VARIABLE SELECTION IN NONPARAMETRIC REGRESSION WITH CATEGORICAL COVARIATES

被引：14

作者：

BICKEL, P ^{[1
]}

PING, Z ^{[1
]}

机构：

[1] UNIV PENN,DEPT STAT,PHILADELPHIA,PA 19104

来源：

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION | 1992年 / 87卷 / 417期

关键词：

CROSS-VALIDATION; MODEL SELECTION; PREDICTION;

D O I：

10.2307/2290456

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

This article extends the problem of variable selection to a nonparametric regression model with categorical covariates. Two selection criteria are considered: the cross-validation (CV) criterion and the accumulated prediction error (APE) criterion. We find that, asymptotically, the CV criterion performs well only when the true model is infinite-dimensional, while the APE criterion is appropriate when the true model is finite-dimensional. This is very similar to the case of linear regression model. A simulation study reveals some interesting small-sample properties of these criteria. To be more specific, suppose that we have observations (X(l), Y(l)), ..., (X(n), Y(n)) that are iid random vectors and X = (X(1), X(2), ...), where the X(i)'s are categorical. We allow Y to be of any type. Now a new observation X has arrived and we want to predict the corresponding Y. Such a framework is more appropriate than regressions with fixed covariates in situations where the covariates are observational rather than being controlled. For instance, Y could be the time from HIV infection to developing clinical AIDS, and the covariates (mostly categorical or reducible to categorical) could be observations from blood tests, a physical examination, or further personal information, such as sexual practices obtained from an interview. Take another example: Y could be the premium of an insurance policy with the covariates being the customer's general demographical information. Our goal is to select a subset of covariates that best predict Y. We define the true model dimension as d0 if the regression function E(Y\X(1), X(2), ...) is a d0-variate function. The main conclusions of the article are: (1) The popular CV criterion performs well only when d0 = infinity (2) There exist other criteria that are more appropriate than CV when d0 < infinity (3) There is no difference between conditional and unconditional prediction errors, as far as asymptotics are concerned. (4) The selection range has to depend on the sample size. In fact, we argue that, for a given sample size n, we should only select models with the number of covariates not exceeding the order of magnitude of o(log n). (5) Simulation study indicates that the CV criterion has nice small-sample properties.

引用

页码：90 / 97

页数：8

共 17 条

[1]

Akaike H., 1992, 2 INT S INF THEOR, P267, DOI DOI 10.1007/978-1-4612-1694-0_15

[2]

Akaike H., 1985, CELEBRATION STAT, P1, DOI DOI 10.1007/978-1-4613-8560-8_1

[3] HOW MANY VARIABLES SHOULD BE ENTERED IN A REGRESSION EQUATION [J].