SUBMODEL SELECTION AND EVALUATION IN REGRESSION - THE X-RANDOM CASE

被引:417
作者
BREIMAN, L
SPECTOR, P
机构
关键词
REGRESSION; VARIABLE SELECTION; CROSS-VALIDATION; BOOTSTRAP; PREDICTION ERROR; SUBSET SELECTION;
D O I
10.2307/1403680
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Often, in a regression situation with many variables, a sequence of submodels is generated containing fewer variables by using such methods as stepwise addition or deletion of variables, or 'best subsets'. The question is which of this sequence of submodels is 'best', and how can submodel performance be evaluated. This was explored in Breiman (1988) for a fixed X-design. This is a sequel exploring the case of random X-designs. Analytical results are difficult, if not impossible. This study involved an extensive simulation. The basis of the study is the theoretical definition of prediction error (PE) as the expected squared error produced by applying a prediction equation to the distributional universe of (y, x) values. This definition is used throughout to compare various submodels. There can be startling differences between the x-fixed and x-random situations and different PE estimates are appropriate. Non-resampling estimates such as C(P), adjusted R2, etc. tum out to be highly biased methods for submodel selection. The two best methods are cross-validation and bootstrap. One surprise is that 5 fold cross-validation (leave out 20% of the data) is better at submodel selection and evaluation than leave-one-out cross-validation. There are a number of other surprises.
引用
收藏
页码:291 / 319
页数:29
相关论文
共 13 条
[11]   SELECTION OF VARIABLES IN MULTIPLE-REGRESSION .1. REVIEW AND EVALUATION [J].
THOMPSON, ML .
INTERNATIONAL STATISTICAL REVIEW, 1978, 46 (01) :1-19
[13]  
WU CFJ, 1986, ANN STAT, V14, P1261, DOI 10.1214/aos/1176350142