Persistence in high-dimensional linear predictor selection and the virtue of overparametrization

被引：186

作者：

Greenshtein, E

Ritov, Y

机构：

[1] Univ Haifa, Dept Stat, IL-31905 Haifa, Israel

[2] Hebrew Univ Jerusalem, Dept Stat, IL-91905 Jerusalem, Israel

来源：

BERNOULLI | 2004年 / 10卷 / 06期

关键词：

consistency; lasso; regression; variable selection;

D O I：

10.3150/bj/1106314846

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Let Z(i) = (Y-i, X-i... X-m(i)), i = 1..., n, be independent and identically distributed random vectors. Z(i) similar to F, F is an element of F. It is desired to predict Y by Sigmabeta(j)X(j), where (beta(1),.... beta(m)) is an element of B-n subset of or equal to R-m, under a prediction loss. Suppose that m = n(a), a > 1, that is, there are many more explanatory variables than observations. We consider sets B-n restricted by the maximal number of non-zero coefficients of their members, or by their 11 radius. We study the following asymptotic question: how 'large' may the set B-n be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on F. Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is. optimization under 11 constraints, could be efficient in finding optimal sparse predictors in high dimensions.

引用

页码：971 / 988

页数：18

共 17 条

[1] Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations
Bickel, PJ
Levina, E
[J]. BERNOULLI, 2004, 10 (06) : 989 - 1010
[2] Billingsley P., 1986, PROBABILITY MEASURE
[3] Statistical modeling: The two cultures
Breiman, L
[J]. STATISTICAL SCIENCE, 2001, 16 (03) : 199 - 215
[4] HOW MANY VARIABLES SHOULD BE ENTERED IN A REGRESSION EQUATION
BREIMAN, L
FREEDMAN, D
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1983, 78 (381) : 131 - 136
[5] Chen SSB, 2001, SIAM REV, V43, P129, DOI [10.1137/S003614450037906X, 10.1137/S1064827596304010]
[6] IDEAL SPATIAL ADAPTATION BY WAVELET SHRINKAGE
DONOHO, DL
JOHNSTONE, IM
[J]. BIOMETRIKA, 1994, 81 (03) : 425 - 455
[7] Least angle regression - Rejoinder
Efron, B
Hastie, T
Johnstone, I
Tibshirani, R
[J]. ANNALS OF STATISTICS, 2004, 32 (02) : 494 - 499
[8] EMERY M., 2000, **NON-TRADITIONAL**, V1738
[9] THE RISK INFLATION CRITERION FOR MULTIPLE-REGRESSION
FOSTER, DP
GEORGE, EI
[J]. ANNALS OF STATISTICS, 1994, 22 (04) : 1947 - 1975
[10] 1972 WALD MEMORIAL LECTURES - ROBUST REGRESSION - ASYMPTOTICS, CONJECTURES AND MONTE-CARLO
HUBER, PJ
[J]. ANNALS OF STATISTICS, 1973, 1 (05) : 799 - 821

← 1 2 →