Chance correlation in variable subset regression: Influence of the objective function, the selection mechanism, and ensemble averaging

被引:64
作者
Baumann, K [1 ]
机构
[1] Univ Wurzburg, Dept Pharm, D-97074 Wurzburg, Germany
来源
QSAR & COMBINATORIAL SCIENCE | 2005年 / 24卷 / 09期
关键词
variable selection; The LASSO; ensemble averaging; cross-validation; permutation test; bagging; chance correlation; overfitting;
D O I
10.1002/qsar.200530134
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Cross-validation is often used to guide variable selection algorithms. While cross-validation almost unbiasedly estimates the prediction error when no model selection (such as variable selection) is involved, it is heavily biased when a large amount of model selection is applied (i.e. sifting through thousands of models). In the latter case, the internal figures of merit such as R-CV(2), or RMSEPCV can be deceptively overoptimistic. The extent of this inflation (overoptimism) and the influence factors for the degree of inflation are studied here. It turns out, that the extent of inflation is extremely large for small data sets. The main influence factors for the degree of inflation are data set size, the size of the variable pool, the allowed object variable ratio, the objective function for guiding an stepwise selection technique, and the correlation structure of the data matrix. Moreover, chancying the selection mechanism from the commonly applied stepwise procedures to the more stable shrinking and selection technique LASSO eliminates the inflation largely. No inflation is observed when ensemble averaging is used to estimate the prediction error. The latter property combined with the potential of ensemble averaging to improve the predictivity and the possibility to use the information of the single models of the ensemble for validation tasks, renders ensemble averaging an attractive tool if prediction is the primary goal of the analysis.
引用
收藏
页码:1033 / 1046
页数:14
相关论文
共 39 条
[1]  
[Anonymous], 1996, OUT BAG ESTIMATION
[2]   GENERATING OPTIMAL LINEAR PLS ESTIMATIONS (GOLPE) - AN ADVANCED CHEMOMETRIC TOOL FOR HANDLING 3D-QSAR PROBLEMS [J].
BARONI, M ;
COSTANTINO, G ;
CRUCIANI, G ;
RIGANELLI, D ;
VALIGI, R ;
CLEMENTI, S .
QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS, 1993, 12 (01) :9-20
[3]   A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part II. Practical applications [J].
Baumann, K ;
von Korff, M ;
Albert, H .
JOURNAL OF CHEMOMETRICS, 2002, 16 (07) :351-360
[4]   Validation tools for variable subset regression [J].
Baumann, K ;
Stiefl, N .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2004, 18 (7-9) :549-562
[5]   Cross-validation as the objective function for variable-selection techniques [J].
Baumann, K .
TRAC-TRENDS IN ANALYTICAL CHEMISTRY, 2003, 22 (06) :395-406
[6]  
Baumann K, 2002, QUANT STRUCT-ACT REL, V21, P507, DOI 10.1002/1521-3838(200211)21:5<507::AID-QSAR507>3.0.CO
[7]  
2-L
[8]   A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations [J].
Baumann, K ;
Albert, H ;
von Korff, M .
JOURNAL OF CHEMOMETRICS, 2002, 16 (07) :339-350
[9]   An alignment-independent versatile structure descriptor for QSAR and QSPR based on the distribution of molecular features [J].
Baumann, K .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2002, 42 (01) :26-35