Reducing over-optimism in variable selection by cross-model validation

被引:144
作者
Anderssen, Endre [1 ]
Dyrstad, Knut
Westad, Frank
Martens, Harald
机构
[1] Norwegian Univ Sci & Technol, Dept Chem, N-7491 Trondheim, Norway
[2] GE Healthcare, Oslo, Norway
[3] Matforsk, N-1430 As, Norway
[4] Univ Life Sci, Ctr Integrated Genom, CIGENE, As, Norway
关键词
variable selection; regression; over-fitting; cross-model validation; jack-knifing; QSAR;
D O I
10.1016/j.chemolab.2006.04.021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extensive optimisation of a mathematical model's fit to a relatively small set of empirical data, may lead to over-optimistic validation results. If the assessment of the final, optimised model is based on the same validation method and the same input data that were used as basis for the extensive model optimisation, accumulated spurious correlations may appear as real predictive ability in the final model validation. An example of this is the use of extensive variable selection in multiple regression, based on a cross-model validation scheme. To illustrate the over-optimism problem in optimisation based on conventional one-layered validation, an artificial data set, with only random numbers was submitted to regression modelling. The model was optimised by stepwise variable selection. A very good apparent predictive ability for y from X was found in the final model by leave-one-out cross-validation (84%), after the number of X-variables had been reduced stepwise from 500 to 29. Finally, the performance of the cross-model validation is tested on one large QSAR data set. Several calibration sets were chosen randomly and a regression model optimised by variable selection. The prediction accuracy of these models was compared to the cross-validation and cross-model validation results. In these tests cross-model validation gives the better measure of model predictive ability. (c) 2006 Published by Elsevier B.V.
引用
收藏
页码:69 / 74
页数:6
相关论文
共 19 条
  • [1] [Anonymous], COMPUTER INTENSIVE S
  • [2] Comparative spectra analysis (CoSA): Spectra as three-dimensional molecular descriptors for the prediction of biological activities
    Bursi, R
    Dao, T
    van Wijk, T
    de Gooyer, M
    Kellenbach, E
    Verwer, P
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1999, 39 (05): : 861 - 867
  • [3] COMPARATIVE MOLECULAR-FIELD ANALYSIS (COMFA) .1. EFFECT OF SHAPE ON BINDING OF STEROIDS TO CARRIER PROTEINS
    CRAMER, RD
    PATTERSON, DE
    BUNCE, JD
    [J]. JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1988, 110 (18) : 5959 - 5967
  • [4] 1977 RIETZ LECTURE - BOOTSTRAP METHODS - ANOTHER LOOK AT THE JACKKNIFE
    EFRON, B
    [J]. ANNALS OF STATISTICS, 1979, 7 (01) : 1 - 26
  • [5] Multivariate design and modeling in QSAR
    Eriksson, L
    Johansson, E
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1996, 34 (01) : 1 - 19
  • [6] Multivariate data analysis:: quo vadis?: I.: Object-oriented data modelling (OODM)
    Esbensen, KH
    Höskuldsson, A
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (01) : 34 - 44
  • [7] STRATEGIES FOR MULTIVARIATE IMAGE REGRESSION
    ESBENSEN, KH
    GELADI, PL
    GRAHN, HF
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1992, 14 (1-3) : 357 - 374
  • [8] EVA: A new theoretically based molecular descriptor for use in QSAR/QSPR analysis
    Ferguson, AM
    Heritage, T
    Jonathon, P
    Pack, SE
    Phillips, L
    Rogan, J
    Snaith, PJ
    [J]. JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1997, 11 (02) : 143 - 152
  • [9] Multivariate measurement of gene expression relationships
    Kim, SC
    Dougherty, ER
    Chen, YD
    Sivakumar, K
    Meltzer, P
    Trent, JM
    Bittner, M
    [J]. GENOMICS, 2000, 67 (02) : 201 - 209
  • [10] MOLECULAR SIMILARITY INDEXES IN A COMPARATIVE-ANALYSIS (COMSIA) OF DRUG MOLECULES TO CORRELATE AND PREDICT THEIR BIOLOGICAL-ACTIVITY
    KLEBE, G
    ABRAHAM, U
    MIETZNER, T
    [J]. JOURNAL OF MEDICINAL CHEMISTRY, 1994, 37 (24) : 4130 - 4146