Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

被引:55
作者
Clark, RD [1 ]
机构
[1] Tripos Inc, St Louis, MO 63144 USA
关键词
cross-validation; dissimilarity selection; molecular diversity; OptiSim; PLS; projection onto latent structures; representativeness; boosted LMO;
D O I
10.1023/A:1025366721142
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set-i.e., a cross-validation set-is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned 'at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed 'boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k=4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.
引用
收藏
页码:265 / 275
页数:11
相关论文
共 41 条
[1]   On the use of neural network ensembles in QSAR and QSPR [J].
Agrafiotis, DK ;
Cedeño, W ;
Lobanov, VS .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2002, 42 (04) :903-911
[2]  
[Anonymous], 1983, Statistical methods
[3]   A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part II. Practical applications [J].
Baumann, K ;
von Korff, M ;
Albert, H .
JOURNAL OF CHEMOMETRICS, 2002, 16 (07) :351-360
[4]   A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations [J].
Baumann, K ;
Albert, H ;
von Korff, M .
JOURNAL OF CHEMOMETRICS, 2002, 16 (07) :339-350
[5]   Three-dimensional quantitative structure-activity relationships of cyclo-oxygenase-2 (COX-2) inhibitors: A comparative molecular field analysis [J].
Chavatte, P ;
Yous, S ;
Marot, C ;
Baurin, N ;
Lesieur, D .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (20) :3223-3230
[6]   Four association coefficients for relating molecular similarity measures [J].
Cheng, C ;
Maggiora, G ;
Lajiness, M ;
Johnson, M .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1996, 36 (04) :909-915
[7]   OptiSim: An extended dissimilarity selection method for finding diverse representative subsets [J].
Clark, RD .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1997, 37 (06) :1181-1188
[8]   Balancing representativeness against diversity using optimizable K-dissimilarity and hierarchical clustering [J].
Clark, RD ;
Langton, WJ .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1998, 38 (06) :1079-1086
[9]  
Clark RD, 2001, RATIONAL APPROACHES TO DRUG DESIGN, P475
[10]  
CLARK RD, 2003, Patent No. 6535819