Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction

被引:194
作者
Sheridan, Robert P. [1 ]
机构
[1] Merck Res Labs, Cheminformat Dept, Rahway, NJ 07065 USA
关键词
APPLICABILITY DOMAIN; QSAR MODELS; RANDOM FOREST; TEST SETS; SIMILARITY; SELECTION; DEFINE;
D O I
10.1021/ci400084k
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R-2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R-2 that is more like that of true prospective prediction than the R-2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
引用
收藏
页码:783 / 790
页数:8
相关论文
共 29 条
[1]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[2]   ATOM PAIRS AS MOLECULAR-FEATURES IN STRUCTURE ACTIVITY STUDIES - DEFINITION AND APPLICATIONS [J].
CARHART, RE ;
SMITH, DH ;
VENKATARAGHAVAN, R .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1985, 25 (02) :64-73
[3]   Comparison of Random Forest and Pipeline Pilot Naive Bayes in Prospective QSAR Predictions [J].
Chen, Bin ;
Sheridan, Robert P. ;
Hornak, Viktor ;
Voigt, Johannes H. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2012, 52 (03) :792-803
[4]   A stepwise approach for defining the applicability domain of SAR and QSAR models [J].
Dimitrov, S ;
Dimitrova, G ;
Pavlov, T ;
Dimitrova, N ;
Patlewicz, G ;
Niemela, J ;
Mekenyan, O .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2005, 45 (04) :839-849
[5]   Assessment of Methods To Define the Applicability Domain of Structural Alert Models [J].
Ellison, C. M. ;
Sherhod, R. ;
Cronin, M. T. D. ;
Enoch, S. J. ;
Madden, J. C. ;
Judson, P. N. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (05) :975-985
[6]   Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection [J].
Golbraikh, A ;
Tropsha, A .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2002, 16 (5-6) :357-369
[7]   Beware of q2! [J].
Golbraikh, A ;
Tropsha, A .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2002, 20 (04) :269-276
[8]   Structure-activity landscape index: Identifying and quantifying activity cliffs [J].
Guha, Rajarshi ;
Van Drie, John H. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2008, 48 (03) :646-658
[9]   Assessing the reliability of a QSAR model's predictions [J].
He, L ;
Jurs, PC .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2005, 23 (06) :503-523
[10]   Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models [J].
Horvath, Dragos ;
Marcou, Gilles ;
Alexandre, Varnek .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (07) :1762-1776