Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

被引:335
作者
Golbraikh, A [1 ]
Tropsha, A [1 ]
机构
[1] Univ N Carolina, Sch Pharm, Lab Mol Modeling, Chapel Hill, NC 27599 USA
关键词
D O I
10.1023/A:1020869118689
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
One of the most important characteristics of Quantitative Structure Activity Relashionships ( QSAR) models is their predictive power. The latter can be defined as the ability of a mode to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.
引用
收藏
页码:357 / 369
页数:13
相关论文
共 84 条
[81]   Artificial neural networks in classification of NIR spectral data: Design of the training set [J].
Wu, W ;
Walczak, B ;
Massart, DL ;
Heuerding, S ;
Erni, F ;
Last, IR ;
Prebble, KA .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1996, 33 (01) :35-46
[82]   Toward an optimal procedure for variable selection and QSAR model building [J].
Yasri, A ;
Hartsough, D .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2001, 41 (05) :1218-1227
[83]   QSAR for boiling points of "small" sulfides. Are the "high-quality structure-property-activity regressions" the real high quality QSAR models? [J].
Zefirov, NS ;
Palyulin, VA .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2001, 41 (04) :1022-1027
[84]   Novel variable selection quantitative structure-property relationship approach based on the k-nearest-neighbor principle [J].
Zheng, WF ;
Tropsha, A .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (01) :185-194