Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

被引:335
作者
Golbraikh, A [1 ]
Tropsha, A [1 ]
机构
[1] Univ N Carolina, Sch Pharm, Lab Mol Modeling, Chapel Hill, NC 27599 USA
关键词
D O I
10.1023/A:1020869118689
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
One of the most important characteristics of Quantitative Structure Activity Relashionships ( QSAR) models is their predictive power. The latter can be defined as the ability of a mode to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.
引用
收藏
页码:357 / 369
页数:13
相关论文
共 84 条
[71]   EXTENDING THE TREND VECTOR - THE TREND MATRIX AND SAMPLE-BASED PARTIAL LEAST-SQUARES [J].
SHERIDAN, RP ;
NACHBAR, RB ;
BUSH, BL .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1994, 8 (03) :323-340
[72]   Comparison of algorithms for dissimilarity-based compound selection [J].
Snarey, M ;
Terrett, NK ;
Willett, P ;
Wilton, DJ .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 1997, 15 (06) :372-385
[73]   Classification of environmental estrogens by physicochemical properties using principal component analysis and hierarchical cluster analysis [J].
Suzuki, T ;
Ide, K ;
Ishida, M ;
Shapiro, S .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2001, 41 (03) :718-726
[74]   Derivation of a three-dimensional pharmacophore model of substance P antagonists bound to the neurokinin-1 receptor [J].
Takeuchi, Y ;
Shands, EFB ;
Beusen, DD ;
Marshall, GR .
JOURNAL OF MEDICINAL CHEMISTRY, 1998, 41 (19) :3609-3623
[75]   SIMULATION ANALYSIS OF EXPERIMENTAL-DESIGN STRATEGIES FOR SCREENING RANDOM COMPOUNDS AS POTENTIAL NEW DRUGS AND AGROCHEMICALS [J].
TAYLOR, R .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1995, 35 (01) :59-67
[76]   Volume learning algorithm artificial neural networks for 3D QSAR studies [J].
Tetko, IV ;
Kovalishyn, VV ;
Livingstone, DJ .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (15) :2411-2420
[77]   CHANCE FACTORS IN STUDIES OF QUANTITATIVE STRUCTURE-ACTIVITY-RELATIONSHIPS [J].
TOPLISS, JG ;
EDWARDS, RP .
JOURNAL OF MEDICINAL CHEMISTRY, 1979, 22 (10) :1238-1244
[78]   STRUCTURAL DETERMINATION OF PARAFFIN BOILING POINTS [J].
WIENER, H .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1947, 69 (01) :17-20
[79]  
Wold S., 1995, CHEMOMETRIC METHODS, V2, P309, DOI [10.1002/9783527615452.ch5, DOI 10.1002/9783527615452.CH5]
[80]  
Wold S., 1995, QSAR: Chemometric methods in molecular design: Methods and principles in medicinal chemistry, P195