Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

被引:335
作者
Golbraikh, A [1 ]
Tropsha, A [1 ]
机构
[1] Univ N Carolina, Sch Pharm, Lab Mol Modeling, Chapel Hill, NC 27599 USA
关键词
D O I
10.1023/A:1020869118689
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
One of the most important characteristics of Quantitative Structure Activity Relashionships ( QSAR) models is their predictive power. The latter can be defined as the ability of a mode to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.
引用
收藏
页码:357 / 369
页数:13
相关论文
共 84 条
[11]   Synthesis, evaluation, and comparative molecular field analysis of 1-phenyl-3-amino-1,2,3,4-tetrahydronaphthalenes as ligands for histamine H1 receptors [J].
Bucholtz, EC ;
Brown, RL ;
Tropsha, A ;
Booth, RG ;
Wyrick, SD .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (16) :3041-3054
[12]   Robust QSAR models using Bayesian regularized neural networks [J].
Burden, FR ;
Winkler, DA .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (16) :3183-3187
[13]   Use of automatic relevance determination in QSAR studies using Bayesian neural networks [J].
Burden, FR ;
Ford, MG ;
Whitley, DC ;
Winkler, DA .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (06) :1423-1430
[14]  
Carlson R., 1992, Design and Optimization in Organic Synthesis
[15]   CROSS-VALIDATED R(2)-GUIDED REGION SELECTION FOR COMPARATIVE MOLECULAR-FIELD ANALYSIS - A SIMPLE METHOD TO ACHIEVE CONSISTENT RESULTS [J].
CHO, SJ ;
TROPSHA, A .
JOURNAL OF MEDICINAL CHEMISTRY, 1995, 38 (07) :1060-1066
[16]  
Clark RD, 2001, RATIONAL APPROACHES TO DRUG DESIGN, P475
[17]  
CLEMENTI S, 1995, CHEMOMETRICS METHODS, P319
[18]   COMPARATIVE MOLECULAR-FIELD ANALYSIS (COMFA) .1. EFFECT OF SHAPE ON BINDING OF STEROIDS TO CARRIER PROTEINS [J].
CRAMER, RD ;
PATTERSON, DE ;
BUNCE, JD .
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 1988, 110 (18) :5959-5967
[19]   Multivariate design and modeling in QSAR [J].
Eriksson, L ;
Johansson, E .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1996, 34 (01) :1-19
[20]   Quantitative structure-antitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studies [J].
Fan, Y ;
Shi, LM ;
Kohn, KW ;
Pommier, Y ;
Weinstein, JN .
JOURNAL OF MEDICINAL CHEMISTRY, 2001, 44 (20) :3254-3263