Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

被引:335
作者
Golbraikh, A [1 ]
Tropsha, A [1 ]
机构
[1] Univ N Carolina, Sch Pharm, Lab Mol Modeling, Chapel Hill, NC 27599 USA
关键词
D O I
10.1023/A:1020869118689
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
One of the most important characteristics of Quantitative Structure Activity Relashionships ( QSAR) models is their predictive power. The latter can be defined as the ability of a mode to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.
引用
收藏
页码:357 / 369
页数:13
相关论文
共 84 条
[1]  
Adams MJ., 1995, CHEMOMETRICS ANAL SP
[2]   A UNIFIED FRAMEWORK FOR USING NEURAL NETWORKS TO BUILD QSARS [J].
AJAY .
JOURNAL OF MEDICINAL CHEMISTRY, 1993, 36 (23) :3565-3571
[3]   Automated docking of 82 N-benzylpiperidine derivatives to mouse acetylcholinesterase and comparative molecular field analysis with 'natural' alignment [J].
Bernard, P ;
Kireev, DB ;
Chrétien, JR ;
Fortier, PL ;
Coppet, L .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1999, 13 (04) :355-371
[4]   A molecular modeling and 3D QSAR study of a large series of indole inhibitors of human non-pancreatic secretory phospholipase A2 [J].
Bernard, P ;
Pintore, M ;
Berthon, JY ;
Chrétien, JR .
EUROPEAN JOURNAL OF MEDICINAL CHEMISTRY, 2001, 36 (01) :1-19
[5]   ISOMER DISCRIMINATION BY TOPOLOGICAL INFORMATION APPROACH [J].
BONCHEV, D ;
MEKENYAN, O ;
TRINAJSTIC, N .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 1981, 2 (02) :127-148
[6]  
Bonchev D, 1997, SAR QSAR ENVIRON RES, V7, P23
[7]  
BONCHEV D, 1999, TOPOLOGICAL INDICES, P361
[8]   Comparative three-dimensional quantitative structure-activity relationship study of safeners and herbicides [J].
Bordás, B ;
Kömíves, T ;
Szántó, Z ;
Lopata, A .
JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY, 2000, 48 (03) :926-931
[9]   OPTIMIZATION IN IRREGULARLY SHAPED REGIONS - PH AND SOLVENT STRENGTH IN REVERSED-PHASE HIGH-PERFORMANCE LIQUID-CHROMATOGRAPHY SEPARATIONS [J].
BOURGUIGNON, B ;
DEAGUIAR, PF ;
KHOTS, MS ;
MASSART, DL .
ANALYTICAL CHEMISTRY, 1994, 66 (06) :893-904
[10]   APPLICATION OF NONLINEAR-REGRESSION FUNCTIONS FOR THE MODELING OF RETENTION IN REVERSED-PHASE LC [J].
BOURGUIGNON, B ;
DEAGUIAR, PF ;
THORRE, K ;
MASSART, DL .
JOURNAL OF CHROMATOGRAPHIC SCIENCE, 1994, 32 (04) :144-152