y-Randomization and its variants in QSPR/QSAR

被引:780
作者
Ruecker, Christoph
Ruecker, Gerta
Meringer, Markus
机构
[1] Univ Basel, Bioctr, CH-4056 Basel, Switzerland
[2] Univ Freiburg, Inst Med Biometr & Med Informat, D-79104 Freiburg, Germany
[3] Univ Groningen, Dept Med Chem, NL-9747 AA Groningen, Netherlands
关键词
D O I
10.1021/ci700157b
中图分类号
R914 [药物化学];
学科分类号
100701 [药物化学];
摘要
y-Randomization is a toolused in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r(2)) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the, pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r(2) values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pseudodescriptors. The difference is due to the intercorrelation of real descriptors in the pool: We propose to compare an original model's r(2) to both of these whenever possible. The meaning of the three possible outcomes of such a double test is discussed. Often y-randomization is not available to a potential user of a model, due to the values of all descriptors in the pool for all compounds not being published. In such cases random number experiments as proposed here are still possible. The test was applied to several recently published MLR QSAR equations, and cases of failure were identified. Some progress also is reported toward the aim of obtaining the mean highest r(2) of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.
引用
收藏
页码:2345 / 2357
页数:13
相关论文
共 42 条
[1]
[Anonymous], 2006, R LANG ENV STAT COMP
[2]
Consensus kNN QSAR: A versatile method for predicting the estrogenic activity of organic compounds in silico. A comparative study with five estrogen receptors and a large, diverse set of ligands [J].
Asikainen, AH ;
Ruuskanen, J ;
Tuppurainen, KA .
ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2004, 38 (24) :6724-6729
[3]
Validation tools for variable subset regression [J].
Baumann, K ;
Stiefl, N .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2004, 18 (7-9) :549-562
[4]
Clark RD, 2001, RATIONAL APPROACHES TO DRUG DESIGN, P475
[5]
Quantum-connectivity descriptors in modeling solubility of environmentally important organic compounds [J].
Estrada, E ;
Delgado, EJ ;
Alderete, JB ;
Jaña, GA .
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2004, 25 (14) :1787-1796
[6]
Cyclooxygenase (COX) inhibitors: A comparative QSAR study [J].
Garg, R ;
Kurup, A ;
Mekapati, SB ;
Hansch, C .
CHEMICAL REVIEWS, 2003, 103 (03) :703-731
[7]
Beware of q2! [J].
Golbraikh, A ;
Tropsha, A .
JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2002, 20 (04) :269-276
[8]
Good P, 1994, PERMUTATION TESTS
[9]
Development of QSAR models to predict and interpret the biological activity of artemisinin analogues [J].
Guha, R ;
Jurs, PC .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2004, 44 (04) :1440-1449
[10]
Gupta AK, 2004, ASIAN J CHEM, V16, P67