Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets

被引:31
作者
Carbon-Mangels, Miriam [1 ]
Hutter, Michael C. [2 ]
机构
[1] Fed Inst Vaccines & Biomed, Paul Ehrlich Inst, Biostat Sect, D-63225 Langen, Germany
[2] Univ Saarland, Ctr Bioinformat, D-66123 Saarbrucken, Germany
关键词
Chemoinformatics; Drug design; Feature selection; in silico-ADMET; Machine learning; Molecular descriptors; Naive Bayes classifier; CYTOCHROME-P450; 3A4; PHARMACOPHORE MODEL; COMBINED PROTEIN; DRUG-METABOLISM; SHANNON ENTROPY; PREDICTION; CYP2D6; 2C9; 2D6; INFORMATION;
D O I
10.1002/minf.201100069
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Classification algorithms suffer from the curse of dimensionality, which leads to overfitting, particularly if the problem is over-determined. Therefore it is of particular interest to identify the most relevant descriptors to reduce the complexity. We applied Bayesian estimates to model the probability distribution of descriptors values used for binary classification using n-fold cross-validation. As a measure for the discriminative power of the classifiers, the symmetric form of the Kullback-Leibler divergence of their probability distributions was computed. We found that the most relevant descriptors possess a Gaussian-like distribution of their values, show the largest divergences, and therefore appear most often in the cross-validation scenario. The results were compared to those of the LASSO feature selection method applied to multiple decision trees and support vector machine approaches for data sets of substrates and nonsubstrates of three Cytochrome P450 isoenzymes, which comprise strongly unbalanced compound distributions. In contrast to decision trees and support vector machines, the performance of Bayesian estimates is less affected by unbalanced data sets. This strategy reveals those descriptors that allow a simple linear separation of the classes, whereas the superior accuracy of decision trees and support vector machines can be attributed to nonlinear separation, which are in turn more prone to overfitting.
引用
收藏
页码:885 / 895
页数:11
相关论文
共 32 条
[1]   Evaluation of descriptors and classification schemes to predict cytochrome substrates in terms of chemical information [J].
Block, John H. ;
Henry, Douglas R. .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2008, 22 (6-7) :385-392
[2]   A novel approach to predicting P450 mediated drug metabolism. CYP2D6 catalyzed N-dealkylation reactions and qualitative metabolite predictions using a combined protein and pharmacophore model for CYP2D6 [J].
de Groot, MJ ;
Ackland, MJ ;
Horne, VA ;
Alex, AA ;
Jones, BC .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (20) :4062-4070
[3]   Novel approach to predicting P450-mediated drug metabolism: Development of a combined protein and pharmacophore model for CYP2D6 [J].
de Groot, MJ ;
Ackland, MJ ;
Horne, VA ;
Alex, AA ;
Jones, BC .
JOURNAL OF MEDICINAL CHEMISTRY, 1999, 42 (09) :1515-1524
[4]  
Duda R., 2001, Pattern Recognition, V2nd
[5]  
Eitrich T, 2007, J CHEM INF MODEL, V47, P92, DOI [10.1021/ci6002619, 10.1021/ci60026l9]
[6]   On the interpretation and interpretability of quantitative structure-activity relationship models [J].
Guha, Rajarshi .
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2008, 22 (12) :857-871
[7]   Classification of Cytochrome P450 Activities Using Machine Learning Methods [J].
Hammann, Felix ;
Gutmann, Heike ;
Baumann, Ulli ;
Helma, Christoph ;
Drewe, Juergen .
MOLECULAR PHARMACEUTICS, 2009, 6 (06) :1920-1926
[8]  
Hastie T., 2001, ELEMENTS STAT LEARNI
[9]   CypScore: Quantitative Prediction of Reactivity toward Cytochrornes P450 Based on Semiempirical Molecular Orbital Theory [J].
Hennemann, Matthias ;
Friedl, Arno ;
Lobell, Mario ;
Keldenich, Joerg ;
Hillisch, Alexander ;
Clark, Timothy ;
Goeller, Andreas H. .
CHEMMEDCHEM, 2009, 4 (04) :657-669
[10]   RIDGE REGRESSION - BIASED ESTIMATION FOR NONORTHOGONAL PROBLEMS [J].
HOERL, AE ;
KENNARD, RW .
TECHNOMETRICS, 1970, 12 (01) :55-&