Estimating the domain of applicability for machine learning QSAR models:: a study on aqueous solubility of drug discovery molecules

被引:34
作者
Schroeter, Timon Sebastian
Schwaighofer, Anton
Mika, Sebastian
Ter Laak, Antonius
Suelzle, Detlev
Ganzer, Ursula
Heinrich, Nikolaus
Mueller, Klaus-Robert
机构
[1] Fraunhofer FIRST, D-12489 Berlin, Germany
[2] Tech Univ Berlin, Dept Comp Sci, D-10587 Berlin, Germany
[3] Idalab GmbH, D-10178 Berlin, Germany
[4] Res Labs Bayer Schering Pharma AG, D-13342 Berlin, Germany
关键词
solubility; aqueous; machine learning; drug discovery; domain of applicability; error bar; error estimation; Gaussian process; Bayesian modeling; random forest; ensemble; decision tree; support vector machine; ridge regression; distance;
D O I
10.1007/s10822-007-9125-z
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
引用
收藏
页码:485 / 498
页数:14
相关论文
共 39 条
[1]   In silico approaches to prediction of aqueous and DMSO solubility of drug-like compounds:: Trends, problems and solutions [J].
Balakin, KV ;
Savchuk, NP ;
Tetko, IV .
CURRENT MEDICINAL CHEMISTRY, 2006, 13 (02) :223-241
[2]   Screening for dihydrofolate reductase inhibitors using MOLPRINT 2D, a fast fragment-based method employing the naive Bayesian classifier: Limitations of the descriptor and the importance of balanced chemistry in training and test sets [J].
Bender, A ;
Mussa, HY ;
Glen, RC .
JOURNAL OF BIOMOLECULAR SCREENING, 2005, 10 (07) :658-666
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
BRUNEAU P, 2004, J CHEM INF MODEL, V44, P1912
[5]   logD7.4 modeling using Bayesian regularized neural networks.: Assessment and correction of the errors of prediction [J].
Bruneau, Pierre ;
McElroy, Nathan R. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (03) :1379-1387
[6]   Predicting aqueous solubility from structure [J].
Delaney, JS .
DRUG DISCOVERY TODAY, 2005, 10 (04) :289-295
[7]  
GOLDMAN BB, 2006, MACHINE LEARNING COM, V2, P127
[8]   In silico prediction of buffer solubility based on quantum-mechanical and HQSAR- and topology-based descriptors [J].
Göller, AH ;
Hennemann, M ;
Keldenich, J ;
Clark, T .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (02) :648-658
[9]  
Hastie T., 2009, The Elements of Statistical Learning, P9
[10]   Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology [J].
Huuskonen, J .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2000, 40 (03) :773-777