Comparison of Random Forest and Pipeline Pilot Naive Bayes in Prospective QSAR Predictions

被引:83
作者
Chen, Bin [2 ]
Sheridan, Robert P. [1 ]
Hornak, Viktor [1 ]
Voigt, Johannes H. [1 ]
机构
[1] Merck Res Labs, Chem Modeling & Informat Dept, Rahway, NJ 07065 USA
[2] Indiana Univ, Sch Informat & Comp, Bloomington, IN 47405 USA
关键词
COMPOUND CLASSIFICATION; MOLECULAR DESCRIPTOR; SIMILARITY; REGRESSION; MODELS; TOOL; SET;
D O I
10.1021/ci200615h
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Random forest is currently considered one of the best QSAR methods available in terms of accuracy of prediction. However, it is computationally intensive. Naive Bayes is a simple, robust classification method. The Laplacian-modified Naive Bayes implementation is the preferred QSAR method in the widely used commercial chemoinformatics platform Pipeline Pilot. We made a comparison of the ability of Pipeline Pilot Naive Bayes (PLPNB) and random forest to make accurate predictions on 18 large, diverse in-house QSAR data sets. These include on-target and ADME-related activities. These data sets were set up as classification problems with either binary or multicategory activities. We used a time-split method of dividing training and test sets, as we feel this is a realistic way of simulating prospective prediction. PLPNB is computationally efficient. However, random forest predictions are at least as good and in many cases significantly better than those of PLPNB on our data sets. PLPNB performs better with ECFP4 and ECFP6 descriptors, which are native to Pipeline Pilot, and more poorly with other descriptors we tried.
引用
收藏
页码:792 / 803
页数:12
相关论文
共 40 条
[11]   Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets [J].
Heikamp, Kathrin ;
Bajorath, Juergen .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (08) :1831-1839
[12]   Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures [J].
Hert, J ;
Willett, P ;
Wilton, DJ ;
Acklin, P ;
Azzaoui, K ;
Jacoby, E ;
Schuffenhauer, A .
ORGANIC & BIOMOLECULAR CHEMISTRY, 2004, 2 (22) :3256-3266
[13]   Development of a new regression analysis method using independent component analysis [J].
Kaneko, Hiromasa ;
Arakawa, Masamoto ;
Funatsu, Kimito .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2008, 48 (03) :534-541
[14]   Chemical similarity using physiochemical property descriptors [J].
Kearsley, SK ;
Sallamack, S ;
Fluder, EM ;
Andose, JD ;
Mosley, RT ;
Sheridan, RP .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1996, 36 (01) :118-127
[15]   Improved naive Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction [J].
Klon, Anthony E. ;
Lowrie, Jeffrey F. ;
Diller, David J. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (05) :1945-1956
[16]   On selection of training and test sets for the development of predictive QSAR models [J].
Leonard, JT ;
Roy, K .
QSAR & COMBINATORIAL SCIENCE, 2006, 25 (03) :235-251
[17]   Judging the significance of multiple linear regression models [J].
Livingstone, DJ ;
Salt, DW .
JOURNAL OF MEDICINAL CHEMISTRY, 2005, 48 (03) :661-663
[18]   Prediction of human volume of distribution values for neutral and basic drugs. 2. Extended data set and leave-class-out statistics [J].
Lombardo, F ;
Obach, RS ;
Shalaeva, MY ;
Gao, F .
JOURNAL OF MEDICINAL CHEMISTRY, 2004, 47 (05) :1242-1250
[19]   Pharmaceutical Perspectives of Nonlinear QSAR Strategies [J].
Michielan, Lisa ;
Moro, Stefano .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (06) :961-978
[20]   QSAR/QSPR studies using probabilistic neural networks and generalized regression neural networks [J].
Mosier, PD ;
Jurs, PC .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2002, 42 (06) :1460-1470