Random forest: A classification and regression tool for compound classification and QSAR modeling

被引:2517
作者
Svetnik, V
Liaw, A
Tong, C
Culberson, JC
Sheridan, RP
Feuston, BP
机构
[1] Merck Res Labs, Rahway, NJ 07065 USA
[2] Merck Res Labs, West Point, PA 19486 USA
来源
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES | 2003年 / 43卷 / 06期
关键词
D O I
10.1021/ci034160g
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
引用
收藏
页码:1947 / 1958
页数:12
相关论文
共 39 条
  • [1] Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis
    Bakken, GA
    Jurs, PC
    [J]. JOURNAL OF MEDICINAL CHEMISTRY, 2000, 43 (23) : 4534 - 4541
  • [2] Partial least squares for discrimination
    Barker, M
    Rayens, W
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (03) : 166 - 173
  • [3] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Breiman L, 1998, ANN STAT, V26, P801
  • [7] Bagging predictors
    Breiman, L
    [J]. MACHINE LEARNING, 1996, 24 (02) : 123 - 140
  • [8] BREIMAN L, 2002, IMS WALD LECT 2
  • [9] Breiman L., 2003, Manual on Setting Up, Using and Understanding Random Forest
  • [10] Use of structure Activity data to compare structure-based clustering methods and descriptors for use in compound selection
    Brown, RD
    Martin, YC
    [J]. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1996, 36 (03): : 572 - 584