Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

被引:270
作者
Touw, Wouter G. [1 ]
Bayjanov, Jumamurat R. [2 ]
Overmars, Lex [3 ]
Backus, Lennart [2 ]
Boekhorst, Jos
Wels, Michiel
van Hijum, Sacha A. F. T. [4 ]
机构
[1] Radboud Univ Nijmegen, Nijmegen, Netherlands
[2] Radboud Univ Nijmegen, Med Ctr, Nijmegen, Netherlands
[3] Radboud Univ Nijmegen, Med Ctr, Ctr Mol & Biomol Informat, Nijmegen, Netherlands
[4] Radboud Univ Nijmegen, Med Ctr, Genom Grp, Ctr Mol & Biomol Informat, Nijmegen, Netherlands
关键词
Random Forest; variable importance; local importance; conditional relationships; variable interaction; proximity; VARIABLE IMPORTANCE MEASURES; AMINO-ACID; SYSTEMS BIOLOGY; PREDICTION; IDENTIFICATION; MICROARRAY; CLASSIFICATION; PROTEINS; MODEL; CLASSIFIERS;
D O I
10.1093/bib/bbs034
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
引用
收藏
页码:315 / 326
页数:12
相关论文
共 113 条
[91]   Towards large-scale FAME-based bacterial species identification using machine learning techniques [J].
Slabbinck, Bram ;
De Baets, Bernard ;
Dawyndt, Peter ;
De Vos, Paul .
SYSTEMATIC AND APPLIED MICROBIOLOGY, 2009, 32 (03) :163-176
[92]   Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions [J].
Somorjai, RL ;
Dolenko, B ;
Baumgartner, R .
BIOINFORMATICS, 2003, 19 (12) :1484-1491
[93]   PostDOCK: A structural, empirical approach to scoring protein ligand complexes [J].
Springer, C ;
Adalsteinsson, H ;
Young, MM ;
Kegelmeyer, PW ;
Roe, DC .
JOURNAL OF MEDICINAL CHEMISTRY, 2005, 48 (22) :6821-6831
[94]   A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification [J].
Statnikov, Alexander ;
Wang, Lily ;
Aliferis, Constantin F. .
BMC BIOINFORMATICS, 2008, 9 (1)
[95]   CROSS-VALIDATORY CHOICE AND ASSESSMENT OF STATISTICAL PREDICTIONS [J].
STONE, M .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1974, 36 (02) :111-147
[96]   Bias in random forest variable importance measures: Illustrations, sources and a solution [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Zeileis, Achim ;
Hothorn, Torsten .
BMC BIOINFORMATICS, 2007, 8 (1)
[97]   Conditional variable importance for random forests [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Kneib, Thomas ;
Augustin, Thomas ;
Zeileis, Achim .
BMC BIOINFORMATICS, 2008, 9 (1)
[98]   The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored [J].
Szklarczyk, Damian ;
Franceschini, Andrea ;
Kuhn, Michael ;
Simonovic, Milan ;
Roth, Alexander ;
Minguez, Pablo ;
Doerks, Tobias ;
Stark, Manuel ;
Muller, Jean ;
Bork, Peer ;
Jensen, Lars J. ;
von Mering, Christian .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D561-D568
[99]   Machine learning and its applications to biology [J].
Tarca, Adi L. ;
Carey, Vincent J. ;
Chen, Xue-Wen ;
Romero, Roberto ;
Draghici, Sorin .
PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (06) :953-963
[100]   Probabilistic classifiers and automated cancer registration: An exploratory application [J].
Tognazzo, Sandro ;
Ernanuela, Bovo ;
Rita, Fiore Anna ;
Stefano, Guzzinati ;
Daniele, Monetti ;
Fiorella, Stocco Cramen ;
Paola, Zambon .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (01) :1-10