Supervised learning with decision tree-based methods in computational and systems biology

被引:146
作者
Geurts, Pierre [1 ]
Irrthum, Alexandre [1 ]
Wehenkel, Louis [1 ]
机构
[1] Univ Liege, Dept EE & CS & GIGA Res, B-4000 Liege, Belgium
关键词
RANDOM FOREST; VARIABLE IMPORTANCE; REGULATORY MODULES; SERUM BIOMARKERS; BINDING SITES; CLASSIFICATION; PREDICTION; NETWORKS; IDENTIFICATION; PROTEINS;
D O I
10.1039/b907946g
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
At the intersection between artificial intelligence and statistics, supervised learning allows algorithms to automatically build predictive models from just observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the review is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the review provides a survey of their applications in the context of computational and systems biology.
引用
收藏
页码:1593 / 1605
页数:13
相关论文
共 93 条
[1]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]  
[Anonymous], [No title captured]
[3]   Empirical characterization of random forest variable importance measures [J].
Archer, Kelfie J. ;
Kirnes, Ryan V. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (04) :2249-2260
[4]   Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information [J].
Bao, L ;
Cui, Y .
BIOINFORMATICS, 2005, 21 (10) :2185-2190
[5]   An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J].
Bauer, E ;
Kohavi, R .
MACHINE LEARNING, 1999, 36 (1-2) :105-139
[6]   Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype [J].
Beerenwinkel, N ;
Schmidt, B ;
Walter, H ;
Kaiser, R ;
Lengauer, T ;
Hoffmann, D ;
Korn, K ;
Selbig, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (12) :8271-8276
[7]   Tissue classification with gene expression profiles [J].
Ben-Dor, A ;
Bruhn, L ;
Friedman, N ;
Nachman, I ;
Schummer, M ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (3-4) :559-583
[8]   Support Vector Machines and Kernels for Computational Biology [J].
Ben-Hur, Asa ;
Ong, Cheng Soon ;
Sonnenburg, Soeren ;
Schoelkopf, Bernhard ;
Raetsch, Gunnar .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (10)
[9]  
Biau G, 2008, J MACH LEARN RES, V9, P2015
[10]  
Bishop C. M., 2009, Pattern Recognition and Machine Learning