A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

被引:472
作者
Statnikov, Alexander [1 ]
Wang, Lily [2 ]
Aliferis, Constantin F. [1 ,2 ,3 ,4 ]
机构
[1] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN 37203 USA
[2] Vanderbilt Univ, Dept Biostat, Nashville, TN USA
[3] Vanderbilt Univ, Dept Canc Biol, Nashville, TN USA
[4] Vanderbilt Univ, Dept Comp Sci, Nashville, TN 37235 USA
关键词
D O I
10.1186/1471-2105-9-319
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
引用
收藏
页数:10
相关论文
共 30 条
[11]  
Fawcett Tom., 2003, ROC graphs: Notes and practical considerations for data mining researchers, DOI DOI 10.1177/1073858410386492
[12]   Support vector machine classification and validation of cancer tissue samples using microarray expression data [J].
Furey, TS ;
Cristianini, N ;
Duffy, N ;
Bednarski, DW ;
Schummer, M ;
Haussler, D .
BIOINFORMATICS, 2000, 16 (10) :906-914
[13]   Converting a breast cancer microarray signature into a high-throughput diagnostic test [J].
Glas, Annuska M. ;
Floore, Arno ;
Delahaye, Leonie J. M. J. ;
Witteveen, Anke T. ;
Pover, Rob C. F. ;
Bakx, Niels ;
Lahti-Domenici, Jaana S. T. ;
Bruinsma, Tako J. ;
Warmoes, Marc O. ;
Bernards, Rene ;
Wessels, Lodewyk F. A. ;
Van 't Veer, Laura J. .
BMC GENOMICS, 2006, 7 (1)
[14]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[15]   Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422
[16]   A note on the universal approximation capability of support vector machines [J].
Hammer, B ;
Gersmann, K .
NEURAL PROCESSING LETTERS, 2003, 17 (01) :43-53
[17]  
Harrell FE, 1996, STAT MED, V15, P361, DOI 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO
[18]  
2-4
[19]  
Hastie T., 2009, The Elements of Statistical Learning, P9
[20]   An extensive comparison of recent classification tools applied to microarray data [J].
Lee, JW ;
Lee, JB ;
Park, M ;
Song, SH .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 48 (04) :869-885