A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

被引:541
作者
Statnikov, A [1 ]
Aliferis, CF
Tsamardinos, I
Hardin, D
Levy, S
机构
[1] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN 37240 USA
[2] Vanderbilt Univ, Dept Math, Nashville, TN USA
关键词
D O I
10.1093/bioinformatics/bti033
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. Results: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.
引用
收藏
页码:631 / 643
页数:13
相关论文
共 68 条
[1]  
ALIFERIS CF, 2003, P 16 INT FLAIRS C, P67
[2]  
ALIFERIS CF, 2003, P 2003 AM MED INF AS, P21
[3]   Reducing multiclass to binary: A unifying approach for margin classifiers [J].
Allwein, EL ;
Schapire, RE ;
Singer, Y .
JOURNAL OF MACHINE LEARNING RESEARCH, 2001, 1 (02) :113-141
[4]  
[Anonymous], P 2003 INT C MATH EN
[5]  
[Anonymous], LIBSVM LIB SUPPORT V
[6]  
[Anonymous], 21 INT C MACH LEARN
[7]  
[Anonymous], 1999, SUPPORT VECTOR MACHI
[8]  
[Anonymous], 1997, MATLAB STAT TOOLBOX
[9]   MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [J].
Armstrong, SA ;
Staunton, JE ;
Silverman, LB ;
Pieters, R ;
de Boer, ML ;
Minden, MD ;
Sallan, SE ;
Lander, ES ;
Golub, TR ;
Korsmeyer, SJ .
NATURE GENETICS, 2002, 30 (01) :41-47
[10]  
BERRAR D, 2003, P PAC S BIOC PSB LIH