Gene extraction for cancer diagnosis by support vector machines - An improvement

被引:62
作者
Huang, TM [1 ]
Kecman, V [1 ]
机构
[1] Univ Auckland, Sch Engn, Auckland 1, New Zealand
关键词
cancer diagnosis; support vector machines; gene selection; feature selection;
D O I
10.1016/j.artmed.2005.01.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: To improve the performance of gene extraction for cancer diagnosis by recursive feature elimination with support vector machines (RFE-SVMs): A cancer diagnosis by using the DNA microarray data faces many challenges the most serious one being the presence of thousands of genes and only several dozens (at the best) of patient's samples. Thus, making any kind of classification in high-dimensional spaces from a limited number of data is both an extremely difficult and a prone to an error procedure. The improved RFE-SVMs is introduced and used here for an elimination of less relevant genes and just for a reduction of the overall number of genes used in a medical diagnostic. Methods: The paper shows why and how the, usually neglected, penalty parameter C and some standard data preprocessing techniques (normalizing and scaling) influence classification results and the gene selection of RFE-SVMs. The gene selected by RFESVMs is compared with eight other gene selection algorithms implemented in the Rankgene software to investigate whether there is any consensus among the algorithms, so the scope of finding the right set of genes can be reduced. Results: The improved RFE-SVMs is applied on the two benchmarking colon and lymphoma cancer data sets with various C parameters and different standard preprocessing techniques. Here, decreasing C leads to the smaller diagnosis error in comparisons to other known methods applied to the benchmarking data sets. With an appropriate parameter C and with a proper preprocessing procedure, the reduction in a diagnosis error is as high as 36%. Conclusions: The results suggest that with a properly chosen parameter C, the extracted genes and the constructed classifier will ensure less overfitting of the training data leading to an increased accuracy in selecting relevant genes. Finally, comparison in gene ranking obtained by different algorithms shows that there is a significant consensus among the various algorithms as to which set of genes is relevant. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:185 / 194
页数:10
相关论文
共 11 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[4]  
Chu F, 2003, IEEE IJCNN, P2268
[5]   Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422
[6]  
Huang T.-M., 2004, P 12 EUR S ART NEUR
[7]  
KECMAN V, 2001, LEARNING SOFT COMPUI
[8]  
KECMAN V, 2003, P ESANN 2003 11 EUR
[9]  
Rakotomamonjy A., 2003, Journal of Machine Learning Research, V3, P1357, DOI 10.1162/153244303322753706
[10]  
SU Y, 2002, RANKGENE PROGRAM RAN