Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis

被引:46
作者
Pan, F [1 ]
Wang, BY
Hu, X
Perrizo, W
机构
[1] N Dakota State Univ, Dept Comp Sci, Fargo, ND 58105 USA
[2] Rockefeller Univ, Lab Struct Microbiol, New York, NY 10021 USA
关键词
data mining; k-nearest neighbor; support vector machine; feature selection; P-tree; gene expression; machine learning;
D O I
10.1016/j.jbi.2004.07.003
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Classification analysis of microarray gene expression data has been widely used to uncover biological features and to distinguish closely related cell types that often appear in the diagnosis of cancer. However, the number of dimensions of gene expression data is often very high, e.g., in the hundreds or thousands. Accurate and efficient classification of such high-dimensional data remains a contemporary challenge. In this paper, we propose a comprehensive vertical sample-based KNN/LSVM classification approach with weights optimized by genetic algorithms for high-dimensional data. Experiments on common gene expression datasets demonstrated that our approach can achieve high accuracy and efficiency at the same time. The improvement of speed is mainly related to the vertical data representation, P-tree,(1) and its optimized logical algebra. The high accuracy is due to the combination of a KNN majority voting approach and a local support vector machine approach that makes optimal decisions at the local level. As a result, our approach could be a powerful tool for high-dimensional gene expression data analysis. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:240 / 248
页数:9
相关论文
共 22 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[3]  
COVER T. M., 1968, Proc. Int. Conf. on System Science, P413
[4]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[5]   STRONG UNIFORM CONSISTENCY OF NEAREST NEIGHBOR DENSITY ESTIMATES [J].
DEVROYE, LP ;
WAGNER, TJ .
ANNALS OF STATISTICS, 1977, 5 (03) :536-540
[6]  
DING Q, 2002, ACM S APPL COMPUT, P11
[7]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[8]  
EISEN MB, 1995, P NATL ACAD SCI USA, P14863
[9]   Support vector machine classification and validation of cancer tissue samples using microarray expression data [J].
Furey, TS ;
Cristianini, N ;
Duffy, N ;
Bednarski, DW ;
Schummer, M ;
Haussler, D .
BIOINFORMATICS, 2000, 16 (10) :906-914
[10]  
Goldberg DE, 1991, FDN GENETIC ALGORITH, P69, DOI DOI 10.1016/B978-0-08-050684-5.50008-2