Robust classification modeling on microarray data using misclassification penalized posterior

被引:18
作者
Soukup, M
Cho, HJ
Lee, JK
机构
[1] Univ Virginia, Div Biostat & Epidemiol, Charlottesville, VA 22908 USA
[2] US FDA, Div Biometr 3, Rockville, MD 20850 USA
关键词
D O I
10.1093/bioinformatics/bti1020
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Genome-wide microarray data are often used in challenging classification problems of clinically relevant subtypes of human diseases. However, the identification of a parsimonious robust prediction model that performs consistently well on future independent data has not been successful due to the biased model selection from an extremely large number of candidate models during the classification model search and construction. Furthermore, common criteria of prediction model performance, such as classification error rates, do not provide a sensitive measure for evaluating performance of such astronomic competing models. Also, even though several different classification approaches have been utilized to tackle such classification problems, no direct comparison on these methods have been made. Results: We introduce a novel measure for assessing the performance of a prediction model, the misclassification-penalized posterior (MiPP), the sum of the posterior classification probabilities penalized by the number of incorrectly classified samples. Using MiPP, we implement a forward step-wise cross-validated procedure to find our optimal prediction models with different numbers of features on a training set. Our final robust classification model and its dimension are determined based on a completely independent test dataset. This MiPP-based classification modeling approach enables us to identify the most parsimonious robust prediction models only with two or three features on well-known microarray datasets. These models show superior performance to other models in the literature that often have more than 40-100 features in their model construction.
引用
收藏
页码:I423 / I430
页数:8
相关论文
共 18 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[3]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[4]  
Christianini N., 2000, INTRO SUPPORT VECTOR, DOI DOI 10.1017/CBO9780511801389
[5]   Between-group analysis of microarray data [J].
Culhane, AC ;
Perrière, G ;
Considine, EC ;
Cotter, TG ;
Higgins, DG .
BIOINFORMATICS, 2002, 18 (12) :1600-1608
[6]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[7]   Support vector machine classification and validation of cancer tissue samples using microarray expression data [J].
Furey, TS ;
Cristianini, N ;
Duffy, N ;
Bednarski, DW ;
Schummer, M ;
Haussler, D .
BIOINFORMATICS, 2000, 16 (10) :906-914
[8]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[9]  
JOHNSON RA, 1998, APPL MULTIVARIATE ST, P630
[10]   Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method [J].
Li, LP ;
Weinberg, CR ;
Darden, TA ;
Pedersen, LG .
BIOINFORMATICS, 2001, 17 (12) :1131-1142