Simultaneous gene clustering and subset selection for sample classification via MDL

被引:54
作者
Jörnsten, R
Yu, B
机构
[1] Rutgers State Univ, Dept Stat, Piscataway, NJ 08854 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/btg039
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids. Results: Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data.
引用
收藏
页码:1100 / 1109
页数:10
相关论文
共 15 条
[1]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[2]  
BENDOR A, 2001, P 5 ANN INT C COMP M
[3]  
DUDOIT S, 2000, COMP DISCRIMINATION
[4]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[5]   Model selection and the principle of minimum description length [J].
Hansen, MH ;
Yu, B .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (454) :746-774
[6]   FLEXIBLE DISCRIMINANT-ANALYSIS BY OPTIMAL SCORING [J].
HASTIE, T ;
TIBSHIRANI, R ;
BUJA, A .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (428) :1255-1270
[7]  
HASTIE T, 2000, SUPERVISED HARVESTIN
[8]  
HASTIE T, 2000, GENE SHAVING NEW CLA
[9]  
JORNSTEN R, 2001, THESIS UC BERKELEY
[10]  
Lehmann E. L., 1983, THEORY POINT ESTIMAT