Model-based clustering and data transformations for gene expression data

被引:525
作者
Yeung, KY
Fraley, C
Murua, A
Raftery, AE
Ruzzo, WL
机构
[1] Univ Washington, Seattle, WA 98195 USA
[2] Insightful Corp, Seattle, WA 98109 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/17.10.977
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. Results: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits.
引用
收藏
页码:977 / 987
页数:11
相关论文
共 35 条
[1]  
Anderson T.W., 1986, STAT ANAL DATA, V2nd
[2]  
[Anonymous], 208 STANF U DEP STAT
[3]  
[Anonymous], 1991, APPL MULTIVARIATE DA
[4]  
[Anonymous], P ACM RECOMB 11 14 A
[5]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[6]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[7]  
BARASH Y, 2001, RECOMB 2001, P12
[8]   Knowledge-based analysis of microarray gene expression data by using support vector machines [J].
Brown, MPS ;
Grundy, WN ;
Lin, D ;
Cristianini, N ;
Sugnet, CW ;
Furey, TS ;
Ares, M ;
Haussler, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) :262-267
[9]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[10]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332