Mixtures of common t-factor analyzers for clustering high-dimensional microarray data

被引:51
作者
Baek, Jangsun [1 ]
McLachlan, Geoffrey J. [2 ,3 ]
机构
[1] Chonnam Natl Univ, Dept Stat, Kwangju 500757, South Korea
[2] Univ Queensland, Dept Math, Brisbane, Qld 4072, Australia
[3] Univ Queensland, Inst Mol Biosci, Brisbane, Qld 4072, Australia
基金
澳大利亚研究理事会;
关键词
MODEL; CLASSIFICATION;
D O I
10.1093/bioinformatics/btr112
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Mixtures of factor analyzers enable model-based clustering to be undertaken for high-dimensional microarray data, where the number of observations n is small relative to the number of genes p. Moreover, when the number of clusters is not small, for example, where there are several different types of cancer, there may be the need to reduce further the number of parameters in the specification of the component-covariance matrices. A further reduction can be achieved by using mixtures of factor analyzers with common component-factor loadings (MCFA), which is a more parsimonious model. However, this approach is sensitive to both non-normality and outliers, which are commonly observed in microarray experiments. This sensitivity of the MCFA approach is due to its being based on a mixture model in which the multivariate normal family of distributions is assumed for the component-error and factor distributions. Results: An extension to mixtures of t-factor analyzers with common component-factor loadings is considered, whereby the multivariate t-family is adopted for the component-error and factor distributions. An EM algorithm is developed for the fitting of mixtures of common t-factor analyzers. The model can handle data with tails longer than that of the normal distribution, is robust against outliers and allows the data to be displayed in low-dimensional plots. It is applied here to both synthetic data and some microarray gene expression data for clustering and shows its better performance over several existing methods.
引用
收藏
页码:1269 / 1276
页数:8
相关论文
共 34 条
[1]  
[Anonymous], 2000, Sankhya Ser. A, DOI DOI 10.2307/25051289
[2]  
[Anonymous], PREPRINT SERIES ISAA
[3]  
[Anonymous], R LANG ENV STAT COMP
[4]   Mixtures of Factor Analyzers with Common Factor Loadings: Applications to the Clustering and Visualization of High-Dimensional Data [J].
Baek, Jangsun ;
McLachlan, Geoffrey J. ;
Flack, Lloyd K. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (07) :1298-1309
[5]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[6]   Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[7]   Assessing a mixture model for clustering with the integrated completed likelihood [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (07) :719-725
[8]  
Biernacki C., 1997, COMPUTING SCI STAT, V29, P451
[9]  
Celeux G., 2007, ADV DATA ANAL
[10]   Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative [J].
Chowdary, D ;
Lathrop, J ;
Skelton, J ;
Curtin, K ;
Briggs, T ;
Zhang, Y ;
Yu, J ;
Wang, YX ;
Mazumder, A .
JOURNAL OF MOLECULAR DIAGNOSTICS, 2006, 8 (01) :31-39