Combining Mixture Components for Clustering

被引:203
作者
Baudry, Jean-Patrick [1 ,2 ,3 ]
Raftery, Adrian E. [4 ]
Celeux, Gilles [1 ]
Lo, Kenneth [5 ]
Gottardo, Raphael [6 ]
机构
[1] Univ Paris Sud, INRIA Saclay Ile de France, F-91405 Orsay, France
[2] Univ Paris 05, Lab MAP5, Paris, France
[3] CNRS, F-75700 Paris, France
[4] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[5] Univ Washington, Dept Microbiol, Seattle, WA 98195 USA
[6] Inst Rech Clin Montreal, Montreal, PQ H2W 1R7, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
BIC; Entropy; Flow cytometry; Mixture model; Model-based clustering; Multivariate normal distribution; FLOW-CYTOMETRY; MODEL; FEATURES; CLUTTER;
D O I
10.1198/jcgs.2010.08111
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the resealed entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental materials are available on the journal web site and described at the end of the article.
引用
收藏
页码:332 / 353
页数:22
相关论文
共 27 条
[1]  
[Anonymous], 2000, Sankhya Ser. A, DOI DOI 10.2307/25051289
[2]  
[Anonymous], 2002, Latent Class Analysis
[3]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[4]   Assessing a mixture model for clustering with the integrated completed likelihood [J].
Biernacki, C ;
Celeux, G ;
Govaert, G .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (07) :719-725
[5]   High-content flow cytometry and temporal data analysis for defining a cellular signature graft-versus-host disease [J].
Brinkman, Ryan Remy ;
Gasparetto, Maura ;
Lee, Shang-Jung Jessica ;
Ribickas, Albert J. ;
Perkins, Janelle ;
Janssen, William ;
Smiley, Renee ;
Smith, Clay .
BIOLOGY OF BLOOD AND MARROW TRANSPLANTATION, 2007, 13 (06) :691-700
[6]   Nearest-neighbor clutter removal for estimating features in spatial point processes [J].
Byers, S ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (442) :577-584
[7]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[8]   Detecting features in spatial point processes with clutter via model-based clustering [J].
Dasgupta, A ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (441) :294-302
[9]   How many clusters? Which clustering method? Answers via model-based cluster analysis [J].
Fraley, C ;
Raftery, AE .
COMPUTER JOURNAL, 1998, 41 (08) :578-588
[10]   Model-based clustering, discriminant analysis, and density estimation [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) :611-631