Model-based clustering, discriminant analysis, and density estimation

被引:2748
作者
Fraley, C [1 ]
Raftery, AE [1 ]
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
关键词
Bayes factor; breast cancer diagnosis; cluster analysis; EM algorithm; gene expression microarray data; Markov chain Monte Carlo; mixture model; outliers; spatial point process;
D O I
10.1198/016214502760047131
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent developments in model-based clustering for non-Gaussian data, high-dimensional datasets, large datasets, and Bayesian estimation.
引用
收藏
页码:611 / 631
页数:21
相关论文
共 144 条
  • [1] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
    Alizadeh, AA
    Eisen, MB
    Davis, RE
    Ma, C
    Lossos, IS
    Rosenwald, A
    Boldrick, JG
    Sabet, H
    Tran, T
    Yu, X
    Powell, JI
    Yang, LM
    Marti, GE
    Moore, T
    Hudson, J
    Lu, LS
    Lewis, DB
    Tibshirani, R
    Sherlock, G
    Chan, WC
    Greiner, TC
    Weisenburger, DD
    Armitage, JO
    Warnke, R
    Levy, R
    Wilson, W
    Grever, MR
    Byrd, JC
    Botstein, D
    Brown, PO
    Staudt, LM
    [J]. NATURE, 2000, 403 (6769) : 503 - 511
  • [2] Nonparametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation
    Allard, D
    Fraley, C
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (440) : 1485 - 1493
  • [3] ALON U, 1999, CELL BIOL, V99, P6745
  • [4] Singular value decomposition for genome-wide expression data processing and modeling
    Alter, O
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10101 - 10106
  • [5] MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING
    BANFIELD, JD
    RAFTERY, AE
    [J]. BIOMETRICS, 1993, 49 (03) : 803 - 821
  • [6] ICE-FLOE IDENTIFICATION IN SATELLITE IMAGES USING MATHEMATICAL MORPHOLOGY AND CLUSTERING ABOUT PRINCIPAL CURVES
    BANFIELD, JD
    RAFTERY, AE
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (417) : 7 - 16
  • [7] Latent class marginal models for cross-classifications of counts
    Becker, MP
    Yang, IS
    [J]. SOCIOLOGICAL METHODOLOGY, VOL. 28 1998, 1998, 28 : 293 - 325
  • [8] Clustering gene expression patterns
    Ben-Dor, A
    Shamir, R
    Yakhini, Z
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) : 281 - 297
  • [9] Ben-Dor A., 2000, RECOMB 2000. Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, P54, DOI 10.1145/332306.332328
  • [10] Inference in model-based cluster analysis
    Bensmail, H
    Celeux, G
    Raftery, AE
    Robert, CP
    [J]. STATISTICS AND COMPUTING, 1997, 7 (01) : 1 - 10