How many clusters? An information-theoretic perspective

被引:92
作者
Still, S [1 ]
Bialek, W
机构
[1] Princeton Univ, Dept Phys, Princeton, NJ 08544 USA
[2] Princeton Univ, Lewis Sigler Inst Integrat Genom, Princeton, NJ 08544 USA
关键词
D O I
10.1162/0899766042321751
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure - without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.
引用
收藏
页码:2483 / 2506
页数:24
相关论文
共 21 条
[1]   Statistical inference, Occam's razor, and statistical mechanics on the space of probability distributions [J].
Balasubramanian, V .
NEURAL COMPUTATION, 1997, 9 (02) :349-368
[2]   Superparamagnetic clustering of data [J].
Blatt, M ;
Wiseman, S ;
Domany, E .
PHYSICAL REVIEW LETTERS, 1996, 76 (18) :3251-3254
[3]  
BOCK HH, 1996, CLUSTERING CLASSIFIC, P378
[4]  
BUHMANN JM, 2000, ADV NEURAL INFORMATI, V12
[5]   Cluster analysis and display of genome-wide expression patterns [J].
Eisen, MB ;
Spellman, PT ;
Brown, PO ;
Botstein, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) :14863-14868
[6]   Model-based clustering, discriminant analysis, and density estimation [J].
Fraley, C ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) :611-631
[7]  
Gordon A, 1999, Classification
[8]  
HALL P, 1988, BIOMETRIKA, V75, P705
[9]   Algorithm for data clustering in pattern recognition problems based on quantum mechanics [J].
Horn, D ;
Gottlieb, A .
PHYSICAL REVIEW LETTERS, 2002, 88 (01) :4
[10]   Stability-based validation of clustering solutions [J].
Lange, T ;
Roth, V ;
Braun, ML ;
Buhmann, JM .
NEURAL COMPUTATION, 2004, 16 (06) :1299-1323