How many clusters? An information-theoretic perspective

被引：92

作者：

Still, S ^{[1
]}

Bialek, W

机构：

[1] Princeton Univ, Dept Phys, Princeton, NJ 08544 USA

[2] Princeton Univ, Lewis Sigler Inst Integrat Genom, Princeton, NJ 08544 USA

来源：

NEURAL COMPUTATION | 2004年 / 16卷 / 12期

关键词：

D O I：

10.1162/0899766042321751

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure - without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.

引用

页码：2483 / 2506

页数：24

共 21 条

[1] Statistical inference, Occam's razor, and statistical mechanics on the space of probability distributions [J].