Cluster-wise assessment of cluster stability

被引:524
作者
Hennig, Christian [1 ]
机构
[1] UCL, Dept Stat Sci, London WC1E 6BT, England
关键词
cluster validation; bootstrap; robustness; clustering with noise; Jaccard coefficient;
D O I
10.1016/j.csda.2006.11.025
中图分类号
TP39 [计算机的应用];
学科分类号
081203 [计算机应用技术]; 0835 [软件工程];
摘要
Stability in cluster analysis is strongly dependent on the data set, especially on how well separated and how homogeneous the clusters are. In the same clustering, some clusters may be very stable and others may be extremely unstable. The Jaccard coefficient, a similarity measure between sets, is used as a cluster-wise measure of cluster stability, which is assessed by the bootstrap distribution of the Jaccard coefficient for every single cluster of a clustering compared to the most similar cluster in the bootstrapped data sets. This can be applied to very general cluster analysis methods. Some alternative resampling methods are investigated as well, namely subsetting, jittering the data points and replacing some data points by artificial noise points. The different methods are compared by means of a simulation study. A data example illustrates the use of the cluster-wise stability assessment to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:258 / 271
页数:14
相关论文
共 26 条
[1]
Ben-Hur Asa, 2002, Pac Symp Biocomput, P6
[2]
Problems in gene clustering based on gene expression data [J].
Bryan, J .
JOURNAL OF MULTIVARIATE ANALYSIS, 2004, 90 (01) :44-66
[3]
Nearest-neighbor clutter removal for estimating features in spatial point processes [J].
Byers, S ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (442) :577-584
[4]
Cuesta-Albertos JA, 1997, ANN STAT, V25, P553
[5]
Donoho D. L., 1983, FESTSCHRIFT EL LEHMA, P157
[6]
Dudoit S, 2002, GENOME BIOL, V3
[7]
How many clusters? Which clustering method? Answers via model-based cluster analysis [J].
Fraley, C ;
Raftery, AE .
COMPUTER JOURNAL, 1998, 41 (08) :578-588
[8]
Robustness properties of k means and trimmed k means [J].
García-Escudero, LA ;
Gordaliza, A .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1999, 94 (447) :956-969
[9]
GORDON AD, 1989, CLASSIFICATION
[10]
METRIC AND EUCLIDEAN PROPERTIES OF DISSIMILARITY COEFFICIENTS [J].
GOWER, JC ;
LEGENDRE, P .
JOURNAL OF CLASSIFICATION, 1986, 3 (01) :5-48