Estimating the number of clusters in a data set via the gap statistic

被引:3836
作者
Tibshirani, R [1 ]
Walther, G
Hastie, T
机构
[1] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA
关键词
clustering; groups; hierarchy; K-means; uniform distribution;
D O I
10.1111/1467-9868.00293
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We propose a method (the 'gap statistic') for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.
引用
收藏
页码:411 / 423
页数:13
相关论文
共 20 条
[1]  
Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946
[2]  
Calinski T., 1974, COMMUN STAT-THEOR M, V3, P1, DOI DOI 10.1080/03610927408827101
[3]   Estimating the number of clusters [J].
Cuevas, A ;
Febrero, M ;
Fraiman, R .
CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2000, 28 (02) :367-382
[4]  
Dharmadhikari S, 1988, UNIMODALITY CONVEXIT
[5]  
DIDAY E, 1977, RAIRO INFORMATIQUE C, P329
[6]   How many clusters? Which clustering method? Answers via model-based cluster analysis [J].
Fraley, C ;
Raftery, AE .
COMPUTER JOURNAL, 1998, 41 (08) :578-588
[7]  
Gordon A, 1999, Classification
[8]  
Gordon A.D., 1996, DATA KNOWLEDGE, P32, DOI DOI 10.1007/978-3-642-79999-0_3
[9]  
Hartigan J. A., 1975, CLUSTERING ALGORITHM
[10]   A CRITERION FOR DETERMINING THE NUMBER OF GROUPS IN A DATA SET USING SUM-OF-SQUARES CLUSTERING [J].
KRZANOWSKI, WJ ;
LAI, YT .
BIOMETRICS, 1988, 44 (01) :23-34