Estimating the number of clusters in a data set via the gap statistic

被引：3836

作者：

Tibshirani, R ^{[1
]}

Walther, G

Hastie, T

机构：

[1] Stanford Univ, Dept Hlth Res & Policy, Stanford, CA 94305 USA

[2] Stanford Univ, Dept Stat, Stanford, CA 94305 USA

来源：

JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY | 2001年 / 63卷

关键词：

clustering; groups; hierarchy; K-means; uniform distribution;

D O I：

10.1111/1467-9868.00293

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

We propose a method (the 'gap statistic') for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

引用

页码：411 / 423

页数：13

共 20 条

[1]

Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946

[2]

Calinski T., 1974, COMMUN STAT-THEOR M, V3, P1, DOI DOI 10.1080/03610927408827101

[3] Estimating the number of clusters [J].

Cuevas, A ;

Febrero, M ;

Fraiman, R .

CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2000, 28 (02) :367-382

[4]

Dharmadhikari S, 1988, UNIMODALITY CONVEXIT

[5]

DIDAY E, 1977, RAIRO INFORMATIQUE C, P329

[6] How many clusters? Which clustering method? Answers via model-based cluster analysis [J].

Fraley, C ;

Raftery, AE .

COMPUTER JOURNAL, 1998, 41 (08) :578-588

[7]

Gordon A, 1999, Classification

[8]

Gordon A.D., 1996, DATA KNOWLEDGE, P32, DOI DOI 10.1007/978-3-642-79999-0_3

[9]

Hartigan J. A., 1975, CLUSTERING ALGORITHM

[10] A CRITERION FOR DETERMINING THE NUMBER OF GROUPS IN A DATA SET USING SUM-OF-SQUARES CLUSTERING [J].

KRZANOWSKI, WJ ;

LAI, YT .

BIOMETRICS, 1988, 44 (01) :23-34

← 1 2 →