K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective

被引:156
作者
Xiong, Hui [1 ]
Wu, Junjie [2 ]
Chen, Jian [3 ]
机构
[1] Rutgers State Univ, Rutgers Business Sch, Management Sci & Informat Syst Dept, Newark, NJ 07102 USA
[2] Beihang Univ, Sch Econ & Management, Beijing 100083, Peoples R China
[3] Tsinghua Univ, Res Ctr Contemporary Management, Key Res Inst Humanities & Social Sci Univ, Sch Econ & Management, Beijing 100084, Peoples R China
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS | 2009年 / 39卷 / 02期
基金
美国国家科学基金会;
关键词
Clustering validation; coefficient of variation (CV); entropy; F-measure; K-means clustering; CLASSIFICATION; SIMILARITY;
D O I
10.1109/TSMCB.2008.2004559
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV > 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the "true" cluster distributions.
引用
收藏
页码:318 / 331
页数:14
相关论文
共 45 条
[31]  
MURTAGH F, 2000, CLUSTERING MASSIVE D
[32]  
OZGUR A, 2004, UNSUPERVISED MACHINE
[33]   AN ALGORITHM FOR SUFFIX STRIPPING [J].
PORTER, MF .
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03) :130-137
[34]  
Portnoy L., 2001, Proc. ACM CSS Workshop on Data Mining Applied to Security (DMSA), P5
[35]  
STEINBACH M, 2000, 6 ACM SIGKDD INT C K, P20
[36]  
Tan P., 2019, INTRO DATA MINING
[37]  
Tang B, 2005, LECT NOTES COMPUT SC, V3501, P292
[38]  
*TREC, 1996, TEXT RETR C
[39]  
Van Rijsbergen C.J., 1979, Information retrieval
[40]  
XIONG H, 2006, P 12 ACM SIGKDD INT, P779