A comparative study of efficient initialization methods for the k-means clustering algorithm

被引:859
作者
Celebi, M. Emre [1 ]
Kingravi, Hassan A. [2 ]
Vela, Patricio A. [2 ]
机构
[1] Louisiana State Univ, Dept Comp Sci, Shreveport, LA 71105 USA
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
基金
美国国家科学基金会;
关键词
Partitional clustering; Sum of squared error criterion; k-means; Cluster center initialization; ENCODING ALGORITHM; QUANTIZATION; PERFORMANCE; SETS;
D O I
10.1016/j.eswa.2012.07.021
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using nonparametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods. (C) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:200 / 210
页数:11
相关论文
共 76 条
[51]  
Macqueen J., 1967, 5 BERK S MATH STAT P, P281, DOI DOI 10.1007/S11665-016-2173-6
[52]   The planar k-means problem is NP-hard [J].
Mahajan, Meena ;
Nimbhorkar, Prajakta ;
Varadarajan, Kasturi .
THEORETICAL COMPUTER SCIENCE, 2012, 442 :13-21
[53]   Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms [J].
Maitra, Ranjan ;
Melnykov, Volodymyr .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2010, 19 (02) :354-376
[54]   A self-organizing network for hyperellipsoidal clustering (HEC) [J].
Mao, JC ;
Jain, AK .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1996, 7 (01) :16-29
[55]  
Matsumoto M., 1998, ACM Transactions on Modeling and Computer Simulation, V8, P3, DOI 10.1145/272991.272995
[56]   Comparing clusterings - an information based distance [J].
Meila, Marina .
JOURNAL OF MULTIVARIATE ANALYSIS, 2007, 98 (05) :873-895
[57]   AN EXAMINATION OF THE EFFECT OF 6 TYPES OF ERROR PERTURBATION ON 15 CLUSTERING ALGORITHMS [J].
MILLIGAN, GW .
PSYCHOMETRIKA, 1980, 45 (03) :325-342
[58]   A STUDY OF STANDARDIZATION OF VARIABLES IN CLUSTER-ANALYSIS [J].
MILLIGAN, GW ;
COOPER, MC .
JOURNAL OF CLASSIFICATION, 1988, 5 (02) :181-204
[59]  
Norusis MarijaJ., 2011, IBM SPSS Statistics 19 Statistical Procedures Companion
[60]  
Onoda Takashi, 2012, Journal of Emerging Technologies in Web Intelligence, V4, P51, DOI 10.4304/jetwi.4.1.51-59