Fast and exact out-of-core and distributed k-means clustering

被引:57
作者
Jin, Ruoming [1 ]
Goswami, Anjan
Agrawal, Gagan
机构
[1] Kent State Univ, Dept Comp Sci, Kent, OH USA
[2] Ohio State Univ, Dcpt Comp Sci & Engn, Columbus, OH 43210 USA
关键词
k-means clustering; out-of-core datasets distributed k-means; confidence radius; boundary points;
D O I
10.1007/s10115-005-0210-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.
引用
收藏
页码:17 / 40
页数:24
相关论文
共 27 条
[1]  
[Anonymous], P 4 INT C KNOWL DISC
[2]  
BADOIU M, 2002, P ANN ACM S THEOR CO
[3]  
Bayardo R.J., 1999, P 5 ACM SIGKDD INT C
[4]  
Berkhin P., 2002, SURVEY CLUSTERING DA
[5]  
Bottou L., 1995, Advances in Neural Information Processing Systems 7, P585
[6]  
CHARIKAR M, 2003, P 35 ANN ACM S THEOR
[7]  
CHERIKAR M, 1997, P 35 ANN ACM S THEOR
[8]  
CHERVENAK A, 2001, J NETWORK COMPUT APP
[9]  
DETERUEL PEL, 1999, P PDPTA
[10]   A data-clustering algorithm on distributed memory multiprocessors [J].
Dhillon, IS ;
Modha, DS .
LARGE-SCALE PARALLEL DATA MINING, 2000, 1759 :245-260