A data-clustering algorithm on distributed memory multiprocessors

被引:153
作者
Dhillon, IS [1 ]
Modha, DS
机构
[1] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
[2] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
来源
LARGE-SCALE PARALLEL DATA MINING | 2000年 / 1759卷
关键词
D O I
10.1007/3-540-46502-2_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.
引用
收藏
页码:245 / 260
页数:16
相关论文
共 43 条
[1]   Parallel mining of association rules [J].
Agrawal, R ;
Shafer, JC .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1996, 8 (06) :962-969
[2]  
[Anonymous], 1996, P ACM SIGMOD C MAN D
[3]  
[Anonymous], 1998, MINING VERY LARGE DA
[4]  
BOLEY D, 1998, AI REV
[5]  
Bottou L., 1995, Advances in Neural Information Processing Systems 7, P585
[6]  
Broder A.Z., 1997, 1997015 DIG SYST RES
[7]  
CHATTRATICHAT J, 1997, P 3 INT C KNOWL DISC, P61
[8]  
Cheeseman P.C., 1996, ADV KNOWLEDGE DISCOV, V180, P153, DOI https://doi.org/10.5555/257938.257954
[9]  
CHEUNG DW, 1999, IN PRESS DATA MINING
[10]   LogP - A practice model of parallel computation [J].
Culler, DE ;
Karp, RM ;
Patterson, D ;
Sahay, A ;
Santos, EE ;
Schauser, KE ;
Subramonian, R ;
vonEicken, T .
COMMUNICATIONS OF THE ACM, 1996, 39 (11) :78-85