Fast and exact out-of-core K-means clustering

被引:17
作者
Goswami, A [1 ]
Jin, RM [1 ]
Agrawal, G [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
来源
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2004年
关键词
D O I
10.1109/ICDM.2004.10102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering has been one. of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which, require only one or a,small number of passes on the entire dataset. In this paper we present a new algorithm which typically requires only, one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers.. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means.
引用
收藏
页码:83 / 90
页数:8
相关论文
共 17 条
[1]  
[Anonymous], P 4 INT C KNOWL DISC
[2]  
BADOIU M, 2002, P ANN ACM S THEORY C
[3]  
Bayardo R.J., 1999, P 5 ACM SIGKDD INT C
[4]  
Berkhin P., 2002, SURVEY CLUSTERING DA
[5]  
Bottou L., 1995, Advances in Neural Information Processing Systems 7, P585
[6]  
CHARIKAR M, 1997, P S THEOR COMP
[7]  
CHARIKAR M, 2003, P 35 ANN ACM S THEOR
[8]  
Domingos P., 2001, P 18 INT C MACH LEAR
[9]  
Farnstrom F., 2000, SIGKDD Explor. Newsl, V2, P51, DOI DOI 10.1145/360402.360419
[10]  
Ghosh J, 2003, HUM FAC ER, P247