Clustering data streams: Theory and practice

被引:404
作者
Guha, S
Meyerson, A
Mishra, N
Motwani, R
O'Callaghan, L
机构
[1] Univ Penn, Dept Comp Sci, Philadelphia, PA 19104 USA
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
[3] Hewlett Packard Labs, Palo Alto, CA 94304 USA
[4] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
clustering; data streams; approximation algorithms;
D O I
10.1109/TKDE.2003.1198387
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, While using little memory, is crucial. We describe such a Streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.
引用
收藏
页码:515 / 528
页数:14
相关论文
共 75 条
[1]  
Achlioptas D., 2001, PROC 33 ACM S THEORY, P611
[2]  
Agarwal PK, 2000, PROCEEDINGS OF THE ELEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P538
[3]  
AGRAWAL R, 1998, P SIGMOD
[4]  
Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
[5]  
Ankerst M, 1999, P SIGMOD
[6]  
[Anonymous], 1999, 40 ANN S FDN COMP SC
[7]  
[Anonymous], P 4 INT C KNOWL DISC
[8]  
Arora S., 1998, Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, P106, DOI 10.1145/276698.276718
[9]  
Arya V., 2001, 33 ACM S THEORY COMP, P21
[10]  
Babcock B., 2002, P ACM S DISCR ALG