On clustering massive text and categorical data streams

被引:56
作者
Aggarwal, Charu C. [1 ]
Yu, Philip S. [2 ]
机构
[1] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
[2] Univ Illinois, Chicago, IL USA
关键词
Stream clustering; Text clustering; Text streams; Text stream clustering; Categorical data;
D O I
10.1007/s10115-009-0241-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.
引用
收藏
页码:171 / 196
页数:26
相关论文
共 34 条
[1]  
Aggarwal C., 2003, ACM SIGMOD C
[2]   Finding localized associations in market basket data [J].
Aggarwal, CC ;
Procopiuc, C ;
Yu, PS .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (01) :51-62
[3]  
AGGARWAL CC, 2008, ICDE C
[4]  
AGRAWAL D, 2007, KAIS J, V11, P29
[5]  
AGRAWAL R, 1994, VLDB C
[6]  
Allan J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P37, DOI 10.1145/290941.290954
[7]  
ALLAN J, 1998, P BROADC NEWS UND TR
[8]  
[Anonymous], 2003, VLDB C
[9]  
BABCOCK B, 2002, ACM PODS C
[10]   Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres [J].
Banerjee, A ;
Ghosh, J .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2004, 15 (03) :702-719