Stream Data Clustering Based on Grid Density and Attraction

被引:96
作者
Tu, Li [1 ]
Chen, Yixin [2 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Inst Informat Sci & Technol, Nanjing 210016, Peoples R China
[2] Washington Univ, St Louis, MO 63130 USA
关键词
Stream data; data mining; clustering; density-based algorithms;
D O I
10.1145/1552303.1552305
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering real-time stream data is an important and challenging problem. Existing algorithms such as CluStream are based on the k-means algorithm. These clustering algorithms have difficulties finding clusters of arbitrary shapes and handling outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this article proposes D-Stream, a framework for clustering stream data using a density-based approach. Our algorithm uses an online component that maps each input data record into a grid and an offline component that computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream and a attraction-based mechanism to accurately generate cluster boundaries. Exploiting the intricate relationships among the decay factor, attraction, data density, and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.
引用
收藏
页数:27
相关论文
共 22 条
[1]  
[Anonymous], 2003, P 29 INT C VER LARG
[2]   Clustering distributed data streams in peer-to-peer environments [J].
Bandyopadhyay, Sanghamitra ;
Giannella, Chris ;
Maulik, Ujjwal ;
Kargupta, Hillol ;
Liu, Kun ;
Datta, Souptik .
INFORMATION SCIENCES, 2006, 176 (14) :1952-1985
[3]  
Barbara Daniel., 2002, ACM sIGKDD Explorations Newsletter, V3, P23, DOI DOI 10.1145/507515.507519
[4]   Online clustering of parallel data streams [J].
Beringer, Juergen ;
Huellermeier, Eyke .
DATA & KNOWLEDGE ENGINEERING, 2006, 58 (02) :180-204
[5]  
CHEN Y, 2007, P INT C KNOWL DISC D
[6]  
CHEN Y, 2002, P WORKSH RES ISS DAT, P53
[7]  
Giannella C., 2003, NEXT GENERATION DATA
[8]  
GILBERT A., 2001, P INT C VER LARG DAT
[9]  
Golab L, 2003, SIGMOD REC, V32, P5, DOI 10.1145/776985.776986
[10]   Clustering data streams: Theory and practice [J].
Guha, S ;
Meyerson, A ;
Mishra, N ;
Motwani, R ;
O'Callaghan, L .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003, 15 (03) :515-528