Squeezer: An efficient algorithm for clustering categorical data

被引:199
作者
He, ZY [1 ]
Xu, XF [1 ]
Deng, SC [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
clustering; categorical data; data stream; data mining;
D O I
10.1007/BF02948829
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 [计算机科学与技术];
摘要
This paper presents a new efficient algorithm for clustering categorical data, Squeezer, which can produce high quality clustering results and at the same time deserve good scalability. The Squeezer algorithm reads each tuple t in sequence, either assigning t to an existing cluster (initially none), or creating t as a new cluster, which is determined by the similarities between t and clusters. Due to its characteristics, the proposed algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time. Outliers can also be handled efficiently and directly in Squeezer. Experimental results on real-life and synthetic datasets verify the superiority of Squeezer.
引用
收藏
页码:611 / 624
页数:14
相关论文
共 16 条
[1]
[Anonymous], 1997, SIGMOD WORKSHOP RES
[2]
Ester M, 1996, 2 INT C KNOWL DISCOV, P226, DOI DOI 10.5555/3001460.3001507
[3]
Ganti Venkatesh., 1999, Int. Conf. Knowledge Discovery and Data Mining, P73, DOI DOI 10.1145/312129.312201
[4]
Gibson D., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P311
[5]
Guha S., 1998, SIGMOD Record, V27, P73, DOI 10.1145/276305.276312
[6]
ROCK: A robust clustering algorithm for categorical attributes [J].
Guha, S ;
Rastogi, R ;
Shim, K .
15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, :512-521
[7]
Clustering data streams [J].
Guha, S ;
Mishra, N ;
Motwani, R ;
O'Callaghan, L .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :359-366
[8]
HAN EH, 1997, P 1997 SIGMOD WORKSH, P78
[9]
Two-phase clustering process for outliers detection [J].
Jiang, MF ;
Tseng, SS ;
Su, CM .
PATTERN RECOGNITION LETTERS, 2001, 22 (6-7) :691-700
[10]
Chameleon: Hierarchical clustering using dynamic modeling [J].
Karypis, G ;
Han, EH ;
Kumar, V .
COMPUTER, 1999, 32 (08) :68-+