Rock: A robust clustering algorithm for categorical attributes

被引:479
作者
Guha, S [1 ]
Rastogi, R
Shim, K
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Bell Labs, Murray Hill, NJ 07974 USA
[3] Korea Adv Inst Sci & Technol, Taejon 305701, South Korea
[4] Adv Informat Technol Res Ctr, Taejon 305701, South Korea
关键词
data mining; knowledge discovery; clustering algorithms;
D O I
10.1016/S0306-4379(00)00022-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties. (C) 2000 Published by Elsevier Science Ltd.
引用
收藏
页码:345 / 366
页数:22
相关论文
共 9 条
  • [1] [Anonymous], P C VER LARG DAT VLD
  • [2] Guha S., 1998, SIGMOD Record, V27, P73, DOI 10.1145/276305.276312
  • [3] Ng R.T., 1994, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, P144
  • [4] SHIM K, 1997, ICDE, P301
  • [5] RANDOM SAMPLING WITH A RESERVOIR
    VITTER, JS
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1985, 11 (01): : 37 - 57
  • [6] Zhang T, 1996, ACM SIGMOD RECORD, V25, P103, DOI [/10.1145/235968.233324, DOI 10.1145/235968.233324, 10.1145/235968.233324]
  • [7] [No title captured]
  • [8] [No title captured]
  • [9] [No title captured]