Using self-similarity to cluster large data sets

被引：38

作者：

Barbará, D

Chen, P

机构：

[1] George Mason Univ, ISE Dept, Fairfax, VA 22030 USA

[2] Univ Houston Downtown, Dept Math & Comp Sci, Houston, TX 77002 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2003年 / 7卷 / 02期

基金：

美国国家科学基金会;

关键词：

clustering; self-similarity; scalability;

D O I：

10.1023/A:1022493416690

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

引用

页码：123 / 152

页数：30

共 36 条

[11] Faloutsos C, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P307
[12] Faloutsos C, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P40
[13] Relaxing the uniformity and independence assumptions using the concept of fractal dimension
Faloutsos, C
Kamel, I
[J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (02) : 229 - 240
[14] Iterative optimization and simplification of hierarchical clusterings
Fisher, D
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 : 147 - 179
[15] Fukunaga K., 1990, INTRO STAT PATTERN R
[16] Gluck M. A., 1985, P 7 ANN C COGN SCI S
[17] GENERALIZED DIMENSIONS OF STRANGE ATTRACTORS
GRASSBERGER, P
[J]. PHYSICS LETTERS A, 1983, 97 (06) : 227 - 230
[18] GRASSBERGER P, 1983, PHYS REV LETT, V50, P5
[19] Guha S., 1998, SIGMOD Record, V27, P73, DOI 10.1145/276305.276312
[20] HINNEBURG A, 1999, ACM SIGKDD INT C KNO

← 1 2 3 4 →