An Index Structure for Data Mining and Clustering

被引：17

作者：

Xiong Wang

Jason T. L. Wang

King-Ip Lin

Dennis Shasha

Bruce A. Shapiro

Kaizhong Zhang

机构：

[1] Department of Computer and Information Science,

[2] New Jersey Institute of Technology,undefined

[3] Newark,undefined

[4] NJ,undefined

[5] USA,undefined

[6] Department of Mathematical Sciences,undefined

[7] University of Memphis,undefined

[8] Memphis,undefined

[9] TN,undefined

[10] USA,undefined

[11] Courant Institute of Mathematical Sciences,undefined

[12] New York University,undefined

[13] New York,undefined

[14] USA,undefined

[15] Laboratory of Experimental and Computational Biology,undefined

[16] National Cancer Institute,undefined

[17] Frederick,undefined

[18] MD,undefined

[19] USA,undefined

[20] Department of Computer Science,undefined

[21] The University of Western Ontario,undefined

[22] London,undefined

[23] Ontario,undefined

[24] Canada,undefined

来源：

Knowledge and Information Systems | 2000年 / 2卷 / 2期

关键词：

Keywords: Biomedical applications; Data engineering; Distance metrics; Knowledge discovery; Visualization;

D O I：

10.1007/s101150050009

中图分类号：

学科分类号：

摘要：

In this paper we present an index structure, called MetricMap, that takes a set of objects and a distance metric and then maps those objects to a k-dimensional space in such a way that the distances among objects are approximately preserved. The index structure is a useful tool for clustering and visualization in data-intensive applications, because it replaces expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical. We compare the index structure with another data mining index structure, FastMap, recently proposed by Faloutsos and Lin, according to two criteria: relative error and clustering accuracy. For relative error, we show that (i) FastMap gives a lower relative error than MetricMap for Euclidean distances, (ii) MetricMap gives a lower relative error than FastMap for non-Euclidean distances (i.e., general distance metrics), and (iii) combining the two reduces the error yet further. A similar result is obtained when comparing the accuracy of clustering. These results hold for different data sizes. The main qualitative conclusion is that these two index structures capture complementary information about distance metrics and therefore can be used together to great benefit. The net effect is that multi-day computations can be done in minutes.

引用

页码：161 / 184

页数：23