Indexing large metric spaces for similarity search queries

被引:140
作者
Bozkaya, T
Ozsoyoglu, M
机构
[1] Oracle Corp, Redwood Shores, CA 94065 USA
[2] Case Western Reserve Univ, Dept Comp Engn & Sci, Cleveland, OH 44106 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 1999年 / 24卷 / 03期
关键词
algorithms; experimentation; measurement; performance; verification;
D O I
10.1145/328939.328959
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the common queries in many database applications is finding approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance-based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high-dimensional data) and the distance function is metric. In this paper we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach that uses reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multivantage point tree structure (mvp-tree) that uses more than one vantage point to partition the space into spherical cuts at each level. In answering similarity-based queries, the mvp-tree also utilizes the precomputed (at construction time) distances between the data points and the vantage points. We summarize the experiments comparing mvp-trees to vp-trees that have a similar partitioning strategy, but use only one vantage point at each level and do not make use of the precomputed distances. Empirical studies show that the mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points and discuss the results of experiments we have made to see how varying the number of vantage points in a node affects search performance and how much is gained in performance by making use of precomputed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node in order to end up with a single directory node and keep as many of the precomputed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results that compare mvp-trees with M-trees, which is a dynamic distance-based index structure for metric domains.
引用
收藏
页码:361 / 404
页数:44
相关论文
共 24 条
[1]  
AGRAWAL R, 1993, P C FODO
[2]  
[Anonymous], P 21 INT C VER LARG
[3]  
Baeza-Yates R., 1994, LNCS, V807/1994, P198
[4]  
BECKMANN N, 1990, SIGMOD REC, V19, P322, DOI 10.1145/93605.98741
[5]  
Berchtold S, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P28
[6]  
Berchtold S., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P78, DOI 10.1145/263661.263671
[7]  
BOZKAYA T, 1997, P ACM SIGMOD INT C M, P357
[8]  
BURKHARD WA, 1973, COMMUN ACM, V16, P230, DOI 10.1145/362003.362025
[9]  
Chiueh Tzi-cker., 1994, VLDB 94, P582
[10]  
Ciaccia P., 1998, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS 1998, P59, DOI 10.1145/275487.275495