Properties of embedding methods for similarity searching in metric spaces

被引:124
作者
Hjaltason, GR [1 ]
Samet, H
机构
[1] Univ Waterloo, Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
[2] Univ Maryland, Ctr Automat Res, Dept Comp Sci, College Pk, MD 20742 USA
[3] Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA
基金
美国国家科学基金会;
关键词
embedding methods; metric spaces; similarity search; multimedia databases; contractiveness; distortion; quality; Lipschitz embeddings; singular value decomposition (SVID); SparseMap; FastMap; MetricMap;
D O I
10.1109/TPAMI.2003.1195989
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Complex data types-such as images, documents, DNA sequences, etc-are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance function. Often, the cost of evaluating the distance between two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space so that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. In this paper, we are especially interested in examining the issue of whether or not the embedding methods will ensure that no relevant objects are left out (i.e., there are no false dismissals and, hence, the correct result is reported). Particular attention is paid to the SparseMap, FastMap, and MetricMap embedding methods. SparseMap is a variant of Lipschitz embeddings, while FastMap and MetricMap are inspired by dimension reduction methods for Euclidean spaces (using KILT or the related PCA and SVD). We show that, in general, none of these embedding methods guarantee that queries on the embedded objects have no false dismissals, while also demonstrating the limited cases in which the guarantee does hold. Moreover, we describe a variant of SparseMap that allows queries with no false dismissals. In addition, we show that with FastMap and MetricMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap and MetricMap to guarantee no false dismissals.
引用
收藏
页码:530 / 549
页数:20
相关论文
共 45 条
[1]  
Achlioptas D., 2001, P 20 ACM SIGMOD SIGA, P274, DOI DOI 10.1145/375551.375608
[2]  
Agrawal R., 1993, Foundations of Data Organization and Algorithms. 4th International Conference. FODO '93 Proceedings, P69
[3]  
Ankerst M, 1999, LECT NOTES COMPUT SC, V1651, P207
[4]  
[Anonymous], P 1998 ACM SIGMOD IN
[5]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C, DOI DOI 10.1145/312129.312264
[6]   An optimal algorithm for approximate nearest neighbor searching in fixed dimensions [J].
Arya, S ;
Mount, DM ;
Netanyahu, NS ;
Silverman, R ;
Wu, AY .
JOURNAL OF THE ACM, 1998, 45 (06) :891-923
[7]   Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval [J].
Barros, J ;
French, J ;
Martin, W ;
Kelly, P ;
Cannon, M .
STORAGE AND RETRIEVAL FOR STILL IMAGE AND VIDEO DATABASES IV, 1996, 2670 :392-403
[8]   APPROXIMATE CLOSEST-POINT QUERIES IN HIGH DIMENSIONS [J].
BERN, M .
INFORMATION PROCESSING LETTERS, 1993, 45 (02) :95-99
[9]  
Bingham E., 2001, P 7 ACM SIGKDD INT C, P245, DOI DOI 10.1145/502512.502546
[10]   ON LIPSCHITZ EMBEDDING OF FINITE METRIC-SPACES IN HILBERT-SPACE [J].
BOURGAIN, J .
ISRAEL JOURNAL OF MATHEMATICS, 1985, 52 (1-2) :46-52