A cost model and index architecture for the similarity join

被引:17
作者
Böhm, C
Kriegel, HP
机构
来源
17TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2001年
关键词
D O I
10.1109/ICDE.2001.914854
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The similarity join is an important database primitive,which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter epsilon. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based an indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Fine-grained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown.
引用
收藏
页码:411 / 420
页数:10
相关论文
共 35 条
[1]  
Agrawal R, 1993, INT C FDN DAT ORG AL
[2]  
Ankerst M., 1999, ACM SIGMOD INT C MAN
[3]  
[Anonymous], 1996, COMPREHENSIVE CHEMOM
[4]  
BERCHTOLD S, 2000, IEEE INT C DAT ENG I
[5]  
BERCHTOLD S, 1997, ACM S PRINC DAT SYST
[6]  
BERCHTOLD S, 1998, INT C EXT DAT TECHN
[7]  
BOHM C, 2000, INT C INF KNOWL MAN
[8]  
BRINKHOFF T, 1993, ACM SIGMOD INT C MAN
[9]  
CIACCIA P, 1998, ACM S PRINC DAT SYST
[10]  
Faloutsos C., 1994, J INTELLIGENT INFORM, V3