Faster algorithm of string comparison

被引:48
作者
Yang, QX [1 ]
Yuan, SS [1 ]
Zhao, L [1 ]
Chun, L [1 ]
Peng, S [1 ]
机构
[1] Inst High Performance Comp, Singapore 117528, Singapore
关键词
data cleaning; data mining; field similarity; pattern recognition; record similarity; string similarity;
D O I
10.1007/s10044-002-0180-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
In many applications, it is necessary to determine the field similarity. Our paper introduces a package of substring-based new algorithms to determine Field Similarity. Combined together, our new algorithms not only achieves higher accuracy, but also gains the time complexity O(knm) (k<0.75) for the worst case, O(beta*n) where beta<6 for the average case and O(1) for the best case. Throughout the paper, we use the approach of comparative examples to show the higher accuracy of our algorithms compared to that proposed in Lee et al. [1]. Theoretical analysis, concrete examples and experimental results show that our algorithms can significantly improve the accuracy and time complexity of the calculation of field similarity.
引用
收藏
页码:122 / 133
页数:12
相关论文
共 24 条
[1]
GENERALIZED STRING MATCHING [J].
ABRAHAMSON, K .
SIAM JOURNAL ON COMPUTING, 1987, 16 (06) :1039-1051
[2]
Alstrup S, 2000, PROCEEDINGS OF THE ELEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P819
[3]
Amir A, 2000, PROCEEDINGS OF THE ELEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P794
[4]
[Anonymous], P S THEOR COMP
[5]
DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES [J].
BITTON, D ;
DEWITT, DJ .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02) :255-265
[6]
CARRANO FM, 2001, DATA ABSTRACTION PRO
[7]
Cormode G, 2000, PROCEEDINGS OF THE ELEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P197
[8]
GALIL Z, 1985, COMBINATORIAL ALGORI, P1
[9]
Gusfield D, 1997, ALGORITHMS STRINGS T
[10]
HERNANDEZ MA, 1995, P 1995 ACM SIGMOD IN, P127