Transformation-based framework for record matching

被引:42
作者
Arasu, Arvind [1 ]
Chaudhuri, Surajit [1 ]
Kaushik, Raghav [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 | 2008年
关键词
D O I
10.1109/ICDE.2008.4497412
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We propose a programmatic framework of record matching that takes such user-defined string transformations as input. To the best of our knowledge, this is the first proposal for such a framework. This transformational framework, while expressive, poses significant computational challenges which we address. We empirically evaluate our techniques over real data.
引用
收藏
页码:40 / 49
页数:10
相关论文
共 26 条
[1]  
Ananthakrishna R., 2002, Proceedings of the Twenty-eighth International Conference on Very Large Data Bases, P586
[2]  
[Anonymous], 2007, P 2007 ACM SIGMOD IN
[3]  
[Anonymous], 2003, 2003 ACM SIGMOD INT, DOI DOI 10.1145/872757.872796
[4]  
[Anonymous], PROCEEDINGS OF THE I
[5]  
[Anonymous], SOVREMENNAYA EVROPA, DOI DOI 10.15211/SOVEUROPE420198596
[6]   Correlation clustering [J].
Bansal, N ;
Blum, A ;
Chawla, S .
MACHINE LEARNING, 2004, 56 (1-3) :89-113
[7]  
Bhattacharya I., 2006, IEEE DATA ENG B, V29, P4
[8]  
Bilenko M., 2003, Proc. 9th Int. Conf. Knowledge Discovery and Data Mining, Washington, P39, DOI DOI 10.1145/956750.956759
[9]  
CHAUDHURI S, 2006, P 22 INT C DAT ENG A
[10]  
CHAUDHURI S, 2007, P INT C VER LARG DAT, P23