Learning to match the schemas of data sources: A multistrategy approach

被引:119
作者
Doan, A
Domingos, P
Halevy, A
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
schema matching; multistrategy learning; data integration;
D O I
10.1023/A:1021765902788
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
The problem of integrating data from multiple data sources-either on the Internet or within enterprises-has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are "element location maps to address" and "price maps to listed-price". We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.
引用
收藏
页码:279 / 301
页数:23
相关论文
共 45 条
[1]
[Anonymous], 2001, OIS IJCAI
[2]
Ashish N., 1997, SIGMOD Record, V26, P8, DOI 10.1145/271074.271078
[3]
BRAZDIL P, 1991, LECT NOTES ARTIFICIA, V482
[4]
Castano S., 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265), P53, DOI 10.1109/IDEAS.1999.787251
[5]
CHALUPSKY H, 2000, PRINCIPLES KNOWLEDGE
[6]
CLIFTON C, 1997, P IFIP WORK C DAT SE, V7
[7]
DO HH, 2002, P 2 INT WORKSH WEB D
[8]
DOAN A, 2002, P WORLD WID WEB C WW
[9]
DOAN A, 2002, UWCSE2002 U WASH
[10]
Doan AnHai, 2001, P ACM SIGMOD C