An unsupervised heuristic-based approach for bibliographic metadata deduplication

被引:12
作者
Borges, Eduardo N. [1 ]
de Carvalho, Moises G. [2 ]
Galante, Renata [1 ]
Goncalves, Marcos Andre [2 ]
Laender, Alberto H. F. [2 ]
机构
[1] Univ Fed Rio Grande do Sul, Inst Informat, Porto Alegre, RS, Brazil
[2] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
关键词
Digital libraries; Metadata; Deduplication; Similarity;
D O I
10.1016/j.ipm.2011.01.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These metadata standards may be heterogeneous in both, content and structure. All of this implies that many records may be duplicated in the repository, thus affecting the quality of services, such as searching and browsing. In this article we present an approach that identifies duplicated bibliographic metadata records in an efficient and effective way. We propose similarity functions especially designed for the digital library domain and experimentally evaluate them. Our results show that the proposed functions improve the quality of metadata deduplication up to 188% compared to four different baselines. We also show that our approach achieves statistical equivalent results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:706 / 718
页数:13
相关论文
共 31 条
[1]
[Anonymous], 2003, 2003 ACM SIGMOD INT, DOI DOI 10.1145/872757.872796
[2]
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775116
[3]
[Anonymous], 2003, P 9 ACM SIGKDD INT C, DOI DOI 10.1145/956750.956759
[4]
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[5]
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[6]
Carvalho J., 2003, Proc. of the Intl. Workshop on Web Information and Data Management (WIDM), P90
[7]
Carvalho MG, 2008, APPLIED COMPUTING 2008, VOLS 1-3, P1801
[8]
CONVIS DB, 1982, Patent No. 4328561
[9]
Cota R., 2007, P 22 BRAZ S DAT JOAO, P20
[10]
de Carvalho MG, 2006, OPENING INFORMATION HORIZONS, P41