BibPro: A Citation Parser Based on Sequence Alignment

被引:12
作者
Chen, Chien-Chih [1 ]
Yang, Kai-Hsiang [2 ]
Chen, Chuen-Liang [1 ]
Ho, Jan-Ming [3 ]
机构
[1] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10764, Taiwan
[2] Natl Taipei Univ Educ, Dept Math & Informat Educ, Taipei, Taiwan
[3] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
关键词
Data integration; digital libraries; information extraction; sequence alignment; EXTRACTION;
D O I
10.1109/TKDE.2010.231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dramatic increase in the number of academic publications has led to growing demand for efficient organization of the resources to meet researchers' needs. As a result, a number of network services have compiled databases from the public resources scattered over the Internet. However, publications by different conferences and journals adopt different citation styles. It is an interesting problem to accurately extract metadata from a citation string which is formatted in one of thousands of different styles. It has attracted a great deal of attention in research in recent years. In this paper, based on the notion of sequence alignment, we present a citation parser called BibPro that extracts components of a citation string. To demonstrate the efficacy of BibPro, we conducted experiments on three benchmark data sets. The results show that BibPro achieved over 90 percent accuracy on each benchmark. Even with citations and associated metadata retrieved from the web as training data, our experiments show that BibPro still achieves a reasonable performance.
引用
收藏
页码:236 / 250
页数:15
相关论文
共 28 条
[1]  
Agichtein E.i., 2004, P 10 ACM SIGKDD INT
[2]  
[Anonymous], 1998, P 7 INT C WORLD WID
[3]  
Bollacker K. D., 1998, P 2 INT C AUT AG
[4]  
Borkar V R., 2001, P ACM SIGMOD INT C M
[5]   Link-based similarity measures for the classification of Web documents [J].
Calado, P ;
Cristo, M ;
Gonçalves, MA ;
de Moura, ES ;
Ribeiro-Neto, B ;
Ziviani, N .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (02) :208-221
[6]  
Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
[7]  
CHEN CC, 2008, P INT C ADV INF NETW
[8]   FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata [J].
Cortez, Eli ;
da Silva, Altigran S. ;
Goncalves, Marcos Andre ;
Mesquita, Filipe ;
de Moura, Edleno S. .
PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, :215-+
[9]  
Councill I. G., 2008, P LANG RES EV C
[10]  
Couto T., 2006, P 6 ACM IEEE CS JOIN