AUTOBIB: Automatic extraction of bibliographic information on the web

被引:6
作者
Geng, JF [1 ]
Yang, J [1 ]
机构
[1] Duke Univ, Dept Comp Sci, Durham, NC 27708 USA
来源
INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS | 2004年
关键词
D O I
10.1109/IDEAS.2004.1319792
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and challenging task. In this work, we focus on building a system called AUTOBIB to automate extraction of bibliographic information on the Web. We use a combination of bootstrapping, statistical, and heuristic methods to achieve a high degree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each site's presentation format, and the system can cope with changes in the presentation format without human intervention. AUTOBIB bootstraps itself with a small seed database of structured bibliographic records. For each bibliographic Web site, we identify segments within its pages that represent bibliographic records, using state-of-the-art record-boundary discovery techniques. Next, we find matches for some of these "raw records" in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effective HMM structure with special states that correspond to delimiters and HTML tags in raw records. Experiments demonstrate that for our application, this HMM structure achieves high success rates without the complexity of previously proposed structures.
引用
收藏
页码:193 / 204
页数:12
相关论文
共 25 条
  • [1] Adelberg Brad, 1998, SIGMOD, 1998, P283, DOI [10.1145/276304.276330, DOI 10.1145/276304.276330]
  • [2] ARASU A, 2003, P 2003 ACM SIGMOD IN
  • [3] Bikel D.M., 1997, Proceedings of the fifth conference on Applied natural language processing. Association for Computational Linguistics, P194
  • [4] BORKAR VR, 2001, P ACM SIGMOD INT C M, P175
  • [5] A fully automated object extraction system for the World Wide Web
    Buttler, D
    Liu, L
    Pu, C
    [J]. 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2001, : 361 - 370
  • [6] BUTTLER D, 2001, P 2001 ACM SIGMOND I
  • [7] Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
  • [8] Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
  • [9] Embley DW, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P467, DOI 10.1145/304181.304223
  • [10] FORNEY D, 1973, P IEEE, V61