A trigram hidden Markov model for metadata extraction from heterogeneous references

被引:22
作者
Ojokoh, Bolanle [1 ,2 ]
Zhang, Ming [1 ]
Tang, Jian [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Beijing 100871, Peoples R China
[2] Fed Univ Technol Akure, Dept Comp Sci, Akure, Nigeria
基金
高等学校博士学科点专项科研基金; 国家高技术研究发展计划(863计划);
关键词
Metadata extraction; Hidden Markov models; Bibliography; Second order; Shrinkage;
D O I
10.1016/j.ins.2011.01.014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our objective was to explore an efficient and accurate extraction of metadata such as author, title and institution from heterogeneous references, using hidden Markov models (HMMs). The major contributions of the research were the (i) development of a trigram, full second order hidden Markov model with more priority to words emitted in transitions to the same state, with a corresponding new Viterbi algorithm (ii) introduction of a new smoothing technique for transition probabilities and (iii) proposal of a modification of back-off shrinkage technique for emission probabilities. The effect of the size of data set on the training procedure was also measured. Comparisons were made with other related works and the model was evaluated with three different data sets. The results showed overall accuracy, precision, recall and F1 measure of over 95% suggesting that the method outperforms other related methods in the task of metadata extraction from references. (C) 2011 Elsevier Inc. All rights reserved.
引用
收藏
页码:1538 / 1551
页数:14
相关论文
共 30 条
  • [1] Bikel D. M., 1997, P 5 C APPL NAT LANG, P194
  • [2] BORKAR VR, 2001, P ACM SIGMOD INT C M, P175
  • [3] CONNSN J, 2000, USCSTR006 DEP COMP S
  • [4] CORTEZ E, 2010, P 4 SIGMOD PH D WORK
  • [5] FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata
    Cortez, Eli
    da Silva, Altigran S.
    Goncalves, Marcos Andre
    Mesquita, Filipe
    de Moura, Edleno S.
    [J]. PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, : 215 - +
  • [6] Reference metadata extraction using a hierarchical knowledge representation framework
    Day, Min-Yuh
    Tsai, Richard Tzong-Han
    Sung, Cheng-Lung
    Hsieh, Chiu-Chen
    Lee, Cheng-Wei
    Wu, Shih-Hung
    Wu, Kun-Pin
    ong, Chorng-Shy Ong
    Hsu, Wen-Lian
    [J]. DECISION SUPPORT SYSTEMS, 2007, 43 (01) : 152 - 167
  • [8] FREITAG D, 1999, P AAAI 99 WORKSH MAC
  • [9] AUTOBIB: Automatic extraction of bibliographic information on the web
    Geng, JF
    Yang, J
    [J]. INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2004, : 193 - 204
  • [10] GUPTA D, 2009, P INT C CONT COMP IN, P93