FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata

被引:30
作者
Cortez, Eli [1 ]
da Silva, Altigran S. [1 ]
Goncalves, Marcos Andre
Mesquita, Filipe [1 ]
de Moura, Edleno S. [1 ,2 ]
机构
[1] Univ Fed Amazonas, Dept Ciencia Comp, Manaus, AM, Brazil
[2] Univ Fed Minas Gerais, Dept Ciencia Comp, Belo Horizonte, MG, Brazil
来源
PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT | 2007年
关键词
Citation Management; Metadata Extraction;
D O I
10.1145/1255175.1255219
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that, rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from art existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.
引用
收藏
页码:215 / +
页数:3
相关论文
共 24 条
  • [1] Arasu A, 2003, SIGMOD'03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, P337, DOI DOI 10.1145/872757.872799
  • [2] The anatomy of a large-scale hypertextual Web search engine
    Brin, S
    Page, L
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 107 - 117
  • [3] Link-based similarity measures for the classification of Web documents
    Calado, P
    Cristo, M
    Gonçalves, MA
    de Moura, ES
    Ribeiro-Neto, B
    Ziviani, N
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (02): : 208 - 221
  • [4] COUTO T, 2006, JCDL 06, P75
  • [5] Crescenzi V., 2001, VLDB J, P109
  • [6] Day MY, 2005, PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, P50
  • [7] Conceptual-model-based data extraction from multiple-record Web pages
    Embley, DW
    Campbell, DM
    Jiang, YS
    Liddle, SW
    Lonsdale, DW
    Ng, YK
    Smith, RD
    [J]. DATA & KNOWLEDGE ENGINEERING, 1999, 31 (03) : 227 - 251
  • [8] Freitag D, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P584
  • [9] GONCALVES MA, 2007, INFORM PROC IN PRESS
  • [10] Han H, 2003, ACM-IEEE J CONF DIG, P37