Extracting lists of data records from semi-structured web pages

被引:33
作者
Alvarez, Manuel [1 ]
Pan, Alberto [1 ]
Raposo, Juan [1 ]
Bellas, Fernando [1 ]
Cacheda, Fidel [1 ]
机构
[1] Univ A Coruna, Dept Informat & Commun Technol, La Coruna 15071, Spain
关键词
data extraction; data mining/web-based information; web/web-based information systems;
D O I
10.1016/j.datak.2007.10.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have also tested our techniques with a high number of real web sources and we have found them to be very effective. (C) 2007 Elsevier B.V. All rights reserved.
引用
收藏
页码:491 / 509
页数:19
相关论文
共 34 条
  • [11] Chen-Chuan Chang K., 2004, P VLDB WORKSH INF IN
  • [12] Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
  • [13] Clustering Web pages based on their structure
    Crescenzi, V
    Merialdo, P
    Missier, P
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (03) : 279 - 299
  • [14] Gonnet G. H., 1992, NEW INDICES TEXT PAT
  • [15] HOGUE A, 2005, P 14 INT WORLD WID W
  • [16] Generating finite-state transducers for semi-structured data extraction from the Web
    Hsu, CN
    Dung, MT
    [J]. INFORMATION SYSTEMS, 1998, 23 (08) : 521 - 538
  • [17] JUNG Y, 2007, P INT SAC C, P1667
  • [18] KISTLER T, 1998, P 7 INT WORLD WID WE, P259
  • [19] HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML
    Kovalev, V
    Bhowmick, SS
    Madria, S
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (02) : 241 - 276
  • [20] Kushmerick N, 1997, INT JOINT CONF ARTIF, P729