Effective techniques for automatic extraction of Web publications

被引:3
作者
Fong, ACM [1 ]
Hui, SC
Vu, HL
机构
[1] Massey Univ, Inst Informat & Math Sci, Auckland, New Zealand
[2] Nanyang Technol Univ, Sch Comp Engn, Singapore 2263, Singapore
关键词
internet; research; electronic publishing; content analysis;
D O I
10.1108/14684520210418347
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To facilitate the exchange of ideas, the lists often include links to published papers in portable document format (PDF) or Postscript (PS) format. Generally, these publication Web sites are updated regularly to include new works. While manual monitoring of relevant Web sites is tedious, commercial search engines and information monitoring systems are ineffective in finding and tracking scholarly publications. Analyses the characteristics of publication index pages and describes effective automatic extraction techniques that the authors have developed, The authors' techniques combine lexical and syntactic analyses with heuristics. The proposed techniques have been implemented and tested for more than 14,000 Web pages and achieved consistently high success rates of around 90 percent.
引用
收藏
页码:4 / 18
页数:15
相关论文
共 30 条
[1]  
AGGARWAL S, 1998, 9 INT WORKSH DAT EXP
[2]  
*ALLR, 2001, WEBCL SERV HOM PAG
[3]   Discovering relevant scientific literature on the Web [J].
Bollacker, KD ;
Lawrence, S ;
Giles, CL .
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 2000, 15 (02) :42-47
[4]  
Bradshaw S., 2000, IUI 2000. 2000 International Conference on Intelligent User Interfaces, P37, DOI 10.1145/325737.325774
[5]  
BRANDES J, 2001, CITING WORLD WIDE WE
[6]   Enabling concept-based relevance feedback for information retrieval on the WWW [J].
Chang, CH ;
Hsu, CC .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (04) :595-609
[7]   Information extraction [J].
Cowie, J ;
Lehnert, W .
COMMUNICATIONS OF THE ACM, 1996, 39 (01) :80-91
[8]   TetraFusion: Information discovery on the Internet [J].
Crimmins, F ;
Smeaton, AF ;
Dkaki, T ;
Mothe, J .
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1999, 14 (04) :55-+
[9]  
*CUB, 2001, PERS INT CLIPP SERV
[10]  
DEJONG G.F., 1982, Strategies for Natural Language Processing, P149