Focused web crawling in the acquisition of comparable corpora

被引:27
作者
Talvensaari, Tuornas [1 ]
Pirkola, Ari [2 ]
Jarvelin, Kalervo [2 ]
Juhola, Martti [1 ]
Laurikkala, Jorma [1 ]
机构
[1] Univ Tampere, Dept Comp Sci, Tampere 33014, Finland
[2] Univ Tampere, Dept Informat Studies, Tampere 33014, Finland
来源
INFORMATION RETRIEVAL | 2008年 / 11卷 / 05期
基金
芬兰科学院;
关键词
cross-language information retrieval; focused crawling; comparable corpora;
D O I
10.1007/s10791-008-9058-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.
引用
收藏
页码:427 / 445
页数:19
相关论文
共 20 条
[1]  
Cavnar WB., 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, VVol. 48113, P161, DOI DOI 10.1.1.53.9367
[2]  
CHAKRABARTI S, 1999, WWW 1999, P1623
[3]   Stellar X-ray sources in the Rosette Nebula [J].
Chen, WP ;
Chiang, PS ;
Li, JZ .
CHINESE JOURNAL OF ASTRONOMY AND ASTROPHYSICS, 2004, 4 (02) :153-165
[4]  
Gale William A., 1991, ASS COMPUTATIONAL LI, P177
[5]  
Hersh W. R., 2005, SIGIR Forum, V39, P21, DOI 10.1145/1067268.1067273
[6]  
Nie JY, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P74
[7]   Dictionary-based cross-language information retrieval:: Problems, methods, and research findings [J].
Pirkola, A ;
Hedlund, T ;
Keskustalo, H ;
Järvelin, K .
INFORMATION RETRIEVAL, 2001, 4 (3-4) :209-230
[8]  
PIRKOLA A, 1998, SIGIR 98, P55
[9]  
Sheridan P., 1996, SIGIR Forum, P58
[10]  
Singh AN, 1996, J PSYCHIATR NEUROSCI, V21, P29