Creating and exploiting a comparable corpus in cross-language information retrieval

被引:34
作者
Talvensaari, Tuomas [1 ]
Laurikkala, Jorma
Jarvelin, Kalervo
Juhola, Martti
Keskustalo, Heikki
机构
[1] Univ Tampere, Dept Comp Sci, Tampere 33014, Finland
[2] Univ Tampere, Dept Informat Studies, Tampere 33014, Finland
关键词
algorithms; languages; cross-language information retrieval; comparable corpora; query translation;
D O I
10.1145/1198296.1198300
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.
引用
收藏
页数:21
相关论文
共 29 条
[1]  
Allan J., 1997, Fifth Text REtrieval Conference (TREC-5) (NIST SP 500-238), P119
[2]  
Ballesteros L., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P64, DOI 10.1145/290941.290958
[3]  
Braschler M., 1998, Research and Advanced Technology for Digital Libraries. Second European Conference, ECDL'98. Proceedings, P183
[4]  
CONOVIERE WJ, 1999, PRACTICAL NONPARAMET
[5]  
Davis MW, 1998, KLUW S INF, V2, P11
[6]  
Franz M., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242), P157
[7]  
Fung Pascale, 1998, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, V1, P414
[8]  
Gale William A., 1991, ASS COMPUTATIONAL LI, P177
[9]   Dictionary-based cross-language information retrieval:: Learning experiences from CLEF 2000-2002 [J].
Hedlund, T ;
Airio, E ;
Keskustalo, H ;
Lehtokangas, R ;
Pirkola, A ;
Järvelin, K .
INFORMATION RETRIEVAL, 2004, 7 (1-2) :99-119
[10]  
Keskustalo H, 2003, LECT NOTES COMPUT SC, V2857, P252