A study on automatic creation of a comparable document collection in cross-language information retrieval

被引:2
作者
Talvensaari, Tuomas
Laurikkala, Jorma
Jarvelin, Kalervo
Juhola, Martti [1 ]
机构
[1] Univ Tampere, Dept Comp Sci, FIN-33101 Tampere, Finland
[2] Univ Tampere, Dept Informat Studies, FIN-33101 Tampere, Finland
关键词
information retrieval; document management; language and literature;
D O I
10.1108/00220410610666510
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - To present a method for creating a comparable document collection from two document collections in different languages. Design/methodology/approach - The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the relative average term frequency formula. The keys were translated into English with a dictionary-based query translation program. The resulting lists of words were used as queries that were run against the target collection (Los Angeles Times articles) with the nearest neighbor method. The documents were aligned with unrestricted and date-restricted alignment schemes, which were also combined. Findings - The combined alignment scheme was found the best, when the relatedness of the document pairs was assessed with a five-degree relevance scale. Of the 400 document pairs, roughly 40 percent were highly or fairly related and 75 percent included at least lexical similarity. Research limitations/implications - The number of alignment pairs was small due to the short common time period of the two collections, and their geographical (and thus, topical) remoteness. In future, our aim is to build larger comparable corpora in various languages and use them as source of translation knowledge for the purposes of cross-language information retrieval (CLIR). Practical implications - Readily available parallel corpora are scarce. With this method, two unrelated document collections can relatively easily be aligned to create a CUR resource. Originality/value - The method can be applied to weakly linked collections and morphologically complex languages, such as Finnish.
引用
收藏
页码:372 / 387
页数:16
相关论文
共 30 条
[11]  
HULL DA, 1996, P 19 ANN INT ACM SIG, P49
[12]   Using graded relevance assessments in IR evaluation [J].
Kekäläinen, J ;
Järvelin, K .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2002, 53 (13) :1120-1129
[13]  
Keskustalo H., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
[14]  
KOSKENNIEMI K, 1983, PUBLICATIONS DEP GEN, V11
[15]  
Mitchell TM., 1997, MACH LEARN, V1
[16]  
Oard DW, 1998, ANNU REV INFORM SCI, V33, P223
[17]  
OARD DW, 1996, UMIACSTR9619 I ADV C
[18]  
PETERS C, 2003, INTRO CLEF 2003 WORK
[19]  
Picchi E, 1998, KLUW S INF, V2, P81
[20]   Dictionary-based cross-language information retrieval:: Problems, methods, and research findings [J].
Pirkola, A ;
Hedlund, T ;
Keskustalo, H ;
Järvelin, K .
INFORMATION RETRIEVAL, 2001, 4 (3-4) :209-230