To stem or lemmatize a highly inflectional language in a probabilistic IR environment?

被引:15
作者
Kettunen, K [1 ]
Kunttu, T [1 ]
Järvelin, K [1 ]
机构
[1] Univ Tampere, Dept Informat Studies, FIN-33101 Tampere, Finland
关键词
information research; languages;
D O I
10.1108/00220410510607480
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - To show that stem generation compares well with lemmatization as a morphological tool for a highly inflectional language for IR purposes in a best-match retrieval system. Design/methodology/approach - Effects of three different morphological methods lemmatization, stemming and stem production - for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four-point relevance scale which is partitioned differently in different test settings. Findings - Results show that stem production, a lighter method than morphological lemmatization, compares well with lemmatization in a best-match IR environment. Differences in performance between stem production and lemmatization are small and they are not statistically significant in most of the tested settings. It is also shown that hitherto a rather neglected method of morphological processing for Finnish, stemming, performs reasonably well although the stemmer used - a Porter stemmer implementation - is far from optimal for a morphologically complex language like Finnish. In another series of tests, the effects of compound splitting and derivational expansion of queries are tested. Practical implications - Usefulness of morphological lemmatization and stem generation for IR purposes can be estimated with many factors. On the average P-R level they seem to behave very close to each other in a probabilistic IR system. Thus, the choice of the used method with highly inflectional languages needs to be estimated along other dimensions too. Originality/value - Results are achieved using Finnish as an example of a highly inflectional language. The results are of interest for anyone who is interested in processing of morphological variation of a highly inflected language for IR purposes.
引用
收藏
页码:476 / 496
页数:21
相关论文
共 29 条
[1]  
Abu-Salem H, 1999, J AM SOC INFORM SCI, V50, P524, DOI 10.1002/(SICI)1097-4571(1999)50:6<524::AID-ASI7>3.0.CO
[2]  
2-M
[3]   The effectiveness of stemming for information retrieval in Amharic [J].
Alemayehu, N ;
Willett, P .
PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2003, 37 (04) :254-259
[4]   From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software [J].
Alkula, R .
INFORMATION RETRIEVAL, 2001, 4 (3-4) :195-208
[5]  
[Anonymous], 2003, P 26 ANN INT ACM SIG
[6]  
Braschler M, 2003, LECT NOTES COMPUT SC, V2633, P177
[7]  
CALLAN JP, 1992, P 3 INT C DAT EXP SY, P78
[8]  
FRAKES WB, 1992, INFORMATION RETRIEVA, P131
[9]  
HARMANEC P, 1991, B ASTRON I CZECH, V42, P1
[10]  
Hull DA, 1996, J AM SOC INFORM SCI, V47, P70, DOI 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO