How effective is stemming and decompounding for German text retrieval?

被引:53
作者
Braschler, M
Ripplinger, B
机构
[1] Eurospider Informat Technol AG, CH-8006 Zurich, Switzerland
[2] Univ Neuchatel, Inst Interfac Informat, CH-2001 Neuchatel, Switzerland
来源
INFORMATION RETRIEVAL | 2004年 / 7卷 / 3-4期
关键词
stemming; decompounding; German; evaluation; morphological analysis;
D O I
10.1023/B:INRT.0000011208.60754.a1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Information retrieval systems operating on free text face difficulties when word forms used in the query and documents do not match. The usual solution is the use of a "stemming component" that reduces related word forms to a common stem. Extensive studies of such components exist for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. The major contribution of our work that goes beyond its focus on German lies in the investigation of a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.
引用
收藏
页码:291 / 316
页数:26
相关论文
共 33 条
[1]  
BLUSTEIN J, 1998, IR STAT PAK
[2]  
Braschler M., 2001, Cross-Language Information Retrieval and Evaluation. Workshop of the Cross-Language Evaluation Forum, CLEF 2000. Revised Papers (Lecture Notes in Computer Science Vol.2069), P140
[3]  
CHOUEKA Y, 1992, COMPUTATIONAL LEXICO
[4]  
FAKES WB, 1992, INFORMATION RETRIEVA, P131
[5]  
FRISCH E, 1997, 10 GESIS IZ SOZ
[6]   Unsupervised learning of the morphology of a natural language [J].
Goldsmith, J .
COMPUTATIONAL LINGUISTICS, 2001, 27 (02) :153-198
[7]  
HARMAN D, 1997, READINGS INFORMATION
[8]  
HARMANEC P, 1991, B ASTRON I CZECH, V42, P1
[9]  
Hull DA, 1996, J AM SOC INFORM SCI, V47, P70, DOI 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO
[10]  
2-#