Character N-gram tokenization for European language text retrieval

被引:154
作者
McNamee, P [1 ]
Mayfield, J [1 ]
机构
[1] Johns Hopkins Univ, Appl Phys Lab, Laurel, MD 20723 USA
来源
INFORMATION RETRIEVAL | 2004年 / 7卷 / 1-2期
关键词
cross-language information retrieval; language-neutral retrieval; character n-grams; Cross Language Evaluation Forum; European languages;
D O I
10.1023/B:INRT.0000009441.78971.be
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n=4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:25
相关论文
共 59 条
[21]  
Hiemstra D., 2000, THESIS CTR TELEMATIC
[22]  
Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop, P381
[23]  
Joon Ho Lee, 1996, SIGIR Forum, P216
[24]  
Kraaij W., 2001, EVALUATION CROSS LAN
[25]  
LANDAUER TK, 1990, P 6 ANN C UW CTR NEW, P31, DOI DOI 10.1099/00221287-136-2-327
[26]  
MAH CP, 1983, ACM SIGIR FORUM, V17, P6
[27]  
MAYFIELD J, 2000, NIST PUBLICATION, P445
[28]  
MCCARLEY S, 1999, P ACL
[29]  
McNamee P., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P159
[30]  
McNamee P, 2002, LECT NOTES COMPUT SC, V2406, P193