Character N-gram tokenization for European language text retrieval

被引:154
作者
McNamee, P [1 ]
Mayfield, J [1 ]
机构
[1] Johns Hopkins Univ, Appl Phys Lab, Laurel, MD 20723 USA
来源
INFORMATION RETRIEVAL | 2004年 / 7卷 / 1-2期
关键词
cross-language information retrieval; language-neutral retrieval; character n-grams; Cross Language Evaluation Forum; European languages;
D O I
10.1023/B:INRT.0000009441.78971.be
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n=4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:25
相关论文
共 59 条
[1]  
[Anonymous], P 24 ANN INT ACM SIG, DOI DOI 10.1145/383952.384019
[2]  
Ballesteros L, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P84, DOI 10.1145/278459.258540
[3]  
Ballesteros L., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P64, DOI 10.1145/290941.290958
[4]  
BENEDETTO D, 2002, PHYS REV LETT JAN, P88
[5]  
BRASCHLER M, 2000, P 1 CROSS LANG EV FO, P140
[6]  
BUCKLEY C, 1998, P 6 TEXT RETR C TREC, P107
[7]  
Carmel D., 2001, SIGIR Forum, P43, DOI 10.1145/383952.383958
[8]  
Cavnar WB., 1994, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, VVol. 48113, P161, DOI DOI 10.1.1.53.9367
[9]  
CAVNAR WB, 1994, NIST SPECIAL PUBLICA, P269
[10]   Dynamic behavior of steel frames with beam flanges shaved around connection [J].
Chen, SJ ;
Chu, JM ;
Chou, ZL .
JOURNAL OF CONSTRUCTIONAL STEEL RESEARCH, 1997, 42 (01) :49-70