Information retrieval on Turkish texts

被引:58
作者
Can, Fazli [1 ]
Kocberber, Seyit [1 ]
Balcik, Erman [1 ]
Kaynak, Cihan [1 ]
Ocalan, H. Cagdas [1 ]
Vursavas, Onur M. [1 ]
机构
[1] Bilkent Univ, Dept Comp Engn, Bilkent Informat Retrieval Grp, TR-06800 Ankara, Turkey
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2008年 / 59卷 / 03期
关键词
D O I
10.1002/asi.20750
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stopword list in indexing.
引用
收藏
页码:407 / 421
页数:15
相关论文
共 62 条
[31]   To stem or lemmatize a highly inflectional language in a probabilistic IR environment? [J].
Kettunen, K ;
Kunttu, T ;
Järvelin, K .
JOURNAL OF DOCUMENTATION, 2005, 61 (04) :476-496
[32]  
Koksal A., 1981, P BIL 80 BILD ANK, P37
[33]  
KRROVETZ R, 1993, P 16 INT C RES DEV I, P191
[34]  
LARKEY GL, 1988, TURKISH GRAMMAR
[35]   Document ranking and the vector-space model [J].
Lee, DL ;
Chuang, H ;
Seamons, K .
IEEE SOFTWARE, 1997, 14 (02) :67-75
[36]  
Long Xiaohui, 2003, P 29 INT C VER LARG, P129
[37]   Character N-gram tokenization for European language text retrieval [J].
McNamee, P ;
Mayfield, J .
INFORMATION RETRIEVAL, 2004, 7 (1-2) :73-97
[38]  
*NTCIR, 2007, NII TEST COLL IR SYS
[39]  
Oflazer K., 1994, Literary & Linguistic Computing, V9, P137, DOI 10.1093/llc/9.2.137
[40]  
Pembe FC, 2004, LECT NOTES COMPUT SC, V3280, P741