Corpus-based stemming using cooccurrence of word variants

被引:131
作者
Xu, JX [1 ]
Croft, WB [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
关键词
algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;
D O I
10.1145/267954.267957
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
引用
收藏
页码:61 / 81
页数:21
相关论文
共 21 条
  • [1] [Anonymous], P 16 ANN INT ACM SIG
  • [2] [Anonymous], P 16 ANN INT ACM SIG
  • [3] BROGLIO J, 1995, NIST SPECIAL PUBLICA, P22
  • [4] Broglio John., 1994, Proceedings of the TIPSTER Text Program, P47
  • [5] CROFT WB, 1995, 4 ANN S DOC AN INF R, P147
  • [6] HARMAN D, 1991, J AM SOC INFORM SCI, V42, P7, DOI 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO
  • [7] 2-P
  • [8] HARMAN D, 1995, NIST SPECIAL PUBLICA, P1
  • [9] Hull DA, 1996, J AM SOC INFORM SCI, V47, P70, DOI 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO
  • [10] 2-#