THE AUTOMATIC IDENTIFICATION OF STOP WORDS

被引:148
作者
WILBUR, WJ
SIROTKIN, K
机构
[1] National Center for Biotechnology Information, Bethesda, MD
关键词
D O I
10.1177/016555159201800106
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A stop word may be identified as a word that has the same likelihood of occurring in those documents not relevant to a query as in those documents relevant to the query. In this paper we show how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a collection by automated statistical testing. We describe the nature of the statistical test as it is realized with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this technique is then applied to a large MEDLINE(R) subset in the area of biotechnology. The initial processing of this database involves a 310 word stop list of common non-content terms. Our technique is then applied and 75% of the remaining terms are identified as stop words. We compare retrieval with and without the removal of these stop words and find that of the top twenty documents retrieved in response to a random query document, seventeen of these are the same on the average for the two methods. We also examine the differences and conclude that where the user prefers one method over the other, the new method with the reduced term set is favored about three times out of four.
引用
收藏
页码:45 / 55
页数:11
相关论文
共 17 条
[1]  
Buckley Chris, 1985, 85686 CORN U DEP COM
[2]  
CROFT WB, 1982, COINS8221 U MASS TEC
[3]  
LUCARELLA D, 1988, J INFORM SCI, V14, P25, DOI 10.1177/016555158801400104
[4]   AN ALGORITHM FOR SUFFIX STRIPPING [J].
PORTER, MF .
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03) :130-137
[5]  
REGAZZI JJ, 1988, J AM SOC INFORM SCI, V39, P235, DOI 10.1002/(SICI)1097-4571(198807)39:4<235::AID-ASI3>3.0.CO
[6]  
2-H
[7]   VECTOR-SPACE MODEL FOR AUTOMATIC INDEXING [J].
SALTON, G ;
WONG, A ;
YANG, CS .
COMMUNICATIONS OF THE ACM, 1975, 18 (11) :613-620
[8]  
SALTON G, 1983, INTRO MODERN INFORMA
[9]  
SALTON G, 1989, INTRO MODERN INFORMA
[10]  
SALTON G, 1968, AUTOMATIC INFORMATIO