Documents similarity measurement using field association terms

被引:50
作者
Atlam, ES [1 ]
Fuketa, M [1 ]
Morita, K [1 ]
Aoe, J [1 ]
机构
[1] Univ Tokushima, Dept Informat Sci & Intelligent Syst, Tokushima 7708506, Japan
关键词
information retrieval; FA terms; FA-Sim; recall; precision;
D O I
10.1016/S0306-4573(03)00019-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Conventional approaches to text analysis and information retrieval which measured document similarity by using considering all of the information in texts are a relatively inefficiency for processing large text collections in heterogeneous subject areas. This paper outlined a new text manipulation system FA-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. FA-Sim is based on flexible text matching procedures carried out in various contexts and various field ranks. FA-Sim measures texts similarity by using specific field association (FA) terms instead of by comparing all text information. Similarity between texts is faster and higher by using FA-Sim than other two analysis methods. Therefore, Recall and Precision significantly improved by 39% and 37% over these two traditional methods. (C) 2003 Elsevier Ltd. All rights reserved.
引用
收藏
页码:809 / 824
页数:16
相关论文
共 34 条
[1]  
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[2]  
Aoe J., 1989, T IPSJ, V39, P2563
[3]   A new method for selecting English field association terms of compound words and its knowledge representation [J].
Atlam, E ;
Morita, K ;
Fuketa, M ;
Aoe, J .
INFORMATION PROCESSING & MANAGEMENT, 2002, 38 (06) :807-821
[4]   Similarity measurement using term negative weight and its application to word similarity [J].
Atlam, ES ;
Fuketa, M ;
Morita, K ;
Aoe, J .
INFORMATION PROCESSING & MANAGEMENT, 2000, 36 (05) :717-736
[5]  
BLAIR DC, 1984, COMMUNICATIONS MCA, V28, P289
[6]  
Breiman L., 1984, BIOMETRICS, DOI DOI 10.2307/2530946
[7]   USING PROBABILISTIC MODELS OF DOCUMENT-RETRIEVAL WITHOUT RELEVANCE INFORMATION [J].
CROFT, WB ;
HARPER, DJ .
JOURNAL OF DOCUMENTATION, 1979, 35 (04) :285-295
[8]  
CROFT WB, 1984, P ACM C RES DEV INF, P201
[9]   FASIT - A FULLY-AUTOMATIC SYNTACTICALLY BASED INDEXING SYSTEM [J].
DILLON, M ;
GRAY, AS .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1983, 34 (02) :99-108
[10]  
Dozawa T., 1999, ANN SERIES