Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

被引:14
作者
Dorji, Tshering Cigay [1 ]
Atlam, El-sayed [1 ]
Yata, Susumu [1 ]
Fuketa, Masao [1 ]
Morita, Kazuhiro [1 ]
Aoe, Jun-ichi [1 ]
机构
[1] Univ Tokushima, Dept Informat Sci & Intelligent Syst, Fac Engn, Tokushima 7708506, Japan
关键词
Field Association (FA) Terms; Terms weighting and selection; Document classification; Terminology extraction; Information retrieval; CLASSIFICATION; KNOWLEDGE; WEB;
D O I
10.1007/s10115-010-0296-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Field Association (FA) Terms-words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74-97 and 65-98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.
引用
收藏
页码:141 / 161
页数:21
相关论文
共 41 条
[1]  
[Anonymous], 2005, TERMINOLOGY CONTENT
[2]   A new method for selecting English field association terms of compound words and its knowledge representation [J].
Atlam, E ;
Morita, K ;
Fuketa, M ;
Aoe, J .
INFORMATION PROCESSING & MANAGEMENT, 2002, 38 (06) :807-821
[3]   Automatic building of new Field Association word candidates using search engine [J].
Atlam, ES ;
Elmarhomy, G ;
Morita, K ;
Fuketa, M ;
Aoe, JI .
INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (04) :951-962
[4]   Documents similarity measurement using field association terms [J].
Atlam, ES ;
Fuketa, M ;
Morita, K ;
Aoe, J .
INFORMATION PROCESSING & MANAGEMENT, 2003, 39 (06) :809-824
[5]  
Bennett NA, 1999, J AM MED INFORM ASSN, P671
[6]   A Faceted Classification as the Basis of a Faceted Terminology: Conversion of a Classified Structure to Thesaurus Format in the Bliss Bibliographic Classification, 2nd Edition [J].
Broughton, Vanda .
AXIOMATHES, 2008, 18 (02) :193-210
[7]  
Brunzel M, 2007, LECT NOTES COMPUT SC, V4592, P427
[8]  
Collier N., 2001, Terminology, V7, P239, DOI 10.1075/term.7.2.07col
[9]  
Dauphin Claudine., 1993, Ancient Churches Revealed, P49
[10]  
DOZAWA T, 1999, INNOVATIVE MULTI INF