Improving effectiveness of mutual information for substantival multiword expression extraction

被引:36
作者
Zhang, Wen [1 ,3 ]
Yoshida, Taketoshi [1 ]
Tang, Xijin [2 ]
Ho, Tu-Bao [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Knowledge Sci, Ishikari, Hokkaido 9231292, Japan
[2] Chinese Acad Sci, Acad Math & Syst Sci, Inst Syst Sci, Beijing 100080, Peoples R China
[3] Chinese Acad Sci, Inst Software, Lab Internet Software Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Substantival multiword expression; Mutual information; Enhanced mutual information; Collocation optimization; EMICO; TEXT CLASSIFICATION; WORD;
D O I
10.1016/j.eswa.2009.02.026
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the deficiencies of mutual information is its poor capacity to measure association of words with unsymmetrical co-occurrence, which has large amounts for multi-word expression in texts. Moreover, threshold setting, which is decisive for success of practical implementation of mutual information for multi-word extraction, brings about many parameters to be predefined manually in the process of extracting multiword expressions with different number of individual words. In this paper, we propose a new method as EMICO (Enhanced Mutual Information and Collocation Optimization) to extract substantival multiword expression from text. Specifically, enhanced mutual information is proposed to measure the association of words and collocation optimization is proposed to automatically determine the number of individual words contained in a multiword expression when the multiword expression occurs in a candidate set. Our experiments showed that EMICO significantly improves the performance of substantival multiword expression extraction in comparison with a classic extraction method based on mutual information. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:10919 / 10930
页数:12
相关论文
共 27 条
[1]  
ARGAMON S, 1998, P COLING ACL MONTR C, P67
[2]  
Bourigault D., 1992, P 14 C COMPUTATIONAL, V3, P977, DOI DOI 10.3115/992383.992415
[3]  
CHEN BX, 1992, COMPUTATIONAL LINGUI, V1, P21
[4]  
Chen Kuang-hua., 1994, Proceedings of the 32nd annual meeting on Association for Computational Linguistics, P234
[5]  
Church K.W., 1990, WORD ASS NORMS MUTUA, V16, P22
[6]  
da Silva J. F., 1999, Progress in Artificial Intelligence. 9th Portuguese Conference on Artificial Intelligence, EPIA'99. Proceedings (Lecture Notes in Artificial Intelligence Vol.1695), P113
[7]  
Dias Gael., 2003, Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment-, V18, P41, DOI DOI 10.3115/1119282.1119288
[8]  
DUAN JY, 2005, P 43 ANN M ASS COMP, P605
[9]  
Firth J., 1962, Studies in Linguistic Analysis
[10]  
Gale W., 1991, of the 7th Annual Conference for the new OED and Text Research, P40