Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums

被引:463
作者
Abbasi, Ahmed [1 ]
Chen, Hsinchun [1 ]
Salem, Arab [1 ]
机构
[1] Univ Arizona, Dept Management Informat Syst, Tucson, AZ 85721 USA
关键词
algorithms; experimentation; sentiment analysis; opinion mining; feature selection; text classification;
D O I
10.1145/1361684.1361685
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Internet is frequently used as a medium for exchange of information and opinions, as well as propaganda dissemination. In this study the use of sentiment analysis methodologies is proposed for classification of Web forum opinions in multiple languages. The utility of stylistic and syntactic features is evaluated for sentiment classification of English and Arabic content. Specific feature extraction components are integrated to account for the linguistic characteristics of Arabic. The entropy weighted genetic algorithm (EWGA) is also developed, which is a hybridized genetic algorithm that incorporates the information-gain heuristic for feature selection. EWGA is designed to improve performance and get a better assessment of key features. The proposed features and techniques are evaluated on a benchmark movie review dataset and U. S. and Middle Eastern Web forum postings. The experimental results using EWGA with SVM indicate high performance levels, with accuracies of over 91% on the benchmark dataset as well as the U. S. and Middle Eastern forums. Stylistic features significantly enhanced performance across all testbeds while EWGA also outperformed other feature selection methods, indicating the utility of these features and techniques for document-level classification of sentiments.
引用
收藏
页数:34
相关论文
共 92 条
  • [51] LI J, 2007, IEEE T INF IN PRESS
  • [52] From fingerprint to writeprint
    Li, JX
    Zheng, R
    Chen, HC
    [J]. COMMUNICATIONS OF THE ACM, 2006, 49 (04) : 76 - 82
  • [53] Martin JR, 2007, LANGUAGE OF EVALUATION: APPRAISAL IN ENGLISH, P1, DOI 10.1057/9780230511910
  • [54] Mitra Mandar., 1997, P 5 INT RIAO C, P200
  • [55] Mladenic D., 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P234, DOI 10.1145/1008992.1009034
  • [56] Morinaga Satoshi, 2002, Proceedings of 8th International Conference Knowledge Discovery and Data Mining, P341, DOI DOI 10.1145/775047.775098
  • [57] Mullen T., 2004, P 2004 C EMP METH NA, P412, DOI [DOI 10.3115/1219044.1219069, 10.3115/1219044.1219069]
  • [58] Nasukawa T., 2003, P 2 INT C KNOWLEDGE, P70, DOI [DOI 10.1145/945645.945658, 10.1145/945645.945658]
  • [59] Nigam Kamal, 2004, P AAAI SPRING S EXPL, V598603
  • [60] Oliveira LS, 2002, INT C PATT RECOG, P568, DOI 10.1109/ICPR.2002.1044794