Short text classification based on strong feature thesaurus

被引:31
作者
Wang, Bing-kun [1 ,2 ]
Huang, Yong-feng [1 ,2 ]
Yang, Wan-xia [1 ,2 ]
Li, Xing [1 ,2 ]
机构
[1] Tsinghua Univ, Dept Elect & Engn, Informat Cognit & Intelligent Syst Res Inst, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Informat Technol Natl Lab, Beijing 100084, Peoples R China
来源
JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS | 2012年 / 13卷 / 09期
关键词
Short text; Classification; Data sparseness; Semantic; Strong feature thesaurus (SFT); Latent Dirichlet allocation (LDA);
D O I
10.1631/jzus.C1100373
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Na < ve Bayes Multinomial.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 23 条
[1]  
[Anonymous], P 22 NAT C ART INT
[2]  
[Anonymous], 2006, Proceedings of the 15th international conference on World Wide Web
[3]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[4]  
Bollegala D., 2007, P 16 INT C WORLD WID, P757, DOI DOI 10.1145/1242572.1242675
[5]  
Bollegala D., 2011, P 49 ANN M ASS COMPU, P132
[6]  
CAS (Chinese Academy of Sciences), 2010, CHIN LEX AN SYST CAS
[7]  
Gabrilovich E., 2006, AAAI, P1301
[8]  
Gabrilovich E, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1048
[9]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235
[10]  
Heinrich G., 2005, ORAL HLTH STATUS ORA