Incorporating word embeddings into topic modeling of short text

被引：44

作者：

Gao, Wang ^{[1
]}

Peng, Min ^{[1
]}

Wang, Hua ^{[2
]}

Zhang, Yanchun ^{[2
]}

Xie, Qianqian ^{[1
]}

Tian, Gang ^{[1
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China

[2] Victoria Univ, Ctr Appl Informat, Melbourne, Vic, Australia

来源：

KNOWLEDGE AND INFORMATION SYSTEMS | 2019年 / 61卷 / 02期

基金：

美国国家科学基金会;

关键词：

Short text; Topic model; Word embeddings; Conditional Random Fields;

D O I：

10.1007/s10115-018-1314-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

引用

页码：1123 / 1145

页数：23

共 41 条

[1] Term weighting scheme for short-text classification: Twitter corpuses
Alsmadi, Issa
Hoon, Gan Keng
[J]. NEURAL COMPUTING & APPLICATIONS, 2019, 31 (08) : 3819 - 3831
[2] [Anonymous], 2005, TECHNICAL REPORT
[3] [Anonymous], 2016, P 2016 IEEE INT C MU, DOI DOI 10.1109/ICME.2016.7552883
[4] Bansal M, 2014, PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P809
[5] Latent Dirichlet allocation
Blei, DM
Ng, AY
Jordan, MI
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
[6] Chang J., 2009, Adv. Neural Inf. Process. Syst., P288
[7] BTM: Topic Modeling over Short Texts
Cheng, Xueqi
Yan, Xiaohui
Lan, Yanyan
Guo, Jiafeng
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2928 - 2941
[8] Das R, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P795
[9] Hofmann T, 1999, UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, P289
[10] Hong L., 2010, P 1 WORKSHOP SOCIAL, P80, DOI 10.1145/1964858.1964870

← 1 2 3 4 5 →