BTM: Topic Modeling over Short Texts

被引:338
作者
Cheng, Xueqi [1 ]
Yan, Xiaohui [1 ]
Lan, Yanyan [1 ]
Guo, Jiafeng [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
基金
中国国家自然科学基金;
关键词
Short text; topic model; biterm; online algorithm; content analysis;
D O I
10.1109/TKDE.2014.2313872
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Short texts are popular on today's web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM). BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.
引用
收藏
页码:2928 / 2941
页数:14
相关论文
共 42 条
[1]   On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [J].
AlSumait, Loulwah ;
Barbara, Daniel ;
Domeniconi, Carlotta .
ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, :3-12
[2]  
[Anonymous], 2005, PARAMETER ESTIMATION
[3]  
[Anonymous], 2005, Advances in Neural Information Processing Systems
[4]  
[Anonymous], 2012, TM LDA EFFICIENT ONL, DOI [DOI 10.1145/2339530.2339552, 10.1145/2339530.2339552]
[5]  
[Anonymous], 2006, PATTERN RECOGN, DOI DOI 10.1117/1.2819119
[6]  
[Anonymous], 2010, P 3 ACM INT C WEB SE, DOI DOI 10.1145/1718487.1718520
[7]  
[Anonymous], 2008, Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08
[8]  
[Anonymous], 2009, ARTIF INTELL
[9]  
[Anonymous], 2010, Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, DOI DOI 10.1145/1835804.1835922
[10]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84