On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking

被引:232
作者
AlSumait, Loulwah [1 ]
Barbara, Daniel [1 ]
Domeniconi, Carlotta [1 ]
机构
[1] George Mason Univ, Dept Comp Sci, Fairfax, VA 22030 USA
来源
ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2008年
关键词
D O I
10.1109/ICDM.2008.140
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents Online Topic Model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the Latent Dirichlet Allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the Empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data, The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
引用
收藏
页码:3 / 12
页数:10
相关论文
共 17 条
[1]  
[Anonymous], 1999, P 22 ANN INT ACM SIG
[2]  
[Anonymous], 2005, Latent Semantic Analysis
[3]  
[Anonymous], PARAMETER ESTIMATION
[4]  
Bishop C. M., 2006, Pattern Recognition and Machine Learning, P179
[5]  
Blei David M., 2006, Proceedings of the 23rd international conference on Machine learning, P113
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Cao B, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2689
[8]  
CHOU T, 2008, IEEE T KNOWLEDGE DAT, V20
[9]   Finding scientific topics [J].
Griffiths, TL ;
Steyvers, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5228-5235
[10]  
GUHA R, 2005, P 11 ACM SIGKDD INT