基于LDA模型的Twitter中文微博热点主题词组发现

被引:6
作者
孙世杰
濮建忠
机构
[1] 解放军外国语学院
关键词
热点主题识别; 热点挖掘; 特征选取; 微博;
D O I
10.16594/j.cnki.41-1302/g4.2012.11.024
中图分类号
TP391.1 [文字信息处理];
学科分类号
081203 ; 0835 ;
摘要
提出了潜在狄利赫雷分布模型与自然语言信息处理相关技术相结合的一种挖掘Twitter中中文微博的热点主题词组的方法.选取了20923条中文Tweets作为样本,获取了相关热点的主题词组,与预期的效果基本吻合,表明该模型具有较好的热点识别效果和对主题进行描述的能力.
引用
收藏
页码:60 / 64+81 +81
页数:6
相关论文
共 11 条
[1]  
Scalable back off language mod-els. Seymore K,Rosenfeld R. Proceedings of the International Conference onSpeech and Language Processing,ICSLP . 1996
[2]  
Foundations of Statistical Natural Language Processing. Christopher D Manning,Hinrich Schutze. . 1999
[3]  
Probabilistic latent semantic indexing. Hofmann T. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) . 1999
[4]  
Latent Dirichlet allocation. Blei D M,Ng A Y,Jordan M I. Journal of Machine Learning Research . 2003
[5]  
Estimation of probabilities from sparse data for the language model component of a speech recognizer. Katz,S.M. IEEE Transactions on Acoustics Speech and Signal Processing . 1987
[6]  
Entropy-based Pruning of BackoffLanguage Models. Andreas Stolcke. Proceedings of DARPA NewsTranscription and Understanding Workshop . 1998
[7]  
Probabilistic topic models. Grifths T,Steyvers M. Latent Semantic Analysis:A Road to Meaning . 2006
[8]  
Topical n-grams: phrase and topic discovery, with and application to information retrieval. Xuerui W,Mccallum A,Xing W. Proc. 7th IEEE International Conference on Data Mining (ICDM 2007) . 2007
[9]  
Simplicial mixtures of Markov chains: Distributed modelling of dynamic user profiles. Girolami, M,Kaban, A. Advances in Neural Information Procesing Systems . 2004
[10]  
A density-based meth-od for adaptive LDA model selection. CAO Juan,XIA Tian,LI Jintao,et al. Neurocomputing . 2009