A CORRELATED TOPIC MODEL OF SCIENCE

被引:950
作者
Blei, David M. [1 ]
Lafferty, John D. [2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08540 USA
[2] Carnegie Mellon Univ, Dept Comp Sci, Machine Learning DEpt, Pittsburgh, PA 15213 USA
关键词
Hierarchical models; approximate posterior inference; variational methods; text analysis;
D O I
10.1214/07-AOAS114
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Topic models. such as latent Dirichlet allocation (LDA), call he useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is it distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation Sterns from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139-177]. We derive a fast variational inference algorithm for approximate posterior inference in this model. which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science Published from 1990-1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its Use as an exploratory tool of large document collections.
引用
收藏
页码:17 / 35
页数:19
相关论文
共 31 条
  • [11] EROSHEVA E, 2007, ANN APPL ST IN PRESS
  • [12] Erosheva E.A., 2002, THESIS CARNEGIE MELL
  • [13] EROSHEVA EA, 2004, P NATL ACAD SCI USA, V97, P11885
  • [14] Fei-Fei L, 2005, PROC CVPR IEEE, P524
  • [15] Girolami M, 2004, ADV NEUR IN, V16, P9
  • [16] Finding scientific topics
    Griffiths, TL
    Steyvers, M
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 : 5228 - 5235
  • [17] GRIFFITHS TL, 2005, ADV NEURAL INFORM PR, V17, P537
  • [18] An introduction to variational methods for graphical models
    Jordan, MI
    Ghahramani, Z
    Jaakkola, TS
    Saul, LK
    [J]. MACHINE LEARNING, 1999, 37 (02) : 183 - 233
  • [19] MARLIN B, 2004, THESIS U TORONTO
  • [20] High-dimensional graphs and variable selection with the Lasso
    Meinshausen, Nicolai
    Buehlmann, Peter
    [J]. ANNALS OF STATISTICS, 2006, 34 (03) : 1436 - 1462