Learning Author-Topic Models from Text Corpora

被引:209
作者
Rosen-Zvi, Michal [1 ]
Chemudugunta, Chaitanya [2 ]
Griffiths, Thomas [3 ]
Smyth, Padhraic [2 ]
Steyvers, Mark [2 ]
机构
[1] IBM Res Lab, Haifa, Israel
[2] Univ Calif Irvine, Irvine, CA USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
关键词
Algorithms; Topic models; Gibbs sampling; unsupervised learning; author models; perplexity; NETWORKS;
D O I
10.1145/1658377.1658381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.
引用
收藏
页数:38
相关论文
共 55 条
  • [1] [Anonymous], 2006, P 29 ANN INT ACM SIG, DOI DOI 10.1145/1148170.1148204
  • [2] [Anonymous], 2003, KDD '03
  • [3] [Anonymous], 2006, ICML, DOI [10.1145/1143844.1143917, DOI 10.1145/1143844.1143917]
  • [4] [Anonymous], 2004, Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), DOI [10.1145/1014052, DOI 10.1145/1014052]
  • [5] [Anonymous], 1964, Inference and disputed authorship: The Federalist
  • [6] [Anonymous], 1998, SIGIR 98 P 21 ANN IN, DOI DOI 10.1145/290941.291008
  • [7] [Anonymous], 1995, Markov Chain Monte Carlo in Practice
  • [8] Using linear algebra for intelligent information retrieval
    Berry, MW
    Dumais, ST
    OBrien, GW
    [J]. SIAM REVIEW, 1995, 37 (04) : 573 - 595
  • [9] Blei D., 2006, ADV NEURAL INFORM PR, V18, P147
  • [10] Blei D.M., 2006, P 23 INT C MACHINE L, P113, DOI [DOI 10.1145/1143844.1143859, 10.1145/1143844.114385]