Text classification from labeled and unlabeled documents using EM

被引:1547
作者
Nigam, K [1 ]
McCallum, AK
Thrun, S
Mitchell, T
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA
[2] Just Res, Pittsburgh, PA 15213 USA
关键词
text classification; Expectation-Maximization; integrating supervised and unsupervised learning; combining labeled and unlabeled data; Bayesian learning;
D O I
10.1023/A:1007692713085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
引用
收藏
页码:103 / 134
页数:32
相关论文
共 47 条
  • [11] Approximate statistical tests for comparing supervised classification learning algorithms
    Dietterich, TG
    [J]. NEURAL COMPUTATION, 1998, 10 (07) : 1895 - 1923
  • [12] On the optimality of the simple Bayesian classifier under zero-one loss
    Domingos, P
    Pazzani, M
    [J]. MACHINE LEARNING, 1997, 29 (2-3) : 103 - 130
  • [13] On bias, variance, 0/1 - Loss, and the curse-of-dimensionality
    Friedman, JH
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (01) : 55 - 77
  • [14] Ghahramani Z., 1994, Advances in Neural Information Processing Systems, V6, P120
  • [15] Jaakkola T., 1998, LEARNING GRAPHICAL M
  • [16] Joachims T., 1996, P 14 INT C MACH LEAR, P143, DOI DOI 10.1016/J.ESWA.2016.09.009
  • [17] Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
  • [18] Larkey L. S., 1996, SIGIR Forum, P289
  • [19] Lewis D. D., 1998, Machine Learning: ECML-98. 10th European Conference on Machine Learning. Proceedings, P4, DOI 10.1007/BFb0026666
  • [20] Lewis D. D., 1994, SIGIR '94. Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, P3