Committee-based sample selection for probabilistic classifiers

被引:58
作者
Argamon-Engelson, S
Dagan, I
机构
[1] Jerusalem Coll Technol, Dept Comp Sci, IL-91160 Jerusalem, Israel
[2] Bar Ilan Univ, Dept Math & Comp Sci, IL-52900 Ramat Gan, Israel
关键词
D O I
10.1613/jair.612
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In many real-world learning tasks it is expensive to acquire a sufficient number of labeled examples for training. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during training the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids redundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, and extends the committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample selection in probabilistic classification models, which evaluate the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task of stochastic part-of-speech tagging. We find that all variants of the method achieve a significant reduction in annotation cost, although their computational efficiency differs. In particular, the simplest variant, a two member committee with no parameters to tune, gives excellent results. We also show that sample selection yields a significant reduction in the size of the model used by the tagger.
引用
收藏
页码:335 / 360
页数:26
相关论文
共 34 条
  • [1] Angluin D., 1988, Machine Learning, V2, P319, DOI 10.1007/BF00116828
  • [2] LEARNING REGULAR SETS FROM QUERIES AND COUNTEREXAMPLES
    ANGLUIN, D
    [J]. INFORMATION AND COMPUTATION, 1987, 75 (02) : 87 - 106
  • [3] [Anonymous], 1994, P ADV NEUR INF PROC
  • [4] [Anonymous], P INT C MACH LEARN
  • [5] Baum L.E., 1972, Inequalities III: Proceedings of the Third Symposium on Inequalities, page, V3, P1
  • [6] BLACK E, 1993, P 31 ANN M ASS COMP, P31
  • [7] BRILL E, 1992, P ACL C APPL NAT LAN
  • [8] CHURCH KW, 1988, P ACL C APPL NAT LAN
  • [9] IMPROVING GENERALIZATION WITH ACTIVE LEARNING
    COHN, D
    ATLAS, L
    LADNER, R
    [J]. MACHINE LEARNING, 1994, 15 (02) : 201 - 221
  • [10] COHN DA, 1995, ADV NEURAL INFORMATI, V7