Repeated labeling using multiple noisy labelers

被引:114
作者
Ipeirotis, Panagiotis G. [1 ]
Provost, Foster [1 ]
Sheng, Victor S. [2 ]
Wang, Jing [1 ]
机构
[1] NYU, Leonard N Stern Sch Business, Dept Informat Operat & Management Sci, New York, NY 10012 USA
[2] Univ Cent Arkansas, Dept Comp Sci, Conway, AR USA
基金
加拿大自然科学与工程研究理事会; 美国国家科学基金会;
关键词
Active learning; Data selection; Data preprocessing; Classification; Human computation; Repeated labeling; Selective labeling; DATA-ACQUISITION;
D O I
10.1007/s10618-013-0306-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
引用
收藏
页码:402 / 441
页数:40
相关论文
共 54 条
  • [51] Witten I. H., 2005, DATA MINING, V2, P403
  • [52] Cost-sensitive learning by cost-proportionate example weighting
    Zadrozny, B
    Langford, J
    Abe, N
    [J]. THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 435 - 442
  • [53] Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution
    Zheng, Zhiqiang
    Padmanabhan, Balaji
    [J]. MANAGEMENT SCIENCE, 2006, 52 (05) : 697 - 712
  • [54] Cost-constrained data acquisition for intelligent data preparation
    Zhu, XQ
    Wu, XD
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (11) : 1542 - 1556