Learning from positive and unlabeled examples

被引:102
作者
Denis, F [1 ]
Gilleron, R
Letouzey, F
机构
[1] Univ Aix Marseille 1, Equipe BDAA, LIF, UMR 6166 CNRS, Marseille, France
[2] Univ Lille 1, Equipe Grappa, LIFL, UPRESA 8022 CNRS, Lille, France
[3] Univ Lille 3, Lille, France
关键词
PAC learning; statistical query model; semi-supervised learning; data mining;
D O I
10.1016/j.tcs.2005.09.007
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In many machine learning settings, labeled examples are difficult to collect while unlabeled data are abundant. Also, for some binary classification problems, positive examples which are elements of the target concept are available. Can these additional data be used to improve accuracy of supervised learning algorithms? We investigate in this paper the design of learning algorithms from positive and unlabeled data only. Many machine learning and data mining algorithms, such as decision tree induction algorithms and naive Bayes algorithms, use examples only to evaluate statistical queries (SQ-like algorithms). Kearns designed the statistical query learning model in order to describe these algorithms. Here, we design an algorithm scheme which transforms any SQ-Iike algorithm into an algorithm based on positive statistical queries (estimate for probabilities over the set of positive instances) and instance statistical queries (estimate for probabilities over the instance space). We prove that any class learnable in the statistical query learning model is learnable from positive statistical queries and instance statistical queries only if a lower bound on the weight of any target concept f can be estimated in polynomial time. Then, we design a decision tree induction algorithm POSC4.5, based on C4.5, that uses only positive and unlabeled examples and we give experimental results for this algorithm. In the case of imbalanced classes in the sense that one of the two classes (say the positive class) is heavily underrepresented compared to the other class, the learning problem remains open. This problem is challenging because it is encountered in many real-world applications. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:70 / 83
页数:14
相关论文
共 17 条
[1]  
Angluin D., 1988, Machine Learning, V2, P343, DOI 10.1007/BF00116829
[2]  
[Anonymous], 1998, UCI REPOSITORY MACHI
[3]  
[Anonymous], P 17 INT C MACH LEAR
[4]  
[Anonymous], C4 5 PROGR MACHINE L
[5]  
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[6]  
Blum A., 2000, Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing, P435, DOI 10.1145/335305.335355
[7]  
BREIMAN L, 1984, CLASSIFICAL NOISE AP
[8]  
De Comité F, 1999, LECT NOTES ARTIF INT, V1720, P219
[9]  
Denis F, 1998, LECT NOTES ARTIF INT, V1501, P112
[10]   EQUIVALENCE OF MODELS FOR POLYNOMIAL LEARNABILITY [J].
HAUSSLER, D ;
KEARNS, M ;
LITTLESTONE, N ;
WARMUTH, MK .
INFORMATION AND COMPUTATION, 1991, 95 (02) :129-161