一种基于聚类的PU主动文本分类方法

被引:24
作者
刘露 [1 ,2 ]
彭涛 [1 ,2 ,3 ]
左万利 [1 ,3 ]
戴耀康 [1 ]
机构
[1] 吉林大学计算机科学与技术学院
[2] Department of Computer Science University of Illinois at
关键词
PU(positive and unlabeled)文本分类; 聚类; TFIPNDF(term frequency inverse positive-negative document frequency); 主动学习; 可信反例; 改进的Rocchio;
D O I
暂无
中图分类号
TP391.1 [文字信息处理];
学科分类号
摘要
文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive and unlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM主动学习和改进的Rocchio构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.
引用
收藏
页码:2571 / 2583
页数:13
相关论文
共 11 条
  • [1] Dynamic classifier ensemble for positive unlabeled text stream classification
    Pan, Shirui
    Zhang, Yang
    Li, Xue
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (02) : 267 - 287
  • [2] Detecting unknown computer worm activity via support vector machines and active learning
    Nissim, Nir
    Moskovitch, Robert
    Rokach, Lior
    Elovici, Yuval
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2012, 15 (04) : 459 - 475
  • [3] Online active multi-field learning for efficient email spam filtering
    Liu, Wuying
    Wang, Ting
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (01) : 117 - 136
  • [4] Large Linear Classification When Data Cannot Fit in Memory
    Yu, Hsiang-Fu
    Hsieh, Cho-Jui
    Chang, Kai-Wei
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2012, 5 (04)
  • [5] Feature sub-set selection metrics for Arabic text classification
    Mesleh, Abdelwadood Moh'd
    [J]. PATTERN RECOGNITION LETTERS, 2011, 32 (14) : 1922 - 1929
  • [6] Class-dependent projection based method for text categorization
    Chen, Lifei
    Guo, Gongde
    Wang, Kaijun
    [J]. PATTERN RECOGNITION LETTERS, 2011, 32 (10) : 1493 - 1501
  • [7] Web page classification[J] . Xiaoguang Qi,Brian D. Davison.ACM Computing Surveys (CSUR) . 2009 (2)
  • [8] Reliable Negative Extracting Based on kNN for Learning from Positive and Unlabeled Examples
    Zhang, Bangzuo
    Zuo, Wanli
    [J]. JOURNAL OF COMPUTERS, 2009, 4 (01) : 94 - 101
  • [9] SVM based adaptive learning method for text classification from positive and unlabeled documents
    Peng, Tao
    Zuo, Wanli
    He, Fengling
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 16 (03) : 281 - 301
  • [10] Enhancing relevance feedback in image retrieval using unlabeled data
    Zhou, Zhi-Hua
    Chen, Ke-Jia
    Dai, Hong-Bin
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2006, 24 (02) : 219 - 244