Clustering feature decision trees for semi-supervised classification from high-speed data streams

被引:17
作者
Xu, Wen-hua [2 ]
Qin, Zheng [1 ]
Chang, Yang [1 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS | 2011年 / 12卷 / 08期
基金
中国国家自然科学基金;
关键词
Clustering feature vector; Decision tree; Semi-supervised learning; Stream data classification; Very fast decision tree;
D O I
10.1631/jzus.C1000330
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.
引用
收藏
页码:615 / 628
页数:14
相关论文
共 23 条
[1]
[Anonymous], MOA MASSIVE ONLINE A
[2]
[Anonymous], P 8 INT C DAT MIN
[3]
Fast Perceptron Decision Tree Learning from Evolving Data Streams [J].
Bifet, Albert ;
Holmes, Geoff ;
Pfahringer, Bernhard ;
Frank, Eibe .
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PROCEEDINGS, 2010, 6119 :299-310
[4]
Bifet A, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P139
[5]
Chapelle O., 2006, SEMISUPERVISED LEARN, P5
[6]
Domingos P., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P71, DOI 10.1145/347090.347107
[7]
Gama Joao, 2003, Accurate Decision Trees for Mining High-Speed Data Streams (KDD '03), P523, DOI [10.1145/956750.956813, DOI 10.1145/956750.956813]
[8]
Gehrke J, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P169, DOI 10.1145/304181.304197
[9]
RainForest - A framework for fast decision tree construction of large datasets [J].
Gehrke, J ;
Ramakrishnan, R ;
Ganti, V .
DATA MINING AND KNOWLEDGE DISCOVERY, 2000, 4 (2-3) :127-162
[10]
Greenwald M., 2001, SIGMOD Record, V30, P58, DOI 10.1145/376284.375670