A RANDOM DECISION TREE ENSEMBLE FOR MINING CONCEPT DRIFTS FROM NOISY DATA STREAMS

被引:19
作者
Li, Peipei [1 ]
Wu, Xindong [1 ,2 ]
Hu, Xuegang [1 ]
Liang, Qianhui [3 ]
Gao, Yunjun [4 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[3] Singapore Management Univ, Sch Informat Syst, Singapore, Singapore
[4] Zhejiang Univ, Coll Comp Sci, Hangzhou, Zhejiang, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
CLASSIFIERS;
D O I
10.1080/08839514.2010.499500
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting concept drifts and reducing the impact from the noise in real applications of data streams are challenging but valuable for inductive learning. It is especially a challenge in a light demand on the overheads of time and space. However, though a great number of inductive learning algorithms based on ensemble classification models have been proposed for handling concept drifting data streams, little attention has been focused on the detection of the diversity of concept drifts and the influence from noise in data streams simultaneously. Motivated by this, we present a new light-weighted inductive algorithm for concept drifting detection in virtue of an ensemble model of random decision trees (named CDRDT) to distinguish various types of concept drifts from noisy data streams in this article. We use variably small data chunks to generate random decision trees incrementally. Meanwhile, we introduce the inequality of Hoeffding bounds and the principle of statistical quality control to detect the different types of concept drifts and noise. Extensive studies on synthetic and real streaming data demonstrate that CDRDT could effectively and efficiently detect concept drifts from the noisy streaming data. Therefore, our algorithm provides a feasible reference framework of classification for concept drifting data streams with noise.
引用
收藏
页码:680 / 710
页数:31
相关论文
共 47 条
[1]  
Abdulsalam H, 2008, LECT NOTES COMPUT SC, V5181, P643, DOI 10.1007/978-3-540-85654-2_54
[2]  
Abdulsalam H, 2007, INT DATABASE ENG APP, P225
[3]  
*ACM SPEC INT GROU, 1999, KDDCUP99 DATASET
[4]   On classification and segmentation of massive audio data streams [J].
Aggarwal, Charu C. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2009, 20 (02) :137-156
[5]  
[Anonymous], 2005, P 2 INT WORKSHOP KNO
[6]  
[Anonymous], 2001, Pattern Classification
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]  
Castillo G, 2003, LECT NOTES ARTIF INT, V2902, P279
[9]   An adaptive learning approach for noisy data streams [J].
Chu, F ;
Wang, YZ ;
Zaniolo, C .
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, :351-354
[10]   Online mining of frequent sets in data streams with error guarantee [J].
Dang, Xuan Hong ;
Ng, Wee-Keong ;
Ong, Kok-Leong .
KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 16 (02) :245-258