A framework of irregularity enlightenment for data pre-processing in data mining

被引:16
作者
Au, Siu-Tong [2 ]
Duan, Rong [2 ]
Hesar, Siamak G. [1 ]
Jiang, Wei [1 ]
机构
[1] Stevens Inst Technol, Hoboken, NJ 07030 USA
[2] AT&T Labs Res, Florham Pk, NJ USA
基金
美国国家科学基金会;
关键词
Activity monitoring; Change point; Feature selection; LASSO; Outliers; Regression models; STATISTICAL PROCESS-CONTROL; OUTLIER DETECTION; CONTROL CHARTS; TIME-SERIES; MODELS;
D O I
10.1007/s10479-008-0494-z
中图分类号
C93 [管理学]; O22 [运筹学];
学科分类号
070105 ; 12 ; 1201 ; 1202 ; 120202 ;
摘要
Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.
引用
收藏
页码:47 / 66
页数:20
相关论文
共 58 条
[1]   TIME-SERIES MODELING FOR STATISTICAL PROCESS-CONTROL [J].
ALWAN, LC ;
ROBERTS, HV .
JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 1988, 6 (01) :87-95
[2]  
[Anonymous], 2006, Pattern recognition and machine learning
[3]  
Apley DW, 1999, IIE TRANS, V31, P1123, DOI 10.1080/07408179908969913
[4]  
Bakshi BR, 1999, J CHEMOMETR, V13, P415, DOI 10.1002/(SICI)1099-128X(199905/08)13:3/4<415::AID-CEM544>3.0.CO
[5]  
2-8
[6]  
Barnett V., 1994, Outliers in statistical data
[7]  
Basseville M, 1993, DETECTION ABRUPT CHA
[8]  
Bay S.D., 2003, KDD, P29, DOI DOI 10.1145/956750.956758
[9]  
Bellman R., 1961, Adaptive Control Processes: A Guided Tour, DOI DOI 10.1515/9781400874668
[10]  
Bianco A. M., 1996, COMPSTAT. Proceedings in Computational Statistics. 12th Symposium, P27