Clustering time series with clipped data

被引:54
作者
Bagnall, A [1 ]
Janacek, G [1 ]
机构
[1] Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
clustering time series; clipping;
D O I
10.1007/s10994-005-5825-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. Time series data are often large and may contain outliers. We show that the simple procedure of clipping the time series (discretising to above or below the median) reduces memory requirements and significantly speeds up clustering without decreasing clustering accuracy. We also demonstrate that clipping increases clustering accuracy when there are outliers in the data, thus serving as a means of outlier detection and a method of identifying model misspecification. We consider simulated data from polynomial, autoregressive moving average and hidden Markov models and show that the estimated parameters of the clipped data used in clustering tend, asymptotically, to those of the unclipped data. We also demonstrate experimentally that, if the series are long enough, the accuracy on clipped data is not significantly less than the accuracy on unclipped data, and if the series contain outliers then clipping results in significantly better clusterings. We then illustrate how using clipped series can be of practical benefit in detecting model misspecification and outliers on two real world data sets: an electricity generation bid data set and an ECG data set.
引用
收藏
页码:151 / 178
页数:28
相关论文
共 24 条
[1]  
Agrawal R., 1993, P 4 INT C FDN DAT OR, V730, P69
[2]   LIKELIHOOD OF A MODEL AND INFORMATION CRITERIA [J].
AKAIKE, H .
JOURNAL OF ECONOMETRICS, 1981, 16 (01) :3-14
[3]  
ALON J, 2003, IEEE COMPUTER VISION
[4]  
[Anonymous], 2003, Proceedings of the second Australasian Data Mining Workshop
[5]  
Aref WG, 2004, IEEE T KNOWL DATA EN, V16, P332
[6]  
Bagnall AJ, 2000, 2000 INTERNATIONAL CONFERENCE ON POWER SYSTEM TECHNOLOGY, VOLS I-III, PROCEEDINGS, P891, DOI 10.1109/ICPST.2000.897139
[7]  
BAGNALL AJ, 2004, CMPC0401 U E ANGL SC
[8]  
CHU S, 2002, P 2 SIAM INT C SAT M
[9]  
Gaffney S., 2003, P 9 INT WORKSH ART I
[10]   On clustering validation techniques [J].
Halkidi, M ;
Batistakis, Y ;
Vazirgiannis, M .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2001, 17 (2-3) :107-145