Fast mining of distance-based outliers in high-dimensional datasets

被引:120
作者
Ghoting, Amol [1 ]
Parthasarathy, Srinivasan [2 ]
Otey, Matthew Eric [3 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
[3] Google Inc, Pittsburgh, PA USA
基金
美国国家科学基金会;
关键词
outlier detection; high-dimensional datasets; approximate k-nearest neighbors; clustering;
D O I
10.1007/s10618-008-0093-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.
引用
收藏
页码:349 / 364
页数:16
相关论文
共 23 条
[1]  
ANGIULLI F, 2002, P INT C PRINC DAT MI
[2]  
[Anonymous], 1975, CLUSTERING ALGORITHM
[3]   An optimal algorithm for approximate nearest neighbor searching in fixed dimensions [J].
Arya, S ;
Mount, DM ;
Netanyahu, NS ;
Silverman, R ;
Wu, AY .
JOURNAL OF THE ACM, 1998, 45 (06) :891-923
[4]  
Barnett V., 1994, Outliers in Statistical Data, V3rd
[5]  
BAY S, 1999, UCI KDD ARCHIVE
[6]  
Bay S. D., 2003, P INT C KNOWL DISC D
[7]   MULTIDIMENSIONAL BINARY SEARCH TREES USED FOR ASSOCIATIVE SEARCHING [J].
BENTLEY, JL .
COMMUNICATIONS OF THE ACM, 1975, 18 (09) :509-517
[8]  
BERCHTOLD S, 1996, P INT C VER LARG DAT
[9]  
Bolton RJ, 2002, STAT SCI, V17, P235
[10]  
GAMBERGER D, 1999, P INT C MACH LEARN