Parallel mining of outliers in large database

被引:37
作者
Hung, E [1 ]
Cheung, DW [1 ]
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
关键词
data mining; outlier detection; parallel algorithm;
D O I
10.1023/A:1015608814486
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 [计算机科学与技术];
摘要
Data mining is a new, important and fast growing database application. Outlier (exception) detection is one kind of data mining, which can be applied in a variety of areas like monitoring of credit card fraud and criminal activities in electronic commerce. With the ever-increasing size and attributes (dimensions) of database, previously proposed detection methods for two dimensions are no longer applicable. The time complexity of the Nested-Loop (NL) algorithm (Knorr and Ng, in Proc. 24th VLDB, 1998) is linear to the dimensionality but quadratic to the dataset size, inducing an unacceptable cost for large dataset. A more efficient version (ENL) and its parallel version (PENL) are introduced. In theory, the improvement of performance in PENL is linear to the number of processors, as shown in a performance comparison between ENL and PENL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network.
引用
收藏
页码:5 / 26
页数:22
相关论文
共 17 条
[1]
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]
[Anonymous], P 9 INT DAT C IDC 99
[3]
[Anonymous], 1997, P CASCON
[4]
Barnett V., 1984, Outliers in Statistical Data, V2nd
[5]
BISSELING RH, 1993, 836 UTR U DEP MATH
[6]
BREUNIG MM, 2000, P ACM SIGMOD 2000 DA
[7]
Ester M, 1996, 2 INT C KNOWL DISCOV, P226, DOI DOI 10.5555/3001460.3001507
[8]
HAN JW, 1992, PROC INT CONF VERY L, P547
[9]
Hawkins D.M, 1980, IDENTIFICATION OUTLI, V11, DOI [10.1007/978-94-015-3994-4, DOI 10.1007/978-94-015-3994-4]
[10]
Knorr E. M., 1997, Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, P219