Clustering in very large databases based on distance and density

被引:29
作者
Qian, WN [1 ]
Gong, XQ [1 ]
Zhou, AY [1 ]
机构
[1] Fudan Univ, Dept Comp Sci & Engn, Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China
基金
高等学校博士学科点专项科研基金;
关键词
data mining; very large database; clustering;
D O I
10.1007/BF02946652
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g., R*-tree); 2) pre-specifying the uncertain parameter k, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.
引用
收藏
页码:67 / 76
页数:10
相关论文
共 12 条
[1]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[2]  
[Anonymous], P 23 INT C VER LARG
[3]  
[Anonymous], 1988, ALGORITHMS CLUSTERIN
[4]   Partitioning-based clustering for Web document categorization [J].
Boley, D ;
Gini, M ;
Gross, R ;
Han, EH ;
Hastings, K ;
Karypis, G ;
Kumar, V ;
Mobasher, B ;
Moore, J .
DECISION SUPPORT SYSTEMS, 1999, 27 (03) :329-341
[5]  
Ester M, 1996, 2 INT C KNOWL DISCOV, P226, DOI DOI 10.5555/3001460.3001507
[6]  
Gibson D., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P311
[7]  
Guha S., 1998, SIGMOD Record, V27, P73, DOI 10.1145/276305.276312
[8]  
Ng R.T., 1994, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, P144
[9]  
Rousseeuw P.J., 1990, Finding groups in data: An introduction to cluster analysis, V1
[10]  
Sheikholeslami G., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P428