A database clustering methodology and tool

被引:16
作者
Ryu, TW [1 ]
Eick, CF
机构
[1] Calif State Univ Fullerton, Dept Comp Sci, Fullerton, CA 92834 USA
[2] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
关键词
database clustering; preprocessing in KDD; data miiling; data model discrepancy; similarity measures for bags;
D O I
10.1016/j.ins.2004.03.016
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst's particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research. This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:29 / 59
页数:31
相关论文
共 53 条
[1]  
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]  
Anderberg M.R., 1973, Probability and Mathematical Statistics
[3]   TOWARD A UNIFIED THEORY OF SIMILARITY AND RECOGNITION [J].
ASHBY, FG ;
PERRIN, NA .
PSYCHOLOGICAL REVIEW, 1988, 95 (01) :124-150
[4]  
BISSON G, 1992, P 10 EUR C ART INT J
[5]  
BISWAS G, 1995, INNOVATIVE APPL ARTI
[6]  
Bradley P., 1998, P 4 INT C KNOWL DIS
[7]  
Cheeseman P.C., 1996, ADV KNOWLEDGE DISCOV, V180, P153, DOI https://doi.org/10.5555/257938.257954
[8]  
DOMINGOS P, 1996, P 2 INT C KNOWL DISC
[9]  
DUMOUCHEL W, 1999, P 5 ACM SIGKDD INT C
[10]  
EISLER H, 1959, MECH SUBJECTIVE SIMI