Innovation in the cluster validating techniques

被引:24
作者
Jain, Ravi [1 ]
Koronios, Andy [1 ]
机构
[1] Univ S Australia, Sch Comp & Informat Sci, Adelaide, SA 5001, Australia
关键词
clustering algorithms; Silhouette width; Calinski & Harbasz index; Baker & Hubert indices;
D O I
10.1007/s10700-008-9033-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (gamma index) algorithms for exact and approximate duplicates. In this paper, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.
引用
收藏
页码:233 / 241
页数:9
相关论文
共 18 条
[1]
[Anonymous], 2003, Proceedings of the 3rd IASTED International Conference on Artificial Intelligence and Applications (AIA 03), Benalmadena, Spain
[2]
MEASURING POWER OF HIERARCHICAL CLUSTER-ANALYSIS [J].
BAKER, FB ;
HUBERT, LJ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (349) :31-38
[3]
Blake C.L., 1998, UCI repository of machine learning databases
[4]
Cluster validation techniques for genome expression data [J].
Bolshakova, N ;
Azuaje, F .
SIGNAL PROCESSING, 2003, 83 (04) :825-833
[5]
Calinski T., 1974, COMMUN STAT, V3, P1, DOI DOI 10.1080/03610927408827101
[6]
Duplicate record detection: A survey [J].
Elmagarmid, Ahmed K. ;
Ipeirotis, Panagiotis G. ;
Verykios, Vassilios S. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) :1-16
[7]
Halkidi M, 2002, SIGMOD REC, V31, P19, DOI 10.1145/601858.601862
[8]
Halkidi M, 2002, SIGMOD RECORD, V31, P40, DOI 10.1145/565117.565124
[9]
On clustering validation techniques [J].
Halkidi, M ;
Batistakis, Y ;
Vazirgiannis, M .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2001, 17 (2-3) :107-145
[10]
Comparison of clustering methods for clinical databases [J].
Hirano, S ;
Sun, XG ;
Tsumoto, S .
INFORMATION SCIENCES, 2004, 159 (3-4) :155-165