Clustering validity assessment: Finding the optimal partitioning of a data set

被引:268
作者
Halkidi, M [1 ]
Vazirgiannis, M [1 ]
机构
[1] Athens Univ Econ & Business, Dept Informat, Athens, Greece
来源
2001 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2001年
关键词
D O I
10.1109/ICDM.2001.989517
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set, As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. In this paper we present a clustering validity procedure, which evaluates the results of clustering algorithms on data sets. We define a validity index, S_Dbw, based on well-defined clustering criteria enabling the selection of the optimal input parameters' values for a clustering algorithm that result in the best partitioning of a data set. We evaluate the reliability of our index both theoretically, and experimentally, considering three representative clustering algorithms ran on synthetic and real data sets. Also, we carried out an evaluation study to compare S_Dbw performance with other known validity indices. Our approach performed favorably in all cases, even in those that other indices failed to indicate the correct partitions in a data set.
引用
收藏
页码:187 / 194
页数:8
相关论文
共 25 条
  • [1] AGRAWAL R, 1998, P SIGMOD
  • [2] [Anonymous], ACM COMPUTING SURVEY
  • [3] [Anonymous], 2000, P PKDD LYON FRANC
  • [4] [Anonymous], 1997, DMKD
  • [5] Berry MJA., 1996, DATA MINING TECHNIQU
  • [6] Validating fuzzy partitions obtained through c-shells clustering
    Dave, RN
    [J]. PATTERN RECOGNITION LETTERS, 1996, 17 (06) : 613 - 623
  • [7] Dunn J. C., 1974, Journal of Cybernetics, V4, P95, DOI 10.1080/01969727408546059
  • [8] Ester M, 1996, 2 INT C KNOWL DISCOV, P226, DOI DOI 10.5555/3001460.3001507
  • [9] ESTER M, 1998, P 24 VLDB C NEW YORK
  • [10] FAYAD U, 1996, COMMUNICATIONS ACM, V39