Hierarchical clustering of mixed data based on distance hierarchy

被引:77
作者
Hsu, Chung-Chian [1 ]
Chen, Chin-Long [1 ]
Su, Yu-Wei [1 ]
机构
[1] Natl Yunlin Univ Sci & Technol, Dept Informat Management, Touliu 640, Yunlin, Taiwan
关键词
categorical data; distance hierarchy; hierarchical clustering; k-means; mixed data;
D O I
10.1016/j.ins.2007.05.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity. (c) 2007 Elsevier Inc. All rights reserved.
引用
收藏
页码:4474 / 4492
页数:19
相关论文
共 38 条
  • [1] [Anonymous], P 7 ANN C COGN SCI S
  • [2] Barbara D., 2002, Proceedings of the Eleventh International Conference on Information and Knowledge Management. CIKM 2002, P582, DOI 10.1145/584792.584888
  • [3] Cai Y., 1991, Knowledge discovery in databases, P213
  • [4] Chiu t., 2001, Proceedings of the 7th ACM SIGKDD, P263, DOI DOI 10.1145/502512.502549
  • [5] DAS B, 1998, NAT PROD SCI, V4, P23
  • [6] Data mining and knowledge discovery in databases
    Fayyad, U
    Uthurusamy, R
    [J]. COMMUNICATIONS OF THE ACM, 1996, 39 (11) : 24 - 26
  • [7] Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology
    Friedman, Menahem
    Last, Mark
    Makover, Yaniv
    Kandel, Abraham
    [J]. INFORMATION SCIENCES, 2007, 177 (02) : 467 - 475
  • [8] Ganti Venkatesh., 1999, Int. Conf. Knowledge Discovery and Data Mining, P73, DOI DOI 10.1145/312129.312201
  • [9] Gibson D., 1998, Proceedings of the Twenty-Fourth International Conference on Very-Large Databases, P311
  • [10] A NEW SIMILARITY INDEX BASED ON PROBABILITY
    GOODALL, DW
    [J]. BIOMETRICS, 1966, 22 (04) : 882 - &