Extensions to the k-means algorithm for clustering large data sets with categorical values

被引：1774

作者：

Huang, ZX ^{[1
]}

机构：

[1] CSIRO, ACsys CRC, Canberra, ACT 2601, Australia

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 1998年 / 2卷 / 03期

关键词：

data mining; cluster analysis; clustering algorithms; categorical data;

D O I：

10.1023/A:1009769707641

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.

引用

页码：283 / 304

页数：22

共 34 条

[1] Anderberg M.R., 1973, Probability and Mathematical Statistics
[2] [Anonymous], 1996, P PAC RIM KNOWL ACQS
[3] A CLUSTERING TECHNIQUE FOR SUMMARIZING MULTIVARIATE DATA
BALL, GH
HALL, DJ
[J]. BEHAVIORAL SCIENCE, 1967, 12 (02): : 153 - &
[4] Bezdek J.C., 2013, Pattern Recognition With Fuzzy Objective Function Algorithms
[5] C-MEANS CLUSTERING WITH THE L1 AND L-INFINITY NORMS
BOBROWSKI, L
BEZDEK, JC
[J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1991, 21 (03): : 545 - 554
[6] REVIEW OF CLASSIFICATION
CORMACK, RM
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-GENERAL, 1971, 134 : 321 - +
[7] VALIDITY STUDIES IN CLUSTERING METHODOLOGIES
DUBES, R
JAIN, AK
[J]. PATTERN RECOGNITION, 1979, 11 (04) : 235 - 254
[8] HOW MANY CLUSTERS ARE BEST - AN EXPERIMENT
DUBES, RC
[J]. PATTERN RECOGNITION, 1987, 20 (06) : 645 - 663
[9] Ester M., 1996, KDD, P226
[10] Everitt B, 1974, CLUSTER ANAL

← 1 2 3 4 →