K-modes clustering

被引:181
作者
Chaturvedi, A
Green, PE
Carroll, JD
机构
[1] Kraft Gen Foods Inc, Glenview, IL 60025 USA
[2] Univ Penn, Wharton Sch, Philadelphia, PA 19104 USA
[3] Rutgers State Univ, Grad Sch Management, Newark, NJ 07102 USA
关键词
categorical data; cluster analysis; groups; modes; latent class analysis;
D O I
10.1007/s00357-001-0004-3
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
We present a nonparametric approach to deriving clusters from categorical (nominal scale) data using a new clustering procedure called K-modes, which is analogous to the traditional K-Means procedure (MacQueen 1967) for clustering interval scale data. Unlike most existing methods for clustering nominal scale data, the K-modes procedure explicitly optimizes a loss function based on the Lo norm (defined as the limit of an L-p norm as p approaches zero). In Monte Carlo simulations, both K-modes and latent class procedures (e.g., Goodman 1974) performed with equal efficiency in recovering a known underlying cluster structure. However, K-modes is an order of magnitude faster than the latent class procedure in speed and suffers from fewer problems of local optima than do latent class procedures. For data sets involving a large number of categorical variables, latent class procedures become computationally extremely slow and hence infeasible. We conjecture that, although in some cases latent class procedures might perform better than K-modes, it could out-perform latent class procedures in other cases. Hence, we recommend that these two approaches be used as "complementary" procedures in performing cluster analysis. We also present an empirical comparison of K-modes and latent class, where the former method prevails.
引用
收藏
页码:35 / 55
页数:21
相关论文
共 16 条