Data clustering: 50 years beyond K-means

被引:6138
作者
Jain, Anil K. [1 ,2 ]
机构
[1] Michigan State Univ, Dept Comp Sci & Engn, E Lansing, MI 48824 USA
[2] Korea Univ, Dept Brain & Cognit Engn, Seoul 136713, South Korea
基金
美国国家科学基金会;
关键词
Data clustering; User's dilemma; Historical developments; Perspectives on clustering; King-Sun Fu prize; SCALABLE FRAMEWORK; ALGORITHM;
D O I
10.1016/j.patrec.2009.09.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:651 / 666
页数:16
相关论文
共 143 条
[1]  
Aggarwal C.C., 2003, P 2003 VLDB C, P81, DOI [DOI 10.1016/B978-012722442-8/50016-1, 10.1016/B978-, DOI 10.1016/B978]
[2]  
Anderberg M.R., 1973, CLUSTER ANAL APPL, DOI DOI 10.1016/C2013-0-06161-0
[3]  
[Anonymous], 2005, Wiley series in probability and statistics
[4]  
[Anonymous], MATH CLASSIFICATION
[5]  
[Anonymous], UCI REPOSITORY MACHI
[6]  
[Anonymous], 2006, ICML, DOI [10.1145/1143844.1143917, DOI 10.1145/1143844.1143917]
[7]  
[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI DOI 10.1145/1014052.1014062
[8]  
[Anonymous], 2007, TR0735 DEP COMP SCI
[9]  
[Anonymous], P 7 SIAM INT C DAT M
[10]  
[Anonymous], ADV METHODS MARKETIN