A distributed framework for trimmed Kernel k-Means clustering

被引:36
作者
Tsapanos, Nikolaos [1 ]
Tefas, Anastasios [1 ]
Nikolaidis, Nikolaos [1 ]
Pitas, Ioannis [1 ]
机构
[1] Aristotle Univ Thessaloniki, GR-54006 Thessaloniki, Greece
关键词
Data clustering; Face clustering; Kernel k-Means; Distributed computing;
D O I
10.1016/j.patcog.2015.02.020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. Kernel k-Means is a state of the art clustering algorithm. However, in contrast to clustering algorithms that can work using only a limited percentage of the data at a time, Kernel k-Means is a global clustering algorithm. It requires the computation of the kernel matrix, which takes O(n(2)d) time and O(n(2)) space in memory. As datasets grow larger, the application of Kernel k-Means becomes infeasible on a single computer, a fact that strongly suggests a distributed approach. In this paper, we present such an approach to the Kernel k-Means clustering algorithm, in order to make its application to a large number of samples feasible and, thus, achieve high performance clustering results on very big datasets. Our distributed approach follows the MapReduce programming model and consists of 3 stages, the kernel matrix computation, a novel kernel matrix trimming method and the Kernel k-Means clustering algorithm. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2685 / 2698
页数:14
相关论文
共 44 条
  • [1] Agrawal D., 2011, P 14 INT C EXTENDING, P530, DOI DOI 10.1145/1951365.1951432
  • [2] Aizerman M. A., 1964, Automation and Remote Control, V25, P821
  • [3] [Anonymous], 2011, KDD 11 P 17 ACM SIGK
  • [4] [Anonymous], 2011, P IEEE C COMP VIS PA
  • [5] [Anonymous], 2008, REAL LIF IM WORKSH E
  • [6] [Anonymous], 2009, Hadoop: The Definitive Guide
  • [7] [Anonymous], 2012, ADV NEURAL INFORM PR
  • [8] Spectral clustering based on local linear approximations
    Arias-Castro, Ery
    Chen, Guangliang
    Lerman, Gilad
    [J]. ELECTRONIC JOURNAL OF STATISTICS, 2011, 5 : 1537 - 1587
  • [9] Bach FR, 2004, ADV NEUR IN, V16, P305
  • [10] Bingham E, 2001, P 7 ACM SIGKDD INT C, P245, DOI DOI 10.1145/502512.502546