A new nonparametric pairwise clustering algorithm based on iterative estimation of distance profiles

被引:20
作者
Dubnov, S [1 ]
El-Yaniv, R
Gdalyahu, Y
Schneidman, E
Tishby, N
Yona, G
机构
[1] Ben Gurion Univ Negev, Dept Commun Syst Engn, IL-84105 Beer Sheva, Israel
[2] Technion Israel Inst Technol, Dept Comp Sci, IL-32000 Haifa, Israel
[3] Hebrew Univ Jerusalem, Sch Engn & Comp Sci, IL-91904 Jerusalem, Israel
[4] Hebrew Univ Jerusalem, Ctr Neural Computat, IL-91904 Jerusalem, Israel
[5] Hebrew Univ Jerusalem, Dept Neurobiol, IL-91904 Jerusalem, Israel
[6] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
nonparametric methods; pairwise distances; hierarchical clustering; Jensen-Shannon divergence; cross-validation; noise robustness;
D O I
10.1023/A:1013631728342
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel pairwise clustering method. Given a proximity matrix of pairwise relations (i.e. pairwise similarity or dissimilarity estimates) between data points, our algorithm extracts the two most prominent clusters in the data set. The algorithm, which is completely nonparametric, iteratively employs a two-step transformation on the proximity matrix. The first step of the transformation represents each point by its relation to all other data points, and the second step re-estimates the pairwise distances using a statistically motivated proximity measure on these representations. Using this transformation, the algorithm iteratively partitions the data points, until it finally converges to two clusters. Although the algorithm is simple and intuitive, it generates a complex dynamics of the proximity matrices. Based on this bipartition procedure we devise a hierarchical clustering algorithm, which employs the basic bipartition algorithm in a straightforward divisive manner. The hierarchical clustering algorithm copes with the model validation problem using a general cross-validation approach, which may be combined with various hierarchical clustering methods. We further present an experimental study of this algorithm. We examine some of the algorithm's properties and performance on some synthetic and 'standard' data sets. The experiments demonstrate the robustness of the algorithm and indicate that it generates a good clustering partition even when the data is noisy or corrupted.
引用
收藏
页码:35 / 61
页数:27
相关论文
共 39 条
  • [1] [Anonymous], [No title captured]
  • [2] [Anonymous], P 31 ANN M ASS COMP
  • [3] [Anonymous], 1998, P ACM SIAM S DISCR A
  • [4] Bishop C. M., 1995, NEURAL NETWORKS PATT
  • [5] Blake C.L., 1998, UCI repository of machine learning databases
  • [6] Blatt M, 1996, ADV NEUR IN, V8, P416
  • [7] BUHMANN J, 1995, IAITR957 U BONN DEP
  • [8] CLARK J, 1995, BLACKWELL TXB LINGUI, V9
  • [9] Crawford T., 1998, COMPUTING MUSICOLOGY, V11, P73
  • [10] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38