Feature weighting in k-means clustering

被引:292
作者
Modha, DS [1 ]
Spangler, WS [1 ]
机构
[1] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
关键词
clustering; convexity; convex k-means algorithm; feature combination; feature selection; Fisher's discriminant analysis; text mining; unsupervised learning;
D O I
10.1023/A:1024016609528
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable ( and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning ( possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and ( v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.
引用
收藏
页码:217 / 237
页数:21
相关论文
共 34 条
  • [1] AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
  • [2] AHONENMYKA H, 1999, ICML 99 WORKSH MACH, P11
  • [3] [Anonymous], [No title captured]
  • [4] BAY SD, 1999, UCI KDD ARCH
  • [5] Blake C.L., 1998, UCI repository of machine learning databases
  • [6] Selection of relevant features and examples in machine learning
    Blum, AL
    Langley, P
    [J]. ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) : 245 - 271
  • [7] Diversity by design
    Bradbury, A
    [J]. TRENDS IN BIOTECHNOLOGY, 1998, 16 (03) : 99 - 102
  • [8] Caruana R., 1994, MACH LEARN P 1994, P28, DOI 10.1016/B978-1-55860-335-6.50012-X
  • [9] DEVANEY M., 1997, P 14 INT C MACH LEAR, P92
  • [10] Concept decompositions for large sparse text data using clustering
    Dhillon, IS
    Modha, DS
    [J]. MACHINE LEARNING, 2001, 42 (1-2) : 143 - 175