Robust projected clustering

被引:57
作者
Moise, Gabriela [1 ]
Sander, Joerg [1 ]
Ester, Martin [2 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2E8, Canada
[2] Simon Fraser Univ, Sch Comp Sci, Burnaby, BC V5A 1S6, Canada
关键词
projected clustering; subspace clustering; clustering numerical and categorical data;
D O I
10.1007/s10115-007-0090-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.
引用
收藏
页码:273 / 298
页数:26
相关论文
共 26 条
[1]
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P61, DOI 10.1145/304181.304188
[2]
AGGARWAL CC, 2000, P ACM SIGMOD INT C M, P70, DOI DOI 10.1145/335191
[3]
Agrawal R., 1994, Proceedings of the 20th International Conference on Very Large Data Bases. VLDB'94, P487
[4]
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[5]
Andritsos P, 2004, LECT NOTES COMPUT SC, V2992, P123
[6]
[Anonymous], 1983, Statistical methods
[7]
[Anonymous], 2004, ACM SIGKDD EXPLOR NE
[8]
[Anonymous], 2001, P 2001 SIAM INT C DA, DOI DOI 10.1137/1.9781611972719.7
[9]
Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217
[10]
MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38