Correlation preserving discretization

被引:3
作者
Mehta, S [1 ]
Parthasarathy, S [1 ]
Yang, H [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
来源
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2004年
关键词
unsupervised discretization; missing data;
D O I
10.1109/ICDM.2004.10007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Discretization is a crucial preprocessing primitive for a variety of data warehousing and mining tasks. In this article we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate datasets. The algorithm leverages the underlying correlation structure in the dataset to obtain the discrete intervals, and ensures that the inherent correlations are preserved. The approach also extends easily to datasets containing missing values. We demonstrate the efficacy of the approach on real datasets and as a preprocessing step for both classification and frequent itemset mining tasks. We also show that the intervals are meaningful and can uncover hidden patterns in data.
引用
收藏
页码:479 / 482
页数:4
相关论文
共 15 条
  • [1] Multivariate Discretization for Set Mining
    Stephen D. Bay
    [J]. Knowledge and Information Systems, 2001, 3 (4) : 491 - 512
  • [2] CATLETT J, 1991, P EUR WORK SESS LEAR
  • [3] Dougherty J., 1995, ICML
  • [4] Jolliffe I. T., 1986, Principal Component Analysis, DOI [DOI 10.1016/0169-7439(87)80084-9, 10.1007/0-387-22440-8_13, DOI 10.1007/0-387-22440-8_13]
  • [5] KERBER R, 1991, NAT C AI
  • [6] Kim J-O., 1978, Factor analysis: statistical methods and practical issues
  • [7] Liu B, 1998, Proceedings of the fourth international conference on knowledge discovery and data mining, P80
  • [8] MAASS W, 1994, COLT
  • [9] MARCUSCHRISTOPH.L, 2000, PKDD
  • [10] MEHTA S, 2003, OSUCISRC1203TR69