An unsupervised approach to feature discretization and selection

被引:93
作者
Ferreira, Artur J. [1 ,3 ]
Figueiredo, Mario A. T. [2 ,3 ]
机构
[1] Polytech Inst Lisbon, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal
[2] Univ Tecn Lisboa, Inst Super Tecn, Lisbon, Portugal
[3] Inst Telecomunicacoes, Lisbon, Portugal
关键词
Feature discretization; Feature quantization; Feature selection; Linde-Buzo-Gray algorithm; Sparse data; Support vector machines; Naive Bayes; k-nearest neighbor; RANDOM SUBSPACE METHOD; MICROARRAY DATA; GENE SELECTION; CLASSIFICATION; ALGORITHM; INFORMATION; RELEVANCE;
D O I
10.1016/j.patcog.2011.12.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3048 / 3060
页数:13
相关论文
共 48 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]  
[Anonymous], 1999, The Nature Statist. Learn. Theory
[3]  
[Anonymous], 2007, Prtools4. 1, A Matlab Toolbox for Pattern Recognition
[4]  
[Anonymous], P EUR S ART NEUR NET
[5]  
[Anonymous], 2008, Introduction to information retrieval
[6]  
[Anonymous], 2000, Pattern Classification
[7]  
Bolon-Canedo V., 2011, P ESANN
[8]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[9]  
Christianini N., 2000, INTRO SUPPORT VECTOR, P189
[10]  
Clarke EJ, 2000, INT J INTELL SYST, V15, P61, DOI 10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO