A review of feature selection methods on synthetic data

被引:536
作者
Bolon-Canedo, Veronica [1 ]
Sanchez-Marono, Noelia [1 ]
Alonso-Betanzos, Amparo [1 ]
机构
[1] Univ A Coruna, Dept Comp Sci, La Coruna, Spain
关键词
Feature selection; Filters; Embedded methods; Wrappers; Synthetic datasets; EFFICIENT FEATURE-SELECTION; MUTUAL INFORMATION; GENE SELECTION; CLASSIFICATION; ALGORITHMS; RELEVANCE; RELIEFF; SEARCH;
D O I
10.1007/s10115-012-0487-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the advent of high dimensionality, adequate identification of relevant features of the data has become indispensable in real-world scenarios. In this context, the importance of feature selection is beyond doubt and different methods have been developed. However, with such a vast body of algorithms available, choosing the adequate feature selection method is not an easy-to-solve question and it is necessary to check their effectiveness on different situations. Nevertheless, the assessment of relevant features is difficult in real datasets and so an interesting option is to use artificial data. In this paper, several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selection methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features. Seven filters, two embedded methods, and two wrappers are applied over eleven synthetic datasets, tested by four classifiers, so as to be able to choose a robust method, paving the way for its application to real datasets.
引用
收藏
页码:483 / 519
页数:37
相关论文
共 78 条
  • [51] Feature selection algorithms: A survey and experimental evaluation
    Molina, LC
    Belanche, L
    Nebot, A
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 306 - 313
  • [52] Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy
    Peng, HC
    Long, FH
    Ding, C
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (08) : 1226 - 1238
  • [53] A novel feature selection approach for biomedical data classification
    Peng, Yonghong
    Wu, Zhiqing
    Jiang, Jianmin
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2010, 43 (01) : 15 - 23
  • [54] Perner P, 2000, LECT NOTES COMPUT<D>, V1910, P575
  • [55] Provost F, 2000, ADV DISTRIBUTED DATA
  • [56] Rakotomamonjy A., 2003, Journal of Machine Learning Research, V3, P1357, DOI 10.1162/153244303322753706
  • [57] Ramaswami M, 2009, J COMPUTING, V2, P7
  • [58] Generalizability and Simplicity as Criteria in Feature Selection: Application to Mood Classification in Music
    Saari, Pasi
    Eerola, Tuomas
    Lartillot, Olivier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (06): : 1802 - 1812
  • [59] Saeys Y, 2008, LECT NOTES ARTIF INT, V5212, P313, DOI 10.1007/978-3-540-87481-2_21
  • [60] Sánchez-Maroño N, 2007, LECT NOTES COMPUT SC, V4881, P178