Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction

被引:38
作者
Saeys, Yvan [1 ]
Degroeve, Sven [1 ]
Aeyels, Dirk [2 ]
Van de Peer, Yves [1 ]
Rouze, Pierre [3 ]
机构
[1] Univ Ghent VIB, Dept Plant Syst Biol, B-9000 Ghent, Belgium
[2] Univ Ghent, SYSTeMS Res Grp, B-9052 Zwijnaarde, Belgium
[3] Univ Ghent, Lab Associe INRA France, B-9000 Ghent, Belgium
关键词
Machine Learning; Feature Subset Selection; Estimation of Distribution Algorithms; Splice Site Prediction;
D O I
10.1093/bioinformatics/btg1076
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. Results: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large.
引用
收藏
页码:II179 / II188
页数:10
相关论文
共 28 条
[1]  
[Anonymous], [No title captured], DOI DOI 10.1016/B978-1-55860-332-5.50055-9
[2]  
BEKKERMAN R, 2001, P SIGIR 01 24 ACM IN, P146
[3]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[4]  
BOZ O, 2002, P INT C MACH LEARN A
[5]  
CANTUPAZ E, 2002, P GEN EV COMP C GECC, P754
[6]   Feature subset selection for splice site prediction [J].
Degroeve, S ;
De Baets, B ;
Van de Peer, Y ;
Rouzé, P .
BIOINFORMATICS, 2002, 18 :S75-S83
[7]  
GUYON I, 2000, MACHINE LEARNING
[8]  
Hall M.A, 1999, Correlation-based feature selection for machine learning, P51
[9]   The compact genetic algorithm [J].
Harik, GR ;
Lobo, FG ;
Goldberg, DE .
1998 IEEE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION - PROCEEDINGS, 1998, :523-528
[10]  
Hart P.E., 1973, Pattern recognition and scene analysis