FrankSum: New feature selection method for protein function prediction

被引:22
作者
Al-Shahib, A [1 ]
Breitling, R
Gilbert, D
机构
[1] Univ Glasgow, Dept Comp Sci, Bioinformat Res Ctr, Glasgow G12 8QQ, Lanark, Scotland
[2] Univ Glasgow, Inst Biomed Life Sci, Glasgow G12 8QQ, Lanark, Scotland
关键词
feature selection; protein function; sequence features; machine learning;
D O I
10.1142/S0129065705000281
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the study of in silico functional genomics, improving the performance of protein function prediction is the ultimate goal for identifying proteins associated with defined cellular functions. The classical prediction approach is to employ pairwise sequence alignments. However this method often faces difficulties when no statistically significant homologous sequences are identified. An alternative way is to predict protein function from sequence-derived features using machine learning. In this case the choice of possible features which can be derived from the sequence is of vital importance to ensure adequate discrimination to predict function. In this paper we have successfully selected biologically significant features for protein function prediction. This was performed using a new feature selection method (FrankSum) that avoids data distribution assumptions, uses a data independent measurement (p-value) within the feature, identifies redundancy between features and uses an appropiate ranking criterion for feature selection. We have shown that classifiers generated from features selected by FrankSum outperforms classifiers generated from full feature sets, randomly selected features and features selected from the Wrapper method. We have also shown the features are concordant across all species and top ranking features are biologically informative. We conclude that feature selection is vital for successful protein function prediction and FrankSum is one of the feature selection methods that can be applied successfully to such a domain.
引用
收藏
页码:259 / 275
页数:17
相关论文
共 29 条
[1]  
ALSHAHIB A, IN PRESS APPL BIOINF
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
APPICE A, 2004, P INT C MACH LEARN, P33
[4]   AREA ABOVE ORDINAL DOMINANCE GRAPH AND AREA BELOW RECEIVER OPERATING CHARACTERISTIC GRAPH [J].
BAMBER, D .
JOURNAL OF MATHEMATICAL PSYCHOLOGY, 1975, 12 (04) :387-415
[5]  
Bishop CM, 1993, NEURAL NETWORKS PATT
[6]   H-1 NMR-STUDIES OF MOLECULAR-CONFORMATION OF MONOMERIC GLUCAGON IN AQUEOUS-SOLUTION [J].
BOESCH, C ;
BUNDI, A ;
OPPLIGER, M ;
WUTHRICH, K .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 1978, 91 (01) :209-214
[7]  
Duda R.O., 2001, Pattern Classification, V2nd
[8]   Intrinsically unstructured proteins and their functions [J].
Dyson, HJ ;
Wright, PE .
NATURE REVIEWS MOLECULAR CELL BIOLOGY, 2005, 6 (03) :197-208
[9]  
GABRILOVICH E, 2004, ICML 04 2U INT C MAC
[10]   Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching [J].
Gribskov, M ;
Robinson, NL .
COMPUTERS & CHEMISTRY, 1996, 20 (01) :25-33