Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains

被引:69
作者
Bulashevska, Alla [1 ]
Eils, Roland [1 ]
机构
[1] Deutsch Krebsforschungszentrum, Theoret Bioinformat Dept, D-69120 Heidelberg, Germany
关键词
D O I
10.1186/1471-2105-7-298
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. Results: A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. Conclusion: This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.
引用
收藏
页数:13
相关论文
共 36 条
[1]  
[Anonymous], 1979, Multivariate analysis
[2]   Algorithms for variable length Markov chain modeling [J].
Bejerano, G .
BIOINFORMATICS, 2004, 20 (05) :788-U729
[3]   ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST [J].
Bhasin, M ;
Raghava, GPS .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W414-W419
[4]   Addressing protein localization within the nucleus [J].
Bickmore, WA ;
Sutherland, HGE .
EMBO JOURNAL, 2002, 21 (06) :1248-1254
[5]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[6]   DETECTION OF NEW GENES IN A BACTERIAL GENOME USING MARKOV-MODELS FOR 3 GENE CLASSES [J].
BORODOVSKY, M ;
MCININCH, JD ;
KOONIN, EV ;
RUDD, KE ;
MEDIGUE, C ;
DANCHIN, A .
NUCLEIC ACIDS RESEARCH, 1995, 23 (17) :3554-3562
[7]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[8]   Using functional domain composition and support vector machines for prediction of protein subcellular location [J].
Chou, KC ;
Cai, YD .
JOURNAL OF BIOLOGICAL CHEMISTRY, 2002, 277 (48) :45765-45769
[9]   Prediction of protein cellular attributes using pseudo-amino acid composition [J].
Chou, KC .
PROTEINS-STRUCTURE FUNCTION AND GENETICS, 2001, 43 (03) :246-255
[10]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids