A novel representation of protein sequences for prediction of subcellular location using support vector machines

被引:124
作者
Matsuda, S [1 ]
Vert, JP
Saigo, H
Ueda, N
Toh, H
Akutsu, T
机构
[1] Kyoto Univ, Inst Chem Res, Bioinformat Ctr, Kyoto 6110011, Japan
[2] Ecole Mines, Ctr Geostat, F-77300 Fontainebleau, France
[3] Kyushu Univ, Med Inst Bioregulat, Div Bioinformat, Fukuoka 8128582, Japan
关键词
subcellular location; signal sequence; amino acid composition; distance frequency; support vector machine; predictive accuracy;
D O I
10.1110/ps.051597405
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
引用
收藏
页码:2804 / 2813
页数:10
相关论文
共 43 条
[1]   The InterPro database, an integrated documentation resource for protein families, domains and functional sites [J].
Apweiler, R ;
Attwood, TK ;
Bairoch, A ;
Bateman, A ;
Birney, E ;
Biswas, M ;
Bucher, P ;
Cerutti, T ;
Corpet, F ;
Croning, MDR ;
Durbin, R ;
Falquet, L ;
Fleischmann, W ;
Gouzy, J ;
Hermjakob, H ;
Hulo, N ;
Jonassen, I ;
Kahn, D ;
Kanapin, A ;
Karavidopoulou, Y ;
Lopez, R ;
Marx, B ;
Mulder, NJ ;
Oinn, TM ;
Pagni, M ;
Servant, F ;
Sigrist, CJA ;
Zdobnov, EM .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :37-40
[2]   Improved prediction of signal peptides: SignalP 3.0 [J].
Bendtsen, JD ;
Nielsen, H ;
von Heijne, G ;
Brunak, S .
JOURNAL OF MOLECULAR BIOLOGY, 2004, 340 (04) :783-795
[3]   PSLpred: prediction of subcellular localization of bacterial proteins [J].
Bhasin, M ;
Garg, A ;
Raghava, GPS .
BIOINFORMATICS, 2005, 21 (10) :2522-2524
[4]   ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST [J].
Bhasin, M ;
Raghava, GPS .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W414-W419
[5]   Chloroplast transit peptides: structure, function and evolution [J].
Bruce, BD .
TRENDS IN CELL BIOLOGY, 2000, 10 (10) :440-447
[6]   Predicting subcellular localization of proteins in a hybridization space [J].
Cai, YD ;
Chou, KC .
BIOINFORMATICS, 2004, 20 (07) :1151-1156
[7]   Relation between amino acid composition and cellular location of proteins [J].
Cedano, J ;
Aloy, P ;
PerezPons, JA ;
Querol, E .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 266 (03) :594-600
[8]   Using discriminant function for prediction of subcellular location of prokaryotic proteins [J].
Chou, KC ;
Elrod, DW .
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 1998, 252 (01) :63-68
[9]   Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition [J].
Chou, KC ;
Cai, YD .
JOURNAL OF CELLULAR BIOCHEMISTRY, 2004, 91 (06) :1197-1203
[10]   A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology [J].
Chou, KC ;
Cai, YD .
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 2003, 311 (03) :743-747