Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training

被引:102
作者
Dor, Ofer
Zhou, Yaoqi [1 ]
机构
[1] Indiana Univ Purdue Univ, Sch Informat, Indianapolis, IN 46202 USA
[2] SUNY Buffalo, Howard Hughes Med Inst, Ctr Single Mol Biophys, Dept Physiol & Biophys, Buffalo, NY 14214 USA
[3] Indiana Univ, Sch Med, Ctr Computat Biol & Bioinformat, Indianapolis, IN 46202 USA
关键词
solvent accessibility; solvent accessible surface area; neural network;
D O I
10.1002/prot.21298
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
An integrated system of neural networks, called SPINE, is established and optimized for predicting structural properties of proteins. SPINE is applied to three-state secondary-structure and residue-solvent-accessibility (RSA) prediction in this paper. The integrated neural networks are carefully trained with a large dataset of 2640 chains, sequence profiles generated from multiple sequence alignment, representative amino acid properties, a slow learning rate, overfitting protection, and an optimized sliding-widow size. More than 200,000 weights in SPINE are optimized by maximizing the accuracy measured by Q(3) (the percentage of correctly classified residues). SPINE yields a 10-fold cross-validated accuracy of 79.5% (80.0% for chains of length between 50 and 300) in secondary-structure prediction after one-month (CPU time) training on 22 processors. An accuracy of 87.5% is achieved for exposed residues (RSA > 95%). The latter approaches the theoretical upper limit of 88-90% accuracy in assigning secondary structures. An accuracy of 73% for three-state solvent-accessibility prediction (25%/75% cutoff) and 79.3% for two-state prediction (25% cutoff) is also obtained.
引用
收藏
页码:838 / 845
页数:8
相关论文
共 64 条
[1]   Combining prediction of secondary structure and solvent accessibility in proteins [J].
Adamczak, R ;
Porollo, A ;
Meller, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 59 (03) :467-475
[2]   Accurate prediction of solvent accessibility using neural networks-based regression [J].
Adamczak, R ;
Porollo, A ;
Meller, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 56 (04) :753-767
[3]   Real value prediction of solvent accessibility from amino acid sequence [J].
Ahmad, S ;
Gromiha, MM ;
Sarai, A .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2003, 50 (04) :629-635
[4]   NETASA: neural network based prediction of solvent accessibility [J].
Ahmad, S ;
Gromiha, MM .
BIOINFORMATICS, 2002, 18 (06) :819-824
[5]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[6]   USE OF CONDITIONAL PROBABILITIES FOR DETERMINING RELATIONSHIPS BETWEEN AMINO-ACID-SEQUENCE AND PROTEIN SECONDARY STRUCTURE [J].
ARNOLD, GE ;
DUNKER, AK ;
JOHNS, SJ ;
DOUTHART, RJ .
PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1992, 12 (04) :382-399
[7]   Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures [J].
Bodén, M ;
Yuan, Z ;
Bailey, TL .
BMC BIOINFORMATICS, 2006, 7 (1)
[8]   Predicting residue solvent accessibility from protein sequence by considering the sequence environment [J].
Carugo, O .
PROTEIN ENGINEERING, 2000, 13 (09) :607-609
[9]  
Chandonia JM, 1999, PROTEINS, V35, P293
[10]   PREDICTION OF PROTEIN CONFORMATION [J].
CHOU, PY ;
FASMAN, GD .
BIOCHEMISTRY, 1974, 13 (02) :222-245