Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure

被引:33
作者
Aydin, Zafer [1 ]
Singh, Ajit [2 ]
Bilmes, Jeff [2 ]
Noble, William S. [1 ,3 ]
机构
[1] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[2] Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA
[3] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
来源
BMC BIOINFORMATICS | 2011年 / 12卷
关键词
HIDDEN MARKOV-MODELS; SEQUENCE ALIGNMENT PROFILES; STRUCTURE PREDICTION;
D O I
10.1186/1471-2105-12-154
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight. Results: In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements. Conclusions: We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
引用
收藏
页数:21
相关论文
共 62 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
[Anonymous], 2001, Pattern Classification
[3]  
[Anonymous], PSIPRED SERVER
[4]  
ASAI K, 1993, COMPUT APPL BIOSCI, V9, P141
[5]   Protein secondary structure prediction for a single-sequence using hidden semi-Markov models [J].
Aydin, Zafer ;
Altunbasak, Yucel ;
Borodovsky, Mark .
BMC BIOINFORMATICS, 2006, 7 (1)
[6]   Microbial communities in acid mine drainage [J].
Baker, BJ ;
Banfield, JF .
FEMS MICROBIOLOGY ECOLOGY, 2003, 44 (02) :139-152
[7]   HIDDEN MARKOV-MODELS OF BIOLOGICAL PRIMARY SEQUENCE INFORMATION [J].
BALDI, P ;
CHAUVIN, Y ;
HUNKAPILLER, T ;
MCCLURE, MA .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1994, 91 (03) :1059-1063
[8]  
Bilmes J, 2002, INT CONF ACOUST SPEE, P3916
[9]  
BILMES J, 2000, UAI 00
[10]  
Bilmes J., 2008, HDB SIGNAL PROCESSIN, P521