A Unified Multitask Architecture for Predicting Local Protein Properties

被引:31
作者
Qi, Yanjun [1 ]
Oja, Merja [2 ]
Weston, Jason [3 ]
Noble, William Stafford [2 ]
机构
[1] NEC Labs Amer, Machine Learning Dept, Princeton, NJ USA
[2] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[3] Google, New York, NY USA
来源
PLOS ONE | 2012年 / 7卷 / 03期
基金
芬兰科学院;
关键词
COMBINED TRANSMEMBRANE TOPOLOGY; SEQUENCE ALIGNMENT PROFILES; SIGNAL PEPTIDE PREDICTION; SECONDARY STRUCTURE; SOLVENT ACCESSIBILITY; COILED COILS; INTERACTION SITES; BINDING RESIDUES; IDENTIFICATION; DATABASE;
D O I
10.1371/journal.pone.0032235
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
引用
收藏
页数:11
相关论文
共 59 条
  • [1] Combining prediction of secondary structure and solvent accessibility in proteins
    Adamczak, R
    Porollo, A
    Meller, J
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 59 (03) : 467 - 475
  • [2] PSSM-based prediction of DNA binding sites in proteins
    Ahmad, S
    Sarai, A
    [J]. BMC BIOINFORMATICS, 2005, 6 (1)
  • [3] Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information
    Ahmad, S
    Gromiha, MM
    Sarai, A
    [J]. BIOINFORMATICS, 2004, 20 (04) : 477 - 486
  • [4] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [5] [Anonymous], 1987, COMPUT SPEECH LANG
  • [6] CCHMM_PROF: a HMM-based coiled-coil predictor with evolutionary information
    Bartoli, Lisa
    Fariselli, Piero
    Krogh, Anders
    Casadio, Rita
    [J]. BIOINFORMATICS, 2009, 25 (21) : 2757 - 2763
  • [7] Bassnett S, 2009, MOL VIS, V15, P2448
  • [8] Improved prediction of signal peptides: SignalP 3.0
    Bendtsen, JD
    Nielsen, H
    von Heijne, G
    Brunak, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2004, 340 (04) : 783 - 795
  • [9] A neural probabilistic language model
    Bengio, Y
    Ducharme, R
    Vincent, P
    Jauvin, C
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) : 1137 - 1155
  • [10] PREDICTING COILED COILS BY USE SF PAIRWISE RESIDUE CORRELATIONS
    BERGER, B
    WILSON, DB
    WOLF, E
    TONCHEV, T
    MILLA, M
    KIM, PS
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1995, 92 (18) : 8259 - 8263