Enhanced protein domain discovery using taxonomy

被引:14
作者
Coin, L [1 ]
Bateman, A [1 ]
Durbin, R [1 ]
机构
[1] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
关键词
Taxonomic Distribution; Negative Sequence; Pfam Family; Asparaginyl; Profile HMMs;
D O I
10.1186/1471-2105-5-56
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids. Results: We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques. Conclusions: Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions - such as domain co-occurrence and protein localisation.
引用
收藏
页数:10
相关论文
共 14 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
  • [3] Disulfide bridges of Ergtoxin, a member of a new sub-family of peptide blockers of the ether-a-go-go-related K+ channel
    Bottiglieri, C
    Ferrara, L
    Corona, M
    Gurrola, GB
    Batista, C
    Wanke, E
    Possani, LD
    [J]. FEBS LETTERS, 2000, 479 (03) : 156 - 157
  • [4] ASTRAL compendium enhancements
    Chandonia, JM
    Walker, NS
    Conte, LL
    Koehl, P
    Levitt, M
    Brenner, SE
    [J]. NUCLEIC ACIDS RESEARCH, 2002, 30 (01) : 260 - 263
  • [5] Enhanced protein domain discovery by using language modeling techniques from speech recognition
    Coin, L
    Bateman, A
    Durbin, R
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (08) : 4516 - 4520
  • [6] The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR-mediated protein-protein interactions
    Das, AK
    Cohen, PTW
    Barford, D
    [J]. EMBO JOURNAL, 1998, 17 (05) : 1192 - 1199
  • [7] Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
  • [8] Profile hidden Markov models
    Eddy, SR
    [J]. BIOINFORMATICS, 1998, 14 (09) : 755 - 763
  • [9] SCOP: A structural classification of proteins database
    Hubbard, TJP
    Murzin, AG
    Brenner, SE
    Chothia, C
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (01) : 236 - 239
  • [10] HIDDEN MARKOV-MODELS IN COMPUTATIONAL BIOLOGY - APPLICATIONS TO PROTEIN MODELING
    KROGH, A
    BROWN, M
    MIAN, IS
    SJOLANDER, K
    HAUSSLER, D
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1994, 235 (05) : 1501 - 1531