Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB

被引:47
作者
Xu, Qifang [1 ]
Dunbrack, Roland L., Jr. [1 ]
机构
[1] Fox Chase Canc Ctr, Inst Canc Res, Philadelphia, PA 19111 USA
关键词
COMPREHENSIVE DATABASE; STRUCTURAL GENOMICS; PSI-BLAST;
D O I
10.1093/bioinformatics/bts533
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains > 50 residues.
引用
收藏
页码:2763 / 2772
页数:10
相关论文
共 34 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   Characterization of protein hubs by inferring interacting motifs from protein interactions [J].
Aragues, Ramon ;
Sali, Andrej ;
Bonet, Jaume ;
Marti-Renom, Marc A. ;
Oliva, Baldo .
PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (09) :1761-1771
[3]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[5]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[6]   PSI-2: Structural Genomics to Cover Protein Domain Family Space [J].
Dessailly, Benoit H. ;
Nair, Rajesh ;
Jaroszewski, Lukasz ;
Fajardo, J. Eduardo ;
Kouranov, Andrei ;
Lee, David ;
Fiser, Andras ;
Godzik, Adam ;
Rost, Burkhard ;
Orengo, Christine .
STRUCTURE, 2009, 17 (06) :869-881
[7]   iPfam:: visualization of protein-protein interactions in PDB at domain and amino acid resolutions [J].
Finn, RD ;
Marshall, M ;
Bateman, A .
BIOINFORMATICS, 2005, 21 (03) :410-412
[8]   Pfam:: clans, web tools and services [J].
Finn, Robert D. ;
Mistry, Jaina ;
Schuster-Bockler, Benjamin ;
Griffiths-Jones, Sam ;
Hollich, Volker ;
Lassmann, Timo ;
Moxon, Simon ;
Marshall, Mhairi ;
Khanna, Ajay ;
Durbin, Richard ;
Eddy, Sean R. ;
Sonnhammer, Erik L. L. ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D247-D251
[9]   InterPro: the integrative protein signature database [J].
Hunter, Sarah ;
Apweiler, Rolf ;
Attwood, Teresa K. ;
Bairoch, Amos ;
Bateman, Alex ;
Binns, David ;
Bork, Peer ;
Das, Ujjwal ;
Daugherty, Louise ;
Duquenne, Lauranne ;
Finn, Robert D. ;
Gough, Julian ;
Haft, Daniel ;
Hulo, Nicolas ;
Kahn, Daniel ;
Kelly, Elizabeth ;
Laugraud, Aurelie ;
Letunic, Ivica ;
Lonsdale, David ;
Lopez, Rodrigo ;
Madera, Martin ;
Maslen, John ;
McAnulla, Craig ;
McDowall, Jennifer ;
Mistry, Jaina ;
Mitchell, Alex ;
Mulder, Nicola ;
Natale, Darren ;
Orengo, Christine ;
Quinn, Antony F. ;
Selengut, Jeremy D. ;
Sigrist, Christian J. A. ;
Thimma, Manjula ;
Thomas, Paul D. ;
Valentin, Franck ;
Wilson, Derek ;
Wu, Cathy H. ;
Yeats, Corin .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D211-D215
[10]   FFAS03: a server for profile-profile sequence alignments [J].
Jaroszewski, L ;
Rychlewski, L ;
Li, ZW ;
Li, WZ ;
Godzik, A .
NUCLEIC ACIDS RESEARCH, 2005, 33 :W284-W288