Exhaustive enumeration of protein domain families

被引:100
作者
Heger, A [1 ]
Holm, L [1 ]
机构
[1] EMBL, EBI, Cambridge CB10 1SD, England
关键词
homology; evolution; protein family; domain; maximum likelihood;
D O I
10.1016/S0022-2836(03)00269-9
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda. (C) 2003 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:749 / 767
页数:19
相关论文
共 46 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[3]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[4]   Approaches to the automatic discovery of patterns in biosequences [J].
Brazma, A ;
Jonassen, I ;
Eidhammer, I ;
Gilbert, D .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) :279-305
[5]   Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships [J].
Brenner, SE ;
Chothia, C ;
Hubbard, TJP .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (11) :6073-6078
[6]   Recent improvements of the ProDom database of protein domain families [J].
Corpet, F ;
Gouzy, J ;
Kahn, D .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :263-267
[7]  
Coutinho PM, 1999, ROY SOC CH, P3
[8]   Systematic identification of novel protein domain families associated with nuclear functions [J].
Doerks, T ;
Copley, RR ;
Schultz, J ;
Ponting, CP ;
Bork, P .
GENOME RESEARCH, 2002, 12 (01) :47-56
[9]   EVOLUTIONARILY MOBILE MODULES IN PROTEINS [J].
DOOLITTLE, RF ;
BORK, P .
SCIENTIFIC AMERICAN, 1993, 269 (04) :50-56
[10]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763