Protein subfamily assignment using the Conserved Domain Database

被引:21
作者
Fong J.H. [1 ]
Marchler-Bauer A. [1 ]
机构
[1] National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894
基金
美国国家卫生研究院;
关键词
Domain Model; Alignment Score; Sequence Interval; Domain Assignment; Query Protein Sequence;
D O I
10.1186/1756-0500-1-114
中图分类号
学科分类号
摘要
Background. Domains, evolutionarily conserved units of proteins, are widely used to classify protein sequences and infer protein function. Often, two or more overlapping domain models match a region of a protein sequence. Therefore, procedures are required to choose appropriate domain annotations for the protein. Here, we propose a method for assigning NCBI-curated domains from the Curated Domain Database (CDD) that takes into account the organization of the domains into hierarchies of homologous domain models. Findings. Our analysis of alignment scores from NCBI-curated domain assignments suggests that identifying the correct model among closely related models is more difficult than choosing between non-overlapping domain models. We find that simple heuristics based on sorting scores and domain-specific thresholds are effective at reducing classification error. In fact, in our test set, the heuristics result in almost 90% of current misclassifications due to missing domain subfamilies being replaced by more generic domain assignments, thereby eliminating a significant amount of error within the database. Conclusion. Our proposed domain subfamily assignment rule has been incorporated into the CD-Search software for assigning CDD domains to query protein sequences and has significantly improved pre-calculated domain annotations on protein sequences in NCBI's Entrez resource. © 2008 Fong et al; licensee BioMed Central Ltd.
引用
收藏
相关论文
共 23 条
[1]  
Gilks W.R., Audit B., De Angelis D., Tsoka S., Ouzounis C.A., Percolation of annotation errors through hierarchically structured protein sequence databases, Math Biosci, 193, pp. 223-234, (2005)
[2]  
Galperin M.Y., Koonin E.V., Sources of systematic error in functional annotation of genomes: Domain rearrangement, non-orthologous gene displacement and operon disruption, Silico Biol, 1, pp. 55-67, (1998)
[3]  
Marchler-Bauer A., Anderson J.B., Cherukuri P.F., Deweese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., Et al., CDD: A Conserved Domain Database for protein classification, Nucleic Acids Res, 33, pp. 4192-196, (2005)
[4]  
Marchler-Bauer A., Anderson J.B., Derbyshire M.K., Deweese-Scott C., Gonzales N.R., Gwadz M., Hao L., He S., Hurwitz D.I., Jackson J.D., Et al., CDD: A conserved domain database for interactive domain family analysis, Nucleic Acids Res, 35, pp. 4237-240, (2007)
[5]  
Snel B., Bork P., Huynen M., Genome evolution. Gene fusion versus gene fission, Trends Genet, 16, pp. 9-11, (2000)
[6]  
Bornberg-Bauer E., Beaussart F., Kummerfeld S.K., Teichmann S.A., Weiner Iii J., The evolution of domain arrangements in proteins and interaction networks, Cell Mol Life Sci, 62, pp. 435-445, (2005)
[7]  
Eddy S.R., Profile hidden Markov models, Bioinformatics, 14, pp. 755-763, (1998)
[8]  
Marchler-Bauer A., Panchenko A.R., Shoemaker B.A., Thiessen P.A., Geer L.Y., Bryant S.H., CDD: A database of conserved domain alignments with links to domain three-dimensional structure, Nucleic Acids Res, 30, pp. 281-283, (2002)
[9]  
Bateman A., Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L., Et al., The Pfam protein families database, Nucleic Acids Res, 32, pp. 4138-141, (2004)
[10]  
Finn R.D., Mistry J., Schuster-Bockler B., Griffiths-Jones S., Hollich V., Lassmann T., Moxon S., Marshall M., Khanna A., Durbin R., Et al., Pfam: Clans, web tools and services, Nucleic Acids Res, 34, pp. 4247-251, (2006)