Automatic annotation of protein function based on family identification

被引:37
作者
Abascal, F [1 ]
Valencia, A [1 ]
机构
[1] CSIC, Prot Design Grp, Natl Biotechnol Ctr, CNB, E-28049 Madrid, Spain
关键词
protein function prediction; genome analysis; protein families and subfamilies; orthologues and paralogues; annotation inconsistencies; database errors;
D O I
10.1002/prot.10449
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Although genomes are being sequenced at an impressive rate, the information generated tells us little about protein function, which is slow to characterize by traditional methods. Automatic protein function annotation based on computational methods has alleviated this imbalance. The most powerful current approach for inferring the function of new proteins is by studying the annotations of their homologues, since their common origin is assumed to be reflected in their structure and function. Unfortunately, as proteins evolve they acquire new functions, so annotation based on homology must be carried out in the context of orthologues or subfamilies. Evolution adds new complications through domain shuffling: homology (or orthology) frequently corresponds to domains rather than complete proteins. Moreover, the function of a protein may be seen as the result of combining the functions of its domains. Additionally, automatic annotation has to deal with problems related to the annotations in the databases: errors (which are likely to be propagated), inconsistencies, or different degrees of function specification. We describe a method that addresses these difficulties for the annotation of protein function. Sequence relationships are detected and measured to obtain a map of the sequence space, which is searched for differentiated groups of proteins (similar to islands on the map), which are expected to have a common function and correspond to groups of orthologues or subfamilies. This mapmaking is done by applying a clustering algorithm based on Normalized cuts in graphs. The domain problem is addressed in a simple way: pairwise local alignments are analyzed to determine the extent to which they cover the entire sequence lengths of the two proteins. This analysis determines both what homologues are preferred for functional inheritance and the level of confidence of the annotation. To alleviate the problems associated with database annotations, the information on all the homologues that are grouped together with the query protein are taken into account to select the most representative functional descriptors. This method has been applied for the annotation of the genome of Buchnera aphidicola (specific host Baizongia pistaciae). Human inspection of the annotations allowed an estimation of accuracy of 94%; the different kinds of error that may appear when using this approach are described. Results can be accessed at http://www.pdg.cnb.uam.es/funcut.html. The programs are available upon request, although installation in other systems may be complicated. (C) 2003 Wiley-Liss, Inc.
引用
收藏
页码:683 / 692
页数:10
相关论文
共 34 条
[1]   Clustering of proximal sequence space for the identification of protein families [J].
Abascal, F ;
Valencia, A .
BIOINFORMATICS, 2002, 18 (07) :908-921
[2]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]   Automated genome sequence analysis and annotation [J].
Andrade, MA ;
Brown, NP ;
Leroy, C ;
Hoersch, S ;
de Daruvar, A ;
Reich, C ;
Franchini, A ;
Tamames, J ;
Valencia, A ;
Ouzounis, C ;
Sander, C .
BIOINFORMATICS, 1999, 15 (05) :391-412
[5]  
ANDRADE MA, 1999, ISMB, V7, P28
[6]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[7]   Genomics - The babel of bioinformatics [J].
Attwood, TK .
SCIENCE, 2000, 290 (5491) :471-473
[8]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[9]  
Blaschke Christian, 2002, Brief Bioinform, V3, P154, DOI 10.1093/bib/3.2.154
[10]   Predicting function: From genes to genomes and back [J].
Bork, P ;
Dandekar, T ;
Diaz-Lazcoz, Y ;
Eisenhaber, F ;
Huynen, M ;
Yuan, YP .
JOURNAL OF MOLECULAR BIOLOGY, 1998, 283 (04) :707-725