DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins

被引:44
作者
Park, J
Teichmann, SA
机构
[1] MRC, Mol Biol Lab, Cambridge CB2 2QH, England
[2] Ctr Prot Engn, Cambridge CB2 2QH, England
关键词
D O I
10.1093/bioinformatics/14.2.144
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Large-scale determination of relationships between the proteins produced by genome sequences is now common. All protein sequences are matched and those that have high match scores are clustered into families. In cases where the proteins al-e built of several domains or duplication modules, this can lead to misleading results. Consider the very simple example of three proteins: I, formed by duplication modules A and B; 2, formed by duplication modules B' and C; and 3, formed by duplication modules C' and D. Duplication modules B and B' are homologous, as are C and C'. Matching the sequences of 1, 2 and 3 followed by simple single-linkage clustering would put all three in the same family even though proteins 1 and 3 are not related. This is because the different pal fs of 2 match 1 and 3. This paper describes a procedure, DIVCLUS, that divides such complex clusters of partially related sequences into simple clusters that contain only related duplication modules. In the example just given, it would produce two groups of sequences: the first with domains B of sequence I and B of sequence 2, and the second with domain C of sequence 2 and C of sequence 3. DIVCLUS is part of a package called GEANFAMMER, for GEnome ANalysis and protein FAMily MakER. The package automates the detection of families of duplication modules fi-om a protein sequence database. Results: DIVCLUS has been applied to the division of single-linkage clusters generated from the protein sequences of sh completely sequenced bacterial genomes. Our of 12 013 genes in these six genomes, 4563 single- and multi-domain sequences formed 1071 complex clusters. Application of the DIVCLUS program resolved these clusters into 2113 clusters corresponding to single duplication modules.
引用
收藏
页码:144 / 150
页数:7
相关论文
共 22 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   The PROSITE database, its status in 1995 [J].
Bairoch, A ;
Bucher, P ;
Hofmann, K .
NUCLEIC ACIDS RESEARCH, 1996, 24 (01) :189-196
[4]   The complete genome sequence of Escherichia coli K-12 [J].
Blattner, FR ;
Plunkett, G ;
Bloch, CA ;
Perna, NT ;
Burland, V ;
Riley, M ;
ColladoVides, J ;
Glasner, JD ;
Rode, CK ;
Mayhew, GF ;
Gregor, J ;
Davis, NW ;
Kirkpatrick, HA ;
Goeden, MA ;
Rose, DJ ;
Mau, B ;
Shao, Y .
SCIENCE, 1997, 277 (5331) :1453-+
[5]  
BRENNER SE, 1996, THESIS U CAMBRIDGE
[6]   Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii [J].
Bult, CJ ;
White, O ;
Olsen, GJ ;
Zhou, LX ;
Fleischmann, RD ;
Sutton, GG ;
Blake, JA ;
FitzGerald, LM ;
Clayton, RA ;
Gocayne, JD ;
Kerlavage, AR ;
Dougherty, BA ;
Tomb, JF ;
Adams, MD ;
Reich, CI ;
Overbeek, R ;
Kirkness, EF ;
Weinstock, KG ;
Merrick, JM ;
Glodek, A ;
Scott, JL ;
Geoghagen, NSM ;
Weidman, JF ;
Fuhrmann, JL ;
Nguyen, D ;
Utterback, TR ;
Kelley, JM ;
Peterson, JD ;
Sadow, PW ;
Hanna, MC ;
Cotton, MD ;
Roberts, KM ;
Hurst, MA ;
Kaine, BP ;
Borodovsky, M ;
Klenk, HP ;
Fraser, CM ;
Smith, HO ;
Woese, CR ;
Venter, JC .
SCIENCE, 1996, 273 (5278) :1058-1073
[7]   WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD [J].
FLEISCHMANN, RD ;
ADAMS, MD ;
WHITE, O ;
CLAYTON, RA ;
KIRKNESS, EF ;
KERLAVAGE, AR ;
BULT, CJ ;
TOMB, JF ;
DOUGHERTY, BA ;
MERRICK, JM ;
MCKENNEY, K ;
SUTTON, G ;
FITZHUGH, W ;
FIELDS, C ;
GOCAYNE, JD ;
SCOTT, J ;
SHIRLEY, R ;
LIU, LI ;
GLODEK, A ;
KELLEY, JM ;
WEIDMAN, JF ;
PHILLIPS, CA ;
SPRIGGS, T ;
HEDBLOM, E ;
COTTON, MD ;
UTTERBACK, TR ;
HANNA, MC ;
NGUYEN, DT ;
SAUDEK, DM ;
BRANDON, RC ;
FINE, LD ;
FRITCHMAN, JL ;
FUHRMANN, JL ;
GEOGHAGEN, NSM ;
GNEHM, CL ;
MCDONALD, LA ;
SMALL, KV ;
FRASER, CM ;
SMITH, HO ;
VENTER, JC .
SCIENCE, 1995, 269 (5223) :496-512
[8]   THE MINIMAL GENE COMPLEMENT OF MYCOPLASMA-GENITALIUM [J].
FRASER, CM ;
GOCAYNE, JD ;
WHITE, O ;
ADAMS, MD ;
CLAYTON, RA ;
FLEISCHMANN, RD ;
BULT, CJ ;
KERLAVAGE, AR ;
SUTTON, G ;
KELLEY, JM ;
FRITCHMAN, JL ;
WEIDMAN, JF ;
SMALL, KV ;
SANDUSKY, M ;
FUHRMANN, J ;
NGUYEN, D ;
UTTERBACK, TR ;
SAUDEK, DM ;
PHILLIPS, CA ;
MERRICK, JM ;
TOMB, JF ;
DOUGHERTY, BA ;
BOTT, KF ;
HU, PC ;
LUCIER, TS ;
PETERSON, SN ;
SMITH, HO ;
HUTCHISON, CA ;
VENTER, JC .
SCIENCE, 1995, 270 (5235) :397-403
[9]   Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae [J].
Himmelreich, R ;
Hilbert, H ;
Plagens, H ;
Pirkl, E ;
Li, BC ;
Herrmann, R .
NUCLEIC ACIDS RESEARCH, 1996, 24 (22) :4420-4449
[10]   PARSER FOR PROTEIN-FOLDING UNITS [J].
HOLM, L ;
SANDER, C .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 1994, 19 (03) :256-268