CATHEDRAL: A fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures

被引:63
作者
Redfern, Oliver C. [1 ]
Harrison, Andrew
Dallman, Tim
Pearl, Frances M. G.
Orengo, Christine A.
机构
[1] UCL, Dept Biochem & Mol Biol, London, England
[2] Univ Essex, Dept Math Sci & Biol Sci, Colchester CO4 3SQ, Essex, England
关键词
D O I
10.1371/journal.pcbi.0030232
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure-based method ( using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with < 1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.
引用
收藏
页码:2333 / 2347
页数:15
相关论文
共 46 条
  • [1] Domain combinations in archaeal, eubacterial and eukaryotic proteomes
    Apic, G
    Gough, J
    Teichmann, SA
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2001, 310 (02) : 311 - 325
  • [2] Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkr1065, 10.1093/nar/gkp985, 10.1093/nar/gkh121]
  • [3] The Protein Data Bank
    Berman, HM
    Westbrook, J
    Feng, Z
    Gilliland, G
    Bhat, TN
    Weissig, H
    Shindyalov, IN
    Bourne, PE
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 235 - 242
  • [4] CRYSTAL-STRUCTURE OF CATALASE HPII FROM ESCHERICHIA-COLI
    BRAVO, J
    VERDAGUER, N
    TORMO, J
    BETZEL, C
    SWITALA, J
    LOEWEN, PC
    FITA, I
    [J]. STRUCTURE, 1995, 3 (05) : 491 - 502
  • [5] Bovine beta-lactoglobulin at 1.8 angstrom resolution - Still an enigmatic lipocalin
    Brownlow, S
    Cabral, JHM
    Cooper, R
    Flower, DR
    Yewdall, SJ
    Polikarpov, I
    North, ACT
    Sawyer, L
    [J]. STRUCTURE, 1997, 5 (04) : 481 - 495
  • [6] The impact of structural genomics: Expectations and outcomes
    Chandonia, JM
    Brenner, SE
    [J]. SCIENCE, 2006, 311 (5759) : 347 - 351
  • [7] A unifold, mesofold, and superfold model of protein fold use
    Coulson, AFW
    Moult, J
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 46 (01) : 61 - 71
  • [8] Hidden Markov models
    Eddy, SR
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 1996, 6 (03) : 361 - 365
  • [9] SnapDRAGON: a method to delineate protein structural domains from sequence data
    George, RA
    Heringa, J
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2002, 316 (03) : 839 - 851
  • [10] Whole genome protein domain analysis using a new method for domain clustering
    Gouzy, J
    Corpet, F
    Kahn, D
    [J]. COMPUTERS & CHEMISTRY, 1999, 23 (3-4): : 333 - 340