Mining the structural genomics pipeline: Identification of protein properties that affect high-throughput experimental analysis

被引:108
作者
Goh, CS
Lan, N
Douglas, SM
Wu, BL
Echols, N
Smith, A
Milburn, D
Montelione, GT
Zhao, HY
Gerstein, M
机构
[1] Yale Univ, Dept Epidemiol & Publ Hlth, New Haven, CT 06520 USA
[2] Univ Med & Dent New Jersey, NE Struct Genom Consortium, Robert Wood Johnson Med Sch, Piscataway, NJ 08854 USA
[3] Univ Med & Dent New Jersey, Robert Wood Johnson Med Sch, Ctr Adv Biotechnol & Med, Piscataway, NJ 08854 USA
[4] Rutgers State Univ, Robert Wood Johnson Med Sch, Dept Mol Biol & Biochem, UMDNJ, Piscataway, NJ 08854 USA
[5] Univ Med & Dent New Jersey, Robert Wood Johnson Med Sch, Dept Biochem, Piscataway, NJ 08854 USA
[6] Yale Univ, Dept Genet, New Haven, CT 06520 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
structural genomics; COGs; charged residues; hydrophobicity; decision trees;
D O I
10.1016/j.jmb.2003.11.053
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant, protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized "pipeline schematics". We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation. Further information is available from http://mining.nesg.org/. (C) 2003 Elsevier Ltd. All rights reserved.
引用
收藏
页码:115 / 130
页数:16
相关论文
共 49 条
[1]   Structural genomics - Tapping DNA for structures produces a trickle [J].
Service, RF .
SCIENCE, 2002, 298 (5595) :948-950
[2]   BIND - a data specification for storing and describing biomolecular interactions, molecular complexes and pathways [J].
Bader, GD ;
Hogue, CWV .
BIOINFORMATICS, 2000, 16 (05) :465-477
[3]  
Bader GD, 2003, NUCLEIC ACIDS RES, V31, P248, DOI 10.1093/nar/gkg056
[4]   BIND - The Biomolecular Interaction Network Database [J].
Bader, GD ;
Donaldson, I ;
Wolting, C ;
Ouellette, BFF ;
Pawson, T ;
Hogue, CWV .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :242-245
[5]   SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics [J].
Bertone, P ;
Kluger, Y ;
Lan, N ;
Zheng, DY ;
Christendat, D ;
Yee, A ;
Edwards, AM ;
Arrowsmith, CH ;
Montelione, GT ;
Gerstein, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (13) :2884-2898
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]  
BREIMAN L, 2002, IMS WALD LECT 2 P IM
[8]  
Brenner SE, 2000, PROTEIN SCI, V9, P197
[9]   Target selection for structural genomics [J].
Brenner, SE .
NATURE STRUCTURAL BIOLOGY, 2000, 7 (Suppl 11) :967-969
[10]   A tour of structural genomics [J].
Brenner, SE .
NATURE REVIEWS GENETICS, 2001, 2 (10) :801-809