Improving the prediction of disease-related variants using protein three-dimensional structure

被引:100
作者
Capriotti, Emidio [1 ,3 ]
Altman, Russ B. [1 ,2 ]
机构
[1] Stanford Univ, Dept Bioengn, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
[3] Univ Balearic Isl, Dept Math & Comp Sci, Palma De Mallorca, Spain
来源
BMC BIOINFORMATICS | 2011年 / 12卷
关键词
SINGLE-NUCLEOTIDE POLYMORPHISMS; AMINO-ACID SUBSTITUTIONS; SUPPORT VECTOR MACHINES; NON-SYNONYMOUS SNPS; STABILITY CHANGES; EVOLUTIONARY INFORMATION; POINT MUTATIONS; HUMAN GENOME; SEQUENCE; DATABASE;
D O I
10.1186/1471-2105-12-S4-S3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures. Results: In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information. Conclusion: This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available.
引用
收藏
页数:11
相关论文
共 43 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] A haplotype map of the human genome
    Altshuler, D
    Brooks, LD
    Chakravarti, A
    Collins, FS
    Daly, MJ
    Donnelly, P
    Gibbs, RA
    Belmont, JW
    Boudreau, A
    Leal, SM
    Hardenbol, P
    Pasternak, S
    Wheeler, DA
    Willis, TD
    Yu, FL
    Yang, HM
    Zeng, CQ
    Gao, Y
    Hu, HR
    Hu, WT
    Li, CH
    Lin, W
    Liu, SQ
    Pan, H
    Tang, XL
    Wang, J
    Wang, W
    Yu, J
    Zhang, B
    Zhang, QR
    Zhao, HB
    Zhao, H
    Zhou, J
    Gabriel, SB
    Barry, R
    Blumenstiel, B
    Camargo, A
    Defelice, M
    Faggart, M
    Goyette, M
    Gupta, S
    Moore, J
    Nguyen, H
    Onofrio, RC
    Parkin, M
    Roy, J
    Stahl, E
    Winchester, E
    Ziaugra, L
    Shen, Y
    [J]. NATURE, 2005, 437 (7063) : 1299 - 1320
  • [3] Assessing the accuracy of prediction algorithms for classification: an overview
    Baldi, P
    Brunak, S
    Chauvin, Y
    Andersen, CAF
    Nielsen, H
    [J]. BIOINFORMATICS, 2000, 16 (05) : 412 - 424
  • [4] nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms
    Bao, L
    Zhou, M
    Cui, Y
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : W480 - W482
  • [5] Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information
    Bao, L
    Cui, Y
    [J]. BIOINFORMATICS, 2005, 21 (10) : 2185 - 2190
  • [6] The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data
    Berman, Helen
    Henrick, Kim
    Nakamura, Haruki
    Markley, John L.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D301 - D303
  • [7] DISULFIDE BONDS AND THE STABILITY OF GLOBULAR-PROTEINS
    BETZ, SF
    [J]. PROTEIN SCIENCE, 1993, 2 (10) : 1551 - 1558
  • [8] GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes
    Boyle, EI
    Weng, SA
    Gollub, J
    Jin, H
    Botstein, D
    Cherry, JM
    Sherlock, G
    [J]. BIOINFORMATICS, 2004, 20 (18) : 3710 - 3715
  • [9] SNAP predicts effect of mutations on protein function
    Bromberg, Yana
    Yachdav, Guy
    Rost, Burkhard
    [J]. BIOINFORMATICS, 2008, 24 (20) : 2397 - 2398
  • [10] Functional Annotations Improve the Predictive Score of Human Disease-Related Mutations in Proteins
    Calabrese, Remo
    Capriotti, Emidio
    Fariselli, Piero
    Martelli, Pier Luigi
    Casadio, Rita
    [J]. HUMAN MUTATION, 2009, 30 (08) : 1237 - 1244