Collective judgment predicts disease-associated single nucleotide variants

被引:193
作者
Capriotti, Emidio [1 ]
Altman, Russ B. [2 ,3 ]
Bromberg, Yana [4 ]
机构
[1] Univ Alabama Birmingham, Dept Pathol, Div Informat, Birmingham, AL 35294 USA
[2] Stanford Univ, Dept Bioengn, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Genet, Stanford, CA 94305 USA
[4] Rutgers State Univ, Dept Biochem & Microbiol, New Brunswick, NJ 08903 USA
来源
BMC GENOMICS | 2013年 / 14卷
关键词
PROTEIN MUTATIONS; BIOINFORMATICS; POLYMORPHISMS; DATABASE;
D O I
10.1186/1471-2164-14-S3-S2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains similar to 50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy. Results: Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, similar to 3% higher accuracy and similar to 0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor. Conclusions: Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC.
引用
收藏
页数:9
相关论文
共 30 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]   McKusick's Online Mendelian Inheritance in Man (OMIM®) [J].
Amberger, Joanna ;
Bocchini, Carol A. ;
Scott, Alan F. ;
Hamosh, Ada .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D793-D796
[4]   SNAP: predict effect of non-synonymous polymorphisms on function [J].
Bromberg, Yana ;
Rost, Burkhard .
NUCLEIC ACIDS RESEARCH, 2007, 35 (11) :3823-3835
[5]   Functional Annotations Improve the Predictive Score of Human Disease-Related Mutations in Proteins [J].
Calabrese, Remo ;
Capriotti, Emidio ;
Fariselli, Piero ;
Martelli, Pier Luigi ;
Casadio, Rita .
HUMAN MUTATION, 2009, 30 (08) :1237-1244
[6]   Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information [J].
Capriotti, E. ;
Calabrese, R. ;
Casadio, R. .
BIOINFORMATICS, 2006, 22 (22) :2729-2734
[7]  
Capriotti E, 2011, GENOMICS
[8]   Use of estimated evolutionary strength at the codon level improves the prediction of disease-related protein mutations in humans [J].
Capriotti, Emidio ;
Arbiza, Leonardo ;
Casadio, Rita ;
Dopazo, Joaquin ;
Dopazo, Hernan ;
Marti-Renom, Marc A. .
HUMAN MUTATION, 2008, 29 (01) :198-204
[9]   Bioinformatics for personal genome interpretation [J].
Capriotti, Emidio ;
Nehrt, Nathan L. ;
Kann, Maricel G. ;
Bromberg, Yana .
BRIEFINGS IN BIOINFORMATICS, 2012, 13 (04) :495-512
[10]   Improving the prediction of disease-related variants using protein three-dimensional structure [J].
Capriotti, Emidio ;
Altman, Russ B. .
BMC BIOINFORMATICS, 2011, 12