Deleterious SNP prediction: be mindful of your training data!

被引:48
作者
Care, Matthew A.
Needham, Chris J.
Bulpitt, Andrew J.
Westhead, David R. [1 ]
机构
[1] Univ Leeds, Inst Mol & Cell Biol, Leeds LS2 9JT, W Yorkshire, England
[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England
基金
英国生物技术与生命科学研究理事会;
关键词
D O I
10.1093/bioinformatics/btl649
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: To predict which of the vast number of human single nucleotide polymorphisms (SNPs) are deleterious to gene function or likely to be disease associated is an important problem, and many methods have been reported in the literature. All methods require data sets of mutations classified as 'deleterious' or 'neutral' for training and/or validation. While different workers have used different data sets there has been no study of which is best. Here, the three most commonly used data sets are analysed. We examine their contents and relate this to classifiers, with the aims of revealing the strengths and pitfalls of each data set, and recommending a best approach for future studies. Results: The data sets examined are shown to be substantially different in content, particularly with regard to amino acid substitutions, reflecting the different ways in which they are derived. This leads to differences in classifiers and reveals some serious pitfalls of some data sets, making them less than ideal for non-synonymous SNP prediction.
引用
收藏
页码:664 / 672
页数:9
相关论文
共 36 条
[1]   Accurate prediction of solvent accessibility using neural networks-based regression [J].
Adamczak, R ;
Porollo, A ;
Meller, J .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 56 (04) :753-767
[2]   TEMPERATURE-SENSITIVE MUTATIONS OF BACTERIOPHAGE-T4 LYSOZYME OCCUR AT SITES WITH LOW MOBILITY AND LOW SOLVENT ACCESSIBILITY IN THE FOLDED PROTEIN [J].
ALBER, T ;
SUN, DP ;
NYE, JA ;
MUCHMORE, DC ;
MATTHEWS, BW .
BIOCHEMISTRY, 1987, 26 (13) :3754-3758
[3]  
ALTSCHUL SF, 1997, NUCLEIC ACIDS RES, V25, P3402
[4]   Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information [J].
Bao, L ;
Cui, Y .
BIOINFORMATICS, 2005, 21 (10) :2185-2190
[5]   AMINO-ACID SUBSTITUTION DURING FUNCTIONALLY CONSTRAINED DIVERGENT EVOLUTION OF PROTEIN SEQUENCES [J].
BENNER, SA ;
COHEN, MA ;
GONNET, GH .
PROTEIN ENGINEERING, 1994, 7 (11) :1323-1332
[6]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[7]   Bayesian approach to discovering pathogenic SNPs in conserved protein domains [J].
Cai, ZH ;
Tsung, EF ;
Marinescu, VD ;
Ramoni, MF ;
Riva, A ;
Kohane, IS .
HUMAN MUTATION, 2004, 24 (02) :178-184
[8]   Characterization of single-nucleotide polymorphisms in coding regions of human genes [J].
Cargill, M ;
Altshuler, D ;
Ireland, J ;
Sklar, P ;
Ardlie, K ;
Patil, N ;
Lane, CR ;
Lim, EP ;
Kalyanaraman, N ;
Nemesh, J ;
Ziaugra, L ;
Friedland, L ;
Rolfe, A ;
Warrington, J ;
Lipshutz, R ;
Daley, GQ ;
Lander, ES .
NATURE GENETICS, 1999, 22 (03) :231-238
[9]   Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation [J].
Chasman, D ;
Adams, RM .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 307 (02) :683-706
[10]   A DNA polymorphism discovery resource for research on human genetic variation [J].
Collins, FS ;
Brooks, LD ;
Chakravarti, A .
GENOME RESEARCH, 1998, 8 (12) :1229-1231