Predicting deleterious nsSNPs: an analysis of sequence and structural attributes

被引:75
作者
Dobson, Richard J.
Munroe, Patricia B.
Caulfield, Mark J.
Saqi, Mansoor A. S.
机构
[1] Queen Mary Univ London, Barts & London Sch Med & Dent, William Harvey Res Inst, London EC1M 6BQ, England
[2] Queen Mary Univ London, Inst Cell & Mol Sci, Barts & London Sch Med & Dent, London EC1M 6BQ, England
基金
英国医学研究理事会;
关键词
D O I
10.1186/1471-2105-7-217
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: There has been an explosion in the number of single nucleotide polymorphisms ( SNPs) within public databases. In this study we focused on non- synonymous protein coding single nucleotide polymorphisms (nsSNPs), some associated with disease and others which are thought to be neutral. We describe the distribution of both types of nsSNPs using structural and sequence based features and assess the relative value of these attributes as predictors of function using machine learning methods. We also address the common problem of balance within machine learning methods and show the effect of imbalance on nsSNP function prediction. We show that nsSNP function prediction can be significantly improved by 100% undersampling of the majority class. The learnt rules were then applied to make predictions of function on all nsSNPs within Ensembl. Results: The measure of prediction success is greatly affected by the level of imbalance in the training dataset. We found the balanced dataset that included all attributes produced the best prediction. The performance as measured by the Matthews correlation coefficient ( MCC) varied between 0.49 and 0.25 depending on the imbalance. As previously observed, the degree of sequence conservation at the nsSNP position is the single most useful attribute. In addition to conservation, structural predictions made using a balanced dataset can be of value. Conclusion: The predictions for all nsSNPs within Ensembl, based on a balanced dataset using all attributes, are available as a DAS annotation. Instructions for adding the track to Ensembl are at http:// www. brightstudy. ac. uk/ das_ help. html.
引用
收藏
页数:9
相关论文
共 34 条
[1]  
Al-Shahib Ali, 2005, Appl Bioinformatics, V4, P195, DOI 10.2165/00822942-200594030-00004
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
[Anonymous], 1978, Atlas of protein sequence and structure
[4]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[5]  
Bader GD, 2003, NUCLEIC ACIDS RES, V31, P248, DOI 10.1093/nar/gkg056
[6]  
BAO L, 2005, BIOINFORMATICS
[7]   DEVELOPMENT OF HYDROPHOBICITY PARAMETERS TO ANALYZE PROTEINS WHICH BEAR POSTTRANSLATIONAL OR COTRANSLATIONAL MODIFICATIONS [J].
BLACK, SD ;
MOULD, DR .
ANALYTICAL BIOCHEMISTRY, 1991, 193 (01) :72-82
[8]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[9]   Mapping SNPs to protein sequence and structure data [J].
Cavallo, A ;
Martin, ACR .
BIOINFORMATICS, 2005, 21 (08) :1443-1450
[10]   Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation [J].
Chasman, D ;
Adams, RM .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 307 (02) :683-706