Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers

被引:26
作者
Barenboim, Maxim [1 ]
Masso, Majid [1 ]
Vaisman, Iosif I. [1 ]
Jamison, D. Curtis [1 ]
机构
[1] George Mason Univ, Dept Bioinformat & Comp Biol, Manassas, VA 20110 USA
关键词
single nucleotide polymorphism; computational geometry; artificial intelligence; algorithms; protein conformation; statistical models; complex disease; tertiary protein structure; computational biology; amino acid substitution;
D O I
10.1002/prot.21838
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http:// rna.gmu.edu/FuzzySnps/).
引用
收藏
页码:1930 / 1939
页数:10
相关论文
共 43 条
[1]  
[Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
[2]   The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :45-48
[3]   Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[4]   Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information [J].
Bao, L ;
Cui, Y .
BIOINFORMATICS, 2005, 21 (10) :2185-2190
[5]   The Quickhull algorithm for convex hulls [J].
Barber, CB ;
Dobkin, DP ;
Huhdanpaa, H .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1996, 22 (04) :469-483
[6]   Statistical geometry approach to the study of functional effects of human nonsynonymous SNPs [J].
Barenboim, M ;
Jamison, DC ;
Vaisman, II .
HUMAN MUTATION, 2005, 26 (05) :471-476
[7]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[8]   A METHOD TO IDENTIFY PROTEIN SEQUENCES THAT FOLD INTO A KNOWN 3-DIMENSIONAL STRUCTURE [J].
BOWIE, JU ;
LUTHY, R ;
EISENBERG, D .
SCIENCE, 1991, 253 (5016) :164-170
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   A neural-network-based method for predicting protein stability changes upon single point mutations [J].
Capriotti, Emidio ;
Fariselli, Piero ;
Casadio, Rita .
BIOINFORMATICS, 2004, 20 :63-68