Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy

被引:15
作者
Jiang, Rui [1 ]
Yang, Hua [1 ]
Sun, Fengzhu [1 ]
Chen, Ting [1 ]
机构
[1] Univ So Calif, Los Angeles, CA 90089 USA
基金
美国国家科学基金会;
关键词
D O I
10.1186/1471-2105-7-417
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge. Results: To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest. Conclusion: The prediction methods using the proposed feature set can achieve larger AUC ( the area under the ROC curve), smaller BER ( the balanced error rate), and larger MCC ( the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method ( PRIM), revealing that the strategy is more effective in inducing interpretable rules.
引用
收藏
页数:18
相关论文
共 29 条
[1]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[2]   Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information [J].
Bao, L ;
Cui, Y .
BIOINFORMATICS, 2005, 21 (10) :2185-2190
[3]  
Berg J.M., 2002, Biochemistry, P465
[4]   A METHOD TO IDENTIFY PROTEIN SEQUENCES THAT FOLD INTO A KNOWN 3-DIMENSIONAL STRUCTURE [J].
BOWIE, JU ;
LUTHY, R ;
EISENBERG, D .
SCIENCE, 1991, 253 (5016) :164-170
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation [J].
Chasman, D ;
Adams, RM .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 307 (02) :683-706
[7]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
[8]  
Fan RE, 2005, J MACH LEARN RES, V6, P1889
[9]   Sequence-based prediction of pathological mutations [J].
Ferrer-Costa, C ;
Orozco, M ;
de la Cruz, X .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2004, 57 (04) :811-819
[10]   Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties [J].
Ferrer-Costa, C ;
Orozco, M ;
de la Cruz, X .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 315 (04) :771-786