FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model

被引:42
作者
Wang, Mingjun [1 ,2 ]
Zhao, Xing-Ming [3 ]
Takemoto, Kazuhiro [4 ]
Xu, Haisong [1 ,2 ]
Li, Yuan [1 ,2 ]
Akutsu, Tatsuya [5 ]
Song, Jiangning [1 ,2 ,5 ,6 ]
机构
[1] Chinese Acad Sci, Natl Engn Lab Ind Enzymes, Tianjin Inst Ind Biotechnol, Tianjin, Peoples R China
[2] Chinese Acad Sci, Key Lab Syst Microbial Biotechnol, Tianjin Inst Ind Biotechnol, Tianjin, Peoples R China
[3] Tongji Univ, Dept Comp Sci, Sch Elect & Informat Engn, Shanghai 200092, Peoples R China
[4] Kyushu Inst Technol, Dept Biosci & Bioinformat, Iizuka, Fukuoka, Japan
[5] Kyoto Univ, Bioinformat Ctr, Inst Chem Res, Uji, Kyoto, Japan
[6] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3004, Australia
来源
PLOS ONE | 2012年 / 7卷 / 08期
基金
英国医学研究理事会; 日本学术振兴会; 澳大利亚研究理事会;
关键词
PROTEIN SECONDARY STRUCTURE; SUPPORT VECTOR MACHINES; NUCLEOTIDE POLYMORPHISMS; GENOME SEQUENCE; DISEASE; INFORMATION; MUTATIONS; BIOINFORMATICS; COVARIATION; RESOURCE;
D O I
10.1371/journal.pone.0043847
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets can be downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/similar to sjn/FunSAV.
引用
收藏
页数:14
相关论文
共 73 条
[51]   Use of Lorenz curves and Gini coefficients to assess yield inequality within paddocks [J].
Sadras, V ;
Bongiovanni, R .
FIELD CROPS RESEARCH, 2004, 90 (2-3) :303-310
[52]   Crystal Structure of Human Plasma Platelet-activating Factor Acetylhydrolase STRUCTURAL IMPLICATION TO LIPOPROTEIN BINDING AND CATALYSIS [J].
Samanta, Uttamkumar ;
Bahnson, Brian J. .
JOURNAL OF BIOLOGICAL CHEMISTRY, 2008, 283 (46) :31617-31624
[53]   Evaluation of structural and evolutionary contributions to deleterious mutation prediction [J].
Saunders, CT ;
Baker, D .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 322 (04) :891-901
[54]   Structure of ATP-bound human ATP:cobalamin adenosyltransferase [J].
Schubert, Heidi L. ;
Hill, Christopher P. .
BIOCHEMISTRY, 2006, 45 (51) :15188-15196
[55]   Complete Khoisan and Bantu genomes from southern Africa [J].
Schuster, Stephan C. ;
Miller, Webb ;
Ratan, Aakrosh ;
Tomsho, Lynn P. ;
Giardine, Belinda ;
Kasson, Lindsay R. ;
Harris, Robert S. ;
Petersen, Desiree C. ;
Zhao, Fangqing ;
Qi, Ji ;
Alkan, Can ;
Kidd, Jeffrey M. ;
Sun, Yazhou ;
Drautz, Daniela I. ;
Bouffard, Pascal ;
Muzny, Donna M. ;
Reid, Jeffrey G. ;
Nazareth, Lynne V. ;
Wang, Qingyu ;
Burhans, Richard ;
Riemer, Cathy ;
Wittekindt, Nicola E. ;
Moorjani, Priya ;
Tindall, Elizabeth A. ;
Danko, Charles G. ;
Teo, Wee Siang ;
Buboltz, Anne M. ;
Zhang, Zhenhai ;
Ma, Qianyi ;
Oosthuysen, Arno ;
Steenkamp, Abraham W. ;
Oostuisen, Hermann ;
Venter, Philippus ;
Gajewski, John ;
Zhang, Yu ;
Pugh, B. Franklin ;
Makova, Kateryna D. ;
Nekrutenko, Anton ;
Mardis, Elaine R. ;
Patterson, Nick ;
Pringle, Tom H. ;
Chiaromonte, Francesca ;
Mullikin, James C. ;
Eichler, Evan E. ;
Hardison, Ross C. ;
Gibbs, Richard A. ;
Harkins, Timothy T. ;
Hayes, Vanessa M. .
NATURE, 2010, 463 (7283) :943-947
[56]   SWISS-MODEL: an automated protein homology-modeling server [J].
Schwede, T ;
Kopp, J ;
Guex, N ;
Peitsch, MC .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3381-3385
[57]   Cytoscape: A software environment for integrated models of biomolecular interaction networks [J].
Shannon, P ;
Markiel, A ;
Ozier, O ;
Baliga, NS ;
Wang, JT ;
Ramage, D ;
Amin, N ;
Schwikowski, B ;
Ideker, T .
GENOME RESEARCH, 2003, 13 (11) :2498-2504
[58]   HSEpred: predict half-sphere exposure from protein sequences [J].
Song, Jiangning ;
Tan, Hao ;
Takemoto, Kazuhiro ;
Akutsu, Tatsuya .
BIOINFORMATICS, 2008, 24 (13) :1489-1497
[59]   Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure [J].
Song, Jiangning ;
Yuan, Zheng ;
Tan, Hao ;
Huber, Thomas ;
Burrage, Kevin .
BIOINFORMATICS, 2007, 23 (23) :3147-3154
[60]   Cascleave: towards more accurate prediction of caspase substrate cleavage sites [J].
Song, Jiangning ;
Tan, Hao ;
Shen, Hongbin ;
Mahmood, Khalid ;
Boyd, Sarah E. ;
Webb, Geoffrey I. ;
Akutsu, Tatsuya ;
Whisstock, James C. .
BIOINFORMATICS, 2010, 26 (06) :752-760