FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model

被引:42
作者
Wang, Mingjun [1 ,2 ]
Zhao, Xing-Ming [3 ]
Takemoto, Kazuhiro [4 ]
Xu, Haisong [1 ,2 ]
Li, Yuan [1 ,2 ]
Akutsu, Tatsuya [5 ]
Song, Jiangning [1 ,2 ,5 ,6 ]
机构
[1] Chinese Acad Sci, Natl Engn Lab Ind Enzymes, Tianjin Inst Ind Biotechnol, Tianjin, Peoples R China
[2] Chinese Acad Sci, Key Lab Syst Microbial Biotechnol, Tianjin Inst Ind Biotechnol, Tianjin, Peoples R China
[3] Tongji Univ, Dept Comp Sci, Sch Elect & Informat Engn, Shanghai 200092, Peoples R China
[4] Kyushu Inst Technol, Dept Biosci & Bioinformat, Iizuka, Fukuoka, Japan
[5] Kyoto Univ, Bioinformat Ctr, Inst Chem Res, Uji, Kyoto, Japan
[6] Monash Univ, Dept Biochem & Mol Biol, Melbourne, Vic 3004, Australia
来源
PLOS ONE | 2012年 / 7卷 / 08期
基金
英国医学研究理事会; 日本学术振兴会; 澳大利亚研究理事会;
关键词
PROTEIN SECONDARY STRUCTURE; SUPPORT VECTOR MACHINES; NUCLEOTIDE POLYMORPHISMS; GENOME SEQUENCE; DISEASE; INFORMATION; MUTATIONS; BIOINFORMATICS; COVARIATION; RESOURCE;
D O I
10.1371/journal.pone.0043847
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets can be downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/similar to sjn/FunSAV.
引用
收藏
页数:14
相关论文
共 73 条
[1]   A method and server for predicting damaging missense mutations [J].
Adzhubei, Ivan A. ;
Schmidt, Steffen ;
Peshkin, Leonid ;
Ramensky, Vasily E. ;
Gerasimova, Anna ;
Bork, Peer ;
Kondrashov, Alexey S. ;
Sunyaev, Shamil R. .
NATURE METHODS, 2010, 7 (04) :248-249
[2]   The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group [J].
Ahn, Sung-Min ;
Kim, Tae-Hyung ;
Lee, Sunghoon ;
Kim, Deokhoon ;
Ghang, Ho ;
Kim, Dae-Soo ;
Kim, Byoung-Chul ;
Kim, Sang-Yoon ;
Kim, Woo-Yeon ;
Kim, Chulhong ;
Park, Daeui ;
Lee, Yong Seok ;
Kim, Sangsoo ;
Reja, Rohit ;
Jho, Sungwoong ;
Kim, Chang Geun ;
Cha, Ji-Young ;
Kim, Kyung-Hee ;
Lee, Bonghee ;
Bhak, Jong ;
Kim, Seong-Jin .
GENOME RESEARCH, 2009, 19 (09) :1622-1629
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[5]   Network analysis of protein structures identifies functional residues [J].
Amitai, G ;
Shemesh, A ;
Sitbon, E ;
Shklar, M ;
Netanely, D ;
Venger, I ;
Pietrokovski, S .
JOURNAL OF MOLECULAR BIOLOGY, 2004, 344 (04) :1135-1146
[6]  
[Anonymous], 5 IEEE INT C SYST BI
[7]  
[Anonymous], J MOL MED
[8]  
[Anonymous], 2002, PYMOL MOL GRAPHICS S
[9]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[10]   nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms [J].
Bao, L ;
Zhou, M ;
Cui, Y .
NUCLEIC ACIDS RESEARCH, 2005, 33 :W480-W482