Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data

被引:105
作者
Ding, Jiarui [1 ,2 ]
Bashashati, Ali [1 ]
Roth, Andrew [1 ]
Oloumi, Arusha [1 ]
Tse, Kane [3 ]
Zeng, Thomas [3 ]
Haffari, Gholamreza [1 ]
Hirst, Martin [3 ]
Marra, Marco A. [3 ]
Condon, Anne [2 ]
Aparicio, Samuel [1 ,4 ]
Shah, Sohrab P. [1 ,2 ,4 ]
机构
[1] BC Canc Agcy, Dept Mol Oncol, Vancouver, BC, Canada
[2] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1W5, Canada
[3] Canadas Michael Smith Genome Sci Ctr, Vancouver, BC, Canada
[4] Univ British Columbia, Dept Pathol, Vancouver, BC V6T 1W5, Canada
关键词
FREQUENT MUTATION; GENOME; IDENTIFICATION; TOOLKIT;
D O I
10.1093/bioinformatics/btr629
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth 'false positive' predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study.
引用
收藏
页码:167 / 175
页数:9
相关论文
共 24 条
[1]   Robust biomarker identification for cancer diagnosis with ensemble feature selection methods [J].
Abeel, Thomas ;
Helleputte, Thibault ;
Van de Peer, Yves ;
Dupont, Pierre ;
Saeys, Yvan .
BIOINFORMATICS, 2010, 26 (03) :392-398
[2]   vipR: variant identification in pooled DNA using R [J].
Altmann, Andre ;
Weber, Peter ;
Quast, Carina ;
Rex-Haffner, Monika ;
Binder, Elisabeth B. ;
Mueller-Myhsok, Bertram .
BIOINFORMATICS, 2011, 27 (13) :I77-I84
[3]   BamTools: a C++ API and toolkit for analyzing and managing BAM files [J].
Barnett, Derek W. ;
Garrison, Erik K. ;
Quinlan, Aaron R. ;
Stroemberg, Michael P. ;
Marth, Gabor T. .
BIOINFORMATICS, 2011, 27 (12) :1691-1692
[4]   Initial genome sequencing and analysis of multiple myeloma [J].
Chapman, Michael A. ;
Lawrence, Michael S. ;
Keats, Jonathan J. ;
Cibulskis, Kristian ;
Sougnez, Carrie ;
Schinzel, Anna C. ;
Harview, Christina L. ;
Brunet, Jean-Philippe ;
Ahmann, Gregory J. ;
Adli, Mazhar ;
Anderson, Kenneth C. ;
Ardlie, Kristin G. ;
Auclair, Daniel ;
Baker, Angela ;
Bergsagel, P. Leif ;
Bernstein, Bradley E. ;
Drier, Yotam ;
Fonseca, Rafael ;
Gabriel, Stacey B. ;
Hofmeister, Craig C. ;
Jagannath, Sundar ;
Jakubowiak, Andrzej J. ;
Krishnan, Amrita ;
Levy, Joan ;
Liefeld, Ted ;
Lonial, Sagar ;
Mahan, Scott ;
Mfuko, Bunmi ;
Monti, Stefano ;
Perkins, Louise M. ;
Onofrio, Robb ;
Pugh, Trevor J. ;
Rajkumar, S. Vincent ;
Ramos, Alex H. ;
Siegel, David S. ;
Sivachenko, Andrey ;
Stewart, A. Keith ;
Trudel, Suzanne ;
Vij, Ravi ;
Voet, Douglas ;
Winckler, Wendy ;
Zimmerman, Todd ;
Carpten, John ;
Trent, Jeff ;
Hahn, William C. ;
Garraway, Levi A. ;
Meyerson, Matthew ;
Lander, Eric S. ;
Getz, Gad ;
Golub, Todd R. .
NATURE, 2011, 471 (7339) :467-472
[5]   BART: BAYESIAN ADDITIVE REGRESSION TREES [J].
Chipman, Hugh A. ;
George, Edward I. ;
McCulloch, Robert E. .
ANNALS OF APPLIED STATISTICS, 2010, 4 (01) :266-298
[6]   Genome remodelling in a basal-like breast cancer metastasis and xenograft [J].
Ding, Li ;
Ellis, Matthew J. ;
Li, Shunqiang ;
Larson, David E. ;
Chen, Ken ;
Wallis, Johnw. ;
Harris, Christopher C. ;
McLellan, Michael D. ;
Fulton, Robert S. ;
Fulton, Lucinda L. ;
Abbott, Rachel M. ;
Hoog, Jeremy ;
Dooling, David J. ;
Koboldt, Daniel C. ;
Schmidt, Heather ;
Kalicki, Joelle ;
Zhang, Qunyuan ;
Chen, Lei ;
Lin, Ling ;
Wendl, Michael C. ;
McMichael, Joshua F. ;
Magrini, Vincent J. ;
Cook, Lisa ;
McGrath, Sean D. ;
Vickery, Tammi L. ;
Appelbaum, Elizabeth ;
DeSchryver, Katherine ;
Davies, Sherri ;
Guintoli, Therese ;
Lin, Li ;
Crowder, Robert ;
Tao, Yu ;
Snider, Jacqueline E. ;
Smith, Scott M. ;
Dukes, Adam F. ;
Sanderson, Gabriel E. ;
Pohl, Craig S. ;
Delehaunty, Kim D. ;
Fronick, Catrina C. ;
Pape, Kimberley A. ;
Reed, Jerry S. ;
Robinson, Jody S. ;
Hodges, Jennifer S. ;
Schierding, William ;
Dees, Nathan D. ;
Shen, Dong ;
Locke, Devin P. ;
Wiechert, Madeline E. ;
Eldred, James M. ;
Peck, Josh B. .
NATURE, 2010, 464 (7291) :999-1005
[7]   SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors [J].
Goya, Rodrigo ;
Sun, Mark G. F. ;
Morin, Ryan D. ;
Leung, Gillian ;
Ha, Gavin ;
Wiegand, Kimberley C. ;
Senz, Janine ;
Crisan, Anamaria ;
Marra, Marco A. ;
Hirst, Martin ;
Huntsman, David ;
Murphy, Kevin P. ;
Aparicio, Sam ;
Shah, Sohrab P. .
BIOINFORMATICS, 2010, 26 (06) :730-736
[8]  
HARTIGAN PM, 1985, J R STAT SOC C-APPL, V34, P320
[9]  
Hastie T., 2009, ELEMENTS STAT LEARNI, DOI 10.1007/978-0-387-84858-7
[10]   VarScan: variant detection in massively parallel sequencing of individual and pooled samples [J].
Koboldt, Daniel C. ;
Chen, Ken ;
Wylie, Todd ;
Larson, David E. ;
McLellan, Michael D. ;
Mardis, Elaine R. ;
Weinstock, George M. ;
Wilson, Richard K. ;
Ding, Li .
BIOINFORMATICS, 2009, 25 (17) :2283-2285