Accurate indel prediction using paired-end short reads

被引:24
作者
Grimm, Dominik [1 ,2 ]
Hagmann, Joerg [3 ]
Koenig, Daniel [3 ]
Weigel, Detlef [3 ]
Borgwardt, Karsten [1 ,2 ,4 ]
机构
[1] Max Planck Inst Dev Biol, Machine Learning & Computat Biol Res Grp, Tubingen, Germany
[2] Max Planck Inst Intelligent Syst, Tubingen, Germany
[3] Max Planck Inst Dev Biol, Dept Mol Biol, Tubingen, Germany
[4] Univ Tubingen, Ctr Bioinformat, Tubingen, Germany
来源
BMC GENOMICS | 2013年 / 14卷
关键词
Next generation sequencing; Indel detection; Discriminative machine learning; Paired-end short reads; Split-read mapping; COPY-NUMBER VARIATION; GENOME-WIDE ASSOCIATION; ARABIDOPSIS-THALIANA; STRUCTURAL VARIATION; IDENTIFICATION; BREAKPOINTS; FRAMEWORK; DELETIONS;
D O I
10.1186/1471-2164-14-132
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. Results: Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project (http://www.1001genomes.org) in Arabidopsis thaliana. Conclusion: In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
引用
收藏
页数:10
相关论文
共 32 条
[1]   AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision [J].
Abyzov, Alexej ;
Gerstein, Mark .
BIOINFORMATICS, 2011, 27 (05) :595-603
[2]   APPLICATIONS OF NEXT-GENERATION SEQUENCING Genome structural variation discovery and genotyping [J].
Alkan, Can ;
Coe, Bradley P. ;
Eichler, Evan E. .
NATURE REVIEWS GENETICS, 2011, 12 (05) :363-375
[3]   Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines [J].
Atwell, Susanna ;
Huang, Yu S. ;
Vilhjalmsson, Bjarni J. ;
Willems, Glenda ;
Horton, Matthew ;
Li, Yan ;
Meng, Dazhe ;
Platt, Alexander ;
Tarone, Aaron M. ;
Hu, Tina T. ;
Jiang, Rong ;
Muliyati, N. Wayan ;
Zhang, Xu ;
Amer, Muhammad Ali ;
Baxter, Ivan ;
Brachi, Benjamin ;
Chory, Joanne ;
Dean, Caroline ;
Debieu, Marilyne ;
de Meaux, Juliette ;
Ecker, Joseph R. ;
Faure, Nathalie ;
Kniskern, Joel M. ;
Jones, Jonathan D. G. ;
Michael, Todd ;
Nemri, Adnane ;
Roux, Fabrice ;
Salt, David E. ;
Tang, Chunlao ;
Todesco, Marco ;
Traw, M. Brian ;
Weigel, Detlef ;
Marjoram, Paul ;
Borevitz, Justin O. ;
Bergelson, Joy ;
Nordborg, Magnus .
NATURE, 2010, 465 (7298) :627-631
[4]   Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing [J].
Campbell, Peter J. ;
Stephens, Philip J. ;
Pleasance, Erin D. ;
O'Meara, Sarah ;
Li, Heng ;
Santarius, Thomas ;
Stebbings, Lucy A. ;
Leroy, Catherine ;
Edkins, Sarah ;
Hardy, Claire ;
Teague, Jon W. ;
Menzies, Andrew ;
Goodhead, Ian ;
Turner, Daniel J. ;
Clee, Christopher M. ;
Quail, Michael A. ;
Cox, Antony ;
Brown, Clive ;
Durbin, Richard ;
Hurles, Matthew E. ;
Edwards, Paul A. W. ;
Bignell, Graham R. ;
Stratton, Michael R. ;
Futreal, P. Andrew .
NATURE GENETICS, 2008, 40 (06) :722-729
[5]   Whole-genome sequencing of multiple Arabidopsis thaliana populations [J].
Cao, Jun ;
Schneeberger, Korbinian ;
Ossowski, Stephan ;
Guenther, Torsten ;
Bender, Sebastian ;
Fitz, Joffrey ;
Koenig, Daniel ;
Lanz, Christa ;
Stegle, Oliver ;
Lippert, Christoph ;
Wang, Xi ;
Ott, Felix ;
Mueller, Jonas ;
Alonso-Blanco, Carlos ;
Borgwardt, Karsten ;
Schmid, Karl J. ;
Weigel, Detlef .
NATURE GENETICS, 2011, 43 (10) :956-U60
[6]   Methods and strategies for analyzing copy number variation using DNA microarrays [J].
Carter, Nigel P. .
NATURE GENETICS, 2007, 39 (Suppl 7) :S16-S21
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]   A framework for variation discovery and genotyping using next-generation DNA sequencing data [J].
DePristo, Mark A. ;
Banks, Eric ;
Poplin, Ryan ;
Garimella, Kiran V. ;
Maguire, Jared R. ;
Hartl, Christopher ;
Philippakis, Anthony A. ;
del Angel, Guillermo ;
Rivas, Manuel A. ;
Hanna, Matt ;
McKenna, Aaron ;
Fennell, Tim J. ;
Kernytsky, Andrew M. ;
Sivachenko, Andrey Y. ;
Cibulskis, Kristian ;
Gabriel, Stacey B. ;
Altshuler, David ;
Daly, Mark J. .
NATURE GENETICS, 2011, 43 (05) :491-+
[9]   Multiple reference genomes and transcriptomes for Arabidopsis thaliana [J].
Gan, Xiangchao ;
Stegle, Oliver ;
Behr, Jonas ;
Steffen, Joshua G. ;
Drewe, Philipp ;
Hildebrand, Katie L. ;
Lyngsoe, Rune ;
Schultheiss, Sebastian J. ;
Osborne, Edward J. ;
Sreedharan, Vipin T. ;
Kahles, Andre ;
Bohnert, Regina ;
Jean, Geraldine ;
Derwent, Paul ;
Kersey, Paul ;
Belfield, Eric J. ;
Harberd, Nicholas P. ;
Kemen, Eric ;
Toomajian, Christopher ;
Kover, Paula X. ;
Clark, Richard M. ;
Raetsch, Gunnar ;
Mott, Richard .
NATURE, 2011, 477 (7365) :419-423
[10]   AN IMPROVED ALGORITHM FOR MATCHING BIOLOGICAL SEQUENCES [J].
GOTOH, O .
JOURNAL OF MOLECULAR BIOLOGY, 1982, 162 (03) :705-708