Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

被引:390
作者
Hoffmann, Steve [1 ,2 ]
Otto, Christian [1 ]
Kurtz, Stefan [3 ]
Sharma, Cynthia M. [4 ]
Khaitovich, Philipp [9 ]
Vogel, Joerg [4 ]
Stadler, Peter F. [1 ,5 ,6 ,7 ,8 ]
Hackermueller, Joerg [5 ]
机构
[1] Univ Leipzig, Dept Comp Sci, Bioinformat Grp, Leipzig, Germany
[2] Univ Leipzig, Interdisciplinary Ctr Bioinformat, Leipzig, Germany
[3] Univ Hamburg, Ctr Bioinformat, Hamburg, Germany
[4] Max Planck Inst Infect Biol, Berlin, Germany
[5] Fraunhofer Inst Cell Therapy & Immunol IZI, Rnom Grp, Leipzig, Germany
[6] Santa Fe Inst, Santa Fe, NM 87501 USA
[7] Univ Vienna, Dept Theoret Chem, Vienna, Austria
[8] Max Planck Inst Math Sci, Leipzig, Germany
[9] Partner Inst Computat Biol, Comparat Biol Grp, Shanghai, Peoples R China
关键词
ALIGNMENT;
D O I
10.1371/journal.pcbi.1000502
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.
引用
收藏
页数:10
相关论文
共 17 条
[1]  
Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
[2]   Solexa Ltd [J].
Bennett, S .
PHARMACOGENOMICS, 2004, 5 (04) :433-438
[3]   SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS [J].
CHANG, WI ;
LAWLER, EL .
ALGORITHMICA, 1994, 12 (4-5) :327-344
[4]  
Crochemore M., 2007, Algorithms on Strings, DOI DOI 10.1017/CBO9780511546853
[5]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[6]   Opportunistic data structures with applications [J].
Ferragina, P ;
Manzini, G .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :390-398
[7]   Accuracy and quality of massively parallel DNA pyrosequencing [J].
Huse, Susan M. ;
Huber, Julie A. ;
Morrison, Hilary G. ;
Sogin, Mitchell L. ;
Mark Welch, David .
GENOME BIOLOGY, 2007, 8 (07)
[8]   METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES [J].
KARLIN, S ;
ALTSCHUL, SF .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) :2264-2268
[9]   Ultrafast and memory-efficient alignment of short DNA sequences to the human genome [J].
Langmead, Ben ;
Trapnell, Cole ;
Pop, Mihai ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2009, 10 (03)
[10]   Mapping short DNA sequencing reads and calling variants using mapping quality scores [J].
Li, Heng ;
Ruan, Jue ;
Durbin, Richard .
GENOME RESEARCH, 2008, 18 (11) :1851-1858