PEAR: a fast and accurate Illumina Paired-End reAd mergeR

被引:3373
作者
Zhang, Jiajie [1 ,2 ,3 ]
Kobert, Kassian [1 ]
Flouri, Tomas [1 ]
Stamatakis, Alexandros [1 ,4 ]
机构
[1] Heidelberg Inst Theoret Studies, Sci Comp Grp, Exelixis Lab, D-69118 Heidelberg, Germany
[2] Med Univ Lubeck, Grad Sch Comp Med & Life Sci, D-23538 Lubeck, Germany
[3] Med Univ Lubeck, Inst Neuro & Bioinformat, D-23538 Lubeck, Germany
[4] Karlsruhe Inst Technol, Inst Theoret Informat, D-76128 Karlsruhe, Germany
关键词
SEQUENCES; GENERATION; ALIGNMENT;
D O I
10.1093/bioinformatics/btt593
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Motivation: The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths. Results: We present the PEAR software for merging raw Illumina paired-end reads from target fragments of varying length. The program evaluates all possible paired-end read overlaps and does not require the target fragment size as input. It also implements a statistical test for minimizing false-positive results. Tests on simulated and empirical data show that PEAR consistently generates highly accurate merged paired-end reads. A highly optimized implementation allows for merging millions of paired-end reads within a few minutes on a standard desktop computer. On multi-core architectures, the parallel version of PEAR shows linear speedups compared with the sequential version of PEAR.
引用
收藏
页码:614 / 620
页数:7
相关论文
共 18 条
[1]
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[2]
Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads [J].
Bartram, Andrea K. ;
Lynch, Michael D. J. ;
Stearns, Jennifer C. ;
Moreno-Hagelsieb, Gabriel ;
Neufeld, Josh D. .
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2011, 77 (11) :3846-3852
[3]
Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample [J].
Caporaso, J. Gregory ;
Lauber, Christian L. ;
Walters, William A. ;
Berg-Lyons, Donna ;
Lozupone, Catherine A. ;
Turnbaugh, Peter J. ;
Fierer, Noah ;
Knight, Rob .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 :4516-4522
[4]
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[5]
SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J].
Cox, Murray P. ;
Peterson, Daniel A. ;
Biggs, Patrick J. .
BMC BIOINFORMATICS, 2010, 11
[6]
Illumina-based analysis of microbial community diversity [J].
Degnan, Patrick H. ;
Ochman, Howard .
ISME JOURNAL, 2012, 6 (01) :183-194
[7]
Microbiome Profiling by Illumina Sequencing of Combinatorial Sequence-Tagged PCR Products [J].
Gloor, Gregory B. ;
Hummelen, Ruben ;
Macklaim, Jean M. ;
Dickson, Russell J. ;
Fernandes, Andrew D. ;
MacPhee, Roderick ;
Reid, Gregor .
PLOS ONE, 2010, 5 (10)
[8]
ART: a next-generation sequencing read simulator [J].
Huang, Weichun ;
Li, Leping ;
Myers, Jason R. ;
Marth, Gabor T. .
BIOINFORMATICS, 2012, 28 (04) :593-594
[9]
Langmead B, 2012, NAT METHODS, V9, P357, DOI [10.1038/NMETH.1923, 10.1038/nmeth.1923]
[10]
COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly [J].
Liu, Binghang ;
Yuan, Jianying ;
Yiu, Siu-Ming ;
Li, Zhenyu ;
Xie, Yinlong ;
Chen, Yanxiang ;
Shi, Yujian ;
Zhang, Hao ;
Li, Yingrui ;
Lam, Tak-Wah ;
Luo, Ruibang .
BIOINFORMATICS, 2012, 28 (22) :2870-2874