CONDETRI - A Content Dependent Read Trimmer for Illumina Data

被引:180
作者
Smeds, Linnea [1 ]
Kunstner, Axel [1 ]
机构
[1] Uppsala Univ, Dept Evolutionary Biol, Evolutionary Biol Ctr, Uppsala, Sweden
来源
PLOS ONE | 2011年 / 6卷 / 10期
基金
瑞典研究理事会;
关键词
SEQUENCING ERRORS; QUALITY; GENOMES;
D O I
10.1371/journal.pone.0026314
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present CONDETRI, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. CONDETRI is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data.
引用
收藏
页数:6
相关论文
共 25 条
[1]   SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J].
Cox, Murray P. ;
Peterson, Daniel A. ;
Biggs, Patrick J. .
BMC BIOINFORMATICS, 2010, 11
[2]  
DOHM JC, 2007, GENOME RES
[3]   Substantial biases in ultra-short read data sets from high-throughput DNA sequencing [J].
Dohm, Juliane C. ;
Lottaz, Claudio ;
Borodina, Tatiana ;
Himmelbauer, Heinz .
NUCLEIC ACIDS RESEARCH, 2008, 36 (16)
[4]   Synergy between sequence and size in large-scale genomics [J].
Gregory, TR .
NATURE REVIEWS GENETICS, 2005, 6 (09) :699-708
[5]  
*ICGC, 2004, NATURE, V432, P717
[6]   HiTEC: accurate error correction in high-throughput sequencing data [J].
Ilie, Lucian ;
Fazayeli, Farideh ;
Ilie, Silvana .
BIOINFORMATICS, 2011, 27 (03) :295-302
[7]   Quake: quality-aware detection and correction of sequencing errors [J].
Kelley, David R. ;
Schatz, Michael C. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (11)
[8]  
Kozarewa I, 2009, NAT METHODS, V6, P291, DOI [10.1038/NMETH.1311, 10.1038/nmeth.1311]
[9]   Versatile and open software for comparing large genomes [J].
Kurtz, S ;
Phillippy, A ;
Delcher, AL ;
Smoot, M ;
Shumway, M ;
Antonescu, C ;
Salzberg, SL .
GENOME BIOLOGY, 2004, 5 (02)
[10]   Fast and accurate short read alignment with Burrows-Wheeler transform [J].
Li, Heng ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (14) :1754-1760