AfterQC: automatic filtering, trimming, error removing and quality control for fastq data

被引:261
作者
Chen, Shifu [1 ,2 ,3 ]
Huang, Tanxiao [2 ]
Zhou, Yanqing [2 ]
Han, Yue [2 ]
Xu, Mingyan [2 ]
Gu, Jia [1 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Xueyuan Rd, Shenzhen, Peoples R China
[2] HaploX BioTechnol, Songpingshan Rd, Shenzhen, Peoples R China
[3] Univ Chinese Acad Sci, 19 A Yuquan Rd, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
NGS; Overlap analysis; Quality control; Data filtering; Bubble; CANCER; DNA;
D O I
10.1186/s12859-017-1469-3
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Background: Some applications, especially those clinical applications requiring high accuracy of sequencing data, usually have to face the troubles caused by unavoidable sequencing errors. Several tools have been proposed to profile the sequencing quality, but few of them can quantify or correct the sequencing errors. This unmet requirement motivated us to develop AfterQC, a tool with functions to profile sequencing errors and correct most of them, plus highly automated quality control and data filtering features. Different from most tools, AfterQC analyses the overlapping of paired sequences for pair-end sequencing data. Based on overlapping analysis, AfterQC can detect and cut adapters, and furthermore it gives a novel function to correct wrong bases in the overlapping regions. Another new feature is to detect and visualise sequencing bubbles, which can be commonly found on the flowcell lanes and may raise sequencing errors. Besides normal per cycle quality and base content plotting, AfterQC also provides features like polyX (a long sub-sequence of a same base X) filtering, automatic trimming and K-MER based strand bias profiling. Results: For each single or pair of FastQ files, AfterQC filters out bad reads, detects and eliminates sequencer's bubble effects, trims reads at front and tail, detects the sequencing errors and corrects part of them, and finally outputs clean data and generates HTML reports with interactive figures. AfterQC can run in batch mode with multiprocess support, it can run with a single FastQ file, a single pair of FastQ files (for pair-end sequencing), or a folder for all included FastQ files to be processed automatically. Based on overlapping analysis, AfterQC can estimate the sequencing error rate and profile the error transform distribution. The results of our error profiling tests show that the error distribution is highly platform dependent. Conclusion: Much more than just another new quality control (QC) tool, AfterQC is able to perform quality control, data filtering, error profiling and base correction automatically. Experimental results show that AfterQC can help to eliminate the sequencing errors for pair-end sequencing data to provide much cleaner outputs, and consequently help to reduce the false-positive variants, especially for the low-frequency somatic mutations. While providing rich configurable options, AfterQC can detect and set all the options automatically and require no argument in most cases.
引用
收藏
页数:10
相关论文
共 18 条
[1]
Andrews S., QUALITY CONTROL TOOL
[2]
[Anonymous], SET JAV COMM LIN TOO
[3]
[Anonymous], 2015, ILL 2 CHANN SBS SEQ
[4]
[Anonymous], 2011, EMBNET J, V17, P10
[5]
Bolger AM, 2014, BIOINFORMATICS, V13, P266
[6]
Cox MP, 2010, BMC BIOINFORMATICS, V11, P266
[7]
Gao X, 2010, PATTERN ANAL APPL, V13, P266
[8]
Matplotlib: A 2D graphics environment [J].
Hunter, John D. .
COMPUTING IN SCIENCE & ENGINEERING, 2007, 9 (03) :90-95
[9]
Precision medicine at Memorial Sloan Kettering Cancer Center: clinical next-generation sequencing enabling next-generation targeted therapy trials [J].
Hyman, David M. ;
Solit, David B. ;
Arcilas, Maria E. ;
Cheng, Donavan T. ;
Sabbatini, Paul ;
Baselga, Jose ;
Berger, Michael F. ;
Ladanyi, Marc .
DRUG DISCOVERY TODAY, 2015, 20 (12) :1422-1428
[10]
VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing [J].
Koboldt, Daniel C. ;
Zhang, Qunyuan ;
Larson, David E. ;
Shen, Dong ;
McLellan, Michael D. ;
Lin, Ling ;
Miller, Christopher A. ;
Mardis, Elaine R. ;
Ding, Li ;
Wilson, Richard K. .
GENOME RESEARCH, 2012, 22 (03) :568-576