SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

被引:1403
作者
Chen, Yuxin [1 ]
Chen, Yongsheng [2 ,4 ]
Shi, Chunmei [3 ,5 ]
Huang, Zhibo [1 ]
Zhang, Yong [1 ,6 ]
Li, Shengkang [1 ,6 ]
Li, Yan [1 ]
Ye, Jia [1 ]
Yu, Chang [7 ]
Li, Zhuo [8 ,9 ]
Zhang, Xiuqing [1 ]
Wang, Jian [1 ,10 ]
Yang, Huanming [1 ,10 ]
Fang, Lin [1 ,6 ]
Chen, Qiang [3 ,4 ,5 ]
机构
[1] BGI Shenzhen, Shenzhen 518083, Peoples R China
[2] Geneplus Beijing, Beijing 102206, Peoples R China
[3] Fujian Med Univ, Dept Oncol, Union Hosp, Fuzhou 350001, Fujian, Peoples R China
[4] Fujian Key Lab Translat Canc Med, Fuzhou 350014, Fujian, Peoples R China
[5] Fujian Med Univ, Stem Cell Res Inst, Dept Stem Cell Res Inst, Fuzhou 350000, Fujian, Peoples R China
[6] Natl Univ Def Technol, Collaborat Innovat Ctr High Performance Comp, Changsha 410073, Hunan, Peoples R China
[7] Intel China Ltd, Shanghai 200336, Peoples R China
[8] Guangdong Prov Hosp Chinese Med, Guangzhou 510120, Guangdong, Peoples R China
[9] Chinese Univ Hong Kong, Dept Surg, Fac Med, Hong Kong, Hong Kong, Peoples R China
[10] James D Watson Inst Genome Sci, Hangzhou 310058, Zhejiang, Peoples R China
关键词
high-throughput sequencing; quality control; preprocessing; MapReduce; ADAPTER; TOOL;
D O I
10.1093/gigascience/gix120
中图分类号
Q [生物科学];
学科分类号
090105 [作物生产系统与生态工程];
摘要
Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a "QC-Preprocess-QC" workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster. We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a similar to 30x NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved similar to 5.7 times the fastest speed of other tools.
引用
收藏
页码:1 / 6
页数:6
相关论文
共 49 条
[1]
[Anonymous], 2017, NISTV3 3 2 NA12878 H
[2]
[Anonymous], seqtk: Toolkit for processing sequences in FASTA/Q formats Internet. Github
[3]
Aronesty E., 2011, Expr. Anal. Durham
[4]
Trimmomatic: a flexible trimmer for Illumina sequence data [J].
Bolger, Anthony M. ;
Lohse, Marc ;
Usadel, Bjoern .
BIOINFORMATICS, 2014, 30 (15) :2114-2120
[5]
Bushnell B., 2014, LAWRENCE BERKELEY NA
[6]
AfterQC: automatic filtering, trimming, error removing and quality control for fastq data [J].
Chen, Shifu ;
Huang, Tanxiao ;
Zhou, Yanqing ;
Han, Yue ;
Xu, Mingyan ;
Gu, Jia .
BMC BIOINFORMATICS, 2017, 18
[7]
Chen Y, 2017, GIGASCIENCE DATABASE
[8]
SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data [J].
Cox, Murray P. ;
Peterson, Daniel A. ;
Biggs, Patrick J. .
BMC BIOINFORMATICS, 2010, 11
[9]
ALIENTRIMMER: A tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads [J].
Criscuolo, Alexis ;
Brisse, Sylvain .
GENOMICS, 2013, 102 (5-6) :500-506
[10]
Dodt Matthias, 2012, Biology (Basel), V1, P895, DOI 10.3390/biology1030895