COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

被引:125
作者
Liu, Binghang [1 ,2 ]
Yuan, Jianying [2 ]
Yiu, Siu-Ming [1 ,3 ]
Li, Zhenyu [1 ,2 ]
Xie, Yinlong [1 ,2 ]
Chen, Yanxiang [2 ]
Shi, Yujian [2 ]
Zhang, Hao [2 ]
Li, Yingrui [1 ,2 ]
Lam, Tak-Wah [1 ,3 ]
Luo, Ruibang [1 ,2 ,3 ]
机构
[1] Univ Hong Kong, HKU BGI BAL Bioinformat Algorithms & Core Technol, Hong Kong, Hong Kong, Peoples R China
[2] BGI Shenzhen, Shenzhen 518083, Guangdong, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
SEQUENCE; PROFILE;
D O I
10.1093/bioinformatics/bts563
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Motivation: The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read length hinders us from better assembling the genome from scratch. New protocols now exist that can generate overlapping pair-end reads. By joining the 30 ends of each read pair, one is able to construct longer reads for assembling. However, effectively joining two overlapped pair-end reads remains a challenging task. Result: In this article, we present an efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30x simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.
引用
收藏
页码:2870 / 2874
页数:5
相关论文
共 13 条
[1]
Field guide to next-generation DNA sequencers [J].
Glenn, Travis C. .
MOLECULAR ECOLOGY RESOURCES, 2011, 11 (05) :759-769
[2]
High-quality draft assemblies of mammalian genomes from massively parallel sequence data [J].
Gnerre, Sante ;
MacCallum, Iain ;
Przybylski, Dariusz ;
Ribeiro, Filipe J. ;
Burton, Joshua N. ;
Walker, Bruce J. ;
Sharpe, Ted ;
Hall, Giles ;
Shea, Terrance P. ;
Sykes, Sean ;
Berlin, Aaron M. ;
Aird, Daniel ;
Costello, Maura ;
Daza, Riza ;
Williams, Louise ;
Nicol, Robert ;
Gnirke, Andreas ;
Nusbaum, Chad ;
Lander, Eric S. ;
Jaffe, David B. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (04) :1513-1518
[3]
pIRS: Profile-based Illumina pair-end reads simulator [J].
Hu, Xuesong ;
Yuan, Jianying ;
Shi, Yujian ;
Lu, Jianliang ;
Liu, Binghang ;
Li, Zhenyu ;
Chen, Yanxiang ;
Mu, Desheng ;
Zhang, Hao ;
Li, Nan ;
Yue, Zhen ;
Bai, Fan ;
Li, Heng ;
Fan, Wei .
BIOINFORMATICS, 2012, 28 (11) :1533-1535
[4]
Quake: quality-aware detection and correction of sequencing errors [J].
Kelley, David R. ;
Schatz, Michael C. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (11)
[5]
Adaptive seeds tame genomic sequence comparison [J].
Kielbasa, Szymon M. ;
Wan, Raymond ;
Sato, Kengo ;
Horton, Paul ;
Frith, Martin C. .
GENOME RESEARCH, 2011, 21 (03) :487-493
[6]
De novo assembly of human genomes with massively parallel short read sequencing [J].
Li, Ruiqiang ;
Zhu, Hongmei ;
Ruan, Jue ;
Qian, Wubin ;
Fang, Xiaodong ;
Shi, Zhongbin ;
Li, Yingrui ;
Li, Shengting ;
Shan, Gao ;
Kristiansen, Karsten ;
Li, Songgang ;
Yang, Huanming ;
Wang, Jian ;
Wang, Jun .
GENOME RESEARCH, 2010, 20 (02) :265-272
[7]
Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph [J].
Li, Zhenyu ;
Chen, Yanxiang ;
Mu, Desheng ;
Yuan, Jianying ;
Shi, Yujian ;
Zhang, Hao ;
Gan, Jun ;
Li, Nan ;
Hu, Xuesong ;
Liu, Binghang ;
Yang, Bicheng ;
Fan, Wei .
BRIEFINGS IN FUNCTIONAL GENOMICS, 2012, 11 (01) :25-37
[8]
FLASH: fast length adjustment of short reads to improve genome assemblies [J].
Magoc, Tanja ;
Salzberg, Steven L. .
BIOINFORMATICS, 2011, 27 (21) :2957-2963
[9]
Next-generation transcriptome assembly [J].
Martin, Jeffrey A. ;
Wang, Zhong .
NATURE REVIEWS GENETICS, 2011, 12 (10) :671-682
[10]
PANDAseq: PAired-eND Assembler for Illumina sequences [J].
Masella, Andre P. ;
Bartram, Andrea K. ;
Truszkowski, Jakub M. ;
Brown, Daniel G. ;
Neufeld, Josh D. .
BMC BIOINFORMATICS, 2012, 13