A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

被引:2780
作者
Marcais, Guillaume [1 ]
Kingsford, Carl [2 ,3 ]
机构
[1] Univ Maryland, Program Appl Math Stat & Sci Computat, College Pk, MD 20742 USA
[2] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[3] Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
LARGE GENOMES; SEQUENCE;
D O I
10.1093/bioinformatics/btr011
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
引用
收藏
页码:764 / 770
页数:7
相关论文
共 23 条
[1]   RAP:: a new computer program for de novo identification of repeated sequences in whole genomes [J].
Campagna, D ;
Romualdi, C ;
Vitulo, N ;
Del Favero, M ;
Lexa, M ;
Cannata, N ;
Valle, G .
BIOINFORMATICS, 2005, 21 (05) :582-588
[2]  
CORMEN TH, 1990, INTRO ALGORITHMS, pCH12
[3]   Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis [J].
Dalloul, Rami A. ;
Long, Julie A. ;
Zimin, Aleksey V. ;
Aslam, Luqman ;
Beal, Kathryn ;
Blomberg, Le Ann ;
Bouffard, Pascal ;
Burt, David W. ;
Crasta, Oswald ;
Crooijmans, Richard P. M. A. ;
Cooper, Kristal ;
Coulombe, Roger A. ;
De, Supriyo ;
Delany, Mary E. ;
Dodgson, Jerry B. ;
Dong, Jennifer J. ;
Evans, Clive ;
Frederickson, Karin M. ;
Flicek, Paul ;
Florea, Liliana ;
Folkerts, Otto ;
Groenen, Martien A. M. ;
Harkins, Tim T. ;
Herrero, Javier ;
Hoffmann, Steve ;
Megens, Hendrik-Jan ;
Jiang, Andrew ;
de Jong, Pieter ;
Kaiser, Pete ;
Kim, Heebal ;
Kim, Kyu-Won ;
Kim, Sungwon ;
Langenberger, David ;
Lee, Mi-Kyung ;
Lee, Taeheon ;
Mane, Shrinivasrao ;
Marcais, Guillaume ;
Marz, Manja ;
McElroy, Audrey P. ;
Modise, Thero ;
Nefedov, Mikhail ;
Notredame, Cedric ;
Paton, Ian R. ;
Payne, William S. ;
Pertea, Geo ;
Prickett, Dennis ;
Puiu, Daniela ;
Qioa, Dan ;
Raineri, Emanuele ;
Ruffier, Magali .
PLOS BIOLOGY, 2010, 8 (09)
[4]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[5]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[6]  
GAO H, 2004, P 18 INT PAR DISTR P, pA50
[7]   Annotating large genomes with exact word matches [J].
Healy, J ;
Thomas, EE ;
Schwartz, JT ;
Wigler, M .
GENOME RESEARCH, 2003, 13 (10) :2306-2315
[8]   Whole-genome sequence assembly for mammalian genomes: Arachne 2 [J].
Jaffe, DB ;
Butler, J ;
Gnerre, S ;
Mauceli, E ;
Lindblad-Toh, K ;
Mesirov, JP ;
Zody, MC ;
Lander, ES .
GENOME RESEARCH, 2003, 13 (01) :91-96
[9]   Quake: quality-aware detection and correction of sequencing errors [J].
Kelley, David R. ;
Schatz, Michael C. ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2010, 11 (11)
[10]   A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes [J].
Kurtz, Stefan ;
Narechania, Apurva ;
Stein, Joshua C. ;
Ware, Doreen .
BMC GENOMICS, 2008, 9 (1) :517