A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

被引：2780

作者：

Marcais, Guillaume ^{[1
]}

Kingsford, Carl ^{[2
,3
]}

机构：

[1] Univ Maryland, Program Appl Math Stat & Sci Computat, College Pk, MD 20742 USA

[2] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

[3] Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA

来源：

BIOINFORMATICS | 2011年 / 27卷 / 06期

基金：

美国国家科学基金会; 美国国家卫生研究院;

关键词：

LARGE GENOMES; SEQUENCE;

D O I：

10.1093/bioinformatics/btr011

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

引用

页码：764 / 770

页数：7

共 23 条

[1] RAP:: a new computer program for de novo identification of repeated sequences in whole genomes [J].

Campagna, D ;

Romualdi, C ;

Vitulo, N ;

Del Favero, M ;

Lexa, M ;

Cannata, N ;

Valle, G .

BIOINFORMATICS, 2005, 21 (05) :582-588

[2]

CORMEN TH, 1990, INTRO ALGORITHMS, pCH12

[3] Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis [J].

Dalloul, Rami A. ;

Long, Julie A. ;

Zimin, Aleksey V. ;

Aslam, Luqman ;

Beal, Kathryn ;

Blomberg, Le Ann ;

Bouffard, Pascal ;

Burt, David W. ;

Crasta, Oswald ;

Crooijmans, Richard P. M. A. ;

Cooper, Kristal ;

Coulombe, Roger A. ;

De, Supriyo ;

Delany, Mary E. ;

Dodgson, Jerry B. ;

Dong, Jennifer J. ;

Evans, Clive ;

Frederickson, Karin M. ;

Flicek, Paul ;

Florea, Liliana ;

Folkerts, Otto ;

Groenen, Martien A. M. ;

Harkins, Tim T. ;

Herrero, Javier ;

Hoffmann, Steve ;

Megens, Hendrik-Jan ;

Jiang, Andrew ;

de Jong, Pieter ;

Kaiser, Pete ;

Kim, Heebal ;

Kim, Kyu-Won ;

Kim, Sungwon ;

Langenberger, David ;

Lee, Mi-Kyung ;

Lee, Taeheon ;

Mane, Shrinivasrao ;

Marcais, Guillaume ;

Marz, Manja ;

McElroy, Audrey P. ;

Modise, Thero ;

Nefedov, Mikhail ;

Notredame, Cedric ;

Paton, Ian R. ;

Payne, William S. ;

Pertea, Geo ;

Prickett, Dennis ;

Puiu, Daniela ;

Qioa, Dan ;

Raineri, Emanuele ;

Ruffier, Magali .

PLOS BIOLOGY, 2010, 8 (09)

[4]

Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137

[5] MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].

Edgar, RC .

NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797

[6]

GAO H, 2004, P 18 INT PAR DISTR P, pA50

[7] Annotating large genomes with exact word matches [J].

Healy, J ;

Thomas, EE ;

Schwartz, JT ;

Wigler, M .

GENOME RESEARCH, 2003, 13 (10) :2306-2315

[8] Whole-genome sequence assembly for mammalian genomes: Arachne 2 [J].

Jaffe, DB ;

Butler, J ;

Gnerre, S ;

Mauceli, E ;

Lindblad-Toh, K ;

Mesirov, JP ;

Zody, MC ;

Lander, ES .

GENOME RESEARCH, 2003, 13 (01) :91-96

[9] Quake: quality-aware detection and correction of sequencing errors [J].

Kelley, David R. ;

Schatz, Michael C. ;

Salzberg, Steven L. .

GENOME BIOLOGY, 2010, 11 (11)

[10] A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes [J].

Kurtz, Stefan ;

Narechania, Apurva ;

Stein, Joshua C. ;

Ware, Doreen .

BMC GENOMICS, 2008, 9 (1) :517

← 1 2 3 →