DSK: k-mer counting with very low memory usage

被引:186
作者
Rizk, Guillaume [1 ]
Lavenier, Dominique [2 ]
Chikhi, Rayan [2 ]
机构
[1] Algorizk, F-75013 Paris, France
[2] ENS Cachan Brittany IRISA, F-35700 Rennes, France
关键词
D O I
10.1093/bioinformatics/btt020
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.
引用
收藏
页码:652 / 653
页数:2
相关论文
共 2 条
[1]   A fast, lock-free approach for efficient parallel counting of occurrences of k-mers [J].
Marcais, Guillaume ;
Kingsford, Carl .
BIOINFORMATICS, 2011, 27 (06) :764-770
[2]   Efficient counting of k-mers in DNA sequences using a bloom filter [J].
Melsted, Pall ;
Pritchard, Jonathan K. .
BMC BIOINFORMATICS, 2011, 12