KAnalyze: a fast versatile pipelined K-mer toolkit

被引:41
作者
Audano, Peter [1 ]
Vannberg, Fredrik [1 ]
机构
[1] Georgia Inst Technol, Sch Biol, Atlanta, GA 30332 USA
关键词
D O I
10.1093/bioinformatics/btu152
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical interface or contain features that make it extensible and maintainable. We designed KAnalyze to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code. Currently, KAnalyze can output k-mer counts in a sorted tab-delimited file or stream k-mers as they are read. KAnalyze can process large datasets with 2 GB of memory. This project is implemented in Java 7, and the command line interface (CLI) is designed to integrate into pipelines written in any language. Results: As a k-mer counter, KAnalyze outperforms Jellyfish, DSK and a pipeline built on Perl and Linux utilities. Through extensive unit and system testing, we have verified that KAnalyze produces the correct k-mer counts over multiple datasets and k-mer sizes.
引用
收藏
页码:2070 / 2072
页数:3
相关论文
共 6 条
[1]  
Knuth D., 1998, SORTING SEARCHING, V3, P248
[2]   A fast, lock-free approach for efficient parallel counting of occurrences of k-mers [J].
Marcais, Guillaume ;
Kingsford, Carl .
BIOINFORMATICS, 2011, 27 (06) :764-770
[3]   UniPROBE: an online database of protein binding microarray data on protein-DNA interactions [J].
Newburger, Daniel E. ;
Bulyk, Martha L. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D77-D82
[4]   Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers [J].
Nordstroem, Karl J. V. ;
Albani, Maria C. ;
James, Geo Velikkakam ;
Gutjahr, Caroline ;
Hartwig, Benjamin ;
Turck, Franziska ;
Paszkowski, Uta ;
Coupland, George ;
Schneeberger, Korbinian .
NATURE BIOTECHNOLOGY, 2013, 31 (04) :325-+
[5]   DSK: k-mer counting with very low memory usage [J].
Rizk, Guillaume ;
Lavenier, Dominique ;
Chikhi, Rayan .
BIOINFORMATICS, 2013, 29 (05) :652-653
[6]   Best Practices for Scientific Computing [J].
Wilson, Greg ;
Aruliah, D. A. ;
Brown, C. Titus ;
Hong, Neil P. Chue ;
Davis, Matt ;
Guy, Richard T. ;
Haddock, Steven H. D. ;
Huff, Kathryn D. ;
Mitchell, Ian M. ;
Plumbley, Mark D. ;
Waugh, Ben ;
White, Ethan P. ;
Wilson, Paul .
PLOS BIOLOGY, 2014, 12 (01)