Compressing DNA sequence databases with coil

被引：12

作者：

White, W. Timothy J. ^{[1
]}

Hendy, Michael D. ^{[1
]}

机构：

[1] Massey Univ, Allan Wilson Ctr Mol Ecol & Evolut, Palmerston North, New Zealand

来源：

BMC BIOINFORMATICS | 2008年 / 9卷 / 1期

关键词：

D O I：

10.1186/1471-2105-9-242

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.

引用

页数：15

共 35 条

[1] Burkhardt S, 2002, LECT NOTES COMPUT SC, V2373, P225
[2] A minimum spanning tree algorithm with Inverse Ackermann type complexity
Chazelle, B
[J]. JOURNAL OF THE ACM, 2000, 47 (06) : 1028 - 1047
[3] DNACompress: fast and effective DNA sequence compression
Chen, X
Li, M
Ma, B
Tromp, J
[J]. BIOINFORMATICS, 2002, 18 (12) : 1696 - 1698
[4] A compression algorithm for DNA sequences
Chen, X
Kwong, S
Li, M
[J]. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE, 2001, 20 (04): : 61 - 66
[5] Indexing compressed text
Ferragina, P
Manzini, G
[J]. JOURNAL OF THE ACM, 2005, 52 (04) : 552 - 581
[6] When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications
Foschini, Luca
Grossi, Roberto
Gupta, Ankur
Vitter, Jeffrey Scott
[J]. ACM TRANSACTIONS ON ALGORITHMS, 2006, 2 (04)
[7] Foulds L.R., 1982, ADV APPL MATH, V3, P43, DOI DOI 10.1016/S0196-8858(82)80004-3
[8] GAILLY J, GZIP GNU ZIP COMPRES
[9] A NEW CHALLENGE FOR COMPRESSION ALGORITHMS - GENETIC SEQUENCES
GRUMBACH, S
TAHI, F
[J]. INFORMATION PROCESSING & MANAGEMENT, 1994, 30 (06) : 875 - 886
[10] Grumbach S., 1993, COMPRESSION DNA SEQU, P340, DOI 10.1109/DCC.1993.253115

← 1 2 3 4 →