Efficient storage of high throughput DNA sequencing data using reference-based compression

被引:248
作者
Fritz, Markus Hsi-Yang [1 ]
Leinonen, Rasko [1 ]
Cochrane, Guy [1 ]
Birney, Ewan [1 ]
机构
[1] EMBL EBI, Hinxton CB10 1SD, Cambs, England
基金
英国惠康基金;
关键词
D O I
10.1101/gr.114819.110
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
引用
收藏
页码:734 / 740
页数:7
相关论文
共 17 条
  • [1] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [2] DNACompress: fast and effective DNA sequence compression
    Chen, X
    Li, M
    Ma, B
    Tromp, J
    [J]. BIOINFORMATICS, 2002, 18 (12) : 1696 - 1698
  • [3] Human genomes as email attachments
    Christley, Scott
    Lu, Yiming
    Li, Chen
    Xie, Xiaohui
    [J]. BIOINFORMATICS, 2009, 25 (02) : 274 - 275
  • [4] Data structures and compression algorithms for high-throughput sequencing technologies
    Daily, Kenny
    Rigor, Paul
    Christley, Scott
    Xie, Xiaohui
    Baldi, Pierre
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [5] ELIAS P, 1975, IEEE T INFORM THEORY, V21, P194, DOI 10.1109/TIT.1975.1055349
  • [6] Base-calling of automated sequencer traces using phred.: II.: Error probabilities
    Ewing, B
    Green, P
    [J]. GENOME RESEARCH, 1998, 8 (03): : 186 - 194
  • [7] The ENCODE (ENCyclopedia of DNA elements) Project
    Feingold, EA
    Good, PJ
    Guyer, MS
    Kamholz, S
    Liefer, L
    Wetterstrand, K
    Collins, FS
    Gingeras, TR
    Kampa, D
    Sekinger, EA
    Cheng, J
    Hirsch, H
    Ghosh, S
    Zhu, Z
    Pate, S
    Piccolboni, A
    Yang, A
    Tammana, H
    Bekiranov, S
    Kapranov, P
    Harrison, R
    Church, G
    Struhl, K
    Ren, B
    Kim, TH
    Barrera, LO
    Qu, C
    Van Calcar, S
    Luna, R
    Glass, CK
    Rosenfeld, MG
    Guigo, R
    Antonarakis, SE
    Birney, E
    Brent, M
    Pachter, L
    Reymond, A
    Dermitzakis, ET
    Dewey, C
    Keefe, D
    Denoeud, F
    Lagarde, J
    Ashurst, J
    Hubbard, T
    Wesselink, JJ
    Castelo, R
    Eyras, E
    Myers, RM
    Sidow, A
    Batzoglou, S
    [J]. SCIENCE, 2004, 306 (5696) : 636 - 640
  • [8] RUN-LENGTH ENCODINGS
    GOLOMB, SW
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) : 399 - +
  • [9] International network of cancer genome projects
    Hudson, Thomas J.
    Anderson, Warwick
    Aretz, Axel
    Barker, Anna D.
    Bell, Cindy
    Bernabe, Rosa R.
    Bhan, M. K.
    Calvo, Fabien
    Eerola, Iiro
    Gerhard, Daniela S.
    Guttmacher, Alan
    Guyer, Mark
    Hemsley, Fiona M.
    Jennings, Jennifer L.
    Kerr, David
    Klatt, Peter
    Kolar, Patrik
    Kusuda, Jun
    Lane, David P.
    Laplace, Frank
    Lu, Youyong
    Nettekoven, Gerd
    Ozenberger, Brad
    Peterson, Jane
    Rao, T. S.
    Remacle, Jacques
    Schafer, Alan J.
    Shibata, Tatsuhiro
    Stratton, Michael R.
    Vockley, Joseph G.
    Watanabe, Koichi
    Yang, Huanming
    Yuen, Matthew M. F.
    Knoppers, M.
    Bobrow, Martin
    Cambon-Thomsen, Anne
    Dressler, Lynn G.
    Dyke, Stephanie O. M.
    Joly, Yann
    Kato, Kazuto
    Kennedy, Karen L.
    Nicolas, Pilar
    Parker, Michael J.
    Rial-Sebbag, Emmanuelle
    Romeo-Casabona, Carlos M.
    Shaw, Kenna M.
    Wallace, Susan
    Wiesner, Georgia L.
    Zeps, Nikolajs
    Lichter, Peter
    [J]. NATURE, 2010, 464 (7291) : 993 - 998
  • [10] A METHOD FOR THE CONSTRUCTION OF MINIMUM-REDUNDANCY CODES
    HUFFMAN, DA
    [J]. PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1952, 40 (09): : 1098 - 1101