mz5: Space- and Time-efficient Storage of Mass Spectrometry Data Sets

被引:41
作者
Wilhelm, Mathias [1 ,2 ,3 ]
Kirchner, Marc [1 ,3 ,4 ]
Steen, Judith A. J. [1 ,5 ,6 ]
Steen, Hanno [1 ,3 ,4 ]
机构
[1] Childrens Hosp Boston, Prote Ctr, Boston, MA USA
[2] Univ Bielefeld, Fac Technol, D-33615 Bielefeld, Germany
[3] Childrens Hosp Boston, Dept Pathol, Boston, MA USA
[4] Harvard Univ, Sch Med, Dept Pathol, Boston, MA 02115 USA
[5] Harvard Univ, Sch Med, Dept Neurobiol, Boston, MA 02115 USA
[6] Childrens Hosp, FM Kirby Neurobiol Ctr, Boston, MA 02115 USA
基金
美国国家卫生研究院;
关键词
OPEN SOURCE SOFTWARE; IDENTIFICATION;
D O I
10.1074/mcp.O111.011379
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Across a host of MS-driven-omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS data and face the need for their compact yet efficiently accessible storage. Addressing the need for an open data exchange format, the Proteomics Standards Initiative and the Seattle Proteome Center at the Institute for Systems Biology independently developed the mzData and mzXML formats, respectively. In a subsequent joint effort, they defined an ontology and associated controlled vocabulary that specifies the contents of MS data files, implemented as the newer mzML format. All three formats are based on XML and are thus not particularly efficient in either storage space requirements or read/write speed. This contribution introduces mz5, a complete reimplementation of the mzML ontology that is based on the efficient, industrial strength storage backend HDF5. Compared with the current mzML standard, this strategy yields an average file size reduction to similar to 54% and increases linear read and write speeds similar to 3-4-fold. The format is implemented as part of the ProteoWizard project and is available under a permissive Apache license. Additional information and download links are available from http://software.steenlab.org/mz5. Molecular & Cellular Proteomics 11: 10.1074/mcp.O111.011379, 1-5, 2012.
引用
收藏
页数:5
相关论文
共 20 条
[1]  
Anderson K., 2010, INT SKA FORUM
[2]   mzServer: Web-based Programmatic Access for Mass Spectrometry Data Analysis [J].
Askenazi, Manor ;
Webber, James T. ;
Marto, Jarrod A. .
MOLECULAR & CELLULAR PROTEOMICS, 2011, 10 (05)
[3]   Protein identification by spectral networks analysis [J].
Bandeira, Nuno ;
Tsur, Dekel ;
Frank, Ari ;
Pevzner, Pavel A. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (15) :6140-6145
[4]   The ALPS project release 2.0: open source software for strongly correlated systems [J].
Bauer, B. ;
Carr, L. D. ;
Evertz, H. G. ;
Feiguin, A. ;
Freire, J. ;
Fuchs, S. ;
Gamper, L. ;
Gukelberger, J. ;
Gull, E. ;
Guertler, S. ;
Hehn, A. ;
Igarashi, R. ;
Isakov, S. V. ;
Koop, D. ;
Ma, P. N. ;
Mates, P. ;
Matsuo, H. ;
Parcollet, O. ;
Pawlowski, G. ;
Picon, J. D. ;
Pollet, L. ;
Santos, E. ;
Scarola, V. W. ;
Schollwoeck, U. ;
Silva, C. ;
Surer, B. ;
Todo, S. ;
Trebst, S. ;
Troyer, M. ;
Wall, M. L. ;
Werner, P. ;
Wessel, S. .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2011,
[5]  
Bertsch A, 2011, METHODS MOL BIOL, V696, P353, DOI 10.1007/978-1-60761-987-1_23
[6]  
Desiere F, 2005, GENOME BIOL, V6
[7]   PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows [J].
Deutsch, Eric W. ;
Lam, Henry ;
Aebersold, Ruedi .
EMBO REPORTS, 2008, 9 (05) :429-434
[8]   A guided tour of the Trans-Proteomic Pipeline [J].
Deutsch, Eric W. ;
Mendoza, Luis ;
Shteynberg, David ;
Farrah, Terry ;
Lam, Henry ;
Tasman, Natalie ;
Sun, Zhi ;
Nilsson, Erik ;
Pratt, Brian ;
Prazen, Bryan ;
Eng, Jimmy K. ;
Martin, Daniel B. ;
Nesvizhskii, Alexey I. ;
Aebersold, Ruedi .
PROTEOMICS, 2010, 10 (06) :1150-1159
[9]   Unifying Biological Image Formats with HDF5 [J].
Dougherty, Matthew T. ;
Folk, Michael J. ;
Zadok, Erez ;
Bernstein, Herbert J. ;
Bernstein, Frances C. ;
Eliceiri, Kevin W. ;
Benger, Werner ;
Best, Christoph .
COMMUNICATIONS OF THE ACM, 2009, 52 (10) :42-47
[10]  
HDF Group, 2000, HIER DAT FORM VERS 5