Petabyte-scale innovations at the European Nucleotide Archive

被引:58
作者
Cochrane, Guy [1 ]
Akhtar, Ruth [1 ]
Bonfield, James [2 ]
Bower, Lawrence [1 ]
Demiralp, Fehmi [1 ]
Faruque, Nadeem [1 ]
Gibson, Richard [1 ]
Hoad, Gemma [1 ]
Hubbard, Tim [2 ]
Hunter, Christopher [1 ]
Jang, Mikyung [1 ]
Juhos, Szilveszter [1 ]
Leinonen, Rasko [1 ]
Leonard, Steven [2 ]
Lin, Quan [1 ]
Lopez, Rodrigo [1 ]
Lorenc, Dariusz [1 ]
McWilliam, Hamish [1 ]
Mukherjee, Gaurab [1 ]
Plaister, Sheila [1 ]
Radhakrishnan, Rajesh [1 ]
Robinson, Stephen [1 ]
Sobhany, Siamak [1 ]
Hoopen, Petra Ten [1 ]
Vaughan, Robert [1 ]
Zalunin, Vadim [1 ]
Birney, Ewan [1 ]
机构
[1] EMBL European Bioinformat Inst, Cambridge CB10 1SD, England
[2] Sanger Inst, Cambridge CB10 1SA, England
基金
英国惠康基金;
关键词
DATABASE; SEQUENCE;
D O I
10.1093/nar/gkn765
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.
引用
收藏
页码:D19 / D25
页数:7
相关论文
共 12 条
  • [1] Acland A, 2013, NUCLEIC ACIDS RES, V41, pD8, DOI [10.1093/nar/gkq1172, 10.1093/nar/gks1189, 10.1093/nar/gkx1095]
  • [2] The universal protein resource (UniProt)
    Bairoch, Amos
    Bougueleret, Lydie
    Altairac, Severine
    Amendolia, Valeria
    Auchincloss, Andrea
    Puy, Ghislaine Argoud
    Axelsen, Kristian
    Baratin, Delphine
    Blatter, Marie-Claude
    Boeckmann, Brigitte
    Bollondi, Laurent
    Boutet, Emmanuel
    Quintaje, Silvia Braconi
    Breuza, Lionel
    Bridge, Alan
    deCastro, Edouard
    Coral, Danielle
    Coudert, Elisabeth
    Cusin, Isabelle
    Dobrokhotov, Pavel
    Dornevil, Dolnide
    Duvaud, Severine
    Estreicher, Anne
    Famiglietti, Livia
    Feuermann, Marc
    Gehant, Sebastian
    Farriol-Mathis, Nathalie
    Ferro, Serenella
    Gasteiger, Elisabeth
    Gateau, Alain
    Gerritsen, Vivienne
    Gos, Arnaud
    Gruaz-Gumowski, Nadine
    Hinz, Ursula
    Hulo, Chantal
    Hulo, Nicolas
    Ioannidis, Vassilios
    Ivanyi, Ivan
    James, Janet
    Jain, Eric
    Jimenez, Silvia
    Jungo, Florence
    Junker, Vivien
    Keller, Guillaume
    Lachaize, Corinne
    Lane-Guermonprez, Lydie
    Langendijk-Genevaux, Petra
    Lara, Vicente
    Lemercier, Philippe
    Le Saux, Virginie
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D193 - D197
  • [3] Benson DA, 2017, NUCLEIC ACIDS RES, V45, pD37, DOI [10.1093/nar/gkp1024, 10.1093/nar/gkw1070, 10.1093/nar/gkq1079, 10.1093/nar/gkl986, 10.1093/nar/gkr1202, 10.1093/nar/gkx1094, 10.1093/nar/gks1195, 10.1093/nar/gkn723, 10.1093/nar/gkg057]
  • [4] The HGNC Database in 2008: a resource for the human genome
    Bruford, Elspeth A.
    Lush, Michael J.
    Wright, Mathew W.
    Sneddon, Tam P.
    Povey, Sue
    Birney, Ewan
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D445 - D448
  • [5] The Mouse Genome Database (MGD): mouse biology and model systems
    Bult, Carol J.
    Eppig, Janan T.
    Kadin, James A.
    Richardson, Joel E.
    Blake, Judith A.
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D724 - D728
  • [6] Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database
    Cochrane, Guy
    Akhtar, Ruth
    Aldebert, Philippe
    Althorpe, Nicola
    Baldwin, Alastair
    Bates, Kirsty
    Bhattacharyya, Sumit
    Bonfield, James
    Bower, Lawrence
    Browne, Paul
    Castro, Matias
    Cox, Tony
    Demiralp, Fehmi
    Eberhardt, Ruth
    Faruque, Nadeem
    Hoad, Gemma
    Jang, Mikyung
    Kulikova, Tamara
    Labarga, Alberto
    Leinonen, Rasko
    Leonard, Steven
    Lin, Quan
    Lopez, Rodrigo
    Lorenc, Dariusz
    McWilliam, Hamish
    Mukherjee, Gaurab
    Nardone, Francesco
    Plaister, Sheila
    Robinson, Stephen
    Sobhany, Siamak
    Vaughan, Robert
    Wu, Dan
    Zhu, Weimin
    Apweiler, Rolf
    Hubbard, Tim
    Birney, Ewan
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D5 - D12
  • [7] The minimum information about a genome sequence (MIGS) specification
    Field, Dawn
    Garrity, George
    Gray, Tanya
    Morrison, Norman
    Selengut, Jeremy
    Sterk, Peter
    Tatusova, Tatiana
    Thomson, Nicholas
    Allen, Michael J.
    Angiuoli, Samuel V.
    Ashburner, Michael
    Axelrod, Nelson
    Baldauf, Sandra
    Ballard, Stuart
    Boore, Jeffrey
    Cochrane, Guy
    Cole, James
    Dawyndt, Peter
    De Vos, Paul
    dePamphilis, Claude
    Edwards, Robert
    Faruque, Nadeem
    Feldman, Robert
    Gilbert, Jack
    Gilna, Paul
    Gloeckner, Frank Oliver
    Goldstein, Philip
    Guralnick, Robert
    Haft, Dan
    Hancock, David
    Hermjakob, Henning
    Hertz-Fowler, Christiane
    Hugenholtz, Phil
    Joint, Ian
    Kagan, Leonid
    Kane, Matthew
    Kennedy, Jessie
    Kowalchuk, George
    Kottmann, Renzo
    Kolker, Eugene
    Kravitz, Saul
    Kyrpides, Nikos
    Leebens-Mack, Jim
    Lewis, Suzanna E.
    Li, Kelvin
    Lister, Allyson L.
    Lord, Phillip
    Maltsev, Natalia
    Markowitz, Victor
    Martiny, Jennifer
    [J]. NATURE BIOTECHNOLOGY, 2008, 26 (05) : 541 - 547
  • [8] Ensembl 2008
    Flicek, P.
    Aken, B. L.
    Beal, K.
    Ballester, B.
    Caccamo, M.
    Chen, Y.
    Clarke, L.
    Coates, G.
    Cunningham, F.
    Cutts, T.
    Down, T.
    Dyer, S. C.
    Eyre, T.
    Fitzgerald, S.
    Fernandez-Banet, J.
    Graf, S.
    Haider, S.
    Hammond, M.
    Holland, R.
    Howe, K. L.
    Howe, K.
    Johnson, N.
    Jenkinson, A.
    Kahari, A.
    Keefe, D.
    Kokocinski, F.
    Kulesha, E.
    Lawson, D.
    Longden, I.
    Megy, K.
    Meidl, P.
    Overduin, B.
    Parker, A.
    Pritchard, B.
    Prlic, A.
    Rice, S.
    Rios, D.
    Schuster, M.
    Sealy, I.
    Slater, G.
    Smedley, D.
    Spudich, G.
    Trevanion, S.
    Vilella, A. J.
    Vogel, J.
    White, S.
    Wood, M.
    Birney, E.
    Cox, T.
    Curwen, V.
    [J]. NUCLEIC ACIDS RESEARCH, 2008, 36 : D707 - D714
  • [9] High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi
    Holt, Kathryn E.
    Parkhill, Julian
    Mazzoni, Camila J.
    Roumagnac, Philippe
    Weill, Francois-Xavier
    Goodhead, Ian
    Rance, Richard
    Baker, Stephen
    Maskell, Duncan J.
    Wain, John
    Dolecek, Christiane
    Achtman, Mark
    Dougan, Gordon
    [J]. NATURE GENETICS, 2008, 40 (08) : 987 - 993
  • [10] ArrayExpress - a public database of microarray experiments and gene expression profiles
    Parkinson, H.
    Kapushesky, M.
    Shojatalab, M.
    Abeygunawardena, N.
    Coulson, R.
    Farne, A.
    Holloway, E.
    Kolesnykov, N.
    Lilja, P.
    Lukk, M.
    Mani, R.
    Rayner, T.
    Sharma, A.
    William, E.
    Sarkans, U.
    Brazma, A.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D747 - D750