The Stratosphere platform for big data analytics

被引:252
作者
Alexandrov, Alexander [1 ]
Bergmann, Rico [2 ]
Ewen, Stephan [1 ]
Freytag, Johann-Christoph [2 ]
Hueske, Fabian [1 ]
Heise, Arvid [3 ]
Kao, Odej [1 ]
Leich, Marcus [1 ]
Leser, Ulf [2 ]
Markl, Volker [1 ]
Naumann, Felix [3 ]
Peters, Mathias [2 ]
Rheinlaender, Astrid [2 ]
Sax, Matthias J. [2 ]
Schelter, Sebastian [1 ]
Hoeger, Mareike [1 ]
Tzoumas, Kostas [1 ]
Warneke, Daniel [4 ]
机构
[1] Tech Univ Berlin, Berlin, Germany
[2] Humboldt Univ, D-10099 Berlin, Germany
[3] Hasso Plattner Inst, Potsdam, Germany
[4] Int Comp Sci Inst, Berkeley, CA 94704 USA
关键词
Big data; Parallel databases; Query processing; Query Optimization; Data cleansing; Text mining; Graph processing; Distributed systems;
D O I
10.1007/s00778-014-0357-y
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere's features include "in situ" data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of "Big Data" use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system's components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.
引用
收藏
页码:939 / 964
页数:26
相关论文
共 64 条
  • [31] VOLCANO - AN EXTENSIBLE AND PARALLEL QUERY EVALUATION SYSTEM
    GRAEFE, G
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1994, 6 (01) : 120 - 135
  • [32] Graefe G, 2009, ENCY DATABASE SYSTEM, P2030
  • [33] Implementing sorting in database systems
    Graefe, Goetz
    [J]. ACM COMPUTING SURVEYS, 2006, 38 (03) : 4
  • [34] Guo Zhenyu., 2012, 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, P121
  • [35] Harjung J.J., 2013, THESIS TU BERLIN
  • [36] Heise A., 2012, BIGDATA WORKSH VLDB
  • [37] Integrating open government data with stratosphere for more transparency
    Heise, Arvid
    Naumann, Felix
    [J]. JOURNAL OF WEB SEMANTICS, 2012, 14 : 45 - 56
  • [38] Ephemeral Materialization Points in Stratosphere Data Management on the Cloud
    Hoeger, Mareike
    Kao, Odej
    Richter, Philipp
    Warneke, Daniel
    [J]. CLOUD COMPUTING AND BIG DATA, 2013, 23 : 163 - 181
  • [39] Hovestadt M., 2011, 2011 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, P1042, DOI 10.1109/IPDPS.2011.256
  • [40] Hueske F., 2013, XLDI