Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce

被引:61
作者
Chen, Songting [1 ]
机构
[1] Turn Inc, Redwood City, CA 94063 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2010年 / 3卷 / 02期
关键词
D O I
10.14778/1920841.1921020
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any adhoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies.
引用
收藏
页码:1459 / 1468
页数:10
相关论文
共 21 条
[1]  
Abadi D., 2006, P 2006 ACM SIGMOD IN, P671, DOI DOI 10.1145/1142473.1142548
[2]  
Abadi DJ, 2008, P 2008 ACM SIGMOD IN, P967, DOI DOI 10.1145/1376616.1376712
[3]  
Abouzeid A, 2009, P VLDB, V2, P922
[4]  
Ailamaki A., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P169
[5]  
Blanas S., 2010, P 2010 ACM SIGMOD IN, P975, DOI DOI 10.1145/1807167.1807273
[6]  
Chang F., 2006, P OSDI
[7]   Design and evaluation of alternative selection placement strategies in optimizing continuous queries [J].
Chen, JJ ;
DeWitt, DJ ;
Naughton, JF .
18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, :345-356
[8]  
Chih Yang H., 2007, P 2007 ACM SIGMOD IN, P1029, DOI DOI 10.1145/1247480.1247602
[9]  
Dean J., 2004, P 6 C S OP SYST DES, V6, P10, DOI DOI 10.HTTP://DL.ACM.0RG/CITATI0N.CFM?
[10]   Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience [J].
Gates, Alan F. ;
Natkovich, Olga ;
Chopra, Shubham ;
Kamath, Pradeep ;
Narayanamurthy, Shravan M. ;
Olston, Christopher ;
Reed, Benjamin ;
Srinivasan, Santhosh ;
Srivastava, Utkarsh .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1414-1425