MapReduce Workload Modeling with Statistical Approach

被引:53
作者
Yang, Hailong [1 ]
Luan, Zhongzhi [1 ]
Li, Wenjun [1 ]
Qian, Depei [1 ]
机构
[1] Beihang Univ, Sino German Joint Software Inst, State Key Lab Software Dev Environm, Sch Comp Sci & Engn, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
Cloud computing; Data intensive computing; MapReduce; Workload characterization; Statistical analysis; Performance prediction;
D O I
10.1007/s10723-011-9201-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.
引用
收藏
页码:279 / 310
页数:32
相关论文
共 32 条
[1]
[Anonymous], 2009, DEP ELECT ENG COMPUT
[2]
[Anonymous], UCBEECS20106
[3]
Apache Hadoop, 2010, GRIDM
[4]
Apache Hadoop, 2010, HAD WIK POW BY
[5]
Apache Hadoop MapReduce, 2009, MUM MAP RED SIM
[6]
Apache Hive, 2010, HIV PERF BENCHM
[7]
Babu S., 2010, P ACM S CLOUD COMP S
[8]
Condie T., 2010, P USENIX C NETW SYST
[9]
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]
Factor analytical approaches for evaluating groundwater trace element chemistry data [J].
Farnham, IM ;
Johannesson, KH ;
Singh, AK ;
Hodge, VF ;
Stetzenbach, KJ .
ANALYTICA CHIMICA ACTA, 2003, 490 (1-2) :123-138