LIBRA: Lightweight Data Skew Mitigation in MapReduce

被引：75

作者：

Chen, Qi ^{[1
]}

Yao, Jinyu ^{[1
]}

Xiao, Zhen ^{[1
]}

机构：

[1] Peking Univ, Dept Comp Sci, Beijing 100871, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2015年 / 26卷 / 09期

基金：

中国国家自然科学基金;

关键词：

MapReduce; data skew; sampling; partitioning;

D O I：

10.1109/TPDS.2014.2350972

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. This paper presents LIBRA, a lightweight strategy to address the data skew problemamong the reducers of MapReduce applications. Unlike previous work, LIBRA does not require any pre-run sampling of the input data or prevent the overlap between the map and the reduce stages. It uses an innovative sampling method which can achieve a highly accurate approximation to the distribution of the intermediate data by sampling only a small fraction of the intermediate data during the normal map processing. It allows the reduce tasks to start copying as soon as the chosen sample map tasks (only a small fraction of map tasks which are issued first) complete. It supports the split of large keys when application semantics permit and the total order of the output data. It considers the heterogeneity of the computing resources when balancing the load among the reduce tasks appropriately. LIBRA is applicable to a wide range of applications and is transparent to the users. We implement LIBRA in Hadoop and our experiments show that LIBRA has negligible overhead and can speed up the execution of some popular applications by

引用

页码：2520 / 2533

页数：14

共 29 条

[1] Acharya S, 2000, SIGMOD REC, V29, P487
[2] Adamic L.A., 2002, GLOTTOMETRICS, V3, P143, DOI DOI 10.1109/S0SE.2014.50
[3] Ananthanarayanan G., 2010, 9 USENIX S OP SYST D, DOI DOI 10.5555/1924943.1924962
[4] Improving MapReduce Performance Using Smart Speculative Execution Strategy
Chen, Qi
Liu, Cheng
Xiao, Zhen
[J]. IEEE TRANSACTIONS ON COMPUTERS, 2014, 63 (04) : 954 - 967
[5] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[6] DEWITT DJ, 1992, PROC INT CONF VERY L, P27
[7] Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience
Gates, Alan F.
Natkovich, Olga
Chopra, Shubham
Kamath, Pradeep
Narayanamurthy, Shravan M.
Olston, Christopher
Reed, Benjamin
Srinivasan, Santhosh
Srivastava, Utkarsh
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02): : 1414 - 1425
[8] Gufler Benjamin, 2011, Proceedings of the 1st International Conference on Cloud Computing and Services Science. CLOSER 2011, P574
[9] Load Balancing in MapReduce Based on Scalable Cardinality Estimates
Gufler, Benjamin
Augsten, Nikolaus
Reiser, Angelika
Kemper, Alfons
[J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 522 - 533
[10] Ibrahim S., 2010, Proceedings of the 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom 2010), P17, DOI 10.1109/CloudCom.2010.25

← 1 2 3 →