Performance comparison under failures of MPI and MapReduce: An analytical approach

被引:11
作者
Jin, Hui [1 ]
Sun, Xian-He [2 ]
机构
[1] Oracle, Parallel Query Grp, Redwood City, CA 94065 USA
[2] IIT, Chicago, IL 60616 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2013年 / 29卷 / 07期
基金
美国国家科学基金会;
关键词
Fault tolerance; MPI; MapReduce; Checkpoint; OPTIMUM CHECKPOINT INTERVAL;
D O I
10.1016/j.future.2013.01.013
中图分类号
TP301 [理论、方法];
学科分类号
080201 [机械制造及其自动化];
摘要
MPI has been the de facto standard of parallel programming for decades. There has been an increasing concern about the reliability of MM applications in recent years, partially due to the inefficiency of parallel checkpointing. MapReduce is a new programming model originally introduced to handle massive data processing. There are numerous efforts recently that transform classical MPI based scientific applications to MapReduce, due to the merits of easy programming, automatic parallelism, and fault tolerance of MapReduce. However, the stricter synchronization primitive supported by MapReduce also imposes considerable overhead. While the failure-free performance comparison between MPI and MapReduce has been investigated, there exists little work in comparing the two programming models under failures. In this paper, we propose an analytical approach to quantifying the capabilities of the two programming models to tolerate failures for a comparison. We also carry out extensive numerical analysis to study the impact of different parameters on fault tolerance. This work can be used by the HPC community for various purposes in making critical decisions. For example, it helps algorithm designers to answer the question such as, at which scale should we give up MPI and use MapReduce as the programming model for a better performance under the presence of failures? (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:1808 / 1815
页数:8
相关论文
共 28 条
[1]
Bressoud T., P IEEE CLUST CLUSTER
[2]
Bryant Randal E., HPDC 2010
[3]
Cappello F., P 39 INT C PAR PROC
[4]
TOWARD EXASCALE RESILIENCE [J].
Cappello, Franck ;
Geist, Al ;
Gropp, Bill ;
Kale, Laxmikant ;
Kramer, Bill ;
Snir, Marc .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) :374-388
[5]
A higher order estimate of the optimum checkpoint interval for restart dumps [J].
Daly, JT .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312
[6]
Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[7]
DeBardeleben N., 2009, HIGH END COMPUTING R
[8]
Ekanayake J., P 5 IEEE INT C E SCI
[9]
Ekanayake J., P 4 IEEE INT C E SCI
[10]
Ekanayake J., 2010, P 1 INT WORKSH MAPRE