Checkpointing in distributed computing systems

被引:18
作者
Wong, KF
Franklin, M
机构
[1] Comp. and Commun. Res. Center, Washington University, St. Louis
基金
美国国家科学基金会;
关键词
D O I
10.1006/jpdc.1996.0069
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper examines the performance of synchronous check-pointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state-dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. The analytic results for synchronous checkpointing without load redistribution are compared to measurements of a synthetic parallel algorithm with user-level checkpointing. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are used to determine when load redistribution is advantageous. (C) 1996 Academic Press, Inc.
引用
收藏
页码:67 / 75
页数:9
相关论文
共 11 条
[1]  
Chandy K. M., 1975, IEEE Transactions on Software Engineering, VSE-1, P100, DOI 10.1109/TSE.1975.6312824
[2]  
ELNOZAHY EN, 1993, THESIS RICE U DEP CO
[3]   AVAILABILITY OF A DISTRIBUTED COMPUTER-SYSTEM WITH FAILURES [J].
GELENBE, E ;
FINKEL, D ;
TRIPATHI, SK .
ACTA INFORMATICA, 1986, 23 (06) :643-655
[4]   OPTIMUM CHECKPOINT INTERVAL [J].
GELENBE, E .
JOURNAL OF THE ACM, 1979, 26 (02) :259-270
[5]  
KARP AH, 1993, COMPUTER, V26, P77
[6]   COMPARATIVE-ANALYSIS OF DIFFERENT MODELS OF CHECKPOINTING AND RECOVERY [J].
NICOLA, VF ;
VANSPANJE, JM .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1990, 16 (08) :807-821
[7]  
SUNDERAM VS, 1990, CONCURRENCY PRACTICE, V2
[8]  
TRIVEDI KS, 1982, PROBABILITY STATISTI
[9]  
Vaidya N. H, 1994, CASE MULTILEVEL DIST
[10]  
WONG K, 1994, WUCS9423