CHECKPOINT SPACE RECLAMATION FOR UNCOORDINATED CHECKPOINTING IN MESSAGE-PASSING SYSTEMS

被引:26
作者
WANG, YM
CHUNG, PY
LIN, IJ
FUCHS, WK
机构
[1] UNIV ILLINOIS,COORDINATED SCI LAB,URBANA,IL 61801
[2] UNIV ILLINOIS,DEPT MATH,URBANA,IL 61801
基金
美国国家航空航天局;
关键词
FAULT TOLERANCE; MESSAGE-PASSING SYSTEMS; UNCOORDINATED CHECKPOINTING; ROLLBACK RECOVERY; GARBAGE COLLECTION;
D O I
10.1109/71.382324
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N + 1)/2 checkpoints where N is the number of processes.
引用
收藏
页码:546 / 554
页数:9
相关论文
共 21 条
  • [1] Anderson I., 1987, COMBINATORICS FINITE
  • [2] Bhargava B., 1988, Proceedings. Seventh Symposium on Reliable Distributed Systems (IEEE Cat. No.88CH2612-0), P3, DOI 10.1109/RELDIS.1988.25775
  • [3] FAULT TOLERANCE UNDER UNIX
    BORG, A
    BLAU, W
    GRAETSCH, W
    HERRMANN, F
    OBERLE, W
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1989, 7 (01): : 1 - 24
  • [4] DISTRIBUTED SNAPSHOTS - DETERMINING GLOBAL STATES OF DISTRIBUTED SYSTEMS
    CHANDY, KM
    LAMPORT, L
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1985, 3 (01): : 63 - 75
  • [5] ELNOZAHY EN, 1992, OCT P IEEE S REL DIS, P39
  • [6] RECOVERY IN DISTRIBUTED SYSTEMS USING OPTIMISTIC MESSAGE LOGGING AND CHECKPOINTING
    JOHNSON, DB
    ZWAENEPOEL, W
    [J]. JOURNAL OF ALGORITHMS, 1990, 11 (03) : 462 - 491
  • [7] CHECKPOINTING AND ROLLBACK-RECOVERY FOR DISTRIBUTED SYSTEMS
    KOO, R
    TOUEG, S
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1987, 13 (01) : 23 - 31
  • [8] TIME, CLOCKS, AND ORDERING OF EVENTS IN A DISTRIBUTED SYSTEM
    LAMPORT, L
    [J]. COMMUNICATIONS OF THE ACM, 1978, 21 (07) : 558 - 565
  • [9] POWELL ML, 1983, 9TH P ACM S OP SYST, P100
  • [10] Randell B., 1975, IEEE T SOFTWARE ENG, VSE-1, P220, DOI DOI 10.1109/TSE.1975.6312842