A survey of rollback-recovery protocols in message-passing systems

被引:857
作者
Elnozahy, EN
Alvisi, L
Wang, YM
Johnson, DB
机构
[1] IBM Corp, Res, Austin Res Lab, Austin, TX 78578 USA
[2] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
[3] Microsoft Corp, Res, Redmond, WA 98052 USA
[4] Rice Univ, Dept Comp Sci, Houston, TX 77005 USA
关键词
design; reliability; performance; message logging; rollback-recovery;
D O I
10.1145/568522.568525
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.
引用
收藏
页码:375 / 408
页数:34
相关论文
共 63 条
  • [1] An analysis of communication induced checkpointing
    Alvisi, L
    Elnozahy, E
    Rao, S
    Husain, SA
    De Mel, A
    [J]. TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, : 242 - 249
  • [2] Message logging: Pessimistic, optimistic, causal, and optimal
    Alvisi, L
    Marzullo, K
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) : 149 - 159
  • [3] ALVISI L, 1996, THESIS CORNELL U
  • [4] [Anonymous], P 3 S ARCH SUPP PROG
  • [5] APPEL AW, 1989, CSTR22089 PRINC U DE
  • [6] BABAOGLU O, 1981, P 8 ACM S OP SYST PR, P78
  • [7] A VP-accordant checkpointing protocol preventing useless checkpoints
    Baldoni, R
    Quaglia, F
    Ciciani, B
    [J]. SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 61 - 67
  • [8] BANATRE JP, 1988, P 4 C DAT ENG, P285
  • [9] BARTLETT JF, 1981, P 8 ACM S OP SYST PR, P22
  • [10] Application level fault tolerance in heterogeneous networks of workstations
    Beguelin, A
    Seligman, E
    Stephan, P
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 43 (02) : 147 - 155