Application level fault tolerance in heterogeneous networks of workstations

被引:50
作者
Beguelin, A
Seligman, E
Stephan, P
机构
[1] School of Computer Science, Carnegie Mellon University, Pittsburgh
基金
美国国家科学基金会;
关键词
D O I
10.1006/jpdc.1997.1338
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs), System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity, We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects, Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique, Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures, Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared results, The overhead of checkpointing is low while providing substantial decreases in expected runtime on realistic systems. (C) 1997 Academic Press.
引用
收藏
页码:147 / 155
页数:9
相关论文
共 22 条
  • [1] ARABE JNC, 1996, INT PARALLEL S 1996
  • [2] BEGUELIN A, 1996, CMUCS96157
  • [3] CHOI J, 1992, 4TH P S FRONT MASS P, P120
  • [4] DUDA A, 1991, INFORM PROCESS LETT, V16, P221
  • [5] ELNOZAHY EN, 1992, 11TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS : PROCEEDINGS, P39, DOI 10.1109/RELDIS.1992.235144
  • [6] What have we learnt from using real parallel machines to solve real problems?
    Fox, G.C.
    [J]. Conference on Hypercube Concurrent Computers and Applications, 1988,
  • [7] Geist A, 1994, PVM PARALLEL VIRTUAL
  • [8] OPTIMUM CHECKPOINT INTERVAL
    GELENBE, E
    [J]. JOURNAL OF THE ACM, 1979, 26 (02) : 259 - 270
  • [9] HOFMEISTER C, 1992, UMIACSTR92120
  • [10] HUANG Y, 1993, FTCS23