LOW-LATENCY, CONCURRENT CHECKPOINTING FOR PARALLEL PROGRAMS

被引:57
作者
LI, K
NAUGHTON, JF
PLANK, JS
机构
[1] UNIV WISCONSIN,DEPT COMP SCI,MADISON,WI 53706
[2] UNIV TENNESSEE,DEPT COMP SCI,KNOXVILLE,TN 37996
基金
美国国家科学基金会;
关键词
CHECKPOINTING; FAULT TOLERANCE; COPY-ON-WRITE; MULTIPROCESSING; BACKWARD ERROR RECOVERY;
D O I
10.1109/71.298215
中图分类号
TP301 [理论、方法];
学科分类号
081202 [计算机软件与理论];
摘要
This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.
引用
收藏
页码:874 / 879
页数:6
相关论文
共 30 条
[1]
APPEL AW, 1988, ACM SIGPLAN 88 C PRO, P11
[2]
CHANDY KM, 1985, ACM T COMPUT SYSTEMS, V3, P3
[3]
CRISTIAN F, 1991, TENTH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, P12, DOI 10.1109/RELDIS.1991.145399
[4]
DEWITT DJ, 1984, P ACM SIGMOD INT C M, P1
[5]
Douglis F., 1987, 7th International Conference on Distributed Computing Systems (Cat. No.87CH2439-8), P18
[6]
ELNOZAHY EN, 1992, 11TH P S REL DISTR S
[7]
FELDMAN SI, 1989, SIGPLAN NOTICES, V24, P112, DOI 10.1145/69215.69226
[8]
THE INTEGRATION OF VIRTUAL MEMORY MANAGEMENT AND INTERPROCESS COMMUNICATION IN ACCENT [J].
FITZGERALD, R ;
RASHID, RF .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1986, 4 (02) :147-177
[9]
HAGMANN RB, 1986, IEEE T COMPUT, V35, P839, DOI 10.1109/TC.1986.1676845
[10]
RECOVERY IN DISTRIBUTED SYSTEMS USING OPTIMISTIC MESSAGE LOGGING AND CHECKPOINTING [J].
JOHNSON, DB ;
ZWAENEPOEL, W .
JOURNAL OF ALGORITHMS, 1990, 11 (03) :462-491