Performing tasks on synchronous restartable message-passing processors

被引:28
作者
Chlebus, BS
De Prisco, R
Shvartsman, AA
机构
[1] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland
[2] MIT, Comp Sci Lab, Cambridge, MA 02139 USA
[3] Univ Salerno, Dipartimento Informat & Applicaz, I-84081 Baronissi, SA, Italy
[4] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA
关键词
fault-tolerance; distributed systems; load balancing; processor restarts; work;
D O I
10.1007/PL00008926
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork. Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t+ P logp/ log logp) log S) and its message complexity is M = O(t + p logp/ log log p +fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures nod restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + p log p + f) min{log p, log f}), and its message complexity is M = O(t + p logp + fp), where S is the total number of failures.
引用
收藏
页码:49 / 64
页数:16
相关论文
共 19 条
[1]   Parallel algorithms with processor failures and delays [J].
Buss, JF ;
Kanellakis, PC ;
Ragde, PL ;
Shvartsman, AA .
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1996, 20 (01) :45-86
[2]  
Chlebus BS, 1997, LECT NOTES COMPUT SC, V1320, P96, DOI 10.1007/BFb0030678
[3]  
CHLEBUS BS, 1999, LECT NOTES COMPUTER
[4]  
De Prisco R., 1994, Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, P161, DOI 10.1145/197917.198082
[5]  
Dwork C., 1992, Proceedings of the Eleventh Annual ACM Symposium on Principles of Distributed Computing, P91, DOI 10.1145/135419.135439
[6]   Performing work efficiently in the presence of faults [J].
Dwork, C ;
Halpern, JY ;
Waarts, O .
SIAM JOURNAL ON COMPUTING, 1998, 27 (05) :1457-1491
[7]  
Galil Z, 1995, AN S FDN CO, P724, DOI 10.1109/SFCS.1995.492674
[8]  
Hadzilacos Vassos, 1993, DISTRIBUTED SYSTEMS
[9]  
Kanellakis P. C., 1989, Proceedings of the Eighth Annual ACM Symposium on Principles of Distributed Computing, P211, DOI 10.1145/72981.72996
[10]  
Kanellakis P. C., 1995, Nordic Journal of Computing, V2, P146