Performing tasks on synchronous restartable message-passing processors

被引：28

作者：

Chlebus, BS

De Prisco, R

Shvartsman, AA

机构：

[1] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland

[2] MIT, Comp Sci Lab, Cambridge, MA 02139 USA

[3] Univ Salerno, Dipartimento Informat & Applicaz, I-84081 Baronissi, SA, Italy

[4] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT 06269 USA

来源：

DISTRIBUTED COMPUTING | 2001年 / 14卷 / 01期

关键词：

fault-tolerance; distributed systems; load balancing; processor restarts; work;

D O I：

10.1007/PL00008926

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork. Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < p stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t+ P logp/ log logp) log S) and its message complexity is M = O(t + p logp/ log log p +fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures nod restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + p log p + f) min{log p, log f}), and its message complexity is M = O(t + p logp + fp), where S is the total number of failures.

引用

页码：49 / 64

页数：16

共 19 条

[1] Parallel algorithms with processor failures and delays [J].

Buss, JF ;

Kanellakis, PC ;

Ragde, PL ;

Shvartsman, AA .

JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1996, 20 (01) :45-86

[2]

Chlebus BS, 1997, LECT NOTES COMPUT SC, V1320, P96, DOI 10.1007/BFb0030678

[3]

CHLEBUS BS, 1999, LECT NOTES COMPUTER

[4]

De Prisco R., 1994, Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, P161, DOI 10.1145/197917.198082

[5]

Dwork C., 1992, Proceedings of the Eleventh Annual ACM Symposium on Principles of Distributed Computing, P91, DOI 10.1145/135419.135439

[6] Performing work efficiently in the presence of faults [J].

Dwork, C ;

Halpern, JY ;

Waarts, O .

SIAM JOURNAL ON COMPUTING, 1998, 27 (05) :1457-1491

[7]

Galil Z, 1995, AN S FDN CO, P724, DOI 10.1109/SFCS.1995.492674

[8]

Hadzilacos Vassos, 1993, DISTRIBUTED SYSTEMS

[9]

Kanellakis P. C., 1989, Proceedings of the Eighth Annual ACM Symposium on Principles of Distributed Computing, P211, DOI 10.1145/72981.72996

[10]

Kanellakis P. C., 1995, Nordic Journal of Computing, V2, P146

← 1 2 →