Parallel algorithms with processor failures and delays

被引:32
作者
Buss, JF
Kanellakis, PC
Ragde, PL
Shvartsman, AA
机构
[1] BROWN UNIV, DEPT COMP SCI, PROVIDENCE, RI 02912 USA
[2] MIT, COMP SCI LAB, CAMBRIDGE, MA 02139 USA
来源
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC | 1996年 / 20卷 / 01期
基金
加拿大自然科学与工程研究理事会; 美国国家科学基金会;
关键词
D O I
10.1006/jagm.1996.0003
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processes are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed fixed-size update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serialization of the actions of the processors determined by an on-line adversary; the complexity measure is total work (number of steps taken by all processors). Despite their differences, the two models share key algorithmic techniques. We present new algorithms for the Write-All problem (in which P processors write ones into an array of size N) for the two models. These algorithms can be used to implement a simulation strategy for any N processor PRAM on a restartable fail-stop P processor CRCW PRAM such that it guarantees a terminating execution of each simulated N processor step, with O(log(2) N) overhead ratio, and O(min{N + P log(2) N + M log N, N . p(0.59)}) (subquadratic) completed work (where M is the number of failures during this step's simulation). This strategy has a range of optimality. We also show that the Write-All requires N + Omega(P log P) completed/total work on these models for P less than or equal to N. (C) 1996 Academic Press, Inc.
引用
收藏
页码:45 / 86
页数:42
相关论文
共 49 条
[1]   A SURVEY AND COMPARISON OF FAULT-TOLERANT MULTISTAGE INTERCONNECTION NETWORKS [J].
ADAMS, GB ;
AGRAWAL, DP ;
SIEGEL, HJ .
COMPUTER, 1987, 20 (06) :14-27
[2]  
Afek Y., 1987, 28th Annual Symposium on Foundations of Computer Science (Cat. No.87CH2471-1), P347, DOI 10.1109/SFCS.1987.38
[3]  
ANDERSON RJ, 1991, 23RD P ACM S THEOR C, P370
[4]  
ANDERSON RJ, 1990, 2ND P ANN ACM S PAR, P95
[5]  
ASSAF S, 1990, ANN IEEE SYMP FOUND, P275
[6]  
BUSS J, 1990, CERTIFIED WRITE ALL
[7]  
CHOR B, 1987, ACM PODC, P86
[8]  
COLE R, 1990, 2ND P ACM S PAR ALG, P85
[9]  
COLE R, 1989, 1989 P ACM S PAR ALG, P170
[10]   UNDERSTANDING FAULT-TOLERANT DISTRIBUTED SYSTEMS [J].
CRISTIAN, F .
COMMUNICATIONS OF THE ACM, 1991, 34 (02) :56-78