ALGORITHM-BASED FAULT TOLERANCE ON A HYPERCUBE MULTIPROCESSOR

被引:67
作者
BANERJEE, P
RAHMEH, JT
STUNKEL, C
NAIR, VS
ROY, K
BALASUBRAMANIAN, V
ABRAHAM, JA
机构
[1] UNIV TEXAS,DEPT ELECT & COMP ENGN,AUSTIN,TX 78712
[2] UNIV ILLINOIS,COORDINATED SCI LAB,URBANA,IL 61801
[3] IBM CORP,THOMAS J WATSON RES CTR,RES STAFF,YORKTOWN HTS,NY 10598
基金
美国国家科学基金会;
关键词
Error coverage; hypercube multiprocessors; parallel algorithms; reconfiguration; system-level error detection;
D O I
10.1109/12.57055
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 [计算机科学与技术];
摘要
Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. We have implemented system-level error detection mechanisms for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: 1) matrix multiplication, 2) Gaussian elimination, and 3) fast Fourier transform. Schemes for other applications are under development. We have performed extensive studies of error coverage of our system-level error detection schemes in the presence of finite precision arithmetic which affects our system-level encodings. Finally, the paper proposes two reconfiguration schemes that allow us to isolate and replace faulty processors with spare processors. These schemes of reconfiguration are integrated with the error detection schemes to form a truly fault-tolerant hypercube multiprocessor. © 1990 IEEE
引用
收藏
页码:1132 / 1145
页数:14
相关论文
共 29 条
[1]
ARMSTRONG JR, 1981, IEEE T COMPUT, V30, P587, DOI 10.1109/TC.1981.1675844
[2]
AYKANAT C, 1987, 17TH P INT S FAULT T, P204
[3]
Banerjee P., 1984, 11th Annual International Symposium on Computer Architecture. Conference Proceedings (Cat. No. 84CH2051-1), P279, DOI 10.1145/800015.808196
[4]
BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS. [J].
Banerjee, Prithviraj ;
Abraham, Jacob A. .
IEEE Transactions on Computers, 1986, C-35 (04) :296-306
[5]
BANERJEE P, 1986, 16TH P INT S FAULT T, P298
[6]
Chen C.-Y., 1986, ADV ALGORITHMS ARCHI, P228
[7]
Dilger E., 1984, Fourteenth International Conference on Fault-Tolerant Computing. Digest of Papers (Cat. No. 84CH2050-3), P184
[8]
FOX G, 1989, SOLVING PROBLEMS CON
[9]
GEIST GA, 1985, 1ST P SIAM C HYP MUL
[10]
ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX OPERATIONS [J].
HUANG, KH ;
ABRAHAM, JA .
IEEE TRANSACTIONS ON COMPUTERS, 1984, 33 (06) :518-528