FAULT-TOLERANT MESHES AND HYPERCUBES WITH MINIMAL NUMBERS OF SPARES

被引:53
作者
BRUCK, J
CYPHER, R
HO, CT
机构
[1] IBM, vjtAlmaden Research Center, San Jose
关键词
D O I
10.1109/12.241598
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Many parallel computers consist of processors connected in the form of a d-dimensional mesh or hypercube. Two- and three-dimensional meshes have been shown to be efficient in manipulating images and dense matrices, whereas hypercubes have been shown to be well suited to divide-and-conquer algorithms requiring global communication. However, even a single faulty processor or communication link can seriously affect the performance of these machines. This paper presents several techniques for tolerating faults in d-dimensional mesh and hyper-cube architectures. Our approach consists of adding spare processors and communication links so that the resulting architecture will contain a fault-free mesh or hypercube in the presence of faults. We optimize the cost of the fault-tolerant architecture by adding exactly k spare processors (while tolerating up to k processor and/or link faults) and minimizing the maximum number of links per processor. For example, when the desired architecture is a d-dimensional mesh and k = 1, we present a fault-tolerant architecture that has the same maximum degree as the desired architecture (namely, 2d) and has only one spare processor. We also present efficient layouts for fault-tolerant two- and three-dimensional meshes, and show how multiplexers and buses can be used to reduce the degree of fault-tolerant architectures. Finally, we give constructions for fault-tolerant tori, eight-connected meshes, and hexagonal meshes.
引用
收藏
页码:1089 / 1104
页数:16
相关论文
共 28 条
[1]  
ANNEXSTEIN F, 1989, 1989 P ACM S PAR ALG, P179
[2]   A FAULT TOLERANT MASSIVELY PARALLEL PROCESSING ARCHITECTURE [J].
BALASUBRAMANIAN, V ;
BANERJEE, P .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1987, 4 (04) :363-383
[3]  
BATCHER KE, 1980, IEEE T COMPUT, V29, P836, DOI 10.1109/TC.1980.1675684
[4]  
Beivide R., 1987, 14th Annual International Symposium on Computer Architecture. Conference Proceedings (Cat. No.87CH2420-8), P163, DOI 10.1145/30350.30369
[5]  
BRUCK J, 1990, 2ND P ANN ACM S PAR, P37
[6]  
BRUCK J, 1991 P INT C PAR PRO, V1, P692
[7]   ADDRESSING, ROUTING, AND BROADCASTING IN HEXAGONAL MESH MULTIPROCESSORS [J].
CHEN, MS ;
SHIN, KG ;
KANDLUR, DD .
IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (01) :10-18
[8]   ON DESIGNING AND RECONFIGURING K-FAULT-TOLERANT TREE ARCHITECTURES [J].
DUTT, S ;
HAYES, JP .
IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (04) :490-503
[9]   DESIGNING FAULT-TOLERANT SYSTEMS USING AUTOMORPHISMS [J].
DUTT, S ;
HAYES, JP .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1991, 12 (03) :249-268
[10]  
DUTT S, 1991, 21ST P INT S FAULT T, P292