Teraflops supercomputer: Architecture and validation of the fault tolerance mechanisms

被引:11
作者
Constantinescu, C [1 ]
机构
[1] Intel Corp, Server Architecture Lab, Hillsboro, OR 97124 USA
关键词
supercomputing; fault-tolerant computing; validation; fault injection; fault/error detection coverage;
D O I
10.1109/12.869320
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Intel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today. performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation Of any fault-tolerant or highly available computing system.
引用
收藏
页码:886 / 894
页数:9
相关论文
共 36 条
[1]  
[Anonymous], P INT S FAULT TOL CO
[2]  
[Anonymous], P INT COMP PERF DEP
[3]  
[Anonymous], P EUR DEP COMP C
[4]   FAULT INJECTION FOR DEPENDABILITY VALIDATION - A METHODOLOGY AND SOME APPLICATIONS [J].
ARLAT, J ;
AGUERA, M ;
AMAT, L ;
CROUZET, Y ;
FABRE, JC ;
LAPRIE, JC ;
MARTINS, E ;
POWELL, D .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1990, 16 (02) :166-182
[5]  
ARLAT J, 1989, P 19 INT S FAULT TOL, P348
[6]   FAULT INJECTION EXPERIMENTS USING FIAT [J].
BARTON, JH ;
CZECK, EW ;
SEGALL, ZZ ;
SIEWIOREK, DP .
IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (04) :575-582
[7]   Xception: A technique for the experimental evaluation of dependability in modern computers [J].
Carreira, J ;
Madeira, H ;
Silva, JG .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) :125-136
[8]  
CHILLAREGE R, 1989, P 19 INT S FAULT TOL, P356
[9]  
Constantinescu C., 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing, P186, DOI 10.1109/PRDC.1999.816228
[10]   USING MULTISTAGE AND STRATIFIED SAMPLING FOR INFERRING FAULT-COVERAGE PROBABILITIES [J].
CONSTANTINESCU, C .
IEEE TRANSACTIONS ON RELIABILITY, 1995, 44 (04) :632-639