Fault-tolerant grid architecture and practice

被引:23
作者
Jin, H [1 ]
Zou, DQ [1 ]
Chen, HH [1 ]
Sun, JH [1 ]
Wu, S [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
grid computing; fault tolerance; middleware; Globus; distributed computing;
D O I
10.1007/BF02948916
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Grid computing emerges as effective technologies to couple geographically distributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globus fault detection service uses the well-known techniques based on unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in a grid system, and a convenient toolkit is also needed to maintain the consistency in the grid. A fault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus fault detection service is presented in this paper. The platform offers effective strategies in such three aspects as grid key components, user tasks, and high-level applications.
引用
收藏
页码:423 / 433
页数:11
相关论文
共 22 条
  • [1] ANGULO D, 2002, P IPDPS 02 FORT LAUD, P171
  • [2] Armstrong R., 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469), P115, DOI 10.1109/HPDC.1999.805289
  • [3] BERMAN F, 1996, P SUP 96 PITTSB PA
  • [4] BIRMAN K, 1985, P 10 ACM S OP SYST P, P79
  • [5] Buyya R., 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, P283, DOI 10.1109/HPC.2000.846563
  • [6] Unreliable failure detectors for reliable distributed systems
    Chandra, TD
    Toueg, S
    [J]. JOURNAL OF THE ACM, 1996, 43 (02) : 225 - 267
  • [7] Czajkowski K, 2001, 10TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, P181, DOI 10.1109/HPDC.2001.945188
  • [8] DONGARRA J, 1999, P S GLOB INF PROC TE
  • [9] A worldwide flock of Condors: Load sharing among workstation clusters
    Epema, DHJ
    Livny, M
    vanDantzig, R
    Evers, X
    Pruyne, J
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 1996, 12 (01): : 53 - 65
  • [10] Lightweight probabilistic broadcast
    Eugster, PT
    Guerraoui, R
    Handurukande, SB
    Kermarrec, AM
    Kouznetsov, P
    [J]. INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 443 - 452