A flexible framework for fault tolerance in the grid

被引:14
作者
Soonwook Hwang
Carl Kesselman
机构
[1] National Institute of Informatics, Tokyo 101-0051, Jimbocho Mitsui bldg. 14F (NAREGI), 1-105, Kanda-Jimbocho
[2] Information Sciences Institute, University of Southern California, Marina del Rey
关键词
Failure detection; Fault tolerance; Grid computing; Workflow;
D O I
10.1023/B:GRID.0000035187.54694.75
中图分类号
学科分类号
摘要
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-defined exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures. © 2004 Kluwer Academic Publishers.
引用
收藏
页码:251 / 272
页数:21
相关论文
共 46 条
[1]  
Condor DAGMan
[2]  
Condor Manuals
[3]  
The Globus Toolkit
[4]  
Abramson D., Giddy J., Kotler L., High Performance Parametric Modeling with Nimrod/G: Killer Application for the Globus Grid, International Parallel and Distributed Processing Symposium (IPDPS), pp. 520-528, (2000)
[5]  
Abramson D., Sosic R., Giddy J., Hall B., Nimrod: A Tool for Performing Parametised Simulations Using Distributed Workstations, Proceedings of the Fourth IEEE Symposium On High Performance Distributed Computing, (1995)
[6]  
Beguelin A., Seligman E., Stephan P., Application Level Fault Tolerance in Heterogeneous Networks ofWorkstations, Journal of Parallel and Distributed Computing On Workstation Clusters and Networked-based Computing, 43, 2, pp. 147-155, (1997)
[7]  
Beiriger J.L., Biven H.P., Humphreys S.L., Johnson W.R., Rhea R.E., Constructing the ASCI Computational Grid, Proceedings of the Ninth IEEE Symposium On High Performance Distributed Computing, pp. 193-199, (2000)
[8]  
Brunett S., Czajkowski K., Fitzgerald S., Foster I., Johnson A., Kesselman C., Leigh J., Tuecke S., Application Experiences with the Globus Toolkit, Proceedings of the Eighth IEEE Symposium On High Performance Distributed Computing, (1998)
[9]  
Casanova H., Dongarra J., Johnson C., Miller M., Application-Specific Tools, The GRID: Blueprint For a New Computing Infrastructure, pp. 159-180, (1998)
[10]  
Czajkowski K., Fitzgerald S., Foster I., Kesselman C., Grid Information Services for Distributed Resource Sharing, Proceedings of the Tenth IEEE Symposium On High Performance Distributed Computing, (2001)