Deploying fault tolerance and taks migration with NetSolve

被引:12
作者
Plank, JS [1 ]
Casanova, H
Beck, M
Dongarra, JJ
机构
[1] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Math Sci Sect, Oak Ridge, TN 37831 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 1999年 / 15卷 / 5-6期
基金
美国国家科学基金会;
关键词
fault-tolerance; scientific computing; computational server; checkpointing; migration;
D O I
10.1016/S0167-739X(99)00024-2
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve. (C) 1999 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:745 / 755
页数:11
相关论文
共 42 条
[1]  
Amza Cristiana, 1996, IEEE COMPUT, V29, P18
[2]  
Anderson E., 1995, LAPACK USERS GUIDE
[3]   SUPPORTING FAULT-TOLERANT PARALLEL PROGRAMMING IN LINDA [J].
BAKKEN, DE ;
SCHLICHTING, RD .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1995, 6 (03) :287-302
[4]   Application level fault tolerance in heterogeneous networks of workstations [J].
Beguelin, A ;
Seligman, E ;
Stephan, P .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 43 (02) :147-155
[5]  
Blackford L. S., 1997, ScaLAPACK user's guide
[6]   FLOATING-POINT FAULT-TOLERANCE WITH BACKWARD ERROR ASSERTIONS [J].
BOLEY, D ;
GOLUB, GH ;
MAKAR, S ;
SAXENA, N ;
MCCLUSKEY, EJ .
IEEE TRANSACTIONS ON COMPUTERS, 1995, 44 (02) :302-311
[7]  
CABILLIC G, 1995, 14TH SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, P96, DOI 10.1109/RELDIS.1995.526217
[8]  
CASANOVA H, UNPUB IEEE COMPUTATI
[9]  
CASAS J, 1995, 3 ANN PVM US GROUP M
[10]  
CHUNG PE, 1997, PAC RIM INT S FAULT