CUMULVS: Providing fault tolerance, visualization, and steering of parallel applications

被引:70
作者
Geist, GA
Kohl, JA
Papadopoulos, PM
机构
[1] Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge
[2] Mathematical Sciences Section, Oak Ridge National Laboratory, Building 6012, Oak Ridge, TN 37831-6367
来源
INTERNATIONAL JOURNAL OF SUPERCOMPUTER APPLICATIONS AND HIGH PERFORMANCE COMPUTING | 1997年 / 11卷 / 03期
关键词
D O I
10.1177/109434209701100305
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The use of visualization and computational steering can often assist scientists in analyzing large-scale scientific applications. Fault tolerance to failures is of great importance when running on a distributed system. However, the details of implementing these features are complex and tedious, leaving many scientists with inadequate development tools. CUMULVS is a library that enables programmers to easily incorporate interactive visualization and computational steering into existing parallel programs. Built on the PVM virtual machine framework, CUMULVS is portable and interoperable with all the computer architectures that PVM works with-a growing list that now stands at about 60 architectures. The CUMULVS library is divided into two pieces: one for the application program and one for the possibly commercial, visualization, and steering front end. Together, these two libraries encompass all the connection and data protocols needed to dynamically attach multiple, independent viewer front ends to a running parallel application. Viewer programs can also steer one or more user-defined parameters to ''close the loop'' for computational experiments and analyses. CUMULVS allows the programmer to specify user-directed checkpoints for saving an important program state in case of failures and also provides a mechanism to migrate tasks across heterogeneous machine architectures to achieve improved performance. Details of the CUMULVS design goals and compromises as well as future directions are given.
引用
收藏
页码:224 / 235
页数:12
相关论文
共 9 条
[1]  
AGARWAL DA, 1994, THESIS U CALIFORNIA
[2]  
BIRMAN KP, 1994, RELIABLE DISTRIBUTED
[3]  
GEIST A, 1996, PARALLEL COMPUT, V1, P128
[4]  
GEIST GA, 1994, PARALLEL VIRTUAL MAC
[5]   A high-performance, portable implementation of the MPI message passing interface standard [J].
Gropp, W ;
Lusk, E ;
Doss, N ;
Skjellum, A .
PARALLEL COMPUTING, 1996, 22 (06) :789-828
[6]  
KOHL JA, 1995, P HIGH PERF COMP S M, P243
[7]  
*MESS PASS INT FOR, 1994, INT J SUPERCOMPUTING, V8, P169
[8]  
*RIC U, 1994, HIGH PERF FORTR LANG
[9]  
STELLNER G, 1995, 1995 PVM US GROUP M