PERFORMANCE AND SCALABILITY OF PRECONDITIONED CONJUGATE-GRADIENT METHODS ON PARALLEL COMPUTERS

被引:26
作者
GUPTA, A
KUMAR, V
SAMEH, A
机构
[1] Department of Computer Science, University of Minnesota, Minneapolis.
关键词
D O I
10.1109/71.382315
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5(TM 1) parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three-dimensional finite difference grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In the case of unstructured sparse matrices with a constant number of nonzero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after reordering, the nonzero elements of the N x N matrix can be confined in a band whose width is O(N-y) for any y < 1, or if the number of nonzero elements per row increases as N-x for any x > 0. Scalability increases as the number of nonzero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally verified on the CM-5 parallel computer.
引用
收藏
页码:455 / 469
页数:15
相关论文
共 33 条
[1]  
Anderson E., Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear equations, Cent. for Supercomput. Res. and Development, (1988)
[2]  
Aykanat C., Ozguner F., Ercal F., Sadayappan P., Iterative algorithms for solution of large sparse systems of linear equations on hypercubes, IEEE Trans. Comput., 37, pp. 1554-1567, (1988)
[3]  
Eager D.L., Zahorjan J., Lazowska E.D., Speedup versus efficiency in parallel systems, IEEE Trans. Comput., 38, pp. 408-423, (1989)
[4]  
George A., Liu J.-W.H., Computer Solution of Large Sparse Positive Difinite Systems, (1981)
[5]  
Gibbs N.E., Poole W.G., Stockmeyer P.K., A comparison of several bandwidth and profile reduction algorithms, ACM Trans. Math. Software., 2, pp. 322-330, (1976)
[6]  
Golub G.H., Van Loan C., Matrix Computations: Second Edition, Baltimore, MD: The Johns Hopkins University Press, (1989)
[7]  
Grama A., Gupta A., Kumar V., isoefficiency: Measuring the scalability of parallel algorithms and architectures, IEEE Parallel and Distrib. Technol., I, pp. 12-21, (1993)
[8]  
Gupta A., Kumar V., A scalable parallel algorithm for sparse matrix factorization, Dep. Comput. Sci., (1994)
[9]  
The scalability of FFT on parallel computers, IEEE Trans. Parallel and Distrib. Syst., 4, pp. 922-932, (1993)
[10]  
Gustafson J.L., Reevaluating Amdahl's law, Commun. ACM, 31, 5, pp. 532-533, (1988)