MPI-based implementation of a PCG solver using an EBE architecture and preconditioner for implicit, 3-D finite element analysis

被引：42

作者：

Gullerud, AS

Dodds, RH ^{[1
]}

机构：

[1] Univ Illinois, Dept Civil Engn, Urbana, IL 61801 USA

[2] Sandia Natl Labs, Albuquerque, NM 87185 USA

来源：

COMPUTERS & STRUCTURES | 2001年 / 79卷 / 05期

关键词：

element-by-element computation; Hughes-Winget preconditioner; parallel finite elements; message passing; conjugate gradient; domain decomposition; coloring algorithms;

D O I：

10.1016/S0045-7949(00)00153-X

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This work describes a coarse-grain parallel implementation of a linear preconditioned conjugate gradient solver using an element-by-element architecture and preconditioner for computation. The solver, implemented within a nonlinear. implicit finite element code, uses an MPI-based message-passing approach to provide portable parallel execution on shared, distributed, and distributed-shared memory computers. The flexibility of the element-by-element approach permits a dual-level mesh decomposition; a coarse, domain-level decomposition creates a load-balanced domain for each processor for parallel computation, while a second level decomposition breaks each domain into blocks of similar elements (same constitutive model- order of integration, element type) for fine-grained parallel computation on each processor. The key contribution here is a new parallel implementation of the Hughes-Winget (HW) element-by-element preconditioner suitable for arbitrary, unstructured meshes. The implementation couples an unstructured dependency graph with a new balanced graph-coloring algorithm to schedule parallel computations within and across domains. The code also includes the diagonal preconditioner and a modern parallel (threaded) sparse direct solver for comparison, Three example problems with up to 158,000 elements and 180,000 nodes analyzed on an SGI/Gray Origin 2000 illustrate the parallel performance of the algorithms and preconditioners, Analyses with varying block sizes illustrate that the two-level decomposition improves overall execution speed with the block size tuned for the cache memory architecture of the executing platform. This implementation of the HW preconditioner shows reasonable parallel efficiency - typically 80'%, on 48 processors. Efficiency for the diagonal preconditioner is also high, with total speedups reaching 86% on 48 CPUs. Calculation of the tangent element stiffnesses shows superlinear speedups for each of the test problems, while the computation of strains/stresses/residual forces shows 80% parallel efficiency on 48 processors. (C) 2001 Elsevier Science Ltd. All rights reserved.

引用

页码：553 / 575

页数：23

共 61 条

[1]

AMIN A, 1994, 8 INT PAR P S CANC, P509

[2]

[Anonymous], J ENG MECH DIV, DOI DOI 10.1061/JMCEA3.0000098

[3]

[Anonymous], 1995, CHACO USERS GUIDE VE

[4]

[Anonymous], 1993, SOLVING LARGE SCALE