A recursive formulation of Cholesky factorization of a matrix in packed storage

被引:40
作者
Andersen, BS
Wasniewski, J
Gustavson, FG
机构
[1] Tech Univ Denmark, UNIC Danish Comp Ctr Res & Educ, DK-2800 Lyngby, Denmark
[2] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE | 2001年 / 27卷 / 02期
关键词
algorithms; performance BLAS; real symmetric matrices; complex Hermitian matrices; positive definite matrices; Cholesky factorization and solution; recursive algorithms; novel packed matrix data structures;
D O I
10.1145/383738.383741
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format are included. A new algorithm, called RPC for Recursive Packed Cholesky, that operates on the RPF format is presented. Algorithm RPC is based on level-3 BLAS and requires variants of algorithms TRSM and SYRK that work on RPF. We call these RP_TRSM and RP_SYRK and find that they do most of their work by calling GEMM. It follows that most of the execution time of RPC lies in GEMM. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the RPC storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our RPC implementation is decidedly superior. Third, unlike fixed block size algorithms, RPC requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also SMP parallel computations on the IBM SMP computer are made. The graphs that are attached in Section 7 show that the RPC algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the RPC performance results are almost the same or even better than the traditional full-storage algorithm results.
引用
收藏
页码:214 / 244
页数:31
相关论文
共 23 条
[1]   EXPLOITING FUNCTIONAL PARALLELISM OF POWER2 TO DESIGN HIGH-PERFORMANCE NUMERICAL ALGORITHMS [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (05) :563-576
[2]  
ANDERSEN BS, 1999, P 3 INT C PAR PROC A, P63
[3]  
ANDERSON E, 1999, LAPACK USERSS GUIDE
[4]  
Bilmes J., 1997, PROC 11 INT C SUPERC, P340, DOI 10.1145/263580.263662
[5]  
Demmel J.W., 1997, APPL NUMERICAL LINEA
[6]  
DONGARRA J, 1999, COMBINATORIAL OPTIMI, V5, P241
[7]   AN EXTENDED SET OF FORTRAN BASIC LINEAR ALGEBRA SUBPROGRAMS [J].
DONGARRA, JJ ;
DUCROZ, J ;
HAMMARLING, S ;
HANSON, RJ .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1988, 14 (01) :1-17
[8]  
DONGARRA JJ, 1990, ACM T MATH SOFTWARE, V16, P1, DOI 10.1145/77626.79170
[9]  
DONGARRA JJ, 1998, SOFTW ENVIRONM TOOL, P1
[10]  
Golub G.H., 1996, Johns Hopkins Studies in the Mathematical Sciences, V3rd