GEMM-based level 3 BLAS:: High-performance model implementations and performance evaluation benchmark

被引:109
作者
Kågström, B [1 ]
Ling, P
Van Loan, C
机构
[1] Umea Univ, Dept Comp Sci, S-901 Umea, Sweden
[2] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
来源
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE | 1998年 / 24卷 / 03期
关键词
blocked algorithms; GEMM-based level 3 BLAS; matrix-matrix kernels; memory hierarchy; parallelization; vectorization;
D O I
10.1145/292395.292412
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduce effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark. is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.
引用
收藏
页码:268 / 302
页数:35
相关论文
共 26 条
[1]   EXPLOITING FUNCTIONAL PARALLELISM OF POWER2 TO DESIGN HIGH-PERFORMANCE NUMERICAL ALGORITHMS [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (05) :563-576
[2]   IMPROVING PERFORMANCE OF LINEAR ALGEBRA ALGORITHMS FOR DENSE MATRICES, USING ALGORITHMIC PREFETCH [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (03) :265-275
[3]  
Anderson E., 1992, LAPACK User's Guide
[4]   Compiler blockability of dense matrix factorizations [J].
Carr, S ;
Lehoucq, RB .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1997, 23 (03) :336-361
[5]  
DACKLAND K, 1995, UMINF95XX UM U DEP C
[6]   A PARALLEL BLOCK IMPLEMENTATION OF LEVEL-3 BLAS FOR MIMD VECTOR PROCESSORS [J].
DAYDE, MJ ;
DUFF, IS ;
PETITET, A .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1994, 20 (02) :178-193
[7]  
DONGARRA JJ, 1991, SUPERCOMPUTER, V8, P15
[8]   AN EXTENDED SET OF FORTRAN BASIC LINEAR ALGEBRA SUBPROGRAMS [J].
DONGARRA, JJ ;
DUCROZ, J ;
HAMMARLING, S ;
HANSON, RJ .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1988, 14 (01) :1-17
[9]   A SET OF LEVEL 3 BASIC LINEAR ALGEBRA SUBPROGRAMS - MODEL IMPLEMENTATION AND TEST PROGRAMS [J].
DONGARRA, JJ ;
DUCROZ, J ;
HAMMARLING, S ;
DUFF, I .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1990, 16 (01) :18-28
[10]  
DONGARRA JJ, 1990, ACM T MATH SOFTWARE, V16, P1, DOI 10.1145/77626.79170