MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA

被引:133
作者
Ballard, Grey [1 ]
Demmel, James [2 ,3 ]
Holtz, Olga [2 ,4 ]
Schwartz, Oded [1 ]
机构
[1] Univ Calif Berkeley, Dept Comp Sci, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, CS Div, Berkeley, CA 94720 USA
[4] Tech Univ Berlin, Berlin, Germany
关键词
linear algebra algorithms; bandwidth; latency; communication-avoiding; lower bound; MULTISHIFT QR ALGORITHM; NESTED DISSECTION; PARALLEL MATRIX; FACTORIZATION; COMPLEXITY; RECURSION; BOUNDS; SERIAL; LEADS;
D O I
10.1137/090769156
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix multiplication using the conventional O(n(3)) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo, and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Omega(#arithmetic operations/root M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLT factorization, QR factorization, the Gram-Schmidt algorithm, and algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth cost), we get lower bounds on the number of messages required to move it (latency cost). We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix) to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or whether we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph-theoretic problems. We point out recently designed algorithms that attain many of these lower bounds.
引用
收藏
页码:866 / 901
页数:36
相关论文
共 64 条
[51]   Parallel Out-of-Core computation and updating of the QR factorization [J].
Gunter, BC ;
Van De Geijn, RA .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2005, 31 (01) :60-78
[52]   Recursion leads to automatic variable blocking for dense linear-algebra algorithms [J].
Gustavson, FG .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1997, 41 (06) :737-755
[53]   COMPLEXITY BOUNDS FOR REGULAR FINITE-DIFFERENCE AND FINITE-ELEMENT GRIDS [J].
HOFFMAN, AJ ;
MARTIN, MS ;
ROSE, DJ .
SIAM JOURNAL ON NUMERICAL ANALYSIS, 1973, 10 (02) :364-369
[54]   Communication lower bounds for distributed-memory matrix multiplication [J].
Irony, D ;
Toledo, S ;
Tiskin, A .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2004, 64 (09) :1017-1026
[55]  
Irony D., 2002, Parallel Processing Letters, V12, P79, DOI 10.1142/S0129626402000847
[56]   AN INEQUALITY RELATED TO THE ISOPERIMETRIC INEQUALITY [J].
LOOMIS, LH ;
WHITNEY, H .
BULLETIN OF THE AMERICAN MATHEMATICAL SOCIETY, 1949, 55 (10) :961-962
[57]   Memory-efficient matrix multiplication in the BSP model [J].
McColl, WF ;
Tiskin, A .
ALGORITHMICA, 1999, 24 (3-4) :287-297
[58]   Optimizing graph algorithms for improved cache performance [J].
Park, JS ;
Penner, M ;
Prasanna, VK .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2004, 15 (09) :769-782
[59]   MODIFICATION OF THE HOUSEHOLDER METHOD BASED ON THE COMPACT WY REPRESENTATION [J].
PUGLISI, C .
SIAM JOURNAL ON SCIENTIFIC AND STATISTICAL COMPUTING, 1992, 13 (03) :723-726