Communication lower bounds for distributed-memory matrix multiplication

被引:126
作者
Irony, D
Toledo, S [1 ]
Tiskin, A
机构
[1] Tel Aviv Univ, Sch Comp Sci, IL-69978 Tel Aviv, Israel
[2] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, W Midlands, England
基金
以色列科学基金会;
关键词
communication; lower bounds; distributed memory; matrix multiplication;
D O I
10.1016/j.jpdc.2004.03.021
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer. We denote the number of processors by P and the dimension of square matrices by n. We show that the most widely used class of algorithms, the so-called two-dimensional (2D) algorithms, are optimal, in the sense that in any algorithm that only uses O(n(2)/p) words of memory per processor, at least one processor must send or receive Omega(n(2)/p(1/2)) words. We also show that algorithms from another class, the so-called three-dimensional (3D) algorithms, are also optimal. These algorithms use replication to reduce communication. We show that in any algorithm that uses Omega(n(2)/p(2/3)) words of memory per processor, at least one processor must send or receive Omega(n(2)/p(2/3)) words. Furthermore, we show a continuous tradeoff between the size of local memories and the amount of communication that must be performed. The 2D and 3D bounds are essentially instantiations of this tradeoff. We also show that if the input is distributed across the local memories of multiple nodes without replication, then Omega(n(2)) words must cross any bisection cut of the machine. All our bounds apply only to conventional o(n(3)) algorithms. They do not apply to Strassen's algorithm or other Theta(n(3)) algorithms. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:1017 / 1026
页数:10
相关论文
共 26 条
[1]   A HIGH-PERFORMANCE MATRIX-MULTIPLICATION ALGORITHM ON A DISTRIBUTED-MEMORY PARALLEL COMPUTER, USING OVERLAPPED COMMUNICATION [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (06) :673-681
[2]   A three-dimensional approach to parallel matrix multiplication [J].
Agarwal, RC ;
Balle, SM ;
Gustavson, FG ;
Joshi, M ;
Palkar, P .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1995, 39 (05) :575-582
[3]   EXPLOITING FUNCTIONAL PARALLELISM OF POWER2 TO DESIGN HIGH-PERFORMANCE NUMERICAL ALGORITHMS [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (05) :563-576
[4]   IMPROVING PERFORMANCE OF LINEAR ALGEBRA ALGORITHMS FOR DENSE MATRICES, USING ALGORITHMIC PREFETCH [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (03) :265-275
[5]   COMMUNICATION COMPLEXITY OF PRAMS [J].
AGGARWAL, A ;
CHANDRA, AK ;
SNIR, M .
THEORETICAL COMPUTER SCIENCE, 1990, 71 (01) :3-28
[6]   COMMUNICATION EFFICIENT MATRIX MULTIPLICATION ON HYPERCUBES [J].
BERNTSEN, J .
PARALLEL COMPUTING, 1989, 12 (03) :335-342
[7]   PARALLEL MATRIX AND GRAPH ALGORITHMS [J].
DEKEL, E ;
NASSIMI, D ;
SAHNI, S .
SIAM JOURNAL ON COMPUTING, 1981, 10 (04) :657-675
[8]   MATRIX ALGORITHMS ON A HYPERCUBE .1. MATRIX MULTIPLICATION [J].
FOX, GC ;
OTTO, SW ;
HEY, AJG .
PARALLEL COMPUTING, 1987, 4 (01) :17-31
[9]  
Hong J. W., 1981, SER STOC 81, P326, DOI DOI 10.1145/800076.802486
[10]  
Irony D., 2002, Parallel Processing Letters, V12, P79, DOI 10.1142/S0129626402000847