IMPLICATIONS OF HIERARCHICAL N-BODY METHODS FOR MULTIPROCESSOR ARCHITECTURES

被引:33
作者
SINGH, JP [1 ]
HENNESSY, JL [1 ]
GUPTA, A [1 ]
机构
[1] STANFORD UNIV,COMP SYST LAB,STANFORD,CA 94305
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 1995年 / 13卷 / 02期
关键词
ALGORITHMS; EXPERIMENTATION; MEASUREMENT; PERFORMANCE; COMMUNICATION ABSTRACTIONS; LOCALITY; MESSAGE PASSING; N-BODY METHODS; PARALLEL APPLICATIONS; PARALLEL COMPUTER ARCHITECTURE; SCALING; SHARED ADDRESS SPACE; SHARED MEMORY;
D O I
10.1145/201045.201050
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
To design effective large-scale multiprocessors, designers need to understand the characteristics of the applications that will use the machines. Application characteristics of particular interest include the amount of communication relative to computation, the structure of the communication, and the local cache and memory requirements, as well as how these characteristics scale with larger problems and machines. One important class of applications is based on hierarchical N-body methods, which are used to solve a wide range of scientific and engineering problems efficiently. Important characteristics of these methods include the nonuniform and dynamically changing nature of the domains to which they are applied, and their use of long-range, irregular communication. This article examines the key architectural implications of representative applications that use the two dominant hierarchical N-body methods: the Barnes-Hut Method and the Fast Multipole Method. We first show that exploiting temporal locality on accesses to communicated data is critical to obtaining good performance on these applications and then argue that coherent caches on shared-address-space machines exploit this locality both automatically and very effectively. Next, we examine the implications of scaling the applications to run on larger machines. We use scaling methods that reflect the concerns of the application scientist and find that this leads to different conclusions about how communication traffic and local cache and memory usage scale than scaling based only on data set size. In particular, we show that under the most realistic form of scaling, both the communication-to-computation ratio as well as the working-set size (and hence the ideal cache size per processor) grow slowly as larger problems are run on larger machines. Finally, we examine the effects of using the two dominant abstractions for interprocessor communication: a shared address space and explicit message passing between private address spaces. We show that the lack of an efficiently supported shared address space will substantially increase the programming complexity and performance overheads for these applications.
引用
收藏
页码:141 / 202
页数:62
相关论文
共 37 条
  • [1] AARSETH S.J., HENON M., WIELEN R., Astronomy Astrophysics, (1974)
  • [2] APPEL A.A., An efficient program for many body simulation, SIAM J. Sci. Stat. Comput., 6, pp. 85-93, (1985)
  • [3] BARNES J.E., HUT P., Error analysis of a tree code, Astrophysics J. Suppl., 70, pp. 389-417, (1989)
  • [4] BARNES J.E., HUT P., A hierarchical O(N log N) force calculation algorithm, Nature, 324, 4, pp. 446-449, (1986)
  • [5] CHAIKEN D., KUBIATOWICZ J., AGARWAL A., LimitLESS directories: A scalable cache coherence scheme, 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 224-234, (1991)
  • [6] CHAN T., Hierarchical algorithms and architectures for parallel scientific computing, Proceedings of ACM Conference on Supercomputing, (1990)
  • [7] CHORIN A.J., Numerical study of slightly viscous flow, J. Fluid Mech., 57, pp. 785-796, (1973)
  • [8] DENNING P.I., The working set model for program behavior, Commun. ACM, 11, 5, pp. 323-333, (1968)
  • [9] EGGERS S., KATZ R., The effect of sharing on the cache and bus performance of parallel programs, Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 257-270, (1989)
  • [10] FOX G.C., A graphical approach to load balancing and sparse matrix vector multiplication on the hypercube, Numerical Algorithms for Modern Parallel Computer Architectures, pp. 37-62, (1988)