Fast-multipole algorithms moving to Exascale

Size: px

Start display at page:

Download "Fast-multipole algorithms moving to Exascale"

Gervais Park
5 years ago
Views:

1 Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George Washington University (at Boston University until July on Twitter

2 Acknowledgements NSF CAREER award NVIDIA Academic Partnership Award CUDA Fellow ONR Applied Computational Analysis Program

3 Acknowledgement joint work with Dr. Rio Yokota here at Nagasaki Advanced Computing Center, Japan

1946 The Monte Carlo method. 1947 Simplex Method for Linear Programming. 1950 Krylov Subspace Iteration Method. 1951 The Decompositional Approach to Matrix Computations. 1957 The Fortran Compiler.

4 1946 The Monte Carlo method Simplex Method for Linear Programming Krylov Subspace Iteration Method The Decompositional Approach to Matrix Computations The Fortran Compiler QR Algorithm for Computing Eigenvalues Quicksort Algorithms for Sorting Fast Fourier Transform Integer Relation Detection Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng., Vol. 2(1): ( 2000)

6 14 Gordon Bell awards for N-body

7 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

8 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

9 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 34x more than Moore s law 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

10 largest simulation 90 billion unknowns scale 256 GPUs of Lincoln cluster / 196,608 cores of Jaguar numerical engine: FMM (kernel-independent version, kifmm )

11 N-body simulation on GPU hardware: e algorithmic and hardware speed-ups multiply.

12 FMM on GPU: multiplying speed-ups Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 200x Note: p=10 L 2 -norm error (normalized): 10-4 time [s] x N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

13 FMM on GPU: multiplying speed-ups Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 200x Note: p=10 L 2 -norm error (normalized): 10-4 time [s] x N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

14 Advantage of N-body algorithms on GPUs quantify using the Roofline Model - shows hardware barriers ( ceiling ) on a computational kernel Components of performance: Computation Communication Locality

15 Performance: Computation Metric: Gflop/s Computation dp / sp Peak achivable if: exploit FMA, etc. Communication non-divergence (GPU) Intra-node parallelism: Locality explicit in algorithm explicit in code - Source: ParLab, UC Berkeley

16 Performance: Communication Metric: GB/s Computation Peak achivable if optimizations are explicit prefetching Communication allocation/usage stride streams coalescing on GPU Locality - Source: ParLab, UC Berkeley

17 Performance: Locality Computation Communication Computation is free Locality Maximize locality > minimize communication - Comm lower bound Hardware aids Optimizations via software minimize capacity misses cache size blocking minimize conflict misses associativities padding Source: ParLab, UC Berkeley

18 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA no SFU, no FMA peak memory performance single-precision peak peak floating-point performance NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

19 Advantage of N-body algorithms on GPUs Attainable flop/s (Gflop/s) SpMV Stencil 3-D FFT +SFU +FMA Fast N-body (cell-cell) single-precision peak no SFU, no FMA 1/16 1/8 1/4 1/ Operational intensity (flop/byte) Fast N-body (particle-particle) Hierarchical N-body simulations with auto-tuning for heterogeneous systems, Rio Yokota, L A Barba. IEEE Computing in Science and Engineering, 3 January 2012

20 Computing in Science and Engineering (CiSE), 3 January 2012, IEEE Computer Society, doi: /mcse Preprint arxiv:

21 Scalability of FMM in many-gpus & many-cpu systems

22 Scalability of FMM Our own progress: 1) 1 billion points on 512 GPUs (Degima) 2) 32 billion on 32,768 processors of Kraken 3) 69 billion on 4096 GPUs of Tsubame 2.0 achieved 1 petaflop/s on turbulence simulation

24 Demo: Lysozyme molecule charges mesh discretized with 102,486 boundary elements

25 1000 lysozyme molecules largest calculation: 10,648 molecules each discretized with 102,486 boundary elements more than 20 million atoms 1 billion unknowns one minute per iteration on 512 GPUs of Degima

Tsubame 2.0 1408 nodes with 12 CPU cores each, 3 nvidia M2050 GPUs, 54 GB of RAM.

26 Tsubame nodes with 12 CPU cores each, 3 nvidia M2050 GPUs, 54 GB of RAM. Total of 4224 GPUs peak performance 2.4 Petaflop/s. # 5 in Top500 (Jun 11 & Nov 11)

27 Weak scaling on Tsubame 4 million points per process Time (sec) Tree construction GPU communication MPI communication FMM evaluation Local evaluation Number of processes (GPUs) ~9 billion points, 30s on 2048 GPUs

28 FMM vs. FFT, weak scaling FMM spectral method Parallel efficiency Number of processes

29 Petascale turbulence simulation using vortex method 4,0963 grid, 69 billion points 1 Pflop/s Energy spectrum well-matched 0 spectral method vortex method E(k) k

30 How will FMM fare in exascale?

31 Pressure from hardware Overall byte-to-flop ratios machine balance are declining Machine balance is converging

32 Features of FMM for exascale The tree data structure extracts data locality from nonuniform and multi-scale resolutions; benign synchronization requirements; some kernels are purely local hierarchichal communication pattern The FMM is not a locality-sensitive application In the sense of: Bergman et al. (2008) Exascale Computing Study, DARPA IPTO

33 Double precision performance (Gflop/s) Intel Sandy Bridge AMD Abu Dhabi IBM BG/Q Fujitsu FX10 NVIDIA Kepler Intel Xeon Phi SpMV Stencil 3D FFT FMM M2L (Spherical) 8 1/16 1/8 1/4 1/ Operational intensity (flop/byte) FMM M2L (Cartesian) DGEMM FMM P2P

37 25 September 2013

38 15x the bandwidth of a DDR3 module 70% less energy per bit compared to DDR3

39 Hardware balance Vendor Microarchitecture Model GB/s Gflop/s AMD Bulldozer Opteron 6284 AMD Southern Islands Radeon HD7970 Intel Sandy Bridge Xeon E Intel MIC Xeon Phi Nvidia Nvidia Nvidia Fermi GF110 Kepler GK110 Volta GPU Tesla M2090 (16 SMs) Tesla K20X (14 SMXs) with Stacked 3D DRAM Byte/ flop ,6 (DP) (DP) ,2 (DP) (DP) (DP) 1331 (SP) 1310 (DP) 3950 (SP) 4000 (DP) (SP) 0,266 0,133 0,190 0,063 0,256 0,085 anks: Manuel Ujaldón

40 GFLOP/s (double precision performance) Stacked DRAM: 1 GB/s. 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

41 GFLOP/s (double precision performance) Balance zone Stacked DRAM: 1 GB/s. 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

42 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Stacked DRAM: 1 GB/s. 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

43 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Compute-bound kernels Stacked DRAM: 1 GB/s. 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

44 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Compute-bound kernels Stacked DRAM: 1 GB/s. SpMxV 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

45 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Compute-bound kernels Stacked DRAM: 1 GB/s. SpMxV Stencil 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

46 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Compute-bound kernels Stacked DRAM: 1 GB/s. SpMxV Stencil FFT 3D 8 1/16 1/8 1/4 1/ Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

47 GFLOP/s (double precision performance) Balance zone Memory-bound kernels Compute-bound kernels Stacked DRAM: 1 GB/s. SpMxV Stencil FFT 3D 8 1/16 1/8 1/4 1/ MxM (DGEMM en BLAS) Xeon Opteron Volta Kepler Xeon Phi Radeon Fermi anks: Manuel Ujaldón FLOP/byte (operational intensity)

48 16384 GFLOP/s (double precision performance) Stencil 8 1/16 1/8 1/4 1/ FMM M2L (Spherical) FMM M2L (Cartesian) FMM M2L P2P FLOP/byte (operational intensity) Volta Kepler anks: Manuel Ujaldón

49 Most of the algorithms in our toolbox are bandwidth-bound. Forecasts with current trends are gloomy. Stacked memory changes the game.

50 Backup

51 Treecode & Fast multipole method - reduces operation count from O(N 2 ) to O(N log N) or O(N) f(y) = N c i K(y x i ) i=1 y [1...N] root level 1 leaf level A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. 26(4): (November 2012)

52 root level 1 M2M M2L M2L L2L x x P2M leaf level L2P A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. 26(4): (November 2012)

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales ICERM, Brown University Topical Workshop: Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale Simulations Providence, Jan. 9 13, 2012 Hierarchical N-body