The success of multipole methods: Are we there yet?

Size: px

Start display at page:

Download "The success of multipole methods: Are we there yet?"

Ann Baldwin
5 years ago
Views:

1 King Abdullah University of Science and Technology Saudi Arabia, Apr. 2830, 2012 The success of multipole methods: Are we there yet? Lorena A Barba, Boston University

2 1948 The world s first computation by a storedprogram computer: The Manchester Baby

4 th Anniversary Celebration at the University of Machester

5 Professor Lloyd N Trefethen FRS Head of Oxford's Numerical Analysis Group President of SIAM ( )

6 1998 Lloyd Trefethen: Predictions for Scientific Computing 50 Years From Now

7 #1. We may not be here

8 #2. We ll talk to computers more often than type to them, and they ll respond with pictures more often than numbers.

9 level 1 M2M M2L M2L L2L x x P2M leaf level L2P #7. Multipole methods and their descendants will be ubiquitous

10 ... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms Trefethen & Bau Numerical Linear Algebra SIAM

11 Fast multipole method O(N) solution of N-body problem Top 10 Algorithm of the 20th century

12 1946 The Monte Carlo method Simplex Method for Linear Programming Krylov Subspace Iteration Method The Decompositional Approach to Matrix Computations The Fortran Compiler QR Algorithm for Computing Eigenvalues Quicksort Algorithms for Sorting Fast Fourier Transform Integer Relation Detection Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng., Vol. 2(1):22 23 (2000)

13 N-body Problem: updates to a system where each element of the system rigorously depends on the state of every other element of the system.

14 N-body Problem: updates to a system where each element of the system rigorously depends on the state of every other element of the system.

15 Credit: Mark Stock

17 M31 Andromeda galaxy # stars: 10 12

19 Fast N-body method stars of the Andromeda galaxy Earth

20 Fast N-body method stars of the Andromeda galaxy

21 Fast N-body method stars of the Andromeda galaxy O(N)

22 Treecode & Fast multipole method reduces operation count from O(N 2 ) to O(N log N) or O(N) f(y) = N c i K(y x i ) y [1...N] i=1 root level 1 leaf level Image: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. Accepted (2011) To appear; preprint arxiv:

23 information moves from red to blue M2M multipole to multipole treecode & FMM P2M particle to multipole treecode & FMM M2L multipole to local FMM M2P multipole to particle treecode L2L local to local FMM L2P local to particle FMM source particles P2P particle to particle treecode & FMM target particles Image: Treecode and fast multipole method for N-body simulation with CUDA, Rio Yokota, Lorena A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Wen-mei Hwu, ed.; Morgan Kaufmann/Elsevier (2011) pp

24 root level 1 M2M M2L M2L L2L leaf level x x P2M L2P A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput., Jan This figure shared under CC-BY on figshare.com

25 Diversity of N-body problems e.g., atoms/ions in electrostatic or van der Waals forces integral formulation of elliptic PDE 2 u = f u = Gfd Numerical integration

26 Applications of the FMM 2 u = f u = Gfd Poisson 2 u = f Astrophysics Electrostatics Fluid mechanics Helmholtz 2 u + k 2 u = f Acoustics Electromagnetics Poisson-Boltzmann fast mat-vec: ( u)+k 2 u = f Geophysics Biophysics accelerate iterations of Krylov solvers speeds-up Boundary Element Method (BEM) solvers

27 Progress: 25 years since G&R 87 FMM matures in late 90s, early 2000s Fourteen Gordon Bell awards for N-body Variants (kifmm) and several optimizations Last 5 years N-body simulation on GPU hardware

28 FMM matures in late 90s, early 2000s First parallel implementations Application to acoustics, scattering Translation theory Meanwhile success in treecodes

29 Parallel FMM 1990 G&G shared memory A parallel adaptive fast multipole method, J. P. Singh, C. Holt, J. L. Hennessy, A. Gupta, Supercomputing '93 Proceedings of the 1993 ACM/ IEEE Conference Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods..., J.P. Singh, C. Holt, T. Totsuka, A. Gupta, J. Hennessy, Journal of Parallel and Distributed Computing, 27(2): (1995) A new parallel kernel-independent fast multipole method, L. Ying, G.Biros, D. Zorin, H. Langston, Supercomputing, 2003 ACM/IEEE Conference (Best Student Paper Award)

30 Acoustics, Helmholtz & Maxwell equations References below by N. Gumerov & R. Duraiswami, Univ. of Maryland Computation of scattering from N spheres using multipole reexpansion, J. Acoust. Soc. Am. Volume 112, Issue 6, pp (2002) Efficient computation of acoustical scattering from N spheres via the fast multipole method..., J. Acoust. Soc. Am., 116(4): 2528 (2004) Computation of scattering from clusters of spheres using the fast multipole method, J. Acoust. Soc. Am., 117(4): (2005) A broadband fast multipole accelerated boundary element method for the three dimensional Helmholtz equation, J. Acoust. Soc. Am., 125(1): (2009)

32 Helmholtz: hard life at high frequencies (Roklin) classical translation operators are of no use Summary of Duraiswami group techniques: level-dependent truncation number; use of rotation based symmetries and local expansion with special functions; switch to diagonal representations later

33 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

34 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

35 14 Gordon Bell awards for N-body Performance 1992 Warren & Salmon, 5 Gflop/s Price/performance 1997 Warren et al., 18 Gflop/s / $1 M 34x more than Moore s law 6200x cheaper Price/performance 2009 Hamada et al., 124 Mflop/s / $1 Performance 2010 Rahimian et al., 0.7 Pflop/s on Jaguar

36 largest simulation 90 billion unknowns scale 256 GPUs of Lincoln cluster / 196,608 cores of Jaguar numerical engine: FMM (kernel-independent version, kifmm )

37 Most recent progress on FMM The black-box fast multipole method, W. Fong, E. Darve, J. Comput. Phys., 228(23): (2009) Separation of variables via Chebyshev interpolation of a kernel M2L compressed with SVD Any non-oscillatory kernel Generalized fast multipole method, P.-D. Létourneau, C. Cecka, E. Darve, IOP Conference Series: Materials Science and Engineering, 10(1): (2010) Separation of variables via Laplace transform Diagonal M2L, Any analytic kernel

38 N-body simulation on GPU hardware The algorithmic and hardware speed-ups multiply

Early application of GPUs direct N-body 2007, Hamada & Iitaka CUNbody distributed source particles among thread blocks, requiring reduction 2007,

39 Early application of GPUs direct N-body 2007, Hamada & Iitaka CUNbody distributed source particles among thread blocks, requiring reduction 2007, Nyland et al. GPU Gems 3 target particles were distributed, no reduction necessary 2008, Belleman et al. Kirin code 2009, Gaburov et al. Sapporo code

40 FMM on GPU 2008

41 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

42 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

43 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

44 FMM on GPU multiplying speed-ups Note: p= Direct (CPU) Direct (GPU) FMM (CPU) FMM (GPU) 20 L 2 -norm error (normalized): 10-4 time [s] N Treecode and fast multipole method for N-body simulation with CUDA, R Yokota & L A Barba, Ch. 9 in GPU Computing Gems Emerald Edition, Elsevier/Morgan Kaufman (2011)

45 New M2L optimizations for GPU Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units, T. Takahashi, C. Cecka, W. Fong, E. Darve, Int. J. Num. Meth. Eng., 89(1): (Jan. 2012) Use structure of FMM to increase flop:byte Result is GPU compute-bound rather than memory-bound Full implementation of bbfmm on GPU M2L achieves ~6x over 12-core CPU implementation Achieves ~3.2x total over 12-core CPU implementation Code available:

46 Advantage of N-body algorithms on GPUs quantify using the Roofline Model shows hardware barriers ( ceiling ) on a computational kernel Components of performance: Computation Communication Locality

47 Performance: Computation Metric: Gflop/s Computation dp / sp Peak achivable if: Communication Locality exploit FMA, etc. non-divergence (GPU) Intra-node parallelism: explicit in algorithm explicit in code Source: ParLab, UC Berkeley

48 Performance: Communication Metric: GB/s Computation Peak achivable if optimizations are explicit prefetching Communication allocation/usage Locality stride streams coalescing on GPU Source: ParLab, UC Berkeley

49 Computation Performance: Locality Communication Computation is free Locality Maximize locality > minimize communication Comm lower bound minimize capacity misses Hardware aids cache size Optimizations via software blocking minimize conflict misses associativities padding Source: ParLab, UC Berkeley

50 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA single-precision peak no SFU, no FMA NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

51 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA single-precision peak no SFU, no FMA peak floating-point performance NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

52 Roofline model Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams, A. Waterman, D. Patterson. Communictions of the ACM, April Operational intensity = total flop / total byte = Gflop/s / GB/s Attainable flop/s (Gflop/s) SFU +FMA no SFU, no FMA peak memory performance single-precision peak peak floating-point performance NVIDIA C2050 log/log scale 1/16 1/8 1/4 1/ Operational intensity (flop/byte)

53 Advantage of N-body algorithms on GPUs Attainable flop/s (Gflop/s) SpMV Stencil 3-D FFT +SFU +FMA Fast N-body (cell-cell) single-precision peak no SFU, no FMA 1/16 1/8 1/4 1/ Operational intensity (flop/byte) Fast N-body (particle-particle) NVIDIA C2050 Image: Hierarchical N-body simulations with auto-tuning for heterogeneous systems, Rio Yokota, L A Barba. Computing in Science and Engineering (CiSE), 3 January 2012, IEEE Computer Society, doi: /mcse

54 Scalability in many-gpus & many-cpu systems Our own progress so far: 1) 1 billion unknowns on 512 GPUs (Degima) 2) 32 billion on 32,768 processors of Kraken 3) 69 billion on 4096 GPUs of Tsubame 2.0 achieved 1 petaflop/s on turbulence simulation

55 Degima cluster Nagasaki Advanced Computing Center

57 Lysozyme molecule charges mesh discretized with 102,486 boundary elements

58 1000 Lysozyme molecules largest calculation: 10,648 molecules each discretized with 102,486 boundary elements more than 20 million atoms 1 billion unknowns one minute per iteration on 512 GPUs of Degima

59 Kraken Cray XT5 system at NICS, Tennessee: 9,408 nodes with 12 CPU cores each, 16 GB memory peak performance is 1.17 Petaflop/s. # 11 in Top500 (Jun 11 & Nov 11)

60 Weak scaling on Kraken p=3 N=10 6, per process parallel efficiency 72% at 32,768 processors largest run: 32 Billion points time [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel time to solution: <40 s N procs Image: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems, Rio Yokota, L A Barba. Int. J. High-perf. Comput. (Jan. 2012)

Tsubame 2.0 1408 nodes with 12 CPU cores each, 3 nvidia M2050 GPUs, 54 GB of RAM.

61 Tsubame nodes with 12 CPU cores each, 3 nvidia M2050 GPUs, 54 GB of RAM. Total of 4224 GPUs peak performance 2.4 Petaflop/s. # 5 in Top500 (Jun 11 & Nov 11)

62 Weak scaling on Tsubame 4 million points per process Time (sec) Tree construction GPU communication MPI communication FMM evaluation Local evaluation Number of processes (GPUs) Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

63 Weak scaling on Tsubame 4 million points per process Time (sec) Tree construction GPU communication MPI communication FMM evaluation Local evaluation Number of processes (GPUs) ~9 billion points, 30s on 2048 GPUs Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

64 FMM vs. FFT, weak scaling FMM spectral method Parallel efficiency Number of processes

65 Petascale turbulence simulation using vortex method 4,0963 grid, 69 billion points 1 Pflop/s Energy spectrum well-matched 0 spectral method vortex method E(k) k 10 Petascale turbulence simulation using a highly parallel fast multipole method, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Comput. Phys. Commun., under revision (minor) Preprint arxiv:

66 to conclude Hierarchical N-body algorithms are wellsuited for achieving exascale FMM is a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape.

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales

Hierarchical N-body algorithms: A pattern likely to lead at extreme scales ICERM, Brown University Topical Workshop: Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale Simulations Providence, Jan. 9 13, 2012 Hierarchical N-body