Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code

Size: px

Start display at page:

Download "Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code"

Linette Cameron
6 years ago
Views:

1 Hybrid MPI + OpenMP Approach to Improve the Scalability of a Phase-Field-Crystal Code Reuben D. Budiardja reubendb@utk.edu ECSS Symposium March 19 th, 2013

2 Project Background Project Team (University of Michigan): Katsuyo Thornton (P.I.), Victor Chan Phase-field-crystal (PFC) formulation to study dynamics of various metal systems Original in-house code written in C++ Has been run in 2D and 3D systems Solves multiple Helmholtz equations, a reduction, then an explicit time step 2

3 Solving the Helmholtz Equations 2 + k 2 = 0 Originally used GMRES with Algebraic Multigrid (AMG) preconditioner from HYPRE In 3D, discretization matrix is large and may become indefinite difficult to solve, requiring large iterations Poor weak-scaling results Prohibitively long for indefinite matrix case Increasing memory requirements with iteration 3

4 Goal Scalable to solve larger problem Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process Decrease the time to solution to 1 sec / time step Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size Exploit other parallelism (with OpenMP?) Investigate better preconditioner Different method (library?) to solve the equations 4

5 Goal Scalable to solve larger problem Weak scaling: maintain the time-to-solution with increasing number of processes and a fixed problem size per process Decrease the time to solution to 1 sec / time step Strong scaling: decrease time-to-solution with increasing number of process and a fixed problem size Exploit other parallelism (with OpenMP?) Investigate better preconditioner Different method to solve the equations 5

6 Complex Iterative Jacobi Solver Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) Replaced HYPRE A modification of standard Jacobi method H n+1 H n, Δl i, δ i 2, δ i 2 is computed with centereddifference Easily parallelized and low memory requirement Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations. 6

7 Complex Iterative Jacobi Solver Hadley, G. R, A complex Jacobi iterative method for the indefinite Helmholtz Equation, J.Comp.Phys. 203 (2005) Replaced HYPRE A modification of standard Jacobi method H n+1 H n, Δl i, δ i 2, δ i 2 is computed with centereddifference Easily parallelized and low memory requirement Convergence rate depends on resolution, but roughly constant from problem to problem larger problem (with similar resolution) should not increase iterations. A draft version was quickly implemented by the project team (Victor Chan) and tested. 7

8 Profiling the Code with CrayPAT Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw,... > module load perftools > make clean > make > pat_build g mpi pfc_jacobi.exe > aprun n 48 pfc_jacobi.exe+pat > pat_report o profile.txt \ <output_data>.xf 8

9 Profiling the Code with CrayPAT Measure before optimize Can use sampling or tracing Using CrayPAT is simple: load module, re-compile, build instrumented code, re-run CayPAT can trace only specified group, e.g. mpi, io, heap, fftw,... > module load perftools > make clean > make > pat_build g mpi pfc_jacobi.exe > aprun n 48 pfc_jacobi.exe+pat > pat_report o profile.txt \ <output_data>.xf That Should Have Worked! 9

10 CrayPAT Workaround Use the API for fine grain instrumentation Add PAT_region_{begin/end} calls to most subroutines After narrowed down to a couple major subroutines, split labels to computation and communication #include <pat_api.h>... void Complex_Jacobi( ){... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication"); MPI_Internal_Communicate( ); MPI_Boundary_Communicate( ) ierr = PAT_region_end(PAT_ID); PAT_ID = 42; ierr = PAT_region_begin(PAT_ID, "computation"); for (int i=1; i<size.l1+1; i++){ } for (int j=1; j<size.l2+1; j++){ } for (int k=1; k<size.l3+1; k++){ } residual(i,j,k)=(1.0/d)*(...); ierr = PAT_region_end(PAT_ID); 10

11 CrayPAT Workaround Use the API for fine grain instrumentation Add PAT_region_{begin/end} calls to most subroutines After narrowed down to a couple major subroutines, split labels to computation and communication Communication subroutine eventually dominate at certain MPI size #include <pat_api.h>... void Complex_Jacobi( ){... int PAT_ID, ierr; PAT_ID = 41; ierr = PAT_region_begin(PAT_ID, "communication"); MPI_Internal_Communicate( ); MPI_Boundary_Communicate( ) ierr = PAT_region_end(PAT_ID); PAT_ID = 42; ierr = PAT_region_begin(PAT_ID, "computation"); for (int i=1; i<size.l1+1; i++){ } for (int j=1; j<size.l2+1; j++){ } for (int k=1; k<size.l3+1; k++){ } residual(i,j,k)=(1.0/d)*(...); ierr = PAT_region_end(PAT_ID); 11

12 Cell Update and MPI Communication r r+1 r+2 Step n: Compute differences and update cell values Step n: Communicate updated value to neighboring ghost cells (using MPI_Sendrecv( ) ) Iterations Step n+1: Compute differences and update cell values There is a fixed communication cost in every iteration. Can we hide it? 12

13 Hiding Communication Cost r r+1 r+2 Step n: Post MPI_Irecv( ) for ghost cells. Compute differences and update cell values on surface cells. Step n: Send surface cells value with MPI_Isend ( ). Compute differences and update cell values on inner cells. Step n+1: Compute differences and update cell values on surface cells. Post MPI_Irecv( ) for ghost cells. Iterations 13

14 Hiding Communication Cost r r+1 r+2 Step n: Compute differences and update cell values on surface cells. Post MPI_Irecv( ) for ghost cells. Step n: Send surface cells value with MPI_Isend ( ). Compute differences and update cell values on inner cells. Step n+1: Compute differences and update cell values on surface cells. Post MPI_Irecv( ) for ghost cells. Iterations Communication cost is hidden as long as there is enough work when communication happens 14

15 Different parallelisms with OpenMP Data Parallelism Parallelize over cell updates for each Helmholtz equation Use do/for directive Need to modify every loop with OpenMP directives May require higher synchronization cost Only master thread communicates Task Parallelism Parallelize over the solving of Helmholtz equations Use section directive Only need to modify main( ) subroutine that calls solver May have load-imbalance among threads Each thread communicates, requires thread-safe MPI 15

16 Different parallelisms with OpenMP Data Parallelism Parallelize over cell updates for each Helmholtz equation Use do/for directive Need to modify every loop with OpenMP directives May require higher synchronization cost Only master thread communicates Task Parallelism Parallelize over the solving of Helmholtz equations Use section directive Only need to modify main( ) subroutine that calls solver May have load-imbalance among threads Each thread communicates, requires thread-safe MPI 16

17 MPI Threading Support MPI-2 standard defines four levels of threading model: Model Description Advantage Disadvantages Single Only one thread allowed Portable: every MPI implementation support this Funneled Serialized Only master thread make MPI calls All threads can make MPI calls, but one at a time Simpler to program Limited flexibility Manager thread could get overloaded Freedom to communicate Risk of too much cross-communication Multiple No restriction Completely thread safe Limited availability Our OpenMP implementation requires multiple threading support On Kraken, we need to set environmental variable MPICH_MAX_THREAD_SAFETY=multiple 17

18 This Should Have Worked... ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &MPI_Thread_Provided); // check that MPI_Thread_Provided == MPI_THREAD_MULTIPLE #pragma omp parallel private(alpha, beta, w1, fit, bvalue, tag) { #pragma omp sections { #pragma omp section { tag = 1; Complex_Jacobi(, tag, MPI_CART_COMM); } #pragma omp section { tag = 2; Complex_Jacobi(, tag, MPI_CART_COMM); } } } But this produced MPI error on Kraken... 18

19 Workaround for MPI Thread Issue: Create separate MPI Communicator for each thread ierr = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &MPI_Thread_Provided); // check that MPI_Thread_Provided == MPI_THREAD_MULTIPLE ierr = MPI_Comm_dup(MPI_CART_COMM, &MPI_CART_COMM_1); ierr = MPI_Comm_dup(MPI_CART_COMM, &MPI_CART_COMM_2); #pragma omp parallel private(alpha, beta, w1, fit, bvalue, tag) { #pragma omp sections { #pragma omp section { Complex_Jacobi(, MPI_CART_COMM_1); } #pragma omp section { Complex_Jacobi(, MPI_CART_COMM_2); } } } 19

20 Other Minor Optimizations Only check for convergence (require MPI reduction) every tenth of iteration On Kraken, sacrificing one core (5 threads per socket) may in fact be beneficial (no NUMA effect, more memory bandwidth) -O2 instead of -O3 (not always better) 20

21 Results ~8X Strong-scaling and efficiency plot for system 21

22 Conclusion HYPRE was replaced by in-house Complex Iterative Jacobi solver MPI parallelization with non-blocking communications for domaindecomposition OpenMP task parallelism for multiple Helmholtz equation A scalable PFC code (weak and strong scaling), a major improvement from the original code. Future work: Simulate larger systems Implement OpenMP data parallelism (nested inside the task-parallelism) Lesson learned: Need better collaborative framework for ECSS (e.g. wiki, code revision control such as git, subversion) Estimating reasonable goal can be difficult Custom solver may give you better flexibility, but has development cost Bugs can come from unexpected places 22

Code Parallelization

Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain