Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Size: px

Start display at page:

Download "Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program"

Valentine Beasley
5 years ago
Views:

1 Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program Implementing Hybrid Parallelism in FLASH Christopher Daley 1 2 Vitali Morozov 1 Dongwook Lee 2 Anshu Dubey 1 2 Jonathon Gallagher 2 Don Lamb 2 Klaus Weide 2 1 Argonne National Laboratory 2 The Flash Center for Computational Science at the University of Chicago July 10, / 26

2 1 Introduction 2 Multithreading FLASH 3 FLASH on BG/Q RTFlame White Dwarf 4 Optimizations 5 Conclusion 2 / 26

3 FLASH overview FLASH simulates problems from astrophysics, cosmology, HEDP, incompressible fluid dynamics Collection of code units which a user assembles into a custom application Portable and scalable 1/2 million lines of code Written in Fortran90 and C Parallelised with MPI and (recently) OpenMP Adaptive Mesh Refinement (AMR) with Paramesh or Chombo Parallel I/O with HDF5 or PnetCDF. 3 / 26

4 Early science goals Improve understanding of Type Ia supernova Key physical processes will be studied in various FLASH simulations The early science applications include the following two applications as well as other applications with functionality in-between the two RTFlame (flame in a rectilinear domain with constant gravity) White Dwarf (full supernova problem) 4 / 26

5 BG/P to BG/Q transition Intrepid BG/P 4 cores/node, 2 GB/node, 40,960 nodes FLASH has been run on Intrepid for the last several years Scales to the whole machine MPI-only is sufficient Run in VN mode (4 MPI ranks/node) Mira BG/Q 4 hw threads/core, 16 cores/node, 16 GB/node, 49,152 nodes MPI-only approach not suitable for BG/Q OpenMP directives have been added to FLASH to take advantage of the additional intra-node parallelism 5 / 26

6 Paramesh There exists an underlying mesh in every FLASH application We will use Paramesh during our early science time Mesh is divided into blocks of fixed size (typically 16 3 cells) Hierarchy of blocks organized as an Oct-Tree (3D) Blocks assigned to MPI ranks 6 / 26

MPI parallelism in Paramesh The thick black lines show blocks 12-17 being assigned to a single MPI rank 6 total blocks 5 leaf blocks 1 parent

7 MPI parallelism in Paramesh The thick black lines show blocks being assigned to a single MPI rank 6 total blocks 5 leaf blocks 1 parent block FLASH solvers update the solution on local leaf blocks We use multiple threads to speed up the solution update (details in next slides) 7 / 26

8 Multithreading strategy 1 Assign different blocks to different threads Assuming 2 threads per MPI rank Thread 0 (blue) updates 3 full blocks - 72 cells Thread 1 (yellow) updates 2 full blocks - 48 cells This will be referred to as thread block list 8 / 26

9 Multithreading strategy 2 Assign different cells from the same block to different threads Assuming 2 threads per MPI rank Thread 0 (blue) updates 5 partial blocks - 60 cells Thread 1 (yellow) updates 5 partial blocks - 60 cells This will be referred to as thread within block 9 / 26

10 RTFlame: Finding the best configuration Run a fixed problem on 128 nodes of BG/Q and find the fastest time to solution. Vary the number of MPI ranks per node OpenMP threads per MPI rank We want a configuration which gives A fast time to solution Enough run-time flexibility in terms of memory per MPI rank The figure on the next slide shows the time to solution for the fastest multithreaded strategy only This is thread within block (investigated in later slides) 10 / 26

11 RTFlame: FLASH performance on BG/Q 11 / 26

12 RTFlame: FLASH performance on BG/Q 12 / 26

13 RTFlame: FLASH performance on BG/Q 13 / 26

14 RTFlame: FLASH performance on BG/Q summary Best performance: 32 MPI ranks/node, 2 threads/mpi rank BG/Q to BG/P node-to-node ratio of 8.9x Tricky to fit application in 512MB/MPI rank The unsplit hydro solver is more memory hungry than the split hydro solver used for science runs on BG/P in previous years Best compromise: 16 MPI ranks/node, 4 threads/mpi rank BG/Q to BG/P node-to-node ratio of 8.6x Comfortable to fit application in 1GB/MPI rank. Buffers can be sized larger to accommodate Rapid refinement of problem Congregation of many tracer particles on some MPI ranks All subsequent results use this configuration 14 / 26

15 RTFlame: Strong scaling on BG/Q 15 / 26

16 RTFlame: Strong scaling on BG/Q summary Good strong scaling for both multithreading strategies Block count/mpi rank is varied by a factor of 10 Only 3 blocks of 16 3 cells on some MPI ranks for the 4096 MPI rank (256 node) data point! Finer grained threading performs slightly better Better load balancing within an MPI rank Performance advantage increases as work becomes more finely distributed. 16 / 26

17 White Dwarf: Steps towards a successful run Difficulty running the full supernova problem on BG/Q Controlled FLASH aborts because of Imaginary sound speed in unsplit hydro Non-convergence in iterative equation of state (EOS) No nice segfault to debug :( Pre-V1R1M1 driver upgrade Only able to run FLASH by turning off compiler optimization Post-V1R1M1 driver upgrade Able to run MPI+OpenMP FLASH with compiler optimization, but only by adopting a custom build strategy 17 / 26

18 White Dwarf: Custom build strategy White Dwarf application aborts when compiling all source files with the OpenMP compiler option -qsmp=omp:noauto. XLF OpenMP compiler option known to introduce additional code transformations Adopt a build strategy where we use the OpenMP compiler option only if a file contains OpenMP Strategy initially failed when using GNU compiler Need to use -frecursive option on files without OpenMP Prevents compiler placing local arrays in static memory Keeps application thread-safe We use -qnosave in case this happens with XLF compiler too 18 / 26

19 White Dwarf: Weak scaling on BG/Q 19 / 26

20 White Dwarf: Weak scaling on BG/Q summary Good weak scaling on Intrepid BG/P (up to 131K MPI ranks) and Vesta BG/Q (up to 16K MPI ranks - whole machine) Hard to create perfect weak scaling adaptive grid tests Oscillations in evolution time happen because the work per MPI rank does not stay exactly constant BG/Q to BG/P node-to-node ratio between 7.4x and 7.9x Less than RTFlame node-to-node ratio White Dwarf application includes additional physics and so more guard cell fills per time step Does not include multipole solver multithreading - removed to avoid a strange crash during access of threadprivate data 20 / 26

21 Optimizations Easy optimization opportunities found by collecting performance measurements with libmpihpm smp.a Supports sampling of program counter through vprof one sample every 0.01 seconds The 128 node RTFlame test problem originally took 202 seconds Many samples in glibc log function Linking against MASS library reduced time to 184 seconds Many samples on function call lines in unsplit hydro Reordering arrays in unsplit hydro reduced time to 156 seconds (example function call improvement in next slide) 21 / 26

22 Unsplit hydro array layout optimization c a l l h y u h d d a t a R e c o n s t O n e s t e p & l e i g ( 1 : NDIM, i, j, k, 1 :HY WAVENUM, 1 : HY VARINUM),& 712 r e i g ( 1 : NDIM, i, j, k, 1 : HY VARINUM, 1 :HY WAVENUM) ) 479 counts (approximately 4.79 seconds) 22 / 26

23 Unsplit hydro array layout optimization c a l l h y u h d d a t a R e c o n s t O n e s t e p & l e i g ( 1 : NDIM, i, j, k, 1 :HY WAVENUM, 1 : HY VARINUM),& 712 r e i g ( 1 : NDIM, i, j, k, 1 : HY VARINUM, 1 :HY WAVENUM) ) 479 counts (approximately 4.79 seconds) Reorder arrays so that i,j,k are the slowest varying dimensions allows us to pass a memory address instead of creating a temporary array 23 / 26

24 Unsplit hydro array layout optimization c a l l h y u h d d a t a R e c o n s t O n e s t e p & l e i g ( 1 : NDIM, i, j, k, 1 :HY WAVENUM, 1 : HY VARINUM),& 712 r e i g ( 1 : NDIM, i, j, k, 1 : HY VARINUM, 1 :HY WAVENUM) ) 479 counts (approximately 4.79 seconds) Reorder arrays so that i,j,k are the slowest varying dimensions allows us to pass a memory address instead of creating a temporary array c a l l h y u h d d a t a R e c o n s t O n e s t e p & l e i g ( 1, 1, 1, i, j, k ),& 712 r e i g ( 1, 1, 1, i, j, k ) ) 37 counts (approximately 0.37 seconds) 24 / 26

25 Conclusion FLASH early science applications run correctly on BG/Q Best compromise between performance and usability happens when using 16 MPI ranks / node and 4 threads / MPI rank Finer grained multithreading performs slightly better Custom build process allows us to run the more complicated White Dwarf application What is -qsmp=omp:noauto doing to MPI-only code? 25 / 26

26 Any questions? 26 / 26

27 Backup slides Vary block size in RTFlame test problem 27 / 26

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale Parallel sorting in the age of Exascale Charm N-body GrAvity solver Massive Cosmological N-body simulations Parallel sorting in every iteration