Results from the Early Science High Speed Combus:on and Detona:on Project

Size: px

Start display at page:

Download "Results from the Early Science High Speed Combus:on and Detona:on Project"

Clifton Hamilton
5 years ago
Views:

1 Results from the Early Science High Speed Combus:on and Detona:on Project Alexei Khokhlov, University of Chicago Joanna Aus:n, University of Illinois Charles Bacon, Argonne Na:onal Laboratory Andrew Knisely, University of Illinois Ben Clifford, Argonne Na:onal Laboratory Joe Bernstein, Argonne Na:onal Laboratory

2 Overview Science Technology Scaling challenges Current :mings

3 The Science Direct Numerical Simula:on of the deflagra:on- to- detona:on transi:on (DDT) in hydrogen- oxygen gaseous mixtures for hydrogen safety, funded by ASCR and BES The plan: Shock bifurca:on of a reflected shock Auto- igni:on, strong and weak 0.1atm, 6- micron resolu:on of 1m, 2.5cmx2.5cm pipe to model flame accelera:on and predict run distance to detona:on

4 Code structure Physics modules, Ini:al/boundary condi:ons of a problem run on top of ALLA ALLA: a Navier- Stokes fluid dynamics solver that runs on top of FTT FTT: Fully threaded tree library FTT library provides mesh, AMR, global parallel iterators, visualiza:on, I/O

5 Code features 3- d reac:ve flow Navier- Stokes with 8- species and 19 reac:on kine:cs H2- O2 burning, mul:- species NASA7 equa:on of state, mul:- species temperature dependent viscosity, mass and heat conduc:on, and radia:ve cooling Adap:ve mesh refinement on a regular rectangular grid

6 Scaling challenges, BG/P Moving to GPFS from Lustre, different I/O strategy required Change from one file per rank to a single file with MPI- I/O Improved checkpoint :me by 41x, mostly due to the reduc:on in metadata overhead (~28x faster) The remainder was due to MPI- I/O enforcing aligned writes, another ~1.5 :mes faster Data acquired using the Darshan library

7 Scaling challenges, BG/P (II) Lower memory per node led to the use of OpenMP App couldn t run in 4 or 2 ranks per node, leading to three idle cores in SMP mode The AMR code (FTT) executes work func:ons from the physics code (ALLA) using a global iterator Pugng openmp around the fluid dynamics work- func:on call results in a 3x speedup This got us scaling up to 32 racks of BG/P

8 Reflected shock tube valida:on

9 Shock bifurca:on angle The agreement of these structures 3D, accurate equa:on of state, viscosity, and heat conduc:on For instance elimina:ng heat conduc:on changes angle 1 by 8 degrees, and angle 2 by 4 degrees

10 Turbulence

11 Strong igni:on

12 Weak igni:on

13 Reducing communica:ons overheads At this point, the computa:onal side of the code was scaling well, and the efficiency losses at higher rank counts were due to the AMR refinement and rebalance steps

14 Balance :mer speedup On 128K cores, balance decisions were taking 64 seconds/call due to serial repeated work - got this under 1 second Also, the ghost pakern set post- refine was taking a long :me a rewrite to MPI one- sided sped this up to the point where mesh refinement was down to 10% of the cost of a run

15 Remaining challenges The point- to- point ghost data exchanges now consume a lot of :me, exhibi:ng a load imbalance between ranks Needs to be addressed more thoroughly in the rebalance heuris:c BG/P had one thread/core; BG/Q goes up to four Need new strategy for fine- grained OpenMP

16 Adding fine- grained OpenMP Currently the code passes all cells through work func:ons, one aoer the other. In order to take advantage of caching, we will create work- func:on sets that will operate over the cells in cache at the :me

17 Single- node scaling on Q Thread count Time per step Efficiency (68) 62 (64) (48) 34 (45) (40) 24 (27) (41) 11 (13) Parenthe:cal numbers come aoer increasing the size of the array of cells passed to the work func:ons high rank counts were gegng not enough work per thread from the original segng

18 20 step :mes (includes 5x refine/ balance) BG/Q BG/P Node count Time Efficiency Node count Time Efficiency

19 Main loop :mes BG/P - > BG/Q speedup = 2.5x/core, 9.2x/node BG/Q Node count Time Efficiency BG/P Node count Time Efficiency

20 Scaling plots

21 Next step DDT in a long pipe Tube length ~ 1 meter Cross- sec:on ~ 2.5 cm x 2.5 cm N cells ~ 10,000,000,000 N :me steps ~ 140,000 Numerical resolu:on ~ 6 microns

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program Implementing Hybrid Parallelism in FLASH Christopher Daley 1 2 Vitali Morozov 1 Dongwook Lee 2 Anshu Dubey 1 2 Jonathon