Scalability of Uintah Past Present and Future

Size: px

Start display at page:

Download "Scalability of Uintah Past Present and Future"

Muriel Lloyd
5 years ago
Views:

DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John

1 DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John Schmidt, Alan Humphrey, Justin Luitjens How do we get software frameworks for complex multiscale multiphysics problems such as Uintah to scale on machines such as Blue Waters, Sequoia, Titan and the next two generations of machines? Directed Acyclic Graph approaches are seen as a key technology. Exascale programming will require prioritization of critical-path and noncritical path tasks, adaptive directed acyclic graph scheduling of critical- Exascale Report

2 Directed Acyclic Graph Based Applications/Languages 1: 1 1: 2 1: 3 1: 4 2: 2 2: 3 2: 4 Charm++: Object-based Virtualization NAMD etc etc Intel CnC: new language for graph based parallelism 2: 2 Plasma (Dongarra): DAG based Parallel linear algebra software 2: 3 3: 3 3: 3 2: 4 3: 4 Uintah Software

3 Uintah Applications Explosions Angiogenesis Micropin Flow Industrial Flares Explosions Sandstone Compaction Foam Compaction Carbon capture and cleanup Shaped Charges

4 ICE is a cell-centered finite volume method for Navier Stokes equations Uintah Patch and Variables Structured Grid Variable (for Flows) Cell Centered, Node Centered,Face Centered Unstructured Points (for Solids) Particles ARCHES is a combustion code using several different radiation models and linear solvers MPM is a novel method that uses particles and nodes Exchange data with ICE, not just boundary condition

Utah Uintah Software Parallelism AMR Meshes Structured Grid + Unstructured Points Patch-based Domain Decomposition Adaptive Mesh Refinement Dynamic Load Balancing Profiling + Forecasting Model

5 Utah Uintah Software Parallelism AMR Meshes Structured Grid + Unstructured Points Patch-based Domain Decomposition Adaptive Mesh Refinement Dynamic Load Balancing Profiling + Forecasting Model Parallel Space Filling Curves Works on MPI and/or thread level Uses asynchronous task directed graph approach to specify the relationships between tasks not when they will compute. Task Graph for Material Point Method

6 Uintah uses a combination of compile-time and run-time techniques xml Task Compile (when needed) Run Time (each timestep) Parallel I/O

7 Uintah Task Graph 4 patches single level ICE task graph Compiled from detailed tasks (on local and neighboring patches) Intermediate Representation for Uintah runtime system Internal dependencies ->Edges External dependencies ->MPI message tags Variable dictionary held in data warehouse Stored locally not globally

?? You only need a few very smart young (computer science know what impossible means.

8 Scalability of Uintah DAG Framework This is what scalability with a large software framework such as Uintah feels like The Harry Potter approach to scalability??? You only need a few very smart young (computer science know what impossible means. Steve Parker designer of Uintah Justin Luitjens Qingyu Meng Alan Humphrey plus a strong supporting cast.

9 The DAG Approach is not a silver bullet Uintah Phase overlap communications with computation. Static task graph execution standard data structures one MPI process per core. No AMR. Uintah Phase improved fast data structures, load balancer. AMR to 12k cores, then 100K cores using new approach before data structures cause problems. Out of order and dynamic task execution. Uintah Phase Hybrid model. Theaded runtine system on node. One MPI process and one data warehouse per node. Multiple cores and GPUs grab tasks as needed. Fast scalable use of hypre for linear equations.

10 Weak and Strong Scalability: Problem size n on p cores takes time T(n,p) Strong Scalability Tnp (, ) Tn (,1)/ p Try to solve the same problem p times more quickly on p cores Weak Scalability Tnpp (, ) Tn (,1) Solve a problem that is p times as large in the same time on p cores Theorem Both weak and strong scalability only if linear complexity [Tirado + Martin] 1998 Tn (,1) n

Scalable Tiled AMR Regridding Algorithm [Analysis in Concurrency] Tiled Algorithm Tiles that contain flags become patches Simple and easy to parallelize

11 Scalable Tiled AMR Regridding Algorithm [Analysis in Concurrency] Tiled Algorithm Tiles that contain flags become patches Simple and easy to parallelize Semi-regular patch sets that can be exploited Example: Neighbor finding Each core processes subset of refinement flags in parallel-helps produce global patch set GBRv1 vs Tiled

12 Strong Scaling of AMR Algorithms problem size fixed as number of cores increases should lead to decreasing execution time. GBR is Global Berger-Rigoutsos No strong scaling Strong Scaling T: GBRv1 = c1 (Flags/Proc) log(npatches ) + c2 M(GBRv1) T: Tiled = c5 Nmeshcells / Proc T: LBR = GBR on a compute node Dots are data and lines are models

Example: AMR MPMICE A PBX explosive flow quickly pushing a piece of its metal container Click Flow velocity and particle volume Computational

13 Example: AMR MPMICE A PBX explosive flow quickly pushing a piece of its metal container Click Flow velocity and particle volume Computational grids and particles Grid Variables: Fixed number per patch, relative easy to balance Particle Variables: Variable number per patch, hard to load balance

14 Uintah Scalability(Original MPI-only) Poor scalability up to 98K cores (Kraken, NICS) Issues: Two flows transporting One flow with particles moving Out of memory with 98K cores AMR MPMICE scaling, Load Imbalance Solution: New runtime system with Hybrid thread/mpi

15 One Data Warehouse per mode Hybrid threaded CPU GPU scheduler No on node MPI Better load balance Per node (more tasks) Lower MPI reduction Time. Atomic operations for Faster lock-free memory access Needs thread-safe code Originally one control core now decentralized -GPUs and CPUs pull work

New Hybrid Model Memory Savings: Ghost Cells MPI: Thread/MPI: Local Patch Ghost Cells Raw Data: 49152 doubles 31360 doubles

16 New Hybrid Model Memory Savings: Ghost Cells MPI: Thread/MPI: Local Patch Ghost Cells Raw Data: doubles doubles MPI buffer: doubles doubles Total: 75K doubles 40K doubles (example on Kraken, 12 cores/node, more cores more savings)

17 Memory Savings Global Meta-data copies 60 bytes or 7.5 doubles per patch Each copy per core vs Each copy per node MPI library buffer overhead Results: cores percent 61% 47% 36% 27% 18% 11% AMRICE: Simulation of the transport of two fluids with a prescribed initial velocity of Mach two: 435 million cells, strong scaling runs on Kraken

18 MPI Patch Boundary Thread/MPI Processor Boundary Particle Reallocations Move particles to new patches After each timestep Cross patch boundary: just re-indexing Cross processor boundary: MPI scatter record (expensive!) Thread/MPI Fewer particles cross processor boundary

19 Results: Scalability on Jaguar One flow with particles moving 3-level AMR MPM ICE Two flows transporting, 3-level AMR ICE Using one control core per node scales up to 196K cores

20 OLD Scalability on Titan Scaling Breakdown One flow with particles moving 3-level AMR MPM ICE 50% efficiency At 256K cores vs 16K cores Distributed Controller

21 ICES Carbon Dioxide Cleanup Red is C02 Turbulent flow problem - need to quantify the uncertainty in the Simulation to estimate how much CO2 is removed. Need at least 100K cores to resolve the problem scales.

22 4 months work with hypre PFMG preconditioner and CG solver. Hypre setup only called once and workspace carried by Uintah Weak Scalability with hypre on turbulent combustion problems Uintah on implicit problems using hypre solver Good approx log(c) ( weak scaling to 98K cores on turbulent combustion Helium Plume problem with Red-Black Gauss Seidel on Kraken XT5. Jacobi iteration on Jaguar XK6 has weak scaling, c = cores Kraken case has 2x Jaguar unknowns per core

23 NVIDIA 2050/2070 vs Intel xeon 2x4 cores DESIGNING FOR EXASCALE Clear trend towards accelerators e.g. GPU but also Intel MIC new NSF -. 15PF Balance factor = flops/bandwidth - high -based codes,2x over multicore cpu estimated and achieved for ICE. Similar results by others. Network and memory performance more slowly growing than cpu/gpu performance GPU performance of ray-tracing radiation method is 100x cpu See his talk this week Overlapping and hiding Communications essential new MPI non-blocking collective will help- new algorithms needed.

LES resolution needed: 1mm per side for each computational volume = 9x 1012

24 An Exascale Problem Design of Alstom Clean coal Boilers Temperature field For 350MWe boiler problem. LES resolution needed: 1mm per side for each computational volume = 9x 1012 cells Based on initial runs - to run in 48 hours of wall clock time requires M fast cores. Professor Phil Smith ICSE Utah

25 Summary Scalability still a challenge even with DAG approach which does work amazingly well Uintah Hybrid Parallelism Domain Decomposition + Task Graph MPI Level load balance based on patches Multi-thread task graph execution locally Concurrent data warehouse Thread task queue controller Reduce overhead of global metadata and ghost cells Memory savings: 40%~90% Scaling No in-node MPI, better load balancing AMR MPMICE 2~4X faster when running with 12k+ cores Allow Uintah scaling to 256K cores even for fluid-structure calculations GPU development ongoing The approach used here shows promise for very large core and GPU counts NSF funded work will stop in September 2013

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison