The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Size: px

Start display at page:

Download "The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System"

Isaac Hart
5 years ago
Views:

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of

Future Work and Questions Thanks to: John Schmidt and J. Davison de St.

1 The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview II. Emergence of Heterogeneous Systems III. Unified Scheduler and Runtime Design IV. Computational Experiments & Results V. Future Work and Questions Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens and Steve Parker, NVIDIA DoE for funding the CSAFE project from , DOE NETL, DOE NNSA, INCITE NSF for funding via SDCI and PetaApps Keeneland Computing Facility, supported by NSF under Contract OCI Oak Ridge Leadership Computing Facility for access to TitanDev

2 Uintah Overview Parallel, adaptive multi-physics framework Fluid-structure interaction problems Patch-based AMR using: particles and mesh-based fluid-solve Plume Fires Explosions Angiogenesis Foam Compaction Sandstone Compaction Industrial Flares Shaped Charges Virtual Soldier

Uintah - Scalability Patch-based domain decomposition Asynchronous task-based paradigm Cores 256K cores Jaguar XK6 95% weak scaling efficiency & 60% strong scaling efficiency Multi-threaded MPI

3 Uintah - Scalability Patch-based domain decomposition Asynchronous task-based paradigm Cores 256K cores Jaguar XK6 95% weak scaling efficiency & 60% strong scaling efficiency Multi-threaded MPI shared memory model on-node 1 Scalable, efficient, lock-free data structures 2 1. Q. Meng, M. Berzins, and J. Schmidt. Using Hybrid Parallelism to Improve Memory Use in the Uintah Framework. In Proc. of the 2011 TeraGrid Conference (TG11), Salt Lake City, Utah, Q. Meng and M. Berzins. Scalable Large-scale Fluid-structure Interaction Solvers in the Uintah Framework via Hybrid Taskbased Parallelism Algorithms. Concurrency and Computation: Practice and Experience 2012, Submitted

Exascale Problem Design of Alstom Clean Coal Boilers LES resolution needed for 350MW boiler problem 1mm per side for each computational volume = 9 x 10 12 cells Based on

4 Exascale Problem Design of Alstom Clean Coal Boilers LES resolution needed for 350MW boiler problem 1mm per side for each computational volume = 9 x cells Based on initial runs - to simulate problem in 48 hours of wall clock time requires million fast cores Professor Phil Smith ICSE, Utah O 2 concentrations in a clean coal boiler

Emergence of Heterogeneous Systems Nvidia

System 360 GPUs NSF Keeneland Full Scale System

Motivation - Accelerate Uintah Components

Uintah s asynchronous task-based approach well

5 Emergence of Heterogeneous Systems Nvidia M2070/90 Tesla GPU + Keeneland Initial Delivery System 360 GPUs NSF Keeneland Full Scale System 792 GPUs Multi-core CPU DoE Titan 18,688 GPUs Motivation - Accelerate Uintah Components Utilize all on-node computational resources Uintah s asynchronous task-based approach well suited to take advantage of GPUs Natural progression GPU Tasks

6 When extending a general computational framework to GPUs, with over 700K lines of code. where to start?. Uintah s asynchronous task-based approach makes this surprisingly manageable

7 NVIDIA Fermi Overview 8GB/sec 144GB/sec Host memory to Device memory is max 8GB/sec Device memory to cores is 144GB/sec Memory bound applications must hide PCIe latency

Fluid Solver Code (ICE) 1 2 2 1 Generated by Google profiling tool, visualized by Kcachegrind FirstOrderAdvector Operators & 1 2 Significant

8 Fluid Solver Code (ICE) Generated by Google profiling tool, visualized by Kcachegrind FirstOrderAdvector Operators & 1 2 Significant portion of runtime (~ 20%) Highly structured calculations Stencil operations and other SIMD constructs Map well onto GPU High FLOPs:Byte ratio

9 Results Without Optimizations GPU performance for stencil-based operations ~2x over multi-core CPU equivalent for realistic patch sizes Significant speedups for large patch sizes only Worth pursuing, but need optimizations Hide PCIe latency with asynchronous memory copies

10 Hiding PCIe Latency Nvidia CUDA Asynchronous API Asynchronous functions provide: Memcopies asynchronous with CPU Concurrently execute a kernel and memcopy Normal Page-locked Memory Data Transfer Kernel Execution Data Transfer Kernel Execution Stream - sequence of operations that execute in order on GPU Operations from different streams can be interleaved

11 Unified CPU-GPU Scheduler

GPU Task Management With Uintah s knowledge of the task-graph, task data can be automatically transferred asynchronously to the device before a GPU task executes All

buffer 1 2 Call-back executed here (kernel run) cudamemcpyasync(h2d) devrequires 3 Can handle multiple devices onnode Result back on host hostcomputes 4 computation

12 GPU Task Management With Uintah s knowledge of the task-graph, task data can be automatically transferred asynchronously to the device before a GPU task executes All device memory allocations and asynchronous transfers handled automatically Pin this memory with cudahostregister() existing host memory hostrequires Page locked buffer 1 2 Call-back executed here (kernel run) cudamemcpyasync(h2d) devrequires 3 Can handle multiple devices onnode Result back on host hostcomputes 4 computation devcomputes 5 All device data is made available to component code via convenient interface Free pinned host memory 6 cudamemcpyasync(d2h) Component requests D2H copy here

13 Multistage Task Queue Architecture Overlap computation with PCIe transfers and MPI communication Automatically handles device memory ops and stream management Enables Uintah to pre-fetch GPU data Queries task-graph for task s data requirements

14 Using GPUs in Alstom Boiler Problem ARCHES Combustion Component Need to approximate the radiation transfer equation Methods considered - Both solve same equation Discrete Ordinates Method (DOM) Slow and expensive (solving linear systems) and is difficult to add more complex radiation physics (specifically scattering) Reverse Monte Carlo Ray Tracing (RMCRT) Faster due to ray decomposition and naturally incorporates physics (such as scattering) with ease. No linear solve.

15 ARCHES GPU-Based RMCRT RayTrace: computationally intensive task Ideal for SIMD parallelization Rays mutually exclusive Can be traced simultaneously Offload Ray Tracing and RNG to GPU(s) NVIDIA curand Library RNG states on device, 1 per thread Available CPU cores can perform other computation

16 Uintah CPU-GPU Scheduler Abilities Now able to run capability jobs on: Keeneland Initial Delivery System (NICS) 1440 CPU cores & 360 GPUs simultaneously (3) Nvidia Tesla M2090 GPUs per node TitanDev - Jaguar XK6 GPU partition (OLCF) CPU cores & 960 GPUs simultaneously (1) Nvidia Tesla M2090 GPU per node Shown significant speedups High degree of node-level parallelism

17 GPU RMCRT Speedup Results (Single Node) Single CPU Core vs Single GPU Machine Rays CPU (sec) GPU (sec) Speedup (x) Keeneland 1-core Intel TitanDev 1-core AMD GPU Nvidia M2090 Keeneland CPU Core Intel Xeon X5660 TitanDev CPU Core AMD Opteron 6200

18 GPU RMCRT Speedup Results (Single Node) All CPU Cores vs Single GPU Machine Rays CPU (sec) GPU (sec) Speedup (x) Keeneland 12-cores Intel TitanDev 16-cores AMD GPU Nvidia M2090 Keeneland CPU Cores Intel Xeon X5660 TitanDev CPU Cores AMD Opteron 6200

19 Performance Comparison Tests CPU-Only Execution Time (s) Master-Slave Model vs Unified #Cores Master Slave Unified Problem: Combined MPMICE problem using AMR Run on a single Cray XE6 node with two 16-core AMD Opteron 6200 Series (Interlagos processors

20 Performance Comparison Tests CPU-GPU Execution Time (s) Master-Slave Model vs Unified #Cores Master Slave Unified Problem: GPU-enabled Reverse Monte Carlo Ray Tracer (RMCRT) Run on a single 12-core heterogeneous node (two Intel Xeon X5650 processors each with Westmere (2) Nvidia Tesla C2070 GPUs and (1) Nvidia GeForce 570 GTX GPU)

21 CUDA Kepler Nvidia CUDA 5.0 and Kepler GPUs promises to significantly enhance Uintah s GPU capabilities Dynamic Parallelism Launch kernels from the device Future Uintah GPU design plans will include leveraging these two offerings GPU Object Linking Create libraries for GPU code

22 Scheduler Infrastructure Future Work GPU affinity for multi socket/gpu nodes Support for Intel MIC (Xeon Phi) PETSc GPU interface utilization ARCHES linear solves Alstom Boiler Problem Mechanism to dynamically determine whether to run GPU or CPU version task Optimize GPU codes for Nvidia Kepler

23 Questions? Software Download

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah