HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Similar documents
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Introduction to parallel Computing

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

OpenACC programming for GPGPUs: Rotor wake simulation

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Demonstration of Legion Runtime Using the PENNANT Mini-App

OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection. Anmol Paudel Satish Puri Marquette University Milwaukee, WI

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Introduction to GPU programming with CUDA

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Intro to Parallel Computing

Modern Processor Architectures. L25: Modern Compiler Design

Optimising the Mantevo benchmark suite for multi- and many-core architectures

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Dynamic load balancing in OSIRIS

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

High Performance Computing. University questions with solution

A4. Intro to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

Programming as Successive Refinement. Partitioning for Performance

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Performance Benefits of NVIDIA GPUs for LS-DYNA

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Parallelising Pipelined Wavefront Computations on the GPU

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

Parallelization Principles. Sathish Vadhiyar

Warps and Reduction Algorithms

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Review: Creating a Parallel Program. Programming for Performance

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Recent Advances in Heterogeneous Computing using Charm++

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Chapter 3 Parallel Software

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

L20: Putting it together: Tree Search (Ch. 6)!

OP2 FOR MANY-CORE ARCHITECTURES

GPU Fundamentals Jeff Larkin November 14, 2016

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Two-Phase flows on massively parallel multi-gpu clusters

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Portland State University ECE 588/688. Graphics Processors

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Maximizing Face Detection Performance

Center for Computational Science

L21: Putting it together: Tree Search (Ch. 6)!

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Performance of deal.ii on a node

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Overview of research activities Toward portability of performance

Scalable GPU Graph Traversal!

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Cartoon parallel architectures; CPUs and GPUs

The Icosahedral Nonhydrostatic (ICON) Model

Indirect Volume Rendering

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

Hybrid Implementation of 3D Kirchhoff Migration

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Optimisation Myths and Facts as Seen in Statistical Physics

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Parallelism Inherent in the Wavefront Algorithm. Gavin J. Pringle

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

GPU ACCELERATED TOTAL FOCUSING METHOD IN CIVA

DRAM Bank Organization

Lecture 2 Unstructured Mesh Generation

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

Technology for a better society. hetcomp.com

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Heterogeneous-Race-Free Memory Models

OpenACC Course. Office Hour #2 Q&A

Parallel Systems. Project topics

Parallel Computing. Hwansoo Han (SKKU)

High Performance Computing

Transcription:

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA

PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh photon transport Note: You do not need to have a background in photon transport for this talk This talk is about the parallelism, not about the physics Various other kinds of solvers use unstructured meshes, as do many graphical applications; similar ideas might apply to some of those

INTRODUCTION: WHAT IS AN UNSTRUCTURED MESH? An unstructured mesh (grid) is a spatial decomposition using simple shapes (e.g., triangles, tetrahedra) in an irregular pattern

INTRODUCTION: WHAT IS AN UNSTRUCTURED MESH? Unstructured methods require an explicit description of the geometry and its connectivity Sometimes the vertices (or the edges) are the important parts Other times the facets/volumes are the important parts

INTRODUCTION: WHY IS THIS CHALLENGING? The amount of concurrency we can extract as we sweep across the mesh is irregular Traversing an unstructured mesh requires indirect (and likely incoherent) memory addressing

INTRODUCTION: WHY IS THIS CHALLENGING? The amount of concurrency we can extract as we sweep across the mesh is irregular Traversing an unstructured mesh requires indirect (and likely incoherent) memory addressing

TRADITIONAL PARALLELIZATION OF UNSTRUCTURED MESH SWEEPS Partition the mesh into patches (like tiling a structured mesh) Each processor sweeps its local patch

PARALLELISM IN UMT2013 Out of the box, the UMT2013 benchmark presents the following degrees of (potential) concurrency: Geometry is partitioned over 10k s of MPI ranks (say we pick 8-16 ranks per processor) Each rank sweeps over its partition from up to 32 angles concurrently using OpenMP Each sweep is over the 1000 s of zones of that rank s partition of the geometry Total available concurrency per processor ~8k Each calculation within the sweep is over 16 energy groups, SIMD-style

SWEEP SCHEDULING IN UMT2013 The sweep order is angle-dependent but pre-scheduled Zone indices are ordered (as best as possible) such that zones are processed after zones on which they depend when processing the ordered list sequentially UMT2013 makes no attempt to schedule for parallelism

STRUCTURED GRID = (MOSTLY) REGULAR PARALLELISM To understand the parallelization opportunities for traversal of unstructured meshes, let s look first at structured meshes In a structured grid, the number of neighbors each grid cell is known (and generally constant) and addressing the neighbors is straightforward. Parallelism is (usually) regular: We can often consider all grid cells concurrently at any given step (or at least all cells along a given row or column) Worst case is when we sweep across the diagonal, but parallelizing these wavefront computations through pipelining is well understood

PARALLEL WAVEFRONTS When sweeping across the diagonal, available parallelism at each step follows a triangular pattern 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1

PARALLEL WAVEFRONTS Sweeping along the diagonal sequentially creates load imbalances (both within a node and across nodes)

PARALLEL WAVEFRONTS The standard solution to this is pipelining parallel wavefronts Communicate each boundary value as soon as it s ready Start on the next iteration s wavefront on the heels of the previous

SWEEP SCHEDULING IN UMT2013 Fortunately don t have to contend with the pipelining problem UMT2013 already deals with this in another way We do have variable parallel widths like sweeping diagonals in structured meshes, but worse due to the irregularity This is why UMT2013 makes no attempt to schedule sweeps for parallelism

SWEEP SCHEDULING IN UMT2013 Still there is some degree of parallelism of zones within a given step of a single sweep Need a way to detect and extract that parallelism

SWEEP SCHEDULING FOR PARALLELISM After scheduling the sweep, iterate over the schedule on the CPU and calculate the first step in which a zone can appear such that its dependencies will have been resolved 1 2 2 5 3 4 5 1 8 2 3 6 7 9 10

SWEEP SCHEDULING FOR PARALLELISM Use this value as a batch number and sort the sweep schedule using the batch number as the keys Resulting parallelism is irregular but still useful 1 2 3 4 2 5 5 1 8 2 1 3 4 5 6 7 8 5 6 10 11 9 10 12 6 7 7 8

Batch size SWEEP SCHEDULING FOR PARALLELISM Batch size follows a (loosely) triangular trend similar to what was seen in structured mesh sweeps, though less predictable When zones/rank is ~2800, number of batches is ~100 and max batch size is ~64 64 Batch number

Batch size REGULARIZING UNSTRUCTURED MESH SWEEPS IN UMT2013 To regularize this, pick a parallel width of, say, 16 Process sets of 16 zones from a single batch concurrently Loop until all zones in batch are done Synchronize threads, since some will have had one extra zone (loop iteration) When whole batch is done, move on to next batch 64 48 32 16 0 Batch number

Batch size REGULARIZING UNSTRUCTURED MESH SWEEPS IN UMT2013 Some processors idle some of the time; about 25% overall Possibilities for future improvement here Different batch size Work stealing 64 48 32 16 0 Batch number

PARALLELISM IN UMT2013 After exposing more parallelism, UMT2013 now has ~16x the concurrency: Geometry is partitioned over 10k s of MPI ranks (8-16 ranks per GPU using CUDA MPS) Each rank sweeps over its partition from all 32 angles concurrently (each is a thread block) Each sweep is over all of the 1000 s of zones of that rank s partition (16 zones at a time) Total available concurrency per GPU ~128k threads Each calculation within the sweep is over all 16 energy groups (one per SIMT thread, so each warp of 32 threads handles two zones)

VISUAL PROFILER TIMELINE K40 can fit 60 of UMT s 16x16-thread blocks concurrently 32 blocks per rank, so ~2 ranks to fill the GPU once over Note though that we prefer to fill the GPU many times over Memory transfers and computation overlapped automatically

Speedup RESULTS GPUs can do well on traversal of unstructured meshes! With this initial work, K40 already 1.5x faster than Ivy Bridge for whole-app performance 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Xeon E5-2690 v2 @ 3.0GHz Tesla K40m @ 875MHz CUDA 6.0 RC, driver 331.44 PGI 13.9 compiler 10 ranks, 12^3 zones/rank

FUTURE WORK We re not done yet! While the CPU run pegs all cores at 100% utilization throughout, the CUDA version has lots of headroom for further optimization, e.g.: There s still about 20% of total time outside the sweeps that is currently CPU-only; could GPU accelerate more of that Find ways to reduce register pressure and improve occupancy Exposing sufficient parallelism was just the first step