Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies

Similar documents
GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

Using GPUs to compute the multilevel summation of electrostatic forces

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating High Performance Computing.

Multilevel Summation of Electrostatic Potentials Using GPUs

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Analysis and Visualization Algorithms in VMD

CUDA Experiences: Over-Optimization and Future HPC

Improving NAMD Performance on Volta GPUs

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

CME 213 S PRING Eric Darve

Tesla GPU Computing A Revolution in High Performance Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

In-Situ Visualization and Analysis of Petascale Molecular Dynamics Simulations with VMD

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Technology for a better society. hetcomp.com

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

General Purpose GPU Computing in Partial Wave Analysis

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Lecture 1: Introduction and Computational Thinking

Steve Scott, Tesla CTO SC 11 November 15, 2011

Tesla GPU Computing A Revolution in High Performance Computing

GPU Architecture. Alan Gray EPCC The University of Edinburgh

HPC with Multicore and GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

TESLA ACCELERATED COMPUTING. Mike Wang Solutions Architect NVIDIA Australia & NZ

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

NVIDIA GPU TECHNOLOGY UPDATE

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC Course. Office Hour #2 Q&A

CUDA. Matthew Joyner, Jeremy Williams

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High Performance Computing with Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

An Introduction to OpenACC

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Fundamental CUDA Optimization. NVIDIA Corporation

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

ECE 8823: GPU Architectures. Objectives

Preparing GPU-Accelerated Applications for the Summit Supercomputer

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Improving Performance of Machine Learning Workloads

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU ARCHITECTURE Chris Schultz, June 2017

An Introduc+on to OpenACC Part II

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Advanced CUDA Optimization 1. Introduction

The Art of Parallel Processing

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Building NVLink for Developers

World s most advanced data center accelerator for PCIe-based servers

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Stan Posey, NVIDIA, Santa Clara, CA, USA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Hybrid Implementation of 3D Kirchhoff Migration

GPU ARCHITECTURE Chris Schultz, June 2017

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Programming Using NVIDIA CUDA

CPU-GPU Heterogeneous Computing

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Quantum ESPRESSO on GPU accelerated systems

Trends in HPC (hardware complexity and software challenges)

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

OpenStaPLE, an OpenACC Lattice QCD Application

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

IBM CORAL HPC System Solution

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

TESLA V100 PERFORMANCE GUIDE May 2018

Modern GPU Programming With CUDA and Thrust. Gilles Civario (ICHEC)

Transcription:

Scaling in a Heterogeneous Environment with GPUs: GPU Architecture, Concepts, and Strategies John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign ~johns/ Scaling to Petascale Institute, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Agenda: Scaling in a Heterogeneous Environment With GPUs GPU architecture, concepts, and strategies OpenACC OpenACC Hands-On Lab CUDA Programming 1 CUDA Hands-On Lab CUDA Programming 2 GPU Optimization and Scaling with Profiling and Debugging Open Hands-on Lab

Administrativa: QwikLab Accounts Participants that have not created & verified their QwikLab accounts should do so ASAP to be ready for today s hands-on: https://nvlabs.qwiklab.com/ If you are a walk-in participant, your site handler will need to request access by email to Justin Luitjens, including which site you re located at, and your email address. If you still don t have access or we reach max capacity, buddy up with another participant until you get access. You should have an access code to run the QwikLab courses: Accelerating Applications with GPU-Accelerated Libraries in C/C++ OpenACC 2X in 4 steps Accelerating Applications with CUDA C/C++

GPU Computing GPUs evolved from graphics toward general purpose data-parallel workloads GPUs are commodity devices, omnipresent in modern computers (~million sold per week) Massively parallel hardware, well suited to throughput-oriented workloads, streaming data far too large for CPU caches Programming tools allow software to be written in various dialects of familiar C/C++/Fortran and integrated into legacy software GPU algorithms are often multicore-friendly due to attention paid to data locality and data-parallel work decomposition

What Makes GPUs Compelling? Massively parallel hardware architecture: Tens of wide SIMD-oriented stream processing compute units ( SMs in NVIDIA nomenclature) Tens of thousands of threads running on thousands of ALUs, special fctn units Large register files, fast on-chip and die-stacked memory systems Example: NVIDIA Tesla V100 (Volta) Peak Perf: 7.5 TFLOPS FP64, 15 TFLOPS FP32 120 TFLOPS Tensor unit (FP16/FP32 mix) 900 GB/sec memory bandwidth (HBM2) http://www.nvidia.com/object/volta-architecture-whitepaper.html

Evolution of GPUs Over Multiple Generations http://www.nvidia.com/object/volta-architecture-whitepaper.html

Other Benefits of GPUs 2017: 13 of the top of 14 Green500 systems are GPU-accelerated (Tesla P100) machines Increased GFLOPS/watt power efficiency Increased compute power per unit volume Desktop workstations can incorporate the same types of GPUs found in clouds, clusters, and supercomputers GPUs can be upgraded without new OS version, license fees, etc.

Growth of GPU Accelerated Apps (2013) # of GPU-Accelerated Apps Top HPC Applications Molecular Dynamics Quantum Chemistry Material Science AMBER CHARMM DESMOND Abinit Gaussian CP2K QMCPACK GROMACS LAMMPS NAMD GAMESS NWChem Quantum Espresso VASP Weather & Climate COSMO GEOS-5 HOMME CAM-SE NEMO NIM WRF 2011 2012 2013 Courtesy NVIDIA Lattice QCD Chroma MILC Plasma Physics GTC GTS Structural Mechanics Fluid Dynamics ANSYS Mechanical LS-DYNA Implicit MSC Nastran ANSYS Fluent OptiStruct Abaqus/Standard Culises (OpenFOAM) U. Accelerated, Illinois Urbana-Champaign In Development

Sounds Great! What Don t GPUs Do? GPUs don t accelerate serial code GPUs don t run your operating system you still need a CPU for that GPUs don t accelerate your InfiniBand card GPUs don t make disk I/O faster and GPUs don t make Amdahl s Law magically go away

Heterogeneous Computing Use processors with complementary capabilities for best overall performance GPUs of today are effective accelerators that depend on the host system for OS and resource management, I/O, etc GPU-accelerated programs are therefore programs that run on heterogeneous computing systems consisting of a mix of processors (at least CPU+GPU)

Complementarity of Typical CPU and GPU Hardware Architectures CPU: Cache heavy, low latency, per-thread performance, small core counts GPU: ALU heavy, massively parallel, throughput-oriented

Exemplary Hetereogeneous Computing Challenges Tuning, adapting, or developing software for multiple processor types Decomposition of problem(s) and load balancing work across heterogeneous resources for best overall performance and work-efficiency Managing data placement in disjoint memory systems with varying performance attributes Transferring data between processors, memory systems, interconnect, and I/O devices

Hetereogeneous Compute Node Dense PCIe-based multi-gpu compute node Application would ideally exploit all of the CPU, GPU, and I/O resources concurrently (I/O devs not shown) ~12GB/s Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-gpu workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Major Approaches For Programming Hybrid Architectures Use drop-in libraries in place of CPU-only libraries Little or no code development Examples: MAGMA, BLAS-variants, FFT libraries, etc. Speedups limited by Amdahl s Law and overheads associated with data movement between CPUs and GPU accelerators Generate accelerator code as a variant of CPU source, e.g. using OpenMP and OpenACC directives, and similar Write lower-level accelerator-specific code, e.g. using CUDA, OpenCL, other approaches

Simplified GPU-Accelerated Application Adaptation and Development Cycle 1. Use drop-in GPU libraries, e.g. BLAS, FFT, 2. Profile application, identify opportunities for massive data-parallelism 3. Migrate well-suited data-parallel work to GPUs Run data-parallel work, e.g. loop nests on GPUs Exploit high bandwidth memory systems Exploit massively parallel arithmetic hardware Minimize host-gpu data transfers 4. Go back to step 2 Observe Amdahl s Law, adjust CPU-GPU workloads

GPU Accelerated Libraries Drop-in Acceleration for your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix Numerical & Math RAND, Statistics Data Struct. & AI Sort, Scan, Zero Sum Visual Processing Image & Video Courtesy NVIDIA NVIDIA cufft, cublas, cusparse NVIDIA NPP NVIDIA Math Lib GPU AI Board Games NVIDIA Video Encode GPU AI Path Finding NVIDIA curand

What Runs on a GPU? GPUs run data-parallel programs called kernels GPUs are managed by host CPU thread(s): Create a CUDA / OpenCL / OpenACC context Manage GPU memory allocations/properties Host-GPU and GPU-GPU (peer to peer) transfers Launch GPU kernels Query GPU status Handle runtime errors

How Do I Write GPU Kernels? Directive-based parallelism (OpenACC): Annotate existing source code loop nests with directives that allow a compiler to automatically generate dataparallel kernels Same source code targets multiple processors Explicit parallelism (CUDA, OpenCL) Write data parallel kernels, explicitly map range of independent work items to GPU threads and groups Explicit control over specialized on-chip memory systems, low-level parallel synchronization, reductions

OpenACC Directives: Open, Simple, Portable main() { <serial code> Compiler Hint #pragma acc kernels { <compute intensive code> } } Open Standard Easy, Compiler-Driven Approach CAM-SE Climate 6x Faster on GPU Top Kernel: 50% of Runtime Courtesy NVIDIA

Directive-Based Parallel Programming with OpenACC Annotate loop nests in existing code with #pragma compiler directives: Annotate opportunities for parallelism Annotate points where host-gpu memory transfers are best performed, indicate propagation of data Evolve original code structure to improve efficacy of parallelization Eliminate false dependencies between loop iterations Revise algorithms or constructs that create excess data movement

Process for Writing CUDA Kernels Data-parallel loop nests are unrolled into a large batch of independent work items that can execute concurrently Work items are mapped onto GPU hardware threads using multidimensional grids and blocks of threads that execute on stream processing units (SMs) Programmer manages data placement in GPU memory systems, access patterns, and data dependencies

CUDA Grid, Block, Thread Decomposition 1-D, 2-D, or 3-D Computational Domain 1-D, 2-D, or 3-D Grid of Thread Blocks: 0,0 0,1 1,0 1,1 1-D, 2-D, 3-D thread block: Padding arrays can optimize global memory performance

Overview of Throughput-Oriented GPU Hardware Architecture GPUs have small on-chip caches Main memory latency (several hundred clock cycles!) is tolerated through hardware multithreading overlap memory transfer latency with execution of other work When a GPU thread stalls on a memory operation, the hardware immediately switches context to a ready thread Effective latency hiding requires saturating the GPU with lots of work tens of thousands of independent work items

Avoid Output Conflicts, Conversion of Scatter to Gather Many CPU codes contain algorithms that scatter outputs to memory, to reduce arithmetic Scattered output can create bottlenecks for GPU performance due write conflicts among hundreds or thousands of threads On the GPU, it is often better to: do more arithmetic, in exchange for regularized output memory write patterns convert scatter algorithms to gather approaches Use data privatization to reduce the scope of potentially conflicting outputs, and to leverage special on-chip memory systems and data reduction instructions

GPU Technology Conference Presentations: See the latest announcements about GPU hardware, libraries, and programming tools http://www.gputechconf.com/ http://www.gputechconf.com/attend/sessions

Bonus Material If Time Allows

Peak Arithmetic Performance Trend

Peak Memory Bandwidth Trend

Multi-GPU NUMA Architectures: Example of a balanced PCIe topology NUMA: Host threads should be pinned to the CPU that is closest to their target GPU GPUs on the same PCIe I/O Hub (IOH) can use CUDA peer-to-peer transfer APIs Intel: GPUs on different IOHs can t use peer-to-peer Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-gpu workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

GPU PCI-Express DMA Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-gpu workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Multi-GPU NUMA Architectures: Direct GPU-to-GPU peer DMA operations are more performant than other approaches, particularly for moderate sized transfers They perform even better with NVLink peer-to-peer GPU interconnections Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-gpu workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

IBM S822LC w/ NVLink 1.0 Minsky

Overlapping CPU Work with GPU Work Host CPU thread launches GPU action, e.g. a kernel, DMA memory copy, etc. on the GPU GPU action runs to completion CPU code running CPU waits for GPU, ideally doing something productive CPU code running CPU GPU Host synchronizes with completed GPU action

Single CUDA Execution Stream Host CPU thread launches a CUDA kernel, a memory copy, etc. on the GPU GPU action runs to completion Host synchronizes with completed GPU action CPU code running CPU waits for GPU, ideally doing something productive CPU code running CPU GPU

Multiple CUDA Streams: Overlapping Compute and DMA Operations Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-gpu workstations Michael J. Hallock, John E. Stone, Elijah Roberts, Corey Fry, and Zaida Luthey-Schulten. Journal of Parallel Computing, 40:86-99, 2014. http://dx.doi.org/10.1016/j.parco.2014.03.009

Using the CPU to Optimize GPU Performance GPU performs best when the work evenly divides into the number of threads/processing units Optimization strategy: Use the CPU to regularize the GPU workload Use fixed size bin data structures, with empty slots skipped or producing zeroed out results Handle exceptional or irregular work units on the CPU; GPU processes the bulk of the work concurrently On average, the GPU is kept highly occupied, attaining a high fraction of peak performance

Time-Averaged Electrostatics Analysis on NCSA Blue Waters NCSA Blue Waters Node Type Cray XE6 Compute Node: 32 CPU cores (2xAMD 6200 CPUs) Cray XK6 GPU-accelerated Compute Node: 16 CPU cores + NVIDIA X2090 (Fermi) GPU Seconds per trajectory frame for one compute node 9.33 2.25 Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x faster overall Tests on XK7 nodes indicate MSM is CPU-bound with the Kepler K20X GPU. Performance is not much faster (yet) than Fermi X2090 Need to move spatial hashing, prolongation, interpolation onto the GPU In progress. XK7 nodes 4.3x faster overall Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System

Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVIDIA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff 480.07 14.87 32.3 Long-range anterpolation 0.18 restriction 0.16 lattice cutoff 49.47 1.36 36.4 prolongation 0.17 interpolation 3.47 Total 533.52 20.21 26.4 Multilevel summation of electrostatic potentials using graphics processing units. D. Hardy, J. Stone, K. Schulten. J. Parallel Computing, 35:164-177, 2009.

Avoiding Shared Memory Bank Conflicts: Array of Structures (AOS) vs. Structure of Arrays (SOA) AOS: typedef struct { float x; float y; float z; } myvec; myvec aos[1024]; aos[threadidx.x].x = 0; aos[threadidx.x].y = 0; SOA typedef struct { float x[1024]; float y[1024]; float z[1024]; } myvecs; myvecs soa; soa.x[threadidx.x] = 0; soa.y[threadidx.x] = 0;