CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Similar documents
GPU Fundamentals Jeff Larkin November 14, 2016

CS 179 Lecture 4. GPU Compute Architecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

Warps and Reduction Algorithms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Tesla Architecture, CUDA and Optimization Strategies

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Numerical Simulation on the GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA OPTIMIZATIONS ISC 2011 Tutorial

ECE 574 Cluster Computing Lecture 15

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU hardware and to CUDA

Improving Performance of Machine Learning Workloads

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Practical Introduction to CUDA and GPU

Parallel Numerical Algorithms

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations


Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Computer Architecture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GPU programming. Dr. Bernhard Kainz

GPU Profiling and Optimization. Scott Grauer-Gray

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Introduction to CUDA Programming

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU programming with CUDA

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU Programming. Rupesh Nasre.

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduc)on to GPU Programming

Programming in CUDA. Malik M Khan

CUDA Architecture & Programming Model

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CS 179: GPU Computing

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU CUDA Programming

Lecture 2: CUDA Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

Dense Linear Algebra. HPC - Algorithms and Applications

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Programmer's View of Execution Teminology Summary

CME 213 S PRING Eric Darve

Master Informatics Eng.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Mathematical computations with GPUs

Sparse Linear Algebra in CUDA

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Modern Processor Architectures. L25: Modern Compiler Design

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GRAPHICS PROCESSING UNITS

Cartoon parallel architectures; CPUs and GPUs

University of Bielefeld

CUDA Programming Model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Preparing seismic codes for GPUs and other

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Introduction to CUDA

Portland State University ECE 588/688. Graphics Processors

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Blocks, Grids, and Shared Memory

Overview: Graphics Processing Units

High Performance Computing on GPUs using NVIDIA CUDA

Profiling & Tuning Applications. CUDA Course István Reguly

TUNING CUDA APPLICATIONS FOR MAXWELL

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

CUDA Performance Optimization

CUDA Performance Optimization. Patrick Legresley

ECE 574 Cluster Computing Lecture 17

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Advanced CUDA Optimizations

Introduction to GPGPU and GPU-architectures

COSC 6339 Accelerators in Big Data

TUNING CUDA APPLICATIONS FOR MAXWELL

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

ME964 High Performance Computing for Engineering Applications

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Transcription:

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN

* Using this terminology since you ve already heard of SIMD and SPMD at this school A Streaming Multiprocessor (SM) is a parallel compute unit that dispatches instructions to warps, vectors of 32 lightweight threads (SIMT lanes). GPUs have ~16 SMs. Warps are capable of SIMT (Single Instruction, Multiple Thread) execution: Virtually, each thread has its own execution context (registers, program counters, ). But serialization occurs if threads take different execution paths. SIMT~ SPMD (single program multiple data) instruction set, vectorized/dispatched by hardware to SIMD vector lanes (single instruction multiple data)*.

<CUDA Program: SPMD Instructions> Dispatcher that vectorizess SIMD Hardware

The CUDA programming model is an abstraction of the multiprocessor. A block is a group of a variable number of threads (warps). A streaming multiprocessor is assigned a few blocks. Blocks are arranged on a grid. Each thread is identified inside its block by the integer triplet threadidx.{x,y,z} and each block with blockidx.{x,y,z}. Execution of warps (blocks) happens in parallel and no order is specified, as they are dynamically dispatched. threadidx.x 0 8 16 24 32 Thread Warp Block 40 48 56 0 32 0 32 blockidx.x 0 1 2 global 0 8 16 24 32 40 48 56 64 96 128 160 (BlockDim.x=64, 1D grid)

The memory hierarchy reflects the parallelism model hierarchy. Bandwidth Registers SM Per block shared memory Global memory GPU Latency Size

Thanks to Tim Mattson @ Intel for this slide OpenCL vs. CUDA Terminology Third party names are the property of their owners. Host defines a command queue and associates it with a context (devices, kernels, memory, etc). Host enqueues commands to the command queue Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space called an NDRange Threads Grid (w x S x + sx, w y S y + s y ) (s x, s y ) = (0,0) CUDA Stream (w x, w y ) (w x S x + sx, w y S y + s y ) (s x, s y ) = (S x -1,0) Thread Block G y (w x S x + sx, w y S y + s y ) (s x, s y ) = (0, S y -1) (w x S x + sx, w y S y + s y ) (s x, s y ) = (S x -1, S y - 1) A Index (G y by Space G x ) index space G x Work items execute together as a work-group.

Serial Code OpenCL CUDA

Three performance questions.

Question: why does this kernel does not achieve the GPU bandwidth achieved by cudamemcpy if stride > 1? Hands on: Compile and run bw.cu

A: Warps should access global memory in a coalesced manner. For example, on Kepler only contiguous chunks of global memory (in caching mode, cache lines) can be moved from/to the warps. A stride of two (four) would require two (four) lines of cache to be moved. See this presentation for more details http://on-demand.gputechconf.com/ gtc-express/2011/presentations/ cuda_webinars_globalmemory.pdf Moved but not used

Question: why does the second kernel achieves roughly ~50% of the floating point operation throughput compared to the first? Compile and run diverge.cu and fix the performance bug. Exercise: fix the performance issue of the second kernel while keeping the same result. On Fermi, you should be able to see the floating point throughput fully utilized: $./divergence divergence: time=1.0965s throughput=501.34 GFLOP/s no divergence:time=0.5513s throughput=997.16 GFLOP/s

A: If threads in a warp take different code paths then the execution paths are serialized. A warp (32 threads) can be seen as a large, flexible SIMD vector unit, programmable to take internal branches (virtually, a SPMD) but with performance cost (serialization). A single instruction can be dispatched only to an entire warp. <- Lazy and generally bad solution. Homework: a more robust solution. Constants array is stored in the local memory space (see CUDA programming guide) and the load is optimized away to happen just once per thread, see ptx (nvcc -ptx diverge.cu)... ld.local.f32 %f8, [%rd10+0];...

Question: How can we use parallel caches to help this kernel loading idata just once per grid? The memory access pattern is fully coalesced, and the bandwidth is fully used, yet we can optimize the performance avoiding to load data twice using caches. Compile and run stencil.cu

The shared memory cache can be manually programmed to share the different values read by threads. Caches on CPUs are used to optimize the latencies for a single thread. Caches on the GPU are used to share data amongst parallel units. Shared memory is a programmable cache to share data amongst the threads in a block.

The shared memory cache can be manually programmed to share the different values read by threads. Global memory Thread 2 Thread 3 ld.global (reads dispatched twice) Halos Halos Global memory ld.global/st.shared Shared memory Thread 2 Thread 3 ld.shared (much faster) Make sure that everyone sees everything in shared memory.

On Kepler, the read-only texture memory cache can be used very easily with the restrict keyword and the ldg instruction.

Final Tips Read the CUDA programming guide to understand the architecture: http://docs.nvidia.com/cuda/cuda-cprogramming-guide/ Learn about how GPUs hide latencies by keeping many transaction in flight (a very important topic not explicitly covered here, but covered in Tim s lecture). Use the NVIDIA visual profiler to identify performance issues. In the SIMT paradigm, hardware dispatches/vectorizes the code, dynamic performance analysis is even more important.