GPU programming: Unified memory models and more. Sylvain Collange Inria Rennes Bretagne Atlantique

Similar documents
GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Tesla Architecture, CUDA and Optimization Strategies

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallel Computing: Parallel Architectures Jin, Hai

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

GPU CUDA Programming

TUNING CUDA APPLICATIONS FOR MAXWELL

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Modern Processor Architectures. L25: Modern Compiler Design

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Portland State University ECE 588/688. Graphics Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

CS 179 Lecture 4. GPU Compute Architecture

CS 179: GPU Computing. Lecture 2: The Basics

ECE 574 Cluster Computing Lecture 17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPU Profiling and Optimization. Scott Grauer-Gray

CUDA Programming Model

CUDA Architecture & Programming Model

Introduction to CUDA Programming

ECE 574 Cluster Computing Lecture 15

Introduction to GPU hardware and to CUDA

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Performance Optimization. Patrick Legresley

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 10. Efficient Host-Device Data Transfer

Tesla GPU Computing A Revolution in High Performance Computing

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS427 Multicore Architecture and Parallel Computing

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

University of Bielefeld

COSC 6339 Accelerators in Big Data

! Readings! ! Room-level, on-chip! vs.!

CUDA Memories. Introduction 5/4/11

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Multi-GPU Programming

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

GRAPHICS PROCESSING UNITS

Maximizing Face Detection Performance

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Spring Prof. Hyesoon Kim

Advanced CUDA Programming. Dr. Timo Stich

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA

CUDA C Programming Mark Harris NVIDIA Corporation

CME 213 S PRING Eric Darve

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Profiling & Tuning Applications. CUDA Course István Reguly

State-of-the-art in Heterogeneous Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Introduction to CELL B.E. and GPU Programming. Agenda

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

simcuda: A C++ based CUDA Simulation Framework

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Scientific discovery, analysis and prediction made possible through high performance computing.

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Parallel Numerical Algorithms

Transcription:

GPU programming: Unified memory models and more Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr

OK, I lied You were told CPU and GPU have distinct memory spaces Blocks cannot communicate You need to synchronize threads inside a block This is the least-common denominator across all CUDA GPUs This is not true (any more). We now have: Global and shared memory atomics Warp-synchronous programming, warp vote, shuffle Device-mapped, Unified virtual address space, Unified memory Dynamic parallelism 2

Outline Device-mapped memory, Unified virtual address space, Unified memory Dynamic parallelism Other memory spaces: texture, constant Pitfall and fallacies 3

NVIDIA Compute capabilities Newer GPUs introduce additional features Architecture Tesla Fermi Kepler Maxwell GPU Compute capability G80 G92 GT200GT21xGF100 GF104 GK104 GK110 GM107GM204 1.0 1.1 1.3 1.2 2.0 2.1 3.0 3.5 5.0 5.2 2006 2015 Compute capability means both Set of features supported Who can do more can do less: x > y CC x includes CC y Native instruction set Not always backward-compatible e.g. GPU of CC 2.0 cannot run binary for CC 1.3 4

Device-mapped memory The GPU can access host memory directly, if explicitly mapped in the device address space Lower bandwidth but lower latency than DMA transfers Useful for light or sparse memory accesses Advantageous for integrated GPUs Still different address spaces Device address is different than host address *host_pointer CPU *device_pointer GPU CPU memory GPU memory 5

Example // Allocate device-mappable memory float* a; cudahostalloc((void**)&a, bytes, cudahostallocmapped); // Get device-side pointer float* da; cudahostgetdevicepointer((void**)&d_a, a, 0); // Write using host pointer on CPU a[0] = 42; // Read/write using device pointer on GPU mykernel<<<grid, block>>>(d_a); // Wait for kernel completion cudadevicesynchronize(); // Read using host pointer printf("a=%f\n", a[0]); // Free memory cudafreehost(a) 6

Unified virtual address space (UVA) Coordinate allocation in CPU and GPU address space Can tell if an address is on the device, host or both cudahostalloc returns an address valid in both host and device space No need to specify direction in cudamemcpy Caveat: not a true unified address space Having an address does not mean you can access the data it points to Memory allocated with malloc still cannot be accessed by GPU Requirements: CUDA 4.0, 64-bit OS, CC 2.0 Not available on Windows Vista/7 under WDDM 7

Unified memory: the real thing Allocate memory using cudamallocmanaged The CUDA runtime will take care of the transfers No need to call cudamemcpy any more Behaves as if you had a single memory space Suboptimal performance: software-managed coherency Still call cudamemcpy when you can Requires CUDA 6.0, CC 3.0 (preferably 5.x), 64-bit Linux or Windows 8

VectorAdd using unified memory int main() { int numelements = 50000; size_t size = numelements * sizeof(float); float *A, *B, *C; cudamallocmanaged((void **)&A, size); cudamallocmanaged((void **)&B, size); cudamallocmanaged((void **)&C, size); Initialize(A, B); int blocks = numelements; vectoradd2<<<blocks, 1>>>(A, B, C); cudadevicesynchronize(); Display(C); } cudafree(a); cudafree(b); cudafree(c); Can I replace cudamallocmanaged by cudahostalloc (with UVA?) What is the difference? 9

Multi-GPU programming What if I want to use multiple GPUs on the same machine? cudagetdevicecount, cudagetdeviceproperties enumerate devices cudasetdevice(i) selects the current device All following CUDA calls in the thread will concern this device Streams are associated with devices But cudastreamwaitevent can synchronize with events on other GPU: allow inter-gpu synchronization Host memory accessible from multiple devices: cudahostalloc(..., cudahostallocportable) Then call cudahostgetdevicepointer for each GPU 10

Peer-to-peer memory copy/access Transfer memory between GPUs cudamemcpydevicetodevice will not work. Why? Without UVA cudamemcpypeer() cudamemcpypeerasync() With UVA: just plain cudamemcpy Direct access to other GPU's memory Check result of cudadevicecanaccesspeer CPU CPU memory GPU 0 GPU 1 GPU memory Call cudadeviceenablepeeraccess from accessing GPU 11

Recap We have too many mallocs... cudamalloc cudamallochost cudahostalloc(..., cudahostalloc{portable Mapped}) cudamallocmanaged And too many memcpys cudamemcpy(dest, src, cudamemcpy{hosttodevice, DeviceToHost, DeviceToDevice, Default}) cudamemcpyasync cudamemcpypeer[async] Quizz: do you remember what they do and when we should use them? 12

Recap: evolution of the memory model New features: going away from the split memory model Device-mapped: GPU can map and access CPU memory Lower bandwidth than DMA transfers Higher latency than GPU memory access Uniform virtual addressing (CC 2.0): synchronize memory space between CPU and GPU Addresses are unique across the system No need to specify direction in cudamemcpy Unified memory (CC 3.x): both CPU and GPU share a managed memory space Driver manages transfers automatically Only for memory allocated as managed Unmanaged + cudamemcpy still useful for optimized transfers 13

Outline Device-mapped memory, Unified virtual address space, Unified memory Dynamic parallelism Other memory spaces: texture, constant Pitfall and fallacies 14

Dynamic parallelism Kernels can launch kernels Host code kernel call SM push pop Block Scheduler SM CPU Command queue GPU Same syntax as host code kernel<<<blocks, threads, smem, stream>>>(arguments); Use streams for parallel kernel calls from the same block Applications Traversal of graphs and irregular structures Multi-resolution finite element analysis 16

Texture memory Primarily made for graphics Optimized for 2D accesses Takes advantage of locality in space Not coherent caches: usually read-only Initialization through specific host functions cudacreatetextureobject(), cudabindtexture2d()... Access through specific device functions tex1d(), tex2d(), tex3d()... 17

Bilinear texture filtering Fixed-function unit in the texture read path Interpolate color of Pixels from Texels T[i,j] T[i,j+1]... α β γ Alignment (FP->FX) i xi+1 shared exponent Bilinear interpolation FX j y j+1 Texels Pixel MAC FP*FX+FP->FP Applications: graphics, image processing Can implement 2D lookup tables with interpolation 18

Constant memory Array or variable declared as const Initialized in place or from host code, then read-only Used automatically by the compiler for compile-time constants in code Single-ported memory: good performance when all threads in a warp access the same word (broadcast) Similar to shared memory with 1 bank Generally not worth using with divergent accesses Being phased out in CUDA, but may appear in existing code 19

Cached constants in global memory Non-coherent cache for constants on newer NVIDIA GPU architectures Complements or replace the constant memory space Special load instruction In PTX or assembly Two ways to use it at the C for CUDA level: Automatic: be sure to declare all pointers to constant data as const restrict The compiler may be able to prove accesses are read-only Manual: force use of the cache using the ldg() intrinsic 20

Conclusion: trends GPU: rapidly evolving hardware and software Going towards CPU-GPU tighter coupling On-chip integration Shared physical memory Shared virtual memory space Most development is going into mobile Nvidia: Kepler GPUs in Tegra 4 support CUDA Many GPUs supporting OpenCL: ARM Mali, Qualcomm Adreno, Imagination Technologies PowerVR Rogue... Still much work to do at the operating system level GPU currently a second-class citizen Need to move from CPU+devices model to heterogeneous compute model 22

References and further reading/watching CUDA C Programming Guide Mark Harris. Introduction to CUDA C. http://developer.nvidia.com/cuda-education David Luebke, John Owens. Intro to parallel programming. Online course. https://www.udacity.com/course/cs344 Paulius Micikevicius. GPU Performance Analysis and Optimization. GTC 2012. http://on-demand.gputechconf.com/gtc/2012/presentations/s0514-gt C2012-GPU-Performance-Analysis.pdf 25

Pitfalls and fallacies 26

GPUs are 100x faster than CPUs 27

GPUs are 100x faster than CPUs Wrong The gap in peak performance (compute and memory) is 10x In practice, 5x to 8x against optimized CPU code Right: you can get a 100x speedup when porting an application to GPU You get 10x because of the higher GPU throughput... and 10x more because you parallelized and optimized the application But spending the same effort on optimizing for CPU would also yield 10x improvement Modern CPUs are highly parallel too Right: if you use GPU hardware features not available on CPUs Texture filtering, elementary functions (exp, log, sin, cos)... Victor W. Lee, et al. "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU." ISCA 2010. 28

GPUs have thousands of cores 29

GPUs have thousands of cores Official definition of CUDA Core: one execution unit Right: GPUs have thousands of cores Using this metric, CPUs have hundreds of cores! The closest to a CPU core is a GPU SM Wrong: 15 to 20 SMs only on a GPU Wide SIMD units inside But just 2x to 4x wider than on CPU Often-overlooked aspect: GPUs have 10s of thousands of threads But: CPU thread ~ GPU warp Still thousands of warps 30

I can avoid divergence with predication 31

I can avoid divergence with predication Do not confuse predication with prediction! In a superscalar CPU, branches are predicted Cost of misprediction ~ pipeline depth independent from length of mispredicted path Modern branch predictors are really good at finding correlations Inefficient on short ifs governed by random conditions Predication might help in this case In a GPU, instructions are predicated Cost of divergence ~ divergent section size GPU divergence management has very low overhead Inefficient on long ifs/loops governed by divergent conditions Predication does not help in this case 32