GPU programming. Dr. Bernhard Kainz

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Real-time Graphics 9. GPGPU

CUDA Architecture & Programming Model

Practical Introduction to CUDA and GPU

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Real-time Graphics 9. GPGPU

CUDA Programming Model

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Lecture 2: CUDA Programming

Lecture 3: Introduction to CUDA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 17

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

ECE 574 Cluster Computing Lecture 15

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

High Performance Linear Algebra on Data Parallel Co-Processors I

Programmable Graphics Hardware (GPU) A Primer

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

CUDA C Programming Mark Harris NVIDIA Corporation

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU CUDA Programming

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CUDA

Parallel Numerical Algorithms

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA Programming

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU Programming Using CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 11: GPU programming

Parallel Computing. Lecture 19: CUDA - I

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

High-Performance Computing Using GPUs

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA (Compute Unified Device Architecture)

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to GPU hardware and to CUDA

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Data Parallel Execution Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 1: an introduction to CUDA

Lecture 10!! Introduction to CUDA!

University of Bielefeld

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Programming in CUDA. Malik M Khan

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Module 2: Introduction to CUDA C. Objective

GPGPU/CUDA/C Workshop 2012

General-purpose computing on graphics processing units (GPGPU)

Module 2: Introduction to CUDA C

CUDA Parallelism Model

CUDA Basics. July 6, 2016

CS516 Programming Languages and Compilers II

Overview: Graphics Processing Units

CSE 160 Lecture 24. Graphical Processing Units

CS377P Programming for Performance GPU Programming - I

CUDA Parallel Programming Model Michael Garland

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS 179 Lecture 4. GPU Compute Architecture

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU Computing with CUDA. Part 2: CUDA Introduction

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

High Performance Computing and GPU Programming

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Mathematical computations with GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Programming. Aiichiro Nakano

Introduction to CUDA

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to CUDA (1 of n*)

COSC 6374 Parallel Computations Introduction to CUDA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Programming with CUDA, WS09

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Basics of CADA Programming - CUDA 4.0 and newer

Technology for a better society. hetcomp.com

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Transcription:

GPU programming Dr. Bernhard Kainz

Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling examples State-of-the-art applications This week Next week

About myself Born, raised, and educated in Austria PhD in interactive medical image analysis and visualisation Marie-Curie Fellow, Imperial College London, UK Senior research fellow King s College London Lecturer in high-performance medical image analysis at DOC > years GPU programming experience

History 4

GPUs GPU = graphics processing unit GPGPU = General Purpose Computation on Graphics Processing Units CUDA = Compute Unified Device Architecture OpenCL = Open Computing Language Images: www.geforce.co.uk 5

History First dedicated GPUs Brook OpenCL you Other (graphics related) developments 998 4 7 8 n o w programmable shader CUDA Modern interfaces to CUDA and OpenCL (python, Matlab, etc.) 6

Why GPUs became popular http://www.computerhistory.org/timeline/graphics-games/ 7

Why GPUs became popular for computing Sandy Bridge Haswell HerbSutter The free lunch is over 8

cuda-c-programming-guide 9

cuda-c-programming-guide

Motivation

parallelisation + = Thread for (int i = ; i < N; ++i) c[i] = a[i] + b[i];

parallelisation + = Thread for (int i = ; i < N/; ++i) c[i] = a[i] + b[i]; Thread for (int i = N/; i < N; ++i) c[i] = a[i] + b[i];

parallelisation + = Thread for (int i = ; i < N/; ++i) c[i] = a[i] + b[i]; Thread for (int i = N/; i < *N/; ++i) c[i] = a[i] + b[i]; Thread for (int i = *N/; i < N; ++i) c[i] = a[i] + b[i]; 4

multi-core CPU Control ALU ALU ALU ALU Cache DRAM 5

parallelisation + = c[] = a[] + b[]; c[] = a[] + b[]; c[] = a[] + b[]; c[] = a[] + b[]; c[4] = a[4] + b[4]; c[5] = a[5] + b[5]; c[n-] = a[n-] + b[n-]; c[n] = a[n] + b[n]; 6

multi-core GPU DRAM 7

Terminology 8

Host vs. device CPU (host) GPU w/ local DRAM (device) 9

multi-core GPU current schematic Nvidia Maxwell architecture

Streaming Multiprocessors (SM,SMX) single-instruction, multiple-data (SIMD) hardware threads form a warp Each thread within a warp must execute the same instruction (or be deactivated) instruction values computed handle more warps than cores to hide latency

Differences CPU-GPU Threading resources - Host currently ~ concurrent threads - Device: smallest executable unit of parallelism: Warp : thread - 768-4 active threads per multiprocessor - Device with multiprocessors: >. active threads - Devices can hold billions of threads Threads - Host: heavyweight entities, context switch expensive - Device: lightweight threads - If the GPU processor must wait for one warp of threads, it simply begins executing work on another Warp. Memory - Host: equally accessible to all code - Device: divided virtually and physically into different types

Flynn s Taxonomy SISD: single-instruction, single-data (single core CPU) MIMD: multiple-instruction, multiple-data (multi core CPU) SIMD: single-instruction, multiple-data (data-based parallelism) MISD: multiple-instruction, single-data (fault-tolerant computers)

Amdahl s Law - Sequential vs. parallel - Performance benefit S N ( P) P N - P: parallelizable part of code - N: # of processors 4

SM Warp Scheduling - SM hardware implements zero overhead Warp scheduling - Warps whose next instruction has its operands ready for consumption are eligible for execution - Eligible Warps are selected for execution on a prioritized scheduling policy - All threads in a Warp execute the same instruction when selected - Currently: ready-queue and memory access scoreboarding - Thread and warp scheduling are active topics of research! 5

Programming GPUs 6

Programming languages OpenCL (Open Computing Language): - OpenCL is an open, royalty-free, standard for crossplatform, parallel programming of modern processors - An Apple initiative approved by Intel, Nvidia, AMD, etc. - Specified by the Khronos group (same as OpenGL) - It intends to unify the access to heterogeneous hardware accelerators - CPUs (Intel i ) - GPUs (Nvidia GTX & Tesla, AMD/ATI 58xx, ) - What s the difference to other languages? - Portability over Nvidia, ATI, S platforms + CPUs - Slow or no implementation of new/special hardware features 7

Programming languages CUDA: - Compute Unified Device Architecture - Nvidia GPUs only! - Open source announcement - Does not provide CPU fallback - NVIDIA CUDA Forums 89 topics - AMD OpenCL Forums 8 topics - Stackoverflow CUDA Tag 79 tags - Stackoverflow OpenCL Tag 564 tags - Raw math libraries in NVIDIA CUDA - CUBLAS, CUFFT, CULA, Magma - new hardware features immediately available! 8

Installation - Download and install the newest driver for your GPU! - OpenCL: get SDK from Nvidia or AMD - CUDA: https://developer.nvidia.com/cuda-downloads - CUDA nvcc complier -> easy access via CMake and.cu files - OpenCL -> no special compiler, runtime evaluation - Integrated Intel something graphics -> No No No! 9

Writing parallel code Current GPUs have > cores (GTX TITAN, Tesla K8 etc.) Need more threads than cores (warp scheduler) Writing different code for threads / warps? Single-program, multiple-data (SPMD = SIMDI) model - Write one program that is executed by all threads

CUDA C CUDA C is C (C++) with additional keywords to control parallel execution Type qualifiers Keywords Intrinsics Runtime API GPU function launches device global float x; global void func(int* mem) { shared shared int y[]; constant device y[threadidx.x] = blockidx.x; syncthreads() blockidx syncthreads(); any() } cudasetdevice cudamalloc cudamalloc(&d_mem, bytes); func<<<>>>(d_mem); GPU code (device code) CPU code (host code)

Kernel - A function that is executed on the GPU - Each started thread is executing the same function global void myfunction(float *input, float* output) { *output = *input; } - Indicated by global - must have return value void

Parallel Kernel Kernel is split up in blocks of threads

Launching a kernel - A function that is executed on the GPU - Each started thread is executing the same function dim blocksize(); dim gridsize((ispacex + blocksize.x - )/blocksize.x, (ispacey + blocksize.y - )/blocksize.y), ) myfunction<<<gridsize, blocksize>>>(input, output); - Indicated by global - must have return value void 4

Distinguishing between threads using threadidx and blockidx execution paths are chosen with blockdim and griddim number of threads can be determined global void myfunction(float *input, float* output) { uint bid = blockidx.x + blockidx.y * griddim.x; uint tid = bid * (blockdim.x * blockdim.y) + (threadidx.y * blockdim.x) + threadidx.x; output[tid] = input[tid]; } 5

6 Distinguishing between threads blockid and threadid,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4 4 4 5,5 5 5 4 4 4 4 5 5 5 5 4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4 4,5 5,6 6,7 7,8 8,9 9,,

Grids, Blocks, Threads 7

Blocks Threads within one block - are executed together - can be synchronized - can communicate efficiently - share the same local cache can work on a goal cooperatively 8

Blocks Threads of different blocks - may be executed one after another - cannot synchronize on each other - can only communicate inefficiently should work independently of other blocks 9

Block Scheduling Block queue feeds multiprocessors Number of available multiprocessors determines number of concurrently executed blocks 4

Blocks to warps On each multiprocessor each block is split up in warps Threads with the lowest id map to the first warp, 8, 9, warp, 8, 9, warp, 8, 9,, 8, 9, warp,4 4 4 4 4 4 4 4 8,4 9,4 4 4 4 4 4 4,5 5 5 5 5 5 5 5 8,5 9,5 5 5 5 5 5 5 warp,6 6 6 6 6 6 6 6 8,6 9,6 6 6 6 6 6 6,7 7 7 7 7 7 7 7 8,7 9,7 7 7 7 7 7 7 4

Where to start CUDA programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ OpenCL http://www.nvidia.com/content/cudazone/download/opencl/ nvidia_opencl_programmingguide.pdf http://developer.amd.com/tools-and-sdks/opencl-zone/ 4

GPU programming Dr. Bernhard Kainz 4