GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Similar documents
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Programming Model

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to CUDA Programming

GPU Programming Using NVIDIA CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Fundamentals Jeff Larkin November 14, 2016

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Practical Introduction to CUDA and GPU

Parallel Numerical Algorithms

CME 213 S PRING Eric Darve

Fundamental CUDA Optimization. NVIDIA Corporation

Tesla Architecture, CUDA and Optimization Strategies

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Fundamental CUDA Optimization. NVIDIA Corporation

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Computer Architecture

Portland State University ECE 588/688. Graphics Processors

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

High Performance Computing and GPU Programming

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GPU Architecture. Alan Gray EPCC The University of Edinburgh

ECE 574 Cluster Computing Lecture 15

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Warps and Reduction Algorithms

CS 179 Lecture 4. GPU Compute Architecture

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CUDA (Compute Unified Device Architecture)

Lecture 1: an introduction to CUDA

GPU programming. Dr. Bernhard Kainz

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

ME964 High Performance Computing for Engineering Applications

Introduction to GPU programming with CUDA

Introduction to GPU hardware and to CUDA

Lecture 2: CUDA Programming

CUDA. Matthew Joyner, Jeremy Williams

GPU CUDA Programming

Programmable Graphics Hardware (GPU) A Primer

Lecture 11: GPU programming

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

University of Bielefeld

Improving Performance of Machine Learning Workloads

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to GPGPU and GPU-architectures

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Fundamental Optimizations

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

TUNING CUDA APPLICATIONS FOR MAXWELL

OpenACC Course. Office Hour #2 Q&A

Introduction to Parallel Computing with CUDA. Oswald Haan

CS516 Programming Languages and Compilers II

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Multi-Processors and GPU

Recent Advances in Heterogeneous Computing using Charm++

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Parallelism Model

Scientific Computing on GPUs: GPU Architecture Overview

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

ECE 574 Cluster Computing Lecture 17

CUDA Experiences: Over-Optimization and Future HPC

Supporting Data Parallelism in Matcloud: Final Report

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Profiling of Data-Parallel Processors

Accelerating image registration on GPUs

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017


Tesla GPU Computing A Revolution in High Performance Computing

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Lecture 1: Introduction and Computational Thinking

Dense Linear Algebra. HPC - Algorithms and Applications

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Sparse Linear Algebra in CUDA

Cartoon parallel architectures; CPUs and GPUs

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 3: Introduction to CUDA

CUDA Performance Optimization. Patrick Legresley

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Transcription:

GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten

NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven

Who are we? Anton Wijs Assistant professor, Software Engineering & Technology, TU Eindhoven Developing and integrating formal methods for model driven software engineering Verification of model transformations Automatic generation of (correct) parallel software Accelerating formal methods with multi-/many-threading Muhammad Osama PhD student, Software Engineering & Technology, TU Eindhoven GEARS: GPU Enabled Accelerated Reasoning about System designs GPU Accelerated SAT solving

Schedule GPU Computing Tuesday 12 June Afternoon: Intro to GPU computing Wednesday 13 June Morning / Afternoon: Formal verification of GPU software Afternoon: Optimised GPU computing (to perform model checking)

Schedule of this afternoon 13:30 14:00 Introduction to GPU Computing 14:00 14:30 High-level intro to CUDA Programming Model 14:30 15:00 1 st Hands-on Session 15:00 15:15 Coffee break 15:15 15:30 Solution to first Hands-on Session 15:30 16:15 CUDA Programming model Part 2 with 2 nd Hands-on Session 16:15 16:40 CUDA Program execution

Before we start You can already do the following: Install VirtualBox (virtualbox.org) Download VM file: scp gpuser@131.155.68.95:gpututorial.ova. in terminal (Linux/Mac) or with WinSCP (Windows) Password: cuda2018 https://tinyurl.com/y9j5pcwt (10 GB) Or copy from USB stick

We will cover approx. first five chapters

Introduction to GPU Computing

What is a GPU? Graphics Processing Unit The computer chip on a graphics card General Purpose GPU (GPGPU)

Graphics in 1980

Graphics in 2000

Graphics now

General Purpose Computing Graphics processing units (GPUs) Numerical simulation, media processing, medical imaging, machine learning, Communications of the ACM 59(9):14-16 (sep. 16) GPUs are a gateway to the future of computing Example: deep learning 2011-12: GPUs dramatically increase performance

Compute performance (According to Nvidia)

GPUs vs supercomputers?

Oak Ridge s Titan Number 3 in top500 list: 27.113 pflops peak, 8.2 MW power 18.688 AMD Opteron processors x 16 cores = 299.008 cores 18.688 Nvidia Tesla K20X GPUs x 2688 cores = 50.233.344 cores

CPU vs GPU Hardware Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread Big on-chip caches Sophisticated control logic GPU: maximize throughput of all threads Multithreading can hide latency, so no big caches Control logic Much simpler Less: share control logic across many threads Control Core Core Cache Core Core

It's all about the memory

Many-core architectures From Wikipedia: A many-core processor is a multicore processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors.

Integration into host system PCI-e 3.0 achieves about 16 GB/s Comparison: GPU device memory bandwidth is 320 GB/s for GTX1080

Why GPUs? Performance Large scale parallelism Power Efficiency Use transistors more efficiently #1 in green 500 uses NVIDIA Tesla P100 Price (GPUs) Huge market Mass production, economy of scale Gamers pay for our HPC needs!

When to use GPU Computing? When: Thousands or even millions of elements that can be processed in parallel Very efficient for algorithms that: have high arithmetic intensity (lots of computations per element) have regular data access patterns do not have a lot of data dependencies between elements do the same set of instructions for all elements

A high-level intro to the CUDA Programming Model

CUDA Programming Model Before we start: I m going to explain the CUDA Programming model I ll try to avoid talking about the hardware as much as possible For the moment, make no assumptions about the backend or how the program is executed by the hardware I will be using the term thread a lot, this stands for thread of execution and should be seen as a parallel programming concept. Do not compare them to CPU threads.

CUDA Programming Model The CUDA programming model separates a program into a host (CPU) and a device (GPU) part. The host part: allocates memory and transfers data between host and device memory, and starts GPU functions The device part consists of functions that will execute on the GPU, which are called kernels Kernels are executed by huge amounts of threads at the same time The data-parallel workload is divided among these threads The CUDA programming model allows you to code for each thread individually

Data management The GPU is located on a separate device The host program manages the allocation and freeing of GPU memory Host program also copies data between different physical memories CPU GPU Host memory Device memory Host PCI Express link Device

Thread Hierarchy Kernels are executed in parallel by possibly millions of threads, so it makes sense to try to organize them in some manner Grid (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Thread block (0,0,0) (1,0,0) (2,0,0) (0,1,0) (1,1,0) (2,1,0) Typical block sizes: 256, 512, 1024

Threads In the CUDA programming model a thread is the most fine-grained entity that performs computations Threads direct themselves to different parts of memory using their built-in variables threadidx.x, y, z (thread index within the thread block) Example: Create a single thread block of N threads: Effectively the loop is unrolled and spread across N threads Single Instruction Multiple Data (SIMD) principle

Thread blocks Threads are grouped in thread blocks, allowing you to work on problems larger than the maximum thread block size Thread blocks are also numbered, using the built-in variables containing the index of each block within the grid. Total number of threads created is always a multiple of the thread block size, possibly not exactly equal to the problem size Other built-in variables are used to describe the thread block dimensions and grid dimensions

Mapping to hardware

Starting a kernel The host program sets the number of threads and thread blocks when it launches the kernel

CUDA function declarations device float DeviceFunc() global void KernelFunc() host float HostFunc() defines a kernel function Each consists of two underscore characters A kernel function must return and can be used together is optional if used alone

Setup hands-on session You can already do the following: Install VirtualBox (virtualbox.org) Download VM file: scp gpuser@131.155.68.95:gpututorial.ova. in terminal (Linux/Mac) or with WinSCP (Windows) Password: cuda2018 https://tinyurl.com/y9j5pcwt (10 GB) Or copy from USB stick

Setup hands-on session Import file as Appliance in VirtualBox Start the machine Login name: gpuser Login password: cuda2018 Launch NSight

First hands-on session Start with project vector_add in left pane Configure: right click vector_add -> Properties; go to Build -> Target Systems -> Manage Also update Project Path

First hands-on session Configure: go to Run/Debug Settings. Click the configuration -> Edit. Select the remote connection, and set Remote executable folder Do these steps for the four projects in the left pane, and restart Nsight

1 st Hands-on Session Make sure you understand everything in the code, and complete the exercise! Hints: Look at how the kernel is launched in the host program is the thread index within the thread block is the block index within the grid is the dimension of the thread block

Hint blockidx.x thread block 0 thread block 1 thread block 2 threadidx.x 0 1 2 3 0 1 2 3 0 1 2 3 blockdim.x?

Solution CPU implementation: GPU implementation: Create a N threads using multiple thread blocks: Single Instruction Multiple Data (SIMD) principle

CUDA Programming model Part 2

CUDA memory hierarchy Registers Thread Shared memory Thread Block Global memory Constant memory Grid (0, 0) (1, 0)

Hardware overview

Memory space: Registers Example: Registers Thread-local scalars or small constant size arrays are stored as registers Implicit in the programming model Behavior is very similar to normal local variables Not persistent, after the kernel has finished, values in registers are lost

Memory space: Global Example: Global memory Allocated by the host program using Initialized by the host program using or previous kernels Persistent, the values in global memory remain across kernel invocations Not coherent, writes by other threads will not be visible until kernel has finished

Memory space: Constant Constant memory Statically defined by the host program using qualifier Defined as a global variable Initialized by the host program using Read-only to the GPU, cannot be accessed directly by the host Values are cached in a special cache optimized for broadcast access by multiple threads simultaneously, access should not depend on

2 nd Hands-on Session Go to project reduction, look at the source files Make sure you understand everything in the code Task: Implement the kernel to perform a single iteration of parallel reduction Hints: It is assumed that enough threads are launched such that each thread only needs to compute the sum of two elements in the input array In each iteration, an array of size n is reduced into an array of size n/2 Each thread stores it result at a designated position in the output array

Hint Parallel Summation

Global synchronisation CUDA has no mechanism to indicate global synchronisation of all threads across the grid Instead, enforce synchronisation points by breaking down computation into multiple kernel launches Kernel launch 0 Kernel launch 1 Kernel launch 2 Kernel launch 3 Kernel launch 4

Barrier synchronisation Two forms: Global synchronisation: achieved between kernel launches Intra-block synchronisation: Contrary to global synchronisation, CUDA does provide a mechanism to synchronise all threads in the same block All threads in the same block must reach the before any of them can move on Best used to split up computation of each block in several phases Tightly linked to use of (block-local) shared memory, which we will address tomorrow afternoon

CUDA Program execution

Compilation CUDA program PTX assembly CUBIN bytecode Nvidia Compiler nvcc Machine-level binary Runtime compiler driver

Translation table CUDA OpenCL OpenACC OpenMP 4 Grid NDRange compute region parallel region Thread block Work group Gang Team Warp CL_KERNEL_PREFERRED _WORK_GROUP_SIZE_MU LTIPLE Worker SIMD Chunk Thread Work item Vector Thread or SIMD Note that for the mapping is actually implementation dependent for the open standards and may differ across computing platforms Not too sure about the OpenMP 4 naming scheme, please correct me if wrong

How threads are executed Remember: all threads in a CUDA kernel execute the exact same program Threads are actually executed in groups of (32) threads called warps (more on this tomorrow afternoon) Threads within a warp all execute one common instruction simultaneously The context of each thread is stored separately, as such the GPU stores the context of all currently active threads The GPU can switch between warps even after executing only 1 instruction, effectively hiding the long latency of instructions such as memory loads

Maxwell Architecture Streaming multiprocessor (SM) 32 core block

Maxwell Architecture Register file block of 32 cores L1 Cache L1 Cache Shared memory

Resource partitioning The GPU consists of several (1 to 56) streaming multiprocessors (SMs) The SMs are fully independent Each SM contains several resources: Thread and Thread Block slots, Register file, and Shared memory SM Resources are dynamically partitioned among the thread blocks that execute concurrently on the SM, resulting in a certain occupancy Thread slots Register file Shared memory

Global Memory access Global memory is cached at L2, and for some GPUs also in L1 SM SM When a thread reads a value from global memory, think about: The total number of values that are accessed by the warp that the thread belongs to The cache line length and the number of cache lines that those values will belong to Alignment of the data accesses to that of the cache lines L1 L1 L2 GPU Device memory

Cached memory access The memory hierarchy is optimized for certain access patterns CPU All memory accesses happen through the cache Main memory Memory is optimized for reading in (rowwise) bursts Cache Cache fetches memory at the granularity of cache-lines

Overview Proving correctness tomorrow morning / afternoon! CUDA Programming model (API) threads Think in terms of threads Reason on program correctness warps GPU Hardware Think in terms of warps Reason on program performance Tomorrow in Part 2 of GPU Development!

To do: setup the VerCors tool See https://github.com/utwente-fmt/vercors basic build: Clone the VerCors repository: Move into the cloned directory: Build VerCors with Ant: Test build: If this fails, there will be a VM with VerCors available tomorrow Do NOT delete your VM with Nsight, as we will use it again tomorrow afternoon!