CS 179: GPU Computing. Lecture 2: The Basics

Similar documents
Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

04. CUDA Data Transfer

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Basics. July 6, 2016

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Numerical Algorithms

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

EEM528 GPU COMPUTING

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU CUDA Programming

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

COMP 322: Fundamentals of Parallel Programming

Massively Parallel Algorithms

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Programming with CUDA, WS09

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CS179 GPU Programming Recitation 4: CUDA Particles

NVIDIA CUDA Compute Unified Device Architecture

Scientific discovery, analysis and prediction made possible through high performance computing.

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

University of Bielefeld

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Introduc)on to GPU Programming

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

COSC 6339 Accelerators in Big Data

COSC 6374 Parallel Computations Introduction to CUDA

HPCSE II. GPU programming and CUDA

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA

Lecture 2: Introduction to CUDA C

CUDA Kenjiro Taura 1 / 36

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

High Performance Linear Algebra on Data Parallel Co-Processors I

Introduction to GPGPUs and to CUDA programming model

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA BASICS Matt Heavner -- 10/21/2010

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ECE 574 Cluster Computing Lecture 15

Mathematical computations with GPUs

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 10!! Introduction to CUDA!

Parallel Computing. Lecture 19: CUDA - I

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

GPU Programming Using CUDA

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CS 179: GPU Programming. Lecture 7

CUDA. Sathish Vadhiyar High Performance Computing

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Introduction to CUDA C

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA C

Hands-on CUDA Optimization. CUDA Workshop

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

ECE 574 Cluster Computing Lecture 17

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

Lecture 3: Introduction to CUDA

Module 2: Introduction to CUDA C

COSC 462 Parallel Programming

Computation to Core Mapping Lessons learned from a simple application

Lessons learned from a simple application

CUDA Memory Hierarchy

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Cartoon parallel architectures; CPUs and GPUs

Module 2: Introduction to CUDA C. Objective

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Memories. Introduction 5/4/11

CS : Many-core Computing with CUDA

Advanced Topics in CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Vector Addition on the Device: main()

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Paralization on GPU using CUDA An Introduction

GPU Fundamentals Jeff Larkin November 14, 2016

An Introduction to GPU Computing and CUDA Architecture

CUDA Architecture & Programming Model

Writing and compiling a CUDA code

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Programming. Maciej Halber

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Transcription:

CS 179: GPU Computing Lecture 2: The Basics

Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language

Disclaimer Goal for Week 1: Fast-paced introduction Know enough to be dangerous We will fill in details later!

Our original problem Add two arrays A[] + B[] -> C[] Goal: Understand what s going on

CUDA code (first part)

Basic formula Setup inputs on the host (CPU-accessible memory) Allocate memory for inputs on the GPU Copy inputs from host to GPU Allocate memory for outputs on the host Allocate memory for outputs on the GPU Start GPU kernel Copy output from GPU to host

Classic Memory Hierarchy

The GPU

Global memory The GPU

Pointers Difference between CPU and GPU pointers?

Pointers Difference between CPU and GPU pointers? None pointers are just addresses!

Pointers Difference between CPU and GPU pointers? None pointers are just addresses! Up to the programmer to keep track!

Pointers Good practice: Special naming conventions, e.g. dev_ prefix

Memory allocational With the CPU (host memory) float *c = malloc(n * sizeof(float)); Attempts to allocate #bytes in argument

Memory allocation On the GPU (global memory): float *dev_c; cudamalloc(&dev_c, N * sizeof(float)); Signature: cudaerror_t cudamalloc (void ** devptr, size_t size) Attempts to allocate #bytes in arg2 arg1 is the pointerto the pointer in GPU memory! Passed into function for modification Result after successful call: Memory allocated in location given by dev_con GPU Return value is error code, can be checked

Memory copying With the CPU (host memory) // pointers source,destination to memory regions memcpy(destination, source, N); Signature: void * memcpy (void * destination, const void * source, size_t num); Copies numbytes from (area pointed to by) source to (area pointed to by) destination

Memory copying Versatile cudamemcpy() equivalent CPU -> GPU GPU -> CPU GPU -> GPU CPU -> CPU

Memory copying Signature: cudaerror_t cudamemcpy(void *destination, void *src, size_t count, enum cudamemcpykind kind) f

Memory copying Signature: cudaerror_t cudamemcpy(void *destination, void *src, size_t count, enum cudamemcpykind kind) Values: cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice F

Memory copying Signature: cudaerror_t cudamemcpy(void *destination, void *src, size_t count, enum cudamemcpykind kind) Values: cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Determines treatment of dst. and src. as CPU or GPU addresses F

Summary of memory CPU vs. GPU pointers cudamalloc() cudamemcpy()

Part 2

Recall GPUs Have lots of cores Are suited toward parallel problems

GPU internals

GPU internals

GPU internals

CPU internals

GPU internals One instruction unit for multiple cores!

Warps Groups of threads simultaneously execute same instructions! Called a warp (32 threads in a warp under current standards)

GPU internals

Blocks Group of threads scheduled to a multiprocessor Contain multiple warps Has a max. number (varies by GPU, e.g. 512 or 1024)

Multiprocessor execution timeline

Thread groups A grid (all the threads started ): contains blocks <- assigned to multiprocessors Each block contains warps <- executed simultaneously Each warp contains individual threads

Part 2

Moral 1: (from Lecture 1) Start lots of threads! Recall: Low context switch penalty Hide latency Start enough blocks! Occupy SMs e.g. Don t call: Call: kernel<<<1,1>>>(); // 1 block, 1 thread per block kernel<<<50,512>>>();// 50 blocks, 512 threads per block

Moral 2: Multiprocessors execute warps (of 32 threads) Block sizes of 32*n (integer n) are best e.g. Don t call: Call: kernel<<<50,97>>>(); // 50 blocks, 97 threads per block kernel<<<50,128>>>();// 50 blocks, 128 threads per block

Summary (processor internals) Key parameters on kernel call: Threads per block Number of blocks Choose carefully!

Kernel argument passing Similar to arg-passing in C functions Some rules: Don t pass host-memory pointers Small variables (e.g. individual ints) are fine No pass-by-reference

Kernel function Executed by many threads Threads have unique ID mechanism: Thread index within block Block index

Out of bounds issue: If index > (#elements), illegal access!

Out of bounds issue: If index > (#elements), illegal access!

#Threads issue: Cannot start e.g. 1e9 threads! Threads should handle arbitrary # of elements

#Threads issue: Cannot start e.g. 1e9 threads! Threads should handle arbitrary # of elements

#Threads issue: Cannot start e.g. 1e9 threads! Threads should handle arbitrary # of elements Total number of blocks

GPU -> CPU Host memory pointer (copy to here) Device memory pointer (copy from here)

cudafree() Equivalent to host memory s free() function (As on host) Free memory after completion!

Summary GPU global memory: Pointers (CPU vs GPU) cudamalloc() and cudamemcpy() GPU processor details: Thread group hierarchy Launch parameters Threads in kernel