Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA C Programming Mark Harris NVIDIA Corporation

Practical Introduction to CUDA and GPU

Introduction to CUDA Programming

University of Bielefeld

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

ECE 574 Cluster Computing Lecture 15

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Parallel Computing. Lecture 19: CUDA - I

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Programming with CUDA, WS09

CUDA C/C++ BASICS. NVIDIA Corporation

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 17

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CUDA C/C++ BASICS. NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Module 2: Introduction to CUDA C

CUDA Programming Model

Module 2: Introduction to CUDA C. Objective

CS 179: GPU Computing. Lecture 2: The Basics

HPCSE II. GPU programming and CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

COSC 6374 Parallel Computations Introduction to CUDA

Lecture 3: Introduction to CUDA

An Introduction to GPU Computing and CUDA Architecture

Paralization on GPU using CUDA An Introduction

Introduction to GPGPUs and to CUDA programming model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA Architecture & Programming Model

Introduction to CELL B.E. and GPU Programming. Agenda

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Massively Parallel Algorithms

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to CUDA

Parallel Numerical Algorithms

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

High-Performance Computing Using GPUs

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduc)on to GPU Programming

Lecture 2: Introduction to CUDA C

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

COSC 6339 Accelerators in Big Data

GPU CUDA Programming

Lecture 1: an introduction to CUDA

Programming in CUDA. Malik M Khan

CS179 GPU Programming Recitation 4: CUDA Particles

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Introduction to GPU Programming

Lecture 6b Introduction of CUDA programming

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA Performance Optimization. Patrick Legresley

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

CS377P Programming for Performance GPU Programming - I

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPGPU/CUDA/C Workshop 2012

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Kenjiro Taura 1 / 36

CUDA Basics. July 6, 2016

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Programmable Graphics Hardware (GPU) A Primer

Real-time Graphics 9. GPGPU

CSE 160 Lecture 24. Graphical Processing Units

GPU Programming Using CUDA

EEM528 GPU COMPUTING

Vector Addition on the Device: main()

CUDA Parallelism Model

Transcription:

Basic CUDA workshop Penporn Koanantakool twitter: @kaewgb e-mail: kaewgb@gmail.com Outlines Setting Up Your Machine Architecture Getting Started Programming CUDA Debugging Fine-Tuning

Setting Up Your Machine Disclaimer I expected that everyone already knew C/C++. Loops for(i=0;i<n;i++) printf( %d\n, i); Arrays array[i] = array[i-2] + array[i-1]; Functions void something(){ } printf("something\n"); Dynamic Memory Allocation array = (float *) malloc(n*sizeof(float));

Maeka Cluster Log in to MaekaCluster using the account provided Change the password with command passwd Create hello.ceither by An editor in the cluster (nano, vim, emacs, etc.) Your editor on your machine. And then transfer file via FTP. Type the following classic code gcc-o hello hello.c./hello You should get something like [kaewgb@maeka ~]$./hello Hello, world!!! #include <stdio.h> int main(){ printf("hello, world!!!\n"); return 0; } Installation Go to official CUDA page http://developer.nvidia.com/object/cuda_3_1_downloads.html Download and install CUDA driver only if your machine has NVIDIA GPU Download and install CUDA toolkit (SDK) Linux (.run): chmod+x, then run by./ Download and install GPU Computing SDK Code Samples CUDA 2.3 http://developer.download.nvidia.com/compute/cuda/2_3/sdk/cuda sdk_2.3_linux.run cd~/nvidia_gpu_computing_sdk/c make Sample programs will be in ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release

IDE Syntax Highlighting Visual Studio (Windows) Refer to C:\NVIDIA GPU Computing SDK\C\doc\ syntax_highlighting\visual_studio_8\readme.txt May use Visual Assist Xfor Auto-Completion Code::Blocks (Windows/Linux) Open C:\Program Files\ CodeBlocks\share\CodeBlocks\lexers\lexer_cpp.xml Add.cu (and probably.cuhif you use) to filemasksattribute in Lexer tag filemasks="*.c,*.cpp,*.cc,*.cxx,*.h,*.hpp,*.hh,*.hxx,*.inl,*.cu,*.cuh" Hardware Architecture in general

NVIDIA GPU Architecture Highly parallel, multithreaded, many-core processor 9 4GB Video Memory / Global Memory Off-Chip NVIDIA Tesla M2050 SM SM SM SM SM SM On-Chip 32 Streaming Processors (SP) SP SP 14 SP Streaming Multiprocessors (SM) SP 16KB Shared Memory 16k Register Memory SP SP SP SP A Streaming Multiprocessor (SM) Let s begin! How to get started.

Drill #0 Try running some test program from GPU Computing SDK Code Samples in NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ Many programs use X Windows so if you re not on Linux you may not be able to try. Just skip them. Run by prefix the whole path like this [kaewgb@maeka ~]$ NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Don t cdinto the folder and run That folder is actually a symbolic link. The actual folder is in /share/apps/nvidia_gpu_computing_sdk/ Some test program try to write files at the running location Might have permission problems Useful Sample Programs devicequery Driver version SDK (toolkit) version Compute Capability Hardware/Software Specifications bandwidthtest Peak Bandwidth (Practical/Experimental) Not in SDK 2.3 etc. Many more in newer SDKs

Compute Capabilities A number Tells how much a GPU is capable of 1.0, 1.1, 1.2 1.3 support double-precision floating point 2.0 Fermi Global Memory cache Native printf() Learn more in CUDA C Programming Guide Appendix A: CUDA-Enabled GPUs Appendix G: Compute Capabilities A Closer Look at the Specifications Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores) Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 What is this all about!?!?!?

Programming CUDA Terminology, Memory Hierarchy, APIs, etc. Kernels A Function that is executed on GPU. Many threads executed the same instructions together. SPMD (Single Program Multiple Data)

Multithreaded Massive number of threads GPU s threads are of much lighter weight than CPU s. Take very few cycle to generate and schedule Compute each data element simultaneously Thread# #0 Thread# #1 Thread# #2 Thread# #3 Thread# #4 Thread# #5 Thread# #6 Thread# #7 Thread# #n a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a n b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b n 17 Multithreaded Massive number of threads GPU s threads are of much lighter weight than CPU s. Take very few cycle to generate and schedule Compute each data element simultaneously Thread# #0 Thread# #1 Thread# #2 Thread# #3 Thread# #4 Thread# #5 Thread# #6 Thread# #7 Thread# #n eg. n= 1 billion a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a n = = = = = = = = = = + + + + + + + + + + b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b n c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c n 18

Threads can be structured as 2Ds threads c 0,0 c 0,1 c 0,2 c 0,3 b 0,0 b 0,1 b 0,2 b 0,3 c 1,0 c 1,1 c 1,2 c 1,3 a 0,0 a 0,1 a 0,2 a 0,3 b 1,0 b 1,1 b 1,2 b 1,3 c 2,0 c 2,1 c 2,2 c 2,3 a 1,0 a 1,1 a 1,2 a 1,3 b 2,0 b 2,1 b 2,2 b 2,3 c 3,0 c 3,1 c 3,2 c 3,3 a 2,0 a 2,1 a 2,2 a 2,3 b 3,0 b 3,1 b 3,2 b 3,3 t 0,0 t 0,1 t 0,2 t 0,3 t 1,0 t 1,1 t 1,2 t 1,3 t 2,0 t 2,1 t 2,2 t 2,3 t 3,0 t 3,1 t 3,2 t 3,3 a 3,0 a 3,1 a 3,2 a 3,3 19 Dimensional Structure Threadsare grouped into Blocks for management reasons t 0,0 t 0,1 t 0,2 t 0,3 t 1,0 t 1,1 t 1,2 t 1,3 t 2,0 t 2,1 t 2,2 t 2,3 t 3,0 t 3,1 t 3,2 t 3,3 20 threads

Dimensional Structure Threadsare grouped into Blocks for management reasons A group ofblocks is executed simultaneously on each Core (Streaming Multiprocessor) b0,0 t 0,0 t 0,1 t 0,0 t 0,1 t 1,0 t 1,1 t 1,0 t 1,1 b0,1 b1,0 t 0,0 t 0,1 t 0,0 t 0,1 t 1,0 t 1,1 t 1,0 t 1,1 b1,1 21 threads Dimensional Structure Threadsare grouped into Blocks for management reasons EachBlock is executed simultaneously on each Core (Streaming Multiprocessor) b0,0 t 0,0 t 0,1 t 0,0 t 0,1 t 1,0 t 1,1 t 1,0 t 1,1 b0,1 b1,0 t 0,0 t 0,1 t 0,0 t 0,1 t 1,0 t 1,1 t 1,0 t 1,1 b1,1 22 threads

Dimensional Structure (cont.) Courtesy: NVIDIA Automatic Scalability 24

Heterogeneous Computing CPU GPU Task#0 Asynchronous launch Time Launch Kernel#0 Task#1 Grid_0 block 0 block 1 block 2 block 3 Synchronize() Launch Kernel#1 Task#2 Synchronize() Grid_1 block 0 block 1 Each kernel has independent -#blocks, #threads 25 Normal Program Flow Allocate device memory Transfer input into device memory Compute (Call KernelFunction) Transfer output back to host memory Free unused memory Terminology Check Thread, Block, Grid Kernel

Vector Addition: Host float a[max], b[max], c[max]; intmain(){ Device Memory Allocation float*d_a, *d_b,*d_c; cudamalloc((void**) &d_a, MAX*sizeof(float)); cudamalloc((void**) &d_b, MAX*sizeof(float)); Copy input to cudamalloc((void**) &d_c, MAX*sizeof(float)); Device Memory cudamemcpy(d_a, a, MAX*sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_b, b, MAX*sizeof(float), cudamemcpyhosttodevice); dim3 dimblock(32); dim3 dimgrid(max/32); vecadd<<<dimgrid, dimblock>>>(d_a, d_b, d_c); Run Kernel } cudamemcpy(c, d_c, MAX*sizeof(float), cudamemcpydevicetohost); return 0; Copy output back 27 API List (Part I) cudaerror_tcudamalloc(void **devptr, size_tsize) cudaerror_tcudamemcpy(void *dst, const void *src, size_t count, enum cudamemcpykind kind) cudaerror_tcudafree(void *devptr) cudaerror_tcudagetsymboladdress(void **devptr, const char *symbol) cudaerror_ttells execution result Checking errors if(cudamalloc((void **) &d_a, 4096)!=cudaSuccess) Refer to CudaReference Manual for more details

dim3 An integer vector type based on uint3 Is used to specify dimension (block/grid) Any component left unspecified is initialized to 1 Syntax dim3 <name>(x[, y, z]) [] = optional Examples dim3 abc(1, 2, 3); dim3 haha(13); = dim3 haha(13,1,1); Kernel Call Syntax KernelName<<<#blocks per grid, #threads per block>>> (parameter list) #blocks per grid & #threads per block can be.. dim3 variables dim3 griddim(2,2); dim3 blockdim(8,8,8); vecadd<<<griddim, blockdim>>>(d_a, d_b, d_c); A natural number (if only 1 dimension) vecadd<<<4, 512>>>(d_a,d_b,d_c); Of course you can mix them vecadd<<<4, blockdim>>>(d_a,d_b,d_c);

Kernel Code Standard C Can use standard Math library (math.h) Can t use other stdio.h This means it can t print anything out Except we re on Fermi.. and we are!!! stdlib.h Syntax global void <name>(<parameter list>) Function Type Qualifiers Quaifiers Executed on Callable from device Device Device global Device Host host Host Host Both host and device Host/Device Host/Device

Built-in Variables Hardware Registers threadidx(thread index).x,.y,.z blockdim(block dimension) Number of threads in each dimension.x,.y,.z blockidx(block index).x,.y griddim(grid dimension) Number of blocks in each dimensions.x,.y warpsize Number of threads per warp Vector Addition: Device global void vecadd(float*d_a, float*d_b, float*d_c){ inti; i = blockidx.x* blockdim.x + threadidx.x; } d_c[i] = d_a[i]+d_b[i]; 34

Compiling nvcc Just like gcc Kernels must be in.cu files nvcc-o vecaddvecadd.cu Can link with object files compiled from gcc gcc-c main.c nvcc-o vecaddmain.ovecadd.cu Or nvcc-c vecadd.cu nvcc-o vecaddmain.ovecadd.o Drill #1 Download cuda.zip from http://203.159.10.25/cuda/cuda.zip Or cuda.tar http://203.159.10.25/cuda/cuda.tar Write a 1-d vector addition Program Fixed size, 1024 elements int Find C = A+B Try compiling and running by yourself. Use my template templates/vecadd.cu All solutions for all tasks are in solutions/

APIs so far Refer to CUDA C Reference Manual for the usage Memory APIs cudamalloc cudafree cudamemcpy cudagetsymboladdress Syncronization syncthreads() Block level cudathreadsynchronize() Grid level Multidimensional Arrays are actually stored in 1 dimension

Drill #2 Write a matrix addition Program Fixed size 2048x2048 int Find C = A+B Use my template templates/matadd.cu APIs cudaerror_tcudamallocpitch(void **devptr, size_t *pitch, size_t width, size_t height) cudaerror_tcudamemcpy2d(void *dst, size_tdpitch, const void *src, size_tspitch, size_twidth, size_theight, enum cudamemcpykind kind)

Drill #3 Write a Matrix Multiplication Program Fixed size 2048x2048 Find C = AB Use my template templates/matmul.cu Fine-Tuning

Overall Optimization Strategies Maximize parallel execution Exposing data parallelism in algorithms Overlap memory access with computation Keep the hardware busy Reduce Divergence Maximize memory bandwidth Access data efficiently Maximize instruction throughput Loop unrolling Memory Space Off-chip (Slow) Global Local Texture (cached) On-chip (Fast) Registers Shared Memory Constant Memory 44 http://i.cmpnet.com/ddj/images/article/2008/0806/080604cuda4_f1.gif

Scopes Variable Declaration Memory Space Lifetime Automatic variables other than arrays Register Thread Kernel Automatic array variable Global Thread Kernel device shared int SharedVar; Shared Block Kernel device int GlobalVar; Global Grid Application device constant int ConstVar; Constant Grid Application Courtesy of David Kirk & Nvidia Example intb; constant c; device d; global void mykernel(){ int a, array[10]; shared float temp[20]; } device float square(float num){ return num*num; }

Drill #3 (again) Revise your Matrix Multiplication Program Use shared memory Fixed size 2048x2048 Find C = AB Use my template templates/matmul.cu Warps 32-thread unit Only 1 warp will be actually executed by the hardware at any point in time <more on the whiteboard> Zero-overhead scheduling Long latency operations

Performance Evaluation Speedup Comparing actual/max. theoretical value Memory Bandwidth Instruction Bandwidth Peak TFLOPS Debugging

Debugging cuda-gdb Extended from GNU gdb Doesn t work well when kernel code diverges (branches) printf May be the best and most convenient way cudamemcpyfromsymbol(<symbol name>) Miscellaneous

Visual Studio Integration to be continued Tutorials Chapter 1-7 of online textbook of a course at Illinois http://courses.engr.illinois.edu/ece498/al/syllabus.html Official CUDA page http://developer.nvidia.com/object/cuda_3_1_download s.html CUDA C Programming Guide CUDA C Best Practices Guide CUDA Reference Manual NVIDIA Developer Zone http://developer.nvidia.com/object/cuda_training.html

Thank you. Any Questions? kaewgb@gmail.com NVIDIA GPU GPU Memory Intel Core i7 Extreme GeForce GTX 285 GeForce GTX 480 Peak Performance [GFlops] 102.4 708*,1063 1344.96* Number of Processor 4 240 480 Core Clock [MHz] 3200 1476 1401 Bandwidth[GB/s] 32 159 177.4 Memory Interface [bit] 64 512 384 Memory Data Clock [MHz] 1333 (DDR3) 2484 (GDDR3) 3696 (GDDR5) Capacity [GB] ----- 2046 1536 56 Courtesy of Tokyo Institute of Technology