Introduction to GPU Programming

Similar documents
Introduc)on to GPU Programming

High Performance Computing with Accelerators

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Basics. July 6, 2016

Introduction to CUDA Programming

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Parallel Numerical Algorithms

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU COMPUTING. Ana Lucia Varbanescu (UvA)


Parallel Accelerators

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Parallel Accelerators

High Performance Computing and GPU Programming

GPU CUDA Programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

GPU Computing: Introduction to CUDA. Dr Paul Richmond

University of Bielefeld

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

COSC 6339 Accelerators in Big Data

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 1: an introduction to CUDA

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

ECE 574 Cluster Computing Lecture 17

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Lab 1 Part 1: Introduction to CUDA

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

CS 179: GPU Computing. Lecture 2: The Basics

GPU Computing: A Quick Start

Tesla Architecture, CUDA and Optimization Strategies

Using a GPU in InSAR processing to improve performance

GPU Computing with Fornax. Dr. Christopher Harris

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

HPCSE II. GPU programming and CUDA

Basics of CADA Programming - CUDA 4.0 and newer

Introduction to CUDA

Parallel Computing. Lecture 19: CUDA - I

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

ECE 574 Cluster Computing Lecture 15

Programming with CUDA and OpenCL. Dana Schaa and Byunghyun Jang Northeastern University

Introduction to Parallel Computing with CUDA. Oswald Haan

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

General Purpose GPU Computing in Partial Wave Analysis

GPU Programming with CUDA. Pedro Velho

EEM528 GPU COMPUTING

High-Performance Computing Using GPUs

Cuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

GPU Computing with CUDA. Part 2: CUDA Introduction

Lecture 3: Introduction to CUDA

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Module 2: Introduction to CUDA C

CSE 160 Lecture 24. Graphical Processing Units

Lecture 1: Introduction and Computational Thinking

CUDA Performance Optimization. Patrick Legresley

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 2: Introduction to CUDA C

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Mathematical computations with GPUs

Introduction to GPU Programming

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Massively Parallel Algorithms

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Module 2: Introduction to CUDA C. Objective

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Advanced Research Computing. ARC3 and GPUs. Mark Dixon

GPU programming. Dr. Bernhard Kainz

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to GPU Programming

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPU Cluster Computing. Advanced Computing Center for Research and Education

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Introduction to CELL B.E. and GPU Programming. Agenda

Transcription:

Introduction to GPU Programming Volodymyr (Vlad) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

Tutorial Goals Become familiar with NVIDIA GPU architecture Become familiar with the NVIDIA GPU application development flow Be able to write and run simple NVIDIA GPU kernels in CUDA Be aware of performance limiting factors and understand performance tuning strategies 2

Schedule Day 1 2:30-3:45 part I 3:45-4:15 break 4:15-5:30 part II Day 2 2:30-3:45 part III 3:45-4:15 break 4:15-5:30 part IV 3

Part I Introduction Hands-on: getting started with NCSA GPU cluster Hands-on: anatomy of a GPU application 4

Introduction Why use Graphics Processing Units (GPUs) for general-purpose computing Modern GPU architecture NVIDIA GPU programming overview Libraries CUDA C OpenCL PGI x64+gpu 5

GFLOPS 1200 1000 Why GPUs? Raw Performance Trends GTX 285 GTX 280 800 600 8800 GTX 8800 Ultra NVIDIA GPU 400 Intel CPU 7900 GTX 200 5800 5950 Ultra 6800 Ultra 7800 GTX Intel Xeon Quad-core 3 GHz 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 6 Graph is courtesy of NVIDIA

GByte/s 180 160 140 Why GPUs? Memory Bandwidth Trends GTX 285 GTX 280 120 100 80 8800 GTX 8800 Ultra 60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 7 Graph is courtesy of NVIDIA

GPU vs. CPU Silicon Use 8 Graph is courtesy of NVIDIA

NVIDIA GPU Architecture A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of 8 Scalar Processor () cores 2 special function units for transcendentals A multithreaded instruction unit On-chip shared memory GDDR3 SDRAM PCIe interface 9 Figure is courtesy of NVIDIA

NVIDIA GeForce9400M G GPU TPC Geometry controller SFU SM I cache MT issue C cache SFU Shared memory SMC SFU SM I cache MT issue C cache SFU Shared memory Texture units Texture L1 128-bit interconnect L2 ROP ROP L2 16 streaming processors arranged as 2 streaming multiprocessors At 0.8 GHz this provides 54 GFLOPS in singleprecision () 128-bit interface to offchip GDDR3 memory 21 GB/s bandwidth DRAM DRAM 10

NVIDIA Tesla C1060 GPU L2 SFU SM I cache MT issue C cache Shared memory TPC 1 Geometry controller SFU ROP SMC SFU SM I cache MT issue C cache SFU Shared memory Texture units Texture L1 SFU SM I cache MT issue C cache SFU Shared memory SFU SM I cache MT issue C cache Shared memory TPC 10 Geometry controller SFU 512-bit memory interconnect SMC SFU SM I cache MT issue C cache SFU Shared memory Texture units Texture L1 SFU SM I cache MT issue C cache SFU Shared memory ROP DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM L2 240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides 1 TFLOPS 86.4 GFLOPS DP 512-bit interface to off-chip GDDR3 memory 102 GB/s bandwidth 11

System monitoring Thermal management PCI x16 Power supply NVIDIA Tesla S1070 Computing Server 4 T10 GPUs 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM Tesla GPU Tesla GPU PCI x16 NVIDIA SWITCH NVIDIA SWITCH Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM 12 Graph is courtesy of NVIDIA

GPU Use/Programming GPU libraries NVIDIA s CUDA BLAS and FFT libraries Many 3 rd party libraries Low abstraction lightweight GPU programming toolkits CUDA C OpenCL High abstraction compiler-based tools PGI x64+gpu 13

CUDA C APIs higher-level API called the CUDA runtime API mykernel<<<grid size>>>(args); low-level API called the CUDA driver API cumoduleload( &module, binfile ); cumodulegetfunction( &func, module, "mykernel" ); cuparamsetv( func, 0, &args, 48 ); cuparamsetsize( func, 48 ); cufuncsetblockshape( func, ts[0], ts[1], 1 ); culaunchgrid( func, gs[0], gs[1] ); 14

Getting Started with NCSA GPU Cluster Cluster architecture overview How to login and check out a node How to compile and run an existing application 15

NCSA AC GPU Cluster 16

GPU Cluster Architecture Servers: 32 CPU cores: 128 Accelerator Units: 32 GPUs: 128 ac (head node) ac01 (compute node) ac02 (compute node) ac32 (compute node) 17

GPU Cluster Node Architecture HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 InfiniBand QDR S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM QDR IB IB HP xw9400 workstation PCIe x16 PCIe x16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 Compute node 18

Accessing the GPU Cluster Use Secure Shell (SSH) client to access AC ssh USER@ac.ncsa.uiuc.edu (User: tra1 tra50; Password:???) You will see something like this printed out: See machine details and a technical report at: http://www.ncsa.uiuc.edu/projects/gpucluster/ If publishing works supported by AC, you can acknowledge it as an IACAT resource. http://www.ncsa.illinois.edu/userinfo/allocations/ack.html All SSH traffic on this system is monitored. For more information see: https://bw-wiki.ncsa.uiuc.edu/display/bw/security+monitoring+policy Machine Description and HOW TO USE. See: /usr/local/share/docs/ac.readme CUDA wrapper readme: /usr/local/share/docs/cuda_wrapper.readme These docs also available at: http://ac.ncsa.uiuc.edu/docs/ ##################################################################### Nov 22, 2010 CUDA 3.2 fully deployed. Release nodes are available here: http://developer.nvidia.com/object/cuda_3_2_downloads.html An updated SDK is available in /usr/local/cuda/ Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu (for password resets, please contact help@ncsa.uiuc.edu) 11:17:55 up 2:38, 9 users, load average: 0.10, 0.07, 0.06 Job state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:49 Exiting:0 [tra50@ac ~]$ 19

Installing Tutorial Examples Run this sequence to retrieve and install tutorial examples: cd wget http://www.ncsa.illinois.edu/~kindr/projects/hpca/files/cairo2010_tutorial.tgz tar -xvzf cairo2010_tutorial.tgz cd tutorial ls benchmarks src1 src2 src4 src5 src6 20

Accessing the GPU Cluster user 1 user 2 user n You are here ac (head node) ac01 (compute node) ac02 (compute node) ac32 (compute node) 21

Requesting a Cluster Node for Interactive Use Run qstat to see what other users do Run qsub -I -l walltime=03:00:00 to request a node with a single GPU for 3 hours of interactive use You will see something like this printed out: qsub: waiting for job 1183789.acm to start qsub: job 1183789.acm ready [tra50@ac10 ~]$ _ 22

Requesting a Cluster Node user 1 user 2 user n ac (head node) You are here ac01 (compute node) ac02 (compute node) ac32 (compute node) 23

Some useful utilities installed on AC As part of NVIDIA driver nvidia-smi (NVIDIA System Management Interface program) As part of NVIDIA CUDA SDK devicequery bandwidthtest As part of CUDA wrapper wrapper_query showgputime/showallgputime (works from the head node only) 24

nvidia-smi Timestamp : Mon May 24 14:39:28 2010 Unit 0: Product Name : NVIDIA Tesla S1070-400 Turn-key Product ID : 920-20804-0006 Serial Number : 0324708000059 Firmware Ver : 3.6 Intake Temperature : 22 C GPU 0: Product Name : Tesla T10 Processor Serial : 2624258902399 PCI ID : 5e710de Bridge Port : 0 Temperature : 33 C GPU 1: Product Name : Tesla T10 Processor Serial : 2624258902399 PCI ID : 5e710de Bridge Port : 2 Temperature : 30 C Fan Tachs: #00: 3566 Status: NORMAL #01: 3574 Status: NORMAL #12: 3564 Status: NORMAL #13: 3408 Status: NORMAL PSU: Voltage : 12.01 V Current : 19.14 A State : Normal LED: State : GREEN Unit 1: Product Name : NVIDIA Tesla S1070-400 Turn-key Product ID : 920-20804-0006 Serial Number : 0324708000059 Firmware Ver : 3.6 Intake Temperature : 22 C GPU 0: Product Name : Tesla T10 Processor Serial : 1930554578325 PCI ID : 5e710de Bridge Port : 0 Temperature : 33 C GPU 1: Product Name : Tesla T10 Processor Serial : 1930554578325 PCI ID : 5e710de Bridge Port : 2 Temperature : 30 C Fan Tachs: #00: 3584 Status: NORMAL #01: 3570 Status: NORMAL #12: 3572 Status: NORMAL #13: 3412 Status: NORMAL PSU: Voltage : 11.99 V Current : 19.14 A State : Normal LED: State : GREEN 25

devicequery CUDA Device Query (Runtime API) version (CUDART static linking) There is 1 device supporting CUDA Device 0: "Tesla T10 Processor" CUDA Driver Version: 3.0 CUDA Runtime Version: 3.0 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294770688 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Clock rate: 1.30 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Exclusive (only one host thread at a time can use this device) 26

wrapper_query cuda_wrapper info: version=2 userid=21783 pid=-1 ngpu=1 physgpu[0]=2 key_env_var= allow_user_passthru=1 affinity: GPU=0, CPU=0 2 GPU=1, CPU=0 2 GPU=2, CPU=1 3 GPU=3, CPU=1 3 cudaapi = Unknown walltime = 10.228021 seconds gpu_kernel_time = 0.000000 seconds gpu_usage = 0.00% There are 4 GPUs per cluster node When requesting a node, we can specify how may GPUs should be allocated e.g., `-l nodes=1:ppn=4` in qsub resources string will result in all 4 GPUs allocated By default, only one GPU per node is allocated 27

Compiling and Running an Existing Application cd tutorial/src1 vecadd.c - reference C implementation vecadd.cu CUDA implementation Compile & run CPU version gcc vecadd.c -o vecadd_cpu./vecadd_cpu Running CPU vecadd for 16384 elements C[0]=2147483648.00... Compile & run GPU version nvcc vecadd.cu -o vecadd_gpu./vecadd_gpu Running GPU vecadd for 16384 elements C[0]=2147483648.00... 28

nvcc Any source file containing CUDA C language extensions must be compiled with nvcc nvcc is a compiler driver that invokes many other tools to accomplish the job Basic nvcc usage nvcc <filename>.cu [-o <executable>] Builds release mode nvcc -deviceemu <filename>.cu Builds device emulation mode (all code runs on CPU) -g flag allows to build debug mode for gdb debugger nvcc --version 29

Anatomy of a GPU Application Host side Allocate memory on the GPU (device memory) Copy data to the device memory Launch GPU kernel and wait until it is done Copy results back to the host memory Device side Execute the GPU kernel 30

Reference CPU Version void vecadd(int N, float* A, float* B, float* C) { } for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; Computational kernel int main(int argc, char **argv) { int N = 16384; // default vector size } float *A = (float*)malloc(n * sizeof(float)); float *B = (float*)malloc(n * sizeof(float)); float *C = (float*)malloc(n * sizeof(float)); vecadd(n, A, B, C); // call compute kernel free(a); free(b); free(c); Memory allocation Kernel invocation Memory de-allocation 31

Adding GPU support Host CPU GPU card GPU Host Memory A Device Memory ga B gb C gc 32

Memory Spaces CPU and GPU have separate memory spaces Data between these spaces is moved across the PCIe bus Use functions to allocate/set/copy memory on GPU Host (CPU) manages device (GPU) memory cudamalloc(void** pointer, size_t nbytes) cudafree(void* pointer) cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind direction); returns after the copy is complete blocks CPU thread until all bytes have been copied does not start copying until previous CUDA calls complete enum cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice 33

Adding GPU support int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(n * sizeof(float)); float *B = (float*)malloc(n * sizeof(float)); float *C = (float*)malloc(n * sizeof(float)); float *devptra, *devptrb, *devptrc; cudamalloc((void**)&devptra, N * sizeof(float)); cudamalloc((void**)&devptrb, N * sizeof(float)); cudamalloc((void**)&devptrc, N * sizeof(float)); Memory allocation on the GPU card Copy data from the CPU (host) memory to the GPU (device) memory cudamemcpy(devptra, A, N * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(devptrb, B, N * sizeof(float), cudamemcpyhosttodevice); 34

Adding GPU support vecadd<<<n/512, 512>>>(devPtrA, devptrb, devptrc); Kernel invocation cudamemcpy(c, devptrc, N * sizeof(float), cudamemcpydevicetohost); cudafree(devptra); cudafree(devptrb); cudafree(devptrc); Copy results from device memory to the host memory } free(a); free(b); free(c); Device memory de-allocation 35

GPU Kernel CPU version void vecadd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } GPU version global void vecadd(float* A, float* B, float* C) { int i = blockidx.x * blockdim.x + threadidx.x; C[i] = A[i] + B[i]; } 36