What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Similar documents
CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Introduction to Parallel Computing with CUDA. Oswald Haan

Tesla Architecture, CUDA and Optimization Strategies

Mathematical computations with GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Threading Hardware in G80

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA (Compute Unified Device Architecture)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Portland State University ECE 588/688. Graphics Processors

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Architecture & Programming Model

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to GPGPU and GPU-architectures

ECE 574 Cluster Computing Lecture 17

Lecture 8: GPU Programming. CSE599G1: Spring 2017

GPU programming. Dr. Bernhard Kainz

Introduction to CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU CUDA Programming

GPUs and GPGPUs. Greg Blanton John T. Lubia

CUDA Programming Model

Introduction to CUDA (1 of n*)

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Scientific discovery, analysis and prediction made possible through high performance computing.

Parallel Accelerators

CS 179: GPU Computing

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Practical Introduction to CUDA and GPU

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Paralization on GPU using CUDA An Introduction

Parallel Numerical Algorithms

Real-time Graphics 9. GPGPU

Lecture 1: an introduction to CUDA

Device Memories and Matrix Multiplication

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

Fundamental CUDA Optimization. NVIDIA Corporation

Real-time Graphics 9. GPGPU

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Programmable Graphics Hardware (GPU) A Primer

Parallel Accelerators

Introduction to CUDA Programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 15

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Fundamentals Jeff Larkin November 14, 2016

Lecture 2: CUDA Programming

TUNING CUDA APPLICATIONS FOR MAXWELL

Mattan Erez. The University of Texas at Austin

GPGPU/CUDA/C Workshop 2012

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction to GPU hardware and to CUDA

CUDA Basics. July 6, 2016

Introduction to CELL B.E. and GPU Programming. Agenda

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel Computing. Lecture 19: CUDA - I

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

high performance medical reconstruction using stream programming paradigms

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Programming. Aiichiro Nakano

Introduc)on to GPU Programming

Dense Linear Algebra. HPC - Algorithms and Applications

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Numerical Simulation on the GPU

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

University of Bielefeld

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.


COSC 462 Parallel Programming

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Introduction to CUDA 5.0

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Programming Using NVIDIA CUDA

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Transcription:

CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D graphics, images, video, GUI, and games 15 years ago, only VGA In 2000, GPU had everything a graphics workstation can provide Fixed-function logic was replaced by programmable logic Being more and more programmable à GPGPU GPUs are optimized for Visual Computing Visual Computing? Mix graphics processing and computing together so that users can interact with computed objects via graphics/images/video 2 1

GPU Design for General HPC 2006 -- First GPU for general HPC as well as graphics processing, NVIDIA GeForce 8800 card. (GPGPU) Have unified processors that could perform vertex, geometry, pixel, and general computing operations Unifies graphics and computing Provide large amount of floating-point processing power in GPU Attractive even for non-graphics applications You can write your programs in C rather than using a graphics API. 3 GPU Performance Gains over CPUs 4 2

5.3 TFLOPS of double precision floating point (FP64) performance 10.6 TFLOPS of single precision (FP32) performance 21.2 TFLOPS of half-precision (FP16) performance 5 GPU Processor Array 14 SMs, each with 8 SPs à 112 SP (or CUDA) cores Connected with four DRAM partitions (8 bytes-wide) GeForce 8800, 2006 6 3

Kepler GK110 Full chip block diagram 7-192 single-precision CUDA cores; -64 double-precision units; -32 special function units (SFU); -32 load/store units (LD/ST). 8 4

Table 1. Tesla P100 Compared to Prior Generation Tesla products Tesla Products Tesla K40 Tesla M40 Tesla P100 GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) SMs 15 24 56 TPCs 15 24 28 FP32 CUDA Cores / SM 192 128 64 FP32 CUDA Cores / GPU 2880 3072 3584 FP64 CUDA Cores / SM 64 4 32 FP64 CUDA Cores / GPU 960 96 1792 Base Clock 745 MHz 948 MHz 1328 MHz GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz Peak FP32 GFLOPs 1 5040 6840 10600 Peak FP64 GFLOPs 1 1680 210 5300 Texture Units 240 192 224 Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 Memory Size Up to 12 GB Up to 24 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB Register File Size / SM 256 KB 256 KB 256 KB Register File Size / GPU 3840 KB 6144 KB 14336 KB TDP 235 Watts 250 Watts 300 Watts Transistors 7.1 billion 8 billion 15.3 billion GPU Die Size 551 mm² 601 mm² 610 mm² Manufacturing Process 28-nm 28-nm 16-nm FinFET 1 The GFLOPS in this chart are based on GPU Boost Clocks. 9 Programming GPUs In the past, a program must be expressed as a graphics-rendering algorithm very difficult to program Now, CUDA programming model An extension to C/C++ Programmers decompose a problem into small problems, executed in parallel Two-level GPU architecture: SMP and SP Thus, two-level parallel program decomposition Thread block on a SMP Thread on a SP 10 5

CUDA (Compute Unified Device Architecture) An architecture and programming model, introduced by NVIDIA in 2007 Enables GPUs to execute programs written in C In C, just call kernel routines that are executed on GPU Easy to start, although to get highest performance requires understanding of hardware architecture! 11 Example of Problem Decomposition A matrix is divided into 2-D blocks: 2 rows x 3 columns of blocks Each block has 3x5 elements Each block corresponds to one thread block. Your task: Write a threaded program to compute only 1 element 12 6

How to Specify Block Size and Thread Block Size? Kernel<<< (input_size/block_size), (T x,t y ) >>> Programmers need to decide T x, T y, and the block_size 13 CUDA Programming Paradigm There are 3 key abstractions: A hierarchy of thread groups Shared memories Barrier synchronization Kernel: A sequential code for 1 thread designed to be executed by many threads Thread block: A set of concurrent threads <<<,???>>> Grid: A set of thread blocks, which execute in parallel <<<????, >>> Every kernel has a grid kernel A à kernel B à kernel C 14 7

CUDA Threads All threads execute the same kernel code, but can take different paths Each thread has an ID: threadidx (.x,.y) Can select its own input/output data Can make its own control decisions Threads are grouped into thread blocks Thread blocks are grouped into a grid A kernel is executed as a grid of blocks of threads 15 Host Device Kernel 1 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA 16 8

Restrictions All threads in a grid execute the same kernel A grid is organized as a 3D array of blocks (griddim.x griddim.y, and griddim.z) Each block is organized as 3D array of threads (blockdim.x, blockdim.y, and blockdim.z) Once a kernel is launched, its dimensions cannot change All blocks in a grid have the same dimension The total size of a block is limited to 1024 threads Once assigned to an SM, the thread block must execute in its entirety by the SM 17 Thread Index When invoking a kernel, programmer specifies #blocks comprising the grid, #threads per block Each thread is given a unique thread ID number threadidx within its thread block Each thread block is given a unique block ID number blockidx Thread blocks and grids may have 1, 2, or 3 dimensions, accessed via.x,.y, and.z index fields 18 9

CUDA C Keywords Kernel : function that executes on device (GPU) and can be called from host (CPU) Can only access GPU memory No variable number of arguments No static variables Functions must be declared with a qualifier global : GPU kernel function launched by CPU, must return void device : can be called from GPU functions host : can be called from CPU functions (default) host and device qualifiers can be combined Qualifiers determines how functions are compiled Controls which compilers are used to compile functions 19 Compiling*CUDA*C/C++*Programs* //"foo.cpp" int$foo(int$x)$$ {$ $$...$ }$ float$bar(float$x)$$ {$ $$...$ }$ //$saxpy.cu$ global $void$saxpy(int$n,$float$...$)$$ {$ $$int"i"="threadidx.x"+"...";" ""if"(i"<"n)"y[i]"="a*x[i]"+"y[i];" }$ //$main.cpp$ void$main($)${$ $$float$x$=$bar(1.0)$ $$if$(x<2.0f)$ $$$$saxpy<<<...>>>(foo(1),$...);$ $$...$ }$ CUDA*C* Functions* NVCC* CUDA*object* files* Linker* Rest*of*C* Application* CPU*Compiler* CPU*object* files* CPU*+*GPU* Executable* 20 10

Example of SAXPY GPU Kernel 21 Indexing Arrays with Blocks and Threads No longer as simple as using B3)<DV*I@I and $=.2&*V*I@I Consider indexing an array with one element per thread (8 threads/block) +,-."/0/121) +,-."/0/121) +,-."/0/121) +,-."/0/121) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 345670/121)8)9) 345670/121)8)') 345670/121)8):) 345670/121)8);) With M threads/block a unique index for each thread is given by:! "#$!"#*2I!K!$=.2&*V*I@I!L!B3)<DV*I@I!J!M7! 22 11

Kernel Execution A thread block executes on a single SM Threads and blocks do not migrate to different SMs All threads within block execute in concurrently, in parallel One SM may execute multiple thread blocks Must be able to satisfy aggregate register and memory demands A grid executes on a single device (GPU) Blocks from the same grid may execute concurrently or serially Blocks from multiple grids may execute concurrently A device can execute multiple kernels concurrently 23 Thread/Thread Block/Thread Grid CUDA kernel calling syntax kernel<<<grid dim, Thread block dim>>>(... parameter list...) Threads in a thread block can synchronize by: syncthreads() They can communicate with each other through Shared Memory at synchronization point How many blocks depend on user input Thread blocks must be independent! (in any order) No direct communication Thread grids can be independent or dependent There is an implicit barrier between kernels 24 12

Memory Structure in GPU Local memory -- per thread Shared memory -- per thread block Global memory -- per application -GPU executes kernel grids. -SM executes one or more thread blocks -SM executes threads in groups of 32 threads The group is called Warp. 25 Memory Structure in GPU (Cont.) Threads have access to multiple memory spaces Each thread has a private local memory and thread registers Each thread block has a shared memory, visible to all threads of the thread block Declare variables with shared Low latency on-chip RAM such as L1 cache Normally, initialize data in share memory, compute, then copy data to global memory Finally, all threads have access to global memory Declare variables with device DRAM on the graphics board 26 13

Warps Once a thread block is assigned to an SM, it is divided into units called warps (i.e., 32 threads). Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Warp is the unit of thread scheduling in SM Each warp is executed in a SIMD fashion (i.e. all threads within a warp must execute the same instruction at any given time). Warp is like a traditional thread of SIMD instructions (32- elements wide) 32 SPs are like 32 SIMD lanes 27 :$'(")%<+,#-/%"'(%!;(#6&()%"/%4"'1/%!"#$%&$'(")%*+,#-%./%0"11()%&,%,2(%,'%0,'(%3"'1/% 4$(2%&$(%&$'(")%*+,#-%/.5(%./%2,&%"%06+&.1+(%,7%&$(%3"'1%/.5(8% 626/()%&$'(")/%3.&$.2%&$(%+"/&%3"'1%"'(%)./"*+()%"6&,0"&.#"++9%% % % % % % 40./"-5"6#"17.809:" % <=->?"-5"%#@"17.809:" 40./"-5"6#"17.809:" 40./"-5"6#"17.809:" 40./"-5"6#"17.809:" % :$(%$"')3"'(%/#$()6+(/%("#$%3"'1%.2)(1(2)(2&+9% 4"'1/%3.&$.2%"%&$'(")%*+,#-%#"2%(;(#6&(%.2)(1(2)(2&+9% 28 14

!"#$%&'%(&')%#*'+,"$&-./(0' '!"$'*#1,$221#2'324#$%5/(0'5-.4/*#1,$221#26',%('27/4,"' 8$47$$('7%#*2'7/4"'(1'%**%#$(4'19$#"$%&' )%#*2'7/4"'/(24#-,4/1('7"12$'/(*-42'%#$'#$%&:'%#$'$./0/8.$' 41'$;$,-4$<'%(&'7/..'8$',1(2/&$#$&'7"$('2,"$&-./(0' )"$('%'7%#*'/2'2$.$,4$&'=1#'$;$,-4/1(<'%..'3%,4/9$6'4"#$%&2' $;$,-4$'4"$'2%5$'/(24#-,4/1(' ) C ' " ) B ' ) A '" ) @ '"! ( ' >;$,-4/(0" " )%/4/(0'=1#'&%4%"?$%&:'41'$;$,-4$" " 29!"##"$%&'()*+&,-./0& 1)232)&45)2(6&7#89:&+";2+&45(4&)2+<#4&"$&=8+4#>&3<##&?()*+& @(6A&&& B:(>A&& @2442)A&!"#$"%&&&'()*+++),)---).)!"#$"%&&&')/)01()01+++,)---).&!"#$"%&&&')/)*12()*12+++,)---).) & 1)232)&48&5(C2&2$8<%5&45)2(6+&*2)&7#89:&48&*)8C"62& 5()6?()2&?"45&=($>&?()*+&48&+?"495&724?22$&& D5"+&"+&58?&452&E1F&5"62+&=2=8)>&(992++&#(42$9>& & G2+8<)92&#":2&HH+5()26HH&=(>&98$+4)("$&45)2(6+&*2)&7#89:& I#%8)"45=&($6&6298=*8+"4"8$&?"##&2+4(7#"+5&+8=2&*)232))26&(=8<$4& 83&+5()26&6(4(&($6&HH+5()26HH&(##89(4"8$& 30 15

Table 2. Compute Capabilities: GK110 vs GM200 vs GP100 GPU Kepler GK110 Maxwell GM200 Pascal GP100 Compute Capability 3.5 5.2 6.0 Threads / Warp 32 32 32 Max Warps / Multiprocessor 64 64 64 Max Threads / Multiprocessor 2048 2048 2048 Max Thread Blocks / Multiprocessor 16 32 32 Max 32-bit Registers / SM 65536 65536 65536 Max Registers / Block 65536 32768 65536 Max Registers / Thread 255 255 255 Max Thread Block Size 1024 1024 1024 Shared Memory Size / SM 16 KB/32 KB/48 KB 96 KB 64 KB 33 16