CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D graphics, images, video, GUI, and games 15 years ago, only VGA In 2000, GPU had everything a graphics workstation can provide Fixed-function logic was replaced by programmable logic Being more and more programmable à GPGPU GPUs are optimized for Visual Computing Visual Computing? Mix graphics processing and computing together so that users can interact with computed objects via graphics/images/video 2 1
GPU Design for General HPC 2006 -- First GPU for general HPC as well as graphics processing, NVIDIA GeForce 8800 card. (GPGPU) Have unified processors that could perform vertex, geometry, pixel, and general computing operations Unifies graphics and computing Provide large amount of floating-point processing power in GPU Attractive even for non-graphics applications You can write your programs in C rather than using a graphics API. 3 GPU Performance Gains over CPUs 4 2
5.3 TFLOPS of double precision floating point (FP64) performance 10.6 TFLOPS of single precision (FP32) performance 21.2 TFLOPS of half-precision (FP16) performance 5 GPU Processor Array 14 SMs, each with 8 SPs à 112 SP (or CUDA) cores Connected with four DRAM partitions (8 bytes-wide) GeForce 8800, 2006 6 3
Kepler GK110 Full chip block diagram 7-192 single-precision CUDA cores; -64 double-precision units; -32 special function units (SFU); -32 load/store units (LD/ST). 8 4
Table 1. Tesla P100 Compared to Prior Generation Tesla products Tesla Products Tesla K40 Tesla M40 Tesla P100 GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) SMs 15 24 56 TPCs 15 24 28 FP32 CUDA Cores / SM 192 128 64 FP32 CUDA Cores / GPU 2880 3072 3584 FP64 CUDA Cores / SM 64 4 32 FP64 CUDA Cores / GPU 960 96 1792 Base Clock 745 MHz 948 MHz 1328 MHz GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz Peak FP32 GFLOPs 1 5040 6840 10600 Peak FP64 GFLOPs 1 1680 210 5300 Texture Units 240 192 224 Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 Memory Size Up to 12 GB Up to 24 GB 16 GB L2 Cache Size 1536 KB 3072 KB 4096 KB Register File Size / SM 256 KB 256 KB 256 KB Register File Size / GPU 3840 KB 6144 KB 14336 KB TDP 235 Watts 250 Watts 300 Watts Transistors 7.1 billion 8 billion 15.3 billion GPU Die Size 551 mm² 601 mm² 610 mm² Manufacturing Process 28-nm 28-nm 16-nm FinFET 1 The GFLOPS in this chart are based on GPU Boost Clocks. 9 Programming GPUs In the past, a program must be expressed as a graphics-rendering algorithm very difficult to program Now, CUDA programming model An extension to C/C++ Programmers decompose a problem into small problems, executed in parallel Two-level GPU architecture: SMP and SP Thus, two-level parallel program decomposition Thread block on a SMP Thread on a SP 10 5
CUDA (Compute Unified Device Architecture) An architecture and programming model, introduced by NVIDIA in 2007 Enables GPUs to execute programs written in C In C, just call kernel routines that are executed on GPU Easy to start, although to get highest performance requires understanding of hardware architecture! 11 Example of Problem Decomposition A matrix is divided into 2-D blocks: 2 rows x 3 columns of blocks Each block has 3x5 elements Each block corresponds to one thread block. Your task: Write a threaded program to compute only 1 element 12 6
How to Specify Block Size and Thread Block Size? Kernel<<< (input_size/block_size), (T x,t y ) >>> Programmers need to decide T x, T y, and the block_size 13 CUDA Programming Paradigm There are 3 key abstractions: A hierarchy of thread groups Shared memories Barrier synchronization Kernel: A sequential code for 1 thread designed to be executed by many threads Thread block: A set of concurrent threads <<<,???>>> Grid: A set of thread blocks, which execute in parallel <<<????, >>> Every kernel has a grid kernel A à kernel B à kernel C 14 7
CUDA Threads All threads execute the same kernel code, but can take different paths Each thread has an ID: threadidx (.x,.y) Can select its own input/output data Can make its own control decisions Threads are grouped into thread blocks Thread blocks are grouped into a grid A kernel is executed as a grid of blocks of threads 15 Host Device Kernel 1 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA 16 8
Restrictions All threads in a grid execute the same kernel A grid is organized as a 3D array of blocks (griddim.x griddim.y, and griddim.z) Each block is organized as 3D array of threads (blockdim.x, blockdim.y, and blockdim.z) Once a kernel is launched, its dimensions cannot change All blocks in a grid have the same dimension The total size of a block is limited to 1024 threads Once assigned to an SM, the thread block must execute in its entirety by the SM 17 Thread Index When invoking a kernel, programmer specifies #blocks comprising the grid, #threads per block Each thread is given a unique thread ID number threadidx within its thread block Each thread block is given a unique block ID number blockidx Thread blocks and grids may have 1, 2, or 3 dimensions, accessed via.x,.y, and.z index fields 18 9
CUDA C Keywords Kernel : function that executes on device (GPU) and can be called from host (CPU) Can only access GPU memory No variable number of arguments No static variables Functions must be declared with a qualifier global : GPU kernel function launched by CPU, must return void device : can be called from GPU functions host : can be called from CPU functions (default) host and device qualifiers can be combined Qualifiers determines how functions are compiled Controls which compilers are used to compile functions 19 Compiling*CUDA*C/C++*Programs* //"foo.cpp" int$foo(int$x)$$ {$ $$...$ }$ float$bar(float$x)$$ {$ $$...$ }$ //$saxpy.cu$ global $void$saxpy(int$n,$float$...$)$$ {$ $$int"i"="threadidx.x"+"...";" ""if"(i"<"n)"y[i]"="a*x[i]"+"y[i];" }$ //$main.cpp$ void$main($)${$ $$float$x$=$bar(1.0)$ $$if$(x<2.0f)$ $$$$saxpy<<<...>>>(foo(1),$...);$ $$...$ }$ CUDA*C* Functions* NVCC* CUDA*object* files* Linker* Rest*of*C* Application* CPU*Compiler* CPU*object* files* CPU*+*GPU* Executable* 20 10
Example of SAXPY GPU Kernel 21 Indexing Arrays with Blocks and Threads No longer as simple as using B3)<DV*I@I and $=.2&*V*I@I Consider indexing an array with one element per thread (8 threads/block) +,-."/0/121) +,-."/0/121) +,-."/0/121) +,-."/0/121) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 345670/121)8)9) 345670/121)8)') 345670/121)8):) 345670/121)8);) With M threads/block a unique index for each thread is given by:! "#$!"#*2I!K!$=.2&*V*I@I!L!B3)<DV*I@I!J!M7! 22 11
Kernel Execution A thread block executes on a single SM Threads and blocks do not migrate to different SMs All threads within block execute in concurrently, in parallel One SM may execute multiple thread blocks Must be able to satisfy aggregate register and memory demands A grid executes on a single device (GPU) Blocks from the same grid may execute concurrently or serially Blocks from multiple grids may execute concurrently A device can execute multiple kernels concurrently 23 Thread/Thread Block/Thread Grid CUDA kernel calling syntax kernel<<<grid dim, Thread block dim>>>(... parameter list...) Threads in a thread block can synchronize by: syncthreads() They can communicate with each other through Shared Memory at synchronization point How many blocks depend on user input Thread blocks must be independent! (in any order) No direct communication Thread grids can be independent or dependent There is an implicit barrier between kernels 24 12
Memory Structure in GPU Local memory -- per thread Shared memory -- per thread block Global memory -- per application -GPU executes kernel grids. -SM executes one or more thread blocks -SM executes threads in groups of 32 threads The group is called Warp. 25 Memory Structure in GPU (Cont.) Threads have access to multiple memory spaces Each thread has a private local memory and thread registers Each thread block has a shared memory, visible to all threads of the thread block Declare variables with shared Low latency on-chip RAM such as L1 cache Normally, initialize data in share memory, compute, then copy data to global memory Finally, all threads have access to global memory Declare variables with device DRAM on the graphics board 26 13
Warps Once a thread block is assigned to an SM, it is divided into units called warps (i.e., 32 threads). Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Warp is the unit of thread scheduling in SM Each warp is executed in a SIMD fashion (i.e. all threads within a warp must execute the same instruction at any given time). Warp is like a traditional thread of SIMD instructions (32- elements wide) 32 SPs are like 32 SIMD lanes 27 :$'(")%<+,#-/%"'(%!;(#6&()%"/%4"'1/%!"#$%&$'(")%*+,#-%./%0"11()%&,%,2(%,'%0,'(%3"'1/% 4$(2%&$(%&$'(")%*+,#-%/.5(%./%2,&%"%06+&.1+(%,7%&$(%3"'1%/.5(8% 626/()%&$'(")/%3.&$.2%&$(%+"/&%3"'1%"'(%)./"*+()%"6&,0"&.#"++9%% % % % % % 40./"-5"6#"17.809:" % <=->?"-5"%#@"17.809:" 40./"-5"6#"17.809:" 40./"-5"6#"17.809:" 40./"-5"6#"17.809:" % :$(%$"')3"'(%/#$()6+(/%("#$%3"'1%.2)(1(2)(2&+9% 4"'1/%3.&$.2%"%&$'(")%*+,#-%#"2%(;(#6&(%.2)(1(2)(2&+9% 28 14
!"#$%&'%(&')%#*'+,"$&-./(0' '!"$'*#1,$221#2'324#$%5/(0'5-.4/*#1,$221#26',%('27/4,"' 8$47$$('7%#*2'7/4"'(1'%**%#$(4'19$#"$%&' )%#*2'7/4"'/(24#-,4/1('7"12$'/(*-42'%#$'#$%&:'%#$'$./0/8.$' 41'$;$,-4$<'%(&'7/..'8$',1(2/&$#$&'7"$('2,"$&-./(0' )"$('%'7%#*'/2'2$.$,4$&'=1#'$;$,-4/1(<'%..'3%,4/9$6'4"#$%&2' $;$,-4$'4"$'2%5$'/(24#-,4/1(' ) C ' " ) B ' ) A '" ) @ '"! ( ' >;$,-4/(0" " )%/4/(0'=1#'&%4%"?$%&:'41'$;$,-4$" " 29!"##"$%&'()*+&,-./0& 1)232)&45)2(6&7#89:&+";2+&45(4&)2+<#4&"$&=8+4#>&3<##&?()*+& @(6A&&& B:(>A&& @2442)A&!"#$"%&&&'()*+++),)---).)!"#$"%&&&')/)01()01+++,)---).&!"#$"%&&&')/)*12()*12+++,)---).) & 1)232)&48&5(C2&2$8<%5&45)2(6+&*2)&7#89:&48&*)8C"62& 5()6?()2&?"45&=($>&?()*+&48&+?"495&724?22$&& D5"+&"+&58?&452&E1F&5"62+&=2=8)>&(992++&#(42$9>& & G2+8<)92&#":2&HH+5()26HH&=(>&98$+4)("$&45)2(6+&*2)&7#89:& I#%8)"45=&($6&6298=*8+"4"8$&?"##&2+4(7#"+5&+8=2&*)232))26&(=8<$4& 83&+5()26&6(4(&($6&HH+5()26HH&(##89(4"8$& 30 15
Table 2. Compute Capabilities: GK110 vs GM200 vs GP100 GPU Kepler GK110 Maxwell GM200 Pascal GP100 Compute Capability 3.5 5.2 6.0 Threads / Warp 32 32 32 Max Warps / Multiprocessor 64 64 64 Max Threads / Multiprocessor 2048 2048 2048 Max Thread Blocks / Multiprocessor 16 32 32 Max 32-bit Registers / SM 65536 65536 65536 Max Registers / Block 65536 32768 65536 Max Registers / Thread 255 255 255 Max Thread Block Size 1024 1024 1024 Shared Memory Size / SM 16 KB/32 KB/48 KB 96 KB 64 KB 33 16