Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/

Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks / Department of Electrical Engineering - ES group 28 June 2012 2

1. What is a GPU? It s a bird It s a plane It s a GPU! / Department of Electrical Engineering - ES group 28 3 June 2012

Graphics Processing Unit ref: "How GPUs Work", http://dx.doi.org/10.1109/mc.2007.59 / Department of Electrical Engineering - ES group 28 June 2012 4

Trends in chip design Frequency wall Energy wall ILP wall / Department of Electrical Engineering - ES group 28 June 2012 5

Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Cost effective Re-use: just connect existing processors or processor cores / Department of Electrical Engineering - ES group 28 June 2012 6

How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Intel Penryn ( 2) half area spend on processing large on-chip caches Die photo of Intel Penryn (source: Intel) / Department of Electrical Engineering - ES group 28 June 2012 7

How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. 15x NVidia Kepler GPU most area spend on processing relatively small on-chip memories huge off-chip memory latencies / Department of Electrical Engineering - ES group 28 June 2012 8

Why GPU? according to NVidia Jen-Hsun Huang, CEO Ref: http://www.nvidia.com/object/cuda-apps-flash-new.html NVidia GPUs give a 1000x performance boost! / Department of Electrical Engineering - ES group 28 June 2012 9

Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs. NVidia GTX280 Paul Otellini, CEO Intel Ref: Debunking the 100X GPU vs. CPU Myth, ISCA 2010 http://dx.doi.org/10.1145/1815961.1816021 / Department of Electrical Engineering - ES group 28 June 2012 10

Why GPU performance boost NVidia GPUs give a 1000x performance boost! Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Jen-Hsun Huang, CEO Paul Otellini, CEO Intel / Department of Electrical Engineering - ES group 28 June 2012 11

2. How to program a GPU? / Department of Electrical Engineering - ES group 28 12 June 2012

NVIDIA GPU architecture / Department of Electrical Engineering - ES group 28 June 2012 13

NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file 8800GT (2006) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June 2012 14

GPU programming Sequential Data parallel Thread parallel Programming model C Vector CUDA Architecture CPU SIMD SIMT Let s start from C and CPU / Department of Electrical Engineering - ES group 28 June 2012 15

Let's Start with C and CPU int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC / Department of Electrical Engineering - ES group 28 June 2012 16

Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE. / Department of Electrical Engineering - ES group 28 June 2012 17

Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } / Department of Electrical Engineering - ES group Assembly code of inner-loop Looks like the previous example, but SSE instructions execute on 4 ALUs. lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) 28 June 2012 18

How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } / Department of Electrical Engineering - ES group 28 June 2012 19

Programmer's View of GPUs A GPU contains multiple SIMD Units. / Department of Electrical Engineering - ES group 28 June 2012 20

Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. / Department of Electrical Engineering - ES group 28 June 2012 21

What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces / Department of Electrical Engineering - ES group 28 June 2012 22

Let's Start Again from C int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } convert into CUDA int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June 2012 23

3. How does a GPU schedule all these threads? / Department of Electrical Engineering - ES group 28 25 June 2012

Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads / Department of Electrical Engineering - ES group 28 June 2012 26

What Is the Thread Hierarchy? thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June 2012 27

How Are Threads Scheduled? / Department of Electrical Engineering - ES group 28 June 2012 28

Executing many, many threads int A[2][4]; A[256][128]; kernelf<<<(2,1),(4,1)>>>(a); kernelf<<<(256,1),(128,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } 28 June 2012 30

Warps as a group of threads A warp is a group of 32 threads from the same threadblock Warps are generated dynamically by the hardware kernelf<<<(256,1),(128,1)>>>(a); 1 Grid 256 thread blocks 128 threads per block 4 warps / Department of Electrical Engineering - ES group 28 June 2012 31

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 0 Warp: 0 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 32

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 2 Warp: 1 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 1 (threads 32-63) Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 33

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 4 Warp: 2 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Instruction 0 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 34

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 6 Warp: 3 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads 96-127) Instruction 0 / Department of Electrical Engineering - ES group 28 June 2012 35

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 8 Warp: 0 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 1 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 36

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 10 Warp: 1 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Instruction 1 Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 37

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 12 Warp: 2 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Instruction 1 Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 38

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 14 Warp: 3 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) Instruction 1 Warp 3, Instruction 1 / Department of Electrical Engineering - ES group 28 June 2012 39

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 16 Warp: 0 Instruction: 2 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 2 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) Warp 3, Instruction 1 Warp 0, Instruction 2 And so on... / Department of Electrical Engineering - ES group 28 June 2012 40

Scheduling warps from multiple thread blocks Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file TB0 - Warp 0, Instruction 0 TB0 - Warp 1, Instruction 0 TB0 - Warp 2, Instruction 0 TB0 - Warp 3, Instruction 0 TB1 - Warp 0, Instruction 0 TB1 - Warp 1, Instruction 0 TB1 - Warp 2, Instruction 0 TB1 - Warp 3, Instruction 0 TB0 - Warp 0 (threads 0-31) TB0 - Warp 0, Instruction 1 TB0 - Warp 1, Instruction 1 TB0 - Warp 1 (threads 32-63) TB0 - Warp 2, Instruction 1 TB0 - Warp 3, Instruction 1 TB1 - Warp 0, Instruction 1 TB0 - Warp 2 (threads 64-95) TB1 - Warp 1, Instruction 1 TB1 - Warp 2, Instruction 1 TB0 - Warp 3 (threads 96-127) TB1 - Warp 3, Instruction 1 TB1 - Warp 0 (threads 0-31) And so on... TB1 - Warp 1 (threads 32-63) TB1 - Warp 2 (threads 64-95) / Department TB1 - of Warp Electrical 3 (threads Engineering 96-127) - ES group 28 June 2012 41

Inside the warp scheduler i-cache Scheduler Register file Ref: Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 2009 http://dx.doi.org/10.1109/ispass.2009.4919648 GPGPU-Sim / Department of Electrical Engineering - ES group 28 June 2012 42

4. Performance bottlenecks How do I get the maximum out of my GPU? / Department of Electrical Engineering - ES group 28 43 June 2012

Thread divergence Example, per cluster: 128 threads (4 warps) 16 processing cores i-cache Scheduler Register file Warp 0, Instruction j%2==1 Warp 1, Instruction j%2==1 Warp 2, Instruction j%2==1 Warp 3, Instruction j%2==1 device kernelf(a){ i = blockidx.x; j = threadidx.x; if(j%2 == 1) A[i][j]++; else A[i][j]--; } Warp 0, Instruction A[i][j]++ Warp 1, Instruction A[i][j]++ Warp 2, Instruction A[i][j]++ Warp 3, Instruction A[i][j]++ Warp 0, Instruction A[i][j]-- Warp 1, Instruction A[i][j]-- Warp 2, Instruction A[i][j]-- Warp 3, Instruction A[i][j]-- / Department of Electrical Engineering - ES group 28 June 2012 44

GPU memory hierarchy Register Per thread i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Shared memory Per thread block Global memory Per kernel SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem / Department of Electrical Engineering - ES group 28 June 2012 45

On-chip shared / local memory Th.0 Bank 0 Th.0 Bank 0 Th.0 Bank 0 Th.1 Bank 1 Th.1 Bank 1 Th.1 Bank 1 Th.2 Bank 2 Th.2 Bank 2 Th.2 Bank 2 Th.3 Bank 3 Th.3 Bank 3 Th.3 Bank 3 Th.4 Bank 4 Th.4 Bank 4 Th.4 Bank 4 Th.5 Bank 5 Th.5 Bank 5 Th.5 Bank 5 Th.6 Bank 6 Th.6 Bank 6 Th.6 Bank 6 Th.7 Bank 7 Th.7 Bank 7 Th.7 Bank 7 No bank conflict Bank conflict No bank conflict / Department of Electrical Engineering - ES group 28 June 2012 46

Off-chip global memory Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced / Department of Electrical Engineering - ES group 28 June 2012 47

CPU GPU memory PCI-e time CPU GPU GPU compute GPU CPU frame CPU GPU GPU compute CPU GPU GPU CPU GPU compute GPU CPU / Department of Electrical Engineering - ES group 28 June 2012 48

Example: Matrix Multiplication global void matrixmul_naive(float* C, float* A, float* B, int wa, int wb) { // Accumulate row i of A and column j of B int i = blockidx.y * blockdim.y + threadidx.y; int j = blockidx.x * blockdim.x + threadidx.x; float accu=0; for(int k=0; k<wa; k++) { accu = accu + A[ i * wa + k ] * B[ k * wb + j ]; } } // Write the block sub-matrix to device memory; element C[ i * wb + j ] = accu; / Department of Electrical Engineering - ES group 28 June 2012 49

Example: Matrix Multiplication global void matrixmul_prefetch( float* C, float* A, float* B, int wa, int wb) { // Block index int bx = blockidx.x; int by = blockidx.y; // Thread index int tx = threadidx.x; int ty = threadidx.y; // Declaration of the shared memory array As used to // store the sub-matrix of A shared float As[BLOCK_SIZE * BLOCK_SIZE]; shared float As2[BLOCK_SIZE * BLOCK_SIZE]; float *prefetch = As; float *prefetch2 = As2; // Declaration of the shared memory array Bs used to // store the sub-matrix of B // shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; float cv[block_size] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // Index of the first sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bbegin = BLOCK_SIZE * VECTOR_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bstep = BLOCK_SIZE * wb; int cbegin = wb * BLOCK_SIZE * by + VECTOR_SIZE * BLOCK_SIZE * bx; // Csub is used to store the element of the block sub-matrix // that is computed by the thread // float Csub = 0; float *Ap = &A[aBegin + wa * ty +tx]; float *ap = &prefetch[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap[i] = Ap[wA * i]; } syncthreads(); // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix Ap = &A[a + astep + wa * ty +tx]; float *ap2 = &prefetch2[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap2[i] = Ap[wA * i]; } ap = &prefetch[0]; float *bp = &B[b + BLOCK_SIZE * ty + tx]; #pragma unroll for(int i = 0; i < BLOCK_SIZE; i++){ float bv = bp[0]; cv[0] += ap[0] * bv; cv[1] += ap[1] * bv; } } cv[2] += ap[2] * bv; cv[3] += ap[3] * bv; cv[4] += ap[4] * bv; cv[5] += ap[5] * bv; cv[6] += ap[6] * bv; cv[7] += ap[7] * bv; cv[8] += ap[8] * bv; cv[9] += ap[9] * bv; cv[10] += ap[10] * bv; cv[11] += ap[11] * bv; cv[12] += ap[12] * bv; cv[13] += ap[13] * bv; cv[14] += ap[14] * bv; cv[15] += ap[15] * bv; ap += BLOCK_SIZE; bp += wb; // Synchronize to make sure the matrices are loaded syncthreads(); // swap As and As2 float *prefetch_temp = prefetch; prefetch = prefetch2; prefetch2 = prefetch_temp; // Write the block sub-matrix to device memory; // each thread writes one element float *Cp = &C[cBegin]; Cp += BLOCK_SIZE * ty + tx; int cstep = wb; #pragma unroll for(int i=0; i<block_size; i++){ Cp[0] = cv[i]; Cp += cstep; } } / Department of Electrical Engineering - ES group 28 June 2012 50

Example: Matrix Multiplication / Department of Electrical Engineering - ES group 28 June 2012 51

Present: AMD vs. NVidia Graphics core next HD 7970 Kepler GTX 680 s 2048 @ 925 MHz 1536 cores @ 1GHz Max. performance (sp) 3.9 Tflop/s 3.1 Tflop/s Max. performance (dp) 947 Gflop/s 129 Gflop/s Bandwidth 264 GB/s 192 GB/s Gflops/W 16.5 15.9 Prog. language OpenCL CUDA & OpenCL / Department of Electrical Engineering - ES group 28 June 2012 52

Future GPU and CPU on one chip AMD Fusion Intel i5 2500K / Department of Electrical Engineering - ES group 28 June 2012 53

ARM & GPU Present Carma (NVidia) Tegra 3 (ARM) CUDA GPU Future Embedded (ARM) CPU Embedded GPU ARM Mali PowerVR / Department of Electrical Engineering - ES group 28 June 2012 54

Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Current GPUs on the market Gflops / W SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June 2012 55

Questions / Department of Electrical Engineering - ES group 28 June 2012 56