Introduction to GPGPU and GPU-architectures

Size: px

Start display at page:

Download "Introduction to GPGPU and GPU-architectures"

Dominic Simon
6 years ago
Views:

1 Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak

2 Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks / Department of Electrical Engineering - ES group 28 June

3 1. What is a GPU? It s a bird It s a plane It s a GPU! / Department of Electrical Engineering - ES group 28 3 June 2012

Graphics Processing Unit ref: "How GPUs Work", http://dx.doi.org/10.

4 Graphics Processing Unit ref: "How GPUs Work", / Department of Electrical Engineering - ES group 28 June

5 Trends in chip design Frequency wall Energy wall ILP wall / Department of Electrical Engineering - ES group 28 June

Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area?

6 Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Cost effective Re-use: just connect existing processors or processor cores / Department of Electrical Engineering - ES group 28 June

7 How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Intel Penryn ( 2) half area spend on processing large on-chip caches Die photo of Intel Penryn (source: Intel) / Department of Electrical Engineering - ES group 28 June

8 How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. 15x NVidia Kepler GPU most area spend on processing relatively small on-chip memories huge off-chip memory latencies / Department of Electrical Engineering - ES group 28 June

http://www.nvidia.com/object/cuda-apps-flash-new.

9 Why GPU? according to NVidia Jen-Hsun Huang, CEO Ref: NVidia GPUs give a 1000x performance boost! / Department of Electrical Engineering - ES group 28 June

Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs.

10 Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs. NVidia GTX280 Paul Otellini, CEO Intel Ref: Debunking the 100X GPU vs. CPU Myth, ISCA / Department of Electrical Engineering - ES group 28 June

11 Why GPU performance boost NVidia GPUs give a 1000x performance boost! Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Jen-Hsun Huang, CEO Paul Otellini, CEO Intel / Department of Electrical Engineering - ES group 28 June

12 2. How to program a GPU? / Department of Electrical Engineering - ES group June 2012

13 NVIDIA GPU architecture / Department of Electrical Engineering - ES group 28 June

NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler

14 NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file 8800GT (2006) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June

15 GPU programming Sequential Data parallel Thread parallel Programming model C Vector CUDA Architecture CPU SIMD SIMT Let s start from C and CPU / Department of Electrical Engineering - ES group 28 June

16 Let's Start with C and CPU int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC / Department of Electrical Engineering - ES group 28 June

17 Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE. / Department of Electrical Engineering - ES group 28 June

18 Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } / Department of Electrical Engineering - ES group Assembly code of inner-loop Looks like the previous example, but SSE instructions execute on 4 ALUs. lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) 28 June

19 How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } / Department of Electrical Engineering - ES group 28 June

20 Programmer's View of GPUs A GPU contains multiple SIMD Units. / Department of Electrical Engineering - ES group 28 June

21 Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. / Department of Electrical Engineering - ES group 28 June

22 What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces / Department of Electrical Engineering - ES group 28 June

23 Let's Start Again from C int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } convert into CUDA int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June

24 3. How does a GPU schedule all these threads? / Department of Electrical Engineering - ES group June 2012

25 Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads / Department of Electrical Engineering - ES group 28 June

26 What Is the Thread Hierarchy? thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June

27 How Are Threads Scheduled? / Department of Electrical Engineering - ES group 28 June

28 Executing many, many threads int A[2][4]; A[256][128]; kernelf<<<(2,1),(4,1)>>>(a); kernelf<<<(256,1),(128,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } 28 June

29 Warps as a group of threads A warp is a group of 32 threads from the same threadblock Warps are generated dynamically by the hardware kernelf<<<(256,1),(128,1)>>>(a); 1 Grid 256 thread blocks 128 threads per block 4 warps / Department of Electrical Engineering - ES group 28 June

30 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 0 Warp: 0 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

31 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 2 Warp: 1 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 1 (threads 32-63) Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

32 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 4 Warp: 2 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Instruction 0 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

33 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 6 Warp: 3 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) Instruction 0 / Department of Electrical Engineering - ES group 28 June

34 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 8 Warp: 0 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 1 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

35 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 10 Warp: 1 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Instruction 1 Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

36 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 12 Warp: 2 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Instruction 1 Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

37 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 14 Warp: 3 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Instruction 1 Warp 3, Instruction 1 / Department of Electrical Engineering - ES group 28 June

38 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 16 Warp: 0 Instruction: 2 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 2 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Warp 3, Instruction 1 Warp 0, Instruction 2 And so on... / Department of Electrical Engineering - ES group 28 June

39 Scheduling warps from multiple thread blocks Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file TB0 - Warp 0, Instruction 0 TB0 - Warp 1, Instruction 0 TB0 - Warp 2, Instruction 0 TB0 - Warp 3, Instruction 0 TB1 - Warp 0, Instruction 0 TB1 - Warp 1, Instruction 0 TB1 - Warp 2, Instruction 0 TB1 - Warp 3, Instruction 0 TB0 - Warp 0 (threads 0-31) TB0 - Warp 0, Instruction 1 TB0 - Warp 1, Instruction 1 TB0 - Warp 1 (threads 32-63) TB0 - Warp 2, Instruction 1 TB0 - Warp 3, Instruction 1 TB1 - Warp 0, Instruction 1 TB0 - Warp 2 (threads 64-95) TB1 - Warp 1, Instruction 1 TB1 - Warp 2, Instruction 1 TB0 - Warp 3 (threads ) TB1 - Warp 3, Instruction 1 TB1 - Warp 0 (threads 0-31) And so on... TB1 - Warp 1 (threads 32-63) TB1 - Warp 2 (threads 64-95) / Department TB1 - of Warp Electrical 3 (threads Engineering ) - ES group 28 June

40 Inside the warp scheduler i-cache Scheduler Register file Ref: Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS GPGPU-Sim / Department of Electrical Engineering - ES group 28 June

41 4. Performance bottlenecks How do I get the maximum out of my GPU? / Department of Electrical Engineering - ES group June 2012

42 Thread divergence Example, per cluster: 128 threads (4 warps) 16 processing cores i-cache Scheduler Register file Warp 0, Instruction j%2==1 Warp 1, Instruction j%2==1 Warp 2, Instruction j%2==1 Warp 3, Instruction j%2==1 device kernelf(a){ i = blockidx.x; j = threadidx.x; if(j%2 == 1) A[i][j]++; else A[i][j]--; } Warp 0, Instruction A[i][j]++ Warp 1, Instruction A[i][j]++ Warp 2, Instruction A[i][j]++ Warp 3, Instruction A[i][j]++ Warp 0, Instruction A[i][j]-- Warp 1, Instruction A[i][j]-- Warp 2, Instruction A[i][j]-- Warp 3, Instruction A[i][j]-- / Department of Electrical Engineering - ES group 28 June

GPU memory hierarchy Register Per thread i-cache Scheduler

SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU

43 GPU memory hierarchy Register Per thread i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Shared memory Per thread block Global memory Per kernel SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem / Department of Electrical Engineering - ES group 28 June

44 On-chip shared / local memory Th.0 Bank 0 Th.0 Bank 0 Th.0 Bank 0 Th.1 Bank 1 Th.1 Bank 1 Th.1 Bank 1 Th.2 Bank 2 Th.2 Bank 2 Th.2 Bank 2 Th.3 Bank 3 Th.3 Bank 3 Th.3 Bank 3 Th.4 Bank 4 Th.4 Bank 4 Th.4 Bank 4 Th.5 Bank 5 Th.5 Bank 5 Th.5 Bank 5 Th.6 Bank 6 Th.6 Bank 6 Th.6 Bank 6 Th.7 Bank 7 Th.7 Bank 7 Th.7 Bank 7 No bank conflict Bank conflict No bank conflict / Department of Electrical Engineering - ES group 28 June

45 Off-chip global memory Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced / Department of Electrical Engineering - ES group 28 June

46 CPU GPU memory PCI-e time CPU GPU GPU compute GPU CPU frame CPU GPU GPU compute CPU GPU GPU CPU GPU compute GPU CPU / Department of Electrical Engineering - ES group 28 June

47 Example: Matrix Multiplication global void matrixmul_naive(float* C, float* A, float* B, int wa, int wb) { // Accumulate row i of A and column j of B int i = blockidx.y * blockdim.y + threadidx.y; int j = blockidx.x * blockdim.x + threadidx.x; float accu=0; for(int k=0; k<wa; k++) { accu = accu + A[ i * wa + k ] * B[ k * wb + j ]; } } // Write the block sub-matrix to device memory; element C[ i * wb + j ] = accu; / Department of Electrical Engineering - ES group 28 June

48 Example: Matrix Multiplication global void matrixmul_prefetch( float* C, float* A, float* B, int wa, int wb) { // Block index int bx = blockidx.x; int by = blockidx.y; // Thread index int tx = threadidx.x; int ty = threadidx.y; // Declaration of the shared memory array As used to // store the sub-matrix of A shared float As[BLOCK_SIZE * BLOCK_SIZE]; shared float As2[BLOCK_SIZE * BLOCK_SIZE]; float *prefetch = As; float *prefetch2 = As2; // Declaration of the shared memory array Bs used to // store the sub-matrix of B // shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; float cv[block_size] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // Index of the first sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bbegin = BLOCK_SIZE * VECTOR_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bstep = BLOCK_SIZE * wb; int cbegin = wb * BLOCK_SIZE * by + VECTOR_SIZE * BLOCK_SIZE * bx; // Csub is used to store the element of the block sub-matrix // that is computed by the thread // float Csub = 0; float *Ap = &A[aBegin + wa * ty +tx]; float *ap = &prefetch[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap[i] = Ap[wA * i]; } syncthreads(); // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix Ap = &A[a + astep + wa * ty +tx]; float *ap2 = &prefetch2[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap2[i] = Ap[wA * i]; } ap = &prefetch[0]; float *bp = &B[b + BLOCK_SIZE * ty + tx]; #pragma unroll for(int i = 0; i < BLOCK_SIZE; i++){ float bv = bp[0]; cv[0] += ap[0] * bv; cv[1] += ap[1] * bv; } } cv[2] += ap[2] * bv; cv[3] += ap[3] * bv; cv[4] += ap[4] * bv; cv[5] += ap[5] * bv; cv[6] += ap[6] * bv; cv[7] += ap[7] * bv; cv[8] += ap[8] * bv; cv[9] += ap[9] * bv; cv[10] += ap[10] * bv; cv[11] += ap[11] * bv; cv[12] += ap[12] * bv; cv[13] += ap[13] * bv; cv[14] += ap[14] * bv; cv[15] += ap[15] * bv; ap += BLOCK_SIZE; bp += wb; // Synchronize to make sure the matrices are loaded syncthreads(); // swap As and As2 float *prefetch_temp = prefetch; prefetch = prefetch2; prefetch2 = prefetch_temp; // Write the block sub-matrix to device memory; // each thread writes one element float *Cp = &C[cBegin]; Cp += BLOCK_SIZE * ty + tx; int cstep = wb; #pragma unroll for(int i=0; i<block_size; i++){ Cp[0] = cv[i]; Cp += cstep; } } / Department of Electrical Engineering - ES group 28 June

49 Example: Matrix Multiplication / Department of Electrical Engineering - ES group 28 June

50 Present: AMD vs. NVidia Graphics core next HD 7970 Kepler GTX 680 s 925 MHz GHz Max. performance (sp) 3.9 Tflop/s 3.1 Tflop/s Max. performance (dp) 947 Gflop/s 129 Gflop/s Bandwidth 264 GB/s 192 GB/s Gflops/W Prog. language OpenCL CUDA & OpenCL / Department of Electrical Engineering - ES group 28 June

51 Future GPU and CPU on one chip AMD Fusion Intel i5 2500K / Department of Electrical Engineering - ES group 28 June

52 ARM & GPU Present Carma (NVidia) Tegra 3 (ARM) CUDA GPU Future Embedded (ARM) CPU Embedded GPU ARM Mali PowerVR / Department of Electrical Engineering - ES group 28 June

Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler

53 Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Current GPUs on the market Gflops / W SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June

54 Questions / Department of Electrical Engineering - ES group 28 June

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the