Introduction to GPGPU and GPU-architectures

Similar documents
Device Memories and Matrix Multiplication

GPU Computing Master Clss. Development Tools

Tesla Architecture, CUDA and Optimization Strategies

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Computing with CUDA

Dense Linear Algebra. HPC - Algorithms and Applications

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Advanced CUDA Optimizations

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Lecture 2: CUDA Programming

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Computing with CUDA

GPU CUDA Programming

CS 179 Lecture 4. GPU Compute Architecture

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Lab 1 Part 1: Introduction to CUDA

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research


ME964 High Performance Computing for Engineering Applications

High Performance Computing and GPU Programming

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Double-Precision Matrix Multiply on CUDA

Tiled Matrix Multiplication

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A Detailed GPU Cache Model Based on Reuse Distance Theory

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Massively Parallel Architectures

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

University of Bielefeld

Optimizing Parallel Reduction in CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Sparse Linear Algebra in CUDA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Mathematical computations with GPUs

CUDA Performance Considerations (2 of 2)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CS : Many-core Computing with CUDA

Parallel Numerical Algorithms

Programmable Graphics Hardware (GPU) A Primer

Computer Architecture

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Module Memory and Data Locality

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Porting Performance across GPUs and FPGAs

Cartoon parallel architectures; CPUs and GPUs

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU for HPC. October 2010

Unrolling parallel loops

CS 314 Principles of Programming Languages

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Introduction to CUDA Programming

Warps and Reduction Algorithms

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU programming. Dr. Bernhard Kainz

Lecture 1: Introduction and Computational Thinking

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to CUDA

Programming in CUDA. Malik M Khan

Fundamental CUDA Optimization. NVIDIA Corporation

Code Optimizations for High Performance GPU Computing

GPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies

Data Parallel Execution Model

GPU Computing with CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 8: GPU Programming. CSE599G1: Spring 2017

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Programming Parallel Computers

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CS521 CSE IITG 11/23/2012

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

High Performance Linear Algebra on Data Parallel Co-Processors I

1/25/12. Administrative

CSE 160 Lecture 24. Graphical Processing Units

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Programming Parallel Computers

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

GPU Programming Using NVIDIA CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Overview: Graphics Processing Units

Hands-on CUDA Optimization. CUDA Workshop

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

HPC with Multicore and GPUs

Lecture 5. Performance Programming with CUDA

CME 213 S PRING Eric Darve

COSC 6374 Parallel Computations Introduction to CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Image convolution with CUDA

Transcription:

Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/

Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks / Department of Electrical Engineering - ES group 28 June 2012 2

1. What is a GPU? It s a bird It s a plane It s a GPU! / Department of Electrical Engineering - ES group 28 3 June 2012

Graphics Processing Unit ref: "How GPUs Work", http://dx.doi.org/10.1109/mc.2007.59 / Department of Electrical Engineering - ES group 28 June 2012 4

Trends in chip design Frequency wall Energy wall ILP wall / Department of Electrical Engineering - ES group 28 June 2012 5

Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Cost effective Re-use: just connect existing processors or processor cores / Department of Electrical Engineering - ES group 28 June 2012 6

How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Intel Penryn ( 2) half area spend on processing large on-chip caches Die photo of Intel Penryn (source: Intel) / Department of Electrical Engineering - ES group 28 June 2012 7

How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. 15x NVidia Kepler GPU most area spend on processing relatively small on-chip memories huge off-chip memory latencies / Department of Electrical Engineering - ES group 28 June 2012 8

Why GPU? according to NVidia Jen-Hsun Huang, CEO Ref: http://www.nvidia.com/object/cuda-apps-flash-new.html NVidia GPUs give a 1000x performance boost! / Department of Electrical Engineering - ES group 28 June 2012 9

Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs. NVidia GTX280 Paul Otellini, CEO Intel Ref: Debunking the 100X GPU vs. CPU Myth, ISCA 2010 http://dx.doi.org/10.1145/1815961.1816021 / Department of Electrical Engineering - ES group 28 June 2012 10

Why GPU performance boost NVidia GPUs give a 1000x performance boost! Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Jen-Hsun Huang, CEO Paul Otellini, CEO Intel / Department of Electrical Engineering - ES group 28 June 2012 11

2. How to program a GPU? / Department of Electrical Engineering - ES group 28 12 June 2012

NVIDIA GPU architecture / Department of Electrical Engineering - ES group 28 June 2012 13

NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file 8800GT (2006) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June 2012 14

GPU programming Sequential Data parallel Thread parallel Programming model C Vector CUDA Architecture CPU SIMD SIMT Let s start from C and CPU / Department of Electrical Engineering - ES group 28 June 2012 15

Let's Start with C and CPU int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC / Department of Electrical Engineering - ES group 28 June 2012 16

Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE. / Department of Electrical Engineering - ES group 28 June 2012 17

Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } / Department of Electrical Engineering - ES group Assembly code of inner-loop Looks like the previous example, but SSE instructions execute on 4 ALUs. lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) 28 June 2012 18

How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } / Department of Electrical Engineering - ES group 28 June 2012 19

Programmer's View of GPUs A GPU contains multiple SIMD Units. / Department of Electrical Engineering - ES group 28 June 2012 20

Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. / Department of Electrical Engineering - ES group 28 June 2012 21

What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces / Department of Electrical Engineering - ES group 28 June 2012 22

Let's Start Again from C int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } convert into CUDA int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June 2012 23

3. How does a GPU schedule all these threads? / Department of Electrical Engineering - ES group 28 25 June 2012

Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads / Department of Electrical Engineering - ES group 28 June 2012 26

What Is the Thread Hierarchy? thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June 2012 27

How Are Threads Scheduled? / Department of Electrical Engineering - ES group 28 June 2012 28

Executing many, many threads int A[2][4]; A[256][128]; kernelf<<<(2,1),(4,1)>>>(a); kernelf<<<(256,1),(128,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } 28 June 2012 30

Warps as a group of threads A warp is a group of 32 threads from the same threadblock Warps are generated dynamically by the hardware kernelf<<<(256,1),(128,1)>>>(a); 1 Grid 256 thread blocks 128 threads per block 4 warps / Department of Electrical Engineering - ES group 28 June 2012 31

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 0 Warp: 0 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 32

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 2 Warp: 1 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 1 (threads 32-63) Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 33

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 4 Warp: 2 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Instruction 0 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 34

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 6 Warp: 3 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads 96-127) Instruction 0 / Department of Electrical Engineering - ES group 28 June 2012 35

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 8 Warp: 0 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 1 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 36

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 10 Warp: 1 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Instruction 1 Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 37

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 12 Warp: 2 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Instruction 1 Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) / Department of Electrical Engineering - ES group 28 June 2012 38

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 14 Warp: 3 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) Instruction 1 Warp 3, Instruction 1 / Department of Electrical Engineering - ES group 28 June 2012 39

Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 16 Warp: 0 Instruction: 2 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 2 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads 96-127) Warp 3, Instruction 1 Warp 0, Instruction 2 And so on... / Department of Electrical Engineering - ES group 28 June 2012 40

Scheduling warps from multiple thread blocks Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file TB0 - Warp 0, Instruction 0 TB0 - Warp 1, Instruction 0 TB0 - Warp 2, Instruction 0 TB0 - Warp 3, Instruction 0 TB1 - Warp 0, Instruction 0 TB1 - Warp 1, Instruction 0 TB1 - Warp 2, Instruction 0 TB1 - Warp 3, Instruction 0 TB0 - Warp 0 (threads 0-31) TB0 - Warp 0, Instruction 1 TB0 - Warp 1, Instruction 1 TB0 - Warp 1 (threads 32-63) TB0 - Warp 2, Instruction 1 TB0 - Warp 3, Instruction 1 TB1 - Warp 0, Instruction 1 TB0 - Warp 2 (threads 64-95) TB1 - Warp 1, Instruction 1 TB1 - Warp 2, Instruction 1 TB0 - Warp 3 (threads 96-127) TB1 - Warp 3, Instruction 1 TB1 - Warp 0 (threads 0-31) And so on... TB1 - Warp 1 (threads 32-63) TB1 - Warp 2 (threads 64-95) / Department TB1 - of Warp Electrical 3 (threads Engineering 96-127) - ES group 28 June 2012 41

Inside the warp scheduler i-cache Scheduler Register file Ref: Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 2009 http://dx.doi.org/10.1109/ispass.2009.4919648 GPGPU-Sim / Department of Electrical Engineering - ES group 28 June 2012 42

4. Performance bottlenecks How do I get the maximum out of my GPU? / Department of Electrical Engineering - ES group 28 43 June 2012

Thread divergence Example, per cluster: 128 threads (4 warps) 16 processing cores i-cache Scheduler Register file Warp 0, Instruction j%2==1 Warp 1, Instruction j%2==1 Warp 2, Instruction j%2==1 Warp 3, Instruction j%2==1 device kernelf(a){ i = blockidx.x; j = threadidx.x; if(j%2 == 1) A[i][j]++; else A[i][j]--; } Warp 0, Instruction A[i][j]++ Warp 1, Instruction A[i][j]++ Warp 2, Instruction A[i][j]++ Warp 3, Instruction A[i][j]++ Warp 0, Instruction A[i][j]-- Warp 1, Instruction A[i][j]-- Warp 2, Instruction A[i][j]-- Warp 3, Instruction A[i][j]-- / Department of Electrical Engineering - ES group 28 June 2012 44

GPU memory hierarchy Register Per thread i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Shared memory Per thread block Global memory Per kernel SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem / Department of Electrical Engineering - ES group 28 June 2012 45

On-chip shared / local memory Th.0 Bank 0 Th.0 Bank 0 Th.0 Bank 0 Th.1 Bank 1 Th.1 Bank 1 Th.1 Bank 1 Th.2 Bank 2 Th.2 Bank 2 Th.2 Bank 2 Th.3 Bank 3 Th.3 Bank 3 Th.3 Bank 3 Th.4 Bank 4 Th.4 Bank 4 Th.4 Bank 4 Th.5 Bank 5 Th.5 Bank 5 Th.5 Bank 5 Th.6 Bank 6 Th.6 Bank 6 Th.6 Bank 6 Th.7 Bank 7 Th.7 Bank 7 Th.7 Bank 7 No bank conflict Bank conflict No bank conflict / Department of Electrical Engineering - ES group 28 June 2012 46

Off-chip global memory Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced / Department of Electrical Engineering - ES group 28 June 2012 47

CPU GPU memory PCI-e time CPU GPU GPU compute GPU CPU frame CPU GPU GPU compute CPU GPU GPU CPU GPU compute GPU CPU / Department of Electrical Engineering - ES group 28 June 2012 48

Example: Matrix Multiplication global void matrixmul_naive(float* C, float* A, float* B, int wa, int wb) { // Accumulate row i of A and column j of B int i = blockidx.y * blockdim.y + threadidx.y; int j = blockidx.x * blockdim.x + threadidx.x; float accu=0; for(int k=0; k<wa; k++) { accu = accu + A[ i * wa + k ] * B[ k * wb + j ]; } } // Write the block sub-matrix to device memory; element C[ i * wb + j ] = accu; / Department of Electrical Engineering - ES group 28 June 2012 49

Example: Matrix Multiplication global void matrixmul_prefetch( float* C, float* A, float* B, int wa, int wb) { // Block index int bx = blockidx.x; int by = blockidx.y; // Thread index int tx = threadidx.x; int ty = threadidx.y; // Declaration of the shared memory array As used to // store the sub-matrix of A shared float As[BLOCK_SIZE * BLOCK_SIZE]; shared float As2[BLOCK_SIZE * BLOCK_SIZE]; float *prefetch = As; float *prefetch2 = As2; // Declaration of the shared memory array Bs used to // store the sub-matrix of B // shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; float cv[block_size] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // Index of the first sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bbegin = BLOCK_SIZE * VECTOR_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bstep = BLOCK_SIZE * wb; int cbegin = wb * BLOCK_SIZE * by + VECTOR_SIZE * BLOCK_SIZE * bx; // Csub is used to store the element of the block sub-matrix // that is computed by the thread // float Csub = 0; float *Ap = &A[aBegin + wa * ty +tx]; float *ap = &prefetch[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap[i] = Ap[wA * i]; } syncthreads(); // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix Ap = &A[a + astep + wa * ty +tx]; float *ap2 = &prefetch2[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap2[i] = Ap[wA * i]; } ap = &prefetch[0]; float *bp = &B[b + BLOCK_SIZE * ty + tx]; #pragma unroll for(int i = 0; i < BLOCK_SIZE; i++){ float bv = bp[0]; cv[0] += ap[0] * bv; cv[1] += ap[1] * bv; } } cv[2] += ap[2] * bv; cv[3] += ap[3] * bv; cv[4] += ap[4] * bv; cv[5] += ap[5] * bv; cv[6] += ap[6] * bv; cv[7] += ap[7] * bv; cv[8] += ap[8] * bv; cv[9] += ap[9] * bv; cv[10] += ap[10] * bv; cv[11] += ap[11] * bv; cv[12] += ap[12] * bv; cv[13] += ap[13] * bv; cv[14] += ap[14] * bv; cv[15] += ap[15] * bv; ap += BLOCK_SIZE; bp += wb; // Synchronize to make sure the matrices are loaded syncthreads(); // swap As and As2 float *prefetch_temp = prefetch; prefetch = prefetch2; prefetch2 = prefetch_temp; // Write the block sub-matrix to device memory; // each thread writes one element float *Cp = &C[cBegin]; Cp += BLOCK_SIZE * ty + tx; int cstep = wb; #pragma unroll for(int i=0; i<block_size; i++){ Cp[0] = cv[i]; Cp += cstep; } } / Department of Electrical Engineering - ES group 28 June 2012 50

Example: Matrix Multiplication / Department of Electrical Engineering - ES group 28 June 2012 51

Present: AMD vs. NVidia Graphics core next HD 7970 Kepler GTX 680 s 2048 @ 925 MHz 1536 cores @ 1GHz Max. performance (sp) 3.9 Tflop/s 3.1 Tflop/s Max. performance (dp) 947 Gflop/s 129 Gflop/s Bandwidth 264 GB/s 192 GB/s Gflops/W 16.5 15.9 Prog. language OpenCL CUDA & OpenCL / Department of Electrical Engineering - ES group 28 June 2012 52

Future GPU and CPU on one chip AMD Fusion Intel i5 2500K / Department of Electrical Engineering - ES group 28 June 2012 53

ARM & GPU Present Carma (NVidia) Tegra 3 (ARM) CUDA GPU Future Embedded (ARM) CPU Embedded GPU ARM Mali PowerVR / Department of Electrical Engineering - ES group 28 June 2012 54

Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Current GPUs on the market Gflops / W SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June 2012 55

Questions / Department of Electrical Engineering - ES group 28 June 2012 56