Introduction to GPGPU and GPU-architectures

Size: px
Start display at page:

Download "Introduction to GPGPU and GPU-architectures"

Transcription

1 Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak

2 Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks / Department of Electrical Engineering - ES group 28 June

3 1. What is a GPU? It s a bird It s a plane It s a GPU! / Department of Electrical Engineering - ES group 28 3 June 2012

4 Graphics Processing Unit ref: "How GPUs Work", / Department of Electrical Engineering - ES group 28 June

5 Trends in chip design Frequency wall Energy wall ILP wall / Department of Electrical Engineering - ES group 28 June

6 Why multi / many core? Running into the Energy Frequency wall ILP wall Memory wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Cost effective Re-use: just connect existing processors or processor cores / Department of Electrical Engineering - ES group 28 June

7 How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Intel Penryn ( 2) half area spend on processing large on-chip caches Die photo of Intel Penryn (source: Intel) / Department of Electrical Engineering - ES group 28 June

8 How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. 15x NVidia Kepler GPU most area spend on processing relatively small on-chip memories huge off-chip memory latencies / Department of Electrical Engineering - ES group 28 June

9 Why GPU? according to NVidia Jen-Hsun Huang, CEO Ref: NVidia GPUs give a 1000x performance boost! / Department of Electrical Engineering - ES group 28 June

10 Why GPU? answer from Intel Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Application performance benchmarked by Intel Intel i7 960 vs. NVidia GTX280 Paul Otellini, CEO Intel Ref: Debunking the 100X GPU vs. CPU Myth, ISCA / Department of Electrical Engineering - ES group 28 June

11 Why GPU performance boost NVidia GPUs give a 1000x performance boost! Actually, NVidia GPUs are only 2.5x faster than Intel CPUs. Jen-Hsun Huang, CEO Paul Otellini, CEO Intel / Department of Electrical Engineering - ES group 28 June

12 2. How to program a GPU? / Department of Electrical Engineering - ES group June 2012

13 NVIDIA GPU architecture / Department of Electrical Engineering - ES group 28 June

14 NVIDIA GPU architecture Nvidia Tesla i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file 8800GT (2006) SFU SFU SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June

15 GPU programming Sequential Data parallel Thread parallel Programming model C Vector CUDA Architecture CPU SIMD SIMT Let s start from C and CPU / Department of Electrical Engineering - ES group 28 June

16 Let's Start with C and CPU int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC / Department of Electrical Engineering - ES group 28 June

17 Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE. / Department of Electrical Engineering - ES group 28 June

18 Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } / Department of Electrical Engineering - ES group Assembly code of inner-loop Looks like the previous example, but SSE instructions execute on 4 ALUs. lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) 28 June

19 How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } / Department of Electrical Engineering - ES group 28 June

20 Programmer's View of GPUs A GPU contains multiple SIMD Units. / Department of Electrical Engineering - ES group 28 June

21 Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. / Department of Electrical Engineering - ES group 28 June

22 What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. The "Shared Memory" spaces / Department of Electrical Engineering - ES group 28 June

23 Let's Start Again from C int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } convert into CUDA int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June

24 3. How does a GPU schedule all these threads? / Department of Electrical Engineering - ES group June 2012

25 Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads / Department of Electrical Engineering - ES group 28 June

26 What Is the Thread Hierarchy? thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelf<<<(2,1),(4,1)>>>(a); // define threads device } kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; // all threads run same kernel // each thread block has its id // each thread has its id // each thread has a different i and j / Department of Electrical Engineering - ES group 28 June

27 How Are Threads Scheduled? / Department of Electrical Engineering - ES group 28 June

28 Executing many, many threads int A[2][4]; A[256][128]; kernelf<<<(2,1),(4,1)>>>(a); kernelf<<<(256,1),(128,1)>>>(a); device kernelf(a){ i = blockidx.x; j = threadidx.x; A[i][j]++; } 28 June

29 Warps as a group of threads A warp is a group of 32 threads from the same threadblock Warps are generated dynamically by the hardware kernelf<<<(256,1),(128,1)>>>(a); 1 Grid 256 thread blocks 128 threads per block 4 warps / Department of Electrical Engineering - ES group 28 June

30 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 0 Warp: 0 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

31 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 2 Warp: 1 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 1 (threads 32-63) Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

32 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 4 Warp: 2 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 2 (threads 64-95) Instruction 0 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

33 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 6 Warp: 3 Instruction: 0 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 2 (threads 64-95) Warp 3 (threads ) Instruction 0 / Department of Electrical Engineering - ES group 28 June

34 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 8 Warp: 0 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 1 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

35 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 10 Warp: 1 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Instruction 1 Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

36 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 12 Warp: 2 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Instruction 1 Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) / Department of Electrical Engineering - ES group 28 June

37 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 14 Warp: 3 Instruction: 1 Warp 0, Instruction 0 Warp 0 (threads 0-31) Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Instruction 1 Warp 3, Instruction 1 / Department of Electrical Engineering - ES group 28 June

38 Scheduling warps Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file Clock cycle: 16 Warp: 0 Instruction: 2 Warp 0, Instruction 0 Warp 0 (threads 0-31) Instruction 2 Warp 1, Instruction 0 Warp 2, Instruction 0 Warp 1 (threads 32-63) Warp 3, Instruction 0 Warp 0, Instruction 1 Warp 2 (threads 64-95) Warp 1, Instruction 1 Warp 2, Instruction 1 Warp 3 (threads ) Warp 3, Instruction 1 Warp 0, Instruction 2 And so on... / Department of Electrical Engineering - ES group 28 June

39 Scheduling warps from multiple thread blocks Example, per cluster: 128 threads (4 warps) 16 processing cores Warp pool: i-cache Scheduler Register file TB0 - Warp 0, Instruction 0 TB0 - Warp 1, Instruction 0 TB0 - Warp 2, Instruction 0 TB0 - Warp 3, Instruction 0 TB1 - Warp 0, Instruction 0 TB1 - Warp 1, Instruction 0 TB1 - Warp 2, Instruction 0 TB1 - Warp 3, Instruction 0 TB0 - Warp 0 (threads 0-31) TB0 - Warp 0, Instruction 1 TB0 - Warp 1, Instruction 1 TB0 - Warp 1 (threads 32-63) TB0 - Warp 2, Instruction 1 TB0 - Warp 3, Instruction 1 TB1 - Warp 0, Instruction 1 TB0 - Warp 2 (threads 64-95) TB1 - Warp 1, Instruction 1 TB1 - Warp 2, Instruction 1 TB0 - Warp 3 (threads ) TB1 - Warp 3, Instruction 1 TB1 - Warp 0 (threads 0-31) And so on... TB1 - Warp 1 (threads 32-63) TB1 - Warp 2 (threads 64-95) / Department TB1 - of Warp Electrical 3 (threads Engineering ) - ES group 28 June

40 Inside the warp scheduler i-cache Scheduler Register file Ref: Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS GPGPU-Sim / Department of Electrical Engineering - ES group 28 June

41 4. Performance bottlenecks How do I get the maximum out of my GPU? / Department of Electrical Engineering - ES group June 2012

42 Thread divergence Example, per cluster: 128 threads (4 warps) 16 processing cores i-cache Scheduler Register file Warp 0, Instruction j%2==1 Warp 1, Instruction j%2==1 Warp 2, Instruction j%2==1 Warp 3, Instruction j%2==1 device kernelf(a){ i = blockidx.x; j = threadidx.x; if(j%2 == 1) A[i][j]++; else A[i][j]--; } Warp 0, Instruction A[i][j]++ Warp 1, Instruction A[i][j]++ Warp 2, Instruction A[i][j]++ Warp 3, Instruction A[i][j]++ Warp 0, Instruction A[i][j]-- Warp 1, Instruction A[i][j]-- Warp 2, Instruction A[i][j]-- Warp 3, Instruction A[i][j]-- / Department of Electrical Engineering - ES group 28 June

43 GPU memory hierarchy Register Per thread i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Shared memory Per thread block Global memory Per kernel SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem SFU SFU Shared mem / Department of Electrical Engineering - ES group 28 June

44 On-chip shared / local memory Th.0 Bank 0 Th.0 Bank 0 Th.0 Bank 0 Th.1 Bank 1 Th.1 Bank 1 Th.1 Bank 1 Th.2 Bank 2 Th.2 Bank 2 Th.2 Bank 2 Th.3 Bank 3 Th.3 Bank 3 Th.3 Bank 3 Th.4 Bank 4 Th.4 Bank 4 Th.4 Bank 4 Th.5 Bank 5 Th.5 Bank 5 Th.5 Bank 5 Th.6 Bank 6 Th.6 Bank 6 Th.6 Bank 6 Th.7 Bank 7 Th.7 Bank 7 Th.7 Bank 7 No bank conflict Bank conflict No bank conflict / Department of Electrical Engineering - ES group 28 June

45 Off-chip global memory Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced Th.0 Th.1 Th.2 Th.3 Th.4 Th.5 Th.6 Th.7 Not coalesced / Department of Electrical Engineering - ES group 28 June

46 CPU GPU memory PCI-e time CPU GPU GPU compute GPU CPU frame CPU GPU GPU compute CPU GPU GPU CPU GPU compute GPU CPU / Department of Electrical Engineering - ES group 28 June

47 Example: Matrix Multiplication global void matrixmul_naive(float* C, float* A, float* B, int wa, int wb) { // Accumulate row i of A and column j of B int i = blockidx.y * blockdim.y + threadidx.y; int j = blockidx.x * blockdim.x + threadidx.x; float accu=0; for(int k=0; k<wa; k++) { accu = accu + A[ i * wa + k ] * B[ k * wb + j ]; } } // Write the block sub-matrix to device memory; element C[ i * wb + j ] = accu; / Department of Electrical Engineering - ES group 28 June

48 Example: Matrix Multiplication global void matrixmul_prefetch( float* C, float* A, float* B, int wa, int wb) { // Block index int bx = blockidx.x; int by = blockidx.y; // Thread index int tx = threadidx.x; int ty = threadidx.y; // Declaration of the shared memory array As used to // store the sub-matrix of A shared float As[BLOCK_SIZE * BLOCK_SIZE]; shared float As2[BLOCK_SIZE * BLOCK_SIZE]; float *prefetch = As; float *prefetch2 = As2; // Declaration of the shared memory array Bs used to // store the sub-matrix of B // shared float Bs[BLOCK_SIZE][BLOCK_SIZE]; float cv[block_size] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; // Index of the first sub-matrix of A processed by the block int abegin = wa * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aend = abegin + wa - 1; // Step size used to iterate through the sub-matrices of A int astep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bbegin = BLOCK_SIZE * VECTOR_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bstep = BLOCK_SIZE * wb; int cbegin = wb * BLOCK_SIZE * by + VECTOR_SIZE * BLOCK_SIZE * bx; // Csub is used to store the element of the block sub-matrix // that is computed by the thread // float Csub = 0; float *Ap = &A[aBegin + wa * ty +tx]; float *ap = &prefetch[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap[i] = Ap[wA * i]; } syncthreads(); // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix Ap = &A[a + astep + wa * ty +tx]; float *ap2 = &prefetch2[ty + BLOCK_SIZE * tx]; #pragma unroll for(int i = 0; i < 16; i+=4){ ap2[i] = Ap[wA * i]; } ap = &prefetch[0]; float *bp = &B[b + BLOCK_SIZE * ty + tx]; #pragma unroll for(int i = 0; i < BLOCK_SIZE; i++){ float bv = bp[0]; cv[0] += ap[0] * bv; cv[1] += ap[1] * bv; } } cv[2] += ap[2] * bv; cv[3] += ap[3] * bv; cv[4] += ap[4] * bv; cv[5] += ap[5] * bv; cv[6] += ap[6] * bv; cv[7] += ap[7] * bv; cv[8] += ap[8] * bv; cv[9] += ap[9] * bv; cv[10] += ap[10] * bv; cv[11] += ap[11] * bv; cv[12] += ap[12] * bv; cv[13] += ap[13] * bv; cv[14] += ap[14] * bv; cv[15] += ap[15] * bv; ap += BLOCK_SIZE; bp += wb; // Synchronize to make sure the matrices are loaded syncthreads(); // swap As and As2 float *prefetch_temp = prefetch; prefetch = prefetch2; prefetch2 = prefetch_temp; // Write the block sub-matrix to device memory; // each thread writes one element float *Cp = &C[cBegin]; Cp += BLOCK_SIZE * ty + tx; int cstep = wb; #pragma unroll for(int i=0; i<block_size; i++){ Cp[0] = cv[i]; Cp += cstep; } } / Department of Electrical Engineering - ES group 28 June

49 Example: Matrix Multiplication / Department of Electrical Engineering - ES group 28 June

50 Present: AMD vs. NVidia Graphics core next HD 7970 Kepler GTX 680 s 925 MHz GHz Max. performance (sp) 3.9 Tflop/s 3.1 Tflop/s Max. performance (dp) 947 Gflop/s 129 Gflop/s Bandwidth 264 GB/s 192 GB/s Gflops/W Prog. language OpenCL CUDA & OpenCL / Department of Electrical Engineering - ES group 28 June

51 Future GPU and CPU on one chip AMD Fusion Intel i5 2500K / Department of Electrical Engineering - ES group 28 June

52 ARM & GPU Present Carma (NVidia) Tegra 3 (ARM) CUDA GPU Future Embedded (ARM) CPU Embedded GPU ARM Mali PowerVR / Department of Electrical Engineering - ES group 28 June

53 Summary Trends in chip design More transistors more cores GPU architecture SIMD SIMT Thread scheduling on GPU Thread divergence & memory issues i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file i-cache Scheduler Register file Current GPUs on the market Gflops / W SFU SFU SFU SFU SFU SFU SFU SFU Shared mem Shared mem Shared mem Shared mem / Department of Electrical Engineering - ES group 28 June

54 Questions / Department of Electrical Engineering - ES group 28 June

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

GPU Computing Master Clss. Development Tools

GPU Computing Master Clss. Development Tools GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015 Code Tópicos em Arquiteturas Paralelas November 25, 2015 The Device Code GPU Device Memory Access Thread Management Private Private Thread1 Thread M Streaming Processor 0... Private Private Thread1 Thread

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Milano

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Memory Layout in CUDA Execution Scheduling in CUDA February 15, 2011 Dan Negrut, 2011 ME964 UW-Madison They have computers, and they may have

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU

More information

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

GPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies

GPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies GPU Acceleration of CFD Codes and Optimizing for GPU Memory Hierarchies Dr. Frank Mueller Nishanth Balasubramanian North Carolina State University Infrastructure: ARC cluster at NCSU l Hardware 2x AMD

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

GPU Computing with CUDA

GPU Computing with CUDA GPU Computing with CUDA Dan Negrut Andrew Seidl Daniel Melanz Simulation-Based Engineering Lab Department of Mechanical Engineering University of Wisconsin-Madison Dan Negrut, 2012 UW-Madison Bucuresti

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR PPAR: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 This lecture: CUDA programming We have seen some GPU architecture

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2016 users.ics.aalto.fi/suomela/ppc-2016/ New code must be parallel! otherwise a computer from

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 Hide vector width using scalar threads. 1 2 Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs) 3 TPC TPC TPC TPC Texture Cluster (TPC) Texture Unit Streaming Array SM SM SM TPC TPC

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Programming Parallel Computers

Programming Parallel Computers ICS-E4020 Programming Parallel Computers Jukka Suomela Jaakko Lehtinen Samuli Laine Aalto University Spring 2015 users.ics.aalto.fi/suomela/ppc-2015/ Introduction Modern computers have high-performance

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model Sunpyo Hong Hyesoon Kim ECE School of Computer Science Georgia Institute of Technology {shong9, hyesoon}@cc.gatech.edu

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information