Code Optimizations for High Performance GPU Computing

Size: px
Start display at page:

Download "Code Optimizations for High Performance GPU Computing"

Transcription

1 Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1

2 Question to answer Given a task to accelerate some algorithm, e.g., solving PDE, image fltering, etc., using GPU computing How can we start? How can we systematically develop high performance GPGPU programs? 2

3 Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 3

4 Hardware Abstraction of GPU Architecture Based on this simple abstraction, develop a naïve implementation without considering optimizations. Focuses: Data level parallelism Functional correctness 4

5 GPGPU Architecture Fast (local) communication among processors in a SM through shared memory Memory requests need to be evenly distributed among MCs to avoid conflicts/partition camping 5

6 Key to Performance Global memory access bandwidth Coalesced global memory accesses Memory partitions Fast data accesses Shared memory Constant Cache Texture Cache Register Balanced resource usage, balanced ILP and TLP Thread level: register usage Thread block level: shared memory usage, 6

7 Developing High Performance GPGPU Code Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 7

8 Naïve Kernel Fine-grain data-level parallelism Compute one element/pixel in the output domain Example: Matrix multiplication float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication 8

9 Physical Meaning of the Naïve Kernel One thread computes one element at (idx, idy) in the product matrix B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication idx A idy C= AXB A (idx, idy) 9

10 Outline Hardware abstraction A systematic approach to developing high performance GPGPU Programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 10

11 Case Study: Convolution (idx, idy) A (idx, idy) C = X B float sum = 0; for (j=0; j<8; j=j+1) { for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } } C[idy][idx] = sum; A is input matrix B is 8x8 filter matrix C is output matrix Naïve version of Convolution one thread computes one output pixel at (idx, idy) 11

12 Coalesced Global Memory Access Needed by GPU to achieve high memory bandwidth Examined at the half-warp granularity Threads in a warp have consecutive thread ids Requirements for coalesced global memory accesses Aligned: Half of warp threads must access the data with starting address to be a multiple of 64 bytes Sequential (less strict for GTX 280/480): Thread Half of warp 0 threads must access the data 15 sequentially Global memory

13 Checking coalesced memory accesses A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j][i]; sum += a*b; } Inner loop of convolution C Access pattern of B[j][i]: When i = 0; B[j][0] for all the threads in a warp When i = 1; B[j][1] for all the threads in a warp Therefore, it is not coalesced. As B is a small kernel, we can store it in the shared memory or constant memory (cache). 13

14 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] 14

15 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] 15

16 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] 16

17 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] for the warp When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] When i = 7, A[idy-j][idx-7] Therefore, it is not coalesced. The warp accesses the data: A[idy-j][idx-7:idx+31] 17

18 Convert to coalesced accesses Shared memory A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C We preload data into shared memory. Then access it from shared memory. One warp (32 threads) loads 64 floats into shared memory 18

19 Coalesced memory access shared float shared_0[64]; shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; // load data into shared memory syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; // access data from shared memory float b=b[j][i]; sum+=(a*b); } syncthreads(); 32 threads (one warp) in one thread block Each warp access 64 elements A[idy-j][idx-tidx-32 : idxtidx+31] (idx tidx) is the start position of the thread block 19

20 Convolution: thread block merge 256 threads only need floats from global memory A C = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution Now one warp needs to load 64 floats for the inner loop There are some overlap between neighboring warps/thread blocks A[idy-j][idx-tidx-32 : idx-tidx+31] If we put more warps into one thread blocks, they can share the overlap part and reduce global memory access 20

21 Thread block merge Parallelism impact Increase thread block workload Keep the thread workload TB before merge TB before merge Shared Data Segment Advantage: Don t increase register pressure Disadvantage : Data must be in shared memory (slower than register) Thread block after thread-block merge Thread Improve memory reuse by merging neighboring thread blocks 21

22 Code after thread block merge shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b=b[j][i]; sum+=(a*b); } syncthreads(); // only first warp executes 256 threads in one thread block 22

23 Case study: Convolution /////// /////// /////// A C = X float sum = 0; for (j=0; j<8; j=j+1) { } C[idy][idx] = sum; The neighboring threads in Y direction have overlaps in A If we let one thread compute two output pixels in Y direction, we can reduce the data access of A. B Overlap for two output pixels for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Outer loop of Convolution 23

24 Convolution: thread merge A X B When we load 8 pixels from A (shared memory) = We can do inner loop for one output pixel C Or two pixels Three, or more So after we load data A from shared memory, we can keep it in the register to do more ALU computation 24

25 Code after thread merge float sum_0 = 0; sum_1 = 0; for (j=0; j<8; j=j+1) { shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b_0=b[j][i]; float b_1=b[j+1][i]; // we also compute another output pixel sum_0+=(a*b_0); // code for boundary check is ignored sum_1+=(a*b_1); } syncthreads(); } C[2*idy][idx] = sum; C[2*idy+1][idx] = sum; One thread computes two output pixels 25

26 Thread merge Parallelism impact Increase thread workload Keep the thread block workload TB before merge Shared Data Segment Thread Shared Register Advantage : Data can be in the register or shared memory Disadvantage : Increase the register pressure for single thread Thread block after thread merge TB before merge Improve memory reuse by merging threads from neighboring thread blocks. 26

27 Performance (Gflops) 1000 Convolution Performance with 8 x 8 filter matrix on GTX k by 1k 2k by 2k 4k by 4k 8k by 8k Input matrix size 70% of theoretic computation power (1.35Tflops) of GTX threads in one thread block and one thread computes 184 output pixels. 27

28 Case study: matrix vector multiplication t0 t1 One thread load one row of A and compute one pixel of C A X B = float sum = 0; for (i=0; i<w; i=i+1) { float a; float b; a = A[idx][i]; b = B[i]; sum += a*b; } C[idx] = sum; Naïve version of MV C 28

29 Partition camping tb0 tb1 A X B If the width of A is multiple of partition size All thread blocks start from same partition Partition Partition 0 Partition 1 2 = C 29

30 Eliminating Partition camping tb0 tb1 tb2 A X B Let different thread blocks have different start points Partition 0 Partition 1 = Partition 2 C 30

31 Code to eliminate partition camping int start = (blockidx.x*16); // different start points for different thread blocks for (i=0; i<w; i=(i+16)) { int k=((start+i)%w); float a; float b; a = A[idx][k]; b = B[k]; sum += a*b; } C[idx]=sum; The un-optimized kernel is used to illustrate the code to remove partition camping Optimized kernel has 32 threads in one thread block and uses shared memory to avoid un-coalesced memory access 31

32 GFLOPS Matrix vector multiplication on GTX 280 Naïve Opti_PC Optimized CUBLAS2.2 2kx2k 2kx4k 2kx8k 2kx16k 2kx32k 2kx64k 4kx4k 3kx3k Matrix size Opti_PC : the optimized kernel without partition camping elimination 32

33 Performance (Gflops) Matrix vector multiplication on GTX Opti_PC optimized cublas K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K 7K X 7K 8K X 8K Matrix Size Opti_PC : the optimized kernel without partition camping elimination. Partition camping elimination benefits 3K and 6K more because GTX 480 has 6 partitions. 33

34 Compiling for High Performance GPGPU Code: Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 34

35 Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 35

36 Leveraging constant cache (GTX 480) Register Benefit: fastest, no latency for ALU Limitation: no sharing between threads Constant cache Benefit: up to 2TBytes/S Limitation: 64kB const memory on GTX 480, sequential broadcast access Shared memory Benefit: sharing in block with index Limitation: up to 1TBytes/S Texture cache Benefit: 2D cache automatically Limitation: up to 334GBytes/S r0 = r1 + r2*shared[k]; One second 1T/4bytes = 0.25T float 0.25T * 2flops = 500 GFlop 36

37 Case study: Matrix Multiplication with Constant memory C = A * B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication (one output per thread) All threads with the same idy access input A same location sequentially. From A[idy][0] to A[idy][w-1] How about we put A into constant memory 37

38 Matrix Multiplication (Tiled) A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] A[3][0] A[3][1] A[3][0] B[0] B[1] B[2] C[i] = A[i][0]*B[0] + A[i][1]*B[1]+A[i][2]*B[2] C[0], C[1], C[2], C[3] can be computed concurrently Let s put A[i][j] into constant memory C[0] C[1] C[2] C[3] 38

39 Efficient constant memory accesses 128 x 16 A X WidthOfB x 128 B A T = WidthOfC x 16 C When we load one pixel from B We can compute one output pixel Or two, up to 16 so that we can use more computation to overlap memory access B But column access in constant memory is not efficient 39

40 Matrix Multiplication 128 x 16 A X One thread Load one pixel from B Load one column from A Compute one column of C WidthOfB x 128 = WidthOfC x 16 B C A is 128 x 16 We can put A into const memory (column major) Load a float from B, we can do 16 mad to overlap the memory request of B The width of B determine the overall thread number. 40

41 Kernel code when A is 128 x 16 int idx = blockidx.x*blockdim.x + threadidx.x; float sum[16]; for (int i=0; i<128; i++) { float b = B[i] [idx]; for (int j=0; j<16; j++) { sum[j] += b*consta_a[i*16+j]; // A is in constant memory } } for (int j=0; j<16; j++) { C[j] [idx] = sum[j]; } Each thread computes 16 output pixels Thread block size:

42 Performance (Gflops) Matrix Mutiplication on GTX 480 (A is 128x16) cublas 3.1 const version *Constant memory transpose and transfer time included Width of B Up to 1.8 times Speedup on CUBLAS % of theoretical computation power (1.35Tflops) of GTX

43 Performance (Gflops) Matrix Mutiplication on GTX 480 cublas 3.1 const version *Constant memory transpose and transfer time included Input size Width, Height of A and B are the same as the input size Up to 1.65X speedups over CUBLAS % of theoretical computation power (1.35Tflops) of GTX

44 Conclusion A systematic way to optimize GPGPU programs Naïve kernel based on simplified hardware abstraction Optimizations Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging different types of caches We implement a source to source compiler to perform the optimizations automatically. 44

A GPGPU Compiler for Memory Optimization and Parallelism Management

A GPGPU Compiler for Memory Optimization and Parallelism Management GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University School

More information

A GPGPU Compiler for Memory Optimization and Parallelism Management

A GPGPU Compiler for Memory Optimization and Parallelism Management A GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang Dept. of ECE North Carolina State University yyang14@ncsu.edu Ping Xiang School of EECS Univ. of Central Florida xp@knights.ucf.edu

More information

A GPGPU Compiler for Memory Optimization and Parallelism Management

A GPGPU Compiler for Memory Optimization and Parallelism Management A GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang Dept. of ECE North Carolina State University yyang14@ncsu.edu Ping Xiang School of EECS Univ. of Central Florida xp@knights.ucf.edu

More information

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

A Unified Optimizing Compiler Framework for Different GPGPU Architectures A Unified Optimizing Compiler Framework for Different GPGPU Architectures YI YANG, North Carolina State University PING XIANG, North Carolina State University JINGFEI KONG, Advanced Micro Devices MIKE

More information

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

A Unified Optimizing Compiler Framework for Different GPGPU Architectures A Unified Optimizing Compiler Framework for Different GPGPU Architectures YI YANG and PING XIANG, North Carolina State University JINGFEI KONG and MIKE MANTOR, Advanced Micro Devices HUIYANG ZHOU, North

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

ABSTRACT. YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou).

ABSTRACT. YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou). ABSTRACT YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou). Many-core architectures, such as general purpose computation on

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in

More information

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory. Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

CPU-Assisted GPGPU on Fused CPU-GPU Architectures

CPU-Assisted GPGPU on Fused CPU-GPU Architectures CPU-Assisted GPGPU on Fused CPU-GPU Architectures Yi Yang, Ping Xiang, Mike Mantor *, Huiyang Zhou Department of Electrical and Computer Engineering *Graphics Products Group North Carolina State University

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy Lecture 7 Overlap Using Shared Memory Performance programming the memory hierarchy Announcements Mac Mini lab (APM 2402) Starts Tuesday Every Tues and Fri for the next 2 weeks Project proposals due on

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Normalized execution time Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Chao Li # Yi Yang* Min Feng* Srimat Chakradhar* Huiyang Zhou # # Department of Electrical and Computer

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1 Optimizing CUDA 1 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary 2 Optimize Algorithms for the GPU Maximize independent parallelism

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: Code optimization part 2 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Memory optimization Memory access patterns Global memory optimization Shared

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the

More information

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks GPU Computing with CUDA Part 3: CUDA Performance Tips and Tricks Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

Efficient Integral Image Computation on the GPU

Efficient Integral Image Computation on the GPU Efficient Integral Image Computation on the GPU The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Bilgic,

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

DRAM Bank Organization

DRAM Bank Organization DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Image Processing Optimization C# on GPU with Hybridizer

Image Processing Optimization C# on GPU with Hybridizer Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information