Code Optimizations for High Performance GPU Computing
|
|
- Earl Lewis
- 6 years ago
- Views:
Transcription
1 Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1
2 Question to answer Given a task to accelerate some algorithm, e.g., solving PDE, image fltering, etc., using GPU computing How can we start? How can we systematically develop high performance GPGPU programs? 2
3 Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 3
4 Hardware Abstraction of GPU Architecture Based on this simple abstraction, develop a naïve implementation without considering optimizations. Focuses: Data level parallelism Functional correctness 4
5 GPGPU Architecture Fast (local) communication among processors in a SM through shared memory Memory requests need to be evenly distributed among MCs to avoid conflicts/partition camping 5
6 Key to Performance Global memory access bandwidth Coalesced global memory accesses Memory partitions Fast data accesses Shared memory Constant Cache Texture Cache Register Balanced resource usage, balanced ILP and TLP Thread level: register usage Thread block level: shared memory usage, 6
7 Developing High Performance GPGPU Code Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 7
8 Naïve Kernel Fine-grain data-level parallelism Compute one element/pixel in the output domain Example: Matrix multiplication float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication 8
9 Physical Meaning of the Naïve Kernel One thread computes one element at (idx, idy) in the product matrix B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication idx A idy C= AXB A (idx, idy) 9
10 Outline Hardware abstraction A systematic approach to developing high performance GPGPU Programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 10
11 Case Study: Convolution (idx, idy) A (idx, idy) C = X B float sum = 0; for (j=0; j<8; j=j+1) { for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } } C[idy][idx] = sum; A is input matrix B is 8x8 filter matrix C is output matrix Naïve version of Convolution one thread computes one output pixel at (idx, idy) 11
12 Coalesced Global Memory Access Needed by GPU to achieve high memory bandwidth Examined at the half-warp granularity Threads in a warp have consecutive thread ids Requirements for coalesced global memory accesses Aligned: Half of warp threads must access the data with starting address to be a multiple of 64 bytes Sequential (less strict for GTX 280/480): Thread Half of warp 0 threads must access the data 15 sequentially Global memory
13 Checking coalesced memory accesses A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j][i]; sum += a*b; } Inner loop of convolution C Access pattern of B[j][i]: When i = 0; B[j][0] for all the threads in a warp When i = 1; B[j][1] for all the threads in a warp Therefore, it is not coalesced. As B is a small kernel, we can store it in the shared memory or constant memory (cache). 13
14 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] 14
15 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] 15
16 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] 16
17 Checking coalesced memory accesses (idx, idy) A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C Access pattern of A[idy-j][idx-i] for the warp When i = 0, A[idy-j][idx] When i = 1, A[idy-j][idx-1] When i = 2, A[idy-j][idx-2] When i = 7, A[idy-j][idx-7] Therefore, it is not coalesced. The warp accesses the data: A[idy-j][idx-7:idx+31] 17
18 Convert to coalesced accesses Shared memory A = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution C We preload data into shared memory. Then access it from shared memory. One warp (32 threads) loads 64 floats into shared memory 18
19 Coalesced memory access shared float shared_0[64]; shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; // load data into shared memory syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; // access data from shared memory float b=b[j][i]; sum+=(a*b); } syncthreads(); 32 threads (one warp) in one thread block Each warp access 64 elements A[idy-j][idx-tidx-32 : idxtidx+31] (idx tidx) is the start position of the thread block 19
20 Convolution: thread block merge 256 threads only need floats from global memory A C = X B for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Inner loop of convolution Now one warp needs to load 64 floats for the inner loop There are some overlap between neighboring warps/thread blocks A[idy-j][idx-tidx-32 : idx-tidx+31] If we put more warps into one thread blocks, they can share the overlap part and reduce global memory access 20
21 Thread block merge Parallelism impact Increase thread block workload Keep the thread workload TB before merge TB before merge Shared Data Segment Advantage: Don t increase register pressure Disadvantage : Data must be in shared memory (slower than register) Thread block after thread-block merge Thread Improve memory reuse by merging neighboring thread blocks 21
22 Code after thread block merge shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b=b[j][i]; sum+=(a*b); } syncthreads(); // only first warp executes 256 threads in one thread block 22
23 Case study: Convolution /////// /////// /////// A C = X float sum = 0; for (j=0; j<8; j=j+1) { } C[idy][idx] = sum; The neighboring threads in Y direction have overlaps in A If we let one thread compute two output pixels in Y direction, we can reduce the data access of A. B Overlap for two output pixels for (i=0; i<8; i=i+1) { float a; float b; a = A[idy-j][idx-i]; b = B[j] [i]; sum += a*b; } Outer loop of Convolution 23
24 Convolution: thread merge A X B When we load 8 pixels from A (shared memory) = We can do inner loop for one output pixel C Or two pixels Three, or more So after we load data A from shared memory, we can keep it in the register to do more ALU computation 24
25 Code after thread merge float sum_0 = 0; sum_1 = 0; for (j=0; j<8; j=j+1) { shared float shared_0[256+32]; if (tidx<32) shared_0[tidx]=a[idy-j][idx-32]; shared_0[tidx+32]=a[idy-j][idx]; syncthreads(); for (i=0; i<8; i=i+1) { float a=shared_0[tidx+32]; float b_0=b[j][i]; float b_1=b[j+1][i]; // we also compute another output pixel sum_0+=(a*b_0); // code for boundary check is ignored sum_1+=(a*b_1); } syncthreads(); } C[2*idy][idx] = sum; C[2*idy+1][idx] = sum; One thread computes two output pixels 25
26 Thread merge Parallelism impact Increase thread workload Keep the thread block workload TB before merge Shared Data Segment Thread Shared Register Advantage : Data can be in the register or shared memory Disadvantage : Increase the register pressure for single thread Thread block after thread merge TB before merge Improve memory reuse by merging threads from neighboring thread blocks. 26
27 Performance (Gflops) 1000 Convolution Performance with 8 x 8 filter matrix on GTX k by 1k 2k by 2k 4k by 4k 8k by 8k Input matrix size 70% of theoretic computation power (1.35Tflops) of GTX threads in one thread block and one thread computes 184 output pixels. 27
28 Case study: matrix vector multiplication t0 t1 One thread load one row of A and compute one pixel of C A X B = float sum = 0; for (i=0; i<w; i=i+1) { float a; float b; a = A[idx][i]; b = B[i]; sum += a*b; } C[idx] = sum; Naïve version of MV C 28
29 Partition camping tb0 tb1 A X B If the width of A is multiple of partition size All thread blocks start from same partition Partition Partition 0 Partition 1 2 = C 29
30 Eliminating Partition camping tb0 tb1 tb2 A X B Let different thread blocks have different start points Partition 0 Partition 1 = Partition 2 C 30
31 Code to eliminate partition camping int start = (blockidx.x*16); // different start points for different thread blocks for (i=0; i<w; i=(i+16)) { int k=((start+i)%w); float a; float b; a = A[idx][k]; b = B[k]; sum += a*b; } C[idx]=sum; The un-optimized kernel is used to illustrate the code to remove partition camping Optimized kernel has 32 threads in one thread block and uses shared memory to avoid un-coalesced memory access 31
32 GFLOPS Matrix vector multiplication on GTX 280 Naïve Opti_PC Optimized CUBLAS2.2 2kx2k 2kx4k 2kx8k 2kx16k 2kx32k 2kx64k 4kx4k 3kx3k Matrix size Opti_PC : the optimized kernel without partition camping elimination 32
33 Performance (Gflops) Matrix vector multiplication on GTX Opti_PC optimized cublas K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K 7K X 7K 8K X 8K Matrix Size Opti_PC : the optimized kernel without partition camping elimination. Partition camping elimination benefits 3K and 6K more because GTX 480 has 6 partitions. 33
34 Compiling for High Performance GPGPU Code: Naïve code Vectorization for memory access bandwidth Checking memory coalescing Converting non-coalesced accesses into coalesced Checking data dependencies and sharing patterns Thread & thread-block merge Data prefetching Removing memory partition camping High performance code 34
35 Outline Hardware abstraction A systematic approach to developing high performance GPGPU programs Optimization techniques with case studies Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging constant cache Conclusions 35
36 Leveraging constant cache (GTX 480) Register Benefit: fastest, no latency for ALU Limitation: no sharing between threads Constant cache Benefit: up to 2TBytes/S Limitation: 64kB const memory on GTX 480, sequential broadcast access Shared memory Benefit: sharing in block with index Limitation: up to 1TBytes/S Texture cache Benefit: 2D cache automatically Limitation: up to 334GBytes/S r0 = r1 + r2*shared[k]; One second 1T/4bytes = 0.25T float 0.25T * 2flops = 500 GFlop 36
37 Case study: Matrix Multiplication with Constant memory C = A * B float sum = 0; for (int i=0; i<w; i++) sum+=a[idy][i]*b[i][idx]; C[idy][idx] = sum; Naïve matrix multiplication (one output per thread) All threads with the same idy access input A same location sequentially. From A[idy][0] to A[idy][w-1] How about we put A into constant memory 37
38 Matrix Multiplication (Tiled) A[0][0] A[0][1] A[0][2] A[1][0] A[1][1] A[1][2] A[2][0] A[2][1] A[2][2] A[3][0] A[3][1] A[3][0] B[0] B[1] B[2] C[i] = A[i][0]*B[0] + A[i][1]*B[1]+A[i][2]*B[2] C[0], C[1], C[2], C[3] can be computed concurrently Let s put A[i][j] into constant memory C[0] C[1] C[2] C[3] 38
39 Efficient constant memory accesses 128 x 16 A X WidthOfB x 128 B A T = WidthOfC x 16 C When we load one pixel from B We can compute one output pixel Or two, up to 16 so that we can use more computation to overlap memory access B But column access in constant memory is not efficient 39
40 Matrix Multiplication 128 x 16 A X One thread Load one pixel from B Load one column from A Compute one column of C WidthOfB x 128 = WidthOfC x 16 B C A is 128 x 16 We can put A into const memory (column major) Load a float from B, we can do 16 mad to overlap the memory request of B The width of B determine the overall thread number. 40
41 Kernel code when A is 128 x 16 int idx = blockidx.x*blockdim.x + threadidx.x; float sum[16]; for (int i=0; i<128; i++) { float b = B[i] [idx]; for (int j=0; j<16; j++) { sum[j] += b*consta_a[i*16+j]; // A is in constant memory } } for (int j=0; j<16; j++) { C[j] [idx] = sum[j]; } Each thread computes 16 output pixels Thread block size:
42 Performance (Gflops) Matrix Mutiplication on GTX 480 (A is 128x16) cublas 3.1 const version *Constant memory transpose and transfer time included Width of B Up to 1.8 times Speedup on CUBLAS % of theoretical computation power (1.35Tflops) of GTX
43 Performance (Gflops) Matrix Mutiplication on GTX 480 cublas 3.1 const version *Constant memory transpose and transfer time included Input size Width, Height of A and B are the same as the input size Up to 1.65X speedups over CUBLAS % of theoretical computation power (1.35Tflops) of GTX
44 Conclusion A systematic way to optimize GPGPU programs Naïve kernel based on simplified hardware abstraction Optimizations Coalesced memory accesses Data reuse through thread (block) merge Eliminating partition conflicts Leveraging different types of caches We implement a source to source compiler to perform the optimizations automatically. 44
A GPGPU Compiler for Memory Optimization and Parallelism Management
GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University School
More informationA GPGPU Compiler for Memory Optimization and Parallelism Management
A GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang Dept. of ECE North Carolina State University yyang14@ncsu.edu Ping Xiang School of EECS Univ. of Central Florida xp@knights.ucf.edu
More informationA GPGPU Compiler for Memory Optimization and Parallelism Management
A GPGPU Compiler for Memory Optimization and Parallelism Management Yi Yang Dept. of ECE North Carolina State University yyang14@ncsu.edu Ping Xiang School of EECS Univ. of Central Florida xp@knights.ucf.edu
More informationA Unified Optimizing Compiler Framework for Different GPGPU Architectures
A Unified Optimizing Compiler Framework for Different GPGPU Architectures YI YANG, North Carolina State University PING XIANG, North Carolina State University JINGFEI KONG, Advanced Micro Devices MIKE
More informationA Unified Optimizing Compiler Framework for Different GPGPU Architectures
A Unified Optimizing Compiler Framework for Different GPGPU Architectures YI YANG and PING XIANG, North Carolina State University JINGFEI KONG and MIKE MANTOR, Advanced Micro Devices HUIYANG ZHOU, North
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationABSTRACT. YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou).
ABSTRACT YANG, YI. Architectural Support and Compiler Optimization for Many-Core Architectures. (Under the direction of Dr. Huiyang Zhou). Many-core architectures, such as general purpose computation on
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationLecture 7. Using Shared Memory Performance programming and the memory hierarchy
Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationShared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.
Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationCPU-Assisted GPGPU on Fused CPU-GPU Architectures
CPU-Assisted GPGPU on Fused CPU-GPU Architectures Yi Yang, Ping Xiang, Mike Mantor *, Huiyang Zhou Department of Electrical and Computer Engineering *Graphics Products Group North Carolina State University
More informationCUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)
CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More information2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.
Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on
More informationLecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy
Lecture 7 Overlap Using Shared Memory Performance programming the memory hierarchy Announcements Mac Mini lab (APM 2402) Starts Tuesday Every Tues and Fri for the next 2 weeks Project proposals due on
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationOptimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs
Normalized execution time Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Chao Li # Yi Yang* Min Feng* Srimat Chakradhar* Huiyang Zhou # # Department of Electrical and Computer
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationOptimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1
Optimizing CUDA 1 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary 2 Optimize Algorithms for the GPU Maximize independent parallelism
More informationImage convolution with CUDA
Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationGPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: Code optimization part 2 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Memory optimization Memory access patterns Global memory optimization Shared
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationGPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks
GPU Computing with CUDA Part 3: CUDA Performance Tips and Tricks Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de
More informationEfficient Integral Image Computation on the GPU
Efficient Integral Image Computation on the GPU The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Bilgic,
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationPerformance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationCE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture
CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationDRAM Bank Organization
DRM andwidth DRM ank Organization Row ddr Row Decoder Memory Cell Core rray DRM Memory Cell Sense mps Column Latches Column ddr Mux Mux Off-chip Data DRM Core rrays are Slow DRM Core rrays are Slow DDR:
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles
More informationBlocks, Grids, and Shared Memory
Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same
More informationMatrix Multiplication in CUDA. A case study
Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block
More informationGPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group
GPU Accelerated Application Performance Tuning Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group Main Requirements for GPU Performance Expose sufficient parallelism Use
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationImage Processing Optimization C# on GPU with Hybridizer
Image Processing Optimization C# on GPU with Hybridizer regis.portalez@altimesh.com Median Filter Denoising Noisy image (lena 1960x1960) Denoised image window = 3 2 Median Filter Denoising window Output[i,j]=
More informationCSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More information