CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Similar documents
CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179 Lecture 4. GPU Compute Architecture

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Architecture & Programming Model

CS377P Programming for Performance GPU Programming - II

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Shared Memory and Synchronizations

Lecture 2: CUDA Programming

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Fundamental CUDA Optimization. NVIDIA Corporation

CS5460: Operating Systems

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

TUNING CUDA APPLICATIONS FOR MAXWELL

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to GPU hardware and to CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

OpenACC Course. Office Hour #2 Q&A

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Stanford CS149: Parallel Computing Written Assignment 3

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

GPU Fundamentals Jeff Larkin November 14, 2016

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA Fermi Architecture

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Advanced CUDA Optimizations

GPU Profiling and Optimization. Scott Grauer-Gray

CUDA Experiences: Over-Optimization and Future HPC

Atomicity CS 2110 Fall 2017

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

High Performance Computing on GPUs using NVIDIA CUDA

CS 179: GPU Programming. Lecture 7

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

LDetector: A low overhead data race detector for GPU programs

GPU programming basics. Prof. Marco Bertini

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Hands-on CUDA Optimization. CUDA Workshop

Data Parallel Execution Model

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Understanding and modeling the synchronization cost in the GPU architecture

Tesla Architecture, CUDA and Optimization Strategies

CUDA Threads. Origins. ! The CPU processing core 5/4/11

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA C Programming Mark Harris NVIDIA Corporation

Spring Prof. Hyesoon Kim

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CS516 Programming Languages and Compilers II

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Heterogeneous-Race-Free Memory Models

GPU & Reproductibility, predictability, Repetability, Determinism.

Warps and Reduction Algorithms

Device Memories and Matrix Multiplication

Fundamental CUDA Optimization. NVIDIA Corporation

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

CUDA. Matthew Joyner, Jeremy Williams

CS 179: GPU Programming

Last Class: Synchronization

CS 153 Design of Operating Systems Winter 2016

Mathematical computations with GPUs

GPU Performance Nuggets

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

Lecture 4: warp shuffles, and reduction / scan operations

CS3350B Computer Architecture

CS 179: GPU Computing

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

B. Tech. Project Second Stage Report on

Dense Linear Algebra. HPC - Algorithms and Applications

Tuning CUDA Applications for Fermi. Version 1.2

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Programming Model

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Profiling & Tuning Applications. CUDA Course István Reguly

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

CS671 Parallel Programming in the Many-Core Era

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Lecture #7: Implementing Mutual Exclusion

Hardware/Software Co-Design

Introduction to GPGPU and GPU-architectures

Understanding Outstanding Memory Request Handling Resources in GPGPUs

GRAPHICS PROCESSING UNITS

Computer Architecture

CUDA (Compute Unified Device Architecture)

GPUs and GPGPUs. Greg Blanton John T. Lubia

Transcription:

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1

Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2

Warp Schedulers Warp schedulers find a warp that is ready to execute its next instruction and available execution cores and then start execution GK110: 4 warp schedulers, 2 dispatchers in each SM Starts instructions in up to 4 warps each clock, and starts up to 2 instructions in each warp. 3

GK110 (Kepler) numbers max threads / SM = 2048 (64 warps) max threads / block = 1024 (32 warps) 32 bit registers / SM = 64k max shared memory / SM = 48KB The number of blocks that run concurrently on a SM depends on the resource requirements of the block! 4

Occupancy occupancy = warps per SM / max warps per SM max warps / SM depends only on GPU warps / SM depends on warps / block, registers / block, shared memory / block. 5

GK110 Occupancy 100% occupancy 2 blocks of 1024 threads 32 registers/thread 24KB of shared memory / block 50% occupancy 1 block of 1024 threads 64 registers/thread 48KB of shared memory / block 6

This lecture Synchronization Atomic Operations Instruction Dependencies Instruction Level Parallelism (ILP) 7

Synchronization Synchronization is a process by which multiple threads must indirectly communicate with each other in order to make sure they do not clash with each other Example of a synchronization issue: int x = 1; Thread 1 wants to add 1 to x; Thread 2 wants to add 1 to x; Thread 1 reads in the value of x (which is 1) into a register Thread 2 reads in the value of x (which is still 1) into a register Both threads increment the values they read in but they both think the final value is 2 They write x back out and the final result is 2 8

Synchronization On a CPU, you can solve synchronization issues using Locks, Semaphores, Condition Variables, etc. On a GPU, these solutions introduce too much memory and process overhead We have simpler solutions better suited for parallel programs 9

CUDA Synchronization Use the syncthreads() function to sync threads within a block Only works at the block level SMs are separate from each other so can't do better than this Similar to barrier() function in C/C++ 10

Atomic Operations Atomic Operations are operations that ONLY happen in sequence For example, if you want to add up N numbers by adding the numbers to a variable that starts in 0, you must add one number at a time Don't do this though. We'll talk about better ways to do this in the next lecture. Only use when you have no other options CUDA provides built in atomic operations Use the functions: atomic<op>(float *address, float val); Replace <op> with one of: Add, Sub, Exch, Min, Max, Inc, Dec, And, Or, Xor e.g. atomicadd(float *address, float val) for atomic addition These functions are all implemented using a function called atomiccas(int *address, int compare, int val) CAS stands for compare and swap. The function compares *address to compare and swaps the value to val if the values are different 11

Instruction Dependencies An Instruction Dependency is a requirement relationship between instructions that force a sequential execution In the example on the right, each summation call must happen in sequence because the value of acc depends on the previous summation as well Can be caused by direct dependencies or requirements set by the execution order of code I.e. You can't start an instruction until all previous operations have been completed in a single thread acc += x[0]; acc += x[1]; acc += x[2]; acc += x[3];... 12

Instruction Level Parallelism (ILP) Instruction Level Parallelism is when you avoid performances losses caused by instruction dependencies In CUDA, also removes performances losses caused by how certain operations are handled by the hardware 13

ILP Example z0 = x[0] + y[0]; z1 = x[1] + y[1]; COMPILATION x0 = x[0]; y0 = y[0]; z0 = x0 + y0; x1 = x[1]; y1 = y[1]; z1 = x1 + y1; The second half of the code can't start execution until the first half completes 14

ILP Example z0 = x[0] + y[0]; z1 = x[1] + y[1]; COMPILATION x0 = x[0]; y0 = y[0]; x1 = x[1]; y1 = y[1]; z0 = x0 + y0; z1 = x1 + y1; Sequential nature of the code due to instruction dependency has been minimized. Additionally, this code minimizes the number of memory transactions required 15

Questions? Synchronization Atomic Operations Instruction Dependencies Instruction Level Parallelism (ILP) 16

Next time... Set 2 Recitation on Friday (04/06) GPU based algorithms (next week, lectures will be given by Aadyot) 17