CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Similar documents
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CS377P Programming for Performance GPU Programming - II

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Performance Considerations (2 of 2)

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS 179 Lecture 4. GPU Compute Architecture

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Fundamental CUDA Optimization. NVIDIA Corporation

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Advanced GPU Programming. Samuli Laine NVIDIA Research

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Global Memory Access Pattern and Control Flow

Fundamental CUDA Optimization. NVIDIA Corporation

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA Performance Optimization. Patrick Legresley

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Optimization Strategies Global Memory Access Pattern and Control Flow

Computer Architecture

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Programmer's View of Execution Teminology Summary

Cartoon parallel architectures; CPUs and GPUs

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Occupancy-based compilation

Warps and Reduction Algorithms

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Performance Nuggets

High Performance Computing on GPUs using NVIDIA CUDA

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Optimizing Parallel Reduction in CUDA

OpenACC Course. Office Hour #2 Q&A

Advanced CUDA Optimizations

Programmable Graphics Hardware (GPU) A Primer

B. Tech. Project Second Stage Report on

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

TUNING CUDA APPLICATIONS FOR MAXWELL

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

GPU Profiling and Optimization. Scott Grauer-Gray

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

CME 213 S PRING Eric Darve

CS 179: GPU Programming. Lecture 7

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Profiling & Tuning Applications. CUDA Course István Reguly

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Hands-on CUDA Optimization. CUDA Workshop

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Sparse Linear Algebra in CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS671 Parallel Programming in the Many-Core Era

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to Parallel Computing with CUDA. Oswald Haan

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Data Parallel Execution Model

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Sparse Matrix-Matrix Multiplication on the GPU. Julien Demouth, NVIDIA

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

Shared Memory and Synchronizations

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Lecture 11: GPU programming

Warped parallel nearest neighbor searches using kd-trees

Lecture 2: CUDA Programming

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

NVIDIA GPU CODING & COMPUTING

Spring Prof. Hyesoon Kim

CUDA. Matthew Joyner, Jeremy Williams

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Stanford CS149: Parallel Computing Written Assignment 3

Lecture 5. Performance Programming with CUDA

GPU programming basics. Prof. Marco Bertini

Introduction to GPGPU and GPU-architectures

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

NVIDIA Fermi Architecture

Benchmarking the Memory Hierarchy of Modern GPUs

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Transcription:

CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming

Related to: score boarding load and store Consider the following code: Lazy Evaluation is this efficient code? each operation is dependent on the one before compute new index and address, then load the data, add to sum not much memory latency hiding can we do it better?

Lazy Evaluation Splitting into four independent sums: is this better?

Lazy Evaluation How about this? compare this with the eager evaluation model of CPUs CPUs will stall at every read GPUs will delay the stall until actual use then will rapidly switch a thread

Recursion on GPUs Leads to branch explosion less appropriate for GPUs since number of threads is fixed before kernel invocation very recent Kepler high-end K20 GPUs support dynamic parallelism How else to implement it? use iterative methods instead of branch generation recall binary search in Sample Sort can also invoke new kernels at every level see book for more detail

CUDA Ballot Function For compute 2.0 and above unsigned int ballot(int predicate) when predicate evaluates to TRUE the function returns a value with the Nth bit set N = threadidx.x C-implementation of non-atomic version (will work for all compute levels):

Intrinsics What are (CUDA) intrinsics? a function known by the compiler that directly maps to a sequence of one or more assembly language instructions. are inherently more efficient than called functions because no calling linkage is required make the use of processor-specific enhancements easier because they provide a CUDA language interface to assembly instructions. in doing so, the compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data

Other CUDA Intrinsics AtomicOr: int atomicor(int *address, int val) reads the value pointed to by address performs a bitwise OR with val returns the result back in address Example use: what does this function do? will set a bit for every thread for which data[tid]> threshold what useful computation can this enable?

Other CUDA Intrinsics Enables very fast counting given a condition (predicate) extend it using the popc function returns the number of bits set within a 32-bit parameter can be used to accumulate a block-based sum for all warps in the block can accumulate for a given CUDA block the number of threads in every warp that had the condition we used for the predicate set in this example, the condition is that the data value was larger than a threshold then must sum all the values across the blocks See the book for more detail and applications

Profiling We shall use Sample Sort as a running example Major parameters number of samples number of threads Explore the possible search space double the number of samples per iteration use 32, 64, 128, or 256 threads

Example Configuration (1)

Example Configuration (2)

Hard to See? What We Do? Visualize it! Now differences and trends can be easily observed For 16K samples ¾ of the time is used for sorting, ¼ for setting up the sample sort For 64K samples suddenly the time to sort the sample jumps to around 1/3 much variability on number of samples and the number of threads

NVIDIA Parallel Nsight Free debugging and analysis tool incredibly useful for identifying bottlenecks look for New Analysis Activity feature We shall focus on middle case 32 K samples Choose Profile activity type (next slide) by default this will run a couple of experiments Achieved Occupancy and Instruction Statistics produces a summary at the top of the summary page is a dropdown box select CUDA Launches to get useful information

Parallel Nsight Launch Options

Parallel Nsight Analysis Occupancy View

Analysis Factors limiting occupancy are red-colored Occupancy block limit (8 blocks) per device is limiting the maximum number of active warps on the device not enough warps means less memory latency can be hidden we launched around 16 warps (but could have up to 48) this yields 1/3 of the maximum occupancy so should we increase the number of warps (and threads)? turns out that this will have the opposite effect (see measured results)

More Parallel Nsight Analysis

Analysis Some instructions were reissued (blue bars) due to serialization these are threads not able to execute as a complete warp due to divergent program flow, uncoalesced memory access, conflicts (shared memory, atomics) Distribution of work across SMs is uneven some have more warps than others some also take longer due to uneven amount of work 14 SMs 512 blocks of 64 threads expect 36 blocks (72 warps)/sm but some get 68 warps others get 78 warps

Change in Parameters When moving to 256 threads/block variability in issued vs. executed instructions grows number of scheduled blocks goes from eight to three due to the 34 registers allocated per thread nevertheless, the number of scheduled warps goes to 24 (from 16) this gives a 50% occupancy rate Can we increase occupancy? we need to limit registers use via compiler option (set max to 32) this leads to a register use of 18 occupancy is now 100% but now execution time grows from 63ms to 86ms why? because now registers are pushed to local storage (L1 cache, or global memory for earlier devices) Can we gain performance in a different way? can achieve a speedup of 4.4 over CPU-based QSort see book for details (mainly by better cache utilization)