CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Similar documents
CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Performance Optimization Process

Performance Optimization: Programming Guidelines and GPU Architecture Reasons Behind Them

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Advanced CUDA Programming. Dr. Timo Stich

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

CUDA Performance Optimization. Patrick Legresley

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Fundamentals Jeff Larkin November 14, 2016

Preparing seismic codes for GPUs and other

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

State-of-the-art in Heterogeneous Computing

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Maximizing Face Detection Performance

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Profiling & Tuning Applications. CUDA Course István Reguly

NVIDIA Fermi Architecture

Tuning CUDA Applications for Fermi. Version 1.2

Improving Performance of Machine Learning Workloads

CUDA Experiences: Over-Optimization and Future HPC

Hands-on CUDA Optimization. CUDA Workshop

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Lecture 2: CUDA Programming

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Dense Linear Algebra. HPC - Algorithms and Applications

Parallel Computing: Parallel Architectures Jin, Hai

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Outline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CS377P Programming for Performance GPU Programming - II

Portland State University ECE 588/688. Graphics Processors

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

arxiv: v1 [physics.comp-ph] 4 Nov 2013

High Performance Computing on GPUs using NVIDIA CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Parallelising Pipelined Wavefront Computations on the GPU

Tesla Architecture, CUDA and Optimization Strategies

Code Optimizations for High Performance GPU Computing

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU Profiling and Optimization. Scott Grauer-Gray

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Profiling of Data-Parallel Processors

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Spring Prof. Hyesoon Kim

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to CUDA Programming

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Double-Precision Matrix Multiply on CUDA

A Detailed GPU Cache Model Based on Reuse Distance Theory

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Parallel Processing SIMD, Vector and GPU s cont.

Mathematical computations with GPUs

Tuning Performance on Kepler GPUs

Blocks, Grids, and Shared Memory

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

ME964 High Performance Computing for Engineering Applications

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

CUDA Memories. Introduction 5/4/11

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

CS 179 Lecture 4. GPU Compute Architecture

General Purpose GPU Computing in Partial Wave Analysis

CS427 Multicore Architecture and Parallel Computing

Introduction to GPU hardware and to CUDA

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Real-Time Rendering Architectures

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Height field ambient occlusion using CUDA

Transcription:

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs Launch Configuration Memory Access Patterns Using On-Chip Memory Summary, Further Reading and Questions

Launch Configuration

Launch Configuration Need enough total threads to keep GPU busy Typically, you d like 512+ threads per SM - More if processing one fp32 element per thread - SM can concurrently execute up to 8 threadblocks Of course, exceptions exist

Example Increment of an array of 64M elements Two accesses per thread (load then store) The two accesses are dependent, so really 1 access per thread at a time Tesla C2050, ECC on, theoretical bandwidth: ~120 GB/s Several independent smaller accesses have the same effect as one larger one. For example: Four 32-bit ~= one 128-bit

Launch Configuration Conclusion: Have enough loads in flight to saturate the bus, or handle multiple elements per thread with independent loads, same performance. Typically, you d like 512+ threads per SM Independent loads and stores from the same thread Loads and stores from different threads Larger word sizes can also help (float2 is twice the transactions of float, for example) For more details: Vasily Volkov s GTC2010 talk Better Performance at Lower Occupancy

Memory Access Patterns

GMEM Operations Two types of loads: Caching - Default mode - Attempts to hit in L1, then L2, then GMEM - Load granularity is 128-byte line - Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1 Non-caching - Compile with Xptxas dlcm=cg option to nvcc - Attempts to hit in L2, then GMEM - Load granularity is 32-bytes

Load Operation Memory operations are issued per warp (32 threads) Just like all other instructions Operation: Threads in a warp provide memory addresses Determine which lines/segments are needed Request the needed lines/segments

Caching Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Non-caching Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Caching Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Non-caching Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Caching Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within 2 cache-lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus utilization: 50% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Non-caching Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within at most 5 segments Warp needs 128 bytes 160 bytes move across the bus on misses Bus utilization: at least 80% - Some misaligned patterns will fall within 4 segments, so 100% utilization addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Caching Load All threads in a warp request the same 4-byte word Addresses fall within a single cache-line Warp needs 4 bytes 128 bytes move across the bus on a miss Bus utilization: 3.125% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Non-caching Load All threads in a warp request the same 4-byte word Addresses fall within a single segment Warp needs 4 bytes 32 bytes move across the bus on a miss Bus utilization: 12.5% addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Caching Load Warp requests 32 scattered 4-byte words Addresses fall within N cache-lines Warp needs 128 bytes N*128 bytes move across the bus on a miss Bus utilization: 128 / (N*128) addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Non-caching Load Warp requests 32 scattered 4-byte words Addresses fall within N segments Warp needs 128 bytes N*32 bytes move across the bus on a miss Bus utilization: 128 / (N*32) addresses from a warp... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

Impact of Address Alignment Warps should access aligned regions for maximum memory throughput L1 can help for misaligned loads if several warps are accessing a contiguous region ECC further significantly reduces misaligned store throughput Experiment: Copy 16MB of floats 256 threads/block Greatest throughput drop: CA loads: 15% CG loads: 32%

Using On-Chip Memory

Shared Memory Uses: Use it to improve global memory access patterns Cache data to reduce redundant global memory accesses Inter-thread communication within a block Performance: Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1 Very low latency Very high throughput: 1+ TB/s aggregate

Additional memories Texture and Constant Read-only Data resides in global memory Read through different caches Additional fast on-chip memory Avoid polluting L1

Summary

Summary Strive for perfect coalescing Align starting address (may require padding) A warp should access within a contiguous region Have enough concurrent accesses to saturate the bus Process several elements per thread - Multiple loads get pipelined - Indexing calculations can often be reused Launch enough threads to maximize throughput - Latency is hidden by switching threads (warps) Try L1 and caching configurations to see which one works best Caching vs non-caching loads (compiler option) 16KB vs 48KB L1 (CUDA call) - Sometimes using shared memory or the texture / constant cache is the best choice

Further reading GTC 2010 Talks: Fundamental Performance Optimizations for GPUs Paulius Micikevicius Analysis-Driven Optimization Paulius Micikevicius Better Performance at Lower Occupancy Vasily Volkov http://www.gputechconf.com/page/gtc-on-demand.html

Questions?