Advanced CUDA Optimization 1. Introduction

Similar documents
Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Practical Introduction to CUDA and GPU

Technology for a better society. hetcomp.com

Introduc)on to GPU Programming

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Tesla Architecture, CUDA and Optimization Strategies

CUDA Programming Model

CUDA Performance Optimization. Patrick Legresley

General Purpose GPU Computing in Partial Wave Analysis

GPU Lund Observatory

CUDA C Programming Mark Harris NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Portland State University ECE 588/688. Graphics Processors

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

HIGH PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Introduction to CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Threading Hardware in G80

GPU Programming Using NVIDIA CUDA

high performance medical reconstruction using stream programming paradigms

OpenACC Course. Office Hour #2 Q&A

Programming in CUDA. Malik M Khan

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel FFT Program Optimizations on Heterogeneous Computers

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Parallel Computing: Parallel Architectures Jin, Hai

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

A Sampling of CUDA Libraries Michael Garland

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CELL B.E. and GPU Programming. Agenda

Master Informatics Eng.

NVIDIA GPU CODING & COMPUTING

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPUfs: Integrating a file system with GPUs

B. Tech. Project Second Stage Report on

HPC with Multicore and GPUs

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

General-purpose computing on graphics processing units (GPGPU)

Introduction to CUDA (1 of n*)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

GPU Computing Master Clss. Development Tools

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Mathematical computations with GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

CS 179: Lecture 10. Introduction to cublas

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Introduction to CUDA (1 of n*)

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Hands-on CUDA exercises

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Memories. Introduction 5/4/11

Multi Agent Navigation on GPU. Avi Bleiweiss

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Programming Introduction

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Lecture 11: GPU programming

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

MD-CUDA. Presented by Wes Toland Syed Nabeel

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Using GPUs to compute the multilevel summation of electrostatic forces

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

GPU Computing with CUDA. Part 2: CUDA Introduction

Tuning CUDA Applications for Fermi. Version 1.2

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Nvidia Tesla The Personal Supercomputer

Optimization solutions for the segmented sum algorithmic function

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Improving Performance of Machine Learning Workloads

1/25/12. Administrative

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU programming. Dr. Bernhard Kainz

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Transcription:

Advanced CUDA Optimization 1. Introduction Thomas Bradley

Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines Productivity Resources

CUDA Review REVIEW OF CUDA ARCHITECTURE

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

CUDA Review PROGRAMMING MODEL

CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths float x = input[threadid]; float y = func(x); output[threadid] = y; Each thread has an ID Select input/output data Control decisions

CUDA Kernels: Subdivide into Blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large

Transparent Scalability G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2

Transparent Scalability G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6 7 8

Transparent Scalability GT200 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12... Idle Idle Idle

CUDA Programming Model - Summary A kernel executes as a grid of thread blocks Host Device Kernel 1 0 1 2 3 1D A block is a batch of threads Communicate through shared memory 0,0 0,1 0,2 0,3 Each block has a block ID Kernel 2 1,0 1,1 1,2 1,3 2D Each thread has a thread ID

CUDA Review MEMORY MODEL

Memory hierarchy Thread: Registers

Memory hierarchy Thread: Registers Thread: Local memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

CUDA Review PROGRAMMING ENVIRONMENT

CUDA C and OpenCL Entry point for developers who want low-level API Entry point for developers who prefer high-level C Shared back-end compiler and optimization technology

Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Integrated debugger and profiler: Nexus

NVIDIA Nexus IDE The industry s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

Performance OPTIMIZATION GUIDELINES

Optimize Algorithms for GPU Algorithm selection Understand the problem, consider alternate algorithms Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Recompute? GPU allocates transistors to arithmetic, not memory Sometimes better to recompute rather than cache Serial computation on GPU? Low parallelism computation may be faster on GPU vs copy to/from host

Optimize Memory Access Coalesce global memory access Maximise DRAM efficiency Order of magnitude impact on performance Avoid serialization Minimize shared memory bank conflicts Understand constant cache semantics Understand spatial locality Optimize use of textures to ensure spatial locality

Exploit Shared Memory Hundreds of times faster than global memory Inter-thread cooperation via shared memory and synchronization Cache data that is reused by multiple threads Stage loads/stores to allow reordering Avoid non-coalesced global memory accesses

Use Resources Efficiently Partition the computation to keep multiprocessors busy Many threads, many thread blocks Multiple GPUs Monitor per-multiprocessor resource utilization Registers and shared memory Low utilization per thread block permits multiple active blocks per multiprocessor Overlap computation with I/O Use asynchronous memory transfers

Productivity RESOURCES

Getting Started CUDA Zone www.nvidia.com/cuda Introductory tutorials/webinars Forums Documentation Programming Guide Best Practices Guide Examples CUDA SDK

Libraries NVIDIA cublas Dense linear algebra (subset of full BLAS suite) cufft 1D/2D/3D real and complex Third party NAG Numeric libraries e.g. RNGs culapack/magma Open Source Thrust STL/Boost style template language cudpp Data parallel primitives (e.g. scan, sort and reduction) CUSP Sparse linear algebra and graph computation Many more...