Programming in CUDA. Malik M Khan

Similar documents
L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

L14: CUDA, cont. Execution Model and Memory Hierarchy"

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

1/25/12. Administrative

Administrative Issues. L10: Dense Linear Algebra on GPUs. Triangular Solve (STRSM) Outline 2/25/11. Next assignment, linear algebra

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

L18: Introduction to CUDA

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Tesla Architecture, CUDA and Optimization Strategies

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Introduction to CUDA (1 of n*)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 2: CUDA Programming

Master Informatics Eng.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

L3: Data Partitioning and Memory Organization, Synchronization

Introduction to CUDA (1 of n*)

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Portland State University ECE 588/688. Graphics Processors

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 5. Performance Programming with CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Code Optimizations for High Performance GPU Computing

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

Spring Prof. Hyesoon Kim

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Module 2: Introduction to CUDA C

Mattan Erez. The University of Texas at Austin

Introduction to CUDA Programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Data Parallel Execution Model

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Lecture 2: Introduction to CUDA C

Warps and Reduction Algorithms

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Module 3: CUDA Execution Model -I. Objective

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Performance Optimization. Patrick Legresley

CSE 160 Lecture 24. Graphical Processing Units

Dense Linear Algebra. HPC - Algorithms and Applications

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Threading Hardware in G80

Module 2: Introduction to CUDA C. Objective

Matrix Multiplication in CUDA. A case study

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

GPU programming. Dr. Bernhard Kainz

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Double-Precision Matrix Multiply on CUDA

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Numerical Simulation on the GPU

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Using GPUs to compute the multilevel summation of electrostatic forces

COSC 6374 Parallel Computations Introduction to CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Device Memories and Matrix Multiplication

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Programmable Graphics Hardware (GPU) A Primer

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

CUDA Architecture & Programming Model

ECE 408 / CS 483 Final Exam, Fall 2014

Introduction to CELL B.E. and GPU Programming. Agenda

ECE 8823: GPU Architectures. Objectives

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 1: Introduction and Computational Thinking

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Introduction to Parallel Computing with CUDA. Oswald Haan

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Overview: Graphics Processing Units

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

CUDA Programming Model

Practical Introduction to CUDA and GPU

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CS 314 Principles of Programming Languages

Introduction to GPGPU and GPU-architectures

B. Tech. Project Second Stage Report on

Nvidia G80 Architecture and CUDA Programming

Transcription:

Programming in CUDA October 21, 2010 Malik M Khan

Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement - Maximizing bandwidth through global memory coalescing - Avoiding memory bank conflicts Tiling and its Applicability to CUDA Code Generation - Example Matrix-Vector Multiplication This lecture includes slides provided by: Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) see http://courses.ece.uiuc.edu/ece498/al1/ and Austin Robison (NVIDIA)

Reading David Kirk and Wen-mei Hwu manuscript (in progress) - http://www.toodoc.com/cuda-textbook-by-david-kirkfrom-nvidia-and-prof-wen-mei-hwu-pdf.html CUDA 2.x Manual, particularly Chapters 2 and 4 (download from nvidia.com/cudazone) Nice series from Dr. Dobbs Journal by Rob Farber - http://www.ddj.com/cpp/207200659

CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: - Is a coprocessor to the CPU or host - Has its own DRAM (device memory) - Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads - GPU threads are extremely lightweight 11/05/09 - Very little creation overhead - GPU needs 1000s of threads for full efficiency - Multi-core CPU needs only a few

Batching: Grids and Blocks A kernel is executed as a grid of thread blocks - All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: - Synchronizing their execution - For hazard-free shared memory accesses - Efficiently sharing data through a low latency shared memory Host Kernel 1 Kernel 2 Block (1, 1) Device Grid 1 Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Two threads from two different blocks cannot cooperate David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Courtesy: NDVIA (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2)

Block and IDs s and blocks have IDs Device Grid 1 - So each thread can decide what data to work on Block (0, 0) Block (1, 0) Block (2, 0) - Block ID: 1D or 2D (blockidx.x, blockidx.y) Block (0, 1) Block (1, 1) Block (2, 1) - ID: 1D, 2D, or 3D (threadidx.{x,y,z}) Block (1, 1) Simplifies memory addressing when processing multidimensional data - Image processing - Solving PDEs on volumes - (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) Courtesy: NDVIA David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Hardware Implementation: A Set of SIMD Multiprocessors A device has a set of multiprocessors Each multiprocessor is a set of 32-bit processors with a Single Instruction Multiple Data architecture - Shared instruction unit At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Processor 1 Processor 2 Processor M Instruction Unit

Hardware Execution Model I. SIMD Execution of warpsize=m threads (from single block) Result is a set of instruction streams roughly equal to # blocks in thread divided by warpsize Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 II. Multithreaded Execution across different instruction streams within block Also possibly across different blocks if there are more blocks than SMs III. Each block mapped to single SM No direct interaction across SMs Register s Processor 1 Shared Memory Register s Processor 2 Register s Processor M Instruction Unit Constant Cache Texture Cache Device memory

A Very Simple Execution Model No branch prediction - Just evaluate branch targets and wait for resolution - But wait is only a small number of cycles No speculation - Only execute useful instructions

Terminology Divergent paths - Different threads within a warp take different control flow paths within a kernel function - N divergent paths in a warp? - An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path. - Performance decreases by about a factor of N

Hardware Implementation: Memory Architecture The local, global, constant, and texture spaces are regions of device memory Each multiprocessor has: - A set of 32-bit registers per processor - On-chip shared memory - Where the shared memory space resides - A read-only constant cache - To speed up access to the constant memory space - A read-only texture cache - To speed up access to the texture memory space Device David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Multiprocessor N Multiprocessor 2 Multiprocessor 1 Registers Processor 1 Device memory Shared Memory Registers Processor 2 Registers Processor M Instruction Unit Constant Cache Texture Cache Global, constant, texture memories

Programmer s View: Memory Spaces Each thread can: - Read/write per-thread registers - Read/write per-thread local memory - Read/write per-block shared memory - Read/write per-grid global memory - Read only per-grid constant memory - Read only per-grid texture memory Grid Block (0, 0) Registers Shared Memory Registers Block (1, 0) Shared Memory Registers Registers (0, 0) (1, 0) (0, 0) (1, 0) Local Memory Local Memory Local Memory Local Memory The host can read/write global, constant, and texture memory Host David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Global Memory Constant Memory Texture Memory

Now Let s Look at Shared Memory Common Programming Pattern (5.1.2 of CUDA manual) - Load data into shared memory - Synchronize (if necessary) - Operate on data in shared memory - Synchronize (if necessary) - Write intermediate results to global memory - Repeat until done Shared memory Global memory Familiar concept?

Mechanics of Using Shared Memory shared type qualifier required Must be allocated from global/device function, or as extern Examples: extern shared float d_s_array[]; /* a form of dynamic allocation */ /* MEMSIZE is size of per-block */ /* shared memory*/ host void outercompute() { compute<<<gs,bs,memsize>>>(); } global void compute() { d_s_array[i] = ; } global void compute2() { shared float d_s_array[m]; } /* create or copy from global memory */ d_s_array[j] = ; /* write result back to global memory */ d_g_array[j] = d_s_array[j];

Bandwidth to Shared Memory: Parallel Memory Accesses Consider each thread accessing a different location in shared memory Bandwidth maximized if each one is able to proceed in parallel Hardware to support this - Banked memory: each bank can support an access on every memory cycle

Bank Addressing Examples No Bank Conflicts - Linear addressing stride == 1 No Bank Conflicts - Random 1:1 Permutation 0 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 0 1 2 3 4 5 6 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 15 Bank 15 15 Bank 15 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Bank Addressing Examples 2-way Bank Conflicts - Linear addressing stride == 2 8-way Bank Conflicts - Linear addressing stride == 8 0 1 2 3 4 8 9 10 11 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 0 1 2 3 4 5 6 7 15 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 David Kirk/NVIDIA and Wen-mei W. Hwu, 172007-2009

How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks - So bank = address % 16 - Same as the size of a half-warp - No bank conflicts between different half-warps, only within a single half-warp David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: - If all threads of a half-warp access different banks, there is no bank conflict - If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: - Bank Conflict: multiple threads in the same half-warp access the same bank - Must serialize the accesses - Cost = max # of simultaneous accesses to a single bank David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Global Memory Accesses Each thread issues memory accesses to data types of varying sizes, perhaps as small as 1 byte entities Given an address to load or store, memory returns/updates segments of either 32 bytes, 64 bytes or 128 bytes Maximizing bandwidth: - Operate on an entire 128 byte segment for each memory transfer

Understanding Global Memory Accesses Memory protocol for compute capability 1.2* (CUDA Manual 5.1.2.1) Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type) Find other active threads requesting addresses within that segment and coalesce Reduce transaction size if possible Access memory and mark threads as inactive Repeat until all threads in half-warp are serviced *Includes Tesla and GTX platforms

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign Memory Layout of a Matrix in C Access direction in Kernel code M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 M Time Period 1 T 1 T 2 T 3 T 4 Time Period 2 T 1 T 2 T 3 T 4 M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign Memory Layout of a Matrix in C Access direction in Kernel code M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 Time Period 2 M 0,3 M 1,3 M 2,3 M 3,3 T 1 T 2 T 3 T 4 Time Period 1 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3

Using Loop Transformations in GPU Code Generation Cod transformations Application Conventional Architectures GPU Tiling Manage reuse in limited storage Manage reuse in limited storage Partition parallel execution at 2 levels Data-copy Eliminate conflict misses in cache Copy data to specialized memory structures Permutation Unrolling Reorder loop structure to enable other optimizations Exposes fine-grain parallelism by replicating the loop body Reorder loop structure to enable other optimizations Exposes fine-grain parallelism by replicating the loop body

Tiling for Computation Mapping in GPUs I Tiles Mapped to blocks J

Matrix-Vector Multiply: A Naïve Strategy N = 1024 Sequential code* for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i] = a[i] + c[j][i] * b[j]; CUDA Kernel global GPU_MV(float* a, float* b, float** c){ int i= blockidx.x* 512 + threadidx.x; for (int j=0; j<n; j++) a[i] = a[i]+ c[j][i]* b[j] ; } Performance < 3.75GF < CUBLAS2.2 Host Code...copy data to device... dim3 dimgrid(n/512, 1); dim3 dimblock(512, 1); GPU_MV<<<dimGrid,dimBlock>>>(gpuA,gpuB,gpuC);...copy data back to host * Following CUBLAS notation for matrix vector multiplication which takes transposed matrix as input.

Decision Algorithm: Computational Decomposition Block Parallelism tile_by_index({ i }, {TI}, {l1_control="ii },{"ii","i", "j }) cudaize( block{ ii }, thread{} ) Parallelism cudaize( block{ ii }, thread{ i } )

Decision Algorithm: Data Staging Data Staging Shared Memory Data Reuse across threads tile_by_index({ j }, {TJ}, {l1_control= jj },{"ii", jj,"i", "j }) Registers Data Reuse inside thread Final Loop Order Cudaize Unrolling cudaize( block{ ii }, thread{ i } ) unroll to depth( 1 )

CUDA-CHiLL Recipe N = 1024 TI= TJ = 32 tile_by_index({ i, j }, {TI,TJ}, {l1_control="ii, l2_control= k },{"ii", jj,"i", "j }) normalize_index( i ) cudaize( mv_gpu, {a=n, b=n, c=n*n},{block={ ii }, thread={ i }}) copy_to_shared( tx, b, 1) copy_to_registers( jj, a ) unroll_to_depth(1)

Matrix-Vector Multiply: GPU Code Generated Code: with Computational decomposition only. global GPU_MV(float* a, float* b, float** c) { int bx = blockidx.x; int tx = threadidx.x; int i = 32*bx+tx; for (j = 0; j < N; j++) a[i] = a[i] + c[j][i] * b[j]; } Final MV Generated Code: with Data staged in shared memory & registers. global GPU_MV(float* a, float* b, float** c) { int bx = blockidx.x; int tx = threadidx.x; shared float bcpy[32]; float acpy = a[tx + 32 * bx]; for (jj = 0; jj < 32; jj++) { } bcpy[tx] = b[32 * jj + tx]; syncthreads(); //this loop is actually fully unrolled for (j = 32 * jj; j <= 32 * jj + 32; j++) acpy = acpy + c[j][32 * bx + tx] * bcpy [j]; syncthreads(); a[tx + 32 * bx] = acpy; } Performance always tops CUBLAS 2.2 & ~4x improvement on the naïve GPU kernel

BLAS Library Kernel Optimization MV Results MM Results

Memory Copy and Stencil Optimization 2D Convolution Matrix Transpose

Summary of Lecture A deeper probe of performance issues - Execution model - Control flow - Heterogeneous memory hierarchy - Locality and bandwidth - Tiling for CUDA code generation - An example of Tiling Transformation with Matrix-Vector Multiplication.