L14: CUDA, cont. Execution Model and Memory Hierarchy"

Similar documents
L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Programming in CUDA. Malik M Khan

L18: Introduction to CUDA

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

1/25/12. Administrative

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

L3: Data Partitioning and Memory Organization, Synchronization

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Introduction to CUDA (1 of n*)

Introduction to CUDA Programming

Spring Prof. Hyesoon Kim

Warps and Reduction Algorithms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Mattan Erez. The University of Texas at Austin

Tesla Architecture, CUDA and Optimization Strategies

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Lecture 2: CUDA Programming

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CUDA Performance Considerations (2 of 2)

CS 179 Lecture 4. GPU Compute Architecture

ME964 High Performance Computing for Engineering Applications

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CSE 160 Lecture 24. Graphical Processing Units

Numerical Simulation on the GPU

Benchmarking the Memory Hierarchy of Modern GPUs

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA (1 of n*)

ME964 High Performance Computing for Engineering Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Introduction to GPU programming. Introduction to GPU programming p. 1/17

B. Tech. Project Second Stage Report on

Lecture 5. Performance Programming with CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CUDA Performance Optimization. Patrick Legresley

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Data Parallel Execution Model

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Architecture & Programming Model

GPU Performance Nuggets

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Master Informatics Eng.

Introduction to GPU hardware and to CUDA

CUDA Programming Model

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Computer Architecture

GPU programming. Dr. Bernhard Kainz

Exotic Methods in Parallel Computing [GPU Computing]

Threading Hardware in G80

Programmer's View of Execution Teminology Summary

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Dense Linear Algebra. HPC - Algorithms and Applications

Using GPUs to compute the multilevel summation of electrostatic forces

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Device Memories and Matrix Multiplication

NVIDIA GPU CODING & COMPUTING

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Portland State University ECE 588/688. Graphics Processors

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Numerical Algorithms

Parallel Programming Concepts. GPU Computing with OpenCL

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CS427 Multicore Architecture and Parallel Computing

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Convolution, Constant Memory and Constant Caching

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Transcription:

Programming Assignment 3, Due 11:59PM Nov. 7 L14: CUDA, cont. Execution Model and Hierarchy" October 27, 2011! Purpose: - Synthesize the concepts you have learned so far - Data parallelism, locality and task parallelism - Image processing computation adapted from a real application Turn in using handin program on CADE machines - Handin cs4961 proj3 <file> - Include code + README Three Parts: 1. Locality optimization (50%): Improve performance using locality optimizations only (no parallelism) 2. Data parallelism (20%): Improve performance using locality optimizations plus data parallel constructs in OpenMP 3. Task parallelism (30%): Code will not be faster by adding task parallelism, but you can improve time to first result. Use task parallelism in conjunction with data parallelism per my message from the previous assignment. Here s the code Initialize th[i][j] = 0 /* compute array convolution */ for(m = 0; m < IMAGE_NROWS - TEMPLATE_NROWS + 1; m++){ for(n = 0; n < IMAGE_NCOLS - TEMPLATE_NCOLS + 1; n++){ for(i=0; i < TEMPLATE_NROWS; i++){ for(j=0; j < TEMPLATE_NCOLS; j++){ if(mask[i][j]!= 0) { th[m][n] += image[i+m][j+n]; /* scale array with bright count and template bias */ th[i][j] = th[i][j] * bc bias; Things to think about Beyond the assigned work, how does parallelization affect the profitability of locality optimizations? What happens if you make the IMAGE SIZE larger (1024x1024 or even 2048x2048)? - You ll need to use unlimit stacksize to run these. What happens if you repeat this experiment on a different architecture with a different memory hierarchy (in particular, smaller cache)? How does SSE or other multimedia extensions affect performance and optimization selection? 1

Reminder About Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. - Research experience - Freedom to pick your own problem and solution, get feedback - Thrill of victory, agony of defeat - Communication skills - Present results to communicate technical ideas Write a non-trivial parallel program that combines two parallel programming languages/models. In some cases, just do two separate implementations. - OpenMP + SSE - OpenMP + CUDA (but need to do this in separate parts of the code) - MPI + OpenMP - MPI + SSE - MPI + CUDA Present results in a poster session on the last day of class Example Projects Look in the textbook or look on-line - Chapter 6: N-body, Tree search - Chapters 3 and 5: Sorting - Image and signal processing algorithms - Graphics algorithms - Stencil computations - FFT - Graph algorithms - Other domains Must change it up in some way from text - Different language/strategy CS4961 Details and Schedule 2-3 person projects - Let me know if you need help finding a team Ok to combine with project for other class, but expectations will be higher and professors will discuss Each group must talk to me about their project between now and November 10 - Before/after class, during office hours or by appointment - Bring written description of project, slides are fine - Must include your plan and how the work will be shared across the team I must sign off by November 22 (in writing) Dry run on December 6 Poster presentation on December 8 Strategy A lesson in research - Big vision is great, but make sure you have an evolutionary plan where success comes in stages - Sometimes even the best students get too ambitious and struggle - Parallel programming is hard - Some of you will pick problems that don t speed up well and we ll need to figure out what to do - There are many opportunities to recover if you have problems - I ll check in with you a few times and redirect if needed - Feel free to ask for help - Optional final report can boost your grade, particularly if things are not working on the last day of classes CS4961 2

Shared Architecture 2: Sun Ultrasparc T2 Niagara (water) Controller Controller Controller Full Cross-Bar (Interconnect) Controller More on Niagara Target applications are server-class, business operations Characterization: Floating point? Array-based computation? Support for VIS 2.0 SIMD instruction set 64-way multithreading (8-way per processor, 8 processors) 08/30/2011 CS4961 9! 08/30/2011 CS4961 10! The 2 nd fastest computer in the world What is its name? Where is it located? How many processors does it have? What kind of processors? How fast is it? Tianhe-1A National Supercomputer Center in Tianjin ~21,000 processor chips (14,000 CPUs and 7,000 Tesla Fermis) CPU? NVIDIA Fermi 2.507 Petaflop/second 2.5 quadrillion operations/s See http://www.top500.org 1 x 10 16 CS4961 11! Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Hierarchy - Locality through data placement - Maximizing bandwidth through global memory coalescing - Avoiding memory bank conflicts Tiling and its Applicability to CUDA Code Generation This lecture includes slides provided by: Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) see http://courses.ece.uiuc.edu/ece498/al1/ and Austin Robison (NVIDIA) 11/05/09 3

Reading David Kirk and Wen-mei Hwu manuscript (in progress) - http://www.toodoc.com/cuda-textbook-by-david-kirkfrom-nvidia-and-prof-wen-mei-hwu-pdf.html CUDA Manual, particularly Chapters 2 and 4 (download from nvidia.com/cudazone) Nice series from Dr. Dobbs Journal by Rob Farber (link on class website) - http://www.ddj.com/cpp/207200659 Hardware Implementation: A Set of SIMD Multiprocessors A device has a set of multiprocessors Each multiprocessor is a set of 32-bit processors with a Single Multiple Data architecture - Shared instruction unit At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Device David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Multiprocessor N Multiprocessor 2 Multiprocessor 1 essor 1 essor 2 essor M Hardware Execution Model Example SIMD Execution I. SIMD Execution of warpsize=m threads (from single block) Result is a set of instruction streams roughly equal to # blocks in thread divided by warpsize II. Multithreaded Execution across different instruction streams within block Also possibly across different blocks if there are more blocks than SMs III. Each block mapped to single SM No direct interaction across SMs Device Mul*processor N Mul*processor 2 Mul*processor 1 Shared isters isters essor 1 essor 2 isters essor M Instruc*on Constant Texture Count 6 kernel function d_out[threadidx.x] = 0; for (int i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 76; P! P M-1 Device memory 4

Example SIMD Execution Count 6 kernel function d_out[threadidx.x] = 0; for (int i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); Each core initializes data from addr based on its own threadidx threadidx threadidx threadidx + + P! + P M-1 &dout &dout &dout LDC 0, &(dout+ threadidx) Example SIMD Execution Count 6 kernel function d_out[threadidx.x] = 0; for (int i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); Each core initializes its own R3 0 0 0 P! P M-1 /* int i=0; */ LDC 0, R3 Example SIMD Execution SM Warp Scheduling Count 6 kernel function d_out[threadidx.x] = 0; for (int i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); Each core performs same operations from its own registers Etc. /* i*blocksize + threadidx */ P! P M-1 LDC BLOCKSIZE,R2 MUL R1, R3, R2 ADD R4, R1, RO time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 SM hardware implements zerooverhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency 5

SIMD Execution of Control Flow Control flow example if (threadidx >= 2) { out[threadidx] += 100; else { out[threadidx] += 10; Re g Re g P! P M-1 Re g compare threadidx,2 SIMD Execution of Control Flow Control flow example if (threadidx.x >= 2) { out[threadidx.x] += 100; else { out[threadidx.x] += 10; X X Re Re Re g g g P! P M-1 Instructio n /* possibly predicated using CC */ (CC) LD R5, &(out+threadidx.x) (CC) ADD R5, R5, 100 (CC) ST R5, &(out+threadidx.x) SIMD Execution of Control Flow Control flow example if (threadidx >= 2) { out[threadidx] += 100; else { out[threadidx] += 10; X X Re Re Re g g g P! P M-1 Instructio n /* possibly predicated using CC */ (not CC) LD R5, &(out+threadidx) (not CC) ADD R5, R5, 10 (not CC) ST R5, &(out+threadidx) Terminology Divergent paths - Different threads within a warp take different control flow paths within a kernel function - N divergent paths in a warp? - An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back results from the threads taking each divergent path. - Performance decreases by about a factor of N 6

Hardware Implementation: Architecture The local, global, constant, and texture spaces are regions of device memory (DRAM) Each multiprocessor has: - A set of 32-bit registers per processor - On-chip shared memory - Where the shared memory space resides - A read-only constant cache - To speed up access to the constant memory space - A read-only texture cache - To speed up access to the texture memory space - Data cache (Fermi only) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Device Mul*processor N Mul*processor 2 Mul*processor 1 isters essor 1 Data, Fermi only Shared isters essor 2 isters essor M Instruc*on Constant Texture Device memory Global, constant, texture memories Programmer s View: Spaces Each thread can: - Read/write per-thread registers - Read/write per-thread local memory - Read/write per-block shared memory - Read/write per-grid global memory - Read only per-grid constant memory - Read only per-grid texture memory The host can read/write global, constant, and texture memory Host David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 Grid Block (0, 0) isters Local Global Constant Texture Shared Thread (0, 0) isters Thread (1, 0) Local Block (1, 0) Shared isters isters Thread (0, 0) Thread (1, 0) Local Local Targets of Hierarchy Optimizations Global Accesses Reduce memory latency The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead Cost of performing optimization (e.g., copying) should be less than anticipated gain Each thread issues memory accesses to data types of varying sizes, perhaps as small as 1 byte entities Given an address to load or store, memory returns/ updates segments of either 32 bytes, 64 bytes or 128 bytes Maximizing bandwidth: - Operate on an entire 128 byte segment for each memory transfer 27 7

Understanding Global Accesses Layout of a Matrix in C protocol for compute capability 1.2* (CUDA Manual 5.1.2.1) Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type) Find other active threads requesting addresses within that segment and coalesce Reduce transaction size if possible Access memory and mark threads as inactive Repeat until all threads in half-warp are serviced M Time Period 1 Access direction in Kernel code T 1 T 2 T 3 T 4 M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 2 T 1 T 2 T 3 T 4 *Includes Tesla and GTX platforms M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 Layout of a Matrix in C Now Let s Look at Shared M Access direction in Kernel code M 0,0 M 0,1 M 1,0 M 1,1 M 2,0 M 3,0 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 2 T 1 T 2 T 3 T 4 Time Period 1 T 1 T 2 T 3 T 4 M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 Common Programming Pattern (5.1.2 of CUDA manual) - Load data into shared memory - Synchronize (if necessary) - Operate on data in shared memory - Synchronize (if necessary) - Write intermediate results to global memory - Repeat until done Familiar concept??? Shared memory Global memory 8

Mechanics of Using Shared shared type qualifier required Must be allocated from global/device function, or as extern Examples: extern shared float d_s_array[]; /* a form of dynamic allocation */ /* MEMSIZE is size of per-block */ /* shared memory*/ host void outercompute() { compute<<<gs,bs,memsize>>>(); global void compute() { d_s_array[i] = ; global void compute2() { shared float d_s_array[m]; /* create or copy from global memory */ d_s_array[j] = ; /* write result back to global memory */ d_g_array[j] = d_s_array[j]; Bandwidth to Shared : Parallel Accesses Consider each thread accessing a different location in shared memory Bandwidth maximized if each one is able to proceed in parallel Hardware to support this - Banked memory: each bank can support an access on every memory cycle Bank Addressing Examples Bank Addressing Examples No Bank Conflicts - Linear addressing stride == 1 No Bank Conflicts - Random 1:1 Permutation 2-way Bank Conflicts - Linear addressing stride == 2 8-way Bank Conflicts - Linear addressing stride == 8 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 x8 Bank 9 Bank 15 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 David Kirk/NVIDIA and Wen-mei W. Hwu, 362007-2009 9

How addresses map to banks on G80 (older technology) Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks - So bank = address % 16 - Same as the size of a half-warp - No bank conflicts between different half-warps, only within a single half-warp David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: - If all threads of a half-warp access different banks, there is no bank conflict - If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: - Bank Conflict: multiple threads in the same half-warp access the same bank - Must serialize the accesses - Cost = max # of simultaneous accesses to a single bank David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 Summary of Lecture A deeper probe of performance issues - Execution model - Control flow - Heterogeneous memory hierarchy - Locality and bandwidth - Tiling for CUDA code generation 10