Spring Prof. Hyesoon Kim

Similar documents
EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Mattan Erez. The University of Texas at Austin

Threading Hardware in G80

Portland State University ECE 588/688. Graphics Processors

Spring 2009 Prof. Hyesoon Kim

ME964 High Performance Computing for Engineering Applications

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to CUDA (1 of n*)

Lecture 2: CUDA Programming

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Fundamental CUDA Optimization. NVIDIA Corporation

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Fundamental CUDA Optimization. NVIDIA Corporation

Programming in CUDA. Malik M Khan

Data Parallel Execution Model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

NVIDIA Fermi Architecture

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU Fundamentals Jeff Larkin November 14, 2016

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Warps and Reduction Algorithms

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Occupancy-based compilation

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Multi-Processors and GPU

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Tesla Architecture, CUDA and Optimization Strategies

Fundamental Optimizations

L14: CUDA, cont. Execution Model and Memory Hierarchy"

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Lecture 5. Performance Programming with CUDA

Numerical Simulation on the GPU

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Computer Architecture

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

Parallel Computing: Parallel Architectures Jin, Hai

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CUDA Performance Optimization. Patrick Legresley

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Register File Organization

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

Portland State University ECE 588/688. Cray-1 and Cray T3E

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Analyzing CUDA Workloads Using a Detailed GPU Simulator

ME964 High Performance Computing for Engineering Applications

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Spring 2009 Prof. Hyesoon Kim

CS427 Multicore Architecture and Parallel Computing

Spring 2011 Prof. Hyesoon Kim

GPU ARCHITECTURE Chris Schultz, June 2017

TUNING CUDA APPLICATIONS FOR MAXWELL

Introduction to GPU hardware and to CUDA

Arquitetura e Organização de Computadores 2

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Master Informatics Eng.

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Introduction to GPU programming with CUDA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

From Application to Technology OpenCL Application Processors Chung-Ho Chen

GRAPHICS PROCESSING UNITS

CS377P Programming for Performance GPU Programming - II

GPU ARCHITECTURE Chris Schultz, June 2017

Overview: Graphics Processing Units

TUNING CUDA APPLICATIONS FOR MAXWELL

Advanced CUDA Programming. Dr. Timo Stich

The University of Texas at Austin

CSE 160 Lecture 24. Graphical Processing Units

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

B. Tech. Project Second Stage Report on

Improving Performance of Machine Learning Workloads

CUDA Architecture & Programming Model

The Graphics Pipeline: Evolution of the GPU!

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

" # " $ % & ' ( ) * + $ " % '* + * ' "

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points):

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

High Performance Computing on GPUs using NVIDIA CUDA

Using GPUs to compute the multilevel summation of electrostatic forces

ME964 High Performance Computing for Engineering Applications

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

Transcription:

Spring 2011 Prof. Hyesoon Kim

2

Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp Sources ready Sources ready T T T T T T T T One warp One warp Finite number of streaming processors SIMD Execution Unit 3

t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory TF Shared Memory Texture L1 Memory L2 Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as resource allows (# of blocks is dependent on the architecture) SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

PC PC PCPCPC PC PC Mux I-cache warp #id stall

TB1, W1 stall TB2, W1 stall TB3, W2 stall Fetch TB1 W1 TB2 W1 TB3 W1 TB = Thread Block, W = Warp One instruction for each warp (could be further optimizations) Round Robin, Greedy-fetch (switch when stall events such as branch, I-cache misses, buffer full) Thread scheduling polices Execute when all sources are ready TB3 W2 TB2 W1 Instruction: 1 2 3 4 5 6 1 2 1 2 3 4 Time In-order execution within warps Scheduling polices: Greedy-execution, round-robin TB1 W1 TB1 W2 TB1 W3 1 2 7 8 1 2 1 2 3 4 Kirk & Hwu TB3 W2

Enough parallelism Switch to another thread Speculative execution is Branch predictor could be expensive Per thread predictor Branch elimination techniques Pipeline flush is too costly

Basic Block Def: a sequence of consecutive operations in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end Single entry, single exit Add r1, r2, r3 Br.cond target Mov r3, r4 Br jmp join Target add r1, r2, r3 Join mov r4 r5 A B C D B A D Control-flow graph C http://www.eecs.umich.edu/~mahlke/583w04/

Defn: Dominator Given a CFG, a node x dominates a node y, if every path from the Entry block to y contains x Given some BB, which blocks are guaranteed to have executed prior to executing the BB Defn: Post dominator: Given a CFG, a node x post dominates a node y, if every path from y to the Exit contains x Given some BB, which blocks are guaranteed to have executed after executing the BB reverse of dominator BB2 BB5 BB1 BB3 BB4 BB6 BB7 http://www.eecs.umich.edu/~mahlke/583w04/

Defn: Immediate post dominator (ipdom) Each node n has a unique immediate post dominator m that is the first post dominator of n on any path from n to the Exit Closest node that post dominates First breadth-first successor that post dominates a node Immediate post dominator is the reconvergence point of divergent branch Entry BB1 BB2 BB3 BB4 BB5 BB6 BB7 Exit http://www.eecs.umich.edu/~mahlke/583w04/

Recap: 32 threads in a warm are executed in SIMD (share one instruction sequencer) Threads within a warp can be disabled (masked) For example, handling bank conflicts Threads contain arbitrary code including conditional branches How do we handle different conditions in different threads? No problem if the threads are in different warps Control divergence Predication https://users.ece.utexas.edu/~merez/new/pmwiki.php/ee382vfa07/schedule?action=download&upname=ee382v_fa07_lect14_g80control.pdf

Predication Loop unrolling

if (cond) { b = 0; } else { b = 1; C (normal branch code) T C A D mov b, 1 jmp JOIN TARGET: mov b, 0 N B (predicated code) } A p1 = (cond) branch p1, TARGET A p1 = (cond) B B Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches (in traditional architecture) Eliminate branch divergence (in CUDA) Cons: Extra instructions C A B C D (!p1) mov b, 1 (p1) mov b, 0

Comparison instructions set condition codes (CC) Instructions can be predicated to write results only when CC meets criterion (CC!= 0, CC >= 0, etc.) Compiler tries to predict if a branch condition is likely to produce many divergent warps If guaranteed not to diverge: only predicates if < 4 instructions If not guaranteed: only predicates if < 7 instructions May replace branches with instruction predication ALL predicated instructions take execution cycles Those with false conditions don t write their output Or invoke memory loads and stores Saves branch instructions, so can be cheaper than serializing divergent paths David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Transforms an M-iteration loop into a loop with M/N iterations We say that the loop has been unrolled N times for(i=0;i<100;i+=4){ for(i=0;i<100;i++) a[i]*=2; } a[i]*=2; a[i+1]*=2; a[i+2]*=2; a[i+3]*=2; http://www.cc.gatech.edu/~milos/cs6290f07/

Sum { 1-100}, How to calculate?

Reduction example 0 1 2 3 4 5 6 7 If (threadid.x%==2) If (threadid.x%==4) If (threadid.x%==8) 0 2 4 6 0 4 0 What about other threads? What about different paths? B A D C If (threadid.x>2) { do work B} else { do work C } Divergent branch! From Fung et al. MICRO 07

All branch conditions are serialized and will be executed Parallel code sequential code Divergence occurs within a warp granularity. It s a performance issue Degree of nested branches Depending on memory instructions, (cache hits or misses), divergent warps can occur Dynamic warp subdivision [Meng 10] Hardware solutions to reduce divergent branches Dynamic warp formation [Fung 07]

Register File (RF) 32 KB Provides 4 operands/clock TEX pipe can also read/write RF 2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select Shared Mem MAD SFU David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

R1 R2 R1 R2 R3 Read write Read write R4 R3 R4 write Read write Read write1 Read4 Read3 Read2 Read1 Multiple read ports Banks

t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory TF Courtesy: John Nicols, NVIDIA Shared Memory Texture L1 L2 Threads in a Block share data & results In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory SM Shared Memory dynamically allocated to Blocks, one of the limiting resources Memory David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a Block! Can reduce the number of registers. I$ L1 Multithreaded Instruction Buffer R F C$ L1 Operand Select MAD Shared Mem SFU David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Textures are 1D,2D, 3D arrays of values stored in global DRAM Textures are cached in L1 and L2 Read-only access Caches are optimized for 2D access: Threads in a warp that follow 2D locality will achieve better memory performance Texels: elements of the arrays, texture elements https://users.ece.utexas.edu/~merez/new/pmwiki.php/ee382vfa07/schedule?action=download&upname=ee382v_fa07_lect13_g80mem.pdf

Designed to map textures onto 3D polygons Specialty hardware pipelines for: Fast data sampling from 1D, 2D, 3D arrays Swizzling of 2D, 3D data for optimal access Bilinear filtering in zero cycles Image compositing & blending operations Arrays indexed by u,v,w coordinates easy to program Extremely well suited for multigrid & finite difference methods

MSHR MSHR MSHR Many levels of queues Large size of queues High number of DRAM banks Sensitive to memory scheduling algorithms FRFCFS >> FCFS Interconnection network algorithm to get FRFCFS Effects Yan 09,

In-order execution but Warp cannot execute an instruction when sources are dependent on memory instructions, not when it generates memory requests High MLP Memory Request Memory Request Memory Request Memory Request W0 W1 C M C C M D Context Switch C M C C M D

High Merging Effects Inter-core merging Intra-core merging MSHR MSHR MSHR Techniques to take advantages of this SDMT Compiler optimization[yang 10]: increase memory reuse Cache coherence [Tarjan 10] Cache increase reuses

No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

This has no conflicts if type of shared is 32-bits: foo = shared[baseindex + threadidx.x] But not if the data type is smaller 4-way bank conflicts: shared char shared[]; foo = shared[baseindex + threadidx.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 2-way bank conflicts: shared short shared[]; foo = shared[baseindex + threadidx.x]; Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC Thread 15 Bank 15

Bulk-Synchronous Parallel (BSP) program (Valiant [90]) Synchronization within blocks using explicit Kernel1 Block barrier Implicit barrier across kernels Kernel 1 Kernel 2 C.f.) Cuda 3.x Kernel2 Block Block barrier barrier Block Barrier Block barrier barrier barrier barrier

Use multiple kernels Write to same memory addresses Behavior is not guaranteed Data race Atomic operation No other threads can write to the same location Memory Write order is still arbitrary Keep being updated: atomic{add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor} Performance degradation Fermi increases atomic performance by 5x to 20x (M. Shebanow)

White paper, NVIDIA s Next Generation, CUDA Compute Architecture Fermi White paper: World s Fastest GPU Delivering Great Gaming Performance with True Geometric Realism

SM 32 CUDA cores per SM (fully 32-lane) Dual Warp Scheduler and dispatches from two independent warps 64KB of RAM with a configurable partitioning of shared memory and L1 cache Programming support Unified Address Space with Full C++ Support Full 32-bit integer path with 64-bit extensions Memory access instructions to support transition to 64-bit addressing Memory system Data cache, ECC support, Atomic memory operations Concurrent kernel execution

Better integer support 4 SFU Dual warp scheduler More TLP 16 cores are executed together