Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Similar documents
Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Lecture 5. Performance Programming with CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

ME964 High Performance Computing for Engineering Applications

Mattan Erez. The University of Texas at Austin

ME964 High Performance Computing for Engineering Applications

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Complex Pipelining. Motivation

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction-Level Parallelism and Its Exploitation

Spring Prof. Hyesoon Kim

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Data Parallel Execution Model

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Processor: Superscalars Dynamic Scheduling

Advanced issues in pipelining

Course on Advanced Computer Architectures

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Programming. Dr. Timo Stich

Scoreboard information (3 tables) Four stages of scoreboard control

COSC 6385 Computer Architecture - Pipelining (II)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Tomasulo s Algorithm

Multi-cycle Instructions in the Pipeline (Floating Point)

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Instruction Level Parallelism (ILP)

Pipelining. Maurizio Palesi

Handout 2 ILP: Part B

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Detailed Scoreboard Pipeline Control. Three Parts of the Scoreboard. Scoreboard Example Cycle 1. Scoreboard Example. Scoreboard Example Cycle 3

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Warps and Reduction Algorithms

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Static vs. Dynamic Scheduling

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Optimizing Parallel Reduction in CUDA

CS425 Computer Systems Architecture

Four Steps of Speculative Tomasulo cycle 0

Structure of Computer Systems

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Pipelining

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Instruction Pipelining

Lecture 4: Introduction to Advanced Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Good luck and have fun!

Lecture 15: Pipelining. Spring 2018 Jason Tang

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

EECC551 Exam Review 4 questions out of 6 questions

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Lecture 2: CUDA Programming

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Threading Hardware in G80

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

COSC 6385 Computer Architecture - Pipelining

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Exploitation of instruction level parallelism

Instruction Pipelining Review

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Lecture 19: Instruction Level Parallelism

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

5008: Computer Architecture

Metodologie di Progettazione Hardware-Software

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Transcription:

Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism

Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours cancelled on Thursday; make an appointment to see me later this week CUDA profiling with nvprof See slides 13-16 http://www.nvidia.com/docs/io/116711/sc11-perfoptimization.pdf Scott B. Baden /CSE 260/ Winter 2014 2

Recapping optimized matrix multiply Volkov and Demmel, SC08 Improve performance using fewer threads Reducing concurrency frees up registers to trade locality against parallelism ILP to increase processor utilization Scott B. Baden /CSE 260/ Winter 2014 3

Hiding memory latency Parallelism = latency throughput Arithmetic: 576 ops/sm = 18CP x 32/SM/CP Memory: 150KB = ~500CP (1100 nsec) x 150 GB/sec How can we keep 150KB in flight? Multiple threads: ~35,000 threads @ 4B/thread ILP (increase fetches per thread) Larger fetches (64 or 128 bit/thread) Higher occupancy Copy 1 float /thread, need 100% occupancy int indx = threadidx.x + block * blockdim.x; float a0 = src[indx]; dest[indx] = a0; Copy 4 floats /thread, need 25% occ int indx = threadidx.x + 4 * block * blockdim.x; float a[4]; // in registers for(i=0;i<4;i++) a[i]=src[indx+i*blockdim.x]; for(i=0;i<4;i++) dst[indx+i*blockdim.x]=a[i]; Copy 2 floats /thread, need 50% occ float a0 = src[indx]; float a1 = src[indx+blockdim.x]; dest[indx] = a0; dst[index+blockdim.x] = a1; 2/4/14 Scott B. Baden /CSE 260/ Winter 2014 4

Today s lecture Thread divergence Scheduling Instruction Level Parallelism Scott B. Baden /CSE 260/ Winter 2014 5

Recapping: warp scheduling on Fermi Threads assigned to an SM in units of a thread block, multiple blocks Each block divided into warps of 32 (SIMD) threads, a schedulable unit A warp becomes eligible for execution when all its operands are available Dynamic instruction reordering: eligible warps selected for execution using a prioritized scheduling policy All threads in a warp execute the same instruction, branches serialize execution Multiple warps simultaneously active, hiding data transfer delays All registers in all the warps are available, 0 overhead scheduling Hardware is free to assign blocks to any SM time MT IU Shared /L1 t0 t1 t2 tm warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 Scott B. Baden /CSE 260/ Winter 2014 6. t0 t1 t2 tm t0 t1 t2 tm warp 3 instruction 96

Instruction Issue on Fermi Each vector unit 32 CUDA cores for integer and floating-point arithmetic 4 special function units for Single Precision transcendentals 2 Warp schedulers: each scheduler issues: 1 (2) instructions for capability 2.0 (2.1) One scheduler in charge of odd warp IDs, the other even warp IDs Only 1 scheduler can issue a double-precision floatingpoint instruction at a time Warp scheduler can issue an instruction to ½ the CUDA cores in an SM Scheduler must issue the instruction over 2 clock cycles for an integer or floating-point arithmetic instructions Scott B. Baden /CSE 260/ Winter 2014 7

Dynamic behavior resource utilization Each vector core (SM): 1024 thread slots and 8 block slots Hardware partitions slots into blocks at run time, accommodates different processing capacities Registers are split dynamically across all blocks assigned to the vector core A register is private to a single thread within a single block Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 8

Thread Divergence All the threads in a warp execute the same instruction Different control paths are serialized Divergence when a predicate is a function of the thread Id if (threadid < 2) { } No divergence if all follow the same path if (threadid / WARP_SIZE < 2) { } Consider reduction, e.g. summation i x i warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 Scott B. Baden /CSE 260/ Winter 2014 9

Thread divergence All the threads in a warp execute the same instruction Different control paths are serialized Warp Scalar Thread Scalar Thread Scalar Thread Thread Warp 3 Thread Warp 8 Thread Warp 7 Branch Path A Path B Scalar Thread Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt, UBC Scott B. Baden /CSE 260/ Winter 2014 10

Divergence example if (threadidx >= 2) a=100; else a=-100; Reg Reg... Reg P 0 P! P M- 1 compare threadidx,2 Mary Hall Scott B. Baden /CSE 260/ Winter 2014 11

Divergence example if (threadidx >= 2) a=100; else a=-100; X X Reg Reg... Reg P 0 P! P M- 1 Mary Hall Scott B. Baden /CSE 260/ Winter 2014 12

Divergence example if (threadidx >= 2) a=100; else a=-100; X X Reg Reg... Reg P 0 P! P M- 1 Mary Hall Scott B. Baden /CSE 260/ Winter 2014 13

Thread Divergence All the threads in a warp execute the same instruction Different control paths are serialized Divergence when a predicate is a function of the threadid if (threadid < 2) { } No divergence if all follow the same path within a warp if (threadid / WARP_SIZE < 2) { } We can have different control paths within the thread block warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 Scott B. Baden /CSE 260/ Winter 2014 14

3 1 7 0 4 1 6 3 Example reduction thread divergence 4 7 5 9 11 14 Thread 0! Thread 2! Thread 4! Thread 6! Thread 8! Thread 25 10! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 1! 0+1! 2+3! 4+5! 6+7! 8+9! 10+11! 2! 0...3! 4..7! 8..11! 3! 0..7! 8..15! DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 15

The naïve code global void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadidx.x; unsigned int i = blockidx.x * blockdim.x + threadidx.x; shared int x[bsize]; x[tid] = (i<n)? input[i] : 0; syncthreads(); } for (unsigned int stride = 1; stride < blockdim.x; stride *= 2) { syncthreads(); if (tid % (2*stride) == 0) x[tid] += x[tid + stride]; } if (tid == 0) atomicadd(total,x[tid]); Scott B. Baden /CSE 260/ Winter 2014 16

Thread 0! Reducing divergence and avoiding bank conflicts 0! 1! 2! 3!! 13! 14! 15! 16! 17! 18! 19! 0+16! 15+31! DavidKirk/NVIDIA & Wen-mei Hwu/UIUC Scott B. Baden /CSE 260/ Winter 2014 18

The improved code All threads in a warp execute the same instruction reducesum <<<N/512,512>>> (x,n) No divergence until stride < 32 All warps active when stride 32 shared int x[ ]; unsigned int tid = threadidx.x; unsigned int s; for (stride = 1; stride < blockdim.x; stride *= 2) { } syncthreads(); if (tid % (2*stride) == 0) x[tid] += x[tid + stride]; for (s = blockdim.x/2; s > 1; s /= 2) { syncthreads(); if (tid < s ) x[tid] += x[tid + s ]; } s = 256: threads 0:255 s = 128: threads 0:127 s = 64: threads 0:63 s = 32: threads 0:31 s = 16: threads 0:15 Scott B. Baden /CSE 260/ Winter 2014 19

Predication on Fermi All instructions support predication in 2.x Condition code or predicate per thread: set to true or false Execute only if the predicate is true if (x>1) y = 7; test = (x>1) test: y=7 Compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by branch condition is not too large If the compiler predicts too many divergent warps. threshold = 7, else 4 Scott B. Baden /CSE 260/ Winter 2014 20

Concurrency Host & Device Nonbocking operations Asynchronous Device {Host,Device} Kernel launch Interleaved CPU and device calls: kernel calls run asynchronous with respect to host But a kernel call sequence runs sequentially on device Multiple kernel invocations running in separate CUDA streams: interleaved, independent computations cudastream_t st1, st2, st3, st4; cudastreamcreate(&st1) cudamemcpyasync(d1, h1, N, H2D, st1); kernel2<<<grid,block, 0, st2 >>> (,d2,.. ); kernel3<<<grid,block, 0, st3 >>> (,d3,.. ); cudamemcpyasync(h4, d4, N, D2H, st4); CPU_function( ); on-demand.gputechconf.com/gtc-express/2011/presentations/ StreamsAndConcurrencyWebinar.pdf Scott B. Baden /CSE 260/ Winter 2014 21

Today s lecture Bank conflicts (previous lecture) Further improvements to Matrix Multiplication Thread divergence Scheduling Scott B. Baden /CSE 260/ Winter 2014 22

Dual warp scheduling on Fermi 2 schedulers/sm find an eligible warp Each issues 1 instruction per cycle Dynamic instruction reordering: eligible warps selected for execution using a prioritized scheduling policy Issue selection: round-robin/age of warp Odd/even warps Warp scheduler can issue instruction to ½ the cores, each scheduler issues: 1 (2) instructions for capability 2.0 (2.1) 16 load/store units Scheduler must issue the instruction over 2 clock cycles for an integer or floatingpoint arithmetic instruction Most instructions dual issued 2int/2 FP, int/fp/ ld/st Only 1 double prec instruction at a time All registers in all the warps are available, 0 overhead scheduling Overhead may be different when switching blocks time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 Scott B. Baden /CSE 260/ Winter 2014 23. warp 3 instruction 96

What makes a processor run faster? Registers and cache Pipelining Instruction level parallelism Vectorization, SSE Scott B. Baden /CSE 260/ Winter 2014 24

Pipelining Assembly line processing - an auto plant Dave Patterson s Laundry example: 4 people doing laundry A B C wash (30 min) + dry (40 min) + fold (20 min) = 90 min 6 PM 7 8 9 Time 30 40 40 40 40 20 Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages D Scott B. Baden /CSE 260/ Winter 2014 25

Instruction level parallelism Execute more than one instruction at a time x= y / z; a = b + c; Dynamic techniques Allow stalled instructions to proceed Reorder instructions x = y / z; a = x + c; t = c q; x = y / z; t = c q; a = x + c; Scott B. Baden /CSE 260/ Winter 2014 26

MIPS R4000 Multi-execution pipeline integer unit ex FP/int Multiply IF ID m1 m2 m3 m4 m5 m6 m7 MEM WB FP adder a1 a2 a3 a4 x= y / z; a = b + c; s = q * r; i++; FP/int divider Div (latency = 25) x = y / z; t = c q; a = x + c; David Culler Scott B. Baden /CSE 260/ Winter 2014 27

Scoreboarding, Tomasulo s algorithm Hardware data structures keep track of When instructions complete Which instructions depend on the results When it s safe to write a reg. Deals with data hazards WAR (Write after read) RAW (Read after write) WAW (Write after write) Instruction Fetch Issue & Resolve op fetch ex Scoreboard op fetch ex FU a = b*c x= y b q = a / y y = x + b Scott B. Baden /CSE 260/ Winter 2014 28

Scoreboarding Keep track of all register operands of all instructions in the Instruction Buffer Instruction becomes ready after the needed values are written Eliminates hazards Ready instructions are eligible for issue Decouples the Memory/Processor pipelines Threads can issue instructions until the scoreboard prevents issue Allows Memory/Processor ops to proceed in parallel with other waiting Memory/Processor ops TB1, W1 stall TB2, W1 stall TB3, W2 stall TB1 W1 TB2 W1 TB3 W1 Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4 Time TB3 W2 TB2 W1 TB1 W1 TB1 W2 TB = Thread Block, W = Warp TB1 W3 TB3 W2 Scott B. Baden /CSE 260/ Winter 2014 29

Static scheduling limits performance The ADDD instruction is stalled on the DIVide.. stalling further instruction issue, e.g. the SUBD DIV F0, F2, F4 ADDD SUBD F10, F0, F8 F12, F8, F14 But SUBD doesn t depend on ADDD or DIV If we have two adder/subtraction units, one will sit idle uselessly until the DIV finishes DIV ADDD SUBD Scott B. Baden /CSE 260/ Winter 2014 30

Dynamic scheduling Idea: modify the pipeline to permit instructions to execute as soon as their operands become available This is known as out-of-order execution (classic dataflow) The SUBD can now proceed normally Complications: dynamically scheduled instructions also complete out of order Scott B. Baden /CSE 260/ Winter 2014 31

Dynamic scheduling splits the ID stage Issue sub-stage Decode the instructions Check for structural hazards Read operands sub-stage Wait until there are no data hazards Read operands We need additional registers to store pending instructions that aren t ready to execute I $ Multithreaded Instruction Buffer R F C $ L 1 Shared Mem Operand Select MAD SFU Mary Hall Scott B. Baden /CSE 260/ Winter 2014 32

Consequences of a split ID stage We distinguish between the time when an instruction begins execution, and when it completes Previously, an instruction stalled in the ID stage, and this held up the entire pipeline Instructions can now be in a suspended state, neither stalling the pipeline, nor executing They are waiting on operands Scott B. Baden /CSE 260/ Winter 2014 33

Two schemes for dynamic scheduling Scoreboard CDC 66000 Tomasulo s algorithm IBM 360/91 We ll vary the number of functional units, their latency, and functional unit pipelining Scott B. Baden /CSE 260/ Winter 2014 34

What is a scoreboarding? A technique that allows instructions to execute out of order So long as there are sufficient resources and No data dependencies The goal of scoreboarding Maintain an execution rate of one instruction per clock cycle Scott B. Baden /CSE 260/ Winter 2014 35

Multiple execution pipelines in DLX with scoreboarding Registers Data buses FP mult FP mult FP divide Unit Latency Integer 0 FP add Memory 1 FP Add 2 Integer unit FP Multiply 10 FP Div 40 Control/status Scoreboard Control/status Scott B. Baden /CSE 260/ Winter 2014 36

Scoreboard controls the pipeline Scoreboard / Control Unit Pre-execution buffers Read operands Execution unit 1 Post-execution buffers Instruction Fetch Pre-issue buffer Instruction Issue WAW Instruction Decode Read operands RAW Execution unit 2 Write results WAR Mike Frank Scott B. Baden /CSE 260/ Winter 2014 37

What are the requirements? Responsibility for instruction issue and execution, including hazard detection Multiple instructions must be in the EX stage simultaneously Either through pipelining or multiple functional units DLX has: 2 multipliers, 1 divider, 1 integer unit (memory, branch, integer arithmetic) Scott B. Baden /CSE 260/ Winter 2014 38

How does it work? As each instruction passes through the scoreboard, construct a description of the data dependencies (Issue) Scoreboard determines when the instruction can read operands and begin execution If the instruction can t begin execution, the scoreboard keeps a record, and it listens for one the instruction can execute Also controls when an instruction may write its result All hazard detection is centralized Scott B. Baden /CSE 260/ Winter 2014 39

How is a scoreboard implemented? A centralized bookkeeping table Tracks instructions, along with register operand(s) they depend on and which register they modify Status of result registers (who is going to write to a given register) Status of the functional units Scott B. Baden /CSE 260/ Winter 2014 40