Computer Science 146. Computer Architecture

Similar documents
Computer Science 246 Computer Architecture

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Getting CPI under 1: Outline

EE 4683/5683: COMPUTER ARCHITECTURE

Four Steps of Speculative Tomasulo cycle 0

Hiroaki Kobayashi 12/21/2004

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

5008: Computer Architecture

Instruction-Level Parallelism (ILP)

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

CS425 Computer Systems Architecture

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction Level Parallelism

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

TDT 4260 lecture 7 spring semester 2015

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

HY425 Lecture 09: Software to exploit ILP

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

HY425 Lecture 09: Software to exploit ILP

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Hardware-based Speculation

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Super Scalar. Kalyan Basu March 21,

Lecture 4: Introduction to Advanced Pipelining

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

G.1 Introduction: Exploiting Instruction-Level Parallelism Statically G-2 G.2 Detecting and Enhancing Loop-Level Parallelism G-2 G.

Hardware-Based Speculation

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Hardware-Based Speculation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Dynamic Control Hazard Avoidance

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

INSTRUCTION LEVEL PARALLELISM

Static vs. Dynamic Scheduling

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ILP: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS252 Graduate Computer Architecture Lecture 5. Interrupt Controller CPU. Interrupts, Software Scheduling around Hazards February 1 st, 2012

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Instruction Level Parallelism. Taken from

Metodologie di Progettazione Hardware-Software

Computer Science 146. Computer Architecture

Advanced Pipelining and Instruction- Level Parallelism 4

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

EECC551 Exam Review 4 questions out of 6 questions

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Graduate Computer Architecture. Chapter 4 Explore Instruction Level. Parallelism with Software Approaches

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Lecture 6: Static ILP

EE557--FALL 2000 MIDTERM 2. Open books and notes

Adapted from David Patterson s slides on graduate computer architecture

Floating Point/Multicycle Pipelining in DLX

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Static Scheduling Techniques. Loop Unrolling (1 of 2) Loop Unrolling (2 of 2)

TDT 4260 TDT ILP Chap 2, App. C

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Static Compiler Optimization Techniques

Instruction-Level Parallelism and Its Exploitation

Advanced issues in pipelining

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

4DM4 Sample Problems for Midterm Tuesday, Oct. 22, 2013

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Multiple Issue ILP Processors. Summary of discussions

Copyright 2012, Elsevier Inc. All rights reserved.

CS252 Graduate Computer Architecture Midterm 1 Solutions

Lecture: Pipeline Wrap-Up and Static ILP

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Transcription:

Computer rchitecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 11: Software Pipelining and Global Scheduling Lecture Outline Review of Loop Unrolling Software Pipelining Global Scheduling Trace Scheduling, Superblocks Next Time Hardware-ssisted, Software ILP Conditional, Predicated Instructions Compiler Speculation with Hardware Support Hardware vs. Software comparison Itanium Implementation

Compiler Loop Unrolling 1. Check OK to move the S.D after DSUBUI and BNEZ, and find amount to adjust S.D offset 2. Determine unrolling the loop would be useful by finding that the loop iterations were independent 3. Rename registers to avoid name dependencies 4. Eliminate extra test and branch instructions and adjust the loop termination and iteration code 5. Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent requires analyzing memory addresses and finding that they do not refer to the same address. 6. Schedule the code, preserving any dependences needed to yield same result as the original code Loop Unrolling Limitations Decrease in amount of overhead amortized per unroll Diminishing returns in reducing loop overheads Growth in code size Can hurt instruction-fetch performance Register Pressure ggressive unrolling/scheduling can exhaust 32 register machines

Loop Unrolling Problem Every loop unrolling iteration requires pipeline to fill and drain Occurs every m/n times if loop has m iterations and is unrolled n times overlapped ops Proportional to Number of Unrolls Overlap between Unrolled iterations Time More advanced Technique: Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Softwarepipelined iteration

Software Pipelining Now must optimize inner loop Want to do as much work as possible in each iteration Keep all of the functional units busy in the processor for(j = 0; j < MX; j) C[j] = * B[j]; Dataflow graph: Software Pipelining Example for(j = 0; j < MX; j) C[j] = * B[j]; Pipelined: Not pipelined: Fill Steady State Drain

Software Pipelining Example Before: Unrolled 3 times fter: Software Pipelined 1 L.D F0,0(R1) 1 S.D 0(R1),F4 ; Stores M[i] 2 DD.D F4,F0,F2 2 DD.D F4,F0,F2 ; dds to M[i-1] 3 S.D 0(R1),F4 3 L.D F0,-16(R1); Loads M[i-2] 4 L.D F6,-8(R1) 4 DSUBUI R1,R1,#8 5 DD.D F8,F6,F2 5 BNEZ R1,LOOP 6 S.D -8(R1),F8 7 L.D F10,-16(R1) SW Pipeline 8 DD.D F12,F10,F2 9 S.D -16(R1),F12 10 DSUBUI R1,R1,#24 Time 11 BNEZ R1,LOOP Loop Unrolled overlapped ops Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling 5 cycles per iteration Time Software Pipelining vs. Loop Unrolling Software pipelining is symbolic loop unrolling Consumes less code space ctually they are targeting different things Both provide a better scheduled inner loop Loop Unrolling Targets loop overhead code (branch/counter update code) Software Pipelining Targets time when pipelining is filling and draining Best performance can come from doing both

When Safe to Unroll Loop? Example: Where are data dependencies? (,B,C distinct & nonoverlapping) for (i=0; i<100; i=i1) { [i1] = [i] C[i]; /* S1 */ B[i1] = B[i] [i1]; /* S2 */ } 1. S2 uses the value, [i1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes [i1] which is read in iteration i1. The same is true of S2 for B[i] and B[i1]. This is a loop-carried dependence : between iterations For our prior example, each iteration was distinct Implies that iterations can t be executed in parallel? VLIW vs. SuperScalar Superscalar processors decide on the fly how many instructions to issue HW complexity of Number of instructions to issue O(n 2 ) Proposal: llow compiler to schedule instruction level parallelism explicitly Format the instructions in a potential issue packet so that HW need not check explicitly for dependences

VLIW: Very Large Instruction Word Each instruction has explicit coding for multiple operations In I-64, grouping called a packet Tradeoff instruction space for simple decoding Slots are available for many ops in the instruction word By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branches (Discussed next time) Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) L.D to DD.D: 1 Cycle 3 L.D F10,-16(R1) DD.D to S.D: 2 Cycles 4 L.D F14,-24(R1) 5 DD.D F4,F0,F2 6 DD.D F8,F6,F2 7 DD.D F12,F10,F2 8 DD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 BNEZ R1,LOOP 14 S.D 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration

Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,0(R1) L.D F6,-8(R1) 1 L.D F10,-16(R1) L.D F14,-24(R1) 2 L.D F18,-32(R1) L.D F22,-40(R1) DD.D F4,F0,F2 DD.D F8,F6,F2 3 L.D F26,-48(R1) DD.D F12,F10,F2 DD.D F16,F14,F2 4 DD.D F20,F18,F2 DD.D F24,F22,F2 5 S.D 0(R1),F4 S.D -8(R1),F8 DD.D F28,F26,F2 6 S.D -16(R1),F12 S.D -24(R1),F16 7 S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 S.D -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) verage: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) Software Pipelining with Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch L.D F0,-48(R1) ST 0(R1),F4 DD.D F4,F0,F2 1 L.D F6,-56(R1) ST -8(R1),F8 DD.D F8,F6,F2 DSUBUI R1,R1,#24 2 L.D F10,-40(R1) ST 8(R1),F12 DD.D F12,F10,F2 BNEZ R1,LOOP 3 Software pipelined across 9 iterations of original loop In each iteration of above loop, we: Store to m,m-8,m-16 Compute for m-24,m-32,m-40 Load from m-48,m-56,m-64 (iterations I-3,I-2,I-1) (iterations I,I1,I2) (iterations I3,I4,I5) 9 results in 9 cycles, or 1 clock per iteration verage: 3.3 ops per clock, 66% efficiency Note: Need fewer registers for software pipelining (only using 7 registers here, was using 15)

Global Scheduling Previously we focused on loop-level parallelism Unrolling, Software Pipelining scheduling work well These work best on single basic blocks (repeatable schedules) Basic Block Single Entry/Single Exit Instruction Sequence What about internal control flow? What about if-branches instead of loop-branches? Global Scheduling How to move computation and assignment of B[i]? Relative execution frequency? How cheap to execute B[i] above the branch? How much benefit to executing B[i] early? (critical path?) What is the cost of compensation code for the else case? What about moving C[i]?

Static Branch Prediction Simplest: Predict taken Misprediction rate = untaken branch frequency => for SPEC programs is 34%. Range is quite large though (from not very accurate (59%) to highly accurate (9%)) Predict on the basis of branch direction? (P6 on BTB miss) choosing backward-going branches to be taken (loop) forward-going branches to be not taken (if) SPEC programs, however, most forward-going branches are taken => predict taken is better Predict branches on the basis of profile information collected from earlier runs Misprediction varies from 5% to 22% Trace Scheduling Parallelism across IF branches vs. LOOP branches? Two steps: Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong This is a form of compiler-generated speculation Compiler must generate fixup code to handle cases in which trace is not the taken branch Needs extra registers: undoes bad guess by discarding

Trace Scheduling Use loop unrolling, static branch prediction to generate long traces Trace scheduling: Bookkeeping code is needed when code is moved across trace entry and exit points Superblocks Fixes a major drawback of trace scheduling Entries and exits in the middle of the trace are complicated Superblocks Use a similar process as trace generation, but superblocks are restriced to a single entry point with multiple exit points Scheduling (compaction) is simpler Only code motion across exits must be considered Only one entrance? Tail duplication is used to create a separate block that corresponds to the portion of the trace after entry

Superblocks What if branches are not statically predictable? Loop Unrolling, Trace scheduling work great when branches are fairly predictable statically Same thing with memory reference dependencies Compiler Speculation is needed to solve this Conditional/Predicated instructions if-conversion Hardware support for exception/memory-dependence checks

Hardware Support for Exposing More Parallelism at Compile-Time Conditional or Predicated Instructions Conditional instruction execution Full predication every instruction has predicate tag (I64) Conditional Moves (lpha, I32, etc) if(r3==0)r1=r2 BNEZ R3, L cmoveqz r1, r2, r3 DDU R1, R2, R0 L: Schedule for next few lectures Next Time (Mar. 17 th ) HW#3 Due Friday Hardware support for software-ilp Itanium (I64) case study Review for midterm (Mar 22nd) Midterm March 24th