HY425 Lecture 09: Software to exploit ILP

Similar documents
HY425 Lecture 09: Software to exploit ILP

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction-Level Parallelism (ILP)

CS425 Computer Systems Architecture

Getting CPI under 1: Outline

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Hiroaki Kobayashi 12/21/2004

Topics. Digital Systems Architecture EECE EECE Software Approaches to ILP Part 2. Ideas To Reduce Stalls. Processor Case Studies

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ILP: Instruction Level Parallelism

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 6: Static ILP

Instruction Level Parallelism

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

EE 4683/5683: COMPUTER ARCHITECTURE

Adapted from David Patterson s slides on graduate computer architecture

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based Speculation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Instruction Level Parallelism. Taken from

Super Scalar. Kalyan Basu March 21,

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Four Steps of Speculative Tomasulo cycle 0

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

5008: Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

CS425 Computer Systems Architecture

Hardware-Based Speculation

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Lecture: Pipeline Wrap-Up and Static ILP

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Multi-cycle Instructions in the Pipeline (Floating Point)

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

INSTRUCTION LEVEL PARALLELISM

Computer Science 246 Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Hardware-Based Speculation

Lecture 4: Introduction to Advanced Pipelining

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Static vs. Dynamic Scheduling

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Lecture 9: Multiple Issue (Superscalar and VLIW)

Chapter 4 The Processor 1. Chapter 4D. The Processor

HY425 Lecture 05: Branch Prediction

The Processor: Instruction-Level Parallelism

Exploitation of instruction level parallelism

EECC551 Exam Review 4 questions out of 6 questions

CS422 Computer Architecture

Advanced issues in pipelining

Static Compiler Optimization Techniques

Superscalar Architectures: Part 2

Computer Architecture Practical 1 Pipelining

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

TDT 4260 lecture 7 spring semester 2015

Computer Science 146. Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Floating Point/Multicycle Pipelining in DLX

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture-13 (ROB and Multi-threading) CS422-Spring

HW 2 is out! Due 9/25!

Instruction-Level Parallelism and Its Exploitation

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Advanced Computer Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Metodologie di Progettazione Hardware-Software

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Chapter 4 The Processor (Part 4)

Computer Science 146. Computer Architecture

EE557--FALL 2000 MIDTERM 2. Open books and notes

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Dynamic Control Hazard Avoidance

CSE 502 Graduate Computer Architecture

Transcription:

HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 Dynamic scheduling with scoreboard Renaming (Tomasulo, renaming registers) Branch prediction Multiple issue Speculation Software Instruction scheduling Code transformations (topic of next lecture) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 3 / 44

What limits ILP Software and hardware issues Limits of parallelism in programs Data flow true data dependencies Control flow control dependencies Code generation, scheduling by compiler Hardware complexity Large storage structures branch prediction, ROB, window Complex logic dependence control, associative searches Higher bandwidth multiple issue, multiple outstanding instructions Long latencies memory system (caches, DRAM) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 4 / 44 : add scalar to vector Unrolling and renaming (6 cycles per iteration) Loop: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) LD F6,-8(R1) ADDD F8,F6,F2 SD F8,-8(R1) LD F10,-16(R1) ADDD F12,F10,F2 SD F12,-16(R1) LD F14,-24(R1) ADDD F16,F14,F2 SD F16,-24(R1) ADDI R1,R1,-32 BNE R1, R2, Loop Pros Unrolling lowers loop overhead (ADDI, BNE) Cons: Unrolling grows code size Cons: Register pressure Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 5 / 44

: add scalar to vector Unrolling and renaming with improved instruction scheduling (3.5 cycles per iteration) Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD F4,0(R1) SD F8,-8(R1) SD F12,-16(R1) ADDI R1,R1,-32 BNE R1, R2, Loop SD F16, 8(R1) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 6 / 44 : add scalar to vector Unrolling and renaming with dual-issue Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD F4,0(R1) ADDD F16,F14,F2 6 SD F8,-8(R1) ADDD F20,F18,F2 7 SD F12,-16(R1) 8 SD F16,-24(R1) 9 ADDI R1,R1,-40 10 BNE R1,R2,Loop 11 SD F20,-32(R1) 12 2.4 cycles per iteration Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 7 / 44

Predict branches as always taken Example Example DSUBU and BEQZ need to stall LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1,L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 L: LD R1,0(R2) DADDU R7,R8,R9 #speculative DSUBU R1,R1,R3 BEQZ R1,L OR R4,R5,R6 DADDU R10,R4,R3 Second control-dependent DADDU speculatively moved before branch to eliminate stall Note that moved DADDU is not data-dependent on OR, or first DADDU Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 9 / 44 alternatives Simple offline branch prediction schemes Predict always taken 34% misprediction rate, high variance Predict based on direction of branch Forward not taken, backward taken Misprediction rates 30% 40% Predict based on execution profile Branch bias (mostly taken or not taken) Accuracy sensitive to input Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 10 / 44

Performance of profile-based branch prediction SPECCPU92 results Performance of profile-based branch prediction su2cor mdljdp hydro2d ear doduc li gcc espresso eqntott compress 6% 5% 10% 9% 12% 11% 12% 15% 18% 22% 0% 5% 10% 15% 20% 25% Misprediction rate Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 11 / 44 Profile-based vs. static prediction SPECCPU92 results Performance of profile-based branch prediction su2cor mdljdp hydro2d ear doduc li gcc espresso eqntott compress 14 11 11 19 14 11 11 19 6 1 37 58 56 60 92 96 113 159 253 250 0 50 100 150 200 250 300 Instructions between mispredicted branches Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 12 / 44

processors Statically scheduled multiple-issue processors Reduce hardware cost compared to dynamically scheduled Advanced compiler support for exploiting ILP Instructions scheduled in packets No dependences among instructions in packet Long instruction word (64+ bits) Explicit parallelism among instructions Compiler guarantees that instructions are independent Multiple functional units Parallelism exploited via loop unrolling and instruction scheduling Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 14 / 44 Add scalar to vector example Unrolling and code scheduling in 2 load/store, 2 INT, 1 FP unit 1.29 cycles per iteration (vs. 2.4 in two-issue superscalar) Memory Memory FP FP Integer reference 1 reference 2 operation 1 operation 2 branch LD F0, 0(R1) LD F6,-8(R1) LD F10, -16(R1) LD F14,-24(R1) LD F18, -32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26, -48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2 ADDD F24,F22,F2 SD F4, 0(R1) SD F8, -8(R1) ADDD F28,F26,F2 SD F12, -16(R1) SD F16, -24(R1) SD F20, -24(R1) SD F24, -32(R1) ADDI R1, R1, -56 SD F28, 8(R1) BNE R1, R2, Loop Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 15 / 44

processors Statically scheduled multiple-issue processors Parallelism sought within and across basic blocks Hardware support for predicated instruction execution Design implications Increased code size Lock-step execution of instruction bundles Stall in a FU causes entire processor pipeline stall Hard to schedule instructions upon cache misses Solution: check hazards and dependences at issue time, use hardware to enable unsynchronized execution Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 16 / 44 processors (cont.) Code compatibility Old binary can not run if number of FUs changes Format of instruction bundles is changed Binary translation across HW generations Binary translation from superscalar to Superscalar runs unmodified binaries from previous HW generations only code scheduling may require changes for performance Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 17 / 44

Compiler dependence analysis of source code Loop-level parallelization for detecting loop-carried dependences Dependences between instructions in two lexicographically ordered iterations of a loop Lexicographical ordering produces the equivalent of a sequential in-order execution of all instructions in a loop Independent iterations can be unrolled at will Independent iterations can execute in parallel Key to exploit multiple processors Example Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 19 / 44 for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2*/ } Loop-carried dependence on S1 (A[i+1] depends on A[i], C[i]) Loop-carried dependence on S2 (B[i+1] depends on B[i]) Same-iteration dependence on S2 (B[i+1] depends on A[i+1]) Loop-carried dependences may or may not prevent parallelization Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 20 / 44

Example Can this loop be parallelized? for (i=1; i<=100; i=i+1) { A[i] = B[i] + C[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2*/ } Peel one iteration from each end of the loop Notice that no iteration produces result for future iteration A[1] = B[1] + C[1]; for (i=1; i<=99; i=i+1) { A[i+1] = B[i+1] + C[i+1]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2*/ } B[101] = C[100] + D[100]; Loop-carried dependence eliminated Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 21 / 44 Limitations of analyzing memory references Static analysis indicates that there may be a dependence between two instructions due to naming of memory locations Dependence resolution requires disambiguation of memory references Easy for scalar variables, harder for arrays, hard for pointers Dependences do not always prevent parallelization Uncovering parallelism in loops with dependences for (i=6; i<=100; i=i+1) { A[i] = A[i-5] + A[i]; /* S1 */ } No loop-carried dependences in 5 iterations Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 22 / 44

101 Assume affine array indices: index = a i + b Index in multi-dimensional array affine, if index in each dimension affine Assume two references a j + b, c k + d, check if: Array elements are within loop bounds: m j n, m k n j precedes k (lexicographical ordering) a j + b = c k + d GCD test: test if GCD(a, c) divides (d b) Necessary but not sufficient condition a, b, c, d and bounds need to be known at compile-time Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 23 / 44 Eliminating dependences through renaming Finding dependences in source code for (i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c - Y[i]; /* S4 */ } True dependences S1 S3, S1 S4 Antidependence S1 S2, S3 S4 Output dependence S1 S4 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 24 / 44

Eliminating dependences through renaming Finding dependences in source code for (i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c - Y[i]; /* S4 */ } Renaming resolves output and anti dependences for (i=1; i<=100; i=i+1) { T1[i] = X[i] / c; /* S1 */ T2[i] = X[i] + c; /* S2 */ Z[i] = T1[i] + c; /* S3 */ Y[i] = c - T1[i]; /* S4 */ } Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 25 / 44 Limits of dependence analysis Examples of hard to analyze cases Hard to analyze pointer references Determine if two pointers reference same memory location Undecidable for dynamically allocated data structures Hard if code uses with pointer arithmetic Array-indexed arrays, sparse arrays, indirect references Input-dependent dependences Inter-procedural dependences, analysis beyond basic blocks Conservatism of analysis Correctness precedes performance in compilers Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 26 / 44

Other compiler optimizations Copy propagation ADDI R1, R2, 4 ADDI R1, R1, 4 ADDI R1, R2, 8 Tree height reduction ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 ADD R1,R2,R3 ADD R4,R7,R6 ADD R8,R1,R4 Assumes addition is associative (not true in FP arithmetic) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 27 / 44 Symbolic loop unrolling Benefits of loop unrolling with reduced code size Instructions in loop body selected from different loop iterations Increase distance between dependent instructions Software pipelined iteration Iteration 0 Iteration1 Iteration 2 Iteration 3 Iteration 4 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 29 / 44

Software pipelined loop Loop unrolled 3 times Iteration i: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) Iteration i+1: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 Iteration i+2: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) Loop: SD F4,16(R1) #store to v[i] ADDD F4,F0,F2 #add to v[i-1] LD F0,0(R1) #load v[i-2] ADDI R1,R1,-8 BNE R1,R2,Loop 5 cycles/iteration (with dynamic scheduling and renaming) Need startup/cleanup code Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 30 / 44 (cont.) SW pipelined loop with startup and cleanup code #startup, assume i runs from 0 to n ADDI R1,R1,-16 #point to v[n-2] LD F0,16(R1) #load v[n] ADDD F4,F0,F2 #add v[n] LD F0,8(R1) #load v[n-1] #body for (i=2;i<=n-2;i++) Loop: SD F4,16(R1) #store to v[i] ADDD F4,F0,F2 #add to v[i-1] LD F0,0(R1) #load v[i-2] ADDI R1,R1,-8 BNE R1,R2,Loop #cleanup SD F4,8(R1) #store v[1] ADDD F4,F0,F2 #add v[0] SD F4,0(R1) #store v[0] Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 31 / 44

versus unrolling Performance effects of SW pipelining vs. unrolling Unrolling reduces loop overhead per iteration SW pipelining reduces startup-cleanup pipeline overhead startup cleanup Overlapping iterations (a) software pipelining f(unrolled iterations) overlap between unrolled iterations Overlapping iterations (b) loop unrolling Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 32 / 44 (cont.) Advantages Less code space than conventional unrolling Loop runs at peak speed during steady state Overhead only at loop initiation and termination Complements unrolling Disadvantages Hard to overlap long latencies Unrolling combined with SW pipelining Requires advanced compiler transformations Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 33 / 44

Predication Conditional move A predicated instruction packs a conditional and an instruction Instruction control-dependent on conditional If conditional is false instruction is converted to no-op, otherwise executed Convert control dependence to data dependence #if (A==0) {S=T;} # simple translation BNEZ R1, L #if (A==0) ADDI R2, R3, 0 #S=T; L: # predicated instruction CMOVZ R2, R3, R1 #move T to S if R1=0 Predication Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 35 / 44 Generalized predication Predicates applied to all instructions Enables predicated execution of large code blocks Speculatively puts time-critical instructions under predicates No predication Slot 1 Slot 2 LW R1,40(R2) BEQZ R10,L LW R8,0(R10) LW R9,0(R8) ADD R3,R4,R5 ADD R6,R3,R7 Predication Slot 1 Slot 2 LW R1,40(R2) ADD R3,R4,R5 LWC R8,0(R10),R10 ADD R6,R3,R7 BEQZ R10,L LW R9,0(R8) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 36 / 44

Predication implementation issues Preserve control and data flow, precise interrupts Speculative predicated instructions may not throw illegal exceptions LWC may not throw exception if R10 == 0 LWC may throw recoverable page fault if R10 0 Instruction conversion to nop Early condition detection may not be possible due to data dependence Late condition detection incurs stalls and consumes pipeline resources needlessly Instructions may be dependent on multiple branches Compiler able to find instruction slots and reorder Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 37 / 44 Hardware support for speculation Alternatives for handling speculative exceptions Hardware and OS ignore exceptions from speculative instructions Mark speculative instructions and check for exceptions Additional instructions to check for exceptions and recover Registers marked with poison bits to catch exceptions upon read Hardware buffers instruction results until instruction is no longer speculative Exception classes Recoverable: exception from speculative instruction may harm performance, but not preciseness Unrecoverable: exception from speculative instruction compromises preciseness Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 38 / 44

Solution I: Ignore exceptions HW/SW solution Instruction causing exception returns undefined value Value not used if instruction is speculative Incorrect result if instruction is non-speculative Compiler generates code to throw regular exception Rename registers receiving speculative results Non-speculative # if (A==0) A=B; else A=A+4; LD R1,0(R3) ;load A BNEZ R1,L1 ;test A LD R1,0(R2) ;then load B J L2 L1: ADDI R1,R1,4 ;else L2: SD R1,0(R3) ;store A Speculative # if (A==0) A=B; else A=A+4; LD R1,0(R3) ;load A LD R4,0(R2) ;speculative load B BEQZ R1,L3 ;test A ADDI R4,R1,4 ;else L3: SD R4,0(R3) ;non-speculative store Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 39 / 44 Solution II: mark speculative instructions # if (A==0) A=B; else A=A+4; LD R1,0(R3) ;load A SLD R4,0(R2) ;speculative load B BNEZ R1,L1 ;test A CHK R4,recover ;speculation check J L2 ;skip else L1: ADDI R4,R1,4 ;else L2: SD R4,0(R3) ;store A recover:... Instruction checking speculation status Jump to recovery code if exception Itanium CHK instruction Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 40 / 44

Solution III: poison bits # if (A==0) A=B; else A=A+4; LD R1,0(R3) ;load A SLD R4,0(R2) ;speculative load B BEQZ R1,L3 ;test A ADDI R4,R1,4 ;else L3: SD R4,0(R3) ;store A R4 marked with poison bit Use of R4 in SD raises exception if SLD raises exception Generate exception when result of offending instruction is used for the first time OS code needs to save poison bits during context switching Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 41 / 44 Performance of processors Itanium vs. Alpha vs. Pentium 4 Low INT performance Better FP performance, highly application-dependent Poor power-efficiency (performance/watt) Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 43 / 44

What next? Alternatives to exploit parallelism Vector processors and SIMD next lecture Simultaneous multithreading lecture after next Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 44 / 44