1. Chapter 3: Exploiting ILP

Size: px
Start display at page:

Download "1. Chapter 3: Exploiting ILP"

Transcription

1 1. Chapter 3: Exploiting ILP Simple static techniques for finding ILP Basic scheduling, renaming and loop unrolling (data, name, & control) Hardware (dynamic) techniques for reducing deps (dep. limit ILP): 1. Branch result prediction (control) 2. Branch target prediction (control) 3. Dynamic scheduling with renaming (data & name) 4. Speculation (control) Hardware techniques for finding ILP: 1. Pipelining 2. Multiple instruction issue Approaches for multiple instruction issue: 1. Superscalar 2. VLIW Limitations of ILP 2. Dependences Limit ILP If two inst have dependence, some ordering required for correctness. Types of dependence: Data dependence: inst requires data computed by preceeding ins t(s) Name dependence: Two insts (one or both of which write) access same reg/mem Control dependence: whether or not an inst is reached is determined by previous insts data dependence Caused by flow of data values amount insts. RAW (true dep): inst i produces result (Write) required as input (Read) for inst j inst i must compute result before inst j is allowed to continue No fix (value prediction), but can be ameliorated by dynamic scheduling 3. Data & Name Dependencies NOTE: inst i before inst j in program name dependence Caused when two inst use same storage location (reg/mem): WAW (output dep): inst i Writes to same loc as inst j WAR (antidependence): inst j Writes loc Read by i Can be fixed by using differing locations for inst i and j This is known as renaming 4. Static Scheduling Instead of using hardware to do dynamic sched (Tomasulo), have the compiler pre-arrange inst stream so that hazards are miminized W/o dyn exec, compiler must schedule code to avoid dep: Means pipe stage information must be available to compiler When machine is updated, programs must be recompiled If a hazard is detected, inst stalled in ID with no new inst fetched or issued until the dependence is cleared Some machines even rely on compiler to insert stalls in the face of hazards, so hardware does not have to do checks! Compiler techniques are used statically schedule the inst to avoid or minimize stalls Schedule independent inst between inst with true dep Statically perform other transforms to tackle other dependence types (eg., explicit register renaming to avoid name dependencies)

2 5. Assumed Effective Latencies 6. Instruction Scheduling We will examine some transforms that a compiler might make, and unless otherwise noted, we will assume the following latencies between dependent set and use:.. Instruction.. Latency Producer User (cycles) Load use 1 load store 0 iop iop 0 iop branch 1 FP op FP op 3 FP op Store 2 Simple 5-stage MIPS pipeline effective latency dep on inst pair: Need to sched 1 inst between ld and dep use to avoid stall Branch computes cond in ID ILP available in basic block is typically small; variety of techniques exploit it across bblocks. Most common way to increase ILP exploits parallelism amongst iterations of loop. Original C Code 1 for ( i =999; i >=0; i ) 2 x [ i ] = x [ i ] + s ; Original Assembly Code 1LOOP: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 DADDUI R1, R1,# 8 5 BNE R1,R2,LOOP Would need nop in delay slot!! Branch calc cond in ID, so delay between dep iop and br offset of S.D changed to fill delay slot Cycles reduced from 10 to 6 We will see software pipelining can reduce further Original With Stalls 1LOOP: L.D F0,0(R1) 2 s t a l l 3 ADD.D F4,F0,F2 4 s t a l l 5 s t a l l 6 S.D F4,0(R1) 7 DADDUI R1, R1,# 8 8 s t a l l 9 BNE R1,R2,LOOP 10 s t a l l Reordered Code 1LOOP: L.D F0,0(R1) 2 DADDUI R1, R1,# 8 3 ADD.D F4,F0,F2 4 s t a l l 5 BNE R1,R2,LOOP 6 S.D F4,8(R1) 7. Loop Unrolling Loop Unrolling duplicates the body of the loop multiple times, which can reduce loop overhead and increase scheduling (& other optimization) opportunities: Must have loop cleanup and guard After unrolling, want to remove duplicate loop control Here s an example (diff frm asmbly) at source level: 1 for ( i =0; i <n ; i++) 2 a [ i ] = a [ i ] + s ; 1 for ( i =0; i <n%4; i++) 2 a [ i ] = a [ i ] + s ; 3 for (; i <n ; i++) { 4 a [ i ] = a [ i ] + s ; i++; 5 a [ i ] = a [ i ] + s ; i++; 6 a [ i ] = a [ i ] + s ; i++; 7 a [ i ] = a [ i ] + s ; 8 } 1 for ( i =0; i <n%4; i++) 2 a [ i ] = a [ i ] + s ; 3 for (; i <n ; i += 4) { 4 a [ i ] = a [ i ] + s ; 5 a [ i +1] = a [ i +1] + s ; 6 a [ i +2] = a [ i +2] + s ; 7 a [ i +3] = a [ i +3] + s ; 8 } 8. Loop Unrolling at the Assembly Level for (i=999; i>=0; i--) x[i] = x[i] + s; We unroll, reg rename each iter, do induction var elimination to get (ignoring cleanup and guard): Unrolled Loop 1 Loop : L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 ADDIU R1,R1,# 8 5 L.D F0,0(R1) 6 ADD.D F4,F0,F2 7 S.D F4,0(R1) 8 ADDIU R1,R1,# 8 9 L.D F0,0(R1) 10 ADD.D F4,F0,F2 11 S.D F4,0(R1) 12 ADDIU R1, R1,# 8 13 L.D F0,0(R1) 14 ADD.D F4,F0,F2 15 S.D F4,0(R1) 16 ADDIU R1, R1,# 8 17 BNE R1,R2, Loop Removed Name & Data Dep 1 Loop : L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F6, 8(R1) 5 ADD.D F8,F6,F2 6 S.D F8, 8(R1) 7 L.D F10, 16(R1) 8 ADD.D F12, F10,F2 9 S.D F12, 16(R1) 10 L.D F14, 24(R1) 11 ADD.D F16, F14,F2 12 S.D F16, 24(R1) 13 ADDIU R1, R1,# 8 14 ADDIU R1, R1,# 8 15 ADDIU R1, R1,# 8 16 ADDIU R1, R1,# 8 17 BNE R1,R2, Loop

3 9. Loop Unrolling without Scheduling Unrolled Loop 1 Loop : L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F6, 8(R1) 5 ADD.D F8,F6,F2 6 S.D F8, 8(R1) 7 L.D F10, 16(R1) 8 ADD.D F12, F10,F2 9 S.D F12, 16(R1) 0 L.D F14, 24(R1) 1 ADD.D F16, F14,F2 2 S.D F16, 24(R1) 3 ADDIU R1, R1,# 32 4 BNE R1,R2, Loop Unrolled Loop wt Stalls 1 Loop : L.D F0,0(R1) 2 s t a l l 3 ADD.D F4,F0,F2 4 s t a l l 5 s t a l l 6 S.D F4,0(R1) 7 L.D F6, 8(R1) 8 s t a l l 9 ADD.D F8,F6,F2 10 s t a l l 11 s t a l l 12 S.D F8, 8(R1) 13 L.D F10, 16(R1) 14 s t a l l 15 ADD.D F12, F10,F2 16 s t a l l 17 s t a l l 18 S.D F12, 16(R1) 19 L.D F14, 24(R1) 20 s t a l l 21 ADD.D F16, F14,F2 22 s t a l l 23 s t a l l 24 S.D F16, 24(R1) 25 ADDIU R1, R1,# s t a l l 27 BNE R1,R2, Loop 28 s t a l l 10. Loop Unrolling with Scheduling Unrolled Loop wt Stalls 1 Loop : L.D F0,0(R1) 2 s t a l l 3 ADD.D F4,F0,F2 4 s t a l l 5 s t a l l 6 S.D F4,0(R1) 7 L.D F6, 8(R1) 8 s t a l l 9 ADD.D F8,F6,F2 10 s t a l l 11 s t a l l 12 S.D F8, 8(R1) 13 L.D F10, 16(R1) 14 s t a l l 15 ADD.D F12, F10,F2 16 s t a l l 17 s t a l l 18 S.D F12, 16(R1) 19 L.D F14, 24(R1) 20 s t a l l 21 ADD.D F16, F14,F2 22 s t a l l 23 s t a l l 24 S.D F16, 24(R1) 25 ADDIU R1, R1,# s t a l l 27 BNE R1,R2, Loop 28 s t a l l Sched Loop (no stalls!) 1 Loop : L.D F0,0(R1) 2 L.D F6, 8(R1) 3 L.D F10, 16(R1) 4 L.D F14, 24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12, F10,F2 8 ADD.D F16, F14,F2 9 S.D F4,0(R1) 10 S.D F8, 8(R1) 11 ADDIU R1, R1,# S.D F12,32 16(R1) 13 BNE R1,R2, Loop 14 S.D F16,32 24(R1) Moved 2 S.D past ADDIU Loop now takes 14/4 = 3.5 cycles/iter, rather than 28/4 = 7 Took 6 cycles after sched before unrolling! 11. Reducing Branch Costs with Prediction Branches cause control or branch hazards Saw that need to compute branch target and condition can cause stalls in pipeline We have seen name hazards can be fixed by renaming We (will) see that data dep can be helped by scheduling & OOE We saw that the compiler/programmer can help with control hazards via unrolling Hardware can further improve branch performance using prediction 1. Branch prediction: predicts whether branch is taken or untaken In static prediction, compiler/programmer indicates if br is taken or not In dynamic prediction, hardware (br predictor) guesses if br is taken or not 2. Branch Target Prediction: predicts where a taken branch will go Only untaken easily done statically Can have front-end precompute simple branch targets Dynamically predicted using branch target buffer Need both if branches are not to cause pipe flushes 12. Branch Prediction Buffers (AKA Branch History Table) Dynamic branch prediction done by special cache called branch prediction buffers: Cache indexed by lower portion of branch address (branches far apart may overwrite each other) Unlike data cache, don t check if this was the right branch at all! Contains a bit indicating if the branch was taken or fell through (last time) Can increase bits used to capture more complex behavior Gives decent accuracy with very little memory If branch is taken, no help unless branch target is also known BPB alone help if we calc target address very early Schemes for branch target help discussed later

4 14. 2-Bit Branch Prediction 1 bit branch prediction Only 1 bit per entry If prediction wrong, invert bit Will typically mispredict loop br twice possibly only once for initial exec, depending on init Unknowable, as it may be another br entirely Bit Branch Prediction example How often will these two branches be mispredicted: for (i=0; i < 100; i++) if (i & 1) { A } else { B } Use extra bits for prediction: Provides a stickiness, so prediction doesn t thrash Branches that favor one way mispredicted less Must update bits every cycle (so will read and write every cycle) 1-bit only on mispred Essentially halves # of mispredicts in example (basically, bounce back and forth horizontally) Now as good as static :( Comp Arch, Henn & Patt, Fig 2.4, pg 83 (4th ed) Taken Predict taken (11) Taken Predict not taken (01) Not taken Taken Predict taken (10) Not Taken Not taken Predict Taken not taken (00) Not taken for (i=0; i < 100; i++) if (i & 1) { A } else { B } 15. Correlating (Two-Level) Branch Prediction Buffers 16. (m,n) Branch Prediction Buffer with N e Entries Try to improve prediction by examining recent behavior of preceeding branches. An (m,n) pred uses behavior of last m dynamic branches to choose between 2 m predictors, each of which uses n bits of stickiness Records most recent m dynamic branch results using an m-bit shift register Requires 2 m n N e bits of storage (N e = # entries) Machines almost never use n > 2: (0,1) conventional 1-bit predictor; (0,2) conv. 2-bit predictor Require 2 0 n N e = n N e bits storage In our example last two dynamic branches are if and loop control, so m = 2 will allow us to perfectly predict the if m = 1 would capture only loop control, which is not correlated with if behavior (2,2) predictor (N e = 10) shift: entry for (i=0; i < 100; i++) if (i & 1) { A } else { B } Each column is n-bits wide Each column corresponds to a (0,n) local predictor, select among m columns using branch history stored in shift register entry used e = (branch address) 4 mod N e Said to have m bits of global history and n bits of local history During loop, will flip between 11 & 01 predictors (If branch changing, loop branch always taken) predictors, which will remain in steady state of correct prediction!

5 17. Advantages of Global Prediction Comp Arch, Henn & Patt, Fig 3.3, pg 165 Little advantage to adding more entries to (0,2) predictor (2,2) pred beats (0,2) of same size Beats (0,2) of infinite size! 18. Tournament Branch Prediction Result of some branches correlate with prior branches, but not always. In which case, local predictor may indeed be better. Therefore, can combine local and global predictor in a tournament predictor. Uses multiple predictors, usually one local [eg. (0,2)] and one global [eg. (2,2)]. Uses sticky selector like (0,2) selector to choose between local & global (next slide) Makes better use of bits than just pumping up entries on strictly local or global predictors Must read & update all pred in parallel Tournament predictors are the most accurate branch predictors in use today 19. State Diagram for Tournament Predictor 20. Misprediction Rate for Three Predictor Types 0/0,1/0,1/1 1/0 Use pred 1 Use pred 1 0/1 0/0,1/1 0/1 1/0 0/0,0/1,1/1 1/0 Use pred 2 Use pred 2 0/1 0/0,1/1 res1/res2: resx=0, predx wrong, resx=1, predx right Never change state if both right or both wrong Strengthen pred if current right, other wrong Weaken pred if current wrong, other right Comp Arch, Henn & Patt, Fig 2.8, pg 88 Based on SPEC89 benchmarks Tournament pred best Small size ( 32K) does pretty well Little improvement beyond 128K? Why pumping N e have so little effect?

6 21. Is Branch Prediction Sufficient? The idea behind branch prediction is to know where to fetch next inst from to avoid draining the pipe If we predict taken, still don t know where to fetch from until the branch target (BT) is computed Pred. taken/untaken helps hide late-stage condition detection, but taken branches still penalized until BT-computing stage. Still helpful for long-running cond, like floating point comp We need a way to predict the BT as well as whether the branch will be taken if we are to avoid branch penalties, even assuming perfect prediction! Branch prediction alone may be very helpful with multi-issue & Instruction Fetch Unit, which may precompute BT (only indirect jumps require reading non-pc register) branch target buffer 22. Simple Branch Target Buffer Stores only taken br (untaken behave like nonbr) Replaces (0,1) predictor! Checked during IF stage for earliest resolution Must store relevant bits of PC (tag) to avoid flushing the pipe on non-br! Requires more bits than branch prediction (PC has implied bits due to alignment, and possibly entry #) Keeping/checking PC of inst high overhead, extra bits make updates more expensive as well Some BT buffers store actual BT instructions, rather than Hardware can do branch folding If keep pred bits (n > 1) in buff, must store taken and untaken Can combine with indep. br pred, then keep BT buff small 23. Simple Branch Target Buffer Access & Penalties Comp Arch, Henn & Patt, Fig 2.23, pg 124 Without delay slot, 2-cycle penalty for mispredict 1 cycle of fetch of wrong inst Discover error in ID stage 1 cycle to update BT buffer penalty table In buff? Pred Actual Penalty Yes taken taken 0 Yes taken untaken 2 No taken 2 No untaken Integrated Instruction Fetch Unit To feed wide backends, many architectures use an Integrated Instruction Fetch Unit, which performs: Integrated branch prediction: When inst is fetched, it is decoded so that branches are identified. If they are, at least simple br targets are computed (PC+immed). Branch is predicted, so that we can fetch correct instructions. Instruction prefetch: Using br pred, instructions are fetched into buffer before they are accessed, to provide a pool of schedulable inst for backend. Instruction memory access and buffering: multiple inst may cross cache boundaries, so buffer them in known queue using prefetching and feed them to the backend as needed. IFU even more important in x86, where x86 inst are of varying length, and so getting one inst may take multiple reads

7 26. Dynamic Scheduling / Out-of-Order Execution 25. Dynamic Scheduling Hardware is designed to reorder instructions on the fly as needed. Inst are issued in order, but possibly executed out of order Can schedule unrelated inst between RAW insts Advantages: avoids many data stalls, allows greater use of multiple FU, does not require good compiler, easily combined with renaming to solve name dep Disadvantages: Complicates hardware (possibly slowing clock rate), makes exec less predictable Different ways of doing dynamic scheduling, we will use Tomasulo s Approach To support dyn sched in our pipe, must split ID into two stages: 1. Issue: Decode inst, check for structural hazards 2. Read Operands: Wait until no data hazards, then read operands While issue is in-order, execution and completion need not be: DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14 SUB.D F3,F5,F6 No dep, so start SUB while DIV still running (why no work for in-order?) If multiple fadders, EX both SUB same time Later inst may finish before DIV.D (OO completion) OOE/OOC complicates exception handling (in-order commit, discussed later) 27. Tomasulo Approach Invented by Robert Tomasulo of IBM. Uses lvl of indirection called reservation stations to track when operands are available (defeat RAW haz) to enable OOE, and register renaming to avoid WAR/WAW hazards. As inst issued, register specifiers are renamed to reservation station or load buffer entry Res Stat/buffers updated via common data bus (CDB) coming from FUs Inst wait until CDB provides data, begin EX when available (dynamic sched) Store buffs can tell which EA are truly different (dyn mem disamb), allowing independent LD to complete before preceeding ST 28. Tomasulo s Algorithm for MIPS FPU Comp Arch, Henn & Patt, Fig 2.9, pg 94 Inst queue buffers inst FIFO from inst unit Each FPU has res stat that has multiple entries RS entry hold: inst, ops, and flow info LD buffers hold: EA componants until calc, EA while waiting for mem bus, value while waiting for CDB ST buffers hold: EA comp until calc, EA while awaiting data to ST, hold both while waiting for mem unit

8 29. Tomasulo Algorithm Steps TA requires new way of viewing inst from our ipipe; each step may take multiple cycles (and be pipelined!): 1. Issue: Get inst from queue, issue to appropriate res stat/buffer if there is one available (otherwise, stall on structural hazard). If operands avail in reg, write vals to res stat, else add ptrs to res stat/buff that will have results. Thus, input is renamed, defeating WAW and WAR hazards. 2. Execute: Monitor CDB until all operands available (RAW), and write them to res stat entry. When all operands are available, execute operation, assuming prior branches completed. Mult inst may be ready at same time. LD/ST first compute EA, then perform op. ST require ordering with other ST and LD. 3. Write result: Write result to CDB and from there to registers, reservation stations and ST buffers. ST write data to mem: when both EA and value available, they are sent to MU, and ST completes. 30. Bookkeeping for Tomasulo s Algorithm In order to detect & eliminate hazards need bookkeeping info : 1. Reservation Station/buffer fields: Op: operation to perform on source operands S1 and S2. Set during issue. Q j, Q k : Res stat entries that will produce S1 and S2; 0 means unneeded, or actual value in V. Set in issue and exec. V j, V k : The value of S1 and S2. Ignored unless Q is 0. Set in issue and exec. A: EA info for ld/st. Initially imm field (issue), after EA calc (exec), EA. Busy: Is this reservation station currently being used? Set in issue and write results. 2. New field for register file: Q i : Res stat entry whose operation will produce a result for this reg. If 0, no currently active inst has i th reg as output. Simply overwritten by later inst to affect write-back squash. 32. Computing Tomasulo Algorithm Information 31. Tomasulo Algorithm Information Tables after only first load completed (Fig 2.10 in book wrong) instruction history Instruction Issue Exec Write L.D F6,34(R2) Yes Yes Yes L.D F2,45(R3) Yes Yes No MUL.D F0,F2,F4 Yes No No SUB.D F8,F2,F6 Yes No No DIV.D F10,F0,F6 No No No ADD.D F6,F8,F2 No No No Reservation station info regfile tags Reg # Q i F0 Mult1 F2 Load2 F4 0 F6 0 F8 Add1 F10 0 name busy Op V j V k Q j Q k A Load1 no Load2 yes L.D 45+Regs[R3] Add1 yes sub.d Mem[34+Regs[R2]] Load2 0 Add2 no Add3 no Mult1 yes mul.d Regs[F4] Load2 0 Mult2 no Res stat: load=5, store=5, fadd=3, fmul=2 Assume latencies: load=1, fadd=2, fmul=10, fdiv=40 Computer steps of TA given in book, but as human, can easily compute table below using # res stat, dep between inst, and cycle times for ops First, compute table below: Reserv Issue Exec Exec Mem CDB Instruction station at FU beg-end acc write L.D F6,34(R2) Load1 1 ALU L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Now, we know that load 1 completes at cycle 4 Fill in TA state tables as they would be at cycle 4 to get prev slide! CDB write adds 1 cycle latency to all inst!

9 32. Computing Tomasulo Algorithm Information Res stat: load=5, store=5, fadd=3, fmul=2 Assume latencies: load=1, fadd=2, fmul=10, fdiv=40 Instruction Reserv Issue Exec Exec Mem CDB station at FU beg-end acc write L.D F6,34(R2) Load1 1 ALU L.D F2,45(R3) Load2 2 ALU MUL.D F0,F2,F4 Mult1 3 FMUL 6-15 n/a 16 SUB.D F8,F2,F6 Add1 4 FADD 6-7 n/a 8 DIV.D F10,F0,F6 Mult2 5 FMUL n/a 57 ADD.D F6,F8,F2 Add2 6 FADD 9-10 n/a 11 Now, we know that load 1 completes at cycle 4 Fill in TA state tables as they would be at cycle 4 to get prev slide. CDB write adds 1 cycle latency to all inst! W/o dep, possible struc haz on FMUL if mul.d or div.d unpipelined 33. Tomasulo Algorithm Information Tables 2 Compute for when MUL is ready to write it s result instruction history Instruction Issue Exec Write L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F2,F6 DIV.D F10,F0,F6 ADD.D F6,F8,F2 Reg # Q i F0 F2 F4 F6 F8 F10 regfile tags Reservation station info name busy Op V j V k Q j Q k A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 33. Tomasulo Algorithm Information Tables 2 Compute for end of cycle 15 instruction history Instruction Issue Exec Write L.D F6,34(R2) Yes Yes Yes L.D F2,45(R3) Yes Yes Yes MUL.D F0,F2,F4 Yes Yes No SUB.D F8,F2,F6 Yes Yes Yes DIV.D F10,F0,F6 Yes No No ADD.D F6,F8,F2 Yes Yes Yes Reg # Q i F0 Mult1 F2 0 F4 0 F6 0 F8 0 F10 Mult2 regfile tags Reservation station info name busy Op V j V k Q j Q k A Load1 No Load2 No Add1 No Add2 No Add3 No Mult1 Yes MUL Mem[45+Regs[R3]] Regs[F4] 0 0 n/a Mult2 Yes DIV n/a Mem[34+Regs[R2]] Mult1 0 n/a 34. Compute Tomasulo Algorithm Information for Loop Res stat: load=5, store=5, fmul=2, bu=1, alu=1 Assume cycles: cache=1, fmul=4 BU comp. BT during issue, verifies taken prediction in Exec Any xfer between units must use CDB (1 cycle latency) Note: inst from loop, ALU/BU not really res stat Instruction Reserv Issue Exec Exec Mem CDB station at FU B-E acc write L.D F0,0(R1) Load1 1 ALU MUL.D F4,F0,F2 S.D F4,0(R1) DADDIU R1,R1,#8 BNE R1,R2,Loop L.D F0,0(R1) MUL.D F4,F0,F2 S.D F4,0(R1) DADDIU R1,R1,#8 BNE R1,R2,Loop

10 34. Compute Tomasulo Algorithm Information Res stat: load=5, store=5, fmul=2, bu=1, alu=1 Assume latencies: cache=1, fmul=4 BU comp. BT during issue, verifies taken prediction in Exec Any xfer between units must use CDB (1 cycle latency) Note: inst from loop, ALU/BU not really res stat Instruction Reserv Issue Exec Exec Mem CDB station at FU B-E acc write L.D F0,0(R1) Load1 1 ALU MUL.D F4,F0,F2 Mult1 2 FMUL 5-8 n/a 9 S.D F4,0(R1) Stor1 3 ALU n/a DADDIU R1,R1,#8 ALU 4 ALU 5-5 n/a 6 BNE R1,R2,Loop BU 5 BU 7-7 n/a n/a L.D F0,0(R1) Load1 6 ALU MUL.D F4,F0,F2 Mult2 7 FMUL n/a 15 S.D F4,0(R1) Stor2 8 ALU n/a DADDIU R1,R1,#8 ALU 9 ALU n/a 11 BNE R1,R2,Loop BU 10 BU n/a n/a 35. Hardware-Based Speculation In order to get reasonable ILP, must reduce time wasted due to control hazards, which requires speculation. This chapter concerned with hardware-based speculation: Combines three key ideas: 1. Dynamic branch prediction [what inst past br?] 2. Speculation [execute inst past br] 3. Dynamic scheduling (out-of-order execution) [work around data and control hazards] Inst on such a machine must: 1. Issue in order 2. Can execute & produce results out-of-order 3. Update state of machine (commit) in order 36. Speculative Tomasulo Approach Allows speculative execution with dynamic scheduling based on Tomasulo s algorithm Like Tomasulo s algorithm, it allows instructions to complete outof-order Unlike Tomasulo s algorithm, it forces the instructions to commit (update machine state such as registers or memory) in order. This means: Precise exceptions are supported Supports speculative execution Will add additional step, instruction commit, to Tomasulo s alg Must have hardware support to buffer out-of-order results until ready for in-order commit 37. Speculative Tomasulo s Algorithm for MIPS FPU Comp Arch, Henn & Patt, Fig 2.14, pg 107 MIPS FP unit using speculative Tomasulo s algorithm Addition is that CDB writes to the Reorder buffer (ROB) rather than directly to FP registers: Holds results (& provides results to subsequent operations, as res stat used to do) between completion of OOE and in-order commit Like inst queue, may be implemented as circular queue Used to accomplish renaming (def name haz) & insure in-order commit Registers have reorder buff #, not res stat Store buffers replaced with reorder entry (ROB & data) Res stat still buff op until operands & FU available (data & struct hazards)

11 38. Repeat: Out-of-order Commit Tomasulo Algorithm Steps 1. Issue: Get inst from queue, issue to appropriate res stat/buffer if there is one available (otherwise, stall on structural hazard). If operands avail in reg, write vals to res stat, else add ptrs to res stat/buff that will have results. Thus, input is renamed, defeating WAW and WAR hazards. 2. Execute: Monitor CDB until all operands available (RAW), and write them to res stat entry. When all operands are available, execute operation, assuming prior branches completed. Mult inst may be ready at same time. LD/ST first compute EA, then perform op. ST require ordering with other ST and LD. 3. Write result: Write result to CDB and from there to registers, reservation stations and ST buffers. ST write data to mem: when both EA and value available, they are sent to MU, and ST completes. 39. Speculative (In-Order Commit) Tomasulo Steps 1. Issue: Get inst from queue. Issue it if there is an empty res stat and empty slot in ROB (otherwise, stall issue wt struct haz). Mark both as busy. If operands avail in regs or ROB write them to res stat. Write output s ROB entry to res stat. 2. Execute: Monitor CDB for operands to become available (avoids RAW hazards). Then, execute op when FU available (struct haz). Stores need only base reg and imm (EA calc). Loads have additional MEM step in this stage. 3. Write result: Write result to CDB, including ROB tag provided during issue. From CDB, write to indicated ROB entry and any res stat waiting for result. Mark res stat as available. Store monitors CDB for value, writes it to ROB when it arrives. 4. Commit: When result at head of ROB is available, commit it. If store or op, write mem or reg, free ROB entry. If mispredicted branch, flush ROB, restart at correct branch successor. 40. Speculative Tomasulo Approach 40. Speculative Tomasulo Approach res stations/rob entries: ld=5, fpadd=3, fpmul=2, rob=8 Only one of each FU (ALU handles EA) FMUL, FADD pipelined, division done by FMUL 1 cycle in EX iop (including EA), branch, or cache 2 cycles for ADD.D, 10 for MUL.D, & 40 for DIV.D Instruction issue res/rob FU/EX mem cdb commit L.D F6,34(R2) 1 Ld1/#1 ALU/ L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 res stations/rob entries: ld=5, fpadd=3, fpmul=2, rob=8 Only one of each FU (ALU handles EA) FMUL, FADD pipelined, division done by FMUL 1 cycle in EX iop, branch, load, store or cache 2 cycles for ADD.D, 10 for MUL.D, & 40 for DIV.D Instruction issue res/rob FU/EX mem cdb commit L.D F6,34(R2) 1 Ld1/#1 ALU/ L.D F2,45(R3) 2 Ld2/#2 ALU/ MUL.D F0,F2,F4 3 Mul1/#3 FMUL/6-15 n/a SUB.D F8,F6,F2 4 Add1/#4 FADD/6-7 n/a 8 18 DIV.D F10,F0,F6 5 Mul2/#5 FMUL/17-56 n/a ADD.D F6,F8,F2 6 Add2/#6 FADD/9-10 n/a The larger the diff in inst run-time, the larger ROB will need to be! Handle exceptions only when inst reaches top of queue.

12 41. Speculative Tomasulo Information Tables When MUL.D ready to commit end of cycle 16. Only way inst stays in issue is unavailable ROB or res stat Execute of timing table execute of steps, which happens cycle after successful issue Write step of alg cont until inst reaches head of ROB, write of CDB only on cycle! Reservation station info name busy Op V j V k Q j Q k Dest A Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 Reorder buff info # busy instruction state destreg value 1 L.D F6,34(R2) 2 L.D F2,45(R3) 3 MUL.D F0,F2,F4 4 SUB.D F8,F6,F2 5 DIV.D F10,F0,F6 6 ADD.D F6,F8,F2 Reg rob # F0 F2 F4 F6 F8 F Speculative Tomasulo Information Tables When MUL.D ready to commit end of cycle 16. Only way inst stays in issue is unavailable ROB or res stat Execute of timing table execute of steps, which happens cycle after successful issue Write step of alg cont until inst at ROB head Reservation station info name busy Op V j V k Q j Q k Dest A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 N Mult2 Y DIV.D Mem[45+R3]*Reg[F4] Mem[34+R2] 0 0 #5 Reorder buff info # busy instruction state destreg value 1 N L.D F6,34(R2) Commit Mem[34+R2] 2 N L.D F2,45(R3) Commit Mem[45+R3] 3 Y MUL.D F0,F2,F4 Write F0 #2*Reg[F4] 4 Y SUB.D F8,F6,F2 Write F8 #1-#2 5 Y DIV.D F10,F0,F6 Execute F10 6 Y ADD.D F6,F8,F2 Write F6 #4+#2 Book may assume V j written cycle 17, but we won t. Reg rob # F0 #3 F2 n/a F4 n/a F6 #6 F8 #4 F10 #5 42. Speculative Tomasulo Example 2 Res stat: load=5, store=5, fadd=2, bu=1, alu=1 Assume cycles: cache=1, fadd=4 single issue/commit/cdb/cache 16 entry ROB Stores now do MEM during commit! Instruction issue res/rob FU/EX mem cdb commit L.D F0,0(R1) 1 Ld1/#1 ALU/ ADD.D F4,F0,F2 S.D F4,0(R1) DADDIU R1,R1,#8 BNE R1,R2,Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDIU R1,R1,#8 BNE R1,R2,Loop 42. Speculative Tomasulo Example 2 Res stat: load=5, store=5, fadd=2, bu=1, alu=1, ROB=16 Assume cycles: cache=1, fadd=4 Stores now do MEM during commit! FUs: 1 FADD, 1 FMUL, 1 BU, 1 ALU, 1 EA, 1 MEM, 1 CDB Instruction issue res/rob FU/EX mem cdb commit L.D F0,0(R1) 1 Ld1/#1 ALU/ ADD.D F4,F0,F2 2 Add1/#2 FADD/5-8 n/a 9 10 S.D F4,0(R1) 3 #3 ALU/4-4 n/a n/a 11 DADDIU R1,R1,#8 4 ALU/#4 ALU/5-5 n/a 6 12 BNE R1,R2,Loop 5 BCU/#5 BU/7-7 n/a n/a 13 L.D F0,0(R1) 6 Ld1/#6 ALU/ ADD.D F4,F0,F2 7 Add2/#7 FADD/10-13 n/a S.D F4,0(R1) 8 #8 ALU/9-9 n/a n/a 16 DADDIU R1,R1,#8 9 ALU/#9 ALU/10-10 n/a BNE R1,R2,Loop 10 BCU/#10 BU12-12 n/a n/a 18 Branches EX actually updates PC If 1st BNE not taken (PC!= pred), simply flush ROB #6-10 Exceptions for flushed inst never handled

13 44. Top Five Multi-issue Approaches 43. Issuing Multiple Instructions in Parallel Must have multi-issue to achieve CPI < 1 (IPC > 1) Requires simultaneous fetching, decoding and executing of multiple instructions Two approaches to multi-issue: 1. Superscalar: hardware examines inst stream for parallel inst 2. VLIW : Compiler puts parallel inst in VLIW/packet Intel modified VLIW idea slightly, call it EPIC VLIW/EPIC covered in Appendix G Most of this chap concentrates on multi-issue using superscalar Issue Hazard Distinguishing Approach structure detection Schedul characteristic Implementations Superscalar dynamic hardware static in-order exec Sun UltraSparc IV (static) ARM/MIPS Superscalar dynamic hardware dynamic out-of-order IBM Power2 (dynamic) execution Superscalar dynamic hardware dynamic out-of-order P4/Opteron (speculative) +specul ex wt spec MIPS R12K, Pow5 VLIW/LIW static software static no hazards in TI C6x issue packets Efficeon EPIC mostly mostly mostly dep marked IA64 static software static by compiler Itanium2 We see that almost all desktops today are speculative superscalar Most are 4-way SS, some (particularly embedded) 2-way We concentrate on first 3 approaches in this chapter Will briefly overview VLIW/EPIC in next couple of slides 45. Static Multi-issue with Very Long Instruction Word (VLIW) The compiler has responsibility to package multiple indep inst together in VLIW Can issue many instructions in parallel: past machines have varied from 2-32 inst/vliw Simplifies issuing hardware Performance limited by amount of ILP compiler can discover With many functional units to feed, this requires very complicated compiler scheduling support: Loop unrolling Software pipelining Trace scheduling Superblock scheduling 46. VLIW Challenges As machine gets wider, IO must keep pace (multiple ports to reg file, must be able to issue multiple loads at once, etc) Too much width may restrain maximum clock speed Any functional unit stall (eg. d-cache) will cause entire proc to stall. Cache misses are not as predictable, so dynamic superscalar has advantage here (since it won t stall machine until ROB is full) In original VLIW attempts, adding more FU resulted in a greater word length, which meant you must recompile executables to run correctly May significantly increase code size due to inserting nops in VLIW and aggressive compiler optimizations (eg. loop unrolling) Many of these problems were improved in the Intel/HP EPIC (Explicitly Parallel Instruction Computer) design (at the cost of simplicity)

14 47. All Aboard the Good Ship Itanic 48. EPIC/Itanium Remarks Itanium does not appear to be a major factor in the marketplace Still-increasing perf of DSS seems to indicate hardware complexity has not yet hit a ceiling Loss of Mhz slow CMPs?opportunity? (too late?) Itanium has not shown that software-oriented approach necessarily yields simpler or faster processors Excellent compiler support has remained elusive IA-64 arch does not appear to have major architectural advantage, as did caches and RISC (in its day) Software & hardware approaches are instead intermingling: I would bet on hardware with some software tricks Software tricks in hardware-centric Conditional inst (eg. move) Prefetch inst & cache hints Br pred hints Spec (non-excepting) loads Hardware tricks in software-centric Scoreboard scheduling of inst Dynamic br pred Rollback or trap-and-fix-up support for speculation Hardware for checking spec load correctness 50. Simple Static Superscalar Pipeline in Operation 49. Static Superscalar Approach Issue multiple inst that are quickly detected as independent Most SS allow 4 inst at most (dep checking, available ILP) If one inst in set has some dependency, only prior inst are issued (maintains in-order execution) Must increase inst fetch rate: widen icache bus, pipeline IF (often by having integrated inst fetch unit) Longer pipeline and multi-issue result in more hazards**** Why does multi-issue result in more hazards? No increase in code size and less dependence on compiler than VLIW Programs compiled for nonsuperscalar machines may still get benefit Most superscalar procs have issue restriction, which limits the types of instructions that may be issued simultaneously Parallel inst called issue packet May have 0-n (2 n 4) inst in issue packet An easy issue rest for dual-issue is fp comp and other (ld/st/iop) Dependencies limited, since they use different WB and EX (except for fp loads) Will become single-issue for all-integer/fp code Since fp computations are long-running, will need pipelined and perhaps multiple FPUs Need for multiple inst/- Inst type Pipe stage cycle often requires: int inst IF ID EX MEM WB fp inst IF ID FEX FWB Superpipelining IF/ID int inst IF ID EX MEM WB stages fp inst IF ID FEX FWB int inst IF ID EX MEM WB New FUs: Instruction fp inst IF ID FEX FWB Fetch Unit, Issue Unit, etc.

15 51. Simple Dynamic Superscalar with Tomasulo Other than multiple instruction fetch and issue, Tomasulo s Algorithm works unchanged for dynamic superscalar, and defeats some issue probs: Res Issu Exec Exec Mem CDB Instruction fp cycles = 3, stat at FU B-E acc write iop cycles=1 L.D F0,0(R1) Load1 1 ALU ADD.D F4,F0,F2 Add1 1 FADD 5-7 n/a 8 CDB write S.D F4,0(R1) Stor1 2 ALU n/a adds 1 cycle DADDIU R1,R1,#8 ALU 2 ALU 4-4 n/a 5 BNE R1,R2,Loop BCU 3 BCU 6-6 n/a n/a lat L.D F0,0(R1) Load2 4 ALU Branches req ADD.D F4,F0,F2 Add2 4 FADD n/a 13 single issue S.D F4,0(R1) Stor2 5 ALU n/a DADDIU R1,R1,#8 ALU 5 ALU 9-9 n/a 10 BCU: ALU BNE R1,R2,Loop BCU 6 BCU n/a n/a spec for br L.D F0,0(R1) Load1 7 ALU ADD.D F4,F0,F2 Add3 7 FADD n/a 18 cond & BT@ S.D F4,0(R1) Stor3 8 ALU n/a ld/st use ALU DADDIU R1,R1,#8 ALU 8 ALU n/a 15 for EA BNE R1,R2,Loop BCU 9 ALU n/a n/a!!! Must watch ADDIU delayed by S.D s use of ALU in EX stage! for struct haz! Later inst delayed until branch is resolved 52. Resource Usage Table (fig 3.26) Clock ALU FPU D-cache CDB 2 1/L.D 3 1/S.D 1/L.D 4 1/DADDIU 1/L.D 5 1/ADD.D 1/DADDIU 6 7 2/L.D 8 2/S.D 2/L.D 1/ADD.D 9 2/DADDIU 1/S.D 2/L.D 10 2/ADD.D 2/ADDIU /L.D 13 3/S.D 3/LD 2/ADD.D 14 3/DADDIU 2/S.D 3/L.D 15 3/ADD.D 3/DADDIU /ADD.D 19 3/S.D 20 ADD.D only shows up in 1 st EX cycle, since unit pipelined Any cycle without entry in both ALU & FPU is missed opportunity Any resource, including CDB can cause structural hazard 53. Dynamic Superscalar with Tomasulo, Ex 2 Had repeated struct stalls on ALU (ADDIU/S.D) If we add ALU alone, will then stall on CDB! Assume 2 nd ALU for EA calculation, & 2 nd CDB: Res Issue Exec Exec Mem CDB Instruction stat at FU B-E acc write L.D F0,0(R1) Load1 1 EAU ADD.D F4,F0,F2 Add1 1 FADD 5-7 n/a 8 S.D F4,0(R1) Stor1 2 EAU n/a DADDIU R1,R1,#8 ALU 2 ALU 3-3 n/a 4 BNE R1,R2,Loop BCU 3 BCU 5-5 n/a n/a L.D F0,0(R1) Load2 4 EAU ADD.D F4,F0,F2 Add2 4 FADD 9-11 n/a 12 S.D F4,0(R1) Stor2 5 EAU n/a DADDIU R1,R1,#8 ALU 5 ALU 6-6 n/a 7 BNE R1,R2,Loop BCU 6 BCU 8-8 n/a n/a L.D F0,0(R1) Load1 7 EAU ADD.D F4,F0,F2 Add3 7 FADD n/a 15 S.D F4,0(R1) Stor3 8 EAU n/a DADDIU R1,R1,#8 ALU 8 ALU 9-9 n/a 10 BNE R1,R2,Loop BCU 9 BCU n/a n/a Last inst (S.D) done at cycle 16 rather than 19! fp cycles = 3, iop cycles = 1 CDB write adds 1 cycle lat Branches req single issue BCU: ALU spec for br cond & BT@ EAU: ALU for EA 54. Resource Usage Table (fig 3.28) Assume 2 nd ALU for EA calculation, & 2 nd CDB: Clock ALU EA ALU FPU D-cache CDB 1 CDB 2 2 1/L.D 3 1/DADDIU 1/S.D 1/L.D 4 1/L.D 1/DADDIU 5 1/ADD.D 6 2/DADDIU 2/L.D 7 2/S.D 2/L.D 2/DADDIU 8 1/ADD.D 2/L.D 9 3/DADDIU 3/L.D 2/ADD.D 1/S.D 10 3/S.D 3/L.D 3/DADDIU 11 3/L.D 12 3/ADD.D 2/ADD.D 13 2/S.D /ADD.D 16 3/S.D So we see that widening one stage (ALU) may require us to widen others (CDB) With less dep or more ALU/FPUs, might need more D-cache ports

16 55. Timing Table for Dual-Issue Pipeline w/o Speculation Instruction issue FU/EX mem cdb comment LW R2,0(R1) 1 EA/ st issue DADDIU R2,R2,#1 1 ALU/5-5 na 6 wait LW SW R2,0(R1) 2 EA/3-3 7 na wait DADDIU DADDIU R1,R1,#4 2 ALU/3-3 na 4 no wait BNE R2,R3,Loop 3 BU/7-7 na na wait for DADDIU LW R2,0(R1) 4 EA/ wait BNE DADDIU R2,R2,#1 4 ALU/11-11 na 12 wait LW SW R2,0(R1) 5 EA/ na wait DADDIU DADDIU R1,R1,#4 5 ALU/8-8 na 9 wait for BNE BNE R2,R3,Loop 6 BU/13-13 na na wait for DADDIU LW R2,0(R1) 7 EA/ wait BNE DADDIU R2,R2,#1 7 ALU/17-17 na 18 wait LW SW R2,0(R1) 8 EA/ na wait DADDIU DADDIU R1,R1,#4 8 ALU/14-14 na 15 wait for BNE BNE R2,R3,Loop 9 BU/19-19 na na wait for DADDIU Sep ALU & EA units allow parallel exec (eg. cycle 3) 2nd SW exec on cycle 9 since LW using EA unit on 8 final instruction finished at cycle Timing Table for Dual-Issue Pipeline with Speculation Instruction issue FU/EX mem cdb commit comment LW R2,0(R1) 1 ALU/ st issue DADDIU R2,R2,#1 1 ALU/5-5 na 6 7 wait LW SW R2,0(R1) 2 EA/3-3 na na 7 wait DADDIU DADDIU R1,R1,#4 2 ALU/3-3 na 4 8 no wait BNE R2,R3,Loop 3 BU/7-7 na na 8 wait for DADDIU LW R2,0(R1) 4 EA/ no wait DADDIU R2,R2,#1 4 ALU/8-8 na 9 10 wait LW SW R2,0(R1) 5 EA/6-6 na na 10 wait DADDIU DADDIU R1,R1,#4 5 ALU/6-6 na 7 11 no wait BNE R2,R3,Loop 6 BU/10-10 na na 11 wait for DADDIU LW R2,0(R1) 7 EA/ earliest poss DADDIU R2,R2,#1 7 ALU/11-11 na wait LW SW R2,0(R1) 8 EA/9-9 na na 13 wait DADDIU DADDIU R1,R1,#4 8 ALU/9-9 na execs earlier BNE R2,R3,Loop 9 BU/13-13 na na 14 wait for DADDIU Mult inst issue, write, and commit in same cycle Otherwise, peak of multi-issue not achievable! Mult issue implies multiple read and write ports to regfile With speculation, finish in 14 rather than 19 cycles! 57. Fallacies and Pitfalls Fallacy: Processors with faster clock rates are always faster More sophisticated pipelines have slower clock rates, but may be faster due to better ILP exploitation One of most difficult tradeoffs is between simple procs wt high clock rates and large caches, and more complex procs with slower clock rate and smaller caches, but more ILP Pitfall: Improving only one aspect of multi-issue proc and expecting overall performance improvement As we have seen, removing one bottleneck often only exposes another (eg., multi-issue requiring wider CDB, etc)

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on Chapter 3: Instruc0on Level Parallelism and Its Exploita0on - Abdullah Muzahid Hardware- Based Specula0on (Sec0on 3.6) In mul0ple issue processors, stalls due to branches would be frequent: You may need

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

CSE 502 Graduate Computer Architecture

CSE 502 Graduate Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 CSE 502 Graduate Computer Architecture Lec 15-19 Inst. Lvl. Parallelism Instruction-Level Parallelism and Its Exploitation Larry Wittie

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505

Computer Architecture: Mul1ple Issue. Berk Sunar and Thomas Eisenbarth ECE 505 Computer Architecture: Mul1ple Issue Berk Sunar and Thomas Eisenbarth ECE 505 Outline 5 stages of RISC Type of hazards Sta@c and Dynamic Branch Predic@on Pipelining with Excep@ons Pipelining with Floa@ng-

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures Advanced Computer Architecture Chapter 4: More sophisticated CPU architectures Lecturer: Paul H J Kelly Autumn 2001 Department of Computing Imperial College Room 423 email: phjk@doc.ic.ac.uk Course web

More information

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches 4.1 Basic Compiler Techniques for Exposing ILP 計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches 吳俊興高雄大學資訊工程學系 To avoid a pipeline stall, a dependent instruction must be

More information

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://wwweecsberkeleyedu/~krste

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow? Complications with long instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study So far, all MIPS instructions take 5 cycles But haven't talked

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation Lecture 7 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2013 Reading: Textbook, Ch. 3 Complexity-Effective Superscalar Processors, PhD Thesis by Subbarao Palacharla, Ch.1

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit

More information

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information