Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.

Size: px

Start display at page:

Download "Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time."

Miranda McKenzie
6 years ago
Views:

1 Pipelining Pipelining is the se of pipelining to allow more than one instrction to be in some stage of eection at the same time. Ferranti ATLAS (963): Pipelining redced the average time per instrction by 375% emory cold not keep p with the CPU, needed a cache.

2 What Is Pipelining Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 mintes A B C D Dryer takes 4 mintes Folder takes 2 mintes 2

3 What Is Pipelining 6 P idnight Time T a s k O r d e r A B C D Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry take? 3

4 What Is Pipelining Start work ASAP 6 P idnight T a s k O r d e r A B C D Time Pipelined landry takes 3.5 hors for 4 loads 4

5 Pipelining Lessons T a s k O r d e r 6 P Time A B C D Pipelining doesn t help latency of single task, it helps throghpt of entire workload Pipeline rate limited by slowest pipeline stage ltiple tasks operating simltaneosly Potential speedp = Nmber pipe stages Unbalanced lengths of pipe stages redces speedp Time to fill pipeline and time to drain it redces speedp 5

6 odels Single-cycle model (non-overlapping) The instrction latency eectes in a single cycle Every instrction and clock-cycle mst be stretched to the slowest instrction lti-cycle model (non-overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step Ability to share fnctional nits within the eection of a single instrction Pipeline model (overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step The throghpt is mainly one clock-cycle/instrction Gains efficiency by overlapping the eection of mltiple instrctions, increasing hardware tilization. 6

7 Review: Single-Cycle Datapath 2 adders: PC+4 adder, Branch/Jmp offset adder PC 4 address memory RegDst Reg register register 2 register 2 6 Sign 32 etend Shift left 2 Src 3 Reslt ctl Zero reslt ress em Data memory And em Branch emtoreg Harvard Architectre: Separate instrction and memory 7

8 Review: lti vs. Single-cycle Processor Datapath Combine adders: add ½ & 3 temp. registers, A, B, Ot Combine emory: add & 2 temp. registers, IR, DR IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS 8

9 lti-cycle Processor Datapath Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op = 6 6 additional FFs FFsfor for mlti-cycle processor over over single-cycle processor 9

10 PC PC bits bits PC 4 ress memory Datapath Registers 6 6 FFs FFs FFs FFs FFs FFs PC PC IF/ID ID/EX EX/E E/WB bits bits IR IR bits bits register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg 6 Sign 32 etend 2 W 3 4 EX EX PC PC A B Si Si RT RT 5 RD RD 5 Shift left 2 Src 6 RegDst control Op reslt Zero reslt 2 W 3 PC PC Z Ot Ot B D 5 Branch ress 23+6 = 229 additional FFs for pipeline over mlti-cycle processor PCSrc em Data memory em Figre W D R Ot Ot D 5 emtoreg

11 Overhead Single-cycle model 8 ns Clock (25 Hz), (non-overlapping) + 2 adders es Datapath Register bits (Flip-Flops) Chip Area Speed lti-cycle model 2 ns Clock (5 Hz), (non-overlapping) + Controller 5 es 6 Datapath Register bits (Flip-Flops) Pipeline model 2 ns Clock (5 Hz), (overlapping) 2 + Controller 4 es 373 Datapath + 6 Controlpath Register bits (Flip-Flops)

12 Comparison CISC Any instrction may reference memory any instrctions & addressing modes Variable instrction formats Single register set RISC Only load/store references memory Few instrctions & addressing modes Fied instrction formats ltiple register sets lti-clock cycle instrctions icro-program interprets instrctions Compleity is in the micro-program Less to no pipelining Program code size small Single-clock cycle instrctions Hardware (FS) eectes instrctions Compleity is in the complier Highly pipelined Program code size large 2

13 Pipelining verss Parallelism ost high-performance compters ehibit a great deal of concrrency. However, it is not desirable to call every modern compter a parallel compter. Pipelining and parallelism are 2 methods sed to achieve concrrency. Pipelining increases concrrency by dividing a comptation into a nmber of steps. Parallelism is the se of mltiple resorces to increase concrrency. 3

14 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - e.g., mltiple memory accesses, mltiple register writes - soltions: mltiple memories (separate instrction & memory) stretch pipeline control hazards: attempt to make a decision before condition is evalated - e.g., any conditional branch - soltions: prediction, delayed branch hazards: attempt to se item before it is ready - e.g., add r,r2,r3; sb r4, r,r5; lw r6, (r7); or r8, r6,r9 - soltions: forwarding/bypassing, stall/bbble 4

15 When stalls are navoidable There are nmeros instances where hazards lead navoidably to stalls! perhaps the most srprising case is the simple assignment statement: A = B + C; 5

16 Classification of hazards Assme that instrction J occrs after instrction I RAW -- read after write -- J tries to read a sorce before I writes to it. J gets a possibly incorrect vale WAR -- write after read -- J tries to write a destination before the destination is read by I. I reads a possibly incorrect vale WAW -- write after write -- J tries to write an operand before it is written by I. In this case the writes are performed in the wrong order IPS avoids the latter two hazards in its integer pipeline However, they can still arise in its floating-point pipeline since instrctions have different lengths of time there 6

17 The Five Stages of Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the from the Data emory Wr: the back to the register file 7

18 RISCEE 4 Architectre Clock = load vale into register P P (~AlZero (~AlZero & & BZ) BZ) PCSrc srcb 2 Y P C IorD em em address address Data Data Data Data DR2 DR2 DR DR IR I I R R [7-] Reg RegDst 2 2 Data Data Accmlator Accmlator Data Data srca X op op X+ X+ 2 2 X-Y X-Y 3 3 +Y +Y X+Y X+Y Ot Ot Clock 8

19 Single Cycle, ltiple Cycle, vs. Pipeline Clk Cycle Cycle 2 Single Cycle Implementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Clk ltiple Cycle Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em R-type Ifetch Pipeline Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em Wr R-type Ifetch Reg Eec em Wr 9

20 Why Pipeline? Sppose we eecte instrctions Single Cycle achine 45 ns/cycle CPI inst = 45 ns lticycle achine ns/cycle 4.6 CPI (de to inst mi) inst = 46 ns Ideal pipelined machine ns/cycle ( CPI inst + 4 cycle drain) = 4 ns 2

21 Why Pipeline? Becase the resorces are there! Time (clock cycles) I n s t r. O r d e r Inst Inst Inst 2 Inst 3 Inst 4 Reg Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Resorce eminst emdata Reg Reg bsy idle idle idle idle bsy idle bsy idle idle bsy idle bsy idle bsy bsy bsy bsy idle bsy bsy bsy bsy bsy bsy idle bsy bsy bsy bsy idle bsy idle bsy bsy idle bsy idle bsy idle idle idle idle bsy idle 2

22 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - E.g., combined washer/dryer wold be a strctral hazard or folder bsy doing something else (watching TV) hazards: attempt to se item before it is ready - E.g., one sock of pair in dryer and one in washer; can t fold ntil get sock from washer throgh dryer - instrction depends on reslt of prior instrction still in the pipeline control hazards: attempt to make a decision before condition is evalated - E.g., washing football niforms and need to get proper detergent level; need to see after dryer before net load in - branch instrctions Can always resolve hazards by waiting pipeline control mst detect the hazard take action (or delay action) to resolve hazards 22

23 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time Previos eample: Separate Instem and and Dataem I n s t r. Load Instr em Reg em Reg em Reg em Reg right right half: half: highlight means means read. read. left left half half write. write. O r d e r Instr 2 Instr 3 Resorce em(inst & Data) Reg Reg bsy idle idle idle bsy bsy idle idle em em Reg em Reg bsy bsy idle bsy bsy bsy idle bsy Reg em Reg bsy bsy bsy bsy idle bsy bsy bsy Detection is is easy easy in in this this case! case! idle idle bsy bsy idle idle bsy idle 23

24 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time By change the architectre from a Harvard (separate instrction and memory) to a von Neman memory, we actally created a strctral hazard! Strctral hazards can be avoid by changing hardware: design of the architectre (splitting resorces) software: re-order the instrction seqence software: delay 24

25 Pipelining Improve perfomance by increasing instrction throghpt Program eection order Time (in instrctions) lw $, ($) Reg fetch Data access Reg lw $2, 2($) 8 ns Reg fetch Data access Reg lw $3, 3($) Program eection Time order (in instrctions) lw $, ($) lw $2, 2($) fetch 2 ns 8 ns Reg fetch Reg Data access Reg Data access Reg fetch 8 ns... lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedp is nmber of stages in the pipeline. Do we achieve this? 25

26 Stall on Branch Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Figre

27 Predicting branches Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg Program eection order (in instrctions) add $4, $5,$6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg bbble bbble bbble bbble bbble Figre 6.5 or $7, $8, $9 4 ns fetch Reg Data access Reg 27

28 Delayed branch Program eection order (in instrctions) Time beq $, $2, 4 fetch Reg Data access Reg add $4, $5, $6 (Delayed branch slot) 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns Figre

29 pipeline Figre 6.7 Time add $s, $t, $t IF ID EX E WB Pipeline stages IF instrction fetch (read) ID instrction decode and register read (read) EX eecte al operation E memory (read or write) WB back to register Resorces em instr. & memory Reg register read port # Reg2 register read port #2 Reg register write al operation 29

30 Forwarding Program eection order Time (in instrctions) add $s, $t, $t IF ID EX E WB sb $t2, $s, $t3 IF ID EX E WB Figre 6.8 3

31 Load Forwarding Program Time eection order (in instrctions) lw $s, 2($t) IF ID EX E WB bbble bbble bbble bbble bbble sb $t2, $s, $t3 IF ID EX E WB Figre 6.9 3

32 Reordering lw $t, ($t) $t=emory[+$t] lw $t2, 4($t) $t2=emory[4+$t] sw $t2, ($t) emory[+$t]=$t2 sw $t, 4($t) emory[4+$t]=$t lw lw sw sw $t2, 4($t) $t, ($y) $t2, ($t) $t, 4($t) Figre

33 Basic Idea: split the path IF: fetch ID: decode/ register file read EX: Eecte/ address calclation E: emory access WB: back 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 What do we need to add to actally split the path into stages? 33

34 Graphically Representing Pipelines Time (in clock cycles) Program eection order (in instrctions) lw $, 2($) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Can help with answering qestions like: how many cycles does it take to eecte this code? what is the doing dring cycle 4? se this representation to help nderstand paths 34

35 Pipeline path with registers IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

36 Load instrction fetch and decode lw fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 lw decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 36

37 Load instrction eection lw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

38 Load instrction memory and write back lw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory lw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 38

39 Store instrction eection sw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

40 Store instrction memory and write back sw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory sw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.7 4

41 Load instrction: corrected path IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.8 4

42 Load instrction: overall sage IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

43 lti-clock-cycle pipeline diagram Program eection order (in instrctions) lw $, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Program eection order (in instrctions) lw $, $2($) sb $, $2, $3 Time ( in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 fetch decode fetch Eection decode Data access Eection back Data access back Figre

44 Single-cycle #-2 lw $, 2($) fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory Clock sb $, $2, $3 fetch lw $, 2($) decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.22 Clock 2 44

45 Single-cycle #3-4 sb $, $2, $3 decode lw $, 2($) Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 3 sb $, $2, $3 Eection lw $, 2($) emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.23 Clock 4 45

46 Single-cycle #5-6 sb $, $2, $3 emory lw $, 2($) back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 5 sb $, $2, $3 back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.24 Clock 6 46

47 Conventional Pipelined Eection Representation Time IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB Program Flow IFetch Dcd Eec em WB IFetch Dcd Eec em WB 47

48 Strctral Hazards limit performance Eample: if.3 memory accesses per instrction and only one memory access per cycle then average CPI.3 otherwise resorce is more than % tilized 48

49 Control Hazard Soltions I n s t r. O r d e r Stall: wait ntil decision is clear Its possible to move p decision to 2nd stage by adding hardware to check registers as being read Beq Load Time (clock cycles) em Reg em Reg Impact: 2 clock cycles per branch instrction => slow em Reg em Reg em Reg em Reg 49

50 Control Hazard Soltions Predict: gess one direction then back p if wrong Predict not taken I n s t r. O r d e r Beq Load Time (clock cycles) em Reg em Reg em Reg em Reg em Reg em Reg Impact: clock cycles per branch instrction if right, 2 if wrong (right - 5% of time) ore dynamic scheme: history of branch (- 9%) 5

51 Control Hazard Soltions Redefine branch behavior (takes place after net instrction) delayed branch I n s t r. O r d e r Beq isc Load Time (clock cycles) em Reg em Reg em Reg em Reg em Impact: clock cycles per branch instrction if can find instrction to pt in slot (- 5% of time) As lanch more instrction per clock cycle, less sefl Reg em Reg em Reg em Reg 5

52 Data Hazard on r add r,r2,r3 sb r4, r,r3 and r6, r,r7 or r8, r,r9 or r, r,r 52

53 Data Hazard on r: Dependencies backwards in time are hazards I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg 53

54 Data Hazard Soltion: Forward reslt from one stage to another I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg or OK if define read/write properly 54

55 Forwarding (or Bypassing): What abot Loads Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX E WB lw r,(r2) Im Reg Dm Reg sb r4,r,r3 Im Reg Dm Reg Can t solve with forwarding: st delay/stall instrction dependent on loads 55

56 Pipelining the Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw Ifetch Reg/Dec Eec em Wr 2nd lw Ifetch Reg/Dec Eec em Wr 3rd lw Ifetch Reg/Dec Eec em Wr The five independent fnctional nits in the pipeline path are: emory for the Ifetch stage Register File s ports (bs A and bsb) for the Reg/Dec stage for the Eec stage Data emory for the em stage Register File s port (bs W) for the Wr stage 56

57 The For Stages of R-type Cycle Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Eec Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: operates on the two register operands Update PC Wr: the otpt back to the register file 57

58 Pipelining the R-type and Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec Wr Ops! We have a problem! R-type Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Eec Wr We have pipeline conflict or strctral hazard: Two instrctions try to write to the register file at the same time! Only one write port 58

59 Important Observation Each fnctional nit can only be sed once per instrction Each fnctional nit mst be sed at the same stage for all instrctions: Load ses Register File s Port dring its 5th stage Load Ifetch Reg/Dec Eec em Wr R-type ses Register File s Port dring its 4th stage R-type Ifetch Reg/Dec Eec Wr 2 ways to solve this pipeline hazard. 59

60 Soltion : Insert Bbble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Pipeline Eec Wr R-type Ifetch Bbble Reg/Dec Eec Wr Ifetch Reg/Dec Eec Insert a bbble into the pipeline to prevent 2 writes at the same cycle The control logic can be comple. Lose instrction fetch and isse opportnity. No instrction is started in Cycle 6! 6

61 Soltion 2: Delay R-type s by One Cycle Delay R-type s register write by one cycle: Now R-type instrctions also se Reg File s write port at Stage 5 em stage is a NOOP stage: nothing is being done R-type Ifetch Reg/Dec Eec em Wr Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr 6

62 The For Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the into the Data emory 62

63 The Three Stages of Beq Cycle Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: compares the two register operand, select correct branch target address latch into PC 63

64 Smmary: Pipelining What makes it easy all instrctions are the same length jst a few instrction formats memory operands appear only in loads and stores What makes it hard? strctral hazards: sppose we had only one memory control hazards: need to worry abot branch instrctions hazards: an instrction depends on a previos instrction We ll bild a simple pipeline and look at these isses We ll talk abot modern processors and what really makes it hard: eception handling trying to improve performance with ot-of-order eection, etc. 64

65 Smmary Pipelining is a fndamental concept mltiple steps sing distinct resorces Utilize capabilities of the Datapath by pipelined instrction processing start net instrction while working on the crrent one limited by length of longest stage (pls fill/flsh) detect and resolve hazards 65

66 Pipeline Control: Controlpath Register bits WB Control WB EX WB IF/ID ID/EX EX/E E/WB 9 control bits bits Figre control bits bits 2 control bits bits 66

67 Pipeline Control: Controlpath table Figre 5.2, Single Cycle Reg Dst Src em Reg Reg Wrt em Red em Wrt Branch op op R-format lw sw X X beq X X Figre 6.28 ID / EX control lines EX / E control lines E / WB cntrl lines Reg Dst Op Op Src Branch em Red em Wrt Reg Wrt em Reg R-format lw sw X X beq X X 67

68 Pipeline Hazards Pipeline hazards Soltion # always works (for non-realtime) applications: stall, delay & procrastinate! Strctral Hazards (i.e. fetching same memory bank) Soltion #2: partition architectre Control Hazards (i.e. branching) Soltion #: stall! bt decreases throghpt Soltion #2: gess and back-track Soltion #3: delayed decision: delay branch & fill slot Data Hazards (i.e. register dependencies) Worst case sitation Soltion #2: re-order instrctions Soltion #3: forwarding or bypassing: delayed load 68

69 Pipeline Datapath and Controlpath PCSrc Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register register 2 Registers 2 register Reg Shift left 2 reslt Src Zero reslt Branch em ress Data memory emtoreg [5 ] 6 Sign 32 etend 6 control em [2 6] [5 ] RegDst Op 69

70 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7

71 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7

72 Pipeline single stepping Contents of Register = C$ = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$=8; emory[23]=9; Formats: add $rd,$rs=a,$rt=b; lw $rt=b,@($rs=a) Clock <IF/ID> <ID/EX> <EX/E> <E/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z,, B, R> <DR,, R> <,?> <?,?,?,?,?,?> <?,?,?,?,?> <?,?,?> <4,lw $,2($)> <,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 2 <8,sb $,$2,$3> <4,C$ 3,C$ 8,2,$,> <,?,?,?,?> <?,?,?> 3 <2,and $2,$4,$5> <8,C$2 4,C$3 4,X,$3,$> <4+2<<2 84,,2+3 23,8,$><?,?,?> 4 <6,or $3,$6,$7> <2,C$4 6,C$5 7,X,$5,$2><X,,4-4=,4,$> <em[23] 9,23,$> 5 <2,add $4,$8,$9> <6,C$6,C$7,X,$7,$3> <X,,,7,$2> <X,,$> 72

73 Clock : Figre 6.3a IF: lw $, 2($) ID: before<> EX: before<2> E: before<3> WB: before<4> PC= PC 4 ress memory IF/ID PC=4 IR=lw $,2($) register register 2 Registers 2 register [5 ] Reg Control Sign etend ID/EX WB EX Shift left 2 reslt control Src Zero reslt EX/E WB Branch ress Data memory em em E/WB WB emtoreg Clock [2 6] [5 ] RegDst Op IF: sb $, $2, $3 ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> 73

74 Clock [5 ] RegDst C C IF: sb $, $2, $3 PC=4 PC 4 ress memory Clock 2 Figre 6.3b IF/ID ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> lw PC=4 PC=8 IR=lw IR=sb $,2($) $, $, $2, $2, $3 $3 X 2 X Reg register Control register 2 Registers 2 register [5 ] Sign etend [2 6] [5 ] $ $X 2 X PC=4 A=C$ B=X ID/EX WB EX B=X S=2 T=$ D= D= Shift left 2 RegDst reslt control Src Zero reslt Op EX/E WB Branch em ress Data memory em E/WB WB emtoreg 74

75 CC ID: sb $, $2, $3 EX: lw $,... E: before<> WB: before<2> E/WB IF/ID ID/EX EX/E CC WB sb Control WB EX reslt Branch PC=4+2<<2 WB Shift left 2 Src Reg em emtoreg register 2 $ $2 register 2 3 Zero reslt $3 Registers ress 2 CC PC=8 PC=4 A=C$2 A=C$ B=C$3 B=X B=X S=2 S=X S=X T=$ T=$3 D= D= D=$ register Data memory 2 X [5 ] X em =2+C$ control Sign etend PC=2 PC=8 IR=sb IR=and $, $2,$4,$5 $, $2, $2, $3 $3 X [2 6] X D=$ Op [5 ] RegDst 75 IF: and $2, $4, $5 4 ress PC=8 PC memory Clock 3

76 RegDst [5 ] CC CC Clock 3 Clock 4: Figre 6.32b ID: and $2, $2, $3 EX: sb $,... E: lw $,... WB: before<> PC=4+2<<2 IF: or $3, $6, $7 CC PC=2 PC=8 E/WB IF/ID ID/EX EX/E WB and WB Control WB PC=4+2<<2 PC=X A=C$2 A=C$4 EX reslt 4 Branch Shift left 2 Src B=C$3 B=C$5 Reg em emtoreg register 4 $2 $4 ress PC register 2 5 Zero reslt $3 $5 2 Registers memory register ress Data memory S=X S=X DR=em[2+C$] em =2+C$ =C$2-C$3 [5 ] PC=2 PC=6 IR=and IR=or $3,$6,$7 $2,$4,$5 PC=2 control T=$3 T=$3 X Sign etend X X [2 6] X Op D=$ D=$ 2 [5 ] 2 RegDst Clock 4 76 D=$ D=$2 D=$

77 Data Dependencies: that can be resolved by forwarding Data Dependencies Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg I Reg I Reg D Reg CC 7 CC 8 CC 9 / Data Data Hazards Reg Resolved by by forwarding D Reg D At At same same time: time: Not Not a hazard hazard I Reg D Reg sw $5, ($2) I Reg D Reg Forward in in time: time: Not Not a hazard hazard 77

78 Data Hazards: arithmetic Vale of register $2 : Vale of EX/E : Vale of E/WB : Program eection order (in instrctions) sb $2, $, $3 Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg CC 7 CC 8 CC 9 / X X X 2 X X X X X X X X X 2 X X X X Forwards in in time: time: Can Can be be resolved D Reg At At same same time: time: Not Not a hazard hazard and $2, $2, $5 I Reg D Reg or $3, $6, $2 I Reg D Reg add $4, $2, $2 I Reg D Reg sw $5, ($2) I Reg D Reg 78

79 Data Dependencies: no forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID ID ID EX WB Stall Stall Stall Stall st st Half Half 2nd 2nd Half Half Sppose every instrction is dependant = + 2 stalls = 3 clocks IPS = Clock = 5 hz = 67 IPS CPI 3 79

80 Data Dependencies: no forwarding A dependant instrction will take An independent instrction will take = + 2 stalls = 3 clocks = + stalls = clocks Sppose % of the time the instrctions are dependant? Averge instrction time = %*3 + 9%* =.*3 +.9* =.2 clocks IPS = Clock = 5 hz = 47 IPS (% dependency) CPI.2 IPS = Clock = 5 hz = 67 IPS (% dependency) CPI 3 IPS = Clock = 5 hz = 5 IPS (% dependency) CPI 8

81 Data Dependencies: with forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID EX WB Detected Data Data Hazard Hazard a a ID/EX.$rs = EX/.$rd Sppose every instrction is dependant = + stalls = clock IPS = Clock = 5 hz = 5 IPS CPI 8

82 Data Dependencies: Hazard Conditions Data Hazard Condition occrs whenever a sorce needs a previos navailable reslt de to a destination. Eample sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt Data Hazard Detection is always comparing a destination with a sorce. Destination EX/E.$rdest = { E/WB.$rdest = { Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Hazard Type a. b. 2a. 2b. 82

83 Data Dependencies: Hazard Conditions a Data Hazard: EX/E.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt b Data Hazard: EX/E.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $2 and $rd, $rs, $rt 2a Data Hazard: E/WB.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $2, $ and $rd, $rs, $rt 2b Data Hazard: E/WB.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $6, $2 and $rd, $rs, $rt 83

84 Data Dependencies: Worst case Data Hazard: sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $2 and $rd, $rs, $rt or $3, $2, $2 and $rd, $rs, $rt Data Hazard a: Data Hazard b: Data Hazard 2a: Data Hazard 2b: EX/E.$rd = ID/EX.$rs EX/E.$rd = ID/EX.$rt E/WB.$rd = ID/EX.$rs E/WB.$rd = ID/EX.$rt 84

85 Data Dependencies: Hazard Conditions Hazard Type a. b. 2a. 2b. Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rdest } = E/WB.$rdest ID/EX $rs $rt $rd Pipeline Registers EX/E $rd E/WB $rd 85

86 ID/EX EX/E E/WB Registers Data memory a. No forwarding ID/EX EX/E E/WB Registers ForwardA Data memory Rs Rt Rt Rd ForwardB Forwarding nit EX/E.RegisterRd E/WB.RegisterRd b. With forwarding 86

87 Data Hazards: Loads Program eection order (in instrctions) lw $2, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg Backwards in in time: time: Cannot be be resolved D CC 7 CC 8 CC 9 Forwards in in time: time: Cannot be be resolved Reg and $4, $2, $5 I Reg D Reg or $8, $2, $6 I Reg D Reg add $9, $4, $2 I Reg D Reg slt $, $6, $7 At At same same time: time: Not Not a hazard hazard I Reg D Reg 87

88 Data Hazards: load stalling Program eection order (in instrctions) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC lw $2, 2($) I Reg D Reg and $4, $2, $5 I Reg Reg D Reg or $8, $2, $6 I I Reg D Reg bbble add $9, $4, $2 slt $, $6, $7 Stall Stall I Reg D Reg I Reg D Reg 88

89 Data Hazards: Hazard detection nit (page 49) Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= Stall Eample lw $2, 2($) lw $rt, addr($rs) and $4, $2, $5 and $rd, $rs, $rt No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt 89

90 Data Hazards: Hazard detection nit (page 49) No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt Eample load: assme half of the instrctions are immediately followed by an instrction that ses it. What is the average nmber of clocks for the load? load instrction time: 5%*( clock) + 5%*(2 clocks)=.5 9

91 9 Hazard Detection Unit: when to stall PC memory Registers Control EX WB WB WB ID/EX EX/E E/WB Data memory Hazard detection nit Forwarding nit IF/ID ID/EX.em IF/ID PC ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/E.RegisterRd E/WB.RegisterRd

92 Data Dependency Units Forwarding Condition Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rd } = E/WB.$rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 92

93 Data Dependency Units Stalling Comparisons IF/ID $rs Pipeline Registers ID/EX $rs Forwarding Comparisons EX/E E/WB $rt $rt $rd $rd $rd $rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 93

94 Branch Hazards: Soln #, Stall ntil Decision made (fig. 6.4) Program eection order (in instrctions) add $4, $5, $6 add $4, $5, beq $, $3, 7 Soln #: #: Stall Stall ntil ntil Decision is is and $2, $2, or $3, $6, add $4, $2, lw $4, 5($7) fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Stall Stall Decision made in in ID ID stage: do do load load 94

95 Branch Hazards: Soln #2, Predict ntil Decision made Clock beq $,$3,7 IF ID EX WB Predict false branch and $2, $2, $5 IF ID EX WB discard and $2,$2,$5 instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: discard & branch 95

96 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB ove instrction before branch add $4,$6,$6 IF ID EX WB Do Do not not need to to discard instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: branch 96

97 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB and $2, $2, $5 IF ID EX WB Decision made in in ID ID stage: do do branch lw $4, 5($7) IF ID EX WB 97

98 Branch Hazards: Decision made in the ID stage (figre 6.4) Clock beq $,$3,7 IF ID EX WB nop No No decision yet: yet: insert a nop nop IF ID EX Decision: do do load load WB lw $4, 5($7) IF ID EX WB 98

99 Branch Hazards: Soln #2, Predict ntil Decision made Program eection order (in instrctions) Time (in clock cycles) CC Predict false false branch branch Branch Decision made made in in E E stage: stage: Discard vales vales when when wrong wrong prediction CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 4 beq $, $3, 7 I Reg D Reg 44 and $2, $2, $5 I Reg D Reg 48 or $3, $6, $2 I Reg D Reg 52 add $4, $2, $2 I Reg D Reg 72 lw $4, 5($7) Same Same effect effect as as 3 stalls stalls I Reg D Reg 99

100 Figre 6.5 IF.Flsh Hazard detection nit ID/EX WB Early Early branch comparison EX/E Control WB E/WB IF/ID EX WB PC 4 memory Shift left 2 Registers = Data memory Sign Flsh: if wrong prediciton, etend add nops Flsh: if wrong prediciton, add nops Forwarding nit

101 Performance load: assme half of the instrctions are immediately followed by an instrction that ses it (i.e. dependency) load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 Jmp: assme that jmps always pay fll clock cycle delay (stall). Jmp instrction time = 2 Branch: the branch delay of misprediction is clock cycle that 25% of the branches are mispredicted. branch time = 75%*( clocks) + 25%*(2 clocks) =.25

102 Performance, page 54 Also known as the instrction latency with in a pipeline Pipeline throghpt Single- Cycle lti-cycle Clocks Pipeline Cycles i loads 5.5 (5% dependancy) 23% stores 4 3% arithmetic 4 43% branches 3.25 (25% dependancy) 9% jmps 3 2 2% Clock speed 25 hz 8 ns 5 hz 2 ns 5 hz 2 ns CPI = Σ Cycles*i IPS 25 IPS 25 IPS 424 IPS = Clock/CPI load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 branch time = 75%*( clocks) + 25%*(2 clocks) =.25 2

103 Branch prediction Datapath parallelism only sefl if yo can keep it fed. Easy to fetch mltiple (consective) instrctions per cycle essentially speclating on seqential flow Jmp: nconditional change of control flow Always taken Branch: conditional change of control flow Taken abot 5% of the time Backward: 3% 8% taken Forward: 7% 4% taken 3

104 A Big Idea for Today Reactive: past actions case system to adapt se do what yo did before better e: caches TCP windows URL completion,... Proactive: ses past actions to predict ftre actions optimize speclatively, anticipate what yo are abot to do branch prediction long cache blocks??? 4

105 Case for Branch Prediction when Isse N instrctions per clock cycle. Branches will arrive p to n times faster in an n- isse processor 2. Amdahl s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-isse processor conversely, need branch prediction to see potential parallelism 5

106 Branch Prediction Schemes. Static Branch Prediction -bit Branch-Prediction Bffer 2-bit Branch-Prediction Bffer Correlating Branch Prediction Bffer Tornament Branch Predictor Branch Target Bffer Integrated Fetch Units Retrn ress Predictors 6

107 Dynamic Branch Prediction Performance = ƒ(accracy, cost of misprediction) Branch History Table: Lower bits of PC address inde table of -bit vales Says whether or not branch taken last time No address check - saves HW, bt may not be right branch If inst == BR, pdate table with otcome Problem: in a loop, -bit BHT will case 2 mispredictions End of loop case, when it eits instead of looping as before First time throgh loop on net time throgh code, when it predicts eit instead of looping avg is 9 iterations before eit Only 8% accracy even if loop 9% of the time Local history This particlar branch inst - Or one that maps into same lost PC 7

108 2-bit Dynamic Branch Prediction 2-bit scheme where change prediction only if get misprediction twice: Predict Taken Predict Not Taken T NT T T NT T NT Predict Taken Predict Not Taken Red: stop, not taken Green: go, taken s hysteresis to decision making process Generalize to n-bit satrating conter NT 8

109 Consider 3 Scenarios Branch for loop test Check for error or eception predictors Alternating taken / not-taken eample? taken Global history Yor worst-case prediction scenario How cold HW predict this loop will eecte 3 times sing a simple mechanism? 9

110 Correlating Branches Idea: taken/not taken of recently eected branches is related to behavior of net branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 4 predictions of net branch, pdating jst that prediction Branch address (4 bits) 2-bits per branch local predictors Prediction (2,2) predictor: 2-bit global, 2-bit local 2-bit recent global branch history ( = not taken then taken)

111 Accracy of Different Schemes 2% Freqency of of ispredictions 8% 6% 4% 2% % 8% 6% 4% % 2% % % 496 Entries 2-bit BHT Unlimited Entries 2-bit BHT 24 Entries (2,2) BHT % % 5% 6% 6% nasa7 matri3 tomcatv dodcd spice fpppp gcc espresso eqntott li % 4% 6% 8% 5% 4,96 entries: 2-bits per entry Unlimited entries: 2-bits/entry,24 entries (2,2) What s missing in this pictre?

112 Re-evalating Correlation Several of the SPEC benchmarks have less than a dozen branches responsible for 9% of taken branches: program branch % static # = 9% compress 4% eqntott 25% gcc 5% mpeg % real gcc 3% Real programs + OS more like gcc Small benefits beyond benchmarks for correlation? problems with branch aliases? 2

113 BHT Accracy ispredict becase either: Wrong gess for that branch Got branch history of wrong branch when inde the table 496 entry table programs vary from % misprediction (nasa7, tomcatv) to 8% (eqntott), with spice at 9% and gcc at 2% For SPEC92, 496 abot as good as infinite table 3

114 Dynamically finding strctre in Spaghetti? 4

115 Tornament Predictors otivation for correlating branch predictors is 2- bit predictor failed on important branches; by adding global information, performance improved Tornament predictors: se 2 predictors, based on global information and based on local information, and combine with a selector Use the predictor that tends to gess correctly addr history Predictor A Predictor B 5

116 Tornament Predictor in Alpha K 2-bit conters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indeed by the history of the last 2 branches; each entry in the global predictor is a standard 2-bit predictor 2-bit pattern: ith bit => ith prior branch not taken; ith bit => ith prior branch taken; Local predictor consists of a 2-level predictor: Top level a local history table consisting of 24 -bit entries; each -bit entry corresponds to the most recent branch otcomes for the entry. -bit history allows patterns branches to be discovered and predicted. Net level Selected entry from the local history table is sed to inde a table of K entries consisting a 3-bit satrating conters, which provide the local prediction Total size: 4K*2 + 4K*2 + K* + K*3 = 29K bits! (~8, transistors) 6

117 % of predictions from local predictor in Tornament Prediction Scheme % 2% 4% 6% 8% % nasa7 matri3 tomcatv dodc spice fpppp gcc espresso eqntott li 37% 55% 76% 72% 63% 69% 98% % 94% 9% 7

118 Accracy of Branch Prediction to m c a tv d o d c 84% 99% 99% % 95% 97% fp p p p li 86% 82% 88% 77% 98% 98% P ro file -b a s e d 2-b it c o nte r T o rna m e nt e s p re s s o 86% 82% 96% g c c 7% 88% 94% fig 3.4 % 2% 4% 6% 8% % B ra nc h p re d ic tio n a c c ra c y 8

119 Accracy v. Size (SPEC89) % 9% 8% 7% Local 6% 5% 4% 3% 2% Correlating Tornament % % Total predictor size (Kbits) 9

120 GSHARE A good simple predictor Sprays predictions for a given address across a large table for different histories address or n history 2

121 Need ress At Same Time as Prediction Branch Target Bffer (BTB): ress of branch inde to get prediction AND branch address (if taken) Note: mst check for branch match now, since can t se wrong branch address (Figre 3.9, 3.2) F E T C H P C o f i n s t r c t i o n Branch PC Predicted PC =? No: branch not predicted, proceed normally (Net PC = PC+4) Yes: instrction is branch and se predicted PC as net PC Etra prediction state bits 2

122 Predicated Eection Avoid branch prediction by trning branches into conditionally eected instrctions: if () then A = B op C else NOP If false, then neither store reslt nor case eception Epanded ISA of Alpha, IPS, PowerPC, SPARC have conditional move; PA-RISC can annl any following instr. IA-64: 64 -bit condition fields selected so conditional eection of any instrction This transformation is called if-conversion A = B op C Drawbacks to conditional instrctions Still takes a clock even if annlled Stall if condition evalated late Comple conditions redce effectiveness; condition becomes known late in pipeline 22

123 Special Case Retrn resses Register Indirect branch hard to predict address SPEC89 85% sch branches for procedre retrn Since stack discipline for procedres, save retrn address in small bffer that acts like a stack: 8 to 6 entries has small miss rate 23

124 Pitfall: Sometimes bigger and dmber is Better 2264 ses tornament predictor (29 Kbits) Earlier 264 ses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, otperforms 2264 avg..5 mispredictions per instrctions 264 avg. 6.5 mispredictions per instrctions Reversed for transaction processing (TP)! 2264 avg. 7 mispredictions per instrctions 264 avg. 5 mispredictions per instrctions TP code mch larger & 264 hold 2X branch predictions based on local behavior (2K vs. K local predictor in the 2264) 24

125 A zero-cycle jmp What really has to be done at rntime? Once an instrction has been detected as a jmp or JAL, we might recode it in the internal cache. Very limited form of dynamic compilation? and addi sb jal sbi r3,r,r5 r2,r3,#4 r4,r2,r doit r,r,# A: Internal Cache state: and r3,r,r5 N A+4 addi r2,r3,#4 N A+8 sb r4,r2,r L doit sbi r,r,# N A+2 Use of Pre-decoded instrction cache Called branch folding in the Bell-Labs CRISP processor. Original CRISP cache had two addresses and cold ths fold a complete branch into the previos instrction Notice that JAL introdces a strctral hazard on write 25

126 reflect PREDICTIONS and remove delay slots and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r r4,loop r,r,# A: A+6: Internal Cache state: and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r loop r,r,# This cases the net instrction to be immediately fetched from branch destination (predict taken) If branch ends p being not taking, then sqash destination instrction and restart pipeline at address A+6 N N N N N Net A+4 A+8 A+2 loop A+2 26

127 Dynamic Branch Prediction Smmary Prediction becoming important part of scalar eection Branch History Table: 2 bits for loop accracy Correlation: Recently eected branches correlated with net branch. Either different branches Or different eections of same branches Tornament Predictor: more resorces to competitive soltions and pick between them Branch Target Bffer: inclde branch address & prediction Predicated Eection can redce nmber of branches, nmber of mispredicted branches Retrn address stack for prediction of indirect jmp 27

128 Eceptions and Interrpts (Hardware) 28

129 Eample: Device Interrpt (Say, arrival of network message) t p r r e t n I l a n r e t E add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE r l e d n a H t p r e t I n 29

130 Alternative: Polling (again, for arrival of network message) t p r r e t n I l a n r e t E no_mess: Disable Network Intr sbi r4,r,#4 slli r4,r4,#2 lw r2,(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r,2(r) beq r,no_mess lw r,2(r) lw r2,(r) addi r3,r,#5 sw (r),r3 Clear Network Intr Polling Point (check device register) Handler 3

131 Polling is faster/slower than Interrpts. Polling is faster than interrpts becase Compiler knows which registers in se at polling point. Hence, do not need to save and restore registers (or not as many). Other interrpt overhead avoided (pipeline flsh, trap priorities, etc). Polling is slower than interrpts becase Overhead of polling instrctions is incrred regardless of whether or not handler is rn. This cold add to inner-loop delay. Device may have to wait for service for a long time. When to se one or the other? lti-ais tradeoff - Freqent/reglar events good for polling, as long as device can be controlled at ser level. - Interrpts good for infreqent/irreglar events - Interrpts good for ensring reglar/predictable service of events. 3

132 Eception/Interrpt classifications Eceptions: relevant to the crrent process Falts, arithmetic traps, and synchronos traps Invoke software on behalf of the crrently eecting process Interrpts: cased by asynchronos, otside events I/O devices reqiring service (DISK, network) Clock interrpts (real time schedling) achine Checks: cased by serios hardware failre Not always restartable Indicate that bad things have happened. - Non-recoverable ECC error - achine room fire - Power otage 32

133 A related classification: Synchronos vs. Asynchronos Synchronos: means related to the instrction stream, i.e. dring the eection of an instrction st stop an instrction that is crrently eecting Page falt on load or store instrction Arithmetic eception Software Trap s Asynchronos: means nrelated to the instrction stream, i.e. cased by an otside event. Does not have to disrpt instrctions that are already eecting Interrpts are asynchronos achine checks are asynchronos SemiSynchronos (or high-availability interrpts): Cased by eternal event bt may have to disrpt crrent instrctions in order to garantee service 33

134 Interrpt Priorities st be Handled t p r r e t n I k r o w t e N add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE C o l d b e i n t e r r p t e d b y d i s k Note that priority mst be raised to avoid recrsive interrpts! 34

135 Interrpt controller hardware and mask levels Operating system constrcts a hierarchy of masks that reflects some form of interrpt priority. For instance: Priority Eamples Software interrpts 2 Network Interrpts 4 Sond card 5 Disk Interrpt 6 Real Time clock Non-askable Ints (power) This reflects the an order of rgency to interrpts For instance, this ordering says that disk events can interrpt the interrpt handlers for network interrpts. 35

136 Can we have fast interrpts? t p r e t n I i n a r G e i n F Raise priority Reenable All Ints add r,r2,r3 P C s aved A ll Ints o pervisor Save registers sbi r4,r,#4 D i s abl e de lw r,2(r) slli r4,r4,#2 S lw r2,(r) Hiccp(!) addi r3,r,#5 sw (r),r3 lw r2,(r4) R lw r3,4(r4) estore Restore registers U ser add r2,r2,r3 Clear crrent Int PC sw 8(r4),r2 o de Disable All Ints 28 registers + cache misses + etc. Restore priority RTE Pipeline Drain: Can be very Epensive Priority aniplations Register Save/Restore C o l d b e i n t e r r p t e d b y d i s k 36

137 SPARC (and RISC I) had register windows On interrpt or procedre call, simply switch to a different set of registers Really saves on interrpt overhead Interrpts can happen at any point in the eection, so compiler cannot help with knowledge of live registers. Conservative handlers mst save all registers Short handlers might be able to save only a few, bt this analysis is compilcated Not as big a deal with procedre calls Original statement by Patterson was that Berkeley didn t have a compiler team, so they sed a hardware soltion Good compilers can allocate registers across procedre bondaries Good compilers know what registers are live at any one time However, register windows have retrned! IA64 has them any other processors have shadow registers for interrpts 37

138 Spervisor State Typically, processors have some amont of state that ser programs are not allowed to toch. Page mapping hardware/tlb - TLB prevents one ser from accessing memory of another - TLB protection prevents ser from modifying mappings Interrpt controllers -- User code prevented from crashing machine by disabling interrpts. Ignoring device interrpts, etc. Real-time clock interrpts ensre that sers cannot lockp/crash machine even if they rn code that goes into a loop: - Preemptive ltitasking vs non-preemptive mltitasking Access to hardware devices restricted Prevents malicios ser from stealing network packets Prevents ser from writing over disk blocks Distinction made with at least two-levels: USER/SYSTE (one hardware mode-bit) 86 architectres actally provide 4 different levels, only two sally sed by OS (or only in older icrosoft OSs) 38

PS Midterm 2. Pipelining

PS Midterm 2. Pipelining PS idterm 2 Pipelining Seqential Landry 6 P 7 8 9 idnight Time T a s k O r d e r A B C D 3 4 2 3 4 2 3 4 2 3 4 2 Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry