Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.

Size: px
Start display at page:

Download "Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time."

Transcription

1 Pipelining Pipelining is the se of pipelining to allow more than one instrction to be in some stage of eection at the same time. Ferranti ATLAS (963): Pipelining redced the average time per instrction by 375% emory cold not keep p with the CPU, needed a cache.

2 What Is Pipelining Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 mintes A B C D Dryer takes 4 mintes Folder takes 2 mintes 2

3 What Is Pipelining 6 P idnight Time T a s k O r d e r A B C D Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry take? 3

4 What Is Pipelining Start work ASAP 6 P idnight T a s k O r d e r A B C D Time Pipelined landry takes 3.5 hors for 4 loads 4

5 Pipelining Lessons T a s k O r d e r 6 P Time A B C D Pipelining doesn t help latency of single task, it helps throghpt of entire workload Pipeline rate limited by slowest pipeline stage ltiple tasks operating simltaneosly Potential speedp = Nmber pipe stages Unbalanced lengths of pipe stages redces speedp Time to fill pipeline and time to drain it redces speedp 5

6 odels Single-cycle model (non-overlapping) The instrction latency eectes in a single cycle Every instrction and clock-cycle mst be stretched to the slowest instrction lti-cycle model (non-overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step Ability to share fnctional nits within the eection of a single instrction Pipeline model (overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step The throghpt is mainly one clock-cycle/instrction Gains efficiency by overlapping the eection of mltiple instrctions, increasing hardware tilization. 6

7 Review: Single-Cycle Datapath 2 adders: PC+4 adder, Branch/Jmp offset adder PC 4 address memory RegDst Reg register register 2 register 2 6 Sign 32 etend Shift left 2 Src 3 Reslt ctl Zero reslt ress em Data memory And em Branch emtoreg Harvard Architectre: Separate instrction and memory 7

8 Review: lti vs. Single-cycle Processor Datapath Combine adders: add ½ & 3 temp. registers, A, B, Ot Combine emory: add & 2 temp. registers, IR, DR IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS 8

9 lti-cycle Processor Datapath Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op = 6 6 additional FFs FFsfor for mlti-cycle processor over over single-cycle processor 9

10 PC PC bits bits PC 4 ress memory Datapath Registers 6 6 FFs FFs FFs FFs FFs FFs PC PC IF/ID ID/EX EX/E E/WB bits bits IR IR bits bits register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg 6 Sign 32 etend 2 W 3 4 EX EX PC PC A B Si Si RT RT 5 RD RD 5 Shift left 2 Src 6 RegDst control Op reslt Zero reslt 2 W 3 PC PC Z Ot Ot B D 5 Branch ress 23+6 = 229 additional FFs for pipeline over mlti-cycle processor PCSrc em Data memory em Figre W D R Ot Ot D 5 emtoreg

11 Overhead Single-cycle model 8 ns Clock (25 Hz), (non-overlapping) + 2 adders es Datapath Register bits (Flip-Flops) Chip Area Speed lti-cycle model 2 ns Clock (5 Hz), (non-overlapping) + Controller 5 es 6 Datapath Register bits (Flip-Flops) Pipeline model 2 ns Clock (5 Hz), (overlapping) 2 + Controller 4 es 373 Datapath + 6 Controlpath Register bits (Flip-Flops)

12 Comparison CISC Any instrction may reference memory any instrctions & addressing modes Variable instrction formats Single register set RISC Only load/store references memory Few instrctions & addressing modes Fied instrction formats ltiple register sets lti-clock cycle instrctions icro-program interprets instrctions Compleity is in the micro-program Less to no pipelining Program code size small Single-clock cycle instrctions Hardware (FS) eectes instrctions Compleity is in the complier Highly pipelined Program code size large 2

13 Pipelining verss Parallelism ost high-performance compters ehibit a great deal of concrrency. However, it is not desirable to call every modern compter a parallel compter. Pipelining and parallelism are 2 methods sed to achieve concrrency. Pipelining increases concrrency by dividing a comptation into a nmber of steps. Parallelism is the se of mltiple resorces to increase concrrency. 3

14 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - e.g., mltiple memory accesses, mltiple register writes - soltions: mltiple memories (separate instrction & memory) stretch pipeline control hazards: attempt to make a decision before condition is evalated - e.g., any conditional branch - soltions: prediction, delayed branch hazards: attempt to se item before it is ready - e.g., add r,r2,r3; sb r4, r,r5; lw r6, (r7); or r8, r6,r9 - soltions: forwarding/bypassing, stall/bbble 4

15 When stalls are navoidable There are nmeros instances where hazards lead navoidably to stalls! perhaps the most srprising case is the simple assignment statement: A = B + C; 5

16 Classification of hazards Assme that instrction J occrs after instrction I RAW -- read after write -- J tries to read a sorce before I writes to it. J gets a possibly incorrect vale WAR -- write after read -- J tries to write a destination before the destination is read by I. I reads a possibly incorrect vale WAW -- write after write -- J tries to write an operand before it is written by I. In this case the writes are performed in the wrong order IPS avoids the latter two hazards in its integer pipeline However, they can still arise in its floating-point pipeline since instrctions have different lengths of time there 6

17 The Five Stages of Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the from the Data emory Wr: the back to the register file 7

18 RISCEE 4 Architectre Clock = load vale into register P P (~AlZero (~AlZero & & BZ) BZ) PCSrc srcb 2 Y P C IorD em em address address Data Data Data Data DR2 DR2 DR DR IR I I R R [7-] Reg RegDst 2 2 Data Data Accmlator Accmlator Data Data srca X op op X+ X+ 2 2 X-Y X-Y 3 3 +Y +Y X+Y X+Y Ot Ot Clock 8

19 Single Cycle, ltiple Cycle, vs. Pipeline Clk Cycle Cycle 2 Single Cycle Implementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Clk ltiple Cycle Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em R-type Ifetch Pipeline Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em Wr R-type Ifetch Reg Eec em Wr 9

20 Why Pipeline? Sppose we eecte instrctions Single Cycle achine 45 ns/cycle CPI inst = 45 ns lticycle achine ns/cycle 4.6 CPI (de to inst mi) inst = 46 ns Ideal pipelined machine ns/cycle ( CPI inst + 4 cycle drain) = 4 ns 2

21 Why Pipeline? Becase the resorces are there! Time (clock cycles) I n s t r. O r d e r Inst Inst Inst 2 Inst 3 Inst 4 Reg Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Resorce eminst emdata Reg Reg bsy idle idle idle idle bsy idle bsy idle idle bsy idle bsy idle bsy bsy bsy bsy idle bsy bsy bsy bsy bsy bsy idle bsy bsy bsy bsy idle bsy idle bsy bsy idle bsy idle bsy idle idle idle idle bsy idle 2

22 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - E.g., combined washer/dryer wold be a strctral hazard or folder bsy doing something else (watching TV) hazards: attempt to se item before it is ready - E.g., one sock of pair in dryer and one in washer; can t fold ntil get sock from washer throgh dryer - instrction depends on reslt of prior instrction still in the pipeline control hazards: attempt to make a decision before condition is evalated - E.g., washing football niforms and need to get proper detergent level; need to see after dryer before net load in - branch instrctions Can always resolve hazards by waiting pipeline control mst detect the hazard take action (or delay action) to resolve hazards 22

23 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time Previos eample: Separate Instem and and Dataem I n s t r. Load Instr em Reg em Reg em Reg em Reg right right half: half: highlight means means read. read. left left half half write. write. O r d e r Instr 2 Instr 3 Resorce em(inst & Data) Reg Reg bsy idle idle idle bsy bsy idle idle em em Reg em Reg bsy bsy idle bsy bsy bsy idle bsy Reg em Reg bsy bsy bsy bsy idle bsy bsy bsy Detection is is easy easy in in this this case! case! idle idle bsy bsy idle idle bsy idle 23

24 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time By change the architectre from a Harvard (separate instrction and memory) to a von Neman memory, we actally created a strctral hazard! Strctral hazards can be avoid by changing hardware: design of the architectre (splitting resorces) software: re-order the instrction seqence software: delay 24

25 Pipelining Improve perfomance by increasing instrction throghpt Program eection order Time (in instrctions) lw $, ($) Reg fetch Data access Reg lw $2, 2($) 8 ns Reg fetch Data access Reg lw $3, 3($) Program eection Time order (in instrctions) lw $, ($) lw $2, 2($) fetch 2 ns 8 ns Reg fetch Reg Data access Reg Data access Reg fetch 8 ns... lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedp is nmber of stages in the pipeline. Do we achieve this? 25

26 Stall on Branch Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Figre

27 Predicting branches Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg Program eection order (in instrctions) add $4, $5,$6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg bbble bbble bbble bbble bbble Figre 6.5 or $7, $8, $9 4 ns fetch Reg Data access Reg 27

28 Delayed branch Program eection order (in instrctions) Time beq $, $2, 4 fetch Reg Data access Reg add $4, $5, $6 (Delayed branch slot) 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns Figre

29 pipeline Figre 6.7 Time add $s, $t, $t IF ID EX E WB Pipeline stages IF instrction fetch (read) ID instrction decode and register read (read) EX eecte al operation E memory (read or write) WB back to register Resorces em instr. & memory Reg register read port # Reg2 register read port #2 Reg register write al operation 29

30 Forwarding Program eection order Time (in instrctions) add $s, $t, $t IF ID EX E WB sb $t2, $s, $t3 IF ID EX E WB Figre 6.8 3

31 Load Forwarding Program Time eection order (in instrctions) lw $s, 2($t) IF ID EX E WB bbble bbble bbble bbble bbble sb $t2, $s, $t3 IF ID EX E WB Figre 6.9 3

32 Reordering lw $t, ($t) $t=emory[+$t] lw $t2, 4($t) $t2=emory[4+$t] sw $t2, ($t) emory[+$t]=$t2 sw $t, 4($t) emory[4+$t]=$t lw lw sw sw $t2, 4($t) $t, ($y) $t2, ($t) $t, 4($t) Figre

33 Basic Idea: split the path IF: fetch ID: decode/ register file read EX: Eecte/ address calclation E: emory access WB: back 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 What do we need to add to actally split the path into stages? 33

34 Graphically Representing Pipelines Time (in clock cycles) Program eection order (in instrctions) lw $, 2($) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Can help with answering qestions like: how many cycles does it take to eecte this code? what is the doing dring cycle 4? se this representation to help nderstand paths 34

35 Pipeline path with registers IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

36 Load instrction fetch and decode lw fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 lw decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 36

37 Load instrction eection lw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

38 Load instrction memory and write back lw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory lw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 38

39 Store instrction eection sw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

40 Store instrction memory and write back sw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory sw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.7 4

41 Load instrction: corrected path IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.8 4

42 Load instrction: overall sage IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre

43 lti-clock-cycle pipeline diagram Program eection order (in instrctions) lw $, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Program eection order (in instrctions) lw $, $2($) sb $, $2, $3 Time ( in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 fetch decode fetch Eection decode Data access Eection back Data access back Figre

44 Single-cycle #-2 lw $, 2($) fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory Clock sb $, $2, $3 fetch lw $, 2($) decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.22 Clock 2 44

45 Single-cycle #3-4 sb $, $2, $3 decode lw $, 2($) Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 3 sb $, $2, $3 Eection lw $, 2($) emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.23 Clock 4 45

46 Single-cycle #5-6 sb $, $2, $3 emory lw $, 2($) back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 5 sb $, $2, $3 back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.24 Clock 6 46

47 Conventional Pipelined Eection Representation Time IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB Program Flow IFetch Dcd Eec em WB IFetch Dcd Eec em WB 47

48 Strctral Hazards limit performance Eample: if.3 memory accesses per instrction and only one memory access per cycle then average CPI.3 otherwise resorce is more than % tilized 48

49 Control Hazard Soltions I n s t r. O r d e r Stall: wait ntil decision is clear Its possible to move p decision to 2nd stage by adding hardware to check registers as being read Beq Load Time (clock cycles) em Reg em Reg Impact: 2 clock cycles per branch instrction => slow em Reg em Reg em Reg em Reg 49

50 Control Hazard Soltions Predict: gess one direction then back p if wrong Predict not taken I n s t r. O r d e r Beq Load Time (clock cycles) em Reg em Reg em Reg em Reg em Reg em Reg Impact: clock cycles per branch instrction if right, 2 if wrong (right - 5% of time) ore dynamic scheme: history of branch (- 9%) 5

51 Control Hazard Soltions Redefine branch behavior (takes place after net instrction) delayed branch I n s t r. O r d e r Beq isc Load Time (clock cycles) em Reg em Reg em Reg em Reg em Impact: clock cycles per branch instrction if can find instrction to pt in slot (- 5% of time) As lanch more instrction per clock cycle, less sefl Reg em Reg em Reg em Reg 5

52 Data Hazard on r add r,r2,r3 sb r4, r,r3 and r6, r,r7 or r8, r,r9 or r, r,r 52

53 Data Hazard on r: Dependencies backwards in time are hazards I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg 53

54 Data Hazard Soltion: Forward reslt from one stage to another I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg or OK if define read/write properly 54

55 Forwarding (or Bypassing): What abot Loads Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX E WB lw r,(r2) Im Reg Dm Reg sb r4,r,r3 Im Reg Dm Reg Can t solve with forwarding: st delay/stall instrction dependent on loads 55

56 Pipelining the Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw Ifetch Reg/Dec Eec em Wr 2nd lw Ifetch Reg/Dec Eec em Wr 3rd lw Ifetch Reg/Dec Eec em Wr The five independent fnctional nits in the pipeline path are: emory for the Ifetch stage Register File s ports (bs A and bsb) for the Reg/Dec stage for the Eec stage Data emory for the em stage Register File s port (bs W) for the Wr stage 56

57 The For Stages of R-type Cycle Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Eec Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: operates on the two register operands Update PC Wr: the otpt back to the register file 57

58 Pipelining the R-type and Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec Wr Ops! We have a problem! R-type Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Eec Wr We have pipeline conflict or strctral hazard: Two instrctions try to write to the register file at the same time! Only one write port 58

59 Important Observation Each fnctional nit can only be sed once per instrction Each fnctional nit mst be sed at the same stage for all instrctions: Load ses Register File s Port dring its 5th stage Load Ifetch Reg/Dec Eec em Wr R-type ses Register File s Port dring its 4th stage R-type Ifetch Reg/Dec Eec Wr 2 ways to solve this pipeline hazard. 59

60 Soltion : Insert Bbble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Pipeline Eec Wr R-type Ifetch Bbble Reg/Dec Eec Wr Ifetch Reg/Dec Eec Insert a bbble into the pipeline to prevent 2 writes at the same cycle The control logic can be comple. Lose instrction fetch and isse opportnity. No instrction is started in Cycle 6! 6

61 Soltion 2: Delay R-type s by One Cycle Delay R-type s register write by one cycle: Now R-type instrctions also se Reg File s write port at Stage 5 em stage is a NOOP stage: nothing is being done R-type Ifetch Reg/Dec Eec em Wr Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr 6

62 The For Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the into the Data emory 62

63 The Three Stages of Beq Cycle Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: compares the two register operand, select correct branch target address latch into PC 63

64 Smmary: Pipelining What makes it easy all instrctions are the same length jst a few instrction formats memory operands appear only in loads and stores What makes it hard? strctral hazards: sppose we had only one memory control hazards: need to worry abot branch instrctions hazards: an instrction depends on a previos instrction We ll bild a simple pipeline and look at these isses We ll talk abot modern processors and what really makes it hard: eception handling trying to improve performance with ot-of-order eection, etc. 64

65 Smmary Pipelining is a fndamental concept mltiple steps sing distinct resorces Utilize capabilities of the Datapath by pipelined instrction processing start net instrction while working on the crrent one limited by length of longest stage (pls fill/flsh) detect and resolve hazards 65

66 Pipeline Control: Controlpath Register bits WB Control WB EX WB IF/ID ID/EX EX/E E/WB 9 control bits bits Figre control bits bits 2 control bits bits 66

67 Pipeline Control: Controlpath table Figre 5.2, Single Cycle Reg Dst Src em Reg Reg Wrt em Red em Wrt Branch op op R-format lw sw X X beq X X Figre 6.28 ID / EX control lines EX / E control lines E / WB cntrl lines Reg Dst Op Op Src Branch em Red em Wrt Reg Wrt em Reg R-format lw sw X X beq X X 67

68 Pipeline Hazards Pipeline hazards Soltion # always works (for non-realtime) applications: stall, delay & procrastinate! Strctral Hazards (i.e. fetching same memory bank) Soltion #2: partition architectre Control Hazards (i.e. branching) Soltion #: stall! bt decreases throghpt Soltion #2: gess and back-track Soltion #3: delayed decision: delay branch & fill slot Data Hazards (i.e. register dependencies) Worst case sitation Soltion #2: re-order instrctions Soltion #3: forwarding or bypassing: delayed load 68

69 Pipeline Datapath and Controlpath PCSrc Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register register 2 Registers 2 register Reg Shift left 2 reslt Src Zero reslt Branch em ress Data memory emtoreg [5 ] 6 Sign 32 etend 6 control em [2 6] [5 ] RegDst Op 69

70 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7

71 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7

72 Pipeline single stepping Contents of Register = C$ = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$=8; emory[23]=9; Formats: add $rd,$rs=a,$rt=b; lw $rt=b,@($rs=a) Clock <IF/ID> <ID/EX> <EX/E> <E/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z,, B, R> <DR,, R> <,?> <?,?,?,?,?,?> <?,?,?,?,?> <?,?,?> <4,lw $,2($)> <,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 2 <8,sb $,$2,$3> <4,C$ 3,C$ 8,2,$,> <,?,?,?,?> <?,?,?> 3 <2,and $2,$4,$5> <8,C$2 4,C$3 4,X,$3,$> <4+2<<2 84,,2+3 23,8,$><?,?,?> 4 <6,or $3,$6,$7> <2,C$4 6,C$5 7,X,$5,$2><X,,4-4=,4,$> <em[23] 9,23,$> 5 <2,add $4,$8,$9> <6,C$6,C$7,X,$7,$3> <X,,,7,$2> <X,,$> 72

73 Clock : Figre 6.3a IF: lw $, 2($) ID: before<> EX: before<2> E: before<3> WB: before<4> PC= PC 4 ress memory IF/ID PC=4 IR=lw $,2($) register register 2 Registers 2 register [5 ] Reg Control Sign etend ID/EX WB EX Shift left 2 reslt control Src Zero reslt EX/E WB Branch ress Data memory em em E/WB WB emtoreg Clock [2 6] [5 ] RegDst Op IF: sb $, $2, $3 ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> 73

74 Clock [5 ] RegDst C C IF: sb $, $2, $3 PC=4 PC 4 ress memory Clock 2 Figre 6.3b IF/ID ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> lw PC=4 PC=8 IR=lw IR=sb $,2($) $, $, $2, $2, $3 $3 X 2 X Reg register Control register 2 Registers 2 register [5 ] Sign etend [2 6] [5 ] $ $X 2 X PC=4 A=C$ B=X ID/EX WB EX B=X S=2 T=$ D= D= Shift left 2 RegDst reslt control Src Zero reslt Op EX/E WB Branch em ress Data memory em E/WB WB emtoreg 74

75 CC ID: sb $, $2, $3 EX: lw $,... E: before<> WB: before<2> E/WB IF/ID ID/EX EX/E CC WB sb Control WB EX reslt Branch PC=4+2<<2 WB Shift left 2 Src Reg em emtoreg register 2 $ $2 register 2 3 Zero reslt $3 Registers ress 2 CC PC=8 PC=4 A=C$2 A=C$ B=C$3 B=X B=X S=2 S=X S=X T=$ T=$3 D= D= D=$ register Data memory 2 X [5 ] X em =2+C$ control Sign etend PC=2 PC=8 IR=sb IR=and $, $2,$4,$5 $, $2, $2, $3 $3 X [2 6] X D=$ Op [5 ] RegDst 75 IF: and $2, $4, $5 4 ress PC=8 PC memory Clock 3

76 RegDst [5 ] CC CC Clock 3 Clock 4: Figre 6.32b ID: and $2, $2, $3 EX: sb $,... E: lw $,... WB: before<> PC=4+2<<2 IF: or $3, $6, $7 CC PC=2 PC=8 E/WB IF/ID ID/EX EX/E WB and WB Control WB PC=4+2<<2 PC=X A=C$2 A=C$4 EX reslt 4 Branch Shift left 2 Src B=C$3 B=C$5 Reg em emtoreg register 4 $2 $4 ress PC register 2 5 Zero reslt $3 $5 2 Registers memory register ress Data memory S=X S=X DR=em[2+C$] em =2+C$ =C$2-C$3 [5 ] PC=2 PC=6 IR=and IR=or $3,$6,$7 $2,$4,$5 PC=2 control T=$3 T=$3 X Sign etend X X [2 6] X Op D=$ D=$ 2 [5 ] 2 RegDst Clock 4 76 D=$ D=$2 D=$

77 Data Dependencies: that can be resolved by forwarding Data Dependencies Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg I Reg I Reg D Reg CC 7 CC 8 CC 9 / Data Data Hazards Reg Resolved by by forwarding D Reg D At At same same time: time: Not Not a hazard hazard I Reg D Reg sw $5, ($2) I Reg D Reg Forward in in time: time: Not Not a hazard hazard 77

78 Data Hazards: arithmetic Vale of register $2 : Vale of EX/E : Vale of E/WB : Program eection order (in instrctions) sb $2, $, $3 Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg CC 7 CC 8 CC 9 / X X X 2 X X X X X X X X X 2 X X X X Forwards in in time: time: Can Can be be resolved D Reg At At same same time: time: Not Not a hazard hazard and $2, $2, $5 I Reg D Reg or $3, $6, $2 I Reg D Reg add $4, $2, $2 I Reg D Reg sw $5, ($2) I Reg D Reg 78

79 Data Dependencies: no forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID ID ID EX WB Stall Stall Stall Stall st st Half Half 2nd 2nd Half Half Sppose every instrction is dependant = + 2 stalls = 3 clocks IPS = Clock = 5 hz = 67 IPS CPI 3 79

80 Data Dependencies: no forwarding A dependant instrction will take An independent instrction will take = + 2 stalls = 3 clocks = + stalls = clocks Sppose % of the time the instrctions are dependant? Averge instrction time = %*3 + 9%* =.*3 +.9* =.2 clocks IPS = Clock = 5 hz = 47 IPS (% dependency) CPI.2 IPS = Clock = 5 hz = 67 IPS (% dependency) CPI 3 IPS = Clock = 5 hz = 5 IPS (% dependency) CPI 8

81 Data Dependencies: with forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID EX WB Detected Data Data Hazard Hazard a a ID/EX.$rs = EX/.$rd Sppose every instrction is dependant = + stalls = clock IPS = Clock = 5 hz = 5 IPS CPI 8

82 Data Dependencies: Hazard Conditions Data Hazard Condition occrs whenever a sorce needs a previos navailable reslt de to a destination. Eample sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt Data Hazard Detection is always comparing a destination with a sorce. Destination EX/E.$rdest = { E/WB.$rdest = { Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Hazard Type a. b. 2a. 2b. 82

83 Data Dependencies: Hazard Conditions a Data Hazard: EX/E.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt b Data Hazard: EX/E.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $2 and $rd, $rs, $rt 2a Data Hazard: E/WB.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $2, $ and $rd, $rs, $rt 2b Data Hazard: E/WB.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $6, $2 and $rd, $rs, $rt 83

84 Data Dependencies: Worst case Data Hazard: sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $2 and $rd, $rs, $rt or $3, $2, $2 and $rd, $rs, $rt Data Hazard a: Data Hazard b: Data Hazard 2a: Data Hazard 2b: EX/E.$rd = ID/EX.$rs EX/E.$rd = ID/EX.$rt E/WB.$rd = ID/EX.$rs E/WB.$rd = ID/EX.$rt 84

85 Data Dependencies: Hazard Conditions Hazard Type a. b. 2a. 2b. Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rdest } = E/WB.$rdest ID/EX $rs $rt $rd Pipeline Registers EX/E $rd E/WB $rd 85

86 ID/EX EX/E E/WB Registers Data memory a. No forwarding ID/EX EX/E E/WB Registers ForwardA Data memory Rs Rt Rt Rd ForwardB Forwarding nit EX/E.RegisterRd E/WB.RegisterRd b. With forwarding 86

87 Data Hazards: Loads Program eection order (in instrctions) lw $2, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg Backwards in in time: time: Cannot be be resolved D CC 7 CC 8 CC 9 Forwards in in time: time: Cannot be be resolved Reg and $4, $2, $5 I Reg D Reg or $8, $2, $6 I Reg D Reg add $9, $4, $2 I Reg D Reg slt $, $6, $7 At At same same time: time: Not Not a hazard hazard I Reg D Reg 87

88 Data Hazards: load stalling Program eection order (in instrctions) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC lw $2, 2($) I Reg D Reg and $4, $2, $5 I Reg Reg D Reg or $8, $2, $6 I I Reg D Reg bbble add $9, $4, $2 slt $, $6, $7 Stall Stall I Reg D Reg I Reg D Reg 88

89 Data Hazards: Hazard detection nit (page 49) Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= Stall Eample lw $2, 2($) lw $rt, addr($rs) and $4, $2, $5 and $rd, $rs, $rt No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt 89

90 Data Hazards: Hazard detection nit (page 49) No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt Eample load: assme half of the instrctions are immediately followed by an instrction that ses it. What is the average nmber of clocks for the load? load instrction time: 5%*( clock) + 5%*(2 clocks)=.5 9

91 9 Hazard Detection Unit: when to stall PC memory Registers Control EX WB WB WB ID/EX EX/E E/WB Data memory Hazard detection nit Forwarding nit IF/ID ID/EX.em IF/ID PC ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/E.RegisterRd E/WB.RegisterRd

92 Data Dependency Units Forwarding Condition Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rd } = E/WB.$rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 92

93 Data Dependency Units Stalling Comparisons IF/ID $rs Pipeline Registers ID/EX $rs Forwarding Comparisons EX/E E/WB $rt $rt $rd $rd $rd $rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 93

94 Branch Hazards: Soln #, Stall ntil Decision made (fig. 6.4) Program eection order (in instrctions) add $4, $5, $6 add $4, $5, beq $, $3, 7 Soln #: #: Stall Stall ntil ntil Decision is is and $2, $2, or $3, $6, add $4, $2, lw $4, 5($7) fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Stall Stall Decision made in in ID ID stage: do do load load 94

95 Branch Hazards: Soln #2, Predict ntil Decision made Clock beq $,$3,7 IF ID EX WB Predict false branch and $2, $2, $5 IF ID EX WB discard and $2,$2,$5 instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: discard & branch 95

96 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB ove instrction before branch add $4,$6,$6 IF ID EX WB Do Do not not need to to discard instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: branch 96

97 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB and $2, $2, $5 IF ID EX WB Decision made in in ID ID stage: do do branch lw $4, 5($7) IF ID EX WB 97

98 Branch Hazards: Decision made in the ID stage (figre 6.4) Clock beq $,$3,7 IF ID EX WB nop No No decision yet: yet: insert a nop nop IF ID EX Decision: do do load load WB lw $4, 5($7) IF ID EX WB 98

99 Branch Hazards: Soln #2, Predict ntil Decision made Program eection order (in instrctions) Time (in clock cycles) CC Predict false false branch branch Branch Decision made made in in E E stage: stage: Discard vales vales when when wrong wrong prediction CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 4 beq $, $3, 7 I Reg D Reg 44 and $2, $2, $5 I Reg D Reg 48 or $3, $6, $2 I Reg D Reg 52 add $4, $2, $2 I Reg D Reg 72 lw $4, 5($7) Same Same effect effect as as 3 stalls stalls I Reg D Reg 99

100 Figre 6.5 IF.Flsh Hazard detection nit ID/EX WB Early Early branch comparison EX/E Control WB E/WB IF/ID EX WB PC 4 memory Shift left 2 Registers = Data memory Sign Flsh: if wrong prediciton, etend add nops Flsh: if wrong prediciton, add nops Forwarding nit

101 Performance load: assme half of the instrctions are immediately followed by an instrction that ses it (i.e. dependency) load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 Jmp: assme that jmps always pay fll clock cycle delay (stall). Jmp instrction time = 2 Branch: the branch delay of misprediction is clock cycle that 25% of the branches are mispredicted. branch time = 75%*( clocks) + 25%*(2 clocks) =.25

102 Performance, page 54 Also known as the instrction latency with in a pipeline Pipeline throghpt Single- Cycle lti-cycle Clocks Pipeline Cycles i loads 5.5 (5% dependancy) 23% stores 4 3% arithmetic 4 43% branches 3.25 (25% dependancy) 9% jmps 3 2 2% Clock speed 25 hz 8 ns 5 hz 2 ns 5 hz 2 ns CPI = Σ Cycles*i IPS 25 IPS 25 IPS 424 IPS = Clock/CPI load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 branch time = 75%*( clocks) + 25%*(2 clocks) =.25 2

103 Branch prediction Datapath parallelism only sefl if yo can keep it fed. Easy to fetch mltiple (consective) instrctions per cycle essentially speclating on seqential flow Jmp: nconditional change of control flow Always taken Branch: conditional change of control flow Taken abot 5% of the time Backward: 3% 8% taken Forward: 7% 4% taken 3

104 A Big Idea for Today Reactive: past actions case system to adapt se do what yo did before better e: caches TCP windows URL completion,... Proactive: ses past actions to predict ftre actions optimize speclatively, anticipate what yo are abot to do branch prediction long cache blocks??? 4

105 Case for Branch Prediction when Isse N instrctions per clock cycle. Branches will arrive p to n times faster in an n- isse processor 2. Amdahl s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-isse processor conversely, need branch prediction to see potential parallelism 5

106 Branch Prediction Schemes. Static Branch Prediction -bit Branch-Prediction Bffer 2-bit Branch-Prediction Bffer Correlating Branch Prediction Bffer Tornament Branch Predictor Branch Target Bffer Integrated Fetch Units Retrn ress Predictors 6

107 Dynamic Branch Prediction Performance = ƒ(accracy, cost of misprediction) Branch History Table: Lower bits of PC address inde table of -bit vales Says whether or not branch taken last time No address check - saves HW, bt may not be right branch If inst == BR, pdate table with otcome Problem: in a loop, -bit BHT will case 2 mispredictions End of loop case, when it eits instead of looping as before First time throgh loop on net time throgh code, when it predicts eit instead of looping avg is 9 iterations before eit Only 8% accracy even if loop 9% of the time Local history This particlar branch inst - Or one that maps into same lost PC 7

108 2-bit Dynamic Branch Prediction 2-bit scheme where change prediction only if get misprediction twice: Predict Taken Predict Not Taken T NT T T NT T NT Predict Taken Predict Not Taken Red: stop, not taken Green: go, taken s hysteresis to decision making process Generalize to n-bit satrating conter NT 8

109 Consider 3 Scenarios Branch for loop test Check for error or eception predictors Alternating taken / not-taken eample? taken Global history Yor worst-case prediction scenario How cold HW predict this loop will eecte 3 times sing a simple mechanism? 9

110 Correlating Branches Idea: taken/not taken of recently eected branches is related to behavior of net branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 4 predictions of net branch, pdating jst that prediction Branch address (4 bits) 2-bits per branch local predictors Prediction (2,2) predictor: 2-bit global, 2-bit local 2-bit recent global branch history ( = not taken then taken)

111 Accracy of Different Schemes 2% Freqency of of ispredictions 8% 6% 4% 2% % 8% 6% 4% % 2% % % 496 Entries 2-bit BHT Unlimited Entries 2-bit BHT 24 Entries (2,2) BHT % % 5% 6% 6% nasa7 matri3 tomcatv dodcd spice fpppp gcc espresso eqntott li % 4% 6% 8% 5% 4,96 entries: 2-bits per entry Unlimited entries: 2-bits/entry,24 entries (2,2) What s missing in this pictre?

112 Re-evalating Correlation Several of the SPEC benchmarks have less than a dozen branches responsible for 9% of taken branches: program branch % static # = 9% compress 4% eqntott 25% gcc 5% mpeg % real gcc 3% Real programs + OS more like gcc Small benefits beyond benchmarks for correlation? problems with branch aliases? 2

113 BHT Accracy ispredict becase either: Wrong gess for that branch Got branch history of wrong branch when inde the table 496 entry table programs vary from % misprediction (nasa7, tomcatv) to 8% (eqntott), with spice at 9% and gcc at 2% For SPEC92, 496 abot as good as infinite table 3

114 Dynamically finding strctre in Spaghetti? 4

115 Tornament Predictors otivation for correlating branch predictors is 2- bit predictor failed on important branches; by adding global information, performance improved Tornament predictors: se 2 predictors, based on global information and based on local information, and combine with a selector Use the predictor that tends to gess correctly addr history Predictor A Predictor B 5

116 Tornament Predictor in Alpha K 2-bit conters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indeed by the history of the last 2 branches; each entry in the global predictor is a standard 2-bit predictor 2-bit pattern: ith bit => ith prior branch not taken; ith bit => ith prior branch taken; Local predictor consists of a 2-level predictor: Top level a local history table consisting of 24 -bit entries; each -bit entry corresponds to the most recent branch otcomes for the entry. -bit history allows patterns branches to be discovered and predicted. Net level Selected entry from the local history table is sed to inde a table of K entries consisting a 3-bit satrating conters, which provide the local prediction Total size: 4K*2 + 4K*2 + K* + K*3 = 29K bits! (~8, transistors) 6

117 % of predictions from local predictor in Tornament Prediction Scheme % 2% 4% 6% 8% % nasa7 matri3 tomcatv dodc spice fpppp gcc espresso eqntott li 37% 55% 76% 72% 63% 69% 98% % 94% 9% 7

118 Accracy of Branch Prediction to m c a tv d o d c 84% 99% 99% % 95% 97% fp p p p li 86% 82% 88% 77% 98% 98% P ro file -b a s e d 2-b it c o nte r T o rna m e nt e s p re s s o 86% 82% 96% g c c 7% 88% 94% fig 3.4 % 2% 4% 6% 8% % B ra nc h p re d ic tio n a c c ra c y 8

119 Accracy v. Size (SPEC89) % 9% 8% 7% Local 6% 5% 4% 3% 2% Correlating Tornament % % Total predictor size (Kbits) 9

120 GSHARE A good simple predictor Sprays predictions for a given address across a large table for different histories address or n history 2

121 Need ress At Same Time as Prediction Branch Target Bffer (BTB): ress of branch inde to get prediction AND branch address (if taken) Note: mst check for branch match now, since can t se wrong branch address (Figre 3.9, 3.2) F E T C H P C o f i n s t r c t i o n Branch PC Predicted PC =? No: branch not predicted, proceed normally (Net PC = PC+4) Yes: instrction is branch and se predicted PC as net PC Etra prediction state bits 2

122 Predicated Eection Avoid branch prediction by trning branches into conditionally eected instrctions: if () then A = B op C else NOP If false, then neither store reslt nor case eception Epanded ISA of Alpha, IPS, PowerPC, SPARC have conditional move; PA-RISC can annl any following instr. IA-64: 64 -bit condition fields selected so conditional eection of any instrction This transformation is called if-conversion A = B op C Drawbacks to conditional instrctions Still takes a clock even if annlled Stall if condition evalated late Comple conditions redce effectiveness; condition becomes known late in pipeline 22

123 Special Case Retrn resses Register Indirect branch hard to predict address SPEC89 85% sch branches for procedre retrn Since stack discipline for procedres, save retrn address in small bffer that acts like a stack: 8 to 6 entries has small miss rate 23

124 Pitfall: Sometimes bigger and dmber is Better 2264 ses tornament predictor (29 Kbits) Earlier 264 ses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, otperforms 2264 avg..5 mispredictions per instrctions 264 avg. 6.5 mispredictions per instrctions Reversed for transaction processing (TP)! 2264 avg. 7 mispredictions per instrctions 264 avg. 5 mispredictions per instrctions TP code mch larger & 264 hold 2X branch predictions based on local behavior (2K vs. K local predictor in the 2264) 24

125 A zero-cycle jmp What really has to be done at rntime? Once an instrction has been detected as a jmp or JAL, we might recode it in the internal cache. Very limited form of dynamic compilation? and addi sb jal sbi r3,r,r5 r2,r3,#4 r4,r2,r doit r,r,# A: Internal Cache state: and r3,r,r5 N A+4 addi r2,r3,#4 N A+8 sb r4,r2,r L doit sbi r,r,# N A+2 Use of Pre-decoded instrction cache Called branch folding in the Bell-Labs CRISP processor. Original CRISP cache had two addresses and cold ths fold a complete branch into the previos instrction Notice that JAL introdces a strctral hazard on write 25

126 reflect PREDICTIONS and remove delay slots and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r r4,loop r,r,# A: A+6: Internal Cache state: and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r loop r,r,# This cases the net instrction to be immediately fetched from branch destination (predict taken) If branch ends p being not taking, then sqash destination instrction and restart pipeline at address A+6 N N N N N Net A+4 A+8 A+2 loop A+2 26

127 Dynamic Branch Prediction Smmary Prediction becoming important part of scalar eection Branch History Table: 2 bits for loop accracy Correlation: Recently eected branches correlated with net branch. Either different branches Or different eections of same branches Tornament Predictor: more resorces to competitive soltions and pick between them Branch Target Bffer: inclde branch address & prediction Predicated Eection can redce nmber of branches, nmber of mispredicted branches Retrn address stack for prediction of indirect jmp 27

128 Eceptions and Interrpts (Hardware) 28

129 Eample: Device Interrpt (Say, arrival of network message) t p r r e t n I l a n r e t E add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE r l e d n a H t p r e t I n 29

130 Alternative: Polling (again, for arrival of network message) t p r r e t n I l a n r e t E no_mess: Disable Network Intr sbi r4,r,#4 slli r4,r4,#2 lw r2,(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r,2(r) beq r,no_mess lw r,2(r) lw r2,(r) addi r3,r,#5 sw (r),r3 Clear Network Intr Polling Point (check device register) Handler 3

131 Polling is faster/slower than Interrpts. Polling is faster than interrpts becase Compiler knows which registers in se at polling point. Hence, do not need to save and restore registers (or not as many). Other interrpt overhead avoided (pipeline flsh, trap priorities, etc). Polling is slower than interrpts becase Overhead of polling instrctions is incrred regardless of whether or not handler is rn. This cold add to inner-loop delay. Device may have to wait for service for a long time. When to se one or the other? lti-ais tradeoff - Freqent/reglar events good for polling, as long as device can be controlled at ser level. - Interrpts good for infreqent/irreglar events - Interrpts good for ensring reglar/predictable service of events. 3

132 Eception/Interrpt classifications Eceptions: relevant to the crrent process Falts, arithmetic traps, and synchronos traps Invoke software on behalf of the crrently eecting process Interrpts: cased by asynchronos, otside events I/O devices reqiring service (DISK, network) Clock interrpts (real time schedling) achine Checks: cased by serios hardware failre Not always restartable Indicate that bad things have happened. - Non-recoverable ECC error - achine room fire - Power otage 32

133 A related classification: Synchronos vs. Asynchronos Synchronos: means related to the instrction stream, i.e. dring the eection of an instrction st stop an instrction that is crrently eecting Page falt on load or store instrction Arithmetic eception Software Trap s Asynchronos: means nrelated to the instrction stream, i.e. cased by an otside event. Does not have to disrpt instrctions that are already eecting Interrpts are asynchronos achine checks are asynchronos SemiSynchronos (or high-availability interrpts): Cased by eternal event bt may have to disrpt crrent instrctions in order to garantee service 33

134 Interrpt Priorities st be Handled t p r r e t n I k r o w t e N add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE C o l d b e i n t e r r p t e d b y d i s k Note that priority mst be raised to avoid recrsive interrpts! 34

135 Interrpt controller hardware and mask levels Operating system constrcts a hierarchy of masks that reflects some form of interrpt priority. For instance: Priority Eamples Software interrpts 2 Network Interrpts 4 Sond card 5 Disk Interrpt 6 Real Time clock Non-askable Ints (power) This reflects the an order of rgency to interrpts For instance, this ordering says that disk events can interrpt the interrpt handlers for network interrpts. 35

136 Can we have fast interrpts? t p r e t n I i n a r G e i n F Raise priority Reenable All Ints add r,r2,r3 P C s aved A ll Ints o pervisor Save registers sbi r4,r,#4 D i s abl e de lw r,2(r) slli r4,r4,#2 S lw r2,(r) Hiccp(!) addi r3,r,#5 sw (r),r3 lw r2,(r4) R lw r3,4(r4) estore Restore registers U ser add r2,r2,r3 Clear crrent Int PC sw 8(r4),r2 o de Disable All Ints 28 registers + cache misses + etc. Restore priority RTE Pipeline Drain: Can be very Epensive Priority aniplations Register Save/Restore C o l d b e i n t e r r p t e d b y d i s k 36

137 SPARC (and RISC I) had register windows On interrpt or procedre call, simply switch to a different set of registers Really saves on interrpt overhead Interrpts can happen at any point in the eection, so compiler cannot help with knowledge of live registers. Conservative handlers mst save all registers Short handlers might be able to save only a few, bt this analysis is compilcated Not as big a deal with procedre calls Original statement by Patterson was that Berkeley didn t have a compiler team, so they sed a hardware soltion Good compilers can allocate registers across procedre bondaries Good compilers know what registers are live at any one time However, register windows have retrned! IA64 has them any other processors have shadow registers for interrpts 37

138 Spervisor State Typically, processors have some amont of state that ser programs are not allowed to toch. Page mapping hardware/tlb - TLB prevents one ser from accessing memory of another - TLB protection prevents ser from modifying mappings Interrpt controllers -- User code prevented from crashing machine by disabling interrpts. Ignoring device interrpts, etc. Real-time clock interrpts ensre that sers cannot lockp/crash machine even if they rn code that goes into a loop: - Preemptive ltitasking vs non-preemptive mltitasking Access to hardware devices restricted Prevents malicios ser from stealing network packets Prevents ser from writing over disk blocks Distinction made with at least two-levels: USER/SYSTE (one hardware mode-bit) 86 architectres actally provide 4 different levels, only two sally sed by OS (or only in older icrosoft OSs) 38

PS Midterm 2. Pipelining

PS Midterm 2. Pipelining PS idterm 2 Pipelining Seqential Landry 6 P 7 8 9 idnight Time T a s k O r d e r A B C D 3 4 2 3 4 2 3 4 2 3 4 2 Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry

More information

Review: Computer Organization

Review: Computer Organization Review: Compter Organization Pipelining Chans Y Landry Eample Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 mintes A B C D Dryer takes 3 mintes

More information

1048: Computer Organization

1048: Computer Organization 8: Compter Organization Lectre 6 Pipelining Lectre6 - pipelining (cwli@twins.ee.nct.ed.tw) 6- Otline An overview of pipelining A pipelined path Pipelined control Data hazards and forwarding Data hazards

More information

Enhanced Performance with Pipelining

Enhanced Performance with Pipelining Chapter 6 Enhanced Performance with Pipelining Note: The slides being presented represent a mi. Some are created by ark Franklin, Washington University in St. Lois, Dept. of CSE. any are taken from the

More information

Pipelining. Chapter 4

Pipelining. Chapter 4 Pipelining Chapter 4 ake processor rns faster Pipelining is an implementation techniqe in which mltiple instrctions are overlapped in eection Key of making processor fast Pipelining Single cycle path we

More information

TDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction

TDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction Review Friday the 2st of October Real world eamples of pipelining? How does pipelining pp inflence instrction latency? How does pipelining inflence instrction throghpt? What are the three types of hazard

More information

What do we have so far? Multi-Cycle Datapath

What do we have so far? Multi-Cycle Datapath What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring2 4-11-2 Pipelining pipelining

More information

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts

Chapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts CS359: Compter Architectre Chapter 3 & Appendi C Pipelining Part A: Basic and Intermediate Concepts Yanyan Shen Department of Compter Science and Engineering Shanghai Jiao Tong University 1 Otline Introdction

More information

Chapter 6: Pipelining

Chapter 6: Pipelining Chapter 6: Pipelining Otline An overview of pipelining A pipelined path Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Eceptions Sperscalar and dynamic pipelining

More information

Overview of Pipelining

Overview of Pipelining EEC 58 Compter Architectre Pipelining Department of Electrical Engineering and Compter Science Cleveland State University Fndamental Principles Overview of Pipelining Pipelined Design otivation: Increase

More information

Solutions for Chapter 6 Exercises

Solutions for Chapter 6 Exercises Soltions for Chapter 6 Eercises Soltions for Chapter 6 Eercises 6. 6.2 a. Shortening the ALU operation will not affect the speedp obtained from pipelining. It wold not affect the clock cycle. b. If the

More information

The extra single-cycle adders

The extra single-cycle adders lticycle Datapath As an added bons, we can eliminate some of the etra hardware from the single-cycle path. We will restrict orselves to sing each fnctional nit once per cycle, jst like before. Bt since

More information

Chapter 6: Pipelining

Chapter 6: Pipelining CSE 322 COPUTER ARCHITECTURE II Chapter 6: Pipelining Chapter 6: Pipelining Febrary 10, 2000 1 Clothes Washing CSE 322 COPUTER ARCHITECTURE II The Assembly Line Accmlate dirty clothes in hamper Place in

More information

The single-cycle design from last time

The single-cycle design from last time lticycle path Last time we saw a single-cycle path and control nit for or simple IPS-based instrction set. A mlticycle processor fies some shortcomings in the single-cycle CPU. Faster instrctions are not

More information

Review Multicycle: What is Happening. Controlling The Multicycle Design

Review Multicycle: What is Happening. Controlling The Multicycle Design Review lticycle: What is Happening Reslt Zero Op SrcA SrcB Registers Reg Address emory em Data Sign etend Shift left Sorce A B Ot [-6] [5-] [-6] [5-] [5-] Instrction emory IR RegDst emtoreg IorD em em

More information

The final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read.

The final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read. The final path PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtor RegDst ALUSrc em I [5

More information

Comp 303 Computer Architecture A Pipelined Datapath Control. Lecture 13

Comp 303 Computer Architecture A Pipelined Datapath Control. Lecture 13 Comp 33 Compter Architectre A Pipelined path Lectre 3 Pipelined path with Signals PCSrc IF/ ID ID/ EX EX / E E / Add PC 4 Address Instrction emory RegWr ra rb rw Registers bsw [5-] [2-6] [5-] bsa bsb Sign

More information

The multicycle datapath. Lecture 10 (Wed 10/15/2008) Finite-state machine for the control unit. Implementing the FSM

The multicycle datapath. Lecture 10 (Wed 10/15/2008) Finite-state machine for the control unit. Implementing the FSM Lectre (Wed /5/28) Lab # Hardware De Fri Oct 7 HW #2 IPS programming, de Wed Oct 22 idterm Fri Oct 2 IorD The mlticycle path SrcA Today s objectives: icroprogramming Etending the mlti-cycle path lti-cycle

More information

Exceptions and interrupts

Exceptions and interrupts Eceptions and interrpts An eception or interrpt is an nepected event that reqires the CPU to pase or stop the crrent program. Eception handling is the hardware analog of error handling in software. Classes

More information

Chapter 6 Enhancing Performance with. Pipelining. Pipelining. Pipelined vs. Single-Cycle Instruction Execution: the Plan. Pipelining: Keep in Mind

Chapter 6 Enhancing Performance with. Pipelining. Pipelining. Pipelined vs. Single-Cycle Instruction Execution: the Plan. Pipelining: Keep in Mind Pipelining hink of sing machines in landry services Chapter 6 nhancing Performance with Pipelining 6 P 7 8 9 A ime ask A B C ot pipelined Assme 3 min. each task wash, dry, fold, store and that separate

More information

CS 251, Winter 2019, Assignment % of course mark

CS 251, Winter 2019, Assignment % of course mark CS 25, Winter 29, Assignment.. 3% of corse mark De Wednesday, arch 3th, 5:3P Lates accepted ntil Thrsday arch th, pm with a 5% penalty. (7 points) In the diagram below, the mlticycle compter from the corse

More information

EXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation

EXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation EXAINATIONS 2003 COP203 END-YEAR Compter Organisation Time Allowed: 3 Hors (180 mintes) Instrctions: Answer all qestions. There are 180 possible marks on the eam. Calclators and foreign langage dictionaries

More information

PIPELINING. Pipelining: Natural Phenomenon. Pipelining. Pipelining Lessons

PIPELINING. Pipelining: Natural Phenomenon. Pipelining. Pipelining Lessons Pipelining: Natral Phenomenon Landry Eample: nn, rian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 mintes C D Dryer takes 0 mintes PIPELINING Folder takes 20 mintes

More information

EXAMINATIONS 2010 END OF YEAR NWEN 242 COMPUTER ORGANIZATION

EXAMINATIONS 2010 END OF YEAR NWEN 242 COMPUTER ORGANIZATION EXAINATIONS 2010 END OF YEAR COPUTER ORGANIZATION Time Allowed: 3 Hors (180 mintes) Instrctions: Answer all qestions. ake sre yor answers are clear and to the point. Calclators and paper foreign langage

More information

Review. A single-cycle MIPS processor

Review. A single-cycle MIPS processor Review If three instrctions have opcodes, 7 and 5 are they all of the same type? If we were to add an instrction to IPS of the form OD $t, $t2, $t3, which performs $t = $t2 OD $t3, what wold be its opcode?

More information

EECS 322 Computer Architecture Improving Memory Access: the Cache

EECS 322 Computer Architecture Improving Memory Access: the Cache EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow

More information

1048: Computer Organization

1048: Computer Organization 48: Compter Organization Lectre 5 Datapath and Control Lectre5A - simple implementation (cwli@twins.ee.nct.ed.tw) 5A- Introdction In this lectre, we will try to implement simplified IPS which contain emory

More information

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2018, Assignment % of course mark CS 25, Winter 28, Assignment 4.. 3% of corse mark De Wednesday, arch 7th, 4:3P Lates accepted ntil Thrsday arch 8th, am with a 5% penalty. (6 points) In the diagram below, the mlticycle compter from the

More information

Computer Architecture. Lecture 6: Pipelining

Computer Architecture. Lecture 6: Pipelining Compter Architectre Lectre 6: Pipelining Dr. Ahmed Sallam Based on original slides by Prof. Onr tl Agenda for Today & Net Few Lectres Single-cycle icroarchitectres lti-cycle and icroprogrammed icroarchitectres

More information

EEC 483 Computer Organization. Branch (Control) Hazards

EEC 483 Computer Organization. Branch (Control) Hazards EEC 483 Compter Organization Section 4.8 Branch Hazards Section 4.9 Exceptions Chans Y Branch (Control) Hazards While execting a previos branch, next instrction address might not yet be known. s n i o

More information

EEC 483 Computer Organization

EEC 483 Computer Organization EEC 83 Compter Organization Chapter.6 A Pipelined path Chans Y Pipelined Approach 2 - Cycle time, No. stages - Resorce conflict E E A B C D 3 E E 5 E 2 3 5 2 6 7 8 9 c.y9@csohio.ed Resorces sed in 5 Stages

More information

PART I: Adding Instructions to the Datapath. (2 nd Edition):

PART I: Adding Instructions to the Datapath. (2 nd Edition): EE57 Instrctor: G. Pvvada ===================================================================== Homework #5b De: check on the blackboard =====================================================================

More information

EEC 483 Computer Organization

EEC 483 Computer Organization EEC 483 Compter Organization Chapter 4.4 A Simple Implementation Scheme Chans Y The Big Pictre The Five Classic Components of a Compter Processor Control emory Inpt path Otpt path & Control 2 path and

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

Instruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion

Instruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion . (Chapter 5) Fill in the vales for SrcA, SrcB, IorD, Dst and emto to complete the Finite State achine for the mlti-cycle datapath shown below. emory address comptation 2 SrcA = SrcB = Op = fetch em SrcA

More information

Computer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University

Computer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University Compter Architectre Chapter 5 Fall 25 Department of Compter Science Kent State University The Processor: Datapath & Control Or implementation of the MIPS is simplified memory-reference instrctions: lw,

More information

Prof. Kozyrakis. 1. (10 points) Consider the following fragment of Java code:

Prof. Kozyrakis. 1. (10 points) Consider the following fragment of Java code: EE8 Winter 25 Homework #2 Soltions De Thrsday, Feb 2, 5 P. ( points) Consider the following fragment of Java code: for (i=; i

More information

Computer Architecture

Computer Architecture Compter Architectre Lectre 4: Intro to icroarchitectre: Single- Cycle Dr. Ahmed Sallam Sez Canal University Spring 25 Based on original slides by Prof. Onr tl Review Compter Architectre Today and Basics

More information

Computer Architecture

Computer Architecture Compter Architectre Lectre 4: Intro to icroarchitectre: Single- Cycle Dr. Ahmed Sallam Sez Canal University Based on original slides by Prof. Onr tl Review Compter Architectre Today and Basics (Lectres

More information

Lecture 10: Pipelined Implementations

Lecture 10: Pipelined Implementations U 8-7 S 9 L- 8-7 Lectre : Pipelined Implementations James. Hoe ept of EE, U Febrary 23, 29 nnoncements: Project is de this week idterm graded, d reslts posted Handots: H9 Homework 3 (on lackboard) Graded

More information

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7- in the online

More information

Quiz #1 EEC 483, Spring 2019

Quiz #1 EEC 483, Spring 2019 Qiz # EEC 483, Spring 29 Date: Jan 22 Name: Eercise #: Translate the following instrction in C into IPS code. Eercise #2: Translate the following instrction in C into IPS code. Hint: operand C is stored

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Lecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University 8 447 Lectre 6: icroprogrammed lti Cycle Implementation James C. Hoe Department of ECE Carnegie ellon University 8 447 S8 L06 S, James C. Hoe, CU/ECE/CALC, 208 Yor goal today Hosekeeping nderstand why

More information

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed Lecture 3: General Purpose Processor Design CSCE 665 Advanced VLSI Systems Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, tet etc included in slides are borrowed from various books, websites,

More information

1048: Computer Organization

1048: Computer Organization 48: Compter Organization Lectre 5 Datapath and Control Lectre5B - mlticycle implementation (cwli@twins.ee.nct.ed.tw) 5B- Recap: A Single-Cycle Processor PCSrc 4 Add Shift left 2 Add ALU reslt PC address

More information

Lecture 7. Building A Simple Processor

Lecture 7. Building A Simple Processor Lectre 7 Bilding A Simple Processor Christos Kozyrakis Stanford University http://eeclass.stanford.ed/ee8b C. Kozyrakis EE8b Lectre 7 Annoncements Upcoming deadlines Lab is de today Demo by 5pm, report

More information

Chapter 4 (Part II) Sequential Laundry

Chapter 4 (Part II) Sequential Laundry Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

Lecture 13: Exceptions and Interrupts

Lecture 13: Exceptions and Interrupts 18 447 Lectre 13: Eceptions and Interrpts S 10 L13 1 James C. Hoe Dept of ECE, CU arch 1, 2010 Annoncements: Handots: Spring break is almost here Check grades on Blackboard idterm 1 graded Handot #9: Lab

More information

CSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade

CSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade CSE 141 Compter Architectre Smmer Session I, 2004 Lectres 10 Advanced Topics, emory Hierarchy and Cache Pramod V. Argade CSE141: Introdction to Compter Architectre Instrctor: TA: Pramod V. Argade (p2argade@cs.csd.ed)

More information

NOW Handout Page 1. Debate Review: ISA -a Critical Interface. EECS 252 Graduate Computer Architecture

NOW Handout Page 1. Debate Review: ISA -a Critical Interface. EECS 252 Graduate Computer Architecture NOW Handout Page 1 EECS 252 Graduate Computer Architecture Lec 4 Issues in Basic Pipelines (stalls, exceptions, branch prediction) David Culler Electrical Engineering and Computer Sciences University of

More information

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S. Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing (2) Pipeline Performance Assume time for stages is ps for register read or write

More information

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner CPS104 Computer Organization and Programming Lecture 19: Pipelining Robert Wagner cps 104 Pipelining..1 RW Fall 2000 Lecture Overview A Pipelined Processor : Introduction to the concept of pipelined processor.

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

More information

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Lecture 6: Pipelining

Lecture 6: Pipelining Lecture 6: Pipelining i CSCE 26 Computer Organization Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed from various books, websites, authors pages, and other

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Lecture 9: Microcontrolled Multi-Cycle Implementations

Lecture 9: Microcontrolled Multi-Cycle Implementations 8-447 Lectre 9: icroled lti-cycle Implementations James C. Hoe Dept of ECE, CU Febrary 8, 29 S 9 L9- Annoncements: P&H Appendi D Get started t on Lab Handots: Handot #8: Project (on Blackboard) Single-Cycle

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? Multi-Cycle Datapath (Textbook Version) What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware Design Language 345.e1

4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware Design Language 345.e1 .3 Advanced Topic: An Introdction to Digital Design Using a Hardware Design Langage 35.e.3 Advanced Topic: An Introdction to Digital Design Using a Hardware Design Langage to Describe and odel a Pipeline

More information

Hardware Design Tips. Outline

Hardware Design Tips. Outline Hardware Design Tips EE 36 University of Hawaii EE 36 Fall 23 University of Hawaii Otline Verilog: some sbleties Simlators Test Benching Implementing the IPS Actally a simplified 6 bit version EE 36 Fall

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

CENG 3420 Lecture 06: Pipeline

CENG 3420 Lecture 06: Pipeline CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2019 Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2

More information

Advanced Computer Architecture Pipelining

Advanced Computer Architecture Pipelining Advanced Computer Architecture Pipelining Dr. Shadrokh Samavi Some slides are from the instructors resources which accompany the 6 th and previous editions of the textbook. Some slides are from David Patterson,

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

CS 251, Spring 2018, Assignment 3.0 3% of course mark

CS 251, Spring 2018, Assignment 3.0 3% of course mark CS 25, Spring 28, Assignment 3. 3% of corse mark De onday, Jne 25th, 5:3 P. (5 points) Consider the single-cycle compter shown on page 6 of this assignment. Sppose the circit elements take the following

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

EECS 252 Graduate Computer Architecture Lecture 4. Control Flow (continued) Interrupts. Review: Ordering Properties of basic inst.

EECS 252 Graduate Computer Architecture Lecture 4. Control Flow (continued) Interrupts. Review: Ordering Properties of basic inst. EECS 252 Graduate Computer Architecture Lecture 4 Control Flow (continued) Interrupts John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

CSE Introduction to Computer Architecture Chapter 5 The Processor: Datapath & Control

CSE Introduction to Computer Architecture Chapter 5 The Processor: Datapath & Control CSE-45432 Introdction to Compter Architectre Chapter 5 The Processor: Datapath & Control Dr. Izadi Data Processor Register # PC Address Registers ALU memory Register # Register # Address Data memory Data

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2018, Assignment % of course mark CS 25, Winter 28, Assignment 3.. 3% of corse mark De onday, Febrary 26th, 4:3 P Lates accepted ntil : A, Febrary 27th with a 5% penalty. IEEE 754 Floating Point ( points): (a) (4 points) Complete the following

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content

CSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content 3/6/8 CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Today s Content We have looked at how to design a Data Path. 4.4, 4.5 We will design

More information

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations

More information

Pipelining: Basic Concepts

Pipelining: Basic Concepts Pipelining: Basic Concepts Prof. Cristina Silvano Dipartimento di Elettronica e Informazione Politecnico di ilano email: silvano@elet.polimi.it Outline Reduced Instruction Set of IPS Processor Implementation

More information

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard Computer Architecture and Organization Pipeline: Control Hazard Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 15.1 Pipelining Outline Introduction

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018 Your goal today Housekeeping detect and resolve

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information