Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time.
|
|
- Miranda McKenzie
- 6 years ago
- Views:
Transcription
1 Pipelining Pipelining is the se of pipelining to allow more than one instrction to be in some stage of eection at the same time. Ferranti ATLAS (963): Pipelining redced the average time per instrction by 375% emory cold not keep p with the CPU, needed a cache.
2 What Is Pipelining Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 mintes A B C D Dryer takes 4 mintes Folder takes 2 mintes 2
3 What Is Pipelining 6 P idnight Time T a s k O r d e r A B C D Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry take? 3
4 What Is Pipelining Start work ASAP 6 P idnight T a s k O r d e r A B C D Time Pipelined landry takes 3.5 hors for 4 loads 4
5 Pipelining Lessons T a s k O r d e r 6 P Time A B C D Pipelining doesn t help latency of single task, it helps throghpt of entire workload Pipeline rate limited by slowest pipeline stage ltiple tasks operating simltaneosly Potential speedp = Nmber pipe stages Unbalanced lengths of pipe stages redces speedp Time to fill pipeline and time to drain it redces speedp 5
6 odels Single-cycle model (non-overlapping) The instrction latency eectes in a single cycle Every instrction and clock-cycle mst be stretched to the slowest instrction lti-cycle model (non-overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step Ability to share fnctional nits within the eection of a single instrction Pipeline model (overlapping) The instrction latency eectes in mltiple-cycles The clock-cycle mst be stretched to the slowest step The throghpt is mainly one clock-cycle/instrction Gains efficiency by overlapping the eection of mltiple instrctions, increasing hardware tilization. 6
7 Review: Single-Cycle Datapath 2 adders: PC+4 adder, Branch/Jmp offset adder PC 4 address memory RegDst Reg register register 2 register 2 6 Sign 32 etend Shift left 2 Src 3 Reslt ctl Zero reslt ress em Data memory And em Branch emtoreg Harvard Architectre: Separate instrction and memory 7
8 Review: lti vs. Single-cycle Processor Datapath Combine adders: add ½ & 3 temp. registers, A, B, Ot Combine emory: add & 2 temp. registers, IR, DR IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS 8
9 lti-cycle Processor Datapath Single-cycle= + 2 em + 4 es + 2 adders + OpcodeDecoders lti-cycle = + em + 5½ es + 5 Reg (IR,A,B,DR,Ot) + FS IorD em em IR RegDst Reg SrcA PC ress emory emdata [25 2] [2 6] [5 ] register [5 ] [5 ] register register 2 Registers register 2 A B Zero reslt Ot emory register 6 Sign etend 32 Shift left 2 control emtoreg [5 ] SrcB Op = 6 6 additional FFs FFsfor for mlti-cycle processor over over single-cycle processor 9
10 PC PC bits bits PC 4 ress memory Datapath Registers 6 6 FFs FFs FFs FFs FFs FFs PC PC IF/ID ID/EX EX/E E/WB bits bits IR IR bits bits register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg 6 Sign 32 etend 2 W 3 4 EX EX PC PC A B Si Si RT RT 5 RD RD 5 Shift left 2 Src 6 RegDst control Op reslt Zero reslt 2 W 3 PC PC Z Ot Ot B D 5 Branch ress 23+6 = 229 additional FFs for pipeline over mlti-cycle processor PCSrc em Data memory em Figre W D R Ot Ot D 5 emtoreg
11 Overhead Single-cycle model 8 ns Clock (25 Hz), (non-overlapping) + 2 adders es Datapath Register bits (Flip-Flops) Chip Area Speed lti-cycle model 2 ns Clock (5 Hz), (non-overlapping) + Controller 5 es 6 Datapath Register bits (Flip-Flops) Pipeline model 2 ns Clock (5 Hz), (overlapping) 2 + Controller 4 es 373 Datapath + 6 Controlpath Register bits (Flip-Flops)
12 Comparison CISC Any instrction may reference memory any instrctions & addressing modes Variable instrction formats Single register set RISC Only load/store references memory Few instrctions & addressing modes Fied instrction formats ltiple register sets lti-clock cycle instrctions icro-program interprets instrctions Compleity is in the micro-program Less to no pipelining Program code size small Single-clock cycle instrctions Hardware (FS) eectes instrctions Compleity is in the complier Highly pipelined Program code size large 2
13 Pipelining verss Parallelism ost high-performance compters ehibit a great deal of concrrency. However, it is not desirable to call every modern compter a parallel compter. Pipelining and parallelism are 2 methods sed to achieve concrrency. Pipelining increases concrrency by dividing a comptation into a nmber of steps. Parallelism is the se of mltiple resorces to increase concrrency. 3
14 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - e.g., mltiple memory accesses, mltiple register writes - soltions: mltiple memories (separate instrction & memory) stretch pipeline control hazards: attempt to make a decision before condition is evalated - e.g., any conditional branch - soltions: prediction, delayed branch hazards: attempt to se item before it is ready - e.g., add r,r2,r3; sb r4, r,r5; lw r6, (r7); or r8, r6,r9 - soltions: forwarding/bypassing, stall/bbble 4
15 When stalls are navoidable There are nmeros instances where hazards lead navoidably to stalls! perhaps the most srprising case is the simple assignment statement: A = B + C; 5
16 Classification of hazards Assme that instrction J occrs after instrction I RAW -- read after write -- J tries to read a sorce before I writes to it. J gets a possibly incorrect vale WAR -- write after read -- J tries to write a destination before the destination is read by I. I reads a possibly incorrect vale WAW -- write after write -- J tries to write an operand before it is written by I. In this case the writes are performed in the wrong order IPS avoids the latter two hazards in its integer pipeline However, they can still arise in its floating-point pipeline since instrctions have different lengths of time there 6
17 The Five Stages of Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the from the Data emory Wr: the back to the register file 7
18 RISCEE 4 Architectre Clock = load vale into register P P (~AlZero (~AlZero & & BZ) BZ) PCSrc srcb 2 Y P C IorD em em address address Data Data Data Data DR2 DR2 DR DR IR I I R R [7-] Reg RegDst 2 2 Data Data Accmlator Accmlator Data Data srca X op op X+ X+ 2 2 X-Y X-Y 3 3 +Y +Y X+Y X+Y Ot Ot Clock 8
19 Single Cycle, ltiple Cycle, vs. Pipeline Clk Cycle Cycle 2 Single Cycle Implementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Clk ltiple Cycle Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em R-type Ifetch Pipeline Implementation: Load Ifetch Reg Eec em Wr Store Ifetch Reg Eec em Wr R-type Ifetch Reg Eec em Wr 9
20 Why Pipeline? Sppose we eecte instrctions Single Cycle achine 45 ns/cycle CPI inst = 45 ns lticycle achine ns/cycle 4.6 CPI (de to inst mi) inst = 46 ns Ideal pipelined machine ns/cycle ( CPI inst + 4 cycle drain) = 4 ns 2
21 Why Pipeline? Becase the resorces are there! Time (clock cycles) I n s t r. O r d e r Inst Inst Inst 2 Inst 3 Inst 4 Reg Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Resorce eminst emdata Reg Reg bsy idle idle idle idle bsy idle bsy idle idle bsy idle bsy idle bsy bsy bsy bsy idle bsy bsy bsy bsy bsy bsy idle bsy bsy bsy bsy idle bsy idle bsy bsy idle bsy idle bsy idle idle idle idle bsy idle 2
22 Can pipelining get s into troble? Yes: Pipeline Hazards strctral hazards: attempt to se the same resorce two different ways at the same time - E.g., combined washer/dryer wold be a strctral hazard or folder bsy doing something else (watching TV) hazards: attempt to se item before it is ready - E.g., one sock of pair in dryer and one in washer; can t fold ntil get sock from washer throgh dryer - instrction depends on reslt of prior instrction still in the pipeline control hazards: attempt to make a decision before condition is evalated - E.g., washing football niforms and need to get proper detergent level; need to see after dryer before net load in - branch instrctions Can always resolve hazards by waiting pipeline control mst detect the hazard take action (or delay action) to resolve hazards 22
23 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time Previos eample: Separate Instem and and Dataem I n s t r. Load Instr em Reg em Reg em Reg em Reg right right half: half: highlight means means read. read. left left half half write. write. O r d e r Instr 2 Instr 3 Resorce em(inst & Data) Reg Reg bsy idle idle idle bsy bsy idle idle em em Reg em Reg bsy bsy idle bsy bsy bsy idle bsy Reg em Reg bsy bsy bsy bsy idle bsy bsy bsy Detection is is easy easy in in this this case! case! idle idle bsy bsy idle idle bsy idle 23
24 Single emory (Inst & Data) is a Strctral Hazard strctral hazards: attempt to se the same resorce two different ways at the same time By change the architectre from a Harvard (separate instrction and memory) to a von Neman memory, we actally created a strctral hazard! Strctral hazards can be avoid by changing hardware: design of the architectre (splitting resorces) software: re-order the instrction seqence software: delay 24
25 Pipelining Improve perfomance by increasing instrction throghpt Program eection order Time (in instrctions) lw $, ($) Reg fetch Data access Reg lw $2, 2($) 8 ns Reg fetch Data access Reg lw $3, 3($) Program eection Time order (in instrctions) lw $, ($) lw $2, 2($) fetch 2 ns 8 ns Reg fetch Reg Data access Reg Data access Reg fetch 8 ns... lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedp is nmber of stages in the pipeline. Do we achieve this? 25
26 Stall on Branch Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Figre
27 Predicting branches Program eection order (in instrctions) add $4, $5, $6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg Program eection order (in instrctions) add $4, $5,$6 Time fetch Reg Data access Reg beq $, $2, 4 2 ns fetch Reg Data access Reg bbble bbble bbble bbble bbble Figre 6.5 or $7, $8, $9 4 ns fetch Reg Data access Reg 27
28 Delayed branch Program eection order (in instrctions) Time beq $, $2, 4 fetch Reg Data access Reg add $4, $5, $6 (Delayed branch slot) 2 ns fetch Reg Data access Reg lw $3, 3($) 2 ns fetch Reg Data access Reg 2 ns Figre
29 pipeline Figre 6.7 Time add $s, $t, $t IF ID EX E WB Pipeline stages IF instrction fetch (read) ID instrction decode and register read (read) EX eecte al operation E memory (read or write) WB back to register Resorces em instr. & memory Reg register read port # Reg2 register read port #2 Reg register write al operation 29
30 Forwarding Program eection order Time (in instrctions) add $s, $t, $t IF ID EX E WB sb $t2, $s, $t3 IF ID EX E WB Figre 6.8 3
31 Load Forwarding Program Time eection order (in instrctions) lw $s, 2($t) IF ID EX E WB bbble bbble bbble bbble bbble sb $t2, $s, $t3 IF ID EX E WB Figre 6.9 3
32 Reordering lw $t, ($t) $t=emory[+$t] lw $t2, 4($t) $t2=emory[4+$t] sw $t2, ($t) emory[+$t]=$t2 sw $t, 4($t) emory[4+$t]=$t lw lw sw sw $t2, 4($t) $t, ($y) $t2, ($t) $t, 4($t) Figre
33 Basic Idea: split the path IF: fetch ID: decode/ register file read EX: Eecte/ address calclation E: emory access WB: back 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 What do we need to add to actally split the path into stages? 33
34 Graphically Representing Pipelines Time (in clock cycles) Program eection order (in instrctions) lw $, 2($) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Can help with answering qestions like: how many cycles does it take to eecte this code? what is the doing dring cycle 4? se this representation to help nderstand paths 34
35 Pipeline path with registers IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre
36 Load instrction fetch and decode lw fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 lw decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 36
37 Load instrction eection lw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre
38 Load instrction memory and write back lw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory lw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory Figre Sign etend 32 38
39 Store instrction eection sw Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre
40 Store instrction memory and write back sw emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory sw back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.7 4
41 Load instrction: corrected path IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.8 4
42 Load instrction: overall sage IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre
43 lti-clock-cycle pipeline diagram Program eection order (in instrctions) lw $, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg D Reg sb $, $2, $3 I Reg D Reg Program eection order (in instrctions) lw $, $2($) sb $, $2, $3 Time ( in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 fetch decode fetch Eection decode Data access Eection back Data access back Figre
44 Single-cycle #-2 lw $, 2($) fetch IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register 6 Sign etend 32 Zero reslt ress Data memory Clock sb $, $2, $3 fetch lw $, 2($) decode IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.22 Clock 2 44
45 Single-cycle #3-4 sb $, $2, $3 decode lw $, 2($) Eection IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 3 sb $, $2, $3 Eection lw $, 2($) emory IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.23 Clock 4 45
46 Single-cycle #5-6 sb $, $2, $3 emory lw $, 2($) back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Clock 5 sb $, $2, $3 back IF/ID ID/EX EX/E E/WB 4 Shift left 2 reslt PC ress memory register register 2 Registers 2 register Zero reslt ress Data memory 6 Sign etend 32 Figre 6.24 Clock 6 46
47 Conventional Pipelined Eection Representation Time IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB IFetch Dcd Eec em WB Program Flow IFetch Dcd Eec em WB IFetch Dcd Eec em WB 47
48 Strctral Hazards limit performance Eample: if.3 memory accesses per instrction and only one memory access per cycle then average CPI.3 otherwise resorce is more than % tilized 48
49 Control Hazard Soltions I n s t r. O r d e r Stall: wait ntil decision is clear Its possible to move p decision to 2nd stage by adding hardware to check registers as being read Beq Load Time (clock cycles) em Reg em Reg Impact: 2 clock cycles per branch instrction => slow em Reg em Reg em Reg em Reg 49
50 Control Hazard Soltions Predict: gess one direction then back p if wrong Predict not taken I n s t r. O r d e r Beq Load Time (clock cycles) em Reg em Reg em Reg em Reg em Reg em Reg Impact: clock cycles per branch instrction if right, 2 if wrong (right - 5% of time) ore dynamic scheme: history of branch (- 9%) 5
51 Control Hazard Soltions Redefine branch behavior (takes place after net instrction) delayed branch I n s t r. O r d e r Beq isc Load Time (clock cycles) em Reg em Reg em Reg em Reg em Impact: clock cycles per branch instrction if can find instrction to pt in slot (- 5% of time) As lanch more instrction per clock cycle, less sefl Reg em Reg em Reg em Reg 5
52 Data Hazard on r add r,r2,r3 sb r4, r,r3 and r6, r,r7 or r8, r,r9 or r, r,r 52
53 Data Hazard on r: Dependencies backwards in time are hazards I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg 53
54 Data Hazard Soltion: Forward reslt from one stage to another I n s t r. O r d e r Time (clock cycles) IF ID/RF EX E WB add r,r2,r3 sb r4,r,r3 and r6,r,r7 or r8,r,r9 or r,r,r Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg Im Reg Dm Reg or OK if define read/write properly 54
55 Forwarding (or Bypassing): What abot Loads Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX E WB lw r,(r2) Im Reg Dm Reg sb r4,r,r3 Im Reg Dm Reg Can t solve with forwarding: st delay/stall instrction dependent on loads 55
56 Pipelining the Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw Ifetch Reg/Dec Eec em Wr 2nd lw Ifetch Reg/Dec Eec em Wr 3rd lw Ifetch Reg/Dec Eec em Wr The five independent fnctional nits in the pipeline path are: emory for the Ifetch stage Register File s ports (bs A and bsb) for the Reg/Dec stage for the Eec stage Data emory for the em stage Register File s port (bs W) for the Wr stage 56
57 The For Stages of R-type Cycle Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Eec Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: operates on the two register operands Update PC Wr: the otpt back to the register file 57
58 Pipelining the R-type and Load Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec Wr Ops! We have a problem! R-type Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Eec Wr We have pipeline conflict or strctral hazard: Two instrctions try to write to the register file at the same time! Only one write port 58
59 Important Observation Each fnctional nit can only be sed once per instrction Each fnctional nit mst be sed at the same stage for all instrctions: Load ses Register File s Port dring its 5th stage Load Ifetch Reg/Dec Eec em Wr R-type ses Register File s Port dring its 4th stage R-type Ifetch Reg/Dec Eec Wr 2 ways to solve this pipeline hazard. 59
60 Soltion : Insert Bbble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Eec Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec Wr R-type Ifetch Reg/Dec Pipeline Eec Wr R-type Ifetch Bbble Reg/Dec Eec Wr Ifetch Reg/Dec Eec Insert a bbble into the pipeline to prevent 2 writes at the same cycle The control logic can be comple. Lose instrction fetch and isse opportnity. No instrction is started in Cycle 6! 6
61 Soltion 2: Delay R-type s by One Cycle Delay R-type s register write by one cycle: Now R-type instrctions also se Reg File s write port at Stage 5 em stage is a NOOP stage: nothing is being done R-type Ifetch Reg/Dec Eec em Wr Clock Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr Load Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr R-type Ifetch Reg/Dec Eec em Wr 6
62 The For Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: Calclate the memory address em: the into the Data emory 62
63 The Three Stages of Beq Cycle Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Eec em Wr Ifetch: Fetch Fetch the instrction from the emory Reg/Dec: Registers Fetch and Decode Eec: compares the two register operand, select correct branch target address latch into PC 63
64 Smmary: Pipelining What makes it easy all instrctions are the same length jst a few instrction formats memory operands appear only in loads and stores What makes it hard? strctral hazards: sppose we had only one memory control hazards: need to worry abot branch instrctions hazards: an instrction depends on a previos instrction We ll bild a simple pipeline and look at these isses We ll talk abot modern processors and what really makes it hard: eception handling trying to improve performance with ot-of-order eection, etc. 64
65 Smmary Pipelining is a fndamental concept mltiple steps sing distinct resorces Utilize capabilities of the Datapath by pipelined instrction processing start net instrction while working on the crrent one limited by length of longest stage (pls fill/flsh) detect and resolve hazards 65
66 Pipeline Control: Controlpath Register bits WB Control WB EX WB IF/ID ID/EX EX/E E/WB 9 control bits bits Figre control bits bits 2 control bits bits 66
67 Pipeline Control: Controlpath table Figre 5.2, Single Cycle Reg Dst Src em Reg Reg Wrt em Red em Wrt Branch op op R-format lw sw X X beq X X Figre 6.28 ID / EX control lines EX / E control lines E / WB cntrl lines Reg Dst Op Op Src Branch em Red em Wrt Reg Wrt em Reg R-format lw sw X X beq X X 67
68 Pipeline Hazards Pipeline hazards Soltion # always works (for non-realtime) applications: stall, delay & procrastinate! Strctral Hazards (i.e. fetching same memory bank) Soltion #2: partition architectre Control Hazards (i.e. branching) Soltion #: stall! bt decreases throghpt Soltion #2: gess and back-track Soltion #3: delayed decision: delay branch & fill slot Data Hazards (i.e. register dependencies) Worst case sitation Soltion #2: re-order instrctions Soltion #3: forwarding or bypassing: delayed load 68
69 Pipeline Datapath and Controlpath PCSrc Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register register 2 Registers 2 register Reg Shift left 2 reslt Src Zero reslt Branch em ress Data memory emtoreg [5 ] 6 Sign 32 etend 6 control em [2 6] [5 ] RegDst Op 69
70 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7
71 load inst. Clock Clock Clock Clock 2 Clock Clock 3 Clock Clock 3 Clock Clock PC= PC 4 ress PCSrc memory PC=4 IR= IR= lw lw $,2($) IF/ID register register 2 Registers 2 register [5 ] [2 6] [5 ] Reg Control 6 Sign 32 etend WB=, WB=, = = EX= EX= ID/EX WB EX PC=4 A=C$ A=C$ B=X Shift left 2 B=X S=2 T=$ D= D= 6 RegDst reslt control Src Zero reslt Op WB= WB= = = EX/E PC=4+2<<2 =2+C$ D=$ WB Branch em ress Data memory em WB= WB= PC=4+2<<2 E/WB DR=em[2+C$] Alot D=$ WB emtoreg 7
72 Pipeline single stepping Contents of Register = C$ = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$=8; emory[23]=9; Formats: add $rd,$rs=a,$rt=b; lw $rt=b,@($rs=a) Clock <IF/ID> <ID/EX> <EX/E> <E/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z,, B, R> <DR,, R> <,?> <?,?,?,?,?,?> <?,?,?,?,?> <?,?,?> <4,lw $,2($)> <,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 2 <8,sb $,$2,$3> <4,C$ 3,C$ 8,2,$,> <,?,?,?,?> <?,?,?> 3 <2,and $2,$4,$5> <8,C$2 4,C$3 4,X,$3,$> <4+2<<2 84,,2+3 23,8,$><?,?,?> 4 <6,or $3,$6,$7> <2,C$4 6,C$5 7,X,$5,$2><X,,4-4=,4,$> <em[23] 9,23,$> 5 <2,add $4,$8,$9> <6,C$6,C$7,X,$7,$3> <X,,,7,$2> <X,,$> 72
73 Clock : Figre 6.3a IF: lw $, 2($) ID: before<> EX: before<2> E: before<3> WB: before<4> PC= PC 4 ress memory IF/ID PC=4 IR=lw $,2($) register register 2 Registers 2 register [5 ] Reg Control Sign etend ID/EX WB EX Shift left 2 reslt control Src Zero reslt EX/E WB Branch ress Data memory em em E/WB WB emtoreg Clock [2 6] [5 ] RegDst Op IF: sb $, $2, $3 ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> 73
74 Clock [5 ] RegDst C C IF: sb $, $2, $3 PC=4 PC 4 ress memory Clock 2 Figre 6.3b IF/ID ID: lw $, 2($) EX: before<> E: before<2> WB: before<3> lw PC=4 PC=8 IR=lw IR=sb $,2($) $, $, $2, $2, $3 $3 X 2 X Reg register Control register 2 Registers 2 register [5 ] Sign etend [2 6] [5 ] $ $X 2 X PC=4 A=C$ B=X ID/EX WB EX B=X S=2 T=$ D= D= Shift left 2 RegDst reslt control Src Zero reslt Op EX/E WB Branch em ress Data memory em E/WB WB emtoreg 74
75 CC ID: sb $, $2, $3 EX: lw $,... E: before<> WB: before<2> E/WB IF/ID ID/EX EX/E CC WB sb Control WB EX reslt Branch PC=4+2<<2 WB Shift left 2 Src Reg em emtoreg register 2 $ $2 register 2 3 Zero reslt $3 Registers ress 2 CC PC=8 PC=4 A=C$2 A=C$ B=C$3 B=X B=X S=2 S=X S=X T=$ T=$3 D= D= D=$ register Data memory 2 X [5 ] X em =2+C$ control Sign etend PC=2 PC=8 IR=sb IR=and $, $2,$4,$5 $, $2, $2, $3 $3 X [2 6] X D=$ Op [5 ] RegDst 75 IF: and $2, $4, $5 4 ress PC=8 PC memory Clock 3
76 RegDst [5 ] CC CC Clock 3 Clock 4: Figre 6.32b ID: and $2, $2, $3 EX: sb $,... E: lw $,... WB: before<> PC=4+2<<2 IF: or $3, $6, $7 CC PC=2 PC=8 E/WB IF/ID ID/EX EX/E WB and WB Control WB PC=4+2<<2 PC=X A=C$2 A=C$4 EX reslt 4 Branch Shift left 2 Src B=C$3 B=C$5 Reg em emtoreg register 4 $2 $4 ress PC register 2 5 Zero reslt $3 $5 2 Registers memory register ress Data memory S=X S=X DR=em[2+C$] em =2+C$ =C$2-C$3 [5 ] PC=2 PC=6 IR=and IR=or $3,$6,$7 $2,$4,$5 PC=2 control T=$3 T=$3 X Sign etend X X [2 6] X Op D=$ D=$ 2 [5 ] 2 RegDst Clock 4 76 D=$ D=$2 D=$
77 Data Dependencies: that can be resolved by forwarding Data Dependencies Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg I Reg I Reg D Reg CC 7 CC 8 CC 9 / Data Data Hazards Reg Resolved by by forwarding D Reg D At At same same time: time: Not Not a hazard hazard I Reg D Reg sw $5, ($2) I Reg D Reg Forward in in time: time: Not Not a hazard hazard 77
78 Data Hazards: arithmetic Vale of register $2 : Vale of EX/E : Vale of E/WB : Program eection order (in instrctions) sb $2, $, $3 Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg CC 7 CC 8 CC 9 / X X X 2 X X X X X X X X X 2 X X X X Forwards in in time: time: Can Can be be resolved D Reg At At same same time: time: Not Not a hazard hazard and $2, $2, $5 I Reg D Reg or $3, $6, $2 I Reg D Reg add $4, $2, $2 I Reg D Reg sw $5, ($2) I Reg D Reg 78
79 Data Dependencies: no forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID ID ID EX WB Stall Stall Stall Stall st st Half Half 2nd 2nd Half Half Sppose every instrction is dependant = + 2 stalls = 3 clocks IPS = Clock = 5 hz = 67 IPS CPI 3 79
80 Data Dependencies: no forwarding A dependant instrction will take An independent instrction will take = + 2 stalls = 3 clocks = + stalls = clocks Sppose % of the time the instrctions are dependant? Averge instrction time = %*3 + 9%* =.*3 +.9* =.2 clocks IPS = Clock = 5 hz = 47 IPS (% dependency) CPI.2 IPS = Clock = 5 hz = 67 IPS (% dependency) CPI 3 IPS = Clock = 5 hz = 5 IPS (% dependency) CPI 8
81 Data Dependencies: with forwarding Clock sb $2,$,$3 IF ID EX WB and $2,$2,$5 IF ID EX WB Detected Data Data Hazard Hazard a a ID/EX.$rs = EX/.$rd Sppose every instrction is dependant = + stalls = clock IPS = Clock = 5 hz = 5 IPS CPI 8
82 Data Dependencies: Hazard Conditions Data Hazard Condition occrs whenever a sorce needs a previos navailable reslt de to a destination. Eample sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt Data Hazard Detection is always comparing a destination with a sorce. Destination EX/E.$rdest = { E/WB.$rdest = { Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Hazard Type a. b. 2a. 2b. 82
83 Data Dependencies: Hazard Conditions a Data Hazard: EX/E.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $5 and $rd, $rs, $rt b Data Hazard: EX/E.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $2 and $rd, $rs, $rt 2a Data Hazard: E/WB.$rd = ID/EX.$rs sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $2, $ and $rd, $rs, $rt 2b Data Hazard: E/WB.$rd = ID/EX.$rt sb $2, $, $3 sb $rd, $rs, $rt and $2, $, $5 sb $rd, $rs, $rt or $3, $6, $2 and $rd, $rs, $rt 83
84 Data Dependencies: Worst case Data Hazard: sb $2, $, $3 sb $rd, $rs, $rt and $2, $2, $2 and $rd, $rs, $rt or $3, $2, $2 and $rd, $rs, $rt Data Hazard a: Data Hazard b: Data Hazard 2a: Data Hazard 2b: EX/E.$rd = ID/EX.$rs EX/E.$rd = ID/EX.$rt E/WB.$rd = ID/EX.$rs E/WB.$rd = ID/EX.$rt 84
85 Data Dependencies: Hazard Conditions Hazard Type a. b. 2a. 2b. Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rdest } = E/WB.$rdest ID/EX $rs $rt $rd Pipeline Registers EX/E $rd E/WB $rd 85
86 ID/EX EX/E E/WB Registers Data memory a. No forwarding ID/EX EX/E E/WB Registers ForwardA Data memory Rs Rt Rt Rd ForwardB Forwarding nit EX/E.RegisterRd E/WB.RegisterRd b. With forwarding 86
87 Data Hazards: Loads Program eection order (in instrctions) lw $2, 2($) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg Backwards in in time: time: Cannot be be resolved D CC 7 CC 8 CC 9 Forwards in in time: time: Cannot be be resolved Reg and $4, $2, $5 I Reg D Reg or $8, $2, $6 I Reg D Reg add $9, $4, $2 I Reg D Reg slt $, $6, $7 At At same same time: time: Not Not a hazard hazard I Reg D Reg 87
88 Data Hazards: load stalling Program eection order (in instrctions) Time (in clock cycles) CC CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC lw $2, 2($) I Reg D Reg and $4, $2, $5 I Reg Reg D Reg or $8, $2, $6 I I Reg D Reg bbble add $9, $4, $2 slt $, $6, $7 Stall Stall I Reg D Reg I Reg D Reg 88
89 Data Hazards: Hazard detection nit (page 49) Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= Stall Eample lw $2, 2($) lw $rt, addr($rs) and $4, $2, $5 and $rd, $rs, $rt No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt 89
90 Data Hazards: Hazard detection nit (page 49) No Stall Eample: (only need to look at net instrction) lw $2, 2($) lw $rt, addr($rs) and $4, $, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt Eample load: assme half of the instrctions are immediately followed by an instrction that ses it. What is the average nmber of clocks for the load? load instrction time: 5%*( clock) + 5%*(2 clocks)=.5 9
91 9 Hazard Detection Unit: when to stall PC memory Registers Control EX WB WB WB ID/EX EX/E E/WB Data memory Hazard detection nit Forwarding nit IF/ID ID/EX.em IF/ID PC ID/EX.RegisterRt IF/ID.RegisterRd IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRs Rt Rs Rd Rt EX/E.RegisterRd E/WB.RegisterRd
92 Data Dependency Units Forwarding Condition Sorce ID/EX.$rs ID/EX.$rt ID/EX.$rs ID/EX.$rt Destination } = EX/E.$rd } = E/WB.$rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 92
93 Data Dependency Units Stalling Comparisons IF/ID $rs Pipeline Registers ID/EX $rs Forwarding Comparisons EX/E E/WB $rt $rt $rd $rd $rd $rd Stall Condition Sorce IF/ID.$rs IF/ID.$rt Destination } = ID/EX.$rt Λ ID/EX.em= 93
94 Branch Hazards: Soln #, Stall ntil Decision made (fig. 6.4) Program eection order (in instrctions) add $4, $5, $6 add $4, $5, beq $, $3, 7 Soln #: #: Stall Stall ntil ntil Decision is is and $2, $2, or $3, $6, add $4, $2, lw $4, 5($7) fetch Reg Data access Reg beq $, $2, 4 2ns fetch Reg Data access Reg lw $3, 3($) 4 ns fetch Reg Data access Reg 2ns Stall Stall Decision made in in ID ID stage: do do load load 94
95 Branch Hazards: Soln #2, Predict ntil Decision made Clock beq $,$3,7 IF ID EX WB Predict false branch and $2, $2, $5 IF ID EX WB discard and $2,$2,$5 instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: discard & branch 95
96 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB ove instrction before branch add $4,$6,$6 IF ID EX WB Do Do not not need to to discard instrction lw $4, 5($7) IF ID EX WB Decision made in in ID ID stage: branch 96
97 Branch Hazards: Soln #3, Delayed Decision Clock beq $,$3,7 IF ID EX WB and $2, $2, $5 IF ID EX WB Decision made in in ID ID stage: do do branch lw $4, 5($7) IF ID EX WB 97
98 Branch Hazards: Decision made in the ID stage (figre 6.4) Clock beq $,$3,7 IF ID EX WB nop No No decision yet: yet: insert a nop nop IF ID EX Decision: do do load load WB lw $4, 5($7) IF ID EX WB 98
99 Branch Hazards: Soln #2, Predict ntil Decision made Program eection order (in instrctions) Time (in clock cycles) CC Predict false false branch branch Branch Decision made made in in E E stage: stage: Discard vales vales when when wrong wrong prediction CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 4 beq $, $3, 7 I Reg D Reg 44 and $2, $2, $5 I Reg D Reg 48 or $3, $6, $2 I Reg D Reg 52 add $4, $2, $2 I Reg D Reg 72 lw $4, 5($7) Same Same effect effect as as 3 stalls stalls I Reg D Reg 99
100 Figre 6.5 IF.Flsh Hazard detection nit ID/EX WB Early Early branch comparison EX/E Control WB E/WB IF/ID EX WB PC 4 memory Shift left 2 Registers = Data memory Sign Flsh: if wrong prediciton, etend add nops Flsh: if wrong prediciton, add nops Forwarding nit
101 Performance load: assme half of the instrctions are immediately followed by an instrction that ses it (i.e. dependency) load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 Jmp: assme that jmps always pay fll clock cycle delay (stall). Jmp instrction time = 2 Branch: the branch delay of misprediction is clock cycle that 25% of the branches are mispredicted. branch time = 75%*( clocks) + 25%*(2 clocks) =.25
102 Performance, page 54 Also known as the instrction latency with in a pipeline Pipeline throghpt Single- Cycle lti-cycle Clocks Pipeline Cycles i loads 5.5 (5% dependancy) 23% stores 4 3% arithmetic 4 43% branches 3.25 (25% dependancy) 9% jmps 3 2 2% Clock speed 25 hz 8 ns 5 hz 2 ns 5 hz 2 ns CPI = Σ Cycles*i IPS 25 IPS 25 IPS 424 IPS = Clock/CPI load instrction time = 5%*( clock) + 5%*(2 clocks)=.5 branch time = 75%*( clocks) + 25%*(2 clocks) =.25 2
103 Branch prediction Datapath parallelism only sefl if yo can keep it fed. Easy to fetch mltiple (consective) instrctions per cycle essentially speclating on seqential flow Jmp: nconditional change of control flow Always taken Branch: conditional change of control flow Taken abot 5% of the time Backward: 3% 8% taken Forward: 7% 4% taken 3
104 A Big Idea for Today Reactive: past actions case system to adapt se do what yo did before better e: caches TCP windows URL completion,... Proactive: ses past actions to predict ftre actions optimize speclatively, anticipate what yo are abot to do branch prediction long cache blocks??? 4
105 Case for Branch Prediction when Isse N instrctions per clock cycle. Branches will arrive p to n times faster in an n- isse processor 2. Amdahl s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-isse processor conversely, need branch prediction to see potential parallelism 5
106 Branch Prediction Schemes. Static Branch Prediction -bit Branch-Prediction Bffer 2-bit Branch-Prediction Bffer Correlating Branch Prediction Bffer Tornament Branch Predictor Branch Target Bffer Integrated Fetch Units Retrn ress Predictors 6
107 Dynamic Branch Prediction Performance = ƒ(accracy, cost of misprediction) Branch History Table: Lower bits of PC address inde table of -bit vales Says whether or not branch taken last time No address check - saves HW, bt may not be right branch If inst == BR, pdate table with otcome Problem: in a loop, -bit BHT will case 2 mispredictions End of loop case, when it eits instead of looping as before First time throgh loop on net time throgh code, when it predicts eit instead of looping avg is 9 iterations before eit Only 8% accracy even if loop 9% of the time Local history This particlar branch inst - Or one that maps into same lost PC 7
108 2-bit Dynamic Branch Prediction 2-bit scheme where change prediction only if get misprediction twice: Predict Taken Predict Not Taken T NT T T NT T NT Predict Taken Predict Not Taken Red: stop, not taken Green: go, taken s hysteresis to decision making process Generalize to n-bit satrating conter NT 8
109 Consider 3 Scenarios Branch for loop test Check for error or eception predictors Alternating taken / not-taken eample? taken Global history Yor worst-case prediction scenario How cold HW predict this loop will eecte 3 times sing a simple mechanism? 9
110 Correlating Branches Idea: taken/not taken of recently eected branches is related to behavior of net branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 4 predictions of net branch, pdating jst that prediction Branch address (4 bits) 2-bits per branch local predictors Prediction (2,2) predictor: 2-bit global, 2-bit local 2-bit recent global branch history ( = not taken then taken)
111 Accracy of Different Schemes 2% Freqency of of ispredictions 8% 6% 4% 2% % 8% 6% 4% % 2% % % 496 Entries 2-bit BHT Unlimited Entries 2-bit BHT 24 Entries (2,2) BHT % % 5% 6% 6% nasa7 matri3 tomcatv dodcd spice fpppp gcc espresso eqntott li % 4% 6% 8% 5% 4,96 entries: 2-bits per entry Unlimited entries: 2-bits/entry,24 entries (2,2) What s missing in this pictre?
112 Re-evalating Correlation Several of the SPEC benchmarks have less than a dozen branches responsible for 9% of taken branches: program branch % static # = 9% compress 4% eqntott 25% gcc 5% mpeg % real gcc 3% Real programs + OS more like gcc Small benefits beyond benchmarks for correlation? problems with branch aliases? 2
113 BHT Accracy ispredict becase either: Wrong gess for that branch Got branch history of wrong branch when inde the table 496 entry table programs vary from % misprediction (nasa7, tomcatv) to 8% (eqntott), with spice at 9% and gcc at 2% For SPEC92, 496 abot as good as infinite table 3
114 Dynamically finding strctre in Spaghetti? 4
115 Tornament Predictors otivation for correlating branch predictors is 2- bit predictor failed on important branches; by adding global information, performance improved Tornament predictors: se 2 predictors, based on global information and based on local information, and combine with a selector Use the predictor that tends to gess correctly addr history Predictor A Predictor B 5
116 Tornament Predictor in Alpha K 2-bit conters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indeed by the history of the last 2 branches; each entry in the global predictor is a standard 2-bit predictor 2-bit pattern: ith bit => ith prior branch not taken; ith bit => ith prior branch taken; Local predictor consists of a 2-level predictor: Top level a local history table consisting of 24 -bit entries; each -bit entry corresponds to the most recent branch otcomes for the entry. -bit history allows patterns branches to be discovered and predicted. Net level Selected entry from the local history table is sed to inde a table of K entries consisting a 3-bit satrating conters, which provide the local prediction Total size: 4K*2 + 4K*2 + K* + K*3 = 29K bits! (~8, transistors) 6
117 % of predictions from local predictor in Tornament Prediction Scheme % 2% 4% 6% 8% % nasa7 matri3 tomcatv dodc spice fpppp gcc espresso eqntott li 37% 55% 76% 72% 63% 69% 98% % 94% 9% 7
118 Accracy of Branch Prediction to m c a tv d o d c 84% 99% 99% % 95% 97% fp p p p li 86% 82% 88% 77% 98% 98% P ro file -b a s e d 2-b it c o nte r T o rna m e nt e s p re s s o 86% 82% 96% g c c 7% 88% 94% fig 3.4 % 2% 4% 6% 8% % B ra nc h p re d ic tio n a c c ra c y 8
119 Accracy v. Size (SPEC89) % 9% 8% 7% Local 6% 5% 4% 3% 2% Correlating Tornament % % Total predictor size (Kbits) 9
120 GSHARE A good simple predictor Sprays predictions for a given address across a large table for different histories address or n history 2
121 Need ress At Same Time as Prediction Branch Target Bffer (BTB): ress of branch inde to get prediction AND branch address (if taken) Note: mst check for branch match now, since can t se wrong branch address (Figre 3.9, 3.2) F E T C H P C o f i n s t r c t i o n Branch PC Predicted PC =? No: branch not predicted, proceed normally (Net PC = PC+4) Yes: instrction is branch and se predicted PC as net PC Etra prediction state bits 2
122 Predicated Eection Avoid branch prediction by trning branches into conditionally eected instrctions: if () then A = B op C else NOP If false, then neither store reslt nor case eception Epanded ISA of Alpha, IPS, PowerPC, SPARC have conditional move; PA-RISC can annl any following instr. IA-64: 64 -bit condition fields selected so conditional eection of any instrction This transformation is called if-conversion A = B op C Drawbacks to conditional instrctions Still takes a clock even if annlled Stall if condition evalated late Comple conditions redce effectiveness; condition becomes known late in pipeline 22
123 Special Case Retrn resses Register Indirect branch hard to predict address SPEC89 85% sch branches for procedre retrn Since stack discipline for procedres, save retrn address in small bffer that acts like a stack: 8 to 6 entries has small miss rate 23
124 Pitfall: Sometimes bigger and dmber is Better 2264 ses tornament predictor (29 Kbits) Earlier 264 ses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, otperforms 2264 avg..5 mispredictions per instrctions 264 avg. 6.5 mispredictions per instrctions Reversed for transaction processing (TP)! 2264 avg. 7 mispredictions per instrctions 264 avg. 5 mispredictions per instrctions TP code mch larger & 264 hold 2X branch predictions based on local behavior (2K vs. K local predictor in the 2264) 24
125 A zero-cycle jmp What really has to be done at rntime? Once an instrction has been detected as a jmp or JAL, we might recode it in the internal cache. Very limited form of dynamic compilation? and addi sb jal sbi r3,r,r5 r2,r3,#4 r4,r2,r doit r,r,# A: Internal Cache state: and r3,r,r5 N A+4 addi r2,r3,#4 N A+8 sb r4,r2,r L doit sbi r,r,# N A+2 Use of Pre-decoded instrction cache Called branch folding in the Bell-Labs CRISP processor. Original CRISP cache had two addresses and cold ths fold a complete branch into the previos instrction Notice that JAL introdces a strctral hazard on write 25
126 reflect PREDICTIONS and remove delay slots and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r r4,loop r,r,# A: A+6: Internal Cache state: and addi sb bne sbi r3,r,r5 r2,r3,#4 r4,r2,r loop r,r,# This cases the net instrction to be immediately fetched from branch destination (predict taken) If branch ends p being not taking, then sqash destination instrction and restart pipeline at address A+6 N N N N N Net A+4 A+8 A+2 loop A+2 26
127 Dynamic Branch Prediction Smmary Prediction becoming important part of scalar eection Branch History Table: 2 bits for loop accracy Correlation: Recently eected branches correlated with net branch. Either different branches Or different eections of same branches Tornament Predictor: more resorces to competitive soltions and pick between them Branch Target Bffer: inclde branch address & prediction Predicated Eection can redce nmber of branches, nmber of mispredicted branches Retrn address stack for prediction of indirect jmp 27
128 Eceptions and Interrpts (Hardware) 28
129 Eample: Device Interrpt (Say, arrival of network message) t p r r e t n I l a n r e t E add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE r l e d n a H t p r e t I n 29
130 Alternative: Polling (again, for arrival of network message) t p r r e t n I l a n r e t E no_mess: Disable Network Intr sbi r4,r,#4 slli r4,r4,#2 lw r2,(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r,2(r) beq r,no_mess lw r,2(r) lw r2,(r) addi r3,r,#5 sw (r),r3 Clear Network Intr Polling Point (check device register) Handler 3
131 Polling is faster/slower than Interrpts. Polling is faster than interrpts becase Compiler knows which registers in se at polling point. Hence, do not need to save and restore registers (or not as many). Other interrpt overhead avoided (pipeline flsh, trap priorities, etc). Polling is slower than interrpts becase Overhead of polling instrctions is incrred regardless of whether or not handler is rn. This cold add to inner-loop delay. Device may have to wait for service for a long time. When to se one or the other? lti-ais tradeoff - Freqent/reglar events good for polling, as long as device can be controlled at ser level. - Interrpts good for infreqent/irreglar events - Interrpts good for ensring reglar/predictable service of events. 3
132 Eception/Interrpt classifications Eceptions: relevant to the crrent process Falts, arithmetic traps, and synchronos traps Invoke software on behalf of the crrently eecting process Interrpts: cased by asynchronos, otside events I/O devices reqiring service (DISK, network) Clock interrpts (real time schedling) achine Checks: cased by serios hardware failre Not always restartable Indicate that bad things have happened. - Non-recoverable ECC error - achine room fire - Power otage 32
133 A related classification: Synchronos vs. Asynchronos Synchronos: means related to the instrction stream, i.e. dring the eection of an instrction st stop an instrction that is crrently eecting Page falt on load or store instrction Arithmetic eception Software Trap s Asynchronos: means nrelated to the instrction stream, i.e. cased by an otside event. Does not have to disrpt instrctions that are already eecting Interrpts are asynchronos achine checks are asynchronos SemiSynchronos (or high-availability interrpts): Cased by eternal event bt may have to disrpt crrent instrctions in order to garantee service 33
134 Interrpt Priorities st be Handled t p r r e t n I k r o w t e N add sbi slli lw lw add sw r,r2,r3 r4,r,#4 r4,r4,#2 Hiccp(!) r2,(r4) r3,4(r4) r2,r2,r3 8(r4),r2 P C s aved A ll Ints o pervisor D i s abl e de S R estore PC U ser o de Raise priority Reenable All Ints Save registers lw r,2(r) lw r2,(r) addi sw r3,r,#5 (r),r3 Restore registers Clear crrent Int Disable All Ints Restore priority RTE C o l d b e i n t e r r p t e d b y d i s k Note that priority mst be raised to avoid recrsive interrpts! 34
135 Interrpt controller hardware and mask levels Operating system constrcts a hierarchy of masks that reflects some form of interrpt priority. For instance: Priority Eamples Software interrpts 2 Network Interrpts 4 Sond card 5 Disk Interrpt 6 Real Time clock Non-askable Ints (power) This reflects the an order of rgency to interrpts For instance, this ordering says that disk events can interrpt the interrpt handlers for network interrpts. 35
136 Can we have fast interrpts? t p r e t n I i n a r G e i n F Raise priority Reenable All Ints add r,r2,r3 P C s aved A ll Ints o pervisor Save registers sbi r4,r,#4 D i s abl e de lw r,2(r) slli r4,r4,#2 S lw r2,(r) Hiccp(!) addi r3,r,#5 sw (r),r3 lw r2,(r4) R lw r3,4(r4) estore Restore registers U ser add r2,r2,r3 Clear crrent Int PC sw 8(r4),r2 o de Disable All Ints 28 registers + cache misses + etc. Restore priority RTE Pipeline Drain: Can be very Epensive Priority aniplations Register Save/Restore C o l d b e i n t e r r p t e d b y d i s k 36
137 SPARC (and RISC I) had register windows On interrpt or procedre call, simply switch to a different set of registers Really saves on interrpt overhead Interrpts can happen at any point in the eection, so compiler cannot help with knowledge of live registers. Conservative handlers mst save all registers Short handlers might be able to save only a few, bt this analysis is compilcated Not as big a deal with procedre calls Original statement by Patterson was that Berkeley didn t have a compiler team, so they sed a hardware soltion Good compilers can allocate registers across procedre bondaries Good compilers know what registers are live at any one time However, register windows have retrned! IA64 has them any other processors have shadow registers for interrpts 37
138 Spervisor State Typically, processors have some amont of state that ser programs are not allowed to toch. Page mapping hardware/tlb - TLB prevents one ser from accessing memory of another - TLB protection prevents ser from modifying mappings Interrpt controllers -- User code prevented from crashing machine by disabling interrpts. Ignoring device interrpts, etc. Real-time clock interrpts ensre that sers cannot lockp/crash machine even if they rn code that goes into a loop: - Preemptive ltitasking vs non-preemptive mltitasking Access to hardware devices restricted Prevents malicios ser from stealing network packets Prevents ser from writing over disk blocks Distinction made with at least two-levels: USER/SYSTE (one hardware mode-bit) 86 architectres actally provide 4 different levels, only two sally sed by OS (or only in older icrosoft OSs) 38
PS Midterm 2. Pipelining
PS idterm 2 Pipelining Seqential Landry 6 P 7 8 9 idnight Time T a s k O r d e r A B C D 3 4 2 3 4 2 3 4 2 3 4 2 Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry
More informationReview: Computer Organization
Review: Compter Organization Pipelining Chans Y Landry Eample Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 mintes A B C D Dryer takes 3 mintes
More information1048: Computer Organization
8: Compter Organization Lectre 6 Pipelining Lectre6 - pipelining (cwli@twins.ee.nct.ed.tw) 6- Otline An overview of pipelining A pipelined path Pipelined control Data hazards and forwarding Data hazards
More informationEnhanced Performance with Pipelining
Chapter 6 Enhanced Performance with Pipelining Note: The slides being presented represent a mi. Some are created by ark Franklin, Washington University in St. Lois, Dept. of CSE. any are taken from the
More informationPipelining. Chapter 4
Pipelining Chapter 4 ake processor rns faster Pipelining is an implementation techniqe in which mltiple instrctions are overlapped in eection Key of making processor fast Pipelining Single cycle path we
More informationTDT4255 Friday the 21st of October. Real world examples of pipelining? How does pipelining influence instruction
Review Friday the 2st of October Real world eamples of pipelining? How does pipelining pp inflence instrction latency? How does pipelining inflence instrction throghpt? What are the three types of hazard
More informationWhat do we have so far? Multi-Cycle Datapath
What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring2 4-11-2 Pipelining pipelining
More informationChapter 3 & Appendix C Pipelining Part A: Basic and Intermediate Concepts
CS359: Compter Architectre Chapter 3 & Appendi C Pipelining Part A: Basic and Intermediate Concepts Yanyan Shen Department of Compter Science and Engineering Shanghai Jiao Tong University 1 Otline Introdction
More informationChapter 6: Pipelining
Chapter 6: Pipelining Otline An overview of pipelining A pipelined path Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Eceptions Sperscalar and dynamic pipelining
More informationOverview of Pipelining
EEC 58 Compter Architectre Pipelining Department of Electrical Engineering and Compter Science Cleveland State University Fndamental Principles Overview of Pipelining Pipelined Design otivation: Increase
More informationSolutions for Chapter 6 Exercises
Soltions for Chapter 6 Eercises Soltions for Chapter 6 Eercises 6. 6.2 a. Shortening the ALU operation will not affect the speedp obtained from pipelining. It wold not affect the clock cycle. b. If the
More informationThe extra single-cycle adders
lticycle Datapath As an added bons, we can eliminate some of the etra hardware from the single-cycle path. We will restrict orselves to sing each fnctional nit once per cycle, jst like before. Bt since
More informationChapter 6: Pipelining
CSE 322 COPUTER ARCHITECTURE II Chapter 6: Pipelining Chapter 6: Pipelining Febrary 10, 2000 1 Clothes Washing CSE 322 COPUTER ARCHITECTURE II The Assembly Line Accmlate dirty clothes in hamper Place in
More informationThe single-cycle design from last time
lticycle path Last time we saw a single-cycle path and control nit for or simple IPS-based instrction set. A mlticycle processor fies some shortcomings in the single-cycle CPU. Faster instrctions are not
More informationReview Multicycle: What is Happening. Controlling The Multicycle Design
Review lticycle: What is Happening Reslt Zero Op SrcA SrcB Registers Reg Address emory em Data Sign etend Shift left Sorce A B Ot [-6] [5-] [-6] [5-] [5-] Instrction emory IR RegDst emtoreg IorD em em
More informationThe final datapath. M u x. Add. 4 Add. Shift left 2. PCSrc. RegWrite. MemToR. MemWrite. Read data 1 I [25-21] Instruction. Read. register 1 Read.
The final path PC 4 Add Reg Shift left 2 Add PCSrc Instrction [3-] Instrction I [25-2] I [2-6] I [5 - ] register register 2 register 2 Registers ALU Zero Reslt ALUOp em Data emtor RegDst ALUSrc em I [5
More informationComp 303 Computer Architecture A Pipelined Datapath Control. Lecture 13
Comp 33 Compter Architectre A Pipelined path Lectre 3 Pipelined path with Signals PCSrc IF/ ID ID/ EX EX / E E / Add PC 4 Address Instrction emory RegWr ra rb rw Registers bsw [5-] [2-6] [5-] bsa bsb Sign
More informationThe multicycle datapath. Lecture 10 (Wed 10/15/2008) Finite-state machine for the control unit. Implementing the FSM
Lectre (Wed /5/28) Lab # Hardware De Fri Oct 7 HW #2 IPS programming, de Wed Oct 22 idterm Fri Oct 2 IorD The mlticycle path SrcA Today s objectives: icroprogramming Etending the mlti-cycle path lti-cycle
More informationExceptions and interrupts
Eceptions and interrpts An eception or interrpt is an nepected event that reqires the CPU to pase or stop the crrent program. Eception handling is the hardware analog of error handling in software. Classes
More informationChapter 6 Enhancing Performance with. Pipelining. Pipelining. Pipelined vs. Single-Cycle Instruction Execution: the Plan. Pipelining: Keep in Mind
Pipelining hink of sing machines in landry services Chapter 6 nhancing Performance with Pipelining 6 P 7 8 9 A ime ask A B C ot pipelined Assme 3 min. each task wash, dry, fold, store and that separate
More informationCS 251, Winter 2019, Assignment % of course mark
CS 25, Winter 29, Assignment.. 3% of corse mark De Wednesday, arch 3th, 5:3P Lates accepted ntil Thrsday arch th, pm with a 5% penalty. (7 points) In the diagram below, the mlticycle compter from the corse
More informationEXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation
EXAINATIONS 2003 COP203 END-YEAR Compter Organisation Time Allowed: 3 Hors (180 mintes) Instrctions: Answer all qestions. There are 180 possible marks on the eam. Calclators and foreign langage dictionaries
More informationPIPELINING. Pipelining: Natural Phenomenon. Pipelining. Pipelining Lessons
Pipelining: Natral Phenomenon Landry Eample: nn, rian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 mintes C D Dryer takes 0 mintes PIPELINING Folder takes 20 mintes
More informationEXAMINATIONS 2010 END OF YEAR NWEN 242 COMPUTER ORGANIZATION
EXAINATIONS 2010 END OF YEAR COPUTER ORGANIZATION Time Allowed: 3 Hors (180 mintes) Instrctions: Answer all qestions. ake sre yor answers are clear and to the point. Calclators and paper foreign langage
More informationReview. A single-cycle MIPS processor
Review If three instrctions have opcodes, 7 and 5 are they all of the same type? If we were to add an instrction to IPS of the form OD $t, $t2, $t3, which performs $t = $t2 OD $t3, what wold be its opcode?
More informationEECS 322 Computer Architecture Improving Memory Access: the Cache
EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow
More information1048: Computer Organization
48: Compter Organization Lectre 5 Datapath and Control Lectre5A - simple implementation (cwli@twins.ee.nct.ed.tw) 5A- Introdction In this lectre, we will try to implement simplified IPS which contain emory
More informationCS 251, Winter 2018, Assignment % of course mark
CS 25, Winter 28, Assignment 4.. 3% of corse mark De Wednesday, arch 7th, 4:3P Lates accepted ntil Thrsday arch 8th, am with a 5% penalty. (6 points) In the diagram below, the mlticycle compter from the
More informationComputer Architecture. Lecture 6: Pipelining
Compter Architectre Lectre 6: Pipelining Dr. Ahmed Sallam Based on original slides by Prof. Onr tl Agenda for Today & Net Few Lectres Single-cycle icroarchitectres lti-cycle and icroprogrammed icroarchitectres
More informationEEC 483 Computer Organization. Branch (Control) Hazards
EEC 483 Compter Organization Section 4.8 Branch Hazards Section 4.9 Exceptions Chans Y Branch (Control) Hazards While execting a previos branch, next instrction address might not yet be known. s n i o
More informationEEC 483 Computer Organization
EEC 83 Compter Organization Chapter.6 A Pipelined path Chans Y Pipelined Approach 2 - Cycle time, No. stages - Resorce conflict E E A B C D 3 E E 5 E 2 3 5 2 6 7 8 9 c.y9@csohio.ed Resorces sed in 5 Stages
More informationPART I: Adding Instructions to the Datapath. (2 nd Edition):
EE57 Instrctor: G. Pvvada ===================================================================== Homework #5b De: check on the blackboard =====================================================================
More informationEEC 483 Computer Organization
EEC 483 Compter Organization Chapter 4.4 A Simple Implementation Scheme Chans Y The Big Pictre The Five Classic Components of a Compter Processor Control emory Inpt path Otpt path & Control 2 path and
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationInstruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion
. (Chapter 5) Fill in the vales for SrcA, SrcB, IorD, Dst and emto to complete the Finite State achine for the mlti-cycle datapath shown below. emory address comptation 2 SrcA = SrcB = Op = fetch em SrcA
More informationComputer Architecture Chapter 5. Fall 2005 Department of Computer Science Kent State University
Compter Architectre Chapter 5 Fall 25 Department of Compter Science Kent State University The Processor: Datapath & Control Or implementation of the MIPS is simplified memory-reference instrctions: lw,
More informationProf. Kozyrakis. 1. (10 points) Consider the following fragment of Java code:
EE8 Winter 25 Homework #2 Soltions De Thrsday, Feb 2, 5 P. ( points) Consider the following fragment of Java code: for (i=; i
More informationComputer Architecture
Compter Architectre Lectre 4: Intro to icroarchitectre: Single- Cycle Dr. Ahmed Sallam Sez Canal University Spring 25 Based on original slides by Prof. Onr tl Review Compter Architectre Today and Basics
More informationComputer Architecture
Compter Architectre Lectre 4: Intro to icroarchitectre: Single- Cycle Dr. Ahmed Sallam Sez Canal University Based on original slides by Prof. Onr tl Review Compter Architectre Today and Basics (Lectres
More informationLecture 10: Pipelined Implementations
U 8-7 S 9 L- 8-7 Lectre : Pipelined Implementations James. Hoe ept of EE, U Febrary 23, 29 nnoncements: Project is de this week idterm graded, d reslts posted Handots: H9 Homework 3 (on lackboard) Graded
More informationPipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12
Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7- in the online
More informationQuiz #1 EEC 483, Spring 2019
Qiz # EEC 483, Spring 29 Date: Jan 22 Name: Eercise #: Translate the following instrction in C into IPS code. Eercise #2: Translate the following instrction in C into IPS code. Hint: operand C is stored
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationLecture 6: Microprogrammed Multi Cycle Implementation. James C. Hoe Department of ECE Carnegie Mellon University
8 447 Lectre 6: icroprogrammed lti Cycle Implementation James C. Hoe Department of ECE Carnegie ellon University 8 447 S8 L06 S, James C. Hoe, CU/ECE/CALC, 208 Yor goal today Hosekeeping nderstand why
More informationProcessor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed
Lecture 3: General Purpose Processor Design CSCE 665 Advanced VLSI Systems Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, tet etc included in slides are borrowed from various books, websites,
More information1048: Computer Organization
48: Compter Organization Lectre 5 Datapath and Control Lectre5B - mlticycle implementation (cwli@twins.ee.nct.ed.tw) 5B- Recap: A Single-Cycle Processor PCSrc 4 Add Shift left 2 Add ALU reslt PC address
More informationLecture 7. Building A Simple Processor
Lectre 7 Bilding A Simple Processor Christos Kozyrakis Stanford University http://eeclass.stanford.ed/ee8b C. Kozyrakis EE8b Lectre 7 Annoncements Upcoming deadlines Lab is de today Demo by 5pm, report
More informationChapter 4 (Part II) Sequential Laundry
Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30
More informationLecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1
Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)
More informationLecture 13: Exceptions and Interrupts
18 447 Lectre 13: Eceptions and Interrpts S 10 L13 1 James C. Hoe Dept of ECE, CU arch 1, 2010 Annoncements: Handots: Spring break is almost here Check grades on Blackboard idterm 1 graded Handot #9: Lab
More informationCSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade
CSE 141 Compter Architectre Smmer Session I, 2004 Lectres 10 Advanced Topics, emory Hierarchy and Cache Pramod V. Argade CSE141: Introdction to Compter Architectre Instrctor: TA: Pramod V. Argade (p2argade@cs.csd.ed)
More informationNOW Handout Page 1. Debate Review: ISA -a Critical Interface. EECS 252 Graduate Computer Architecture
NOW Handout Page 1 EECS 252 Graduate Computer Architecture Lec 4 Issues in Basic Pipelines (stalls, exceptions, branch prediction) David Culler Electrical Engineering and Computer Sciences University of
More informationPipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.
Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing (2) Pipeline Performance Assume time for stages is ps for register read or write
More informationCPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner
CPS104 Computer Organization and Programming Lecture 19: Pipelining Robert Wagner cps 104 Pipelining..1 RW Fall 2000 Lecture Overview A Pipelined Processor : Introduction to the concept of pipelined processor.
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike
More informationT = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good
CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationLecture 6: Pipelining
Lecture 6: Pipelining i CSCE 26 Computer Organization Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed from various books, websites, authors pages, and other
More informationPipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction
More informationLecture 9: Microcontrolled Multi-Cycle Implementations
8-447 Lectre 9: icroled lti-cycle Implementations James C. Hoe Dept of ECE, CU Febrary 8, 29 S 9 L9- Annoncements: P&H Appendi D Get started t on Lab Handots: Handot #8: Project (on Blackboard) Single-Cycle
More informationModern Computer Architecture
Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each
More informationWhat do we have so far? Multi-Cycle Datapath (Textbook Version)
What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More information4.13 Advanced Topic: An Introduction to Digital Design Using a Hardware Design Language 345.e1
.3 Advanced Topic: An Introdction to Digital Design Using a Hardware Design Langage 35.e.3 Advanced Topic: An Introdction to Digital Design Using a Hardware Design Langage to Describe and odel a Pipeline
More informationHardware Design Tips. Outline
Hardware Design Tips EE 36 University of Hawaii EE 36 Fall 23 University of Hawaii Otline Verilog: some sbleties Simlators Test Benching Implementing the IPS Actally a simplified 6 bit version EE 36 Fall
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationCENG 3420 Lecture 06: Pipeline
CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2019 Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2
More informationAdvanced Computer Architecture Pipelining
Advanced Computer Architecture Pipelining Dr. Shadrokh Samavi Some slides are from the instructors resources which accompany the 6 th and previous editions of the textbook. Some slides are from David Patterson,
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationLecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University
Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More informationCS 251, Spring 2018, Assignment 3.0 3% of course mark
CS 25, Spring 28, Assignment 3. 3% of corse mark De onday, Jne 25th, 5:3 P. (5 points) Consider the single-cycle compter shown on page 6 of this assignment. Sppose the circit elements take the following
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationEECS 252 Graduate Computer Architecture Lecture 4. Control Flow (continued) Interrupts. Review: Ordering Properties of basic inst.
EECS 252 Graduate Computer Architecture Lecture 4 Control Flow (continued) Interrupts John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationPage 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP
CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause
More informationComputer Architecture. Lecture 6.1: Fundamentals of
CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and
More informationCSE Introduction to Computer Architecture Chapter 5 The Processor: Datapath & Control
CSE-45432 Introdction to Compter Architectre Chapter 5 The Processor: Datapath & Control Dr. Izadi Data Processor Register # PC Address Registers ALU memory Register # Register # Address Data memory Data
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationCS 251, Winter 2018, Assignment % of course mark
CS 25, Winter 28, Assignment 3.. 3% of corse mark De onday, Febrary 26th, 4:3 P Lates accepted ntil : A, Febrary 27th with a 5% penalty. IEEE 754 Floating Point ( points): (a) (4 points) Complete the following
More informationPipelining. Maurizio Palesi
* Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer
More informationCSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content
3/6/8 CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Today s Content We have looked at how to design a Data Path. 4.4, 4.5 We will design
More informationReview Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction
CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations
More informationPipelining: Basic Concepts
Pipelining: Basic Concepts Prof. Cristina Silvano Dipartimento di Elettronica e Informazione Politecnico di ilano email: silvano@elet.polimi.it Outline Reduced Instruction Set of IPS Processor Implementation
More informationECE473 Computer Architecture and Organization. Pipeline: Control Hazard
Computer Architecture and Organization Pipeline: Control Hazard Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 15.1 Pipelining Outline Introduction
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationLecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018 Your goal today Housekeeping detect and resolve
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationInstruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.
Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More information