CENG 3420 Lecture 07: Pipeline

Size: px

Start display at page:

Download "CENG 3420 Lecture 07: Pipeline"

Marvin Baldwin
5 years ago
Views:

1 CENG 3420 Lectue 07: Pipeline Bei Yu CENG3420 L07.1 Sping 2017

2 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CENG3420 L07.2 Sping 2017

3 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CENG3420 L07.3 Sping 2017

4 Clocking Methodologies q Clocking methodology defines when signals can be ead and when they can be witten falling (negative) edge clock cycle ising (positive) edge clock ate = 1/(clock cycle) e.g., 10 nsec clock cycle = 100 MHz clock ate 1 nsec clock cycle = 1 GHz clock ate q State element design choices level sensitive latch maste-slave and edge-tiggeed flipflops CENG3420 L07.4 Sping 2017

5 Review:Latches vs Flipflops q Output is equal to the stoed value inside the element q Change of state (value) is based on the clock Latches: output changes wheneve the inputs change and the clock is asseted (level sensitive methodology) - Two-sided timing constaint Flip-flop: output changes only on a clock edge (edgetiggeed methodology) - One-sided timing constaint A clocking methodology defines when signals can be ead and witten would NOT want to ead a signal at the same time it was being witten CENG3420 L07.5 Sping 2017

6 Review: Design A Latch q Stoe one bit of infomation: coss-coupled inveto = q How to change the value stoed? R: eset signal S: set signal SR-Latch othe Latch stuctues CENG3420 L07.6 Sping 2017

7 Review: Design A Flip-Flop q Based on Gated Latch = q Maste-slave positive-edge-tiggeed D flip-flop CENG3420 L07.7 Sping 2017

8 Review: Latch and Flip-Flop q Latch is level-sensitive q Flip-flop is edge tiggeed CENG3420 L07.8 Sping 2017

9 Ou Implementation q An edge-tiggeed methodology q Typical execution ead contents of some state elements send values though some combinational logic wite esults to one o moe state elements State element 1 Combinational logic State element 2 clock one clock cycle q Assumes state elements ae witten on evey clock cycle; if not, need explicit wite contol signal wite occus only when both the wite contol is asseted and the clock edge occus CENG3420 L07.9 Sping 2017

10 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CENG3420 L07.10 Sping 2017

11 Review: Instuction Citical Paths q Calculate cycle time assuming negligible delays (fo muxes, contol unit, sign extend, PC access, shift left 2, wies) except: Instuction and Data Memoy (4 ns) and addes (2 ns) Registe File access (eads o wites) (1 ns) Inst. I Mem Reg Rd Op D Mem Reg W Total R- type load stoe beq jump CENG3420 L07.11 Sping 2017

12 Review: Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest inst especially poblematic fo moe complex instuctions like floating point multiply Clk Cycle 1 Cycle 2 lw sw Waste q May be wasteful of aea since some functional units (e.g., addes) must be duplicated since they can not be shaed duing a clock cycle but q It is simple and easy to undestand CENG3420 L07.12 Sping 2017

13 How Can We Make It Faste? q Stat fetching and executing the next instuction befoe the cuent one has completed Pipelining (all?) moden pocessos ae pipelined fo pefomance Remembe the pefomance equation: CPU time = CPI * CC * IC q Unde ideal conditions and with a lage numbe of instuctions, the speedup fom pipelining is appoximately equal to the numbe of pipe stages A five stage pipeline is nealy five times faste because the CC is nealy five times faste q Fetch (and execute) moe than one instuction at a time Supescala pocessing stay tuned CEG3420 L07.13 Sping 2016

14 The Five Stages of Load Instuction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB q IFetch: Instuction Fetch and Update PC q Dec: Registes Fetch and Instuction Decode q Exec: Execute R-type; calculate memoy addess q Mem: Read/wite the data fom/to the Data Memoy q WB: Wite the esult data into the egiste file CEG3420 L07.14 Sping 2016

15 A Pipelined MIPS Pocesso q Stat the next instuction befoe the cuent one has completed impoves thoughput - total amount of wok done in a given time instuction latency (execution time, delay time, esponse time - time fom the stat of an instuction to its completion) is not educed Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB - clock cycle (pipeline stage time) is limited by the slowest stage - fo some stages don t need the whole clock cycle (e.g., WB) - fo some instuctions, some stages ae wasted cycles (i.e., nothing is done duing that cycle fo that instuction) CEG3420 L07.15 Sping 2016

16 Single Cycle vesus Pipeline Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste Pipeline Implementation (CC = 200 ps): 400 ps lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB q To complete an entie instuction in the pipelined case takes 1000 ps (as compaed to 800 ps fo the single cycle case). Why? q How long does each take to complete 1,000,000 adds? CEG3420 L07.16 Sping 2016

17 Pipelining the MIPS ISA q What makes it easy all instuctions ae the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage few instuction fomats (thee) with symmety acoss fomats - can begin eading egiste file in 2 nd stage memoy opeations occu only in loads and stoes - can use the execute stage to calculate memoy addesses each instuction wites at most one esult (i.e., changes the machine state) and does it in the last few pipeline stages (MEM o WB) opeands must be aligned in memoy so a single data tansfe takes only one data memoy access CEG3420 L07.17 Sping 2016

18 MIPS Pipeline Datapath Additions/Mods q State egistes between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WiteBack IF/ID ID/EX EX/MEM Add PC 4 Instuction Memoy Read Addess Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data Shift left 2 Add Addess Wite Data Data Memoy Read Data MEM/WB Sign 16 Extend 32 System Clock CEG3420 L07.18 Sping 2016

19 MIPS Pipeline Contol Path Modifications q All contol signals can be detemined duing Decode and held in the state egistes between pipeline stages PCSc ID/EX EX/MEM IF/ID Contol PC 4 Instuction Memoy Read Addess Add RegWite Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data Sign 16 Extend 32 Shift left 2 Sc Add cntl Op Banch Addess Wite Data Data Memoy Read Data MemRead MEM/WB MemtoReg RegDst CEG3420 L07.19 Sping 2016

20 Pipeline Contol q IF Stage: ead Inst Memoy (always asseted) and wite PC (on System Clock) q ID Stage: no optional contol signals to set Reg Dst EX Stage MEM Stage WB Stage Op1 Op0 Sc Bch Mem Read Mem Wite Reg Wite Mem toreg R lw sw X X beq X X CEG3420 L07.20 Sping 2016

21 Gaphically Repesenting MIPS Pipeline q Can help with answeing questions like: How many cycles does it take to execute this code? What is the doing duing cycle 4? Is thee a hazad, why does it occu, and how can it be fixed? CEG3420 L07.21 Sping 2016

22 Othe Pipeline Stuctues Ae Possible q What about the (slow) multiply opeation? Make the clock twice as slow o let it take two cycles (since it doesn t use the DM stage) MUL q What if the data memoy access is twice as slow as the instuction memoy? make the clock twice as slow o let data memoy access take two cycles (and keep the same clock ate) IM Reg DM1 DM2 Reg CEG3420 L07.22 Sping 2016

23 Othe Sample Pipeline Altenatives q ARM7 IM Reg EX PC update IM access decode eg access op DM access shift/otate commit esult (wite back) q XScale PC update BTB access stat IM access IM1 IM2 Reg DM1 Reg SHFT DM2 IM access decode eg 1 access op shift/otate eg 2 access DM wite eg wite stat DM access exception CEG3420 L07.23 Sping 2016

24 Why Pipeline? Fo Pefomance! Time (clock cycles) I n s t. O d e Inst 0 Inst 1 Inst 2 Inst 3 Once the pipeline is full, one instuction is completed evey cycle, so CPI = 1 Inst 4 Time to fill the pipeline CEG3420 L07.24 Sping 2016

25 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CEG3420 L07.25 Sping 2016

26 Can Pipelining Get Us Into Touble? q Yes: Pipeline Hazads stuctual hazads: - a equied esouce is busy data hazads: - attempt to use data befoe it is eady contol hazads: - deciding on contol action depends on pevious instuction q Can usually esolve hazads by waiting pipeline contol must detect the hazad and take action to esolve hazads CEG3420 L07.26 Sping 2016

27 Stuctue Hazads q Conflict fo use of a esouce q In MIPS pipeline with a single memoy Load/stoe equies data access Instuction fetch equies instuction access q Hence, pipeline datapaths equie sepaate instuction/data memoies O sepaate instuction/data caches q Since Registe File CEG3420 L07.27 Sping 2016

28 Resolve Stuctual Hazad 1 Time (clock cycles) I n s t. lw Inst 1 Mem Reg Mem Reg Mem Reg Mem Reg Reading data fom memoy O d e Inst 2 Inst 3 Mem Reg Mem Reg Mem Reg Mem Reg Inst 4 Reading instuction fom memoy CEG3420 L07.28 Sping 2016 Mem Reg Mem Reg q Fix with sepaate inst and data memoies (I$ and D$)

29 Resolve Stuctual Hazad 2 Time (clock cycles) I n s t. O d e add $1, Inst 1 Inst 2 add $2,$1, Fix egiste file access hazad by doing eads in the second half of the cycle and wites in the fist half clock edge that contols egiste witing clock edge that contols loading of pipeline state egistes CEG3420 L07.29 Sping 2016

30 Data Hazads q Dependencies backwad in time cause hazads I n s t. O d e add $1, sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 q Read befoe wite data hazad CEG3420 L07.30 Sping 2016

31 Data Hazads: Registe Usage q Dependencies backwad in time cause hazads add $1, sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 q Read befoe wite data hazad CEG3420 L07.31 Sping 2016

32 Data Hazads: Load Memoy q Dependencies backwad in time cause hazads I n s t. O d e lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 q Load-use data hazad CEG3420 L07.32 Sping 2016

33 Resolve Data Hazads 1: Inset Stall I n s t. add $1, stall Can fix data hazad by waiting stall but impacts CPI O d e stall sub $4,$1,$5 and $6,$1,$7 CEG3420 L07.33 Sping 2016

34 Resolve Data Hazads 2: Fowading I n s t. add $1, sub $4,$1,$5 Fix data hazads by fowading esults as soon as they ae available to whee they ae needed O d e and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 CEG3420 L07.34 Sping 2016

35 Resolve Data Hazads 2: Fowading I n s t. add $1, sub $4,$1,$5 Fix data hazads by fowading esults as soon as they ae available to whee they ae needed O d e and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 CEG3420 L07.35 Sping 2016

36 Fowad Unit Output Signals CEG3420 L07.36 Sping 2016

37 Datapath with Fowading Hadwae PCSc ID/EX EX/MEM IF/ID Contol PC 4 Instuction Memoy Read Addess Add Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB Fowad Unit CEG3420 L07.37 Sping 2016

38 Datapath with Fowading Hadwae PCSc ID/EX EX/MEM IF/ID Contol PC 4 Instuction Memoy Read Addess Add Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB EX/MEM.RegisteRd ID/EX.RegisteRt ID/EX.RegisteRs Fowad Unit MEM/WB.RegisteRd CEG3420 L07.38 Sping 2016

39 Data Fowading Contol Conditions 1. EX Fowad Unit: if (EX/MEM.RegWite and (EX/MEM.RegisteRd!= 0) and (EX/MEM.RegisteRd == ID/EX.RegisteRs)) FowadA = 10 if (EX/MEM.RegWite and (EX/MEM.RegisteRd!= 0) and (EX/MEM.RegisteRd == ID/EX.RegisteRt)) FowadB = MEM Fowad Unit: if (MEM/WB.RegWite and (MEM/WB.RegisteRd!= 0) and (MEM/WB.RegisteRd == ID/EX.RegisteRs)) FowadA = 01 if (MEM/WB.RegWite and (MEM/WB.RegisteRd!= 0) and (MEM/WB.RegisteRd == ID/EX.RegisteRt)) FowadB = 01 Fowads the esult fom the pevious inst. to eithe input of the Fowads the esult fom the second pevious inst. to eithe input of the CEG3420 L07.39 Sping 2016

40 Fowading Illustation I n s t. add $1, sub $4,$1,$5 O d e and $6,$7,$1 EX fowading MEM fowading CEG3420 L07.40 Sping 2016

41 Yet Anothe Complication! q Anothe potential data hazad can occu when thee is a conflict between the esult of the WB stage instuction and the MEM stage instuction which should be fowaded? I n s t. O d e add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 CEG3420 L07.41 Sping 2016

42 Yet Anothe Complication! q Anothe potential data hazad can occu when thee is a conflict between the esult of the WB stage instuction and the MEM stage instuction which should be fowaded? I n s t. O d e add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 CEG3420 L07.42 Sping 2016

43 EX: Coected MEM Fowad Unit q MEM Fowad Unit: if (MEM/WB.RegWite and (MEM/WB.RegisteRd!= 0) and (EX/MEM.RegisteRd!= ID/EX.RegisteRs) and (MEM/WB.RegisteRd == ID/EX.RegisteRs)) FowadA = 01 if (MEM/WB.RegWite and (MEM/WB.RegisteRd!= 0) and (EX/MEM.RegisteRd!= ID/EX.RegisteRt) and (MEM/WB.RegisteRd == ID/EX.RegisteRt)) FowadB = 01 CEG3420 L07.43 Sping 2016

44 Memoy-to-Memoy Copies q Fo loads immediately followed by stoes (memoy-tomemoy copies) can avoid a stall by adding fowading hadwae fom the MEM/WB egiste to the data memoy input. Would need to add a Fowad Unit and a mux to the MEM stage I n s t. O d e lw $1,4($2) sw $1,4($3) CEG3420 L07.44 Sping 2016

45 Fowading with Load-use Data Hazads I n s t. O d e lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 IM Reg DM CEG3420 L07.45 Sping 2016

46 Fowading with Load-use Data Hazads I n s t. O d e lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 IM Reg DM q Will still need one stall cycle even with fowading CEG3420 L07.46 Sping 2016

47 Fowading with Load-use Data Hazads I n s t. O d e lw $1,4($2) stall sub $4,$1,$5 and $6,$1,$7 o $8,$1,$9 xo $4,$1,$5 IM Reg DM q Will still need one stall cycle even with fowading CEG3420 L07.47 Sping 2016

48 Load-use Hazad Detection Unit (optional) q Need a Hazad detection Unit in the ID stage that insets a stall between the load and its use 1. ID Hazad detection Unit: if (ID/EX.MemRead and ((ID/EX.RegisteRt == IF/ID.RegisteRs) o (ID/EX.RegisteRt == IF/ID.RegisteRt))) stall the pipeline q The fist line tests to see if the instuction now in the EX stage is a lw; the next two lines check to see if the destination egiste of the lw matches eithe souce egiste of the instuction in the ID stage (the load-use instuction) q Afte this one cycle stall, the fowading logic can handle the emaining data hazads CEG3420 L07.48 Sping 2016

49 Adding the Hazad/Stall Hadwae (optional) PCSc Hazad Unit 0 ID/EX EX/MEM PC 4 Instuction Memoy Read Addess Add IF/ID Contol 1 Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB Fowad Unit CEG3420 L07.49 Sping 2016

50 Adding the Hazad/Stall Hadwae (optional) PCSc Hazad Unit 0 ID/EX ID/EX.MemRead EX/MEM PC 4 Instuction Memoy Read Addess Add IF/ID Contol 0 1 Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB ID/EX.RegisteRt Fowad Unit CEG3420 L07.50 Sping 2016

51 Contol Hazads q When the flow of instuction addesses is not sequential (i.e., PC = PC + 4); incued by change of flow instuctions Unconditional banches (j, jal, j) Conditional banches (beq, bne) Exceptions q Possible appoaches Stall (impacts CPI) Move decision point as ealy in the pipeline as possible, theeby educing the numbe of stall cycles Delay decision (equies compile suppot) Pedict and hope fo the best! q Contol hazads occu less fequently than data hazads, but thee is nothing as effective against contol hazads as fowading is fo data hazads CEG3420 L07.51 Sping 2016

52 Contol Hazads 1: Jumps Incu One Stall q Jumps not decoded until ID, so one flush is needed To flush, set IF.Flush to zeo the instuction field of the IF/ID pipeline egiste (tuning it into a nop) I n s t. j flush Fix jump hazad by waiting flush O d e j taget CEG3420 L07.52 Sping 2016 q Fotunately, jumps ae vey infequent only 3% of the SPECint instuction mix

53 Datapath Banch and Jump Hadwae Jump PCSc Shift left 2 ID/EX EX/MEM IF/ID Contol PC 4 Read Addess Add Instuction Memoy PC+4[31-28] Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB Fowad Unit CEG3420 L07.53 Sping 2016

54 Suppoting ID Stage Jumps Jump PCSc Shift left 2 ID/EX EX/MEM IF/ID Contol PC 4 Add Instuction Memoy Read 0 Addess PC+4[31-28] Read Add 1 Registe Read Read Add 2Data 1 File Wite Add Read Data 2 Wite Data 16 Sign 32 Extend Shift left 2 Add cntl Banch Addess Data Memoy Wite Data Read Data MEM/WB Fowad Unit CEG3420 L07.54 Sping 2016

55 Contol Hazads 2: Banch Inst q Dependencies backwad in time cause hazads I n s t. O d e beq lw Inst 3 Inst 4 CEG3420 L07.55 Sping 2016

56 One Way to Fix a Banch Contol Hazad I n s t. beq flush Fix banch hazad by waiting flush but affects CPI O d e flush flush beq taget Inst 3 IM Reg DM CEG3420 L07.56 Sping 2016

57 Anothe Way to Fix a Banch Contol Hazad q Move banch decision hadwae back to as ealy in the pipeline as possible i.e., duing the decode cycle I n s t. beq flush Fix banch hazad by waiting flush O d e beq taget Inst 3 IM Reg DM CEG3420 L07.57 Sping 2016

58 Two Types of Stalls q Nop instuction (o bubble) inseted between two instuctions in the pipeline (as done fo load-use situations) Keep the instuctions ealie in the pipeline (late in the code) fom pogessing down the pipeline fo a cycle ( bounce them in place with wite contol signals) Inset nop by zeoing contol bits in the pipeline egiste at the appopiate stage Let the instuctions late in the pipeline (ealie in the code) pogess nomally down the pipeline q Flushes (o instuction squashing) wee an instuction in the pipeline is eplaced with a nop instuction (as done fo instuctions located sequentially afte j instuctions) Zeo the contol bits fo the instuction to be flushed CEG3420 L07.58 Sping 2016

59 Reducing the Delay of Banches q Move the banch decision hadwae back to the EX stage Reduces the numbe of stall (flush) cycles to two Adds an and gate and a 2x1 mux to the EX timing path q Add hadwae to compute the banch taget addess and evaluate the banch decision to the ID stage Reduces the numbe of stall (flush) cycles to one (like with jumps) - But now need to add fowading hadwae in ID stage Computing banch taget addess can be done in paallel with RegFile ead (done fo all instuctions only used when needed) Compaing the egistes can t be done until afte RegFile ead, so compaing and updating the PC adds a mux, a compaato, and an and gate to the ID timing path q Fo deepe pipelines, banch decision points can be even late in the pipeline, incuing moe stalls CEG3420 L07.59 Sping 2016

60 ID Banch Fowading Issues q MEM/WB fowading is taken cae of by the nomal RegFile wite befoe ead opeation WB add3 $1, MEM add2 $3, EX add1 $4, ID beq $1,$2,Loop IF next_seq_inst q Need to fowad fom the EX/MEM pipeline stage to the ID compaison hadwae fo cases like WB add3 $3, MEM add2 $1, EX add1 $4, ID beq $1,$2,Loop IF next_seq_inst if (IDcontol.Banch and (EX/MEM.RegisteRd!= 0) and (EX/MEM.RegisteRd == IF/ID.RegisteRs)) FowadC = 1 if (IDcontol.Banch and (EX/MEM.RegisteRd!= 0) and (EX/MEM.RegisteRd == IF/ID.RegisteRt)) FowadD = 1 Fowads the esult fom the second pevious inst. to eithe input of the compae CEG3420 L07.60 Sping 2016

61 ID Banch Fowading Issues, con t q If the instuction immediately befoe the banch poduces one of the banch souce opeands, then a stall needs to be inseted (between the WB add3 $3, MEM add2 $4, EX add1 $1, ID beq $1,$2,Loop IF next_seq_inst beq and add1) since the EX stage opeation is occuing at the same time as the ID stage banch compae opeation Bounce the beq (in ID) and next_seq_inst (in IF) in place (ID Hazad Unit deassets PC.Wite and IF/ID.Wite) Inset a stall between the add in the EX stage and the beq in the ID stage by zeoing the contol bits going into the ID/EX pipeline egiste (done by the ID Hazad Unit) q If the banch is found to be taken, then flush the instuction cuently in IF (IF.Flush) CEG3420 L07.61 Sping 2016

62 Suppoting ID Stage Banches (optional) PCSc Banch Hazad Unit 0 1 ID/EX EX/MEM IF/ID Contol 0 PC 4 Add Instuction Memoy Read 0 Addess IF.Flush Shift left 2 Read Add 1 RegFile Read Add 2 Read Data 1 Wite Add ReadData 2 Wite Data 16 Sign Extend Add 32 Compae cntl Data Memoy Read Data Addess Wite Data MEM/WB Fowad Unit Fowad Unit CEG3420 L07.62 Sping 2016

63 Delayed Banches q If the banch hadwae has been moved to the ID stage, then we can eliminate all banch stalls with delayed banches which ae defined as always executing the next sequential instuction afte the banch instuction the banch takes effect afte that next instuction MIPS compile moves an instuction to immediately afte the banch that is not affected by the banch (a safe instuction) theeby hiding the banch delay q With deepe pipelines, the banch delay gows equiing moe than one delay slot Delayed banches have lost populaity compaed to moe expensive but moe flexible (dynamic) hadwae banch pediction Gowth in available tansistos has made hadwae banch pediction elatively cheape CEG3420 L07.63 Sping 2016

64 Scheduling Banch Delay Slots A. Fom befoe banch B. Fom banch taget C. Fom fall though add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot q A is the best choice, fills delay slot and educes IC add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 q In B and C, the sub instuction may need to be copied, inceasing IC q In B and C, must be okay to execute sub when banch fails CEG3420 L07.64 Sping 2016

65 Static Banch Pediction q Resolve banch hazads by assuming a given outcome and poceeding without waiting to see the actual banch outcome 1. Pedict not taken always pedict banches will not be taken, continue to fetch fom the sequential instuction steam, only when banch is taken does the pipeline stall If taken, flush instuctions afte the banch (ealie in the pipeline) - in IF, ID, and EX stages if banch logic in MEM thee stalls - In IF and ID stages if banch logic in EX two stalls - in IF stage if banch logic in ID one stall ensue that those flushed instuctions haven t changed the machine state automatic in the MIPS pipeline since machine state changing opeations ae at the tail end of the pipeline (MemWite (in MEM) o RegWite (in WB)) estat the pipeline at the banch destination CEG3420 L07.65 Sping 2016

66 Flushing with Mispediction (Not Taken) I n s t. 4 beq $1,$2,2 8 sub $4,$1,$5 O d e q To flush the IF stage instuction, asset IF.Flush to zeo the instuction field of the IF/ID pipeline egiste (tansfoming it into a nop) CEG3420 L07.66 Sping 2016

67 Flushing with Mispediction (Not Taken) I n s t. O d e 4 beq $1,$2,2 8 flush sub $4,$1,$5 16 and $6,$1,$7 20 o 8,$1,$9 q To flush the IF stage instuction, asset IF.Flush to zeo the instuction field of the IF/ID pipeline egiste (tansfoming it into a nop) CEG3420 L07.67 Sping 2016

68 Banching Stuctues q Pedict not taken woks well fo top of the loop banching stuctues Loop: beq $1,$2,Out But such loops have jumps at the bottom of the loop to etun to the top of the loop and incu the jump stall ovehead 1 nd loop inst... last loop inst j Loop Out: fall out inst q Pedict not taken doesn t wok well fo bottom of the loop banching stuctues Loop: 1 st loop inst 2 nd loop inst... last loop inst bne $1,$2,Loop fall out inst CEG3420 L07.68 Sping 2016

69 Static Banch Pediction, con t q Resolve banch hazads by assuming a given outcome and poceeding 2. Pedict taken pedict banches will always be taken Pedict taken always incus one stall cycle (if banch destination hadwae has been moved to the ID stage) Is thee a way to cache the addess of the banch taget instuction?? q As the banch penalty inceases (fo deepe pipelines), a simple static pediction scheme will hut pefomance. With moe hadwae, it is possible to ty to pedict banch behavio dynamically duing pogam execution 3. Dynamic banch pediction pedict banches at untime using un-time infomation CEG3420 L07.69 Sping 2016

70 Dynamic Banch Pediction q A banch pediction buffe (aka banch histoy table (BHT)) in the IF stage addessed by the lowe bits of the PC, contains bit(s) passed to the ID stage though the IF/ID pipeline egiste that tells whethe the banch was taken the last time it was execute Pediction bit may pedict incoectly (may be a wong pediction fo this banch this iteation o may be fom a diffeent banch with the same low ode PC bits) but the doesn t affect coectness, just pefomance - Banch decision occus in the ID stage afte detemining that the fetched instuction is a banch and checking the pediction bit(s) If the pediction is wong, flush the incoect instuction(s) in pipeline, estat the pipeline with the ight instuction, and invet the pediction bit(s) - A 4096 bit BHT vaies fom 1% mispediction (nasa7, tomcatv) to 18% (eqntott) CEG3420 L07.70 Sping 2016

71 Banch Taget Buffe q The BHT pedicts when a banch is taken, but does not tell whee its taken to! A banch taget buffe (BTB) in the IF stage caches the banch taget addess, but we also need to fetch the next sequential instuction. The pediction bit in IF/ID selects which next instuction will be loaded into IF/ID at the next clock edge - Would need a two ead pot instuction memoy O the BTB can cache the banch taken instuction while the instuction memoy is fetching the next sequential instuction PC BTB Instuction Memoy Read 0 Addess q If the pediction is coect, stalls can be avoided no matte which diection they go CEG3420 L07.71 Sping 2016

72 1-bit Pediction Accuacy q q A 1-bit pedicto will be incoect twice when not taken Assume pedict_bit = 0 to stat (indicating banch not taken) and loop contol is at the bottom of the loop code 1. Fist time though the loop, the pedicto mispedicts the banch since the banch is taken back to the top of the loop; invet pediction bit (pedict_bit = 1) 2. As long as banch is taken (looping), pediction is coect 3. Exiting the loop, the pedicto again mispedicts the banch since this time the banch is not taken falling out of the loop; invet pediction bit (pedict_bit = 0) Loop: 1 st loop inst 2 nd loop inst... last loop inst bne $1,$2,Loop fall out inst Fo 10 times though the loop we have a 80% pediction accuacy fo a banch that is taken 90% of the time CEG3420 L07.72 Sping 2016

73 2-bit Pedictos q A 2-bit scheme can give 90% accuacy since a pediction must be wong twice befoe the pediction bit is changed Taken Taken Pedict Taken Pedict Not Taken Not taken Taken Not taken Taken Pedict Taken Not taken Pedict Not Taken Not taken Loop: 1 st loop inst 2 nd loop inst... last loop inst bne $1,$2,Loop fall out inst CEG3420 L07.73 Sping 2016

74 2-bit Pedictos q A 2-bit scheme can give 90% accuacy since a pediction must be wong twice befoe the pediction bit is changed ight 9 times 1 Taken 0 Taken Pedict Taken 11 Pedict 01 Not Taken wong on loop fall out Not taken Taken ight on 1 st iteation Not taken Taken Pedict 10 Taken 1 Not taken 0 00Pedict Not Taken Not taken Loop: 1 st loop inst 2 nd loop inst... last loop inst bne $1,$2,Loop fall out inst q BHT also stoes the initial FSM state CEG3420 L07.74 Sping 2016

75 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CEG3420 L07.75 Sping 2016

76 Dealing with Exceptions q Exceptions (aka inteupts) ae just anothe fom of contol hazad. Exceptions aise fom R-type aithmetic oveflow Tying to execute an undefined instuction An I/O device equest An OS sevice equest (e.g., a page fault, TLB exception) A hadwae malfunction q The pipeline has to stop executing the offending instuction in midsteam, let all pio instuctions complete, flush all following instuctions, set a egiste to show the cause of the exception, save the addess of the offending instuction, and then jump to a peaanged addess (the addess of the exception handle code) q The softwae (OS) looks at the cause of the exception and deals with it CEG3420 L07.76 Sping 2016

77 Two Types of Exceptions q Inteupts asynchonous to pogam execution caused by extenal events may be handled between instuctions, so can let the instuctions cuently active in the pipeline complete befoe passing contol to the OS inteupt handle simply suspend and esume use pogam q Taps (Exception) synchonous to pogam execution caused by intenal events condition must be emedied by the tap handle fo that instuction, so much stop the offending instuction midsteam in the pipeline and pass contol to the OS tap handle the offending instuction may be etied (o simulated by the OS) and the pogam may continue o it may be aboted CEG3420 L07.77 Sping 2016

78 Whee in the Pipeline Exceptions Occu q Aithmetic oveflow q Undefined instuction q TLB o page fault q I/O sevice equest q Hadwae malfunction Stage(s)? Synchonous? CEG3420 L07.78 Sping 2016

79 Whee in the Pipeline Exceptions Occu q Aithmetic oveflow q Undefined instuction q TLB o page fault q I/O sevice equest q Hadwae malfunction Stage(s)? EX ID IF, MEM any any Synchonous? yes yes yes no no q Bewae that multiple exceptions can occu simultaneously in a single clock cycle CEG3420 L07.79 Sping 2016

80 Multiple Simultaneous Exceptions I n s t. O d e Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 q Hadwae sots the exceptions so that the ealiest instuction is the one inteupted fist CEG3420 L07.80 Sping 2016

81 Multiple Simultaneous Exceptions I n s t. O d e Inst 0 Inst 1 Inst 2 Inst 3 D$ page fault aithmetic oveflow undefined instuction Inst 4 I$ page fault q Hadwae sots the exceptions so that the ealiest instuction is the one inteupted fist CEG3420 L07.81 Sping 2016

82 Additions to MIPS to Handle Exceptions (optional) q Cause egiste (ecods exceptions) hadwae to ecod in Cause the exceptions and a signal to contol wites to it (CauseWite) q EPC egiste (ecods the addesses of the offending instuctions) hadwae to ecod in EPC the addess of the offending instuction and a signal to contol wites to it (EPCWite) Exception softwae must match exception to instuction q A way to load the PC with the addess of the exception handle Expand the PC input mux whee the new input is hadwied to the exception handle addess - (e.g., hex fo aithmetic oveflow) q A way to flush offending instuction and the ones that follow it CEG3420 L07.82 Sping 2016

83 Datapath with Contols fo Exceptions (optional) PC 4 Instuction Memoy Read 0 Addess hex Add IF.Flush PCSc IF/ID Hazad Unit Contol Shift left 2 Read Add 1 RegFile Read Add 2 Read Data 1 Wite Add ReadData 2 Wite Data 16 0 Sign Extend Fowad Unit Banch ID.Flush 1 0 Add 32 Compae ID/EX Cause EPC EX.Flush 0 0 cntl Fowad Unit EX/MEM Data Memoy Read Data Addess Wite Data MEM/WB CEG3420 L07.83 Sping 2016

84 Summay q All moden day pocessos use pipelining fo pefomance (a CPI of 1 and a fast CC) q Pipeline clock ate limited by slowest pipeline stage so designing a balanced pipeline is impotant q Must detect and esolve hazads Stuctual hazads esolved by designing the pipeline coectly Data hazads - Stall (impacts CPI) - Fowad (equies hadwae suppot) Contol hazads put the banch decision hadwae in as ealy a stage in the pipeline as possible - Stall (impacts CPI) - Delay decision (equies compile suppot) - Static and dynamic pediction (equies hadwae suppot) q Pipelining complicates exception handling CEG3420 L07.84 Sping 2016

CENG 3420 Computer Organization and Design. Lecture 07: MIPS Processor - II. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 07: MIPS Processor - II. Bei Yu CENG 3420 Compute Oganization and Design Lectue 07: MIPS Pocesso - II Bei Yu CEG3420 L07.1 Sping 2016 Review: Instuction Citical Paths q Calculate cycle time assuming negligible delays (fo muxes, contol