Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018"

Martha Beasley
5 years ago
Views:

1 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28

2 Ageda for Today & Next Few Lectures Previous lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Issues i OoO Executio: Load-Store Hadlig, 2

3 Lecture Aoucemet Moday, April 3, 28 6:5-7:5 CAB G 6 Apéro after the lecture J Prof. We-Mei Hwu (Uiversity of Illiois at Urbaa-Champaig) D-INFK Distiguished Collouium Iovative Applicatios ad Techology Pivots A Perfect Storm i Computig 3

4 Readigs for This Week ad Next Week H&H, Chapter 7.5: Pipelied Processor H&H (fiish Chapter 7) Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proceedigs of the IEEE, 995 More advaced pipeliig Iterrupt ad exceptio hadlig Out-of-order ad superscalar executio cocepts 4

5 Ca We Do Better? 5

6 Ca We Do Better? What limitatios do you see with the multi-cycle desig? Limited cocurrecy Some hardware resources are idle durig differet phases of istructio processig cycle Fetch logic is idle whe a istructio is beig decoded or executed Most of the datapath is idle whe a memory access is happeig 6

7 Ca We Use the Idle Hardware to Improve Cocurrecy? Goal: More cocurrecy à Higher istructio throughput (i.e., more work completed i oe cycle) Idea: Whe a istructio is usig some resources i its processig phase, process other istructios o idle resources ot eeded by that istructio E.g., whe a istructio is beig decoded, fetch the ext istructio E.g., whe a istructio is beig executed, decode aother istructio E.g., whe a istructio is accessig data memory (ld/st), execute the ext istructio E.g., whe a istructio is writig its result ito the register file, access data memory for the ext istructio 7

8 Pipeliig 8

9 Pipeliig: Basic Idea More systematically: Pipelie the executio of multiple istructios Aalogy: Assembly lie processig of istructios Idea: Divide the istructio processig cycle ito distict stages of processig Esure there are eough hardware resources to process oe istructio i each stage Process a differet istructio i each stage Istructios cosecutive i program order are processed i cosecutive stages Beefit: Icreases istructio processig throughput (/CPI) Dowside: Start thikig about this 9

10 Example: Executio of Four Idepedet ADDs Multi-cycle: 4 cycles per istructio F D E W F D E W F D E W F D E W Pipelied: 4 cycles per 4 istructios (steady state) Time F D E W F D E W F D E W Is life always this beautiful? F D E W Time

11 The Laudry Aalogy Time Task order A B C D 6 PM AM place oe dirty load of clothes i the washer whe the washer is fiished, place the wet load i the dryer whe the dryer is fiished, take out the dry load ad fold whe foldig is fiished, ask your roommate (??) to put the clothes away - steps to do a load are seuetially depedet - o depedece betwee differet loads - differet steps do ot share resources Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.]

12 Pipeliig Multiple Loads of Laudry Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D - 4 loads of laudry i parallel - o additioal resources - throughput icreased by 4 - latecy per load is the same Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 2

13 Pipeliig Multiple Loads of Laudry: I Practice Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D the slowest step decides throughput Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 3

14 Pipeliig Multiple Loads of Laudry: I Practice Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] A B A B throughput restored (2 loads per hour) usig 2 dryers 4

15 A Ideal Pipelie Goal: Icrease throughput with little icrease i cost (hardware cost, i case of istructio processig) Repetitio of idetical operatios The same operatio is repeated o a large umber of differet iputs (e.g., all laudry loads go through the same steps) Repetitio of idepedet operatios No depedecies betwee repeated operatios Uiformly partitioable suboperatios Processig ca be evely divided ito uiform-latecy suboperatios (that do ot share resources) Fittig examples: automobile assembly lie, doig laudry What about the istructio processig cycle? 5

16 Ideal Pipeliig combiatioal logic (F,D,E,M,W) T psec BW=~(/T) T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T) T/3 ps (F,D) T/3 ps (E,M) T/3 ps (M,W) BW=~(3/T) 6

17 More Realistic Pipelie: Throughput Nopipelied versio with delay T BW = /(T+S) where S = latch delay T ps k-stage pipelied versio BW k-stage = / (T/k +S ) BW max = / ( gate delay + S ) Latch delay reduces throughput (switchig overhead b/w stages) T/k ps T/k ps 7

18 More Realistic Pipelie: Cost Nopipelied versio with combiatioal cost G Cost = G+L where L = latch cost G gates k-stage pipelied versio Cost k-stage = G + Lk Latches icrease hardware cost G/k G/k 8

19 Pipeliig Istructio Processig 9

20 Remember: The Istructio Processig Cycle Fetch. Istructio fetch (IF) 2. Istructio Decodedecode ad register Evaluate operad Address fetch (ID/RF) 3. Execute/Evaluate Fetch Operads memory address (EX/AG) 4. Memory operad fetch (MEM) 5. Store/writeback Execute result (WB) Store Result 2

21 Remember the Sigle-Cycle Uarch Istructio [25 ] Shift Jump address [3 ] left PCSrc =Jump 4 Add PC+4 [3 28] Istructio [3 26] Cotrol RegDst Jump Brach MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add result ALU M u x M u x PCSrc 2 =Br Take PC Read address Istructio memory Istructio [3 ] Istructio [25 2] Istructio [2 6] Istructio [5 ] M u x Read register Read data Read register 2 Registers Read Write data 2 register Write data M u x Zero ALU ALU result bcod Address Write data Data memory Read data M u x Istructio [5 ] 6 Sig 32 exted ALU cotrol Istructio [5 ] ALU operatio Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] T BW=~(/T) 2

22 Dividig Ito Stages 2ps IF: Istructio fetch M u x ps 2ps 2ps ps ID: Istructio decode/ register file read EX: Execute/ address calculatio MEM: Memory access WB: Write back igore for ow Add 4 Shift left 2 Add Add result PC Address Istructio memory Istructio Read register Read data Read register 2 Registers Read data 2 Write register Write data M u x Zero ALU ALU result Address Write data Data memory Read data M u x RF write 6 Sig exted 32 Is this the correct partitioig? Why ot 4 or 6 stages? Why ot differet boudaries? Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 22

23 Istructio Pipelie Throughput Program executio order Time (i istructios) lw $, ($) Istructio fetch Reg ALU Data access Reg lw $2, 2($) 8ps 8 s Istructio fetch Reg ALU Data access Reg lw $3, 3($) Program executio Time order (i istructios) lw $, ($) Istructio fetch 8 8ps s Reg ALU Data access Reg Istructio fetch 8ps 8 s... lw $2, 2($) 2ps 2 s Istructio fetch Reg ALU Data access Reg lw $3, 3($) 2ps 2 s Istructio fetch Reg ALU Data access Reg 2ps 2 s 2ps 2 s 22ps s 2ps 2 s 2ps 2 s 5-stage speedup is 4, ot 5 as predicted by the ideal model. Why? 23

24 Eablig Pipelied Processig: Pipelie Registers IF: Istructio fetch M M u u x x ID: Istructio decode/ register file read EX: Execute/ address calculatio MEM: Memory access WB: Write back No resource is used by more tha stage! IF/ID ID/EX EX/MEM MEM/WB 4 4 Add Add PC D +4 PC E +4 Add Add Add Add result result PC M Shift Shift left left 2 2 PC PC PC F Address Address Istructio memory Istructio Istructio memory IR D Istructio Read Read register register Read Read data data Read Read register 2 2 Registers Read Read Write Write data data 2 2 register register Write Write data data Sig Sig exted exted A E B E Imm E M M u u x x Zero Zero ALU ALU ALU ALU result result Aout M B M Address Address Write Write data data Data Data memory Read Read data data MDR W Aout W M M u u x x Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] T/k ps T T/k ps 24

25 Pipelied Operatio Example lw Istructio fetch M u x All istructio classes must follow the same path ad timig through the pipelie stages. lw lw Istructio decode Ay performace impact? Executio lw Memory lw Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left left 2 Add Add result PC PC Address Istructio memory Istructio Read register Read data Read register 2 Registers Read Write data 2 register Write data 6 6 Sig exted M u x Zero ALU ALU result Address Data memory Data memory Write data data Read data M u x Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 25

26 Write data Pipelied Operatio Example 32 6 Sig exted M u x Write data Data memory M u x Clock 5 sub lw $, $, 2($) $2, $3 Istructio fetch M u x sub lw $, $, 2($) $2, $3 Istructio decode lw $, 2($) Executio sub $, $2, $3 Executio sub lw $, $, 2($) $2, $3 Memory sub lw $, $, 2($) $2, $3 Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add Add result PC Address Istructio memory Istructio Read register Read Read data register 2 Zero Registers Read ALU ALU Write data 2 result register M u Write x data Is life always this beautiful? 6 Sig exted 32 Address Data memory Write data Read data M u x Clock 2 3 Clock Clock 56 4 Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 26 sub $, $2, $3

27 Illustratig Pipelie Operatio: Operatio View t t t 2 t 3 t 4 t 5 Ist Ist Ist 2 Ist 3 Ist 4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF steady state (full pipelie) WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 27

28 Illustratig Pipelie Operatio: Resource View t t t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t IF I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I ID I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 EX I I I 2 I 3 I 4 I 5 I 6 I 7 I 8 MEM I I I 2 I 3 I 4 I 5 I 6 I 7 WB I I I 2 I 3 I 4 I 5 I 6 28

29 Cotrol Poits i a Pipelie PCSrc M u x IF/ID ID/EX EX/MEM MEM/WB Add 4 RegWrite Shift left 2 Add Add result Brach PC Address Istructio memory Istructio Read register Read data Read register 2 Registers Read Write data 2 register Write data Istructio [5 ] 6 Sig 32 exted ALUSrc M u x 6 ALU cotrol Zero ALU ALU result Address Write data MemWrite Data memory MemRead Read data MemtoReg M u x Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] Istructio [2 6] Istructio [5 ] M u x RegDst ALUOp Idetical set of cotrol poits as the sigle-cycle datapath!! 29

30 Cotrol Sigals i a Pipelie For a give istructio Þ same cotrol sigals as sigle-cycle, but cotrol sigals reuired at differet cycles, depedig o stage Optio : decode oce usig the same logic as sigle-cycle ad buffer sigals util cosumed WB Istructio Cotrol M WB EX M WB Þ IF/ID ID/EX EX/MEM MEM/WB Optio 2: carry relevat istructio word/field dow the pipelie ad decode locally withi each or i a previous stage Which oe is better? 3

31 Pipelied Cotrol Sigals PCSrc M u x Cotrol ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address Istructio memory Istructio Read register Read data Read register 2 Registers Read Write data 2 register Write data RegWrite Shift left 2 M u x Add Add result ALUSrc Zero ALU ALU result Brach Write data MemWrite Address Data memory Read data MemtoReg M u x Istructio [5 ] 6 Sig 32 exted 6 ALU cotrol MemRead Istructio [2 6] Istructio [5 ] M u x RegDst ALUOp Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 3

32 Caregie Mello Aother Example: Sigle-Cycle ad Pipelied CLK PC' PC A RD Istructio Memory Istr 25:2 2:6 CLK A A2 A3 WD3 WE3 Register File RD RD2 SrcA SrcB ALU Zero ALUResult WriteData CLK A RD Data Memory WD WE ReadData 4 + PCPlus4 2:6 5: 5: Sig Exted SigImm PC' CLK PCF A RD Istructio Memory CLK IstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 CLK RtE RdE SrcAE SrcBE WriteDataE WriteRegE 4: CLK ZeroM ALUOutM WriteDataM CLK WE A RD Data Memory WD ALUOutW ReadDataW + 4 5: Sig Exted SigImmE <<2 PCBrachM + WriteReg 4: <<2 + PCBrach Result ALU CLK PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Memory Writeback ResultW 32

33 Caregie Mello Aother Example: Correct Pipelied Datapath CLK PC' PCF A RD Istructio Memory CLK IstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 CLK RtE RdE SrcAE SrcBE ALU WriteDataE WriteRegE 4: CLK ZeroM ALUOutM WriteDataM WriteRegM 4: CLK A RD Data Memory WD WE CLK ALUOutW ReadDataW WriteRegW 4: 4 + 5: Sig Exted SigImmE <<2 + PCBrachM PCPlus4F PCPlus4D PCPlus4E Fetch Decode Execute Memory Writeback WriteReg must arrive at the same time as Result ResultW 33

34 Caregie Mello Aother Example: Pipelied Cotrol CLK CLK CLK Cotrol Uit RegWriteD MemtoRegD MemWriteD RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 5: Op Fuct BrachD ALUCotrolD ALUSrcD BrachE ALUCotrolE 2: ALUSrcE BrachM PCSrcM RegDstD RegDstE ALUOutW PC' CLK PCF A RD Istructio Memory CLK IstrD 25:2 2:6 2:6 5: CLK A A2 A3 WD3 WE3 Register File RD RD2 RtE RdE SrcAE SrcBE WriteDataE ALU WriteRegE 4: ZeroM ALUOutM WriteDataM WriteRegM 4: CLK A RD Data Memory WD WE ReadDataW WriteRegW 4: 4 + 5: Sig Exted SigImmE <<2 + PCBrachM PCPlus4F PCPlus4D PCPlus4E ResultW Same cotrol uit as sigle-cycle processor Cotrol delayed to proper pipelie stage 34

35 Remember: A Ideal Pipelie Goal: Icrease throughput with little icrease i cost (hardware cost, i case of istructio processig) Repetitio of idetical operatios The same operatio is repeated o a large umber of differet iputs (e.g., all laudry loads go through the same steps) Repetitio of idepedet operatios No depedecies betwee repeated operatios Uiformly partitioable suboperatios Processig a be evely divided ito uiform-latecy suboperatios (that do ot share resources) Fittig examples: automobile assembly lie, doig laudry What about the istructio processig cycle? 35

36 Istructio Pipelie: Not A Ideal Pipelie Idetical operatios... NOT! Þ differet istructios à ot all eed the same stages Forcig differet istructios to go through the same pipe stages à exteral fragmetatio (some pipe stages idle for some istructios) Uiform suboperatios... NOT! Þ differet pipelie stages à ot the same latecy Need to force each stage to be cotrolled by the same clock à iteral fragmetatio (some pipe stages are too fast but all take the same clock cycle time) Idepedet operatios... NOT! Þ istructios are ot idepedet of each other Need to detect ad resolve iter-istructio depedecies to esure the pipelie provides correct results à pipelie stalls (pipelie is ot always movig) 36

37 Issues i Pipelie Desig Balacig work i pipelie stages How may stages ad what is doe i each stage Keepig the pipelie correct, movig, ad full i the presece of evets that disrupt pipelie flow Hadlig depedeces Data Cotrol Hadlig resource cotetio Hadlig log-latecy (multi-cycle) operatios Hadlig exceptios, iterrupts Advaced: Improvig pipelie throughput Miimizig stalls 37

38 Causes of Pipelie Stalls Stall: A coditio whe the pipelie stops movig Resource cotetio Depedeces (betwee istructios) Data Cotrol Log-latecy (multi-cycle) operatios 38

39 Depedeces ad Their Types Also called depedecy or less desirably hazard Depedeces dictate orderig reuiremets betwee istructios Two types Data depedece Cotrol depedece Resource cotetio is sometimes called resource depedece However, this is ot fudametal to (dictated by) program sematics, so we will treat it separately 39

40 Hadlig Resource Cotetio Happes whe istructios i two pipelie stages eed the same resource Solutio : Elimiate the cause of cotetio Duplicate the resource or icrease its throughput E.g., use separate istructio ad data memories (caches) E.g., use multiple ports for memory structures Solutio 2: Detect the resource cotetio ad stall oe of the cotedig stages Which stage do you stall? Example: What if you had a sigle read ad write port for the register file? 4

41 Caregie Mello Example Resource Depedece: RegFile The register file ca be read ad writte i the same cycle: write takes place durig the st half of the cycle read takes place durig the 2d half of the cycle => o problem!!! However operatios that ivolve register file have oly half a clock cycle to complete the operatio!! Time (cycles) add $s2 add $s, $s2, $s3 IM RF $s3 + DM $s RF ad $t, $s, $s IM ad $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF 4

42 Data Depedeces Types of data depedeces Flow depedece (true data depedece read after write) Output depedece (write after write) Ati depedece (write after read) Which oes cause stalls i a pipelied machie? For all of them, we eed to esure sematics of the program is correct Flow depedeces always eed to be obeyed because they costitute true depedece o a value Ati ad output depedeces exist due to limited umber of architectural registers They are depedece o a ame, ot a value We will later see what we ca do about them 42

43 Data Depedece Types Flow depedece r 3 r op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Ati depedece r 3 r op r 2 Write-after-Read r r 4 op r 5 (WAR) Output-depedece r 3 r op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 43

44 Write data Pipelied Operatio Example 32 6 Sig exted M u x Write data Data memory M u x Clock 5 sub lw $, $, 2($) $2, $3 Istructio fetch M u x sub lw $, $, 2($) $2, $3 Istructio decode lw $, 2($) Executio sub $, $2, $3 Executio sub lw $, $, 2($) $2, $3 Memory sub lw $, $, 2($) $2, $3 Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add Add result PC Read Address register Read Read data Istructio register 2 Zero Registers memory Read ALU ALU Write data 2 result Address register M u x Data Write data memory What if the SUB were depedet o LW? Write data Istructio 6 Sig exted 32 Read data M u x Clock 2 3 Clock Clock 56 4 Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] 44 sub $, $2, $3

45 Data Depedece Hadlig 45

46 Readig for Next Few Lectures H&H, Chapter Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proceedigs of the IEEE, 995 More advaced pipeliig Iterrupt ad exceptio hadlig Out-of-order ad superscalar executio cocepts 46

How to Hadle Data Depedeces Ati ad output depedeces are easier to hadle write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig Five fudametal ways of hadlig flow

47 How to Hadle Data Depedeces Ati ad output depedeces are easier to hadle write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig Five fudametal ways of hadlig flow depedeces Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 47

48 Iterlockig Detectio of depedece betwee istructios i a pipelied processor to guaratee correct executio Software based iterlockig vs. Hardware based iterlockig MIPS acroym? 48

49 Approaches to Depedece Detectio (I) Scoreboardig Each register i register file has a Valid bit associated with it A istructio that is writig to the register resets the Valid bit A istructio i Decode stage checks if all its source ad destiatio registers are Valid Yes: No eed to stall No depedece No: Stall the istructio Advatage: Simple. bit per register Disadvatage: Need to stall for all types of depedeces, ot oly flow dep. 49

50 Not Stallig o Ati ad Output Depedeces What chages would you make to the scoreboard to eable this? 5

51 Approaches to Depedece Detectio (II) Combiatioal depedece check logic Special logic that checks if ay istructio i later stages is supposed to write to ay source register of the istructio that is beig decoded Yes: stall the istructio/pipelie No: o eed to stall o flow depedece Advatage: No eed to stall o ati ad output depedeces Disadvatage: Logic is more complex tha a scoreboard Logic becomes more complex as we make the pipelie deeper ad wider (flash-forward: thik superscalar executio) 5

52 Oce You Detect the Depedece i Hardware What do you do afterwards? Observatio: Depedece betwee two istructios is detected before the commuicated data value becomes available Optio : Stall the depedet istructio right away Optio 2: Stall the depedet istructio oly whe ecessary à data forwardig/bypassig Optio 3: 52

53 Data Forwardig/Bypassig Problem: A cosumer (depedet) istructio has to wait i decode stage util the producer istructio writes its value i the register file Goal: We do ot wat to stall the pipelie uecessarily Observatio: The data value eeded by the cosumer istructio ca be supplied directly from a later stage i the pipelie (istead of oly from the register file) Idea: Add additioal depedece check logic ad data forwardig paths (buses) to supply the producer s value to the cosumer right after the value is available Beefit: Cosumer ca move i the pipelie util the poit the value ca be supplied à less stallig 53

54 A Special Case of Data Depedece Cotrol depedece Data depedece o the Istructio Poiter / Program Couter 54

55 Cotrol Depedece Questio: What should the fetch PC be i the ext cycle? Aswer: The address of the ext istructio All istructios are cotrol depedet o previous oes. Why? If the fetched istructio is a o-cotrol-flow istructio: Next Fetch PC is the address of the ext-seuetial istructio Easy to determie if we kow the size of the fetched istructio If the istructio that is fetched is a cotrol-flow istructio: How do we determie the ext Fetch PC? I fact, how do we kow whether or ot the fetched istructio is a cotrol-flow istructio? 55

56 Data Depedece Hadlig: Cocepts ad Implemetatio 56

57 Remember: Data Depedece Types Flow depedece r 3 r op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Ati depedece r 3 r op r 2 Write-after-Read r r 4 op r 5 (WAR) Output-depedece r 3 r op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 57

58 RAW Depedece Hadlig Which oe of the followig flow depedeces lead to coflicts i the 5-stage pipelie? addi ra r- - IF ID EX MEM WB addi r- ra - IF ID EX MEM WB addi r- ra - IF ID EX MEM addi r- ra - IF ID EX addi r- ra - IF? ID addi r- ra - IF 58

59 Pipelie Stall: Resolvig Data Depedece t t t 2 t 3 t 4 t 5 Ist h IF ID ALU MEM WB Ist i i IF ID ALU MEM WB Ist j j IF ID ALU ID MEM ALU ID WB MEM ALU ID WB MEM ALU Ist k IF ID IF ALU ID IF MEM ALU ID IF WB MEM ALU ID Ist l IF ID IF ALU ID IF MEM ALU ID IF IF ID IF ALU ID IF i: r x _ j: bubble _ r IF ID x dist(i,j)= IF Stall = make the depedet istructio j: bubble _ r x dist(i,j)=2 IF j: bubble _ r x dist(i,j)=3 j: _ r x dist(i,j)=4 wait util its source data value is available. stop all up-stream stages 2. drai all dow-stream stages 59

60 How to Implemet Stallig PCSrc M u x Cotrol ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address Istructio memory Istructio Read register Read data Read register 2 Registers Read Write data 2 register Write data RegWrite Shift left 2 M u x Add Add result ALUSrc Zero ALU ALU result Brach Write data MemWrite Address Data memory Read data MemtoReg M u x Istructio [5 ] 6 Sig 32 exted 6 ALU cotrol MemRead Stall Istructio [2 6] Istructio [5 ] M u x RegDst disable PC ad IF/ID latchig; esure stalled istructio stays i its stage Isert ivalid istructios/ops ito the stage followig the stalled oe (called bubbles ) Based o origial figure from [P&H CO&D, COPYRIGHT 24 Elsevier. ALL RIGHTS RESERVED.] ALUOp 6

61 Caregie Mello RAW Data Depedece Example Oe istructio writes a register ($s) ad ext istructios read this register => read after write (RAW) depedece. add writes ito $s i the first half of cycle 5 ad reads $s o cycle 3, obtaiig the wrog value Oly if the pipelie hadles data depedeces wrog! or reads $s o cycle 4, agai obtaiig the wrog value. sub reads $s i the secod half of cycle 5, obtaiig the correct value subseuet istructios read the correct value of $s Time (cycles) add $s2 add $s, $s2, $s3 IM RF $s3 + DM $s RF ad $t, $s, $s IM ad $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF 6

62 Caregie Mello Compile-Time Detectio ad Elimiatio Time (cycles) add $s2 add $s, $s2, $s3 IM RF $s3 + DM $s RF op IM op RF DM RF op IM op RF DM RF ad $t, $s, $s IM ad $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF Isert eough NOPs for the reuired result to be ready Or (if you ca) move idepedet useful istructios up 62

63 Caregie Mello Data Forwardig Also called Data Bypassig We have already see the basic idea before Forward the result value to the depedet istructio as soo as the value is available Remember dataflow? Data value supplied to depedet istructio as soo as it is available Istructio executes whe all its operads are available Data forwardig brigs a pipelie closer to data flow executio priciples 63

64 Caregie Mello Data Forwardig Time (cycles) add $s2 add $s, $s2, $s3 IM RF $s3 + DM $s RF ad $t, $s, $s IM ad $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF 64

65 Caregie Mello Data Forwardig Cotrol Uit RegWriteD MemtoRegD MemWriteD CLK CLK CLK RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM 3:26 5: Op Fuct ALUCotrolD 2: ALUSrcD RegDstD BrachD ALUCotrolE 2: ALUSrcE RegDstE BrachE BrachM PCSrcM CLK PC' PCF 4 A RD Istructio Memory + CLK IstrD 25:2 2:6 25:2 2:6 5: 5: CLK A A2 A3 WD3 WE3 Sig Exted Register File RD RD2 SigImmD RsD RtD RdD RsE RtE RdE SrcAE SrcBE WriteDataE ALU WriteRegE 4: SigImmE <<2 ZeroM ALUOutM WriteDataM WriteRegM 4: CLK WE A RD Data Memory WD ReadDataW ALUOutW WriteRegW 4: + PCPlus4F PCPlus4D PCPlus4E PCBrachM ResultW ForwardAE ForwardBE RegWriteM RegWriteW Hazard Uit 65

66 Caregie Mello Data Forwardig Forward to Execute stage from either: Memory stage or Writeback stage Whe should we forward from oe either Memory or Writeback stage? If that stage will write a destiatio register ad the destiatio register matches the source register. If both the Memory ad Writeback stages cotai matchig destiatio registers, the Memory stage should have priority, because it cotais the more recetly executed istructio. 66

67 Caregie Mello Data Forwardig Forward to Execute stage from either: Memory stage or Writeback stage Forwardig logic for ForwardAE (pseudo code): if ((rse!= ) AND (rse == WriteRegM) AND RegWriteM) the ForwardAE = # forward from Memory stage else if ((rse!= ) AND (rse == WriteRegW) AND RegWriteW) the ForwardAE = # forward from Writeback stage else ForwardAE = # o forwardig Forwardig logic for ForwardBE same, but replace rse with rte 67

68 Caregie Mello Stallig Time (cycles) lw $s, 4($) IM RF 4 lw $ + DM $s RF ad $t, $s, $s IM ad Trouble! $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF Forwardig is sufficiet to resolve RAW data depedeces but There are cases whe forwardig is ot possible due to pipelie desig ad istructio latecies 68

69 Caregie Mello Stallig Time (cycles) lw $s, 4($) IM RF 4 lw $ + DM $s RF ad $t, $s, $s IM ad Trouble! $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF The lw istructio does ot fiish readig data util the ed of the Memory stage, so its result caot be forwarded to the Execute stage of the ext istructio. 69

70 Caregie Mello Stallig Time (cycles) lw $s, 4($) IM RF 4 lw $ + DM $s RF ad $t, $s, $s IM ad Trouble! $s RF $s & DM $t RF or $t, $s4, $s IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 IM sub $s RF $s5 - DM $t2 RF The lw istructio has a two-cycle latecy, therefore a depedet istructio caot use its result util two cycles later. The lw istructio receives data from memory at the ed of cycle 4. But the ad istructio eeds that data as a source operad at the begiig of cycle 4. There is o way to supply the data with forwardig. 7

71 Caregie Mello Stallig Time (cycles) lw $s, 4($) IM RF 4 lw $ + DM $s RF ad $t, $s, $s IM ad $s RF $s $s RF $s & DM $t RF or $t, $s4, $s IM or IM or $s4 RF $s DM $t RF sub $t2, $s, $s5 Stall IM sub $s RF $s5 - DM $t2 RF 7

72 Caregie Mello Stallig Hardware Stalls are supported by: addig eable iputs (EN) to the Fetch ad Decode pipelie registers ad a sychroous reset/clear (CLR) iput to the Execute pipelie register or a INV bit associated with each pipelie register Whe a lw stall occurs StallD ad StallF are asserted to force the Decode ad Fetch stage pipelie registers to hold their old values. FlushE is also asserted to clear the cotets of the Execute stage pipelie register, itroducig a bubble 72

73 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28

74 Backup Slides 74

75 How to Hadle Data Depedeces Ati ad output depedeces are easier to hadle write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig Five fudametal ways of hadlig flow depedeces Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 75

76 Fie-Graied Multithreadig 76

77 Fie-Graied Multithreadig Idea: Hardware has multiple thread cotexts (PC+registers). Each cycle, fetch egie fetches from a differet thread. By the time the fetched brach/istructio resolves, o istructio is fetched from the same thread Brach/istructio resolutio latecy overlapped with executio of other threads istructios + No logic eeded for hadlig cotrol ad data depedeces withi a thread -- Sigle thread performace suffers -- Extra logic for keepig thread cotexts -- Does ot overlap latecy if ot eough threads to cover the whole pipelie 77

78 Fie-Graied Multithreadig (II) Idea: Switch to aother thread every cycle such that o two istructios from a thread are i the pipelie cocurretly Tolerates the cotrol ad data depedecy latecies by overlappig the latecy with useful work from other threads Improves pipelie utilizatio by takig advatage of multiple threads Thorto, Parallel Operatio i the Cotrol Data 66, AFIPS 964. Smith, A pipelied, shared resource MIMD computer, ICPP

79 Fie-Graied Multithreadig: History CDC 66 s peripheral processig uit is fie-graied multithreaded Thorto, Parallel Operatio i the Cotrol Data 66, AFIPS 964. Processor executes a differet I/O thread every cycle A operatio from the same thread is executed every cycles Deelcor HEP (Heterogeeous Elemet Processor) Smith, A pipelied, shared resource MIMD computer, ICPP threads/processor available ueue vs. uavailable (waitig) ueue for threads each thread ca have oly istructio i the processor pipelie; each thread idepedet to each thread, processor looks like a o-pipelied machie system throughput vs. sigle thread performace tradeoff 79

80 Fie-Graied Multithreadig i HEP Cycle time: s 8 stages à 8 s to complete a istructio assumig o memory access No cotrol ad data depedecy checkig 8

81 Multithreaded Pipelie Example Slide credit: Joel Emer 8

82 Su Niagara Multithreaded Pipelie Kogetira et al., Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro

83 Fie-graied Multithreadig Advatages + No eed for depedecy checkig betwee istructios (oly oe istructio i pipelie from a sigle thread) + No eed for brach predictio logic + Otherwise-bubble cycles used for executig useful istructios from differet threads + Improved system throughput, latecy tolerace, utilizatio Disadvatages - Extra hardware complexity: multiple hardware cotexts (PCs, register files, ), thread selectio logic - Reduced sigle thread performace (oe istructio fetched every N cycles from the same thread) - Resource cotetio betwee threads i caches ad memory - Some depedecy checkig logic betwee threads remais (load/store) 83

84 Moder GPUs Are FGMT Machies 84

85 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = data-parallel (SIMD) fuc. uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 85

86 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 24 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 86

87 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 3 cores o the GTX 285: 3,72 threads Slide credit: Kayvo Fatahalia 87

88 Ed of Fie-Graied Multithreadig 88

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award