CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

Size: px

Start display at page:

Download "CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago"

Marilynn Black
5 years ago
Views:

1 CMSC Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago

2 Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award lecture Tomorrow 2

3 Lecture Outlie Pipeliig basics ad discussios No-ideal pipelie 3

5 Sigle Cycle uarch: Summary Iefficiet All istructios ru as slow as the slowest istructio Not ecessarily the simplest way to implemet a ISA Sigle-cycle implemetatio of REP MOVS (x86)? Not easy to optimize/improve performace Optimizig the commo case (e.g. commo istructios) does ot work Need to optimize the worst case all the time All resources are ot fully utilized e.g., data memory access ca t overlap with ALU operatio How to do better? 5

6 Sigle-Cycle, Multi-Cycle, Pipeliig Sigle-cycle: 1 cycle per istructio, log cycle time F D E M W F D E M W Multi-cycle: 5 cycles per istructio, short cycle time F D E M W F D E M W F D E M W Pipelie: 1 cycle per istructio (steady state), short cycle time F D E M W F D E M W F D E M W F D E M W Time 6

7 Istructio Pipeliig: Basic Idea Pipelie the executio of multiple istructios Idea: Divide the istructio processig ito distict stages of processig Esure there are eough hardware resources to process oe istructio i each stage Process a differet istructio i each stage Istructios cosecutive i program order are processed i cosecutive stages Beefit: Icreases istructio processig throughput Dowside: Start thikig about this 7

8 Pipeliig Istructio Processig 8

9 Remember: Istructio Processig Steps 1. Istructio fetch (IF) 2. Istructio decode ad register operad fetch (ID/RF) 3. Execute/Evaluate memory address (EX/AG) 4. Memory operad fetch (MEM) 5. Store/writeback result (WB) 9

Remember the Sigle-Cycle Uarch Based o origial figure from

11 Pipelie Operatio Examples We ll look at load & store Show pipelie usage i a sigle cycle Highlight resources used 11

13 Addig Pipelie Registers Registers betwee stages to hold iformatio produced i previous cycle Imm E B M AoutW BE IR D PC D A E PC E Aout M PC M MDR W **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 13

23 Pipelie Operatio Examples Cosider the followig istructio seueces LDUR X10, [X1, 40] SUB X11, X2, X3 ADD X12, X3, X4 LDUR X13, [X1, 48] ADD X14, X5, X6 23

26 Illustratig Pipelie Operatio: Operatio View t 0 t 1 t 2 t 3 t 4 t 5 Ist 0 Ist 1 Ist 2 Ist 3 Ist 4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF steady state (full pipelie) WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 26

27 Illustratig Pipelie Operatio: Resource View t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 IF I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 ID I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 EX I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 MEM I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 WB I 0 I 1 I 2 I 3 I 4 I 5 I 6 27

optios are there to derive pipelie cotrol sigals?

29 Pipelied Cotrol Cotrol sigals derived from istructio Decode oce as i sigle-cycle implemetatio Buffer sigals util cosumed What other optios are there to derive pipelie cotrol sigals? **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 29

30 Pipelied Cotrol + Datapath Note: 1. Reg2Loc==0: istructio[20:16] is selected; ad Reg2Loc==1: istructio[4:0] is selected; 2. istructio[9:5] is the iput to Read register1 **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 30

31 Performace Aalysis 31

32 Termiologies ad Defiitios CPI: cycle per istructio IPC: istructio per cycle, which is 1/CPI Executio time of a istructio {CPI} x {clock cycle time} Executio time of a program Iro Law Sum over all istructios [ {CPI} x {clock cycle time} ] {# of istructios} x {average CPI} x {clock cycle time} 32

33 Examples Remember: executio time of a program Sum over all istructios [ {CPI} x {clock cycle time} ] {# of istructios} x {average CPI} x {clock cycle time} Sigle-cycle uarch CPI = 1, but clock cycle time is log Multi-cycle uarch (with 5 stages) CPI = 5, but clock cycle time is short Pipelied uarch (with 5 stages) CPI = 1 (steady state), clock cycle time same with multi-cycle This is the ideal case 33

34 Pipeliig: Discussios 34

36 Pipelie Cosideratios How to partitio? How may stages? 36

37 Pipelie Partitioig: Resource Reuiremet The goal: o shared resources amog differet pipelie stages i.e., No resource is used by more tha 1 stage Otherwise, we have resource cotetio or structural hazard Example: eed to be able to fetch istructios (i IF stage) ad load data (i MEM stage) at the same time Sigle memory iterface ot sufficiet Solutio 1: provide two separate iterfaces via istructio ad data caches Solutio 2:?? 37

38 How May Pipelie Stages? BW (badwidth), a.k.a. throughput (1/ cycle time) Ideally, seuetial elemets (pipelie registers) do ot impose additioal delays/cost combiatioal logic (F,D,E,M,W) T ps BW=~(1/T) T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T) T/3 ps (F,D) T/3 ps (E,M) T/3 ps (M,W) BW=~(3/T) 38

39 Pipelie Stages ad Impact o Performace Nopipelied versio with delay T BW = 1/(T+S) where S = seuetial elemet delay T ps k-stage pipelied versio BW k-stage = 1 / (T/k +S ) BW max = 1 / (1 gate delay + S ) Seuetial elemet delay reduces BW (switchig overhead betwee stages) T/k ps T/k ps 39

40 Pipelie Stages ad Impact o HW Cost Nopipelied versio with combiatioal cost G Cost = G+L where L = seuetial elemet cost G gates k-stage pipelied versio Cost k-stage ~= G + Lk Seuetial elemets icrease hardware cost G/k G/k It is critical to balace the tradeoffs i.e., how may stages ad what is doe i each stage 40

41 Ideal vs. No Ideal Pipelies 41

42 Properties of A Ideal Pipelie Goal: Icrease throughput with little icrease i cost (hardware cost, i case of istructio processig) Repetitio of idetical operatios The same operatio is repeated o a large umber of differet iputs (e.g., all laudry loads go through the same steps) Uiformly partitioable suboperatios Processig a be evely divided ito uiform-latecy suboperatios (that do ot share resources) Repetitio of idepedet operatios No depedecies betwee repeated operatios Ca you implemet a ideal pipelie for istructio processig? 42

43 Istructio Pipelie: Not Ideal Idetical operatios... NOT! Þ differet istructios à ot all eed the same stages Forcig differet istructios to go through the same pipe stages à exteral fragmetatio (some pipe stages idle for some istructios) Uiform suboperatios... NOT! Þ differet pipelie stages à ot the same latecy Need to force each stage to be cotrolled by the same clock à iteral fragmetatio (some pipe stages are too fast but all take the same clock cycle time) Idepedet operatios... NOT! Þ istructios are ot idepedet of each other Need to detect ad resolve iter-istructio depedecies to esure the pipelie provides correct results à pipelie stalls (pipelie is ot always movig) 43

44 Istructio Pipelie: Not Ideal Idetical operatios... NOT! Þ differet istructios à ot all eed the same stages Forcig differet istructios to go through the same pipe stages à exteral fragmetatio (some pipe stages idle for some istructios) Examples Add, Brach: o eed to go through the MEM stage Others? Performace impact? 44

45 Istructio Pipelie: Not Ideal Uiform suboperatios... NOT! Þ differet pipelie stages à ot the same latecy Need to force each stage to be cotrolled by the same clock à iteral fragmetatio (some pipe stages are too fast but all take the same clock cycle time) 45

46 No-Uiform Operatios: Laudry Aalogy Time Task order A B C D 6 PM AM Time 6 PM AM Task order A B C D Based o origial figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] the slowest step decides throughput or cycle time 46

47 No-Uiform Operatios: Example 200ps 100ps 200ps 200ps 100ps Imm E B M AoutW BE IR D PC D A E PC E Aout M PC M MDR W Based o origial figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 47

48 No-Uiform Operatios: Example Program executio order Time (i istructios) lw $1, 100($0) Istructio fetch Reg ALU Data access Reg lw $2, 200($0) 800ps 8 s Istructio fetch Reg ALU Data access Reg lw $3, 300($0) Program executio Time order (i istructios) lw $1, 100($0) Istructio fetch 800ps 8 s Reg ALU Data access Reg Istructio fetch 800ps 8 s... lw $2, 200($0) 200ps 2 s Istructio fetch Reg ALU Data access Reg lw $3, 300($0) 200ps 2 s Istructio fetch Reg ALU Data access Reg 200ps 200ps 200ps 200ps 200ps 2 s 2 s 2 s 2 s 2 s 48

49 Istructio Pipelie: Not Ideal Idepedet operatios... NOT! Þ istructios are ot idepedet of each other Need to detect ad resolve iter-istructio depedecies to esure the pipelie provides correct results à pipelie stalls (pipelie is ot always movig) 49

50 Depedecies ad Their Types Also called hazards Two types Data depedecy Cotrol depedecy 50

51 Data Depedecy Hadlig 51

52 Data Depedecy Types Flow depedecy r 3 r 1 op r 2 Read-after-Write (RAW) r 5 r 3 op r 4 Ati depedecy r 5 r 3 op r 4 Write-after-Read (WAR) r 3 r 6 op r 7 Output-depedecy r 3 r 1 op r 2 Write-after-Write (WAW) r 5 r 3 op r 4 r 3 r 6 op r 7 52

53 Data Depedecy Types Flow depedecies always eed to be obeyed because they costitute true depedece o a value Ati ad output depedecies exist due to limited umber of architectural registers They are depedece o a ame, ot a value Ati ad output depedeces are easier to hadle Write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig 53

54 Ways of Hadlig Flow Depedecies Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 54

55 Flow Depedecy Example Cosider this seuece: SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]

56 Flow Depedecy Example Time SUB X2, X1, X3 IF ID EX MEM WB AND X12, X2, X5 IF ID EX MEM WB OR ADD X13, X2, X6 X14, X2, X2 IF ID EX MEM? IF ID EX STUR X15, [X2, #100] IF ID SUB writig to X2 ad ADD readig from it i the same cycle Assume iteral forwardig i register file i.e., ADD gets the ew X2 value produced from SUB 56

57 How to Detect Flow Depedecies i HW? R/I-Type LDUR STUR B IF ID read RF read RF read RF EX MEM WB write RF write RF Istructios I A ad I B (where I A comes before I B ) have RAW depedecy iff ad I B (R/I, LDUR, or STUR) reads a register writte by I A (R/I or LDUR) dist(i A, I B ) < dist(id, WB) = 3 57

58 Flow Depedecy Check Logic Helper fuctios Op1(I) ad Op2(I) returs the 1 st ad 2 d register operad field of I, respectively Use_Op1(I) returs true if I reuires the 1 st register operads ad the register is ot X31; similarly for Use_Op2(I) Flow depedecy occurs whe or or or (Op1(IR ID )==dest EX ) && use_op1(ir ID ) && RegWrite EX (Op1(IR ID )==dest MEM ) && use_op1(ir ID ) && RegWrite MEM (Op2(IR ID )==dest EX ) && use_op2(ir ID ) && RegWrite EX (Op2(IR ID )==dest MEM ) && use_op2(ir ID ) && RegWrite MEM 58

59 Resolvig Data Depedece Optio 1: Stall the pipelie (i.e., Isertig bubbles ) t 0 t 1 t 2 t 3 t 4 t 5 Ist h IF ID ALU MEM WB Ist i i IF ID ALU MEM WB Ist j j IF ID ALU ID MEM ALU ID WB MEM ALU WB MEM Ist k IF ID IF ALU ID IF MEM ALU ID WB MEM ALU Ist l IF ID IF ALU ID IF MEM ALU ID IF ID IF ALU ID i: r x _ j: bubble _ r IF ID x dist(i,j)=1 IF Stall = make the depedet istructio j: bubble _ r x dist(i,j)=2 IF j: _ r x dist(i,j)=3 wait util its source data value is available 1. stop all up-stream stages 2. drai all dow-stream stages 59

60 Resolvig Data Depedece Optio 1: Stall the pipelie (i.e., Isertig bubbles ) t 0 t 1 t 2 t 3 t 4 t 5 Ist h IF ID ALU MEM WB Ist i i i IF ID ALU MEM WB Ist Bubble j (op) j IF ID j ALU ID MEM ALU WB MEM WB Ist Bubble k (op) IF k IF ID j ALU ID MEM ALU WB MEM Ist lj j IF k IF ID j ID ALU ALU ID MEM Ist k k IF IF ID i: r x _ k ID ALU j: bubble _ r IF ID x dist(i,j)=1 IF j: _ r x dist(i,j)=2 IF 60

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28 Ageda for Today & Next Few Lectures Previous lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed