CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago

Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award lecture Tomorrow 2

Lecture Outlie Pipeliig basics ad discussios No-ideal pipelie 3

Sigle Cycle uarch: Summary Iefficiet All istructios ru as slow as the slowest istructio Not ecessarily the simplest way to implemet a ISA Sigle-cycle implemetatio of REP MOVS (x86)? Not easy to optimize/improve performace Optimizig the commo case (e.g. commo istructios) does ot work Need to optimize the worst case all the time All resources are ot fully utilized e.g., data memory access ca t overlap with ALU operatio How to do better? 5

Sigle-Cycle, Multi-Cycle, Pipeliig Sigle-cycle: 1 cycle per istructio, log cycle time F D E M W F D E M W Multi-cycle: 5 cycles per istructio, short cycle time F D E M W F D E M W F D E M W Pipelie: 1 cycle per istructio (steady state), short cycle time F D E M W F D E M W F D E M W F D E M W Time 6

Istructio Pipeliig: Basic Idea Pipelie the executio of multiple istructios Idea: Divide the istructio processig ito distict stages of processig Esure there are eough hardware resources to process oe istructio i each stage Process a differet istructio i each stage Istructios cosecutive i program order are processed i cosecutive stages Beefit: Icreases istructio processig throughput Dowside: Start thikig about this 7

Pipeliig Istructio Processig 8

Remember: Istructio Processig Steps 1. Istructio fetch (IF) 2. Istructio decode ad register operad fetch (ID/RF) 3. Execute/Evaluate memory address (EX/AG) 4. Memory operad fetch (MEM) 5. Store/writeback result (WB) 9

Pipelie Operatio Examples We ll look at load & store Show pipelie usage i a sigle cycle Highlight resources used 11

Addig Pipelie Registers Registers betwee stages to hold iformatio produced i previous cycle Imm E B M AoutW BE IR D PC D A E PC E Aout M PC M MDR W **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 13

Pipelie Operatio Examples Cosider the followig istructio seueces LDUR X10, [X1, 40] SUB X11, X2, X3 ADD X12, X3, X4 LDUR X13, [X1, 48] ADD X14, X5, X6 23

Illustratig Pipelie Operatio: Operatio View t 0 t 1 t 2 t 3 t 4 t 5 Ist 0 Ist 1 Ist 2 Ist 3 Ist 4 IF ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF steady state (full pipelie) WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 26

Illustratig Pipelie Operatio: Resource View t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 IF I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I 10 ID I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 EX I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 MEM I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 WB I 0 I 1 I 2 I 3 I 4 I 5 I 6 27

Pipelied Cotrol Cotrol sigals derived from istructio Decode oce as i sigle-cycle implemetatio Buffer sigals util cosumed What other optios are there to derive pipelie cotrol sigals? **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 29

Pipelied Cotrol + Datapath Note: 1. Reg2Loc==0: istructio[20:16] is selected; ad Reg2Loc==1: istructio[4:0] is selected; 2. istructio[9:5] is the iput to Read register1 **Based o origial figure from [P&H CO&D, COPYRIGHT 2017 Elsevier. ALL RIGHTS RESERVED.] 30

Performace Aalysis 31

Termiologies ad Defiitios CPI: cycle per istructio IPC: istructio per cycle, which is 1/CPI Executio time of a istructio {CPI} x {clock cycle time} Executio time of a program Iro Law Sum over all istructios [ {CPI} x {clock cycle time} ] {# of istructios} x {average CPI} x {clock cycle time} 32

Examples Remember: executio time of a program Sum over all istructios [ {CPI} x {clock cycle time} ] {# of istructios} x {average CPI} x {clock cycle time} Sigle-cycle uarch CPI = 1, but clock cycle time is log Multi-cycle uarch (with 5 stages) CPI = 5, but clock cycle time is short Pipelied uarch (with 5 stages) CPI = 1 (steady state), clock cycle time same with multi-cycle This is the ideal case 33

Pipeliig: Discussios 34

Pipelie Cosideratios How to partitio? How may stages? 36

Pipelie Partitioig: Resource Reuiremet The goal: o shared resources amog differet pipelie stages i.e., No resource is used by more tha 1 stage Otherwise, we have resource cotetio or structural hazard Example: eed to be able to fetch istructios (i IF stage) ad load data (i MEM stage) at the same time Sigle memory iterface ot sufficiet Solutio 1: provide two separate iterfaces via istructio ad data caches Solutio 2:?? 37

How May Pipelie Stages? BW (badwidth), a.k.a. throughput (1/ cycle time) Ideally, seuetial elemets (pipelie registers) do ot impose additioal delays/cost combiatioal logic (F,D,E,M,W) T ps BW=~(1/T) T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T) T/3 ps (F,D) T/3 ps (E,M) T/3 ps (M,W) BW=~(3/T) 38

Pipelie Stages ad Impact o Performace Nopipelied versio with delay T BW = 1/(T+S) where S = seuetial elemet delay T ps k-stage pipelied versio BW k-stage = 1 / (T/k +S ) BW max = 1 / (1 gate delay + S ) Seuetial elemet delay reduces BW (switchig overhead betwee stages) T/k ps T/k ps 39

Pipelie Stages ad Impact o HW Cost Nopipelied versio with combiatioal cost G Cost = G+L where L = seuetial elemet cost G gates k-stage pipelied versio Cost k-stage ~= G + Lk Seuetial elemets icrease hardware cost G/k G/k It is critical to balace the tradeoffs i.e., how may stages ad what is doe i each stage 40

Ideal vs. No Ideal Pipelies 41

Properties of A Ideal Pipelie Goal: Icrease throughput with little icrease i cost (hardware cost, i case of istructio processig) Repetitio of idetical operatios The same operatio is repeated o a large umber of differet iputs (e.g., all laudry loads go through the same steps) Uiformly partitioable suboperatios Processig a be evely divided ito uiform-latecy suboperatios (that do ot share resources) Repetitio of idepedet operatios No depedecies betwee repeated operatios Ca you implemet a ideal pipelie for istructio processig? 42

Istructio Pipelie: Not Ideal Idetical operatios... NOT! Þ differet istructios à ot all eed the same stages Forcig differet istructios to go through the same pipe stages à exteral fragmetatio (some pipe stages idle for some istructios) Uiform suboperatios... NOT! Þ differet pipelie stages à ot the same latecy Need to force each stage to be cotrolled by the same clock à iteral fragmetatio (some pipe stages are too fast but all take the same clock cycle time) Idepedet operatios... NOT! Þ istructios are ot idepedet of each other Need to detect ad resolve iter-istructio depedecies to esure the pipelie provides correct results à pipelie stalls (pipelie is ot always movig) 43

Istructio Pipelie: Not Ideal Uiform suboperatios... NOT! Þ differet pipelie stages à ot the same latecy Need to force each stage to be cotrolled by the same clock à iteral fragmetatio (some pipe stages are too fast but all take the same clock cycle time) 45

No-Uiform Operatios: Laudry Aalogy Time Task order A B C D 6 PM 7 8 9 10 11 12 1 2 AM Time 6 PM 7 8 9 10 11 12 1 2 AM Task order A B C D Based o origial figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] the slowest step decides throughput or cycle time 46

No-Uiform Operatios: Example 200ps 100ps 200ps 200ps 100ps Imm E B M AoutW BE IR D PC D A E PC E Aout M PC M MDR W Based o origial figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] 47

No-Uiform Operatios: Example Program executio order Time (i istructios) lw $1, 100($0) Istructio fetch 2 2004 400 6 600 8 800 1000 1200 1400 160016 180018 Reg ALU Data access Reg lw $2, 200($0) 800ps 8 s Istructio fetch Reg ALU Data access Reg lw $3, 300($0) Program executio Time order (i istructios) lw $1, 100($0) Istructio fetch 800ps 8 s 200 400 600 800 1000 1200 1400 2 4 6 8 10 12 14 Reg ALU Data access Reg Istructio fetch 800ps 8 s... lw $2, 200($0) 200ps 2 s Istructio fetch Reg ALU Data access Reg lw $3, 300($0) 200ps 2 s Istructio fetch Reg ALU Data access Reg 200ps 200ps 200ps 200ps 200ps 2 s 2 s 2 s 2 s 2 s 48

Istructio Pipelie: Not Ideal Idepedet operatios... NOT! Þ istructios are ot idepedet of each other Need to detect ad resolve iter-istructio depedecies to esure the pipelie provides correct results à pipelie stalls (pipelie is ot always movig) 49

Depedecies ad Their Types Also called hazards Two types Data depedecy Cotrol depedecy 50

Data Depedecy Hadlig 51

Data Depedecy Types Flow depedecy r 3 r 1 op r 2 Read-after-Write (RAW) r 5 r 3 op r 4 Ati depedecy r 5 r 3 op r 4 Write-after-Read (WAR) r 3 r 6 op r 7 Output-depedecy r 3 r 1 op r 2 Write-after-Write (WAW) r 5 r 3 op r 4 r 3 r 6 op r 7 52

Data Depedecy Types Flow depedecies always eed to be obeyed because they costitute true depedece o a value Ati ad output depedecies exist due to limited umber of architectural registers They are depedece o a ame, ot a value Ati ad output depedeces are easier to hadle Write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig 53

Ways of Hadlig Flow Depedecies Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 54

Flow Depedecy Example Cosider this seuece: SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]

Flow Depedecy Example Time SUB X2, X1, X3 IF ID EX MEM WB AND X12, X2, X5 IF ID EX MEM WB OR ADD X13, X2, X6 X14, X2, X2 IF ID EX MEM? IF ID EX STUR X15, [X2, #100] IF ID SUB writig to X2 ad ADD readig from it i the same cycle Assume iteral forwardig i register file i.e., ADD gets the ew X2 value produced from SUB 56

How to Detect Flow Depedecies i HW? R/I-Type LDUR STUR B IF ID read RF read RF read RF EX MEM WB write RF write RF Istructios I A ad I B (where I A comes before I B ) have RAW depedecy iff ad I B (R/I, LDUR, or STUR) reads a register writte by I A (R/I or LDUR) dist(i A, I B ) < dist(id, WB) = 3 57

Flow Depedecy Check Logic Helper fuctios Op1(I) ad Op2(I) returs the 1 st ad 2 d register operad field of I, respectively Use_Op1(I) returs true if I reuires the 1 st register operads ad the register is ot X31; similarly for Use_Op2(I) Flow depedecy occurs whe or or or (Op1(IR ID )==dest EX ) && use_op1(ir ID ) && RegWrite EX (Op1(IR ID )==dest MEM ) && use_op1(ir ID ) && RegWrite MEM (Op2(IR ID )==dest EX ) && use_op2(ir ID ) && RegWrite EX (Op2(IR ID )==dest MEM ) && use_op2(ir ID ) && RegWrite MEM 58

Resolvig Data Depedece Optio 1: Stall the pipelie (i.e., Isertig bubbles ) t 0 t 1 t 2 t 3 t 4 t 5 Ist h IF ID ALU MEM WB Ist i i IF ID ALU MEM WB Ist j j IF ID ALU ID MEM ALU ID WB MEM ALU WB MEM Ist k IF ID IF ALU ID IF MEM ALU ID WB MEM ALU Ist l IF ID IF ALU ID IF MEM ALU ID IF ID IF ALU ID i: r x _ j: bubble _ r IF ID x dist(i,j)=1 IF Stall = make the depedet istructio j: bubble _ r x dist(i,j)=2 IF j: _ r x dist(i,j)=3 wait util its source data value is available 1. stop all up-stream stages 2. drai all dow-stream stages 59

Resolvig Data Depedece Optio 1: Stall the pipelie (i.e., Isertig bubbles ) t 0 t 1 t 2 t 3 t 4 t 5 Ist h IF ID ALU MEM WB Ist i i i IF ID ALU MEM WB Ist Bubble j (op) j IF ID j ALU ID MEM ALU WB MEM WB Ist Bubble k (op) IF k IF ID j ALU ID MEM ALU WB MEM Ist lj j IF k IF ID j ID ALU ALU ID MEM Ist k k IF IF ID i: r x _ k ID ALU j: bubble _ r IF ID x dist(i,j)=1 IF j: _ r x dist(i,j)=2 IF 60