COSC 6385 Compute Achitectue - Pipelining Sping 2012 Some of the slides ae based on a lectue by David Culle, Pipelining Pipelining is an implementation technique wheeby multiple instuctions ae ovelapped in execution Split an expensive opeation into seveal subopeations Execute the sub-opeations in a staggeed manne Real wold analogy: assembly line in ca manufactuing Each station is doing something diffeent Each station woking on a sepaate ca Pipelining inceases the thoughput, but does not educe the latency of an opeation 1
Classes of instuctions instuctions Take eithe 2 egistes as opeands o 1 egiste and one 16bit immediate offset Results ae stoed in a 3 d egiste Load and stoe instuctions Banches and jumps Typical implementation of an instuction (I) 1. Instuction fetch cycle (IF): send PC to memoy Fetch cuent instuction Update PC to next sequential PC (+4 bytes) 2. Instuction decode/egiste fetch cycle (ID) Decode instuction egistes coesponding to egiste souce specifies fom egiste file Sign extend offset fields if needed Compute possible banch taget addess 2
Typical implementation of an instuction (II) 3. Execution /effective addess cycle (EX) adds base egiste and offset to fom effective addess o pefoms opeations on the values ead fom egiste file o pefoms opeation on value ead fom egiste and signextended immediate 4. Memoy access cycle (MEM) If instuction is a load, ead memoy using the effective addess computed in step 3 If instuction is a stoe, wite the data fom the second egiste ead of the egiste file to the effective addess 5. Wite-back cycle (WB) Wite esult into egiste file Fom memoy fo a load instuction Fom fo an instuction Typical implementation of an instuction (III) Instuction Fetch Next PC PC 4 Adde Memoy Inst Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc MUX MUX Zeo? Memoy Access MUX Memoy L M D Wite Back MUX WB 3
Details(I) Fetching instuctions and incementing pogam count (PC) 4 Adde PC addess Instuction Instuction memoy Details (II) instuctions, e.g. add R1, R2, R3 iste numbe input is 5 bit wide if you have 32(=2 5 ) egistes opeation contol signal (4 bits) iste numbes 5 5 5 egiste 1 egiste 2 Wite egiste iste file data 1 data 2 opeation 4 Zeo esult Wite Wite Wite contol signal 4
Details (III) Load/Stoe instuctions, e.g. LW R1,offset (R2) MemWite Addess Wite data memoy 16 32 Sign Extend Mem Basic steps fo a load/stoe opeation sign extend the offset fom 16 to 32 bit add the sign extended offset to R2 Load the content of the esulting addess into R1 o stoe the data fom R1 into the esulting memoy addess Details (IV) Combining Load/Stoe and instuctions opeation Instuction egiste 1 egiste 2 Wite egiste Wite iste file data 1 data 2 Wite sc 0 1 M U X 4 Addess data memoy Wite MemWite Memto 0 1 M U X 16 32 Sign Extend Mem 5
Details (V) Banches e.g. beq R1,R2,offset Basic steps fo a banch equal instuction compute banch taget addess sign extended offset field shift offset field by 2 bits in ode to ensue a wod offset add shifted, sign-extended offset to PC compae egistes R1 and R2 Details (VI) Implementation of banches, e.g. beq R1,R2,offset PC+4 fom instuction datapath Shift Left 2 Add Banch taget Instuction egiste 1 egiste 2 Wite egiste Wite iste file data 1 data 2 4 opeation To banch contol logic Wite 16 32 Sign Extend 6
Visualizing pipelining Time (clock ycles) I n s t. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 IF ID IF ID Mem WB Mem WB O d e IF ID IF ID Mem WB Mem WB Effects of pipelining A pipeline of depth n equies n-times the memoy bandwidth of a non-pipelined pocesso fo the same clock ate Sepaate data and instuction cache eliminates some memoy conflicts iste file is used in stage ID and in WB Usually not a conflict, since wite s ae executed in the fist half of the clock-cycle and ead s in the second half Instuctions in the pipeline should not attempt to use the same hadwae esouces at the same time Intoducing pipeline egistes between successive stages of the pipeline istes named afte the stages they connect (e.g. IF/ID, ID/, etc.) 7
Instuction Inst. Decode Fetch. Fetch Execute Add.Calc Memoy Access Wite Back Next PC Addess 4 Adde Memoy IF/ID Next SEQ PC RS1 RS2 File ID/EX Next SEQ PC MUX MUX Zeo? EX/MEM MUX Memoy MEM/WB MUX Imm Sign Extend RD RD RD Pipeline Hazads Limits to pipelining: Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads: HW cannot suppot this combination of instuctions hazads: Instuction depends on esult of pio instuction still in the pipeline Contol hazads: Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). 8
One Memoy Pot/Stuctual Hazads Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 One Memoy Pot/Stuctual Hazads Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t. O d e Load Inst 1 Inst 2 Stall Inst 3 Bubble Bubble Bubble Bubble Bubble 9
Hazad on R1 IF ID EX MEMWB I n s t. add 1,2,3 sub 4,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 Thee Geneic Hazads Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I: add 1,2,3 J: sub 4,1,3 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. 10
Thee Geneic Hazads Wite Afte (WAR) Inst J wites opeand befoe Inst I eads it I: sub 4,1,3 J: add 1,2,3 K: mul 6,1,7 Called an anti-dependence by compile wites. This esults fom euse of the name 1. Can t happen in ou 5 stage pipeline because: All instuctions take 5 stages, and s ae always in stage 2, and Wites ae always in stage 5 Thee Geneic Hazads Wite Afte Wite (WAW) Inst J wites opeand befoe Inst I wites it. I: sub 1,4,3 J: add 1,2,3 K: mul 6,1,7 Called an output dependence by compile wites This also esults fom the euse of name 1. Can t happen in 5 stage pipeline because: All instuctions take 5 stages, and Wites ae always in stage 5 11
I n s t. Fowading to Avoid Hazad Time (clock cycles) add 1,2,3 sub 4,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 Hazad even with Fowading I n s t. lw 1, 0(2) sub 4,1,6 O d e and 6,1,7 o 8,1,9 12
Hazad Even with Fowading I n s t. lw 1, 0(2) sub 4,1,6 Bubble O d e and 6,1,7 o 8,1,9 Bubble Bubble Next PC Banches: Pipelined path Instuction Fetch 4 Adde Inst. Decode. Fetch Next SEQ PC RS1 Adde MUX Zeo? Execute Add. Calc Memoy Access Wite Back Addess Memoy IF/ID RS2 File ID/EX MUX EX/MEM Memoy MEM/WB MUX Imm Sign Extend RD RD RD WB 13
Fou Banch Hazad Altenatives #1: Stall until banch diection is clea #2: Pedict Banch Not Taken Execute successo instuctions in sequence Squash instuctions in pipeline if banch actually taken Advantage of late pipeline state update 47% banches not taken on aveage PC+4 aleady calculated, so use it to get next instuction #3: Pedict Banch Taken 53% banches taken on aveage But haven t calculated banch taget addess yet still incus 1 cycle banch penalty Othe machines: banch taget known befoe outcome Fou Banch Hazad Altenatives #4: Delayed Banch Define banch to take place AFTER a following instuction banch instuction sequential successo 1 sequential successo 2... sequential successo n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline 14
Delayed Banch Whee to get instuctions to fill banch delay slot? Befoe banch instuction Fom the taget addess: only valuable when banch taken Fom fall though: only valuable when banch not taken Compile effectiveness fo single banch delay slot: Fills about 60% of banch delay slots About 80% of instuctions executed in banch delay slots useful in computation 15