Lectue 8 Intoduction to Pipelines Adapated fom slides by David Patteson http://www-inst.eecs.bekeley.edu/~cs61c/ * 1
Review (1/3) Datapath is the hadwae that pefoms opeations necessay to execute pogams. Contol instucts datapath on what to do next. Datapath needs: access to stoage (geneal pupose egistes and memoy) computational ability () helpe hadwae (local egistes and PC) * 2
Review (2/3) Five stages of datapath (executing an instuction): 1. Instuction Fetch (Incement PC) 2. Instuction Decode (Read Registes) 3. (Computation) 4. Memoy Access 5. Wite to Registes ALL instuctions must go though ALL five stages. Datapath designed in hadwae. * 3
Example Datapath PC instuction memoy d s t egistes Data memoy +4 imm 1. Instuction Fetch 2. Decode/ Registe Read 3. Execute 4. Memoy 5. Wite Back * 4
Outline Pipelining Analogy Pipelining Instuction Execution Hazads Advanced Pipelining Concepts by Analogy * 5
Gotta Do Laundy Ann, Bian, Cathy, Dave each have one load of clothes to wash, dy, fold, and put away Washe takes 30 minutes Dye takes 30 minutes A B C D Folde takes 30 minutes Stashe takes 30 minutes to put clothes into dawes * 6
Sequential Laundy 6 PM 7 8 9 10 11 12 1 2 AM T a s k O d e A B C D 3030 30 30 30 30 30 30 3030 30 30 3030 30 30 Time Sequential laundy takes 8 hous fo 4 loads * 7
Pipelined Laundy 12 2 AM 6 PM 7 8 9 10 11 1 T a s k O d e A B C D 3030 30 30 30 30 30 Time Pipelined laundy takes 3.5 hous fo 4 loads! * 8
Geneal Definitions Latency: time to completely execute a cetain task fo example, time to ead a secto fom disk is disk access time o disk latency Thoughput: amount of wok that can be done ove a peiod of time * 9
T a s k O d e Pipelining Lessons (1/2) Pipelining doesn t help 6 PM 7 8 9 latency of single task, it helps thoughput of Time entie wokload A B C D 30 30 30 30 30 30 30 Multiple tasks opeating simultaneously using diffeent esouces Potential speedup = Numbe pipe stages Time to fill pipeline and time to dain it educes speedup: 2.3X v. 4X in this example * 10
T a s k O d e Pipelining Lessons (2/2) Suppose new 6 PM 7 8 9 A B C D Time 30 30 30 30 30 30 30 Washe takes 20 minutes, new Stashe takes 20 minutes. How much faste is pipeline? Pipeline ate limited by slowest pipeline stage Unbalanced lengths of pipe stages also educes speedup * 11
Steps in Executing MIPS 1) IFetch: Fetch Instuction, Incement PC 2) Decode Instuction, Read Registes 3) Execute: Mem-ef: Calculate Addess Aith-log: Pefom Opeation 4) Memoy: Load: Stoe: Read Data fom Memoy Wite Data to Memoy 5) Wite Back: Wite Data to Registe * 12
Pipelined Execution Repesentation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Evey instuction must take same numbe of steps, also called pipeline stages, so some will go idle sometimes * 13
Review: Datapath fo MIPS PC +4 instuction memoy Stage 1 Stage 2 Stage 3Stage 4 Stage 5 1. Instuction Fetch d s t imm Use datapath figue to epesent pipeline IFtch Dcd egistes 2. Decode/ Registe Read Exec Mem WB Data memoy 3. Execute 4. Memoy5. Wite Back I$ Reg D$ Reg * 14
I n s t. O d e Gaphical Pipeline Repesentation (In Reg, ight half highlight ead, left half wite) Time (clock cycles) Load Add Stoe Sub O I$ Reg I$ Reg I$ D$ Reg I$ * 15 Reg D$ Reg I$ Reg D$ Reg Reg D$ Reg D$ Reg
Example Suppose 2 ns fo memoy access, 2 ns fo opeation, and 1 ns fo egiste file ead o wite Nonpipelined Execution: lw : IF + Read Reg + + Memoy + Wite Reg = 2 + 1 + 2 + 2 + 1 = 8 ns add: IF + Read Reg + + Wite Reg = 2 + 1 + 2 + 1 = 6 ns Pipelined Execution: Max(IF,Read Reg,,Memoy,Wite Reg) = 2 ns * 16
Pipeline Hazad: Matching socks in late load 12 2 AM 6 PM 7 8 9 10 11 1 T a s k A B 3030 30 30 30 30 30 bubble Time O d e C D E F A depends on D; stall since folde tied up * 17
Poblems fo Computes Limits to pipelining: Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads: HW cannot suppot this combination of instuctions (single peson to fold and put clothes away) Contol hazads: Pipelining of banches & othe instuctions stall the pipeline until the hazad bubbles in the pipeline Data hazads: Instuction depends on esult of pio instuction still in the pipeline (missing sock) * 18
Stuctual Hazad #1: Single Memoy (1/2) I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg I$ Reg D$ Reg I$ Read same memoy twice in same clock cycle * 19 Reg D$ Reg I$ Reg D$ Reg
I n s t. Stuctual Hazad #2: Registes (1/2) Load Inst 1 Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg O d e Inst 2 Inst 3 Inst 4 I$ Reg D$ Reg I$ Reg D$ Reg I$ Reg D$ Reg Can t ead and wite to egistes simultaneously * 20
Stuctual Hazad #2: Registes (2/2) Fact: Registe access is VERY fast: takes less than half the time of stage Solution: intoduce convention always Wite to Registes duing fist half of each clock cycle always Read fom Registes duing second half of each clock cycle Result: can pefom Read and Wite duing same clock cycle * 21
Contol Hazad: Banching (1/6) Suppose we put banch decisionmaking hadwae in stage then two moe instuctions afte the banch will always be fetched, whethe o not the banch is taken Desied functionality of a banch if we do not take the banch, don t waste any time and continue executing nomally if we take the banch, don t execute any instuctions afte the banch, just go to the desied label * 22
Contol Hazad: Banching (2/6) Initial Solution: Stall until decision is made inset no-op instuctions: those that accomplish nothing, just take time Dawback: banches take 3 clock cycles each (assuming compaato is put in stage) * 23
Contol Hazad: Banching (3/6) Optimization #1: move compaato up to Stage 2 as soon as instuction is decoded (Opcode identifies is as a banch), immediately make a decision and set the value of the PC (if necessay) Benefit: since banch is complete in Stage 2, only one unnecessay instuction is fetched, so only one no-op is needed Side Note: This means that banches ae idle in Stages 3, 4 and 5. * 24
I n s t. O d e Contol Hazad: Banching (4/6) Inset a single no-op (bubble) Add Beq Load Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg bub ble I$ Reg D$ Reg Impact: 2 clock cycles pe banch instuction slow * 25
Contol Hazad: Banching (5/6) Optimization #2: Redefine banches Old definition: if we take the banch, none of the instuctions afte the banch get executed by accident New definition: whethe o not we take the banch, the single instuction immediately following the banch gets executed (called the banch-delay slot) * 26
Contol Hazad: Banching (6/6) Notes on Banch-Delay Slot Wost-Case Scenaio: can always put a no-op in the banch-delay slot Bette Case: can find an instuction peceding the banch which can be placed in the banch-delay slot without affecting flow of the pogam - e-odeing instuctions is a common method of speeding up pogams - compile must be vey smat in ode to find instuctions to do this - usually can find such an instuction at least 50% of the time * 27
Example: Nondelayed vs. Delayed Banch Nondelayed Banch o $8, $9,$10 Delayed Banch add $1,$2,$3 add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit xo $10, $1,$11 sub $4, $5,$6 beq $1, $4, Exit o $8, $9,$10 xo $10, $1,$11 Exit: Exit: * 28
Things to Remembe (1/2) Optimal Pipeline Each stage is executing pat of an instuction each clock cycle. One instuction finishes duing each clock cycle. On aveage, execute fa moe quickly. What makes this wok? Similaities between instuctions allow us to use same stages fo all instuctions (geneally). Each stage takes about the same amount of time as all othes: little wasted time. * 29
Advanced Pipelining Concepts (if time) Out-of-ode Execution Supescala execution State-of-the-At Micopocesso * 30
Review Pipeline Hazad: Stall is dependency T a s k O d e 12 2 AM 6 PM 7 8 9 10 11 1 A B C D E F 3030 30 30 30 30 30 bubble Time A depends on D; stall since folde tied up * 31
Out-of-Ode Laundy: Don t Wait T a s k O d e 12 2 AM 6 PM 7 8 9 10 11 1 A B C D E F 3030 30 30 30 30 30 bubble Time A depends on D; est continue; need moe esouces to allow out-of-ode * 32
Supescala Laundy: Paallel pe stage T a s k O d e 12 2 AM 6 PM 7 8 9 10 11 1 A B C D E F 3030 30 30 30 Time (light clothing) (dak clothing) (vey dity clothing) (light clothing) (dak clothing) (vey dity clothing) Moe esouces, HW to match mix of paallel tasks? * 33
Supescala Laundy: Mismatch Mix 12 2 AM 6 PM 7 8 9 10 11 1 T a s k O d e A B C D Time 3030 30 30 30 30 30 (light clothing) (light clothing) (dak clothing) (light clothing) Task mix undeutilizes exta esouces * 34
Compaq Alpha 21264 Vey simila instuction set to MIPS 1 64KB Instuction cache, 1 64 KB Data cache on chip; 16MB L2 cache off chip Clock cycle = 1.5 nanoseconds, o 667 MHz clock ate Supescala: fetch up to 6 instuctions /clock cycle, eties up to 4 instuction/clock cycle Execution out-of-ode 15 million tansistos, 90 watts! * 35
Things to Remembe (1/2) Optimal Pipeline Each stage is executing pat of an instuction each clock cycle. One instuction finishes duing each clock cycle. On aveage, execute fa moe quickly. What makes this wok? Similaities between instuctions allow us to use same stages fo all instuctions (geneally). Each stage takes about the same amount of time as all othes: little wasted time. * 36
Things to Remembe (2/2) Pipelining a Big Idea: widely used concept What makes it less than pefect? Stuctual hazads: two diffeent instuctions equie same hadwae Need moe HW esouces Contol hazads: need to woy about banch instuctions? Delayed banch Data hazads: an instuction depends on a pevious instuction? * 37