Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Review: Joy s Law in ManyCore world. Bell s Law new class per decade

Size: px

Start display at page:

Download "Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Review: Joy s Law in ManyCore world. Bell s Law new class per decade"

Brenda Bishop
5 years ago
Views:

EECS 252 Gaduate Compute Achitectue Lectue 2 ℵ 0 Review of Instuction Sets, Pipelines, and Caches Januay 26 th, 2009 Review Mooe s Law John

edu/~kubiton/cs252 Camming Moe Components onto Integated Cicuits Godon Mooe, Electonics, 1965 # on tansistos on cost-effective integated cicuit

VAX-11/780) 10000 1000 100 10 Review Joy s Law in ManyCoe wold 1 Fom Hennessy and Patteson, Compute Achitectue A Quantitative Appoach, 4th edition,

?%/yea 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 VAX 25%/yea 1978 to 1986 RISC + x86 52%/yea 1986 to 2002 RISC + x86?

1 EECS 252 Gaduate Compute Achitectue Lectue 2 ℵ 0 Review of Instuction Sets, Pipelines, and Caches Januay 26 th, 2009 Review Mooe s Law John Kubiatowicz Electical Engineeing and Compute Sciences Univesity of Califonia, Bekeley http// Camming Moe Components onto Integated Cicuits Godon Mooe, Electonics, 1965 # on tansistos on cost-effective integated cicuit double evey 18 months 1/26/2009 CS252-S09, Lectue 02 2 Pefomance (vs. VAX-11/780) Review Joy s Law in ManyCoe wold 1 Fom Hennessy and Patteson, Compute Achitectue A Quantitative Appoach, 4th edition, Octobe, %/yea 52%/yea??%/yea VAX 25%/yea 1978 to 1986 RISC + x86 52%/yea 1986 to 2002 RISC + x86??%/yea 2002 to pesent 1/26/2009 CS252-S09, Lectue 02 3 log (people pe compute) Bell s Law new class pe decade Enabled by technological oppotunities yea Smalle, moe numeous and moe intimately connected Bings in a new kind of application Numbe Cunching Data Stoage poductivity inteactive steaming infomation to/fom physical wold Used in many ways not peviously imagined 1/26/2009 CS252-S09, Lectue 02 4

2 Metics used to Compae Designs Today Quick eview of eveything you should have leaned ℵ 0 ( A countably-infinite set of compute achitectue concepts ) Cost Die cost and system cost Execution Time aveage and wost-case Latency vs. Thoughput Enegy and Powe Also peak powe and peak switching cuent Reliability Resiliency to electical noise, pat failue Robustness to bad softwae, opeato eo Maintainability System administation costs Compatibility Softwae costs dominate 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 6 Cost of Pocesso Design cost (Non-ecuing Engineeing Costs, NRE) dominated by enginee-yeas (~$200K pe enginee yea) also mask costs (exceeding $1M pe spin) Cost of die die aea die yield (matuity of manufactuing pocess, edundancy featues) cost/size of wafes die cost ~= f(die aea 4 ) with no edundancy Cost of packaging numbe of pins (signal + powe/gound pins) powe dissipation Cost of testing built-in test featues? logical complexity of design choice of cicuits (minimum clock ates, leakage cuents, I/O dives) Achitect affects all of these What is Pefomance? Latency (o esponse time o execution time) time to complete one task Bandwidth (o thoughput) tasks completed pe unit time 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 8

3 Definition Pefomance Pefomance is in units of things pe sec bigge is bette If we ae pimaily concened with esponse time pefomance(x) = 1 execution_time(x) " X is n times faste than Y" means Pefomance(X) n = = Pefomance(Y) Execution_time(Y) Execution_time(X) 1/26/2009 CS252-S09, Lectue 02 9 Pefomance What to measue Usually ely on benchmaks vs. eal wokloads To incease pedictability, collections of benchmak applications-- benchmak suites -- ae popula SPECCPU popula desktop benchmak suite CPU only, split between intege and floating point pogams SPECint2000 has 12 intege, SPECfp2000 has 14 intege pgms SPECCPU2006 to be announced Sping 2006 SPECSFS (NFS file seve) and SPECWeb (WebSeve) added as seve benchmaks Tansaction Pocessing Council measues seve pefomance and cost-pefomance fo databases TPC-C Complex quey fo Online Tansaction Pocessing TPC-H models ad hoc decision suppot TPC-W a tansactional web benchmak TPC-App application seve and web sevices benchmak 1/26/2009 CS252-S09, Lectue Summaizing Pefomance depends who s selling System Rate (Task 1) Rate (Task 2) A B Which system is faste? System Rate (Task 1) Rate (Task 2) A B Aveage thoughput System Rate (Task 1) Rate (Task 2) A B Thoughput elative to B Aveage Aveage System Rate (Task 1) Rate (Task 2) A B Thoughput elative to A Aveage /26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 12

4 Summaizing Pefomance ove Set of Benchmak Pogams Nomalized Execution Time and Geometic Mean Measue speedup up elative to efeence machine Aithmetic mean of execution times t i (in seconds) 1/n Σ i t i Hamonic mean of execution ates i (MIPS/MFLOPS) n/ [Σ i (1/ i )] Both equivalent to wokload whee each pogam is un the same numbe of times Can add weighting factos to model othe wokload distibutions atio = t Ref /t A Aveage time atios using geometic mean n ( I atio i ) Insensitive to machine chosen as efeence Insensitive to un time of individual benchmaks Used by SPEC89, SPEC92, SPEC95,, SPEC But bewae that choice of efeence machine can suggest what is nomal pefomance pofile 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue Vecto/Supescala Speedup Supescala/Vecto Speedup 100 MHz Cay J90 vecto machine vesus 300MHz Alpha [LANL Computational Physics Codes, Wasseman, ICS 96] Vecto machine peaks on a few codes???? 1/26/2009 CS252-S09, Lectue MHz Cay J90 vecto machine vesus 300MHz Alpha [LANL Computational Physics Codes, Wasseman, ICS 96] Scala machine peaks on one code??? 1/26/2009 CS252-S09, Lectue 02 16

5 How to Mislead with Pefomance Repots Select pieces of wokload that wok well on you design, ignoe othes Use unealistic data set sizes fo application (too big o too small) Repot thoughput numbes fo a latency benchmak Repot latency numbes fo a thoughput benchmak Repot pefomance on a kenel and claim it epesents an entie application Use 16-bit fixed-point aithmetic (because it s fastest on you system) even though application equies 64-bit floating-point aithmetic Use a less efficient algoithm on the competing machine Repot speedup fo an inefficient algoithm (bubblesot) Compae hand-optimized assembly code with unoptimized C code Compae you design using next yea s technology against competito s yea old design (1% pefomance impovement pe week) Ignoe the elative cost of the systems being compaed Repot aveages and not individual esults Repot speedup ove unspecified base system, not absolute times Repot efficiency not absolute times Repot MFLOPS not absolute times (use inefficient algoithm) [ David Bailey Twelve ways to fool the masses when giving pefomance esults fo paallel supecomputes ] 1/26/2009 CS252-S09, Lectue Amdahl s Law ExTimenew = ExTimeold 1 Speedup oveall ExTime = ExTime old new = ( 1 Faction ) Best you could eve hope to do Speedup = Factionenhanced ( Factionenhanced ) + maximum Faction enhanced ( ) 1 Faction + Speedup enhanced Speedup enhanced enhanced enhanced 1/26/2009 CS252-S09, Lectue Amdahl s Law example New CPU 10X faste I/O bound seve, so 60% time waiting fo I/O Speedup oveall = = ( 1 Faction ) 1 ( 1 0.4) enhanced 1 Faction + Speedup = 1.56 enhanced enhanced Appaently, its human natue to be attacted by 10X faste, vs. keeping in pespective its just 1.6X faste = Compute Pefomance inst count Cycle time CPU CPU time time = Seconds = Instuctions x Cycles Cycles x Seconds Pogam Pogam Instuction Cycle Cycle CPI Inst Count CPI Clock Rate Pogam X Compile X (X) Inst. Set. X X Oganization X X Technology X 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 20

6 Cycles Pe Instuction (Thoughput) Aveage Cycles pe Instuction CPI = (CPU Time * Clock Rate) / Instuction Count = Cycles / Instuction Count CPU time = Cycle Time CPI j I CPI = n n j=1 CPI j Fj whee Fj = j=1 Instuction Fequency 1/26/2009 CS252-S09, Lectue j I Instuction Count j Example Calculating CPI bottom up Run benchmak and collect wokload chaacteization (simulate, machine countes, o sampling) Base Machine ( / ) Op Feq Cycles CPI(i) (% Time) 50% 1.5 (33%) Load 20% 2.4 (27%) Stoe 10% 2.2 (13%) Banch 20% 2.4 (27%) 1.5 Typical Mix of instuction types in pogam Design guideline Make the common case fast MIPS 1% ule only conside adding an instuction of it is shown to add 1% pefomance impovement on easonable benchmaks. 1/26/2009 CS252-S09, Lectue Powe and Enegy Peak Powe vesus Lowe Enegy Enegy to complete opeation (Joules) Coesponds appoximately to battey life (Battey enegy capacity actually depends on ate of dischage) Peak powe dissipation (Watts = Joules/second) Affects packaging (powe and gound pins, themal design) di/dt, peak change in supply cuent (Amps/second) Affects powe supply noise (powe and gound pins, decoupling capacitos) Powe Time Peak A Peak B Integate powe cuve to get enegy System A has highe peak powe, but lowe total enegy System B has lowe peak powe, but highe total enegy 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 24

7 CS 252 Administivia Sign up! Web site is (doesn t quite wok!) http// Review Chapte 1, Appendix A, B, C CS 152 home page, maybe Compute Oganization and Design (COD)2/e If did take a class, be sue COD Chaptes 2, 5, 6, 7 ae familia Copies in Bechtel Libay on 2-hou eseve Fist two eadings ae up (look on Lectue page) Read the assignment caefully, since the equiements vay about what you need to tun in Submit esults to website befoe class» (will be a link up on handouts page) You can have 5 total late days on assignments» 10% pe day aftewads» Save late days! CS 252 Administivia Resouces fo couse on web site Check out the ISCA (Intenational Symposium on Compute Achitectue) 25th yea etospective on web site. Look fo Additional eading below text-book desciption Pointes to pevious CS152 exams and esouces Lots of old CS252 mateial Inteesting links. Check out the WWW Compute Achitectue Home Page Size of class seems ok I asked Michael David to put eveyone on waitlist into class Check to make sue 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue A "Typical" RISC ISA ISA Implementation Review 32-bit fixed fomat instuction (3 fomats) bit GPR (R0 contains zeo, DP take pai) 3-addess, eg-eg aithmetic instuction Single addess mode fo load/stoe base + displacement no indiection Simple banch conditions Delayed banch see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowePC, CDC 6600, CDC 7600, Cay-1, Cay-2, Cay-3 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 28

8 Example MIPS (- MIPS) iste-iste Op Rs1 Rs2 Rd Opx iste-immediate Op Rs1 Rd immediate Banch Op Rs1 Rs2/Opx immediate Datapath vs Contol Datapath signals Contol Points Contolle Jump / Call Op taget 1/26/2009 CS252-S09, Lectue Datapath Stoage, FU, inteconnect sufficient to pefom the desied functions Inputs ae Contol Points Outputs ae signals Contolle State machine to ochestate opeation on the data path Based on desied function and signals 1/26/2009 CS252-S09, Lectue Steps of MIPS Datapath Instuction Fetch Inst. Decode. Fetch Execute Add. Calc Memoy Access Wite Back Simple Pipelining Review Next PC 4 Adde Next SEQ PC RS1 Next SEQ PC Zeo? MUX 1/26/2009 CS252-S09, Lectue Addess Memoy IR <= mem[pc]; PC <= PC + 4 A <= [IR s ]; B <= [IR t ] slt <= A op IRop WB <= slt B IF/ID RS2 Imm File Sign Extend ID/EX [IR d ] <= WB local decode fo each instuction phase 1/26/2009 / pipeline CS252-S09, stage Lectue MUX MUX EX/MEM RD RD RD Data stationay contol Data Memoy MEM/WB MUX WB Data

9 Visualizing Pipelining Figue A.2, Page A-8 Pipelining is not quite that easy! Time (clock cycles) I n s t. O d e Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Limits to pipelining Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads HW cannot suppot this combination of instuctions (single peson to fold and put clothes away) Data hazads Instuction depends on esult of pio instuction still in the pipeline (missing sock) Contol hazads Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue One Memoy Pot/Stuctual Hazads Figue A.4, Page A-14 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 One Memoy Pot/Stuctual Hazads (Simila to Figue A.5, Page A-15) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 I n s t. O d e Load Inst 1 Inst 2 Stall Inst 3 Bubble Bubble Bubble Bubble Bubble 1/26/2009 CS252-S09, Lectue How do you bubble the pipe? 1/26/2009 CS252-S09, Lectue 02 36

10 Speed Up Equation fo Pipelining CPI pipelined = Ideal CPI + Aveage Stall cycles pe Inst Ideal CPI Pipeline depth Cycle Time Speedup = Ideal CPI + Pipeline stall CPI Cycle Time Fo simple RISC pipeline, CPI = 1 Pipeline depth Cycle Time Speedup = 1 + Pipeline stall CPI Cycle Time unpipelined pipelined unpipelined pipelined Example Dual-pot vs. Single-pot Machine A Dual poted memoy ( Havad Achitectue ) Machine B Single poted memoy, but its pipelined implementation has a 1.05 times faste clock ate Ideal CPI = 1 fo both Loads ae 40% of instuctions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faste 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue Data Hazad on R1 Thee Geneic Data Hazads Time (clock cycles) IF ID/RF EX MEM WB Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I n s t. O d e add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 xo 10,1,11 I add 1,2,3 J sub 4,1,3 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 40

11 Thee Geneic Data Hazads Wite Afte Read (WAR) Inst J wites opeand befoe Inst I eads it I sub 4,1,3 J add 1,2,3 K mul 6,1,7 Called an anti-dependence by compile wites. This esults fom euse of the name 1. Can t happen in MIPS 5 stage pipeline because All instuctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 Thee Geneic Data Hazads Wite Afte Wite (WAW) Inst J wites opeand befoe Inst I wites it. I sub 1,4,3 J add 1,2,3 K mul 6,1,7 Called an output dependence by compile wites This also esults fom the euse of name 1. Can t happen in MIPS 5 stage pipeline because All instuctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in moe complicated pipes 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue Fowading to Avoid Data Hazad HW Change fo Fowading Time (clock cycles) I n s t. O d e add 1,2,3 sub 4,1,3 and 6,1,7 o 8,1,9 NextPC istes Immediate ID/EX mux mux EX/MEM Data Memoy MEM/WR mux xo 10,1,11 What cicuit detects and esolves this hazad? 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 44

12 Fowading to Avoid LW-SW Data Hazad Data Hazad Even with Fowading Time (clock cycles) Time (clock cycles) I n s t. O d e add 1,2,3 lw 4, 0(1) sw 4,12(1) o 8,6,9 xo 10,9,11 I n s t. O d e lw 1, 0(2) sub 4,1,6 and 6,1,7 o 8,1,9 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue Data Hazad Even with Fowading Softwae Scheduling to Avoid Load Hazads I n s t. Time (clock cycles) lw 1, 0(2) sub 4,1,6 Bubble Ty poducing fast code fo a = b + c; d = e f; assuming a, b, c, d,e, and f in memoy. Slow code LW Rb,b Fast code LW Rb,b O d e and 6,1,7 o 8,1,9 Bubble Bubble LW ADD SW LW LW SUB Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf LW LW ADD LW SW SUB Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf SW d,rd SW d,rd 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 48

13 Contol Hazad on Banches Thee Stage Stall Banch Stall Impact 10 beq 1,3,36 14 and 2,3,5 18 o 6,1,7 22 add 8,1,9 36 xo 10,1,11 If CPI = 1, 30% banch, Stall 3 cycles => new CPI = 1.9! Two pat solution Detemine banch taken o not soone, AND Compute taken banch addess ealie MIPS banch tests if egiste = 0 o 0 MIPS Solution Move Zeo test to ID/RF stage Adde to calculate new PC in ID/RF stage 1 clock cycle penalty fo banch vesus 3 What do you do with the 3 instuctions in between? How do you do it? Whee is the commit? 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue Pipelined MIPS Datapath Figue A.24, page A-38 Fou Banch Hazad Altenatives Next PC Addess Instuction Fetch 4 Adde Memoy IF/ID Inst. Decode. Fetch Next SEQ PC Adde RS1 RS2 Imm MUX Zeo? File Sign Extend ID/EX Execute Add. Calc MUX EX/MEM RD RD RD Memoy Access Data Memoy MEM/WB Wite Back MUX WB Data #1 Stall until banch diection is clea #2 Pedict Banch Not Taken Execute successo instuctions in sequence Squash instuctions in pipeline if banch actually taken Advantage of late pipeline state update 47% MIPS banches not taken on aveage PC+4 aleady calculated, so use it to get next instuction #3 Pedict Banch Taken 53% MIPS banches taken on aveage But haven t calculated banch taget addess in MIPS» MIPS still incus 1 cycle banch penalty» Othe machines banch taget known befoe outcome Inteplay of instuction set design and cycle time. 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 52

14 Fou Banch Hazad Altenatives #4 Delayed Banch Define banch to take place AFTER a following instuction banch instuction sequential successo 1 sequential successo 2... sequential successo n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline MIPS uses this 1/26/2009 CS252-S09, Lectue Scheduling Banch Delay Slots A. Fom befoe banch B. Fom banch taget C. Fom fall though add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 A is the best choice, fills delay slot & educes instuction count (IC) In B, the sub instuction may need to be copied, inceasing IC In B and C, must be okay to execute sub when banch fails 1/26/2009 CS252-S09, Lectue Delayed Banch Evaluating Banch Altenatives Compile effectiveness fo single banch delay slot Fills about 60% of banch delay slots About 80% of instuctions executed in banch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Banch downside As pocesso go to deepe pipelines and multiple issue, the banch delay gows and need moe than one delay slot Delayed banching has lost populaity compaed to moe expensive but moe flexible dynamic appoaches Gowth in available tansistos has made dynamic appoaches elatively cheape Pipeline speedup = Pipeline depth 1 +Banch fequency Banch penalty Assume 4% unconditional banch, 6% conditional banchuntaken, 10% conditional banch-taken Scheduling Banch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Pedict taken Pedict not taken Delayed banch /26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 56

Poblems with Pipelining Exception An unusual event happens to an instuction duing its execution Examples divide by zeo, undefined opcode Inteupt Hadwae signal to switch the pocesso to a new

15 Poblems with Pipelining Exception An unusual event happens to an instuction duing its execution Examples divide by zeo, undefined opcode Inteupt Hadwae signal to switch the pocesso to a new instuction steam Example a sound cad inteupts when it needs moe audio output samples (an audio click happens if it is left waiting) Poblem It must appea that the exception o inteupt must appea between 2 instuctions (I i and I i+1 ) The effect of all instuctions up to and including I i is totalling complete No effect of any instuction afte I i can take place The inteupt (exception) handle eithe abots pogam o estats at instuction I i+1 1/26/2009 CS252-S09, Lectue Pecise Exceptions in Static Pipelines Key obsevation achitected state only change in memoy and egiste wite stages. 1/26/2009 CS252-S09, Lectue Memoy Hieachy Review 1/26/2009 CS252-S09, Lectue Since 1980, CPU has outpaced DRAM... Pefomance (1/latency) Yea CPU 60% pe y 2X in 1.5 ys 1/26/2009 CS252-S09, Lectue CPU Gap gew 50% pe yea DRAM 9% pe y DRAM 2X in 10 ys 2000 How do achitects addess this gap? Put small, fast cache memoies between CPU and DRAM. Ceate a memoy hieachy

1977 DRAM faste than micopocessos Apple ][ (1977) CPU 1000 ns

the pinciple of locality to Pesent as much memoy as in the

fastest technology Pocesso Steve Jobs Steve Wozniak 1/26/2009

Second Level Cache (SRAM) Main Memoy (DRAM) Speed (ns) 1s

(Disk) 10,000,000s (10s ms) 1/26/2009 CS252-S09, Lectue 02 62

Pinciple of Locality The Pinciple of Locality Pogam access a

time. Two Diffeent Types of Locality Tempoal Locality

, loops, euse) Spatial Locality (Locality in Space) If an item

locality fo speed 1/26/2009 CS252-S09, Lectue 02 63 Memoy

16 1977 DRAM faste than micopocessos Apple ][ (1977) CPU 1000 ns DRAM 400 ns Memoy Hieachy of a Moden Compute Take advantage of the pinciple of locality to Pesent as much memoy as in the cheapest technology Povide access at speed offeed by the fastest technology Pocesso Steve Jobs Steve Wozniak 1/26/2009 CS252-S09, Lectue Datapath Contol istes On-Chip Cache Second Level Cache (SRAM) Main Memoy (DRAM) Speed (ns) 1s 10s-100s 100s Size (bytes) 100s Ks-Ms Ms Seconday Stoage (Disk) 10,000,000s (10s ms) 1/26/2009 CS252-S09, Lectue Gs Tetiay Stoage (Tape) 10,000,000,000s (10s sec) Ts The Pinciple of Locality The Pinciple of Locality Pogam access a elatively small potion of the addess space at any instant of time. Two Diffeent Types of Locality Tempoal Locality (Locality in Time) If an item is efeenced, it will tend to be efeenced again soon (e.g., loops, euse) Spatial Locality (Locality in Space) If an item is efeenced, items whose addesses ae close by tend to be efeenced soon (e.g., staightline code, aay access) Last 15 yeas, HW elied on locality fo speed 1/26/2009 CS252-S09, Lectue Memoy Addess (one dot pe access) Pogams with locality cache well... Bad locality behavio Spatial Locality Tempoal Locality Time Donald J. Hatfield, Jeanette Geald Pogam Restuctuing fo Vitual Memoy. IBM Systems Jounal 1/26/2009 CS252-S09, 10(3) Lectue (1971) 64

Memoy Hieachy Apple imac G5 Managed by compile Managed by hadwae Managed by OS, hadwae, application 07 L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency 1, 3, 3, 11, 88, 10 7,

9 ns 55 ns 12 ms Goal Illusion of lage, fast, cheap memoy Let pogams addess a memoy space that scales to the disk size, at a speed that is usually as fast as egiste access imac G5 1.

appeas in some block in the uppe level (example Block X) Hit Rate the faction of memoy access found in the uppe level Hit Time Time to access the uppe level which consists of RAM access time + Time

17 Memoy Hieachy Apple imac G5 Managed by compile Managed by hadwae Managed by OS, hadwae, application 07 L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency 1, 3, 3, 11, 88, 10 7, Cycles, Time 0.6 ns 1.9 ns 1.9 ns 6.9 ns 55 ns 12 ms Goal Illusion of lage, fast, cheap memoy Let pogams addess a memoy space that scales to the disk size, at a speed that is usually as fast as egiste access imac G5 1.6 GHz 1/26/2009 CS252-S09, Lectue istes (1K) imac s PowePC 970 All caches on-chip L1 (64K Instuction) L1 (32K Data) 512K L2 1/26/2009 CS252-S09, Lectue Memoy Hieachy Teminology Hit data appeas in some block in the uppe level (example Block X) Hit Rate the faction of memoy access found in the uppe level Hit Time Time to access the uppe level which consists of RAM access time + Time to detemine hit/miss Miss data needs to be etieve fom a block in the lowe level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty Time to eplace a block in the uppe level + Time to delive the block the pocesso Hit Time << Miss Penalty (500 instuctions on 21264!) To Pocesso Fom Pocesso Uppe Level Memoy Blk X Lowe Level Memoy 1/26/2009 CS252-S09, Lectue Blk Y 4 Questions fo Memoy Hieachy Q1 Whee can a block be placed in the uppe level? (Block placement) Q2 How is a block found if it is in the uppe level? (Block identification) Q3 Which block should be eplaced on a miss? (Block eplacement) Q4 What happens on a wite? (Wite stategy) 1/26/2009 CS252-S09, Lectue 02 68

18 Q1 Whee can a block be placed in the uppe level? Block 12 placed in 8 block cache Fully associative, diect mapped, 2-way set associative S.A. Mapping = Block Numbe Modulo Numbe Sets Cache Memoy Full Mapped Diect Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = /26/2009 CS252-S09, Lectue A Summay on Souces of Cache Misses Compulsoy (cold stat o pocess migation, fist efeence) fist access to a block Cold fact of life not a whole lot you can do about it Note If you ae going to un billions of instuction, Compulsoy Misses ae insignificant Capacity Cache cannot contain all blocks access by the pogam Solution incease cache size Conflict (collision) Multiple memoy locations mapped to the same cache location Solution 1 incease cache size Solution 2 incease associativity Coheence (Invalidation) othe pocess (e.g., I/O) updates memoy 1/26/2009 CS252-S09, Lectue Q2 How is a block found if it is in the uppe level? Tag Block Addess Block offset Data Select Index Used to Lookup Candidates in Cache Index identifies the set Tag used to identify actual copy If no candidates match, then declae cache miss Block is minimum quantum of caching Data select field used to select data within block Many caching applications don t have data select field 1/26/2009 CS252-S09, Lectue Index Set Select Diect Mapped Cache Diect Mapped 2 N byte cache The uppemost (32 - N) bits ae always the Cache Tag The lowest M bits ae the Byte Select (Block Size = 2 M ) Example 1 KB Diect Mapped Cache with 32 B Blocks Index chooses potential block Tag checked to veify block Byte select chooses byte within block 31 9 Cache Tag Cache Index Ex 0x50 Ex 0x01 Valid Bit Cache Tag 0x50 Cache Data Byte 31 Byte 63 Byte 1 Byte 33 Byte 0 Byte Byte /26/2009 CS252-S09, Lectue Byte Select Ex 0x00 Byte

19 Valid Set Associative Cache N-way set associative N enties pe Cache Index N diect mapped caches opeates in paallel Example Two-way set associative cache Cache Index selects a set fom the cache Two tags in the set ae compaed to input in paallel Data is selected based on the tag esult 31 8 Cache Tag Cache Index Cache Tag Cache Data Cache Block 0 Compae 1 Sel1 Mux 0 Sel0 Cache Data Cache Block 0 4 Byte Select Cache Tag Compae 0 Valid Fully Associative Cache Fully Associative Evey block can hold any line Addess does not include a cache index Compae Cache Tags of all Cache Enties in Paallel Example Block Size=32B blocks We need N 27-bit compaatos Still have byte select to choose fom within block Cache Tag (27 bits long) Valid Bit Cache Data Byte 31 Byte 1 Byte 0 Byte 63 Byte 33 Byte 32 OR = 1/26/2009 Hit CS252-S09, Lectue Cache 02 Block 73 1/26/2009 CS252-S09, Lectue = = = = Cache Tag 4 Byte Select Ex 0x01 0 Q3 Which block should be eplaced on a miss? Q4 What happens on a wite? Easy fo Diect Mapped Set Associative o Fully Associative LRU (Least Recently Used) Appealing, but had to implement fo high associativity Random Easy, but how well does it wok? Policy Wite-Though Data witten to cache block also witten to lowelevel memoy Wite-Back Wite data only to the cache Update lowe level when a block falls out of the cache Assoc 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 1/26/2009 CS252-S09, Lectue Debug Easy Had Do ead misses poduce wites? No Yes Do epeated wites make it to lowe level? Yes No Additional option -- let wites to an un-cached addess allocate a new cache line ( wite-allocate ). 1/26/2009 CS252-S09, Lectue 02 76

20 Wite Buffes fo Wite-Though Caches Pocesso Q. Why a wite buffe? Cache Wite Buffe Lowe Level Memoy Holds data awaiting wite-though to lowe level memoy Q. Why a buffe, why not just one egiste? Q. Ae Read Afte Wite (RAW) hazads an issue fo wite buffe? A. So CPU doesn t stall A. Busts of wites ae common. A. Yes! Dain buffe befoe next ead, o check wite buffes fo match on eads 1/26/2009 CS252-S09, Lectue Basic Cache Optimizations Reducing Miss Rate 1. Lage Block size (compulsoy misses) 2. Lage Cache size (capacity misses) 3. Highe Associativity (conflict misses) Reducing Miss Penalty 4. Multilevel Caches Reducing hit time 5. Giving Reads Pioity ove Wites E.g., Read complete befoe ealie wites in wite buffe 1/26/2009 CS252-S09, Lectue Vitual Memoy 1/26/2009 CS252-S09, Lectue Vitual Addess Space What is vitual memoy? Physical Addess Space Vitual Addess V page no. Page Table Base index into page table Page Table V Access Rights 10 offset Vitual memoy => teat memoy as a cache fo the disk Teminology blocks in this cache ae called Pages Typical size of a page 1K 8K Page table maps vitual page numbes to physical fames PTE = Page Table Enty 1/26/2009 CS252-S09, Lectue PA table located in physical memoy P page no. offset 10 Physical Addess

21 Thee Advantages of Vitual Memoy Tanslation Pogam can be given consistent view of memoy, even though physical memoy is scambled Makes multitheading easonable (now used a lot!) Only the most impotant pat of pogam ( Woking Set ) must be in physical memoy. Contiguous stuctues (like stacks) use only as much physical memoy as necessay yet still gow late. Potection Diffeent theads (o pocesses) potected fom each othe. Diffeent pages can be given special behavio» (Read Only, Invisible to use pogams, etc). Kenel data potected fom Use pogams Vey impotant fo potection fom malicious pogams Shaing Can map same physical page to multiple uses ( Shaed memoy ) Lage Addess Space Suppot Vitual Addess PageTablePt 10 bits 10 bits 12 bits Vitual Vitual P1 index P2 index Offset 4 bytes Single-Level Page Table Lage 4KB pages fo a 32-bit addess 1M enties Each pocess needs own page table! Multi-Level Page Table Can allow spaseness of page table Potions of table can be swapped to disk Physical Addess Physical Page # Offset 4KB 1/26/2009 CS252-S09, Lectue bytes 1/26/2009 CS252-S09, Lectue VM and Disk Page eplacement policy Head pointe Place pages on fee list if used bit is still clea. Schedule pages with dity bit set to be witten to disk. Set of all pages in Memoy Dity bit page witten. Used bit set to 1 on any efeence Tail pointe Clea the used bit in the page table Achitect s ole suppot setting dity and used bits Page Table dity used Feelist Fee Pages 1/26/2009 CS252-S09, Lectue Tanslation Look-Aside Buffes Tanslation Look-Aside Buffes (TLB) Cache on tanslations Fully Associative, Set Associative, o Diect Mapped Tanslation with a TLB hit VA PA miss CPU TLB Cache miss Tanslation TLBs ae Small typically not moe than enties Fully Associative Main Memoy 1/26/2009 CS252-S09, Lectue hit data

22 What Actually Happens on a TLB Miss? Hadwae tavesed page tables On TLB miss, hadwae in MMU looks at cuent page table to fill TLB (may walk multiple levels)» If PTE valid, hadwae fills TLB and pocesso neve knows» If PTE maked as invalid, causes Page Fault, afte which kenel decides what to do aftewads Softwae tavesed Page tables (like MIPS) On TLB miss, pocesso eceives TLB fault Kenel taveses page table to find PTE» If PTE valid, fills TLB and etuns fom fault» If PTE maked as invalid, intenally calls Page Fault handle Most chip sets povide hadwae tavesal Moden opeating systems tend to have moe TLB faults since they use tanslation fo many things Examples» shaed segments» use-level potions of an opeating system 1/26/2009 CS252-S09, Lectue Example R3000 pipeline MIPS R3000 Pipeline Inst Fetch Dcd/ / E.A Memoy Wite TLB I-Cache RF Opeation WB Vitual Addess Space ASID V. Page Numbe Offset E.A. TLB 0xx Use segment (caching based on PT/TLB enty) 100 Kenel physical space, cached 101 Kenel physical space, uncached 11x Kenel vitual space Allows context switching among 64 use pocesses without TLB flush D-Cache TLB 64 enty, on-chip, fully associative, softwae TLB fault handle 1/26/2009 CS252-S09, Lectue Reducing tanslation time futhe As descibed, TLB lookup is in seial with cache lookup Vitual Addess V page no. 10 offset Ovelapping TLB & Cache Access Hee is how this might wok with a 4K cache 32 TLB assoc lookup index 4K Cache 1 K TLB Lookup V Access Rights PA P page no. offset 10 Physical Addess Machines with TLBs go one step futhe they ovelap TLB lookup with cache access. Woks because offset available ealy 1/26/2009 CS252-S09, Lectue Hit/ Miss FN 20 page # 10 2 disp 00 What if cache size is inceased to 8KB? Ovelap not complete Need to do something else. See CS152/252 Anothe option Vitual Caches Tags in cache ae vitual addesses 4 bytes FN Data Hit/ Miss Tanslation only happens on cache misses 1/26/2009 CS252-S09, Lectue =

23 Poblems With Ovelapped TLB Access Ovelapped access equies addess bits used to index into cache do not change as esult tanslation This usually limits things to small caches, lage page sizes, o high n-way set associative caches if you want a lage cache Example suppose eveything the same except that the cache is inceased to 8 K bytes instead of 4 K 11 2 cache index vit page # disp Solutions go to 8K byte page sizes; go to 2 way set associative cache; o SW guaantee VA[13]=PA[13] This bit is changed by VA tanslation, but is needed fo cache lookup 1K 2 way set assoc cache /26/2009 CS252-S09, Lectue Summay Contol and Pipelining Next time Read Appendix A Contol VIA State Machines and Micopogamming Just ovelap tasks; easy if tasks ae independent Speed Up Pipeline Depth; if ideal CPI is 1, then Pipeline depth Cycle Time Speedup = 1 + Pipeline stall CPI Cycle Time Hazads limit pefomance on computes Stuctual need moe HW esouces Data (RAW,WAR,WAW) need fowading, compile scheduling Contol delayed banch, pediction Exceptions, Inteupts add complexity Next time Read Appendix C, ecod bugs online! unpipelined pipelined 1/26/2009 CS252-S09, Lectue Summay #1/3 The Cache Design Space Seveal inteacting dimensions cache size block size associativity eplacement policy wite-though vs wite-back wite allocation The optimal choice is a compomise depends on access chaacteistics» wokload Bad» use (I-cache, D-cache, TLB) depends on technology / cost Good Simplicity often wins Cache Size Facto A Less Associativity Block Size Facto B Moe Summay #2/3 Caches The Pinciple of Locality Pogam access a elatively small potion of the addess space at any instant of time.» Tempoal Locality Locality in Time» Spatial Locality Locality in Space Thee Majo Categoies of Cache Misses Compulsoy Misses sad facts of life. Example cold stat misses. Capacity Misses incease cache size Conflict Misses incease cache size and/o associativity. Nightmae Scenaio ping pong effect! Wite Policy Wite Though vs. Wite Back Today CPU time is a function of (ops, cache misses) vs. just f(ops) affects Compiles, Data stuctues, and Algoithms 1/26/2009 CS252-S09, Lectue /26/2009 CS252-S09, Lectue 02 92

24 Summay #3/3 TLB, Vitual Memoy Page tables map vitual addess to physical addess TLBs ae impotant fo fast tanslation TLB misses ae significant in pocesso pefomance funny times, as most systems can t access all of 2nd level cache without TLB misses! Caches, TLBs, Vitual Memoy all undestood by examining how they deal with 4 questions 1) Whee can block be placed? 2) How is block found? 3) What block is eplaced on miss? 4) How ae wites handled? Today VM allows many pocesses to shae single memoy without having to swap all pocesses to disk; today VM potection is moe impotant than memoy hieachy benefits, but computes insecue Pepae fo debate + quiz on Wednesday 1/26/2009 CS252-S09, Lectue 02 93

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Gaduate Compute Achitectue Lectue 6 - Hazads Michela Taufe http://www.cis.udel.edu/~taufe/teaching/cis662f07 Powepoint Lectue Notes fom John Hennessy and David Patteson s: Compute Achitectue,