Pre-requisites. This is a textbook-based course. Chapter 1. Pipelines, Performance, Caches, and Virtual Memory. January 2009 Paul H J Kelly

Size: px

Start display at page:

Download "Pre-requisites. This is a textbook-based course. Chapter 1. Pipelines, Performance, Caches, and Virtual Memory. January 2009 Paul H J Kelly"

Quentin Cummings
6 years ago
Views:

332 Advanced Compute Achitectue Chapte 1 Intoduction and eview of Pipelines, Pefomance, Caches, and Vitual Januay 2009 Paul H J Kelly These lectue notes ae patly based on the couse text, Hennessy and

html Advanced Compute Achitectue Chapte 1.

1 332 Advanced Compute Achitectue Chapte 1 Intoduction and eview of Pipelines, Pefomance, Caches, and Vitual Januay 2009 Paul H J Kelly These lectue notes ae patly based on the couse text, Hennessy and Patteson s Compute Achitectue, a quantitative appoach ( th ed), and on the lectue slides of David Patteson s Bekeley couse (CS252) Couse mateials online at http// Advanced Compute Achitectue Chapte 1. p1 This a thid-level compute achitectue couse Pe-equisites The usual path would be to take this couse afte following a couse based on a textbook like Compute Oganization and Design (Patteson and Hennessy, Mogan Kaufmann) This couse is based on the moe advanced book by the same authos (see next slide) You can take this couse povided you e pepaed to catch up if necessay Read chaptes 1 to 8 of Compute Oganization and Design (COD) if this mateial is new to you If you have studied compute achitectue befoe, make sue COD Chaptes 2, 6, 7 ae familia See also Appendix A Pipelining Basic and Intemediate Concepts of couse textbook FAST eview today of Pipelining, Pefomance, Caches, and Vitual Advanced Compute Achitectue Chapte 1. p2 This is a textbook-based couse Compute Achitectue A Quantitative Appoach ( th Edition) John L. Hennessy, David A. Patteson ~580 pages. Mogan Kaufmann (2007); ISBN with substantial additional mateial on CD Pice (Amazon.co.uk, Nov 2006 Publishe s companion web site http//textbooks.elsevie.com/ / Textbook includes some vital intoductoy mateial as appendices Appendix A tutoial on pipelining (ead it NOW) Appendix C tutoial on caching (ead it NOW) Futhe appendices (some in book, some in CD) cove moe advanced mateial (some vey elevant to pats of the couse), eg Netwoks Paallel applications Implementing Coheence Potocols Embedded systems VLIW Compute aithmetic (esp floating point) Histoical pespectives Advanced Compute Achitectue Chapte 1. p3 Who ae these guys anyway and why should I ead thei book? John Hennessy Founde, MIPS Compute Systems Pesident, Stanfod Univesity (pevious pesident Condoleezza Rice) David Patteson Leade, Bekeley RISC poject (led to Sun s SPARC) RAID (edundant aays of inexpensive disks) Pofesso, Univesity of Califonia, Bekeley Cuent pesident of the ACM Seved on Infomation Technology Advisoy Committee to the US Pesident RAID-I I (1989) consisted of a Sun /280 wokstation with 128 MB of DRAM, fou dual- sting SCSI contolles, inch SCSI disks and specialized disk stiping softwae. edu/~pa.html / Ach/pototypes2. http// ttsn/a RISC-I (1982) Contains,20 tansistos, fabbed in 5 micon NMOS, with a die aea of 77 mm 2, an at 1 MHz. This chip is pobably the fist VLSI RISC. Advanced Compute Achitectue Chapte 1. p

2 Couse web site Administation details http// e.html Couse textbook H&P th ed Read Appendix A ight away Backgound fo 2008 context See Wokshop on Tends in Computing Pefomance http//www7.nationalacademies.og/cstb/poject_computingpefomance_wokshop.html Advanced Compute Achitectue Chapte 1. p5 Couse oganisation Lectue Paul Kelly Leade, Softwae Pefomance Optimisation eseach goup Tutoial helpe Anton Lokhmotov postdoctoal t eseache PhD fom Cambidge on optimisation i and algoithms fo SIMD. Industy expeience with Boadcom (VLIW hadwae), Cleaspeed (massively-multicoe SIMD hadwae), Codeplay (compiles fo games), ACE (compiles) 3 hous pe week Nominally two hous of lectues, one hou of classoom tutoials We will use the time moe flexibly Assessment Exam Fo CS M.Eng. Class, exam will take place in last week of tem Fo eveyone else, exam will take place ealy in the summe tem The goal of the couse is to teach you how to think about compute achitectue The exam usually includes some achitectual ideas not pesented in the lectues Cousewok You will be assigned a substantial, laboatoy-based execise You will lean about pefomance tuning fo computationally-intensive kenels You will lean about using simulatos, and expeimentally evaluating hypotheses to undestand system pefomance You ae encouaged to bing laptops to class to get stated t and get help duing tutoials Please do not use computes fo anything else duing classes Advanced Compute Achitectue Chapte 1. p6 Ch1 Review of pipelined, in-ode pocesso achitectue and simple cache stuctues Ch2 Caches in moe depth Softwae techniques to impove cache pefomance Vitual memoy Benchmaking Fab Ch3 Instuction-level paallelism Dynamic scheduling, out-of-ode iste enaming Speculative execution Banch pediction Limits to ILP Ch Compile techniques loop nest tansfomations Loop paallelisation, intechange, tiling/blocking, skewing Couse oveview (plan) Ch5 Multitheading, hypetheading, SMT Static instuction scheduling Softwae pipelining EPIC/IA-6; instuction-set suppot fo speculation and egiste enaming Ch6 GPUs, GPGPU, and manycoe Ch7 Shaed-memoy multipocessos Cache coheency Lage-scale cache-coheency; ccnuma. COMA Lab-based cousewok execise Simulation study challenge Using pefomance analysis tools Exam Patially based on ecent pocesso achitectue aticle, which we will study in advance (see past papes) Advanced Compute Achitectue Chapte 1. p7 A "Typical" RISC 32-bit fixed fomat instuction (3 fomats, see next slide) bit geneal-pupose egistes (R0 contains zeo, double-pecision/long opeands occupy a pai) access only via load/stoe instuctions No instuction ti both accesses memoy and does aithmetic ti All aithmetic is done on egistes 3-addess, eg-eg aithmetic instuction Subw 1,2,3 2 3 means 1 = 2-33 egistes identifies always occupy same bits of instuction encoding Single addessing mode fo load/stoe base + displacement ie egiste contents ae added d to constant fom instuction wod, and used as addess, eg lw R2,100(1) means 2 = Mem[100+1] no indiection Simple banch conditions see SPARC, MIPS, ARM, HP PA-Risc, DEC Alpha, IBM PowePC, Delayed banch CDC 6600, CDC 7600, Cay-1, Cay-2, Cay-3 Not Intel IA-32, IA-6 (?), Motoola 68000, DEC VAX, PDP-11, IBM 360/370 Eg VAX matchc, IA32 scas instuctions! Advanced Compute Achitectue Chapte 1. p8

Example MIPS (Note egiste location) iste-iste 31 26 25 2120 16 15 1110 6 5 0 Op Rs1 Rs2 Rd Opx iste-immediate 31 26 25 2120 16 15 0 Op Rs1 Rd immediate Banch 31 26 25 2120 16 15 0 Op Rs1 Rs2/Opx

com/company/about-us/milestones/) HP 100 multifunction pinte Digimax L85 digital camea http//www.zoan.

3 Example MIPS (Note egiste location) iste-iste Op Rs1 Rs2 Rd Opx iste-immediate Op Rs1 Rd immediate Banch Op Rs1 Rs2/Opx immediate So whee do I find a MIPS pocesso? MIPS licensees shipped moe than 350 million units duing fiscal yea 2007 (http// HP 100 multifunction pinte Digimax L85 digital camea http// Jump / Call Op taget Q What is the lagest signed immediate opeand fo subw 1,2,X? Q What ange of addesses can a conditional banch jump to? Advanced Compute Achitectue Chapte 1. p9 Sony PS2 and PSP Linksys WRT5G Route (Linux-based) Advanced Compute Achitectue Chapte 1. p10 A machine to execute these instuctions To execute this instuction set we need a machine that fetches them and does what each instuction says A univesal computing device a simple digital cicuit that, with the ight code, can compute anything Something like Inst = Mem[PC]; PC+=; s1 = [Inst.s1]; s2 = [Inst.s2]; imm = SignExtend(Inst.imm); Opeand1 = if(inst.op==branch) then PC else s1; Opeand2 = if(immediateopeand(inst.op)) then imm else s2; es = (Inst.op, Opeand1, Opeand2); switch(inst.op) { case BRANCH if (s1==0) then PC=PC+imm; continue; case STORE Mem[es] = s1; continue; case LOAD lmd = Mem[es]; } [Inst.d] = if (Inst.op==LOAD) then lmd else es; Advanced Compute Achitectue Chapte 1. p11 Next PC Add ess Instuction Fetch Adde e Mem oy Figue 3.1, Page 130, CAAQA 2e Inst 5 Steps of MIPS Datapath Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc Zeo? AL LU WB Data Access Data L M D Wite Back Advanced Compute Achitectue Chapte 1. p12

4 Next PC Add ess Instuction Fetch Adde e Mem oy Figue 3.1, Page 130, CAAQA 2e Inst Pipelining the MIPS datapath Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc Zeo? AL LU WB Data Access Data We will see moe complex pipeline stuctues late. Fo example, the Pentium Netbust achitectue has 31 stages. L M D Wite Back Advanced Compute Achitectue Chapte 1. p13 5-stage MIPS pipeline with pipeline buffes Next PC Addes ss Instuction Fetch Adde Memo y IF/ID Inst. Decode. Fetch Next SEQ PC RS1 RS2 Fi le Sign Extend Imm ID/EX Execute Add. Calc Next SEQ PC Zeo? EX/ME EM Access RD RD RD Data stationay contol local decode fo each instuction phase / pipeline stage Figue 3., Page 13, CAAQA 2e Data Me emoy MEM/W WB Wite Back ata WB D Advanced Compute Achitectue Chapte 1. p1 I n s t. O d e Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining U Pipelining doesn t help latency of single instuction it helps thoughput of entie wokload Pipeline ate limited by slowest pipeline stage Potential speedup = Numbe pipe stages Unbalanced lengths of pipe p stages educes speedup p Time to fill pipeline and time to dain it educes speedup Speedup comes fom paallelism Fo fee no new hadwae Figue 3.3, Page 133, CAAQA 2e AL Advanced Compute Achitectue Chapte 1. p15 It s Not That Easy fo Computes Limits to pipelining Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads HW cannot suppot this combination of instuctions Data hazads Instuction depends on esult of pio instuction still in the pipeline Contol hazads Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). Advanced Compute Achitectue Chapte 1. p16

5 One Pot/Stuctual Hazads Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 One Pot/Stuctual Hazads Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 I Load n s t Inst 1. Inst 2 O d Inst 3 e Inst I Load n s t Inst 1. Inst 2 O d Stall e Inst 3 Bubble Bubble Bubble Bubble Bubble A Eg if thee is only one memoy fo both instuctions and data Two diffeent stages may need access at same time Example IBM/Sony/Toshiba Cell pocesso Figue 3.6, Page 12, CAAQA 2e Advanced Compute Achitectue Chapte 1. p17 Inst 3 cannot be loaded in cycle ID stage has nothing to do in cycle 5 EX stage has nothing to do in cycle 6, etc. Bubble popagates Figue 3.7, Page 13, CAAQA 2e Advanced Compute Achitectue Chapte 1. p18 I n s t. Time (clock cycles) IF ID/RF EX MEM WB add 1,2,3 sub,1,3 Data Hazad on R1 Thee Geneic Data Hazads Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I add 1,2,3 J sub,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 11 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. Figue 3.9, page 17, CAAQA 2e Advanced Compute Achitectue Chapte 1. p19 Advanced Compute Achitectue Chapte 1. p20

6 Thee Geneic Data Hazads Thee Geneic Data Hazads Wite Afte Read (WAR) Inst J wites opeand befoe Inst I eads it Wite Afte Wite (WAW) Inst J wites opeand befoe Inst I wites it. I sub,1,3 J add 1,2,3, K mul 6,1,7 Called an anti-dependence by compile wites. This esults fom euse of the name 1. Can t happen in MIPS 5 stage pipeline because All instuctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 I sub 1,,3 J add 1,2,3 K mul 6,1,7 Called an output dependence by compile wites This also esults fom the euse of name 1. Can t happen in MIPS 5 stage pipeline p because All instuctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in late moe complicated pipes Advanced Compute Achitectue Chapte 1. p21 Advanced Compute Achitectue Chapte 1. p22 I n s t. O d e add 1,2,3 Fowading to Avoid Data Hazad Time (clock cycles) sub,1,3 and 6,1,7 o 8,1,91 9 Figue 3.10, Page 19, CAAQA 2e AL LU HW Change fo Fowading Figue 3.20, Page 161, CAAQA 2e Add fowading ( bypass ) paths Add multiplexos l to select whee opeand should come fom Detemine mux contol in ID stage If souce egiste is the taget of an instn that will not WB in time NextPC iste es Immediate ID/ /EX mux mux EX/M MEM Data MEM/ /WR mux xo 10,1,11 Advanced Compute Achitectue Chapte 1. p23 Advanced Compute Achitectue Chapte 1. p2

7 Time (clock cycles) I lw 1, 0(2) R n s t sub,1,61 6. O d e and 6,1,71 7 o 8,1,91 9 Data Hazad Even with Fowading Figue 3.12, Page 153, CAAQA 2e LU AL LU AL A A I n s t. O d e Time (clock cycles) lw 1, 0(2) sub,1,6 1 6 and 6,1,7 1 7 o 8,1,9 Data Hazad Even with Fowading Figue 3.13, Page 15, CAAQA 2e LU AL Bubble Bubble Bubble A EX stage waits in cycle fo opeand Following instuction ( and ) waits in ID stage Missed instuction issue oppotunity Advanced Compute Achitectue Chapte 1. p25 Advanced Compute Achitectue Chapte 1. p26 Softwae Scheduling to Avoid Load Hazads Ty poducing fast code fo a = b + c; d = e f; assuming a, b, c, d,e, and f in memoy. Slow code Fast code LW Rb,bb LW Rb,bb LW Rc,c LW Rc,c STALL LW Re,e ADD Ra,Rb,RcRb Rc ADD Ra,Rb,RbRb Rb SW a,ra LW Re,e LW Rf,ff LW Rf,ff STALL SW a,ra SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,rd SW d,rd 10 cycles (2 stalls) 8 cycles (0 stalls) Show the stalls explicitly Advanced Compute Achitectue Chapte 1. p27 10 beq 1,3,36 1 and 2,3,5 Contol Hazad on Banches Thee Stage Stall 18 o 6,1,7 22 add 8,1,9 36 xo 10,1,11 11 U A Advanced Compute Achitectue Chapte 1. p28

8 Next PC Addes ss Instuction Fetch Pipelined MIPS Datapath with ealy banch detemination Adde Memo y IF/ ID Inst. Decode. Fetch Next SEQ PC Adde RS1 RS2 Zeo? F ile Sign Extend Imm ID/E X Execute Add. Calc EX/ME EM RD RD RD Access Data Me emoy MEM/W WB Wite Back ata WB D Fou Banch Hazad Altenatives #1 Stall until banch diection is clea (wasteful the next instuction is being fetched duing ID) #2 Pedict Banch Not Taken Execute successo instuctions in sequence Squash instuctions in pipeline if banch actually taken With MIPS we have advantage of late pipeline state update 7% MIPS banches ae not taken on aveage PC+ aleady calculated, so use it to get next instuction #3 Pedict Banch Taken 53% MIPS banches ae taken on aveage But in MIPS instuction ti set we haven t calculated l banch taget t addess yet (because banches ae elative to the PC) MIPS still incus 1 cycle banch penalty With some othe machines, banch taget is known befoe banch condition Figue 3.22, page 163, CAAQA 2/e Advanced Compute Achitectue Chapte 1. p29 Advanced Compute Achitectue Chapte 1. p30 Fou Banch Hazad Altenatives # Delayed Banch y Define banch to take place AFTER a following instuction banch instuction sequential successo 1 sequential successo 2... sequential successo n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline MIPS uses this; eg in LW R3, #100 If (R1==0) LW R, #200 X=100 BEQZ R1, L1 Else SW R3, X X=100 SW R, X X=200 L1 R5 = X LW R5,X SW R3, X instuction is executed egadless SW R, X instuction is executed only if R1 is non-zeo Advanced Compute Achitectue Chapte 1. p31 Delayed Banch Whee to get instuctions to fill banch delay slot? Befoe banch instuction ti Fom the taget addess only valuable when banch taken Fom fall though only valuable when banch not taken Compile effectiveness fo single banch delay slot Fills about 60% of banch delay slots About 80% of instuctions executed in banch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Banch downside 7-8 stage pipelines, multiple instuctions issued pe clock (supescala) Canceling banches Banch delay slot instuction is executed but wite-back is disabled if it is not supposed to be executed Two vaiants banch likely taken, banch likely not-taken allows moe slots to be filled L1 taget befoe Blt R1,L1 fallthu Advanced Compute Achitectue Chapte 1. p32

9 Eliminating hazads with simultaneous multi-theading If we had no stalls we could finish one instuction evey cycle If we had no hazads we could do without fowading and decode/contol would be simple too PC0 PC1 Next PC Thead 0 egs Thead 1 egs Example PowePC pocessing element (PPE) in the Cell Boadband Engine (Sony PlayStation 3) IF maintains two Pogam Countes Even cycle fetch fom PC0 Odd cycle fetch fom PC1 Thead 0 eads and wites thead-0 egistes No egiste-to-egiste hazads between adjacent pipeline stages Advanced Compute Achitectue Chapte 1. p33 So how fast can this design go? A simple 5-stage pipeline can un at >3GHz Limited by citical path though slowest pipeline stage logic Tadeoff do moe pe cycle? O incease clock ate? O do moe pe cycle, in paallel At 3GHz, clock peiod is 330 picoseconds. The time light takes to go about fou inches About 10 gate delays fo example, the Cell BE is designed fo 11 FO ( fanout= ) gates pe cycle f it/ tti/ ti l /ISSCC2005 Pipeline latches etc account fo 3-5 FO delays leaving only 5-8 fo actual wok How can we build a RAM that can implement ou MEM stage in 5-8 FO delays? Advanced Compute Achitectue Chapte 1. p3 Life used to be so easy Pocesso-DRAM Gap (latency) 1000 CPU Mooe s Law Pe foma ance µpoc 60%/y. (2X/1.5y) 100 Pocesso- Pefomance Gap (gows 50% / yea) Time DRAM In 1980 a lage RAM s access time was close to the CPU cycle time. 1980s machines had little o no need fo cache. Life is no longe quite so simple. DRAM 9%/y. (2X/10 ys) Advanced Compute Achitectue Chapte 1. p35 Hieachy Teminology Hit data appeas in some block X in the uppe level Hit Rate the faction of memoy accesses found in the uppe level Hit Time Time to access the uppe level which consists of RAM access time + Time to detemine hit/miss Miss data needs to be etieved fom a block Y in the lowe level Miss Rate = 1 - (Hit Rate) Miss Penalty Time to eplace a block in the uppe level + Time to delive the block the pocesso Hit Time << Miss Penalty Typically hundeds of missed instuction issue oppotunities To Pocesso Fom Pocesso Uppe Level Blk X Lowe Level Blk Y Advanced Compute Achitectue Chapte 1. p36

10 Capacity Access Time Cost Levels of the Hieachy Staging Xfe Unit CPU istes Management istes 100s Bytes pogamme/compile <1ns Tansfe unit Instuctions and Opeands 1-16 bytes Cache (pehaps multilevel) 10s-1000s K Bytes Cache cache contolle 1-10 ns bytes $10/ MByte Blocks Main G Bytes Opeating System 100ns- 300ns K-8K bytes $1/ MByte Pages Disk 100s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Tape infinite sec-min $0.001/ MByte Disk Tape Files use/opeato Mbytes Uppe Level faste Lage Lowe Level Advanced Compute Achitectue Chapte 1. p37 The Pinciple of Locality The Pinciple of Locality Pogams access a elatively l small potion of the addess space at any instant of time. Two Diffeent Types of Locality Tempoal Locality (Locality in Time) If an item is efeenced, it will tend to be efeenced again soon (e.g., loops, euse) Spatial Locality (Locality in Space) If an item is efeenced, items whose addesses ae close by tend to be efeenced soon (e.g., staightline code, aay access) In ecent yeas, achitectues have become inceasingly eliant (totally eliant?) on locality fo speed Advanced Compute Achitectue Chapte 1. p38 Cache Measues Hit ate faction found in that level So high that usually talk about Miss ate Miss ate fallacy as MIPS to CPU pefomance, miss ate to aveage memoy access time in memoy Aveage memoy-access time = Hit time + Miss ate x Miss penalty (ns o clocks) Miss penalty time to eplace a block fom lowe level, including time to eplace in CPU access time time to lowe level = f(latency to lowe level) tansfe time time to tansfe block =f(bw between uppe & lowe levels) 1 KB Diect Mapped Cache, 32B blocks Fo a 2 N byte cache The uppemost (32 - N) bits ae always the Cache Tag The lowest M bits ae the Byte Select (Block Size = 2 M ) Cache Tag Example 0x50 Cache Index Byte Select Stoed as pat Ex 0x01 Ex 0x00 of the cache state Valid Bit Cache Tag Byte 31 Byte 1 Byte 0 0 0x50 Byte 63 Byte 33 Byte Byte 1023 Byte Advanced Compute Achitectue Chapte 1. p39 Diect-mapped cache - stoage Advanced Compute Achitectue Chapte 1. p0

11 1 KB Diect Mapped Cache, 32B blocks Fo a 2 N byte cache The uppemost (32 - N) bits ae always the Cache Tag The lowest M bits ae the Byte Select (Block Size = 2 M ) Cache Tag Example 0x50 Cache Index Byte Select Stoed as pat Ex 0x01 Ex 0x00 of the cache state Valid Bit Cache Tag Byte 31 Byte 1 Byte 0 0 0x50 Byte 63 Byte 33 Byte Diect-mapped cache ead access Hit Byte 1023 Compae Data Byte Advanced Compute Achitectue Chapte 1. p1 1 KB Diect Mapped Cache, 32B blocks (0) 0 1 Cache location 0 can be occupied 2 by data fom main memoy 3 location 0, 32, 6, etc. 5 Cache location 1 can be occupied 6 by data fom main memoy 7 8 location 1, 33, 65, etc. 9 In geneal, all locations with same 10 Addess<9> bits map to the same 11 Main location in the cache Which one should 12 we place in the cache? 13 How can we tell which h one is in 1 15 the cache? (32) Byte 31 Byte 1 Byte Byte 63 Byte 33 Byte Byte 1023 Byte Advanced Compute Achitectue Chapte 1. p2 Valid Diect-mapped Cache - stuctue Capacity C bytes (eg 1KB) Blocksize B bytes (eg 32) Byte select bits 0..log(B)-1 (eg 0..) Numbe of blocks C/B (eg 32) Addess size A (eg 32 bits) Cache index size I=log(C/B) (eg log(32)=5) Tag size A-I-log(B) (eg =22) Cache Tag Cache Index Valid Two-way Set Associative Cache N-way set associative N enties fo each Cache Index N diect mapped caches opeated in paallel (N typically 2 to ) Example Two-way set associative cache Cache Index selects a set fom the cache The two tags in the set ae compaed in paallel Data is selected based on the tag esult Cache Index Cache Tag Cache Tag Valid Ad Tag Compae Ad Tag Compae 1 Sel1 Mux 0 Sel0 Compae Hit Cache Block Advanced Compute Achitectue Chapte 1. p3 Hit OR Cache Block Advanced Compute Achitectue Chapte 1. p

12 Disadvantage of Set Associative Cache N-way Set Associative i Cache v. Diect Mapped Cache N compaatos vs. 1 Exta delay fo the data Data comes AFTER Hit/Miss In a diect mapped cache, Cache Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recove late if miss. Basic cache teminology Example Intel Pentium Level-1 cache (pe-pescott) Capacity 8K bytes (total amount of data cache can stoe) Block 6 bytes (so thee ae 8K/6=128 blocks in the cache) Ways (addesses with same index bits can be placed in one of ways) Sets 32 (=128/, that is each RAM aay holds 32 blocks) Index 5 bits (since 2 5 =32 and we need index to select one of the 32 ways) Tag 21 bits (=32 minus 5 fo index, minus 6 to addess byte within block) Access time 2 cycles, (.6ns at 3GHz; pipelined, dual-poted [load+stoe]) Valid Cache Tag Cache Index Cache Tag Valid Valid Cache Tag Cache Index Cache Tag Valid Ad Tag Compae 1 Sel1 Mux 0 Sel0 Compae Ad Tag Compae Sel1 1 Mux 0 Sel0 Compae Hit OR Cache Block Advanced Compute Achitectue Chapte 1. p5 OR Hit Cache Block Advanced Compute Achitectue Chapte 1. p6 Questions fo Hieachy Q1 Whee can a block be placed in the uppe level? l? (Block placement) Q2 How is a block found if it is in the uppe level? (Block identification) Q3 Which block should be eplaced on a miss? (Block eplacement) Q What happens on a wite? (Wite stategy) Q1 Whee can a block be placed in the uppe level? In a fully-associative cache, block 12 can be placed in any location in the cache Set In a two-way way setassociative cache, the set is detemined by its low-ode addess bits (12 mod ) = 0 Block 12 can be placed in eithe of the two cache locations in set In a diect-mapped cache, block 12 can only be placed in one cache location, detemined by its low-ode addess bits (12 mod 8) = Advanced Compute Achitectue Chapte 1. p7 Advanced Compute Achitectue Chapte 1. p8

13 Valid Q2 How is a block found if it is in the uppe level? Cache Tag Ad Tag Compae 1 Sel1 Hit OR Cache Index Mux 0 Sel0 Cache Block Tag on each block No need to check index o block offset Block Addess Tag Index Cache Tag Compae Block Offset Valid Q3 Which block should be eplaced on a miss? Easy fo Diect Mapped Set Associative o Fully Associative Random LRU (Least Recently Used) Assoc 2-way -way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7%.7% 5.3%.% 5.0% 6 KB 1.9% 2.0% 1.5% 1.7% 1.% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Benchmak studies show that LRU beats andom only with small caches Inceasing associativity shinks index, expands tag Advanced Compute Achitectue Chapte 1. p9 Advanced Compute Achitectue Chapte 1. p50 Q What happens on a wite? Wite though The infomation is witten to both the block in the cache and to the block in the lowe-level memoy Wite back The infomation is witten only to the block in the cache. The modified cache block is witten to main memoy only when it is eplaced. is block clean o dity? Pos and Cons of each? WT ead misses cannot esult in wites WB no epeated wites to same location WT always combined with wite buffes so that don t wait fo lowe level memoy Wite Buffe fo Wite Though Pocesso Cache Wite Buffe DRAM A Wite Buffe is needed between the Cache and Pocesso wites data into the cache and the wite buffe contolle wite contents of the buffe to memoy Wite buffe is just a FIFO Typical numbe of enties Woks fine if Stoe fequency (w..t. time) << 1 / DRAM wite cycle system designe s nightmae Stoe fequency (w..t. time) -> 1 / DRAM wite cycle Wite buffe satuation Advanced Compute Achitectue Chapte 1. p51 Advanced Compute Achitectue Chapte 1. p52

14 A Moden Hieachy By taking advantage of the pinciple i of locality lit Pesent the use with as much memoy as is available in the cheapest technology. Povide access at the speed offeed by the fastest technology. Datapath Pocesso Contol iste es On-Chi ip Cache Second Level Cache (SRAM) Main (DRAM) Seconday Stoage (Disk) Tetiay Stoage (Disk/Tape) StoageTek STK 9310 ( Powdehon ) 2,000, 3,000,,000, 5,000, o 6,000 catidge slots pe libay stoage module (LSM) Up to 2 LSMs pe libay (1,000 catidges) 120 TB (1 LSM) to 28, TB capacity (2 LSM) Each catidge holds 300GB, eadable up to 0 MB/sec Up to 28.8 petabytes Ave s to load tape Lage-scale stoage Speed (ns) 1s 10s 100s 10,000,000s 10,000,000,000s Size (bytes) 100s (10s ms) (10s sec) Ks Ms Gs Ts Advanced Compute Achitectue Chapte 1. p53 http// http//en.wikipedia.og/wiki/tape_libay http// Advanced Compute Achitectue Chapte 1. p5 http// Can we live without cache? Inteesting ng exception Cay/Tea MTA, fist deliveed June Each CPU switches evey cycle between 128 theads Each thead can have up to 8 outstanding tt memoy accesses 3D tooidal mesh inteconnect accessed hashed to spead load acoss banks MTA-1 fabicated using Gallium Asenide, not silicon nealy un-manufactuable (wikipedia) Thid-geneation Cay XMT http// Advanced Compute Achitectue Chapte 1. p55 Ch1 Review of pipelined, in-ode pocesso achitectue and simple cache stuctues Ch2 Caches in moe depth Softwae techniques to impove cache pefomance Vitual memoy Benchmaking Fab Ch3 Instuction-level paallelism Dynamic scheduling, out-of-ode iste enaming Speculative execution Banch pediction Limits to ILP Ch Compile techniques loop nest tansfomations Loop paallelisation, intechange, tiling/blocking, skewing Whee we ae going Ch5 Multitheading, hypetheading, SMT Static instuction scheduling Softwae pipelining EPIC/IA-6; instuction-set suppot fo speculation and egiste enaming Ch6 GPUs, GPGPU, and manycoe Ch7 Shaed-memoy multipocessos Cache coheency Lage-scale cache-coheency; ccnuma. COMA Lab-based cousewok execise Simulation study challenge Using pefomance analysis tools Exam Patially based on ecent pocesso achitectue aticle, which we will study in advance (see past papes) Advanced Compute Achitectue Chapte 1. p56

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Gaduate Compute Achitectue Lectue 6 - Hazads Michela Taufe http://www.cis.udel.edu/~taufe/teaching/cis662f07 Powepoint Lectue Notes fom John Hennessy and David Patteson s: Compute Achitectue,