Pre-requisites. This is a textbook-based course. Chapter 1. Pipelines, Performance, Caches, and Virtual Memory. January 2009 Paul H J Kelly

Size: px
Start display at page:

Download "Pre-requisites. This is a textbook-based course. Chapter 1. Pipelines, Performance, Caches, and Virtual Memory. January 2009 Paul H J Kelly"

Transcription

1 332 Advanced Compute Achitectue Chapte 1 Intoduction and eview of Pipelines, Pefomance, Caches, and Vitual Januay 2009 Paul H J Kelly These lectue notes ae patly based on the couse text, Hennessy and Patteson s Compute Achitectue, a quantitative appoach ( th ed), and on the lectue slides of David Patteson s Bekeley couse (CS252) Couse mateials online at http// Advanced Compute Achitectue Chapte 1. p1 This a thid-level compute achitectue couse Pe-equisites The usual path would be to take this couse afte following a couse based on a textbook like Compute Oganization and Design (Patteson and Hennessy, Mogan Kaufmann) This couse is based on the moe advanced book by the same authos (see next slide) You can take this couse povided you e pepaed to catch up if necessay Read chaptes 1 to 8 of Compute Oganization and Design (COD) if this mateial is new to you If you have studied compute achitectue befoe, make sue COD Chaptes 2, 6, 7 ae familia See also Appendix A Pipelining Basic and Intemediate Concepts of couse textbook FAST eview today of Pipelining, Pefomance, Caches, and Vitual Advanced Compute Achitectue Chapte 1. p2 This is a textbook-based couse Compute Achitectue A Quantitative Appoach ( th Edition) John L. Hennessy, David A. Patteson ~580 pages. Mogan Kaufmann (2007); ISBN with substantial additional mateial on CD Pice (Amazon.co.uk, Nov 2006 Publishe s companion web site http//textbooks.elsevie.com/ / Textbook includes some vital intoductoy mateial as appendices Appendix A tutoial on pipelining (ead it NOW) Appendix C tutoial on caching (ead it NOW) Futhe appendices (some in book, some in CD) cove moe advanced mateial (some vey elevant to pats of the couse), eg Netwoks Paallel applications Implementing Coheence Potocols Embedded systems VLIW Compute aithmetic (esp floating point) Histoical pespectives Advanced Compute Achitectue Chapte 1. p3 Who ae these guys anyway and why should I ead thei book? John Hennessy Founde, MIPS Compute Systems Pesident, Stanfod Univesity (pevious pesident Condoleezza Rice) David Patteson Leade, Bekeley RISC poject (led to Sun s SPARC) RAID (edundant aays of inexpensive disks) Pofesso, Univesity of Califonia, Bekeley Cuent pesident of the ACM Seved on Infomation Technology Advisoy Committee to the US Pesident RAID-I I (1989) consisted of a Sun /280 wokstation with 128 MB of DRAM, fou dual- sting SCSI contolles, inch SCSI disks and specialized disk stiping softwae. edu/~pa.html / Ach/pototypes2. http// ttsn/a RISC-I (1982) Contains,20 tansistos, fabbed in 5 micon NMOS, with a die aea of 77 mm 2, an at 1 MHz. This chip is pobably the fist VLSI RISC. Advanced Compute Achitectue Chapte 1. p

2 Couse web site Administation details http// e.html Couse textbook H&P th ed Read Appendix A ight away Backgound fo 2008 context See Wokshop on Tends in Computing Pefomance http//www7.nationalacademies.og/cstb/poject_computingpefomance_wokshop.html Advanced Compute Achitectue Chapte 1. p5 Couse oganisation Lectue Paul Kelly Leade, Softwae Pefomance Optimisation eseach goup Tutoial helpe Anton Lokhmotov postdoctoal t eseache PhD fom Cambidge on optimisation i and algoithms fo SIMD. Industy expeience with Boadcom (VLIW hadwae), Cleaspeed (massively-multicoe SIMD hadwae), Codeplay (compiles fo games), ACE (compiles) 3 hous pe week Nominally two hous of lectues, one hou of classoom tutoials We will use the time moe flexibly Assessment Exam Fo CS M.Eng. Class, exam will take place in last week of tem Fo eveyone else, exam will take place ealy in the summe tem The goal of the couse is to teach you how to think about compute achitectue The exam usually includes some achitectual ideas not pesented in the lectues Cousewok You will be assigned a substantial, laboatoy-based execise You will lean about pefomance tuning fo computationally-intensive kenels You will lean about using simulatos, and expeimentally evaluating hypotheses to undestand system pefomance You ae encouaged to bing laptops to class to get stated t and get help duing tutoials Please do not use computes fo anything else duing classes Advanced Compute Achitectue Chapte 1. p6 Ch1 Review of pipelined, in-ode pocesso achitectue and simple cache stuctues Ch2 Caches in moe depth Softwae techniques to impove cache pefomance Vitual memoy Benchmaking Fab Ch3 Instuction-level paallelism Dynamic scheduling, out-of-ode iste enaming Speculative execution Banch pediction Limits to ILP Ch Compile techniques loop nest tansfomations Loop paallelisation, intechange, tiling/blocking, skewing Couse oveview (plan) Ch5 Multitheading, hypetheading, SMT Static instuction scheduling Softwae pipelining EPIC/IA-6; instuction-set suppot fo speculation and egiste enaming Ch6 GPUs, GPGPU, and manycoe Ch7 Shaed-memoy multipocessos Cache coheency Lage-scale cache-coheency; ccnuma. COMA Lab-based cousewok execise Simulation study challenge Using pefomance analysis tools Exam Patially based on ecent pocesso achitectue aticle, which we will study in advance (see past papes) Advanced Compute Achitectue Chapte 1. p7 A "Typical" RISC 32-bit fixed fomat instuction (3 fomats, see next slide) bit geneal-pupose egistes (R0 contains zeo, double-pecision/long opeands occupy a pai) access only via load/stoe instuctions No instuction ti both accesses memoy and does aithmetic ti All aithmetic is done on egistes 3-addess, eg-eg aithmetic instuction Subw 1,2,3 2 3 means 1 = 2-33 egistes identifies always occupy same bits of instuction encoding Single addessing mode fo load/stoe base + displacement ie egiste contents ae added d to constant fom instuction wod, and used as addess, eg lw R2,100(1) means 2 = Mem[100+1] no indiection Simple banch conditions see SPARC, MIPS, ARM, HP PA-Risc, DEC Alpha, IBM PowePC, Delayed banch CDC 6600, CDC 7600, Cay-1, Cay-2, Cay-3 Not Intel IA-32, IA-6 (?), Motoola 68000, DEC VAX, PDP-11, IBM 360/370 Eg VAX matchc, IA32 scas instuctions! Advanced Compute Achitectue Chapte 1. p8

3 Example MIPS (Note egiste location) iste-iste Op Rs1 Rs2 Rd Opx iste-immediate Op Rs1 Rd immediate Banch Op Rs1 Rs2/Opx immediate So whee do I find a MIPS pocesso? MIPS licensees shipped moe than 350 million units duing fiscal yea 2007 (http// HP 100 multifunction pinte Digimax L85 digital camea http// Jump / Call Op taget Q What is the lagest signed immediate opeand fo subw 1,2,X? Q What ange of addesses can a conditional banch jump to? Advanced Compute Achitectue Chapte 1. p9 Sony PS2 and PSP Linksys WRT5G Route (Linux-based) Advanced Compute Achitectue Chapte 1. p10 A machine to execute these instuctions To execute this instuction set we need a machine that fetches them and does what each instuction says A univesal computing device a simple digital cicuit that, with the ight code, can compute anything Something like Inst = Mem[PC]; PC+=; s1 = [Inst.s1]; s2 = [Inst.s2]; imm = SignExtend(Inst.imm); Opeand1 = if(inst.op==branch) then PC else s1; Opeand2 = if(immediateopeand(inst.op)) then imm else s2; es = (Inst.op, Opeand1, Opeand2); switch(inst.op) { case BRANCH if (s1==0) then PC=PC+imm; continue; case STORE Mem[es] = s1; continue; case LOAD lmd = Mem[es]; } [Inst.d] = if (Inst.op==LOAD) then lmd else es; Advanced Compute Achitectue Chapte 1. p11 Next PC Add ess Instuction Fetch Adde e Mem oy Figue 3.1, Page 130, CAAQA 2e Inst 5 Steps of MIPS Datapath Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc Zeo? AL LU WB Data Access Data L M D Wite Back Advanced Compute Achitectue Chapte 1. p12

4 Next PC Add ess Instuction Fetch Adde e Mem oy Figue 3.1, Page 130, CAAQA 2e Inst Pipelining the MIPS datapath Inst. Decode. Fetch Next SEQ PC RS1 RS2 RD Imm File Sign Extend Execute Add. Calc Zeo? AL LU WB Data Access Data We will see moe complex pipeline stuctues late. Fo example, the Pentium Netbust achitectue has 31 stages. L M D Wite Back Advanced Compute Achitectue Chapte 1. p13 5-stage MIPS pipeline with pipeline buffes Next PC Addes ss Instuction Fetch Adde Memo y IF/ID Inst. Decode. Fetch Next SEQ PC RS1 RS2 Fi le Sign Extend Imm ID/EX Execute Add. Calc Next SEQ PC Zeo? EX/ME EM Access RD RD RD Data stationay contol local decode fo each instuction phase / pipeline stage Figue 3., Page 13, CAAQA 2e Data Me emoy MEM/W WB Wite Back ata WB D Advanced Compute Achitectue Chapte 1. p1 I n s t. O d e Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining U Pipelining doesn t help latency of single instuction it helps thoughput of entie wokload Pipeline ate limited by slowest pipeline stage Potential speedup = Numbe pipe stages Unbalanced lengths of pipe p stages educes speedup p Time to fill pipeline and time to dain it educes speedup Speedup comes fom paallelism Fo fee no new hadwae Figue 3.3, Page 133, CAAQA 2e AL Advanced Compute Achitectue Chapte 1. p15 It s Not That Easy fo Computes Limits to pipelining Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads HW cannot suppot this combination of instuctions Data hazads Instuction depends on esult of pio instuction still in the pipeline Contol hazads Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). Advanced Compute Achitectue Chapte 1. p16

5 One Pot/Stuctual Hazads Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 One Pot/Stuctual Hazads Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle Cycle 5 Cycle 6 Cycle 7 I Load n s t Inst 1. Inst 2 O d Inst 3 e Inst I Load n s t Inst 1. Inst 2 O d Stall e Inst 3 Bubble Bubble Bubble Bubble Bubble A Eg if thee is only one memoy fo both instuctions and data Two diffeent stages may need access at same time Example IBM/Sony/Toshiba Cell pocesso Figue 3.6, Page 12, CAAQA 2e Advanced Compute Achitectue Chapte 1. p17 Inst 3 cannot be loaded in cycle ID stage has nothing to do in cycle 5 EX stage has nothing to do in cycle 6, etc. Bubble popagates Figue 3.7, Page 13, CAAQA 2e Advanced Compute Achitectue Chapte 1. p18 I n s t. Time (clock cycles) IF ID/RF EX MEM WB add 1,2,3 sub,1,3 Data Hazad on R1 Thee Geneic Data Hazads Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I add 1,2,3 J sub,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 11 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. Figue 3.9, page 17, CAAQA 2e Advanced Compute Achitectue Chapte 1. p19 Advanced Compute Achitectue Chapte 1. p20

6 Thee Geneic Data Hazads Thee Geneic Data Hazads Wite Afte Read (WAR) Inst J wites opeand befoe Inst I eads it Wite Afte Wite (WAW) Inst J wites opeand befoe Inst I wites it. I sub,1,3 J add 1,2,3, K mul 6,1,7 Called an anti-dependence by compile wites. This esults fom euse of the name 1. Can t happen in MIPS 5 stage pipeline because All instuctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 I sub 1,,3 J add 1,2,3 K mul 6,1,7 Called an output dependence by compile wites This also esults fom the euse of name 1. Can t happen in MIPS 5 stage pipeline p because All instuctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in late moe complicated pipes Advanced Compute Achitectue Chapte 1. p21 Advanced Compute Achitectue Chapte 1. p22 I n s t. O d e add 1,2,3 Fowading to Avoid Data Hazad Time (clock cycles) sub,1,3 and 6,1,7 o 8,1,91 9 Figue 3.10, Page 19, CAAQA 2e AL LU HW Change fo Fowading Figue 3.20, Page 161, CAAQA 2e Add fowading ( bypass ) paths Add multiplexos l to select whee opeand should come fom Detemine mux contol in ID stage If souce egiste is the taget of an instn that will not WB in time NextPC iste es Immediate ID/ /EX mux mux EX/M MEM Data MEM/ /WR mux xo 10,1,11 Advanced Compute Achitectue Chapte 1. p23 Advanced Compute Achitectue Chapte 1. p2

7 Time (clock cycles) I lw 1, 0(2) R n s t sub,1,61 6. O d e and 6,1,71 7 o 8,1,91 9 Data Hazad Even with Fowading Figue 3.12, Page 153, CAAQA 2e LU AL LU AL A A I n s t. O d e Time (clock cycles) lw 1, 0(2) sub,1,6 1 6 and 6,1,7 1 7 o 8,1,9 Data Hazad Even with Fowading Figue 3.13, Page 15, CAAQA 2e LU AL Bubble Bubble Bubble A EX stage waits in cycle fo opeand Following instuction ( and ) waits in ID stage Missed instuction issue oppotunity Advanced Compute Achitectue Chapte 1. p25 Advanced Compute Achitectue Chapte 1. p26 Softwae Scheduling to Avoid Load Hazads Ty poducing fast code fo a = b + c; d = e f; assuming a, b, c, d,e, and f in memoy. Slow code Fast code LW Rb,bb LW Rb,bb LW Rc,c LW Rc,c STALL LW Re,e ADD Ra,Rb,RcRb Rc ADD Ra,Rb,RbRb Rb SW a,ra LW Re,e LW Rf,ff LW Rf,ff STALL SW a,ra SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,rd SW d,rd 10 cycles (2 stalls) 8 cycles (0 stalls) Show the stalls explicitly Advanced Compute Achitectue Chapte 1. p27 10 beq 1,3,36 1 and 2,3,5 Contol Hazad on Banches Thee Stage Stall 18 o 6,1,7 22 add 8,1,9 36 xo 10,1,11 11 U A Advanced Compute Achitectue Chapte 1. p28

8 Next PC Addes ss Instuction Fetch Pipelined MIPS Datapath with ealy banch detemination Adde Memo y IF/ ID Inst. Decode. Fetch Next SEQ PC Adde RS1 RS2 Zeo? F ile Sign Extend Imm ID/E X Execute Add. Calc EX/ME EM RD RD RD Access Data Me emoy MEM/W WB Wite Back ata WB D Fou Banch Hazad Altenatives #1 Stall until banch diection is clea (wasteful the next instuction is being fetched duing ID) #2 Pedict Banch Not Taken Execute successo instuctions in sequence Squash instuctions in pipeline if banch actually taken With MIPS we have advantage of late pipeline state update 7% MIPS banches ae not taken on aveage PC+ aleady calculated, so use it to get next instuction #3 Pedict Banch Taken 53% MIPS banches ae taken on aveage But in MIPS instuction ti set we haven t calculated l banch taget t addess yet (because banches ae elative to the PC) MIPS still incus 1 cycle banch penalty With some othe machines, banch taget is known befoe banch condition Figue 3.22, page 163, CAAQA 2/e Advanced Compute Achitectue Chapte 1. p29 Advanced Compute Achitectue Chapte 1. p30 Fou Banch Hazad Altenatives # Delayed Banch y Define banch to take place AFTER a following instuction banch instuction sequential successo 1 sequential successo 2... sequential successo n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline MIPS uses this; eg in LW R3, #100 If (R1==0) LW R, #200 X=100 BEQZ R1, L1 Else SW R3, X X=100 SW R, X X=200 L1 R5 = X LW R5,X SW R3, X instuction is executed egadless SW R, X instuction is executed only if R1 is non-zeo Advanced Compute Achitectue Chapte 1. p31 Delayed Banch Whee to get instuctions to fill banch delay slot? Befoe banch instuction ti Fom the taget addess only valuable when banch taken Fom fall though only valuable when banch not taken Compile effectiveness fo single banch delay slot Fills about 60% of banch delay slots About 80% of instuctions executed in banch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Banch downside 7-8 stage pipelines, multiple instuctions issued pe clock (supescala) Canceling banches Banch delay slot instuction is executed but wite-back is disabled if it is not supposed to be executed Two vaiants banch likely taken, banch likely not-taken allows moe slots to be filled L1 taget befoe Blt R1,L1 fallthu Advanced Compute Achitectue Chapte 1. p32

9 Eliminating hazads with simultaneous multi-theading If we had no stalls we could finish one instuction evey cycle If we had no hazads we could do without fowading and decode/contol would be simple too PC0 PC1 Next PC Thead 0 egs Thead 1 egs Example PowePC pocessing element (PPE) in the Cell Boadband Engine (Sony PlayStation 3) IF maintains two Pogam Countes Even cycle fetch fom PC0 Odd cycle fetch fom PC1 Thead 0 eads and wites thead-0 egistes No egiste-to-egiste hazads between adjacent pipeline stages Advanced Compute Achitectue Chapte 1. p33 So how fast can this design go? A simple 5-stage pipeline can un at >3GHz Limited by citical path though slowest pipeline stage logic Tadeoff do moe pe cycle? O incease clock ate? O do moe pe cycle, in paallel At 3GHz, clock peiod is 330 picoseconds. The time light takes to go about fou inches About 10 gate delays fo example, the Cell BE is designed fo 11 FO ( fanout= ) gates pe cycle f it/ tti/ ti l /ISSCC2005 Pipeline latches etc account fo 3-5 FO delays leaving only 5-8 fo actual wok How can we build a RAM that can implement ou MEM stage in 5-8 FO delays? Advanced Compute Achitectue Chapte 1. p3 Life used to be so easy Pocesso-DRAM Gap (latency) 1000 CPU Mooe s Law Pe foma ance µpoc 60%/y. (2X/1.5y) 100 Pocesso- Pefomance Gap (gows 50% / yea) Time DRAM In 1980 a lage RAM s access time was close to the CPU cycle time. 1980s machines had little o no need fo cache. Life is no longe quite so simple. DRAM 9%/y. (2X/10 ys) Advanced Compute Achitectue Chapte 1. p35 Hieachy Teminology Hit data appeas in some block X in the uppe level Hit Rate the faction of memoy accesses found in the uppe level Hit Time Time to access the uppe level which consists of RAM access time + Time to detemine hit/miss Miss data needs to be etieved fom a block Y in the lowe level Miss Rate = 1 - (Hit Rate) Miss Penalty Time to eplace a block in the uppe level + Time to delive the block the pocesso Hit Time << Miss Penalty Typically hundeds of missed instuction issue oppotunities To Pocesso Fom Pocesso Uppe Level Blk X Lowe Level Blk Y Advanced Compute Achitectue Chapte 1. p36

10 Capacity Access Time Cost Levels of the Hieachy Staging Xfe Unit CPU istes Management istes 100s Bytes pogamme/compile <1ns Tansfe unit Instuctions and Opeands 1-16 bytes Cache (pehaps multilevel) 10s-1000s K Bytes Cache cache contolle 1-10 ns bytes $10/ MByte Blocks Main G Bytes Opeating System 100ns- 300ns K-8K bytes $1/ MByte Pages Disk 100s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte Tape infinite sec-min $0.001/ MByte Disk Tape Files use/opeato Mbytes Uppe Level faste Lage Lowe Level Advanced Compute Achitectue Chapte 1. p37 The Pinciple of Locality The Pinciple of Locality Pogams access a elatively l small potion of the addess space at any instant of time. Two Diffeent Types of Locality Tempoal Locality (Locality in Time) If an item is efeenced, it will tend to be efeenced again soon (e.g., loops, euse) Spatial Locality (Locality in Space) If an item is efeenced, items whose addesses ae close by tend to be efeenced soon (e.g., staightline code, aay access) In ecent yeas, achitectues have become inceasingly eliant (totally eliant?) on locality fo speed Advanced Compute Achitectue Chapte 1. p38 Cache Measues Hit ate faction found in that level So high that usually talk about Miss ate Miss ate fallacy as MIPS to CPU pefomance, miss ate to aveage memoy access time in memoy Aveage memoy-access time = Hit time + Miss ate x Miss penalty (ns o clocks) Miss penalty time to eplace a block fom lowe level, including time to eplace in CPU access time time to lowe level = f(latency to lowe level) tansfe time time to tansfe block =f(bw between uppe & lowe levels) 1 KB Diect Mapped Cache, 32B blocks Fo a 2 N byte cache The uppemost (32 - N) bits ae always the Cache Tag The lowest M bits ae the Byte Select (Block Size = 2 M ) Cache Tag Example 0x50 Cache Index Byte Select Stoed as pat Ex 0x01 Ex 0x00 of the cache state Valid Bit Cache Tag Byte 31 Byte 1 Byte 0 0 0x50 Byte 63 Byte 33 Byte Byte 1023 Byte Advanced Compute Achitectue Chapte 1. p39 Diect-mapped cache - stoage Advanced Compute Achitectue Chapte 1. p0

11 1 KB Diect Mapped Cache, 32B blocks Fo a 2 N byte cache The uppemost (32 - N) bits ae always the Cache Tag The lowest M bits ae the Byte Select (Block Size = 2 M ) Cache Tag Example 0x50 Cache Index Byte Select Stoed as pat Ex 0x01 Ex 0x00 of the cache state Valid Bit Cache Tag Byte 31 Byte 1 Byte 0 0 0x50 Byte 63 Byte 33 Byte Diect-mapped cache ead access Hit Byte 1023 Compae Data Byte Advanced Compute Achitectue Chapte 1. p1 1 KB Diect Mapped Cache, 32B blocks (0) 0 1 Cache location 0 can be occupied 2 by data fom main memoy 3 location 0, 32, 6, etc. 5 Cache location 1 can be occupied 6 by data fom main memoy 7 8 location 1, 33, 65, etc. 9 In geneal, all locations with same 10 Addess<9> bits map to the same 11 Main location in the cache Which one should 12 we place in the cache? 13 How can we tell which h one is in 1 15 the cache? (32) Byte 31 Byte 1 Byte Byte 63 Byte 33 Byte Byte 1023 Byte Advanced Compute Achitectue Chapte 1. p2 Valid Diect-mapped Cache - stuctue Capacity C bytes (eg 1KB) Blocksize B bytes (eg 32) Byte select bits 0..log(B)-1 (eg 0..) Numbe of blocks C/B (eg 32) Addess size A (eg 32 bits) Cache index size I=log(C/B) (eg log(32)=5) Tag size A-I-log(B) (eg =22) Cache Tag Cache Index Valid Two-way Set Associative Cache N-way set associative N enties fo each Cache Index N diect mapped caches opeated in paallel (N typically 2 to ) Example Two-way set associative cache Cache Index selects a set fom the cache The two tags in the set ae compaed in paallel Data is selected based on the tag esult Cache Index Cache Tag Cache Tag Valid Ad Tag Compae Ad Tag Compae 1 Sel1 Mux 0 Sel0 Compae Hit Cache Block Advanced Compute Achitectue Chapte 1. p3 Hit OR Cache Block Advanced Compute Achitectue Chapte 1. p

12 Disadvantage of Set Associative Cache N-way Set Associative i Cache v. Diect Mapped Cache N compaatos vs. 1 Exta delay fo the data Data comes AFTER Hit/Miss In a diect mapped cache, Cache Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recove late if miss. Basic cache teminology Example Intel Pentium Level-1 cache (pe-pescott) Capacity 8K bytes (total amount of data cache can stoe) Block 6 bytes (so thee ae 8K/6=128 blocks in the cache) Ways (addesses with same index bits can be placed in one of ways) Sets 32 (=128/, that is each RAM aay holds 32 blocks) Index 5 bits (since 2 5 =32 and we need index to select one of the 32 ways) Tag 21 bits (=32 minus 5 fo index, minus 6 to addess byte within block) Access time 2 cycles, (.6ns at 3GHz; pipelined, dual-poted [load+stoe]) Valid Cache Tag Cache Index Cache Tag Valid Valid Cache Tag Cache Index Cache Tag Valid Ad Tag Compae 1 Sel1 Mux 0 Sel0 Compae Ad Tag Compae Sel1 1 Mux 0 Sel0 Compae Hit OR Cache Block Advanced Compute Achitectue Chapte 1. p5 OR Hit Cache Block Advanced Compute Achitectue Chapte 1. p6 Questions fo Hieachy Q1 Whee can a block be placed in the uppe level? l? (Block placement) Q2 How is a block found if it is in the uppe level? (Block identification) Q3 Which block should be eplaced on a miss? (Block eplacement) Q What happens on a wite? (Wite stategy) Q1 Whee can a block be placed in the uppe level? In a fully-associative cache, block 12 can be placed in any location in the cache Set In a two-way way setassociative cache, the set is detemined by its low-ode addess bits (12 mod ) = 0 Block 12 can be placed in eithe of the two cache locations in set In a diect-mapped cache, block 12 can only be placed in one cache location, detemined by its low-ode addess bits (12 mod 8) = Advanced Compute Achitectue Chapte 1. p7 Advanced Compute Achitectue Chapte 1. p8

13 Valid Q2 How is a block found if it is in the uppe level? Cache Tag Ad Tag Compae 1 Sel1 Hit OR Cache Index Mux 0 Sel0 Cache Block Tag on each block No need to check index o block offset Block Addess Tag Index Cache Tag Compae Block Offset Valid Q3 Which block should be eplaced on a miss? Easy fo Diect Mapped Set Associative o Fully Associative Random LRU (Least Recently Used) Assoc 2-way -way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7%.7% 5.3%.% 5.0% 6 KB 1.9% 2.0% 1.5% 1.7% 1.% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Benchmak studies show that LRU beats andom only with small caches Inceasing associativity shinks index, expands tag Advanced Compute Achitectue Chapte 1. p9 Advanced Compute Achitectue Chapte 1. p50 Q What happens on a wite? Wite though The infomation is witten to both the block in the cache and to the block in the lowe-level memoy Wite back The infomation is witten only to the block in the cache. The modified cache block is witten to main memoy only when it is eplaced. is block clean o dity? Pos and Cons of each? WT ead misses cannot esult in wites WB no epeated wites to same location WT always combined with wite buffes so that don t wait fo lowe level memoy Wite Buffe fo Wite Though Pocesso Cache Wite Buffe DRAM A Wite Buffe is needed between the Cache and Pocesso wites data into the cache and the wite buffe contolle wite contents of the buffe to memoy Wite buffe is just a FIFO Typical numbe of enties Woks fine if Stoe fequency (w..t. time) << 1 / DRAM wite cycle system designe s nightmae Stoe fequency (w..t. time) -> 1 / DRAM wite cycle Wite buffe satuation Advanced Compute Achitectue Chapte 1. p51 Advanced Compute Achitectue Chapte 1. p52

14 A Moden Hieachy By taking advantage of the pinciple i of locality lit Pesent the use with as much memoy as is available in the cheapest technology. Povide access at the speed offeed by the fastest technology. Datapath Pocesso Contol iste es On-Chi ip Cache Second Level Cache (SRAM) Main (DRAM) Seconday Stoage (Disk) Tetiay Stoage (Disk/Tape) StoageTek STK 9310 ( Powdehon ) 2,000, 3,000,,000, 5,000, o 6,000 catidge slots pe libay stoage module (LSM) Up to 2 LSMs pe libay (1,000 catidges) 120 TB (1 LSM) to 28, TB capacity (2 LSM) Each catidge holds 300GB, eadable up to 0 MB/sec Up to 28.8 petabytes Ave s to load tape Lage-scale stoage Speed (ns) 1s 10s 100s 10,000,000s 10,000,000,000s Size (bytes) 100s (10s ms) (10s sec) Ks Ms Gs Ts Advanced Compute Achitectue Chapte 1. p53 http// http//en.wikipedia.og/wiki/tape_libay http// Advanced Compute Achitectue Chapte 1. p5 http// Can we live without cache? Inteesting ng exception Cay/Tea MTA, fist deliveed June Each CPU switches evey cycle between 128 theads Each thead can have up to 8 outstanding tt memoy accesses 3D tooidal mesh inteconnect accessed hashed to spead load acoss banks MTA-1 fabicated using Gallium Asenide, not silicon nealy un-manufactuable (wikipedia) Thid-geneation Cay XMT http// Advanced Compute Achitectue Chapte 1. p55 Ch1 Review of pipelined, in-ode pocesso achitectue and simple cache stuctues Ch2 Caches in moe depth Softwae techniques to impove cache pefomance Vitual memoy Benchmaking Fab Ch3 Instuction-level paallelism Dynamic scheduling, out-of-ode iste enaming Speculative execution Banch pediction Limits to ILP Ch Compile techniques loop nest tansfomations Loop paallelisation, intechange, tiling/blocking, skewing Whee we ae going Ch5 Multitheading, hypetheading, SMT Static instuction scheduling Softwae pipelining EPIC/IA-6; instuction-set suppot fo speculation and egiste enaming Ch6 GPUs, GPGPU, and manycoe Ch7 Shaed-memoy multipocessos Cache coheency Lage-scale cache-coheency; ccnuma. COMA Lab-based cousewok execise Simulation study challenge Using pefomance analysis tools Exam Patially based on ecent pocesso achitectue aticle, which we will study in advance (see past papes) Advanced Compute Achitectue Chapte 1. p56

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Gaduate Compute Achitectue Lectue 6 - Hazads Michela Taufe http://www.cis.udel.edu/~taufe/teaching/cis662f07 Powepoint Lectue Notes fom John Hennessy and David Patteson s: Compute Achitectue,

More information

COSC 6385 Computer Architecture. - Pipelining

COSC 6385 Computer Architecture. - Pipelining COSC 6385 Compute Achitectue - Pipelining Sping 2012 Some of the slides ae based on a lectue by David Culle, Pipelining Pipelining is an implementation technique wheeby multiple instuctions ae ovelapped

More information

COEN-4730 Computer Architecture Lecture 2 Review of Instruction Sets and Pipelines

COEN-4730 Computer Architecture Lecture 2 Review of Instruction Sets and Pipelines 1 COEN-4730 Compute Achitectue Lectue 2 Review of nstuction Sets and Pipelines Cistinel Ababei Dept. of Electical and Compute Engineeing Maquette Univesity Cedits: Slides adapted fom pesentations of Sudeep

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 5. Data Hazard Even with Forwarding Figure A.9, Page A-20

Administrivia. CMSC 411 Computer Systems Architecture Lecture 5. Data Hazard Even with Forwarding Figure A.9, Page A-20 Administivia CMSC 411 Compute Systems Achitectue Lectue 5 Basic Pipelining (cont.) Alan Sussman als@cs.umd.edu as@csu dedu Homewok poblems fo Unit 1 due today Homewok poblems fo Unit 3 posted soon CMSC

More information

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson Lectue 8 Intoduction to Pipelines Adapated fom slides by David Patteson http://www-inst.eecs.bekeley.edu/~cs61c/ * 1 Review (1/3) Datapath is the hadwae that pefoms opeations necessay to execute pogams.

More information

Introduction To Pipelining. Chapter Pipelining1 1

Introduction To Pipelining. Chapter Pipelining1 1 Intoduction To Pipelining Chapte 6.1 - Pipelining1 1 Mooe s Law Mooe s Law says that the numbe of pocessos on a chip doubles about evey 18 months. Given the data on the following two slides, is this tue?

More information

Computer Science 141 Computing Hardware

Computer Science 141 Computing Hardware Compute Science 141 Computing Hadwae Fall 2006 Havad Univesity Instucto: Pof. David Books dbooks@eecs.havad.edu [MIPS Pipeline Slides adapted fom Dave Patteson s UCB CS152 slides and May Jane Iwin s CSE331/431

More information

Review from last lecture

Review from last lecture CSE820 Gaduate Compute Achitectue Week 3 Pefomance + Pipeline Review Based on slides by David Patteson Review fom last lectue Tacking and extapolating technology pat of achitect s esponsibility Expect

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.bekeley.edu/~cs61c UCB CS61C : Machine Stuctues Lectue SOE Dan Gacia Lectue 28 CPU Design : Pipelining to Impove Pefomance 2010-04-05 Stanfod Reseaches have invented a monitoing technique called

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Compute Achitectue Lectue 6 Intoduction to Pipelining Septembe 23, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Compute Achitectue 1 Pevious

More information

CSE4201. Computer Architecture

CSE4201. Computer Architecture CSE 4201 Compute Achitectue Pof. Mokhta Aboelaze Pats of these slides ae taken fom Notes by Pof. David Patteson at UCB Outline MIPS and instuction set Simple pipeline in MIPS Stuctual and data hazads Fowading

More information

The Processor: Improving Performance Data Hazards

The Processor: Improving Performance Data Hazards The Pocesso: Impoving Pefomance Data Hazads Monday 12 Octobe 15 Many slides adapted fom: and Design, Patteson & Hennessy 5th Edition, 2014, MK and fom Pof. May Jane Iwin, PSU Summay Pevious Class Pipeline

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hadwae Oganization and Design Lectue 16: Pipelining Adapted fom Compute Oganization and Design, Patteson & Hennessy, UCB Last time: single cycle data path op System clock affects pimaily the Pogam

More information

Lecture #22 Pipelining II, Cache I

Lecture #22 Pipelining II, Cache I inst.eecs.bekeley.edu/~cs61c CS61C : Machine Stuctues Lectue #22 Pipelining II, Cache I Wiewold cicuits 2008-7-29 http://www.maa.og/editoial/mathgames/mathgames_05_24_04.html http://www.quinapalus.com/wi-index.html

More information

CS 2461: Computer Architecture 1 Program performance and High Performance Processors

CS 2461: Computer Architecture 1 Program performance and High Performance Processors Couse Objectives: Whee ae we. CS 2461: Pogam pefomance and High Pefomance Pocessos Instucto: Pof. Bhagi Naahai Bits&bytes: Logic devices HW building blocks Pocesso: ISA, datapath Using building blocks

More information

Computer Architecture. Pipelining and Instruction Level Parallelism An Introduction. Outline of This Lecture

Computer Architecture. Pipelining and Instruction Level Parallelism An Introduction. Outline of This Lecture Compute Achitectue Pipelining and nstuction Level Paallelism An ntoduction Adapted fom COD2e by Hennessy & Patteson Slide 1 Outline of This Lectue ntoduction to the Concept of Pipelined Pocesso Pipelined

More information

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia

CS 61C: Great Ideas in Computer Architecture. Pipelining Hazards. Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Geat Ideas in Compute Achitectue Pipelining Hazads Instucto: Senio Lectue SOE Dan Gacia 1 Geat Idea #4: Paallelism So9wae Paallel Requests Assigned to compute e.g. seach Gacia Paallel Theads Assigned

More information

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Review: Joy s Law in ManyCore world. Bell s Law new class per decade

Review: Moore s Law. EECS 252 Graduate Computer Architecture Lecture 2. Review: Joy s Law in ManyCore world. Bell s Law new class per decade EECS 252 Gaduate Compute Achitectue Lectue 2 ℵ 0 Review of Instuction Sets, Pipelines, and Caches Januay 26 th, 2009 Review Mooe s Law John Kubiatowicz Electical Engineeing and Compute Sciences Univesity

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

Chapter 4 (Part III) The Processor: Datapath and Control (Pipeline Hazards)

Chapter 4 (Part III) The Processor: Datapath and Control (Pipeline Hazards) Chapte 4 (Pat III) The Pocesso: Datapath and Contol (Pipeline Hazads) 陳瑞奇 (J.C. Chen) 亞洲大學資訊工程學系 Adapted fom class notes by Pof. M.J. Iwin, PSU and Pof. D. Patteson, UCB 1 吃感冒藥副作用怎麼辦? http://big5.sznews.com/health/images/attachement/jpg/site3/20120319/001558d90b3310d0c1683e.jpg

More information

User Visible Registers. CPU Structure and Function Ch 11. General CPU Organization (4) Control and Status Registers (5) Register Organisation (4)

User Visible Registers. CPU Structure and Function Ch 11. General CPU Organization (4) Control and Status Registers (5) Register Organisation (4) PU Stuctue and Function h Geneal Oganisation Registes Instuction ycle Pipelining anch Pediction Inteupts Use Visible Registes Vaies fom one achitectue to anothe Geneal pupose egiste (GPR) ata, addess,

More information

Lecture Topics ECE 341. Lecture # 12. Control Signals. Control Signals for Datapath. Basic Processing Unit. Pipelining

Lecture Topics ECE 341. Lecture # 12. Control Signals. Control Signals for Datapath. Basic Processing Unit. Pipelining EE 341 Lectue # 12 Instucto: Zeshan hishti zeshan@ece.pdx.edu Novembe 10, 2014 Potland State Univesity asic Pocessing Unit ontol Signals Hadwied ontol Datapath contol signals Dealing with memoy delay Pipelining

More information

CENG 3420 Computer Organization and Design. Lecture 07: MIPS Processor - II. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 07: MIPS Processor - II. Bei Yu CENG 3420 Compute Oganization and Design Lectue 07: MIPS Pocesso - II Bei Yu CEG3420 L07.1 Sping 2016 Review: Instuction Citical Paths q Calculate cycle time assuming negligible delays (fo muxes, contol

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011

You Are Here! Review: Hazards. Agenda. Agenda. Review: Load / Branch Delay Slots 7/28/2011 CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuction Level Paallelism: Multiple Instuction Issue Guest Lectue: Justin Hsia Softwae Paallel Requests Assigned to compute e.g., Seach Katz

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

CENG 3420 Lecture 07: Pipeline

CENG 3420 Lecture 07: Pipeline CENG 3420 Lectue 07: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L07.1 Sping 2017 Outline q Review: Flip-Flop Contol Signals q Pipeline Motivations q Pipeline Hazads q Exceptions CENG3420 L07.2 Sping

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruc>on Level Parallelism Agenda CS 61C: Geat Ideas in Compute Achitectue (Machine Stuctues) Instuc>on Level Paallelism Instuctos: Randy H. Katz David A. PaJeson hjp://inst.eecs.bekeley.edu/~cs61c/fa10 Review Instuc>on Set Design

More information

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue

CS 61C: Great Ideas in Computer Architecture Instruc(on Level Parallelism: Mul(ple Instruc(on Issue CS 61C: Geat Ideas in Compute Achitectue Instuc(on Level Paallelism: Mul(ple Instuc(on Issue Instuctos: Kste Asanovic, Randy H. Katz hbp://inst.eecs.bekeley.edu/~cs61c/fa12 1 Paallel Requests Assigned

More information

Any modern computer system will incorporate (at least) two levels of storage:

Any modern computer system will incorporate (at least) two levels of storage: 1 Any moden compute system will incopoate (at least) two levels of stoage: pimay stoage: andom access memoy (RAM) typical capacity 32MB to 1GB cost pe MB $3. typical access time 5ns to 6ns bust tansfe

More information

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Graduate Computer Architecture Lecture 6 - Hazards Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

Multidimensional Testing

Multidimensional Testing Multidimensional Testing QA appoach fo Stoage netwoking Yohay Lasi Visuality Systems 1 Intoduction Who I am Yohay Lasi, QA Manage at Visuality Systems Visuality Systems the leading commecial povide of

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20

More information

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives SPARK: Soot Reseach Kit Ondřej Lhoták Objectives Spak is a modula toolkit fo flow-insensitive may points-to analyses fo Java, which enables expeimentation with: vaious paametes of pointe analyses which

More information

A Memory Efficient Array Architecture for Real-Time Motion Estimation

A Memory Efficient Array Architecture for Real-Time Motion Estimation A Memoy Efficient Aay Achitectue fo Real-Time Motion Estimation Vasily G. Moshnyaga and Keikichi Tamau Depatment of Electonics & Communication, Kyoto Univesity Sakyo-ku, Yoshida-Honmachi, Kyoto 66-1, JAPAN

More information

EE 6900: Interconnection Networks for HPC Systems Fall 2016

EE 6900: Interconnection Networks for HPC Systems Fall 2016 EE 6900: Inteconnection Netwoks fo HPC Systems Fall 2016 Avinash Kaanth Kodi School of Electical Engineeing and Compute Science Ohio Univesity Athens, OH 45701 Email: kodi@ohio.edu 1 Acknowledgement: Inteconnection

More information

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS Daniel A Menascé Mohamed N Bennani Dept of Compute Science Oacle, Inc Geoge Mason Univesity 1211 SW Fifth

More information

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio

More information

COSC 6385 Computer Architecture - Memory Hierarchies (I)

COSC 6385 Computer Architecture - Memory Hierarchies (I) COSC 6385 Computer Architecture - Memory Hierarchies (I) Edgar Gabriel Spring 2018 Some slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma apreduce Optimizations and Algoithms 2015 Pofesso Sasu Takoma www.cs.helsinki.fi Optimizations Reduce tasks cannot stat befoe the whole map phase is complete Thus single slow machine can slow down the

More information

Persistent Memory what developers need to know Mark Carlson Co-chair SNIA Technical Council Toshiba

Persistent Memory what developers need to know Mark Carlson Co-chair SNIA Technical Council Toshiba Pesistent Memoy what developes need to know Mak Calson Co-chai SNIA Technical Council Toshiba 2018 Stoage Develope Confeence EMEA. All Rights Reseved. 1 Contents Welcome Pesistent Memoy Oveview Non-Volatile

More information

Lecture 2: Processor and Pipelining 1

Lecture 2: Processor and Pipelining 1 The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient

More information

Accelerating Storage with RDMA Max Gurtovoy Mellanox Technologies

Accelerating Storage with RDMA Max Gurtovoy Mellanox Technologies Acceleating Stoage with RDMA Max Gutovoy Mellanox Technologies 2018 Stoage Develope Confeence EMEA. Mellanox Technologies. All Rights Reseved. 1 What is RDMA? Remote Diect Memoy Access - povides the ability

More information

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141 EECS151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: John Wawrzynek and Nick Weaver Lecture 19: Caches Cache Introduction 40% of this ARM CPU is devoted to SRAM cache. But the role

More information

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review CISC 662 Graduate Computer Architecture Lecture 6 - Cache and virtual memory review Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David

More information

THE THETA BLOCKCHAIN

THE THETA BLOCKCHAIN THE THETA BLOCKCHAIN Theta is a decentalized video steaming netwok, poweed by a new blockchain and token. By Theta Labs, Inc. Last Updated: Nov 21, 2017 esion 1.0 1 OUTLINE Motivation Reputation Dependent

More information

CPE 631 Lecture 04: CPU Caches

CPE 631 Lecture 04: CPU Caches Lecture 04 CPU Caches Electrical and Computer Engineering University of Alabama in Huntsville Outline Memory Hierarchy Four Questions for Memory Hierarchy Cache Performance 26/01/2004 UAH- 2 1 Processor-DR

More information

High performance CUDA based CNN image processor

High performance CUDA based CNN image processor High pefomance UDA based NN image pocesso GEORGE VALENTIN STOIA, RADU DOGARU, ELENA RISTINA STOIA Depatment of Applied Electonics and Infomation Engineeing Univesity Politehnica of Buchaest -3, Iuliu Maniu

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Cases that affect instruction execution semantics

More information

A Novel Parallel Deadlock Detection Algorithm and Architecture

A Novel Parallel Deadlock Detection Algorithm and Architecture A Novel Paallel Deadlock Detection Aloithm and Achitectue Pun H. Shiu 2, Yudon Tan 2, Vincent J. Mooney III {ship, ydtan, mooney}@ece.atech.ed }@ece.atech.edu http://codesin codesin.ece.atech.eduedu,2

More information

Conversion Functions for Symmetric Key Ciphers

Conversion Functions for Symmetric Key Ciphers Jounal of Infomation Assuance and Secuity 2 (2006) 41 50 Convesion Functions fo Symmetic Key Ciphes Deba L. Cook and Angelos D. Keomytis Depatment of Compute Science Columbia Univesity, mail code 0401

More information

GARBAGE COLLECTION METHODS. Hanan Samet

GARBAGE COLLECTION METHODS. Hanan Samet gc0 GARBAGE COLLECTION METHODS Hanan Samet Compute Science Depatment and Cente fo Automation Reseach and Institute fo Advanced Compute Studies Univesity of Mayland College Pak, Mayland 07 e-mail: hjs@umiacs.umd.edu

More information

GCC-AVR Inline Assembler Cookbook Version 1.2

GCC-AVR Inline Assembler Cookbook Version 1.2 GCC-AVR Inline Assemble Cookbook Vesion 1.2 About this Document The GNU C compile fo Atmel AVR isk pocessos offes, to embed assembly language code into C pogams. This cool featue may be used fo manually

More information

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0. Since 1980, CPU has outpaced DRAM... EEL 5764: Graduate Computer Architecture Appendix C Hierarchy Review Ann Gordon-Ross Electrical and Computer Engineering University of Florida http://www.ann.ece.ufl.edu/

More information

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES Svetlana Avetisyan Mikayel Samvelyan* Matun Kaapetyan Yeevan State Univesity Abstact In this pape, the class

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

The Java Virtual Machine. Compiler construction The structure of a frame. JVM stacks. Lecture 2

The Java Virtual Machine. Compiler construction The structure of a frame. JVM stacks. Lecture 2 Compile constuction 2009 Lectue 2 Code geneation 1: Geneating code The Java Vitual Machine Data types Pimitive types, including intege and floating-point types of vaious sizes and the boolean type. The

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

IP Network Design by Modified Branch Exchange Method

IP Network Design by Modified Branch Exchange Method Received: June 7, 207 98 IP Netwok Design by Modified Banch Method Kaiat Jaoenat Natchamol Sichumoenattana 2* Faculty of Engineeing at Kamphaeng Saen, Kasetsat Univesity, Thailand 2 Faculty of Management

More information

User Group testing report

User Group testing report Use Goup testing epot Deliveable No: D6.10 Contact No: Integated Poject No. 506723: SafetyNet Aconym: SafetyNet Title: Building the Euopean Road Safety Obsevatoy Integated Poject, Thematic Pioity 6.2 Sustainable

More information

A New Finite Word-length Optimization Method Design for LDPC Decoder

A New Finite Word-length Optimization Method Design for LDPC Decoder A New Finite Wod-length Optimization Method Design fo LDPC Decode Jinlei Chen, Yan Zhang and Xu Wang Key Laboatoy of Netwok Oiented Intelligent Computation Shenzhen Gaduate School, Habin Institute of Technology

More information

Modeling a shared medium access node with QoS distinction

Modeling a shared medium access node with QoS distinction Modeling a shaed medium access node with QoS distinction Matthias Gies, Jonas Geutet Compute Engineeing and Netwoks Laboatoy (TIK) Swiss Fedeal Institute of Technology Züich CH-8092 Züich, Switzeland email:

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 17 Introduction to Memory Hierarchies Why it s important  Fundamental lesson(s) Suggested reading: (HP Chapter Processor components" Multicore processors and programming" Processor comparison" vs." Lecture 17 Introduction to Memory Hierarchies" CSE 30321" Suggested reading:" (HP Chapter 5.1-5.2)" Writing more "

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

Towards Adaptive Information Merging Using Selected XML Fragments

Towards Adaptive Information Merging Using Selected XML Fragments Towads Adaptive Infomation Meging Using Selected XML Fagments Ho-Lam Lau and Wilfed Ng Depatment of Compute Science and Engineeing, The Hong Kong Univesity of Science and Technology, Hong Kong {lauhl,

More information

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer Using the Gow-And-Pune Netwok to Solve Poblems of Lage Dimensionality B.J. Biedis and T.D. Gedeon School of Compute Science & Engineeing The Univesity of New South Wales Sydney NSW 2052 AUSTRALIA bbiedis@cse.unsw.edu.au

More information

Image Enhancement in the Spatial Domain. Spatial Domain

Image Enhancement in the Spatial Domain. Spatial Domain 8-- Spatial Domain Image Enhancement in the Spatial Domain What is spatial domain The space whee all pixels fom an image In spatial domain we can epesent an image by f( whee x and y ae coodinates along

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput

More information

arxiv: v1 [cs.lo] 3 Dec 2018

arxiv: v1 [cs.lo] 3 Dec 2018 A high-level opeational semantics fo hadwae weak memoy models axiv:1812.00996v1 [cs.lo] 3 Dec 2018 Abstact Robet J. Colvin School of Electical Engineeing and Infomation Technology The Univesity of Queensland

More information

Module 6 STILL IMAGE COMPRESSION STANDARDS

Module 6 STILL IMAGE COMPRESSION STANDARDS Module 6 STILL IMAE COMPRESSION STANDARDS Lesson 17 JPE-2000 Achitectue and Featues Instuctional Objectives At the end of this lesson, the students should be able to: 1. State the shotcomings of JPE standad.

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012 2011, Scienceline Publication www.science-line.com Jounal of Wold s Electical Engineeing and Technology J. Wold. Elect. Eng. Tech. 1(1): 12-16, 2012 JWEET An Efficient Algoithm fo Lip Segmentation in Colo

More information

Dynamic Multiple Parity (DMP) Disk Array for Serial Transaction Processing

Dynamic Multiple Parity (DMP) Disk Array for Serial Transaction Processing IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 9, SEPTEMBER 200 949 Dynamic Multiple Paity (DMP) Disk Aay fo Seial Tansaction Pocessing K.H. Yeung, Membe, IEEE, and T.S. Yum, Senio Membe, IEEE AbstactÐThe

More information

Memory Hierarchy: Motivation

Memory Hierarchy: Motivation Memory Hierarchy: Motivation The gap between CPU performance and main memory speed has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

i-pcgrid Workshop 2016 April 1 st 2016 San Francisco, CA

i-pcgrid Workshop 2016 April 1 st 2016 San Francisco, CA i-pcgrid Wokshop 2016 Apil 1 st 2016 San Fancisco, CA Liang Min* Eddy Banks, Bian Kelley, Met Kokali, Yining Qin, Steve Smith, Philip Top, and Caol Woodwad *min2@llnl.gov, 925-422-1187 LDRD 13-ERD-043

More information

ASSIGN 01: Due Monday Feb 04 PART 1 Get a Sketchbook: 8.5 x 11 (Minimum size 5 x7 ) fo keeping a design jounal and a place to keep poject eseach & ideas. Make sue you have you Dopbox account and/o Flash

More information

dc - Linux Command Dc may be invoked with the following command-line options: -V --version Print out the version of dc

dc - Linux Command Dc may be invoked with the following command-line options: -V --version Print out the version of dc - CentOS 5.2 - Linux Uses Guide - Linux Command SYNOPSIS [-V] [--vesion] [-h] [--help] [-e sciptexpession] [--expession=sciptexpession] [-f sciptfile] [--file=sciptfile] [file...] DESCRIPTION is a evese-polish

More information

DYNAMIC STORAGE ALLOCATION. Hanan Samet

DYNAMIC STORAGE ALLOCATION. Hanan Samet ds0 DYNAMIC STORAGE ALLOCATION Hanan Samet Compute Science Depatment and Cente fo Automation Reseach and Institute fo Advanced Compute Studies Univesity of Mayland College Pak, Mayland 07 e-mail: hjs@umiacs.umd.edu

More information

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC

More information

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE 5th Intenational Confeence on Advanced Mateials and Compute Science (ICAMCS 2016) A New and Efficient 2D Collision Detection Method Based on Contact Theoy Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai

More information

Communication vs Distributed Computation: an alternative trade-off curve

Communication vs Distributed Computation: an alternative trade-off curve Communication vs Distibuted Computation: an altenative tade-off cuve Yahya H. Ezzeldin, Mohammed amoose, Chistina Fagouli Univesity of Califonia, Los Angeles, CA 90095, USA, Email: {yahya.ezzeldin, mkamoose,

More information

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS152 Computer Architecture and Engineering Lecture 17: Cache System CS152 Computer Architecture and Engineering Lecture 17 System March 17, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http//http.cs.berkeley.edu/~patterson

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Physical Aware System Level Design for Tiled Hierarchical Chip Multiprocessors

Physical Aware System Level Design for Tiled Hierarchical Chip Multiprocessors Physical Awae System Level Design fo Tiled Hieachical hip Multipocessos Jodi otadella, Javie de San Pedo, Nikita Nikitin and Jodi Petit Univesitat Politècnica de atalunya (Bacelona) Poject funded by Intel

More information

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension 17th Wold Confeence on Nondestuctive Testing, 25-28 Oct 2008, Shanghai, China Segmentation of Casting Defects in X-Ray Images Based on Factal Dimension Jue WANG 1, Xiaoqin HOU 2, Yufang CAI 3 ICT Reseach

More information

Question?! Processor comparison!

Question?! Processor comparison! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 5.1-5.2!! (Over the next 2 lectures)! Lecture 18" Introduction to Memory Hierarchies! 3! Processor components! Multicore processors and programming! Question?!

More information

On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes

On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes On the Convesion between Binay Code and BinayReflected Gay Code on Boolean Cubes The Havad community has made this aticle openly available. Please shae how this access benefits you. You stoy mattes Citation

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information