Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/ PDF Free Download

Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive per bit tha slower memory Solutio: orgaize memory system ito a hierarchy Etire addressable memory space available i largest, slowest memory Icremetally smaller ad faster memories, each cotaiig a subset of the memory below it, proceed i steps up toward the processor Temporal ad spatial locality isures that early all refereces ca be foud i smaller memories Gives the illusio of a large, fast memory beig preseted to the processor Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 2

Memory Performace Gap Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 3 Memory Hierarchy Desig Memory hierarchy desig becomes more crucial with recet multi-core processors: Aggregate peak badwidth grows with # cores: Itel Core i7 ca geerate two refereces per core per clock Four cores ad 3.2 GHz clock 25.6 billio* 64-bit data refereces/secod + 12.8 billio* 128-bit istructio refereces = 409.6 GB/s! DRAM badwidth is oly 6% of this (25 GB/s) Requires: Multi-port, pipelied caches Two levels of cache per core Shared third-level cache o chip Itroductio * US billio = 10 9 Copyright 2012, Elsevier Ic. All rights reserved. 4

The Memory Hierarchy The BIG Picture Commo priciples apply at all levels of the memory hierarchy Based o otios of cachig At each level i the hierarchy Block placemet Fidig a block Replacemet o a miss Write policy 5.5 A Commo Framework for Memory Hierarchies Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 5 Direct Mapped Cache Locatio determied by address Direct mapped: oly oe choice (Block address) modulo (#Blocks i cache) #Blocks is a power of 2 Use low-order address bits Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 6

Associative Caches Fully associative Allow a give block to go i ay cache etry Requires all etries to be searched at oce Comparator per etry (expesive) -way set associative Each set cotais etries Block umber determies which set (Block umber) modulo (#Sets i cache) Search all etries i a give set at oce comparators (less expesive) Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 7 How Much Associativity Icreased associativity decreases miss rate But with dimiishig returs Simulatio of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 8

Block Placemet Determied by associativity Direct mapped (1-way associative) Oe choice for placemet -way set associative choices withi a set Fully associative Ay locatio Higher associativity reduces miss rate Icreases complexity, cost, ad access time Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 9 Replacemet Policy Direct mapped: o choice Set associative Prefer o-valid etry, if there is oe Otherwise, choose amog etries i the set Least-recetly used (LRU) Choose the oe uused for the logest time Simple for 2-way, maageable for 4-way, too hard beyod that Radom Gives approximately the same performace as LRU for high associativity Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 10

Write Policy Write-through Update both upper ad lower levels Simplifies replacemet, but may require write buffer Write-back Update upper level oly Update lower level whe block is replaced Need to keep more state Virtual memory Oly write-back is feasible, give disk write latecy Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 11 Memory Hierarchy Basics sets => -way set associative Direct-mapped cache => oe block per set Fully associative => oe set Itroductio Writig to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Oly update lower levels of hierarchy whe a updated block is replaced Both strategies use write buffer to make writes asychroous Copyright 2012, Elsevier Ic. All rights reserved. 12

Memory Hierarchy Basics CPU exec-time = (CPU clock-cycles + Mem stall-cycles ) Clock cycle time Itroductio CPU exec-time = (IC CPI CPU + Mem stall-cycles ) Clock cycle time Mem stall-cycles = IC... Miss rate... Mem accesses... Miss pealty... Copyright 2012, Elsevier Ic. All rights reserved. 13 Memory Hierarchy Basics CPU exec-time = (CPU clock-cycles + Mem stall-cycles ) Clock cycle time Itroductio Mem stall-cycles = IC Misses Istructio Miss Pealty Note1: miss rate/pealty are ofte differet for reads ad writes Note2: speculative ad multithreaded processors may execute other istructios durig a miss Reduces performace impact of misses Copyright 2012, Elsevier Ic. All rights reserved. 14

Cache Performace Example Give I-cache miss rate = 2% D-cache miss rate = 4% Miss pealty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of istructios Miss cycles per istructio I-cache: D-cache: Actual CPI = 2 +?? +?? =?? Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 15 Cache Performace Example Give I-cache miss rate = 2% D-cache miss rate = 4% Miss pealty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of istructios Miss cycles per istructio I-cache: 0.02 100 = 2 D-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 16

Memory Hierarchy Basics Miss rate Fractio of cache access that result i a miss Itroductio Causes of misses (3C s +1) Compulsory First referece to a block Capacity Blocks discarded ad later retrieved Coflict Program makes repeated refereces to multiple addresses from differet blocks that map to the same locatio i the cache Coherecy Differet processors should see same value i same locatio Copyright 2012, Elsevier Ic. All rights reserved. 17 The 3C s i diff cache sizes Itroductio Coflict Copyright 2012, Elsevier Ic. All rights reserved. 18

The cache coherece pb Processors may see differet values through their caches: Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 19 Cache Coherece Coherece All reads by ay processor must retur the most recetly writte value Writes to the same locatio by ay two processors are see i the same order by all processors (Coherece defies the behaviour of reads & writes to the same memory locatio) Cosistecy Whe a writte value will be retured by a read If a processor writes locatio A followed by locatio B, ay processor that sees the ew value of B must also see the ew value of A (Cosistecy defies the behaviour of reads & writes with respect to accesses to other memory locatios) Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 20

Eforcig Coherece Coheret caches provide: Migratio: movemet of data Replicatio: multiple copies of data Cache coherece protocols Directory based Sharig status of each block kept i oe locatio Soopig Each core tracks sharig status of each block Cetralized Shared-Memory Architectures Copyright 2012, Elsevier Ic. All rights reserved. 21 Memory Hierarchy Basics Six basic cache optimizatios: Larger block size Reduces compulsory misses Icreases capacity ad coflict misses, icreases miss pealty Larger total cache capacity to reduce miss rate Icreases hit time, icreases power cosumptio Higher associativity Reduces coflict misses Icreases hit time, icreases power cosumptio Multilevel caches to reduce miss pealty Reduces overall memory access time Givig priority to read misses over writes Reduces miss pealty Avoidig address traslatio i cache idexig Reduces hit time Itroductio Copyright 2012, Elsevier Ic. All rights reserved. 22

Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster tha mai memory Mai memory services L-2 cache misses Some high-ed systems iclude L-3 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 23 Multilevel Cache Example Give CPU base CPI = 1, clock rate = 4GHz Miss rate/istructio = 2% Mai memory access time = 100s With just primary cache Miss pealty =??? = 400 cycles Effective CPI = 1 +??? = 9 Now add L-2 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 24

Multilevel Cache Example Give CPU base CPI = 1, clock rate = 4GHz Miss rate/istructio = 2% Mai memory access time = 100s With just primary cache Miss pealty = 100s/0.25s = 400 cycles Effective CPI = 1 + 0.02 400 = 9 Now add L-2 cache Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 25 Example (cot.) Now add L-2 cache Access time = 5s Global miss rate to mai memory = 0.5% Primary miss with L-2 hit Pealty = 5s/0.25s = 20 cycles Primary miss with L-2 miss Extra pealty = 400 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performace ratio = 9/3.4 = 2.6 Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 26

Multilevel O-Chip Caches Itel Nehalem 4-core processor Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache 3-Level Cache Orgaizatio L1 caches (per core) L2 uified cache (per core) L3 uified cache (shared) Itel Nehalem /a: data ot available L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacemet, hit time /a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacemet, write-back/ allocate, hit time /a 256KB, 64-byte blocks, 8-way, approx LRU replacemet, writeback/allocate, hit time /a 8MB, 64-byte blocks, 16-way, replacemet /a, write-back/ allocate, hit time /a AMD Optero X4 L1 I-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacemet, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, approx LRU replacemet, write-back/ allocate, hit time 9 cycles 512KB, 64-byte blocks, 16-way, approx LRU replacemet, writeback/allocate, hit time /a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles Chapter 5 Large ad Fast: Exploitig Memory Hierarchy 28

29 Itel ew cache approach with Skylake AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 30 https://www.servethehome.com/itel-xeo-scalable-processor-family-microarchitecture-overview/ AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 https://www.servethehome.com/itel-xeo-scalable-processor-family-microarchitecture-overview/ Itel ew cache approach with Skylake

Te Advaced Optimizatios Reducig the hit time 1. Small & simple first-level caches 2. Way-predictio Icrease cache badwidth 3. Pipelied cache access 4. Noblockig caches 5. Multibaked caches Reducig the miss pealty 6. Critical word first 7. Mergig write buffers Reducig the miss rate 8. Compiler optimizatios Reducig the miss pealty or miss rate via parallelism 9. Hardware prefetchig of istructios ad data 10. Compiler-cotrolled prefetchig AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 31 1. Small ad simple 1 st level caches Small ad simple first level caches Critical timig path: addressig tag memory, the comparig tags, the selectig correct set Direct-mapped caches ca overlap tag compare ad trasmissio of data Lower associativity reduces power because fewer cache lies are accessed Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 32

L1 Size ad Associativity Advaced Optimizatios Access time vs. size ad associativity Copyright 2012, Elsevier Ic. All rights reserved. 33 L1 Size ad Associativity Advaced Optimizatios Eergy per read vs. size ad associativity Copyright 2012, Elsevier Ic. All rights reserved. 34

2. Way Predictio To improve hit time, predict the way to pre-set mux Mis-predictio gives loger hit time Predictio accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy tha D-cache First used o MIPS R10000 i mid-90s Used o ARM Cortex-A8 Exted to predict block as well Way selectio Icreases mis-predictio pealty Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 35 3. Pipeliig Cache Pipelie cache access to improve badwidth Examples: Petium: 1 cycle Petium Pro Petium III: 2 cycles Petium 4 Core i7: 4 cycles Advaced Optimizatios Icreases brach mis-predictio pealty Makes it easier to icrease associativity Copyright 2012, Elsevier Ic. All rights reserved. 36

4. Noblockig Caches Allow hits before previous misses complete Hit uder miss Hit uder multiple miss L2 must support this I geeral, processors ca hide L1 miss pealty but ot L2 miss pealty Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 37 5. Multibaked Caches Orgaize cache as idepedet baks to support simultaeous access ARM Cortex-A8 supports 1-4 baks for L2 Itel i7 supports 4 baks for L1 ad 8 baks for L2 Advaced Optimizatios Iterleave baks accordig to block address Copyright 2012, Elsevier Ic. All rights reserved. 38

6. Critical Word First, Early Restart Critical word first Request missed word from memory first Sed it to the processor as soo as it arrives Early restart Request words i ormal order Sed missed work to the processor as soo as it arrives Advaced Optimizatios Effectiveess of these strategies depeds o block size ad likelihood of aother access to the portio of the block that has ot yet bee fetched Copyright 2012, Elsevier Ic. All rights reserved. 39 7. Mergig Write Buffer Whe storig to a block that is already pedig i the write buffer, update write buffer Reduces stalls due to full write buffer Do ot apply to I/O addresses Advaced Optimizatios No write bufferig Write bufferig Copyright 2012, Elsevier Ic. All rights reserved. 40

8. Compiler Optimizatios Loop Iterchage Swap ested loops to access memory i sequetial order Advaced Optimizatios Blockig Istead of accessig etire rows or colums, subdivide matrices ito blocks Requires more memory accesses but improves locality of accesses Copyright 2012, Elsevier Ic. All rights reserved. 41 9. Hardware Prefetchig Fetch two blocks o miss (iclude ext sequetial block) Advaced Optimizatios Petium 4 Pre-fetchig Copyright 2012, Elsevier Ic. All rights reserved. 42

10. Compiler Prefetchig Isert prefetch istructios before data is eeded No-faultig: prefetch does t cause exceptios Advaced Optimizatios Register prefetch Loads data ito register Cache prefetch Loads data ito cache Combie with loop urollig ad software pipeliig Copyright 2012, Elsevier Ic. All rights reserved. 43 Summary Advaced Optimizatios Copyright 2012, Elsevier Ic. All rights reserved. 44

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1