Memory Hierarchy. Slides contents from:

Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL

Memory Performance Gap Memory Wall Wall

Memory Performance Gap Aggregate Peak Bandwidth required by Intel i7 2 references per clock 4 cores, 32GHz 256 Billion 64-bit references/sec + 128 Billion128- bit instruction references = 4096 GB/s DRAM Capacity = 256 GB/s Multiport, Pipelined caches Two levels of cache per core Shared third-level cache on chip

Introduction Programmers want unlimited amounts of memory with low latency Fast memory is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality ensures high hit rate in smaller memories

Predictable Memory Reference Patterns Spatial and Temporal Locality Hatfield and Gerald: Program Restructuring for Virtual Memory IBM Systems Journal 10(3): 168-192 (1971)

Locality of Reference Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to be referenced soon

The Memory Hierarchy

Memory Technology Trade-offs Latches/Registers Register File Low Capacity Low Latency High Bandwidth (more and wider ports) SRAM DRAM High Capacity High Latency Low Bandwidth

SRAM Cell wordline b b

Cache Hardware structure that provides memory objects that the processor references MAIN MEMORY PROCESSOR CACHE

Cache Organization 32KB 32KB CACHE 0 1 2 Data (32B) 0 1 30 31 Line (Block) 1022 1023

Cache Organization 32 bit address from Processor 0 1 2 Data (32B) 1022 1023

Cache Organization Index 10b 0 1 2 Data (32B) 1022 1023

Cache Organization Index 10b Block Offset 5b 0 1 2 Data (32B) 1022 1023

Cache Organization Tag 17b Index 10b Block Offset 5b 0 1 2 Data (32B) 1022 1023

Direct Mapped Cache Organization Tag Index BO 17 10 5 Index Valid Tag Data 0 1 2 1022 1023

Direct Mapped Cache Organization Tag Index BO 17 10 5 Hit Index Valid Tag Data (32B) 0 1 2 Data to Processor 1022 1023 =

Direct Mapped Cache Organization Tag Index BO 17 10 5 Index Valid Tag 0 1 2 Data (32B) Hit 1022 1023 = = Cache Miss Send address to Lower level

Block Placement Tag Index BO 0 1 2 V Tag Data Direct Mapped Cache Index bits bits Identify a unique Cache line line 1022 1023

Block Placement 2 Index CacheSize Index = BlockSize SetAssociativity Tag Index BO Set 0 1 V Tag Data 0 1 2 Set Set Associative Cache Index bits bits Identify a unique SET SET A Set Set contains multiple cache lines lines 511 1022 1023 2-way Set Set Associative Cache

2 way Set Associative Cache Valid Tag Data Valid Tag Data = = Reduces conflict misses Hit More More energy spent per per data data access

Block Placement 4Qs of Caches Where can a block be placed in a cache? Direct mapped, Set associative, Fully associative Block Identification How is a block found if it is in cache? Block Replacement Which block should be replaced on a miss? Write Strategy What happens on a write?

Block Replacement No choice in a direct mapped cache Random Least Recently Used (LRU) LRU cache state must be updated on every access True implementation only feasible for small sets (2-way) Psuedo LRU First In, First Out (FIFO) aka Round-Robin Used in highly associative caches Not Most Recently Used (NMRU) FIFO with exception for most recently used block(s)

Write Strategy: How are writes handled? Cache Hit Write Through write both cache and memory, generally higher traffic but simpler to design Write Back write cache only, memory is written when evicted, dirty bit per block avoids unnecessary write backs, more complicated Cache Miss No Write Allocate only write to main memory Write Allocate fetch block into cache, then write Common Combinations Write Through & No Write Allocate Write Back & Write Allocate

Average Memory Access Time PROCESSOR HIT CACHE MAIN MEMORY MISS Avg Memory Access Time =Hit Time +( Miss Rate Miss Penalty)

Categorizing Misses: The Three C's Cold Start Misses (Compulsory) First-reference to a block Capacity misses Cache is too small to hold all data needed by program, occur even under perfect replacement policy Conflict (Collision) Misses Misses that occur because of collisions due to less than full associativity

Basic Cache Optimizations Larger block size to reduce miss rate More data items per block Reduces compulsory misses Increases traffic May increase conflict misses Larger caches to reduce miss rate Reduces capacity (and conflict misses) Longer access time

Basic Cache Optimizations Higher associativity to reduce miss rate Lesser conflict misses 2 way SA cache of size N has same miss ratio of a DM cache of size 2N Multilevel caches to reduce miss penalty Prioritize read misses over writes Write buffer

Reducing Cache Misses Cache compression Victim Cache DM caches have large conflict misses Saves evicted lines in a victim buffer L1 L1 (DM) (DM) V (FA)

Multicore Caches Private L2 L2 divided into banks Lower access time Data duplication, Cache coherance Static space allocation Shared L2 Longer access time Long bus delay Contention between cores Higher hit rate Dynamic space allocation P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 Private L2 L2 Shared L2 L2

UCA and NUCA Uniform Cache Access Non uniform cache access

Shared NUCA Cache L2 is distributed throughout the chip OS can smart distribute data required by a core in the bank closest to the core