Memory Hierarchy. Slides contents from:

Size: px

Start display at page:

Download "Memory Hierarchy. Slides contents from:"

Toby McDowell
5 years ago
Views:

1 Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL

2 Memory Performance Gap Memory Wall Wall

3 Memory Performance Gap Aggregate Peak Bandwidth required by Intel i7 2 references per clock 4 cores, 32GHz 256 Billion 64-bit references/sec Billion128- bit instruction references = 4096 GB/s DRAM Capacity = 256 GB/s Multiport, Pipelined caches Two levels of cache per core Shared third-level cache on chip

4 Introduction Programmers want unlimited amounts of memory with low latency Fast memory is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality ensures high hit rate in smaller memories

5 Predictable Memory Reference Patterns Spatial and Temporal Locality Hatfield and Gerald: Program Restructuring for Virtual Memory IBM Systems Journal 10(3): (1971)

6 Locality of Reference Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to be referenced soon

7 The Memory Hierarchy

8 Memory Technology Trade-offs Latches/Registers Register File Low Capacity Low Latency High Bandwidth (more and wider ports) SRAM DRAM High Capacity High Latency Low Bandwidth

9 SRAM Cell wordline b b

10 Cache Hardware structure that provides memory objects that the processor references MAIN MEMORY PROCESSOR CACHE

11 Cache Organization 32KB 32KB CACHE Data (32B) Line (Block)

12 Cache Organization 32 bit address from Processor Data (32B)

13 Cache Organization Index 10b Data (32B)

14 Cache Organization Index 10b Block Offset 5b Data (32B)

15 Cache Organization Tag 17b Index 10b Block Offset 5b Data (32B)

16 Direct Mapped Cache Organization Tag Index BO Index Valid Tag Data

17 Direct Mapped Cache Organization Tag Index BO Hit Index Valid Tag Data (32B) Data to Processor =

18 Direct Mapped Cache Organization Tag Index BO Index Valid Tag Data (32B) Hit = = Cache Miss Send address to Lower level

19 Block Placement Tag Index BO V Tag Data Direct Mapped Cache Index bits bits Identify a unique Cache line line

20 Block Placement 2 Index CacheSize Index = BlockSize SetAssociativity Tag Index BO Set 0 1 V Tag Data Set Set Associative Cache Index bits bits Identify a unique SET SET A Set Set contains multiple cache lines lines way Set Set Associative Cache

21 2 way Set Associative Cache Valid Tag Data Valid Tag Data = = Reduces conflict misses Hit More More energy spent per per data data access

22 Block Placement 4Qs of Caches Where can a block be placed in a cache? Direct mapped, Set associative, Fully associative Block Identification How is a block found if it is in cache? Block Replacement Which block should be replaced on a miss? Write Strategy What happens on a write?

23 Block Replacement No choice in a direct mapped cache Random Least Recently Used (LRU) LRU cache state must be updated on every access True implementation only feasible for small sets (2-way) Psuedo LRU First In, First Out (FIFO) aka Round-Robin Used in highly associative caches Not Most Recently Used (NMRU) FIFO with exception for most recently used block(s)

24 Write Strategy: How are writes handled? Cache Hit Write Through write both cache and memory, generally higher traffic but simpler to design Write Back write cache only, memory is written when evicted, dirty bit per block avoids unnecessary write backs, more complicated Cache Miss No Write Allocate only write to main memory Write Allocate fetch block into cache, then write Common Combinations Write Through & No Write Allocate Write Back & Write Allocate

25 Average Memory Access Time PROCESSOR HIT CACHE MAIN MEMORY MISS Avg Memory Access Time =Hit Time +( Miss Rate Miss Penalty)

26 Categorizing Misses: The Three C's Cold Start Misses (Compulsory) First-reference to a block Capacity misses Cache is too small to hold all data needed by program, occur even under perfect replacement policy Conflict (Collision) Misses Misses that occur because of collisions due to less than full associativity

27 Basic Cache Optimizations Larger block size to reduce miss rate More data items per block Reduces compulsory misses Increases traffic May increase conflict misses Larger caches to reduce miss rate Reduces capacity (and conflict misses) Longer access time

28 Basic Cache Optimizations Higher associativity to reduce miss rate Lesser conflict misses 2 way SA cache of size N has same miss ratio of a DM cache of size 2N Multilevel caches to reduce miss penalty Prioritize read misses over writes Write buffer

29 Reducing Cache Misses Cache compression Victim Cache DM caches have large conflict misses Saves evicted lines in a victim buffer L1 L1 (DM) (DM) V (FA)

30 Multicore Caches Private L2 L2 divided into banks Lower access time Data duplication, Cache coherance Static space allocation Shared L2 Longer access time Long bus delay Contention between cores Higher hit rate Dynamic space allocation P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 Private L2 L2 Shared L2 L2

31 UCA and NUCA Uniform Cache Access Non uniform cache access

32 Shared NUCA Cache L2 is distributed throughout the chip OS can smart distribute data required by a core in the bank closest to the core

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory