Memory Hierarchy. Slides contents from:

Size: px

Start display at page:

Download "Memory Hierarchy. Slides contents from:"

Lesley Barber
5 years ago
Views:

1 Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL

2 Memory Performance Gap Memory Wall Wall

3 Memory Performance Gap Aggregate Peak Bandwidth required by Intel i7 2 references per clock 4 cores, 32GHz 256 Billion 64-bit references/sec Billion128- bit instruction references = 4096 GB/s DRAM Capacity = 256 GB/s Multiport, Pipelined caches Two levels of cache per core Shared third-level cache on chip

4 Introduction Programmers want unlimited amounts of memory with low latency Fast memory is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality ensures high hit rate in smaller memories An illusion of a large, fast memory is presented to the processor

5 Predictable Memory Reference Patterns Spatial and Temporal Locality Hatfield and Gerald: Program Restructuring for Virtual Memory IBM Systems Journal 10(3): (1971)

6 Locality of Reference Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to be referenced soon

7 The Memory Hierarchy CPU RF L1 L2 L2 L3 L3 D R A M

8 The Memory Hierarchy

9 Memory Technology Trade-offs Latches/Registers Register File Low Capacity Low Latency High Bandwidth (more and wider ports) SRAM DRAM High Capacity High Latency Low Bandwidth

10 Register File

11 Cache Hardware structure that provides memory objects that the processor references MAIN MEMORY PROCESSOR CACHE

12 Memory Array - SRAM

13 SRAM Cell wordline b b

14 Cache Organization 32KB 32KB CACHE Data (32B) Line (Block)

15 Cache Organization 32 bit address from Processor Data (32B)

16 Cache Organization Index 10b Data (32B)

17 Cache Organization Index 10b Block Offset 5b Data (32B)

18 Cache Organization Tag 17b Index 10b Block Offset 5b Data (32B)

19 Direct Mapped Cache Organization Tag Index BO Index Valid Tag Data

20 Direct Mapped Cache Organization Tag Index BO Hit Index Valid Tag Data (32B) Data to Processor =

21 Direct Mapped Cache Organization Tag Index BO Index Valid Tag Data (32B) Hit = = Cache Miss Send address to Lower level

22 Block Placement Tag Index BO V Tag Data Direct Mapped Cache Index bits bits Identify a unique Cache line line

23 Block Placement 2 Index CacheSize Index = BlockSize SetAssociativity Tag Index BO Set 0 1 V Tag Data Set Set Associative Cache Index bits bits Identify a unique SET SET A Set Set contains multiple cache lines lines way Set Set Associative Cache

24 2 way Set Associative Cache Valid Tag Data Valid Tag Data = = Reduces conflict misses Hit More More energy spent per per data data access

25 Inside a Cache PROCESSOR CACHE MAIN MEMORY CONTROLLER 0 1 N 1 Lookup Logic Cache RAM TAGS Line (Block) DATA

26 Classification of Caches Block Placement Where can a block be placed in a cache? Direct mapped, Set associative, Fully associative Block Identification How is a block found if it is in cache? Block Replacement Which block should be replaced on a miss? Write Strategy What happens on a write?

27 Block Replacement No choice in a direct mapped cache Random Least Recently Used (LRU) LRU cache state must be updated on every access True implementation only feasible for small sets (2-way) Psuedo LRU First In, First Out (FIFO) aka Round-Robin Used in highly associative caches Not Most Recently Used (NMRU) FIFO with exception for most recently used block(s)

28 Write Strategy: How are writes handled? Cache Hit Write Through write both cache and memory, generally higher traffic but simpler to design Write Back write cache only, memory is written when evicted, dirty bit per block avoids unnecessary write backs, more complicated Cache Miss No Write Allocate only write to main memory Write Allocate fetch block into cache, then write Common Combinations Write Through & No Write Allocate Write Back & Write Allocate

29 Average Memory Access Time PROCESSOR HIT CACHE MAIN MEMORY MISS Avg Memory Access Time =Hit Time +( Miss Rate Miss Penalty)

30 Tolerating Miss Penalty Out of order execution Software prefetching Execute loads early, speculatively Hardware prefetching Accesses: Stride based Stream prefetcher A, A, A+4, A+4, A+8, A+8,,, Present Address Stride Count A+4 A Stride Detector 2 Prefetch is triggered when count reaches a predefined value Stream Buffer Address A+12 A+12 A+16 A+16 Value A+32 A+32

31 Categorizing Misses: The Three C's Cold Start Misses (Compulsory) First-reference to a block Capacity misses Cache is too small to hold all data needed by program, occur even under perfect replacement policy Conflict (Collision) Misses Misses that occur because of collisions due to less than full associativity

32 Basic Cache Optimizations Larger block size to reduce miss rate More data items per block Reduces compulsory misses Increases traffic May increase conflict misses Larger caches to reduce miss rate Reduces capacity (and conflict misses) Longer access time

33 Basic Cache Optimizations Higher associativity to reduce miss rate Lesser conflict misses 2 way SA cache of size N has same miss ratio of a DM cache of size 2N Multilevel caches to reduce miss penalty Prioritize read misses over writes Write buffer

34 Reducing Cache Misses Cache compression Victim Cache DM caches have large conflict misses Saves evicted lines in a victim buffer L1 L1 (DM) (DM) V (FA)

35 Multicore Caches Private L2 L2 divided into banks Lower access time Data duplication, Cache coherance Static space allocation Shared L2 Longer access time Long bus delay Contention between cores Higher hit rate Dynamic space allocation P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 Private L2 L2 Shared L2 L2

36 UCA and NUCA Uniform Cache Access Non uniform cache access

37 Shared NUCA Cache L2 is distributed throughout the chip OS can smart distribute data required by a core in the bank closest to the core

38 Cache Example Compare CPI of the version with Cache misses to the one with no cache misses

39 Example

40 Extra

42 SRAM vs DRAM Foss, RC Implementing Application-Specific Memory, ISSCC 1996

43 4-way Set Associative Cache

44 Block Identification 2 Index = CacheSize BlockSize SetAssociativity Direct Mapped 2-way Set Associative

45 Prioritizing read misses over writes Direct Mapped, Write through cache maps address 512 and 1024 to the same cache block Write buffer is not checked on a read miss Will the value in R2 always be equal to value in R3? Copy the dirty block to a buffer On a read: check the buffer for the address, read memory and then write memory

46 Mapping NUCA W1 W2 CPU Data placement for least access time One way per bank vs Entire set in a bank Search CPU W16 S0 S255 S256 S511 S512 S767 S768 S1023

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory