CMPSC 311- Introduction to Systems Programming Module: Caching

CMPSC 311- Introduction to Systems Programming Module: Caching Professor Patrick McDaniel Fall 2016

Reminder: Memory Hierarchy L0: Registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: L1 cache (SRAM) L2 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: Remote secondary storage (tapes, distributed file systems, Web servers) Page 4

Processor Caches Most modern computers have multiple layers of caches to manage data passing into and out of the processors L1 very fast and small, processor adjacent L2 a bit slower but often much larger L3 larger still, maybe off chip May be shared amongst processors in multi-core system Memory slowest, least expensive Instruction caches are different from data caches CPU Level 1 Cache (small, fast) Level 2 Cache (bigger, slower) Level 3 Cache (bigger, slower) Memory (huge, slow, and inexpensive) Page

Caches Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Why do memory hierarchies work? Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. Page 6

Locality Caches exploit locality to improve performance, of which there are two types: Spatial locality: accessed data used is tend to be close to data you already accessed Temporal (time) locality: data that is accessed is likely to be accessed again soon This leads to two cache design strategies Spatial: cache items in blocks larger than that accessed Temporal: keep stuff used recently around longer Page

General Cache Concepts Cache 8 9 14 3 Page 8

General Cache Concepts Cache 8 9 14 3 Cache Lines Page 9

General Cache Concepts Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block- sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as parmmoned into blocks Page 10

Cache Hit Cache Request: 14 8 9 14 3 Data in block b is needed Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Page 11

Cache Miss Cache Request: 12 8 12 9 14 3 Data in block b is needed Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (vicim) Page 12

Placement Policy Q: When a new block comes in, where in the cache can you keep it? A: Depends on the placement policy Anywhere (fully associative) Why not do this all the time? Exactly one cache line (direct-mapped) Commonly, block i is mapped to cache line (i mod t) where t is the total number of lines One of n cache lines (n-way set-associative) Fully Associative Page

Types of Cache Misses Cold (compulsory) miss Cold misses occur because the cache is empty. Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache. Conflict miss (direct mapping only) Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time. Page 16

Conflict Miss Page 17

Conflict Miss Page 18

Conflict Miss Page 19

Conflict Miss Page 20

Cache replacement policy When your cache is full and you acquire a new value, you must evict a previously stored value Performance of cache is determined by how smart you are in evicting values, known as a cache eviction policy Popular policies Least recently used (LRU) - eject the value that has been in the cache the longest without being accessed Least frequently used (LFU) - eject the value that accessed the least number of times First in-first out (FIFO) - eject the same order they come in Policy efficiency is measured by the hit performance (how often is something asked for and found) and measured costs Determined by working set and workload Page 21

Cache performance A cache hit is when the referenced information is served out of the cache A cache miss occurs referenced information cannot be served out of the cache The hit ratio is the: hit ratio = # cache hits # total accesses Note that the efficiency of a cache is almost entirely determined by the hit ratio. Page

Cache performance The average memory access time can be calculated: Where memory latency = hit cost + p(miss) miss penalty hit cost is the cost to access out of cache miss penalty is the cost to serve out of main memory P(miss) is the probability of a cache access resulting in a miss, e.g., ratio of hits/total accesses E.g., for a hit cost of 25 usec and, penalty of 250 usec, and cache hit rate of 80%: 25 usec +(0.2 250 usec) = 25 + 50 usec = 75 usec This is the average access time through the cache. Page

Example: 4 Line LRU Cache 1 T=0 Mem 4 Mem 1 T=1 0 3 Mem 1 4 (miss) (miss) (miss) Mem Mem Mem 1 0 1 4 0 1 1 4 3 time 0, memory[1] = 0 time 1, memory[4] = 1 time 2, memory[3] = 2 time 3, memory[1] = 3 time 4, memory[5] = 4 time 5, memory[1] = 5 time 6, memory[4] = 6 time 7, memory[0] = 7 time 8, memory[3] = 8 time 9, memory[1] = 9 T=2 0 1 0 1 2 1 Mem 1 4 3 (hit) Mem 1 4 3 T=3 0 1 2 3 1 2 5 Mem 1 4 3 (miss) Mem 1 4 3 5 T=4 3 1 2 3 1 2 4 Page

Example: 4 Line LRU Cache 1 Mem 1 4 3 5 (hit) Mem 1 4 3 5 T=5 3 1 2 4 5 1 2 4 4 Mem 1 4 3 5 T=6 5 1 2 4 0 Mem 1 4 3 5 (hit) Mem 1 4 3 5 5 6 2 4 (miss) Mem 1 4 0 5 time 0, memory[1] = 0 time 1, memory[4] = 1 time 2, memory[3] = 2 time 3, memory[1] = 3 time 4, memory[5] = 4 time 5, memory[1] = 5 time 6, memory[4] = 6 time 7, memory[0] = 7 time 8, memory[3] = 8 time 9, memory[1] = 9 T=7 5 6 2 4 5 6 7 4 3 Mem 1 4 0 5 (miss) Mem 1 4 0 3 T=8 5 6 7 4 5 6 7 8 1 Mem 1 4 0 3 (hit) Mem 1 4 0 3 T=9 5 6 7 8 9 6 7 8 Page

Example: 4 Line LRU Cache Result: 6 misses, 4 hits Pr(miss) = 0.6 Assume Hit cost (100 usec) Miss penalty (1000 usec) time 0, memory[1] = 0 time 1, memory[4] = 1 time 2, memory[3] = 2 time 3, memory[1] = 3 time 4, memory[5] = 4 time 5, memory[1] = 5 time 6, memory[4] = 6 time 7, memory[0] = 7 time 8, memory[3] = 8 time 9, memory[1] = 9 So the average memory access time is: 100 usec +(0.6 1000 usec) = 100 + 600 = 700 usec Q: Why is the performance so poor? Page