Why memory hierarchy

Size: px

Start display at page:

Download "Why memory hierarchy"

Carmella Conley
6 years ago
Views:

1 Why memory hierarchy (3 rd Ed: p , 4 th Ed: p ) users want unlimited fast memory fast memory expensive, slow memory cheap cache: small, fast memory near CPU large, slow memory (main memory, disk, ) connected to faster memory (one level up) dt

2 Core-2 Duo Extreme dt

3 Locality principles temporal locality: items recently used by CPU tend to be referenced again soon guides cache replacement policy: what to replace when cache is full spatial locality: items with addresses close to recently-used items tend to be referenced fetches multiple data dt

4 Cache operation cache organised in blocks of one or more words cache hit: CPU finds required block in cache otherwise, cache miss: get data from main memory hit rate: ratio of cache access to total memory access hit time: time to access cache + time to determine cache hit or miss miss penalty: time to replace item in cache + time to transfer it to CPU dt

5 Direct mapped cache each memory location mapped to one cache location e.g. cache index = (memory block address) (cache address) mod (number of blocks in cache) multiple memory locations -> one cache location e.g. 8 blocks in cache, cache location 001 may contain items from memory locations 00001, 01001, dt

6 Direct mapped cache mapping address is modulo the number of blocks in the cache Cache Memory dt

7 Address translation lower portion: cache index upper portion: compare with tag Hit if: tag matches and block is valid What about miss? dt

8 Handling misses: by exception cache miss on instruction read: restore PC: PC = PC - 4 send address to main memory and wait (stall) write data received from memory to cache refetch instruction (from restored PC) cache miss on data read: similar: stall CPU until data from main memory are available in cache which is probably more common? dt

9 Cache write policy need to maintain cache consistency how to keep data in cache and in memory consistent? write back: write to cache only complex control, need dirty bit for cache entries have to flush dirty entries when they are evicted write through: write to both cache and memory memory write bandwidth can cause bottleneck What happens with shared memory multiprocessors? dt

10 Write buffer for write through insert buffer between cache and memory processor: write data into cache and write buffer memory controller: write contents of the write buffer to memory write buffer is just a FIFO queue fine if store frequency (w.r.t. time) «otherwise have write buffer saturation 1 memory write cycle dt

11 Write buffer saturation store buffer overflows when CPU cycle time too fast with respect to memory access time too many store instructions in a row solutions to write buffer saturation use a write back cache install a second-level (L2) cache store compression dt

12 Exploiting spatial locality dt

13 Block size and performance block size, miss rate generally (especially for instructions) large block for small cache: miss rate - too few blocks block size: transfer time between cache and main memory dt

14 Impact of block size on miss rate 4 0 % 3 5 % 3 0 % 2 5 % miss rate 2 0 % 1 5 % 1 0 % 5 % 0 % 4 block size, miss rate generally Block size (bytes) large block for small cache: miss rate - too few blocks block size: transfer time between cache and main memory 1 KB 8 KB 1 6 K B 6 4 K B 2 56 KB Total cache size (KBytes) dt

15 Multi-word cache: handling misses cache miss on read: same way as single-word block bring back the entire multi-word block from memory cache miss on write, given write-through cache: single-word block: disregard hit or miss, just write to cache and write buffer / memory do the same for multi-word block? dt

16 Cache performance CPU time = (execution cycles + memory stall cycles) cycle time memory stall cycles = read stall cycles + write stall cycles read stall cycles = write stall cycles = reads read miss read miss program rate (%) penalty (cycle) read miss per program number of cycles per read miss writes write miss write miss program rate (%) penalty (cycle) + write buffer stalls (when full) dt

17 Effect on CPU assume hit time insignificant (data transfer time dominated) let c: CPI-no stall, i: instruction miss rate p: miss penalty, d: data miss rate n: instruction count, f: load/store frequency total memory stall cycles: nip + nfdp total CPU cycles without stall: nc total CPU cycles with stall: n(c + ip + fdp) % time on stall: ip + fdp c + ip + fdp dt

18 Faster CPU same memory speed, halved CPI % time on stall = ip + fdp ip + fdp (½)c + ip + fdp > c + ip + fdp lower CPI results in greater impact of stall cycles same memory speed, halved clock cycle time t miss penalty: 2p total CPU time with new clock: n(c + 2ip + 2fdp) performance improvement = = = exec. time with old clock exec. time with new clock n(c + ip + fdp) t n(c + 2ip + 2fdp) t/2 c + ip + fdp c + 2ip + 2fdp 2 < 2 dt

19 Multi-level cache hierarchy how can we look at the cache hierarchy? performance view capacity view physical hierarchy abstract hierarchy dt

20 Typical scale L1 size: tens of KB hit time: complete in one clock cycle miss rates: 1-5% L2: size: hundreds of KB hit time: a few clock cycles miss rates: 10-20% L2 miss rate: fraction of L1 misses also miss L2 why so high? complex: different block size/placement for L1, L2 dt

21 Average Memory Access Time want the Average Memory Access Time (AMAT) take into account all levels of the hierarchy calculate MT cpu : AMAT for ISA-level accesses follow the abstract hierarchy AMAT CPU = AMAT L1 AMAT L1 = HitTime L1 + MissRate L1 * AMAT L2 AMAT L2 = HitTime L2 + MissRate L2 * AMAT M AMAT M = constant AMAT CPU =HitTm L1 + MissRt L1 (HitTm L2 +MissRt L2 AMAT M ) dt

22 Example assume L1 hit time = 1 cycle L1 miss rate = 5% L2 hit time = 5 cycles L2 miss rate = 15% (% L1 misses that miss) L2 miss penalty = 200 cycles L1 miss penalty = x 200 = 35 cycles AMAT = x 35 = 2.75 cycles dt

23 Example: without L2 cache assume L1 hit time = 1 cycle L1 miss rate = 5% L1 miss penalty = 200 cycles AMAT = x 200 = 11 cycles 4 times faster with L2 cache! (2.75 versus 11) dt

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson