Why memory hierarchy - PDF Free Download

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast memory fast memory expensive, slow memory cheap cache: small, fast memory near CPU large, slow memory (main memory, disk, ) connected to faster memory (one level up) dt10 2011 6.1

Core-2 Duo Extreme dt10 2011 6.2

Locality principles temporal locality: items recently used by CPU tend to be referenced again soon guides cache replacement policy: what to replace when cache is full spatial locality: items with addresses close to recently-used items tend to be referenced fetches multiple data dt10 2011 6.3

Cache operation cache organised in blocks of one or more words cache hit: CPU finds required block in cache otherwise, cache miss: get data from main memory hit rate: ratio of cache access to total memory access hit time: time to access cache + time to determine cache hit or miss miss penalty: time to replace item in cache + time to transfer it to CPU dt10 2011 6.4

Direct mapped cache each memory location mapped to one cache location e.g. cache index = (memory block address) (cache address) mod (number of blocks in cache) multiple memory locations -> one cache location e.g. 8 blocks in cache, cache location 001 may contain items from memory locations 00001, 01001, 10001 dt10 2011 6.5

Direct mapped cache mapping address is modulo the number of blocks in the cache Cache 000 001 010 011 100 101 110 111 00001 00101 01001 01101 10001 10101 11001 11101 Memory dt10 2011 6.6

Address translation lower portion: cache index upper portion: compare with tag Hit if: tag matches and block is valid What about miss? dt10 2011 6.7

Handling misses: by exception cache miss on instruction read: restore PC: PC = PC - 4 send address to main memory and wait (stall) write data received from memory to cache refetch instruction (from restored PC) cache miss on data read: similar: stall CPU until data from main memory are available in cache which is probably more common? dt10 2011 6.8

Cache write policy need to maintain cache consistency how to keep data in cache and in memory consistent? write back: write to cache only complex control, need dirty bit for cache entries have to flush dirty entries when they are evicted write through: write to both cache and memory memory write bandwidth can cause bottleneck What happens with shared memory multiprocessors? dt10 2011 6.9

Write buffer for write through insert buffer between cache and memory processor: write data into cache and write buffer memory controller: write contents of the write buffer to memory write buffer is just a FIFO queue fine if store frequency (w.r.t. time) «otherwise have write buffer saturation 1 memory write cycle dt10 2011 6.10

Write buffer saturation store buffer overflows when CPU cycle time too fast with respect to memory access time too many store instructions in a row solutions to write buffer saturation use a write back cache install a second-level (L2) cache store compression dt10 2011 6.11

Exploiting spatial locality dt10 2011 6.12

Block size and performance block size, miss rate generally (especially for instructions) large block for small cache: miss rate - too few blocks block size: transfer time between cache and main memory dt10 2011 6.13

Impact of block size on miss rate 4 0 % 3 5 % 3 0 % 2 5 % miss rate 2 0 % 1 5 % 1 0 % 5 % 0 % 4 block size, miss rate generally 1 6 6 4 Block size (bytes) 2 5 6 large block for small cache: miss rate - too few blocks block size: transfer time between cache and main memory 1 KB 8 KB 1 6 K B 6 4 K B 2 56 KB Total cache size (KBytes) dt10 2011 6.14

Multi-word cache: handling misses cache miss on read: same way as single-word block bring back the entire multi-word block from memory cache miss on write, given write-through cache: single-word block: disregard hit or miss, just write to cache and write buffer / memory do the same for multi-word block? dt10 2011 6.15

Cache performance CPU time = (execution cycles + memory stall cycles) cycle time memory stall cycles = read stall cycles + write stall cycles read stall cycles = write stall cycles = reads read miss read miss program rate (%) penalty (cycle) read miss per program number of cycles per read miss writes write miss write miss program rate (%) penalty (cycle) + write buffer stalls (when full) dt10 2011 6.16

Effect on CPU assume hit time insignificant (data transfer time dominated) let c: CPI-no stall, i: instruction miss rate p: miss penalty, d: data miss rate n: instruction count, f: load/store frequency total memory stall cycles: nip + nfdp total CPU cycles without stall: nc total CPU cycles with stall: n(c + ip + fdp) % time on stall: ip + fdp c + ip + fdp dt10 2011 6.17

Faster CPU same memory speed, halved CPI % time on stall = ip + fdp ip + fdp (½)c + ip + fdp > c + ip + fdp lower CPI results in greater impact of stall cycles same memory speed, halved clock cycle time t miss penalty: 2p total CPU time with new clock: n(c + 2ip + 2fdp) performance improvement = = = exec. time with old clock exec. time with new clock n(c + ip + fdp) t n(c + 2ip + 2fdp) t/2 c + ip + fdp c + 2ip + 2fdp 2 < 2 dt10 2011 6.18

Multi-level cache hierarchy how can we look at the cache hierarchy? performance view capacity view physical hierarchy abstract hierarchy dt10 2011 6.19

Typical scale L1 size: tens of KB hit time: complete in one clock cycle miss rates: 1-5% L2: size: hundreds of KB hit time: a few clock cycles miss rates: 10-20% L2 miss rate: fraction of L1 misses also miss L2 why so high? complex: different block size/placement for L1, L2 dt10 2011 6.20

Average Memory Access Time want the Average Memory Access Time (AMAT) take into account all levels of the hierarchy calculate MT cpu : AMAT for ISA-level accesses follow the abstract hierarchy AMAT CPU = AMAT L1 AMAT L1 = HitTime L1 + MissRate L1 * AMAT L2 AMAT L2 = HitTime L2 + MissRate L2 * AMAT M AMAT M = constant AMAT CPU =HitTm L1 + MissRt L1 (HitTm L2 +MissRt L2 AMAT M ) dt10 2011 6.21

Example assume L1 hit time = 1 cycle L1 miss rate = 5% L2 hit time = 5 cycles L2 miss rate = 15% (% L1 misses that miss) L2 miss penalty = 200 cycles L1 miss penalty = 5 + 0.15 x 200 = 35 cycles AMAT = 1 + 0.05 x 35 = 2.75 cycles dt10 2011 6.22

Example: without L2 cache assume L1 hit time = 1 cycle L1 miss rate = 5% L1 miss penalty = 200 cycles AMAT = 1 + 0.05 x 200 = 11 cycles 4 times faster with L2 cache! (2.75 versus 11) dt10 2011 6.23