The memory gap. 1980: no cache in µproc; level cache on Alpha µproc

The memory gap 1980: no cache in µproc; 1995 2-level cache on Alpha 21164 µproc

Memory Technology Review DRAM: value is stored as a charge on capacitor (must be refreshed,ras/cas) very small but slower than SRAM (factor of 5 to 10) Word line Pass transistor Capacitor SRAM: Bit line value is stored on a pair of inverting gates (e.g. D-Latch) very fast but takes up more space than DRAM (4 to 6 transistors)

General Principles Locality Temporal Locality: referenced again soon Spatial Locality: nearby items referenced soon Locality + smaller mem is faster = memory hierarchy Levels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom Definitions Upper is closer to processor Block: minimum unit that present or not in upper level Address = Block frame address + block offset address Hit time: time to access upper level, including hit determination Why does code have locality?

Locality Temporal locality Recently used item is likely to be re-used in near future Spatial locality Addresses close together physically tend to be referenced close together in time 90/10 Locality Rule: Code executes 90% of its instructions in 10% of its code.

The memory hierarchy Users want large, fast memories cheap! (conflict) SRAM speed/capacity Lower power, faster DRAM cost/capacity Refreshing, less expensive

The memory hierarchy CPU Level 1 Increasing distance from the CPU in access time Levels in the memory hierarchy Level 2 Level n Size of the memory at each level

Who s in control? cost Registers Cache Memory Storage size compiler hardware OS OS/user

Q1: Where can a block be placed in the upper level?

Q2: How Is a Block Found If It Is in the Upper Level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands Given a block address, it can only be found in the set specified in index. All s in index set must be compared to block address (in parallel) to find a hit. Block Frame Address Tag Index Block Offset FA: No index Find block in set Select set DM: Large index Block address compared to block frame address () for all frames in cache in parallel

Block address index index index index index index index index Q2: How Is a Block Found If It Is in the Upper Level? FA: No index DM: Large index Block Frame Address Tag Index index index index index Given a block address, it can only be found in the set specified in index. All s in index set must be compared to block address (in parallel) to find a hit. Block Offset

Cache Performance cpu execution time (cpuclock cycles mem stall cycles)*clock cycle time mem stall cycles (# misses* miss penalty) # misses IC * *miss penalty instruction memory accesses IC * * miss rate* miss penalty instruction

Split cache 64% non-data transfers 36% data transfers Single ported unified cache split cache miss rate Example 1 # misses/#instr # mem accesses/# instr # misses # mem accesses.00382 1 mem accesses/1 instr: imiss rate16kb. 00382 1.0409.36 mem accesses/1 instr: dmiss rate16kb. 1136.36 miss rate Misses per instruction Size Icache Dcache Ucache 16KB.00382.0409.0510 32KB.00136.0384.0433 16KB 16KB Unified cache Data instr All other instr Hit time 2 1 Miss time 100 100.74(.00382).26(.1136).0324 Instruction stream 74% instr refs 26% data refs Unified cache.0433 1+.36 data accesses/1 instr: umiss rate32kb.0318 1.36 Split cache miss rate > Unified cache miss rate

64% non-data transfers 36% data transfers Example 1 Instruction stream Misses per instruction Size Icache Dcache Ucache 16KB.00382.0409.0510 32KB.00136.0384.0433 Unified cache Data instr All other instr Hit time 2 1 Miss time 100 100 74% instr refs 26% data refs AMAT hit time miss rate*miss penalty AMAT % I( AMAT I ) % D( AMAT D ) AMATsplit.74(1.00382*100).26(1.1136*100) 4.24 AMATunified.74(1.0318*100).26(2.0318*100) 4.44 Split cache AMAT < Unified cache AMAT [opposite of cache miss rate result (split > unified)]

A Closer Look at Misses Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Size X Cache) Conflict If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) CPU Cache

1 2 4 8 16 32 64 128 3Cs Absolute Miss Rate 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1-way 2-way 4-way 8-way Conflict Capacity Cache Size (KB) Compulsory

Block Size vs. Cache Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate Miss Penalty X Miss = Rate Avg. Memory Access Time Block Size Block Size Block Size Block Size Block Size