COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

COSC4201 Chapter 4 Cache Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory hierarchy is organized into several levels of memory with the smaller, more expensive, and faster memory levels closer to the CPU: registers, then primary Cache Level (L 1 ), then additional secondary cache levels (L 2, L 3 ), then main memory, then mass storage (virtual memory). Each level of the hierarchy is a subset of the level below: data found in a level is also found in the level below but at lower speed. Each level maps addresses from a larger physical memory to a smaller level of physical memory. This concept is greatly aided by the principal of locality both temporal and spatial which indicates that programs tend to reuse data and instructions that they have used recently or those stored in their vicinity leading to working set of a program. 2

Processor-Memory Gap 1000 Performance 100 10 1 DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU Processor-Memory Performance Gap: (grows 50% / year) µproc 60%/yr. DRAM 7%/yr. 3 Cost of Cache Processor % Area %Transistors (-cost) (-power) Alpha 21164 37% 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap Smaller is faster 4

Principle of Locality Programs usually spends 90% of the time in 10% of the code (90/10 rule) Two Types of locality: Temporal Locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. 5 Levels of Memory Hierarchy 6

Definitions Block: the smallest piece of information transferred between 2 levels (line, page,..) Hit, hit rate Miss, miss rate Miss penalty 7 4 Questions Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) 8

Cache Placement Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1Direct mapped cache: A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) 2Fully associative cache: A block can be placed anywhere in cache. 3Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. 9 Direct Mapped Cache Hit Tag Address (showing bit positions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Index Data In de x 0 1 2 Valid Tag Data 1024 Blocks Each block = one word Can cache up to 2 32 bytes of memory 1021 1022 1023 20 32 10

Direct Mapped Cache 4K blocks Tag field Each block = four words Hit Tag A d d res s (s ho w in g b it p o sition s) 31 16 15 4 3 2 1 0 16 12 2 Byte offset Index field Word select Data Index Block offset 16 bits 128 bits V Tag Data 4K entries 16 32 32 32 32 Mux 32 11 Alpha AXP 21064 Data Cache 12

Set Associativity One-way set associative (direct mapped) Block Tag Data 0 1 2 3 4 5 6 Two-way set associative Set 0 1 2 3 Tag Data Tag Data 7 Four-way set associative Set T ag D ata Tag D ata Tag D ata Tag D ata 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 13 14

Set Associative Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. - A block offset to select the data from the block. 15 Set Associative Cache Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size 16

4-way Set Associative Cache Tag Field Address 31 30 12 11 10 9 8 3 2 1 0 22 8 Index Field Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2 253 254 255 22 32 4-to-1 multiplexor Hit Data 17 18

Example Which has a lower miss rate 16KB cache for both instruction or data, or a combined 32KB cache? (0.64%, 6.47%, 1.99%). Assume hit=1cycle and miss =50 cycles. 75% of memory references are instruction fetch. Miss rate of split cache=0.75*0.64%+0.25*6.47%=2.1% Slightly worse than 1.99% for combined cache. But, what about average memory access time? Split cache: 75%(1+0.64%*50)+25%(1+6.47%*50) = 2.05 cycles. Combined cache: Extra cycle for load/store 75%(1+1.99*50)+25%(1+1+1.99%*50) = 2.24 19 Example 50 cycles miss penalty, 2 cycles per instruction for exec, 1.33 mem ref/inst. And miss rate of 2% CPU time = IC *(2.0+ 1.33*2%*50)* T clock = IC * 3.33 * T clock Without cache, would be IC *(2+50*1.33)*T clock = IC * 68.5 * T clock 20

Example Compare between 2 different organizations CPI for a perfect cache (CPI) = 2.0, 1ns cycle, 1.5 mem ref/inst Direct mapped, miss=1.4% 2-way set associative miss=1.0%, we have to extend clock cycle by 25% (MUX is on critical path) Both cases miss penalty = 75nsec. Direct mapped memory access=1+0.014*75=2.05nsec. 2-way memory access = 2*1.25 + 0.01*70=2.00nsec. CPU time = IC(CPI exec + CPI misses ) =IC(CPI exec +Miss/Inst. * MP)*TC =IC [ (CPI exec * Tc )+( Mem/Inst *MR *MP*TC)] direct mapped =IC( 2*1 + 1.5*0.014*70)=3.58IC 2-way = IC(2*1.25 + 1.5*0.01*70) = 3.63 IC 21 Improving Cache performance 1. Reducing the miss rate 2. Reducing the miss penalty 3. Reducing the time to hit 22

Reducing the Cache Miss Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) 23 Classifying misses 0.14 0.12 0.1 0.08 0.06 0.04 1-way 8-way going from fully to 8 ways set 4-way going from 8-way to 4-way 2-way 4-way 8-way Capacity 0.02 0 1 2 Cache Size (KB) 4 8 16 32 64 128 Compulsory Misses in a 1-way set associative of size X = misses in 2-way set associative of size X/2 24

Reducing cache miss Larger Block Size Larger block size reduces misses up to a point, then start to increase it (larger block size takes advantage of spatial locality, but it decrease the number of different block in the cache and increases the miss penalty since it will take longer to load the block from the memory to the 25% cache) Miss Rate 20% 15% 10% 5% 0% 1K 4K 16K 64K 256K 16 32 64 128 256 Block Size (bytes) 25 Reducing Cache Miss Larger block size From the last graph, if the memory takes 40 cycles overhead and then deliver 16 bytes every 2 cycle. Compare between64,128 block size for 64K cache. 16 byte block access time = 1+(1.06%*42)=1.5088 128 bytes block access time = 1+(1.02*56)=1.5712 26

Reducing Cache Miss Higher Associativity Practically 8-way set associative is as good as a fully associative The 2:1 cache rule of thumb states that a directmapped cache of size N has the same miss rate as a 2-way set associative of size N/2 One problem with associative caches is that we need to increase the clock cycle (at least, we need a MUX to choose which set) Practically 10% increase in clock time for TTL, or ECL, and 2% for custom CMOS (when we go from direct mapped to 2-way set associative). 27 Reducing Cache Miss Victim Caches Add a small fully associative cache between the cache and the memory. This victim cache contain only blocks that were discarded from the original cache. The victim cache is checked on miss, if the data is found there, it will be swapped with the data in the cache. A small victim cache of 1-5 blocks is sufficient (4-block victim cache removed 20% to 95% of conflict misses. 28

Reducing Cache Miss Pseudo-Associative Cache This cache behaves like direct-mapped On a miss, before going to the main memory, check another cache entry 2 hit times, one slow and one fast Example Compare between direct-mapped, 2-way set associative and pseudo associative (2 extra cycles fro pseudo hit) Direct = 1+9.8%*50 = 5.9 2-way = 1.1 +7.6%*50 = 4.9 (10% increase in Tc) Pseudo = 1 + (9.8%-7.6%)*2 + 7.6%*50 = 4.844 29 Reducing Cache Miss Prefetching Can be done by the compiler (inserting prefetch instructions in the code), or the hardware Prefetching could be done directly to the cache or to a buffer that could be accessed a lot faster than the main memory. Access time = hit time + miss rate *prefetch hit rate * 1 + miss rate *(1-prefetch hit rate) * Miss Penalty 30

Reducing Cache Miss Compiler Optimization - By rearrangind data we can reduce cache miss, - Merging arrays - Int A[N]; truct merge{ - Int B[N] int a; - int b; - }; struct merged_arrays[n]; - That works if we reference the same indices at the same time; - Loop Interchange - Reference by row instead of columns if the array is stored in a row major fashion 31 Reducing Cache Miss Compiler Optimization Before for(i=0;i<n;i++) for(j=0;j<n;j++) a[i][j]=1/b[i][j]*c[i][j]; for(i=0;i<n;i++) for(j=0;j<n;j++) d[i][j]=a[i][j]*c[i][j]; After for(i=0;i<n;i++) for(j=0;j<n;j++) { a[i][j]=1/b[i][j]*c[i][j]; d[i][j]=a[i][j]*c[i][j]; } 32

Reducing Cache Miss Compiler Optimization It reduces the miss by exploiting the temporal locality. Suppose that we are multiplying two matrices, the first row by the first column. First, we load the first column, then the second column (first is gone from the cache, then last. After that the first column again (cache miss). By blocking, we can fully utilize the data we brought into the cache before we replace it. 33 Reducing Cache Miss Penalty Giving Priority to Reads over Writes With a write through cache, don t wait for the write to complete, send data to a write buffer and continue reading. Night cause problems (RAW) if we want to read a data that is still in the buffer and not written yet. One way is to wait until the buffer is empty before you read (wastes time) We can check the write buffer in our way to the memory (if we find the data, don t go to the memory) Same thing could happen with a write back 34

Reducing Cache Miss Penalty Sub block placement In order to decrease the area occupied by the tag, we want bigger blocks. Bigger blocks increases the miss penalty One solution is to to add a valid bit fro each subblock, and the transfer between the memory and the cache is done on the sub-block level. If the tag is there, some parts of the block might be valid and other parts might not. 35 Reducing Cache Miss Penalty Early restart and Critical Word First The CPU starts as soon as the required word is in the cache Even better, we can request the required word first. Reduces the miss penalty since we don t have to wait for the entire block to be written to the cache. 36

Reducing Cache Miss Penalty Non-blocking caches The cache continue to supply data during the miss penalty (hit under miss). Can be hit under multiple miss 37 Reducing Cache Miss Penalty Second Level Cache In this case Av mem Access= Hit Time L1 +Miss rate L1 * Miss Penalty L1 Miss Penalty L1 = Hit time L2 + Miss Rate L2 * Miss Penalty L2 38

Reducing Hit Time Small and Simple Cache Avoiding Address Translation Virtual caches (use virtual address) Every time the process is switched the virtual address refers to a different physical address requiring the cache to be flushed. If PID is used, only flush these specific lines Pipeline writes We have to check the tag before write One technique is to pipeline writes, check tag, in the next cycle write and check next tag. 39