Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3
12 3 CMPE110 Spring 2005 A. Di Blas Caches hierarchy Memory Rel. speed: 1 registers MEMORY ADDRESS data not in registers data 1-2 on-chip cache is it here? N 2-5 off-chip cache is it here? N 10-20 main memory: real address space part of virtual addr. sp. is it here? N 1000-100,000 disk: rest of virtual addr. sp. files, etc. is it here? N? long-term storage devices get it
12 4 CMPE110 Spring 2005 A. Di Blas Caches Memory location CPU DISK MAIN MEMORY L2 L1 INST L1 DATA REGISTERS
block: amount of information transferred (in bytes or words) hit: the block is present miss: the block is not present miss penalty: time (in clock cycles) to fetch a block from the lower level 12 5 CMPE110 Spring 2005 A. Di Blas Caches Basic concepts data locality: temporal locality spatial locality hit rate: fraction of times a requested block is found hit time: time to fetch a block that is present miss rate: fraction of times a requested block is not present rate = 100% - hit rate) (miss
Cache mappings fully-associative cache 12 6 CMPE110 Spring 2005 A. Di Blas Caches Size of cache < size of main memory CPU cache main memory direct-mapped cache set-associative cache
12 7 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches Direct-mapped cache cache BLOCK address (cache INDEX) 000 001 010 011 100 101 110 111 Each memory block is mapped to exactly one block in the cache. memory BLOCK address 00001 00101 01001 01101 10001 10101 11001 11101
The cache index 12 8 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches Many different memory blocks map to a single cache block which block? Use the memory adress' lower bits to index the cache. cache index = (memory block address) % (cache size in blocks) 1: 32-block main memory, 8-block cache (we consider block Example addresses). The memory block address is... bits. To index the cache we need... bits the lower... bits of the memory block address. The memory block 01001 maps to the cache location... The memory block 10110 maps to the cache location...
12 9 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches 2: 128-byte main memory, 8-block cache, 4-byte (= 1 word) cache Example size (we consider byte addresses). block... -bit memory byte address... -bit cache (block) index... bits to address the byte within the block The memory addresses 0100100, 0100101, 0100110, and 0100111 all map memory byte address 0101010 0101001 0101000 0100111 0100110 0100101 0100100 0100011 0100010 cache index 000 001 010 011 100 101 110 111 cache block (4 bytes) to the same cache block...
12 10 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches The Tag field Many different memory blocks map to a single cache block how do we know which memory block is in the cache block? To each cache line we add a tag that contains the remaining part (upper bits) of the address 3: 32-block main memory, 8-block cache. Memory blocks 00001, Example 10001, and 11001 all map to the same cache block... 01001, tag memory block address 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 000 001 010 011 100 101 110 111 cache index cache block (4 bytes)
The Valid bit 12 11 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches The CPU performs many different tasks, and the memory contents change how do we know if a cache block is good"? To each cache line we add a valid bit to indicate whether the content of the block corresponds to what the CPU is actually looking for. For instance, after a reset, all valid bits are reset - no block contains useful information.
12 12 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches memory byte address tag 31 30 29... 13 12 11 10... 3 2 1 0 index CPU 1-word block, direct-mapped cache index 0 1 2......... 1020 1021 1022 1023 V TAG DATA DATA HIT mem. address [b] cache line size [b] bits for index cache data size [B] bits for tag total cache size [b]
12 13 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches Cache trace with block address 32-block memory, 8-block cache, the memory address is a block address Address dec bin Hit/Miss INDEX V TAG DATA 22 26 22 18 26 18 26 22 10110 11010 10110 10010 11010 10010 11010 10110 000 001 010 011 100 101 110 111
12 14 CMPE110 Spring 2005 A. Di Blas Direct-mapped caches Cache trace with byte address 256-byte memory, 32-byte cache, 4-byte cache block, memory byte addressing Address Hit/Miss dec bin INDEX V TAG 89 232 90 8 91 92 232 7 01011001 11101000 01011010 00001000 01011011 01011100 11101000 00000111 000 001 010 011 100 101 110 111 DATA
read misses write misses 12 15 CMPE110 Spring 2005 A. Di Blas Cache reads and writes Cache reads and writes In our CPU, Instruction Memory and Data Memory are actually cache memories. On a memory access, hits are straightforward to handle. Misses are more complex:
stall the CPU restart the instruction 12 16 CMPE110 Spring 2005 A. Di Blas Cache reads and writes Read misses For instructions: stall the CPU send the original PC to memory (current PC-4) and wait write the cache entry (including tag and valid bit) restart the instruction For data: send the address to memory and wait write the cache entry (including tag and valid bit)
12 17 CMPE110 Spring 2005 A. Di Blas Cache reads and writes Write misses What is a write miss? In a 1-word-block write-through cache, writes always hit. We do not need to know what was in the memory location, since the CPU is overwriting it anyway. Problem: inconsistency Solutions: write-through write-back
12 18 CMPE110 Spring 2005 A. Di Blas Cache reads and writes Write-through Every time, write both the cache and the memory: write buffer CPU CACHE MEMORY simple slow (write buffer)
12 19 CMPE110 Spring 2005 A. Di Blas Cache reads and writes Write-back Write only the cache. Write the entire block back into the memory only when R/ W R dec 22 Address bin 10110 H/ M H the block needs to be replaced (dirty bit). Cache index 110 V T D DATA CPU CPU W Hit CACHE CACHE write buffer W W 22 22 10110 10110 W 14 01110 W 14 01110 CPU CPU FLUSH BLOCK CACHE CACHE MEMORY R 22 10110 CPU CACHE CPU CACHE CPU CPU CPU FLUSH BLOCK CACHE CACHE CACHE
12 20 CMPE110 Spring 2005 A. Di Blas Multi-word caches Multi-word caches Using cache blocks larger than one word takes advantage of spatial locality. memory byte address 4-GB memory, 64-KB direct-mapped cache with 4-word data blocks (16-bytes) CPU index word offset data word hit tag V tag cache data block
12 21 CMPE110 Spring 2005 A. Di Blas Multi-word caches Exercise What is the total size in bits of the cache in the previous slide?
12 22 CMPE110 Spring 2005 A. Di Blas Multi-word caches Hits/misses in a multi-word cache Just like the read misses on a single-word cache, except that the entire Read: is fetched. block We can not just write the word, tag, and valid bit without verifying Write: the block is the actual block we want to write to, since more than one whether memory block maps to the same cache block. We need to compare the tag for writes too. the tags match: we can write the word the tags do not match: we need to read the block from memory and then write the word
12 23 CMPE110 Spring 2005 A. Di Blas Multi-word caches Cache block size and miss rate up to a certain point, cache miss rate decreases with increasing block size after a certain point, cache miss rate increases with increasing block size spatial locality decreases with block size the miss penalty increases with block size (COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED)
c) time for transferring each word 12 24 CMPE110 Spring 2005 A. Di Blas Multi-word caches Miss penalty (= additional clock cycles) Has three components: a) sending the address to memory b) latency to initiate the memory transfer Example: a) = 1 clock cycle, b) = 15 clock cycles, c) = 1 clock cycle With a 4-word block cache and a 1-word memory bus, the the miss penalty on a standard DRAM is: On an SDRAM or with an interleaved memory organization is:
Memory bandwidth 12 25 CMPE110 Spring 2005 A. Di Blas Multi-word caches If a single transfer to/from memory can transfer multiple words at a time, the miss penalty decreases CPU CPU CPU bus bus bus cache MUX/DEMUX bus bus bus MUX/DEMUX bus bus bus bus cache bus cache bus MEM MEM MEM Miss penalty for a 2-word block cache with a 2-word memory bus: Miss penalty for a 4-word block cache with a 4-word memory bus:
Cache associativity fully associative 12 26 CMPE110 Spring 2005 A. Di Blas Cache associativity What if the CPU keeps accessing two (or more) variables that map to the same location in a direct-mapped cache? More sophisticated strategy: n-way set-associative caches. direct-mapped ( 1-way set associative") n-way set associative
Cache associativity Two-way set associative cache 00001 00101 01001 01101 10001 10101 11001 11101 00 01 10 11 cache SET index 12 27 CMPE110 Spring 2005 A. Di Blas
cache SET index 0 1 Cache associativity Four-way set associative cache 00001 00011 00101 00111 01001 01011 01101 01111 10001 10011 10101 10111 11001 11011 11101 11111 12 28 CMPE110 Spring 2005 A. Di Blas
12 29 CMPE110 Spring 2005 A. Di Blas Cache associativity Eight-way set associative cache fully-associative cache any block can go anywhere
Cache associativity Direct-mapped cache 1-way set associative cache 00001 01001 10001 11001 000 001 010 011 100 101 110 111 cache line index 12 30 CMPE110 Spring 2005 A. Di Blas
reduces the miss rate 12 31 CMPE110 Spring 2005 A. Di Blas Cache associativity Pros and cons of increasing cache associativity Advantages: Disadvantages: requires more hardware requires a replacement policy Block replacement policy: Least Recently Used (LRU) or random implemented in hardware
12 32 CMPE110 Spring 2005 A. Di Blas Cache associativity Exercise 1 For an 8-line, write-through, 2-way set-associative cache with LRU replacement and 1-word data block, trace the following sequence of addresses: block address dec binary H/M 23 00010111 18 00010010 196 11000100 63 00111111 79 01001111 18 00010010 199 11000111 165 10100101
a) how many lines are in the cache? c) what is the total cache size in bits? d) diagram a cache lookup 12 33 CMPE110 Spring 2005 A. Di Blas Cache associativity Exercise 2 A computer system has 32-bit addresses and a 64-KB direct-mapped, write-back cache with 8-byte data block lines. how many bits total (including cache management bits) are in each line, b) minimum?
a) # of lines: b) # of bits per line c) total cache size d) cache lookup: 12 34 CMPE110 Spring 2005 A. Di Blas Cache associativity Solution: 31 24 16 8 0
a) how many sets are in the cache? b) how many lines are in the cache? d) what is the total cache size in bits? e) diagram a cache lookup 12 35 CMPE110 Spring 2005 A. Di Blas Cache associativity Exercise 3 Suppose the 64-KB cache in Exercise 2 was instead 2-way set associative with 8-byte lines. how many bits total (including cache management bits) are in each line, c) minimum?
a) # of sets b) # of lines c) # of bits per line d) total cache size e) cache lookup: 12 36 CMPE110 Spring 2005 A. Di Blas Cache associativity Solution: 31 24 16 8 0
12 37 CMPE110 Spring 2005 A. Di Blas Caches and performance Caches and performance 4: a computer has a CPI of 1.0 when there are no cache misses, Exercise a 100 MHz clock. Each instruction has on average 0.4 data memory and references. For each cache miss the instruction takes an additional 9 clock cycles to complete. what are the CPI 100% and the MIPS 100% rating with a cache and an 100% hit rate? unrealistic what are the CPI NOCACHE and the MIPS NOCACHE rating without a cache? what are the CPI 90 85 and themips 90 85 rating with a cache and a 90% rate on instructions and an 85% hit rate on data? hit
CPI NOCACHE = MIPS NOCACHE = CPI 90 85 = MIPS 90 85 = 12 38 CMPE110 Spring 2005 A. Di Blas Caches and performance Solution: CPI 100% = MIPS 100% =
Ex 7.1 to 7.9, 7.13 to 7.17, 7.20, 7.21 Ex 7.27 Hint: no need to know the CPI. Remember: the miss penalty is in number of Ex 7.28 Hints: i) no need to know the CPI, ii) since the three machines only differ in array[131], and array[132] are stored. Ex 7.2-7.4, 7.6-7.8, 7.9, 7.12, 7.16, 7.17-7.19, 7.25-7.27 Ex 7.32, 7.33, 7.35 are equivalent to old 7.27, 7.28, and 7.29 respectively. 12 39 CMPE110 Spring 2005 A. Di Blas Homework Recommended exercises Second Edition: clock cycles. Note: the three machines are identical except for the cache. the cache system (and in the clock cycle time, of course) we only need to consider the total miss cycles per instruction. Ex 7.29 (especially interesting for CS) Hint: find in what cache blocks array[0], Third Edition: