registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

13 1 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas 110 Winter 2009 CMPE Cache Direct-mapped cache Reads and writes Cache associativity Cache and performance Textbook Edition: 7.1 to 7.3 Third Fourth Edition: 5.1, 5.2, 5.3

13 3 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches hierarchy Memory Rel. speed: 1 registers MEMORY ADDRESS data not in registers data 1-2 on-chip cache is it here? N 2-5 off-chip cache is it here? N 10-20 main memory: real address space part of virtual addr. sp. is it here? N 1000-100,000 disk: rest of virtual addr. sp. files, etc. is it here? N? long-term storage devices get it

13 4 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Memory location CPU DISK MAIN MEMORY L2 L1 INST L1 DATA REGISTERS

block: amount of information transferred (in bytes or words) hit: the block is present miss: the block is not present miss penalty: time (in clock cycles) to fetch a block from the lower level 13 5 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Basic concepts data locality: temporal locality spatial locality hit rate: fraction of times a requested block is found hit time: time to fetch a block that is present miss rate: fraction of times a requested block is not present rate = 100% - hit rate) (miss

Cache mappings fully-associative cache 13 6 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches Size of cache < size of main memory CPU cache main memory direct-mapped cache set-associative cache

13 7 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Direct-mapped cache cache BLOCK address (cache INDEX) 000 0 0 1 100 1 110 111 Each memory block is mapped to exactly one block in the cache. memory BLOCK address 000 0 0 1 100 1 110 111

The cache index 13 8 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Many different memory blocks map to a single cache block which block? Use the memory adress' lower bits to index the cache. cache index = (memory block address) % (cache size in blocks) 1: 32-block main memory, 8-block cache (we consider block Example addresses). The memory block address is... bits. To index the cache we need... bits the lower... bits of the memory block address. The memory block 0 maps to the cache location... The memory block 110 maps to the cache location...

13 9 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches 2: 128-byte main memory, 8-block cache, 4-byte (= 1 word) cache Example size (we consider byte addresses). block... -bit memory byte address... -bit cache (block) index... bits to address the byte within the block The memory addresses 000, 0, 010, and 011 all map memory byte address 0 0 000 011 010 0 000 001 000 cache index 000 0 0 1 100 1 110 111 cache block (4 bytes) to the same cache block...

13 10 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches The Tag field Many different memory blocks map to a single cache block how do we know which memory block is in the cache block? To each cache line we add a tag that contains the remaining part (upper bits) of the address 3: 32-block main memory, 8-block cache. Memory blocks 000, Example 100, and 110 all map to the same cache block... 0, tag memory block address 00000 000 000 001 000 0 010 011 000 0 0 000 0 0 1 100 1 110 111 cache index cache block (4 bytes)

The Valid bit 13 11 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches The CPU performs many different tasks, and the memory contents change how do we know if a cache block is good"? To each cache line we add a valid bit to indicate whether the content of the block corresponds to what the CPU is actually looking for. For instance, after a reset, all valid bits are reset - no block contains useful information.

13 12 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches memory byte address tag 31 30 29... 13 12 11 10... 3 2 1 0 index CPU 1-word block, direct-mapped cache index 0 1 2......... 1020 1021 1022 1023 V TAG DATA DATA HIT mem. address [b] cache line size [b] bits for index cache data size [B] bits for tag total cache size [b]

13 13 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Cache trace with block address 32-block memory, 8-block cache, the memory address is a block address Address dec bin Hit/Miss INDEX V TAG DATA 22 26 22 18 26 18 26 22 110 110 110 100 110 100 110 110 000 0 0 1 100 1 110 111

13 14 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Direct-mapped caches Cache trace with byte address 256-byte memory, 32-byte cache, 4-byte cache block, memory byte addressing Address Hit/Miss dec bin INDEX V TAG 89 232 90 8 91 92 232 7 10 111000 10 000000 11 1100 111000 000011 000 0 0 1 100 1 110 111 DATA

read misses write misses 13 15 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Cache reads and writes In our CPU, Instruction Memory and Data Memory are actually cache memories. On a memory access, hits are straightforward to handle. Misses are more complex:

stall the CPU restart the instruction 13 16 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Read misses For instructions: stall the CPU send the original PC to memory (current PC-4) and wait write the cache entry (including tag and valid bit) restart the instruction For data: send the address to memory and wait write the cache entry (including tag and valid bit)

13 17 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write misses What is a write miss? In a 1-word-block write-through cache, writes always hit. We do not need to know what was in the memory location, since the CPU is overwriting it anyway. Problem: inconsistency Solutions: write-through write-back

13 18 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write-through Every time, write both the cache and the memory: write buffer CPU CACHE MEMORY simple slow (write buffer)

13 19 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache reads and writes Write-back Write only the cache. Write the entire block back into the memory only when R/ W R dec 22 Address bin 110 H/ M H the block needs to be replaced (dirty bit). Cache index 110 V T D DATA CPU CPU W Hit CACHE CACHE write buffer W W 22 22 110 110 W 14 110 W 14 110 CPU CPU FLUSH BLOCK CACHE CACHE MEMORY R 22 110 CPU CACHE CPU CACHE CPU CPU CPU FLUSH BLOCK CACHE CACHE CACHE

13 20 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Multi-word caches Using cache blocks larger than one word takes advantage of spatial locality. memory byte address 4-GB memory, 64-KB direct-mapped cache with 4-word data blocks (16-bytes) CPU index word offset data word hit tag V tag cache data block

13 21 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Exercise What is the total size in bytes of the cache in the previous slide?

13 22 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Hits/misses in a multi-word cache Just like the read misses on a single-word cache, except that the entire Read: is fetched. block We can not just write the word, tag, and valid bit without verifying Write: the block is the actual block we want to write to, since more than one whether memory block maps to the same cache block. We need to compare the tag for writes too. the tags match: we can write the word the tags do not match: we need to read the block from memory and then write the word

13 23 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Cache block size and miss rate up to a certain point, cache miss rate decreases with increasing block size after a certain point, cache miss rate increases with increasing block size spatial locality decreases with block size the miss penalty increases with block size (COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED)

c) time for transferring each word 13 24 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Miss penalty (= additional clock cycles) Has three components: a) sending the address to memory b) latency to initiate the memory transfer Example on the textbook: a) = 1 clock cycle, b) = 15 clock cycles, c) = 1 clock cycle With a 4-word block cache and a 1-word memory bus, the the miss penalty on a standard DRAM is: On an SDRAM or with an interleaved memory organization is:

Static and Dynamic RAMs 13 25 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches

13 26 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches DRAM diagram RAS CAS Add WE ADDRESS BUFFER ROW DECODER COLUMN DECODER 000000000000000000000000000 111111111111111111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 000000000000000000000000000 111111111111111111111111111 SENSE AMPS and I/O

13 27 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Synchronous DRAM timing CK RAS CAS trcd WE = READ Add DQS 000000 111111 000000 111111 RA 00000000 11111111 00000000 11111111 CA 00000000000000000000000000000000000000 11111111111111111111111111111111111111 0000000000000000000000000000000000000 1111111111111111111111111111111111111 tcl DQ DO[CA] DO[CA+1] DO[CA+2] if BURST mode ONE ROW ACCESS

13 28 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Double Data Rate (DDR) DRAMs timing CK RAS CAS trcd WE = READ Add DQS 000000 111111 000000 111111 RA 00000000 11111111 00000000 11111111 CA 00000000000000000000000000000000000000 11111111111111111111111111111111111111 0000000000000000000000000000000000000 1111111111111111111111111111111111111 tcl DQ D0 D1 D2 D3 if BURST mode

tcl: CAS Latency, time between the read command and data output valid trcd: RAS-to-CAS Delay, minimum time between RAS and CAS trp: RAS Precharge time, time the row decoder needs to precharge the row tras: Activate-to-Precharge time, minimum time before trying to change row CMDrate: Command Rate, minimum time between chip select and RAS (activate) 13 29 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Commercial SDRAM parameters 2 2 2 5 T1 tcl trcd trp tras CMDrate All the above numbers are expressed in clock cycles.

13 30 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches Commercial SDRAM parameters diagram CK COMMAND 0000000 1111111 0000000 1111111 ACTIVATE 0000000 1111111 0000000 1111111 000000 111111 000000 111111 READ 0000000 1111111 0000000 1111111 0000000 1111111 PRECHARGE 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 ACTIVATE (set RAS) (set CAS) (set RAS) 0000000 1111111 0000000 1111111 trcd tras trp tcl DQ

Memory bandwidth 13 31 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Multi-word caches If a single transfer to/from memory can transfer multiple words at a time, the miss penalty decreases CPU CPU CPU bus bus bus cache MUX/DEMUX bus bus bus MUX/DEMUX bus bus bus bus cache bus cache bus MEM MEM MEM Miss penalty for a 2-word block cache with a 2-word memory bus: Miss penalty for a 4-word block cache with a 4-word memory bus:

Cache associativity fully associative 13 32 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity What if the CPU keeps accessing two (or more) variables that map to the same location in a direct-mapped cache? More sophisticated strategy: n-way set-associative caches. direct-mapped ( 1-way set associative") n-way set associative

Cache associativity Two-way set associative cache 000 0 0 1 100 1 110 111 00 10 11 cache SET index 13 33 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

cache SET index 0 1 Cache associativity Four-way set associative cache 000 001 0 011 0 1 1 111 100 101 1 111 110 111 111 11111 13 34 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

13 35 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Eight-way set associative cache fully-associative cache any block can go anywhere

Cache associativity Direct-mapped cache 1-way set associative cache 000 0 100 110 000 0 0 1 100 1 110 111 cache line index 13 36 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas

reduces the miss rate 13 37 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Pros and cons of increasing cache associativity Advantages: Disadvantages: requires more hardware requires a replacement policy Block replacement policy: Least Recently Used (LRU) or random implemented in hardware

13 38 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 1 For an 8-line, write-through, 2-way set-associative cache with LRU replacement and 1-word data block, trace the following sequence of addresses: block address dec binary H/M 23 0011 18 0000 196 110000 63 011111 79 0111 18 0000 199 110011 165 10

a) how many lines are in the cache? c) what is the total cache size in bits? d) diagram a cache lookup 13 39 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 2 A computer system has 32-bit addresses and a 64-KB direct-mapped, write-back cache with 8-byte data block lines. how many bits total (including cache management bits) are in each line, b) minimum?

a) # of lines: b) # of bits per line c) total cache size d) cache lookup: 13 40 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Solution: 31 24 16 8 0

a) how many sets are in the cache? b) how many lines are in the cache? d) what is the total cache size in bits? e) diagram a cache lookup 13 41 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Exercise 3 Suppose the 64-KB cache in Exercise 2 was instead 2-way set associative with 8-byte lines. how many bits total (including cache management bits) are in each line, c) minimum?

a) # of sets b) # of lines c) # of bits per line d) total cache size e) cache lookup: 13 42 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Cache associativity Solution: 31 24 16 8 0

13 43 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches and performance Caches and performance 4: a computer has a CPI of 1.0 when there are no cache misses, Exercise a 100 MHz clock. Each instruction has on average 0.4 data memory and references. For each cache miss the instruction takes an additional 9 clock cycles to complete. what are the CPI 100% and the MIPS 100% rating with a cache and an 100% hit rate? unrealistic what are the CPI NOCACHE and the MIPS NOCACHE rating without a cache? what are the CPI 90 85 and themips 90 85 rating with a cache and a 90% rate on instructions and an 85% hit rate on data? hit

CPI NOCACHE = MIPS NOCACHE = CPI 90 85 = MIPS 90 85 = 13 44 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Caches and performance Solution: CPI 100% = MIPS 100% =

13 45 CMPE110 Computer Architecture, Winter 2009 Andrea Di Blas Homework Recommended exercises Third Edition: Ex 7.2-7.4, 7.6-7.8, 7.9, 7.12, 7.16, 7.17-7.19, 7.25-7.27 Ex 7.32, 7.33, 7.35