Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science

2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative, fully associative caches Addressing Handling writes Performance

3 Review What is control speculation? What is data speculation? What are the advantages of a superscalar vs a VLIW? What are the disadvantages of a superscalar vs a VLIW? When is a VLIW appropriate? When is a superscalar appropriate?

4 Datapath and control from Chapter 4

5 Memory technologies Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk Prices in 2008

6 Memory hierarchies - motivation Programmers want unlimited amount of fast memory Fast memory is expensive Large memories are slow Compromise - memory hierarchy is used to create the illusion of memory with the size of the largest and the speed of the fastest

7 Memory hierarchies Memory hierarchy: A structure that uses multiple levels of memories As the distance from the CPU increases, the size of the memories and the access time both increase The illusion of a large, fast memory is achieved by using the principles of locality

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables)

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables) If you execute an instruction once, you are likely to execute it again (loops)

8 Principle of temporal locality If you read an address once, you are likely to touch it again (variables) If you execute an instruction once, you are likely to execute it again (loops) Temporal locality Addresses recently referenced will tend to be referenced again soon

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays)

9 Principle of spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays) If you execute an instruction once, you are likely to access neighbouring instructions Spacial locality If you access address X, you are likely to access an address close to X

10 Levels of hierarchy Exploit the principle of locality by using the memory hierarchy Memory closer to the CPU is a subset of memory further away All data is stored at the lowest level Data copied between only two levels at a time Upper levels: caching Lower levels: virtual memory

11 Exploiting locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory (main memory) Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory (CPU cache)

12 Organization of data Data transferred only between two levels at a time The data can be either present or not present in the upper level when needed The minimum unit of data that can be present or not present is called a block or a line The block is also the units transferred from the lower level when needed

13 Hierarchy and the computer Concepts used to build memory systems affect many other aspects of a computer and its performance: How the operating system manages memory and I/O How compilers generate code How applications use the computer Since all programs spend much of their time accessing memory, the memory system is the major factor in determining performance Programmers should understand the memory hierarchy to achieve proper performance

14 Hits and misses (1/2) A hit occurs when data referenced by the processor is available in a block in the upper level Otherwise, it is a miss On a miss the block containing the data must be transferred from the next level in the hierarchy The hit rate is the fraction of memory accesses found in the upper level The fraction not found is called the miss rate

15 Hits and misses (2/2) The hit time is the time needed to access data from the upper level Includes the time to determine if it is a hit or a miss The miss penalty is the time needed to access data that is not available in the upper level Includes the time to transfer the block from the lower level and to deliver the requested data The hit time is much smaller than the miss penalty

16 Programs and locality Programs tend to reuse recently accessed data items (temporal locality) and reference data items that are close to recently accessed data (spatial locality) Memory hierarchies exploit temporal locality by keeping more recently accessed data items closer to the processor Memory hierarchies exploit spatial locality by moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy Most systems use a true hierarchy data present at level i is also present at level i + 1

17 Cache Cache levels in the hierarchy between the main memory and the processor Simple (level 1) cache where a block is one single word The cache before and after a reference to X n

18 Direct-mapped cache How do we know if the requested word is in the cache? How do we find it? Easy to find a word in the cache if a memory location is mapped to exactly one cache location If the address of the memory location determines the exact placement in the cache it is called a direct-mapped cache Typical mapping: (blockaddress) mod (#cacheblocks)

19 Direct-mapped 1-word block sized cache Mapping: (wordaddress) mod (#cachewords) 8 word cache for 32 word memory: The 3 least significant bits of the address determine cache position

20 Direct-mapped 1-word block sized cache How do we know which address word is in a given cache word? We need to store the remaining upper address bits with the data The upper part of the address stored with the data block is called a tag For our 32 word memory, 8 word cache the tag is 2 bit What if there is invalid data in the cache word? We need a valid bit for each block For each block we have: Valid bit, tag, data block

21 Example: reads on the simple cache See pages 460-461

22 Larger blocks In order to take advantage of spatial locality, caches use blocks several words in size We only need one valid bit and one tag per block (less storage overhead) The block address is byteaddress #bytesinblock With 16 bytes per block byte, address 1200 has block address 75

23 Larger blocks and hit rate To large blocks decrease hit rate Many words not used before block is kicked out Larger blocks increase miss penalty

24 Anatomy of an address Tag Index Byte offset Byte offset: What is the first byte in the cache line are we reading? Index: Which cache line are we reading? Tag: How we differentiate between other addresses with the same Index and Byte offset. Consider a 64 KB, direct mapped cache with 64B cache lines. Assuming a 32-bit address, how many bits are used for Tag? Index? Byte offset?

25 Cache implementation

26 Caches and associativity fully associative Instead of direct mapping a cache design can be fully associative In a fully associative cache a block can be put in any position in the cache regardless of address Requires a full search of the tags to determine cache hit or miss Increases hardware costs What is the size of the Index field for a fully associative cache?

27 Caches and associativity set associative Direct mapping and fully associative are two ends of the spectrum. Set associative caches are somewhere in between In a set associative cache one address map to a fixed number of locations in the cache A set associative cache with n locations for each block is called n-way set associative Index of the set in the cache is given by: (blockaddress) mod (#setsinthecache) All tags in the set must be searched to determine hit or miss

28 Caches and associativity Direct mapped is the same as one-way set associative Fully associative is m-way set associative where m is the number of blocks in the cache

29 Associativity, hit time and hit rate Increased associativity can increase the hit time Tag search takes more time Increased associativity can decrease the miss rate Blocks are kept longer in the cache Associativity Data miss rate 1 10.3 % 2 8.6 % 4 8.3 % 8 8.1 % FastMATH processor running SPEC2000

30 Block replacement With associativity one has to decide which block to remove when a set is full on a cache miss Two strategies: Least recently used (LRU) Random LRU needs hardware to track access

31 Example: associative caches 1 word blocks, four blocks 3 different cache implementations direct-mapped two-way set associative fully associative Block addresses addressed in sequence 0 8 0 6 8

32 Example: associative caches direct mapped

33 Example: associative caches 2-way set

34 Example: associative caches fully associative

35 Associativity and tag-bits Increasing associativity increases number of comparators needed It also increases the size of the tag fields Assume a 4 K blocks, 16 byte block size, 32-bit address cache Direct mapped: 16 bit tag 64 Kbits total for tags 2-way : 2K sets 17 bit tag 68 Kbits total for tags 4-way : 1K sets 18 bit tag 72 Kbits total for tags fully : 1 set 28 bit tag 112 Kbits total for tags

36 Implementation of direct mapped

37 Implementation of 4-way set associative

38 Handling of cache misses On a cache hit the processor proceeds as normal on the next clock cycle On a cache miss the processor must be stalled until data is available in the cache Freezes the contents of the pipeline registers and the register file On an instruction read the instruction register is invalid, and must be re-fetch

39 Miss penalty elaborated A large block size increases the transfer time of the block We can hide some of the transfer time Early restart Resume execution when the requested word is available in the cache, possibly before the transfer is complete Requested word first or critical word first Transfer the requested word in the block first and the consecutively the rest of the block wrapping the address at the top of the block

40 Memory writes Write-through can help with memory consistency On a write the block is read from the lower level (if no present in cache) The new word is written to both the word in the cache and the address in main memory. Poor performance Write buffer A buffer holds a queue of write accesses to main memory Write-back Writes are only to the cache block Block is written when replaced

41 Memory writes Write-through + no-write-allocate On miss, write through to next level Write-back + write-allocate On miss, read line from next level, place in cache, write to cache When block is evicted, write the line back to memory

42 Types of cache misses Cache misses can be divided into three categories depending on the reason for the miss: Compulsory misses access to a block that has never been in the cache Capacity misses access to blocks that have been kicked out because of cache size Conflict misses access to blocks that have been kicked out of a set associative or direct mapped cache, but would have been available in a fully associative cache

43 Types of cache misses

44 Impact of miss rate on performance Example on page 477

45 16 KB cache in FastMATH (MIPS) 4 K words, 16-word blocks seperate instruction and data cache OS decides between write-through and write-back Effective miss-rate 3.2 % (SPEC2000 integer benchmarks) 11.4% data 0.4% instruction Bits 5-2 is used to index the block and select the word from the block

46 Split vs. combined cache A combined cache with a total size equal to the sum of the two split caches gives a better hit rate FastMATH Split cache miss rate 3.24 % combined cache miss rate 3.18 % Split cache double the cache bandwidth, the processor can access both the instruction cache and the data cache in the same clock cycle The increased bandwidth easily overcomes the disadvantages of the increased miss rate

47 Designing the memory system to support caches (1/3) Miss penalty can be reduced if memory to cache bandwidth increased (allows larger blocks while maintaining low miss penalty) Bus clock rate usually much slower than processor (e.g., factor of 10), affecting the miss penalty Assume 1 memory bus clock cycle to send the address, 15 memory bus clock cycles for each DRAM access initiated and 1 memory bus clock cycle to send a word of data If 4 blocks and a one-word wide bank, the miss penalty would be 1 + 4 x 15 + 4 x 1 = 65 memory bus clock cycles. Transferred bytes per bus clock cycle: (4x4)/65 = 0.25

48 Increasing bandwidth by widening bus (2/3) Widen memory and buses between the processor and the memory Memory bandwidth increases proportionally Miss penalty improvement from previous example with main-memory width of two words: 1 + 2 x 15 + 2 x 1= 33 memory bus clock cycles, down from 65. Main-mem width 4 words: 17 cycles. Costs: wider bus and the potential increase in access time due to the multiplexor and control logic between the processor and cache

49 Increasing bandwidth by interleaving (3/3) Memory chips are organized in banks to read or write multiple words in one access time rather than reading or writing a single word each time. Sending an address to several banks permits them all to read simultaneously. This gives the advantage of incurring the full memory latency only once 1 + 1 x 15 + 4 x 1 = 20 memory bus clock cycles

50 Bytes / clock cycle for a single miss

51 Cache performance CPU time = (execution cycles + stall cycles) * cycle time stall cycles = read stalls + write stalls read stalls = reads/prog * read miss rate * read miss penalty write stalls = writes/prog * write miss rate * write miss penalty + buffer stalls (buffer stalls << write misses) read miss penalty write miss penalty memory stalls = mem access/prog * miss rate * miss penalty = inst/prog * miss/inst * miss penalty

52 Multi-level cache Miss penalty reduced Level 1 cache focuses on reducing hit time smaller cache size smaller block size Level 2 cache focuses on reducing miss penalty larger cache size larger block size

53 Review Cache lines exploit which locality? What is the benefit of associativity? What is the cost of associativity? Why aren t L1 caches big? What is temporal locality? What is an example of code that has no temporal locality?