Cache Structure. Replacement policies Overhead Implementation Handling writes Cache simulations. Comp 411. L18-Cache Structure 1

Similar documents
Cache Structure. Replacement policies Overhead Implementation Handling writes Cache simulations. Comp 411. L15-Cache Structure 1

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Page 1. Multilevel Memories (Improving performance using a little cash )

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Caching Basics. Memory Hierarchies

CS161 Design and Architecture of Computer Systems. Cache $$$$$

EE 4683/5683: COMPUTER ARCHITECTURE

Course Administration

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 6 Objectives

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS3350B Computer Architecture

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Memory Hierarchies &

Page 1. Memory Hierarchies (Part 2)

Question?! Processor comparison!

CS 136: Advanced Architecture. Review of Caches

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Topics. Digital Systems Architecture EECE EECE Need More Cache?

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

CS 61C: Great Ideas in Computer Architecture Caches Part 2

Basic Memory Hierarchy Principles. Appendix C (Not all will be covered by the lecture; studying the textbook is recommended!)

ECE ECE4680

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

COSC 6385 Computer Architecture - Memory Hierarchies (I)

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory

Locality and Data Accesses video is wrong one notes when video is correct

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.2 (writes), 5.3, 5.5

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

EN1640: Design of Computing Systems Topic 06: Memory System

Lecture 11 Cache. Peng Liu.

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

ECEC 355: Cache Design

Memory Hierarchy. Mehran Rezaei

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

CS152 Computer Architecture and Engineering Lecture 17: Cache System

1/19/2009. Data Locality. Exploiting Locality: Caches

MIPS) ( MUX

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Memory Technologies. Technology Trends

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Unit 2 Buffer Pool Management

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Virtual Memory: Page Replacement. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

ECE 485/585 Microprocessor System Design

LECTURE 11. Memory Hierarchy

6.004 Tutorial Problems L14 Cache Implementation

EITF20: Computer Architecture Part4.1.1: Cache - 2

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

CS370 Operating Systems

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

13-1 Memory and Caches

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Cache Memory Mapping Techniques. Continue to read pp

CSE502: Computer Architecture CSE 502: Computer Architecture

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Memory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Readings and References. Virtual Memory. Virtual Memory. Virtual Memory VPN. Reading. CSE Computer Systems December 5, 2001.

Memory Hierarchy. Slides contents from:

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Lecture 11. Virtual Memory Review: Memory Hierarchy

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Virtual Memory. Stefanos Kaxiras. Credits: Some material and/or diagrams adapted from Hennessy & Patterson, Hill, online sources.

ECE331: Hardware Organization and Design

Chapter 8 :: Topics. Chapter 8 :: Memory Systems. Introduction Memory System Performance Analysis Caches Virtual Memory Memory-Mapped I/O Summary

Computer Architecture Memory hierarchies and caches

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Transcription:

Cache Structure Replacement policies Overhead Implementation Handling writes Cache simulations L18-Cache Structure 1

Tag A CPU Data Mem[A] Basic Caching Algorithm ON REFERENCE TO Mem[X]: Look for X among cache s... HIT: X = TAG(i), for some cache line i READ: WRITE: return DATA(i) change DATA(i); Write to Mem[X] MISS: X not found in TAG of any cache line B Mem[B] (1-a) MAIN MEMORY REPLACEMENT ALGORITHM: Select some LINE k to hold Mem[X] (Allocation) READ: WRITE: Read Mem[X] Set TAG(k)=X, DATA(k)=Mem[X] Write to Mem[X] Set TAG(k)=X, DATA(k)= write L18-Cache Structure 2

Tag A CPU Data Mem[A] Basic Caching Algorithm ON REFERENCE TO Mem[X]: Look for X among cache s... Today s HIT: X = TAG(i), for some cache line i focus READ: WRITE: return DATA(i) change DATA(i); Write to Mem[X] MISS: X not found in TAG of any cache line B Mem[B] (1-a) MAIN MEMORY REPLACEMENT ALGORITHM: Select some LINE k to hold Mem[X] (Allocation) READ: WRITE: Read Mem[X] Set TAG(k)=X, DATA(k)=Mem[X] Write to Mem[X] Set TAG(k)=X, DATA(k)= write L18-Cache Structure 3

Continuum of Associativity Fully associative address address Direct-mapped compares addr with all s simultaneously location A can be stored in any cache line compare addr with only one location A can be stored in exactly one cache line L18-Cache Structure 4

Continuum of Associativity Fully associative N-way set associative Direct-mapped address address N address compares addr with all s simultaneously location A can be stored in any cache line compares addr with N s simultaneously Data can be stored in any of the N cache lines belonging to a set like N Direct-mapped caches compare addr with only one location A can be stored in exactly one cache line L18-Cache Structure 5

Continuum of Associativity Fully associative N-way set associative Direct-mapped address address N address compares addr with all s simultaneously location A can be stored in any cache line ON A MISS? compares addr with N s simultaneously Data can be stored in any of the N cache lines belonging to a set like N Direct-mapped caches compare addr with only one location A can be stored in exactly one cache line L18-Cache Structure 6

Continuum of Associativity Fully associative N-way set associative Direct-mapped address address N address compares addr with all s simultaneously location A can be stored in any cache line ON A MISS? Allocates a cache entry compares addr with N s simultaneously Data can be stored in any of the N cache lines belonging to a set like N Direct-mapped caches compare addr with only one location A can be stored in exactly one cache line L18-Cache Structure 7

Continuum of Associativity Fully associative N-way set associative Direct-mapped address address N address compares addr with all s simultaneously location A can be stored in any cache line ON A MISS? Allocates a cache entry compares addr with N s simultaneously Data can be stored in any of the N cache lines belonging to a set like N Direct-mapped caches Allocates a line in a set compare addr with only one location A can be stored in exactly one cache line L18-Cache Structure 8

Continuum of Associativity Fully associative N-way set associative Direct-mapped address address N address compares addr with all s simultaneously location A can be stored in any cache line ON A MISS? compares addr with N s simultaneously Data can be stored in any of the N cache lines belonging to a set like N Direct-mapped caches compare addr with only one location A can be stored in exactly one cache line Allocates a cache entry Allocates a line in a set Only one place to put it L18-Cache Structure 9

Three Replacement Strategies LRU (Least-recently used) replaces the item that has gone UNACCESSED the LONGEST favors the most recently accessed FIFO/LRR (first-in, first-out/least-recently replaced) replaces the OLDEST item in cache favors recently loaded items over older STALE items Random replace some item at RANDOM no favoritism uniform distribution no pathological reference streams causing worst-case results use pseudo-random generator to get reproducible behavior L18-Cache Structure 1

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order Action Resulting Order (,1,2,3) Hit 2 L18-Cache Structure 11

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order Action Resulting Order (,1,2,3) Hit 2 (2,,1,3) L18-Cache Structure 12

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order Action Resulting Order (,1,2,3) Hit 2 (2,,1,3) L18-Cache Structure 13

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 L18-Cache Structure 14

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) L18-Cache Structure 15

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) (1,2,,3) Miss, Replace 3 L18-Cache Structure 16

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) (1,2,,3) Miss, Replace 3 (3,1,2,) L18-Cache Structure 17

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) (1,2,,3) Miss, Replace 3 (3,1,2,) (3,1,2,) Hit 3 L18-Cache Structure 18

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) (1,2,,3) Miss, Replace 3 (3,1,2,) (3,1,2,) Hit 3 (3,1,2,) L18-Cache Structure 19

Keeping Track of LRU Needs to keep ordered list of N items for an N-way associative cache, that is updated on every access. Example for N = 4: Current Order (,1,2,3) Action Hit 2 Resulting Order (2,,1,3) (2,,1,3) Hit 1 (1,2,,3) (1,2,,3) Miss, Replace 3 (3,1,2,) (3,1,2,) Hit 3 (3,1,2,) N! possible orderings log 2 N! bits per set approx O(N log 2 N) LRU bits + update logic L18-Cache Structure 2

Example: LRU for 2-Way Sets Bits needed? LRU bit is selected using the same index as cache (Part of same SRAM) Bit keeps track of the last line accessed in set: (), Hit -> () (), Hit 1 -> (1) (), Miss, replace 1 -> (1) (1), Hit -> () (1), Hit 1 -> (1) 2-way set associative address =? =? Logic Miss Data (1), Miss, replace -> () L18-Cache Structure 21

Example: LRU for 2-Way Sets Bits needed? LRU bit is selected using the same index as cache (Part of same SRAM) Bit keeps track of the last line accessed in set: (), Hit -> () (), Hit 1 -> (1) (), Miss, replace 1 -> (1) (1), Hit -> () (1), Hit 1 -> (1) (1), Miss, replace -> () log 2 2! = 1 per set 2-way set associative address =? =? Logic Miss Data L18-Cache Structure 22

Example: LRU for 4-Way Sets Bits needed? How? One Method: One-Out/Hidden Line coding (and variants) Directly encode the indices of the N-2 most recently accessed lines, plus one bit indicating if the smaller () or larger (1) of the remaining lines was most recently accessed (2,,1,3) -> 1 (3,2,1,) -> 1 1 1 1 (3,2,,1) -> 1 1 1 Requires (N-2)*log 2 N + 1 bits 8-Way sets? log 2 8! = 16, (8-2)*log 2 8 + 1 = 19 L18-Cache Structure 23

Example: LRU for 4-Way Sets Bits needed? log 2 4! = log 2 24 = 5 per set How? One Method: One-Out/Hidden Line coding (and variants) Directly encode the indices of the N-2 most recently accessed lines, plus one bit indicating if the smaller () or larger (1) of the remaining lines was most recently accessed (2,,1,3) -> 1 (3,2,1,) -> 1 1 1 1 (3,2,,1) -> 1 1 1 Requires (N-2)*log 2 N + 1 bits 8-Way sets? log 2 8! = 16, (8-2)*log 2 8 + 1 = 19 L18-Cache Structure 24

Example: LRU for 4-Way Sets Bits needed? log 2 4! = log 2 24 = 5 per set How? One Method: One-Out/Hidden Line coding (and variants) Directly encode the indices of the N-2 most recently accessed lines, plus one bit indicating if the smaller () or larger (1) of the remaining lines was most recently accessed (2,,1,3) -> 1 (3,2,1,) -> 1 1 1 1 (3,2,,1) -> 1 1 1 Requires (N-2)*log 2 N + 1 bits 8-Way sets? log 2 8! = 16, (8-2)*log 2 8 + 1 = 19 Overhead is O(N log 2 N) bits/set L18-Cache Structure 25

Example: LRU for 4-Way Sets Bits needed? log 2 4! = log 2 24 = 5 per set How? One Method: One-Out/Hidden Line coding (and variants) Directly encode the indices of the N-2 most recently accessed lines, plus one bit indicating if the smaller () or larger (1) of the remaining lines was most recently accessed Bottom line, LRU replacement requires considerable overhead as associativity increases (2,,1,3) -> 1 (3,2,1,) -> 1 1 1 1 (3,2,,1) -> 1 1 1 Requires (N-2)*log 2 N + 1 bits 8-Way sets? log 2 8! = 16, (8-2)*log 2 8 + 1 = 19 Overhead is O(N log 2 N) bits/set L18-Cache Structure 26

FIFO Replacement Each set keeps a modulo-n counter that points to victim line that will be replaced on the next miss Counter is only updated only on cache misses Ex: for a 4-way set associative cache: Next Victim Action () Miss, Replace ( 1) Hit 1 ( 1) Miss, Replace 1 (2) Miss, Replace 2 (3) Miss, Replace 3 () Miss, Replace L18-Cache Structure 27

FIFO Replacement Each set keeps a modulo-n counter that points to victim line that will be replaced on the next miss Counter is only updated only on cache misses Ex: for a 4-way set associative cache: Next Victim Action () Miss, Replace ( 1) Hit 1 ( 1) Miss, Replace 1 (2) Miss, Replace 2 (3) Miss, Replace 3 () Miss, Replace Overhead is O(log 2 N) bits/set L18-Cache Structure 28

Example: FIFO For 2-Way Sets Bits needed? FIFO bit is per cache line and uses the same index as cache (Part of same SRAM) Bit keeps track of the oldest line in set Same overhead as LRU! LRU is generally has lower miss rates than FIFO, soooo. WHY BOTHER??? log 2 2 = 1 per set 2-way set associative address =? =? Logic Miss Data L18-Cache Structure 29

FIFO For 4-way Sets Bits Needed? Low-cost, easy to implement (no tricks here) 8-way? 16-way? LRU 16-way? FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 3

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? 16-way? LRU 16-way? FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 31

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? log 2 8 = 3 per set 16-way? LRU 16-way? FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 32

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? log 2 8 = 3 per set 16-way? log 2 16 = 4 per set LRU 16-way? FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 33

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? log 2 8 = 3 per set 16-way? log 2 16 = 4 per set LRU 16-way? log 2 16! = 45 bits per set FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 34

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? log 2 8 = 3 per set 16-way? log 2 16 = 4 per set LRU 16-way? log 2 16! = 45 bits per set 14*log 2 16 + 1 = 57 bits per set FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? L18-Cache Structure 35

FIFO For 4-way Sets Bits Needed? log 2 4 = 2 per set Low-cost, easy to implement (no tricks here) 8-way? log 2 8 = 3 per set 16-way? log 2 16 = 4 per set LRU 16-way? log 2 16! = 45 bits per set 14*log 2 16 + 1 = 57 bits per set FIFO summary Easy to implement, scales well, BUT CAN WE AFFORD IT? I m starting to buy into all that big O stuff! L18-Cache Structure 36

Random Replacement Build a single Pseudorandom Number generator for the WHOLE cache. On a miss, roll the dice and throw out a cache line at random. Updates only on misses. How do you build a random number generator (easier than you might think). L18-Cache Structure 37

Random Replacement Build a single Pseudorandom Number generator for the WHOLE cache. On a miss, roll the dice and throw out a cache line at random. Updates only on misses. How do you build a random number generator (easier than you might think). Pseudorandom Linear Feedback Shift Register L18-Cache Structure 38

Random Replacement Build a single Pseudorandom Number generator for the WHOLE cache. On a miss, roll the dice and throw out a cache line at random. Updates only on misses. How do you build a random number generator (easier than you might think). Pseudorandom Linear Feedback Shift Register Counting Sequence 11111 x1f 1111 xf 111 x7 111 x13 111 x19 11 xc 111 x16 111 xb 11 x5 11 x12 11 x9 1 x4 1 x2 1 x1 1 x1 1 x8 11 x14 11 xa 111 x15 111 x1a 1111 x1d 111 xe 1111 x17 1111 x1b 111 xd 11 x6 11 x3 11 x11 11 x18 111 x1c 1111 x1e L18-Cache Structure 39

Random Replacement Build a single Pseudorandom Number generator for the WHOLE cache. On a miss, roll the dice and throw out a cache line at random. Updates only on misses. How do you build a random number generator (easier than you might think). Pseudorandom Linear Feedback Shift Register 11111 x1f 1111 xf 111 x7 111 x13 111 x19 11 xc 111 x16 111 xb 11 x5 11 x12 11 x9 1 x4 1 x2 1 x1 1 x1 Overhead is O(log 2 N) bits/cache! Counting Sequence 1 x8 11 x14 11 xa 111 x15 111 x1a 1111 x1d 111 xe 1111 x17 1111 x1b 111 xd 11 x6 11 x3 11 x11 11 x18 111 x1c 1111 x1e L18-Cache Structure 4

Replacement Strategy vs. Miss Rate H&P: Figure 5.4 Associativity 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.1% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% FIFO was reported to be worse than random or LRU Little difference between random and LRU for larger-size caches L18-Cache Structure 41

CPU Valid Bits TAG DATA???????? A Mem[A]???????? B Mem[B]???? MAIN MEMORY Problem: Ignoring cache lines that don t contain REAL or CORRECT values - on start-up - Back door changes to memory (eg loading program from disk) Solution: Extend each TAG with VALID bit. Valid bit must be set for cache line to HIT. On power-up / reset : clear all valid bits Set valid bit when cache line is FIRST replaced. Cache Control Feature: Flush cache by clearing all valid bits, Under program/external control. L18-Cache Structure 42

CPU V 1 1 Valid Bits TAG DATA???????? A Mem[A]???????? B Mem[B]???? MAIN MEMORY Problem: Ignoring cache lines that don t contain REAL or CORRECT values - on start-up - Back door changes to memory (eg loading program from disk) Solution: Extend each TAG with VALID bit. Valid bit must be set for cache line to HIT. On power-up / reset : clear all valid bits Set valid bit when cache line is FIRST replaced. Cache Control Feature: Flush cache by clearing all valid bits, Under program/external control. L18-Cache Structure 43

Handling WRITES Observation: Most (8+%) of memory accesses are READs, but writes are essential. How should we handle writes? Policies: WRITE-THROUGH: CPU writes are cached, but also written to main memory (stalling the CPU until write is completed). Memory always holds the truth. WRITE-BACK: CPU writes are cached, but not immediately written to main memory. Memory contents can become stale. Additional Enhancements: WRITE-BUFFERS: For either write-through or write-back, writes to main memory are buffered. CPU keeps executing while writes are completed (in order) in the background. What combination has the highest performance? L18-Cache Structure 44

Write-Through ON REFERENCE TO Mem[X]: Look for X among s... HIT: X == TAG(i), for some cache line i READ: return DATA[I] WRITE: change DATA[I]; Start Write to Mem[X] MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] READ: Read Mem[X] Set TAG[k] = X, DATA[k] = Mem[X] WRITE: Start Write to Mem[X] Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 45

Write-Back ON REFERENCE TO Mem[X]: Look for X among s... HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] MISS: X not found in TAG of any cache line Costly if contents of cache are not modified REPLACEMENT SELECTION: Select some line k to hold Mem[X] Write Back: Write Data(k) to Mem[Tag[k]] READ: Read Mem[X] Set TAG[k] = X, DATA[k] = Mem[X] WRITE: Start Write to Mem[X] Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 46

Write-Back w/ Dirty bits CPU ON REFERENCE TO Mem[X]: Look for X among s... D V 1 1 1 TAG MAIN MEMORY HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 A B DATA Mem[A] Mem[B] MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1 Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 47

Dirty and Valid bits are per cache line not per set Write-Back w/ Dirty bits CPU D 1 V 1 1 TAG A B DATA Mem[A] Mem[B] MAIN MEMORY ON REFERENCE TO Mem[X]: Look for X among s... HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1 Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 48

Dirty and Valid bits are per cache line not per set Write-Back w/ Dirty bits CPU D 1 V 1 1 TAG A B DATA Mem[A] Mem[B] MAIN MEMORY What if the cache has a block-size larger than one? ON REFERENCE TO Mem[X]: Look for X among s... HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1 Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 49

Dirty and Valid bits are per cache line not per set Write-Back w/ Dirty bits CPU ON REFERENCE TO Mem[X]: Look for X among s... D 1 V 1 1 TAG A B DATA Mem[A] Mem[B] MAIN MEMORY What if the cache has a block-size larger than one? A) If only one word in the line is modified, we end up writing back ALL words HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1 Set TAG[k] = X, DATA[k] = new Mem[X] L18-Cache Structure 5

Dirty and Valid bits are per cache line not per set Write-Back w/ Dirty bits CPU ON REFERENCE TO Mem[X]: Look for X among s... D 1 V 1 1 TAG A B DATA Mem[A] Mem[B] MAIN MEMORY What if the cache has a block-size larger than one? A) If only one word in the line is modified, we end up writing back ALL words HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1 Set TAG[k] = X, DATA[k] = new Mem[X] B) On a MISS, we need to READ the line BEFORE we WRITE it. L18-Cache Structure 51

Dirty and Valid bits are per cache line not per set Write-Back w/ Dirty bits CPU ON REFERENCE TO Mem[X]: Look for X among s... D 1 V 1 1 TAG A B DATA Mem[A] Mem[B] MAIN MEMORY What if the cache has a block-size larger than one? A) If only one word in the line is modified, we end up writing back ALL words HIT: X = TAG(i), for some cache line I READ: return DATA(i) WRITE: change DATA(i); Start Write to Mem[X] D[i]=1 MISS: X not found in TAG of any cache line REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]] READ: Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]= WRITE: Start Write to Mem[X] D[k]=1, Read Mem[X] Set TAG[k] = X, DATA[k] = new Mem[X] B) On a MISS, we need to READ the line BEFORE we WRITE it. L18-Cache Structure 52

Write Buffers Avoids the overhead of waiting for writes to complete Write is stored in a special H/W address queue called a Write Buffer where it is POSTED until the write completes Usually at least the size of a cache block. On a subsequent cache MISSes you may still need to stall any subsequent reads until outstanding (POSTED) writes are completed or you can check to see if the missed address matches one in the write buffer. Takes advane of sequential writes Prevailing wisdom: Write-Back is better than Write-Through, less memory traffic Always use Write-buffering Write buffer Miss/Dirty CPU out in Hit Main Memory L18-Cache Structure 53

Cache Benchmarking Suppose this loop is entered with $t3 = 4: ADR: Instruction I D. 4: lw $t,($t3) 4 4+... 44: addi $t3,$t3,4 44 48: bne $t,$,4 48 GOAL: Given some cache design, simulate (by hand or machine) execution well enough to estimate hit ratio. 1. Observe that the sequence of memory locations referenced is 4, 4, 44, 48, 4, 44,... We can use this simpler reference string/memory trace, rather than the program, to simulate cache behavior. 2. We can make our life even easier by converting to word addresses: 1, 1, 11, 12, 1, 11,... L18-Cache Structure 54

Cache Benchmarking Suppose this loop is entered with $t3 = 4: ADR: Instruction I D. 4: lw $t,($t3) 4 4+... 44: addi $t3,$t3,4 44 48: bne $t,$,4 48 GOAL: Given some cache design, simulate (by hand or machine) execution well enough to estimate hit ratio. 1. Observe that the sequence of memory locations referenced is 4, 4, 44, 48, 4, 44,... We can use this simpler reference string/memory trace, rather than the program, to simulate cache behavior. 2. We can make our life even easier by converting to word addresses: 1, 1, 11, 12, 1, 11,... (Word Addr = (Byte Addr)/4) L18-Cache Structure 55

Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 L18-Cache Structure 56

Compulsory Misses Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 L18-Cache Structure 57

Compulsory Misses Capacity Miss Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 L18-Cache Structure 58

Compulsory Misses Capacity Miss 1/4 miss Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 L18-Cache Structure 59

Compulsory Misses Capacity Miss 1/4 miss Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 Collision Miss L18-Cache Structure 6

Compulsory Misses Capacity Miss 1/4 miss Simple Cache Simulation 4-line Fully-associative/LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 1 M 11 2 12 3 1 12 1 M 11 2 12 3 1 13 1 M 11 2 12 3 4-line Direct-mapped Addr Line# Miss? 1 M 1 M 11 1 M 12 2 M 1 M 11 1 M 11 1 M 12 2 1 12 2 M 11 1 12 2 M 1 13 3 M 11 1 12 2 Collision Miss 7/16 miss L18-Cache Structure 61

Cache Simulation: Bout 2 8-line Fully-associative, LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 4 M 11 2 12 3 1 12 5 M 11 2 12 3 1 13 6 M 11 2 12 3 2-way, 8-line total, LRU Addr Line/N Miss? M 1,1 M M M 11 1,1 M 12 2,1 M 13 3, M L18-Cache Structure 62

1/4 miss Cache Simulation: Bout 2 8-line Fully-associative, LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 4 M 11 2 12 3 1 12 5 M 11 2 12 3 1 13 6 M 11 2 12 3 2-way, 8-line total, LRU Addr Line/N Miss? M 1,1 M M M 11 1,1 M 12 2,1 M 13 3, M L18-Cache Structure 63

1/4 miss Cache Simulation: Bout 2 8-line Fully-associative, LRU Addr Line# Miss? 1 M 1 1 M 11 2 M 12 3 M 1 11 4 M 11 2 12 3 1 12 5 M 11 2 12 3 1 13 6 M 11 2 12 3 2-way, 8-line total, LRU Addr Line/N Miss? M 1,1 M M M 11 1,1 M 12 2,1 M 13 3, M 1/4 miss L18-Cache Structure 64

Cache Simulation: Bout 3 2-way, 8-line total, LRU Addr Line/N Miss? 14,1 M 15 1,1 M 16 2,1 M 17 3,1 M 2-way, 8-line total, FIFO Addr Line/N Miss? 14, M 1,1 M 15 1, M 11 1,1 M 16 2, M 12 2,1 M 17 3,1 M L18-Cache Structure 65

The first 16 cycles of both caches are identical (Same as 2- way on previous slide). So we jump to round 2. Cache Simulation: Bout 3 2-way, 8-line total, LRU Addr Line/N Miss? 14,1 M 15 1,1 M 16 2,1 M 17 3,1 M 2-way, 8-line total, FIFO Addr Line/N Miss? 14, M 1,1 M 15 1, M 11 1,1 M 16 2, M 12 2,1 M 17 3,1 M L18-Cache Structure 66

1/4 miss The first 16 cycles of both caches are identical (Same as 2- way on previous slide). So we jump to round 2. Cache Simulation: Bout 3 2-way, 8-line total, LRU Addr Line/N Miss? 14,1 M 15 1,1 M 16 2,1 M 17 3,1 M 2-way, 8-line total, FIFO Addr Line/N Miss? 14, M 1,1 M 15 1, M 11 1,1 M 16 2, M 12 2,1 M 17 3,1 M L18-Cache Structure 67

1/4 miss The first 16 cycles of both caches are identical (Same as 2- way on previous slide). So we jump to round 2. Cache Simulation: Bout 3 2-way, 8-line total, LRU Addr Line/N Miss? 14,1 M 15 1,1 M 16 2,1 M 17 3,1 M 2-way, 8-line total, FIFO Addr Line/N Miss? 14, M 1,1 M 15 1, M 11 1,1 M 16 2, M 12 2,1 M 17 3,1 M 7/16 miss L18-Cache Structure 68

1/4 miss Cache Simulation: Bout 4 2-way, 8-line total, LRU Addr Line/N Miss? M 1,1 M M M 11 1,1 M 12 2,1 M 13 3, M 2-way, 4-line, 2 word blk, LRU Addr Line/N Miss? 1/1, M 1/1,1 M 11, 12/3 1, M 11,1 11, 12 1, 12/3 1,1 M 11, 12 1, 13 1,1 11, 12 1, L18-Cache Structure 69

1/4 miss Cache Simulation: Bout 4 2-way, 8-line total, LRU Addr Line/N Miss? M 1,1 M M M 11 1,1 M 12 2,1 M 13 3, M 2-way, 4-line, 2 word blk, LRU Addr Line/N Miss? 1/1, M 1/1,1 M 11, 12/3 1, M 11,1 11, 12 1, 12/3 1,1 M 11, 12 1, 13 1,1 11, 12 1, 1/8 miss L18-Cache Structure 7

Cache Design Summary Various design decisions the affect cache performance Block size, exploits spatial locality, saves H/W, but, if blocks are too large you can load unneeded items at the expense of needed ones Replacement strategy, attempts to exploit temporal locality to keep frequently referenced items in cache LRU Best performance/highest cost FIFO Low performance/economical RANDOM Medium performance/lowest cost, avoids pathological sequences, but performance can vary Write policies Write-through Keeps memory and cache consistent, but high memory traffic Write-back allows memory to become STALE, but reduces memory traffic Write-buffer queue that allows processor to continue while waiting for writes to finish, reduces stalls No simple answers, in the real-world cache designs are based on simulations using memory traces. L18-Cache Structure 71