CS 136: Advanced Architecture. Review of Caches

Similar documents
Page 1. Memory Hierarchies (Part 2)

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EE 4683/5683: COMPUTER ARCHITECTURE

CS3350B Computer Architecture

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Course Administration

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Caching Basics. Memory Hierarchies

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Mo Money, No Problems: Caches #2...

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory hierarchy review. ECE 154B Dmitri Strukov

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

12 Cache-Organization 1

CMPSC 311- Introduction to Systems Programming Module: Caching

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Advanced Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 11 Cache. Peng Liu.

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Modern Computer Architecture

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

14:332:331. Week 13 Basics of Cache

Question?! Processor comparison!

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Hierarchy. Mehran Rezaei

CS 61C: Great Ideas in Computer Architecture Caches Part 2

LECTURE 11. Memory Hierarchy

Computer Architecture Spring 2016

Cache Performance (H&P 5.3; 5.5; 5.6)

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

ECE331: Hardware Organization and Design

Performance metrics for caches

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CS Computer Architecture

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Cache Optimisation. sometime he thought that there must be a better way

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Introduction to cache memories

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Show Me the $... Performance And Caches

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CS 31: Intro to Systems Caching. Martin Gagne Swarthmore College March 23, 2017

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CSE502: Computer Architecture CSE 502: Computer Architecture

Locality and Data Accesses video is wrong one notes when video is correct

Tutorial 11. Final Exam Review

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

1/19/2009. Data Locality. Exploiting Locality: Caches

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

EEC 483 Computer Organization

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Memory Hierarchies &

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Introduction to OpenMP. Lecture 10: Caches

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Logical Diagram of a Set-associative Cache Accessing a Cache

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

COSC3330 Computer Architecture Lecture 19. Cache

Introduction. Memory Hierarchy

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CMPSC 311- Introduction to Systems Programming Module: Caching

Memory Hierarchy. Slides contents from:

TDT 4260 lecture 3 spring semester 2015

Transcription:

1 / 30 CS 136: Advanced Architecture Review of Caches

2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you reference x, you ll probably use it again. Spatial locality: If you reference x, you ll probably reference x+1 To get the required low cost, must understand what goes into performance

Introduction Basics of Cache Performance 3 / 30 Cache Performance Review Memory accesses cause CPU stalls, which affect total run time: CPU execution time = (CPU clock cycles + Memory stall cycles) Clock cycle time Assumes CPU clock cycles includes cache hits In this form, difficult to measure

4 / 30 Measuring Stall Cycles Introduction Basics of Cache Performance Stall cycles controlled by misses and miss penalty: Memory stall cycles = Number of misses Miss penalty = IC Misses Miss penalty Instruction Memory Accesses = IC Miss rate Instruction Miss penalty All items above easily measured Some people prefer misses per instruction rather than misses per (memory) access

Introduction Four Questions in Cache Design 5 / 30 Four Questions in Cache Design Cache design is controlled by four questions: Q1. Where to place a block? (block placement) Q2. How to find a block? (block identification) Q3. Which block to replace on miss? (block replacement) Q4. What happens on write? (write strategy)

6 / 30 Introduction Block Placement Strategies Four Questions in Cache Design Where do we put a block? Most general method: block can go anywhere Fully associative Most restrictive method: block can go only one place Direct-mapped, usually at address modulo cache size (in blocks) Mixture: set associative, where block can go in a limited number of places inside a (sub)set of cache blocks Set chosen as address modulo number of sets Size of a set is way of cache; e.g. 4-way set associative has 4 blocks per set Direct-mapped is just 1-way set associative; fully associative is m-way if cache has m blocks

Introduction Four Questions in Cache Design 7 / 30 Block Identification How do we find a block? Address divided (bitwise, from right) into block offset, set index, and tag Block offset is log 2 b where b is number of bytes in block Set index is log 2 s, where s is number of sets in cache, given by: s = cache size b way (Set index is 0 bits in fully associative cache) Tag is remaining address bits To find block, index into set, compare tags and check valid bit

Introduction Four Questions in Cache Design 8 / 30 Block Replacement: LRU How do we pick a block to replace? If direct mapped, only one place a block can go, so kick out previous occupant Otherwise, ideal is to evict what will go unused longest in future

Introduction Four Questions in Cache Design 8 / 30 Block Replacement: LRU How do we pick a block to replace? If direct mapped, only one place a block can go, so kick out previous occupant Otherwise, ideal is to evict what will go unused longest in future Crystal balls don t fit on modern chips...

8 / 30 Introduction Block Replacement: LRU Four Questions in Cache Design How do we pick a block to replace? If direct mapped, only one place a block can go, so kick out previous occupant Otherwise, ideal is to evict what will go unused longest in future Crystal balls don t fit on modern chips... Best approximation: LRU (least recently used) Temporal locality makes it good predictor of future Easy to implement in 2-way (why?) Hard to do in >2-way (again, why?)

9 / 30 Introduction Four Questions in Cache Design Block Replacement: Approximations LRU is hard (in >2-way), so need simpler schemes that perform well FIFO: replace block that was loaded longest ago Implementable with shift register Behaves surprisingly well with small caches (16K) Implication: temporal locality is small? Random: what the heck, just flip a coin Implementable with PRNG or just low bits of CPU clock counter Better than FIFO with large caches

Introduction Four Questions in Cache Design 10 / 30 Speeding Up Reads Reads dominate writes: writes are 28% of data traffic, and only 7% of overall Common case is reads, so optimize that But Amdahl s Law still applies! When reading, pull out data with tag, discard if tag doesn t match

11 / 30 Write Strategies Introduction Four Questions in Cache Design Two policies: write-through and write-back Write-through is simple to implement Increases memory traffic Causes stall on every write (unless buffered) Simplifies coherency (important for I/O as well as SMP) Write-back reduces traffic, especially on multiple writes No stalls on normal writes Requires extra dirty bit in cache... But long stall on read if dirty block replaced Memory often out of date coherency extremely complex

12 / 30 Introduction Subtleties of Write Strategies Four Questions in Cache Design Suppose you write to something that s not currently in cache? Write allocate: put it in the cache For most cache block sizes, means read miss for parts that weren t written No write allocate: just go straight to memory Assumes no spatial or temporal locality Causes excess memory traffic on multiple writes Not very sensible with write-back caches

13 / 30 Cache Performance Optimizing Cache Performance Cache performance (especially miss rate) can have dramatic overall effects Near-100% hit rate means 1 clock per memory access Near-0% can mean 150-200 clocks 200X performance drop! Amdahl s Law says you don t have to stray far from 100% to get big changes

Cache Performance Miss Rates Effect of Miss Rate Performance effect of per-instruction miss rate, assuming 200-cycle miss penalty Relative Performance 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.02 0.04 0.06 0.08 0.1 Miss Rate 14 / 30

15 / 30 Cache Performance Complications of Miss Rates Miss Rates Common to separate L1 instruction and data caches Allows I-fetch and D-fetch on same clock Cheaper than dual-porting L1 cache Separate caches have slightly higher miss rates But net penalty can be less Why?

Cache Performance Miss Rates 16 / 30 The Math of Separate Caches Assume 200-clock miss penalty, 1-clock hit Separate 16K I- and D-caches can service both in one cycle Unified 32K cache takes extra clock if data & instruction on same cycle (i.e., any data access)

Cache Performance Miss Rates 17 / 30 The Math of Separate Caches (cont d) MR 16KI = 0.004 (from Fig. C.6) 36% of instructions transfer data, so MR 16KD =.0409/0.36 = 0.114 74% of accesses are instruction fetches, so MR split =.74 0.004 +.26 0.114 = 0.0326 Unified cache gets an access on every instruction fetch, plus another access when the 36% of data transfers happen, so MR U =.0433/1.36 = 0.0318 Unified cache misses less

Cache Performance Miss Rates 17 / 30 The Math of Separate Caches (cont d) MR 16KI = 0.004 (from Fig. C.6) 36% of instructions transfer data, so MR 16KD =.0409/0.36 = 0.114 74% of accesses are instruction fetches, so MR split =.74 0.004 +.26 0.114 = 0.0326 Unified cache gets an access on every instruction fetch, plus another access when the 36% of data transfers happen, so MR U =.0433/1.36 = 0.0318 Unified cache misses less BUT... Miss rate isn t what counts: only total time matters

Cache Performance Miss Rates 18 / 30 The Truth about Separate Caches Average access time is given by: t = %instrs (t hit + MR I Miss penalty) +%data (t hit + MR D Miss penalty) For split caches, this is: t = 0.74 (1 + 0.004 200) +0.26 (1 + 0.114 200) =.74 1.80 +.26 23.80 = 7.52 For a unified cache, this is: t = 0.74 (1 + 0.0318 200) +0.26 (1 + 1 + 0.0318 200) =.74 7.36 +.26 8.36 = 7.62

19 / 30 Cache Performance Out-of-Order Execution Miss Rates How to define miss penalty? Full latency of fetch? Exposed (nonoverlapped) latency when CPU stalls? We ll prefer the latter... But how to decide when stall happens? How to measure miss penalty? Very difficult to characterize as percentages Slight change could expose latency elsewhere

20 / 30 Cache Optimizations Six Basic Cache Optimizations Average access time = Hit time + Miss rate Miss penalty Six easy ways to improve above equation Reducing miss rate: 1. Larger block size (compulsory misses) 2. Larger cache size (capacity misses) 3. Higher associativity (conflict misses) Reducing miss penalty: 4. Multilevel caches 5. Prioritize reads over writes Reducing hit time: 6. Avoiding address translation

21 / 30 Types of Misses Cache Optimizations Reducing Miss Rate Uniprocessor misses fall into three categories: Compulsory: First access (ever to location) Capacity: Program accesses more than will fit in cache Conflict: Too many blocks map to given set Sometimes called collision miss Can t happen in fully associative caches

22 / 30 Cache Optimizations Reducing Compulsory Misses Reducing Miss Rate 1. Use larger block size Brings in more data per miss Unfortunately can increase conflict and capacity misses 2. Reduce invalidations Requires OS changes PID in cache tag can help

23 / 30 Cache Optimizations Reducing Capacity Misses Reducing Miss Rate Increase size of cache Higher cost More power draw Possibly longer hit time

24 / 30 Cache Optimizations Reducing Conflict Misses Reducing Miss Rate Increase Associativity Gives CPU more places to put things Increases cost Slows CPU clock May outweigh gain from reduced miss rate Sometimes direct-mapped may be better!

25 / 30 Multilevel Caches Cache Optimizations Reducing Miss Penalty Average access time = Hit time + Miss rate Miss penalty Halving miss penalty gives same benefit as halving miss rate Former may be easier to do Insert L2 cache between L1 and memory To first approximation, everything hits in L2 Also lets L1 be smaller and simpler (i.e., faster) Problem: L2 only sees bad memory accesses Poor locality Poor predictability L2 must be significantly bigger than L1

26 / 30 Cache Optimizations Analyzing Multilevel Caches Reducing Miss Penalty Average access time = Hit time L1 + Miss rate L1 Miss penalty L1 = Hit time L1 +Miss rate L1 (Hit time L2 +Miss rate L2 Miss penalty L2 ) We re back where we started, except now all L2 badness is reduced by a factor of Miss rate L1. But now Miss rate L2 is much higher because of bad accesses.

27 / 30 Multilevel Inclusion Cache Optimizations Reducing Miss Penalty Should L2 contain everything that s in L1? Pro: SMP can talk just to L2 Con: Block sizes must be identical or complexity rises 1. L1 caches A0, L2 caches A0 and A1 together 2. CPU miss on B1 replaces A1 in L1, both in L2 3. Inclusion now violated! L1 has A0, B1; L2 has B0, B1 Con: If inclusion violated, SMP coherency more complicated Alternative: explicit exclusion If block moved into L1, evicted from L2 Upshot: L1 and L2 must be designed together

28 / 30 Read Priority Cache Optimizations Reducing Miss Penalty Memory system can buffer and delay writes CPU can continue working... but CPU must wait for reads to complete Obvious fix: don t make reads wait for writes Problem: Introduces RAW hazards Must now check write buffer for dirty blocks

29 / 30 Cache Optimizations Virtually Addressed Caches Reducing Hit Time Address translation takes time Can t look up in cache until translation is done Solutions: 1. Make sure set index comes from page offset portion of virtual address Can then look up set and fetch tag during translation Only wait for comparison (must do in any case) Can fetch data on assumption comparison will succeed Limits cache size (for given way) 2. Index cache directly from virtual address Aliasing becomes problem If aliasing is limited, can check all possible aliases (e.g., Opteron) Page coloring requires aliases to differ only in upper bits All aliases map to same cache block

30 / 30 Summary Summary of Cache Optimizations Hit Miss Miss HW Technique time penalty rate cost Comment Large block size + 0 Trivial: P4 L2 is 128B Large cache size + 1 Widely used, esp. L2 High associativity + 1 Widely used Multilevel caches + 2 Costly but widely used Read priority + 1 Often used Addr. translation + 1 Often used