Cache Memory and Performance

Similar documents
Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache Memory and Performance

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Main Memory Supporting Caches

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Computer Architecture CS372 Exam 3

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

ECE331: Hardware Organization and Design

Advanced Memory Organizations

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Improving Cache Performance

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

ECE331: Hardware Organization and Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Improving Cache Performance

Today Cache memory organization and operation Performance impact of caches

CSE 2021: Computer Organization

Carnegie Mellon. Cache Memories

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 2021: Computer Organization

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Caches Concepts Review

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

14:332:331. Week 13 Basics of Cache

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

ECE232: Hardware Organization and Design

CS429: Computer Organization and Architecture

How to Write Fast Numerical Code

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS161 Design and Architecture of Computer Systems. Cache $$$$$

V. Primary & Secondary Memory!

14:332:331. Week 13 Basics of Cache

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

How to Write Fast Numerical Code

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COSC3330 Computer Architecture Lecture 19. Cache

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Question?! Processor comparison!

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Transistor: Digital Building Blocks

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Page 1. Memory Hierarchies (Part 2)

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Lecture 11 Cache. Peng Liu.

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS429: Computer Organization and Architecture

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

The Memory Hierarchy & Cache

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CS 240 Stage 3 Abstractions for Practical Systems

Memory Hierarchy: The motivation

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

A Cache Hierarchy in a Computer System

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

ECE331: Hardware Organization and Design

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Memory hierarchies: caches and their impact on the running time

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Memory Hierarchy Design (Appendix B and Chapter 2)

Problem: Processor- Memory Bo<leneck

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Carnegie Mellon. Cache Memories. Computer Architecture. Instructor: Norbert Lu1enberger. based on the book by Randy Bryant and Dave O Hallaron

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy: Motivation

Transcription:

Cache Memory and Performance Cache Performance 1 Many of the following slides are taken with permission from Complete Powerpoint Lecture Notes for Computer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

Cache Memories Cache Performance 2 Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure: CPU chip Register file Cache memories ALU System bus Memory bus Bus interface I/O bridge Main memory

General Cache Organization (S, E, B) Cache Performance 3 E = 2 e lines (blocks) per set S = 2 s sets 0 1 2 3 0 1 2 e -1 set line (block) 2 s -1 v valid bit tag 0 1 2 B-1 B = 2 b bytes per cache block (the data) Cache size: C = S x E x B data bytes

Cache Organization Types Cache Performance 4 The "geometry" of the cache is defined by: S = 2 s E = 2 e B = 2 b the number of sets in the cache the number of lines (blocks) in a set the number of bytes in a line (block) E = 1 (e = 0) direct-mapped cache only one possible location in cache for each DRAM block S > 1 E = K > 1 K-way associative cache K possible locations (in same cache set) for each DRAM block S = 1 (only one set) E = # of cache blocks fully-associative cache each DRAM block can be at any location in the cache

Cache Performance Metrics Cache Performance 5 miss rate: fraction of memory references not found in cache (# misses / # accesses) = 1 hit rate Typical miss rates: - 3-10% for L1 - can be quite small (e.g., < 1%) for L2, depending on cache size and locality hit time: time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical times: - 1-2 clock cycles for L1-5-20 clock cycles for L2 miss penalty: additional time required for data access because of a cache miss typically 50-200 cycles for main memory Trend is for increasing # of cycles why?

Calculating Average Access Time Cache Performance 6 Let s say that we have two levels of cache, backed by DRAM: - L1 cache costs 1 cycle to access and has miss rate of 10% - L2 cache costs 10 cycles to access and has miss rate of 2% - DRAM costs 80 cycles to access (and has miss rate of 0%) Then the average memory access time (AMAT) would be: 1 always access L1 cache + 0.10 * 10 probability of miss in L1 cache * time to access L2 + 0.10 * 0.02 * 80 probability of miss in L1 cache * probability of miss in L2 cache * time to access DRAM = 2.16 cycles

Let s see that again Cache Performance 7 Let s say that we have two levels of cache, backed by DRAM: - L1 cache costs 1 cycle to access and has miss rate of 10% - L2 cache costs 10 cycles to access and has miss rate of 2% - DRAM costs 80 cycles to access (and has miss rate of 0%) % of time we go here*: 0.2% # cycles to go here: 80 DRAM Average penalty*: 0.16 cycles % of time we go here*: 10% # cycles to go here: 10 L2 Average penalty*: 1 cycle % of time we go here*: 100% # cycles to go here: 1 L1 Average penalty*: 1 cycle CPU * On a memory access

Lets think about those numbers Cache Performance 8 There can be a huge difference between the cost of a hit and a miss. Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: L1 cache hit time of 1 cycle L1 miss penalty of 100 cycles (to DRAM) Average access time: 97% L1 hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% L1 hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

Measuring Cache Performance Cache Performance 9 Components of CPU time: Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stall cycles Memory accesses Program Miss rate Miss penalty Instructio ns Misses Miss Program Instructio n penalty

Cache Performance Example Given Cache Performance 10 - Instruction-cache miss rate = 2% - Data-cache miss rate = 4% - Miss penalty = 100 cycles - Base CPI (with ideal cache performance) = 2 - Load & stores are 36% of instructions Miss cycles per instruction - Instruction-cache: 0.02 100 = 2 - Data-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 - Ideal CPI is 5.44/2 =2.72 times faster - We spend 3.44/5.44 = 63% of our execution time on memory stalls!

Reduce Ideal CPI Cache Performance 11 What if we improved the datapath so that the average ideal CPI was reduced? - Instruction-cache miss rate = 2% - Data-cache miss rate = 4% - Miss penalty = 100 cycles - Base CPI (with ideal cache performance) = 1.5 - Load & stores are 36% of instructions Miss cycles per instruction will still be the same as before. Actual CPI = 1.5 + 2 + 1.44 = 4.94 - Ideal CPI is 4.94/1.5 =3.29 times faster - We spend 3.44/4.94 = 70% of our execution time on memory stalls!

Performance Summary Cache Performance 12 When CPU performance increases - effect of miss penalty becomes more significant Decreasing base CPI - greater proportion of time spent on memory stalls Increasing clock rate - memory stalls account for more CPU cycles Can t neglect cache behavior when evaluating system performance

Multilevel Cache Considerations Cache Performance 13 Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size

Intel Core i7 Cache Hierarchy Processor package Core 0 Regs L1 d-cache L1 i-cache L2 unified cache L3 unified cache (shared by all cores) Main memory Core 3 Regs L1 d-cache L1 i-cache L2 unified cache Cache Performance 14 2700 Series L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 11 cycles L3 unified cache: 8 MB, 16-way, Access: 25-40 cycles Block size: 64 bytes for all caches.

I-7 Sandybridge Cache Performance 15

What about writes? Cache Performance 16 Multiple copies of data may exist: - L1 - L2 - DRAM - Disk Remember: each level of the hierarchy is a subset of the one below it. Suppose we write to a data block that's in L1. If we update only the copy in L1, then we will have multiple, inconsistent versions! If we update all the copies, we'll incur a substantial time penalty! And what if we write to a data block that's not in L1?

What about writes? Cache Performance 17 What to do on a write-hit? Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line) Need a dirty bit (cached line is different from memory or not)

What about writes? Cache Performance 18 What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow No-write-allocate (writes immediately to memory) Typical combinations: Write-through + No-write-allocate Write-back + Write-allocate