Improving Cache Performance

Similar documents
Improving Cache Performance

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

Main Memory Supporting Caches

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CS3350B Computer Architecture

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

EE 4683/5683: COMPUTER ARCHITECTURE

COSC3330 Computer Architecture Lecture 19. Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS 61C: Great Ideas in Computer Architecture. Cache Performance, Set Associative Caches

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

ECE331: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. 10/4/10 Fall Lecture #16. Agenda

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CS161 Design and Architecture of Computer Systems. Cache $$$$$

ECE331: Hardware Organization and Design

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

14:332:331. Week 13 Basics of Cache

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 6: Memory Organization Part I

COMPUTER ORGANIZATION AND DESIGN

ECE331: Hardware Organization and Design

Advanced Memory Organizations

EN1640: Design of Computing Systems Topic 06: Memory System

14:332:331. Week 13 Basics of Cache

ECE331: Hardware Organization and Design

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CS3350B Computer Architecture

CS 61C: Great Ideas in Computer Architecture Caches Part 2

Cache Memory and Performance

ECE232: Hardware Organization and Design

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Memory Hierarchy Design (Appendix B and Chapter 2)

CMPT 300 Introduction to Operating Systems

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Jiang Jiang

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

COSC 3406: COMPUTER ORGANIZATION

ECEC 355: Cache Design

V. Primary & Secondary Memory!

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Lecture 11: Memory Systems -- Cache Organiza9on and Performance CSE 564 Computer Architecture Summer 2017

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Adapted from David Patterson s slides on graduate computer architecture

Memory Hierarchy. Slides contents from:

Introduction to cache memories

12 Cache-Organization 1

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECSE 425 Lecture 21: More Cache Basics; Cache Performance

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Hierarchy and Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches, Set Associative Caches, Cache Performance

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS 61C: Great Ideas in Computer Architecture

ECE ECE4680

Memory Hierarchy: The motivation

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Transcription:

Improving Cache Performance Tuesday 27 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU

Summary Previous Class Memory hierarchy Caches Today: Improving Cache Performance 2

Terminology Block (or line): the minimum unit of information that is present (or not) in a cache Hit Rate: the fraction of memory accesses found in a level of the memory hierarchy Miss Rate: the fraction of memory accesses not found in a level of the memory hierarchy 1 - (Hit Rate) Miss Penalty: Time to replace a block in that level with the corresponding block from a lower 3

Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses Assuming cache hit costs are included as part of the normal CPU execution cycle, then: CPU time = IC CPI CC = IC (CPI ideal + Memory-stall cycles) CC CPI stall Assuming a combined read/write miss rate: Memory-stall cycles = accesses/program miss rate miss penalty 4

Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 100 = 2 D-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster 5

Average Access Time Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% Ø AMAT = 1 + 0.05 20 = 2ns 2 cycles per instruction 6

Performance Summary When CPU performance increases Miss penalty becomes more significant Decreasing based on CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can t neglect cache behavior when evaluating system performance 7

Reducing Cache Miss Rates: Associative Cache Allow for more flexible block placement: In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, we can allow a memory block be mapped to any cache block fully associative cache A compromise is to divide the cache into sets each of which consists of n ways (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) 8

Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) 9

Associative Cache Example 10

Spectrum of Associativity For a cache with 8 entries: 11

Direct mapped - Example Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid 0 miss 4 miss 0 miss 4 miss 01 4 00 00 Mem(0) 00 Mem(0) 01 Mem(4) 0 01 00 Mem(0) 4 0 miss 4 miss 0 miss 4 miss 00 01 4 01 4 01 Mem(4) 0 00 00 Mem(0) 01 Mem(4) 0 00 Mem(0) 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block 12

2-way set associative - Example Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 4 0 4 0 4 0 4 0 miss 4 miss 0 hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! 13

Set Associative Cache Organization 14

How Much Associativity The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Miss Rate 12 10 8 6 4 2 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 0 1-way 2-way 4-way 8-way Associativity Data from Hennessy & Patterson, Computer Architecture, 2003 Largest gains are in going from direct mapped to 2- way (20%+ reduction in miss rate) 15

Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity 16

Reducing Cache Miss Rates: Multilevel Caches Use multiple levels of caches With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches normally a unified L2 cache i.e., holds both instructions and data Many high-end systems already include unified L3 cache Design considerations for L1 and L2 caches are very different Primary cache attached to CPU focus on minimizing hit time (i.e. small, but fast)» Smaller with smaller block sizes Level-2 cache services misses from primary cache focus on reducing miss rate (i.e. large, slower than L1) to reduce the penalty of long main memory access times» Larger with larger block sizes» Higher levels of associativity 17

Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size t access = t hitl1 + p missl1 t penaltyl1 t penaltyl1 = t hitl2 + p missl2 t penaltyl2 t access = t hitl1 + p missl1 (t hitl2 + p missl2 t penaltyl2 ) 18

Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 400 = 9 19

Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 400 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performance ratio = 9/3.4 = 2.6 20

Two Machines Cache Parameters Characteristic ARM Cortex-A8 Intel Nehalem L1 cache organization Split instruction and data caches Split instruction and data caches L1 cache size 32 KiB each for instructions/data 32 KiB each for instructions/data per core L1 cache associativity 4-way (I), 4-way (D) set associative 4-way (I), 8-way (D) set associative L1 replacement Random Approximated LRU L1 block size 64 bytes 64 bytes L1 write policy Write-back, Write-allocate(?) Write-back, No-write-allocate L1 hit time (load-use) 1 clock cycle 4 clock cycles, pipelined L2 cache organization Unified (instruction and data) Unified (instruction and data) per core L2 cache size 128 KiB to 1 MiB 256 KiB (0.25 MiB) L2 cache associativity 8-way set associative 8-way set associative L2 replacement Random(?) Approximated LRU L2 block size 64 bytes 64 bytes L2 write policy Write-back, Write-allocate (?) Write-back, Write-allocate L2 hit time 11 clock cycles 10 clock cycles L3 cache organization -- L3 cache size L3 cache associativity L3 replacement L3 block size L3 write policy L3 hit time -- -- -- -- -- -- Unified (instruction and data) 8 MiB, shared 16-way set associative Approximated LRU 64 bytes Write-back, Write-allocate 35 clock cycles 21

Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyze Use system simulation 22

Victim Cache Instead of completely discarding each block when it has to be replaced, temporarily keep it in a victim buffer. Rather than stalling on a subsequent cache miss, the contents of the buffer are checked on a subsequent miss to see if they have the desired data before going to the next lower-level of memory. Small cache (e.g., 4 to 16 positions) Fully associative Particularly efficient for small direct mapped caches (more than 25% reduction of the miss rate in a 4kB cache). 23

Interactions with Software Misses depend on memory access patterns Algorithm behavior Compiler optimization for memory access 24

Summary: Improving Cache Performance 1. Reduce the time to hit in the cache smaller cache direct mapped cache smaller blocks for writes no write allocate no hit on cache, just write to write buffer write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache 2. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache small buffer holding most recently discarded blocks 25

Summary: Improving Cache Performance 3. Reduce the miss penalty smaller blocks use a write buffer to hold dirty blocks being replaced so don t have to wait for the write to complete before reading check write buffer (and/or victim cache) on read miss may get lucky for large blocks fetch critical word first use multiple cache levels L2 cache not tied to CPU clock rate faster backing store/improved memory bandwidth wider buses memory interleaving, DDR SDRAMs 26

Summary: The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation Cache Size Associativity Block Size 27

Next Class Optimizing cache usage 28

Improving Cache Performance Tuesday 27 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU