Lecture 17: Memory Hierarchy: Cache Design

Similar documents
Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University

Lecture 19: Memory Hierarchy: Cache Design. Recap: Basic Cache Parameters

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 20: Virtual Memory, Protection and Paging. Multi-Level Caches

Page 1. Multilevel Memories (Improving performance using a little cash )

Caching Basics. Memory Hierarchies

Caches. Samira Khan March 23, 2017

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. Slides contents from:

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University

Memory Hierarchy. Slides contents from:

ECE 30 Introduction to Computer Engineering

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

1/19/2009. Data Locality. Exploiting Locality: Caches

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

MIPS) ( MUX

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Performance (H&P 5.3; 5.5; 5.6)

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

ECE 2300 Digital Logic & Computer Organization. More Caches

CS3350B Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

data block 0, word 0 block 0, word 1 block 1, word 0 block 1, word 1 block 2, word 0 block 2, word 1 block 3, word 0 block 3, word 1 Word index cache

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Lecture 7 - Memory Hierarchy-II

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Fall 2007 Prof. Thomas Wenisch

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Key Point. What are Cache lines

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

LECTURE 11. Memory Hierarchy

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Memory Hierarchy: Caches, Virtual Memory

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

V. Primary & Secondary Memory!

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Introduction to OpenMP. Lecture 10: Caches

Memory. Lecture 22 CS301

Memory Hierarchy Basics

Review: Computer Organization

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Levels in memory hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

LSN 7 Cache Memory. ECT466 Computer Architecture. Department of Engineering Technology

EE 4683/5683: COMPUTER ARCHITECTURE

Advanced Memory Organizations

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Course Administration

Computer Architecture Spring 2016

CMPSC 311- Introduction to Systems Programming Module: Caching

ECE 341. Lecture # 18

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Chapter Seven Morgan Kaufmann Publishers

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

CACHE OPTIMIZATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 11. Virtual Memory Review: Memory Hierarchy

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Caches Concepts Review

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Advanced Computer Architecture

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Lecture 11 Cache. Peng Liu.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE Computer Architecture I Fall 2009 Homework 08 Pipelined Processors and Multi-core Programming Assigned: Due: Problem 1: (10 points)

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Transcription:

S 09 L17-1 18-447 Lecture 17: Memory Hierarchy: Cache Design James C. Hoe Dept of ECE, CMU March 24, 2009 Announcements: Project 3 is due Midterm 2 is coming Handouts: Practice Midterm 2 solutions The problem (recap) S 09 L17-2 Potentially M2 m bytes of memory, how to keep the most frequently used ones in C bytes of fast storage where C << M Basic issues (intertwined) (1) where to cache a memory location? (2) how to find a cached memory location? (3) granularity of management: large, small, uniform? (4) when to bring a memory location into cache? (5) which cached memory location to evict to free-up space? Optimizations

address Basic Operation S 09 L17-3 (2) cache lookup (1, 3, 5) yes no choose location occupied? no yes return update cache fetch new from L i+1 evict old to L i+1 Ans to (4): memory location brought into cache on-demand. What about prefetch? Direct-Mapped Cache (v1) M-bit address tag idx g Tag Bank Data Bank S 09 L17-4 (C/G) C/G lines by valid C/G lines by G bytes What about writes? G bytes let t M (C)

Storage Overhead S 09 L17-5 For each cache block of G bytes, must also store additional t+1 where t M (C) if M2 32, G4, C16K2 14 t18 for each 4-byte block 60% storage overhead 16KB cache really needs 25.5KB of SRAM Solution: let multiple G-byte words share a common tag each B-byte block holds B/G words if M2 32, B16, G4, C16K t18 for each 16-byte block 15% storage overhead 16KB cache needs 18.4KB of SRAM 15% of 16KB is small, 15% of 1MB is 152KB larger block size for lower/larger hierarchies Direct-Mapped Cache (final) M-bit address tag idx bo g S 09 L17-6 (C/B) Tag Bank C/B-by- valid Data Bank C/B-by-B bytes (B/G) B bytes let t M (C) G bytes

Direct-Mapped Cache S 09 L17-7 C bytes of storage divided into C/B blocks A block of memory is mapped to one particular cache block according the address block index field All addresses with the same block index field map to the same cache block 2 t such addresses; can cache only one such block at a time even if C > working set size, collision is possible given 2 random addresses, chance for collision is 1/(C/B) Notice likelihood for collision decreases with increasing number of cache blocks (C/B) 100% hit rate working set size (W) C S 09 L17-8 Block Size and m i Bytes that share a common tag are all-in or all-out Loading a multi-word block at a time has the effect of prefetching for spatial locality pay miss penalty only once per block works especially well in instruction caches effective up to the limit of spatial locality But, increasing block size (while holding C constant) reduces the number of blocks increases possibility for collision hit rate B

S 09 L17-9 Block Size and T i+1 Loading a large block can increase T i+1 if I want the last word on a block, I have to wait for the entire block to be loaded solution 1 critical-word first reload L i+1 returns the requested word first then rotate around the complete block supply requested word to pipeline as soon as available solution 2: sub-blocking individual valid for different sub-blocks reload only requested sub-block on demand note: all sub-blocks stall share common tag tag v s-block 0 v s-block 1 v s-block 2 S 09 L17-10 Test yourself: What is wrong with this? bo idx tag g (C/B) Tag Bank C/B-by- valid Data Bank C/B-by-B bytes (B/G) B bytes G bytes let t M (C)

S 09 L17-11 Direct-Mapped Cache (double check) tag idx bo g (C/B)+k (B/G)-k Tag Bank C/B-by- valid Data Bank 2 k C/B-by- B/2 k bytes B/2 k bytes let t M (C) G bytes a -way Set Associative Cache tag idx bo g S 09 L17-12 C/a byte direct-mapped Tag Data Tag Data C/a/B by valid C/a/B by B bytes a banks C/a/B by valid C/a/B by B bytes tristate mux bus t M (C/a)

a -way Set-Associative Cache S 09 L17-13 C bytes of storage divided into a banks each with C/a/B blocks Requires a comparators and a-to-1 multiplexer An address is mapped to a particular block in a bank according its block index field, but there are a such banks (together known as a set ) All addresses with the same block index field map to a common set of cache blocks 2 t such addresses; can cache a such blocks at a time if C > working set size higher-degree of associatively fewer collisions Replacement Policy for Set Associative Caches S 09 L17-14 A new cache block can evict any of the cached memory block in the same set, which one? pick the one that is least recently used (LRU) simple function for a2, complicated for a>2 pick any one except the most recently used pick the most recently used one pick one based on some part of the address pick the one that you will need again furthest in the future pick a (pseudo) random one Replacement policy only has second-order effect if you actively use less than a blocks in a set, any sensible replacement policy will quickly converge if you actively use more than a blocks in a set, no replacement policy can help you

Pseudo-Associativity: e.g MIPS R10K 2-way L2 S 09 L17-15 a-way associativity is a placement policy it says an address could be mapped to a different locations in the cache *** it doesn t say lookups must be done in parallel banks Pseudo a-way associativity: given a direct-mapped array with C/B cache blocks implement C/B/a sets given an address A, sequentially search: {0, A[lg(C/B/a)-1: lg(b)]}, {1, A[lg(C/B/a)-1: lg(b)]}, {a-1, A[lg(C/B/a)-1: lg(b)]} Optimization: record the most recently used way (MRU) to look up first How does this compare with true associative caches? Fully Associative Cache: a C/B S 09 L17-16 tag bo g 1-by- v 1-by-B bytes 1-by- v 1-by-B bytes C/B blocks 1-by- v 1-by-B bytes let t M (B)

Fully Associative Cache: a C/B A content-addressable memory not your regular SRAM present a tag, return the block with matching tag, or else miss no index used in lookup Any address can go into any of the C/B blocks if C > working set size, no collisions Requires 1 comparator per cache block, a huge multiplexer, and many long wires considered exorbitantly expensive/difficult for more than 32~64 blocks at L1 latencies. Fortunately, there is little reason? for very large fully associative caches. For any reasonably large values of ~5 C/B, a4~5 is as good as ac/b for typical programs hit rate S 09 L17-17 a ISA Recap: Basic Cache Parameters Let M 2 m be the size of the address space in bytes sample values: 2 32, 2 64 Let G2 g be the cache access granularity in bytes sample values: 4, 8 S 09 L17-18 Implementation Let C be the capacity of the cache in bytes sample values: 16 KBytes (L1), 1 MByte (L2) Let B 2 b be the block size of the cache in bytes sample values: 16 (L1), >64 (L2) Let a be the associativity of the cache sample values: 1, 2, 4, 5(?),... C/B

Recap: Address Fields S 09 L17-19 M -bit address tag index B.O. tag: M (C/a) byte offset: G block offset: lg2(b/g) block index: lg2((c/a)/b) Recap: M32, G, C, B, a S 09 L17-20

Classification of Cache Misses S 09 L17-21 Compulsory miss (design factor: B and prefetch) first reference to an address (block) always results in a miss subsequent references should hit unless the cache block is displaced for the reasons below dominates when locality is poor Capacity miss (design factor: C) cache is too small to hold everything needed defined as the misses that would occur even in an fullyassociative cache of the same capacity dominates when C < W Conflict miss (design factor: a) displaced by collision under direct-mapped or setassociative allocation defined as any miss that is neither a compulsory nor a capacity miss dominates when C W or when C/B is small Classification of Cache Misses S 09 L17-22 Compulsory miss (design factor: B and prefetch) first reference to an address (block) always results in a miss subsequent references should hit unless the cache block is displaced for the reasons below dominates when locality is poor for example, in a streaming access pattern where many addresses are visited, but each is visited exactly once little reuse to amortize this cost hit rate B

Classification of Cache Misses S 09 L17-23 Capacity miss (design factor: C) cache is too small to hold everything needed defined as the misses that would occur even in an fullyassociative cache of the same capacity dominates when C < W for example, the L1 cache can never be made big enough due to cycle-time tradeoff 100% hit rate working set size (W) C Classification of Cache Misses S 09 L17-24 Conflict miss (design factor: a) displaced by collision under direct-mapped or setassociative allocation defined as any miss that is neither a compulsory nor a capacity miss dominates when C W or when C/B is small? hit rate ~5 a