Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Similar documents
Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture Spring 2016

Caches and Prefetching

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Computer Architecture Spring 2016

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CS3350B Computer Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Fall 2007 Prof. Thomas Wenisch

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

LECTURE 11. Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

EE 4683/5683: COMPUTER ARCHITECTURE

Lecture 12: Memory hierarchy & caches

1/19/2009. Data Locality. Exploiting Locality: Caches

Caches. Samira Khan March 23, 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Hierarchy. Slides contents from:

Key Point. What are Cache lines

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

EN1640: Design of Computing Systems Topic 06: Memory System

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EN1640: Design of Computing Systems Topic 06: Memory System

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EEC 483 Computer Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Course Administration

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COMP 3221: Microprocessors and Embedded Systems

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Locality and Data Accesses video is wrong one notes when video is correct

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Page 1. Memory Hierarchies (Part 2)

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Lecture-14 (Memory Hierarchy) CS422-Spring

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Hierarchy: Caches, Virtual Memory

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

The University of Adelaide, School of Computer Science 13 September 2018

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Memory Hierarchies &

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Performance metrics for caches

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Mo Money, No Problems: Caches #2...

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

1. Memory technology & Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Advanced Memory Organizations

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Review: Computer Organization

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Caching Basics. Memory Hierarchies

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Logical Diagram of a Set-associative Cache Accessing a Cache

LECTURE 12. Virtual Memory

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

MEMORY HIERARCHY DESIGN

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CS3350B Computer Architecture

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 17: Memory Hierarchy: Cache Design

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

UCB CS61C : Machine Structures

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

CS61C : Machine Structures

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

www-inst.eecs.berkeley.edu/~cs61c/

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Reducing Conflict Misses with Set Associative Caches

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Transcription:

Cache Design Basics Nima Honarmand

Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled by Hardware Larger Cheaper Caches (SRAM) Memory (DRAM) Faster Controlled by Software (OS) [SSD? (Flash)] Disk (Magnetic Media)

Caches An automatically managed hierarchy Break memory into blocks (several bytes) and transfer data to/from cache in blocks To exploit spatial locality Core $ Keep recently accessed blocks To exploit temporal locality Memory

Cache Terminology block (cache line): minimum unit that may be cached frame: cache storage location to hold one block hit: block is found in the cache miss: block is not found in the cache miss ratio: fraction of references that miss hit time: time to access the cache miss penalty: time to retrieve block on a miss

Cache Example Address sequence from core: (assume 8-byte lines) Core 0x10000 0x10004 0x10120 0x10008 0x10124 0x10004 Miss Hit Miss Miss Hit Hit 0x10000 ( data ) 0x10008 ( data ) 0x10120 ( data ) Memory Final miss ratio is 50%

Average Memory Access Time (1) Or AMAT = Hit-time + Miss-rate Miss-penalty Very powerful tool to estimate performance If cache hit is 10 cycles (core to L1 and back) miss penalty is 100 cycles (miss penalty) Then at 50% miss ratio, avg. access: 10+0.5 100 = 60 at 10% miss ratio, avg. access: 10+0.1 100 = 20 at 1% miss ratio, avg. access: 10+0.01 100 = 11

Average Memory Access Time (2) Generalizes nicely to hierarchies of any depth If L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (L2 miss penalty) Then at 20% miss ratio in L1 and 40% miss ratio in L2 avg. access: 5+0.2 (0.6 20+0.4 100) = 15.4

Memory Hierarchy (1) L1 is usually split separate I$ (inst. cache) and D$ (data cache) L2 and L3 are unified Processor Registers I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L3 Cache (LLC) Main Memory (DRAM)

Memory Hierarchy (2) L1 and L2 are private L3 is shared Processor Core 0 Registers Core 1 Registers I-TLB L1 I-Cache L1 D-Cache D-TLB I-TLB L1 I-Cache L1 D-Cache D-TLB L2 Cache L2 Cache L3 Cache (LLC) Main Memory (DRAM) Multi-core replicates the top of the hierarchy

Intel Nehalem (3.3GHz, 4 cores, 2 threads per core) Spring 2018 :: CSE 502 Memory Hierarchy (3) 32K L1-D 256K L2 32K L1-I

How to Build a Cache

SRAM Overview 1 1 01 01 b Chained inverters maintain a stable state Access gates provide access to the cell Writing to cell involves over-powering storage inverters b 6T SRAM cell 2 access gates 2T per inverter

8-bit SRAM Array wordline bitlines

8 8-bit SRAM Array 3 / 1-of-8 decoder wordline bitlines

Direct-Mapped Cache using SRAM tag[63:16] index[15:6] block offset[5:0] Use middle bits as index Only one tag comparison decoder data data data state state state tag tag tag data state tag multiplexor = tag match hit? Why take index bits out of the middle?

Improving Cache Performance Recall AMAT formula: AMAT = Hit-time + Miss-rate Miss-penalty To improve cache performance, we can improve any of the three components Let s start by reducing miss rate

The 4 C s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced because cache too small Conflict: Neither compulsory nor capacity, because of limited associativity Coherence: (Will discuss in multi-processor lectures)

hit rate Spring 2018 :: CSE 502 Cache Size Cache size is data capacity (don t count tag and state) Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful data constantly replaced capacity working set size

hit rate Spring 2018 :: CSE 502 Block Size Block size is the data that is: associated with an address tag not necessarily the unit of transfer between hierarchies Too small a block Don t exploit spatial locality well Excessive tag overhead Too large a block Useless data transferred Too few total blocks Useful data frequently replaced Common Block Sizes are 32-128 bytes block size

Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF 11011110101011011011111011101111 0xFEEDBEEF 11111110111011011011111011101111 0xDEADBEEF 11011110101011011011111011101111 tag index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other frames available in cache)

Associativity (1) In cache w/ 8 frames, where does block 12 (b 1100) go? 0 1 2 3 4 5 6 7 0 1 2 3 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7 Fully-associative block goes in any frame (all frames in 1 set) Set-associative block goes in any frame in one set (frames grouped in sets) Direct-mapped block goes in exactly one frame (1 frame per set)

hit rate Spring 2018 :: CSE 502 Associativity (2) Larger associativity (for the same size) lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time 2:1 rule of thumb: for small caches (up to 128KB), 2-way assoc. gives same miss rate as direct-mapped twice the size holding cache and block size constant ~5 for L1-D associativity

N-Way Set-Associative Cache tag[63:16] index[15:6] block offset[5:0] way set decoder data data data state state state tag tag tag decoder data data data state state state tag tag tag data state tag data state tag multiplexor = multiplexor = multiplexor Note the additional bit(s) moved from index to tag hit?

Fully-Associative Cache tag[63:6] block offset[5:0] Keep blocks in cache frames data state (e.g., valid) address tag state state state tag tag tag = = = data data data state tag = Content Addressable Memory (CAM) hit? data multiplexor

Block Replacement Algorithms Which block in a set to replace on a miss? Ideal replacement (Belady s Algorithm) Replace block accessed farthest in the future Trick question: How do you implement it? Least Recently Used (LRU) Optimized for temporal locality (expensive for > 2-way associativity) Not Most Recently Used (NMRU) Track MRU, random select among the rest Same as LRU for 2-sets Random Nearly as good as LRU, sometimes better (when?) Pseudo-LRU Used in caches with high associativity Examples: Tree-PLRU, Bit-PLRU

Victim Cache (1) Associativity is expensive Performance overhead from extra muxes Power overhead from reading and checking more tags and data Conflicts are expensive Performance from extra mises Observation: Conflicts don t occur in all sets Idea: use a fully-associative victim cache to absorbs blocks displaced from the main cache

Victim Cache (2) Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative L1 Cache + Fully-Associative Victim Cache E A B N J C K L D M DE A BA CB DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache AE BA CB DC X Y Z Victim cache provides a fifth way so long as only four sets overflow into it at the same time DC AB JKLM Can even provide 6 th or 7 th ways Provide extra associativity, but not for all sets

Parallel vs. Serial Caches Tag and Data usually separate SRAMs tag is smaller & faster State bits stored along with tags Valid bit, LRU bit(s), Parallel access to tag and data reduces latency (good for L1) Serial access to tag and data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? data hit? data

Cache, TLB & Address Translation (1) Should we use virtual address or physical address to access caches? In theory, we can use either Drawback(s) of physical TLB access has to happen before cache access increasing hit time Drawback(s) of virtual Aliasing problem: same physical memory might be mapped using multiple virtual addresses Memory protection bits (part of page table and TLB) should be checked I/O devices usually use physical addresses So, what should we do?

Cache, TLB & Address Translation (2) Observation: caches use addresses for two things Indexing: to find and access the set that could contain the cache block Only requires a small subset of low-order address bits Tag matching: to search the blocks in the set to see if anyone is actually the one we re looking for Requires the complete address Solution: 1) Use part of address common between virtual and physical for indexing 2) While the set is being accessed, do TLB lookup in parallel 3) Use physical address from (2) for tag matching

Cache, TLB & Address Translation (3) Example: in Intel processors, page size is 4KB and cache block is 64 bytes Page offset is 12 bits Block offset is 6 bits What is the max. number of index bits that are common between virtual and physical addrs? 12 6 = 6 What is largest direct-mapped cache that we can build using 6 bits of index? 2 6 blocks 64 bytes-per-block = 4 KB (same as page size) But Intel L1 caches are 32KB. How do they do that? Make the cache 8-way set associative. Each way is 4KB and still only needs 6 bits of index.

Cache, TLB & Address Translation (4) virtual page number[63:12] index[11:6] block offset[5:0] : Virtual Address / index (6 bits) TLB / physical tag = = = = By removing TLB from critical path, we reduce the hit-time component of AMAT

Caches and Writes Writes are more interesting (i.e., complicated) than reads On reads, tag and data can be accessed in parallel On writes, we need two steps First, do indexing and tag matching to find the block Then, write the data to the SRAM

Cache Writes Policies (1) On write hits, update lower-level memory? Yes: write-through (more memory traffic) No: write-back (uses dirty state bits to identify blocks to write back) What is the drawback of write-back? On a block replacement, should first write the old block back to memory if dirty, increasing the miss penalty With write-through, cache blocks are always clean, so no need to write back In multi-level caches, you can have a mix For example, write-through for L1 and write-back for L2

Caches Writes Policies (2) On write misses, allocate a cache block frame? Yes: write-allocate Bring the data in from the lower level, allocate a cache frame, and then do the write More common in write-back caches No: no-write-allocate Do not allocate a cache frame. Just send the write to the lower level More common in write-through caches For your HW2, you will implement a write-back, writeallocate cache

Multiple Accesses per Cycle Super-scalars might make multiple parallel cache accesses Core can make multiple L1$ access requests per cycle E.g., 2 simultaneous L1 D$ accesses in Intel processors Multiple cores can access LLC at the same time Must either delay some requests, or Design SRAM with multiple ports Big and power-hungry Split SRAM into multiple banks Can result in delays, but usually not

Multi-Ported SRAMs Wordline 2 Wordline 1 b 2 b 1 b 1 b 2 Wordlines = 1 per port Bitlines = 2 per port Area = O(ports 2 )

Multi-Porting vs. Banking Decoder Decoder Decoder Decoder SRAM Array Decoder S SRAM Array Decoder S SRAM Array Decoder SRAM Array Decoder SRAM Array Sense Sense Sense Sense Column Muxing S S 4 ports Big (and slow) Guarantees concurrent access 4 banks, 1 port each Each bank small (and fast) Conflicts (delays) possible How to decide which bank to go to?

Bank Conflicts Banks are address interleaved For block size b cache with N banks Bank = (Address / b) % N Looks more complicated than is: just low-order bits of index tag index offset no banking tag index bank offset w/ banking Banking can provide high bandwidth But only if all accesses are to different banks For 4 banks, 2 accesses, chance of conflict is 25%