Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

Similar documents
Caching Basics. Memory Hierarchies

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Lecture 11 Cache. Peng Liu.

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy. Slides contents from:

Modern Computer Architecture

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Caches Concepts Review

CS61C : Machine Structures

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Review : Pipelining. Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Page 1. Multilevel Memories (Improving performance using a little cash )

UCB CS61C : Machine Structures

CS61C : Machine Structures

UCB CS61C : Machine Structures

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS 136: Advanced Architecture. Review of Caches

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

EE 4683/5683: COMPUTER ARCHITECTURE

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

UC Berkeley CS61C : Machine Structures

Lec 11 How to improve cache performance

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Performance (H&P 5.3; 5.5; 5.6)

CPE 631 Lecture 04: CPU Caches

Computer Systems Architecture

Lecture 7 - Memory Hierarchy-II

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory hierarchy review. ECE 154B Dmitri Strukov

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CSE 2021: Computer Organization

Course Administration

Memory. Lecture 22 CS301

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Advanced Computer Architecture

CSE 2021: Computer Organization

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Computer Architecture Spring 2016

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Computer Architecture Spring 2016

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

www-inst.eecs.berkeley.edu/~cs61c/

Tutorial 11. Final Exam Review

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy. Mehran Rezaei

CS 654 Computer Architecture Summary. Peter Kemper

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Introduction to OpenMP. Lecture 10: Caches

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Question?! Processor comparison!

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPU Pipelining Issues

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Lecture 11 Cache and Virtual Memory

Handout 4 Memory Hierarchy

1" 0" d" c" b" a" ...! ! Benefits of Larger Block Size. " Very applicable with Stored-Program Concept " Works well for sequential array accesses

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Memory Hierarchy. Slides contents from:

Mo Money, No Problems: Caches #2...

ECSE 425 Lecture 21: More Cache Basics; Cache Performance

Cray XE6 Performance Workshop

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Transcription:

The Problem Caches Today s topics: Basics memory hierarchy locality cache models associative options calculating miss penalties some fundamental optimization issues Widening memory gap DRAM latency CAGR = 7% CPU performance CAGR» 25% prior to 1986, 52% 86-2005, 20% thereafter 1 CS6810 2 CS6810 The Solution Balancing Act Deepen memory hierarchy make dependency on DRAM latency the rare case» if you can multi-ported single ported As always performance vs. cost, capacity vs. latency, on-die SRAM is expensive per bit» latency depends on size and organization» area 6T/bit» power not so bad since only small portion active/cycle DRAM» latency is awful in cycles w/ GHz clock speeds» area = 1T + 1C/bit - lots of details later Dealing w/ cache size latency deepen cache hierarchy» small separate IL1$ and DL1$: minimize mem structural stalls» unified L2 reduces fragmentation problems» multicore introduces global L3 may not continue to be a good idea everything runs out of steam at some point key is what % of L1 misses hit in L2 (induction on Ln & Ln+1) 3 CS6810 4 CS6810 Page 1

Locality 2 basic types temporal» if you used this block then you will use it again soon spatial» if you used this block you will use a nearby one soon Exploiting locality match of application memory usage and cache design some codes are touch once» encrypt/decrypt, encode/decode,» here caches are a liability why? Locality 2 basic types temporal» if you used this block then you will use it again soon spatial» if you used this block you will use a nearby one soon Exploiting locality match of application memory usage and cache design» if you match you win simple as that some codes are touch once fall through misses» e.g. encrypt/decrypt, encode/decode, (media influence)» here caches are a liability you have to swing to miss try L1 miss? try L2 miss? a lot of extra time if you end up going to DRAM anyway historical tidbit» Seymour Cray didn t believe in caches or DRAM or even 2 s complement initially 5 CS6810 6 CS6810 Basics Performance Issues CPUtime = IC CPI cycle_time 1 CPI = frequency IPC = 1 CPI IC CPUtime = IPC frequency Enter stalls IPC gets degraded by stalls» we ve seen stalls due to hazards and pipeline issues» now the focus is on memory stalls influence is based on load & store % with good branch prediction most stalls are memory induced Computing Memory Effect Misses induce stalls ILP & TLP can overlap some of the penalty» but a miss is a surprise so some of the penalty won t be hidden XEQtime = (CPUcycles+ MEMstallcycles) cycle_time MEMstallcycles = num_misses miss_penalty misses MEMstallcycles = IC instruction miss_penalty MEMstallcycles = IC memory_accesses miss_rate * miss_penalty instruction Separate MEMstallcycles = READstalls,Writestalls reads READstalls = IC instruction read_miss_rate read_miss_penalty writes WRITEstalls = IC instruction write_miss_rate write_miss_penalty 7 CS6810 8 CS6810 Page 2

4 Questions = $ Organization Cache Components Q1: Where can a block be placed? (examples shortly)» fully associative: answer = anywhere» direct mapped: answer = only one place» set-associative: in a small number of ways Q2: How is a block found?» 2 types of tags status: valid and dirty (for now) address: alias chance must be resolved next slide Q3: Which block is replaced on a miss?» LRU best but expensive matches temporal locality model» FIFO good but expensive LRU but on 1 st touch not use» random defeats temporal locality but simple to implement» approximation by epochs add use status tag Q4: What happens on a write miss?» hit: write-through or write-back (requires dirty flag)» miss: write-replace or write-around (modern CPU s ~allow both) a.k.a. write_allocate or write_no_allocate) Cache block or line size varies 64B is a common choice» no reason why the block size can t be larger the deeper you go in the memory hierarchy cache lines typically the same size reduces some complexity memory to cache transfer in line sized chunks disk to memory transfer in page sized chunks 2 main structures & a bunch of logic data RAM holds the cached data tag RAM holds tag information» same number of entries as data RAM entry = line for direct mapped or fully associative entry = set for set-associative» width for set associative a number of ways» for each set of address tags status tags present as well 9 CS6810 10 CS6810 Block Identification Cache Access Address Byte address 101000 tag = address tag» held in tag ram» size is what s left over after index and block offset size index» log 2 (number of data RAM entries) block offset» says which byte, word, or half-word is to be moved to the target register silly in a way word or doubles transferred to register appropriate byte or half-word is then used for the op» size = log 2 (line size) increase offset or index reduces tag size 8 words: 3 index bits Alias problem? Offset Data array 8-byte words Direct-mapped cache: each address maps to a unique address Sets 11 CS6810 12 CS6810 Page 3

De-Alias by Matching Address Tag Set Associative Cache Tag 101000 8-byte words Byte address Tag 10100000 Set associativity fewer conflicts; wasted power because multiple data and tags are read Way-1 Way-2 Compare Direct-mapped cache: each address maps to a unique address Compare succeeds hit Tag array Compare Data array 13 CS6810 14 CS6810 Miss Types Miss Types 3 C s (for now a 4 th will show up later) compulsory» 1 st access to a block will always miss» fix: prefetch if you can LD R0, address will do the trick for MIPS several machines have HW assisted prefetch engines dynamic stride prediction in HP-PA 7400 another form of speculation just wastes power if you lose benefit vs. liability is a tricky balance point capacity» if line that was previously in the cache is evicted and then reloaded» indication that working set size of app is bigger than the cache» fix bigger cache or prefetch conflict» e.g. only need 2 lines but they victimize each other how can you tell the difference between capacity and conflict miss? 3 C s (for now a 4 th will show up later) compulsory» 1 st access to a block will always miss» fix: prefetch if you can capacity» if line that was previously in the cache is evicted and then reloaded» indication that working set size of app is bigger than the cache» fix bigger cache or prefetch conflict» e.g. only need 2 lines but they victimize each other how can you tell the difference between capacity and conflict miss? conflict misses don t exist in fully associative cache since any line can be anywhere run test on set-assoc or direct mapped cache and then on same capacity fully associative cache, intersection of miss sets are capacity misses, the rest are conflict misses after discounting all misses that are first touch e.g. compulsory misses 15 CS6810 16 CS6810 Page 4

Increasing Associativity Organization Effects Data and Tag RAM #_entries the same will depend on capacity Logic involved in compares for n ways n parallel compares» if one of the succeeds then hit in the associated way» if none succeed then miss» if >1 succeeds somebody made a big mistake» if n is large then problems way prediction fully associative» huge number of parallel compares (m line capacity m compares power hungry limits use to smallish caches like a TLB» or save power but increase hit-time n compares where n << m walk tag array in n sized chunks stop when you find a hit but miss_time is VERY long variable miss time but in essence this is always true anyway Increasing associativity and capacity helps but there are trade-offs Increasing either increases both power and delay Need to find the sweet spot 17 CS6810 18 CS6810 Optimizations to Reduce Miss Rate Increase block size +: reduces tag size, compulsory misses, and miss rate if there is spatial locality -: miss must fetch larger block, no spatial locality waste, increases conflict misses since for same capacity there will be less blocks in the cache Increase cache size +: reduces conflict and capacity misses -: larger caches are slower increased hit time Increased associativity issues already discussed rule of thumb 2-way associativity w/ capacity N/2 has the same miss rate as 1-way size N cache Way prediction can use methods similar to branch prediction» get it right and reduce power consumption since 1 way SA = direct mapped 19 CS6810 Hiding/Tolerating Miss Penalty OOO execution combo of ILP and TLP techniques do as much as you can in between a load and consumer Non-blocking caches first miss doesn t block subsequent actions cache controller keeps track of multiple outstanding misses» MSHR s miss status handling registers & dynamic issue req d Write buffers prioritize reads handle writes when memory is otherwise idle» opposite of Itanium ALAT concept» reads must check write buffer to get latest result Prefetching (even if you get it right) too aggressive increased cache pressure possible to increase conflict/capacity misses 20 CS6810 Page 5

The Problem w/ Addresses Program vs. Memory program virtual addresses main memory and I/O land physical addresses what does the cache use? 21 CS6810 The Problem w/ Addresses Program vs. Memory program virtual addresses main memory and I/O land physical addresses what does the cache use? Physically indexed physically tagged cache result increased hit_time» not a good idea according to Amdahl if you hit most of the time why?» virtual address needs to be translated to a physical address before the cache access can even begin TLB is a cache of these translations miss in the TLB goes to the page table which may be in main memory or even on the disk UGHly Improve by doing address translation in parallel with data access a bit speculative but if high hit rate then it s the right choice 22 CS6810 Virtually Indexed Physically Tagged $ Issues and Options VI-PT $ page size inherently limits size of the L1$» higher associativity can mitigate this BUT delay and/or power may go up VI-VT $ s just do everything virtual» note these caches do exist so it s not hypothetical hokum removes page size cache capacity problem problems? Virtual address space can be > physical addr. space Problems? 23 CS6810 24 CS6810 Page 6

Issues and Options Concluding Remarks VI-PT $ page size inherently limits size of the L1$» higher associativity can mitigate this BUT delay and/or power may go up VI-VT $ s just do everything virtual» note these caches do exist so it s not hypothetical hokum removes page size cache capacity problem problems» multiple processes concurrently running OS guarantees address space privacy fix» assign process to an address space ID» address space part of tag RAM space ID provided w. virtual address processes are not interleaved like threads so process ID is known on context switch OS must flush cache if AS-ID isn t active This lecture somewhat remedial but essential to understand what follows» if you don t read the book more thoroughly or go back to the cs3810 text ask questions, confer w/ Dogan,.» mid-term questions will be conceptual» subsequent homework will be more substantive Applies to reality as well lots of cores complicate cache design» foundation however is the same 25 CS6810 26 CS6810 Page 7