CS61C Review of Cache/VM/TLB. Lecture 26. April 30, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

Similar documents
COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Modern Computer Architecture

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

ECE4680 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

ECE468 Computer Organization and Architecture. Virtual Memory

Handout 4 Memory Hierarchy

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Topics. Digital Systems Architecture EECE EECE Need More Cache?

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Course Administration

Page 1, 5/4/99 8:22 PM

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

EE 4683/5683: COMPUTER ARCHITECTURE

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS3350B Computer Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

UC Berkeley CS61C : Machine Structures

Lecture 11. Virtual Memory Review: Memory Hierarchy

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Caching Basics. Memory Hierarchies

Page 1. Memory Hierarchies (Part 2)

CS61C : Machine Structures

CPE 631 Lecture 04: CPU Caches

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Computer Architecture Spring 2016

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Memory Hierarchy Review

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS152: Computer Architecture and Engineering Caches and Virtual Memory. October 31, 1997 Dave Patterson (http.cs.berkeley.

Time. Recap: Who Cares About the Memory Hierarchy? Performance. Processor-DRAM Memory Gap (latency)

14:332:331. Week 13 Basics of Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Levels in memory hierarchy

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

Memory Hierarchies 2009 DAT105

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Memory hierarchy review. ECE 154B Dmitri Strukov

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CS61C : Machine Structures

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Advanced Computer Architecture

Chapter 8. Virtual Memory

CS61C : Machine Structures

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS61C : Machine Structures

Question?! Processor comparison!

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

CS162 Operating Systems and Systems Programming Lecture 13. Caches and TLBs. Page 1

Virtual Memory. Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE331: Hardware Organization and Design

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS61C : Machine Structures

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory: From Address Translation to Demand Paging

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Review. Manage memory to disk? Treat as cache. Lecture #26 Virtual Memory II & I/O Intro

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Time. Who Cares About the Memory Hierarchy? Performance. Where Have We Been?

Paging! 2/22! Anthony D. Joseph and Ion Stoica CS162 UCB Fall 2012! " (0xE0)" " " " (0x70)" " (0x50)"

CSE 502 Graduate Computer Architecture. Lec 6-7 Memory Hierarchy Review

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

www-inst.eecs.berkeley.edu/~cs61c/

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

EITF20: Computer Architecture Part4.1.1: Cache - 2

Transcription:

CS6C Review of Cache/VM/TLB Lecture 26 April 3, 999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs6c/schedule.html Outline Review Pipelining Review Cache/VM/TLB Review slides Administrivia, What s this Stuff Good for? 4 Questions on Memory Hierarchy Detailed Example 3Cs of Caches (if time permits) Cache Impact on Algorithms (if time permits) Conclusion cs 6C L26 cachereview. cs 6C L26 cachereview.2 Review /3: Pipelining Introduction Pipelining is a fundamental concept Multiple steps using distinct resources Exploiting parallelism in instructions What makes it easy? (MIPS vs. 8x86) All instructions are the same length simple instruction fetch Just a few instruction formats read registers before decode instruction Memory operands only in loads and stores fewer pipeline stages Data aligned memory access / load, store cs 6C L26 cachereview.3 Review 2/3: Pipelining Introduction What makes it hard? Structural hazards: suppose we had only one cache? Need more HW resources Control hazards: need to worry about branch instructions? Branch prediction, delayed branch Data hazards: an instruction depends on a previous instruction? need forwarding, compiler scheduling cs 6C L26 cachereview.4

Review 3/3: Advanced Concepts Superscalar Issue, Execution, Retire: Start several instructions each clock cycle (999: 3-4 instructions) Execute on multiple units in parallel Retire in parallel; HW guarantees appearance of simple single instruction execution Out-of-order Execution: Instructions issue in-order, but execute out-oforder when hazards occur (load-use, cache miss, multiplier busy,...) Instructions retire in-order; HW guarantees appearance of simple in-order execution cs 6C L26 cachereview.5 Memory Hierarchy Pyramid Upper Levels in memory hierarchy Lower cs 6C L26 cachereview.6 Central Processor Unit (CPU) Level Level 2 Level 3... Level n Increasing Distance from CPU, Decreasing cost / MB Size of memory at each level Principle of Locality (in time, in space) + Hierarchy of Memories of different speed, cost; exploit to improve cost-performance Performance Why Caches? 98 98 Moore s Law 982 983 984 985 986 987 988 989 99 99 992 993 994 995 996 997 998 999 2 CPU DRAM µproc 6%/yr. Processor-Memory Performance Gap: (grows 5% / year) DRAM 7%/yr. 989 first Intel CPU with cache on chip; Today 37% area of Alpha 264, 6% StrongArm SA, 64% Pentium Pro Why virtual memory? (/2) Protection regions of the address space can be read only, execute only,... Flexibility portions of a program can be placed anywhere, without relocation Expandability can leave room in virtual address space for objects to grow Storage management allocation/deallocation of variable sized blocks is costly and leads to (external) fragmentation cs 6C L26 cachereview.7 cs 6C L26 cachereview.8

Why virtual memory? (2/2) Generality ability to run programs larger than size of physical memory Storage efficiency retain only most important portions of the program in memory Concurrent I/O execute other processes while loading/dumping page Why Translation Lookaside Buffer (TLB)? Paging is most popular implementation of virtual memory (vs. base/bounds) Every paged virtual memory access must be checked against Entry of Page Table in memory Cache of Page Table Entries makes address translation possible without memory access in common case cs 6C L26 cachereview.9 cs 6C L26 cachereview. Paging/Virtual Memory Review User A: Virtual Memory Stack Heap Static Code cs 6C L26 cachereview. 64 MB A Page Table Physical Memory TLB B Page Table User B: Virtual Memory Stack Heap Static Code Three Advantages of Virtual Memory ) Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multiple processes reasonable Only the most important part of program ( Working Set ) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. cs 6C L26 cachereview.2

Three Advantages of Virtual Memory 2) Protection: Different processes protected from each other. Different pages can be given special behavior - (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs Far more viruses under Microsoft Windows 3) Sharing: Can map same physical page to multiple users ( Shared memory ) cs 6C L26 cachereview.3 Virtual Memory Summary Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound 3 Problems: ) Not enough memory: Spatial Locality means small Working Set of pages OK 2) TLB to reduce performance cost of VM 3) Need more compact representation to reduce memory size cost of simple -level page table, especially for 64-bit address (See CS 62) cs 6C L26 cachereview.4 Administrivia th homework (last): Due Today Next Readings: A.7 M 5/3 Deadline to correct your grade record (up to th lab, th homework, 5th project) W 5/5 Review: Interrupts / Polling; A.7 F 5/7 6C Summary / Your Cal heritage / HKN Course Evaluation (Due: Final 6C Survey in lab) Sun 5/9 Final Review starting 2PM ( Pimintel) W5/2 Final (5-8PM Pimintel) Need Early Final? Contact mds@cory cs 6C L26 cachereview.5 What s This Stuff (Potentially) Good For? Allow civilians to find mines, mass graves, plan disaster relief? cs 6C L26 cachereview.6 Private Spy in Space to Rival Military's Ikonos spacecraft is just.8 tons, 5 feet long with its solar panels extended, it runs on just,2 watts of power -- a bit more than a toaster. It sees objects as small as one meter, and covers Earth every 3 days. The company plans to sell the imagery for $25 to $3 per square mile via www.spaceimaging.com. "We believe it will fundamentally change the approach to many forms of information that we use in business and our private lives." N.Y. Times, 4/27/99

4 Questions for Memory Hierarchy Q: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) Q: Where block placed in upper level? Block 2 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Mod Number Sets Block no. 2 3 4 5 6 7 Fully associative: block 2 can go anywhere Block no. 2 3 4 5 6 7 Direct mapped: block 2 can go only into block 4 (2 mod 8) Block-frame address Block no. 2 3 4 5 6 7 Set Set Set Set 2 3 Set associative: block 2 can go anywhere in set (2 mod 4) cs 6C L26 cachereview.7 cs 6C L26 cachereview.8 Block no. 2 2 2 2 2 2 2 2 2 2 3 3 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 Q2: How is a block found in upper level? Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag cs 6C L26 cachereview.9 Tag Block Address Index Set Select Block offset Data Select Q3: Which block replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Miss Rates Associativity:2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 6 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.% 64 KB.9% 2.%.5%.7%.4%.5% 256 KB.5%.7%.3%.3%.2%.2% cs 6C L26 cachereview.2

Q4: What happens on a write? Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers cs 6C L26 cachereview.2 Comparing the 2 levels of hierarchy Cache Version Virtual Memory vers. Block or Line Page Miss cs 6C L26 cachereview.22 Page Fault Block Size: 32-64B Page Size: 4K-8KB Placement: Fully Associative Direct Mapped, N-way Set Associative Replacement: Least Recently Used LRU or Random (LRU) Write Thru or Back Write Back Picking Optimal Page Size Minimize wasted storage small page minimizes internal fragmentation small page increase size of page table Minimize transfer time large pages (multiple disk sectors) amortize access cost sometimes transfer unnecessary info sometimes prefetch useful data sometimes discards useless data early General trend toward larger pages because big cheap RAM increasing mem / disk performance gap larger address spaces cs 6C L26 cachereview.23 Alpha 264 Separate Instr & Data TLB Separate Instr & Data Caches TLBs fully associative TLB updates in SW Caches 8KB direct mapped, write thru 2 MB L2 cache, direct mapped, WB (off-chip) 32B block 256-bit to memory 4 entry write buffer between D$ & L2$ cs 6C L26 cachereview.24 I T L B I C A C H E <29> PC <> <2> <2> 2 <3> <2> V R W Tag Physical address 7 Page-frame address <3> 3 (high-order 2 bits of physical address) <2> Page offset<3> 6 2: MUX 8 <> <9> Tag Index (65,536 blocks) 2 9 5 5 <8> <5> <8> <5> Index Block Index Block Delayed Write Buffer offset D offset C 2 (256 Valid Tag Data A (256 Valid Tag Data <> <2> <64> 9 blocks) 2 C blocks) <> <2> <64> H 2 E 9 =? Stream Buffer =? Instruction Prefetch Stream Buffer L2 C A C H E Victim Buffer 4 Instr 5 Tag <29> Data <256> 4: MUX 3 CPU Data Page-frame Page address <3> offset<3> Instruction <64> Data Out <64> Data In <64> <64> <64> 2 25 24 D T L B 8 <29> Alpha AXP 264 V D Tag Data 28 <> <> <> <256> Write Buffer 6 5 2 =? Victim Buffer 7 8 9 <> <2> <2> <3> <2> V R W Tag Physical address 27 2 (high-order 2 bits of physical address) <2> =? 22 =? 4 23 Data 32: MUX Tag <29> Data <256> Main Memory 7 28 9 26 2 23 <64> 23 Write Buffer Magnetic Disk

Starting an Alpha 264: /6 Starts in kernel mode where memory is not translated Since software managed TLB, st load TLB with 2 valid mappings for process in Instruction TLB Sets the PC to appropriate address for user program and goes into user mode Page frame portion of st instruction access is sent to Instruction TLB: checks for valid PTE and access rights match (otherwise an exception for page fault or access violation) Starting an Alpha 264: 2/6 Since 256 32-byte blocks in instruction cache, 8-bit index selects the I cache block Translated address is compared to cache block tag (assuming tag entry is valid) If hit, proper 4 bytes of block sent to CPU If miss, address sent to L2 cache; its 2MB cache has 64K 32-byte blocks, so 6-bit index selects the L2 cache block; address tag compared (assuming its valid) If L2 cache hit, 32-bytes in clock cycles cs 6C L26 cachereview.25 cs 6C L26 cachereview.26 Starting an Alpha 264: 3/6 If L2 cache miss, 32-bytes in 36 clocks Since L2 cache is write back, any miss can result in a write to main memory; old block placed into buffer and new block fetched first; after new block loaded, and old block is written to memory if old block is dirty cs 6C L26 cachereview.27 Starting an Alpha 264: 4/6 Suppose st instruction is a load; st sends page frame to data TLB If TLB miss, software loads with PTE If data page fault, OS will switch to another process (since disk access = time to execute millions of instructions) If TLB hit and access check OK, send translated page frame address to Data Cache 8-bit portion of address selects data cache block; tags compared to see if a hit (if valid) If miss, send to L2 cache as before cs 6C L26 cachereview.28

Starting an Alpha 264: 5/6 Suppose instead st instruction is a store TLB still checked (no protection violations and valid PTE), send translated address to data cache Access to data cache as before for hit, except write new data into block Since Data Cache is write through, also write data to Write Buffer If Data Cache access on store is a miss, also write to write buffer since policy is no allocate on write miss cs 6C L26 cachereview.29 Starting an Alpha 264: 5/6 Write buffer checks to see if write is already to address within entry, and if so updates the block If 4-entry write buffer is full, stall until entry is written to L2 cache block All writes eventually passed to L2 cache If miss, put old block in buffer, and since L2 allocates on write miss, load missing data from cache;write to portion of block,and mark new L2 cache block as dirty; If old block in buffer is dirty, then write it to memory cs 6C L26 cachereview.3 Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. (Misses in N-way Associative, Size X Cache) cs 6C L26 cachereview.3 3Cs Absolute Miss Rate (SPEC92) Miss Rate per Type.4.2..8.6.4.2 cs 6C L26 cachereview.32 Compulsory vanishingly small -way 2 2-way 4 4-way 8 8-way Conflict Capacity 6 Cache Size (KB) 32 64 28 Compulsory

Miss Rate per Type 3Cs Relative Miss Rate % 8% 6% 4% 2% % cs 6C L26 cachereview.33 3Cs Flaws: for fixed block size 3Cs Pros: insight invention 2 -way 2-way 4-way 8-way 4 8 6 Cache Size (KB) Capacity 32 64 28 Compulsory Conflict Impact of What Learned About Caches? 96-985: Speed = ƒ(no. operations) 99s Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar 999: Speed = ƒ(non-cached memory accesses) Superscalar, Out-of-Order machines hide L data cache miss (5 clocks) but not L2 cache miss (5 clocks)? cs 6C L26 cachereview.34 98 98 982 983 984 985 986 987 988 989 99 99 992 993 994 995 996 997 998 999 2 CPU DRAM Quicksort vs. Radix as vary number keys: Instructions Quicksort vs. Radix as vary number keys: Instructions and Time 8 7 6 5 Radix sort Quick (Instr/key) Radix (Instr/key) 8 7 6 5 Radix sort Time Quick (Instr/key) Radix (Instr/key) Quick (Clocks/key) Radix (clocks/key) 4 4 3 2 Quick sort Instructions/key 3 2 Quick sort Instructions E+7 E+7 cs 6C L26 cachereview.35 Set size in keys cs 6C L26 cachereview.36 Set size in keys

Quicksort vs. Radix as vary number keys: Cache misses 5 4 3 2 Quick sort Radix sort Cache misses Set size in keys Quick(miss/key) Radix(miss/key) What is proper approach to fast algorithms? Cache/VM/TLB Summary: #/3 The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. - Temporal Locality: Locality in Time - Spatial Locality: Locality in Space 3 Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity. cs 6C L26 cachereview.37 cs 6C L26 cachereview.38 Cache/VM/TLB Summary: #2/3 Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: ) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance Cache/VM/TLB Summary: #3/3 Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? X DRAM growth removed controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? cs 6C L26 cachereview.39 cs 6C L26 cachereview.4