Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Similar documents
Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Last =me in Lecture 7 3 C s of cache misses Compulsory, Capacity, Conflict

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

COSC3330 Computer Architecture Lecture 20. Virtual Memory

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory: From Address Translation to Demand Paging

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Virtual Memory: From Address Translation to Demand Paging

New-School Machine Structures. Overarching Theme for Today. Agenda. Review: Memory Management. The Problem 8/1/2011

Modern Virtual Memory Systems. Modern Virtual Memory Systems

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Lecture 9 - Virtual Memory

Cache Performance (H&P 5.3; 5.5; 5.6)

Chapter 8. Virtual Memory

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

V. Primary & Secondary Memory!

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 8 Address Transla>on

14 May 2012 Virtual Memory. Definition: A process is an instance of a running program

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory. Virtual Memory

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Science 146. Computer Architecture

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk

Memory hierarchy review. ECE 154B Dmitri Strukov

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

Lecture 7 - Memory Hierarchy-II

Lecture 11. Virtual Memory Review: Memory Hierarchy

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

John Wawrzynek & Nick Weaver

Virtual Memory Oct. 29, 2002

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Computer Architecture Lecture 13: Virtual Memory II

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Memory Management! Goals of this Lecture!

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Virtual Memory. CS61, Lecture 15. Prof. Stephen Chong October 20, 2011

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Systems. Virtual Memory. Han, Hwansoo

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Intro to Virtual Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 20: Virtual Memory, Protection and Paging. Multi-Level Caches

Page 1. Multilevel Memories (Improving performance using a little cash )

ECE 4750 Computer Architecture, Fall 2017 T03 Fundamental Memory Concepts

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Lecture 9 Virtual Memory

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Systems Programming and Computer Architecture ( ) Timothy Roscoe

virtual memory Page 1 CSE 361S Disk Disk

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

EITF20: Computer Architecture Part4.1.1: Cache - 2

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Main Memory (Part I)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CISC 360. Virtual Memory Dec. 4, 2008

CS 61C: Great Ideas in Computer Architecture Virtual Memory. Instructors: Krste Asanovic & Vladimir Stojanovic h>p://inst.eecs.berkeley.

Virtual Memory. Physical Addressing. Problem 2: Capacity. Problem 1: Memory Management 11/20/15

ECE468 Computer Organization and Architecture. Virtual Memory

Caching Basics. Memory Hierarchies

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

ECE4680 Computer Organization and Architecture. Virtual Memory

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

HY225 Lecture 12: DRAM and Virtual Memory

Carnegie Mellon. 16 th Lecture, Mar. 20, Instructors: Todd C. Mowry & Anthony Rowe

Virtual Memory. Samira Khan Apr 27, 2017

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Transcription:

6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823, L11--2 Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate (e.g., larger cache) reduce the miss penalty (e.g., L2 cache) Page 1

Effect on Performance of Cache Parameters 6.823, L11--3 Larger cache size + reduces capacity and conflict misses - hit time may increase Larger block size + spatial locality reduces compulsory misses and capacity reload misses - fewer blocks may increase conflict miss rate - larger blocks may increase miss penalty Higher associativity + reduces conflict misses (up to around 4-8 way) - may increase access time Reducing Hit Time 6.823, L11--4 On-chip versus off-chip caches Pipelining write tag check and data update for single-cycle write-hits Sum-Addressed Caches (UltraSPARC-III) - Can evaluate A+B=C equality faster than adding A and B! - In register+offset addressing mode, no need to perform addition before cache access - Tag compare unit performs A+B=C operation Pseudo-associative caches (way-predicting caches) - Guess which way will have hit, only look there first - Check other ways sequentially on miss - Combine direct-mapped hit time, associative miss rate - Larger hit time if way prediction poor Page 2

Techniques to Further Reduce Cache Miss Rate 6.823, L11--5 Hardware prefetching of instructions and data Can remove compulsory misses Stream buffers predict sequential or strided accesses Speculative fetches can add useless memory traffic Software prefetching Software prefetch requires extending the ISA with register and cache prefetch instructions Takes extra instruction issue slots Software tuned to one hardware implementation Other software techniques - placement (array padding) to reduce cache conflicts - blocking transformations (reuse data once in cache) to reduce capacity misses Reducing Miss Penalty 6.823, L11--6 Give priority to read-misses over writes and write-backs - queue writes in write buffer, let read overtake writes to memory - must check write buffer for match on read address Multi-level caches - reduced latency on a primary cache miss Sub-block placement, fetch only part of a block on miss - small tag overhead from large blocks, but low miss penalty from small sub-blocks - need an extra valid bit per sub-block Fetch critical word in block first (aka wrap-around refill) - restart the processor as soon as needed word arrives Victim caches - hold recently evicted blocks nearby to reduces conflict miss penalty for low-associativity caches Non-blocking caches to reduce stalls on cache misses Page 3

Memory Hierarchy Example 6.823, L11--7 AlphaStation 600/5 desktop workstation Alpha 21164 @ 333 MHz On-chip L1 and L2 caches L1 instruction cache, 8KB direct-mapped, 32B lines, fetch four instructions/cycle (16B) Instruction stream prefetches up to 4 cache lines ahead L1 data cache, 8KB direct-mapped, 32B lines, write-through, load two 8B words or store one 8B word/cycle (2 cycle latency) up to 21 outstanding loads, 6x32B lines of outstanding writes L2 unified cache, 96KB 3-way set-associative, 64B blocks/32b sub-blocks, write-back, 16B/cycle bandwidth (7 cycle latency) Off-chip L3 unified cache, 8MB direct-mapped, 64B blocks, peak bandwidth is 16B every 7 cycles (15 cycle latency) DRAM, peak bandwidth 16B every 10 cycles (60 cycle latency) Memory Management 6.823, L11--8 The Fifties: - Absolute Addresses - Dynamic address translation The Sixties: - Paged memory systems and TLBs - Atlas Demand paging Modern Virtual Memory Systems (next lecture) Page 4

Types of Names for Memory Locations 6.823, L11--9 machine language address ISA effective address address mapping Machine language address as specified in machine code physical address Physical Memory Effective Address ISA specifies translation of machine code address into virtual address of program variable (may involve segment registers etc.) Physical Address the name for a physical memory location EDSAC, early 50 s Absolute Addresses 6.823, L11--10 effective address = physical memory address Only one program ran at a time, with unrestricted access to entire machine (RAM + I/O devices) Addresses in a program depended upon where the program was to be loaded in memory But it was more convenient for programmers to write location-independent subroutines Lead to the development of loaders & linkers to statically relocate and link programs Page 5

Dynamic Address Translation Motivation: In the early machines, I/O operations were slow and each word transferred involved the CPU 6.823, L11--11 Higher throughput if CPU and I/O of two or more programs were overlapped multiprogramming Location independent programs: Programming and storage management ease need for a base register Protection: Independent programs should not affect each other inadvertently need for a bound register program1 program2 Physical Memory User versus Kernel 6.823, L11--12 With multiprogramming came move away from programming bare machine users should be protected from each other (and program bugs) users need to share resources (e.g., CPU time, memory, disk) Hardware support evolves to support OS device interrupts to support multiprogramming (can t enforce that users software polls for each others devices) protected state and privileged execution modes to run OS exceptions to catch protection violations Purely software schemes to manage protection high-level language programming only (no assembly code) trusted compiler not successful, software bugs cause security loopholes and system crashes Page 6

Simple Base and Bound Translation 6.823, L11--13 Load X Program Address Space Bound Bound Violation Effective Addr Base + Physical Address current segment Main Memory Base and bounds registers only visible/accessible when processor running in kernel mode (aka supervisor mode) Separate Areas for Program and Data 6.823, L11--14 Load X Data Bound Effective Addr Data Base + Bound Violation data segment Main Memory Program Address Space Program Bound Program Counter Program Base + Bound Violation program segment This permitted sharing of program segments Used today on Cray vector supercomputers Page 7

user 1 user 2 free user 3 free OS Space 16 K 32 K Memory Fragmentation User 4 & 5 arrives user 1 user 2 user 4 free user 3 user 5 OS Space 16 K 16 K 8 K 32 K User 2 & 3 leaves user 1 free user 4 free user 5 6.823, L11--15 OS Space 16 K 16 K 40 K As users come and go, the storage is fragmented. Therefore, at some stage programs have to be moved around to compact the storage (burping the memory). 0 1 2 3 Address Space of User-1 Paged Memory Systems: To reduce fragmentation Processor generated address can be interpreted as a pair <page number,offset> page number 0 1 2 3 Page Table of User-1 offset A page table contains the physical address of the base of each page Fixed-length pages plus indirection through page table relaxes the contiguous allocation requirement 6.823, L11--16 1 0 3 2 Page 8

Private Address Space per User 6.823, L11--17 User 1 User 2 VA1 VA1 Page Table Physical Memory OS pages Page Table User 3 VA1 Page Table Each user has a page table Page table contains an entry for each user page where should page tables reside? FREE 6.823, L11--18 Where Should Page Tables Reside? Space required by the page tables is proportional to the page size, number of users,... Space requirement is large too expensive to keep in registers Special registers just for the current user: - need new management instructions - affects the context-switching time may not be feasible for large page tables Main memory: - needs one reference to retrieve the page base address and another to access the data word doubles number of memory references! Page 9

Page Tables in Physical Memory 6.823, L11--19 Page Table, User 1 VA1 User 1 Page Table, User 2 VA1 User 2 Translation Lookaside Buffers Caching the Address Translation virtual address VPN offset 6.823, L11--20 V R W D tag PPN (VPN = virtual page number) (PPN = physical page number) hit? physical address PPN offset TLB speeds up the address translation (IBM, late 60 s) TLB keeps the <VPN, PPN> mappings for the recently accessed pages TLB also keeps additional information about each page, e.g., read/write, dirty Usually TLB is per process and flushed on a context switch Page 10

A Problem in Early Sixties 6.823, L11--21 There were many applications whose data could not fit in the main memory, e.g., Payroll Paged memory system reduced fragmentation but still required the whole program to be resident in the main memory Programmers moved the data back and forth from the secondary store by overlaying it repeatedly on the primary store tricky programming! Manual Overlays 6.823, L11--22 Assuming an instruction can address all the storage on the drum method1 - programmer keeps track of addresses in the main memory and initiates an I/O transfer when required method2 - automatic initiation of I/O transfers by software address translation Brookner s interpretive coding, 1960 method 1 proved too difficult for users and method 2 too slow! 40k bits main 640k bits drum central store Ferranti Mercury 1956 Page 11

Demand Paging Atlas, 1962 6.823, L11--23 Primary 32 Pages 512 words/page A page from secondary storage is brought into the primary storage whenever it is (implicitly) demanded by the processor. Tom Kilburn Central Memory Secondary (Drum) 32x6 pages User sees 32 x 6 x 512 words of storage Primary memory as a cache for secondary memory Hardware Organization of Atlas 6.823, L11--24 Effective Address Initial Address Decode 48-bit words 512-word pages 1 PAR per page frame (Page Address ) 0 31 PARs Fixed (ROM) 16 pages 0.4 ~1 µsec Subsidiary 2 pages 1.4 µsec Main 32 pages 1.4 µsec <effective PN, status> system code (not swapped) system data (not swapped) Drum (4) 192 pages Tape 8 decks 88 µsec/word Compare the effective page address against all 32 PARs match normal access no match page fault the state of the partially executed instruction was saved Page 12

Atlas Demand Paging Scheme On a page fault: 6.823, L11--25 input transfer into a free page is initiated the PAR is updated if no free page is left, a page is selected to be replaced (based on usage) the replaced page is written on the drum - to minimize drum latency effect, the first empty page on the drum was selected the page table is updated to point to the new location of the page on the drum Caching vs Demand Paging 6.823, L11--26 secondary memory CPU cache primary memory CPU primary memory Caching Demand paging cache entry page-frame cache block (~16 bytes) page (~4K bytes) cache miss (1% to 20%) page miss (<0.001%) cache hit (~1 cycle) page hit (~100 cycles) cache miss (~10 cycles) page miss(~5m cycles) a miss is handled a miss is handled in hardware mostly in software Page 13