CSL373: Operating Systems more VM fun

Similar documents
Simple idea 1: load-time linking. Our main questions. Some terminology. Simple idea 2: base + bound register. Protection mechanics.

CS162 Operating Systems and Systems Programming Lecture 13. Caches and TLBs. Page 1

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

Paging! 2/22! Anthony D. Joseph and Ion Stoica CS162 UCB Fall 2012! " (0xE0)" " " " (0x70)" " (0x50)"

Recall: Paging. Recall: Paging. Recall: Paging. CS162 Operating Systems and Systems Programming Lecture 13. Address Translation, Caching

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

CS162 Operating Systems and Systems Programming Lecture 10 Caches and TLBs"

Modern Computer Architecture

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

Memory Hierarchy Review

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Page 1. CS162 Operating Systems and Systems Programming Lecture 14. Caching and Demand Paging

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

How to create a process? What does process look like?

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

EE 4683/5683: COMPUTER ARCHITECTURE

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

Course Administration

Virtual Memory: From Address Translation to Demand Paging

Chapter 8. Virtual Memory

CS162 Operating Systems and Systems Programming Lecture 13. Address Translation (con t) Caches and TLBs

16 Sharing Main Memory Segmentation and Paging

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

10/7/13! Anthony D. Joseph and John Canny CS162 UCB Fall 2013! " (0xE0)" " " " (0x70)" " (0x50)"

14:332:331. Week 13 Basics of Cache

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Handout 4 Memory Hierarchy

ECE4680 Computer Organization and Architecture. Virtual Memory

CSE 120 Principles of Operating Systems

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory. Virtual Memory

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

CS152 Computer Architecture and Engineering Lecture 18: Virtual Memory

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Virtual Memory: From Address Translation to Demand Paging

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk

ECE468 Computer Organization and Architecture. Virtual Memory

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

Virtual Memory. Stefanos Kaxiras. Credits: Some material and/or diagrams adapted from Hennessy & Patterson, Hill, online sources.

Virtual Memory Oct. 29, 2002

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

Virtual Memory. CS61, Lecture 15. Prof. Stephen Chong October 20, 2011

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

15 Sharing Main Memory Segmentation and Paging

CS3350B Computer Architecture

CS 153 Design of Operating Systems Winter 2016

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Chapter 9 Memory Management

Computer Science 146. Computer Architecture

COSC3330 Computer Architecture Lecture 20. Virtual Memory

CS 318 Principles of Operating Systems

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Processes and Virtual Memory Concepts

Page 1. Multilevel Memories (Improving performance using a little cash )

virtual memory Page 1 CSE 361S Disk Disk

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

CSE 120 Principles of Operating Systems Spring 2017

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Memory Management. An expensive way to run multiple processes: Swapping. CPSC 410/611 : Operating Systems. Memory Management: Paging / Segmentation 1

Virtual Memory. CS 351: Systems Programming Michael Saelee

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A Typical Memory Hierarchy

Virtual Memory 1. Virtual Memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Computer Systems. Virtual Memory. Han, Hwansoo

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 13: Address Translation

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Chapter 8 Memory Management

COS 318: Operating Systems. Virtual Memory and Address Translation

Memory Management! Goals of this Lecture!

Memory Management. Dr. Yingwu Zhu

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

A Typical Memory Hierarchy

Virtual Memory 1. Virtual Memory

Motivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CISC 360. Virtual Memory Dec. 4, 2008

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Transcription:

CSL373: Operating Systems more VM fun

The past: translating VAs to PAs Load-time reloc: translate when load from disk Pro: don t need hardware support Con: hard to redo after running + no protection Base & Bounds = add base and check bounds Pro: simple relocation + protection, fast Con: inflexible Segmentation = multiple base & bounds (segments) Pro: Gives more flexible sharing and space usage Con: segments need contiquous memory; fragmentation Paging: quantize memory into fixed pages, map these Pro: Eliminates external fragmentation; flexible mappings Con: internal frag; mapping contiguous ranges more costly Today: paging + segmentation & speed issues

Paging + segmentation: best of both? Dual problems: Paged VM: simple page table = lots of storage Segmentation: All bytes in segment must map to contiguous set of storage locations. Makes paging problematic: all of seg in mem or none. Idea: use paging + segmentation! Map program mem with page table Map page table mem with seg table 4MB seg Phys mem Call virtual address chunk a segment Each segment has >= 1 pages heap gcc text

Paging + segmentation tradeoffs Page-based virtual memory lets segments come from noncontiguous memory marks allocation flexible portion of segment can be on disk! Segmentation = way to keep page table space manageable Page tables correspond to logical units (code, stack) Often relatively large (But what happens if page tables really large?) Going from paging to P+S is like going from single segment to multiple segments, except at a higher level. Instead of having a single page table, have many page tables with base and bound for each.

Example: system 370 System 370: 24 bit virtual address space (old machine): Seg # page # (8 bits) page offset (12 bits) (4 bits) 4 bits of segment # 8 bits of page # 12 bits of offset The mappings: Segment table: maps segment number to physical address & size of that segment s page table. Page table maps virtual page to 12 bit physical page number, which is concatenated to the 12 bit offset to get a 24 bit address

Example system 370 translation Table entry addresses 0 1 2 Base bound prot 0x2000 0x14 R 0x0000 0x00 0x1000 0x0d RW Segment table Read of VA 0x2070 0x0 02 070 0x202016 read? 0x104c84 read? 0x011424 read? 0x210014 write? Page table (2-byte entries) SEG: 0 page 2 * 2 bytes +0x2000 = 0x2004 (0x3 << 12 0x070) = 0x3070 0x001f 0x0011 0x2020 0x0003 0x002a 0x0013 0x2000 0x000c 0x0007 0x1020 0x004 0x00b 0x006 0x1000

P+S discussion If segment not used then no need to have page table for it Can share at two levels: single page or single segment (whole page table) Pages eliminate external fragmentation and make it possible for segments to grow, page without shuffling If page size small compared to most fragments then internal fragmentation not too bad (.5 of page per seg) User not given write access to page tables Read access can be quite useful (e.g., to let garbage collectors see dirty bits), but only experimental OSes provide this If translation tables kept in main memory, overhead high 1 or 2 overhead reference for every real reference (i.e., mem op) Other example systems: VAX, Multics, x86,

Who won? Simplicity = good (and fast): Most new architectures have gone with pages Examples: MIPS, SPARC, Alpha. And many old architectures have added paging even if they started with segmentation! But: to efficiently map large regions, many new machines backsliding into super pages Large, multi-page pages: have strict alignment restrictions; (typically) small # of sizes 16K superpage 32K alignment (VA%32K = 0) 4K page inflexibility = speed, but handles 80% of the cases we want (e.g., that 8MB framebuffer)

Virtual memory is complicated, but not deep Again: Despite the obscure terminology, all virtual memory is trying to do is map ints to ints: VA:int PA:int Most complexity from fact that CS has no good way to construct a general integer function. Mapping between two arbitrary sets of integers, fundamentally requires a table of some sort 4 5 6 42 213 101 Page table = a lookup table to manually build function The usual variations: an array ( direct page table), hash tables, weird trees of arrays. The usual complications: speed, space

Paging and segmentation not so different Main difference is the way they map ints. Segmentation: represents VA s as byte ranges (*va, nbytes)) Restricts outputs so it can map them using a single table entry virtual [base, len) physical result: can protect & alloc variable-sized units, but force mapped range to be contiguous, hard code index for speed. Paging: represents VA s as pages, maps them using a tuple ([vpn, ppn]) for each page sized unit (forces base to be page aligned) Result: can only protect and allocate fixed-size units, but can map any page to any other. Hardware caches and lookups differ, but only because of different table layouts and logic, nothing fundamental

Mapping functions: a perennial OS/CS theme OSes are constantly in the business of constructing a mapping function and then trying to make it fast Example: File systems: Directory: map file name to inode foo.c tsts nachos 42 213 101 inodes: map file offset to disk block number 0 1 2 42 213 101 Others: DNS names to IP addrs, IP addrs to eth, to routes, variables to registers, symbols to VAs,

Variable versus fixed size Can view as space/flexibility vs time fixed size: simple, fast but inflexible + internal frag nice: allows simple, fast int-to-int mappings e.g., computing address of array[i] = base+ i*element size variable size: more flexible but complex, slower, external frag Constant computer science theme: variable sized instructions (CISC) vs fixed size (RISC) variable sized packets (Ethernet) vs fixed size (ATM) fixed sized examples: cache entires, disk blks, IP addrs variable sized e.g. s: variables names&sizes, files, DNS names our primal mud: digital (discrete) vs analog (continuous signal)

Problem: large range = large page tables Same problem as memory: Don t want to have to allocate page tables contiguously So use same solution: map page tables using another page table To stop recursion: the page-table page table ( system page table ) resides in contiguous memory, usually at a known fixed address ls gcc emacs Win: page tables can be pieced together from scatterred pages Win: invalid mappings can be represented with invalid addresses rather than requiring space in a table Lose: Worst case lookup?

Extending idea: hierarchical page tables Large address range = lots of unused space = unwieldy PTs 2^20 (1M) entries! emacs idle 2^32 address space 2^12 (4K) pages Conceptually: map small regions with direct page table, then these with page tables idle Space savings with 2^22 (4M) regions? When are hierarchical (much) worse than non-hierarchical?

Inverted Page Table One entry for each physical page frame specifying PID and VPN for that page On a TLB miss, search inverted page table data structure to find physical page frame for the virtual address of process. Speed up using: Hashing Associative table of recently accessed entries (h/w) e.g., IBM (RS/6000, RT, System38), HP Spectrum

Problem: mapping = slow If each memory reference requires 1 or more page table translations, speed will be bad The obvious idea: caching What to cache? VA-to-PA translations # of references Memory address Why? Program doesn t wildly access entire address space usually references (relatively) small, slowly-changing subsets So, just cache the translations for these regions!

Translation look-aside buffer (TLB) TLB is just a very fast memory with some comparators Each TLB entry maps a VPN to PPN + protection information On each memory reference: check TLB, if there, fast. If not insert in TLB for next time. (Must remove some entry) Typical: 64-2K entries, 95% hit rate Virtual addr 3 128 ((1<<12) 128) 0x1000 mem VPN fault page offset (12 bits)? invalid Page table Prot VPN r 3 1 PPN PPN miss seg PT lookup 128

Caching in real world Refrigerator -- kitchen Book issue -- library Water bottle water cooler Lecture slides book More?

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance Why Bother with Caching? 1000 100 10 1 Processor-DRAM Memory Gap (latency) Moore s Law µproc CPU 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) Time

The precondition to caching: locality Probability of reference 0 2 Address Space n - 1 Temporal Locality (Locality in Time): Keep recently accessed data items closer to processor Spatial Locality (Locality in Space): Move contiguous blocks to the upper levels To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y

Memory Hierarchy of a Modern Computer System Take advantage of the principle of locality to: Present as much memory as in the cheapest technology Provide access at speed offered by the fastest technology Processor Datapath Control Registers On-Chip Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Tertiary Storage (Tape) Speed (ns): 1s 10s-100s 100s Size (bytes): 100s Ks-Ms Ms 10,000,000s (10s ms) Gs 10,000,000,000s (10s sec) Ts

A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Coherence (Invalidation): other process (e.g., I/O) updates memory

How is a Block found in a Cache? Tag Block Address Index Block offset Set Select Index Used to Lookup Candidates in Cache Index identifies the set Tag used to identify actual copy If no candidates match, then declare cache miss Block is minimum quantum of caching Data Select Data select field used to select data within block Many caching applications don t have data select field

The (usual) possible cache Fully associative: Entries can go anywhere replacement? organizations minimizes conflict misses, but lookup is expensive: requires searching entire table on every memory ref Direct mapped: entries can go exactly one place example: position = VPN % TLB-size fast location but inflexible mapping = conflict misses Hybrid: Set associative 2-way set assoc: VPN? VPN VPN? VPN Entries can go in any of the sets (set assoc) but exactly in one place in each. (Search sets in parallel = speed)

TLB details TLB like hash table, but simpler fixed size, no pointer chains How to build hash function? In general: more information = better decisions. In this case exploit fact about programs: spatial locality. Better to use upper to lower address bits to map to entry? If we have temporal locality, what are good to replace? (Also, allow system to reserve space for some translations.) Complication: what to do when switch address space? Flush TLB on context switch (e.g., x86) Tag each entry with associated process s ID (e.g., MIPS) In general, OS must manually keep TLB in valid state

MIPS R3000 L1 cache: 64K bytes, direct mapped. Write allocate, Write through. Physically addressed Write buffer: 4 entries L2 cache: 256K bytes, direct mapped. Write back. Physical Memory: 64M bytes, 4K page frames. Backing Store Disk: 256M bytes of swap space TLB = 64 entry fully associative

MIPS R3000 Segments kuseg: top bit=0 kseg0: top bits=100. Cached and unmapped. Kernel instructions and data kseg1: top bits=101. Uncached and unmapped. Disk buffers, I/O registers, ROM code. kseg2: top bits=11. Cached and Mapped. Map different for each process. Per-process kernel data structures (e.g., user page tables)

Translation of user addresses 31 bits of virtual address + 6 bits of process ID 32 bit physical address Page size = 4k. So VPN=how many bits? Mapping VPN to PPN using a linear page table? Where is this page table stored? Page the page table using top 9 bits. Where is the page table for the page table stored?

Example: MIPS R2000/R3000 TLB Used in DecStations and SGI machines 64 entries, fully associative TLB entry format (64 bits per entry) 20 6 6 20 1 1 1 1 8 VPN PID Undef PFN N D V G Undef G Global, valid for any PID (why?) V entry is valid (why have this?) D dirty bit, page has been modified N don t cache memory addresses (why?) PFN physical address of the page PID process id for the page (how many? How to reuse?) VPN virtual page number Undef - undefined bits, set by OS (why?)

Example: R3000 pipeline includes TLB stages MIPS R3000 Pipeline Inst Fetch Dcd/ Reg ALU / E.A Memory Write Reg TLB I-Cache RF Operation WB E.A. TLB D-Cache TLB 64 entry, on-chip, fully associative, software TLB fault handler Virtual Address Space ASID V. Page Number Offset 6 20 12 0xx User segment (caching based on PT/TLB entry) 100 Kernel physical space, cached 101 Kernel physical space, uncached 11x Kernel virtual space Allows context switching among 64 user processes without TLB flush

Reducing translation time further As described, TLB lookup is in serial with cache lookup: Virtual Address V page no. 10 offset TLB Lookup V Access Rights PA P page no. offset 10 Physical Address Machines with TLBs go one step further: they overlap TLB lookup with cache access. Works because offset available early

Here is how this might work with a 4K cache: 32 Overlapping TLB & Cache Access TLB assoc lookup index 4K Cache 1 K Hit/ Miss FN 20 page # 10 2 disp 00 = 4 bytes FN Data Hit/ Miss

MIPS R3000 Lookup Map on upper half of TLB entry, use lower half of TLB entry. Can generate 3 different types of TLB misses: UTLB miss (access to kuseg, no match in TLB) TLB miss (access to kseg0, kseg1, kseg2. no match) TLB mod (mapping is loaded, but access is a write and the D bit is not set) Each type of miss has it s own exception handler

Alternatives OS may run completely unmapped. Problem: can t page OS data structures OS may have separate address space from user. Problem: can t as easily access user space Cache may be virtually addressed. Problem: flush cache on context switch, can t alias multiple virtual addresses to same physical address TLB reload may be done entirely in hardware. Problem: Locks OS into using a specific data structure for page table (x86)

Problem: accessing user data OS needs to access user memory I/O routines read/write user buffers. BUT: User pointers cannot be trusted obvious: user passes in null pointer or invalid address 0xfffee000 Less obvious: user passes in valid *kernel* address User memory might be paged out. OS will get a page fault in middle of system call If it switches to new process then: Prob 1: if kernel held single OS lock, system will deadlock Prob 2: if system call halfway through, and depends on current state, when restarted will have invalid view of world Prob 3: if system call partially altered OS data structures User pointers only valid in user address space What happens if OS stores them away?

Where does OS live? Different address space? Microkernels Nice: catch wild pointer writes Bad: Accessing user stuff clumsy. Common: user addresses co-exist with OS s OS aliased into every application address space at the same place (allows user addresses to co-exist with OS s) e.g., Windows32 (2GB OS + 2GB user) Linux32 (1GB OS + 3GB user) As a refinement: OS runs unmapped special address range in privileged mode that mapped to physical memory by subtracting constant e.g., Linux: addresses >= 0xc0000000 have this subtracted What happens when an application tries to read/write OS data?

Summary: Virtual memory mapping What? Give programmer logical (virtual) set of addresses rather than actual physical set How? translate every memory operation using table (page table, segment table) Speed: cache frequently used translations Result? each user has a private address space programs run independently of actual physical memory addresses used, and actual memory size protection: check that they only access their own memory