The Art and Science of Memory Allocation

Similar documents
The Art and Science of Memory Alloca4on

CSE506: Operating Systems CSE 506: Operating Systems

CSE506: Operating Systems CSE 506: Operating Systems

Ext3/4 file systems. Don Porter CSE 506

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Memory Allocators. Pradipta De

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

CSE 120 Principles of Operating Systems

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

ò Paper reading assigned for next Tuesday ò Understand low-level building blocks of a scheduler User Kernel ò Understand competing policy goals

Memory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Page Frame Reclaiming

Parallel storage allocator

Read-Copy Update (RCU) Don Porter CSE 506

Kernel Memory Management

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Scheduling. Don Porter CSE 306

Operating Systems. 11. Memory Management Part 3 Kernel Memory Allocation. Paul Krzyzanowski Rutgers University Spring 2015

Chapter 10: Case Studies. So what happens in a real operating system?

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018

OPERATING SYSTEM. Chapter 9: Virtual Memory

Memory Management. Today. Next Time. Basic memory management Swapping Kernel memory allocation. Virtual memory

Scheduling, part 2. Don Porter CSE 506

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 598 Advanced Operating Systems Lecture 10

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS61C : Machine Structures

My malloc: mylloc and mhysa. Johan Montelius HT2016

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Engine Support System. asyrani.com

Memory Management. COMP755 Advanced Operating Systems

Review! Lecture 5 C Memory Management !

CS61C : Machine Structures

Memory Management. Kevin Webb Swarthmore College February 27, 2018

The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

Block Device Scheduling. Don Porter CSE 506

Dynamic Memory Allocation

Preview. Memory Management

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

COMP 530: Operating Systems File Systems: Fundamentals

Key Point. What are Cache lines

Chapter 8: Virtual Memory. Operating System Concepts

CS 111. Operating Systems Peter Reiher

First-In-First-Out (FIFO) Algorithm

CS61C : Machine Structures

Memory Management Topics. CS 537 Lecture 11 Memory. Virtualizing Resources

CS 318 Principles of Operating Systems

Linux kernel synchronization

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Lectures 13 & 14. memory management

Lecture 4: Memory Management & The Programming Interface

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Multiprocessor Systems. COMP s1

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Chapter 3 Memory Management: Virtual Memory

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 318 Principles of Operating Systems

In multiprogramming systems, processes share a common store. Processes need space for:

CS399 New Beginnings. Jonathan Walpole

CS-537: Midterm Exam (Fall 2008) Hard Questions, Simple Answers

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

CSE 332: Data Structures & Parallelism Lecture 12: Comparison Sorting. Ruth Anderson Winter 2019

Kernel Scalability. Adam Belay

CSE351 Spring 2018, Final Exam June 6, 2018

Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory Multiprocessors

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Chapter 9: Virtual Memory. Operating System Concepts 9th Edition

16 Sharing Main Memory Segmentation and Paging

Chapter 10: Virtual Memory

11.1 Segmentation: Generalized Base/Bounds

Virtual Memory Outline

Memory Management Basics

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

Multiprocessor Systems. Chapter 8, 8.1

Dynamic Memory Management! Goals of this Lecture!

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Virtual Memory Management

CSE506: Operating Systems CSE 506: Operating Systems

15 Sharing Main Memory Segmentation and Paging

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

CS 167 Final Exam Solutions

! What is main memory? ! What is static and dynamic allocation? ! What is segmentation? Maria Hybinette, UGA. High Address (0x7fffffff) !

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

Background. Contiguous Memory Allocation

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

2. PICTURE: Cut and paste from paper

238P: Operating Systems. Lecture 14: Process scheduling

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Optimizing Dynamic Memory Management

Administrivia. Dynamic memory allocation. More abstractly. Why is it hard? What is fragmentation really? Important decisions

Heap Management portion of the store lives indefinitely until the program explicitly deletes it C++ and Java new Such objects are stored on a heap

Transcription:

Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking Sync Device Drivers Threads Interrupts Disk Net Consistency Hardware Lecture goal Bump allocator Understand how memory allocators work In both kernel and applications Understand trade-offs and current best practices malloc (6) malloc (12) malloc(20) malloc (5) Bump allocator Assume memory is limited Simply bumps up the free pointer How does free() work? It doesn t Well, you could try to recycle cells if you wanted, but complicated bookkeeping Controversial observation: This is ideal for simple programs Hoard: best-of-breed concurrent allocator User applications Seminal paper We ll also talk about how Linux allocates its own memory You only care about free() if you need the memory for something else 1

Overarching issues Fragmentation Fragmentation Allocation and free latency Synchronization/Concurrency Implementation complexity Cache behavior Alignment (cache and word) Coloring Undergrad review: What is it? Why does it happen? What is Internal fragmentation? Wasted space when you round an allocation up External fragmentation? When you end up with small chunks of free memory that are too small to be useful Which kind does our bump allocator have? Hoard: Superblocks Superblock intuition At a high level, allocator operates on superblocks Chunk of (virtually) contiguous pages All objects in a superblock are the same size A given superblock is treated as an array of same-sized objects They generalize to powers of b > 1 ; In usual practice, b == 2 256 byte object heap 4 KB page 4 KB page Free list in LIFO order (Free space) Each page an array of objects Free Store list pointers in free objects! next next next next next next next Superblock Intuition malloc (8); 256 byte object heap malloc (200) Pick first free object 1) Find the nearest power of 2 heap (8) 4 KB page Free next next next next 2) Find free object in superblock 3) Add a superblock if needed. Goto 2. 4 KB page next next next (Free space) 2

Superblock example Memory free Suppose my program allocates objects of sizes: 4, 5, 7, 34, and 40 bytes. How many superblocks do I need (if b ==2)? 3 (4, 8, and 64 byte chunks) If I allocate a 5 byte object from an 8 byte superblock, doesn t that yield internal fragmentation? Yes, but it is bounded to < 50% Give up some space to bound worst case and complexity Simple most-recently-used list for a superblock How do you tell which superblock an object is from? Round address down: suppose superblock is 8k (2pages) Object at address 0x431a01c Came from a superblock that starts at 0x431a000 or 0x4319000 Which one? (assume superblocks are virtually contiguous) Subtract first superblock virtual address and it is the one divisible by two Simple math can tell you where an object came from! Big objects LIFO If an object size is bigger than half the size of a superblock, just mmap() it Recall, a superblock is on the order of pages already What about fragmentation? Example: 4097 byte object (1 page + 1 byte) Argument (preview): More trouble than it is worth Extra bookkeeping, potential contention, and potential bad cache behavior Why are objects re-allocated most-recently used first? Aren t all good OS heuristics FIFO? More likely to be already in cache (hot) Recall from undergrad architecture that it takes quite a few cycles to load data into cache from memory If it is all the same, let s try to recycle the object already in our cache High-level strategy Simplicity Allocate a heap for each processor, and one shared heap Note: not threads, but CPUs Can only use as many heaps as CPUs at once Requires some way to figure out current processor Try per-cpu heap first If no free blocks of right size, then try global heap If that fails, get another superblock for per-cpu heap The bookkeeping for alloc and free is pretty straightforward; many allocators are quite complex (slab) Overall: Need a simple array of (# CPUs + 1) heaps Per heap: 1 list of superblocks per object size Per superblock: Need to know which/how many objects are free LIFO list of free blocks 3

Locking How to find the locks? On alloc and free, superblock and per-cpu heap are locked Why? An object can be freed from a different CPU than it was allocated on Alternative: We could add more bookkeeping for objects to move to local superblock Reintroduce fragmentation issues and lose simplicity Again, page alignment can identify the start of a superblock And each superblock keeps a small amount of metadata, including the heap it belongs to Per-CPU or shared Heap And heap includes a lock Locking performance Performance argument Acquiring and releasing a lock generally requires an atomic instruction Tens to a few hundred cycles vs. a few cycles Waiting for a lock can take thousands Depends on how good the lock implementation is at managing contention (spinning) Blocking locks require many hundreds of cycles to context switch Common case: allocations and frees are from per-cpu heap Yes, grabbing a lock adds overheads But better than the fragmented or complex alternatives And locking hurts scalability only under contention Uncommon case: all CPUs contend to access one heap Had to all come from that heap (only frees cross heaps) Bizarre workload, probably won t scale anyway New topic: alignment Alignment (words) Word Cacheline struct foo {!!bit x;!!int y;! };! Naïve layout: 1 bit for x, followed by 32 bits for y CPUs only do aligned operations 32-bit add expects arguments to start at addresses divisible by 32 4

Word alignment, cont. If fields of a data type are not aligned, the compiler has to generate separate instructions for the low and high bits No one wants to do this Compiler generally pads this out Waste 31 bits after x Save a ton of code reinventing simple arithmetic Code takes space in memory too! Memory allocator + alignment Compiler generally expects a structure to be allocated starting on a word boundary Otherwise, we have same problem as before Code breaks if not aligned This contract often dictates a degree of fragmentation See the appeal of 2^n sized objects yet? Cacheline alignment Simple coherence model Different issue, similar name Cache lines are bigger than words Word: 32-bits or 64-bits Cache line 64 128 bytes on most CPUs Lines are the basic unit at which memory is cached When a memory region is cached, CPU automatically acquires a reader-writer lock on that region Multiple CPUs can share a read lock Write lock is exclusive Programmer can t control how long these locks are held Ex: a store from a register holds the write lock long enough to perform the write; held from there until the next CPU wants it False sharing False sharing is BAD Object foo (CPU 0 writes) Cache line Object bar (CPU 1 writes) These objects have nothing to do with each other Leads to pathological performance problems Super-linear slowdown in some cases Rule of thumb: any performance trend that is more than linear in the number of CPUs is probably caused by cache behavior At program level, private to separate threads At cache level, CPUs are fighting for a write lock 5

Strawman Round everything up to the size of a cache line Thoughts? Wastes too much memory; a bit extreme Hoard strategy (pragmatic) Rounding up to powers of 2 helps Once your objects are bigger than a cache line Locality observation: things tend to be used on the CPU where they were allocated For small objects, always return free to the original heap Remember idea about extra bookkeeping to avoid synchronization: some allocators do this Save locking, but introduce false sharing! Hoard strategy (2) Where to draw the line? Thread A can allocate 2 small objects from the same line Hand off 1 to another thread to use; keep using 2 nd This will cause false sharing Question: is this really the allocator s job to prevent this? Encapsulation should match programmer intuitions (my opinion) In the hand-off example: Hard for allocator to fix Programmer would have reasonable intuitions (after 506) If allocator just gives parts of same lines to different threads Hard for programmer to debug performance Hoard summary Linux kernel allocators Really nice piece of work Establishes nice balance among concerns Good performance results Focus today on dynamic allocation of small objects Later class on management of physical pages And allocation of page ranges to allocators 6

kmem_caches Caches (2) Linux has a kmalloc and kfree, but caches preferred for common object types Like Hoard, a given cache allocates a specific type of object Ex: a cache for file descriptors, a cache for inodes, etc. Unlike Hoard, objects of the same size not mixed Allocator can do initialization automatically May also need to constrain where memory comes from Caches can also keep a certain reserve capacity No guarantees, but allows performance tuning Example: I know I ll have ~100 list nodes frequently allocated and freed; target the cache capacity at 120 elements to avoid expensive page allocation Often called a memory pool Universal interface: can change allocator underneath Kernel has kmalloc and kfree too Implemented on caches of various powers of 2 (familiar?) Superblocks to slabs Complexity backlash The default cache allocator (at least as of early 2.6) was the slab allocator Slab is a chunk of contiguous pages, similar to a superblock in Hoard Similar basic ideas, but substantially more complex bookkeeping I ll spare you the details, but slab bookkeeping is complicated 2 groups upset: (guesses who?) Users of very small systems Users of large multi-processor systems The slab allocator came first, historically Small systems SLOB allocator Think 4MB of RAM on a small device/phone/etc. As system memory gets tiny, the bookkeeping overheads become a large percent of total system memory How bad is fragmentation really going to be? Note: not sure this has been carefully studied; may just be intuition Simple List Of Blocks Just keep a free list of each available chunk and its size Grab the first one big enough to work Split block if leftover bytes No internal fragmentation, obviously External fragmentation? Yes. Traded for low overheads 7

Large systems SLUB Allocator For very large (thousands of CPU) systems, complex allocator bookkeeping gets out of hand Example: slabs try to migrate objects from one CPU to another to avoid synchronization Per-CPU * Per-CPU bookkeeping The Unqueued Slab Allocator A much more Hoard-like design All objects of same size from same slab Simple free list per slab No cross-cpu nonsense SLUB status Forward pointer Does better than SLAB in many cases Still has some performance pathologies Not universally accepted General-purpose memory allocation is tricky business Hoard gets more Superblocks via mmap What is the kernel s equivalent of mmap? Everything we ve talked about today posits something that can give us reasonably-sized, contiguous chunks of pages Conclusion Misc notes Different allocation strategies have different trade-offs No one, perfect solution Allocators try to optimize for multiple variables: Fragmentation, low false conflicts, speed, multi-processor scalability, etc. Understand tradeoffs: Hoard vs Slab vs. SLOB When is a superblock considered free and eligible to be move to the global bucket? See figure 2, free(), line 9 Essentially a configurable empty fraction Is a "used block" count stored somewhere? Not clear, but probably 8