Parallel storage allocator

Similar documents
Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-Memory Multiprocessors

EECS 482 Introduction to Operating Systems

A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms

Dynamic Storage Allocation

Allocating memory in a lock-free manner

The Art and Science of Memory Allocation

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 13: Address Translation

Project 2. Assigned: 02/20/2015 Due Date: 03/06/2015

Multi-level Page Tables & Paging+ segmentation combined

Optimizing Dynamic Memory Management

18-447: Computer Architecture Lecture 18: Virtual Memory III. Yoongu Kim Carnegie Mellon University Spring 2013, 3/1

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

Process s Address Space. Dynamic Memory. Backing the Heap. Dynamic memory allocation 3/29/2013. When a process starts the heap is empty

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Memory Management. Dr. Yingwu Zhu

Operating Systems. Memory Management. Lecture 9 Michael O Boyle

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Memory Management. Goals of Memory Management. Mechanism. Policies

Multiprocessor Systems. Chapter 8, 8.1

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Multiprocessor Systems. COMP s1

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

The Art and Science of Memory Alloca4on

SMD149 - Operating Systems - Memory

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

Memory Management Basics

Hierarchical PLABs, CLABs, TLABs in Hotspot

Computer Architecture and Organization

CS399 New Beginnings. Jonathan Walpole

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

CSE 4/521 Introduction to Operating Systems. Lecture 27 (Final Exam Review) Summer 2018

Practice Exercises 449

Dynamic Memory Allocation

CS399 New Beginnings. Jonathan Walpole

Virtual Memory Outline

Operating Systems Design Exam 2 Review: Spring 2011

Virtual Memory. control structures and hardware support

CS 261 Fall Mike Lam, Professor. Virtual Memory

CS 416: Opera-ng Systems Design March 23, 2012

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Scalable SIMD-parallel memory allocation for many-core machines

6.172 Performance Engineering of Software Systems Spring Lecture 9. P after. Figure 1: A diagram of the stack (Image by MIT OpenCourseWare.

Operating Systems, Fall

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Operating Systems, Fall

Lecture: Virtual Memory, DRAM Main Memory. Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Paging. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File Layout and Directories

(b) External fragmentation can happen in a virtual memory paging system.

Dynamic Memory Allocation. Gerson Robboy Portland State University. class20.ppt

Dynamic Memory Allocation

CSE544 Database Architecture

Address spaces and memory management

Memory Management: The process by which memory is shared, allocated, and released. Not applicable to cache memory.

Memory Management. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Review! Lecture 5 C Memory Management !

Process. One or more threads of execution Resources required for execution. Memory (RAM) Others

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science

CS61C : Machine Structures

Memory Management Virtual Memory

Chapter 10: Case Studies. So what happens in a real operating system?

Lecture 13: Address Translation

Chapter 8 Virtual Memory

a process may be swapped in and out of main memory such that it occupies different regions

Chapter 8 Virtual Memory

CS 3733 Operating Systems

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Virtual Memory Management

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

File Systems. CSE 2431: Introduction to Operating Systems Reading: Chap. 11, , 18.7, [OSC]

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Physical memory vs. Logical memory Process address space Addresses assignment to processes Operating system tasks Hardware support CONCEPTS 3.

CS162 - Operating Systems and Systems Programming. Address Translation => Paging"

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

CS 318 Principles of Operating Systems

CSE 153 Design of Operating Systems

CSE 120 Principles of Operating Systems

Memory Management Topics. CS 537 Lecture 11 Memory. Virtualizing Resources

Requirements, Partitioning, paging, and segmentation

Lecture 13: Garbage Collection

CS 318 Principles of Operating Systems

Lecture 4: Memory Management & The Programming Interface

Recap: Memory Management

CSE 120 Principles of Operating Systems

CSE506: Operating Systems CSE 506: Operating Systems

Paging. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Operating Systems Design Exam 2 Review: Fall 2010

TDDB68 Processprogrammering och operativsystem / Concurrent programming and operating systems

CS307 Operating Systems Main Memory

ECE 598 Advanced Operating Systems Lecture 10

CIS Operating Systems Memory Management Address Translation for Paging. Professor Qiang Zeng Spring 2018

Transcription:

CSE 539 02/7/205 Parallel storage allocator Lecture 9 Scribe: Jing Li Outline of this lecture:. Criteria and definitions 2. Serial storage allocators 3. Parallel storage allocators Criteria and definitions Criteria for parallel storage allocator Speed. Scalability. Avoid allocator induced false sharing Minimize fragmentation Minimize space overhead False sharing is the situation where two threads allocate objects on the same cache line, with at least one object is being updated. Falsse sharing can cause the underlying to generate lots of cache coherence traffice (we will see more on cache coherence protocol later in the course) and can be really damanging to program performance. Note that for serial storage allocator, speed, fragmentation and space overhead should be optimized as well. The scalability and false sharing problems are unique for parallel allocator. Definitions Space overhead is the ratio between space used for bookkeeping vs. space used for actual storage. External fragmentation is the waste due to enabling of using storage because it is not contiguous. Internal fragmentation is the waste due to allocating larger block than user requested. Blowup is the additional waste of a parallel allocator beyond what a serial allocator would require.

2 Serial storage allocators Simple storage allocator using a free list The simple storage allocator is implemented using a last-in-first-out free list, shown in the figure below. In an (extra) free block, a free list pointer (denoted as free) is the head of the free list, which points to the first free block. In each free block, a free pointer points to the next free block. For now, let s simply assume the allocation is fixed sized. When user returns a previously allocated block B, the allocator first sets B to point to the first free block. It then set the free pointer to B. The pseudocode of the allocator algorithm is shown in the code below. malloc() if (free! = NULL) { 2 then x = free; 3 free = free next; 4 return x; 5 } 6 else { 7 error; 8 } free(void *x) x next = free; 2 free = x; In this simple storage allocator, the space overhead is O(), which is very small. When the block size is not fixed, extra information would be stored in the head and/or end of each block so that they can be used for coalescing two contiguous blocks together. In such cases, the actual spaced a user can use in each block is less than the block size. Therefore, the space overhead could be larger. 2

On the other hand, this simple approach may cost lots of fragmentations, due to the last-infirst-out nature of the free list. An example of such bad case is shown in the figure below. In summary, the pros: O() allocation and free operation Good temporal localing The cons due to bad external fragmentation: The page table size could increase quickly. It can cause disk thrashing. It is bad for TLB (Translation Lookaside Buffer). Minimizing external fragmentation using page-based free list We can minimize the external fragmentation issue of the simple implementation by using a pagebased free list, as shown in the figure below. In particular, Keep a free list on a per page basis. Allocate block from the fullest page. Always return freed block to the page it belongs. Return the page back to OS when it becomes empty. 3

However, now this improved storage allocate is no longer O() allocation/free. Instead, it needs to do extra work in order to keep track of which page is the fullest, either during allocating or when freeing a block. Therefore, it is around O(n) allocation/free, where n is the number of pages. Handling allocation of different sizes using a binned List Note that we typically want to optimize the allocator for small objects instead of large objects, because small objects come and go more frequently. Therefore, people use the binned list to allocate small object for improving performance. As shown in the figure below, each bin contains blocks of the same specified size. The specified sizes of all bins increase. In the example, the bin sizes follow a power of 2. In TBB, they follow a power of.25. When a block is larger than the largest bin, it is maintained by a single list for large sizes. Because the small blocks must be one of the specified sizes, the binned list implementation would cause internal fragmentation. For example, if user request a block of size 34, it will get a block from the bin 64. However, the waste can be bounded by (s )B, where s is increase factor of the bin sizes and B is the actual requested block size. Note that all the approaches described above are designed for serial case. When we have a parallel program with multiple threads, we would need to make the allocator thread safe. This can be accomplished by putting a global lock around every allocation and free calls to ensure the correctness of the allocator. However, a global lock does not scale well. Therefore, we need to consider more complex parallel allocator. 4

3 Parallel storage allocators Approach : local heap per thread One simple approach would be to partition the heap and assign a local heap for each thread. However, this approach has unbounded blowup and may cause false sharing. Unbounded blowup can be caused by constructing a program, in which thread keeps allocating blocks, while thread 2 keeps freeing blocks. Because all the blocks are freed by thread 2, they all goes to the free list of thread 2. Meanwhile, thread sees no freed blocks, therefore it will ask for more and more blocks. Thread and 2 together would cause unbounded blowup. Approach 2: local heap with ownership (Hoard memory allocator []) To avoid the blowup problem, we can mark each block with ownership. In this approach, each thread has its own allocator. All the local allocators talks to a global allocator, which talks to OS. Blocks are marked with the owner. When a block is freed, it will be returned to the owner instead of the one who frees it. The pros: Scalable. Minimize false sharing. Bounded blowup. One important property of Hoard memory allocator is that the local allocator will find free page and return it to the global allocator when the utilization of the local heap is low. This is achieved by keeping track of two parameters by each low allocator: u i the utilization of heap i. a i allocated memory in heap i. Hoard memory allocator maintains u i, so that u i a i KS or u i ( f)a i 5

where K and f are adjustable parameters: K is the number of initial empty pages allowed for a heap, and f is the emptiness threshold. S is the size of a superblock. The psuedocode of Hoard is: malloc() if there exist a free block x in heap i 2 then return x; 3 else try to request a superblock from the global heap, move it to local heap and return a free object 4 else have global heap request a superblock from OS, move it to local heap and return a free object free() put a block back to its heap 2 while (u i min(a i KS, ( f)a i )) { 3 do find a superblock that is at least f empty and put it back to global allocator 4 } If there are in total P local heaps, in which R local heaps satisfy u i a i KS and the rest P R local heaps satisfy u i ( f)a i, then the overall allocation A can be bounded by References A = = a i R a i + R+ a i R (u i + KS) + P KS + = O(U + P ) u i f R+ u i f [] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pages 7 28, Cambridge, Massachusetts, USA, 2000. ACM. 6