CSE 539 02/7/205 Parallel storage allocator Lecture 9 Scribe: Jing Li Outline of this lecture:. Criteria and definitions 2. Serial storage allocators 3. Parallel storage allocators Criteria and definitions Criteria for parallel storage allocator Speed. Scalability. Avoid allocator induced false sharing Minimize fragmentation Minimize space overhead False sharing is the situation where two threads allocate objects on the same cache line, with at least one object is being updated. Falsse sharing can cause the underlying to generate lots of cache coherence traffice (we will see more on cache coherence protocol later in the course) and can be really damanging to program performance. Note that for serial storage allocator, speed, fragmentation and space overhead should be optimized as well. The scalability and false sharing problems are unique for parallel allocator. Definitions Space overhead is the ratio between space used for bookkeeping vs. space used for actual storage. External fragmentation is the waste due to enabling of using storage because it is not contiguous. Internal fragmentation is the waste due to allocating larger block than user requested. Blowup is the additional waste of a parallel allocator beyond what a serial allocator would require.
2 Serial storage allocators Simple storage allocator using a free list The simple storage allocator is implemented using a last-in-first-out free list, shown in the figure below. In an (extra) free block, a free list pointer (denoted as free) is the head of the free list, which points to the first free block. In each free block, a free pointer points to the next free block. For now, let s simply assume the allocation is fixed sized. When user returns a previously allocated block B, the allocator first sets B to point to the first free block. It then set the free pointer to B. The pseudocode of the allocator algorithm is shown in the code below. malloc() if (free! = NULL) { 2 then x = free; 3 free = free next; 4 return x; 5 } 6 else { 7 error; 8 } free(void *x) x next = free; 2 free = x; In this simple storage allocator, the space overhead is O(), which is very small. When the block size is not fixed, extra information would be stored in the head and/or end of each block so that they can be used for coalescing two contiguous blocks together. In such cases, the actual spaced a user can use in each block is less than the block size. Therefore, the space overhead could be larger. 2
On the other hand, this simple approach may cost lots of fragmentations, due to the last-infirst-out nature of the free list. An example of such bad case is shown in the figure below. In summary, the pros: O() allocation and free operation Good temporal localing The cons due to bad external fragmentation: The page table size could increase quickly. It can cause disk thrashing. It is bad for TLB (Translation Lookaside Buffer). Minimizing external fragmentation using page-based free list We can minimize the external fragmentation issue of the simple implementation by using a pagebased free list, as shown in the figure below. In particular, Keep a free list on a per page basis. Allocate block from the fullest page. Always return freed block to the page it belongs. Return the page back to OS when it becomes empty. 3
However, now this improved storage allocate is no longer O() allocation/free. Instead, it needs to do extra work in order to keep track of which page is the fullest, either during allocating or when freeing a block. Therefore, it is around O(n) allocation/free, where n is the number of pages. Handling allocation of different sizes using a binned List Note that we typically want to optimize the allocator for small objects instead of large objects, because small objects come and go more frequently. Therefore, people use the binned list to allocate small object for improving performance. As shown in the figure below, each bin contains blocks of the same specified size. The specified sizes of all bins increase. In the example, the bin sizes follow a power of 2. In TBB, they follow a power of.25. When a block is larger than the largest bin, it is maintained by a single list for large sizes. Because the small blocks must be one of the specified sizes, the binned list implementation would cause internal fragmentation. For example, if user request a block of size 34, it will get a block from the bin 64. However, the waste can be bounded by (s )B, where s is increase factor of the bin sizes and B is the actual requested block size. Note that all the approaches described above are designed for serial case. When we have a parallel program with multiple threads, we would need to make the allocator thread safe. This can be accomplished by putting a global lock around every allocation and free calls to ensure the correctness of the allocator. However, a global lock does not scale well. Therefore, we need to consider more complex parallel allocator. 4
3 Parallel storage allocators Approach : local heap per thread One simple approach would be to partition the heap and assign a local heap for each thread. However, this approach has unbounded blowup and may cause false sharing. Unbounded blowup can be caused by constructing a program, in which thread keeps allocating blocks, while thread 2 keeps freeing blocks. Because all the blocks are freed by thread 2, they all goes to the free list of thread 2. Meanwhile, thread sees no freed blocks, therefore it will ask for more and more blocks. Thread and 2 together would cause unbounded blowup. Approach 2: local heap with ownership (Hoard memory allocator []) To avoid the blowup problem, we can mark each block with ownership. In this approach, each thread has its own allocator. All the local allocators talks to a global allocator, which talks to OS. Blocks are marked with the owner. When a block is freed, it will be returned to the owner instead of the one who frees it. The pros: Scalable. Minimize false sharing. Bounded blowup. One important property of Hoard memory allocator is that the local allocator will find free page and return it to the global allocator when the utilization of the local heap is low. This is achieved by keeping track of two parameters by each low allocator: u i the utilization of heap i. a i allocated memory in heap i. Hoard memory allocator maintains u i, so that u i a i KS or u i ( f)a i 5
where K and f are adjustable parameters: K is the number of initial empty pages allowed for a heap, and f is the emptiness threshold. S is the size of a superblock. The psuedocode of Hoard is: malloc() if there exist a free block x in heap i 2 then return x; 3 else try to request a superblock from the global heap, move it to local heap and return a free object 4 else have global heap request a superblock from OS, move it to local heap and return a free object free() put a block back to its heap 2 while (u i min(a i KS, ( f)a i )) { 3 do find a superblock that is at least f empty and put it back to global allocator 4 } If there are in total P local heaps, in which R local heaps satisfy u i a i KS and the rest P R local heaps satisfy u i ( f)a i, then the overall allocation A can be bounded by References A = = a i R a i + R+ a i R (u i + KS) + P KS + = O(U + P ) u i f R+ u i f [] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pages 7 28, Cambridge, Massachusetts, USA, 2000. ACM. 6