A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms

Size: px

Start display at page:

Download "A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms"

Chester Stewart
5 years ago
Views:

1 2012 Brazilian Symposium on Computing System Engineering A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms Taís Borges Ferreira, Márcia Aparecida Fernandes, Rivalino Matias Jr. School of Computer Science Federal University of Uberlandia Uberlandia, Brazil taisbferreira@comp.ufu.br, marcia@facom.ufu.br, rivalino@fc.ufu.br Abstract Memory allocations are one of the most frequently used operations in computer programs. The performance of memory allocation operations is a critical factor in software design; however, it is very often neglected. In this paper, we present a comprehensive complexity analysis of widely adopted user-level memory allocator algorithms. We consider time and space complexity, as well as the allocator overhead. The results show that the Ptmalloc family of memory allocator algorithms outperformed all other investigated allocators in terms of theoretical time complexity and space overhead. All allocators showed the same space complexity. Keywords Memory management, analysis of algorithm, user-level allocators. I. INTRODUCTION Memory allocations are one of the most ubiquitous operations in computer programs [1]. The efficient use of main memory has a significant impact on the performance of computer programs. In general, more sophisticated real-world applications need to allocate and deallocate portions of memory of varying sizes, many times, during their executions. These operations are commonly performed with high frequency, which makes their individual performance significantly important. A memory allocator, or simply allocator, is the code responsible for implementing memory allocation operations [2]. In terms of operating system (OS) architecture, these operations exist in two levels: kernel level and user level. At the kernel level, they are implemented by a KMA (kernel memory allocator), which is responsible for providing the memory management to meet the OS kernel demands for dynamic memory allocations. At the user level, the memory allocation operations are implemented by an UMA (user-level memory allocator), which is an integral part of the application process. The structures and algorithms used in KMAs and UMAs are practically the same. This work focuses on UMAs. Since the default UMA code is linked to the application program, it is possible to replace it for any other allocator of interest. Some sophisticated applications do not use the default UMA, bringing in their own custom UMA implementation [3]. This occurs because the standard C library usually implements a general-purpose UMA, which is not optimized to support specific application requirements. Depending on the application design, the needs for memory allocation are quite specific and the default allocator may not offer good performance. The use of multiple processors and multiple threads are examples of application-specific characteristics that have significant impact on the UMA performance [4]. For this reason, currently there are many proprietary and open source implementations of memory allocator algorithms, which can be used as an alternative to the default UMA. Many previous work (e.g., [1], [3]-[6]) have evaluated different memory allocators from an experimental point of view. In order to contribute to the body of knowledge in this area, this work presents a theoretical study of six widely used memory allocator algorithms, which are Hoard (version 3.8), Ptmalloc (version 2), Ptmalloc (version 3), TCMalloc (version 1.5), Jemalloc (version 2.0.1), and Miser. We conduct the analysis of these algorithms in terms of time and space complexities, as well as their overheads. The rest of this paper is organized as follows. Section II presents an overview of each memory allocator investigated. Section III and IV describe the analyses of each algorithm in terms of time and space complexities, and overhead, respectively. Section V presents our final remarks. II. INVESTIGATED MEMORY ALLOCATORS A. Hoard The Hoard allocator [4] is designed to offer highperformance in multithreaded programs running on multicore processors. To avoid heap contention, Hoard implements three kinds of heaps: thread-cache, local heap, and global heap. Thread-cache is a heap exclusive to threads that keep only memory blocks smaller than 256 bytes. Hoard implements 128 local heaps and uses a hash function to map heaps and threads. This hash function uses the thread identifier to generate a tid number between 0 and Therefore, more than one thread share the same heap if they get the same tid value. A group of threads will also share the local heap if the application launches more than 128 threads to run simultaneously. In general, it is expected to have most of the threads with exclusive access to a given local heap, reducing heap contention. When a thread, t1, requests a memory slice less than 256 bytes, then Hoard firstly searches for available memory in the t1's thread-cache. Otherwise, it will search for memory in the t1's local heap. In the local heap, at the entry corresponding to the requested size, Hoard searches for an emptiness group that has fuller superblocks and returns the first memory slice of the superblock at the top of the chosen emptiness group. A /12 $ IEEE DOI /SBESC

superblock (S i ) is a memory block of 64 kilobytes, which is split into blocks of the same size, such that each of the latter blocks is up to 32 kilobytes.

Each main index entry points to ten lists (0..9) of memory blocks of a certain size. These lists are known as Emptiness Groups.

e., the superblocks whose freelist points to NULL. Freelist is a pointer to the first block of a list of free blocks inside a superblock.

by all threads. Hoard s architecture is built to minimize blowup problems [4], since it transfers superblocks from local heaps to the global heap, minimizing the waste of memory.

Ptmallocv2 The Ptmallocv2 [7] is based on the well-known DLMalloc allocator [8], and incorporates features aimed at multiprocessors running multithreaded programs.

2 superblock (S i ) is a memory block of 64 kilobytes, which is split into blocks of the same size, such that each of the latter blocks is up to 32 kilobytes. Figure 2 illustrates the Hoard's main data structures. Hoard thread-cache Figure 2. Hoard heap management data structures and the superblock (S1) split into blocks of 16 bytes. Each main index entry points to ten lists (0..9) of memory blocks of a certain size. These lists are known as Emptiness Groups. Figure 2 shows the Emptiness Groups (EG) corresponding to blocks of 16 bytes. Each EG s entry points to a list of superblocks; EG [0] keeps non-used superblocks, i.e., all blocks of all superblocks are free; EG [8] keeps superblocks totally used, i.e., the superblocks whose freelist points to NULL. Freelist is a pointer to the first block of a list of free blocks inside a superblock. When Hoard cannot solve a memory request using the thread s superblocks, and there are no more free superblocks in the thread s local heap, it goes to the global heap (an extra heap) that is shared by all threads. Hoard s architecture is built to minimize blowup problems [4], since it transfers superblocks from local heaps to the global heap, minimizing the waste of memory. Additionally, Hoard avoids false sharing given that the size of superblocks is a multiple of the cache line size and a single heap owns a superblock at a time. B. Ptmallocv2 The Ptmallocv2 [7] is based on the well-known DLMalloc allocator [8], and incorporates features aimed at multiprocessors running multithreaded programs. Similar to Hoard, the Ptmallocv2 implements multiple heap areas to reduce contention in multithreaded programs. Unlike Hoard, it does not address false sharing neither blowup. Figure 3 shows the main components of Ptmallocv2 allocator. Ptmallocv2 inserts into the unsorted list the last released memory blocks and chunks that are result of a split or consolidate operation. These chunks are kept into the unsorted list until Ptmallocv2 needs to allocate memory again. In this case, chunks from unsorted list are moved to the correct bins. Figure 3. Ptmallocv2 heap management data structures. Every time a block is released, Ptmallocv2 tries to consolidate it with the adjacent free chunks, before inserting it inside the unsorted list. To avoid consolidation during free operations, chunks up to 64 bytes are kept in a secondary array, fast bins, from which they cannot be consolidated. To avoid fragmentation and thus waste of memory, when an application requests or releases large chunks, Ptmalloc2 clears the fast bins and tries to consolidate their chunks. All threads share all arenas, but heap contention does not happen because if a thread requests a memory block and all arenas are in use, then Ptmallocv2 creates a new arena to meet that request and links it to the previous one; as soon as the request is solved other threads may share the new created arena. Ptmallocv2 is the current UMA implemented in the GNU C library (glibc). Hence, except the programs using their own allocators, all other programs running under the recent Linux distributions use Ptmallocv2. C. Ptmallocv3 The version 3 of Ptmalloc [7] is an improvement over Ptmallocv2, mainly because it adopts a different method to meet requests for blocks larger than 256 bytes. Ptmallocv3 keeps small chunks (8 to 256 bytes) in a linked list (small bins). The large chunks (>256 bytes) are kept in binary trees (tree bins). Each size appears only once in the trees. The node corresponding to a size is also the head of the linked list of chunks of that size. Unlike Ptmallocv2, Ptmallocv3 does not use unsorted lists neither fast bins. All released chunk is placed into a bin according to its size. Figure 4 shows the heap management data structures of Ptmallocv3. Figure 4. Ptmallocv3 heap management data structures

D. TCMalloc The TCMalloc [9] also seeks to minimize heap contention. It implements a cache per thread. The memory blocks in the thread cache vary from 8 bytes to 32 kilobytes (kb).

TCMalloc adopts 4-kilobyte pages. TCMalloc also implements a global central cache used to exchange memory between thread caches and the page heap.

Assuming that the majority of requests are smaller than 32 kb, so they are solved locally (no heap contention). Requests larger than 32 kb are solved by the page heap.

The TCMalloc minimizes blowup given that lists of free blocks in the thread cache migrate to tc_slots in the central cache, allowing the memory reuse.

However, TCMalloc does not address false sharing, since two small objects assigned to different threads can have close memory addresses. bin is for a specific size.

Large requests are met by runs of two redblack trees: run_avail_dirty (runs that came from a bin or the result of a free() of large objects) and run_avail_clean (runs that contain only never-used

3 D. TCMalloc The TCMalloc [9] also seeks to minimize heap contention. It implements a cache per thread. The memory blocks in the thread cache vary from 8 bytes to 32 kilobytes (kb). For blocks larger than 32 kilobytes, TCMalloc uses a global page heap that is shared by all threads. Every page heap object is named span and is a multiple of the page size. TCMalloc adopts 4-kilobyte pages. TCMalloc also implements a global central cache used to exchange memory between thread caches and the page heap. Figures 5 and 6 present the TCMalloc main data structures. The small objects structure corresponds to the thread-cache. Assuming that the majority of requests are smaller than 32 kb, so they are solved locally (no heap contention). Requests larger than 32 kb are solved by the page heap. Each free_ array entry is related to the number of pages of the span. The index positions from 1 to 256 keep the spans of 1 to 256 pages. Spans larger than 256 pages are kept in the large_ list. The TCMalloc minimizes blowup given that lists of free blocks in the thread cache migrate to tc_slots in the central cache, allowing the memory reuse. It also reduces contention because lists of memory objects can migrate from central to thread caches. However, TCMalloc does not address false sharing, since two small objects assigned to different threads can have close memory addresses. bin is for a specific size. A bin consists of a pointer to the run currently in use (runcur) and the list of non-empty runs (runs). Run is a group of pages that, when in a bin, is split into blocks of a certain size. Large requests are met by runs of two redblack trees: run_avail_dirty (runs that came from a bin or the result of a free() of large objects) and run_avail_clean (runs that contain only never-used pages). Huge requests are directly forwarded to the operating system. There are four arenas per processor. Figure 8 shows the arena data structures. False sharing is not completely avoided because memory slices smaller than 4 kb are grouped in pages that are not exclusive to a thread. Blowup is avoided since free blocks migrate from thread-cache to local heaps for memory reuse. Figure 7. Jemalloc thread cache Figure 5. TCMalloc local heap: small objects structure. Figure 8. Arena data structures. F. Miser Miser [11] is based on Hoard and assumes that the majority of memory requests are up to 256 bytes. Therefore, its goal is to meet these request sizes very fast. To achieve this goal, Miser implements one local heap per thread avoiding heap contention, as can be seen in Figure 9. Figure 6. TCMalloc global data structures. E. Jemalloc The Jemalloc [10] is the UMA used in many well-known applications (e.g., FreeBSD, Firefox, and Facebook). Similar to Hoard, it implements a thread-cache and multiple heaps. Therefore, Jemalloc also addresses blowup, false sharing, and heap contention. The main difference between both allocators is the number of heaps and the size of memory objects. The thread-cache, shown in Figure 7, is exclusive to the thread and keeps memory slices smaller than 32 kb. Memory blocks are classified into three classes according to their sizes: small (4 bytes to 4 kb), large (4 kb to 4 MB), and huge (> 4 MB). Local heaps (arenas) keep memory blocks classified as small and large. Small blocks are managed in structures named bins. Each Figure 9. Miser heap management structures The memory blocks in the local heap are organized as superblocks and they are exclusive to a thread in order to avoid false sharing. Superblocks containing at least one object are inserted into a SuperblockChain entry, according to the size of their objects. Miser will place a superblock in QuadList [i] if it has less than (i+1)/4 and more than (i/4)+1 allocated objects

4 When a superblock is full (no free memory available) it is moved to QuadList [4], and when is empty it is moved to EmptyList. Miser also implements a global heap. Superblocks of local heaps migrate to global heap to avoid blowup. Nevertheless, Miser does not solve requests for memory blocks larger than 256 bytes, which are redirected to the standard C library memory allocator. III. ANALYSIS OF MEMORY ALLOCATOR ALGORITHMS A. Time and Space Complexity We primarily consider the thread number (T) as the parameter of the function that describes the execution time of an allocation algorithm. However, it is observed that T can be limited in some allocators, but the number of allocations (R) may be sufficiently large, independently of the number of threads. Hence, the analysis of all algorithms also considered R. 1) Hoard: The first stage of Hoard runs in a constant time, since it only takes from thread cache memory slices that were requested if their sizes are less than 256 bytes. Otherwise, the second stage is to search sequentially in the emptiness group for the entry that corresponds to the requested memory size; this search runs in a constant time since the emptiness group is an array of fixed size (see Figure 2). The third stage is the search in the global heap. This stage and the second one are the same, because the global and local heaps are organized in the same way. As each one of these stages occurs in constant time, and can be executed for each memory request, the execution time of Hoard s allocator is O(R), where R is the number of requests. This constant time is achieved due to the stored pointers, which occupy a considerable memory space that is proportional to 440 T+4400 (H+1)+72 S, where T is the number of threads, H is the number of heaps, and S is the number of superblocks. In general, one can consider T H, R T, and R S. So, the space complexity is O(R). 2) Ptmallocv2: The arenas in Ptmallocv2 contain two arrays: fast and normal bins. The first is used to solve requests less than 64 bytes. For this case, ptmallocv2 always returns in constant time. It takes the first chunk of the linked list in the fastbin entry, corresponding to the requested size. The requests for large chunks (solved by normal bins) are similar to the small chunks. The worst case for large-chunk requests is a sequential search, which must can be done either in the unsorted list (first entry of normal bins) or in the sorted lists of normal bins (entries 65 up to 123). Given that the worst-case scenario is O(R), the execution time of ptmallocv2 is considered O(R). 3) Ptmallocv3: It solves the requests from 8 to 256 by using small bins, so in constant time these requests can be met. Memory requests for sizes larger than 256 are solved by tree bins, which are similar to normal bins; the only difference are the replacement of sorted lists for unbalanced trees. In the worst case, the tree height of the 256 th entry is proportional to nodes, and the search is still sequential since the tree is unbalanced. Then, we can consider that the Ptmallocv3 s execution time is the same than Ptmallocv2, i.e., O(R). In terms of space complexity, we also have O(T) for both versions of Ptmalloc, since each arena wastes 132 kilobytes and in the worst case the number of arenas is equal to the number of threads. 4) TCMalloc: The first stage of TCMalloc is executed in constant time, since memory requests less than 32 kb are provided from the thread cache, specifically the first memory block of the linked list in the array entry corresponding to the requested size. If this list is empty, then the second stage starts moving an object list from central cache to thread cache. The difference between these caches is the two linked lists in each entry of central cache. So, the execution time of the second stage is also constant. Requests larger than 32kB are also solved in constant time by the page heap. Thus, we conclude that the time complexity is proportional to O(R). In terms of space complexity, the used space is proportional to 732 T, where 732 bytes corresponds to the space used to control each thread cache. So, the space complexity is O(R), assuming that R T. 5) Jemalloc: The Jemalloc differentiates three different request sizes: small, large, and huge. The small requests are solved by the thread cache in a constant time. The execution time for large requests relates to the red-black trees whose heights depend on the number of deallocations. In the worst case, there are R deallocations and the tree height is proportional to O(log R). If all requests are large, then the execution time is O(log R), but if the number of small requests is higher than large requests, then the time is O(R). Therefore, we consider the execution time O(R). Regarding the space complexity, each thread has a thread cache assigned, and there are four arenas for each processor. So, since the processor number does not change during the application execution, the space complexity can be considered O(T). 6) Miser: The only search process in the Miser allocator occurs in the QuadList with respect to the requested size. This QuadList contains four entries, so the execution time of this search is considered constant. This allocator firstly searches in the local heap, and next in the global heap if the local heap could not solve the request. Since the global and local heaps are organized in the same way, then search time is also the same for both cases. Therefore, the execution time of Miser is O(R), given that each request can be executed at most in two searches. The required space is proportional to 240 T+44 S, for R T and R S. Hence, the space complexity of Miser is given by O(R). B. Space Overhead 1) Hoard: Initially, it allocates space to store an array of 128 entries. Each entry is a pointer (size of four bytes). Even if the application does not make any request, Hoard will consume 512 bytes to store this array. When the application does its first request, the control structures of global heap and thread's local heap will be initialized. Both heaps are based on the same structures. A heap uses 4400 bytes to store an array

5 of 55 entries. Each entry corresponds to another array of 10 positions of 8 bytes each, which points to a doubly-linked list of superblocks. Each thread receives a thread-cache that requires 440 bytes to store its array. This array has 55 entries of 8 bytes used to point out the list of objects released in the thread-cache. The memory required for the list's pointer is the object itself, given that this space is not used when the object is free. When the object is allocated, the memory occupied by the pointer is then integrated into the object's usable area, i.e., an object of 8 bytes will use 8 bytes. If the first request is 8 bytes, then after connecting a thread-cache and a heap to the thread, Hoard will request a superblock to the operating system and insert it into the thread's local heap. The superblock is always 64 kilobytes, where 72 bytes are used to store its header. The bytes remaining are used to meet the memory requests. After the first allocation, Hoard will occupy bytes, where 9312 bytes are spent to store memory management structures. Once in use, a superblock can only provide blocks of a specific size, so if the next memory request is a block of 16 bytes, Hoard will allocate a new superblock. If all T threads perform several allocations requiring the creation of S superblocks, then Hoard will use 72 S bytes to store headers of superblocks and 440 T bytes for the thread caches. Since the number of local heaps in Hoard does not exceed 128, we conclude that the space overhead of Hoard is between 72 S+( ) T bytes and 72 S+440 T bytes. 2) Ptmalloc: Differently than Hoard, Ptmalloc (v2 and v3) only begins to occupy memory when the application performs its first allocation. At this time, it requests an arena to the operating system. This arena might have enough space to store the header, the requested memory block, and 128 kilobytes to use in the next requests. The arena header occupies 1144 bytes in Ptmallocv2, and 468 bytes in Ptmallocv3. The header holds a pointer to the circular linked list of arenas and the arrays used to index free objects. In Ptmallocv2 the header holds the fast bins, an array of 8 entries of 4 bytes, and normal bins, another array of 128 entries of 8 bytes. In Ptmallocv3, there is an array of 32 entries of 8 bytes for the lists of small chunks, and an array of 16 entries of 4 bytes for the tree bins. Each memory chunk in Ptmalloc has size equals to the amount of memory requested plus 8 extra bytes, used to save the real chunk size in memory, even though it is allocated. When the chunk is free, as in Hoard, the pointers of the linked list (or tree) are stored within the block itself, saving space. For example, if the application requests 8 bytes, the chunk received will occupy 16 bytes in the arena. On the other hand, this overhead is reduced when releasing a chunk. If the chunk is adjacent to another chunk, which is also free, then they will be coalesced (joined). The arena is always a multiple of 4 kilobytes. Therefore, if the first request performed by the application requires 8 bytes, the new arena will occupy 132 kilobytes. In Ptmallocv2, 1152 bytes of the arena space are spent with memory management structures. In Ptmallocv3, this value is 476 bytes. If the application launchs T threads that allocate and deallocate memory blocks, at the same time, Ptmalloc will create T new arenas. Considering C the number of chunks that the application is using after several allocations and deallocations, Ptmalloc2 will consume 1144 T+8 C bytes against 468 T+8 C bytes of Ptmallocv3. 3) TCMalloc: Initially, TCMalloc allocates memory for the central cache and the page heap. The central cache keeps an array of 61 entries of 576 bytes. These 576 bytes are used to store another array of 488 bytes (slots_), two doubly-linked lists of spans of 24 bytes, and some control variables. Altogether, the central cache data structures occupy bytes. The page heap holds an array of 256 entries of 48 bytes, which contain the pointer to the doubly-linked list of spans and the head of the list of spans that exceed 256 pages, occupying 48 bytes. Hence, the structures of page heap require bytes in total. So, TCMalloc requests 256 pages to the operating system, besides the space spent with global structures. Six of these pages are used to create six spans that are inserted into the central cache. The 250 remaining pages stay in page heap as a single span. Each span keeps a header of 24 bytes and can only be used for requests of the same size. Even before the application makes its first memory allocation, TCMalloc requires approximately one megabyte, where bytes are used only to store initial memory management structures. When the application makes its first allocation, then TCMalloc creates a thread-cache for the current thread. This structure consumes 732 bytes with an array of 61 entries. Each array entry stores a pointer to the list of released objects of that size. The pointer of the linked list in a free object is stored within its first four bytes. When the application makes a request, it will receive all the object memory area, including the 4 bytes previously used to store the list pointer. The exception are the requests larger than 32 kilobytes. For this case, TCMalloc gives an entire span to the application, so it spends 24 bytes with the span control structures. As the application requests more memory, the number of spans also increases. Thus, the amount of memory used for span headers also increases. On the other hand, if an span became free, the allocator will group it with the adjacent span, reducing the memory overhead. Hence, assuming T the number of threads and S the number of spans, after some allocations and deallocations, TCMalloc will occupy T+24 S bytes to store its memory management structures. 4) Jemalloc: Right next to start, Jemalloc creates an array with 4 N entries, where N is the number of processors (or cores) in the system. Each entry in this array occupies 4 bytes and corresponds to an arena pointer. Even though the application does not perform any allocation, Jemalloc will occupy 16 N bytes in memory. Upon receiving the first memory request, Jemalloc will allocate space for an arena. The arena is composed of an array of 29 entries, where each entry requires 60 bytes, and some control variables (e.g., tree of runs) that consume 80 bytes in total. So, Jemalloc requires 1820 bytes to store the control structures of a new arena. Each thread that executes some memory request (up to 32 kilobytes)

6 receives a thread cache. This structure maintains an array of 37 entries of 20 bytes plus 12 bytes used to maintain its control variables. Each entry stores a linked list of objects released in the thread cache. In total, the thread-cache management structures consume 752 bytes. If Jemalloc is out of free memory blocks, it will request four megabytes to the operating system. If the first request of memory is a block of 8 bytes, the allocator will take the first page to make a run of 8- byte blocks. This run needs a header of 16 bytes. The remaining 4080 bytes will be used to meet the next memory requests of 8 bytes. The other 1023 allocated pages will be inserted, as a single run, into the runs_avail_clean tree in the arena. Right after finishing the first allocation, Jemalloc consumes N bytes, where N bytes are used to store control structures and will never be used to meet memory requests. Similar to Hoard, when a block is free, the pointer to the next free block in the list is stored inside the block's usable area. When a block becomes free, its first 4 bytes will be used as the list's pointer, so it does not need to use an extra area for this purpose. If the application requests only amounts of memory that Jemalloc solves with blocks of 8 bytes, until it exhausts the memory of the currently created run, it will not create additional runs. If the application requests 16 bytes after allocating a block of 8 bytes, Jemalloc will create a new run to meet this request. In this case, Jemalloc splits the run of 1023 pages, previously inserted into the run_avail_tree, and increases the amount of memory spent with headers of runs in 16 bytes. In allocations between 4 kilobytes and 4 megabytes, each memory block will occupy 16 extra bytes, given that in these cases Jemalloc will deliver an entire run to the application. When releasing memory, the allocator can group adjacent free runs, then reducing the memory consumption with headers. Assuming T the number of threads and R the number of runs in the arena, after several allocations and deallocations, Jemalloc will occupy N+752 T+16 R bytes with its control structures. 5) Miser: This allocator initializes its structures when it receives the first allocation request. At this time, it allocates an array of 1024 entries. Each entry stores a 4-byte pointer to the heap. Miser also allocates a memory area of 344 kilobytes, in order to store the header of 1024 heaps and 344 bytes for the global heap. Once it initializes the control structures of the heaps, it requests 64 kilobytes to the operating system, which are divided into 16 superblocks. Each superblock has 4 kilobytes, where 44 bytes are occupied with its header. Assuming that the first allocation asks for 8 bytes, Miser will place one of the superblocks previously allocated into the entry of 8 bytes in the thread local heap and splits the remaining 4052 bytes in blocks of 8 bytes. There is no extra memory per memory block. All block control information is stored in the superblock header. After the first allocation, Miser will be using about 413 kilobytes, where 349 kilobytes are used to store control structures of superblocks and heaps. When the allocated superblocks are exhausted (no free memory), then Miser requests 16 superblocks to the operating system, which increase in 704 bytes the memory reserved for headers. Hence, assuming S the number of superblocks, after several memory requests Miser will occupy S bytes for control structures. IV. CONCLUSIONS We conclude that considering R (number of requests) the parameter to express the execution time, Jemalloc and the two versions of Ptmalloc execute in linear time only for the worst cases. In all other cases, their executions are in constant time. The Hoard, Miser, and TCMalloc allocators run in linear time for all scenarios. Therefore, from the theoretical point of view, with respect to the time complexity, the three first allocators (Jemalloc and Ptmalloc v2 & v3) are better than the other ones. Regarding the space, all the allocators have linear complexity. Important to note that the constant subjacent to the asymptotic notation is different for each evaluated allocator, so in practical terms it also should be considered in order to select the best allocator for a specific application. In terms of the space overhead of each allocator due to their management structures, in general, we may rank the investigated allocators as follows: Ptmallocv3 (smallest), Ptmallocv2, Jemalloc, TCMalloc, Miser, and Hoard (largest). Similar to the time complexity analysis, the space overhead of each algorithm may change due to different factors, such as the number of processors (e.g., Jemalloc). Hence, a future work of this research is to conduct a sensitivity analysis for all investigated algorithms with respect to important environmental factors that may affect their behavior in terms of time and space. REFERENCES [1] T. B. Ferreira, R. Matias Jr., A. Macedo, and L. B. Araujo. An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications, In Proceedings of International Conference on Parallel and Distributed Computing, Applications and Technologies, Gwangju, pp , Oct [2] U. Vahalia, UNIX Internals: The New Frontiers, Prentice Hall, [3] E. Berger, Benjamin Zorn and Kathryn McKinley, Reconsidering Custom Memory Allocation, In Proc. of the Conference on Object- Oriented Programming: Systems, Languages, and Applications [4] E.D. Berger, K.S. McKinley, R.D. Blumofe, and P.R.Wilson, "Hoard: a scalable memory allocator for multithreaded applications", ACM SIGARCH Computer Architecture News, v.28:5, p , [5] J. Attardi, and N. Nadgir, A Comparison of Memory Allocators in Multiprocessors, multiproc.html, [6] M. Masmano, I. Ripoll, and A. Crespo, "A comparison of memory allocators for real-time applications", Proc of 4th Int'l workshop on Java technologies for real-time and embedded systems, ACM Int'l Conference Proceeding Series, vol. 177, p.68-76, [7] W. Gloger, "Ptmalloc", [8] D. Lea, "A Memory Allocator," html/malloc.html, [9] S. Ghemawat, and P. Menage, TCMalloc: Thread-Caching Malloc, [10] J. Evans, A scalable concurrent malloc() implementation for FreeBSD, Proc. of the The BSD Conference, [11] T. Tannenbaum, Miser: A dynamically loadable memory allocator for multithreaded applications miser-a-dynamically-loadable-memory-allocator-for-multi-threadedapplications

An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications

An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications ǂ Taís B. Ferreira, ǂ Rivalino Matias, ǂ Autran Macedo, Lucio B. Araujo ǂ School of Computer Science School of Mathematics