CSE 539 Project 2 Assigned: 02/20/2015 Due Date: 03/06/2015 Building a thread-safe memory allocator In this project, you will implement a thread-safe malloc library. The provided codebase includes a simple implementation of a serial malloc library (i.e., NOT thread safe). Your job is to make it thread safe and possibly optimize it so that it has a better performance (in terms of both utilization and speed). Remember that correctness comes first, then performance. Thus, even though this document describes the provided infrastructure on how to test your serial memory allocator first, then parallel memory allocator, it doesn t necessarily mean that you want work on the project in that order; it highly depends on what you plan to do and how you structure your code. Please read through the entire document carefully before you start. The information in this document will help you navigate the codebase. If you organize your code well, you can likely make it thread safe first, and then optimize the baseline malloc implementation without too much disruption to the thread-safe part. Logistics for this project As in project1, you will be obtaining a copy of the base code via your svn repository (if you svn up, you should see a proj2 directory). You are encouraged to work in pairs for this project (though you are not required to). If you decide to work in a group of two, you just need to turn in one copy of the code please designate one person s svn repository for turning in code. You will still have to turn in separate writeups, however. Since this project is due on Friday, you will turn in your writeup by committing the pdf file of your writeup in your own repository (even if you are in a group of two and your repo is not the designated code repo). The due time will be before midnight, at 11 : 59 : 59 pm. Heap memory allocator interface Your dynamic storage allocator will consist of the following four functions, which (among other functions) are declared in allocator interface.h and should be defined in allocator.cpp. The allocator interface.h and allocator.cpp are encapsulated in the namespace called my to prevent name collision with the malloc library in libc. 1
int allocator::init(void); Before calling any functions relating to memory allocations, an application we use to evaluate your implementation will call allocator::init. You may use this function to perform any necessary initialization of your library, such as allocating the initial heap area. The return value should be 1 if there was a problem in performing the initialization and 0 if everything went smoothly. void* allocator::malloc(size t size); This call must return a pointer to a contiguous block of newly allocated memory which is at least size bytes long. This entire block must lie within the heap region and must not overlap any other currently allocated chunk. The pointers returned by allocator::malloc must always be aligned to 8-byte boundaries and within the heap boundary (i.e., between values returned by mem heap lo() and mem heap hi() from memlib.c); you ll notice that the libc implementation of malloc does the same. If the requested size is zero or an error occurs and the requested block cannot be allocated, a NULL pointer must be returned. void allocator::free(void* ptr); This call notifies your storage allocator that a currently allocated block of memory should be deallocated. The argument must be a pointer previously returned by allocator::malloc or allocator::realloc, and not previously freed. You are not required to detect or handle either of these error cases. However, you should handle freeing a NULL pointer it is defined to have no effect. void* allocator::realloc(void* ptr, size t size); This call returns a pointer to an allocated region, similarly to how allocator::malloc behaves. There are two special cases you should be aware of. If ptr is NULL, the call is equivalent to allocator::malloc(size);. If size is equal to zero, the call is equivalent to allocator::free(ptr);. Otherwise, ptr must meet the same constraints as the argument to allocator::free; it must point to a previously allocated block and it must have been previously returned by either allocator::malloc or allocator::realloc. You do not need to defend against frees to invalid pointers. The return value of allocator::realloc must meet all of the same constraints as the return value of allocator::malloc; namely, it be 8-byte aligned, must point to a block of memory of at least size bytes, and within the heap boundary. There is one additional constraint on the behavior of allocator::realloc. Any data in the old block must be copied over to the new block. If the new block is smaller, the old values are truncated; if the new block is larger, the value of each of the bytes at the end of the block is undefined. 2
A naive implementation of allocator::realloc might consist of nothing more than a call to allocator::malloc, a memory copy, and a call to allocator::free. This is, in fact, how the reference implementation works; leaving this solution in place is probably a good way to get started. Once you ve made progress on allocator::malloc and allocator::free, you will want to consider ways of improving the performance of allocator::realloc. All of this behavior matches the semantics of the corresponding libc routines. Type man malloc at the shell to see additional documentation, if you re curious. The allocator.cpp file we have given you currently simply makes calls to a fairly simple and serial (i.e., NOT thread-safe) malloc library that uses a freelist without binning (code in mm-implicit.c). Each block in the freelist has a header and footer to allow coalescing. (A high level description of the implicit free-list implementation can be found here: http://www. cs.cmu.edu/afs/cs/academic/class/15213-f11/www/lectures/18-allocation-basic. pdf). Support routines The code in memlib.c simulates the memory system for your dynamic memory allocator. You can invoke the following functions in memlib.c: void* mem sbrk(int incr); Expands the heap by incr bytes, where incr is a positive non-zero integer and returns a generic pointer to the first byte of the newly allocated heap area. The semantics are identical to the Unix sbrk function, except that mem sbrk accepts only a positive non-zero integer argument. void* mem heap lo(void); Returns a pointer to the first byte in the heap. void* mem heap hi(void); Returns a generic pointer to the last byte in the heap. size t mem heapsize(void); Returns the current size of the heap in bytes. size t mem pagesize(void); Returns the system page size in bytes (4 KB on Linux systems). It is unlikely that you will need this. 3
Improving and testing the serial malloc library implementation You are encouraged to improve the serial implementation of the malloc library. The simple implementation in mm-implicit functions correctly, but it is slow, and the utilization could be improved. We have provided infrastructure for you to improve and test the serial malloc library. The serial implementation can be tested by running the trace-based driver program mdriver. The mdriver tests your allocator.cpp package for correctness, space utilization, and throughput. You should be sure to test that your serial implementation works correctly before you try to make it thread safe; otherwise you will be in debugging hell for the rest of the week. The driver program is controlled by a trace file. Each trace file contains a sequence of allocate, reallocate, and free commands that instruct the driver to call your allocator::malloc, allocator::realloc, and allocator::free routines in some sequence. The driver mdriver accepts the following command line arguments: -t <tracedir>: Look for the default trace files in directory <tracedir> instead of the default directory (../traces). -f <tracefile>: Use one particular tracefile for testing instead of the default set of tracefiles. -h: Print a summary of the command line arguments. -l: Run and measure libc malloc in addition to the students allocator::malloc package. Note that there is no utility measure for the libc runs. -v: Verbose output. Print a performance breakdown for each tracefile in a compact table. -V: More verbose output. Prints additional diagnostic information as each trace file is processed. Useful during debugging for determining which trace file is causing your malloc package to fail. -c: Invoke the allocator::check method after each call to the malloc library. This is extremely useful for debugging. Right now it simply calls the mm checkheap function in mm-implicit.c which checks the heap validity based on how mm-implicit.c manages the heap space. As you modify the malloc library, you should customize the allocator::check to check for the validity of your heap space. 4
Making the malloc library thread safe Even though the malloc library in mm-implicit is a fine starting point, you need to think about how you want to restructure the library so that you can make it thread safe. In particular, mm-implicit uses static variables liberally, and you probably wouldn t want that. It is best to organize the code so that you can easily encapsulate variables for each thread-local heap, and anything that s not thread-local will need to be protected via some form of synchronization. If you add any new *.c or *.cpp files, be sure to edit your Makefile to reflect that (adding the corresponding *.o files under OBJS or the benchmarks won t compile). For testing the thread-safe version of your malloc library, we will use following additional files: benchmarks/ contains benchmark programs. wrapper.[cpp,h] contains wrapper functions for building benchmark programs. You should not modify these files. validate.py validates the correctness and computes a utilization score of an allocator for multi-threaded programs. You should not modify this file. To build the testing benchmarks, you run make benchmark. This command builds 3 versions for each benchmark program. For example, benchmarks/cache-thrash.cpp generates cache-thrash, cache-thrash-validate and cache-thrash-libc. cache-thrash uses your allocator to allocate memory, while cache-thrash-libc uses a standard allocator. cache-thrash-validate will be used with the script validate.py to test the correctness of your allocator. You should not use it for performance testing. Benchmarks for testing the thread-safe library We are using concurrent benchmarks from the paper Hoard: A Scalable Memory Allocator for Multithreaded Applications. [1] cache-thrash tests resilience against active false sharing. Active false sharing occurs when malloc satisfies memory requests by different threads from the same cache line. Parameters: <threads> <inner-loop> <object-size> <iterations>./cache-thrash P 100 8 1000000 cache-scratch tests resilience against passive false sharing. Passive false sharing occurs when free allows a future malloc to produce false sharing. 5
Parameters: <threads> <inner-loop> <object-size> <iterations>./cache-scratch P 100 8 1000000 larson simulates a server: each thread allocates and deallocates objects, and then transfers some objects (randomly selected) to other threads to be freed. This producerconsumer problem might lead to a blowup in the allocator, where some threads keep allocating bigger and bigger heaps. Parameters: <seconds> <min-obj-size> <max-obj-size> <objects> <iterations> <rng seed> <num-threads>./larson 10 7 8 1000 10000 RAND P linux-scalability tests allocator throughput. Parameters: <object-size> <iterations> <number-of-threads>./linux-scalability 8 10000000 P You can also write your own benchmarks. The program should include../wrapper.cpp, call end thread() at the end of each thread, and call end program() at the end of the program. Then, you can add this new program in Makefile. Although we will not run your benchmarks to evaluate your program, they are useful for your regression testing. Correctness tests You can use validate.py to test the correctness of the allocator:./validate.py./cache-scratch-validate 12 100 8 100000 This will print out VALIDATION SUCCESS if there is no error in the program. Note that you need to use a *-validate binary. validate.py runs the program and logs all memory operations to multiple files in tmp/. After that, it reads all logs, verifies that all memory operations are legal, and calculates a space utilization score. There is also a script testbenchmarks.sh already in your directory. It shows a few parameters that you can use for testing your program. During grading, we will run your program with slightly different parameters. Note that right now your program succeeds in some of these runs specified in testbenchmarks.sh, because they are configured to run with single thread or because the thread interleaving behavior doesn t trigger an error. More likely the runs with multiple threads will fail with validation error or segmentation fault. This should not occur once you have a correctly implemented thread-safe library. Hints and tips Use thread for thread-local storage. 6
Use volatile keyword when variables can be changed by itself. Note that volatile is not a memory fence; it s a compiler hint telling the compiler that it should always fetch the value from memory instead of keeping it in a register. Since we haven t covered memory fences and hardware memory model, you probably should not write code that requires memory fences, unless you know what you are doing. Your allocator should allocate memory such that programs run fast. You should think about how the programs access the memory, not just how to allocate memory as fast as possible. Keep in mind that TLB miss can also affect the performance of the program. Rules and reminders You should not change any of the sources in the distribution except for the Makefile, allocator.cpp, validator.h, allocator interface.h and bad allocator.cpp. You are free to add new files and update the Makefile appropriately if you wish. All of the other files will be overwritten with fresh copies during grading You should not invoke any memory-management related library calls or system calls. This excludes the use of malloc, calloc, free, realloc, sbrk, brk, mmap or any variants of these calls in your code. Usage of C++ Standard Template Library containers is NOT allowed (since they will use memory on heap internally bypassing our allocator heap interface). All data structures that allocate memory on heap MUST use our allocator heap interface. Hence, all heap memory space used by your data structures will hence be counted under space utilization. The total size of all defined global and static scalar variables and compound data structures should be small, ideally not exceed 256 bytes per thread, but we will not put a strict upper bound on it. The spirit is that you should also consider space overhead, and any space you use for bookkeeping is counted towards the space utilization. Evaluation You will receive zero points if you break any of the rules or your code is buggy / won t compile. Otherwise, your grade will be calculated based on both correctness and performance (utilization and speed). The library should work correctly for both serial and parallel code (as evaluated by invoking mdriver and the parallel benchmarks. 7
You will get partial credit if you have a correct working version of a thread-safe malloc library that simply uses the mm-implicit implementation protected by locks. The more efficient (both utilization and throughput) your implementation is, the more credit you will get. That said, since this is a project mainly about thread-safe malloc library, the first goal should be to make the malloc library thread-safe. Once that s done, go back to optimize the baseline malloc library that you use to implement the per-thread heap. Writeup In your write-up, which you will turn in via committing a pdf in your svn repo, please include: 1. a description of your strategy for making the library thread-safe, including a justification on why it correctly synchronizes accesses from multiple threads and that it is deadlock free. 2. a description of any changes you made to the serial basecode to speed it up (or a description of the new memory allocator if you re-implemented the serial version). 3. a description of any optimizations that you did for speeding up the parallel malloc library implementation. 4. a description of how you divide up the work if you worked in a group of two. Note that even if you work in a group of two, you still need to turn your own writeup. References [1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson, Hoard: a scalable memory allocator for multithreaded applications, ACM SIGPLAN Notices, v.35 n.11, p.117-128, Nov. 2000 8