Conservative Garbage Collection on Distributed Shared Memory Systems

Size: px

Start display at page:

Download "Conservative Garbage Collection on Distributed Shared Memory Systems"

Pauline Butler
5 years ago
Views:

1 Conservative Garbage Collection on Distributed Shared Memory Systems Weimin Yu and Alan Cox Department of Computer Science Rice University Houston, TX fweimin, Abstract In this paper we present the design and implementation of a conservative garbage collection algorithm for distributed shared memory (DSM) applications that use weakly typed languages like C or C++, and evaluate its performance. In the absence of language support to identify references, our algorithm constructed a conservative approximation of the set of cross node references based on local information only. It was also designed to tolerate memory inconsistency on DSM systems that use relaxed consistency protocols. These techniques enabled every node to perform garbage collections without communicating with others, effectively avoiding the high cost of cross node communication in networks of workstations. We measured the performance of our garbage collector against explicit programmer management using three application programs. In two out of the three programs the performance of the GC version is within 15% of the explicit version. The results showed that the garbage collector has two effects on application programs. On one hand, it tends to reduce memory locality, increasing the communication cost; on the other hand, it may eliminate synchronization and memory accesses that would be incurred if memory were managed by the programmer, reducing the communication cost Introduction Over the last decade, both distributed garbage collection and distributed shared memory have become increasingly active areas of research [21, 17]. Despite the activity in these 1 This research was supported in part by the National Science Foundation under NYI Award CCR and by the Texas Advanced Technology Program and Tech-Sym Inc. under Grant areas individually, their intersection has received relatively little attention. Furthermore, none of the published work that we are aware of [18, 14, 11] has measured the performance of an implementation on any application programs. Neither have they addressed the design of a garbage collector for weakly-typed languages such as C and C++. In this paper, we present the design and implementation of a conservative garbage collection algorithm for distributed shared memory systems; and we evaluate its performance on a collection of application programs. Distributed shared memory (DSM) and garbage collection (GC) are motivated by the same desire: to simplify the programmer s task by handling some of the low-level program details automatically in the run-time system. A DSM system handles the communication of data between machines, eliminating the need for the programmer to write message-passing code. Roughly speaking, a DSM system enables processes on different machines to share virtual memory, even though no physical memory is shared by the machines [15]. It is widely accepted that it is easier to program with shared memory than message passing: Instead of sending and receiving messages explicitly, programs can use ordinary loads and stores to access shared data. This enables programmers to concentrate on algorithmic issues rather than on managing partitioned data sets and communicating values. A GC system handles the memory management, eliminating the need for the programmer to write code to track the status of allocated memory, for example, reference counting to determine whether memory can be freed. Conservative GC is a technique that does not require any support from the language implementation, enabling the use of GC with programs written in weakly-typed languages, like C or C++. Several conservative garbage collection algorithms have been implemented in the past few years [9, 4, 3, 8]. Zorn [22] compared the Boehm Weiser algorithm [8] with a few explicit management algorithms used to implement malloc()

2 and concluded that conservative garbage collection is a viable alternative to explicit memory management for many programs. In contrast to shared-memory multiprocessors, interprocessor communication is quite expensive on general-purpose networks of workstations. It is therefore essential to minimize the amount of data movement and especially the number of messages used to implement garbage collection. In contrast to garbage collection algorithms designed for shared-memory multiprocessors, our algorithm avoids oneat-a-time references to non-local (uncached) data that could generate a message exchange per access. Instead, it aggregates these references and piggybacks them onto messages used by the DSM system. To further minimize communication, our algorithm allows the collection of most garbage without global synchronization. This entails knowing when references are communicated to other nodes. With a weakly typed language like C or C++, it isn t obvious when references are communicated. Therefore, our algorithm constructs a conservative approximation of the set of cross node references. In addition, our algorithm is designed to cope with the fact that in high-performance DSM systems updates to shared data are not visible simultaneously at every node. Instead of requiring global synchronization to bring the nodes up to date, our algorithm is designed to tolerate memory inconsistency. Our garbage collector has been implemented on the TreadMarks DSM system [13]. TreadMarks is a highperformance DSM system that runs on standard workstations connected by general-purpose networks. It uses the lazy release consistency algorithm [12] and a multiple-writer protocol [10] to minimize the number of messages and the amount of data communicated, resulting in good performance on a large class of applications [16]. Using our garbage collector, two out of the three application programs used in this study performed within 15% of explicit memory management by the programmer. This paper is organized as follows. Section 2 describes our conservative garbage collection algorithm; and Section 3 describes its implementation. Section 4 presents a performance evaluation based on a small set of application programs. Section 5 examines related work. Finally, Section 6 offers our concluding remarks. 2. Design Most modern garbage collectors work by starting from a root set of memory objects, and following references from these objects to other objects recursively, until all objects reachable from the roots have been found. Inaccessible objects are garbage and can be reclaimed. There are two classes of collectors: copying collectors, which copy accessible objects to another part of the address space and reclaim the entire old region; and mark and sweep collectors, which mark all accessible objects, then scan the heap and reclaim unmarked objects. To tell references from data, most garbage collectors depend on some language support. At a minimum, tags are maintained for each object s type. Conservative garbage collection is a technique that does not require such cooperation and can work with weakly typed languages. It identifies a superset of the true references by treating every word of a memory object as if it contains a reference. In DSM systems, an application s object graph can be large and widely distributed among the nodes. Therefore, it is very expensive to collect all objects at the same time. In our algorithm each node can independently collect its own objects. To keep the amount of communication small, no extra messages are sent. All GC data exchanged between the nodes are piggybacked on messages required by the execution of the application program. To allow a node to collect independently without sending extra messages, we must solve three problems. First, an object owned by one node may be referenced by another node. The collecting node must identify these objects so that they will not be collected while a remote reference persists. In many strongly typed languages this is not a problem because every assignment of references can be detected and examined. In C, however, such detection is impossible. Second, we need to collect the remotely referenced objects if they are no longer used by other nodes. Third, because of race conditions or the delay in updates to shared data reaching every node, a collecting node may miss references it should see. We must solve this problem without synchronization and communication. In the rest of this section, we will present our solutions to these problems Remote reference detection Neither our target languages nor the DSM abstraction can alert us every time a reference created by one node is passed to another node. However, any remote reference must have been communicated in some message. Therefore, if the DSM system makes the contents of the messages available to the garbage collector, the garbage collector will know all objects that may potentially be referenced remotely and avoid reclaiming them. Some of the references in the messages may be passed because of false sharing and not be actually used by the receiving node; some others may not be references at all, just bit patterns that look like references. To be safe, we must assume that everything in a message is a reference, and is used by the receiving node.

3 2.2. Reclaiming exported references If a node finds out that a remote reference is no longer used locally, it is easy to notify its owner about this. This information, which we call a nack, can be piggybacked on a message to the owner node. The problem is for the owner to determine if all other nodes have dropped their references. Assume node N 1 exports obj to nodes N 2 and N 3. When N 2 no longer uses obj, it sends a nack to N 1. Then N 3 passes obj to N 2 and removes its own references to obj. Even though N 1 has received nacks from both nodes, the owner, N 1, must recognize that there is still a valid reference to obj. A technique called weighted reference counting [19, 20, 5] solves this problem. It works as follows: A reference is assigned a predetermined weight when it is first exported by its owner; Whenever a reference is duplicated across a node boundary, the weight of the reference is equally divided between the local reference and the new remote reference, so the sum of the weights remains constant; When a reference is no longer used and is sent back to its owner, its weight is also returned. When the sum of the returned weights equals the original weight assigned at the reference s creation, the owner node is sure no one needs this reference. The weight of a reference may reach one due to repeated export. When such a reference is being exported, its owner should be asked to increase its weight. We call this situation a weight underflow. However, to make our implementation easier, we assign zero to the weights of both copies. This way the reference can only be reclaimed by the infrequent global phase of our collection algorithm, but we avoid communication cost Synchronization and consistency In DSM systems, a race condition may occur if a collecting node traces an object which another node is updating. If the update doesn t reach the collecting node before it starts garbage collection, the local copy of the object may be inconsistent. As a result, the collector may miss references it should see. Figure 1 gives an example. Object O 1 is cached on both N 1 and N 2 and is in an inconsistent state. The only reference to O 2 is assigned to O 1 by N 2, but N 1 has not seen this reference. If N 2 removes the reference to O 1 from its local roots before the reference to O 2 is sent to N 1, O 2 may be mistakenly collected by N 2. This problem can be solved without communication. There are two situations to consider: O 1 is owned by N 2. Since N 1 also has a reference to O 1, N 2 knows that O 1 is remotely referenced. Since remotely referenced objects are treated like local roots, O 2. will not be collected in this case. O 1 is not owned by N 2. For N 2 to be able to access O 1, it must have imported O 1 from its owner. If N 2 remembers all imported objects and traces them in the collections, O 2 will not be reclaimed either. Therefore a collecting node is not required to update the object it is tracing. Care must be taken when this is combined with the reclamation of exported references discussed in Section 2.2. Take the scenario in Figure 1 for example, N 1 will not see the reference to O 2 until the application program requires it to update its copy of O 1. If N 2 deletes its reference to O 1, it will remove O 1 from its imported object list in the next collection, and O 2 will be reclaimed during the collection after that. Our solution is to remember objects like O 1 in a depart table. If an imported object is no longer used, and the changes made to it have not been seen by its owner node, the object is put in the depart table. Objects in the depart table are treated like local roots. When changes made to an object in the depart table are retrieved by its owner node, the object is removed from the table and a nack is sent to its owner. To summarize, our algorithm allows each node to independently perform local garbage collections, and no messages are needed beyond those required by the execution of the application program Limitations The garbage collection algorithm discussed above, which we will call the local collection algorithm, cannot reclaim circular structures, nor can it reclaim the objects lost in the case of weight underflow. To make up for this limitation, we implemented a global collection algorithm in which every node in the system suspends its computation and takes part in the garbage collection. Global collection is only used as a last resort, when a node cannot collect enough memory via local collection and the export table grows over a predefined threshold. 3 Implementation Our implementation was based on a sequential conservative garbage collector by Boehm and Weiser. The shared memory heap provided by TreadMarks is divided into several pools. Each node is responsible for one pool. Memory requests are satisfied from the local pool and an object is owned by its allocator. There is also a global (free) pool from which processes can allocate more

4 Node 1 Node 2 Node 1 Node 2 O2 O1 O1 t1 t2 Local roots Local roots Export table Export table Figure 1. Tracing inconsistent objects. Figure 2. Cycles of references. memory when necessary. In the text that follows, we refer to shared objects allocated from a node s pool as local objects to that node, and other shared objects as remote objects to that node. Whether an object is local or remote is decided by its ownership. A remote object may be cached in the local memory of a node and thus accessed without ongoing communication cost Data structures Each node has an object header table, an import table, an export table, a depart table, and one nack buffer for every other node. An object header is maintained for every object allocated by a node. Potential references are checked against this structure to see if they are valid. Information about an object, such as its size, can also be found in this structure. The import table is the set of remote objects that are referenced locally. The export table is the set of local objects which are referenced by other nodes. The depart table holds the imported objects that are no longer referenced locally but the changes made to them locally have not been propagated to their owner nodes. All three tables are implemented as hash tables, in which each entry has two fields, the object reference and its weight. The nack buffers hold the imported references that will be sent back to their owners The local collection algorithm Message handling. Every message that contains user data is scanned before it is sent. If a reference to a local object is found, and if it is not already in the export table, that reference is inserted into the export table with a default weight; otherwise its weight is incremented by the default amount. If an imported reference is found, its weight in the import table is halved and the two copies are both assigned the new weight. After the message is scanned, the references and their weights are appended to the message s end, and sent out. When a node receives a message, it checks the appended sequence of reference/weight pairs. The references are inserted into the import table with their weights, or have their weights incremented if they are already there. After a local collection, a node may find that some imported references are no longer used locally. At this time these references are removed from the import table. A removed reference with its weight is put into the depart table if the changes made to the referenced object by the local node have not been retrieved by its owner node; otherwise the referenced object is put into the nack buffer for its owner node. The references in the depart table are also put into the nack buffers when changes to the referenced objects have been sent to their owner nodes. When a message is sent to a remote node, the contents of the corresponding nack buffer are appended to the end of the message. The receiver checks the nack against its export table. For each exported reference that also appears in the nack, its weight is decremented by the amount shown in the nack. An exported reference is removed from the export table if its weight reaches zero. The collection. A collection works as follows: 1. Objects reachable from the local roots are recursively marked and traced. Local roots include registers, stack cells, and global variables. 2. Objects reachable from any references in the export table or the depart table are recursively marked and traced. 3. After the first two steps are done, we start from the import table and look for imported references that are not marked. These references are not used locally and need not be marked, but the local references they contain must be recursively marked and traced. The reason for this has been explained in Section The collector sweeps through the local pool and reclaims all local objects that are not marked. 5. The imported references that are not marked are either put in the depart table or in a nack buffer. They will be handled as described under Message handling.

5 3.3. The global collection algorithm A global collection is only invoked as the last resort when a node cannot collect enough memory by local collections and the export table grows over a threshold. Every other node in the system is interrupted to participate. A global collection consists of several marking phases followed by one sweeping phase. At the beginning, each node starts marking from its local roots, including registers, stack, and global variables. The import and export tables are not included in the local roots for global collection. A node only traces local references. Remote references are not traced, but they are recorded. At the end of a marking phase, the nodes synchronize and exchange the remote references they have recorded. In the next phase they start tracing again from the references they just received. This continues until no unmarked remote references are found on any node. Then each node sweeps through its own pools and reclaims the garbage. At the end of a global collection, the import and export tables are reconstituted; the nack buffers are cleared; and the depart table is not affected. To reconstitute the import and export tables, a node must remember all remote references it has sent and all local references it has received. A remote reference found during a marking phase is put in the export table with the default weight. Each local reference received from another node adds to that object s entry in the export table. An object s final weight is the default weight multiplied by the number of nodes that have the reference. 4. Evaluation We have measured the performance of our garbage collector with three applications: Othello, MIP and Pcfrac. With each application, a version using the garbage collector (the GC version) was compared to a version using explicit memory management (the Exp version). All measurements were taken on a cluster of eight SPARCstation-20 Model 61 workstations connected by a 10Mbit/second Ethernet. This section starts with a brief description of the test programs, then presents the results, and concludes with a summary and a discussion of potential improvements The applications Othello is a parallel program that performs game tree search to play a game called Othello. At the beginning, the master process takes the root task, creates a number of derived tasks, and puts them in a shared queue. Then each process repeatedly takes a task from the queue and performs search on the subtree rooted at that task. The computed score of one subtree can be used as the cutoff value in subsequent computations. This program allocates a lot of objects in shared memory, but in the end most of the objects are accessed by only one process. MIP solves the Mixed Integer Programming problem [6], a form of linear programming in which many of the variables are restricted to have integer values. It uses branch and bound to find the optimal solution to the problem. Nodes in the search space are kept in a doubly linked queue. Each process takes a node from this queue, performs some computation, perhaps generating new nodes, and puts these new nodes back into the queue. For each node, the computation involves relaxing the integer restrictions on the variables and solving the corresponding linear program to determine whether a better solution than the current best solution is possible below that node. This is repeated until the solution is found. This application allocates relatively fewer objects, but most of them are shared. Pcfrac is a naive parallelization of a large number factoring program called cfrac [23]. The main data structures in Pcfrac include a task array and a result array. It works as follows: 1. The master process generates some tasks and puts them in the task array while other processes wait. 2. Each process takes an equal share of the tasks and performs the computation. Interesting results are put in the result array. 3. After everyone is done, one of the slave processes collects the results in the result array and does more computation, then goes back to step 1; other processes go directly back to step 1. This procedure is repeated until the problem is solved. The Exp version of Pcfrac uses a complicated reference counting scheme for memory management Running time The normalized running times of the applications are shown in Table 1, with the running time of the sequential Exp version as 1:00. Table 2 presents the average time each node spends on garbage collection. The GC versions of Othello and MIP performed well. In Othello the small difference between the GC version and the Exp version can be explained by the garbage collection cost. MIP does not allocate as much memory as the other programs so there is not much difference between its two versions. This showed that our collector does not do much harm when it is lightly used. In Pcfrac, the difference between the two versions is significant, and it cannot be explained by the garbage collection cost alone. The problem here is the poor spatial locality caused by the garbage collector. For each task generated by the master, and each result computed by the slaves, quite a

6 Procs Othello MIP Pcfrac Exp GC Exp GC Exp GC Table 1. Normalized running time Procs Othello MIP Pcfrac GC Ratio GC Ratio GC Ratio % % % % % % % % % % % % % % % Table 2. GC time per node (in sec.) and its ratio to running time few objects were allocated as temporary variables which became immediately useless. In the Exp version, these objects are immediately reclaimed and used, so the tasks and results are packed together tightly. In the GC version these temporary objects are not reused until after the next collection. So the tasks and results are mingled with garbage. When a node retrieves its tasks or collects the results, it accesses more pages than the Exp version does. For example, with 8 nodes, it takes 93 seconds to collect the results, and 22 Megabytes of data are transferred for this purpose in the GC version. In the Exp version, the numbers are 8.7 seconds and 4.4 megabytes, respectively. The time spent on garbage collection in all of the applications is small. It ranges from almost zero in MIP to about 10% in Othello at 8 nodes. The GC cost follows the same pattern in all three applications. There is a jump in the cost when the program goes parallel, then the average time spent by each node holds steady. The collection cost is higher in the parallel executions because many objects are checked twice: the objects listed in the export table or those contained by imported objects are often reachable from the local roots as well. In the sequential execution, the export table is empty and there are no imported objects. This helps to explain the increases in Pcfrac and Othello when the number of nodes increased from one to two. The handling of imported objects in the current implementation is not very efficient either. In some cases we are not sure about the exact size of an imported object, so the whole page it was in got scanned. This makes up the major collection cost in MIP. This problem can be solved by using a more sophisticated protocol to keep track of the sizes of imported objects. Another cost comes from the message handling by the garbage collector. Our collector checks and modifies all the messages that contain data. We did not present the cost here because the actual cost measured in these applications was negligible. It was almost zero in Othello and less than one second in the other two Communication costs Table 3 shows the number of messages sent between the nodes during the execution of each program. Garbage collection can affect the amount of communication in different ways depending on the memory usage pattern. It may increase the amount of communication because it causes poor spatial locality. On the other hand, it may decrease the communication cost by eliminating shared accesses associated with free space management, for example, updates to a reference count. In Othello, the message count in the GC version is greater than the count in the Exp version. This is because most objects are privately held and there is little false sharing. The Exp version does not incur much cost from accesses to shared objects or from free list management. Therefore there is not much that the GC version can save to offset the locality cost. MIP is just the opposite. Poor locality is not a big problem due to the small number of objects here. However, there is a lot of sharing among the nodes so that any reduction in the number of writes to shared objects is beneficial. The result is a large reduction in the number of messages. In Pcfrac both of these factors exist. The turning point is

7 Procs Othello MIP Pcfrac Exp GC GC/Exp Exp GC GC/Exp Exp GC GC/Exp K 131K K 15.6K K 280K K 27.4K K 387K K 35.3K K 489K K 44.5K 1.39 Table 3. Message Count at 4 nodes. Below that the reduction in the number of writes outweighs the effect of poor locality of reference and we see a reduction in the number of messages in the GC version; beyond that, we see the opposite. The reason is that the more nodes that are sharing a page, the more messages are involved to maintain consistency on that page. With a large number of nodes more pages are used and more messages are needed for each page. At some point this cost cannot be offset by the savings. In our algorithm, when a reference is exported in a message, its value and weight are appended to the end of the message. When it is no longer needed on a node, an acknowledgement is appended to some message sent to its owner. This increases the amount of data transmitted. Table 4 shows the amount of appended references against the total amount of data sent in the GC versions. The size of appended data is small compared with the size of total data moved in the system. It ranges from 5% in Pcfrac to around 20% in MIP Memory usage Table 5 presents the amount of memory used in each application. The numbers were obtained by adding up the memory allocated by every node. The table shows that for all three applications, our conservative garbage collector requires more memory than programmer management. The increase in memory demand ranges from 80% in Othello to 300% in MIP. In Othello, the GC version used 80% more space than the Exp version regardless of the number of nodes. This was because we limited the frequency of garbage collections to improve CPU performance, so the heap was often expanded even though there was garbage that could be reclaimed. When we doubled the garbage collection frequency, the space requirement of the GC version dropped to the same as that of the Exp version, while the garbage collection time increased by 100% to 150% from the numbers presented in Table 2. The effect on the overall running time, however, is small, because garbage collections only accounted for a small percentage of the running time. In Pcfrac, the GC version required three times as much memory as the Exp version. This ratio did not drop when we increased the collection frequency. The main reason is that the program did not overwrite obsolete references fast enough. For example, the tasks generated by the master node will not be overwritten until the beginning of the next round of computation, although each task is useless after the computation on it is finished. In MIP, the GC version required 50% more memory than the Exp version when running sequentially. That ratio increased to more than 300% when there were eight nodes. The main reason why the ratio increased in parallel executions was that there were circular references spanning several nodes, caused by the doubly linked task queue in MIP. Figure 2 illustrates a scenario where a cycle is formed. When two adjacent elements t 1 and t 2 in the queue are managed by different nodes, each of the elements will hold the address of the other. This puts them into their owner s export table, and local collection will not be able to reclaim them. For example, when N 1 starts a local collection, t 1 will be found and traced since it is in N 1 s export table. The reference to t 2 will be found, so N 1 will keep t 2 in its import table, and N 2 will not be able to remove t 2 from its export table. The nodes must cooperate to reclaim the cycles Summary The garbage collector can have two kinds of effect on the performance of the application programs. The poor spatial locality increases the number of messages, negatively affecting the performance; on the other hand, the elimination of memory accesses for free space management can decrease the number of messages. The net effect depends on the memory usage pattern of the application program. The garbage collector may also increase the space requirements of the applications. The reasons are: To improve CPU performance, the collector may expand the heap rather than collect the garbage. Obsolete references are not overwritten fast enough by the program, and Circular structures may be formed, making them un reclaimable by local collections.

8 Procs Othello MIP Pcfrac Append Total Ratio Append Total Ratio Append Total Ratio 2 3.8K 52K 7.3% 1.6M 7.5M 21.3% 1.3M 26.2M 5.0% 4 15K 134K 11.2% 4.7M 22.2M 21.2% 2.1M 54.6M 3.8% 6 24K 191K 12.6% 6.8M 35.3M 19.3% 2.9M 60.4M 4.8% 8 37K 334K 11.1% 9.9M 49.3M 20.1% 3.5M 71.8M 4.9% Table 4. Appended Data (bytes) Othello MIP Pcfrac Procs Exp GC GC/Exp Exp GC GC/Exp Exp GC GC/Exp 1 69K 128K K 721K M 3.67M K 256K K 1638K M 4.22M K 512K K 1835K M 4.98M K 768K K 2028K M 5.77M K 1024K K 2097K M 5.57M 2.83 Table 5. Memory Usage (bytes) 4.6. Future work The problem of poor spatial locality is inherent in the use of mark and sweep garbage collectors. To solve this problem, the garbage collector must be able to move objects. However, to move objects around, the garbage collector must be able to distinguish references from data. This means that the free conversion between reference and non reference types, which is allowed in languages like C, must be forbidden. We will explore if some simple and reasonable restrictions exist that can provide enough information to the garbage collector while not excessively restricting the freedom of the programmers. 5. Related work Concurrent garbage collection for shared-memory multiprocessors [2, 7] and distributed systems [1] has been an active area of research. We are, however, aware of only three attempts to design garbage collectors for DSM systems. None of these report on the cost of garbage collection. Le Sergent [18] described an extension of a copying collector originally designed for a multiprocessor to a DSM system. Their design entails collecting the entire address space across all nodes at the same time. The garbage collector also locks pages while scanning. It cannot be used with weakly typed languages like C. Kordale s GC design [14] for DSM is based on the mark and sweep technique. The design is very complex and relies on a large amount of auxiliary information. Ferreira and Shapiro [11] discussed a copying garbage collector for weakly consistent DSM systems. They were the first to point out that garbage collectors can be designed to tolerate memory inconsistency. Their algorithm allows the nodes to collect independently, but extra messages may be needed during the creation of cross node references and for reclaiming objects with multiple copies. It does not work with weakly typed languages either. 6. Conclusion In this paper we presented the design and implementation of a conservative garbage collection algorithm for DSM systems and evaluated its performance. Our algorithm allows each node to perform garbage collection without communication with the other nodes. It is robust against race conditions (due to concurrent accesses by many nodes to the same object) and memory inconsistency (due to the relaxed consistency protocols). The two sources of overhead are that each node must check every message that contains data, so that an approximation of the set of cross node references can be built; and that GC data is appended to some messages. Our measurements show that neither of these overheads significantly affects application performance. The most detrimental effect of the garbage collector is that it tends to reduce spatial locality. This effect is not always an overwhelming problem. For example, the performance of Othello and MIP using GC is within 15% of the explicit programmer management. Programs most susceptible to this effect are those like Pcfrac, which use many shared objects that are created after some allocations of short lived intermediate variables. Poor spatial locality is inherent in any mark and sweep collector. To more efficiently handle programs like Pcfrac, we must look to copying collectors to improve the spatial locality. This in turn requires us to restrict the ways the pro-

9 grammer can manipulate references. We want to develop reasonable restrictions that allows the programmer maximum freedom while enabling the garbage collector to move data. Our garbage collection algorithm was implemented on the TreadMarks DSM system, but it is not limited to Tread- Marks. As long as the DSM management makes the contents of the messages available to the garbage collector, our algorithm will work. References [1] A. Abdullahi, E. Miranda, and G. Ringwood. Collection schemes for distributed garbage. In International Workshop on Memory Management, September [2] A. Appel, J. Ellis, and K. Li. Real-time concurrent collection on stock multiprocessors. In Proceedings of the SIGPLAN 88 Conference on Programming Language Design and Implementation, pages 11 20, June [3] J. Barlett. Compacting garbage collection with ambiguous roots. Technical Report 88/2, DEC Western Research Lab, [4] J. Barlett. Mostly-copying garbage collection picks up generations and C++. Technical Report TN-12, DEC Western Research Lab, [5] D. I. Bevan. Distributed garbage collection using reference counting. In Parallel Arch. and Lang. Europe, pages , Eindhoven, The Netherlands, June Spring-Verlag Lecture Notes in Computer Science 259. [6] R. Bixby, W. Cook, A. Cox, and E. Lee. Parallel mixed integer programming. Submitted for publication, [7] H. Boehm, A. Demeres, and S. Shenker. Mostly parallel garbage collection. In Proceedings of the SIGPLAN 91 Conference on Programming Language Design and Implementation, pages , June [8] H. Boehm and M. Weiser. Garbage collection in an uncooperative environment. Software: Practice and Experience, 18(9): , September [9] M. Caplinger. A memory allocator with garbage collection for C. In Proceedings of the 1988 Winter Usenix Conference, pages , February [10] J. Carter, J. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages , Oct [11] P. Ferreira and M. Shapiro. Garbage collection and DSM consistency. In The 1st International Conference on Operating Systems Design and Implementation, [12] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 13 21, May [13] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages , Jan [14] R. Kordale, M. Ahamad, and J. Shilling. Distributed/concurrent garbage collection in distributed shared memory systems. In Proceedings of the International Workshop on Object Orientation and Operating Systems, December [15] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4): , Nov [16] H. Lu, S. Dwarkadas, A. Cox, and W. Zwaenepoel. Message passing vs distributed shared memory on networks of workstations. To appear in Supercomputing 95. [17] B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and algorithms. IEEE Computer, 24(8):52 60, Aug [18] T. L. Sergent and B. Berthomieu. Incremental multi-threaded garbage collection on virtually shared memory architectures. In International Workshop on Memory Management, September [19] R. Thomas. A dataflow computer with improved asymptotic performance. Technical Report TR-265, MIT Laboratory for Computer Science, [20] P. Watson and I. Watson. An efficient garbage collection scheme for parallel computer architectures. In PARLE 87 Parallel Architectures and Languages Europe,number 259 in Lecture Notes in Computer Science, Eindhoven (the Netherlands), June Springer-Verlag. [21] P. R. Wilson. Uniprocessorgarbage collection techniques. In International Workshop on Memory Management, September [22] B. Zorn. The measured cost of conservative garbage collection. Software: Practice and Experience, 23(7): , July [23] B. Zorn and D. Grunwald. Empirical measurements of six allocation-intensive c programs. SIGPLAN NOTICES, 27(12):71 80, Dec 1992.

Lecture Notes on Garbage Collection

Lecture Notes on Garbage Collection 15-411: Compiler Design Frank Pfenning Lecture 21 November 4, 2014 These brief notes only contain a short overview, a few pointers to the literature with detailed descriptions,