Parallel GC. (Chapter 14) Eleanor Ainy December 16 th PDF Free Download

GC (Chapter 14) Eleanor Ainy December 16 th 2014 1

Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 2

Introduction Till now Multiple mutator threads But only 1 collector thread Poor use of resources! Assumption remains: No mutators run in parallel to the collector! 3

Introduction vs. Non- Collection Mutator Collection Cycle 1 Collection Cycle 2 4

Introduction The Goal To reduce: Time overhead of garbage collection Pause times in case of stop-the-world collection 5

Introduction GC Challenges Ensure there is sufficient work to be done. Otherwise it s not worth it! Load balancing distribute work & other resources in a way that minimizes the coordination needed. Synchronization needed for both correctness and to avoid repeating work. 6

Introduction More on Load Balancing Static Partitioning Some processors will probably have more work to do compared to others. Some processors will exhaust their resources before others do. 7

Introduction More on Load Balancing Dynamic Load Balancing Sometimes it s possible to obtain a good estimate of the amount of work to be done in advance More often it s not possible to estimate that Solution: (1) Over-partition the work into more tasks (2) Have each thread compete to claim one task at a time to execute. Advantages: (1) More resilient to changes in the number of processors available (2) If one task takes longer to execute other threads can execute any further work 8

Introduction More on Load Balancing Why not divide the work to the smallest possible independent tasks? The coordination cost is too expensive! Synchronization guarantees correctness and avoids unnecessary work, but has time & space overheads! Algorithms try to minimize the synchronization needed by using thread-local data structures, for instance. 9

Introduction Processor-centric VS. Memory-centric Processor-centric algorithms: threads acquire work that vary in size. threads steal work from other threads little regard to the location of the objects Memory-centric algorithms: take location into greater account operate on continuous blocks of heap memory acquire/release work from/to shared pools of fixed-size buffers of work 10

Introduction Algorithms Abstraction Assumption: Each collector thread executes the following loop (*): while not terminated() acquirework() performwork() generatework() (*) in most cases. 11

Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 12

Marking Marking comprises of 1) Acquisition of an object from a work list 2) Testing & setting marks 3) Generating further marking work by adding the object s children to the work list 13

Marking Important Note All known parallel marking algorithms are processor-centric! 14

Marking When is Synchronization Required? No synchronization: If the work list is thread-local. Example: when an object s mark is represented by a bit in its header. Synchronization needed: Otherwise the thread must acquire work atomically from some other thread s work list or from some global list. Example: when marks are stored in a shared bitmap. 15

Marking Endo et al [1997] Mark Sweep Algorithm N total number of threads Each marker thread has its own: local mark stack a stealable work queue. shared stealableworkqueue[n] me mythreadid acquirework(): if not isempty(mymarkstack) return stealfrommyself() if isempty(mymarkstack) stealfromothers() 16

Marking Endo et al [1997] Mark Sweep Algorithm An idle thread acquires work by first examining its own queue and then other threads queues. stealfrommyself(): lock(stealableworkqueue[me]) n size(stealableworkqueue[me]) / 2 transfer(stealableworkqueue[me], n, mymarkstack) unlock(stealableworkqueue[me]) 17

Marking Endo et al [1997] Mark Sweep Algorithm An idle thread acquires work by first examining its own queue and then other threads queues. stealfromothers(): for each j in Threads if not locked(stealableworkqueue[j] ) if lock(stealableworkqueue[j]) n size(stealableworkqueue[j]) / 2 transfer(stealableworkqueue[j], n, mymarkstack) unlock(stealableworkqueue[j]) return 18

Marking Endo et al [1997] Mark Sweep Algorithm performwork(): while pop(mymarkstack, ref) for each fld in Pointers(ref) child *fld if child null && not ismarked(child) setmarked(child) push(mymarkstack, child) 19

Marking Endo et al [1997] Mark Sweep Algorithm Notice: it is possible for threads to mark the same child object. P 2 P 1 Stack A C 1 Stack B Queue A Queue B 20 Thread A Thread B

Marking Endo et al [1997] Mark Sweep Algorithm Each thread checks its own mark queue. If it s empty it transfers all its mark stack (apart from local roots) to the queue. generatework(): if isempty(stealableworkqueue[me]) n size(mymarkstack) lock(stealableworkqueue[me]) transfer(mymarkstack, n, stealableworkqueue[me]) unlock(stealableworkqueue[me]) 21

Marking Endo et al [1997] Mark Sweep Algorithm Marking With a Bitmap The collector tests the bit and only if it isn t set, attempts to set it atomically, retrying if the set fails. setmarked(ref): 22 oldbyte markbyte(ref) bitposition markbit(ref) loop if ismarked(oldbyte, bitposition) return newbyte mark(oldbyte, bitposition) if (CompareAndSet(&markByte(ref), oldbyte, newbyte) return CompareAndSet(x,old,new): atomic curr *x if curr = old *x new return true return false

Marking Endo et al [1997] Mark Sweep Algorithm Termination Detection Reminder From Previous Lecture: Separate thread for termination detection. Symmetric detection every thread can play the role of the detector. 23

Marking Endo et al [1997] Mark Sweep Algorithm Termination Detection Reminder From Previous Lecture: shared jobs[n] initial work assignments shared busy[n] [true, ] shared jobsmoved false shared alldone false me mythreadid 24

Marking Endo et al [1997] Mark Sweep Algorithm Termination Detection Reminder From Previous Lecture: worker(): 25 loop while not isempty(jobs[me]) job dequeue(jobs[me]) perform job if another thread j exists whose jobs set appears relatively large some stealjobs(j) enqueue(jobs[me], some) continue busy[me] false while no thread has jobs to steal && not alldone /* do nothing: wait for work or termination*/ if alldone return busy[me] true

Marking Endo et al [1997] Mark Sweep Algorithm Termination Detection Reminder From Previous Lecture: stealjobs(j): some atomicallyremovejobs(jobs[j]) if not isempty(some) jobsmoved true return some 26

Marking Endo et al [1997] Mark Sweep Algorithm Termination Detection Reminder From Previous Lecture: detect(): anyactive true while anyactive anyactive ( i) (busy[i]) anyactive anyactive jobsmoved jobsmoved false alldone true 27

Marking Endo et al [1997] Mark Sweep Algorithm Running Example Initially: queues are empty! acquirework if stack is non-empty returns. Stack A Stack B Queue A Queue B 28 Thread A Thread B

Marking Endo et al [1997] Mark Sweep Algorithm Running Example performwork pops, marks and pushes children. O 34 O 1 2 O 4 Stack B O 1 O 2 O 3 Queue B 29

Marking Endo et al [1997] Mark Sweep Algorithm Running Example generatework moves all the objects from the stack to the queue! O 3 O 2 Stack B O 2 O 3 Queue B 30

Marking Endo et al [1997] Mark Sweep Algorithm Running Example acquirework if stack is empty moves half the queue to the stack. Stack B Queue B Queue B 31

Marking Endo et al [1997] Mark Sweep Algorithm Running Example Stack A acquirework if queue is also empty, steals from other queues. This continues until there is no more work (the detector will detect this!). Stack B Queue A Queue B 32

Marking Flood et al [2001] Mark Sweep Algorithm N total number of threads Each thread has its own stealable deque (double-ended queue). The deques are fixed size to avoid allocation during collection causes overflow. All threads share a global overflow set implemented as a list of list. shared overflowset shared deque[n] me mythreadid 33 acquirework(): if not isempty(deque[me]) return n dequefixedsize/2 if extractfromoverflowset(n) return stealfromothers()

Marking Flood et al [2001] Mark Sweep Algorithm The Java class structure holds the head of a list of overflow objects of that type, linked through the class pointer field in their header. An object s type field can be restored on remove from overflow set (stop-the-world enables the type field to be used here). 34

Marking Flood et al [2001] Mark Sweep Algorithm Idle threads acquire work by trying to fill half their deque from the overflow set before stealing from other deques. extractfromoverflowset(n): transfer(overflowset, n, deque[me]) 35

Marking Flood et al [2001] Mark Sweep Algorithm Idle threads steal work from the top of others deques using remove. stealfromothers(): for each j in Threads ref remove(deque[j]) if ref null push(deque[me], ref) return remove: requires synchronization! 36

Marking Flood et al [2001] Mark Sweep Algorithm performwork(): loop ref pop(deque[me]) if ref = null return for each fld in Pointers(ref) child *fld if (child null && not ismarked(child) setmarked(child) if not push(deque[me], child) n size(deque[me]) / 2 transfer(deque[me], n, overflowset) pop: requires synchronization only to claim the last element of the deque. push: does not require synchronization. 37

Marking Flood et al [2001] Mark Sweep Algorithm Work is generated inside peformwork by pushing to the deque or transferring to the overflow set. generatework(): /* nop */ 38

Marking Flood et al [2001] Mark Sweep Algorithm Termination Detection Variation of symmetric detection that we saw in previous lecture. Status word one bit per thread (active/inactive). 39

Marking Flood et al [2001] Mark Sweep Algorithm Running Example Initially: deques are non-empty! acquirework if deque is non-empty return. Deque A Thread A Deque B Thread B 40

Marking Flood et al [2001] Mark Sweep Algorithm Running Example O 2 O 3 performwork pop, mark and push children. O 1 O 4 O 5 O 6 41 O 7

Marking Flood et al [2001] Mark Sweep Algorithm Running Example O 2 O 3 performwork if push causes overflow copies half the queue to the overflow set. O 1 O 4 O 7 O 5 A O1 O 2 Deque B O 3 O 4 O 5 O 6 O 6 A Thread B 42 O 7 B

Marking Flood et al [2001] Mark Sweep Algorithm Running Example O 2 O 3 performwork the overflow set in this case: O 1 O 4 Class A Structure Class B Structure O 5 A O 5 O 7 O 6 A 43 O 7 B O 6

Marking Flood et al [2001] Mark Sweep Algorithm Running Example acquirework if deque is empty, takes work from overflow set. If fails, removes from other deques. Deque A Deque B O 9 O 9 Thread A Thread B 44

Marking Mark Stacks With Work Stealing - Disadvantages This technique is best employed when the number of threads is known in advance. May be difficult for a thread: To choose the best queue from which to steal. To detect termination. 45

Marking Wu and Li [2007] Tracing With Channels Threads exchange marking tasks through single writer, single reader channels. In a system of N threads, each thread has an array of N-1 queues. Annotation for input channel from thread i to thread j i j. This is also an output channel of thread i. shared channel[n,n] me mythreadid 46

Marking Wu and Li [2007] Tracing With Channels If the thread s stack is empty, it takes a task from some input channel k me. acquirework(): if not isempty(mymarkstack) return for each k in Threads if not isempty(channel[k, me]) ref remove(channel[k, me]) push(mymarkstack, ref) return 47

Marking Wu and Li [2007] Tracing With Channels Threads first try to add new tasks (marking children) to other threads input channels (their output channels). performwork(): loop if isempty(mymarkstack) return ref pop(mymarkstack) for each fld in Pointers(ref) child *fld if child null && not ismarked(child) if not generatework(child) push(mymarkstack, child) 48

Marking Wu and Li [2007] Tracing With Channels When a thread generates a new task, it first checks whether any other thread k needs work. If so, adds the task to the output channel me k. Otherwise, pushes the task to its own stack. generatework(ref): for each k in Threads if needswork(k) && not isfull(channel[me,k]) add(channel[me,k], ref) return true return false 49

Marking Wu and Li [2007] Tracing With Channels Advantages: No expensive atomic operations! Performs better on servers with many processors. Keeps all threads busy. (*) On a machine with 16 Intel Xeon processors queues of size one or two were found to scale best. 50

Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 51

Copying Copying is Different From Marking It s essential that an object be copied only once! If an object is marked twice it usually does not affect the correctness of the program. 52

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying Each copying thread is given its own stack and transfers work between its local stack and a shared stack. k size of a local stack shared sharedstack mycopystack[k] sp 0 /* local stack pointer */ 53

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying Using rooms, they allow multiple threads to: pop elements from the shared stack in parallel push elements to the shared stack in parallel But not pop and push in parallel! shared gate open shared popclients /* number of clients in the pop room */ shared pushclients /* number of clients in the push room */ 54

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying while not terminated() enterroom() /* enter pop room */ for i 1 to k if islocalstackempty() acquirework() if islocalstackempty() break performwork() transitionrooms() generatework() if exitroom() /* exit push room */ terminate() 55 islocalstackempty(): return sp = 0 acquirework(): sharedpop() performwork(): ref localpop() scan(ref) generatework(): sharedpush()

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying localpush(ref): mycopystack[sp++] ref localpop(): return mycopystack[--sp] SP ref Local Stack 1. localpop() 2. localpush(ref) 56

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying sharedpop(): cursor FetchAndAdd(&sharedStack, 1) if cursor stacklimit FetchAndAdd(&sharedStack, -1) else mycopystack[sp++] cursor[0] FetchAndAdd(x, v): atomic old *x *x old + v return old 57

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying sharedpush(): cursor FetchAndAdd(&sharedStack, -sp) - sp for i 0 to sp-1 cursor[i] mycopystack[i] sp 0 FetchAndAdd(x, v): atomic old *x *x old + v return old 58

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying enterroom(): while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) while gate OPEN FetchAndAdd(&popClients, -1) /* failure - return to previous state*/ while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) /* try again */ 59

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying transitionrooms(): /* move from pop room to push room */ gate CLOSED /* close gate to pop room */ FetchAndAdd(&pushClients, 1) FetchAndAdd(&popClients, -1) while popclients > 0 /* do nothing: wait till none popping */ 60

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying exitroom(): pushers FetchAndAdd(&pushClients, -1) - 1 if pushers = 0 /* last in push room */ gate OPEN if isempty(sharedstack) /* no work left */ return true else return false 61

Copying Processor-Centric Techniques: Cheng and Blelloch [2001] Copying Problem: Any processor waiting to enter the push room must wait until all processors in the pop room have finished their work! Possible Solution: The work can be done outside the rooms! It increases the likelihood that the pop room is empty threads will be able to enter the push room more quickly 62

Copying Memory-Centric Techniques: Block-Structured Heaps Divide the heap into small, fixed-size chunks. Each thread receives its own chunks to scan and into which to copy survivors. Once a thread chunk copy is full it s transferred to a global pool where idle threads compete to scan it and a new empty chunk is obtained for the thread itself. 63

Copying Memory-Centric Techniques: Block-Structured Heaps Mechanisms Used To Ensure Good Load Balancing: Chunks acquired were small (256 words). To avoid fragmentation, they used big bag of pages allocation for small objects Larger objects and chunks were allocated from the shared heap using a lock. 64

Copying Memory-Centric Techniques: Block-Structured Heaps Mechanisms Used To Ensure Good Load Balancing: Balanced load in finer granularity. Each chunk was divided into smaller blocks (32 words). 65

Copying Memory-Centric Techniques: Block-Structured Heaps Mechanisms Used To Ensure Good Load Balancing: After scanning a slot, the thread checks whether it reached the block boundary. If so and the next object was smaller than a block: the thread advanced its scan pointer to the start of its current copy block. It reduced contention the thread did not have to compete to acquire a new scan block. Un-scanned blocks in that area are given to the global pool. If the object was larger than a block but smaller than a chunk, the scan pointer was advanced to the start of its current copy chunk. If the object was large, the thread continued to scan it. 66

Copying Memory-Centric Techniques: Block-Structured Heaps Mechanisms Used To Ensure Good Load Balancing: 67

Copying Memory-Centric Techniques: Block-Structured Heaps Block States and Transitions: 68

Copying Memory-Centric Techniques: Block-Structured Heaps State Transition Logic: 69

Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 70

Sweeping Simple Strategies 1) Statically partition the heap into contiguous blocks for threads to sweep. 2) Over-partition the heap and have threads compete for a block to sweep to a free-list. Problem The free-list becomes a bottleneck! Solution Processors will have their own free-lists. 71

Sweeping Endo et al [1997] Lazy Sweeping A naturally parallel solution to sweeping partially full blocks. In the sweep phase, we need to identify empty blocks and return them to the block allocator. Need to reduce contention. Gave each thread several consecutive blocks to process locally. They used bitmap marking with bitmaps held in block headers (used to determine whether a block is empty or not). Empty blocks are added to a local free-block list. Partially full blocks are added to local reclaim list for subsequent lazy sweeping. Once a processor finishes with its sweep set it merges its local list with the global free-block list. 72

Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 73

Compaction Flood et al [2001] Mark-Compact Observation: Uniprocessor compaction algorithms typically slide all live data to one end of the heap space. If multiple threads do so in parallel one thread can overwrite live data before another thread has moved it! A C B C D Thread 1 compaction data. Thread 2 compaction data. 74

Compaction Flood et al [2001] Mark-Compact Suggested Solution: Divide the heap space into several regions, one for each compacting thread. To reduce fragmentation, they also have threads alternate the direction in which they move objects in even and odd numbered regions. 75

Compaction Flood et al [2001] Mark-Compact 4 Phases: 1) marking. 2) Calculate forwarding addresses. 3) Update references. 4) Move objects. 76

Compaction Flood et al [2001] Mark-Compact Phase 2 - Calculating Forwarding Addresses: Over-partition the space into M = 4N (N- number of threads) units of roughly the same size. Threads compete to claim units. Each thread counts the volume of live data in its unit. According to these volumes, they partition the space into N regions that contain approximately the same amount of live data. Threads compete to claim units and install forwarding addresses of each live object of their units. M = 12 units, N = 3 regions/threads 8 3 6 13 7 10 5 7 5 12 4 9 77 30 29 30

Compaction Flood et al [2001] Mark-Compact Phase 3 - Updating References: Updating references to point to objects new locations requires scanning: Objects stored in mutator threads stacks that might contain references to objects in the heap space (young generation). Live objects in the heap space (old generation). Threads compete to claim old generation units to scan and a single thread scans the young generation. Phase 4 Moving Objects: Each thread is in charge of a region. Good load balancing is guaranteed because the regions contain roughly equal volumes of live data. 78

Compaction Flood et al [2001] Mark-Compact Disadvantages: 1) The algorithm makes 3 passes over the heap while other compacting algorithms make fewer passes. 2) Rather than compacting all live data to one end of the heap, the algorithm compacts into N regions, leaving (N +1)/2 gaps for allocation. If a large number of threads in used, it s difficult for mutators to allocate very large objects. 79

Compaction Abuaiadh et al [2004] Mark-Compact 1) Address the 3 passes problem: Calculate rather than store forwarding addresses using the mark bitmap and an offset vector that holds the new address of the first live object in each block. To construct the offset vector one pass over the mark-bit vector is needed. Only a single pass over the heap is needed to move objects and update references using these vectors. 80

Compaction Abuaiadh et al [2004] Mark-Compact 1) Address the 3 passes problem: Bits in the mark-bit vector indicate the start and end of each live object. Words in the offset vector hold the address to which the first live object in their corresponding block will be moved. Forwarding addresses are not stored but are calculated when needed from the offset and mark-bit vectors. 81

Compaction Abuaiadh et al [2004] Mark-Compact 2) Address the small gaps problem: Over-partition the heap into fairly large areas. Threads race to claim the next area to compact, using an atomic operation to increment a global area index. If the thread succeeds, it has obtained an area to compact. If it fails, it tries to claim the next area. 82

Compaction Abuaiadh et al [2004] Mark-Compact 2) Address the small gaps problem: A table holds pointers to the beginning of the free space for each area. After winning an area to compact, a thread races to obtain an area into which it can move objects. It claims an area by trying to write null into its corresponding table slot. Threads never try to compact from or into an area whose table entry is null. Objects are never moved from a lower to a higher numbered area. Progress is guaranteed since a thread can always compact an area into itself. Once a thread has finished with an area, it updates the area s free pointer. If an area is full, its free space pointer will remain null. 83

Compaction Abuaiadh et al [2004] Mark-Compact 2) Address the small gaps problem: 200 400 1000 1800 A A B C D E B C D E 1 2 3 Area Index: 01 2 Free pointers table NULL 200 400 1000 1800 84

Compaction Abuaiadh et al [2004] Mark-Compact 2) Address the small gaps problem: Explored two ways in which objects can be moved: a. Slide object by object. b. To reduce compaction time, slide only complete blocks (256 bytes). Free space in each block is not squeezed out. 85

Discussion What is the tradeoff in the choice of the chunk size in parallel copying? copying with no synchronization can cause issues? For example if an object is copied twice by two different threads, what can be the consequence? FA X A A A B B 86

Something Extra https://www.youtube.com/watch?v=yhkze22tzlc 87

Conclusions & Summary There should be enough work for parallel collection Need to take into account synchronization costs Need to balance loads between the multiple threads Learned different algorithms for marking, sweeping, copying and compaction that take all this challenges into account. Difference between marking and copying marking an object twice is not so bad. Copying an object twice can harm the correctness. 88

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014