Concurrent Garbage Collection Deepak Sreedhar JVM engineer, Azul Systems Java User Group Bangalore 1 @azulsystems azulsystems.com
About me: Deepak Sreedhar JVM student at Azul Systems Currently working on enhancing the C4 garbage collector implementation in Azul Zing JVM Prior experience with dynamic binary translation and server migration tools 2
Introduction 3
Quiz Does java spec mandate automatic GC? Is GC efficient? Can GC collect all dead objects? Can GC impact application throughput? Can GC impact application latency? Does a larger heap imply poorer performance? Does increasing Xmx (more free space) Improve GC efficiency? 4
Terminology The java heap memory Objects and references Live, reachable and dead objects Fragmentation and headroom wastage Virtual and physical memory Mutators Allocation and mutation rates 5
GC Safepoint A point in thread execution when GC can identify all references correctly, and there is no mutation Global safepoint (STW) all threads are at safepoint Safepointing not same as halting. A thread running native code (JNI) is at a safepoint Time to safepoint is as crucial for low latency as is the GC operation time. Try -XX: +PrintGCApplicationStoppedTime Safepoints may be needed for non GC reasons such as deoptimization and JVMTI heap iteration 6
GC classification Precise vs. Conservative Incremental vs. Monolithic Parallel vs. Serial Concurrent vs. Stop-the-world Multi-generational collectors Weak generational hypothesis Young (new) and Old (tenured) generation Promotion (tenuring) Lesser pauses usually in new gen (smaller set of live objects) Remembered sets, card tables for cross-generational references Can delay, but not avoid old gen collections 7
Copying collector Copy and fixup as objects are discovered From and To spaces Used for young (new) gen in many collectors Usually implemented as monolithic, stop-the-world Complexity of the order of live objects Theoretically, requires double the memory Practically many objects may be dead Eden and survivor spaces Early promotion to old gen when more memory is needed 8
Mark Compact Separate mark and compact phases Mark (trace) - identify live objects Compact - Move objects to reduce fragmentation Compact to To space Complexity of the order of live objects Can be implemented incrementally Full compaction can be delayed 9
Mark Sweep Compact Mark - identify live objects Sweep iterate over the heap and find free space Compact - Move objects to reduce fragmentation Used for old gen in many collectors Complexity of the order of heap size In-place, does not need more memory Can be implemented incrementally Can delay compaction to reduce pauses, but not eliminate it 10
Object allocation Increasing memory availability on servers into the terabyte space Efficient allocation using Thread Local Allocation Buffers (TLAB) and simple advance the top algorithm Not many java applications able to fully utilize this facility GC pauses (including in new gen) Difficulty in arriving at the right tuning Object pools, off heap memory used to get around this problem not perfect solutions since memory management layer needs to be coded Can we have a continuously concurrent garbage collector? 11
Challenges and approaches 12
Concurrent Marking Marking start from roots and traverse the object graph through discovered references Mutators can modify the object graph while GC is marking Move a reference to an already visited portion of the graph Remove references to an object from heap and keep a single reference in a register hiding it from GC marker Approaches Incremental update revisit root-set and modified portions of the graph iteratively, end with a re-mark pause SATB (snapshot at the beginning) intercept writes and store old contents into buffers 13
Concurrent Compaction Mutators can modify an object while it is being copied Mutators can read an object using stale pointers after it has been copied Incremental compact - G1GC Approach Divide heap into regions, maintain inter region references using remembered sets Minor collections use a copying collector Some minor collections do incremental compaction for old gen After concurrent mark, estimate efficiency of collecting regions, those with no or smaller RSets can be collected easier, so will be prioritized for upcoming minor collections Source regions updated while copying, RSets updates on new regions follow copying Mark sweep compact for STW major collections Read Barriers 14
GC Barriers Instructions executed by mutators that aid gar bage collection Help maintain metadata Impose invariants Write barriers Update cross generation or cross region references SATB barrier to ensure snapshot is fully marked Incremental update barriers that store new references Read barriers Baker-style barrier Brooks-style forwarding pointer C4 Load Value Barrier 15
The Continuously Concurrent Compacting Collector (C4) 16
Loaded Value Barrier A read barrier that ensures, at time of load, that the following invariants are met before reference is visible to application If GC cycle is in marking phase, the reference will be marked through If GC cycle is in relocation phase, or has completed relocation but not fixup, the reference will be updated to point to the relocated object Simultaneously guarantees that No reference misses GC attention during marking There is no stale access to a compacted page The result of the load will always be a valid reference to a valid object 17
Self Healing Contents of source location overwritten with the result of LVB Loading from same source cannot trigger barrier again Critical property that ensures finite and predictable amount of work There may be trap storms at phase shifts, but they will settle down as we do healing and complete Unique to the C4 barrier (LVB) 18
Mark phase Like other collectors start from root set and traverse the object graph NMT (not marked through) LVB check does reference metadata match expected GC state for the generation? Trap handling Fix NMT state for the reference, heal the source location and add to collector s work queue Checkpoints to clean stacks and transfer ref buffers Marking followed by a concurrent weak reference processing phase 19
Relocation phase Forwarding information kept outside of heap pages Virtual memory of compacted pages remain reserved until fixup is complete Physical memory can be released immediately (Quick Release) and recycled Hand over hand relocation Each GC thread can complete with just one seed page Compacted pages are protected to catch accesses performed without LVB Mutators cooperate in the relocation if GC hasn t moved the object yet at the relocate LVB trap Also heal the source memory with the new address of the object Large objects are just remapped to new virtual addresses, not physically copied 20
Fixup phase Traverse object graph and heal memory locations if not already done by mutators At end of fixup phase, virtual memory corresponding to compacted pages can be freed Can be combined with marking phase for next GC cycle, helping reduce GC cycle duration Mutators will do the fixup as part of LVB 21
Generational features New and old collections can proceed simultaneously and almost independently, unlike most collectors Perm gen processed by Old collector Old and new collectors use the same algorithm Synchronization using simple interlocks and limited suspension at phase changes Precise card marks for inter generational references. Updated by Store Value Barriers (SVB) Can be extended to N generations 22
Heap management Allocation in 2 MB pages Quick Release allows physical pages to be recycled to satisfy allocation requests before fixup is complete New, old and perm gen pages interleaved in virtual space Tiered allocation - Objects divided into small, mid and large spaces based on size helps limit maximum headroom wastage (currently 12.5%) TLABs for small space allocation, bump-the-pointer Relocation uses a different mechanism for each space to limit the maximum copy that a mutator needs to do 23
Zing Safepoints C4 algorithm is pauseless, but current implementation has few short pauses mostly at collector phase transitions (for ease and efficiency) Pause times independent of heap size, live object size, object lifetime, allocation rate, mutation rate, count of weak/soft/phantom references Provides sufficient safepoint opportunities to reduce time to bring threads to safepoint Pause times remain consistent Employs thread checkpoints when there is a specific action to be performed for/by that thread or when the thread needs to observe a GC state change 24
More on Zing GC scheduled by heuristics In most cases no tuning required Elastic memory - helps reduce occurrences of OOM Linux kernel module to improve performance of virtual memory operations 25
Keywords for reference search Talks by Gil Tene, CTO Azul Systems The Garbage Collection Handbook C4: The Continuously Concurrent Compacting Collector Garbage-First Garbage Collection Azul Zing JVM 26
Where Zing shines Low latency Eliminate behaviour blips down to the sub-millisecond-units level Machine-to-machine stuff Support higher *sustainable* throughput (one that meets SLAs) Messaging, queues, market data feeds, fraud detection, analytics Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone. Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances) Large data and in-memory analytics 27 Make batch stuff business real time. Gain super-efficiencies. Cassandra, Spark, Solr, DataGrid, any large dataset in fast motion
Q & A 28