The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

The C4 Collector Or: the Application memory wall will remain until compaction is solved Gil Tene Balaji Iyengar Michael Wolf

High Level Agenda 1. The Application Memory Wall 2. Generational collection for modern servers 3. C4 algorithm basics 4. Special generational considerations 5. Additional contributions 6. Results 2011 Azul Systems, Inc. 2

The Application Memory Wall 2011 Azul Systems, Inc. 3

Memory. F How many of you use heap sizes of: F F F F F F F F more than ½ GB? more than 1 GB? more than 2 GB? more than 4 GB? more than 10 GB? more than 20 GB? more than 50 GB? more than 100 GB? 2011 Azul Systems, Inc. 4

Reality check: memory in 2011 Retail prices of common commodity servers (June, 2011) 24 vcore, 96GB server 32 vcore, 256GB server 64 vcore, 512GB server 96 vcore, 1TB server ~$6.5K ~$20K ~$35K ~$80K Cheap (<$2/GB/Month), and roughly linear to ~1TB 2011 Azul Systems, Inc. 5

How much memory do applications need? 640K ought to be enough for anybody WRONG! So what s the right number? 6,400K? 64,000K? 640,000K? 6,400,000K? 64,000,000K? (6.4MB)? (64MB)? (640MB)? (6.4GB)? (64GB)? There is no right number. Target moves at ~50x-100x per decade. 2011 Azul Systems, Inc. 6

Tiny application history Moore s Law: If transistor counts grow at ~2x every 18 months ~100x every 10 yrs 2010??? GB apps on 256 GB 2000 1GB apps on a 2 4 GB server 1990 10MB apps on a 32 64 MB server 1980 100KB apps on a ¼ to ½ MB Server 2011 Azul Systems, Inc. 7

Why is there an application memory wall? GC is a clear and dominant cause There seems to be a practical heap size limits for applications with responsiveness requirements A 100GB heap won t crash. It just periodically pauses for several minutes at a time. [Virtually] All current commercial JVMs will exhibit a multisecond pauses on a normally utilized 2-4GB heap. It s a question of When and how often, not If. GC Tuning only moves the when and the how often around [Inevitable] compaction dominates pauses The C4 collector is focused on removing the application memory wall in enterprise server environments Focus: compaction that no longer hurts responsiveness 2011 Azul Systems, Inc. 8

Generational Collection for modern servers A modern server: 100s of GB of memory, 10s of cores, each allocating at ¼ - ½ GB/sec. A practical collector must work within this envelope without incurring significant program pauses. Generational collection is a practical necessity for keeping up with modern processor throughputs The young generation is typically collected in a stop-theworld, full copying/promotion pass But at modern server capacities, even the young generation would commonly hold live sets that are several GB in size Young generation collection needs to become either concurrent or incremental 2011 Azul Systems, Inc. 9

A C4 invention: A Concurrent Young Generation Collector Historically, ALL generational collectors have used stopthe-world, full cycle young gen collection C4 is the first known collector to use a non-stop-the-world young generation (first shipped in 2006) There is currently only one incremental young gen collector we are aware of: Generational Real-Time GC: A Three-Part Invention for Young Objects. ECOOP 07 [11] Motivated by similar wish to keep up with throughput There are currently no other concurrent young gen collectors that we are aware of 2011 Azul Systems, Inc. 10

The C4 Collector Concurrent, compacting new generation Concurrent, compacting old generation Concurrent guaranteed-single-pass markers Oblivious to mutation, insensitive to mutation rate Concurrent Compactors Objects moved without stopping mutator References remapped without stopping mutator Can relocate entire generation (New, Old) in every GC cycle No stop-the-world fallback Always compacts, and always does so concurrently 2011 Azul Systems, Inc. 11

C4 Algorithm basics 2011 Azul Systems, Inc. 12

C4 algorithm highlights Same core mechanism used for both generations Concurrent Mark-Compact A Loaded Value Barrier (LVB) is central to the algorithm Every heap reference is verified as sane when loaded Non-sane refs are caught and fixed in a self-healing barrier Refs that have not yet been marked through are caught Guaranteed single pass concurrent marker Refs that point to relocated objects are caught Lazily (and concurrently) remap refs, no hurry Relocation and remapping are both concurrent Uses quick release to recycle memory Forwarding information is kept outside of object pages Physical memory released immediately upon relocation Hand-over-hand compaction without requiring empty memory 2011 Azul Systems, Inc. 13

The C4 GC Cycle Compact 2011 Azul Systems, Inc. 14

Mark Phase Mark phase finds all live objects in the Java heap Concurrent & predictable: always completes in a single pass Uses LVB to defeat concurrent marking races Tracks object references that have been traversed by using an NMT (not marked through) metadata bit in each object reference Any access to a not-yet-traversed reference will trigger the LVB Triggered references are queued on collector work lists, and reference NMT state is corrected Self healing corrects the memory location that the reference was loaded from Marker tracks the total live memory in each memory page Compaction uses this to go after the sparse pages first (But each cycle will tend to compact the entire heap ) 2011 Azul Systems, Inc. 15

Relocate Phase Compacts to reclaim heap space occupied by dead objects in from pages without stopping mutator Protects from pages. Uses LVB to support concurrent relocation and lazy remapping by triggering on any access to references to from pages Relocates any live objects to newly allocated to pages Maintains forwarding pointers outside of from pages Virtual from space cannot be recycled until all references to relocated objects are remapped Quick Release : Physical memory can be immediately reclaimed, and used to feed further compaction or allocation 2011 Azul Systems, Inc. 16

Remap Phase Scans all live objects in the heap Looks for references to previously relocated objects, and updates ( remaps ) them to point to the new object locations Uses LVB to support lazy remapping Any access to a not-yet-remapped reference will trigger the LVB Triggered references are corrected to point to the object s new location by consulting forwarding pointers Self healing corrects the memory location the reference was loaded from Overlaps with the next mark phase s live object scan Mark & Remap are executed as a single pass 2011 Azul Systems, Inc. 17

The C4 GC Cycle Compact 2011 Azul Systems, Inc. 18

Special Generational considerations Multiple young-gen collections must be able to complete within a single Old-gen collection Otherwise, the generational filter will be missing C4 runs both young and old generation concurrently young gen effects are not atomic as seen by old gen, and vice-versa Old gen roots include moving targets in young gen Every old gen cycle kicks off a starting young gen cycle starting young gen cycle produces a young-to-old root set stream Old gen marker concurrently consumes young-to-old root set stream Some additional synchronization needed E.g. young gen may need an object size, located in an old-gen class object that is being relocated 2011 Azul Systems, Inc. 19

Additional Contributions Tiered Allocation Spaces C4 s Concurrent compaction requires objects to not span relocation page boundaries, leading to potential space waste The presented tiered allocation spaces method can contain worst case space waste to an arbitrary level Tiered allocation spaces also serve to contain the worst case latency a mutator will encounter when cooperatively relocating an object OS Kernel enhancements C4 s page life cycle relies on OS virtual memory manipulation Sustaining modern server throughput requires higher manipulation rate than most modern OSs can support (e.g. a high page unmap rate to match quick-release behavior) We present new OS Kernel APIs that provide much higher throughput virtual memory manipulation 2011 Azul Systems, Inc. 20

Results Focus: maintaining consistent response times while using multi-gb heaps and live sets, and sustaining multi-gb/sec allocation rates Surprisingly hard to find standard benchmarks that would both scale to modern server capacities and include longrunning enterprise application behavior Workload: modified 4 warehouse SPECjbb2005 workload, changed to include a modest, 2GB object cache that churns at a slow (20MB/sec) rate, and to measure transaction response times Compared the observed worst case response times under different collectors Oh, and we ran all setups long enough to see an Old gen compaction event occur. 2011 Azul Systems, Inc. 21

Sample throughput SpecJBB + Slow churning 2GB LRU Cache Live set is ~2.5GB across all measurements Allocation rate is ~1.2GB/sec across all measurements 2011 Azul Systems, Inc. 22

Sample responsiveness improvement SpecJBB + Slow churning 2GB LRU Cache Live set is ~2.5GB across all measurements Allocation rate is ~1.2GB/sec across all measurements 2011 Azul Systems, Inc. 23

Design goal: be insensitive Heap Size Allocation rate Mutation rate Locality non-generational behavior 2011 Azul Systems, Inc. 24

Q & A The C4 Collector

Sustainable Remap Rates. Per 2MB of allocation: map remap/protect unmap Need to keep up with sustained allocation rate A modern x86 core will happily generate ~0.5GB/sec of garbage (m)remaping pages is only small part of GC cycle Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle So need to sustain 100s of GB/sec in mremap rate Linux remaps sustain <1GB/sec Dominated by unneeded semantics TLB invalidates, 4KB mappings, global locking, Enhanced kernel supports >6TB/sec sustained remap rates Avoids in-process implicit TLB invalidates, uses 2MB mappings 2011 Azul Systems, Inc. 26

Sustained Remap Rates Active threads Mainline Linux w/azul Memory Module Speedup 1 3.04 GB/sec 6.50 TB/sec >2,000x 2 1.82 GB/sec 6.09 TB/sec >3,000x 4 1.19 GB/sec 6.08 TB/sec >5,000x 8 897.65 MB/sec 6.29 TB/sec >7,000x 12 736.65 MB/sec 6.39 TB/sec >8,000x 2011 Azul Systems, Inc. 27

Remap Commit Rates. Remap/protection must be consistent across mutator threads Each batch of relocated pages needs synchronization In practical terms, we bring mutators to safe point, and flip pages Using linux mremap(), protecting 16GB would take ~20 sec. Enhanced kernel supports >800TB/sec remap commit rates Uses shadow table and batch remap/protect ops api Accumulated batch operations are not visible until committed Commits shadow table using ~1 pointer copy per GB Protecting 16GB takes about ~22 usec 2011 Azul Systems, Inc. 28

Remap Commit Rates Active threads Mainline Linux w/azul Memory Module Speedup 0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x 1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x 2 1.82 GB/sec (8 sec) 1166.04 TB/sec (14 usec) >640,000x 4 1.19 GB/sec (13 sec) 913.74 TB/sec (18 usec) >750,000x 8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x 12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x * Commit rate and (time it would take to commit 16GB) 2011 Azul Systems, Inc. 29

New programming models? The coherent, shared memory SMP model has endured That s how people program. Still... In the past 40 years, new programming models proposed Whenever we run into a new architectural limit Usually involve some sort of loosely coupled memory New models are generally useful for mega-scale (moving target) They don t survive (for long) within a physical machine 64KB not enough? (early 1980s) 20 bit segmented memory for 16 bit processors (birth of x86) 640KB not enough? (early 1990s) 32 bit operating systems, even in the commodity/desktop world 2011 Azul Systems, Inc. 30

The hard things to do in GC Robust concurrent marking Refs keep changing Multi-pass marking sensitive to mutation rate Weak, Soft, Final references hard to deal with concurrently [Concurrent] Compaction It s not the moving of the objects It s the fixing of all those references that point to them How do you deal with a mutator looking at a stale reference? If you can t, then remapping is a STW operation Without solving Compaction, GC won t be solved All current commercial server JVMs and GCs perform compaction Azul ships the only commercial JVMs that concurrently compact 2011 Azul Systems, Inc. 31

Garbage Collection & Compaction You can delay it, but you cannot get rid of it Compaction is inevitable And compacting anything requires scanning/fixing all references Usually the worst possible thing that can happen in GC You can delay compaction, but not get rid of it Delay tactics focus on getting easy empty space first This is the focus for the vast majority of GC tuning Most objects die young So collect young objects only, as much as possible But eventually, some old dead objects must be reclaimed Most old dead space can be reclaimed without moving it So track dead space in lists, and reuse it in place But eventually, space gets fragmented, and needs to be moved Eventually, all collectors compact the heap 2011 Azul Systems, Inc. 32

HotSpot CMS Collector mechanism classification Stop-the-world compacting new gen (ParNew) Mostly Concurrent, non-compacting old gen (CMS) Mostly Concurrent marking Mark concurrently while mutator is running Track mutations in card marks Revisit mutated cards (repeat as needed) Stop-the-world to catch up on mutations, ref processing, etc. Concurrent Sweeping Does not Compact (maintains free list, does not move objects) Fallback to Full Collection (Stop the world). Used for Compaction, etc. 2011 Azul Systems, Inc. 33

HotSpot Garbage First (aka G1) Collector mechanism classification Experimental -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC Stop-the-world compacting new gen Mostly Concurrent, old gen marker Mostly Concurrent marking Tracks inter-region relationships in remembered sets Stop-the-world incremental compacting old gen Objective: Avoid, as much as possible, having a Full GC Compact sets of regions that can be scanned in limited time Delay compaction of popular objects, popular regions Fallback to Full Collection (Stop the world) Used for compacting popular objects, popular regions 2011 Azul Systems, Inc. 34