The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Similar documents
The Application Memory Wall

Understanding Java Garbage Collection

Understanding Java Garbage Collection

Understanding Garbage Collection

Finally! Real Java for low latency and low jitter

Concurrent Garbage Collection

C4: The Continuously Concurrent Compacting Collector

Understanding Java Garbage Collection

TECHNOLOGY WHITE PAPER. Azul Pauseless Garbage Collection. Providing continuous, pauseless operation for Java applications

Azul Pauseless Garbage Collection

Understanding Java Garbage Collection

Understanding Java Garbage Collection

Enabling Java in Latency Sensitive Environments

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Java Without the Jitter

OS-caused Long JVM Pauses - Deep Dive and Solutions

A new Mono GC. Paolo Molaro October 25, 2006

JVM Memory Model and GC

The Google File System

Low latency & Mechanical Sympathy: Issues and solutions

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

Attila Szegedi, Software

The Z Garbage Collector An Introduction

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

The Z Garbage Collector Low Latency GC for OpenJDK

CA485 Ray Walshe Google File System

Exploiting the Behavior of Generational Garbage Collector

G1 Garbage Collector Details and Tuning. Simone Bordet

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

Lecture 13: Garbage Collection

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Garbage Collection Algorithms. Ganesh Bikshandi

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

CLOUD-SCALE FILE SYSTEMS

Automatic Memory Management

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Hierarchical PLABs, CLABs, TLABs in Hotspot

The Garbage-First Garbage Collector

Run-Time Environments/Garbage Collection

Java Performance Tuning

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Robust Memory Management Schemes

Garbage Collection. Hwansoo Han

Myths and Realities: The Performance Impact of Garbage Collection

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Understanding Hardware Transactional Memory

Azul Systems, Inc.

Do Your GC Logs Speak To You

THE TROUBLE WITH MEMORY

The Google File System

Virtualizing JBoss Enterprise Middleware with Azul

Lecture 15 Garbage Collection

New Java performance developments: compilation and garbage collection

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

CSE P 501 Compilers. Memory Management and Garbage Collec<on Hal Perkins Winter UW CSE P 501 Winter 2016 W-1

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Low Latency Java in the Real World

Garbage Collection (1)

Managed runtimes & garbage collection

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Understanding Latency and Response Time Behavior

CMSC 330: Organization of Programming Languages

Lecture Notes on Garbage Collection

Google Disk Farm. Early days

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Tick: Concurrent GC in Apache Harmony

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

Operating Systems. Operating Systems Professor Sina Meraji U of T

The Google File System

Understanding Application Hiccups

Google File System. Arun Sundaram Operating Systems

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

How Not to Measure Latency

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection

Heap Management. Heap Allocation

Reference Object Processing in On-The-Fly Garbage Collection

Memory management has always involved tradeoffs between numerous optimization possibilities: Schemes to manage problem fall into roughly two camps

JDK 9/10/11 and Garbage Collection

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

How NOT to Measure Latency

CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES

Automatic Garbage Collection

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Simple Garbage Collection and Fast Allocation Andrew W. Appel

CMSC 330: Organization of Programming Languages

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Transcription:

The C4 Collector Or: the Application memory wall will remain until compaction is solved Gil Tene Balaji Iyengar Michael Wolf

High Level Agenda 1. The Application Memory Wall 2. Generational collection for modern servers 3. C4 algorithm basics 4. Special generational considerations 5. Additional contributions 6. Results 2011 Azul Systems, Inc. 2

The Application Memory Wall 2011 Azul Systems, Inc. 3

Memory. F How many of you use heap sizes of: F F F F F F F F more than ½ GB? more than 1 GB? more than 2 GB? more than 4 GB? more than 10 GB? more than 20 GB? more than 50 GB? more than 100 GB? 2011 Azul Systems, Inc. 4

Reality check: memory in 2011 Retail prices of common commodity servers (June, 2011) 24 vcore, 96GB server 32 vcore, 256GB server 64 vcore, 512GB server 96 vcore, 1TB server ~$6.5K ~$20K ~$35K ~$80K Cheap (<$2/GB/Month), and roughly linear to ~1TB 2011 Azul Systems, Inc. 5

How much memory do applications need? 640K ought to be enough for anybody WRONG! So what s the right number? 6,400K? 64,000K? 640,000K? 6,400,000K? 64,000,000K? (6.4MB)? (64MB)? (640MB)? (6.4GB)? (64GB)? There is no right number. Target moves at ~50x-100x per decade. 2011 Azul Systems, Inc. 6

Tiny application history Moore s Law: If transistor counts grow at ~2x every 18 months ~100x every 10 yrs 2010??? GB apps on 256 GB 2000 1GB apps on a 2 4 GB server 1990 10MB apps on a 32 64 MB server 1980 100KB apps on a ¼ to ½ MB Server 2011 Azul Systems, Inc. 7

Why is there an application memory wall? GC is a clear and dominant cause There seems to be a practical heap size limits for applications with responsiveness requirements A 100GB heap won t crash. It just periodically pauses for several minutes at a time. [Virtually] All current commercial JVMs will exhibit a multisecond pauses on a normally utilized 2-4GB heap. It s a question of When and how often, not If. GC Tuning only moves the when and the how often around [Inevitable] compaction dominates pauses The C4 collector is focused on removing the application memory wall in enterprise server environments Focus: compaction that no longer hurts responsiveness 2011 Azul Systems, Inc. 8

Generational Collection for modern servers A modern server: 100s of GB of memory, 10s of cores, each allocating at ¼ - ½ GB/sec. A practical collector must work within this envelope without incurring significant program pauses. Generational collection is a practical necessity for keeping up with modern processor throughputs The young generation is typically collected in a stop-theworld, full copying/promotion pass But at modern server capacities, even the young generation would commonly hold live sets that are several GB in size Young generation collection needs to become either concurrent or incremental 2011 Azul Systems, Inc. 9

A C4 invention: A Concurrent Young Generation Collector Historically, ALL generational collectors have used stopthe-world, full cycle young gen collection C4 is the first known collector to use a non-stop-the-world young generation (first shipped in 2006) There is currently only one incremental young gen collector we are aware of: Generational Real-Time GC: A Three-Part Invention for Young Objects. ECOOP 07 [11] Motivated by similar wish to keep up with throughput There are currently no other concurrent young gen collectors that we are aware of 2011 Azul Systems, Inc. 10

The C4 Collector Concurrent, compacting new generation Concurrent, compacting old generation Concurrent guaranteed-single-pass markers Oblivious to mutation, insensitive to mutation rate Concurrent Compactors Objects moved without stopping mutator References remapped without stopping mutator Can relocate entire generation (New, Old) in every GC cycle No stop-the-world fallback Always compacts, and always does so concurrently 2011 Azul Systems, Inc. 11

C4 Algorithm basics 2011 Azul Systems, Inc. 12

C4 algorithm highlights Same core mechanism used for both generations Concurrent Mark-Compact A Loaded Value Barrier (LVB) is central to the algorithm Every heap reference is verified as sane when loaded Non-sane refs are caught and fixed in a self-healing barrier Refs that have not yet been marked through are caught Guaranteed single pass concurrent marker Refs that point to relocated objects are caught Lazily (and concurrently) remap refs, no hurry Relocation and remapping are both concurrent Uses quick release to recycle memory Forwarding information is kept outside of object pages Physical memory released immediately upon relocation Hand-over-hand compaction without requiring empty memory 2011 Azul Systems, Inc. 13

The C4 GC Cycle Compact 2011 Azul Systems, Inc. 14

Mark Phase Mark phase finds all live objects in the Java heap Concurrent & predictable: always completes in a single pass Uses LVB to defeat concurrent marking races Tracks object references that have been traversed by using an NMT (not marked through) metadata bit in each object reference Any access to a not-yet-traversed reference will trigger the LVB Triggered references are queued on collector work lists, and reference NMT state is corrected Self healing corrects the memory location that the reference was loaded from Marker tracks the total live memory in each memory page Compaction uses this to go after the sparse pages first (But each cycle will tend to compact the entire heap ) 2011 Azul Systems, Inc. 15

Relocate Phase Compacts to reclaim heap space occupied by dead objects in from pages without stopping mutator Protects from pages. Uses LVB to support concurrent relocation and lazy remapping by triggering on any access to references to from pages Relocates any live objects to newly allocated to pages Maintains forwarding pointers outside of from pages Virtual from space cannot be recycled until all references to relocated objects are remapped Quick Release : Physical memory can be immediately reclaimed, and used to feed further compaction or allocation 2011 Azul Systems, Inc. 16

Remap Phase Scans all live objects in the heap Looks for references to previously relocated objects, and updates ( remaps ) them to point to the new object locations Uses LVB to support lazy remapping Any access to a not-yet-remapped reference will trigger the LVB Triggered references are corrected to point to the object s new location by consulting forwarding pointers Self healing corrects the memory location the reference was loaded from Overlaps with the next mark phase s live object scan Mark & Remap are executed as a single pass 2011 Azul Systems, Inc. 17

The C4 GC Cycle Compact 2011 Azul Systems, Inc. 18

Special Generational considerations Multiple young-gen collections must be able to complete within a single Old-gen collection Otherwise, the generational filter will be missing C4 runs both young and old generation concurrently young gen effects are not atomic as seen by old gen, and vice-versa Old gen roots include moving targets in young gen Every old gen cycle kicks off a starting young gen cycle starting young gen cycle produces a young-to-old root set stream Old gen marker concurrently consumes young-to-old root set stream Some additional synchronization needed E.g. young gen may need an object size, located in an old-gen class object that is being relocated 2011 Azul Systems, Inc. 19

Additional Contributions Tiered Allocation Spaces C4 s Concurrent compaction requires objects to not span relocation page boundaries, leading to potential space waste The presented tiered allocation spaces method can contain worst case space waste to an arbitrary level Tiered allocation spaces also serve to contain the worst case latency a mutator will encounter when cooperatively relocating an object OS Kernel enhancements C4 s page life cycle relies on OS virtual memory manipulation Sustaining modern server throughput requires higher manipulation rate than most modern OSs can support (e.g. a high page unmap rate to match quick-release behavior) We present new OS Kernel APIs that provide much higher throughput virtual memory manipulation 2011 Azul Systems, Inc. 20

Results Focus: maintaining consistent response times while using multi-gb heaps and live sets, and sustaining multi-gb/sec allocation rates Surprisingly hard to find standard benchmarks that would both scale to modern server capacities and include longrunning enterprise application behavior Workload: modified 4 warehouse SPECjbb2005 workload, changed to include a modest, 2GB object cache that churns at a slow (20MB/sec) rate, and to measure transaction response times Compared the observed worst case response times under different collectors Oh, and we ran all setups long enough to see an Old gen compaction event occur. 2011 Azul Systems, Inc. 21

Sample throughput SpecJBB + Slow churning 2GB LRU Cache Live set is ~2.5GB across all measurements Allocation rate is ~1.2GB/sec across all measurements 2011 Azul Systems, Inc. 22

Sample responsiveness improvement SpecJBB + Slow churning 2GB LRU Cache Live set is ~2.5GB across all measurements Allocation rate is ~1.2GB/sec across all measurements 2011 Azul Systems, Inc. 23

Design goal: be insensitive Heap Size Allocation rate Mutation rate Locality non-generational behavior 2011 Azul Systems, Inc. 24

Q & A The C4 Collector

Sustainable Remap Rates. Per 2MB of allocation: map remap/protect unmap Need to keep up with sustained allocation rate A modern x86 core will happily generate ~0.5GB/sec of garbage (m)remaping pages is only small part of GC cycle Healthy GC duty cycle at ~20%, mremap is ~5% of GC cycle So need to sustain 100s of GB/sec in mremap rate Linux remaps sustain <1GB/sec Dominated by unneeded semantics TLB invalidates, 4KB mappings, global locking, Enhanced kernel supports >6TB/sec sustained remap rates Avoids in-process implicit TLB invalidates, uses 2MB mappings 2011 Azul Systems, Inc. 26

Sustained Remap Rates Active threads Mainline Linux w/azul Memory Module Speedup 1 3.04 GB/sec 6.50 TB/sec >2,000x 2 1.82 GB/sec 6.09 TB/sec >3,000x 4 1.19 GB/sec 6.08 TB/sec >5,000x 8 897.65 MB/sec 6.29 TB/sec >7,000x 12 736.65 MB/sec 6.39 TB/sec >8,000x 2011 Azul Systems, Inc. 27

Remap Commit Rates. Remap/protection must be consistent across mutator threads Each batch of relocated pages needs synchronization In practical terms, we bring mutators to safe point, and flip pages Using linux mremap(), protecting 16GB would take ~20 sec. Enhanced kernel supports >800TB/sec remap commit rates Uses shadow table and batch remap/protect ops api Accumulated batch operations are not visible until committed Commits shadow table using ~1 pointer copy per GB Protecting 16GB takes about ~22 usec 2011 Azul Systems, Inc. 28

Remap Commit Rates Active threads Mainline Linux w/azul Memory Module Speedup 0 43.58 GB/sec (360 ms) 4734.85 TB/sec ( 3 usec) >100,000x 1 3.04 GB/sec (5 sec) 1488.10 TB/sec (11 usec) >480,000x 2 1.82 GB/sec (8 sec) 1166.04 TB/sec (14 usec) >640,000x 4 1.19 GB/sec (13 sec) 913.74 TB/sec (18 usec) >750,000x 8 897.65 MB/sec (18 sec) 801.28 TB/sec (20 usec) >890,000x 12 736.65 MB/sec (21 sec) 740.52 TB/sec (22 usec) >1,000,000x * Commit rate and (time it would take to commit 16GB) 2011 Azul Systems, Inc. 29

New programming models? The coherent, shared memory SMP model has endured That s how people program. Still... In the past 40 years, new programming models proposed Whenever we run into a new architectural limit Usually involve some sort of loosely coupled memory New models are generally useful for mega-scale (moving target) They don t survive (for long) within a physical machine 64KB not enough? (early 1980s) 20 bit segmented memory for 16 bit processors (birth of x86) 640KB not enough? (early 1990s) 32 bit operating systems, even in the commodity/desktop world 2011 Azul Systems, Inc. 30

The hard things to do in GC Robust concurrent marking Refs keep changing Multi-pass marking sensitive to mutation rate Weak, Soft, Final references hard to deal with concurrently [Concurrent] Compaction It s not the moving of the objects It s the fixing of all those references that point to them How do you deal with a mutator looking at a stale reference? If you can t, then remapping is a STW operation Without solving Compaction, GC won t be solved All current commercial server JVMs and GCs perform compaction Azul ships the only commercial JVMs that concurrently compact 2011 Azul Systems, Inc. 31

Garbage Collection & Compaction You can delay it, but you cannot get rid of it Compaction is inevitable And compacting anything requires scanning/fixing all references Usually the worst possible thing that can happen in GC You can delay compaction, but not get rid of it Delay tactics focus on getting easy empty space first This is the focus for the vast majority of GC tuning Most objects die young So collect young objects only, as much as possible But eventually, some old dead objects must be reclaimed Most old dead space can be reclaimed without moving it So track dead space in lists, and reuse it in place But eventually, space gets fragmented, and needs to be moved Eventually, all collectors compact the heap 2011 Azul Systems, Inc. 32

HotSpot CMS Collector mechanism classification Stop-the-world compacting new gen (ParNew) Mostly Concurrent, non-compacting old gen (CMS) Mostly Concurrent marking Mark concurrently while mutator is running Track mutations in card marks Revisit mutated cards (repeat as needed) Stop-the-world to catch up on mutations, ref processing, etc. Concurrent Sweeping Does not Compact (maintains free list, does not move objects) Fallback to Full Collection (Stop the world). Used for Compaction, etc. 2011 Azul Systems, Inc. 33

HotSpot Garbage First (aka G1) Collector mechanism classification Experimental -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC Stop-the-world compacting new gen Mostly Concurrent, old gen marker Mostly Concurrent marking Tracks inter-region relationships in remembered sets Stop-the-world incremental compacting old gen Objective: Avoid, as much as possible, having a Full GC Compact sets of regions that can be scanned in limited time Delay compaction of popular objects, popular regions Fallback to Full Collection (Stop the world) Used for compacting popular objects, popular regions 2011 Azul Systems, Inc. 34