Shenandoah GC...and how it looks like in September Aleksey

Similar documents
Shenandoah GC. Part I: The Garbage Collector That Could. Aleksey

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

New Java performance developments: compilation and garbage collection

G1 Garbage Collector Details and Tuning. Simone Bordet

Concurrent Garbage Collection

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Managed runtimes & garbage collection

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Tick: Concurrent GC in Apache Harmony

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

Lecture 15 Garbage Collection

Garbage Collection Algorithms. Ganesh Bikshandi

Reference Object Processing in On-The-Fly Garbage Collection

A new Mono GC. Paolo Molaro October 25, 2006

Lecture 13: Garbage Collection

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

CS 4120 Lecture 37 Memory Management 28 November 2011 Lecturer: Andrew Myers

Reference Counting. Reference counting: a way to know whether a record has other users

The Z Garbage Collector Low Latency GC for OpenJDK

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Garbage Collection. Vyacheslav Egorov

Understanding Garbage Collection

Heap Management. Heap Allocation

The Z Garbage Collector An Introduction

Robust Memory Management Schemes

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

Complex, concurrent software. Precision (no false positives) Find real bugs in real executions

Memory management has always involved tradeoffs between numerous optimization possibilities: Schemes to manage problem fall into roughly two camps

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

The Garbage-First Garbage Collector

Garbage Collection (2) Advanced Operating Systems Lecture 9

Harmony GC Source Code

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Programming Language Implementation

Java Performance Tuning

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

One-Slide Summary. Lecture Outine. Automatic Memory Management #1. Why Automatic Memory Management? Garbage Collection.

Heap Compression for Memory-Constrained Java

Deallocation Mechanisms. User-controlled Deallocation. Automatic Garbage Collection

Exploiting the Behavior of Generational Garbage Collector

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Hard Real-Time Garbage Collection in Java Virtual Machines

Reference Counting. Reference counting: a way to know whether a record has other users

Java Without the Jitter

Automatic Memory Management

Run-Time Environments/Garbage Collection

CS577 Modern Language Processors. Spring 2018 Lecture Garbage Collection

CS842: Automatic Memory Management and Garbage Collection. Mark and sweep

Concurrent Preliminaries

Garbage-First Garbage Collection

Garbage Collection Basics

CSE P 501 Compilers. Memory Management and Garbage Collec<on Hal Perkins Winter UW CSE P 501 Winter 2016 W-1

CS 345. Garbage Collection. Vitaly Shmatikov. slide 1

Garbage Collection. Hwansoo Han

Go GC: Prioritizing Low Latency and Simplicity

JDK 9/10/11 and Garbage Collection

The HALFWORD HEAP EMULATOR

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

Java Performance: The Definitive Guide

Myths and Realities: The Performance Impact of Garbage Collection

High Performance Managed Languages. Martin Thompson

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Evaluating and improving remembered sets in the HotSpot G1 garbage collector

CMSC 330: Organization of Programming Languages

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector

DRAFT. String Density: Performance and Footprint. Xueming Shen. Aleksey Shipilev Oracle Brent Christian.

Automatic Garbage Collection

Review. Partitioning: Divide heap, use different strategies per heap Generational GC: Partition by age Most objects die young

CMSC 330: Organization of Programming Languages

Habanero Extreme Scale Software Research Project

Garbage Collection. Weiyuan Li

Dynamic Memory Allocation

Garbage Collection. CS 351: Systems Programming Michael Saelee

Garbage Collection Techniques

A High Integrity Distributed Deterministic Java Environment. WORDS 2002 January 7, San Diego CA

Heckaton. SQL Server's Memory Optimized OLTP Engine

Software Speculative Multithreading for Java

Freeing memory is a pain. Need to decide on a protocol (who frees what?) Pollutes interfaces Errors hard to track down

CS 241 Honors Memory

A program execution is memory safe so long as memory access errors never occur:

THE TROUBLE WITH MEMORY

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

JamaicaVM Java for Embedded Realtime Systems

Attila Szegedi, Software

Implementation Garbage Collection

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

Virtual Memory Primitives for User Programs

Last week, David Terei lectured about the compilation pipeline which is responsible for producing the executable binaries of the Haskell code you

ACM Trivia Bowl. Thursday April 3 rd (two days from now) 7pm OLS 001 Snacks and drinks provided All are welcome! I will be there.

Dynamic Memory Allocation. Basic Concepts. Computer Organization 4/3/2012. CSC252 - Spring The malloc Package. Kai Shen

Administrivia. Project 2 due Friday at noon Midterm Monday. Midterm review section Friday Section for Project 3 next Friday

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Programmer Directed GC for C++ Michael Spertus N2286= April 16, 2007

Transcription:

Shenandoah GC...and how it looks like in September 2017 Aleksey Shipilёv shade@redhat.com @shipilev

Disclaimers This talk: 1....assumes some knowledge of GC internals: this is implementors-to-implementors talk, not implementors-to-users we are here to troll for ideas 2....briefly covers successes, and thoroughly covers challenges: mind the availability heuristics that can confuse you into thinking challenges outweigh the successes 3....covers many topics, so if you have blinked and lost the thread of thought, wait a little up until the next (ahem) safepoint Slide 2/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview

Overview: Landscape Slide 4/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Landscape Slide 4/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Landscape Slide 4/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Landscape Slide 4/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Key Idea Brooks forwarding pointer to help concurrent copying: fwdptr is attached to every object, at all times Slide 5/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Key Idea Brooks forwarding pointer to help concurrent copying: fwdptr always points to most actual (to-space) copy, and gets atomically updated during evacuation Slide 5/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Key Idea Brooks forwarding pointer to help concurrent copying: Barriers maintain the to-space invariant: «All writes happen into to-space copy» Slide 5/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Key Idea Brooks forwarding pointer to help concurrent copying: Barriers also help to select the to-space copy for reading Slide 5/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: Key Idea Brooks forwarding pointer to help concurrent copying: Allows to update the heap references concurrently too Slide 5/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: GC Cycle Regular cycle: Slide 6/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: GC Cycle Regular cycle: 1. Snapshot-at-the-beginning concurrent mark Slide 6/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: GC Cycle Regular cycle: 1. Snapshot-at-the-beginning concurrent mark 2. Concurrent evacuation Slide 6/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: GC Cycle Regular cycle: 1. Snapshot-at-the-beginning concurrent mark 2. Concurrent evacuation 3. Concurrent update references (optional) Slide 6/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Overview: GC Cycle Regular cycle: 1. Snapshot-at-the-beginning concurrent mark 2. Concurrent evacuation 3. Concurrent update references (optional) Slide 6/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes

Successes: Almost Concurrent Works! LRUFragger, 100 GB heap, 80 GB LDS: Pause Init Mark 0.437ms Concurrent marking 76780M->77260M(102400M) 700.185ms Pause Final Mark 77260M->77288M(102400M) 0.698ms Concurrent cleanup 77288M->77296M(102400M) 0.176ms Concurrent evacuation 77296M->85696M(102400M) 405.312ms Pause Init Update Refs 0.038ms Concurrent update references 85700M->85928M(102400M) 319.116ms Pause Final Update Refs 85928M->85928M(102400M) 0.351ms Concurrent cleanup 85928M->56620M(102400M) 14.316ms Slide 8/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Almost Concurrent Works! LRUFragger, 100 GB heap, 80 GB LDS: Pause Init Mark 0.437ms Concurrent marking 76780M->77260M(102400M) 700.185ms Pause Final Mark 77260M->77288M(102400M) 0.698ms Concurrent cleanup 77288M->77296M(102400M) 0.176ms Concurrent evacuation 77296M->85696M(102400M) 405.312ms Pause Init Update Refs 0.038ms Concurrent update references 85700M->85928M(102400M) 319.116ms Pause Final Update Refs 85928M->85928M(102400M) 0.351ms Concurrent cleanup 85928M->56620M(102400M) 14.316ms Slide 8/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Concurrent Means Freedom Mostly concurrent GC is very liberating! No rush doing concurrent phases: slow concurrent phase means more frequent cycles steal more cycles from application, not pause it extensively Heuristics mistakes are (usually) much less painful: diminished throughput, but not increased pauses Control the GC cycle time budget: -XX:ConcGCThreads=... Slide 9/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Progress Concurrent collector runs GC cycles without blocking mutator progress (translation: BMU/MMU is really good) That means, we can do:...thousands of GC cycles per minute Very efficient testing that surface the rarest bugs...continuous GC cycles when capacity is overwhelming Ultimate sacrifice of throughput for latency...periodic GCs without significant penalty Idle applications get their floating garbage purged Slide 10/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Non-Generational Workloads Shenandoah does not need Generational Hypothesis to hold true in order to operate efficiently Prime example: LRU/ARC-like in-memory caches It would like GH to be true: immediate garbage regions can be immediately reclaimed after mark, and cycle shortcuts Partial collections may use region age to focus on «younger» regions Slide 11/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Barriers Injection Reality: Educated Bystander concern: Where to inject the barriers? Most heap accesses from VM are done via native accessors Most VM parts (e.g. compilers) hold on to JNI handles A very few naked reads and stores are done Story gets much better with JEP 304 «GC interface» Internal heap verification helps to catch missing barriers Slide 12/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Heap Management Regionalized heap allows (un)committing individual regions Some are willing to trade increased peak footprint for better idle footprint: per-region heap uncommit + periodic GCs Slide 13/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Successes: Releases Easy to access (development) releases: try it now! Development in separate JDK 10 forest, regular backports to separate JDK 9 and 8u forests JDK 8u backports ship in RHEL 7.4+, Fedora 24+ Nightly development builds (tarballs, Docker images) docker run -it --rm shipilev/openjdk:10-shenandoah \ java -XX:+UseShenandoahGC -Xlog:gc -version Slide 14/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges

Challenges: Footprint Overhead Shenandoah requires additional word per object for forwarding pointer at all times 1.5x worst case and 1.05-1.10x average overhead but, counted in Java heap, not native structures easier capacity planning Current pointer is uncompressed, no gain to compress due to object alignment constraints Moving fwdptr into synthetic object field promises substantial improvements but, read barriers are already very overheady Slide 16/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Barriers Overhead Shenandoah requires much more barriers 1. SATB barriers for regular cycles 2. Write barriers on all stores, not only reference stores 3. Read barriers on almost all heap reads 4. Other exotic flavors of barriers: acmp, CAS, clone,... Slide 17/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Read Barriers # Read Barrier: dereference via fwdptr mov -0x8(%r10),%r10 # obj = *(obj - 8) # read the field at offset 0x30 mov 0x30(%r10),%r10d # val = *(obj + 0x30) Very simple: single instruction Very frequent: before almost every heap read Optimizeable: move heap accesses move the barriers Accounts for 0..15% throughput hit, depending on the workload Slide 18/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Write Barriers # Read TLS flag and see if evac is enabled movzbl 0x3d8(%r15),%r11d # flag = *(TLS + 0x3d8) test %r11d,%r11d # if (flag)... jne OMG-EVAC-ENABLED # No, no, no! # Not enabled: read barrier mov -0x8(%rbp),%r10 # obj = *(obj - 8) # Store into the field! mov %r10,0x30(%r10) # *(obj + 0x30) = r10 Slide 19/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Write Barriers # Read TLS flag and see if evac is enabled movzbl 0x3d8(%r15),%r11d # flag = *(TLS + 0x3d8) test %r11d,%r11d # if (flag)... jne OMG-EVAC-ENABLED # No, no, no! # Not enabled: read barrier mov -0x8(%rbp),%r10 # obj = *(obj - 8) # Store into the field! mov %r10,0x30(%r10) # *(obj + 0x30) = r10 Writing to field? Locking on object? Computing identity hash code? Writing down new klass? All those are object stores, all require WB Slide 19/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Write Barriers # Read TLS flag and see if evac is enabled movzbl 0x3d8(%r15),%r11d # flag = *(TLS + 0x3d8) test %r11d,%r11d # if (flag)... jne OMG-EVAC-ENABLED # No, no, no! # Not enabled: read barrier mov -0x8(%rbp),%r10 # obj = *(obj - 8) # Store into the field! mov %r10,0x30(%r10) # *(obj + 0x30) = r10 Writes are rare, fast-path is fast, throughput overhead is 0..5% Slide 19/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Exotic Barriers Shenandoah-specific barriers: making sure comparisons work when both copies of the object are reachable. Unequal machine ptrs unequal Java references now! acmp barrier: on comparison failure, do RBs, compare again Java ref comparisons in native VM code have to do it too CAS barrier: on CAS failure, do magic to dodge false positives Normally cost < 1% throughput Slide 20/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Compiler Support The key thing to cope with barriers overhead is Shenandoah-specific compiler optimizations (this is also the major source of interesting bugs) Hoisting read and write barriers out of the loops Eliminating barriers on known new objects, known constants Bypassing read barriers on unordered reads, e.g. final-s Optimizeable barriers straight in IR Coalescing barriers: SATB+WB, back-to-back barriers, etc Slide 21/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: Compiler Support 1 C1 C2 Test G1 Shen %diff G1 Shen %diff Cmp 79 74-7% 127 121-5% Cpr 125 86-31% 146 125-15% Cry 79 63-21% 232 227-2% Drb 78 70-11% 165 156-6% Mpa 31 21-33% 50 40-19% Sci 42 31-25% 74 66-10% Ser 1639 1279-22% 2471 2101-15% Sun 99 75-24% 112 98-13% Xml 89 70-21% 190 170-11% C1 codegens good barriers, but C2 also does high-level optimizations 1 Caveat: Author made this experiment while inebriated after conference dinner, so... Slide 22/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Challenges: STW Woes Pauses 1 ms leave little time budget to deal with, but need to scan roots, cleanup runtime stuff, walk over regions... Consider: Thread wakeup latency is easily more than 200 us: parallelism does not give you all the bang some parallelism is still efficient Processing 10K regions means taking 100 ns per region. Example: you can afford marking regions as «dirty», but cannot afford actually recycling them during the pause Slide 23/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress

In Progress: VM Support Some examples: Pauses 1 ms require more runtime support Time-To-SafePoint takes about that even without loopy code Safepoint auxiliaries: stack scans for method aging takes > 1 ms, cleanup can easily take 1 ms Lots of roots, many are hard/messy to scan concurrently or in parallel: StringTable, synchronizer roots, etc. Slide 25/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Partials Full heap concurrent cycle takes the throughput toll on application. Idea: partial collections! Requires knowing what parts of heap to scan for incoming refs Card Table for Serial, Parallel, CMS Card Table + Remembered Sets for G1 Differs from regular cycle: selects the collection set without prior marking, thus more conservative Generational is the special case of partial Slide 26/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Partials, Connection Matrix Concurrent collector allows for very coarse «connection matrix»: Slide 27/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Partials, Connection Matrix Concurrent collector allows for very coarse «connection matrix»: Slide 27/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Partials, Connection Matrix Concurrent collector allows for very coarse «connection matrix»: Slide 27/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Partial, Status Current partial machinery does work! Implemented GC infra, and matrix barriers in all compilers Can use timestamps to bias towards younger and older regions Caveat, they are STW in current experiment: GC(7) Pause Partial 2103M->2106M(10240M) 4.209ms GC(7) Concurrent cleanup 2106M->59M(10240M) 5.288ms Anecdote: sometimes, partial STW is shorter than regular STWs Slide 28/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Concurrent Partial Next step: making partial collections concurrent (work in progress) Q: Matrix consistency during concurrent partial? Q: New barriers required? Q: Regular concurrent cycle is the special case of partial? Slide 29/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Traversal Order Spot the trouble: Slide 30/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Traversal Order Spot the trouble: Separate marking and evacuation phases mean collector maintains the allocation order, not the traversal order Slide 30/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Traversal Order Spot the trouble: Q: Can coalesce evacuation and update-refs? Q: Concurrent Partial can coalesce the phases? Slide 30/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Humongous and 2 K allocs new byte[1024*1024] is the best fit for regionalized GC? Actually, in G1-style humongous allocs, the worst fit: objects have headers, and 2 K -sized alloc would barely not fit, wasting one of the regions Q: Can be redone with segregated-fits freelist maintained separately? Slide 31/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: Application Pacing Concurrent collector GC relies on collecting faster than applications allocate: applications always see there is available memory In practice, this is frequently true: applications rarely do allocations only, GC threads are high-priority, there enough space to absorb allocations while GC is running... In some cases of overloaded heap, application outpaces GC, yielding Allocation Failure, and prompting STW Q: Pace the application when heap is close to exhaustion? Slide 32/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

In Progress: SATB or IU SATB overestimates liveness: all new allocations during mark are implicitly live Keeps lots of floating garbage that would need to wait for another cycle to be collected Has interesting implications on weak references processing: e.g. deadly embracing Reference.get()... Q: Is incremental update more suitable here? Slide 33/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00

Conclusion

Conclusion: Ready for Experimental Use Try it. Break it. Report the successes and failures. https://wiki.openjdk.java.net/display/shenandoah/main Slide 35/35. «Shenandoah GC», Aleksey Shipilёv, 2017, D:20171002122223+02 00