Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Similar documents
Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

The Z Garbage Collector An Introduction

The Z Garbage Collector Low Latency GC for OpenJDK

New Java performance developments: compilation and garbage collection

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

G1 Garbage Collector Details and Tuning. Simone Bordet

Shenandoah GC...and how it looks like in September Aleksey

Concurrent Garbage Collection

A new Mono GC. Paolo Molaro October 25, 2006

Memory management has always involved tradeoffs between numerous optimization possibilities: Schemes to manage problem fall into roughly two camps

Programming Language Implementation

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

High Performance Managed Languages. Martin Thompson

Garbage Collection. Vyacheslav Egorov

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Efficient Runtime Tracking of Allocation Sites in Java

Dynamic Selection of Application-Specific Garbage Collectors

Managed runtimes & garbage collection

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

Azul Systems, Inc.

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Attila Szegedi, Software

Habanero Extreme Scale Software Research Project

CS 345. Garbage Collection. Vitaly Shmatikov. slide 1

Hard Real-Time Garbage Collection in Java Virtual Machines

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

High-Productivity Languages for HPC: Compiler Challenges

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler

High Performance Managed Languages. Martin Thompson

The Fundamentals of JVM Tuning

Lecture 9 Dynamic Compilation

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Hardware-Supported Pointer Detection for common Garbage Collections

High-Level Language VMs

JVM and application bottlenecks troubleshooting

Compact and Efficient Strings for Java

THE TROUBLE WITH MEMORY

Implementation Garbage Collection

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Java On Steroids: Sun s High-Performance Java Implementation. History

the gamedesigninitiative at cornell university Lecture 9 Memory Management

Reference Counting. Reference counting: a way to know whether a record has other users

CMSC 330: Organization of Programming Languages

JVM Memory Model and GC

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

Reference Object Processing in On-The-Fly Garbage Collection

Java Performance: The Definitive Guide

Operating Systems CMPSCI 377, Lec 2 Intro to C/C++ Prashant Shenoy University of Massachusetts Amherst

Java Without the Jitter

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Understanding Garbage Collection

The Garbage-First Garbage Collector

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

6.172 Performance Engineering of Software Systems Spring Lecture 9. P after. Figure 1: A diagram of the stack (Image by MIT OpenCourseWare.

JDK 9/10/11 and Garbage Collection

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

New Features Overview

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

Dynamic Memory Management! Goals of this Lecture!

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Myths and Realities: The Performance Impact of Garbage Collection

Garbage Collection. Hwansoo Han

CMSC 330: Organization of Programming Languages

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Last week, David Terei lectured about the compilation pipeline which is responsible for producing the executable binaries of the Haskell code you

Java Application Performance Tuning for AMD EPYC Processors

Running class Timing on Java HotSpot VM, 1

Memory Management: The Details

Reference Counting. Reference counting: a way to know whether a record has other users

MicroJIT: A Lightweight Just-in-Time Compiler to Improve Startup Times

Garbage Collection Algorithms. Ganesh Bikshandi

CA341 - Comparative Programming Languages

Garbage Collection (aka Automatic Memory Management) Douglas Q. Hawkins. Why?

<Insert Picture Here> Symmetric multilanguage VM architecture: Running Java and JavaScript in Shared Environment on a Mobile Phone

Kris Mok, Software Engineer, 莫枢 / 撒迦

Complex, concurrent software. Precision (no false positives) Find real bugs in real executions

The Application Memory Wall

Compact Off-Heap Structures in the Java Language

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Run-Time Environments/Garbage Collection

Garbage Collection Basics

Dynamic Memory Management

JVM Performance Tuning with respect to Garbage Collection(GC) policies for WebSphere Application Server V6.1 - Part 1

Virtual Machine Design

Compiler construction 2009

Designing experiments Performing experiments in Java Intel s Manycore Testing Lab

Automatic Memory Management

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

Transcription:

Shenandoah: Theory and Practice Christine Flood Roman Kennke Principal Software Engineers Red Hat 1

Shenandoah Christine Flood Roman Kennke Principal Software Engineers Red Hat 2

Shenandoah Why do we need it? What does it do? How does it work? What's the current state? What's left to do? Performance 3

GC is like an omniscient organizer for program memory. 4 I bet that's your messy pantry isn't it?

Java execution Stack Frame Method Foo Heap Reference Value Reference 42 Object Array Reference Stack Frame Method Bar Value 6.847 Reference Object Object Reference Reference Object Object 5

When we reorganize objects we need to copy the objects and update the stack locations to point to their new addresses. Heap Stack Frame Method Foo Reference Value 42 Object Copy Reference Stack Frame Method Bar Value 6.847 Object Reference 6

Why yet another garbage collector? OpenJDK already has 4 collectors: Serial Parallel Concurrent Mark Sweep G1 7

Why yet another garbage collector? OpenJDK already has 4 collectors: Serial (minimal collector) Parallel (high throughput) Concurrent Mark Sweep (low pause time, but...) G1 (low/managed pause time, but...) 8

But? All existing collectors must (occasionally) compact old-gen or the whole heap.. and therefore stop the world. for a long time, if heap is large 9

Shenandoah! Aims to reduce GC pause times Goal: <10ms pauses for >100GB heaps More precisely: Make GC pauses independent of heap size Long-term goal: pauseless GC 10

How do we do it? Evacuate concurrently with Java threads 11

Garbage-First (G1) Java Init Mark Java Final Mark Evacuation Java Concurrent Mark 12

Shenandoah: Current implementation Java Init Mark Java Final Mark Java Concurrent Mark Concurrent Evacuation We choose our collection set to Minimize amount of copying. We have a plan for removing all of the stop the world pauses. 13

Stack Frame Method Foo Heap Reference Value 42 Object Copy Reference Stack Frame Method Bar Value 6.847 Object Reference 14 Wait, are you moving those objects while the program is running?

How do we do that? We recycle an idea from the 1980's and add a level of indirection. 15

Forwarding Pointers based on Brooks Pointers Rodney A. Brooks Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware 1984 Symposium on Lisp and Functional Programing 16

Forwarding Pointer 17 Foo indirection pointer Foo Object layout inside the JVM remains the same. Third party tools can still walk the heap. Can choose GC algorithm at run time. We hope to one day be able to take advantage of unused space in double word aligned objects when possible.

Forwarding Pointers From-Region To-Region Foo A A' B Any reads or writes of A will now be redirected to A'. We don't need to update Foo immediately. 18

How to move an object while the program is running. Read the forwarding pointer to from space. Allocate a temporary copy of the object in to space. Copy the data. CAS the forwarding pointer. If you succeed carry on. If you fail, use the copy that was placed by the thread that beat you and recycle your temporary copy. 19

Forwarding Pointers From-Region To-Region A B Reading an object in a From-region doesn't trigger an evacuation. 20 Note: If reads were to cause copying we might have a read storm where every operation required copying an object. Our intention is that since we are only copying on writes we will have less bursty behavior.

Forwarding Pointers From-Region A To-Region A' B Writing an object in a From-Region will trigger an evacuation of that object to a To-Region and the write will occur in there. 21

How does Java code know where the real object is? Reads, writes, amps and some others are wrapped by code that ensures the correct objects are accessed: Read barriers Write barriers Acmp / cmpxchg barriers 22

Read Barriers Read the forwarding pointer to access the forwarded object. Does not trigger evacuation If a write occurs concurrently, it's a race, but it's been a race before :-) Usually compiles into a single mov instruction 23

Write Barriers Ensures that writes only happen in to-space It does so by speculatively making a copy, then CASing the forwarding pointer in the object If CAS succeeds, we win. If not, we roll back the allocation, and use whatever the other thread did but only for objects in collection set, and only if evacuation is currently in progress otherwise it's a simple read barrier 24

Acmp barriers If we compare a == a', we can get false negatives Therefore, if an object comparison fails, we resolve both operands through a read barrier, then try again. 25

CmpXChg Barriers compareandswapobject() combines all three, because it loads, compares and writes an object field We insert a somewhat complex barrier that Resolves the written value (read-barrier) Ensures to-space copy (write-barrier) Prevents false negative (acmp-barrier) 26

How are barriers implemented? Need two types of barriers: Read barrier - read brooks pointer Write barrier maybe copy obj & update brooks ptr oop read_barrier(oop obj) oop write_barrier(oop obj) 27

Shenandoah barriers oop read_barrier(oop obj) { } return *(obj-0x8);

Shenandoah barriers oop write_barrier(oop obj) { if (evacuation_in_progress) { return runtime_wbarrier(obj); } return obj; }

Shenandoah barriers Read barriers: getfield Xaload Intrinsics Some esoteric stuff

Shenandoah barriers Write barriers: putfield Xastore Intrinsics Some esoteric stuff

Shenandoah barrier example // Method without barriers void dostuff(typea a, TypeA b) { for (..) { a.x = 3; // putfield System.out.println(b.x); // getfield } } // Same method with Shenandoah barriers void dostuff(typea a, TypeA b) { for (..) { a = write_barrier(a); a.x = 3; // putfield b = read_barrier(b); System.out.println(b.x); // getfield } }

Shenandoah barriers Barriers are inserted by: The interpreter The C1 compiler The C2 compiler By us, hardcoded in the runtime

Shenandoah barriers Initial implementation showed disheartening performance: more than 50% slower than with other Gcs So how did we make it fast?

Shenandoah barriers How to optimize barriers? Make barrier more efficient Eliminate barriers Optimize barrier placement

Shenandoah barriers Making barriers more efficient Eliminate null-checks Inline null-checks Inline evacuation-in-progress checks Inline in-collection-set checks Only call runtime when really necessary

Shenandoah barriers Eliminate barriers We don't need barriers: For known NULL objects For inlined constants For newly allocated objects After write barriers Since we can only figure most of this out after parsing, this isn't possible to do with parse-time barriers

Eliminate barriers on null objects bool isnull(type a) { Type b = null; a' = read_barrier(a); b' = read_barrier(b); return a' == b'; }

Eliminate barriers on null objects bool isnull(type a) { Type b = null; a' = read_barrier(a); // Dont care b' = read_barrier(b); // Known null return a' == b'; }

Eliminate barriers on null objects bool isnull(type a) { Type b = null; return a == b; }

Eliminate barriers on constants static final Type A =...; int getfoo() { return A.foo; }

Eliminate barriers on constants static final Type A =...; int getfoo() { Type A' = read_barrier(a); return A'.foo; }

Eliminate barriers on constants static final Type A =...; int getfoo() { // Constants are always in to-space Type A' = read_barrier(a); return A'.foo; }

Eliminate barriers on new objects int getfoo() { Type a = new Type(); a' = read_barrier(a); return a'.foo; }

Eliminate barriers on new objects int getfoo() { Type a = new Type(); // New objects are always in to-space a' = read_barrier(a); return a'.foo; }

Eliminate barriers on new objects int getfoo() { Type a = new Type(); return a.foo; }

Eliminate barriers after write barriers int getfoo(type a) { a' = write_barrier(a); a'.bar = ; a'' = read_barrier(a'); return a''.foo; }

Eliminate barriers after write barriers int getfoo(type a) { a' = write_barrier(a); a'.bar = ; // a' already in to-space a'' = read_barrier(a'); return a''.foo; }

Eliminate barriers after write barriers int getfoo(type a) { a' = write_barrier(a); a'.bar = ; return a'.foo; }

Optimize barrier placement Hoist barriers out of hot loops

Example void dostuff(typea a, TypeZ z) { for ( ) { Call(); // Safepoint for ( ) { a = write_barrier(a); a.x = foo; z = read_barrier(z); System.out.println(z.y); } }

Example void dostuff(typea a, TypeZ z) { a = write_barrier(a); for ( ) { Call(); // Safepoint for ( ) { a.x = foo; z = read_barrier(z); System.out.println(z.y); } }

Example void dostuff(typea a, TypeZ z) { a = write_barrier(a); z = read_barrier(z); for ( ) { Call(); // Safepoint for ( ) { a.x = foo; System.out.println(z.y); } }

Lessons learned Basic algorithm pretty easy Hard parts: Finding all the right places where to insert barriers Support all JVM peculiarities: Weak references JNI Critical regions System.gc() Compiler support and optimization

Status Feature complete Stable (beta-quality) Good performance (see later ) Established OpenJDK project: http://openjdk.java.net/projects/shenandoah/ Got nightly builds: https://adopt-openjdk.ci.cloudbees.com/view/openjdk/ (Thanks Adopt-OpenJDK!!) 55

Future Work (last year) Finish big application testing. Move the barriers to right before code generation. Barrier-specific C2 opts? Exploit Java Memory Model? Heuristics tuning! Generational Shenandoah? Remembered Sets for updating roots and freeing memory sooner? Round Robin Thread Stopping? NUMA Aware? 56

Future Work (now) Finish big application testing. Move the barriers to right before code generation. Barrier-specific C2 opts? Exploit Java Memory Model? Heuristics tuning! Generational Shenandoah? Remembered Sets for updating roots and freeing memory sooner? Round Robin Thread Stopping? (2.0) NUMA Aware? (2.0) 57

Releases? First in Fedora 24 JDK 10 JEP 189: http://openjdk.java.net/jeps/189 58

Performance SPECjbb2015-32 cores - 160GB RAM, 140GB heap 1600 1400 1200 1000 800 600 Shenandoah G1 400 200 0 compress derby scimark.large compiler crypto mpegaudio scimark.small serial sunflow startup xml Throughput: Shenandoah: 374ops/m G1: 393ops/m (95%, min 80%, max 140%) Pauses: Shenandoah: avg: 41ms, max: 202ms G1: avg: 240ms, max: 1126ms total 59

Performance SPECjbb2015 Max-jops: maximum throughput Critical-jops: throughput under response-timeconstraints (SLA) G1 Shenandoah Max-jops 18117 16899 93% Critical-jops 4294 7990 186% Pause avg 862ms 24.6ms Pause max 2054ms 78.61 60

Performance Radargun/Infinispan Throughput: 61 G1: 940,065 reqs/s Shenandoah: 1,202,925 reqs/s

Performance Radargun/Infinispan Response time percentiles Beware the scales! 62

LRU test Simple handwritten LRU cache benchmark ParallelGC: 116091ms / 100000 ops G1: 98598ms / 100000 ops Shenandoah: 56698ms / 100000 ops 63

Please test Download and build: http://hg.openjdk.java.net/shenandoah Or use nightly builds: https://adopt-openjdk.ci.cloudbees.com/view/openjdk/job/project-shenandoah-jdk9/ https://adopt-openjdk.ci.cloudbees.com/view/openjdk/job/project-shenandoah-jdk8/ Report issues or success stories to: http://mail.openjdk.java.net/mailman/listinfo/shenandoah-dev 64

References http://openjdk.java.net/projects/shenandoah/ 65