New Java performance developments: compilation and garbage collection

Similar documents
JDK 9/10/11 and Garbage Collection

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

Concurrent Garbage Collection

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

Java On Steroids: Sun s High-Performance Java Implementation. History

JVM Memory Model and GC

High Performance Managed Languages. Martin Thompson

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

Java Performance Tuning

Evolution of Virtual Machine Technologies for Portability and Application Capture. Bob Vandette Java Hotspot VM Engineering Sept 2004

A new Mono GC. Paolo Molaro October 25, 2006

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

2011 Oracle Corporation and Affiliates. Do not re-distribute!

THE TROUBLE WITH MEMORY

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

Myths and Realities: The Performance Impact of Garbage Collection

Shenandoah GC...and how it looks like in September Aleksey

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Java Without the Jitter

JamaicaVM Java for Embedded Realtime Systems

Garbage Collection. Hwansoo Han

Ahead of Time (AOT) Compilation

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Running class Timing on Java HotSpot VM, 1

G1 Garbage Collector Details and Tuning. Simone Bordet

OS-caused Long JVM Pauses - Deep Dive and Solutions

Do Your GC Logs Speak To You

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Java Performance: The Definitive Guide

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

Fiji VM Safety Critical Java

Managed runtimes & garbage collection

Compiler construction 2009

Java Performance Tuning and Optimization Student Guide

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Java Memory Management. Märt Bakhoff Java Fundamentals

High-Level Language VMs

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

Optimising Multicore JVMs. Khaled Alnowaiser

The Z Garbage Collector An Introduction

High Performance Managed Languages. Martin Thompson

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Garbage Collection. Steven R. Bagley

CSc 453 Interpreters & Interpretation

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

The Garbage-First Garbage Collector

MODULE 1 JAVA PLATFORMS. Identifying Java Technology Product Groups

Exploiting the Behavior of Generational Garbage Collector

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Towards Lean 4: Sebastian Ullrich 1, Leonardo de Moura 2.

The Z Garbage Collector Low Latency GC for OpenJDK

How to keep capacity predictions on target and cut CPU usage by 5x

JVM Performance Tuning with respect to Garbage Collection(GC) policies for WebSphere Application Server V6.1 - Part 1

HBase Practice At Xiaomi.

Attila Szegedi, Software

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Runtime Application Self-Protection (RASP) Performance Metrics

Memory management has always involved tradeoffs between numerous optimization possibilities: Schemes to manage problem fall into roughly two camps

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Agenda. CSE P 501 Compilers. Java Implementation Overview. JVM Architecture. JVM Runtime Data Areas (1) JVM Data Types. CSE P 501 Su04 T-1

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

DNWSH - Version: 2.3..NET Performance and Debugging Workshop

Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Hard Real-Time Garbage Collection in Java Virtual Machines

CMSC 330: Organization of Programming Languages

MEMORY MANAGEMENT HEAP, STACK AND GARBAGE COLLECTION

Run-Time Environments/Garbage Collection

Lesson 2 Dissecting Memory Problems

Hierarchical PLABs, CLABs, TLABs in Hotspot

The Fundamentals of JVM Tuning

Azul Systems, Inc.

Low latency & Mechanical Sympathy: Issues and solutions

Dynamic Selection of Application-Specific Garbage Collectors

Heap Compression for Memory-Constrained Java

Insight Case Studies. Tuning the Beloved DB-Engines. Presented By Nithya Koka and Michael Arnold

Lecture 15 Garbage Collection

Contents in Detail. Who This Book Is For... xx Using Ruby to Test Itself... xx Which Implementation of Ruby?... xxi Overview...

Lecture 9 Dynamic Compilation

MicroJIT: A Lightweight Just-in-Time Compiler to Improve Startup Times

CON Java in a World of Containers

Zing Vision. Answering your toughest production Java performance questions

Azul Pauseless Garbage Collection

Executing Legacy Applications on a Java Operating System

TECHNOLOGY WHITE PAPER. Azul Pauseless Garbage Collection. Providing continuous, pauseless operation for Java applications

Java on System z. IBM J9 virtual machine

Java Application Performance Tuning for AMD EPYC Processors

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Just-In-Time Compilers & Runtime Optimizers

Implementation Garbage Collection

PennBench: A Benchmark Suite for Embedded Java

Transcription:

New Java performance developments: compilation and garbage collection Jeroen Borgers @jborgers #jfall17

Part 1: New in Java compilation Part 2: New in Java garbage collection 2

Part 1 New in Java compilation 3

New in Java compilation #jfall17 Compilation basics for the JVM AOT-compilation added Advantages of AOT-compilation Examples using AOT Current limitations Conclusions 4

Java compilation basics interpretation and JIT-compilation in the JVM

Interpretation versus compilation #jfall17.php file.cpp file.pl file compiler interpreter executable file each line native code OS Hardware 6

Interpretation versus compilation #jfall17.php file.cpp file portable source code.pl file compiler interpreter executable file each line native code OS Hardware 7 exe non-portable optimizations faster execution

JVM: Interpretation and JIT-compilation #jfall17.java file javac.class file byte code: portable JVM interpreter JIT-compiler native code OS Hardware 8

JVM: Interpretation and JIT-compilation #jfall17.java file javac.class file byte code JVM interpreter each line JIT-compiler method code cache native code OS Hardware 9

Profile guided JIT-compilation #jfall17.java file javac.class file byte code JVM interpreter profiling JIT-compiler hot method code cache native code OS Hardware 10

Profile guided JIT-compilation with adaptive optimizations.class file byte code JVM interpreter profiling JIT-compiler hot method code cache native code focus on hot code: speculatively optimize method inlining branch prediction loop unrolling de-optimization OS Hardware 11

C1: client or C2: server compiler #jfall17.class file byte code JVM interpreter profiling client - C1 server - C2 hot code cache C1: quick startup C2: best performance eventually by more optimizations 12 hot: > C1:1500 > C2:10.000 invocations + iterations

Tiered compilation: Best of C1 and C2 default in Java 8.class file byte code interpreter client - C1 hotter server - C2 JVM profiling hot profiling code cache C1: quick startup C2: best performance eventually by more optimizations 13

interpretation versus compilation flags -Xint -Xcomp -Xmixed JIT-compilation default 14

Java 9 test code: HelloJUG 15

HelloJUG Which is quickest? java HelloJUG java -Xcomp HelloJUG java -Xint HelloJUG 16

HelloJUG Real times, average of 5 runs java HelloJUG java -Xcomp java -Xint 252 ms 2272 ms 214 ms 17

Tiered compilation levels #jfall17 Interpreter profiling C1 no profiling C1 profiling C2 no profiling code speed 1 10 7 14 common Level 0 hot hotter Level 3 Level 4 hot stop at L1 Level 0 Level 1 18

HelloJUG Real times, average of 5 runs java HelloJUG 252 ms java -Xcomp 2272 ms java -Xint 214 ms java -XX:TieredStopAtLevel=1 244 ms 19

AOT-compilation added #jfall17 byte code.class file JVM interpreter JIT-compiler native code 20

AOT-compilation added #jfall17 byte code.class file Graal jaotc libelf liby.so file native code non-portable -XX:AOTLibrary= JVM interpreter JIT-compiler native code 21

AOT in 2 flavors: tiered (with JIT) and non-tiered (no JIT) Graal libelf jaotc byte code.class file liby.so file native code -XX:AOTLibrary= JVM AOT code JIT-compiler native code 22

Non-tiered AOT-compilation: no JIT #jfall17 Graal libelf jaotc byte code.class file liby.so file native code -XX:AOTLibrary= JVM AOT code JIT-compiler native code 23

Non-tiered compilation with AOT #jfall17 AOT C1 no profiling no profiling profiling no profiling code speed 9 10 7 14 C1 C2 common aot footprint more important than peak performance no JIT-code cache, no compiler threads more predictable behavior constant speed of code, no JIT-compilation taking CPU 24

HelloJUG Real times, average of 5 runs java HelloJUG java -Xcomp java -Xint java -XX:TieredStopAtLevel=1 java -XX:AOTLibrary=lib..nont.so 25 252 ms 2272 ms 214 ms 244 ms 214 ms

Tiered AOT-compilation #jfall17 Graal libelf byte code.class file jaotc liby.so file native code -XX:AOTLibrary= JVM AOT code hotter client - C1 server - C2 profiling profiling hot code cache 26

Tiered compilation levels with AOT #jfall17 AOT profiling C1 no profiling C1 profiling C2 no profiling code speed 9 10 7 14 common AOT hot hotter Level 3 Level 4 27

Why AOT-compilation? #jfall17 Faster startup time compiled/optimized methods immediately available JIT optional, for best peak-performance Reach peak-performance quicker Potentially less memory usage Class data sharing (CDS) No JIT-compiler overhead Less CPU usage less interpreting, profiling, compiling 28

AOT-compiling HelloJUG jaotc 29

Run HelloJUG using AOT with -XX:AOTLibrary= 30

Run HelloJUG using AOT with -XX:AOTLibrary and -XX:PrintAOT 31

CreateCalendars micro-benchmark Average of 5 runs, 2 CPU s non-aot tiered AOT non-tiered AOT -Xint > 29 s. real run time (s) C1 time C2 time 0 0,5 1 1,5 2 2,5 3 Shorter is better 32

Current limitations #jfall17 For JDK 9 only supported on Linux x64 JDK 10 (18.3) also on MacOS and Windows No use of profiling data during AOT (yet) Only supported for G1 GC and Parallel GC 33

AOT-compilation conclusions Promising technique for quicker startup times Most noticeable when user waiting for startup / execution Short run: non-tiered - without JIT-compiler Small devices with little resources IDE s, unit tests, code generation, coding checks,.. Can be 10% to 2x better for short execution or startup AOT-compile and bundle only touched methods of java.base etc. should help more training run in future versions 34

Part 1: New in Java compilation Part 2: New in Java garbage collection 35

Part 2: New garbage collectors Shenandoah and Epsilon GC

New garbage collectors #jfall17 GC basics Serial, Parallel, Concurrent GC Region based: G1 GC Shenandoah GC Epsilon GC Conclusions 37

Garbage collection basics Mark&Sweep, Compaction and Mark&Copy

Mark and Sweep #jfall17 OOPs GC-roots (stacks..) jpinpoint 39

Mark #jfall17 OOPs GC-roots (stacks..) jpinpoint 40

Sweep #jfall17 OOPs GC-roots jpinpoint 41

Mark-Sweep: Fragmentation #jfall17 After compaction: Compaction is expensive jpinpoint 42

Mark-Copy: no fragmentation 43

Generational GC: Young and Old Mark Sweep - Compact Eden Survi vor 1 Survi vor 2 Old space Mark Copy 44

Serial GC: stop-the-world pauzes GC thread eg. 7 s. 45

Serial vs Parallel GC #jfall17 GC thread eg. 7 s. GC threads eg. 2 s. 46

Concurrent Mark Sweep GC - mostly concurrent to app GC threads eg. 2x100 ms. STW 10 s. concurrent 47

Parallel GC vs CMS throughput vs responsiveness GC threads eg. 2 s. GC threads eg. 2x100 ms. STW 10 s. concurrent 48

Parallel GC vs CMS Young and Old GC threads 300 ms. GC threads 2 s. GC threads 300 ms. GC threads 2x100 ms. STW 10 s. concurrent 49

CMS problem 1: failures #jfall17 GC threads 300 ms. Concurrent failure GC threads 2x100 ms. STW 10 s. concurrent GC thread 7 s. 50

CMS problem 2: not scalable 4 GB heap x4 GC Conc GC 16 GB heap GC GC threads 1200 ms. Concurrent failure GC threads 2x400 ms. STW 40 s. concurrent 51 GC thread 28 s.

Solution: regionalize #jfall17 Mark Sweep - Compact Eden Survi vor 1 Survi vor 2 Old space Mark Copy 52

G1: Garbage first: young + old with most garbage % %-live objects 90% 40% 90% 80% 20% 15% 50% 53

G1: Garbage first: young + old regions with most garbage Concurrent mark in Old Mark-copy: no fragmentation Limit copy #regions to meet -XX:MaxPauseTimeMillis 90% 40% 90% Solves scalability and failures Still stw pauses 54 80% 20% 15% 50%

G1 stw-pauses: Mark young + Copy Y&O RSet CSet 55

G1 stw-pauses: Mark young + Copy Y&O + Update refs RSet CSet 56

G1 pauses: Mark young + Copy Y&O + Update refs CSet 57 RSet

Shenandoah GC Ultra low pause times

Stw-pause times in G1 how to beat them? Mark Young Copy Young & Old Update refs 59

pause times in G1 vs Shenandoah Mark Young Copy Young & Old Update refs 60

removed pause times in Shenandoah Mark Young Copy Young & Old concurrent Update refs concurrent 61

How to copy Old and update refs concurrently? RSet CSet 62

How to copy Old and update refs concurrently? add RSet CSet 63

All problems in computer science can be solved by? Jim Coplien - David Wheeler

Another level of indirection forwarding pointer 65

Copy Old and update refs concurrently by fwd pointer CSet 66

Application thread can write objects being evacuated add CSet 67

Application thread writes to new copy of object add CSet 68

Concurrent update refs #jfall17 CSet 69

Concurrent cleanup #jfall17 CSet 70

Shenandoah pause times 4 short stw pauses init mark, final mark, init UR, final UR GC(3) Pause Init Mark 0.771ms GC(3) Concurrent marking 76480M->77212M(102400M) 633.213ms GC(3) Pause Final Mark 1.821ms GC(3) Concurrent cleanup 77224M->66592M(102400M) 3.112ms GC(3) Concurrent evacuation 66592M->75640M(102400M) 405.312ms GC(3) Pause Init Update Refs 0.084ms GC(3) Concurrent update references 75700M->76424M(102400M) 354.341ms GC(3) Pause Final Update Refs 0.409ms GC(3) Concurrent cleanup 76244M->56620M(102400M) 12.242ms 71

Shenandoah pause times 4 short stw pauses init mark, final mark, init UR, final UR GC(3) Pause Init Mark 0.771ms GC(3) Concurrent marking 76480M->77212M(102400M) 633.213ms GC(3) Pause Final Mark 1.821ms GC(3) Concurrent cleanup 77224M->66592M(102400M) 3.112ms GC(3) Concurrent evacuation 66592M->75640M(102400M) 405.312ms GC(3) Pause Init Update Refs 0.084ms GC(3) Concurrent update references 75700M->76424M(102400M) 354.341ms GC(3) Pause Final Update Refs 0.409ms GC(3) Concurrent cleanup 76244M->56620M(102400M) 12.242ms 72

Elastic search benchmark (Oct 2, 2017) Run time (x100 s) Total pause time (s) Average (x10 ms) Max (x100 ms) Shorter is better 0 1 2 3 4 5 6 7 8 9 G1 73 Shenandoah

Parallel GC: max = 161 ms 74

G1 GC: max = 38 ms #jfall17 75

Shenandoah GC: max = 5,9 ms 76

G1 with -XX:MaxGCPauseTimeMillis=6: 29 ms 77

Epsilon GC #jfall17 JEP Draft (by Aleksey Shipilev) stw pauzes = 0.0 ms! How? 78

Epsilon GC #jfall17 Garbage non-collector JVM shutdown when heap exhausted java -XX:+UnlockExperimentalVMOptions XX:+UseEpsilonGC Binary builds of patched JDK10 available 79

No-op GC use cases #jfall17 Performance testing compare with real garbage collectors Minimize overhead garbage free apps short-lived apps active failover before heap exhausted 80

Conclusions new GC s #jfall17 Shenandoah GC clearly beats G1 for short pauses Likely to replace G1 as default after JDK9 Can try it out now on JDK8+! Epsilon GC eliminates all GC-overhead (and comfort) if you can avoid GC last-drop performance improvement 81

Questions? #jfall17 want to learn more? resources: accelerating Java applications training covers Java 8 & 9 12-14 March 2018 thanks for the attention! 82

Please rate my talk in the official J-Fall app