Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Similar documents
The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

The Z Garbage Collector Low Latency GC for OpenJDK

The Z Garbage Collector An Introduction

JDK 9/10/11 and Garbage Collection

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

OS-caused Long JVM Pauses - Deep Dive and Solutions

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

The Art and Science of Memory Allocation

What's new in MySQL 5.5? Performance/Scale Unleashed

DRAFT. String Density: Performance and Footprint. Xueming Shen. Aleksey Shipilev Oracle Brent Christian.

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Alan Bateman Java Platform Group, Oracle November Copyright 2018, Oracle and/or its affiliates. All rights reserved.!1

An Efficient Memory-Mapped Key-Value Store for Flash Storage

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Understanding Hardware Transactional Memory

High Performance Managed Languages. Martin Thompson

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

G1 Garbage Collector Details and Tuning. Simone Bordet

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

An Oracle White Paper August Building Highly Scalable Web Applications with XStream

Welcome to the session...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Marcus Gründler aixigo AG. Financial Portfolio Management on Steroids

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Mission Possible - Near zero overhead profiling. Klara Ward Principal Software Developer Java Mission Control team, Oracle February 6, 2018

Project Loom Ron Pressler, Alan Bateman June 2018

Towards Parallel, Scalable VM Services

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

The Google File System

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

CA485 Ray Walshe Google File System

Atomicity via Source-to-Source Translation

Java Without the Jitter

JVM Memory Model and GC

The C2 Register Allocator. Niclas Adlertz

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Pause-Less GC for Improving Java Responsiveness. Charlie Gracie IBM Senior Software charliegracie

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Run-Time Environments/Garbage Collection

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

MySQL Development Cycle

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Perchance to Stream with Java 8

Garbage-First Garbage Collection by David Detlefs, Christine Flood, Steve Heller & Tony Printezis. Presented by Edward Raff

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona

Java performance - not so scary after all

JVM and application bottlenecks troubleshooting

JCTools: Pit Of Performance Nitsan Copyright - TTNR Labs Limited 2018

Oracle Exadata: Strategy and Roadmap

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

Servlet Performance and Apache JServ

Using Transparent Compression to Improve SSD-based I/O Caches

Robust Memory Management Schemes

CSE 120 Principles of Operating Systems

Java in a World of Containers

NPTEL Course Jan K. Gopinath Indian Institute of Science

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Ahead of Time (AOT) Compilation

Oracle JD Edwards EnterpriseOne Object Usage Tracking Performance Characterization Using JD Edwards EnterpriseOne Object Usage Tracking

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

<Insert Picture Here> DBA s New Best Friend: Advanced SQL Tuning Features of Oracle Database 11g

STEPS Towards Cache-Resident Transaction Processing

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Introducing the Cray XMT. Petr Konecny May 4 th 2007

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Java Performance Tuning

Oracle JD Edwards EnterpriseOne Object Usage Tracking Performance Characterization Using JD Edwards EnterpriseOne Object Usage Tracking

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

Concurrent Counting using Combining Tree

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

2. PICTURE: Cut and paste from paper

PostgreSQL Entangled in Locks:

Epilogue. Thursday, December 09, 2004

Low latency & Mechanical Sympathy: Issues and solutions

Concurrent Garbage Collection

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Increasing Performance of Existing Oracle RAC up to 10X

Java On Steroids: Sun s High-Performance Java Implementation. History

Virtual Memory Primitives for User Programs

Linux multi-core scalability

RACS: Extended Version in Java Gary Zibrat gdz4

Future of JRockit & Tools

New Compiler Optimizations in the Java HotSpot Virtual Machine

Attila Szegedi, Software

CS377P Programming for Performance Operating System Performance

Extreme Performance Platform for Real-Time Streaming Analytics

The Google File System

Virtual Machine Design

Transcription:

Scaling the OpenJDK Claes Redestad Java SE Performance Team Oracle

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 2

In this talk... We'll mainly concern ourselves with the vertical scalability of Java running on a single OpenJDK 9 JVM For examples, I will be using JMH to explore code and bottlenecks http://openjdk.java.net/projects/code-tools/jmh/ Benchmarking machine is a dual-socket Intel Xeon E5-2630 v3 (Haswell), 8 cores with 2 threads each Don't take anything presented here to be good, general performance advice or even representative of what you'd see on your own hardware. 3

JVM Scalability challenges Allow many concurrent and parallel tasks Allow for increasing memory requirements of applications Make it easy to work with Do all this without degrading throughput, latency and memory overheads (too much)! 4

Scaling up gently 5

Bottlenecks hiding in plain sight... import org.openjdk.jmh.annotations.*; import java.util.*; @State(Scope.Benchmark) public class Scale { public int year = 2017; public int month = 11; public int day = 29; } @Benchmark public Date getdate() { return new Date(year, month, day); } 6

Bottlenecks hiding in plain sight.. Benchmark Mode Cnt Score Error Units Scale.getDate avgt 10 0.637 ± 0.027 us/op # 1 thread Scale.getDate avgt 10 1.239 ± 0.149 us/op # 2 threads Scale.getDate avgt 10 37.693 9.713 ± 2.374 0.676 us/op # 32 8 threads Scale.getDate avgt 10 18.215 ± 2.578 us/op # 16 threads Scale.getDate avgt 10 37.693 ± 2.374 us/op # 32 threads No scaling at all! Reason: Date(int, int, int) synchronizes on a shared, mutable calendar instance! 7

Better alternatives exist public LocalDate getlocaldate() { Benchmark return LocalDate.of(year, Mode Cnt Score month, Error day); Units Scale.getDate } avgt 10 0.637 ± 0.027 us/op # 1 thread Scale.getDate avgt 10 1.239 ± 0.149 us/op # 2 threads Scale.getDate Benchmark avgt Mode 10 37.693 Cnt Score ± 2.374 Error us/op Units # 32 threads Scale.getLocalDate avgt 10 0.031 ± 0.007 us/op # 1 Scale.getLocalDate avgt 10 0.024 ± 0.009 us/op # 2 Scale.getLocalDate avgt 10 0.029 ± 0.005 us/op # 8 Scale.getLocalDate avgt 10 0.037 ± 0.007 us/op # 16 Scale.getLocalDate avgt 10 0.067 ± 0.001 us/op # 32 Much better! Only a small overhead per operation when saturating all hyperthreads. 8

Sharing efects 9

Improve String.hashCode implementation During JDK 9 development, an innocent little clean-up was done to the String.hashCode implementation: int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } int h = hash; if (h == 0) { for (int v : value) { h = 31 * h + v; } hash = h; }

Improve String.hashCode a bit further... For the corner case of the empty String we were now always calculating and storing 0 to the hash feld. Even though the value doesn't change, this causes the cache line to be evicted, which led to a 5% regression on a standard benchmark on dual-socket machines int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } int h = hash; if (h == 0) { for (int v : value) { h = 31 * h + v; } hash = h; } int h = hash; if (h == 0) { for (int v : value) { h = 31 * h + v; } if (h!= 0) { hash = h; } }

True sharing in "".hashcode() Using JMHs perfnorm profler made it easy to see not only the extra store in the second implemenation, but also the dramatic increase in L1 cache misses per operation induced by these stores: $ java -jar benchmarks.jar.*string.hashcode.* -t 4 -prof perfnorm String.hashCode2 avgt 5 38.701 ± 0.794 ns/op String.hashCode2:CPI avgt 3.501 #/op String.hashCode2:L1-dcache-load-misses avgt 0.460 #/op String.hashCode2:L1-dcache-loads avgt 14.173 #/op String.hashCode2:L1-dcache-stores avgt 5.067 #/op String.hashCode3 avgt 5 6.512 ± 0.450 ns/op String.hashCode3:CPI avgt 0.527 #/op String.hashCode3:L1-dcache-load-misses avgt 0.001 #/op String.hashCode3:L1-dcache-loads avgt 13.995 #/op String.hashCode3:L1-dcache-stores avgt 4.005 #/op (Omitted the frst implementation as those results are indistinguishable from the third)

Sharing another story... Around 2012 we got a new batch of multi-socket hardware where we started seeing intermittent performance regressions on various benchmarks not seen before Often reductions of around 5-10% from one build to another, then back to normal after a build or two... Then discovered that within the same build - with everything else exactly the same - performance could fip back and forth between a good and bad state by simply adding any parameter to the command line... It took a few false starts, but soon a possible root cause was found

The dreaded false sharing HotSpots parallel GC implementation has a PSPromotionManager class to keep information pertaining to an individual GC thread Each thread's instance of the PSPromotionManager is allocated at the same time and laid out next to each other in an array When aligned to cache lines, all was good PSPromotionManager 0 PSPromotionManager 1 Cache Line 0 Cache Line 1

The dreaded false sharing When memory layout changed, say, due to the addition of a command line fag, the alignment of the PSPromotionManager might shift so that they were now split across cache lines: PSPromotionManager 0 PSPromotionManager 1 Cache Line 0 Cache Line 1 When mutating state at the end of the frst manager instance, memory pertaining to the second manager is dirtied and evicted from CPU caches

False sharing explained On most multi-core CPU architectures, data stores will enable a cache coherency protocol to ensure the view of the memory is kept consistent across CPUs - atomicity guarantees may incur added cost False sharing happens when one (or more) CPUs write to memory that just happens to be on the same cache line as other memory that another CPU is working on The cost of false sharing grows dramatically when work is spread across diferent sockets

PSPromotionManager false sharing Solution: Pad the PSPromotionManager object to be size aligned with cache line size Align the array of PSPromotionManager objects to start at a cache line boundary Much more consistent performance in benchmarks using ParallelGC ever since! Most code isn't as massively parallel as the stop-the-world parallel GC algorithm at play here, so it's not unlikely there are problems lurking elsewhere that are simply harder to detect or provoke...

Contention everywhere! JEP 143: Improved Contended Locking was a project delivered as part of JDK 9 Java diferentiates between biased locks, thin locks, and heavy, contended monitors Biased or thin locks are used as a fast-path when application code needs to lock on an Object but noone else is contending for the lock Monitors are installed, or infated, into Objects when demand for the Object monitor becomes contentious The JEP work includes a number of small but well-rounded optimizations to reduce the overall latencies of synchronization primitives in java

Pad all the things! One optimization of JEP-143 was to pad the native data structure used to represent the contended monitor of an Object Up to 50% improvements in targetted microbenchmarks Global monitors and mutexes in HotSpot were later padded out, too Similarly there is the @jdk.internal.vm.annotation.contended facility (since JDK 8) to apply padding around a Java feld and objects

Inherent VM latencies 20

Safepoints, GC and VM operations When the VM for some reason needs to stop all threads, it requests a safepoint that instructs all application threads to stop Java threads perform safepoint polls at regular intervals during normal execution - if a safepoint has been requested the thread will halt and not resume work until the safepoint has completed Safepoints can be initiated by the GC or the VM, and by default the VM will safepoint at least once per second While typically small and quick, time spent in safepoints and GC do put upper bounds on scalability

Monitor defation and other cleanup... One VM operation performed at safepoints is scanning for Java object monitors to defate, which means removing the object monitor from an Object and recycling it to a global list. A single thread scans the entire population of monitors...... and both in benchmarks and real world applications, the monitor population might grow to 100s of thousands, causing this defation operation to take signifcant time JDK 9: Made -XX:+MonitorInUseLists default (+ deprecated the fag) Future work: Concurrent Monitor Defation Parallelize safepoint cleanup

JEP 312: Thread-local handshakes (JDK 10) Added infrastructure to move from only allowing global safepoints (that stop all threads) to perform handshakes on any number of threads. "The big diference between safepointing and handshaking is that the per thread operation will be performed on all threads as soon as possible and they will continue to execute as soon as its own operation is completed." Enables a number of optimizations, such as no longer needing to stop all threads to revoke a biased lock

UseMembar - Not all optimizations age well -XX:-UseMembar was implemented at a time when fence instructions on the state of the art hardware were expensive. Instead a "pseudo-membar" was introduced. Turns out this implementation caused various issues, including a scalability bottleneck when reading thread states, a long tail of actual bugs... there's even some false sharing possible when running more than ~64 java threads Making +UseMembar the default trades global synchronization of thread state for local fences In single-threaded benchmarks that take a lot of transitions this can be a performance loss since fences still have higher latencies For scalability, however, it's typically preferable to take a local performance hit if it removes a cost incurred on all threads

GC scalability 25

G1 scalability improvements in JDK 9 10 TB dataset on : "Pause times reduced by 5-20x on read-only benchmark" "For the frst time we achieved stable operation on a mixed read-write workload with a 10 TB dataset" https://www.youtube.com/watch?v=lppgqvkouks Key improvements includes merging per-thread bitmaps into a single shared structure managed by lock-free algorithms, dropping worst case mark times from 15 minutes to mere seconds mainly from becoming way more cache-friendly.

ZGC project proposed Max pause times not exceeding 10ms on multi-tb heaps Parallel and concurrent: Designed for 1000s of hardware threads => Requires lock- and/or wait-free datastructures => Thread/CPU local data Dynamic, NUMA-aware and lazy resource allocation Features striping: tries to be locality aware and allocate into compact chunks of memory that individual GC workers tend to, aiming to reduce memory contention and cache cross-talk. Sacrifce a bit of throughput to improve latency and scalability

Honorable mention: Project Loom 28

Project Loom Implement a light-weight user-mode thread as an alternative to the OS-managed threads that are the de facto standard in Java today The goal is that such "threads" will: Have minimal footprint - allow millions of concurrent jobs on a system which could host only thousands of Threads Remove penalties of blocking Support tail calls... 5 minute lightning talk: https://www.youtube.com/watch?v=t-8fa3deulg

Q & A