Attila Szegedi, Software

Similar documents
JVM Memory Model and GC

Do Your GC Logs Speak To You

Java Performance Tuning

THE TROUBLE WITH MEMORY

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Garbage collection. The Old Way. Manual labor. JVM and other platforms. By: Timo Jantunen

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

G1 Garbage Collector Details and Tuning. Simone Bordet

HBase Practice At Xiaomi.

Garbage Collection (aka Automatic Memory Management) Douglas Q. Hawkins. Why?

The Garbage-First Garbage Collector

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Concurrent Garbage Collection

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Java Performance Tuning and Optimization Student Guide

CMSC 330: Organization of Programming Languages

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

10/26/2017 Universal Java GC analysis tool - Java Garbage collection log analysis made easy

The Fundamentals of JVM Tuning

The Art and Science of Memory Allocation

SPECjAppServer2002 Statistics. Methodology. Agenda. Tuning Philosophy. More Hardware Tuning. Hardware Tuning.

High Performance Managed Languages. Martin Thompson

Contents. Created by: Raúl Castillo

CMSC 330: Organization of Programming Languages

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Typical Issues with Middleware

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Low latency & Mechanical Sympathy: Issues and solutions

Hard Real-Time Garbage Collection in Java Virtual Machines

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

It s Good to Have (JVM) Options

OS-caused Long JVM Pauses - Deep Dive and Solutions

A new Mono GC. Paolo Molaro October 25, 2006

Exploiting the Behavior of Generational Garbage Collector

Java Application Performance Tuning for AMD EPYC Processors

Java Memory Management. Märt Bakhoff Java Fundamentals

Java Garbage Collection Best Practices For Tuning GC

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Generational Garbage Collection Theory and Best Practices

JVM Performance Tuning with respect to Garbage Collection(GC) policies for WebSphere Application Server V6.1 - Part 1

Java Without the Jitter

Garbage Collection. Hwansoo Han

CA341 - Comparative Programming Languages

CS 241 Honors Memory

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

How to keep capacity predictions on target and cut CPU usage by 5x

High Performance Managed Languages. Martin Thompson

New Java performance developments: compilation and garbage collection

Running class Timing on Java HotSpot VM, 1

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Garbage Collection Algorithms. Ganesh Bikshandi

Heap Compression for Memory-Constrained Java

Review. Partitioning: Divide heap, use different strategies per heap Generational GC: Partition by age Most objects die young

Implementation Garbage Collection

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

The Z Garbage Collector Low Latency GC for OpenJDK

Managed runtimes & garbage collection. CSE 6341 Some slides by Kathryn McKinley

CS220 Database Systems. File Organization

Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide. Release 10

Chapter 8: Virtual Memory. Operating System Concepts

Garbage Collection (1)

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Lesson 2 Dissecting Memory Problems

Managed runtimes & garbage collection

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Compiler construction 2009

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

Garbage Collection. Steven R. Bagley

Habanero Extreme Scale Software Research Project

Java Performance: The Definitive Guide

Structure of Programming Languages Lecture 10

CS Computer Systems. Lecture 8: Free Memory Management

A program execution is memory safe so long as memory access errors never occur:

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Garbage Collection. Akim D le, Etienne Renault, Roland Levillain. May 15, CCMP2 Garbage Collection May 15, / 35

Dynamic Memory Allocation

Computer Systems A Programmer s Perspective 1 (Beta Draft)

JVM Memory Management and ColdFusion Log Analysis

The Z Garbage Collector An Introduction

1. Mark-and-Sweep Garbage Collection

Robust Memory Management Schemes

Using Transparent Compression to Improve SSD-based I/O Caches

A Side-channel Attack on HotSpot Heap Management. Xiaofeng Wu, Kun Suo, Yong Zhao, Jia Rao The University of Texas at Arlington

Caching and Buffering in HDF5

Windows Java address space

Sat 1/6/2018 JAVA (JVM) MEMORY MODEL MEMORY MANAGEMENT IN JAVA... 3 WHERE STORAGE LIVES... 3

L9: Storage Manager Physical Data Organization

Java performance - not so scary after all

Memory Management: The Details

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Azul Systems, Inc.

Enabling Java in Latency Sensitive Environments

Announcements. Persistence: Crash Consistency

Configuring the Heap and Garbage Collector for Real- Time Programming.

Transcription:

Attila Szegedi, Software Engineer @asz

Everything I ever learned about JVM performance tuning @twitter

Everything More than I ever wanted to learned about JVM performance tuning @twitter

Memory tuning CPU usage tuning Lock contention tuning I/O tuning

Twitter s biggest enemy Latency CC licensed image from http://www.flickr.com/photos/dunechaser/213255210/

Latency contributors By far the biggest contributor is garbage collector others are, in no particular order: in-process locking and thread scheduling, I/O, application algorithmic inefficiencies.

Areas of performance tuning Memory tuning Lock contention tuning CPU usage tuning I/O tuning

Areas of memory performance tuning Memory footprint tuning Allocation rate tuning Garbage collection tuning

Memory footprint tuning So you got an OutOfMemoryError Maybe you just have too much data! Maybe your data representation is fat! You can also have a genuine memory leak

Too much data Run with -verbosegc Observe numbers in Full GC messages [Full GC $before->$after($total), $time secs] Can you give the JVM more memory? Do you need all that data in memory? Consider using: a LRU cache, or soft references*

Fat data Can be a problem when you want to do wacky things, like load the full Twitter social graph in a single JVM load all user metadata in a single JVM Slimming internal data representation works at these economies of scale

Fat data: object header JVM object header is normally two machine words. That s 16 bytes, or 128 bits on a 64-bit JVM! new java.lang.object() takes 16 bytes. new byte[0] takes 24 bytes.

Fat data: padding class A { byte x; } class B extends A { byte y; } new A() takes 24 bytes. new B() takes 32 bytes.

Fat data: no inline structs class C { Object obj = new Object(); } new C() takes 40 bytes. similarly, no inline array elements.

Slimming taken to extreme A research project had to load the full follower graph in memory Each vertex s edges ended up being represented as int arrays If it grows further, we can consider variablelength differential encoding in a byte array

Compressed object pointers Pointers become 4 bytes long Usable below 32 GB of max heap size Automatically used below 30 GB of max heap

Compressed object pointers Uncompressed Compressed 32-bit Pointer 8 4 4 Object header 16 12* 8 Array header 24 16 12 Superclass pad 8 4 4 * Object can have 4 bytes of fields and still only take up 16 bytes

Avoid instances of primitive wrappers Hard won experience with Scala 2.7.7: a Seq[Int] stores java.lang.integer an Array[Int] stores int first needs (24 + 32 * length) bytes second needs (24 + 4 * length) bytes

Avoid instances of primitive wrappers This was fixed in Scala 2.8, but it shows that: you often don t know the performance characteristics of your libraries, and won t ever know them until you run your application under a profiler.

Map footprints Guava MapMaker.makeMap() takes 2272 bytes! MapMaker.concurrencyLevel(1).makeMap() takes 352 bytes! ConcurrentMap with level 1 makes sense sometimes (i.e. you don t want a ConcurrentModificationException)

Thrift can be heavy Thrift generated classes are used to encapsulate a wire tranfer format. Using them as your domain objects: almost never a good idea.

Thrift can be heavy Every Thrift class with a primitive field has a java.util.bitset isset_bit_vector field. It adds between 52 and 72 bytes of overhead per object.

Thrift can be heavy

Thrift can be heavy Thrift does not support 32-bit floats. Coupling domain model with transport: resistance to change domain model You also miss oportunities for interning and N-to-1 normalization.

class Location { public String city; public String region; public String countrycode; public int metro; public List<String> placeids; public double lat; public double lon; public double confidence;

class Shared Location { public String city; public String region; public String countrycode; public int metro; public List<String> placeids; class UniqueLocation { private SharedLocation sharedlocation; public double lat; public double lon; public double confidence;

Careful with thread locals Thread locals stick around. Particularly problematic in thread pools with m n resource association. 200 pooled threads using 50 connections: you end up with 10 000 connection buffers. Consider using synchronized objects, or just create new objects all the time.

Part II: fighting latency

Performance tradeoff Memory Time Convenient, but oversimplified view.

Performance triangle Memory footprint Throughput Latency

Performance triangle Compactness Throughput Responsiveness C T R = a Tuning: vary C, T, R for fixed a Optimization: increase a

Performance triangle Compactness: inverse of memory footprint Responsiveness: longest pause the application will experience Throughput: amount of useful application CPU work over time Can trade one for the other, within limits. If you have spare CPU, can be pure win.

Responsiveness vs. throughput

Biggest threat to responsiveness in the JVM is the garbage collector

Memory pools Eden Survivor Old Permanent Code cache This is entirely HotSpot specific!

How does young gen work? Eden S1 S2 Old All new allocation happens in eden. It only costs a pointer bump. When eden fills up, stop-the-world copy-collection into the survivor space. Dead objects cost zero to collect. Aftr several collections, survivors get tenured into old generation.

Ideal young gen operation Big enough to hold more than one set of all concurrent request-response cycle objects. Each survivor space big enough to hold active request objects + tenuring ones. Tenuring threshold such that long-lived objects tenure fast.

Old generation collectors Throughput collectors -XX:+UseSerialGC -XX:+UseParallelGC -XX:+UseParallelOldGC Low-pause collectors -XX:+UseConcMarkSweepGC -XX:+UseG1GC (can t discuss it here)

Adaptive sizing policy Throughput collectors can automatically tune themselves: -XX:+UseAdaptiveSizePolicy -XX:MaxGCPauseMillis= (i.e. 100) -XX:GCTimeRatio= (i.e. 19)

Adaptive sizing policy at work

Choose a collector Bulk service: throughput collector, no adaptive sizing policy. Everything else: try throughput collector with adaptive sizing policy. If it didn t work, use concurrent mark-and-sweep (CMS).

Always start with tuning the young generation Enable -XX:+PrintGCDetails, -XX:+PrintHeapAtGC, and -XX:+PrintTenuringDistribution. Watch survivor sizes! You ll need to determine desired survivor size. There s no such thing as a desired eden size, mind you. The bigger, the better, with some responsiveness caveats. Watch the tenuring threshold; might need to tune it to tenure long lived objects faster.

-XX:+PrintHeapAtGC Heap after GC invocations=7000 (full 87): par new generation total 4608000K, used 398455K eden space 4096000K, 0% used from space 512000K, 77% used to space 512000K, 0% used concurrent mark-sweep generation total 3072000K, used 1565157K concurrent-mark-sweep perm gen total 53256K, used 31889K }

-XX:+PrintTenuringDistribution Desired survivor size 262144000 bytes, new threshold 4 (max 4) - age 1: 137474336 bytes, 137474336 total - age 2: 37725496 bytes, 175199832 total - age 3: 23551752 bytes, 198751584 total - age 4: 14772272 bytes, 213523856 total Things of interest: Number of ages Size distribution in ages You want strongly declining.

Tuning the CMS Give your app as much memory as possible. CMS is speculative. More memory, less punitive miscalculations. Try using CMS without tuning. Use -verbosegc and -XX:+PrintGCDetails. Didn t get any Full GC messages? You re done! Otherwise, tune the young generation first.

Tuning the old generation Goals: Keep the fragmentation low. Avoid full GC stops. Fortunately, the two goals are not conflicting.

Tuning the old generation Find the minimum and maximum working set size (observe Full GC numbers under stable state and under load). Overprovision the numbers by 25-33%. This gives CMS a cushion to concurrently clean memory as it s used.

Tuning the old generation Set -XX:InitiatingOccupancyFraction to between 80-75, respectively. corresponds to overprovisioned heap ratio. You can lower initiating occupancy fraction to 0 if you have CPU to spare.

Responsiveness still not good enough? Too many live objects during young gen GC: Reduce NewSize, reduce survivor spaces, reduce tenuring threshold. Too many threads: Find the minimal concurrency level, or split the service into several JVMs.

Responsiveness still not good enough? Does the CMS abortable preclean phase, well, abort? It is sensitive to number of objects in the new generation, so go for smaller new generation try to reduce the amount of short-lived garbage your app creates.

Part III: let s take a break from GC

Thread coordination optimization You don t have to always go for synchronized. Synchronization is a read barrier on entry; write barrier on exit. Sometimes you only need a half-barrier; i.e. in a producer-observer pattern. Volatiles can be used as half-barriers.

Thread coordination optimization For atomic update of a single value, you only need Atomic{Integer Long}.compareAndSet(). You can use AtomicReference.compareAndSet() for atomic update of composite values represented by immutable objects.

Fight CMS fragmentation with slab allocators CMS doesn t compact, so it s prone to fragmentation, which will lead to a stop-the-world pause. Apache Cassandra uses a slab allocator internally.

Cassandra slab allocator 2MB slab sizes copy byte[] into them using compare-and-set GC before: 30-60 seconds every hour GC after: 5 seconds once in 3 days and 10 hours

Slab allocator constraints Works for limited usage: Buffers are written to linearly, flushed to disk and recycled when they fill up. The objects need to be converted to binary representation anyway. If you need random freeing and compaction, you re heading down the wrong direction. If you find yourself writing a full memory manager on top of byte buffers, stop!

Soft references revisited Soft reference clearing is based on the amount of free memory available when GC encounters the reference. By definition, throughput collectors always clear them. Can use them with CMS, but they increase memory pressure and make the behavior less predictable. Need two GC cycles to get rid of referenced objects.

Everything More than I ever wanted to learned about JVM performance tuning @twitter Questions?

Attila Szegedi, Software Engineer @asz