High Performance Managed Languages. Martin Thompson

Similar documents
High Performance Managed Languages. Martin Thompson

Low latency & Mechanical Sympathy: Issues and solutions

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Running class Timing on Java HotSpot VM, 1

Designing for Performance. Martin Thompson

Compiler construction 2009

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

New Java performance developments: compilation and garbage collection

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Rectangles All The Way Down. Martin Thompson

JVM Memory Model and GC

Concurrent Garbage Collection

Java Performance Tuning

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Attila Szegedi, Software

THE TROUBLE WITH MEMORY

WHO AM I.

Java Without the Jitter

2011 Oracle Corporation and Affiliates. Do not re-distribute!

The Garbage-First Garbage Collector

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

A new Mono GC. Paolo Molaro October 25, 2006

Runtime. The optimized program is ready to run What sorts of facilities are available at runtime

Garbage Collection. Hwansoo Han

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

Memory management has always involved tradeoffs between numerous optimization possibilities: Schemes to manage problem fall into roughly two camps

Java Garbage Collection Best Practices For Tuning GC

The Fundamentals of JVM Tuning

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

Shenandoah: Theory and Practice. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Generational Garbage Collection Theory and Best Practices

Habanero Extreme Scale Software Research Project

Designing for Performance. Martin Thompson

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

The Z Garbage Collector An Introduction

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

CMSC 330: Organization of Programming Languages

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Myths and Realities: The Performance Impact of Garbage Collection

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

The Z Garbage Collector Low Latency GC for OpenJDK

CS 345. Garbage Collection. Vitaly Shmatikov. slide 1

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Garbage Collection. Vyacheslav Egorov

Java Application Performance Tuning for AMD EPYC Processors

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Garbage Collection Algorithms. Ganesh Bikshandi

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Efficient Java (with Stratosphere) Arvid Heise, Large Scale Duplicate Detection

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Roman Kennke Principal Software Engineers Red Hat

Hard Real-Time Garbage Collection in Java Virtual Machines

Shenandoah: An ultra-low pause time garbage collector for OpenJDK. Christine Flood Principal Software Engineer Red Hat

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

Lecture 13: Garbage Collection

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

CS399 New Beginnings. Jonathan Walpole

CMSC 330: Organization of Programming Languages

Programming Language Implementation

Garbage Collection. Akim D le, Etienne Renault, Roland Levillain. May 15, CCMP2 Garbage Collection May 15, / 35

Lecture 15 Garbage Collection

Project. there are a couple of 3 person teams. a new drop with new type checking is coming. regroup or see me or forever hold your peace

Garbage Collection (aka Automatic Memory Management) Douglas Q. Hawkins. Why?

Operating Systems. File Systems. Thomas Ropars.

The benefits and costs of writing a POSIX kernel in a high-level language

A Tour to Spur for Non-VM Experts. Guille Polito, Christophe Demarey ESUG 2016, 22/08, Praha

Inside the Garbage Collector

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Short-term Memory for Self-collecting Mutators. Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg

Azul Systems, Inc.

Parallel Algorithm Engineering

Do Your GC Logs Speak To You

G1 Garbage Collector Details and Tuning. Simone Bordet

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Java Performance: The Definitive Guide

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

COSC3330 Computer Architecture Lecture 20. Virtual Memory

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Virtual Memory Primitives for User Programs

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Lecture 13: Address Translation

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

CS162 Operating Systems and Systems Programming Lecture 14. Caching (Finished), Demand Paging

Atomicity via Source-to-Source Translation

Towards Lean 4: Sebastian Ullrich 1, Leonardo de Moura 2.

Martin Kruliš, v

Optimising Multicore JVMs. Khaled Alnowaiser

Garbage Collection (1)

Memory management units

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Fast Java Programs. Jacob

Transcription:

High Performance Managed Languages Martin Thompson - @mjpt777

Really, what is your preferred platform for building HFT applications?

Why do you build low-latency applications on a GC ed platform?

Agenda 1. Let s set some Context 2. Runtime Optimisation 3. Garbage Collection 4. Algorithms & Design

Some Context

Let s be clear A Managed Runtime is not always the best choice

Latency Arbitrage?

Two questions

Why build on a Managed Runtime?

Can managed languages provide good performance?

We need to follow the evidence

Are native languages faster?

Time? Skills & Resources?

What can, or should, be outsourced?

CPU vs Memory Performance

How much time to perform an addition operation on 2 integers?

1 CPU Cycle < 1ns

Sequential Access - Average time in ns/op to sum all longs in a 1GB array?

Access Pattern Benchmark Benchmark Mode Score Error Units testsequential avgt 0.832 ± 0.006 ns/op ~1 ns/op

Really??? Less than 1ns per operation?

Random walk per OS Page - Average time in ns/op to sum all longs in a 1GB array?

Access Pattern Benchmark Benchmark Mode Score Error Units testsequential avgt 0.832 ± 0.006 ns/op testrandompage avgt 2.703 ± 0.025 ns/op ~3 ns/op

Data dependant walk per OS Page - Average time in ns/op to sum all longs in a 1GB array?

Access Pattern Benchmark Benchmark Mode Score Error Units testsequential avgt 0.832 ± 0.006 ns/op testrandompage avgt 2.703 ± 0.025 ns/op testdependentrandompage avgt 7.102 ± 0.326 ns/op ~7 ns/op

Random heap walk - Average time in ns/op to sum all longs in a 1GB array?

Access Pattern Benchmark Benchmark Mode Score Error Units testsequential avgt 0.832 ± 0.006 ns/op testrandompage avgt 2.703 ± 0.025 ns/op testdependentrandompage avgt 7.102 ± 0.326 ns/op testrandomheap avgt 19.896 ± 3.110 ns/op ~20 ns/op

Data dependant heap walk - Average time in ns/op to sum all longs in a 1GB array?

Access Pattern Benchmark Benchmark Mode Score Error Units testsequential avgt 0.832 ± 0.006 ns/op testrandompage avgt 2.703 ± 0.025 ns/op testdependentrandompage avgt 7.102 ± 0.326 ns/op testrandomheap avgt 19.896 ± 3.110 ns/op testdependentrandomheap avgt 89.516 ± 4.573 ns/op ~90 ns/op

Then ADD 40+ ns/op for NUMA access on a server!!!!

Data Dependent Loads aka Pointer Chasing!!!

Performance 101

Performance 101 1. Memory is transported in Cachelines

Performance 101 1. Memory is transported in Cachelines 2. Memory is managed in OS Pages

Performance 101 1. Memory is transported in Cachelines 2. Memory is managed in OS Pages 3. Memory is pre-fetched on predictable access patterns

Runtime Optimisation

Runtime JIT 1. Profile guided optimisations

Runtime JIT 1. Profile guided optimisations 2. Bets can be taken and later revoked

Branches void foo() { // code if (condition) { // code } } // code

Branches void foo() { // code if (condition) { // code Block A } } // code

Branches void foo() { // code if (condition) { // code Block A Block B } } // code

Branches void foo() { // code if (condition) { // code Block A Block B } } // code Block C

Branches void foo() { // code } if (condition) { // code } // code Block A Block B Block C Block A Block C

Branches void foo() { // code } if (condition) { // code } // code Block A Block B Block C Block A Block C Block B

Subtle Branches int result = (i > 7)? a : b;

Subtle Branches int result = (i > 7)? a : b; CMOV vs Branch Prediction?

Method/Function Inlining void foo() { // code bar(); } // code

Method/Function Inlining void foo() { // code bar(); Block A } // code

Method/Function Inlining void foo() { // code bar(); Block A } // code bar()

Method/Function Inlining void foo() { // code bar(); Block A } // code bar()

Method/Function Inlining void foo() { // code bar(); Block A } // code Block B bar()

Method/Function Inlining void foo() { // code bar(); Block A Block A } // code Block B bar()

Method/Function Inlining void foo() { // code } bar(); // code Block A Block B Block A bar() bar()

Method/Function Inlining void foo() { // code } bar(); // code Block A Block B Block A bar() Block B bar()

Method/Function Inlining void foo() { // code bar(); i-cache & code bloat? } // code

Method/Function Inlining Inlining is THE optimisation. - Cliff Click

Bounds Checking void foo(int[] array, int length) { // code for (int i = 0; i < length; i++) { bar(array[i]); } } // code

Bounds Checking void foo(int[] array) { // code for (int i = 0; i < array.length; i++) { bar(array[i]); } } // code

Subtype Polymorphism void draw(shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(shape shape) { bar(shape.isvisible()); }

Subtype Polymorphism void draw(shape[] shapes) { for (int i = 0; i < shapes.length; i++) { shapes[i].draw(); } } void bar(shape shape) { bar(shape.isvisible()); } Class Hierarchy Analysis & Inline Caching

Runtime JIT 1. Profile guided optimisations 2. Bets can be taken and later revoked

Garbage Collection

Generational Garbage Collection Only the good die young. - Billy Joel

TLAB TLAB Generational Garbage Collection Young/New Generation Eden Survivor 0 Survivor 1 Virtual Old Generation Tenured Virtual

Modern Hardware (Intel Sandy Bridge EP)......... C 1 C n Registers/Buffers <1ns C 1 C n L1 L1 ~4 cycles ~1ns L1 L1 L2 L2 ~12 cycles ~3ns L2 L2......... L3 ~40 cycles ~15ns ~60 cycles ~20ns (dirty hit) L3 PCI-e 3 MC QPI QPI MC PCI-e 3 QPI ~40ns DRAM DRAM 40X IO DRAM DRAM ~65ns DRAM DRAM 40X IO DRAM DRAM * Assumption: 3GHz Processor

Broadwell EX 24 cores & 60MB L3 Cache

TLAB TLAB Thread Local Allocation Buffers Young/New Generation Eden

TLAB TLAB Thread Local Allocation Buffers Young/New Generation Eden Affords locality of reference Avoid false sharing Can have NUMA aware allocation

TLAB TLAB Object Survival Young/New Generation Eden Survivor 0 Survivor 1 Virtual

TLAB TLAB Object Survival Young/New Generation Eden Survivor 0 Survivor 1 Virtual Aging Policies Compacting Copy NUMA Interleave Fast Parallel Scavenging Only the survivors require work

TLAB TLAB Object Promotion Young/New Generation Eden Survivor 0 Survivor 1 Virtual Old Generation Tenured Virtual

TLAB TLAB Object Promotion Young/New Generation Eden Survivor 0 Survivor 1 Virtual Old Generation Tenured Virtual Concurrent Collection String Deduplication

Compacting Collections

Compacting Collections Depth first copy

Compacting Collections

Compacting Collections OS Pages and cache lines?

G1 Concurrent Compaction E E E Eden O S O S S O O S O H Survivor Old Humongous H O O E Unused O E O O O O H

Azul Zing C4 True Concurrent Compacting Collector

Where next for GC?

Object Inlining/Aggregation

GC vs Manual Memory Management Not easy to pick clear winner

GC vs Manual Memory Management Not easy to pick clear winner Managed GC GC Implementation Card Marking Read/Write Barriers Object Headers Background Overhead in CPU and Memory

GC vs Manual Memory Management Not easy to pick clear winner Managed GC GC Implementation Card Marking Read/Write Barriers Object Headers Background Overhead in CPU and Memory Native Malloc Implementation Arena/pool contention Bin Wastage Fragmentation Debugging Effort Inter-thread costs

Algorithms & Design

What is most important to performance?

Avoiding cache misses Strength Reduction Avoiding duplicate work Amortising expensive operations Mechanical Sympathy Choice of Data Structures Choice of Algorithms API Design Overall Design

In a large codebase it is really difficult to do everything well

It also takes some uncommon disciplines such as: profiling, telemetry, modelling

If I had more time, I would have written a shorter letter. - Blaise Pascal

The story of Aeron

Aeron is an interesting lesson in time to performance

Lots of others exists such at the C# Roslyn compiler

Time spent on Mechanical Sympathy vs Debugging Pointers???

Immutable Data & Concurrency

Functional Programming

In Closing

What does the future hold?

Remember Assembly vs Compiled Languages

What about the issues of footprint, startup time, GC pauses, etc.???

Questions? Blog: http://mechanical-sympathy.blogspot.com/ Twitter: @mjpt777 Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein