Low latency & Mechanical Sympathy: Issues and solutions

Similar documents
High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson

Java & Coherence Simon Cook - Sales Consultant, FMW for Financial Services

Concurrent Garbage Collection

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Java Application Performance Tuning for AMD EPYC Processors

JVM Memory Model and GC

Do Your GC Logs Speak To You

Garbage Collection. Hwansoo Han

OS-caused Long JVM Pauses - Deep Dive and Solutions

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Fundamentals of GC Tuning. Charlie Hunt JVM & Performance Junkie

Java Without the Jitter

THE TROUBLE WITH MEMORY

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Understanding Garbage Collection

Non-uniform memory access (NUMA)

OS impact on performance

G1 Garbage Collector Details and Tuning. Simone Bordet

A JVM Does What? Eva Andreasson Product Manager, Azul Systems

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

Finally! Real Java for low latency and low jitter

Attila Szegedi, Software

Typical Issues with Middleware

2011 Oracle Corporation and Affiliates. Do not re-distribute!

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

JVM Troubleshooting MOOC: Troubleshooting Memory Issues in Java Applications

The Application Memory Wall

When the OS gets in the way

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

ECE 172 Digital Systems. Chapter 15 Turbo Boost Technology. Herbert G. Mayer, PSU Status 8/13/2018

HBase Practice At Xiaomi.

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

The Garbage-First Garbage Collector

Java Memory Management. Märt Bakhoff Java Fundamentals

A new Mono GC. Paolo Molaro October 25, 2006

Java Performance Tuning

Operating Systems Design Exam 2 Review: Spring 2011

CS 416: Opera-ng Systems Design March 23, 2012

Realtime Tuning 101. Tuning Applications on Red Hat MRG Realtime Clark Williams

Fast Java Programs. Jacob

The Z Garbage Collector Low Latency GC for OpenJDK

Understanding Java Garbage Collection

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

CMSC 330: Organization of Programming Languages

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Enabling Java-based VoIP backend platforms through JVM performance tuning

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

10/26/2017 Universal Java GC analysis tool - Java Garbage collection log analysis made easy

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Measuring a 25 Gb/s and 40 Gb/s data plane

Understanding Java Garbage Collection

Intel Architecture for Software Developers

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

The Fundamentals of JVM Tuning

Algorithms for Dynamic Memory Management (236780) Lecture 4. Lecturer: Erez Petrank

The Z Garbage Collector An Introduction

Garbage collection. The Old Way. Manual labor. JVM and other platforms. By: Timo Jantunen

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

CS3600 SYSTEMS AND NETWORKS

Generational Garbage Collection Theory and Best Practices

PERFORMANCE IMPLICATIONS OF NUMA WHAT YOU DON T KNOW COULD HURT YOU! CLAIRE CATES SAS INSTITUTE

Agilio CX 2x40GbE with OVS-TC

Understanding Java Garbage Collection

New Java performance developments: compilation and garbage collection

Multi-core is Here! But How Do You Resolve Data Bottlenecks in PC Games? hint: it s all about locality

Performance Tuning Guidelines for Low Latency Response on AMD EPYC -Based Servers Application Note

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Understanding Java Garbage Collection

Page 2 of 6 SUT Model Form Factor CPU CPU Characteristics Number of Systems 1 Nodes Per System 1 Chips Per System 2 Hardware hw_1 Cores Per System 44

=tg= Thomas H. Grohser, bwin

CS Computer Systems. Lecture 8: Free Memory Management

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Optimising Multicore JVMs. Khaled Alnowaiser

Understanding Java Garbage Collection

CSE 120 Principles of Operating Systems

Kris Mok, Software Engineer, 莫枢 / 撒迦

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

Windows Persistent Memory Support

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

About Terracotta Ehcache. Version 10.1

Java Garbage Collection Best Practices For Tuning GC

Application Management Webinar. Daniela Field

Using Transparent Compression to Improve SSD-based I/O Caches

Heap Compression for Memory-Constrained Java

CMSC 330: Organization of Programming Languages

A Side-channel Attack on HotSpot Heap Management. Xiaofeng Wu, Kun Suo, Yong Zhao, Jia Rao The University of Texas at Arlington

Setting Thread Affinity in Ultra Messaging Applications

Performance Optimisations for HPC workloads. August 2008 Imed Chihi

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Modeling Page Replacement: Stack Algorithms. Design Issues for Paging Systems

Configuring the Heap and Garbage Collector for Real- Time Programming.

Transcription:

Low latency & Mechanical Sympathy: Issues and solutions Jean-Philippe BEMPEL Performance Architect @jpbempel http://jpbempel.blogspot.com ULLINK 2016

Low latency order router pure Java SE application FIX protocol / Direct Market Access connectivity (TCP) transform to an intermediate form (UL Message) 2

Performance target process & route orders in < 100us 3

Low latency Process under 1ms No (Disk) I/O involved Memory is the new disk Hardware architecture has a major impact on performance Mechanical Sympathy 4

Agenda GC pauses Power Management NUMA Cache pollution 5

GC pauses 6

Generations 7

GC pauses Minor GC & Full GC are Stop the World (STW) => All app threads are stopped Minor GC pauses: 30-100ms Full GC pauses: couple of seconds to minutes 8

Full GC None during trading hours Outside of trading hours, maintenance time allowed Java heap sized accordingly to avoid Full GC 9

Minor GC Some tricks to reduce both frequency pause time 10

Minor GC frequency Increase Young Gen (512MB) But increasing YG can increase pause time Increase likelihood of objects die 11

Minor GC pause time pause time factors: Refs traversing Card scanning Object Copy (survivor or tenured) 12

Card Scanning 1 2 3 4 GC Roots Old Gen Young Gen C C D Card table 1 card = 512 bytes 13

Object Copy Need to keep orders during app lifetime Orders are relatively large objects Orders will be promoted to old generation => copy to survivor spaces and then to old gen Increase significantly the pause time for minor GC 14

Object Copy Access pattern: 90% of the time accessed immediately after creation Weak references: avoid copying orders object persistence cache to allow reloading orders without too much impact 15

Allocation Allocation should be controlled carefully Reduce frequency of minor gc occurrence JDK API is not always friendly with GC/allocation 16

Allocation need to rewrite some part: conversion int -> char[] int inttostring(int i, char[] buffer, int offset) avoid intermediate/temporary char[] 17

Allocation Usage of empty collections & pre-sizing: Collection<String> process() { Collection<String> results = Collections.emptyList(); for (String s : list) { if (results == Collections.<String>emptyList()) results = new ArrayList(list.size()); results.add(s); } return results; } 18

CMS 19

CMS Promotion allocation use Free Lists (like malloc) No Compaction of Old Gen Old Gen Increase minor pause time 20

CMS Full GC fallback unpredictable (fragmentation, promotion failure, ) 176710.366: [GC 176710.366: [ParNew ( promotion failed): 471872K->455002K(471872K), 2.4424250 secs]176712.809: [CMS: 6998032K->5328833K(7864320K), 53.8558640 secs] 7457624K->5328833K(8336192K), [CMS Perm : 199377K->198212K(524288K)], 56.2987730 secs] [Times: user=56.60 sys=0.78, real=56.29 secs] Same problem for G1 21

Zing JVM from Azul? We are partners Best GC algorithm in the world Our tests show maximum pause time of 2ms Some overhead compared to HotSpot (8%) False issue syndrom 22

Power Management 23

Why power management matters? few orders per day, most of the time in idle 24

Power management technologies CPU embeds technologies to reduce power usage Frequency scaling during execution (P states) P0 = nominal frequency Deep sleep modes during idle phases (C states) C0 = Running, C1 = idle latency to wakeup cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency 200 (us) 25

BIOS Settings 26

BIOS settings impact 27

BIOS settings impact 28

BIOS settings impact 29

BIOS settings impact 30

Cores reduction/fixed Turbo Boost Turbo boost is dynamic frequency scaling Depends on thermal envelop Reducing cores reduce thermal envelop => higher freq Some vendors can fix this higher frequency Example: intel E5 v3 14 cores => 2 cores, 2.6Ghz => 3.6GHz 31

Power management at OS level Some OS drivers are also very aggressive # cat /sys/devices/system/cpu/cpuidle/current_driver intel_idle Disabled in boot kernel parameters: intel_idle.max_cstate=0 32

With intel_idle driver, 1 msg/s 33

Without intel_idle driver, 1 msg/s 34

With intel_idle driver, 100 msg/s 35

Without intel_idle driver, 100 msg/s 36

NUMA 37

Uniform Memory Access? CPU DRAM 38

Uniform Memory Access? CPU CPU DRAM 39

NUMA Non-Uniform Memory Access Starts from 2 cpus (sockets) 1 Memory controller per socket (Node) Local Memory vs remote memory 40

NUMA architecture Socket 0 Socket 1 65ns MC QPI 40ns QPI MC 65ns DRAM DRAM 41

How to deal with NUMA? Avoid remote memory access Avoid scheduling and allocation on different node Bind process on one CPU only (numactl) Restrictive but effective Binding on node 0: numactl --cpunodebind=0 --membind=0 java BIOS: node interleaving disabled 42

Cache pollution 43

Cache hierarchy architecture Socket 0 Core0 Core1 L1 L2 L1 L2 L3 DRAM 44

Cache pollution on L3 CPU L3 shared cache across cores Non critical threads can pollute cache eviction of data required by critical threads => adds latency to re-access those data Need to isolate critical threads to non-critical ones 45

Thread affinity OS Scheduling cannot work in this case Need to pin manually threads on cores Thread affinity allow us to do this Identify our critical and non-critical threads in application 46

No thread affinity 47

With thread affinity 48

System processes & cache pollution Other processes can also pollute L3 cache Need to isolate all processes away from critical threads The whole CPU need to be dedicated 49

Isolating cores No other processes should be scheduled on the same socket (isolcpus) But if more than one thread per core need to use cpuset Scheduler Socket 0 Socket 1 Core0 Core2 Core1 Core3 Core4 Core6 Core5 Core7 50

Recommendations Make GC pauses predictable Tune BIOS/OS for maximum performance Control NUMA effect by binding on one node Avoid cache pollution with Thread Affinity and CPU isolation 51

Conclusion Can optimize a process without changing your code Know your hardware and your OS This is Mechanical Sympathy 52

References Mechanical Sympathy blog http://mechanical-sympathy.blogspot.com/ Mechanical Sympathy forum https://groups.google.com/forum/#!forum/mechanical-sympathy Java Thread Affinity library https://github.com/peter-lawrey/java-thread-affinity 53

Q&A Jean-Philippe BEMPEL Performance Architect @jpbempel http://jpbempel.blogspot.com