A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

Similar documents
High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson

Rectangles All The Way Down. Martin Thompson

OS-caused Long JVM Pauses - Deep Dive and Solutions

Disruptor Using High Performance, Low Latency Technology in the CERN Control System

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

RWLocks in Erlang/OTP

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

Practicing at the Cutting Edge Learning and Unlearning about Performance. Martin Thompson

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Java Performance: The Definitive Guide

The benefits and costs of writing a POSIX kernel in a high-level language

Low Latency Java in the Real World

Martin Kruliš, v

Towards Parallel, Scalable VM Services

Java Without the Jitter

VIRTUAL MEMORY READING: CHAPTER 9

JAVA BYTE IPC: PART 1 DERIVING NIO FROM I/O. Instructor: Prasun Dewan (FB 150,

Oracle Database 12c: JMS Sharded Queues

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Kodewerk. Java Performance Services. The War on Latency. Reducing Dead Time Kirk Pepperdine Principle Kodewerk Ltd.

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Scalable Streaming Analytics

JCTools: Pit Of Performance Nitsan Copyright - TTNR Labs Limited 2018

Multiprocessor Systems. Chapter 8, 8.1

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Memory Management Virtual Memory

Introduction. CS3026 Operating Systems Lecture 01

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Cluster Consensus When Aeron Met Raft. Martin Thompson

Twitter Heron: Stream Processing at Scale

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Distributed caching for cloud computing

User-level Management of Kernel Memory

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

6.033 Computer System Engineering

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

IME Infinite Memory Engine Technical Overview

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Review: Creating a Parallel Program. Programming for Performance

CS370 Operating Systems

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Solution for Operating System

A Scalable Lock Manager for Multicores

CPS 310 second midterm exam, 11/14/2014

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Java performance - not so scary after all

Research on the Implementation of MPI on Multicore Architectures

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSE 153 Design of Operating Systems

Pointer Analysis in the Presence of Dynamic Class Loading. Hind Presented by Brian Russell

Keeping up with the hardware

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Real Time: Understanding the Trade-offs Between Determinism and Throughput

Modern C++ Old Dog, New Tricks Todd L.

Initial Evaluation of a User-Level Device Driver Framework

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Much Faster Networking

3/3/2014! Anthony D. Joseph!!CS162! UCB Spring 2014!

Synchronization COMPSCI 386

QUESTION BANK UNIT I

wait with priority An enhanced version of the wait operation accepts an optional priority argument:

Improve Linux User-Space Core Libraries with Restartable Sequences

5 Solutions. Solution a. no solution provided. b. no solution provided

Java Performance Tuning From A Garbage Collection Perspective. Nagendra Nagarajayya MDE

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Concurrent Garbage Collection

CA485 Ray Walshe Google File System

Arrakis: The Operating System is the Control Plane

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Virtual memory. Virtual memory - Swapping. Large virtual memory. Processes

CEC 450 Real-Time Systems

CS 241 Honors Concurrent Data Structures

CPS 110 Midterm. Spring 2011

CIS 21 Final Study Guide. Final covers ch. 1-20, except for 17. Need to know:

Caching and reliability

September 15th, Finagle + Java. A love story (

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

CS510 Operating System Foundations. Jonathan Walpole

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Java Concurrency. Towards a better life By - -

Concurrent Data Structures Concurrent Algorithms 2016

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

CSE 486/586 Distributed Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

COP 4610: Introduction to Operating Systems (Spring 2016) Chapter 3: Process. Zhi Wang Florida State University

Transcription:

A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777

If a system does not respond in a timely manner then it is effectively unavailable

1. It s all about the Blocking 2. What do we mean by Latency 3. Adventures with Locks and queues 4. Some Alternative FIFOs 5. Where can we go Next

1. It s all about the Blocking

What is preventing progress?

A thread is blocked when it cannot make progress

There are two major causes of blocking

Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs)

Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs) Concurrent Algorithms Notifying Completion Mutual Exclusion (Contention) Synchronisation / Rendezvous

Concurrent Algorithms Locks Atomic/CAS Instructions Call into kernel on contention Always blocking Difficult to get right Execute in user space Can be non-blocking Very difficult to get right

2. What do we mean by Latency

Queuing Theory

Queuing Theory Service Time

Queuing Theory Latent/Wait Time Service Time

Queuing Theory Latent/Wait Time Service Time Response Time

Queuing Theory Latent/Wait Time Service Time Enqueue Dequeue Response Time

Response Time Queuing Theory 12.0 10.0 8.0 6.0 4.0 2.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Utilisation

Speedup Amdahl s & Universal Scalability Laws 20 18 16 14 12 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 Processors Amdahl USL

3. Adventures with Locks and Queues?

Some queue implementations spend more time queueing to be enqueued than in queue time!!!

The evils of Blocking

Queue.put() && Queue.take()

Condition Variables

Condition Variable RTT Signal Ping Echo Service Histogram Record Signal Pong

µs

µs Max = 5525.503µs

Bad News That s the best case scenario!

Are non-blocking APIs any better?

Queue.offer() && Queue.poll()

Context for the Benchmarks https://github.com/real-logic/benchmarks Java 8u60 Ubuntu 15.04 performance mode Intel i7-3632qm 2.2 GHz (Ivy Bridge)

Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 167 189 ArrayBlockingQueue 1 645 1,210 2 1,984 11,648 3 5,257 19,680 LinkedBlockingQueue 1 461 740 2 1,197 7,320 3 2,010 15,152 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ArrayBlockingQueue 1 30,346 41,728 2 35,631 65,008 3 50,271 90,240 LinkedBlockingQueue 1 31,422 41,600 2 45,096 94,208 3 89,820 180,224 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ArrayBlockingQueue 1 30,346 41,728 2 35,631 65,008 3 50,271 90,240 LinkedBlockingQueue 1 31,422 41,600 2 45,096 94,208 3 89,820 180,224 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768

Backpressure? Size methods? Flow Rates? Garbage? Fan out?

5. Some alternative FIFOs

Inter-Thread FIFOs

Disruptor 1.0 2.0 0 claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

Disruptor 1.0 2.0 1 claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

Disruptor 1.0 2.0 1 claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

Disruptor 1.0 2.0 1 claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

Disruptor 1.0 2.0 1 claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

Disruptor 1.0 2.0 1 claimsequence <<CAS>> 1 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

long expectedsequence = claimedsequence - 1; while (cursor!= expectedsequence) { // busy spin } cursor = claimedsequence;

Disruptor 3.0 0 gatingcache 0 gating 0 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 3.0 0 gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 3.0 0 gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 3.0 0 gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 3.0 0 gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 3.0 0 gatingcache 1 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

Disruptor 2.0

Disruptor 3.0

What if I need something with a Queue interface?

ManyToOneConcurrentArrayQueue 0 headcache 0 head 0 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

ManyToOneConcurrentArrayQueue 0 headcache 1 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 Disruptor 3.3.2 1 201 263 2 256 373 3 282 459 ManyToOneConcurrentArrayQueue 1 173 198 2 238 319 3 259 392

Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 Disruptor 3.3.2 1 201 263 2 256 373 3 282 459 ManyToOneConcurrentArrayQueue 1 173 198 2 238 319 3 259 392

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 Disruptor 3.3.2 1 7,030 8,192 2 15,159 21,184 3 31,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3,932 2 12,926 16,480 3 26,685 51,072

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 Disruptor 3.3.2 1 7,030 8,192 2 15,159 21,184 3 31,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3,932 2 12,926 16,480 3 26,685 51,072

BUT!!! Disruptor has single producer and batch methods

Inter-Process FIFOs

ManyToOneRingBuffer

File Head Message 2 Message 3 Tail <<CAS>>

File Head Message 2 Message 3 Tail <<CAS>>

File Head Message 2 Message 3 Message 4 Tail <<CAS>>

File Head Message 2 Message 3 Message 4 Tail <<CAS>>

File Head Message 3 Message 4 Tail <<CAS>>

File Head Message 3 Message 4 Tail <<CAS>>

MPSC Ring Buffer Message 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-------------------------------------------------------------+ R Frame Length +-+-------------------------------------------------------------+ R Message Type +-+-------------------------------------------------------------+ Encoded Message...... +---------------------------------------------------------------+

Takes a throughput hit on zero ing memory but latency is good

Aeron IPC

File Tail

File Message 1 Tail

File Message 1 Message 2 Tail

File Message 1 Message 2 Tail

File Message 1 Message 2 Message 3 Tail

File Message 1 Message 2 Message 3 Tail

One big file that goes on forever?

No!!! Page faults, page cache churn, VM pressure,...

Clean Dirty Message Message Active Message Message Message Message Tail Message Message

Do publishers need a CAS?

File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

File Message 1 Message 2 Message 3 Message X Message Y Tail

File Message 1 Message 2 Message 3 Message X Message Y Tail

File Message 1 Message 2 Message 3 Message X Message Y Padding Tail

File Message 1 File Tail Message 2 Message 3 Message X Message Y Padding

File Message 1 File Message X Message 2 Tail Message 3 Message Y Padding

Have a background thread do zero ing and flow control

Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 ManyToOneRingBuffer 1 170 194 2 256 279 3 283 340 Aeron IPC 1 192 210 2 251 286 3 290 342

Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 167 189 ConcurrentLinkedQueue 1 281 361 2 381 559 3 444 705 ManyToOneRingBuffer 1 170 194 2 256 279 3 283 340 Aeron IPC 1 192 210 2 251 286 3 290 342

Aeron Data Message 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-------------------------------------------------------------+ R Frame Length +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-------------------------------+ Version B E Flags Type +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-------------------------------+ R Term Offset +-+-------------------------------------------------------------+ Session ID +---------------------------------------------------------------+ Stream ID +---------------------------------------------------------------+ Term ID +---------------------------------------------------------------+ Encoded Message...... +---------------------------------------------------------------+

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Less False Sharing - Inlined data vs Reference Array Less Card Marking

Is avoiding the spinning CAS loop is a major step forward?

Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) 1 721 982 ConcurrentLinkedQueue 1 12,916 15,792 2 25,132 35,136 3 39,462 56,768 ManyToOneRingBuffer 1 3,773 3,992 2 11,286 13,200 3 27,255 34,432 Aeron IPC 1 4,436 4,632 2 7,623 8,224 3 10,825 13,872

Burst Length = 100: 99% RTT (ns) 200,000 180,000 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0

Logging is a Messaging Problem What design should we use?

5. Where can we go Next?

C vs Java

C vs Java Spin Wait Loops Thread.onSpinWait()

C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types

C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment

C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment Queue Interface Break conflated concerns to reduce blocking actions

In closing

2TB RAM and 100+ vcores!!!

Where can I find the code? https://github.com/real-logic/benchmarks https://github.com/real-logic/agrona https://github.com/real-logic/aeron

Questions? Blog: http://mechanical-sympathy.blogspot.com/ Twitter: @mjpt777 Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein