A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

Size: px

Start display at page:

Download "A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson"

Warren Rodgers
6 years ago
Views:

1 A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson

3 If a system does not respond in a timely manner then it is effectively unavailable

4 1. It s all about the Blocking 2. What do we mean by Latency 3. Adventures with Locks and queues 4. Some Alternative FIFOs 5. Where can we go Next

5 1. It s all about the Blocking

6 What is preventing progress?

7 A thread is blocked when it cannot make progress

8 There are two major causes of blocking

9 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs)

SMIs) Concurrent Algorithms Notifying Completion

10 Blocking Systemic Pauses JVM Safepoints (GC, etc.) Transparent Huge Pages (Linux) Hardware (C-States, SMIs) Concurrent Algorithms Notifying Completion Mutual Exclusion (Contention) Synchronisation / Rendezvous

11 Concurrent Algorithms Locks Atomic/CAS Instructions Call into kernel on contention Always blocking Difficult to get right Execute in user space Can be non-blocking Very difficult to get right

12 2. What do we mean by Latency

13 Queuing Theory

14 Queuing Theory Service Time

15 Queuing Theory Latent/Wait Time Service Time

16 Queuing Theory Latent/Wait Time Service Time Response Time

17 Queuing Theory Latent/Wait Time Service Time Enqueue Dequeue Response Time

18 Response Time Queuing Theory Utilisation

19 Speedup Amdahl s & Universal Scalability Laws Processors Amdahl USL

20 3. Adventures with Locks and Queues?

21 Some queue implementations spend more time queueing to be enqueued than in queue time!!!

22 The evils of Blocking

23 Queue.put() && Queue.take()

24 Condition Variables

25 Condition Variable RTT Signal Ping Echo Service Histogram Record Signal Pong

26 µs

27 µs Max = µs

28 Bad News That s the best case scenario!

29 Are non-blocking APIs any better?

30 Queue.offer() && Queue.poll()

31 Context for the Benchmarks Java 8u60 Ubuntu performance mode Intel i7-3632qm 2.2 GHz (Ivy Bridge)

32 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue , ,984 11, ,257 19,680 LinkedBlockingQueue ,197 7, ,010 15,152 ConcurrentLinkedQueue

33 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768

34 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ArrayBlockingQueue 1 30,346 41, ,631 65, ,271 90,240 LinkedBlockingQueue 1 31,422 41, ,096 94, , ,224 ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768

35 Backpressure? Size methods? Flow Rates? Garbage? Fan out?

36 5. Some alternative FIFOs

37 Inter-Thread FIFOs

38 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

39 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

40 Disruptor claimsequence <<CAS>> 0 gating 0 cursor Reference Ring Buffer Influences: Lamport + Network Cards

41 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

42 Disruptor claimsequence <<CAS>> 0 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

43 Disruptor claimsequence <<CAS>> 1 gating 1 cursor Reference Ring Buffer Influences: Lamport + Network Cards

44 long expectedsequence = claimedsequence - 1; while (cursor!= expectedsequence) { // busy spin } cursor = claimedsequence;

45 Disruptor gatingcache 0 gating 0 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

46 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

47 Disruptor gatingcache 0 gating 1 cursor <<CAS>> available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

48 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

49 Disruptor gatingcache 0 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

50 Disruptor gatingcache 1 gating 1 cursor <<CAS>> 1 available Reference Ring Buffer Influences: Lamport + Network Cards + Fast Flow

51 Disruptor 2.0

52 Disruptor 3.0

53 What if I need something with a Queue interface?

54 ManyToOneConcurrentArrayQueue 0 headcache 0 head 0 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

55 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

56 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

57 ManyToOneConcurrentArrayQueue 0 headcache 0 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

58 ManyToOneConcurrentArrayQueue 0 headcache 1 head 1 tail <<CAS>> Reference Ring Buffer Influences: Lamport + Fast Flow

59 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue

60 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue Disruptor ManyToOneConcurrentArrayQueue

61 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072

62 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 Disruptor ,030 8, ,159 21, ,597 54,976 ManyToOneConcurrentArrayQueue 1 3,570 3, ,926 16, ,685 51,072

63 BUT!!! Disruptor has single producer and batch methods

64 Inter-Process FIFOs

65 ManyToOneRingBuffer

66 File Head Message 2 Message 3 Tail <<CAS>>

67 File Head Message 2 Message 3 Tail <<CAS>>

68 File Head Message 2 Message 3 Message 4 Tail <<CAS>>

69 File Head Message 2 Message 3 Message 4 Tail <<CAS>>

70 File Head Message 3 Message 4 Tail <<CAS>>

71 File Head Message 3 Message 4 Tail <<CAS>>

72 MPSC Ring Buffer Message R Frame Length R Message Type Encoded Message

73 Takes a throughput hit on zero ing memory but latency is good

74 Aeron IPC

75 File Tail

76 File Message 1 Tail

77 File Message 1 Message 2 Tail

78 File Message 1 Message 2 Tail

79 File Message 1 Message 2 Message 3 Tail

80 File Message 1 Message 2 Message 3 Tail

81 One big file that goes on forever?

82 No!!! Page faults, page cache churn, VM pressure,...

83 Clean Dirty Message Message Active Message Message Message Message Tail Message Message

84 Do publishers need a CAS?

85 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

86 File Message 1 Message 2 Message 3 Message X Tail <<XADD>> Message Y

87 File Message 1 Message 2 Message 3 Message X Message Y Tail

88 File Message 1 Message 2 Message 3 Message X Message Y Tail

89 File Message 1 Message 2 Message 3 Message X Message Y Padding Tail

90 File Message 1 File Tail Message 2 Message 3 Message X Message Y Padding

91 File Message 1 File Message X Message 2 Tail Message 3 Message Y Padding

92 Have a background thread do zero ing and flow control

93 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC

94 Burst Length = 1: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue ManyToOneRingBuffer Aeron IPC

95 Aeron Data Message R Frame Length Version B E Flags Type R Term Offset Session ID Stream ID Term ID Encoded Message

96 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

97 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

98 Less False Sharing - Inlined data vs Reference Array Less Card Marking

99 Is avoiding the spinning CAS loop is a major step forward?

100 Burst Length = 100: RTT (ns) Prod # Mean 99% Baseline (Wait-free) ConcurrentLinkedQueue 1 12,916 15, ,132 35, ,462 56,768 ManyToOneRingBuffer 1 3,773 3, ,286 13, ,255 34,432 Aeron IPC 1 4,436 4, ,623 8, ,825 13,872

Burst Length = 100: 99% RTT (ns) 200,000 180,000

101 Burst Length = 100: 99% RTT (ns) 200, , , , , ,000 80,000 60,000 40,000 20,000 0

102 Logging is a Messaging Problem What design should we use?

103 5. Where can we go Next?

104 C vs Java

105 C vs Java Spin Wait Loops Thread.onSpinWait()

106 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types

107 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment

108 C vs Java Spin Wait Loops Thread.onSpinWait() Data Dependent Loads Heap Aggregates - ObjectLayout Stack Allocation Value Types Memory Copying Baseline against memcpy() for differing alignment Queue Interface Break conflated concerns to reduce blocking actions

109 In closing

110 2TB RAM and 100+ vcores!!!

111 Where can I find the code?

112 Questions? Blog: Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some