Disruptor Using High Performance, Low Latency Technology in the CERN Control System

Size: px

Start display at page:

Download "Disruptor Using High Performance, Low Latency Technology in the CERN Control System"

Cody Rice
5 years ago
Views:

2 Disruptor Using High Performance, Low Latency Technology in the CERN Control System ICALEPCS /10/2015 2

3 The problem at hand 21/10/2015 WEB3O03 3

4 The problem at hand CESAR is used to control the devices in CERN experimental areas Motor Detector Power Converter CESAR Server 21/10/2015 WEB3O03 4

5 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 CESAR Server 21/10/2015 WEB3O03 5

6 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 The business logic on the CESAR server combines the data coming from these streams to calculate device states State Device A CESAR Server State Device B State Device C 21/10/2015 WEB3O03 6

7 The problem at hand CESAR is used to control the devices in CERN experimental areas These devices produce 2500 event streams Motor Detector Power Converter x 2500 The business logic on the CESAR server combines the data coming from these streams to calculate device states This concurrent processing must be properly synchronized State Device A CESAR Server State Device B State Device C 21/10/2015 WEB3O03 7

8 What happens when all flows converge? 21/10/2015 WEB3O03 8

9 21/10/2015 WEB3O03 9

10 21/10/2015 WEB3O03 10

11 21/10/2015 WEB3O03 11

12 1- Some background about the Disruptor 21/10/2015 WEB3O03 12

13 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange 21/10/2015 WEB3O03 13

14 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors 21/10/2015 WEB3O03 14

15 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors Challenges the idea that CPUs are not getting any faster 21/10/2015 WEB3O03 15

16 1- Some background about the Disruptor Created by LMAX, a trading company, to build a high performance Forex exchange Is the result of different trials and errors Challenges the idea that CPUs are not getting any faster Designed to take advantage of the architecture of modern CPUs, following the concept of mechanical sympathy 21/10/2015 WEB3O03 16

17 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB L2 Cache L2 Cache 256 KB Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 17

18 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 18

19 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache Socket Interconnect 1 to 20 MB RAM 21/10/2015 WEB3O03 19

20 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache 1 to 20 MB 12 ns Socket Interconnect RAM 21/10/2015 WEB3O03 20

21 Feed the cores avoid cache misses Core 1 Core 2 L1 Cache L1 Cache 64 KB 1 ns L2 Cache L2 Cache 256 KB 3 ns Socket L3 Cache 1 to 20 MB 12 ns Socket Interconnect RAM 65 ns 21/10/2015 WEB3O03 21

22 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 22

23 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 23

24 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 24

25 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 25

26 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 26

27 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 27

28 L1 / L2 Avoiding false sharing Core 1 Core 2 X Y X Y 1 to 3 ns 1 cache line = 64 bytes (on modern x86) L3 X Y 12 ns Socket Interconnect 40 ns 21/10/2015 WEB3O03 28

29 Avoiding false sharing Core 1 Core 2 1 cache line = 64 bytes (on modern x86) L1 / L2 X Y 1 to 3 ns L3 12 ns Socket Interconnect 40 ns The solution? 21/10/2015 WEB3O03 29

30 Avoiding false sharing Core 1 Core 2 1 cache line = 64 bytes (on modern x86) L1 / L2 P P X P P P P P Y P P P 1 to 3 ns L3 12 ns Socket Interconnect 40 ns The solution? Padding 21/10/2015 WEB3O03 30

31 2 - Disruptor architecture What is it? Can be viewed as a very efficient FIFO bounded queue A data structure to pass data between threads, designed to avoid contention 21/10/2015 WEB3O03 31

32 The mighty ring buffer 13 Producer Consumer 4 21/10/2015 WEB3O03 32

33 The mighty ring buffer Producer Consumer 45 21/10/2015 WEB3O03 33

34 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched 21/10/2015 WEB3O03 34

35 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing 21/10/2015 WEB3O03 35

36 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing The memory visibility relies on the volatile sequence number no locks 21/10/2015 WEB3O03 36

37 The mighty ring buffer Producer Consumer 45 Represented internally as an array caches gets prefetched The sequence number is a padded long no false sharing The memory visibility relies on the volatile sequence number no locks Slots are preallocated no garbage collection 21/10/2015 WEB3O03 37

38 Main differences compared to a queue 21/10/2015 WEB3O03 38

39 Main differences compared to a queue Latency and jitter reduced to a minimum 21/10/2015 WEB3O03 39

40 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection 21/10/2015 WEB3O03 40

41 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection Can have multiple consumers organized in a graph of dependency 21/10/2015 WEB3O03 41

42 Main differences compared to a queue Latency and jitter reduced to a minimum No garbage collection Can have multiple consumers organized in a graph of dependency Consumers can use batching to catch up with producers 21/10/2015 WEB3O03 42

43 Benefits for the architecture 21/10/2015 WEB3O03 43

44 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly 21/10/2015 WEB3O03 44

45 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state 21/10/2015 WEB3O03 45

46 Benefits for the architecture Performance No locks, no garbage collection, CPU friendly Determinism The order in which events were processed is known Messages can be replayed to rebuild the server state Simplification of the code base Since the business logic runs on a single thread, there is no need to worry about concurrency 21/10/2015 WEB3O03 46

47 3 The Disruptor in CESAR 21/10/2015 WEB3O03 47

48 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer Motor Detector Power Converter 21/10/2015 WEB3O03 48

49 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept Motor Detector Power Converter 21/10/2015 WEB3O03 49

50 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching Motor Detector Power Converter 21/10/2015 WEB3O03 50

51 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching At the end of a batch, the business logic is triggered and executed on a single thread Motor Detector Power Converter State A State B 21/10/2015 WEB3O03 51

52 3 The Disruptor in CESAR Each event received from hardware is stored on the ring buffer For each stream of data, the last value is kept We make use of batching At the end of a batch, the business logic is triggered and executed on a single thread Publish the new states over the network, making sure that we do not block the Disruptor thread if the message broker is down Motor Detector Power Converter State A State B 21/10/2015 WEB3O03 52

53 Conclusions The Disruptor, a tool from the world of finance, fits really well in an Accelerator control system It simplified the CERN CESAR code base while handling the flow of data more efficiently It is easily integrated in an existing design to replace a queue or a full pipeline of queues The main challenge faced was to switch the developers mind-set to think in asynchronous terms 21/10/2015 WEB3O03 53

54 Useful Links The Disruptor main page with an introduction and code samples: Presentation of the Disruptor at Qcon An article from Martin Fowler: A useful presentation on Latency by Gil Tene who shows that most of what we measure during performance test is wrong: New Async logger in Log4J /10/2015 WEB3O03 54

Reactive design patterns for microservices on multicore

Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com Outline Microservices on multicore Reactive Multicore Patterns