StreamBox: Modern Stream Processing on a Multicore Machine

Size: px

Start display at page:

Download "StreamBox: Modern Stream Processing on a Multicore Machine"

Jesse Cannon
6 years ago
Views:

1 StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE

2 IoT Data centers Humans High velocity of streaming data requires real-time processing

3 Streaming Pipeline Infinite data stream Input Pipeline 3

4 Streaming Pipeline Infinite data stream Input Pipeline 4

5 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 5

6 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 6

7 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 Output 7

8 Why is it hard? Records arrive out- of- order Input Transform-0 Transform-1 Transform-2 Output Infinite-data-stream 8

9 Why is it hard? Records arrive out- of- order High Performance on Multicore Data parallelism Pipeline parallelism Memory locality Core 0 960K B 3584 KB Input Transform-0 Transform-1 Transform-2 Output Core 1 960K B 3584 KB Core 2 960K B 3584 KB Infinite-data-stream Core K B 3584 KB 35MB L3 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 Intel Xeon E v4 9

10 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 5:00 10

11 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 10:00 5:00 0:00 11

12 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 15:00 10:00 5:00 5:00 0:00 12

13 StreamBox insight Out- of- order processing across epochs Process all epochs in all transforms in parallel Pipeline Transform 0 Transform 1 Transform 2 10:00 15:00 10:00 13

14 Prior work vs. StreamBox Processes only one epoch in each transform at a time Process all epochs in all transforms in parallel Pipeline Transform)0 Transform)1 Transform)2 20:00 15:00 10:00 10:00 5:00 0:00 Pipeline Transform)0 Transform)1 Transform)2 10:00 15:00 10:00 StreamBox: High pipeline and data parallel processing system 14

15 Result: StreamBox vs. existing systems on multicore High throughput & utilization of multicore hardware Throughput KRec/s StreamBox Spark Streaming Beam 7K 10K 10K 8K # Cores 15

16 Roadmap Background Stream pipeline, streaming data, window, watermark, and epoch StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 16

17 Streaming pipeline for data analytics Transform a computation that consumes and produces streams Pipeline a dataflow graph of transforms Group by word Count word occurrences Ingress Transform 1 Transform 2 Egress A Simple WordCount Pipeline 17

18 Stream records = data + event time Records arrive out of order Records travel diverse network paths Computations execute at different rates infinite data stream 0:03 0:05 0:02 Processing System 18

19 Window A temporal processing scope of records Chopping up infinite data into finite pieces along temporal boundaries Transforms do computation based on windows Infinite input stream 1:13 1:09 1:11 1:08 1:02 1:03 Window 1:00 1:05 Event Time 19

20 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 1:02 Windows by event time 1:00 1:05 20

21 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 Windows by event time 1:02 1:00 1:05 21

22 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:03 1:02 1:00 1:05 22

23 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:03 1:02 1:05 1:10 1:00 1:05 23

24 Window A temporal processing scope of records Infinite input stream 1:13 1:09 Windows by event time 1:11 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 24

25 Window A temporal processing scope of records Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 25

26 Window A temporal processing scope of records Infinite input stream Windows by event time 1:13 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 26

27 Out-of-order records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 27

28 When a window is complete? Infinite input stream 1:13 1:09 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 28

29 Watermark Input completeness indicated by data source Watermark X all input data with event times less than X have arrived Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 29

30 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:05 1:10 30

31 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 31

32 Handling out-of-order with watermarks Infinite input stream 1:13 Watermark 1:10 1:09 Watermark 1:05 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 32

33 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 33

34 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 34

35 Epoch A set of records arriving between two watermarks A window may span multiple epochs Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 An epoch 35

36 Roadmap Background StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 36

37 Stream processing engines Most of stream engines optimize for a distributed system Neglected efficient multicore implementation Assume a single machine incapable of handling stream data 37

38 Goal A stream engine for multicore Multicore hardware with High throughput I/O Terabyte DRAMs A large number of cores A stream engine for multicore Correctness respect dependences with minimal synchronization Dynamic parallelism processes any records in any epochs Target throughput & latency Pipeline Transform)0 Transform)1 Transform)2 Core 0 960K B 3584 KB NUMA% 0 Core 1 960K B 3584 KB 10:00 Core 2 960K B 3584 KB NUMA% 1 35MB L3 15:00 NUMA% 2 10:00 Core K B 3584 KB NUMA% 3 38

39 Challenges Correctness Guarantee watermark semantics by meeting two invariants Throughput Never stall the pipeline Latency Do not relax the watermark Dynamically adjust parallelism to relieve bottlenecks 39

40 Invariant 1 Watermark ordering Transforms consume watermarks in order Transforms consume all records in an epoch before consuming the watermark Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 40

41 Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 41

42 Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 42

43 Invariant 2 Respect epoch boundaries What if a record changes to a later epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 Violate watermark guarantee! 43

44 Invariant 2 Respect epoch boundaries What if records change to an earlier epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:22 0:12 0:18 0:05 0:11 Transform 0:10 0:12 0:18 0:05 0:11 Relax watermark, and delay window completion! 44

45 Our solution: Cascading containers Each cascading container Corresponds to an epoch Tracks an epoch state and the relationship between records and the watermark Orchestrates worker threads to consume watermarks and records End watermark 20:00 20:00 An epoch A container 45

46 Each transform has multiple containers A transform has multiple epochs Each epoch corresponds to a container Containers Transform 0 20:00 15:00 Newest Oldest 46

47 Link each container to a downstream container defined by the transform Transform 0 20:00 15:00 Transform 1 47

48 Records/watermarks flow through the pipeline by following the links Meets invariant 2: records respect epoch boundary Avoids relaxing watermark Transform 0 20:00 15:00 Transform 1 48

49 A watermark will be processed after all records within the container have been processed Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 49

50 Watermarks will be processed in order Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 50

51 All records in all containers can be processed in parallel Avoids stalling pipeline Transform 0 20:00 15:00 Transform 1 51

52 Big picture A pipeline: multiple transforms Containers form a network Records/watermarks flow through the links High parallel pipeline Guarantees watermark semantic Avoids stalling pipeline (for throughput) Avoids relaxing watermark (for latency) (Upstream) Transform 0 Transform 1 Transform 2 Transform 3 Newest Oldest 25:00 20:00 15:00 10:00 09:00 04:00 (Downstream) 52

53 Other key optimizations Organizing records into bundles Minimize synchronization Multi-input transforms Defer container ordering in downstream Pipeline scheduling Prioritize externalization to minimize latency Pipeline state management Target NUMA-awareness and coarse-grained allocation 53

54 StreamBox implementation Built from scratch in 22K SLoC of C++11 Supported transforms: Windowing, GroupBy, Aggregation, Mapper, Reducer, Temporal Join, Grep Source C++ libraries Intel TBB, Facebook folly, jemalloc, boost Concurrent hash tables Wrapped TBB s concurrent hash map 54

55 StreamBox implementation Benchmarks: Windowed grep Word count Counting distinct URLs Network latency monitoring Tweets sentiment analysis Machine configurations: 6 cores CM12 6 cores 256GB DRAM CM56 14 cores 14 cores 14 cores 14 cores 256GB DRAM 55

56 Roadmap Background StreamBox Design Evaluation 56

57 Evaluation Throughput and scalability Comparison with existing stream engines Handling out-of-order input streaming data Epoch parallelism effectiveness 57

58 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) # Cores 58

59 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) # Cores 59

60 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores 60

61 Good throughput and scalability Windowed Grep 5000 Word Count 5000 Temporal Join Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s Counting Distinct URLs CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) # Cores # Cores Throughput KRec/s # Cores Network Latency Monitoring CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores Throughput KRec/s # Cores Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores 61

62 StreamBox vs. existing stream engines Throughput KRec/s StreamBox Spark Streaming Beam 7K 10K 10K 8K # Cores Spark: v2.1.0 Beam: v0.5.0 StreamBox achieves significantly better throughput and scalability 62

63 Handling out-of-order records Throughput KRec/s % 20% 40% # Cores Drop 7% Throughput KRec/s % 20% 40% # Cores Throughput KRec/s % 20% 40% # Cores WordCount Netmon Tweets StreamBox achieves good throughput even with lots of out-of-order records 63

64 Epoch parallelism is effective Throughput KRec/s StreamBox NO In-order parallel # Cores Drop 87% Throughput KRec/s StreamBox NO In-order parallel # Cores Pipeline Pipeline Transform)0 Transform)1 Transform)2 Transform)0 Transform)1 20:00 15:00 10:00 10:00 5:00 0:00 Prior work 10:00 15:00 10:00 Grep WordCount Transform)2 StreamBox 64

65 Summary: StreamBox on multicores Processes any records in any epochs in parallel by using all CPU cores Achieves high throughput with low latency Millions records per second throughput, on a par with distributed engines on a cluster with a few hundreds of CPU cores Tens of milliseconds latency, 20x shorter than other large-scale engines Pipeline Transform)0 Transform)1 10:00 15:00 10:00 Core 0 960K B 3584 KB Core 1 960K B 3584 KB Core 2 960K B 3584 KB 35MB L3 Core K B 3584 KB Transform)2 NUMA% 0 NUMA% 1 NUMA% 2 NUMA%

StreamBox: Modern Stream Processing on a Multicore Machine

StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu