StreamBox: Modern Stream Processing on a Multicore Machine

StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

IoT Data centers Humans High velocity of streaming data requires real-time processing

Streaming Pipeline Infinite data stream Input Pipeline 3

Streaming Pipeline Infinite data stream Input Pipeline 4

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 5

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 6

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 Output 7

Why is it hard? Records arrive out- of- order Input Transform-0 Transform-1 Transform-2 Output Infinite-data-stream 8

Why is it hard? Records arrive out- of- order High Performance on Multicore Data parallelism Pipeline parallelism Memory locality Core 0 960K B 3584 KB Input Transform-0 Transform-1 Transform-2 Output Core 1 960K B 3584 KB Core 2 960K B 3584 KB Infinite-data-stream Core 13 960K B 3584 KB 35MB L3 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 Intel Xeon E7-4830 v4 9

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 5:00 10

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 10:00 5:00 0:00 11

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 15:00 10:00 5:00 5:00 0:00 12

StreamBox insight Out- of- order processing across epochs Process all epochs in all transforms in parallel Pipeline Transform 0 Transform 1 Transform 2 10:00 15:00 10:00 13

Prior work vs. StreamBox Processes only one epoch in each transform at a time Process all epochs in all transforms in parallel Pipeline Transform)0 Transform)1 Transform)2 20:00 15:00 10:00 10:00 5:00 0:00 Pipeline Transform)0 Transform)1 Transform)2 10:00 15:00 10:00 StreamBox: High pipeline and data parallel processing system 14

Result: StreamBox vs. existing systems on multicore High throughput & utilization of multicore hardware Throughput KRec/s 8000 6000 4000 2000 0 StreamBox Spark Streaming Beam 7K 10K 10K 8K 4 12 32 56 # Cores 15

Roadmap Background Stream pipeline, streaming data, window, watermark, and epoch StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 16

Streaming pipeline for data analytics Transform a computation that consumes and produces streams Pipeline a dataflow graph of transforms Group by word Count word occurrences Ingress Transform 1 Transform 2 Egress A Simple WordCount Pipeline 17

Stream records = data + event time Records arrive out of order Records travel diverse network paths Computations execute at different rates infinite data stream 0:03 0:05 0:02 Processing System 18

Window A temporal processing scope of records Chopping up infinite data into finite pieces along temporal boundaries Transforms do computation based on windows Infinite input stream 1:13 1:09 1:11 1:08 1:02 1:03 Window 1:00 1:05 Event Time 19

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 1:02 Windows by event time 1:00 1:05 20

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 Windows by event time 1:02 1:00 1:05 21

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:03 1:02 1:00 1:05 22

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:03 1:02 1:05 1:10 1:00 1:05 23

Window A temporal processing scope of records Infinite input stream 1:13 1:09 Windows by event time 1:11 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 24

Window A temporal processing scope of records Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 25

Window A temporal processing scope of records Infinite input stream Windows by event time 1:13 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 26

Out-of-order records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 27

When a window is complete? Infinite input stream 1:13 1:09 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 28

Watermark Input completeness indicated by data source Watermark X all input data with event times less than X have arrived Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 29

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:05 1:10 30

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 31

Handling out-of-order with watermarks Infinite input stream 1:13 Watermark 1:10 1:09 Watermark 1:05 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 32

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 33

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 34

Epoch A set of records arriving between two watermarks A window may span multiple epochs Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 An epoch 35

Roadmap Background StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 36

Stream processing engines Most of stream engines optimize for a distributed system Neglected efficient multicore implementation Assume a single machine incapable of handling stream data 37

Goal A stream engine for multicore Multicore hardware with High throughput I/O Terabyte DRAMs A large number of cores A stream engine for multicore Correctness respect dependences with minimal synchronization Dynamic parallelism processes any records in any epochs Target throughput & latency Pipeline Transform)0 Transform)1 Transform)2 Core 0 960K B 3584 KB NUMA% 0 Core 1 960K B 3584 KB 10:00 Core 2 960K B 3584 KB NUMA% 1 35MB L3 15:00 NUMA% 2 10:00 Core 13 960K B 3584 KB NUMA% 3 38

Challenges Correctness Guarantee watermark semantics by meeting two invariants Throughput Never stall the pipeline Latency Do not relax the watermark Dynamically adjust parallelism to relieve bottlenecks 39

Invariant 1 Watermark ordering Transforms consume watermarks in order Transforms consume all records in an epoch before consuming the watermark Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 40

Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 41

Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 42

Invariant 2 Respect epoch boundaries What if a record changes to a later epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 Violate watermark guarantee! 43

Invariant 2 Respect epoch boundaries What if records change to an earlier epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:22 0:12 0:18 0:05 0:11 Transform 0:10 0:12 0:18 0:05 0:11 Relax watermark, and delay window completion! 44

Our solution: Cascading containers Each cascading container Corresponds to an epoch Tracks an epoch state and the relationship between records and the watermark Orchestrates worker threads to consume watermarks and records End watermark 20:00 20:00 An epoch A container 45

Each transform has multiple containers A transform has multiple epochs Each epoch corresponds to a container Containers Transform 0 20:00 15:00 Newest Oldest 46

Link each container to a downstream container defined by the transform Transform 0 20:00 15:00 Transform 1 47

Records/watermarks flow through the pipeline by following the links Meets invariant 2: records respect epoch boundary Avoids relaxing watermark Transform 0 20:00 15:00 Transform 1 48

A watermark will be processed after all records within the container have been processed Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 49

Watermarks will be processed in order Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 50

All records in all containers can be processed in parallel Avoids stalling pipeline Transform 0 20:00 15:00 Transform 1 51

Big picture A pipeline: multiple transforms Containers form a network Records/watermarks flow through the links High parallel pipeline Guarantees watermark semantic Avoids stalling pipeline (for throughput) Avoids relaxing watermark (for latency) (Upstream) Transform 0 Transform 1 Transform 2 Transform 3 Newest Oldest 25:00 20:00 15:00 10:00 09:00 04:00 (Downstream) 52

Other key optimizations Organizing records into bundles Minimize synchronization Multi-input transforms Defer container ordering in downstream Pipeline scheduling Prioritize externalization to minimize latency Pipeline state management Target NUMA-awareness and coarse-grained allocation 53

StreamBox implementation Built from scratch in 22K SLoC of C++11 Supported transforms: Windowing, GroupBy, Aggregation, Mapper, Reducer, Temporal Join, Grep Source code @ http://xsel.rocks/p/streambox C++ libraries Intel TBB, Facebook folly, jemalloc, boost Concurrent hash tables Wrapped TBB s concurrent hash map 54

StreamBox implementation Benchmarks: Windowed grep Word count Counting distinct URLs Network latency monitoring Tweets sentiment analysis Machine configurations: 6 cores CM12 6 cores 256GB DRAM CM56 14 cores 14 cores 14 cores 14 cores 256GB DRAM 55

Roadmap Background StreamBox Design Evaluation 56

Evaluation Throughput and scalability Comparison with existing stream engines Handling out-of-order input streaming data Epoch parallelism effectiveness 57

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) 4 12 32 56 # Cores 58

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) 4 12 32 56 # Cores 59

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores 60

Good throughput and scalability 40000 Windowed Grep 5000 Word Count 5000 Temporal Join Throughput KRec/s 30000 20000 10000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 4000 3000 2000 1000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 4000 3000 2000 1000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 2000 1500 1000 500 0 Counting Distinct URLs CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) # Cores 4 12 32 56 # Cores Throughput KRec/s 1400 1200 1000 800 600 400 200 0 # Cores Network Latency Monitoring CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores Throughput KRec/s 5000 4000 3000 2000 1000 0 # Cores Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores 61

StreamBox vs. existing stream engines Throughput KRec/s 8000 6000 4000 2000 0 StreamBox Spark Streaming Beam 7K 10K 10K 8K 4 12 32 56 # Cores Spark: v2.1.0 Beam: v0.5.0 StreamBox achieves significantly better throughput and scalability 62

Handling out-of-order records Throughput KRec/s 6000 4000 2000 0 0% 20% 40% 4 12 32 56 # Cores Drop 7% Throughput KRec/s 1000 800 600 400 200 0 0% 20% 40% 4 12 32 56 # Cores Throughput KRec/s 6000 4000 2000 0 0% 20% 40% 4 12 32 56 # Cores WordCount Netmon Tweets StreamBox achieves good throughput even with lots of out-of-order records 63

Epoch parallelism is effective Throughput KRec/s 50000 40000 30000 20000 10000 0 StreamBox NO In-order parallel 32 56 # Cores Drop 87% Throughput KRec/s 8000 6000 4000 2000 0 StreamBox NO In-order parallel 32 56 # Cores Pipeline Pipeline Transform)0 Transform)1 Transform)2 Transform)0 Transform)1 20:00 15:00 10:00 10:00 5:00 0:00 Prior work 10:00 15:00 10:00 Grep WordCount Transform)2 StreamBox 64

Summary: StreamBox on multicores Processes any records in any epochs in parallel by using all CPU cores Achieves high throughput with low latency Millions records per second throughput, on a par with distributed engines on a cluster with a few hundreds of CPU cores Tens of milliseconds latency, 20x shorter than other large-scale engines Pipeline Transform)0 Transform)1 10:00 15:00 10:00 Core 0 960K B 3584 KB Core 1 960K B 3584 KB Core 2 960K B 3584 KB 35MB L3 Core 13 960K B 3584 KB Transform)2 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 http://xsel.rocks/p/streambox 65