StreamBox: Modern Stream Processing on a Multicore Machine

Similar documents
StreamBox: Modern Stream Processing on a Multicore Machine

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

RiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg

Advances of parallel computing. Kirill Bogachev May 2016

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Toward a Memory-centric Architecture

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

PREDICTIVE DATACENTER ANALYTICS WITH STRYMON

Designing Hybrid Data Processing Systems for Heterogeneous Servers

Performance of Multicore LUP Decomposition

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

System Design for a Million TPS

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Flash Storage Complementing a Data Lake for Real-Time Insight

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

CSE 124: Networked Services Lecture-17

Big Data Management and NoSQL Databases

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Big Data Infrastructures & Technologies

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Spark Streaming. Guido Salvaneschi

VoltDB vs. Redis Benchmark

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Pocket: Elastic Ephemeral Storage for Serverless Analytics

STA141C: Big Data & High Performance Statistical Computing

Rack-scale Data Processing System

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Harp-DAAL for High Performance Big Data Computing

Naiad (Timely Dataflow) & Streaming Systems

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Deterministic Process Groups in

The dark powers on Intel processor boards

Parallel Processing SIMD, Vector and GPU s cont.

Concurrency: what, why, how

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Closing the Performance Gap Between Volatile and Persistent K-V Stores

High Performance Computing on MapReduce Programming Framework

NUMA-aware Graph-structured Analytics

Distributed computing: index building and use

Overview. About CERN 2 / 11

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

MEAP Edition Manning Early Access Program Flink in Action Version 2

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

MOHA: Many-Task Computing Framework on Hadoop

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Placement de processus (MPI) sur architecture multi-cœur NUMA

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Module 6: INPUT - OUTPUT (I/O)

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

LazyBase: Trading freshness and performance in a scalable database

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Big Data Management and NoSQL Databases

Moore s Law. Computer architect goal Software developer assumption

Chapter 8: Main Memory

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

Learning with Purpose

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

When, Where & Why to Use NoSQL?

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Shark: Hive (SQL) on Spark

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

XPU A Programmable FPGA Accelerator for Diverse Workloads

Kaisen Lin and Michael Conley

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Chapter 8: Memory-Management Strategies

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Software and Tools for HPE s The Machine Project

A closer look at network structure:

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Introduction to MapReduce (cont.)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Effect of memory latency

Scalable Streaming Analytics

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CS 426 Parallel Computing. Parallel Computing Platforms

Big Data Platforms. Alessandro Margara

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Transcription:

StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE http://xsel.rocks/p/streambox

IoT Data centers Humans High velocity of streaming data requires real-time processing

Streaming Pipeline Infinite data stream Input Pipeline 3

Streaming Pipeline Infinite data stream Input Pipeline 4

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 5

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 6

Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 Output 7

Why is it hard? Records arrive out- of- order Input Transform-0 Transform-1 Transform-2 Output Infinite-data-stream 8

Why is it hard? Records arrive out- of- order High Performance on Multicore Data parallelism Pipeline parallelism Memory locality Core 0 960K B 3584 KB Input Transform-0 Transform-1 Transform-2 Output Core 1 960K B 3584 KB Core 2 960K B 3584 KB Infinite-data-stream Core 13 960K B 3584 KB 35MB L3 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 Intel Xeon E7-4830 v4 9

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 5:00 10

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 10:00 5:00 0:00 11

Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 15:00 10:00 5:00 5:00 0:00 12

StreamBox insight Out- of- order processing across epochs Process all epochs in all transforms in parallel Pipeline Transform 0 Transform 1 Transform 2 10:00 15:00 10:00 13

Prior work vs. StreamBox Processes only one epoch in each transform at a time Process all epochs in all transforms in parallel Pipeline Transform)0 Transform)1 Transform)2 20:00 15:00 10:00 10:00 5:00 0:00 Pipeline Transform)0 Transform)1 Transform)2 10:00 15:00 10:00 StreamBox: High pipeline and data parallel processing system 14

Result: StreamBox vs. existing systems on multicore High throughput & utilization of multicore hardware Throughput KRec/s 8000 6000 4000 2000 0 StreamBox Spark Streaming Beam 7K 10K 10K 8K 4 12 32 56 # Cores 15

Roadmap Background Stream pipeline, streaming data, window, watermark, and epoch StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 16

Streaming pipeline for data analytics Transform a computation that consumes and produces streams Pipeline a dataflow graph of transforms Group by word Count word occurrences Ingress Transform 1 Transform 2 Egress A Simple WordCount Pipeline 17

Stream records = data + event time Records arrive out of order Records travel diverse network paths Computations execute at different rates infinite data stream 0:03 0:05 0:02 Processing System 18

Window A temporal processing scope of records Chopping up infinite data into finite pieces along temporal boundaries Transforms do computation based on windows Infinite input stream 1:13 1:09 1:11 1:08 1:02 1:03 Window 1:00 1:05 Event Time 19

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 1:02 Windows by event time 1:00 1:05 20

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 Windows by event time 1:02 1:00 1:05 21

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:03 1:02 1:00 1:05 22

Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:03 1:02 1:05 1:10 1:00 1:05 23

Window A temporal processing scope of records Infinite input stream 1:13 1:09 Windows by event time 1:11 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 24

Window A temporal processing scope of records Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 25

Window A temporal processing scope of records Infinite input stream Windows by event time 1:13 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 26

Out-of-order records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 27

When a window is complete? Infinite input stream 1:13 1:09 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 28

Watermark Input completeness indicated by data source Watermark X all input data with event times less than X have arrived Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 29

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:05 1:10 30

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 31

Handling out-of-order with watermarks Infinite input stream 1:13 Watermark 1:10 1:09 Watermark 1:05 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 32

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 33

Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 34

Epoch A set of records arriving between two watermarks A window may span multiple epochs Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 An epoch 35

Roadmap Background StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 36

Stream processing engines Most of stream engines optimize for a distributed system Neglected efficient multicore implementation Assume a single machine incapable of handling stream data 37

Goal A stream engine for multicore Multicore hardware with High throughput I/O Terabyte DRAMs A large number of cores A stream engine for multicore Correctness respect dependences with minimal synchronization Dynamic parallelism processes any records in any epochs Target throughput & latency Pipeline Transform)0 Transform)1 Transform)2 Core 0 960K B 3584 KB NUMA% 0 Core 1 960K B 3584 KB 10:00 Core 2 960K B 3584 KB NUMA% 1 35MB L3 15:00 NUMA% 2 10:00 Core 13 960K B 3584 KB NUMA% 3 38

Challenges Correctness Guarantee watermark semantics by meeting two invariants Throughput Never stall the pipeline Latency Do not relax the watermark Dynamically adjust parallelism to relieve bottlenecks 39

Invariant 1 Watermark ordering Transforms consume watermarks in order Transforms consume all records in an epoch before consuming the watermark Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 40

Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 41

Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 42

Invariant 2 Respect epoch boundaries What if a record changes to a later epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 Violate watermark guarantee! 43

Invariant 2 Respect epoch boundaries What if records change to an earlier epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:22 0:12 0:18 0:05 0:11 Transform 0:10 0:12 0:18 0:05 0:11 Relax watermark, and delay window completion! 44

Our solution: Cascading containers Each cascading container Corresponds to an epoch Tracks an epoch state and the relationship between records and the watermark Orchestrates worker threads to consume watermarks and records End watermark 20:00 20:00 An epoch A container 45

Each transform has multiple containers A transform has multiple epochs Each epoch corresponds to a container Containers Transform 0 20:00 15:00 Newest Oldest 46

Link each container to a downstream container defined by the transform Transform 0 20:00 15:00 Transform 1 47

Records/watermarks flow through the pipeline by following the links Meets invariant 2: records respect epoch boundary Avoids relaxing watermark Transform 0 20:00 15:00 Transform 1 48

A watermark will be processed after all records within the container have been processed Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 49

Watermarks will be processed in order Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 50

All records in all containers can be processed in parallel Avoids stalling pipeline Transform 0 20:00 15:00 Transform 1 51

Big picture A pipeline: multiple transforms Containers form a network Records/watermarks flow through the links High parallel pipeline Guarantees watermark semantic Avoids stalling pipeline (for throughput) Avoids relaxing watermark (for latency) (Upstream) Transform 0 Transform 1 Transform 2 Transform 3 Newest Oldest 25:00 20:00 15:00 10:00 09:00 04:00 (Downstream) 52

Other key optimizations Organizing records into bundles Minimize synchronization Multi-input transforms Defer container ordering in downstream Pipeline scheduling Prioritize externalization to minimize latency Pipeline state management Target NUMA-awareness and coarse-grained allocation 53

StreamBox implementation Built from scratch in 22K SLoC of C++11 Supported transforms: Windowing, GroupBy, Aggregation, Mapper, Reducer, Temporal Join, Grep Source code @ http://xsel.rocks/p/streambox C++ libraries Intel TBB, Facebook folly, jemalloc, boost Concurrent hash tables Wrapped TBB s concurrent hash map 54

StreamBox implementation Benchmarks: Windowed grep Word count Counting distinct URLs Network latency monitoring Tweets sentiment analysis Machine configurations: 6 cores CM12 6 cores 256GB DRAM CM56 14 cores 14 cores 14 cores 14 cores 256GB DRAM 55

Roadmap Background StreamBox Design Evaluation 56

Evaluation Throughput and scalability Comparison with existing stream engines Handling out-of-order input streaming data Epoch parallelism effectiveness 57

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) 4 12 32 56 # Cores 58

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) 4 12 32 56 # Cores 59

Good throughput and scalability Throughput KRec/s 5000 4000 3000 2000 1000 0 Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores 60

Good throughput and scalability 40000 Windowed Grep 5000 Word Count 5000 Temporal Join Throughput KRec/s 30000 20000 10000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 4000 3000 2000 1000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 4000 3000 2000 1000 0 CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) 4 12 32 56 Throughput KRec/s 2000 1500 1000 500 0 Counting Distinct URLs CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) # Cores 4 12 32 56 # Cores Throughput KRec/s 1400 1200 1000 800 600 400 200 0 # Cores Network Latency Monitoring CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores Throughput KRec/s 5000 4000 3000 2000 1000 0 # Cores Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) 4 12 32 56 # Cores 61

StreamBox vs. existing stream engines Throughput KRec/s 8000 6000 4000 2000 0 StreamBox Spark Streaming Beam 7K 10K 10K 8K 4 12 32 56 # Cores Spark: v2.1.0 Beam: v0.5.0 StreamBox achieves significantly better throughput and scalability 62

Handling out-of-order records Throughput KRec/s 6000 4000 2000 0 0% 20% 40% 4 12 32 56 # Cores Drop 7% Throughput KRec/s 1000 800 600 400 200 0 0% 20% 40% 4 12 32 56 # Cores Throughput KRec/s 6000 4000 2000 0 0% 20% 40% 4 12 32 56 # Cores WordCount Netmon Tweets StreamBox achieves good throughput even with lots of out-of-order records 63

Epoch parallelism is effective Throughput KRec/s 50000 40000 30000 20000 10000 0 StreamBox NO In-order parallel 32 56 # Cores Drop 87% Throughput KRec/s 8000 6000 4000 2000 0 StreamBox NO In-order parallel 32 56 # Cores Pipeline Pipeline Transform)0 Transform)1 Transform)2 Transform)0 Transform)1 20:00 15:00 10:00 10:00 5:00 0:00 Prior work 10:00 15:00 10:00 Grep WordCount Transform)2 StreamBox 64

Summary: StreamBox on multicores Processes any records in any epochs in parallel by using all CPU cores Achieves high throughput with low latency Millions records per second throughput, on a par with distributed engines on a cluster with a few hundreds of CPU cores Tens of milliseconds latency, 20x shorter than other large-scale engines Pipeline Transform)0 Transform)1 10:00 15:00 10:00 Core 0 960K B 3584 KB Core 1 960K B 3584 KB Core 2 960K B 3584 KB 35MB L3 Core 13 960K B 3584 KB Transform)2 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 http://xsel.rocks/p/streambox 65