SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Similar documents
Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Spark Streaming. Guido Salvaneschi

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Spark, Shark and Spark Streaming Introduction

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

Resilient Distributed Datasets

Spark Streaming. Professor Sasu Tarkoma.

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Lecture 11 Hadoop & Spark

An Introduction to Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Massive Online Analysis - Storm,Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Unifying Big Data Workloads in Apache Spark

Spark Streaming: Hands-on Session A.A. 2017/18

Warehouse- Scale Computing and the BDAS Stack

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Big Data Infrastructures & Technologies

Spark: A Brief History.

CSE 444: Database Internals. Lecture 23 Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Fast, Interactive, Language-Integrated Cluster Computing

25/05/2018. Spark Streaming is a framework for large scale stream processing

Distributed Computation Models

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

The Future of Real-Time in Spark

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CompSci 516: Database Systems

State and DStreams. Big Data Analysis with Scala and Spark Heather Miller

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Lambda Architecture with Apache Spark

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Processing of big data with Apache Spark

Streaming vs. batch processing

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

DATA SCIENCE USING SPARK: AN INTRODUCTION

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark Overview. Professor Sasu Tarkoma.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Cloud Computing & Visualization

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Practical Big Data Processing An Overview of Apache Flink

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

Real-time data processing with Apache Flink

An Overview of Apache Spark

Big data systems 12/8/17

Hadoop Development Introduction

Outline. CS-562 Introduction to data analysis using Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Distributed Computing with Spark and MapReduce

In-Memory Processing with Apache Spark. Vincent Leroy

An Introduction to Apache Spark

Data Platforms and Pattern Mining

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Apache Spark 2.0. Matei

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

2/4/2019 Week 3- A Sangmi Lee Pallickara

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Big Data Architect.

Data-intensive computing systems

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Data-Intensive Distributed Computing

L5-6:Runtime Platforms Hadoop and HDFS

Distributed Filesystem

Dept. Of Computer Science, Colorado State University

Introduction to Apache Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Apache Spark and Scala Certification Training

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

BigData and Map Reduce VITMAC03

Batch Processing Basic architecture

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

Dell In-Memory Appliance for Cloudera Enterprise

Apache Flink- A System for Batch and Realtime Stream Processing

Map-Reduce. Marco Mura 2010 March, 31th

Hadoop. Introduction / Overview

Chapter 4: Apache Spark

MapReduce. U of Toronto, 2014

Programming Systems for Big Data

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Distributed File Systems II

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

microsoft

MapReduce, Hadoop and Spark. Bompotas Agorakis

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Apache Flink Big Data Stream Processing

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Beyond MapReduce: Apache Spark Antonino Virgillito

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Transcription:

SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY

Motivation Many important applications must process large data streams at second- scale latencies Site statistics, intrusion detection, spam filtering,... Scaling these apps to 100s of nodes, require Fault- tolerance: for both crashes Current and streaming stragglers Efficiency: for being cost- effective frameworks don t meet both goals together Also would like to have Simple programming model Integration with batch + ad hoc queries

Traditional Streaming Systems Record- at- a- time processing model Each node has mutable state For each record, update state & send new records mutable state input records node 1 push node 3 input records node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup Replication Upstream backup input node 1 node 3 input node 1 node 3 input node 2 synchronization input node 2 node 1 Fast recovery, but node 2x 3 hardware cost node 2 standby Only need 1 standby, but slow to recover

Traditional Streaming Systems Fault tolerance via replication or upstream backup Replication Upstream backup input node 1 node 3 input node 1 node 3 input node 2 synchronization input node 2 node 1 node 2 node 3 Neither handle stragglers standby

Observation Batch processing models, like MapReduce, do provide fault tolerance efficiently Divide job into deterministic tasks Rerun failed/slow tasks in parallel on other nodes

Idea Idea: run a streaming computation as a series of very small, deterministic batch jobs Eg. Process stream of tweets in 1 sec batches Same recovery schemes at smaller timescale Try to make batch size as small as possible Lower batch size à lower end- to- end latency State between batches kept in memory Deterministic stateful ops à fault- tolerance

Discretized Stream Processing input batch operations immutable time = 0 dataset - 1: (stored reliably) input immutable dataset (output or state); stored in memory as Spark RDD time = 1-2: input stream state stream

Fault Recovery All dataset modeled as RDDs with dependency graph à fault- tolerant with full replication Fault/straggler recovery is done in parallel on other nodes à fast recovery map input dataset output dataset Fast recovery without the cost of full replication

How Fast Can It Go? Prototype can process 4 GB/s (40M records/s) of data on 100 nodes at sub- second latency Cluster Throughput (GB/s) 5 4 3 2 1 0 Grep 1 sec 2 sec 0 50 100 Number of nodes Max throughput with a given latency bound (1 or 2s) 5 4 3 2 1 0 TopKWords 0 50 100 Number of nodes

Comparison with Storm Grep Throughput (MB/s/node) 60 40 20 0 Spark Storm 10000 1000 Record Size (bytes) 100 TopK Throughput (MB/s/node) 30 20 10 0 Spark Storm 10000 1000 Record Size (bytes) 100 Storm limited to 10,000 records/s/node Also tried Apache S4: 7000 records/s/node Commercial systems report O(100K)

How Fast Can It Recover? Recovers from faults/stragglers within 1 sec Failure Happens Interval Processing Time (s) 2.0 1.5 1.0 0.5 0.0 0 15 30 45 60 75 Time (s) Sliding WordCount on 10 nodes with 30s checkpoint interval

Programming Interface A Discretized Stream or DStream is a sequence of RDDs Represents a stream of data API very similar to RDDs DStreams can be created Either from live streaming data Or by transforming other DStreams

DStream Operators Transformations Build new streams from existing streams Existing RDD ops (map, etc) + new stateful ops Output operators Send data to outside world (save results to external storage, print to screen, etc)

! Example 1 DStreams Count the words received every second words = createnetworkstream("http://... )! counts = words.count()! transformation time = 0-1: time = 1-2: time = 2-3: words counts count count count = RDD

! Example 2 Count frequency of words received every second words = createnetworkstream("http://... )! ones = words.map(w => (w, 1))! freqs = ones.reducebykey(_ + _)! time = 0-1: Scala function literal words ones freqs map reduce time = 1-2: time = 2-3:

!!!! Example 2: Full code // Create the context and set the batch size! val ssc = new SparkStreamContext("local", "test")! ssc.setbatchduration(seconds(1))! // Process a network stream! val words = ssc.createnetworkstream("http://...")! val ones = words.map(w => (w, 1))! val freqs = ones.reducebykey(_ + _)! freqs.print()! // Start the stream computation! ssc.run!

! Example 3 Count frequency of words received in last minute ones = words.map(w => (w, 1))! freqs = ones.reducebykey(_ + _)! freqs_60s = freqs.window(seconds(60), Second(1))! time = 0-1: time = 1-2: sliding window operator ".reducebykey(_ + _)! words ones window length freqs window movement map reduce freqs_60s window reduce time = 2-3:

!!! Simpler window- based reduce freqs = ones.reducebykey(_ + _)! freqs_60s = freqs.window(seconds(60), Second(1))! ".reducebykey(_ + _)! freqs = ones.reducebykeyandwindow(_ + _, Seconds(60), Seconds(1))!

Incremental window operators words freqs freqs_60s words freqs freqs_60s t-1 t t+1 t+2 t-1 t t+1 t+2 t+3 t+4 + t+3 t+4 + + Aggregation function Invertible aggregation function

!!!!!! Smarter window- based reduce freqs = ones.reducebykey(_ + _)! freqs_60s = freqs.window(seconds(60), Second(1))! ".reducebykey(_ + _)! freqs = ones.reducebykeyandwindow(_ + _, Seconds(60), Seconds(1))! freqs = ones.reducebykeyandwindow(! " " " " "_ + _, _ - _, Seconds(60), Seconds(1))!

Output Operators save: write results to any Hadoop- compatible storage system (e.g. HDFS, HBase) freqs.save( hdfs://... ) foreachrdd: run a Spark function on each RDD freqs.foreachrdd(freqsrdd => {! "// any Spark/Scala processing, maybe save to database! })

Live + Batch + Interactive Combining DStreams with historical datasets! freqs.join(oldfreqs).map(...)! Interactive queries on stream state from the Spark interpreter! freqs.slice( 21:00, 21:05 ).topk(10)

One stack to rule them all The promise of a unified data analytics stack Write algorithms only once Cuts complexity of maintaining separate stacks for live and batch processing Query live stream state instead of waiting for import Feedback very exciting Some recent experiences

Implementation on Spark Optimizations on current Spark Optimized scheduling for < 100ms tasks New block store with fast NIO communication Pipelining of jobs from different time intervals Changes already in dev branch of Spark on http://www.github.com/mesos/spark An alpha will be released with Spark 0.6 soon

More Details You can find more about SparkStreaming in our paper: http://tinyurl.com/dstreams Thank you!

Fault Recovery Failures: Interval Processing TIme (s) 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.47 WordCount, 30s checkpoints 2.31 2.64 1.66 1.75 WordMax, no checkpoints 2.03 WordMax, 10s checkpoints Before Failure At Time of Failure Stragglers: Interval Processing Time (s) 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.09 1.08 0.79 0.66 WordCount Grep No straggler Straggler, with speculation

Interactive Ad- Hoc Queries

Related Work Bulk incremental processing (CBP, Comet) Periodic (~5 min) batch jobs on Hadoop/Dryad On- disk, replicated FS for storage instead of RDDs Hadoop Online Does not recover stateful ops or allow multi- stage jobs Streaming databases Record- at- a- time processing, generally replication for FT Approximate query processing, load shedding Do not support the loss of arbitrary nodes Different math because drop rate is known exactly Parallel recovery (MapReduce, GFS, RAMCloud, etc)

Timing Considerations D- streams group input into intervals based on when records arrive at the system For apps that need to group by an external time and tolerate network delays, support: Slack time: delay starting a batch for a short fixed time to give records a chance to arrive Application- level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5