Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Similar documents
SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Resilient Distributed Datasets

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Fast, Interactive, Language-Integrated Cluster Computing

Spark, Shark and Spark Streaming Introduction

Spark Streaming. Guido Salvaneschi

Spark: A Brief History.

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark Streaming. Professor Sasu Tarkoma.

Lecture 11 Hadoop & Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Massive Online Analysis - Storm,Spark

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CompSci 516: Database Systems

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

CSE 444: Database Internals. Lecture 23 Spark

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Distributed Computation Models

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Processing of big data with Apache Spark

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Spark Overview. Professor Sasu Tarkoma.

Spark Streaming: Hands-on Session A.A. 2017/18

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

A Platform for Fine-Grained Resource Sharing in the Data Center

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Dell In-Memory Appliance for Cloudera Enterprise

2/4/2019 Week 3- A Sangmi Lee Pallickara

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

a Spark in the cloud iterative and interactive cluster computing

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Stream Processing on IoT Devices using Calvin Framework

An Introduction to Apache Spark

Unifying Big Data Workloads in Apache Spark

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Survey on Incremental MapReduce for Data Mining

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Data Platforms and Pattern Mining

Big Data Infrastructures & Technologies

Lambda Architecture with Apache Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Warehouse- Scale Computing and the BDAS Stack

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Distributed Computing with Spark and MapReduce

The Future of Real-Time in Spark

L5-6:Runtime Platforms Hadoop and HDFS

Chapter 4: Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

An Empirical Study of High Availability in Stream Processing Systems

Outline. CS-562 Introduction to data analysis using Apache Spark

Data-intensive computing systems

Shark: Hive on Spark

Integration of Machine Learning Library in Apache Apex

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

An Introduction to Big Data Analysis using Spark

Distributed File Systems II

Hadoop Development Introduction

Distributed Computations MapReduce. adapted from Jeff Dean s slides

The Google File System

Shark: Hive (SQL) on Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Apache Spark 2.0. Matei

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Reliable, Memory Speed Storage for Cluster Computing Frameworks

Cloud Computing & Visualization

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Isolation Forest for Anomaly Detection

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Webinar Series TMIP VISION

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

State and DStreams. Big Data Analysis with Scala and Spark Heather Miller

Introduction to Apache Spark

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

DATA SCIENCE USING SPARK: AN INTRODUCTION

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Batch Processing Basic architecture

Database Systems CSE 414

MapReduce, Hadoop and Spark. Bompotas Agorakis

Transcription:

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY

Motivation Many important applications need to process large data streams arriving in real time User activity statistics (e.g. Facebook s Puma) Spam detection Traffic estimation Network intrusion detection Our target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency

Challenge To run at large scale, system has to be both: Fault-tolerant: recover quickly from failures and stragglers Cost-efficient: do not require significant hardware beyond that needed for basic processing Existing streaming systems don t have both properties

Traditional Streaming Systems Record-at-a-time processing model Each node has mutable state For each record, update state & send new records mutable state records node 1 push node 3 records node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 1 node 3 node 2 synchronization node 2 node 1 node 3 standby node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 1 node 3 node 2 synchronization node 2 node 1 node 3 standby Fast node recovery, 2 but 2x hardware cost Only need 1 standby, but slow to recover

Traditional Streaming Systems Fault tolerance via replication or upstream backup: node 1 node 3 node 1 node 3 node 2 synchronization node 2 node 1 node 3 standby node 2 Neither approach tolerates stragglers

Observation Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently Divide job into deterministic tasks Rerun failed/slow tasks in parallel on other nodes Idea: run a streaming computation as a series of very small, deterministic batches Same recovery schemes at much smaller timescale Work to make batch size as small as possible

Discretized Stream Processing t = 1: pull batch operation immutable dataset (stored reliably) t = 2: immutable dataset (output or state); stored in memory without replication stream 1 stream 2

Parallel Recovery Checkpoint state datasets periodically If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map dataset output dataset Faster recovery than upstream backup, without the cost of replication

How Fast Can It Go? Prototype built on the Spark in-memory computing engine can process 2 GB/s (20M records/s) of data on 50 nodes at sub-second latency Cluster Throughput (GB/s) 3 2.5 2 1.5 1 0.5 0 Grep 1 sec 2 sec 0 20 40 60 # of Nodes in Cluster Cluster Throughput (GB/s) 3 2.5 2 1.5 1 0.5 0 WordCount 1 sec 2 sec 0 20 40 60 # of Nodes in Cluster Max throughput within a given latency bound (1 or 2s)

How Fast Can It Go? Recovers from failures within 1 second Failure Happens Interval Processing Time (s) 2.0 1.5 1.0 0.5 0.0 0 15 30 45 60 75 Time (s) Sliding WordCount on 10 nodes with 30s checkpoint interval

Programming Model A discretized stream (D-stream) is a sequence of immutable, partitioned datasets Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark Deterministic transformations operators produce new streams

API LINQ-like language-integrated API in Scala New stateful operators for windowing pageviews = readstream("...", "1s")! ones = pageviews.map(ev => (ev.url, 1))! counts = ones.runningreduce(_ + _)! Scala function literal sliding = ones.reducebywindow(! 5s, _ + _, _ - _)! pageviews! ones! counts! t = 1: map reduce t = 2:... = RDD = partition Incremental version with add and subtract functions

Other Benefits of Discretized Streams Consistency: each record is processed atomically Unification with batch processing: Combining streams with historical data! pageviews.join(historiccounts).map(...)! Interactive ad-hoc queries on stream state! pageviews.slice( 21:00, 21:05 ).topk(10)

Conclusion D-Streams forgo traditional streaming wisdom by batching data in small timesteps Enable efficient, new parallel recovery scheme Let users seamlessly intermix streaming, batch and interactive queries

Related Work Bulk incremental processing (CBP, Comet) Periodic (~5 min) batch jobs on Hadoop/Dryad On-disk, replicated FS for storage instead of RDDs Hadoop Online Does not recover stateful ops or allow multi-stage jobs Streaming databases Record-at-a-time processing, generally replication for FT Parallel recovery (MapReduce, GFS, RAMCloud, etc) Hwang et al [ICDE 07] have a parallel recovery protocol for streams, but only allow 1 failure & do not handle stragglers

Timing Considerations D-streams group into intervals based on when records arrive at the system For apps that need to group by an external time and tolerate network delays, support: Slack time: delay starting a batch for a short fixed time to give records a chance to arrive Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5

D-Streams vs. Traditional Streaming Concern Discretized Streams Record-at-a-time Systems Latency 0.5 2s 1-100 ms Consistency Yes, batch-level Not in msg. passing systems; some DBs use waiting Failures Parallel recovery Replication or upstream bkp. Stragglers Speculation Typically not handled Unification with batch Ad-hoc queries from Spark shell, join w. RDD Not in msg. passing systems; in some DBs