Spark Streaming: Hands-on Session A.A. 2017/18

Similar documents
25/05/2018. Spark Streaming is a framework for large scale stream processing

Kafka Streams: Hands-on Session A.A. 2017/18

Dept. Of Computer Science, Colorado State University

Spark Streaming. Professor Sasu Tarkoma.

Spark Streaming. Guido Salvaneschi

Streaming vs. batch processing

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Spark, Shark and Spark Streaming Introduction

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Apache Spark: Hands-on Session A.A. 2016/17

Apache Storm: Hands-on Session A.A. 2016/17

Apache Spark: Hands-on Session A.A. 2017/18

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Spark Streaming. Big Data Analysis with Scala and Spark Heather Miller

Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

An Introduction to Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Resilient Distributed Datasets

2/4/2019 Week 3- A Sangmi Lee Pallickara

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Programming with Data Streams

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

CSE 444: Database Internals. Lecture 23 Spark

Processing of big data with Apache Spark

An Overview of Apache Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Dell In-Memory Appliance for Cloudera Enterprise

Twitter data Analytics using Distributed Computing

Data-intensive computing systems

Big Data Architect.

Beyond MapReduce: Apache Spark Antonino Virgillito

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Data Acquisition. The reference Big Data stack

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

a Spark in the cloud iterative and interactive cluster computing

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Practical Big Data Processing An Overview of Apache Flink

Spark Overview. Professor Sasu Tarkoma.

Fast, Interactive, Language-Integrated Cluster Computing

DATA SCIENCE USING SPARK: AN INTRODUCTION

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

NoSQL: Redis and MongoDB A.A. 2016/17

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Real-time data processing with Apache Flink

Spark Streaming and GraphX

Apache Spark and Scala Certification Training

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Apache Flink. Alessandro Margara

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Spark: A Brief History.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Data Acquisition. The reference Big Data stack

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Introduction to MapReduce Algorithms and Analysis

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Over the last few years, we have seen a disruption in the data management

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Search Engines and Time Series Databases

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Cloud, Big Data & Linear Algebra

Big Data Infrastructures & Technologies

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Webinar Series TMIP VISION

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment

A Platform for Fine-Grained Resource Sharing in the Data Center

Data Analytics Job Guarantee Program

Apache Flink Big Data Stream Processing

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

CompSci 516: Database Systems

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Chapter 4: Apache Spark

Lecture 11 Hadoop & Spark

An Introduction to Apache Spark

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

MapReduce, Hadoop and Spark. Bompotas Agorakis

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Transcription:

Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Spark Streaming: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno

The reference Big Data stack High-level Interfaces Data Processing Data Storage Resource Management Support / Integration 1

Book Apache Spark Learning Spark. Karau, Konwinski, Wendell, Zaharia. O'Reilly. 2015. Papers Zaharia, Das, Li, Hunter, Shenker, Stoica. Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing. Berkeley EECS Zaharia, Chowdhury, Franklin, Shenker, Stoica. Spark: Cluster Computing with Working Sets. HotCloud 10 Online References Apache Spark: http://spark.apache.org/ Spark Documentation: http://spark.apache.org/docs/latest/api/python/index.html 2

Apache Spark Fast and general-purpose engine for large-scale data processing Not a modified version of Hadoop The leading candidate for successor to MapReduce Spark can efficiently support more types of computations interactive queries, stream processing Can read/write to any Hadoop-supported system (e.g., HDFS) Speed: in-memory data storage for very fast iterative queries the system is also more efficient than MapReduce for complex applications running on disk up to 40x faster than Hadoop 3

Apache Spark Data can be ingested from many sources: Kafka, Twitter, HDFS, TCP sockets Results can be pushed out to file-systems, databases, live dashboards, but not only 4

Apache Spark Spark Core: basic functionality of Spark (task scheduling, memory management, fault recovery, storage systems interaction). Spark SQL: package for working with structured data queried via SQL as well as HiveQL Spark Streaming: a component that enables processing of live streams of data (e.g., logfiles, status updates messages) 5

Spark Streaming: Abstractions micro-batch architecture the stream is treated as a series of batches of data new batches are created at regular time intervals the size of the time intervals is called the batch interval the batch interval is typically between 500 ms and several seconds 6

Spark Streaming: DStream Discretized Stream (DStream) basic abstraction provided by Spark Streaming represents a continuous stream of data either input stream or generated by transforming the input stream internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval. 7

Spark Streaming: DStream Any operation applied on a DStream translates to operations on the underlying RDDs RDD transformations are computed by the Spark engine the DStream operations hide most of these details 8

Basic data sources Spark Streaming: DStream File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:... = streamingcontext.filestream<...>(directory); Streams based on Custom Receivers: DStreams can be created with data streams received through custom receivers, extending the Receiver<T> class Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using... = streamingcontext.queuestream(queueofrdds) 9

Spark Streaming: Streaming Context To execute a SparkStreaming application, we need to define the StreamingContext specializes SparkContext for streaming applications in Java can be defined as follows JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchinterval); where: master is a Spark, Mesos or YARN cluster URL; to run your code in local mode, use "local[k]" where K>=2 represents the parallelism appname is the name of your application batchinterval time interval (in seconds) of each batch 10

Spark Streaming: DStream Once built, they offer two types of operations: Transformations which yield a new DStream from a previous one. For example, one common transformation is filtering data. stateless transformations: the processing of each batch does not depend on the data of its previous batches. Examples are: map(), filter(), and reducebykey() stateful transformations: use data from previous batches to compute the results of the current batch. They include sliding windows, tracking state across time Output operations which write data to an external system. Each streaming application has to define an output operation. Note that a streaming context can be started only once, and must be started after we set up all the DStreams and output operations. 11

Spark Streaming: DStream Most of the transformations have the same syntax as the one applied to RDDs Transformation map(func) flatmap(func) filter(func) union(otherstream) join(otherstream) Meaning Return a new DStream by passing each element of the source DStream through a function func. Similar to map, but each input item can be mapped to 0 or more output items. Return a new DStream by selecting only the records of the source DStream on which func returns true. Return a new DStream that contains the union of the elements in the source DStream and otherdstream. When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. 12

Spark Streaming: DStream Transformation reduce(func) reducebykey(func) count() collect() transform(func) updatestatebykey(func) Meaning Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. Return all elements from the RDD. Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. 13

Example: Network Word Count... SparkConf sparkconf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount"); JavaStreamingContext ssc =... JavaReceiverInputDStream<String> lines = ssc.sockettextstream(... ); JavaDStream<String> words = lines.flatmap(...); JavaPairDStream<String, Integer> wordcounts = words.maptopair(s -> new Tuple2<>(s, 1)).reduceByKey((i1, i2) -> i1 + i2);... wordcounts.print(); Excerpt of NetworkWordCount.java 14

Spark Streaming: Window Windowed computations allow you to apply transformations over a sliding window of data. Any window operation needs to specify two parameters: window length The duration of the window in secs sliding interval The interval at which the window operation is performed in secs These parameters must be multiples of the batch interval Batch interval: 1 s Window length: 3 s Sliding interval: 2 s 15

Spark Streaming: Window window(windowlength, slideinterval) Return a new DStream which is computed based on windowed batches.... JavaStreamingContext ssc =... JavaReceiverInputDStream<String> lines =... JavaDStream<String> linesinwindow = lines.window(window_size, SLIDING_INTERVAL); JavaPairDStream<String, Integer> wordcounts = linesinwindow.flatmap(split_line).maptopair(s -> new Tuple2<>(s, 1)).reduceByKey((i1, i2) -> i1 + i2);... Excerpt of WindowBasedWordCount.java 16

Spark Streaming: Window reducebywindow(func, InvFunc, windowlength, slideinterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func (which should be associative). The reduce value of each window is calculated incrementally. func reduces new data that enters the sliding window invfunc inverse reduces the old data that leaves the window. reducebykeyandwindow(func, InvFunc, windowlength, slideinterval) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. For performing these transformations, we need to define a checkpoint directory 17

A Window-based WordCount We now perform only the reduce operation within a sliding window. We change our code as follows.... JavaPairDStream<String, Integer> wordcountpairs = ssc.sockettextstream(...).flatmap(x -> Arrays.asList(SPACE.split(x)).iterator()).mapToPair(s -> new Tuple2<>(s, 1)); JavaPairDStream<String, Integer> wordcounts = wordcountpairs.reducebykeyandwindow( (i1, i2) -> i1 + i2, WINDOW_SIZE, SLIDING_INTERVAL);... wordcounts.print(); wordcounts.foreachrdd(new SaveAsLocalFile()); Excerpt of WindowBasedWordCount2.java 18

A (More Efficient) Window-based WordCount A more efficient version: the reduce value of each window is calculated incrementally a reduce function handles new data that enters the sliding window; an inverse reduce function handles old data that leaves the window. Note that checkpointing must be enabled for using this operation.... ssc.checkpoint(local_checkpoint_dir);... JavaPairDStream<String, Integer> wordcounts = wordcountpairs.reducebykeyandwindow( (i1, i2) -> i1 + i2, (i1, i2) -> i1 - i2, WINDOW_SIZE, SLIDING_INTERVAL);... Excerpt of WindowBasedWordCount3.java 19

Spark Streaming: Output Operations Output operations allow DStream s data to be pushed out to external systems like a database or a file systems Output Operation print() saveastextfiles(prefix, [suffix]) saveashadoopfiles(prefix, [suffix]) saveasobjectfiles(prefix, [suffix]) foreachrdd(func) Meaning Prints the first ten elements of every batch of data in a DStream on the driver node running the application. Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix. Save this DStream's contents as Hadoop files. Save this DStream's contents as SequenceFiles of serialized Java objects. Generic output operator that applies a function, func, to each RDD generated from the stream. 20