Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Similar documents
L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Chapter 4: Apache Spark

CSE 444: Database Internals. Lecture 23 Spark

Programming Systems for Big Data

Spark, Shark and Spark Streaming Introduction

Processing of big data with Apache Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Spark Overview. Professor Sasu Tarkoma.

Introduction to Spark

Resilient Distributed Datasets

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Fast, Interactive, Language-Integrated Cluster Computing

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

An Introduction to Big Data Analysis using Spark

Big Data Analytics with Apache Spark. Nastaran Fatemi

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

CompSci 516: Database Systems

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Spark: A Brief History.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

2/4/2019 Week 3- A Sangmi Lee Pallickara

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University

Introduction to Apache Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Cloud, Big Data & Linear Algebra

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

CS Spark. Slides from Matei Zaharia and Databricks

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Batch Processing Basic architecture

a Spark in the cloud iterative and interactive cluster computing

Data processing in Apache Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Bringing code to the data: from MySQL to RocksDB for high volume searches

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Infrastructures & Technologies

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Distributed Computation Models

A Tutorial on Apache Spark

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Data-intensive computing systems

1. HPC & I/O 2. BioPerl

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang

Data Platforms and Pattern Mining

Big Data Analytics with Apache Spark (Part 2) Nastaran Fatemi

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Introduction to Apache Spark. Patrick Wendell - Databricks

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Apache Spark and Scala Certification Training

MapReduce, Hadoop and Spark. Bompotas Agorakis

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

DATA SCIENCE USING SPARK: AN INTRODUCTION

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

Big data systems 12/8/17

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Concepts and Technologies for Distributed Systems and Big Data Processing Futures, Async, and Actors

IoT Platform using Geode and ActiveMQ

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

15.1 Data flow vs. traditional network programming

Distributed Computing with Spark and MapReduce

Turning Relational Database Tables into Spark Data Sources

Lecture 11 Hadoop & Spark

A Parallel R Framework

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Spark Streaming. Guido Salvaneschi

Data processing in Apache Spark

Data processing in Apache Spark

Cloud Computing & Visualization

Outline. CS-562 Introduction to data analysis using Apache Spark

An Introduction to Apache Spark

Memory-Based Cloud Architectures

An Overview of Apache Spark

Integration of Machine Learning Library in Apache Apex

CS535 Big Data Fall 2017 Colorado State University 9/19/2017 Sangmi Lee Pallickara Week 5- A.

Introduction to Apache Spark

Massive Online Analysis - Storm,Spark

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

A (quick) retrospect. COMPSCI210 Recitation 22th Apr 2013 Vamsi Thummala

Introduction to MapReduce Algorithms and Analysis

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Survey on Incremental MapReduce for Data Mining

Shuffling, Partitioning, and Closures. Parallel Programming and Data Analysis Heather Miller

An Introduction to Apache Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Accelerating Spark Workloads using GPUs

Transcription:

CS-E4610 Modern Database Systems 05.01.2018-05.04.2018 Lecture 09 Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ PHD STUDENT I N COMPUTER SCIENCE, ELT E UNIVERSITY VISITING R ESEA RCHER, A A LTO UNIVERSITY 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 1

Agenda Why not MapReduce? Spark at a glance How to use Spark? Iterative Algorithms Interactive Analytics Scala at a glance Distributed Data Parallelism Fault Tolerance Programming Model Spark Runtime RDD Transformations (Lazy) Actions (Eager) Reduction Operations Pair RDDs Join Shuffling and Partitioning 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 2

User Program Why not map reduce? (1) fork (2) assign map (1) fork Master (1) fork (2) assign reduce MapReduce flows are acyclic split 0 split 1 split 2 split 3 split 4 (3) read worker worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 Not efficient for some applications Input files worker Map phase Intermediate files (on local disks) Reduce phase Output files Figure 1: Execution overview 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 3

Why not map reduce? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. Iterative algorithms Many common machine learning algorithms repeatedly apply the same function on the same dataset (e.g., gradient descent) MapReduce repeatedly reloads (reads & writes) data which is costly 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 4

Why not map reduce? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. Interactive analytics Load data in memory and query repeatedly MapReduce would re-read data 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 5

Lightning-fast cluster computing Spark at a Glance. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 6

Before we talk about Spark Let s talk about Scala Java Generics Scala Lightbend (Typesafe) Prof. Martin Odersky Coursera: Functional Programming in Scala, École Polytechnique Fédérale de Lausanne 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 7

From Fast Single Cores to Multicores Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten and Krste Asanovic Martin Odersky, "Working Hard to Keep It Simple. OSCON Java 2011 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 8

Typical Bare Metal Servers https://softlayer.com 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 9

High Performance Computing (15/11) RANK SITE CORES 1 National Super Computer Center in Guangzhou China 3,120,000 2 DOE/SC/Oak Ridge National Laboratory United States 560,640 3 DOE/NNSA/LLNL United States 1,572,864 4 RIKEN Advanced Institute for Computational Science (AICS) Japan 705,024 5 DOE/SC/Argonne National Laboratory United States 786,432 6 DOE/NNSA/LANL/SNL United States 301,056 7 Swiss National Supercomputing Centre (CSCS) Switzerland 115,984 8 HLRS - Höchstleistungsrechenzentrum Stuttgart Germany 185,088 9 King Abdullah University of Science and Technology Saudi Arabia 196,608 10 Texas Advanced Computing Center/Univ. of Texas United States 462,462 http://top500.org/lists/2015/11/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 10

Concurrency and Parallelism Manage concurrent execution threads Execute programs faster using the multi-cores 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 11

What can go wrong? Concurrent Threads Shared Mutable State var x = 0 async { x = x + 1} async { x = x * 2} // x could be 0, 1, 2 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 12

Threads in Java Manage concurrent & parellel executions Threads Executors ForkJoin Parallel Streams Completable Futures http://booxs.biz/en/java/threads%20in%20java.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 13

Scala at a Glance Static Typing Scala Functional - High Order Functions - Immutable over mutable - Avoid Shared Mutable States - Efficient Immutable Data Structures Lightweight Syntax Object Oriented 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 14

Scala Collections Help to Organize Data Immutable collections never change. Collections are Sequential or Parallel https://docs.scalalang.org/overviews/collections/overview.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 15

Shared Memory Data Parallelism (Scala Parallel collections) result = ratings.map( ) CPU Thread 0 Split data CPU Thread 1 Threads run independently CPU Thread 2 Combine Results 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 16

Distributed Data Parallelism Node 1 Node 2 result = ratings.map( ) Split data among nodes Nodes run independently Node 3 Node 4 Combine Results Network latency is a problem 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 17

Distribution: Failure and Latency Task Latency Humanized (Latency * 1 Billion) L1 cache reference 0.5 ns 0.5 s One heart beat (0.5 s) Branch mispredict 5 ns 5 s Yawn L2 cache reference 7 ns 7 s Long yawn Mutex lock/unlock 25 ns 25 s Making a coffee Main memory reference 100 ns 100 s Brushing your teeth Compress 1K bytes with Zippy 3,000 ns = 3 µs 50 min One episode of a TV show (including ad breaks) Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs 5.5 hr From lunch to end of work day SSD random read 150,000 ns = 150 µs 1.7 days A normal weekend Read 1 MB sequentially from 250,000 ns = 250 µs 2.9 days A long weekend memory Round trip within same datacenter 500,000 ns = 0.5 ms 5.8 days A medium vacation Read 1 MB sequentially from SSD* 1,000,000 ns = 1 ms 11.6 days Waiting for almost 2 weeks for a delivery Disk seek 10,000,000 ns = 10 ms 16.5 weeks A semester in university Read 1 MB sequentially from disk 20,000,000 ns = 20 ms 7.8 months Almost producing a new human being Send packet CA->Netherlands->CA 150,000,000 ns = 150 ms 4.8 years Average time it takes to get a bachelor s degree Memory Disk Network Fast--------------->Slow https://gist.github.com/jboner/2841832 http://norvig.com/21-days.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 18

Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 19

Spark Resilient Distributed Datasets (RDD) Immutable collection of objects (Read-only) Partitioned across machines Once defined, programmer treats it as available (System re-builds it if lost / leaves memory) Users can explicitly cache RDDs in memory Re-use across MapReduce-like parallel operations 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 20

Main Challenge: Efficient fault-tolerance Should be easy to re-build if part of data (e.g., a partition) is lost. Achieved through coarse-grained transformations and lineage 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 21

Fault-tolerance Coarse transformations e.g., map applies the same function to the data items. Lineage: Series of transformations that led to a dataset. If a partition is lost, there is enough information to reapply the transformations and re-compute it 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 22

Programming Model Developers write a drive program high-level control flow Think of RDDs as objects that represent datasets that you distribute among several workers, and transform and apply actions in parallel. Can also use restricted types of shared variables 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 23

Spark runtime Driver tasks results RAM Worker Input Data Worker Input Data RAM RAM Worker Input Data 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 24 tion Sca thes nod save the var of a R para

RDD Immutable (read-only) collection of objects partitioned across a set of machines, that can be re-built if a partition is lost. Constructed in the following ways: From a file in a shared file system (e.g., HDFS) Parallelizing a collection (e.g., an array) divide into partitions and send to multiple nodes Transforming an existing RDD (applying a map operation) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 25

Hadoop Distributed FS (HDFS) architecture Different than Spark 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 26

RDD It does not exist at all time. Instead, there is enough information to compute the RDD when needed. RDDs are lazily-created and ephemeral Lazy: Materialized only when information is extracted from them (through actions!) Ephemeral: Might be discarded after use 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 27

Transformations (Lazy) Lazy operations. The results are not immediately computed Create a new RDD 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 28

Actions (Eager) RDDs are computed every time you run an action. Return a value to the program or output the results (e.g., HDFS) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 29

Why Spark? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. MapReduce: Spark: Simple API (map, reduce) Fault-tolerant Simple and rich API. Fault-tolerant Reduces latency using ideas from functional programming (immutability, in-memory). l00x more performant than MapReduce (Hadoop), and more productive! Logistic regression in Hadoop and Spark 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 30

How does a Spark program looks like? (PySpark) Driver Transformation Action Out[]: There are 20 movies about zombies... scary. RDD 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 31

How does a Spark program looks like? (PySpark) Driver RDD Transformation Action Out[]: There are 20 movies about zombies... scary! Out[]: There are 524 movies about romance <3 Let s think about what happened 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 32

Caching and Persistence To prevent re-computing the RDDs, we can persist the data. cache: Memory only storage persist: Persistence can be customized at different levels (e.g., memory, disk) The default persistence is at memory level 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 33

How does a Spark program looks like? (PySpark) Driver Persist RDD Action Out[]: There are 20 movies about zombies... scary! Out[]: There are 524 movies about romance <3 Transformation In the second time, the lines RDD was loaded from memory 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 34

Cluster Topology Contains the main Creates RDDs Coordinates the execution (transformations and actions) Driver Program Cluster Manager (YARN/Mesos) Run tasks Return Results Persist RDDs Worker Node (Executor) Worker Node (Executor) Worker Node (Executor) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 35

YARN-cluster mode https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 36

YARN-client mode https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 37

Difference between running modes https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 38

Cluster Topology - Evaluation Out[]: There are 20 movies about zombies... scary. Action Where are the zombie movies printed? 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 39

Cluster Topology - Evaluation Actions usually communicate between workers nodes and the driver s node. It is important to think about where the tasks are going to be executed. Large RDDs may cause out of memory errors in the driver node for some actions (e.g., collect). In that case, it s a good idea to output directly from the worker. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 40

Reduction Operations Traverse a collection and combine elements to produce a single combined result. reduce(op) Reduces the elements of this RDD using the specified associative and cumulative operator. fold(zerovalue, op) Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value." Requires the same type of data in the return. aggregate(zerovalue, seqop, combop) Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." Possible to change the return type. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 41

Pair RDD Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. Usually, we have large datasets that we can organize by a key (e.g., movie_id, user_id) Useful because it improves how we handle the RDD Pair RDDs have special methods for working with the data associated to the keys. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 42

GroupByKey example 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 43

ReduceByKey example 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 44

Word Count example Out[]: ('the', 20711) ('and', 15343) ('a', 10785) ('of', 9892) ('film', 9206) ('by', 8377) ('in', 7646) ('The', 7362) ('is', 6290) ('was', 5728) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 45

Join You can combine Pair RDDs using a join. The combination can be by: Inner joins (join) Key that appear in both Pair RDDs Outer joins (left0uterjoin/right0uterjoin) Guarantees that on the RDD all they keys (left or right) will be present Keys that do not appear on the other RDD have a None value. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 46

Shuffling and Partitioning Partitioning Partitioning Shuffle Shuffle and partitioning is expensive! As it they have to send data in the network 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 47

Spark Streaming Motivation Big Data never stops Data is being produced all the time Spark provides a DStream to handle sources that send data constantly 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 48

DStream Discretized Stream represents a continuous stream of data Input data stream is received from source, or the processed data stream generated by transforming the input stream. Internally DStream is a continuous series of RDDs Each RDD in a DStream contains data from a certain interval Similarly, we can apply transformations and actions to Dstreams. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 49

As a summary Spark is a distributed big data processing framework. Distribution brings new concerns: Node failure and latency Uses Resilient Distributed Datasets (RDD) to distribute and parallelize the data. RDDs are lazily-created and ephemeral Caching and persistence is used to preserve a RDD in memory, disk, or both RDDs are fault tolerant Able to recover the state of an RDD using coarse-grained transformations and lineage. Transformations are lazy (e.g., map, filter, groupby, sortby, reducebykey) Actions are eager (e.g., take, collect, reduce, first, foreach) The topology of the cluster matters Working with RDDs implies shuffling and partitioning Impact on performance due to latency Spark provides Big Data Streaming processing via DStreams 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 50

That s all for now! Tutorial: Batch and streaming processing with PySpark Monday, 19 March, 14:15» 16:00, T2 / C105 (T2), Tietotekniikka, Konemiehentie 2 Thanks! Questions? Frederick Ayala-Gómez frederick.ayala@aalto.fi 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 51

Credits and References Slides from Michael Mathioudakis from previous Aalto s Modern Database Systems Course. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Big Data Analysis with Scala and Spark, Dr. Heather Miller, École Polytechnique Fédérale de Lausanne. Zaharia, Matei, et al. "Spark: Cluster Computing with Working Sets." HotCloud 10 (2010): 10-10. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for inmemory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. Learning Spark: Lightning-Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 52