Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

CS-E4610 Modern Database Systems 05.01.2018-05.04.2018 Lecture 09 Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ PHD STUDENT I N COMPUTER SCIENCE, ELT E UNIVERSITY VISITING R ESEA RCHER, A A LTO UNIVERSITY 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 1

Agenda Why not MapReduce? Spark at a glance How to use Spark? Iterative Algorithms Interactive Analytics Scala at a glance Distributed Data Parallelism Fault Tolerance Programming Model Spark Runtime RDD Transformations (Lazy) Actions (Eager) Reduction Operations Pair RDDs Join Shuffling and Partitioning 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 2

User Program Why not map reduce? (1) fork (2) assign map (1) fork Master (1) fork (2) assign reduce MapReduce flows are acyclic split 0 split 1 split 2 split 3 split 4 (3) read worker worker (4) local write (5) remote read worker worker (6) write output file 0 output file 1 Not efficient for some applications Input files worker Map phase Intermediate files (on local disks) Reduce phase Output files Figure 1: Execution overview 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 3

Why not map reduce? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. Iterative algorithms Many common machine learning algorithms repeatedly apply the same function on the same dataset (e.g., gradient descent) MapReduce repeatedly reloads (reads & writes) data which is costly 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 4

Why not map reduce? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. Interactive analytics Load data in memory and query repeatedly MapReduce would re-read data 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 5

Lightning-fast cluster computing Spark at a Glance. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 6

Before we talk about Spark Let s talk about Scala Java Generics Scala Lightbend (Typesafe) Prof. Martin Odersky Coursera: Functional Programming in Scala, École Polytechnique Fédérale de Lausanne 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 7

From Fast Single Cores to Multicores Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten and Krste Asanovic Martin Odersky, "Working Hard to Keep It Simple. OSCON Java 2011 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 8

Typical Bare Metal Servers https://softlayer.com 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 9

High Performance Computing (15/11) RANK SITE CORES 1 National Super Computer Center in Guangzhou China 3,120,000 2 DOE/SC/Oak Ridge National Laboratory United States 560,640 3 DOE/NNSA/LLNL United States 1,572,864 4 RIKEN Advanced Institute for Computational Science (AICS) Japan 705,024 5 DOE/SC/Argonne National Laboratory United States 786,432 6 DOE/NNSA/LANL/SNL United States 301,056 7 Swiss National Supercomputing Centre (CSCS) Switzerland 115,984 8 HLRS - Höchstleistungsrechenzentrum Stuttgart Germany 185,088 9 King Abdullah University of Science and Technology Saudi Arabia 196,608 10 Texas Advanced Computing Center/Univ. of Texas United States 462,462 http://top500.org/lists/2015/11/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 10

Concurrency and Parallelism Manage concurrent execution threads Execute programs faster using the multi-cores 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 11

What can go wrong? Concurrent Threads Shared Mutable State var x = 0 async { x = x + 1} async { x = x * 2} // x could be 0, 1, 2 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 12

Threads in Java Manage concurrent & parellel executions Threads Executors ForkJoin Parallel Streams Completable Futures http://booxs.biz/en/java/threads%20in%20java.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 13

Scala at a Glance Static Typing Scala Functional - High Order Functions - Immutable over mutable - Avoid Shared Mutable States - Efficient Immutable Data Structures Lightweight Syntax Object Oriented 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 14

Scala Collections Help to Organize Data Immutable collections never change. Collections are Sequential or Parallel https://docs.scalalang.org/overviews/collections/overview.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 15

Shared Memory Data Parallelism (Scala Parallel collections) result = ratings.map( ) CPU Thread 0 Split data CPU Thread 1 Threads run independently CPU Thread 2 Combine Results 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 16

Distributed Data Parallelism Node 1 Node 2 result = ratings.map( ) Split data among nodes Nodes run independently Node 3 Node 4 Combine Results Network latency is a problem 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 17

Distribution: Failure and Latency Task Latency Humanized (Latency * 1 Billion) L1 cache reference 0.5 ns 0.5 s One heart beat (0.5 s) Branch mispredict 5 ns 5 s Yawn L2 cache reference 7 ns 7 s Long yawn Mutex lock/unlock 25 ns 25 s Making a coffee Main memory reference 100 ns 100 s Brushing your teeth Compress 1K bytes with Zippy 3,000 ns = 3 µs 50 min One episode of a TV show (including ad breaks) Send 2K bytes over 1 Gbps network 20,000 ns = 20 µs 5.5 hr From lunch to end of work day SSD random read 150,000 ns = 150 µs 1.7 days A normal weekend Read 1 MB sequentially from 250,000 ns = 250 µs 2.9 days A long weekend memory Round trip within same datacenter 500,000 ns = 0.5 ms 5.8 days A medium vacation Read 1 MB sequentially from SSD* 1,000,000 ns = 1 ms 11.6 days Waiting for almost 2 weeks for a delivery Disk seek 10,000,000 ns = 10 ms 16.5 weeks A semester in university Read 1 MB sequentially from disk 20,000,000 ns = 20 ms 7.8 months Almost producing a new human being Send packet CA->Netherlands->CA 150,000,000 ns = 150 ms 4.8 years Average time it takes to get a bachelor s degree Memory Disk Network Fast--------------->Slow https://gist.github.com/jboner/2841832 http://norvig.com/21-days.html 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 18

Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 19

Spark Resilient Distributed Datasets (RDD) Immutable collection of objects (Read-only) Partitioned across machines Once defined, programmer treats it as available (System re-builds it if lost / leaves memory) Users can explicitly cache RDDs in memory Re-use across MapReduce-like parallel operations 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 20

Main Challenge: Efficient fault-tolerance Should be easy to re-build if part of data (e.g., a partition) is lost. Achieved through coarse-grained transformations and lineage 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 21

Fault-tolerance Coarse transformations e.g., map applies the same function to the data items. Lineage: Series of transformations that led to a dataset. If a partition is lost, there is enough information to reapply the transformations and re-compute it 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 22

Programming Model Developers write a drive program high-level control flow Think of RDDs as objects that represent datasets that you distribute among several workers, and transform and apply actions in parallel. Can also use restricted types of shared variables 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 23

Spark runtime Driver tasks results RAM Worker Input Data Worker Input Data RAM RAM Worker Input Data 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 24 tion Sca thes nod save the var of a R para

RDD Immutable (read-only) collection of objects partitioned across a set of machines, that can be re-built if a partition is lost. Constructed in the following ways: From a file in a shared file system (e.g., HDFS) Parallelizing a collection (e.g., an array) divide into partitions and send to multiple nodes Transforming an existing RDD (applying a map operation) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 25

Hadoop Distributed FS (HDFS) architecture Different than Spark 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 26

RDD It does not exist at all time. Instead, there is enough information to compute the RDD when needed. RDDs are lazily-created and ephemeral Lazy: Materialized only when information is extracted from them (through actions!) Ephemeral: Might be discarded after use 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 27

Transformations (Lazy) Lazy operations. The results are not immediately computed Create a new RDD 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 28

Actions (Eager) RDDs are computed every time you run an action. Return a value to the program or output the results (e.g., HDFS) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 29

Why Spark? Zaharia, Matei, et al. "Spark: Cluster computing with working sets." HotCloud 10.10-10 (2010): 95. MapReduce: Spark: Simple API (map, reduce) Fault-tolerant Simple and rich API. Fault-tolerant Reduces latency using ideas from functional programming (immutability, in-memory). l00x more performant than MapReduce (Hadoop), and more productive! Logistic regression in Hadoop and Spark 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 30

How does a Spark program looks like? (PySpark) Driver Transformation Action Out[]: There are 20 movies about zombies... scary. RDD 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 31

How does a Spark program looks like? (PySpark) Driver RDD Transformation Action Out[]: There are 20 movies about zombies... scary! Out[]: There are 524 movies about romance <3 Let s think about what happened 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 32

Caching and Persistence To prevent re-computing the RDDs, we can persist the data. cache: Memory only storage persist: Persistence can be customized at different levels (e.g., memory, disk) The default persistence is at memory level 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 33

How does a Spark program looks like? (PySpark) Driver Persist RDD Action Out[]: There are 20 movies about zombies... scary! Out[]: There are 524 movies about romance <3 Transformation In the second time, the lines RDD was loaded from memory 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 34

Cluster Topology Contains the main Creates RDDs Coordinates the execution (transformations and actions) Driver Program Cluster Manager (YARN/Mesos) Run tasks Return Results Persist RDDs Worker Node (Executor) Worker Node (Executor) Worker Node (Executor) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 35

YARN-cluster mode https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 36

YARN-client mode https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 37

Difference between running modes https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 38

Cluster Topology - Evaluation Out[]: There are 20 movies about zombies... scary. Action Where are the zombie movies printed? 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 39

Cluster Topology - Evaluation Actions usually communicate between workers nodes and the driver s node. It is important to think about where the tasks are going to be executed. Large RDDs may cause out of memory errors in the driver node for some actions (e.g., collect). In that case, it s a good idea to output directly from the worker. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 40

Reduction Operations Traverse a collection and combine elements to produce a single combined result. reduce(op) Reduces the elements of this RDD using the specified associative and cumulative operator. fold(zerovalue, op) Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value." Requires the same type of data in the return. aggregate(zerovalue, seqop, combop) Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." Possible to change the return type. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 41

Pair RDD Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. Usually, we have large datasets that we can organize by a key (e.g., movie_id, user_id) Useful because it improves how we handle the RDD Pair RDDs have special methods for working with the data associated to the keys. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 42

GroupByKey example 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 43

ReduceByKey example 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 44

Word Count example Out[]: ('the', 20711) ('and', 15343) ('a', 10785) ('of', 9892) ('film', 9206) ('by', 8377) ('in', 7646) ('The', 7362) ('is', 6290) ('was', 5728) 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 45

Join You can combine Pair RDDs using a join. The combination can be by: Inner joins (join) Key that appear in both Pair RDDs Outer joins (left0uterjoin/right0uterjoin) Guarantees that on the RDD all they keys (left or right) will be present Keys that do not appear on the other RDD have a None value. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 46

Shuffling and Partitioning Partitioning Partitioning Shuffle Shuffle and partitioning is expensive! As it they have to send data in the network 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 47

Spark Streaming Motivation Big Data never stops Data is being produced all the time Spark provides a DStream to handle sources that send data constantly 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 48

DStream Discretized Stream represents a continuous stream of data Input data stream is received from source, or the processed data stream generated by transforming the input stream. Internally DStream is a continuous series of RDDs Each RDD in a DStream contains data from a certain interval Similarly, we can apply transformations and actions to Dstreams. 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 49

As a summary Spark is a distributed big data processing framework. Distribution brings new concerns: Node failure and latency Uses Resilient Distributed Datasets (RDD) to distribute and parallelize the data. RDDs are lazily-created and ephemeral Caching and persistence is used to preserve a RDD in memory, disk, or both RDDs are fault tolerant Able to recover the state of an RDD using coarse-grained transformations and lineage. Transformations are lazy (e.g., map, filter, groupby, sortby, reducebykey) Actions are eager (e.g., take, collect, reduce, first, foreach) The topology of the cluster matters Working with RDDs implies shuffling and partitioning Impact on performance due to latency Spark provides Big Data Streaming processing via DStreams 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 50

That s all for now! Tutorial: Batch and streaming processing with PySpark Monday, 19 March, 14:15» 16:00, T2 / C105 (T2), Tietotekniikka, Konemiehentie 2 Thanks! Questions? Frederick Ayala-Gómez frederick.ayala@aalto.fi 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 51

Credits and References Slides from Michael Mathioudakis from previous Aalto s Modern Database Systems Course. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Big Data Analysis with Scala and Spark, Dr. Heather Miller, École Polytechnique Fédérale de Lausanne. Zaharia, Matei, et al. "Spark: Cluster Computing with Working Sets." HotCloud 10 (2010): 10-10. Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for inmemory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. Learning Spark: Lightning-Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia 03/16/2018 LECTURE 09 SPARK FOR BATCH AND STREAMING PROCESSING 52