Big Data Analytics with Apache Spark (Part 2) Nastaran Fatemi

Similar documents
Shuffling, Partitioning, and Closures. Parallel Programming and Data Analysis Heather Miller

Transformations and Actions on Pair RDDs

Big Data Analytics with Apache Spark. Nastaran Fatemi

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CSE 444: Database Internals. Lecture 23 Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

SPARK PAIR RDD JOINING ZIPPING, SORTING AND GROUPING

CS535 Big Data Fall 2017 Colorado State University 9/19/2017 Sangmi Lee Pallickara Week 5- A.

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Chapter 4: Apache Spark

Data-intensive computing systems

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CS246: Mining Massive Datasets Crash Course in Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

Data Engineering. How MapReduce Works. Shivnath Babu

15.1 Data flow vs. traditional network programming

Exercise 3: Distributed Game of Life (15 points)

Apache Spark Internals

CSC 261/461 Database Systems Lecture 26. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Dept. Of Computer Science, Colorado State University

CS Spark. Slides from Matei Zaharia and Databricks

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Spark Overview. Professor Sasu Tarkoma.

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Data processing in Apache Spark

CompSci 516: Database Systems

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Big data systems 12/8/17

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Introduction to Apache Spark. Patrick Wendell - Databricks

Data processing in Apache Spark

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

State and DStreams. Big Data Analysis with Scala and Spark Heather Miller

Introduction to Spark

Data processing in Apache Spark

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Apache Spark 2.0. Matei

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

MapReduce, Hadoop and Spark. Bompotas Agorakis

An Introduction to Apache Spark

COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Big Data Infrastructures & Technologies

Spark: A Brief History.

An Introduction to Big Data Analysis using Spark

Batch Processing Basic architecture

Data-Intensive Distributed Computing

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Introduction to Apache Spark

CS 378 Big Data Programming

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Resilient Distributed Datasets

New Developments in Spark

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Article from. CompAct. April 2018 Issue 57

Spark, Shark and Spark Streaming Introduction

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

A very short introduction

I Travel on mobile / UK

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Data Platforms and Pattern Mining

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

Interactive Graph Analytics with Spark

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Turning Relational Database Tables into Spark Data Sources

Spark Streaming: Hands-on Session A.A. 2017/18

25/05/2018. Spark Streaming is a framework for large scale stream processing

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Processing of big data with Apache Spark

a Spark in the cloud iterative and interactive cluster computing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Developing MapReduce Programs

Cloud, Big Data & Linear Algebra

Evolution From Shark To Spark SQL:

Intro to Apache Spark

Spark.jl: Resilient Distributed Datasets in Julia

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Big Data Frameworks: Spark Practicals

Fast, Interactive, Language-Integrated Cluster Computing

Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing

Transcription:

Big Data Analytics with Apache Spark (Part 2) Nastaran Fatemi

What we ve seen so far Spark s Programming Model We saw that, at a glance, Spark looks like Scala collections However, internally, Spark behaves differently than Scala collections Spark uses laziness to save time and memory We saw transformations and actions We saw caching and persistence (i.e., cache in memory, save time!) We saw how the cluster topology comes into the programming model 2

In the following 1. We ll discover Pair RDDs (key-value pairs) 2. We ll learn about Pair RDD specific operators 3. We ll get a glimpse of what shuffling is, and why it hits performance (latency) 3

Pair RDDs (Key-Value Pairs) Often when working with distributed data, it s useful to organize data into key-value pairs. In Spark, these are Pair RDDs. Useful because: Pair RDDs allow you to act on each key in parallel or regroup data across the network. Spark provides powerful extension methods for RDDs containing pairs (e.g., RDD[(K, V)]). Some of the most important extension methods are: def groupbykey(): RDD[(K, Iterable[V])] def reducebykey(func: (V, V) => V): RDD[(K, V)] def join[w](other: RDD[(K, W)]): RDD[(K, (V, W))] Depending on the operation, data in an RDD may have to be shuffled among worker nodes, using worker-worker communication. This is often the case for many operations of Pair RDDs! 4

Pair RDDs Key-value pairs are known as Pair RDDs in Spark. When an RDD is created with a pair as its element type, Spark automatically adds a number of extra useful additional methods (extension methods) for such pairs. Creating a Pair RDD Pair RDDs are most often created from already-existing non-pair RDDs, for example by using the map operation on RDDs: val lines = sc.textfile("readme.md") val pairs = lines.map(x=>(x.split( )(0), x)) In Scala, for the functions on keyed data to be available, we need to return tuples. An implicit conversion on RDDs of tuples exists to provide the additional key/value functions. 5

Pair RDD Transformation: groupbykey groupbykey groups values which have the same key. When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. val data = Array((1, 2), (3, 4), (3, 6)) val myrdd= sc.parallelize(data) val groupedrdd = myrdd.groupbykey() groupedrdd.collect.foreach(println) (1,CompactBuffer(2)) (3,CompactBuffer(4, 6)) Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reducebykey or aggregatebykey will yield much better performance. 6

groupbykey another exmaple case class Event(organizer: String, name: String, budget: Int) val events : List [Event] = List ( Event("HEIG- VD", "Baleinev", 42000), Event("HEIG- VD","Stages WINS",14000), Event("HEIG- VD","CyberSec",20000), Event("HE- ARC","Portes ouvertes",25000), Event("HE- ARC","Cérmonie diplômes",10000),... ) val eventsrdd = sc.parallelize(events) val eventskvrdd = eventsrdd.map (event => (event.organizer, event.budget)) val groupedrdd = eventskvrdd.groupbykey() groupedrdd.collect().foreach(println) //(HEIG- VD,CompactBuffer(42000, 14000, 20000)) //(HE- ARC,CompactBuffer(25000, 10000)) //... 7

Pair RDD Transformation: reducebykey Conceptually, reducebykey can be thought of as a combination of groupbykey and reduce-ing on all the values per key. It s more efficient though, than using each separately. (We ll see why later.) def reducebykey(func: (V, V) => V): RDD[(K, V)] When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Example: Let s use eventskvrdd from the previous example to calculate the total budget per organizer of all of their organized events. val eventsrdd = sc.parallelize(events) val eventskvrdd = eventsrdd.map (event => (event.organizer, event.budget)) val totalbudget = eventskvrdd.reducebykey(_+_) groupedrdd.collect().foreach(println) //(HEIG- VD,76000) //(HE- ARC,35000) //... 8

Pair RDD Transformation: mapvalues and Action: countbykey def mapvalues[u](f: (V) U): RDD[(K, U)] mapvalues can be thought of as a short-hand for: rdd.map { case (x, y): (x, func(y))} That is, it simply applies a function to only the values in a Pair RDD. countbykey (def countbykey(): Map[K, Long]) simply counts the number of elements per key in a Pair RDD, returning a normal Scala Map (remember, it s an action!) mapping from keys to counts. 9

Pair RDD Transformation: mapvalues and Action: countbykey Example: we can use each of these operations to compute the average budget per event organizer. // Calculate a pair (as a key value) containing (budget, #events) val intermediate = eventskvrdd.mapvalues(b => (b, 1)).reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2)) // intermediate: RDD[(String, (Int, Int))] val avgbudgets =??? 10

Pair RDD Transformation: mapvalues and Action: countbykey Example: we can use each of these operations to compute the average budget per event organizer. // Calculate a pair (as a key's value) containing (budget, #events) val intermediate = eventskvrdd.mapvalues(b => (b, 1)).reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2)) // intermediate: RDD[(String, (Int, Int))] val avgbudgets = intermediate.mapvalues { case (budget, numberofevents) => budget / numberofevents } avgbudgets.collect().foreach(println) //(HEIG- VD,25333) //(HE- ARC,17500) 11

Joins Joins are another sort of transformation on Pair RDDs. They re used to combine multiple datasets They are one of the most commonly-used operations on Pair RDDs! There are two kinds of joins: Inner joins (join) Outer joins (leftouterjoin/rightouterjoin) The difference between the two types of joins is exactly the same as in databases 12

Inner Joins (join) def join[w](other: RDD[(K, W)]): RDD[(K, (V, W))] Example: Let s pretend the CFF has two datasets. One RDD representing customers and their subscriptions (abos), and another representing customers and cities they frequently travel to (locations). (E.g., gathered from CFF smartphone app.) How do we combine only customers that have a subscription and where there is location info? val abos =... // RDD[(Int, (String, Abonnement))] val locations =... // RDD[(Int, String)] val trackedcustomers =??? 13

Inner Joins (join) def join[w](other: RDD[(K, W)]): RDD[(K, (V, W))] Example: Let s pretend the CFF has two datasets. One RDD representing Customers, their names and their subscriptions (abos), and another representing customers and cities they frequently travel to (locations). (E.g., gathered from CFF smartphone app.) How do we combine only customers that have a subscription and where there is location info? val abos =... // RDD[(Int, (String, Abonnement))] val locations =... // RDD[(Int, String)] val trackedcustomers = abos.join(locations) // trackedcustomers: RDD[(Int, ((String, Abonnement), String))] 14

Inner Joins (join) Example continued with concrete data: val as = List((101, ("Ruetli", AG)), (102, ("Brelaz", DemiTarif)), (103, ("Gress", DemiTarifVisa)), (104, ("Schatten", DemiTarif))) val abos = sc.parallelize(as) val ls = List((101, "Bern"), (101, "Thun"), (102, "Lausanne"), (102, "Geneve"),(102, "Nyon"), (103, "Zurich"), (103, "St- Gallen"), (103, "Chur")) val locations = sc.parallelize(ls) val trackedcustomers = abos.join(locations) trackedcustomers.foreach(println) // trackedcustomers: RDD[(Int, ((String, Abonnement), String))] 15

Inner Joins (join) Example continued with concrete data: trackedcustomers.collect().foreach(println) // (101,((Ruetli,AG),Bern)) // (101,((Ruetli,AG),Thun)) // (102,((Brelaz,DemiTarif),Nyon)) // (102,((Brelaz,DemiTarif),Lausanne)) // (102,((Brelaz,DemiTarif),Geneve)) // (103,((Gress,DemiTarifVisa),St- Gallen)) // (103,((Gress,DemiTarifVisa),Chur)) // (103,((Gress,DemiTarifVisa),Zurich)) 16

Optimizing... Now that we understand Spark s programming model, and a majority of Spark s key operations, we ll now see how we can optimize what we do with Spark to keep it practical. It s very easy to write code that takes tens of minutes to compute when it could be computed in only tens of seconds! 17

Grouping and Reducing, Example Let s start with an example. Given: case class CFFPurchase(customerId: Int, destination: String, price: Double) Assume we have an RDD of the purchases that users of the CFF mobile app have made in the past month. Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) 18

Grouping and Reducing, Example Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) val purchasespermonth = purchasesrdd.map(p => (p.customerid, p.price)) // Pair RDD 19

Grouping and Reducing, Example Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) val purchasespermonth = purchasesrdd.map(p => (p.customerid, p.price)) // Pair RDD.groupByKey() // groupbykey returns RDD[K,Iterable[V]] 20

Grouping and Reducing, Example Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) val purchasespermonth = purchasesrdd.map(p => (p.customerid, p.price)) // Pair RDD.groupByKey() // groupbykey returns RDD[K,Iterable[V]].map(p => (p._1, (p._2.size, p._2.sum))).collect() 21

Grouping and Reducing, Example: Let s start with an example dataset: What s Happening? val purchases = List(CFFPurchase(100, "Geneva", 22.25), CFFPurchase(300, "Zurich", 42.10), CFFPurchase(100, "Fribourg", 12.40), CFFPurchase(200, "St. Gallen", 8.20), CFFPurchase(100, "Lucerne", 31.60), CFFPurchase(300, "Basel", 16.20)) What might the cluster look like with this data distributed over it? 22

Grouping and Reducing, Example: What s Happening? What might the cluster look like with this data distributed over it? Starting with purchasesrdd: 23

Grouping and Reducing, Example: What s Happening? What might this look like on the cluster? 24

Grouping and Reducing, Example Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) val purchasespermonth = purchasesrdd.map(p => (p.customerid, p.price)) // Pair RDD.groupByKey() // groupbykey returns RDD[K,Iterable[V]] Note: groupbykey results in one key-value pair per key. And this single key-value pair cannot span across multiple worker nodes. 25

Grouping and Reducing, Example: What s Happening? What might this look like on the cluster? 26

Grouping and Reducing, Example: What s Happening? What might this look like on the cluster? 27

Grouping and Reducing, Example: What s Happening? What might this look like on the cluster? 28

Can we do a better job? Perhaps we don t need to send all pairs over the network. Perhaps we can reduce before we shuffle. This could greatly reduce the amount of data we have to send over the network. 29

Grouping and Reducing, Example Optimized We can use reducebykey. Conceptually, reducebykey can be thought of as a combination of first doing groupbykey and then reduce-ing on all the values grouped per key. It s more efficient though, than using each separately. We ll see how in the following example. Signature: def reducebykey(func: (V, V) => V): RDD[(K, V)] 30

Grouping and Reducing, Example Optimized Goal: calculate how many trips, and how much money was spent by each individual customer over the course of the month. val purchasesrdd: RDD[CFFPurchase] = sc.textfile(...) val purchasespermonth = purchasesrdd.map(p => (p.customerid, (1, p.price))) // Pair RDD.reduceByKey(...) //? Notice that the function passed to map has changed. It s now p => (p.customerid, (1, p.price)) **What function do we pass to reducebykey in order to get a result that looks like: (customerid, (numtrips, totalspent)) returned?** 31

Grouping and Reducing, Example Optimized val purchasespermonth = purchasesrdd.map(p => (p.customerid, (1, p.price))) // Pair RDD.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2)).collect() What might this look like on the cluster? 32

Grouping and Reducing, Example Optimized What might this look like on the cluster? 33

Grouping and Reducing, Example Optimized What might this look like on the cluster? 34

Grouping and Reducing, Example Optimized What might this look like on the cluster? 35

Grouping and Reducing, Example Optimized What are the benefits of this approach? By reducing the dataset first, the amount of data sent over the network during the shuffle is greatly reduced. This can result in non-trival gains in performance! 36

groupbykey and reducebykey Running Times Benchmark results on a real cluster: 37

References The content of this section is partly taken from the slides of the course "Parallel Programming and Data Analysis" by Heather Miller at EPFL. Other references used here: - Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. O Reilly, February 2015. - Spark documentation, available at http://spark.apache.org/docs/latest/