Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Similar documents
Parallel Processing Spark and Spark SQL

An Introduction to Apache Spark

Machine Learning with Spark. Amir H. Payberah 02/11/2018

An Introduction to Apache Spark

Data-intensive computing systems

CS Spark. Slides from Matei Zaharia and Databricks

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

CSE 444: Database Internals. Lecture 23 Spark

Introduction to Apache Spark. Patrick Wendell - Databricks

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Programming Systems for Big Data

Introduction to Apache Spark

Resilient Distributed Datasets

Spark Overview. Professor Sasu Tarkoma.

Fast, Interactive, Language-Integrated Cluster Computing

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Massive Online Analysis - Storm,Spark

EPL660: Information Retrieval and Search Engines Lab 11

Spark, Shark and Spark Streaming Introduction

Chapter 4: Apache Spark

Spark: A Brief History.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Apache Spark Internals

Introduction to Apache Spark

6.830 Lecture Spark 11/15/2017

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

DATA SCIENCE USING SPARK: AN INTRODUCTION

Distributed Computing with Spark and MapReduce

Data processing in Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Beyond MapReduce: Apache Spark Antonino Virgillito

Data processing in Apache Spark

Spark Streaming and GraphX

Turning Relational Database Tables into Spark Data Sources

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Big Data Infrastructures & Technologies

An Overview of Apache Spark

Pyspark standalone code

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

An Introduction to Big Data Analysis using Spark

Data processing in Apache Spark

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

MapReduce, Hadoop and Spark. Bompotas Agorakis

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Outline. CS-562 Introduction to data analysis using Apache Spark

CompSci 516: Database Systems

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

a Spark in the cloud iterative and interactive cluster computing

15.1 Data flow vs. traditional network programming

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Big Data Analytics with Apache Spark. Nastaran Fatemi

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Processing of big data with Apache Spark

Database Systems CSE 414

Data Engineering. How MapReduce Works. Shivnath Babu

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Scaling Up & Out. Haidar Osman

Introduction to Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

Higher level data processing in Apache Spark

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Databases and Big Data Today. CS634 Class 22

Distributed Computation Models

Adding Native SQL Support to Spark with C talyst. Michael Armbrust

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Lecture 11 Hadoop & Spark

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Batch Processing Basic architecture

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

In-Memory Processing with Apache Spark. Vincent Leroy

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Unifying Big Data Workloads in Apache Spark

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

New Developments in Spark

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

2/26/2017. DataFrame. The input data can be queried by using

Spark Streaming. Guido Salvaneschi

CSE 344 Final Review. August 16 th

TUTORIAL: BIG DATA ANALYTICS USING APACHE SPARK

Transcription:

Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71

What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 2 / 71

Big Data... everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. - Dan Ariely Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 3 / 71

Big Data Big data is the data characterized by 4 key attributes: variety, velocity and value. volume, - Oracle Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

Big Data Big data is the data characterized by 4 key attributes: variety, velocity and value. Buzzwords volume, - Oracle Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 4 / 71

Big Data In Simple Words Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 5 / 71

The Four Dimensions of Big Data Volume: data size Velocity: data generation rate Variety: data heterogeneity This 4th V is for Vacillation: Veracity/Variability/Value Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 6 / 71

How To Store and Process Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 7 / 71

Scale Up vs. Scale Out Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 8 / 71

Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 9 / 71

The Big Data Stack Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 10 / 71

Data Analysis Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 11 / 71

Programming Languages Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 12 / 71

Platform - Data Processing Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 13 / 71

Platform - Data Storage Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 14 / 71

Resource Management Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 15 / 71

Spark Processing Engine Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 16 / 71

Why Spark? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 17 / 71

Motivation (1/4) Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (1/4) Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (1/4) Most current cluster programming models are based on acyclic data flow from stable storage to stable storage. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures. E.g., MapReduce Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 18 / 71

Motivation (2/4) MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 19 / 71

Motivation (3/4) Very expensive (slow), i.e., always goes to disk and HDFS. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 20 / 71

Motivation (4/4) Extends MapReduce with more operators. Support for advanced data flow graphs. In-memory and out-of-core processing. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 21 / 71

Spark vs. MapReduce (1/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

Spark vs. MapReduce (1/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 22 / 71

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

Spark vs. MapReduce (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 23 / 71

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

Challenge How to design a distributed memory abstraction that is both fault tolerant and efficient? Solution Resilient Distributed Datasets (RDD) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 24 / 71

Resilient Distributed Datasets (RDD) (1/2) A distributed memory abstraction. Immutable collections of objects spread across a cluster. Like a LinkedList <MyObjects> Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 25 / 71

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Built through coarse grained transformations, e.g., map, filter, join. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Resilient Distributed Datasets (RDD) (2/2) An RDD is divided into a number of partitions, which are atomic pieces of information. Partitions of an RDD can be stored on different nodes of a cluster. Built through coarse grained transformations, e.g., map, filter, join. Fault tolerance via automatic rebuild (no replication). Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 26 / 71

Programming Model Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 27 / 71

Spark Programming Model A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

Spark Programming Model A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Operators are higher-order functions that execute user-defined functions in parallel. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

Spark Programming Model A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs. Operators are higher-order functions that execute user-defined functions in parallel. Two types of RDD operators: transformations and actions. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 28 / 71

RDD Operators (1/2) Transformations: lazy operators that create new RDDs. Actions: lunch a computation and return a value to the program or write data to the external storage. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 29 / 71

RDD Operators (2/2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 30 / 71

RDD Transformations - Map All pairs are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

RDD Transformations - Map All pairs are independently processed. // passing each element through a function. val nums = sc.parallelize(array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9} // selecting those elements that func returns true. val even = squares.filter(_ % 2 == 0) // {4} Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 31 / 71

RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71

RDD Transformations - Reduce Pairs with identical key are grouped. Groups are independently processed. val pets = sc.parallelize(seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.groupbykey() // {(cat, (1, 2)), (dog, (1))} pets.reducebykey((x, y) => x + y) or pets.reducebykey(_ + _) // {(cat, 3), (dog, 1)} Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 32 / 71

RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently processed. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71

RDD Transformations - Join Performs an equi-join on the key. Join candidates are independently processed. val visits = sc.parallelize(seq(("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h", "1.3.3.1"))) val pagenames = sc.parallelize(seq(("h", "Home"), ("a", "About"))) visits.join(pagenames) // ("h", ("1.2.3.4", "Home")) // ("h", ("1.3.3.1", "Home")) // ("a", ("3.4.5.6", "About")) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 33 / 71

RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71

RDD Transformations - CoGroup Groups each input on key. Groups with identical keys are processed together. val visits = sc.parallelize(seq(("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h", "1.3.3.1"))) val pagenames = sc.parallelize(seq(("h", "Home"), ("a", "About"))) visits.cogroup(pagenames) // ("h", (("1.2.3.4", "1.3.3.1"), ("Home"))) // ("a", (("3.4.5.6"), ("About"))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 34 / 71

RDD Transformations - Union and Sample Union: merges two RDDs and returns a single RDD using bag semantics, i.e., duplicates are not removed. Sample: similar to mapping, except that the RDD stores a random number generator seed for each partition to deterministically sample parent records. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 35 / 71

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (1/2) Return all the elements of the RDD as an array. val nums = sc.parallelize(array(1, 2, 3)) nums.collect() // Array(1, 2, 3) Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) Return the number of elements in the RDD. nums.count() // 3 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 36 / 71

Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71

Basic RDD Actions (2/2) Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y) or nums.reduce(_ + _) // 6 Write the elements of the RDD as a text file. nums.saveastextfile("hdfs://file.txt") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 37 / 71

SparkContext Main entry point to Spark functionality. Available in shell as variable sc. Only one SparkContext may be active per JVM. // master: the master URL to connect to, e.g., // "local", "local[4]", "spark://master:7077" val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 38 / 71

Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(array(1, 2, 3)) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71

Creating RDDs Turn a collection into an RDD. val a = sc.parallelize(array(1, 2, 3)) Load text file from local FS, HDFS, or S3. val a = sc.textfile("file.txt") val b = sc.textfile("directory/*.txt") val c = sc.textfile("hdfs://namenode:9000/path/file") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 39 / 71

Example 1 val textfile = sc.textfile("hdfs://...") val words = textfile.flatmap(line => line.split(" ")) val ones = words.map(word => (word, 1)) val counts = ones.reducebykey(_ + _) counts.saveastextfile("hdfs://...") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 40 / 71

Example 2 val textfile = sc.textfile("hdfs://...") val sics = textfile.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_ + _) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71

Example 2 val textfile = sc.textfile("hdfs://...") val sics = textfile.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_ + _) val textfile = sc.textfile("hdfs://...") val count = textfile.filter(_.contains("sics")).count() Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 41 / 71

Execution Engine Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 42 / 71

Spark Programming Interface A Spark application consists of a driver program that runs the user s main function and executes various parallel operations on a cluster. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 43 / 71

Lineage Lineage: transformations used to build an RDD. RDDs are stored as a chain of objects capturing the lineage of each RDD. val file = sc.textfile("hdfs://...") val sics = file.filter(_.contains("sics")) val cachedsics = sics.cache() val ones = cachedsics.map(_ => 1) val count = ones.reduce(_+_) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 44 / 71

RDD Dependencies (1/3) Two types of dependencies between RDDs: Narrow and Wide. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 45 / 71

RDD Dependencies: Narrow (2/3) Narrow: each partition of a parent RDD is used by at most one partition of the child RDD. Narrow dependencies allow pipelined execution on one cluster node, e.g., a map followed by a filter. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 46 / 71

RDD Dependencies: Wide (3/3) Wide: each partition of a parent RDD is used by multiple partitions of the child RDDs. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 47 / 71

Job Scheduling (1/2) When a user runs an action on an RDD: the scheduler builds a DAG of stages from the RDD lineage graph. A stage contains as many pipelined transformations with narrow dependencies. The boundary of a stage: Shuffles for wide dependencies. Already computed partitions. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 48 / 71

Job Scheduling (2/2) The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD. Tasks are assigned to machines based on data locality. If a task needs a partition, which is available in the memory of a node, the task is sent to that node. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 49 / 71

RDD Fault Tolerance Logging lineage rather than the actual data. No replication. Recompute only the lost partitions of an RDD. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 50 / 71

Spark SQL Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 51 / 71

Spark and Spark SQL Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 52 / 71

DataFrame A DataFrame is a distributed collection of rows Homogeneous schema. Equivalent to a table in a relational database. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 53 / 71

Adding Schema to RDDs Spark + RDD: functional transformations on partitioned collections of opaque objects. SQL + DataFrame: declarative transformations on partitioned collections of tuples. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 54 / 71

Creating DataFrames The entry point into all functionality in Spark SQL is the SQLContext. With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources. val sc: SparkContext // An existing SparkContext. val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) val df = sqlcontext.read.json(...) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 55 / 71

DataFrame Operations (1/2) Domain-specific language for structured data manipulation. // Show the content of the DataFrame df.show() // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printschema() // root // -- age: long (nullable = true) // -- name: string (nullable = true) // Select only the "name" column df.select("name").show() // name // Michael // Andy // Justin Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 56 / 71

DataFrame Operations (2/2) Domain-specific language for structured data manipulation. // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") > 21).show() // age name // 30 Andy // Count people by age df.groupby("age").count().show() // age count // null 1 // 19 1 // 30 1 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 57 / 71

Running SQL Queries Programmatically Running SQL queries programmatically and returns the result as a DataFrame. Using the sql function on a SQLContext. val sqlcontext =... // An existing SQLContext val df = sqlcontext.sql("select * FROM table") Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 58 / 71

Converting RDDs into DataFrames Inferring the schema using reflection. // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile(...).map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)).todf() people.registertemptable("people") // SQL statements can be run by using the sql methods provided by sqlcontext. val teenagers = sqlcontext.sql("select name, age FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are DataFrames. teenagers.map(t => "Name: " + t(0)).collect().foreach(println) teenagers.map(t => "Name: " + t.getas[string]("name")).collect().foreach(println) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 59 / 71

Data Sources Supports on a variety of data sources. A DataFrame can be operated on as normal RDDs or as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 60 / 71

Advanced Programming Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 61 / 71

Shared Variables When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. General read-write shared variables across tasks is inefficient. Two types of shared variables: accumulators and broadcast variables. Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 62 / 71

Accumulators (1/2) Aggregating values from worker nodes back to the driver program. Example: counting events that occur during job execution. Worker code can add to the accumulator with its += method. The driver program can access the value by calling the value property on the accumulator. scala> val accum = sc.accumulator(0) accum: spark.accumulator[int] = 0 scala> sc.parallelize(array(1, 2, 3, 4)).foreach(x => accum += x)... scala> accum.value res2: Int = 10 Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 63 / 71

Accumulators (2/2) How many lines of the input file were blank? val sc = new SparkContext(...) val file = sc.textfile("file.txt") val blanklines = sc.accumulator(0) // Create an Accumulator[Int] initialized to 0 val callsigns = file.flatmap(line => { if (line == "") { blanklines += 1 // Add to the accumulator } line.split(" ") }) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 64 / 71

Broadcast Variables (1/4) The broadcast values are sent to each node only once, and should be treated as read-only variables. The process of using broadcast variables can access its value with the value property. scala> val broadcastvar = sc.broadcast(array(1, 2, 3)) broadcastvar: spark.broadcast[array[int]] = spark.broadcast(b5c40191-...) scala> broadcastvar.value res0: Array[Int] = Array(1, 2, 3) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 65 / 71

Broadcast Variables (2/4) // Load RDD of (URL, name) pairs val pagenames = sc.textfile("pages.txt").map(...) // Load RDD of (URL, visit) pairs val visits = sc.textfile("visits.txt").map(...) val joined = visits.join(pagenames) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 66 / 71

Broadcast Variables (3/4) // Load RDD of (URL, name) pairs val pagenames = sc.textfile("pages.txt").map(...) val pagemap = pagenames.collect().tomap() // Load RDD of (URL, visit) pairs val visits = sc.textfile("visits.txt").map(...) val joined = visits.map(v => (v._1, (pagemap(v._1), v._2))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 67 / 71

Broadcast Variables (4/4) // Load RDD of (URL, name) pairs val pagenames = sc.textfile("pages.txt").map(...) val pagemap = pagenames.collect().tomap() val bc = sc.broadcast(pagemap) // Load RDD of (URL, visit) pairs val visits = sc.textfile("visits.txt").map(...) val joined = visits.map(v => (v._1, (bc.value(v._1), v._2))) Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 68 / 71

Summary Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 69 / 71

Summary Dataflow programming Spark: RDD Two types of operations: Transformations and Actions. Spark execution engine Spark SQL: DataFrame Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 70 / 71

Questions? Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 71 / 71