SparkSQL 11/14/2018 1

Size: px

Start display at page:

Download "SparkSQL 11/14/2018 1"

Willa Melton
5 years ago
Views:

1 SparkSQL 11/14/2018 1

2 Where are we? Pig Latin HiveQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2

3 Where are we? Pig Latin HiveQL SQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3

4 Shark (Spark on Hive) A small side project that aimed to running RDD jobs on Hive data using HiveQL Still limited to the data model of Hive Tied to the Hadoop world 11/14/2018 4

5 SparkSQL Redesigned to consider Spark query model Supports all the popular relational operators Can be intermixed with RDD operations Uses the Dataframe API as an enhancement to the RDD API Dataframe = RDD + schema 11/14/2018 5

6 Dataframes SparkSQL s counterpart to relations or tables in RDMBS Consists of rows and columns A dataframe is NOT in 1NF Why? Can be created from various data sources CSV file JSON file MySQL database Hive 11/14/2018 6

7 Dataframe Vs RDD Dataframe Lazy execution Spark is aware of the data model Spark is aware of the query logic Can optimize the query RDD Lazy execution The data model is hidden from Spark The transformations and actions are black boxes Cannot optimize the query 11/14/2018 7

8 Built-in operations in SprkSQL Filter (Selection) Select (Projection) Join GroupBy (Aggregation) Load/Store in various formats Cache Conversion between RDD (back and forth) 11/14/2018 8

9 SparkSQL Examples 11/14/2018 9

10 Project Setup # In dependencies pom.xml  <dependency> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.11</artifactid> <version>2.2.1</version> </dependency> 11/14/

11 Code Setup SparkSession sparks = SparkSession.builder().appName("Spark SQL examples").master("local").getorcreate(); Dataset<Row> log_file = sparks.read().option("delimiter", "\t").option("header", "true").option("inferschema", "true").csv("nasa_log.tsv"); log_file.show(); 11/14/

12 Filter Example # Select OK lines Dataset<Row> ok_lines = log_file.filter("response=200"); long ok_count = ok_lines.count(); System.out.println("Number of OK lines is "+ok_count); # Grouped aggregation using SQL Dataset<Row> bytespercode = log_file.sqlcontext().sql("select response, sum(bytes) from log_lines GROUP BY response"); 11/14/

13 SparkSQL Features Catalyst query optimizer Code generation Integration with libraries 11/14/

14 Cost Model SparkSQL Query Plan SQL AST DataFrame Unresolved Logical Plan Analysis Logical Optimization Logical Plan Optimized Logical Plan Physical Planning Physical Plans Selected Physical Plan Code Generation RDDs Catalog DataFrames and SQL share the same optimization/execution pipeline Credits: M. Armbrust 14

15 Catalyst Query Optimizer Extensible rule-based optimizer Users can define their own rules Original Plan Filter Push-Down Project name Project name Filter id = 1 Project id,name Project id,name Filter id = 1 People People 11/14/

16 Code Generation Shift from black-box UDF to Expressions Example # Filter Dataset<Row> ok_lines = log_file.filter("response=200"); # Grouped aggregation Dataset<Row> bytespercode = log_file.sqlcontext().sql("select response, sum(bytes) from log_lines GROUP BY response"); SparkSQL understand the logic of user queries and rewrites them in a more concise way 11/14/

17 Integration SparkSQL is integrated with other high-level interfaces such as MLlib, PySpark, and SparkR SparkSQL is also integrated with the RDD interface and they can be mixed in one program 11/14/

18 Further Reading Documentation SparkSQL paper M. Armbrust et al. "Spark sql: Relational data processing in spark." SIGMOD /14/

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals