Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Similar documents
Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark Overview. Professor Sasu Tarkoma.

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Apache Spark 2.0. Matei

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Processing of big data with Apache Spark

CSE 444: Database Internals. Lecture 23 Spark

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark

SparkSQL 11/14/2018 1

New Developments in Spark

Unifying Big Data Workloads in Apache Spark

An Overview of Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

An Introduction to Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Big Data Analytics using Apache Hadoop and Spark with Scala

Chapter 4: Apache Spark

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Databases and Big Data Today. CS634 Class 22

Accelerating Spark Workloads using GPUs

Spark and distributed data processing

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Turning Relational Database Tables into Spark Data Sources

A Tutorial on Apache Spark

Spark, Shark and Spark Streaming Introduction

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Databricks, an Introduction

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

The Future of Real-Time in Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Data Platforms and Pattern Mining

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Data processing in Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved


Analyzing Flight Data

Hadoop Development Introduction

Parallel Processing Spark and Spark SQL

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Evolution From Shark To Spark SQL:

/ Cloud Computing. Recitation 13 April 17th 2018

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Spark Streaming. Big Data Analysis with Scala and Spark Heather Miller

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data processing in Apache Spark

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Big Data Architect.

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

15.1 Data flow vs. traditional network programming

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Starting with Apache Spark, Best Practices and Learning from the Field

Data processing in Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big Data Infrastructures & Technologies

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big data systems 12/8/17

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

The Evolution of Big Data Platforms and Data Science

Higher level data processing in Apache Spark

Spark Streaming. Guido Salvaneschi

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Data Analytics Job Guarantee Program

Data Analytics and Machine Learning: From Node to Cluster

In-Memory Processing with Apache Spark. Vincent Leroy

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Spark Streaming: Hands-on Session A.A. 2017/18

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Specialist ICT Learning

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cloud Computing & Visualization

Certified Big Data Hadoop and Spark Scala Course Curriculum

Resilient Distributed Datasets

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Real-time data processing with Apache Flink

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

EPL660: Information Retrieval and Search Engines Lab 11

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Lambda Architecture with Apache Spark

Transcription:

Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM

With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About

Secret Word from EPAM itsubbotnik Big Data Training 3

Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs Big Data Training 4

Sprk Dvlprs! Let s start! 5

< SPARK 2.0 6

Modern Java in 2016 Big Data in 2014 Big Data Training 7

Big Data in 2017 Big Data Training 8

Machine Learning EVERYWHERE Big Data Training 9

Machine Learning vs Traditional Programming Big Data Training 10

Big Data Training 11

Something wrong with HADOOP Big Data Training 12

Hadoop is not SEXY 13

Whaaaat? 14

Map Reduce Job Writing 15

MR code 16

Hadoop Developers Right Now 17

Iterative Calculations 10x 100x 18

MapReduce vs Spark 19

MapReduce vs Spark 20

MapReduce vs Spark 21

SPARK 2.0 DISCUSSION 22

Spark Family 23

Spark Family 24

Case #0 : How to avoid DStreams with RDD-like API? 25

Continuous Applications 26

Continuous Applications cases Updating data that will be served in real time Extract, transform and load (ETL) Creating a real-time version of an existing batch job Online machine learning 27

Write Batches 28

Catch Streaming 29

The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data. 30

// Read JSON once from S3 logsdf = spark.read.json("s3://logs") Batch // Transform with DataFrame API and save logsdf.select("user", "url", "date").write.parquet("s3://out") 31

// Read JSON continuously from S3 logsdf = spark.readstream.json("s3://logs") Real Time // Transform with DataFrame API and save logsdf.select("user", "url", "date").writestream.parquet("s3://out").start() 32

Unlimited Table 33

val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 34

val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 35

val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 36

val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() Don t forget to start Streaming val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 37

WordCount with Structured Streaming 38

Structured Streaming provides fast & scalable fault-tolerant end-to-end with exactly-once semantic stream processing ability to use DataFrame/DataSet API for streaming 39

Structured Streaming provides (in dreams) fast & scalable fault-tolerant end-to-end with exactly-once semantic stream processing ability to use DataFrame/DataSet API for streaming 40

Let s UNION streaming and static DataSets 41

Let s UNION streaming and static DataSets org.apache.spark.sql. AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported; 42

Let s UNION streaming and static DataSets Go to UnsupportedOperationChecker.scala and check your operation 43

Case #1 : We should think about optimization in RDD terms 44

Single Thread collection 45

No perf issues, right? 46

The main concept more partitions = more parallelism 47

Do it parallel 48

I d like NARROW 49

Map, filter, filter 50

GroupByKey, join 51

Case #2 : DataFrames suggest mix SQL and Scala functions 52

History of Spark APIs 53

rdd.filter(_.age > 21) // RDD RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 54

History of Spark APIs 55

rdd.filter(_.age > 21) // RDD SQL df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 56

rdd.filter(_.age > 21) // RDD Expression df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 57

History of Spark APIs 58

rdd.filter(_.age > 21) // RDD DataSet df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API 59

Case #2 : DataFrame is referring to data attributes by name 60

DataSet = RDD s types + DataFrame s Catalyst RDD API compile-time type-safety off-heap storage mechanism performance benefits of the Catalyst query optimizer Tungsten 61

DataSet = RDD s types + DataFrame s Catalyst RDD API compile-time type-safety off-heap storage mechanism performance benefits of the Catalyst query optimizer Tungsten 62

Structured APIs in SPARK 63

Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now 64

case class User(email: String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Define case class spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 65

case class User(email: String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Read JSON spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 66

case class User(email: String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = Filter by Field spark.read.json("/home/tmp/datasets/users.json").as[user] userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 67

Case #3 : Spark has many contexts 68

Spark Session New entry point in spark for creating datasets Replaces SQLContext, HiveContext and StreamingContext Move from SparkContext to SparkSession signifies move away from RDD 69

val sparksession = SparkSession.builder.master("local").appName("spark session example").getorcreate() Spark Session val df = sparksession.read.option("header","true").csv("src/main/resources/names.csv") df.show() 70

No, I want to create my lovely RDDs 71

Where s parallelize() method? 72

case class User(email: String, footsize: Long, name: String) // DataFrame -> DataSet with Users val userds = spark.read.json("/home/tmp/datasets/users.json").as[user] RDD? userds.map(_.name).collect() userds.filter(_.footsize > 38).collect() ds.rdd // IF YOU REALLY WANT 73

Case #4 : Spark uses Java serialization A LOT 74

Two choices to distribute data across cluster Java serialization By default with ObjectOutputStream Kryo serialization Should register classes (no support of Serialazible) 75

The main problem: overhead of serializing Each serialized object contains the class structure as well as the values 76

The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don t forget about GC 77

Tungsten Compact Encoding 78

Encoder s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser 79

Encoders 80

No custom encoders 81

Case #5 : Not enough storage levels 82

Caching in Spark Frequently used RDD can be stored in memory One method, one short-cut: persist(), cache() SparkContext keeps track of cached RDD Serialized or deserialized Java objects 83

Full list of options MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 84

Spark Core Storage Level MEMORY_ONLY (default for Spark Core) MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 85

Spark Streaming Storage Level MEMORY_ONLY (default for Spark Core) MEMORY_AND_DISK MEMORY_ONLY_SER (default for Spark Streaming) MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 86

Developer API to make new Storage Levels 87

What s the most popular file format in BigData? 88

Case #6 : External libraries to read CSV 89

data = sqlcontext.read.format("csv").option("header", "true").option("inferschema", "true") Easy to read CSV.load("/datasets/samples/users.csv") data.cache() data.createorreplacetempview( users") display(data) 90

Case #7 : How to measure Spark performance? 91

You'd measure performance! 92

TPCDS 99 Queries http://bit.ly/2dobmsh 93

94

How to benchmark Spark 95

Special Tool from Databricks Benchmark Tool for SparkSQL https://github.com/databricks/spark-sql-perf 96

Spark 2 vs Spark 1.6 97

Case #8 : What s faster: SQL or DataSet API? 98

Job Stages in old Spark 99

Scheduler Optimizations 100

Catalyst Optimizer for DataFrames 101

Unified Logical Plan DataFrame = Dataset[Row] 102

Bytecode 103

DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=final,isdistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=partial,isdistinct=false)], output=[cut#20,color#21,sum#58,count#59l]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----) 104

Case #9 : Why does explain() show so many Tungsten things? 105

How to be effective with CPU Runtime code generation Exploiting cache locality Off-heap memory management 106

Tungsten s goal Push performance closer to the limits of modern hardware 107

Maybe something UNSAFE? 108

UnsafeRowFormat Bit set for tracking null values Small values are inlined For variable-length values are stored relative offset into the variablelength data section Rows are always 8-byte word aligned Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation 109

Case #10 : Can I influence on Memory Management in Spark? 110

Case #11 : Should I tune generation s stuff? 111

Cached Data 112

During operations 113

For your needs 114

For Dark Lord 115

IN CONCLUSION 116

We have no ability join structured streaming and other sources to handle it one unified ML API GraphX rethinking and redesign Custom encoders Datasets everywhere integrate with something important 117

Roadmap Support other data sources (not only S3 + HDFS) Transactional updates Dataset is one DSL for all operations GraphFrames + Structured MLLib Tungsten: custom encoders The RDD-based API is expected to be removed in Spark 3.0 118

And we can DO IT! Push Events & Alarms (Email, SNMP etc.) Real-Time ETL & CEP Search Index Social Real-Time Data-Marts Real-time Data Ingestion Batch Data-Marts External Batch Data Ingestion Batch ETL & Raw Area HDFS CFS as an option Titan & KairosDB store data in Cassandra Relations Graph Ontology Metadata Internal Scheduler Time-Series Data Real-time Dashboarding All Raw Data backup is stored here Events & Alarms Events & Alarms 119

First Part 120

Second Part 121

Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 122

Any questions? 123