Data processing in Apache Spark

Similar documents
Data processing in Apache Spark

Data processing in Apache Spark

Spark Overview. Professor Sasu Tarkoma.

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Infrastructures & Technologies

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Introduction to Apache Spark. Patrick Wendell - Databricks

Distributed Computing with Spark and MapReduce

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

CSE 444: Database Internals. Lecture 23 Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Distributed Computing with Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Spark supports several storage levels

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Chapter 4: Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

An Overview of Apache Spark

Lecture 11 Hadoop & Spark

Dell In-Memory Appliance for Cloudera Enterprise

MapReduce, Hadoop and Spark. Bompotas Agorakis

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Higher level data processing in Apache Spark

Introduction to Apache Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

An Introduction to Apache Spark

Processing of big data with Apache Spark

Dept. Of Computer Science, Colorado State University

DATA SCIENCE USING SPARK: AN INTRODUCTION

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Big data systems 12/8/17

2/4/2019 Week 3- A Sangmi Lee Pallickara

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Spark Programming Essentials

Intro to Apache Spark

A Tutorial on Apache Spark

CS Spark. Slides from Matei Zaharia and Databricks

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

Data-intensive computing systems

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

CS November 2018

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Backtesting with Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Resilient Distributed Datasets


Fast, Interactive, Language-Integrated Cluster Computing

Lambda Architecture with Apache Spark

Apache Spark: Hands-on Session A.A. 2016/17

15.1 Data flow vs. traditional network programming

Apache Spark: Hands-on Session A.A. 2017/18

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Accelerating Spark Workloads using GPUs

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Turning Relational Database Tables into Spark Data Sources

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

CS November 2017

Introduction to Apache Spark

Big Data Architect.

Massive Online Analysis - Storm,Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing & Visualization

Big Data Analytics using Apache Hadoop and Spark with Scala

An Introduction to Big Data Analysis using Spark

EPL660: Information Retrieval and Search Engines Lab 11

In-Memory Processing with Apache Spark. Vincent Leroy

Spark: A Brief History.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Outline. CS-562 Introduction to data analysis using Apache Spark

Distributed Machine Learning" on Spark

Spark, Shark and Spark Streaming Introduction

Webinar Series TMIP VISION

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Data Platforms and Pattern Mining

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Hadoop Development Introduction

/ Cloud Computing. Recitation 13 April 17th 2018

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Apache Spark Internals

Transcription:

Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu

Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks powered by Spark Pelle Jakovits 2/34

Spark Directed acyclic graph (DAG) task execution engine Supports cyclic data flow and in-memory computing Spark works with Scala, Java, Python and R Integrated with Hadoop Yarn and HDFS Extended with tools for SQL like queries, stream processing and graph processing. Uses Resilient Distributed Datasets to abstract data that is to be processed Pelle Jakovits 3/34

Hadoop YARN Pelle Jakovits 4/34

Performance vs Hadoop K-Means Clustering 4,1 155 0 30 60 90 120 150 180 Hadoop Spark Logistic Regression 0,96 110 0 25 50 75 100 125 Time per Iteration (s) Introduction to Spark Patrick Wendell, Databricks Pelle Jakovits 5/34

Resilient Distributed Datasets Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations Automatically rebuilt on failure Pelle Jakovits 6/34

Working in Java Tuples Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b Functions In Java 8 you can use lamda functions JavaPairRDD<String, Integer> counts = pairs.reducebykey((a, b) -> a + b); But In older Java you have to use predefined function interfaces: Function, Function2, Function 3 FlatMapFunction PairFunction Pelle Jakovits 7/34

Java Spark Function types class GetLength implements Function<String, Integer> { public Integer call(string s) { return s.length(); } } class Sum implements Function2<Integer, Integer, Integer> { public Integer call(integer a, Integer b) { return a + b; } } Pelle Jakovits 8/34

Java Example - MapReduce JavaRDD<Integer> dataset = jsc.parallelize(l, slices); int count = dataset.map(new Function<Integer, Integer>() { public Integer call(integer integer) { double x = Math.random() * 2-1; double y = Math.random() * 2-1; return (x * x + y * y < 1)? 1 : 0; } }).reduce(new Function2<Integer, Integer, Integer>() { public Integer call(integer integer, Integer integer2) { return integer + integer2; } }); Pelle Jakovits 9/34

Python example Word count in Spark's Python API file = spark.textfile("hdfs://...") file.flatmap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) Pelle Jakovits 10/34

Persisting data Spark is Lazy To force spark to keep any intermediate data in memory, we can use: linelengths.persist(storagelevel); which would cause linelengths RDD to be saved in memory after the first time it is computed. Should be used in case we want to process the same RDD multiple times. Pelle Jakovits 11/34

Persistance level DISK_ONLY MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER More efficient Use more CPU MEMORY_ONLY_2 Replicate data on 2 executors Pelle Jakovits 12/34

RDD operations Actions Creating RDD s Storing RDD s Extracting data from RDD on the fly Transformations Restructure or transform RDD data Pelle Jakovits 13/34

Spark Actions Pelle Jakovits 14

Loading Data Local data int[] data = {1, 2, 3, 4, 5}; JavaRDD<Integer> distdata = sc.parallelize(data, slices); External data JavaRDD<String> input = sc.textfile( file.txt ) sc.textfile( directory/*.txt ) sc.textfile( hdfs://xxx:9000/path/file ) Pelle Jakovits 15/34

Broadcast Broadcast a copy of the data to every node in the Spark cluster: Broadcast<int[]> broadcastvar = sc.broadcast (new int[] {1, 2, 3}); Int[] values = broadcastvar.value(); Pelle Jakovits 16/34

Storing data counts.saveastextfile("hdfs://..."); counts.saveasobjectfile("hdfs://..."); DataCube.saveAsHadoopFile("testfile.seq", LongWritable.class, LongWritable.class, SequenceFileOutputFormat.class); Pelle Jakovits 17/34

Other actions Reduce() we already saw in example Collect() Retrieve RDD content. Count() count number of elements in RDD First() Take first element from RDD Take(n) - Take n first elements from RDD countbykey() count values for each unique key Pelle Jakovits 18/34

Spark RDD Transformations Pelle Jakovits 19

Map JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(string s) { return new Tuple2(s, 1); } } ); Pelle Jakovits 20/34

groupby JavaPairRDD<Integer, List<Tuple2<Integer, Float>>> grouped = values.groupby(new Partitioner(splits), splits); public class Partitioner extends Function<Tuple2<Integer, Float>, Integer>{ } public Integer call(tuple2<integer, Float> t) { } return r.nextint(partitions); Pelle Jakovits 21/34

reducebykey JavaPairRDD<String, Integer> counts = ones.reducebykey( ); new Function2<Integer, Integer, Integer>() { public Integer call(integer i1, Integer i2) { return i1 + i2; } } Pelle Jakovits 22/34

Filter JavaRDD<String> logdata = sc.textfile(logfile); long ERRORS = logdata.filter( new Function<String, Boolean>() { public Boolean call(string s) { return s.contains( ERROR"); } } ).count(); long INFOS = logdata.filter( new Function<String, Boolean>() { public Boolean call(string s) { return s.contains( INFO"); } } ).count(); Pelle Jakovits 23/34

Other transformations sample(withreplacement, fraction, seed) distinct([numtasks])) union(otherdataset) flatmap(func) groupbykey() join(otherdataset, [numtasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. cogroup(otherdataset, [numtasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. Pelle Jakovits 24/34

Fault Tolerance Pelle Jakovits 25

Lineage Lineage is the history of RDD RDD s keep track of all the RDD partitions What functions were appliced to procduce it Which input data partition were involved Rebuild lost RDD partitions according to lineage, using the latest still available partitions. No performance cost if nothing fails (as opposite to checkpointing) Pelle Jakovits 26/34

Frameworks powered by Spark Pelle Jakovits 27/34

Frameworks powered by Spark Spark SQL- Seemlessly mix SQL queries with Spark programs. Similar to Pig and Hive MLlib - machine learning library GraphX - Spark's API for graphs and graphparallel computation. Pelle Jakovits 28/34

Advantages of Spark Much faster than solutions bult ontop of Hadoop when data can fit into memory Except Impala maybe Hard to keep in track of how (well) the data is distributed More flexible fault tolerance Spark has a lot of extensions and is constantly updated Pelle Jakovits 29/34

Disadvantages of Spark What if data does not fit into the memory? Saving as text files can be very slow Java Spark is not as convinient to use as Pig for prototyping, but you can Use python Spark instead Use Spark Dataframes Use Spark SQL Pelle Jakovits 30/34

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage Provides definite speedup when data fits into the collective memory Pelle Jakovits 31/34

Thats All This week`s practice session Processing data with Spark Next week`s lecture is about higher level Spark Scripting and Prototyping in Spark Spark SQL DataFrames Spark Streaming Pelle Jakovits 32/34