Data processing in Apache Spark

Similar documents
Data processing in Apache Spark

Data processing in Apache Spark

Spark Overview. Professor Sasu Tarkoma.

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Infrastructures & Technologies

Distributed Computing with Spark and MapReduce

Introduction to Apache Spark. Patrick Wendell - Databricks

Distributed Computing with Spark

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MapReduce, Hadoop and Spark. Bompotas Agorakis

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

An Introduction to Apache Spark

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Spark supports several storage levels

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark

CSE 444: Database Internals. Lecture 23 Spark

Lecture 11 Hadoop & Spark

Dell In-Memory Appliance for Cloudera Enterprise

Intro to Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Higher level data processing in Apache Spark

An Overview of Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Chapter 4: Apache Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Big data systems 12/8/17

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CS Spark. Slides from Matei Zaharia and Databricks

Data-intensive computing systems

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Spark Programming Essentials

Dept. Of Computer Science, Colorado State University

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Introduction to Apache Spark

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

CS November 2018

2/4/2019 Week 3- A Sangmi Lee Pallickara

Backtesting with Spark

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Fast, Interactive, Language-Integrated Cluster Computing

Processing of big data with Apache Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

A Tutorial on Apache Spark

CS November 2017

Big Data Analytics using Apache Hadoop and Spark with Scala

Lambda Architecture with Apache Spark

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Big Data Architect.

Spark, Shark and Spark Streaming Introduction

15.1 Data flow vs. traditional network programming

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Cloud Computing & Visualization

Apache Spark: Hands-on Session A.A. 2016/17

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Apache Spark: Hands-on Session A.A. 2017/18


Resilient Distributed Datasets

Outline. CS-562 Introduction to data analysis using Apache Spark

In-Memory Processing with Apache Spark. Vincent Leroy

Turning Relational Database Tables into Spark Data Sources

Massive Online Analysis - Storm,Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Accelerating Spark Workloads using GPUs

EPL660: Information Retrieval and Search Engines Lab 11

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Introduction to Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

An Introduction to Big Data Analysis using Spark

Hadoop Development Introduction

Distributed Machine Learning" on Spark

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Data Engineering. How MapReduce Works. Shivnath Babu

Spark: A Brief History.

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

a Spark in the cloud iterative and interactive cluster computing

Apache Spark 2.0. Matei

Introduction to Apache Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark DAO

Webinar Series TMIP VISION

Transcription:

Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu

Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks powered by Spark Pelle Jakovits 2/18

Spark Directed acyclic graph (DAG) engine supports cyclic data flow and in-memory computing. Spark works with Scala, Java and Python Integrated with Hadoop and HDFS Extended with tools for SQL like queries, stream processing and graph processing. Uses Resilient Distributed Datasets to abstract data that is to be processed Pelle Jakovits 3/18

In-memory distributed computing Pelle Jakovits 4/18

Iteration time (s) 14 23 80 171 Performance vs Hadoop 200 150 100 50 Hadoop Spark 0 30 60 Number of machines Introduction to Spark Patrick Wendell, Databricks Pelle Jakovits 5/18

Performance vs Hadoop K-Means Clustering 4.1 155 0 30 60 90 120 150 180 Hadoop Spark Logistic Regression 0.96 110 0 25 50 75 100 125 Time per Iteration (s) Introduction to Spark Patrick Wendell, Databricks Pelle Jakovits 6/18

Resilient Distributed Datasets (RDD) Pelle Jakovits 7/18

Resilient Distributed Dataset Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations Automatically rebuilt on failure Pelle Jakovits 8/18

Working in Java Tuples Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b Functions public class Partitioner extends Function<Tuple2<Integer, Float>, Integer>{ public Integer call(tuple2<integer, Float> t) { return r.nextint(partitions); } } Pelle Jakovits 9/18

Java Example - MapReduce JavaRDD<Integer> dataset = jsc.parallelize(l, slices); int count = dataset.map(new Function<Integer, Integer>() { public Integer call(integer integer) { double x = Math.random() * 2-1; double y = Math.random() * 2-1; return (x * x + y * y < 1)? 1 : 0; } }).reduce(new Function2<Integer, Integer, Integer>() { public Integer call(integer integer, Integer integer2) { return integer + integer2; } }); Pelle Jakovits 10/18

Python example Word count in Spark's Python API file = spark.textfile("hdfs://...") file.flatmap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) Pelle Jakovits 11/18

Persisting data Spark is Lazy To force spark to keep any intermediate data in memory, we can use: linelengths.persist(storagelevel); which would cause linelengths RDD to be saved in memory after the first time it is computed. Should be used in case we want to process the same RDD multiple times. Pelle Jakovits 12/18

Persistance level DISK_ONLY MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER More efficient Use more CPU MEMORY_ONLY_2 Replicate data on 2 executors Pelle Jakovits 13/18

RDD operations Actions Creating RDD Storing RDD Extracting data from RDD on the fly Transformations Restructure or transform RDD data Pelle Jakovits 14/18

Spark Actions Pelle Jakovits 15/18

Loading Data Local data val data = Array(1, 2, 3, 4, 5) val distdata = sc.parallelize(data, slices) External data JavaRDD<String> input = sc.textfile( file.txt ) sc.textfile( directory/*.txt ) sc.textfile( hdfs://xxx:9000/path/file ) Pelle Jakovits 16/18

Broadcast Broadcast<int[]> broadcastvar = sc.broadcast (new int[] {1, 2, 3}); Int[] values = broadcastvar.value(); Pelle Jakovits 17/18

Storing data counts.saveastextfile("hdfs://..."); counts.saveasobjectfile("hdfs://..."); DataCube.saveAsHadoopFile("testfile.seq", LongWritable.class, LongWritable.class, SequenceFileOutputFormat.class); Pelle Jakovits 18/18

Other actions Reduce() we already saw in example Collect() Retrieve RDD content. Count() count number of elements in RDD First() Take first element from RDD Take(n) - Take n first elements from RDD countbykey() count values for each unique key Pelle Jakovits 19/18

Spark RDD Transformations Pelle Jakovits 20/18

Map JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(string s) { return new Tuple2(s, 1); } } ); Pelle Jakovits 21/18

groupby JavaPairRDD<Integer, List<Tuple3<Integer, Float>>> grouped = values.groupby(new Partitioner(splits), splits); public class Partitioner extends Function<Tuple3<Integer, Float>, Integer>{ } public Integer call(tuple3<integer, Float> t) { } return r.nextint(partitions); Pelle Jakovits 22/18

reducebykey JavaPairRDD<String, Integer> counts = ones.reducebykey( ); new Function2<Integer, Integer, Integer>() { public Integer call(integer i1, Integer i2) { return i1 + i2; } } Pelle Jakovits 23/18

Filter JavaRDD<String> logdata = sc.textfile(logfile); long ERRORS = logdata.filter( new Function<String, Boolean>() { public Boolean call(string s) { return s.contains( ERROR"); } } ).count(); long INFOS = logdata.filter( new Function<String, Boolean>() { public Boolean call(string s) { return s.contains( INFO"); } } ).count(); Pelle Jakovits 24/18

Other transformations sample(withreplacement, fraction, seed) distinct([numtasks])) union(otherdataset) flatmap(func) join(otherdataset, [numtasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. cogroup(otherdataset, [numtasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. Pelle Jakovits 25/18

Fault Tolerance Pelle Jakovits 26/18

Lineage Lineage is the history of RDD RDD s keep track of all the RDD partitions What functions were appliced to procduce it Which input data partition were involved Rebuild lost RDD partitions according to lineage, using the latest still available partitions. No performance cost if nothing fails (as opposite to checkpointing) Pelle Jakovits 27/18

Frameworks powered by Spark Pelle Jakovits 28/18

Frameworks powered by Spark Spark SQL- Seemlessly mix SQL queries with Spark programs. Similar to Pig and Hive MLlib - machine learning library GraphX - Spark's API for graphs and graphparallel computation. Pelle Jakovits 29/18

Advantages of Spark Not as strict as Hive when it comes to schema Much faster than solutions bult ontop of Hadoop when data can fit into ememory Except Impala maybe Hard to keep in track of how (well) the data is distributed Pelle Jakovits 30/18

Disadvantages of Spark If data does not fit into memory.. Saving as text files is extremely slow Not as convinient to use as Pig for prototyping Pelle Jakovits 31/18

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Achieves fault tolerance by providing coarsegrained operations and tracking lineage Provides definite speedup when data fits into the collective memory Pelle Jakovits 32/18

Thats All This week`s practice session Processing data with Spark Next week`s lecture is NoSQL (Jürmo Mehine) Large scale non-relational databases Pelle Jakovits 33/18