Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Similar documents
CS November 2017

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

CS November 2018

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Intro to Apache Spark

Data-intensive computing systems

Processing of big data with Apache Spark

Spark Programming Essentials

CSE 444: Database Internals. Lecture 23 Spark

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Data processing in Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Data processing in Apache Spark

An Introduction to Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Data processing in Apache Spark

25/05/2018. Spark Streaming is a framework for large scale stream processing

Spark DAO

Spark, Shark and Spark Streaming Introduction

MapReduce, Hadoop and Spark. Bompotas Agorakis

DATA SCIENCE USING SPARK: AN INTRODUCTION

Cloud Computing & Visualization

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Spark

Chapter 4: Apache Spark

01: Getting Started. Installation. hands-on lab: 20 min

Big Data Frameworks: Spark Practicals

Spark Streaming: Hands-on Session A.A. 2017/18

Data Engineering. How MapReduce Works. Shivnath Babu

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Dell In-Memory Appliance for Cloudera Enterprise

Lecture 11 Hadoop & Spark

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Big Data Infrastructures & Technologies

Turning Relational Database Tables into Spark Data Sources

Apurva Nandan Tommi Jalkanen

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Apache Spark Internals

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Fast, Interactive, Language-Integrated Cluster Computing

Introduction to Apache Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

An Introduction to Apache Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Big Data Architect.

Programming Systems for Big Data

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Cloud, Big Data & Linear Algebra

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Massive Online Analysis - Storm,Spark

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

A Tutorial on Apache Spark

State and DStreams. Big Data Analysis with Scala and Spark Heather Miller

Resilient Distributed Datasets

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Big Data Analytics with Apache Spark. Nastaran Fatemi

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Certified Big Data Hadoop and Spark Scala Course Curriculum

Spark: A Brief History.

Introduction to Apache Spark. Patrick Wendell - Databricks

a Spark in the cloud iterative and interactive cluster computing

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Big Data Analytics using Apache Hadoop and Spark with Scala

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hadoop Development Introduction

2/4/2019 Week 3- A Sangmi Lee Pallickara

An Overview of Apache Spark

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Batch Processing Basic architecture

Spark Programming with RDD

CS November 2018

CS November 2017

CS Spark. Slides from Matei Zaharia and Databricks

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

An Introduction to Apache Spark

CSE 414: Section 7 Parallel Databases. November 8th, 2018

6.830 Lecture Spark 11/15/2017

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Apache Hive for Oracle DBAs. Luís Marques

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Distributed Computation Models

Certified Big Data and Hadoop Course Curriculum

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Transcription:

Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1

Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to MapReduce Add fast data sharing & general DAGs (graphs) Generic data storage interfaces Storage agnostic: use HDFS, Cassandra database, whatever Resilient Distributed Data (RDD) sets An RDD is a chunk of data that gets processed a large collection of stuff In-memory caching More general functional programming model Transformations and actions In Map-Reduce, transformation = map, action = reduce November 26, 2016 2015-2016 Paul Krzyzanowski 2

High-level view Job = bunch of transformations & actions on RDDs Cluster manager: Allocates worker nodes Cluster Manager Client (Driver Program) Client app Job Spark Context November 26, 2016 2015-2016 Paul Krzyzanowski 3

High-level view Driver breaks the job into tasks Sends tasks to worker nodes where the data lives Task Cluster Manager Task Client (Driver Program) Client app Job Spark Context Task Task Workers November 26, 2016 2015-2016 Paul Krzyzanowski 4

Worker node One or more executors JVM process Cluster Manager Talks with cluster manager Executor Receives tasks JVM code (e.g., compiled Java, Clojure, Scala, Jruby, ) Task Task Task Task = transformation or action Cache Data to be processed (RDD) Local to the node Cache Local data Stores frequently-used data in memory Key to high performance November 26, 2016 2015-2016 Paul Krzyzanowski 5

Data & RDDs Data organized into RDDs: Big data: partition it across lots of computers How are RDDs created? 1. Create from any file stored in HDFS or other storage supported in Hadoop (Amazon S3, HDFS, HBase, Cassandra, etc.) Created externally (e.g., event stream, text files, database) Example: Query a database & make query the results an RDD Any Hadoop InputFormat, such as a list of files or a directory 2. Streaming sources (via Spark Streaming) Fault-tolerant stream with a sliding window 3. An RDD can be the output of a Spark transformation function Example, filter out data, select key-value pairs November 26, 2016 2015-2016 Paul Krzyzanowski 6

Properties of RDDs Main Properties Immutable You cannot change it only create new RDDs The framework will eventually collect unused RDDs Partitioned parts of an RDD go to different servers Default partitioning function = hash(key) mod server_count Optional Properties Typed: they re not BLOBs Embedded data structure e.g., key-value set Ordered Elements in an RDD can be sorted November 26, 2016 2015-2016 Paul Krzyzanowski 7

Operations on RDDs Two types of operations on RDDs Transformations Lazy not computed immediately Transformed RDD is recomputed when an action is run on it Work backwards: What RDDs do you need to apply to get an action? What RDDs do you need to apply to get the input to this RDD? RDD can be persisted into memory or disk storage Actions Finalizing operations Reduce, count, grab samples, write to file November 26, 2016 2015-2016 Paul Krzyzanowski 8

Spark Transformations Transformation map(func) filter(func) flatmap(func) sample(withreplacement, fraction, seed) union(otherdataset) distinct([numtasks]) Description Pass each element through a function func Select elements of the source on which func returns true Each input item can be mapped to 0 or more output items Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed Union of the elements in the source data set and otherdataset The distinct elements of the source dataset November 26, 2016 2015-2016 Paul Krzyzanowski 9

Spark Transformations Transformation Description groupbykey([numtasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, seq[v]) pairs reducebykey(func, [numtasks]) Aggregate the values for each key using the given reduce function sortbykey([ascending], [numtasks]) join(otherdataset, [numtasks]) Sort keys in ascending or descending order Combines two datasets, (K, V) and (K, W) into (K, (V, W)) cogroup(otherdataset, [numtasks]) Given (K, V) and (K, W), returns (K, Seq[V], Seq[W]) cartesian(otherdataset) For two datasets of types T and U, returns a dataset of (T, U) pairs November 26, 2016 2015-2016 Paul Krzyzanowski 10

Spark Actions Action reduce(func) collect(func, [numtasks]) count() first() Description Aggregate elements of the dataset using func. Return all elements of the dataset as an array Return the number of elements in the dataset Return the first element of the dataset take(n) takesample(withreplacement, fraction, seed) Return an array with the first n elements of the dataset Return an array with a random sample of num elements of the dataset November 26, 2016 2015-2016 Paul Krzyzanowski 11

Spark Actions Action saveastextfile(path) Description Write dataset elements as a text file saveassequencefile(path) countbykey () foreach(func) Write dataset elements as a Hadoop SequenceFile For (K, V) RDDs, return a map of (K, Int) pairs with the count of each key Run func on each element of the dataset November 26, 2016 2015-2016 Paul Krzyzanowski 12

Data Storage Spark does not care how source data is stored RDD connector determines that E.g., read RDDs from tables in a Cassandra DB; write new RDDs to Cassandra tables RDD Fault tolerance RDDs track the sequence of transformations used to create them Enables recomputing of lost data Go back to the previous RDD and apply the transforms again November 26, 2016 2015-2016 Paul Krzyzanowski 13

Example: processing logs Transform (creates new RDDs) Grab error message from a log Grab only ERROR messages & extract the source of error Actions : Count mysql & php errors // base RDD val lines = sc.textfile("hdfs://... ) // transformed RDDs val errors = lines.filter(_.startswith("error")) val messages = errors.map(_.split("\t")).map(r => r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() November 26, 2016 2015-2016 Paul Krzyzanowski 14

Spark Streaming Map-Reduce & Pregel expect static data Spark Streaming enables processing live data streams Same programming operations Input data is chunked into batches Programmer specifies time interval November 26, 2016 2015-2016 Paul Krzyzanowski 15

Spark Streaming: DStreams Discretized Stream = DStream Continuous stream of data (from source or a transformation) Appears as a continuous series of RDDs, each for a time interval Each operation on a DStream translates to operations on the RDDs Join operations allow combining multiple streams November 26, 2016 2015-2016 Paul Krzyzanowski 16

Spark Summary Supports streaming Handle continuous data streams via Spark Streaming Fast Often up to 10x faster on disk and 100x faster in memory than MapReduce General execution graph model No need to have useless phases just to fit into the model In-memory storage for RDDs Fault tolerant: RDDs can be regenerated You know what the input data set was, what transformations were applied to it, and what output it creates November 26, 2016 2015-2016 Paul Krzyzanowski 17

The end November 26, 2016 2015-2016 Paul Krzyzanowski 18