Chapter 4: Apache Spark

Similar documents
CSE 444: Database Internals. Lecture 23 Spark

Spark Overview. Professor Sasu Tarkoma.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

An Introduction to Apache Spark

Processing of big data with Apache Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark, Shark and Spark Streaming Introduction

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

CompSci 516: Database Systems

Big data systems 12/8/17

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

DATA SCIENCE USING SPARK: AN INTRODUCTION

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

15.1 Data flow vs. traditional network programming

Apache Spark Internals

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Big Data Infrastructures & Technologies

Apache Flink- A System for Batch and Realtime Stream Processing

2/4/2019 Week 3- A Sangmi Lee Pallickara

Evolution From Shark To Spark SQL:

Spark: A Brief History.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Outline. CS-562 Introduction to data analysis using Apache Spark

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Dell In-Memory Appliance for Cloudera Enterprise

Accelerating Spark Workloads using GPUs

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Turning Relational Database Tables into Spark Data Sources

Resilient Distributed Datasets

Fast, Interactive, Language-Integrated Cluster Computing

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

An Introduction to Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Batch Processing Basic architecture

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Data Platforms and Pattern Mining

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Backtesting with Spark

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Data-intensive computing systems

In-Memory Processing with Apache Spark. Vincent Leroy

Hadoop Development Introduction

An Introduction to Big Data Analysis using Spark

Data processing in Apache Spark

Spark supports several storage levels

The Evolution of Big Data Platforms and Data Science

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

CS Spark. Slides from Matei Zaharia and Databricks

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

a Spark in the cloud iterative and interactive cluster computing

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Apache Spark 2.0. Matei


Cloud Computing & Visualization

Integration of Machine Learning Library in Apache Apex

Lecture 11 Hadoop & Spark

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

An Overview of Apache Spark

Introduction to Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Apache Spark

6.830 Lecture Spark 11/15/2017

Big Data Analytics using Apache Hadoop and Spark with Scala

Specialist ICT Learning

Cloud, Big Data & Linear Algebra

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Introduction to Apache Spark

Data processing in Apache Spark

A very short introduction

Comparative Study of Apache Hadoop vs Spark

Programming Systems for Big Data

Big Data Architect.

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Massive Online Analysis - Storm,Spark

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Data processing in Apache Spark

Stream Processing on IoT Devices using Calvin Framework

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Transcription:

Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec, Anand Rajaraman, and Jeff Ullman (Stanford University)

Motivation Spark becomes new standard for the MR applications: Logistic regression in Hadoop and Spark: Cloudera replaces classic MR framework with Spark IBM puts 3500 Researches to work on Spark related projects Big Data Management and Analytics 168

Motivation Most of the Algorithms require a chain of MR steps: Tedious to program Writes to disk and reads from disk between steps are expensive Idea: Use memory instead of disk Big Data Management and Analytics 169

Apache Spark Keeps data between operations in-memory Lot of convenience functions (e.g. filter, join) No restrictions for the operations order from the framework (not just Map->Reduce) Spark program is a pipeline of operations on distributed datasets (RDD) API: Java, Scala, Python, R Big Data Management and Analytics 170

Resilient Distributed Dataset (RDD) Read-only collection of objects Partitioned across machines Enables operations on partitions in parallel Creation: Parallelizing a collection Data from files (e.g. HDFS) As result of transformation of another RDD Big Data Management and Analytics 171

Resilient Distributed Dataset (RDD) Number of partitions determines parallelism level Can be cached in memory between operations Graph based representation (Lineage) Fault-Tolerant In case of machine failure: RDD can be reconstructed Big Data Management and Analytics 172

RDD Transformations and actions Two types of operations: Transformations (lazily evaluated) Actions (trigger transformations) create lineage data input Transformations RDD Action result input Big Data Management and Analytics 173

RDD Transformations and actions Transformations Recipe how the new dataset is generated from the existing one Lazy evaluated Organized as Directed Acyclic Graph The required calculations are optimized DAG Scheduler defines stages for execution Each stage comprises tasks based on particular data partitions Big Data Management and Analytics 174

Architecture Big Data Management and Analytics 175

RDD narrow and wide dependencies Narrow dependency Each partition of the new RDD depends on partitions located on the same worker (transformation is executed locally on the workers) Wide dependency New partition depends on partitions on several workers (shuffle necessary) Big Data Management and Analytics 176

Shuffle internal map and reduce tasks to organize and aggregate data large costs in memory data structures consume a lot of memory => disk I/O (shuffle spill) + garbage collection many intermediate files on disk (for RDD reconstruction in case of failure) => garbage collection data serialization network I/O reduce the amount of data to be transferred in the shuffle phase by pre-aggregation Big Data Management and Analytics 177

Lettercount examples actions transformations Big Data Management and Analytics 178

Shuffle reducebykey Big Data Management and Analytics 179

Shuffle groupbykey Big Data Management and Analytics 180

RDD Persistence Precomputed RDDs are reused B was computed and is reused, stage 1 is skipped Big Data Management and Analytics 181

RDD Persistence Computed RDD are held in memory as deserialized Java objects Old data partitions are dropped in least-recently-used fashion to free memory. Discarded RDD is recomputed if it is needed again. To advise Spark to keep RDD in memory call cache() or persist() operations on it Big Data Management and Analytics 182

RDD Persistence RDD can be persisted differently by passing argument to persist function (in python persisted objects are always serialized): As deserialized java objects in memory (default) As deserialized java objects in memory and on disk Serialized java objects in memory Serialized java objects in memory and on disk Serialized on disk Off Heap Big Data Management and Analytics 183

RDD Persistence Off heap RDD persistence: RDDs are persisted outside of Java Heap Reduces the JVM Garbage Collection pauses Tachyon Memory-centric distributed storage system Lineage function Enables data sharing between different jobs Data is safe even if computation crashes Big Data Management and Analytics 184

Shared variables The driver program passes the functions to the cluster If passed function uses variables defined in driver program, these are copied to each worker Updates on these variables are not allowed Big Data Management and Analytics 185

Shared variables The necessary common data is broadcasted within each stage Within the stage the data is serialized and is desirialized before each task Broadcast variables are used to avoid multiple broadcasting and de/serialization Broadcast variable is shipped once and is cached deserialized Broadcast variable should not be modified, but can be recreated Big Data Management and Analytics 186

Shared variables Example broadcast variable: Big Data Management and Analytics 187

Shared variables Accumulators are only updatable shared variables in Spark Associative add operation on accumulator is allowed Own add operation for new types are allowed Tasks can update Accumulator value, but only driver program can read it Accumulator update is applied when action is executed Task updates accumulator each time action is called Restarted tasks update accumulator only once Big Data Management and Analytics 188

Shared variables Accumulator example: Big Data Management and Analytics 189

Other relevant spark projects Spark streaming Objects from stream are processed in small groups (batches) Similar to batch processing Spark SQL Processing of structured data (SchemaRDD) Data is stored in columns and is analyzed in SQL manner Data is still RDD and can be processed by other Spark frameworks JDBC/ODBC interface Big Data Management and Analytics 190

Other relevant spark projects GraphX Distributed computations on Graphs Machine Learning Libraries Mlib H20 (Sparkling water) Keystone ML Big Data Management and Analytics 191

Sources http://www.datacenterknowledge.com/archives/2015/09/09/clouderaaims-to-replace-mapreduce-with-spark-as-default-hadoop-framework/ http://spark.apache.org/images/logistic-regression.png https://www-03.ibm.com/press/us/en/pressrelease/47107.wss Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. http://de.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tipsfor-writing-better-spark-programs?related=1 Big Data Management and Analytics 192