Spark Overview. Professor Sasu Tarkoma.

Similar documents
Processing of big data with Apache Spark

An Introduction to Apache Spark

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Chapter 4: Apache Spark

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

CSE 444: Database Internals. Lecture 23 Spark

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloud, Big Data & Linear Algebra

An Introduction to Apache Spark

Data processing in Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Data processing in Apache Spark

Data processing in Apache Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark supports several storage levels

Apache Spark Internals

Big Data Infrastructures & Technologies

MapReduce, Hadoop and Spark. Bompotas Agorakis

Resilient Distributed Datasets

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

An Overview of Apache Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

Spark, Shark and Spark Streaming Introduction

Big data systems 12/8/17

Data-intensive computing systems

Accelerating Spark Workloads using GPUs

An Introduction to Big Data Analysis using Spark

Turning Relational Database Tables into Spark Data Sources

CompSci 516: Database Systems

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Cloud Computing & Visualization

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

Spark: A Brief History.

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

2/4/2019 Week 3- A Sangmi Lee Pallickara

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Introduction to Apache Spark

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.


Introduction to Spark

a Spark in the cloud iterative and interactive cluster computing

Outline. CS-562 Introduction to data analysis using Apache Spark

A Tutorial on Apache Spark

EPL660: Information Retrieval and Search Engines Lab 11

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Massive Online Analysis - Storm,Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

In-Memory Processing with Apache Spark. Vincent Leroy

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Distributed Computing with Spark and MapReduce

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Lecture 11 Hadoop & Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Intro to Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Data Engineering. How MapReduce Works. Shivnath Babu

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Data Platforms and Pattern Mining

Introduction to Apache Spark. Patrick Wendell - Databricks

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Certified Big Data Hadoop and Spark Scala Course Curriculum

Introduction to Apache Spark

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Evolution From Shark To Spark SQL:

Big Data Analytics with Apache Spark. Nastaran Fatemi

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Apache Spark: Hands-on Session A.A. 2016/17

Big Data Analytics using Apache Hadoop and Spark with Scala

Apache Spark 2.0. Matei

Apache Spark: Hands-on Session A.A. 2017/18

The Evolution of Big Data Platforms and Data Science

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Fast, Interactive, Language-Integrated Cluster Computing

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Dell In-Memory Appliance for Cloudera Enterprise

Analyzing Flight Data

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Transcription:

Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi

Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based on MapReduce enhanced with new operations and an engine that supports execution graphs Tools include Spark SQL, MLLlib for machine learning, GraphX for graph processing and Spark Streaming

Spark Aim Unifies batch, streaming, interactive computing Making it easy to build sophisticated applications

Key idea in Spark Resilient distributed datasets (RDDs) Immutable collections of objects across a cluster Built with parallel transformations (map, filter, ) Automatically rebuilt when failure is detected Allow persistence to be controlled (in-memory operation) Transformations on RDDs Lazy operations to build RDDs from other RDDs Always creates a new RDD Actions on RDDs Count, collect, save

Spark terminology I/II Application is a user program built on Spark that consists of a driver program and executors on the cluster. Driver program is the process running the main() function of the application and creating the SparkContext Cluster manager is an external service for acquiring resources on the cluster Worker node is any node that can run application code in the cluster Executor is a process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors

Spark terminology II Task is a unit of work that will be sent to one executor Job is a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect) Stage is about each job being divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce)

Spark overview Worker Node Executor Tasks Cache Driver Program SparkContext Cluster Manager SparkContext connects to a cluster manager Obtains executors on cluster nodes Sends app code to them Sends task to the executors Worker Node Executor Tasks Cache

Spark cluster components The SparkContext object in the main program controls the Spark applications They are executed as independent sets of processes on a cluster SparkContext can connect to several types of cluster managers (Spark native or Mesos/YARN) that are responsible for allocating resources across the cluster

Three observations on the cluster architecture 1. Applications are isolated from each other: tasks from different apps run in different JVMs. As a consequence data cannot be shared across Spark applications without using external storage. 2. Spark is flexible regarding the cluster manager. It must be able to acquire executor processes that communicate with each other. 3. The driver schedules tasks on the cluster it should reside near the worker nodes.

Cluster manager types Standalone Simple cluster manager Apache Mesos A general cluster manager Hadoop YARN the resource manager of Hadoop 2

How Spark works RDD Transformations RDD Action Value

Resilient Distributed Datasets (RDD) RDDs document the sequence of transformations that were used to create them (the lineage) This can be used to recompute lost data HadoopRDD FilteredRDD MappedRDD

RDD Partitions Partitions support the management of parallel collections One task is run for each partition of the cluster Typically 2-4 partitions are used for each CPU Spark sets this automatically, but it can be also manually tuned (e.g. sc.parallelize(data, 10)). Also the term slices is used for partitions

RDD I/II RDD is the primary abstraction Fault-tolerant collection of elements Intrinsic support for parallel operations Two different types of RDD: Parallelized collections based on Scala collections Hadoop datasets that allow the execution of functions on each record of a file in HDFS or some other storage system supported by Hadoop

RDD II Two types of operations are supported by Spark Transformations Lazy Actions Operate on RDDs Transformed RDD is recomputed when action is applied RDD can be stored into memory or disk

Persistence in Spark Spark can cache (persist) RDD in memory Each node stores in memory any parts of an RDD that it computes A node can reuse these cached results when implementing new actions on the RDD, this can result in 10x performance compared to Hadoop The node cache has fault tolerance so that if any partition of an RDD is lost, the lost part can be recomputed using the history of transformations

Persistence in Spark II Spark monitors cache usage of each node and old data partitions are removed using an LRU (Least- Recently-Used) scheme It is also possible to manually remove an RDD with the RDD.unpersist() method

Spark Persistence Transformation MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SE R DISK_ONLY MEMORY_ONLY_2 Description RDD is stored as deserialized Java objects in the JVM. If the RDD does not fit into the memory, it will be only partially cached and missing partitions will be recomputed when they are required. This is the default mode. The RDD is stored as deserialized Java objects in the JVM. If the RDD does not fit into memory, it will be partially accessed from disk. The RDD is stored as serialized Java objects (one by array for a partition). This approach is more byteefficient than the former methods; however, requires more CPU to read. Recomputes missing partitions when required. Similar to the above case, but uses disk instead of recomputing the missing partitions. Store the RDD only to disk. Same as above, but with replication to two clusters.

Shared Variables A function given to a Spark operation is typically executed on a remote cluster node (for example map and reduce) The function works on separate copies of the data and variables that are not shared by the remote cluster nodes The necessary data and variables are copied to the remote node and no updates are sent back to the driver program Spark supports two methods for shared variables: broadcast variables accumulators

Task Scheduler Supports general task graphs Pipelines functions where possible Cache-aware data reuse & locality Partitioning-aware to avoid shuffles A: B: Stage 1 groupby C: D: E: F: join Stage 2 map filter Stage 3 = RDD = cached partition M.Zaharia. Parallel programming with Spark. O Reilly Strata Conference. February 2013.