Big data systems 12/8/17

Similar documents
Batch Processing Basic architecture

CSE 444: Database Internals. Lecture 23 Spark

Processing of big data with Apache Spark

An Introduction to Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Chapter 4: Apache Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

An Introduction to Big Data Analysis using Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

DATA SCIENCE USING SPARK: AN INTRODUCTION

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Data-intensive computing systems

An Introduction to Apache Spark

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Cloud Computing & Visualization

Big Data Infrastructures & Technologies

Cloud, Big Data & Linear Algebra

Introduction to Apache Spark

In-Memory Processing with Apache Spark. Vincent Leroy

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Lambda Architecture with Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to Apache Spark. Patrick Wendell - Databricks

Apache Spark 2.0. Matei

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Beyond MapReduce: Apache Spark Antonino Virgillito

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Apache Spark Internals

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

2/4/2019 Week 3- A Sangmi Lee Pallickara

Turning Relational Database Tables into Spark Data Sources

The Evolution of Big Data Platforms and Data Science

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

Distributed Machine Learning" on Spark

An Overview of Apache Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Accelerating Spark Workloads using GPUs

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Spark, Shark and Spark Streaming Introduction

15.1 Data flow vs. traditional network programming

Unifying Big Data Workloads in Apache Spark

Analyzing Flight Data

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara


08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Data processing in Apache Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Backtesting with Spark

Evolution From Shark To Spark SQL:

Big Data Analytics with Apache Spark. Nastaran Fatemi

Data processing in Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Architect.

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

CS246: Mining Massive Datasets Crash Course in Spark

Introduction to Spark

Dell In-Memory Appliance for Cloudera Enterprise

CS Spark. Slides from Matei Zaharia and Databricks

MLI - An API for Distributed Machine Learning. Sarang Dev

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Hadoop Execution Environment

The detailed Spark programming guide is available at:

Spark and distributed data processing

/ Cloud Computing. Recitation 13 April 14 th 2015

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Engineering. How MapReduce Works. Shivnath Babu

New Developments in Spark

Lecture 11 Hadoop & Spark

microsoft

Data processing in Apache Spark

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Practical Big Data Processing An Overview of Apache Flink

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

Specialist ICT Learning

Distributed Computing with Spark and MapReduce

/ Cloud Computing. Recitation 13 April 12 th 2016

A very short introduction

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Stream Processing on IoT Devices using Calvin Framework

Databases and Big Data Today. CS634 Class 22

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Getting Started with Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Big Data Analytics with Hadoop and Spark at OSC

Big Data Analytics using Apache Hadoop and Spark with Scala

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Introduction to MapReduce

Transcription:

Big data systems 12/8/17

Today Basic architecture Two levels of scheduling Spark overview

Basic architecture

Cluster Manager Cluster

Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores Cluster

Client Submit WordCount.java Cluster Manager Cluster

Client Cluster Manager Launch executor Launch driver Launch executor Cluster

Client Cluster Manager Wordcount driver Wordcount executor Wordcount executor Cluster

Client Submit FindTopTweets.java Cluster Manager Client Wordcount driver Wordcount executor Wordcount executor Cluster

Client Cluster Manager Launch executor Client Wordcount driver Launch driver Wordcount executor Wordcount executor Cluster

Client Cluster Manager Client Wordcount Tweets driver executor Wordcount executor Wordcount Tweets executor driver Cluster

Client Cluster Manager Client Wordcount Tweets driver App3 executor executor Wordcount App3 executor executor Client Wordcount Tweets executor App3 driver driver Cluster

Basic architecture Clients submit applications to the cluster manager Cluster manager assigns cluster resources to applications Each launches containers for each application Driver containers run main method of user program Executor containers run actual computation

Two levels of scheduling Cluster-level: Cluster manager assigns resources to applications Application-level: Driver assigns tasks to run on executors A task is a unit of execution that corresponds to one partition of your data E.g. Running Spark on a YARN cluster Hadoop YARN is a cluster scheduling framework Spark is a distributed computing framework

Two levels of scheduling Cluster-level: Cluster manager assigns resources to applications Application-level: Driver assigns tasks to run on executors A task is a unit of execution that corresponds to one partition of your data What are some advantages of having two levels? Applications need not be concerned with resource fairness Cluster manager need not be concerned with individual tasks (too many) Easy to implement priorities and preemption

Spark overview

What is Apache? Fast and general engine for big data processing Fast to run code In-memory data sharing General computation graphs Fast to write code Rich APIs in Java, Scala, Python Interactive shell 17

What is Apache? Spark SQL structured data Spark Streaming real-time MLlib machine learning GraphX graph Spark Core 18

Spark computation model Recall that Map phase defines how each machine processes its individual partition Reduce phase defines how to merge map outputs from previous phase Most computation can be expressed in terms of these two phases Spark expresses computation as a DAG of maps and reduces

Reduce phase Map phase Synchronization barrier

More complicated DAG

Spark basics Transformations express how to process a dataset Actions express how to turn a transformed dataset into results sc.textfile("declaration-of-independence.txt").flatmap { line => line.split(" ") }.map { word => (word, 1) }.reducebykey { case (counts1, counts2) => counts1 + counts2 }.collect()

Spark basics Transformations express how to process a dataset Actions express how to turn a transformed dataset into results sc.textfile("declaration-of-independence.txt") // input data.flatmap { line => line.split(" ") }.map { word => (word, 1) }.reducebykey { case (counts1, counts2) => counts1 + counts2 }.collect()

Spark basics Transformations express how to process a dataset Actions express how to turn a transformed dataset into results sc.textfile("declaration-of-independence.txt").flatmap { line => line.split(" ") } // transformations.map { word => (word, 1) }.reducebykey { case (counts1, counts2) => counts1 + counts2 }.collect()

Spark basics Transformations express how to process a dataset Actions express how to turn a transformed dataset into results sc.textfile("declaration-of-independence.txt").flatmap { line => line.split(" ") }.map { word => (word, 1) }.reducebykey { case (counts1, counts2) => counts1 + counts2 }.collect() // action

Spark optimization #1 Transformations can be pipelined until we hit A synchronization barrier (e.g. reduce), or An action Example: data.map {... }.filter {... }.flatmap {... }.groupbykey().count() These three operations can all be run in the same task This allows lazy execution; we don t need to eagerly execute map

Spark optimization #2 In-memory caching: store intermediate results in memory to bypass disk access Example: val cached = data.map {... }.filter {... }.cache() (1 to 100).foreach { i => cached.reducebykey {... }.saveastextfile(...) } By caching transformed data in memory, we skip the map, the filter, and reading the original data from disk every iteration

Spark optimization #3 Reusing map outputs (aka shuffle files) allows Spark to skip map stages Along with caching, this makes iterative workloads much faster Example: val transformed = data.map {... }.filter {... }.reducebykey {... } transformed.collect() transformed.collect() // does not run map phase again

Recap Spark is expressive because its computation model is a DAG Spark is fast because of many optimizations, in particular: Pipelined transformations In-memory caching Reusing map outputs

Brainstorm: Top K Top K is the problem of finding the largest K values from a set of numbers How would you express this as a distributed application? In particular, what would the map and reduce phases look like?

Brainstorm: Top K Top K is the problem of finding the largest K values from a set of numbers How would you express this as a distributed application? In particular, what would the map and reduce phases look like? Hint: use a heap...

Top K Assuming that a set of K integers fit in memory Key idea... Map phase: everyone maintains a heap of K elements Reduce phase: merge the heaps until you re left with one

Top K Problem: What are the keys and values here? No notion of key here, just assign one key to all the values (e.g. key = 1) Map task 1: [10, 5, 3, 700, 18, 4] (1, heap(700, 18, 10)) Map task 2: [16, 4, 523, 100, 88] (1, heap(523, 100, 88)) Map task 3: [3, 3, 3, 3, 300, 3] (1, heap(300, 3, 3)) Map task 4: [8, 15, 20015, 89] (1, heap(20015, 89, 15)) Then all the heaps will go to a single reducer responsible for the key 1 This works, but clearly not scalable...

Top K Idea: Use X different keys to balance load (e.g. X = 2 here) Map task 1: [10, 5, 3, 700, 18, 4] (1, heap(700, 18, 10)) Map task 2: [16, 4, 523, 100, 88] (1, heap(523, 100, 88)) Map task 3: [3, 3, 3, 3, 300, 3] (2, heap(300, 3, 3)) Map task 4: [8, 15, 20015, 89] (2, heap(20015, 89, 15)) Then all the heaps will (hopefully) go to X different reducers Rinse and repeat (what s the runtime complexity?)