Resilient Distributed Datasets

Similar documents
Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark: A Brief History.

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Fast, Interactive, Language-Integrated Cluster Computing

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Massive Online Analysis - Storm,Spark

Data-intensive computing systems

CS Spark. Slides from Matei Zaharia and Databricks

Fault Tolerant Distributed Main Memory Systems

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

a Spark in the cloud iterative and interactive cluster computing

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

CSE 444: Database Internals. Lecture 23 Spark

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Big Data Infrastructures & Technologies

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

2/4/2019 Week 3- A Sangmi Lee Pallickara

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Dell In-Memory Appliance for Cloudera Enterprise

Spark Overview. Professor Sasu Tarkoma.

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Evolution From Shark To Spark SQL:

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

Introduction to Apache Spark. Patrick Wendell - Databricks

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Lecture 11 Hadoop & Spark

Distributed Computing with Spark

Spark, Shark and Spark Streaming Introduction

Programming Systems for Big Data

Shark: Hive on Spark

Integration of Machine Learning Library in Apache Apex

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

MapReduce, Hadoop and Spark. Bompotas Agorakis

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Survey on Incremental MapReduce for Data Mining

An Introduction to Apache Spark

Turning Relational Database Tables into Spark Data Sources

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Warehouse- Scale Computing and the BDAS Stack

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Introduction to MapReduce Algorithms and Analysis

Chase Wu New Jersey Institute of Technology

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Chapter 4: Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

An Introduction to Apache Spark

A Platform for Fine-Grained Resource Sharing in the Data Center

Distributed Computing with Spark and MapReduce

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Database Systems CSE 414

Apache Spark Internals

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Apache Spark 2.0. Matei

Batch Processing Basic architecture

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Processing of big data with Apache Spark

/ Cloud Computing. Recitation 13 April 14 th 2015

The Datacenter Needs an Operating System

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Unifying Big Data Workloads in Apache Spark

Lambda Architecture with Apache Spark

Introduction to Apache Spark

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Spark and distributed data processing

Spark Streaming. Guido Salvaneschi

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

BSP, Pregel and the need for Graph Processing

Spark.jl: Resilient Distributed Datasets in Julia

Cloud, Big Data & Linear Algebra

Outline. CS-562 Introduction to data analysis using Apache Spark

15.1 Data flow vs. traditional network programming

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Data Platforms and Pattern Mining

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

A Parallel R Framework

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Spark Streaming. Professor Sasu Tarkoma.

/ Cloud Computing. Recitation 13 April 12 th 2016

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Datacenter Simulation Methodologies: Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

An Introduction to Big Data Analysis using Spark

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Stream Processing on IoT Devices using Calvin Framework

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Transcription:

Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley UC BERKELEY

Motivation MapReduce greatly simplified big data analysis on large, unreliable clusters But as soon as it got popular, users wanted more:» More complex, multi- stage applications (e.g. iterative machine learning & graph processing)» More interactive ad- hoc queries Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage slow!

Examples HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication and disk I/O, but necessary for fault tolerance

Goal: In- Memory Data Sharing Input iter. 1 iter. 2... one- time processing query 1 query 2 Input query 3... 10-100 faster than network/disk, but how to get FT?

Challenge How to design a distributed memory abstraction that is both fault- tolerant and efficient?

Challenge Existing storage abstractions have interfaces based on fine- grained updates to mutable state» RAMCloud, databases, distributed mem, Piccolo Requires replicating data or logs across nodes for fault tolerance» Costly for data- intensive apps» 10-100x slower than memory write

Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory» Immutable, partitioned collections of records» Can only be built through coarse- grained deterministic transformations (map, filter, join, ) Efficient fault recovery using lineage» Log one operation to apply to many elements» Recompute lost partitions on failure» No cost if nothing fails

RDD Recovery Input iter. 1 iter. 2... one- time processing query 1 query 2 Input query 3...

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms» These naturally apply the same operation to many items Unify many current programming models» Data flow models: MapReduce, Dryad, SQL,» Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, Support new apps that these models don t

Tradeoff Space Fine Granularity of Updates K- V stores, databases, RAMCloud HDFS Network bandwidth Best for transactional workloads RDDs Memory bandwidth Best for batch workloads Coarse Low Write Throughput High

Outline Spark programming interface Implementation Demo How people are using Spark

Spark Programming Interface DryadLINQ- like API in the Scala language Usable interactively from Scala interpreter Provides:» Resilient distributed datasets (RDDs)» Operations on RDDs: transformations (build new RDDs), actions (compute and output results)» Control of each RDD s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Master messages.persist() Worker Block 1 Msgs. 1 messages.filter(_.contains( foo )).count messages.filter(_.contains( bar )).count Result: scaled full- text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 for sec on- disk for on- disk data) data) Action Msgs. 3 Worker Block 3 Worker Block 2 Msgs. 2

Fault Recovery RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD FilteredRDD MappedRDD HadoopRDD FilteredRDD MappedRDD path = hdfs:// func = _.contains(...) func = _.split( )

Fault Recovery Results Iteratrion time (s) 140 120 100 80 60 40 20 0 119 Failure happens 81 57 56 58 58 57 59 57 59 1 2 3 4 5 6 7 8 9 10 Iteration

Example: PageRank 1. Start each page with a rank of 1 2. On each iteration, update each page s rank to Σ i neighbors rank i / neighbors i links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reducebykey(_ + _) }

Optimizing Placement Links (url, neighbors) Ranks 0 (url, rank) join Contribs 0 reduce Ranks 1 join Contribs 2 reduce Ranks 2... links & ranks repeatedly joined Can co- partition them (e.g. hash both on URL) to avoid shuffles Can also use app knowledge, e.g., hash on DNS name links = links.partitionby( new URLPartitioner())

PageRank Performance Time per iteration (s) 200 150 100 50 0 171 72 23 Hadoop Basic Spark Spark + Controlled Partitioning

Implementation Runs on Mesos [NSDI 11] to share clusters w/ Hadoop Can read from any Hadoop input source (HDFS, S3, ) Spark Hadoop MPI Mesos Node Node Node Node No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code www.spark- project.org

Programming Models Implemented on Spark RDDs can express many existing parallel models» MapReduce, DryadLINQ» Pregel graph processing [200 LOC]» Iterative MapReduce [200 LOC]» SQL: Hive on Spark (Shark) [in progress] All are based on coarse- grained operations Enables apps to efficiently intermix these models

Demo

Open Source Community 15 contributors, 5+ companies using Spark, 3+ applications projects at Berkeley User applications:» Data mining 40x faster than Hadoop (Conviva)» Exploratory log analysis (Foursquare)» Traffic prediction via EM (Mobile Millennium)» Twitter spam classification (Monarch)» DNA sequence analysis (SNAP)»...

Related Work RAMCloud, Piccolo, GraphLab, parallel DBs» Fine- grained writes requiring replication for resilience Pregel, iterative MapReduce» Specialized models; can t run arbitrary / ad- hoc queries DryadLINQ, FlumeJava» Language- integrated distributed dataset API, but cannot share datasets efficiently across queries Nectar [OSDI 10]» Automatic expression caching, but over distributed FS PacMan [NSDI 12]» Memory cache for HDFS, but writes still go to network/disk

Conclusion RDDs offer a simple and efficient programming model for a broad range of applications Leverage the coarse- grained nature of many parallel algorithms for low- overhead recovery Try it out at www.spark- project.org

Behavior with Insufficient RAM 100 Iteration time (s) 80 60 40 20 68.8 58.1 40.7 29.7 11.5 0 0% 25% 50% 75% 100% Percent of working set in memory

Scalability Logistic Regression K- Means Iteration time (s) 250 200 150 100 184 116 Hadoop HadoopBinMem Spark 111 80 76 62 Iteration time (s) 300 250 200 150 100 274 197 143 Hadoop HadoopBinMem Spark 157 121 50 15 6 61 106 87 50 33 0 25 50 100 3 0 25 50 100 Number of machines Number of machines

Breaking Down the Speedup Iteration time (s) 20 15 10 5 15.4 8.4 13.1 6.9 Text Input Binary Input 2.9 2.9 0 In- mem HDFS In- mem local file Spark RDD

Spark Operations Transformations (define a new RDD) map filter sample groupbykey reducebykey sortbykey flatmap union join cogroup cross mapvalues Actions (return a result to driver program) collect reduce count save lookupkey

Task Scheduler Dryad- like DAGs Pipelines functions within a stage Stage 1 A: B: groupby G: Locality & data reuse aware Partitioning- aware to avoid shuffles C: D: map E: Stage 2 F: union join Stage 3 = cached data partition