Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Similar documents
Resilient Distributed Datasets

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Fast, Interactive, Language-Integrated Cluster Computing

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CSE 444: Database Internals. Lecture 23 Spark

Data-intensive computing systems

Massive Online Analysis - Storm,Spark

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

a Spark in the cloud iterative and interactive cluster computing

CompSci 516: Database Systems

Big Data Infrastructures & Technologies

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

2/4/2019 Week 3- A Sangmi Lee Pallickara

Spark Overview. Professor Sasu Tarkoma.

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Fault Tolerant Distributed Main Memory Systems

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Dell In-Memory Appliance for Cloudera Enterprise

Spark, Shark and Spark Streaming Introduction

Programming Systems for Big Data

CS Spark. Slides from Matei Zaharia and Databricks

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Integration of Machine Learning Library in Apache Apex

Chapter 4: Apache Spark

Lecture 11 Hadoop & Spark

Shark: Hive on Spark

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to MapReduce Algorithms and Analysis

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Unifying Big Data Workloads in Apache Spark

Cloud, Big Data & Linear Algebra

Survey on Incremental MapReduce for Data Mining

Intro to Apache Spark

An Introduction to Apache Spark

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Turning Relational Database Tables into Spark Data Sources

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

DATA SCIENCE USING SPARK: AN INTRODUCTION

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Apache Spark 2.0. Matei

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Processing of big data with Apache Spark

Spark and distributed data processing

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Distributed Computing with Spark and MapReduce

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Apache Spark Internals

15.1 Data flow vs. traditional network programming

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Distributed Computing with Spark

New Developments in Spark

Machine learning library for Apache Flink

Outline. CS-562 Introduction to data analysis using Apache Spark

Introduction to Apache Spark

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Spark.jl: Resilient Distributed Datasets in Julia

Cloud Computing & Visualization

Introduction to Apache Spark. Patrick Wendell - Databricks

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

The Datacenter Needs an Operating System

Evolution From Shark To Spark SQL:

Introduction to Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

An Introduction to Apache Spark

Data Platforms and Pattern Mining

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Stream Processing on IoT Devices using Calvin Framework

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Lambda Architecture with Apache Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

An Introduction to Big Data Analysis using Spark

Batch Processing Basic architecture

Backtesting with Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Big data systems 12/8/17

Spark Streaming. Guido Salvaneschi

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Transcription:

Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google 2008 Hadoop Summit 2014 Apache Spark top-level 2006 Hadoop @ Yahoo!

A Brief History: MapReduce circa 1979 Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processing www-formal.stanford.edu/jmc/history/lisp/lisp.htm circa 2004 Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat research.google.com/archive/mapreduce.html circa 2006 Apache Hadoop, originating from the Nutch Project Doug Cutting research.yahoo.com/files/cutting.pdf circa 2008 Yahoo web scale search indexing Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/ circa 2009 Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/

A Brief History: MapReduce MapReduce use cases showed two major limitations: 1. difficultly of programming directly in MR 2. performance bottlenecks, or batch not fitting the use cases In short, MR doesn t compose well for large applications Therefore, people built specialized systems as workarounds

A Brief History: MapReduce Pregel Giraph Dremel Drill Tez MapReduce Impala GraphLab Storm S4 General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc. The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nu6vo2ejab4

A Brief History: Spark 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google 2006 Hadoop @ Yahoo! 2008 Hadoop Summit 2014 Apache Spark top-level Spark: Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica USENIX HotCloud (2010) people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf! Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI (2012) usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

A Brief History: Spark Unlike the various specialized systems, Spark s goal was to generalize MapReduce to support new apps within same engine Two reasonably small additions are enough to express the previous models: fast data sharing general DAGs This allows for an approach which is more efficient for the engine, and much simpler for the end users

A Brief History: Spark The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nu6vo2ejab4 used as libs, instead of specialized systems

A Brief History: Spark Some key points about Spark: handles batch, interactive, and real-time within a single framework native integration with Java, Python, Scala programming at a higher level of abstraction more general: map/reduce is just one set of supported constructs

https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/assets/dawh_0401.png

Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley UC BERKELEY

Motivation MapReduce greatly simplified big data analysis on large, unreliable clusters But as soon as it got popular, users wanted more:» More complex, multi- stage applications (e.g. iterative machine learning & graph processing)» More interactive ad- hoc queries Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

Motivation Complex apps and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing In MapReduce, the only way to share data across jobs is stable storage slow!

Examples HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2... Input HDFS read query 1 query 2 result 1 result 2 Input query 3... result 3 Slow due to replication and disk I/O, but necessary for fault tolerance

Goal: In- Memory Data Sharing Input iter. 1 iter. 2... one- time processing query 1 query 2 Input query 3... 10-100 faster than network/disk, but how to get FT?

Challenge How to design a distributed memory abstraction that is both fault- tolerant and efficient?

Challenge Existing storage abstractions have interfaces based on fine- grained updates to mutable state» RAMCloud, databases, distributed mem, Piccolo Requires replicating data or logs across nodes for fault tolerance» Costly for data- intensive apps» 10-100x slower than memory write

Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory» Immutable, partitioned collections of records» Can only be built through coarse- grained deterministic transformations (map, filter, join, ) Efficient fault recovery using lineage» Log one operation to apply to many elements» Recompute lost partitions on failure» No cost if nothing fails

RDD Recovery Input iter. 1 iter. 2... one- time processing query 1 query 2 Input query 3...

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms» These naturally apply the same operation to many items Unify many current programming models» Data flow models: MapReduce, Dryad, SQL,» Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, Support new apps that these models don t

Tradeoff Space Fine Granularity of Updates K- V stores, databases, RAMCloud HDFS Network bandwidth Best for transactional workloads RDDs Memory bandwidth Best for batch workloads Coarse Low Write Throughput High

Spark Programming Interface DryadLINQ- like API in the Scala language Usable interactively from Scala interpreter Provides:» Resilient distributed datasets (RDDs)» Operations on RDDs: transformations (build new RDDs), actions (compute and output results)» Control of each RDD s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Spark Operations Transformations (define a new RDD) map filter sample groupbykey reducebykey sortbykey flatmap union join cogroup cross mapvalues Actions (return a result to driver program) collect reduce count save lookupkey

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base lines = spark.textfile( hdfs://... ) Transformed RDD RDD results errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) tasks Master messages.persist() Worker Block 1 Msgs. 1 messages.filter(_.contains( foo )).count messages.filter(_.contains( bar )).count Result: scaled full- text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 for sec on- disk for on- disk data) data) Action Msgs. 3 Worker Block 3 Worker Block 2 Msgs. 2

Task Scheduler Dryad- like DAGs Pipelines functions within a stage Stage 1 A: B: groupby G: Locality & data reuse aware Partitioning- aware to avoid shuffles C: D: map E: Stage 2 F: union join Stage 3 = cached data partition

https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf

What is RDD? Resilient Distributed Dataset - A big collection of data with following properties - Immutable - Distributed - Lazily evaluated - Type inferred - Cacheable https://www.slideshare.net/datamantra/anatomy-of-rdd

Pseudo Monad Wraps iterator + partitions distribution Keeps track of history for fault tolerance Lazily evaluated, chaining of expressions https://www.slideshare.net/deanchen11/scala-bay-spark-talk

Partitions Logical division of data Derived from Hadoop Map/Reduce All Input,Intermediate and output data will be represented as partitions Partitions are basic unit of parallelism RDD data is just collection of partitions

Partition from Input Data RDD Partition 1 Partition 2 Partition 3 Input Format Data in HDFS Chunk 1 Chunk 2 Chunk 3

Partition and Immutability All partitions are immutable Every transformation generates new partition Partition immutability driven by underneath storage like HDFS Partition immutability allows for fault recovery

Partitions and Distribution Partitions derived from HDFS are distributed by default Partitions also location aware Location awareness of partitions allow for data locality For computed data, using caching we can distribute in memory also

Accessing partitions We can access partition together rather single row at a time mapparititons API of RDD allows us that Accessing partition at a time allows us to do some partionwise operation which cannot be done by accessing single row.

Partition for transformed Data Partitioning will be different for key/value pairs that are generated by shuffle operation Partitioning is driven by partitioner specified By default HashPartitioner is used You can use your own partitioner also

Hash Partitioning Input RDD Partition 1 Partition 2 Partition 3 Hash(key)%no.of.partitions Ouptut RDD Partition 1 Partition 2

Custom Partitioner Partition the data according to your data structure Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done

Look up operation Partitioning allows faster lookups Lookup operation allows to look up for a given value by specifying the key Using partitioner, lookup determines which partition look for Then it only need to look in that partition If no partition is specified, it will fallback to filter

Parent(Dependency) Each RDD has access to it s parent RDD Nil is the value of parent for first RDD Before computing it s value, it always computes it s parent This chain of running allows for laziness

Sub classing Each spark operator, creates an instance of specific sub class of RDD map operator results in MappedRDD, flatmap in FlatMappedRDD etc Subclass allows RDD to remember the operation that is performed in the transformation

RDD transformations val datardd = sc.textfile(args (1)) val splitrdd = datardd. flatmap(value => value.split( ) Nil Hadoop RDD datardd : MappedRDD splitrdd: FlatMappedRDD

Compute Compute is the function for evaluation of each partition in RDD Compute is an abstract method of RDD Each sub class of RDD like MappedRDD, FilteredRDD have to override this method

RDD actions val datardd = sc. textfile(args(1)) val flatmaprdd = datardd.flatmap (value => value.split( ) flatmaprdd.collect() Nil Hadoop RDD compute Mapped RDD compute FlatMap RDD compute runjob

runjob API runjob API of RDD is the api to implement actions runjob allows to take each partition and allow you evaluate All spark actions internally use runjob api.

Caching cache internally uses persist API persist sets a specific storage level for a given RDD Spark context tracks persistent RDD When first evaluates, partition will be put into memory by block manager

Block manager Handles all in memory data in spark Responsible for Cached Data ( BlockRDD) Shuffle Data Broadcast data Partition will be stored in Block with id (RDD. id, partition_index)

How caching works? Partition iterator checks the storage level if Storage level is set it calls cachemanager.getorcompute(partition) as iterator is run for each RDD evaluation, its transparent to user

A Brief History: Spark The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nu6vo2ejab4

Fault Recovery RDDs track the graph of transformations that built them (their lineage) to rebuild lost data E.g.: messages = textfile(...).filter(_.contains( error )).map(_.split( \t )(2)) HadoopRDD FilteredRDD MappedRDD HadoopRDD FilteredRDD MappedRDD path = hdfs:// func = _.contains(...) func = _.split( )

Fault Recovery Results Iteratrion time (s) 140 120 100 80 60 40 20 0 119 Failure happens 81 57 56 58 58 57 59 57 59 1 2 3 4 5 6 7 8 9 10 Iteration