Fast, Interactive, Language-Integrated Cluster Computing

Similar documents
Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Resilient Distributed Datasets

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: A Brief History.

a Spark in the cloud iterative and interactive cluster computing

CS Spark. Slides from Matei Zaharia and Databricks

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Programming Systems for Big Data

CSE 444: Database Internals. Lecture 23 Spark

Shark: Hive on Spark

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

2/4/2019 Week 3- A Sangmi Lee Pallickara

Integration of Machine Learning Library in Apache Apex

Big Data Infrastructures & Technologies

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Distributed Computing with Spark

Dell In-Memory Appliance for Cloudera Enterprise

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

CompSci 516: Database Systems

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Spark, Shark and Spark Streaming Introduction

Massive Online Analysis - Storm,Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Lecture 11 Hadoop & Spark

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Survey on Incremental MapReduce for Data Mining

Data-intensive computing systems

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Chase Wu New Jersey Institute of Technology

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Chapter 4: Apache Spark

+ Cluster analysis. a generalization can be derived for each cluster and hence processing is done batch wise rather than individually

Machine learning library for Apache Flink

Data Deluge. Billions of users connected through the net. Storage getting cheaper Store more data!

Spark Overview. Professor Sasu Tarkoma.

Distributed Computing with Spark and MapReduce

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

MapReduce, Hadoop and Spark. Bompotas Agorakis

Turning Relational Database Tables into Spark Data Sources

A Parallel R Framework

Introduction to MapReduce Algorithms and Analysis

An Introduction to Apache Spark

Cloud, Big Data & Linear Algebra

Outline. CS-562 Introduction to data analysis using Apache Spark

Utah Distributed Systems Meetup and Reading Group - Map Reduce and Spark

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Apache Spark Performance Compared to a Traditional Relational Database using Open Source Big Data Health Software

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Warehouse- Scale Computing and the BDAS Stack

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Fault Tolerant Distributed Main Memory Systems

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Data Platforms and Pattern Mining

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

A Platform for Fine-Grained Resource Sharing in the Data Center

Introduction to Apache Spark. Patrick Wendell - Databricks

Processing of big data with Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Lecture 25: Spark. (leveraging bulk-granularity program structure) Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Data processing in Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Data processing in Apache Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

An Introduction to Big Data Analysis using Spark

Stream Processing on IoT Devices using Calvin Framework

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Twitter data Analytics using Distributed Computing

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Data Analytics on RAMCloud

15.1 Data flow vs. traditional network programming

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Spark Streaming. Professor Sasu Tarkoma.

Backtesting with Spark

Distributed Machine Learning" on Spark

The Datacenter Needs an Operating System

Comparative Study of Apache Hadoop vs Spark

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Batch Processing Basic architecture

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Introduction to Spark

Data processing in Apache Spark

Specialist ICT Learning

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Transcription:

Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org UC BERKELEY

Project Goals Extend the MapReduce model to better support two common classes of analytics apps:»iterative algorithms (machine learning, graphs)»interactive data mining Enhance programmability:»integrate into Scala programming language»allow interactive use from Scala interpreter

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Map Output Map Reduce

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Benefits of data flow: runtime can decide where Input to run Map tasks and can automatically Output recover from failures Map Reduce

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data:»iterative algorithms (machine learning, graphs)»interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce» Fault tolerance, data locality, scalability Support a wide range of applications

RDD - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012) - Most machine learning algorithms require iterative computation. - The iterations on MapReduce cause big overhead between Map and Reduce - Data replication - Disk I/O - Serialization

RDD - The iterations are computationally expensive since Hadoop uses HDFS for sharing data - HDFS causes frequent file I/O Slow - Solutions - Reduce uses of file I/O - Use RAM

RDD - Using RAM is much more efficient - However, how to handle fault-tolerant? - Need to load the data again into memory? - Instead update data in RAM, make all data in RAM as read-only.

RDD - Designed by Lineage and Directed Acyclic Graph (DAG) - RDD records all history of the process of the data - Fault-tolerant happens, RDD checks the lineage of the data and roll back Fast recovery - All data is stored as DAG, so efficient.

RDD - Lineage and DAG

Outline Spark programming model Implementation Demo User applications

Programming Model Resilient distributed datasets (RDDs)» Immutable, partitioned collections of objects» Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage» Can be cached for efficient reuse Actions on RDDs» Count, reduce, collect, save,

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) Base Transformed RDD RDD errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) results tasks cachedmsgs = messages.cache() Driver Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vssec 20 for sec on-disk for on-disk data) data) Worker Block 3 Action Cache 3 Worker Block 2 Cache 2

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textfile(...).filter(_.startswith( ERROR )).map(_.split( \t )(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Word Count Use a few transformations to build a dataset of (String, int) pairs called counts and then save it to a file. The map operation produces one output value for each input value, whereas the flatmap operation produces an arbitrary number (zero or more) values for each input value

Pi estimation This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.

Running Time (s) Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Number of Iterations 127 s / iteration Hadoop Spark first iteration 174 s further iterations 6 s

Spark Applications In-memory data mining on Hive data (Conviva) Predictive analytics (Quantifind) City traffic prediction (Mobile Millennium) Twitter spam classification (Monarch) Collaborative filtering via matrix factorization

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes:» Modified wrapper code generation so that each line typed has references to objects for its dependencies» Distribute generated classes over the network

Conclusion Spark provides a simple, efficient, and powerful programming model for a wide range of apps Download our open source release: www.spark-project.org matei@berkeley.edu

Related Work DryadLINQ, FlumeJava» Similar distributed collection API, but cannot reuse datasets efficiently across queries Relational databases» Lineage/provenance, logical logging, materialized views GraphLab, Piccolo, BigTable, RAMCloud» Fine-grained writes similar to distributed shared memory Iterative MapReduce (e.g. Twister, HaLoop)» Implicit data sharing for a fixed computation pattern Caching systems (e.g. Nectar)» Store data in files, no explicit control over what is cached

Iteration time (s) 11.5 29.7 40.7 68.8 58.1 Behavior with Not Enough RAM 100 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached % of working set in memory

Iteratrion time (s) 57 56 58 58 57 59 57 59 81 119 Fault Recovery Results 140 120 100 80 60 40 20 0 No Failure Failure in the 6th Iteration 1 2 3 4 5 6 7 8 9 10 Iteration