Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Similar documents
2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Infrastructures & Technologies

Spark Overview. Professor Sasu Tarkoma.

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CSE 444: Database Internals. Lecture 23 Spark

Lecture 11 Hadoop & Spark

Processing of big data with Apache Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

DATA SCIENCE USING SPARK: AN INTRODUCTION

Spark, Shark and Spark Streaming Introduction

Chapter 4: Apache Spark


2/4/2019 Week 3- A Sangmi Lee Pallickara

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Distributed Computing with Spark and MapReduce

A Tutorial on Apache Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Distributed Machine Learning" on Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Resilient Distributed Datasets

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Fast, Interactive, Language-Integrated Cluster Computing

Cloud Computing & Visualization

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Hadoop Development Introduction

MapReduce, Hadoop and Spark. Bompotas Agorakis

An Overview of Apache Spark

Spark and distributed data processing

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Apache Spark 2.0. Matei

Data Platforms and Pattern Mining

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Backtesting with Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

The Evolution of Big Data Platforms and Data Science

Cloud, Big Data & Linear Algebra

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

CompSci 516: Database Systems

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Turning Relational Database Tables into Spark Data Sources

Apache Spark and Scala Certification Training

Unifying Big Data Workloads in Apache Spark

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Twitter data Analytics using Distributed Computing

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Certified Big Data Hadoop and Spark Scala Course Curriculum

Specialist ICT Learning

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Big data systems 12/8/17

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Evolution From Shark To Spark SQL:

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Scalable Tools - Part I Introduction to Scalable Tools

Comparative Study of Apache Hadoop vs Spark

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Distributed Computing with Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Batch Processing Basic architecture

Improving the MapReduce Big Data Processing Framework

Dell In-Memory Appliance for Cloudera Enterprise

Webinar Series TMIP VISION

An Introduction to Big Data Analysis using Spark

Accelerating Spark Workloads using GPUs

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Big Data Architect.

Warehouse- Scale Computing and the BDAS Stack

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Integration of Machine Learning Library in Apache Apex

An Introduction to Apache Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Higher level data processing in Apache Spark

Introduction to MapReduce (cont.)

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

New Developments in Spark

Scaled Machine Learning at Matroid

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Introduction to Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Transcription:

1

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second Fault tolerance: faults are the norm, not the exception Simplicity: often comes from generality Originally developed at the University of California - Berkeley's AMPLab 2

Iterative jobs, with MapReduce, involve a lot of disk I/O for each iteration and stage Mappers Reducers Mappers Reducers Stage 1 Stage 2 3

Disk I/O is very slow (even if it is local I/O) Mappers Reducers Mappers Reducers Stage 1 Stage 2 Motivation Using MapReduce for complex iterative jobs or multiple jobs on the same data involves lots of disk I/O Opportunity The cost of main memory decreased Hence, large main memories are available in each server Solution Keep more data in main memory Basic idea of Spark 4

MapReduce: Iterative job write write iteration 1 iteration 2... Input Spark: Iterative job Input iteration 1 iteration 2 Data are shared between the iterations by using the main memory Or at least part of them 10 to 100 times faster than disk... 5

MapReduce: Multiple analyses of the same data Input query 1 query 2 query 3... result 1 result 2 result 3 Spark: Multiple analyses of the same data query 1 query 2 result 1 result 2 Input Distributed memory query 3 result 3... Data are only once from and stored in main memory Split of the data across the main memory of each server 6

Data are represented as Resilient Distributed Datasets (RDDs) Partitioned/Distributed collections of objects sp across the nodes of a clusters Stored in main memory (when it is possible) or on local disk Spark programs are written in terms of operations on resilient distributed data sets RDDs are built and manipulated through a set of parallel Transformations map, filter, join, Actions count, collect, save, RDDs are automatically rebuilt on machine failure 7

Provides a programming abstraction (based on RDDs) and transparent mechanisms to execute code in parallel on RDDs Hides complexities of fault-tolerance and slow machines Manages scheduling and synchronization of the jobs Hadoop Map Reduce Spark Storage Disk only In-memory or on disk Operations Map and Reduce Map, Reduce, Join, Sample, etc Execution model Batch Batch, interactive, streaming Programming environments Java Scala, Java, Python, and R 8

Lower overhead for starting jobs Less expensive shuffles Two iterative Machine Learning algorithms: K-means Clustering 4.1 121 Hadoop MR Spark 0 50 100 Logistic Regression 0.96 80 150 sec Hadoop MR Spark 0 20 40 60 80 100 sec 9

Daytona Gray 100 TB sort benchmark record (tied for 1 st place) 10

Spark SQL structured data Spark Streaming real-time MLlib (Machine learning and Data mining) GraphX (Graph processing) Spark Core Standalone Spark Scheduler YARN Scheduler (The same used by Hadoop) Mesos Spark is based on a basic component (the Spark Core component) that is exploited by all the high-level data analytics components This solution provides a more uniform and efficient solution with respect to Hadoop where many non-integrated tools are available When the efficiency of the core component is increased also the efficiency of the other high-level components increases 22 11

Spark Core Contains the basic functionalities of Spark exploited by all components Task scheduling Memory management Fault recovery Provides the APIs that are used to create RDDs and applies transformations and actions on them 23 Spark SQL structured data This component is used to interact with structured datasets by means of the SQL language It supports also Hive Query Language (HQL) It interacts with many data sources Hive Tables Parquet JSON 24 12

Spark Streaming real-time It is used to process live streams of data in realtime The APIs of the Streaming real-time components operated on RDDs and are similar to the ones used to process standard RDDs associated with static data sources 25 MLlib It is a machine learning/data mining library It can be used to apply the parallel versions of some machine learning/data mining algorithms Data preprocessing and dimensional reduction Classification algorithms Clustering algorithms Itemset mining. 26 13

GraphX A graph processing library Provides many algorithms for manipulating graphs Subgraph searching PageRank. 27 Spark can exploit many schedulers to execute its applications Hadoop YARN Standard scheduler of Hadoop Mesos cluster Another popular scheduler Standalone Spark Scheduler A simple cluster scheduler included in Spark 28 14