An Introduction to Apache Spark

Similar documents
Spark, Shark and Spark Streaming Introduction

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Big Data Infrastructures & Technologies

DATA SCIENCE USING SPARK: AN INTRODUCTION

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Cloud, Big Data & Linear Algebra

Processing of big data with Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Distributed Machine Learning" on Spark

Distributed Computing with Spark and MapReduce

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Spark Streaming. Guido Salvaneschi

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big data systems 12/8/17

Chapter 4: Apache Spark

Apache Spark 2.0. Matei

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

CSE 444: Database Internals. Lecture 23 Spark

Distributed Computing with Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Apache Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

An Overview of Apache Spark

Backtesting with Spark

Cloud Computing & Visualization

An Introduction to Apache Spark

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Analyzing Flight Data

Apache Spark and Scala Certification Training

Integration of Machine Learning Library in Apache Apex

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Matrix Computations and " Neural Networks in Spark

Scaled Machine Learning at Matroid

Turning Relational Database Tables into Spark Data Sources

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Specialist ICT Learning

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

MLI - An API for Distributed Machine Learning. Sarang Dev

Hadoop Development Introduction

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Chapter 1 - The Spark Machine Learning Library

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Lambda Architecture with Apache Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

Lecture 11 Hadoop & Spark

Data Analytics and Machine Learning: From Node to Cluster

Unifying Big Data Workloads in Apache Spark

Webinar Series TMIP VISION

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Machine Learning With Spark

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Big Data Architect.

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Accelerating Spark Workloads using GPUs

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark Streaming. Professor Sasu Tarkoma.

/ Cloud Computing. Recitation 13 April 14 th 2015

The Evolution of Big Data Platforms and Data Science

Twitter data Analytics using Distributed Computing

/ Cloud Computing. Recitation 13 April 12 th 2016

Resilient Distributed Datasets

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

Data at the Speed of your Users

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

2/4/2019 Week 3- A Sangmi Lee Pallickara

Certified Big Data Hadoop and Spark Scala Course Curriculum

The Datacenter Needs an Operating System

Massive Online Analysis - Storm,Spark

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Comparative Study of Apache Hadoop vs Spark

Data processing in Apache Spark

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Outline. CS-562 Introduction to data analysis using Apache Spark


A Tutorial on Apache Spark

Hadoop. Introduction / Overview

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Scalable Tools - Part I Introduction to Scalable Tools

Why do we need graph processing?

Distributed systems for stream processing

Datacenter Simulation Methodologies: Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

A very short introduction

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Transcription:

An Introduction to Apache Spark 1

History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as: Databricks, Yahoo!, Intel, Cloudera, IBM, 2

What is Spark? Fast and general cluster computing system interoperable with Hadoop datasets. 3

What are Spark improvements? Improves efficiency through: In-memory computing primitives. General computation graphs. Improves usability through Rich APIs in Scala, Java, Python Interactive shell (Scala/Python) 4

MapReduce is a DAG in General 5

MapReduce MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner... 6

What improvements Spark made on running MapReduce? Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop. Note: Spark is a hadoop successor. 7

How Spark Made it? A Wise Data Sharing! 8

Data Sharing in Hadoop MapReduce 9

Data Sharing in Spark 10

Data Sharing in Spark 10-100x Faster than network and disk! 11

Spark Programming Model At a high level, every Spark application consists of a driver program that runs the user s main function. Promotes you to write programs in term of making transformations on distributed datasets. 12

Spark Programming Model The main abstraction Spark provides is a resilient distributed dataset (RDD). Collection of elements partitioned across the cluster (Memory of Disk) Can be accessed and operated in parallel (map, filter,...) Automatically rebuilt on failure 13

Spark Programming Model RDDs Operations Transformations: Create a new dataset from an existing one. Example: map() Actions: Return a value to the driver program after running a computation on the dataset. Example: reduce() 14

Spark Programming Model 15

Spark Programming Model Another abstraction is Shared Variables Broadcast Variables, which can be used to cache a value in memory on all nodes. Accumulator 16

Spark Programming Model 17

Spark Programming Model 18

Spark Programming Model 19

Ease of Use Spark offers over 80 high-level operators that make it easy to build parallel apps. Scala and Python shells to use it interactively. 20

A General Stack 21

Apache Spark Core 22

Apache Spark Core Spark Core is the general engine for the Spark platform. In-memory computing capabilities deliver speed General execution model supports wide delivery of use cases Ease of development native APIs in Java, Scala, Python (+ SQL, Clojure, R) 23

Spark SQL 24

Spark SQL 25

Spark SQL 26

Spark SQL 27

Spark SQL 28

Spark Streaming 29

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. 30

Spark Streaming 31

Spark Streaming 32

Spark Streaming 33

Spark Streaming 34

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data 35

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data val hashtags = tweets.flatmap (status => gettags(status)) new DStream transformation: modify data in one DStream to create another DStream 36

Spark Streaming val tweets = ssc.twitterstream(<twitter username>, <Twitter password>) val hashtags = tweets.flatmap (status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval 37

Spark Streaming val tagcounts = hashtags.window(minutes(1), Seconds(1)).countByValue() 38

MLLib 39

MLLib MLLib is Spark's scaleable machine learning engine. MLLib works on any hadoop datasource such as HDFS, HBase and local files. 40

MLLib Algorithms: linear SVM and logistic regression classification and regression tree k-means clustering recommendation via alternating least squares singular value decomposition linear regression with L1- and L2-regularization multinomial naive Bayes basic statistics feature transformations 41

GraphX 42

GraphX GraphX is Spark's API for graphs and graphparallel computation. Works with both graphs and collections. 43

GraphX Comparable performance to the fastest specialized graph processing systems 44

GraphX Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count 45

Spark Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3. 46

Resources http://spark.apache.org Intro to Apache Spark by Paco Nathan Building a Unified Data Pipeline in Spark by Aaron Davidson. 47