Beyond MapReduce: Apache Spark Antonino Virgillito

Similar documents
L6: Introduction to Spark Spark

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Big Data Analytics with Apache Spark. Nastaran Fatemi

An Introduction to Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Processing of big data with Apache Spark

Big Data Infrastructures & Technologies

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

The detailed Spark programming guide is available at:

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Apache Spark. Patrick Wendell - Databricks

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

An Introduction to Apache Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

An Introduction to Big Data Analysis using Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Big data systems 12/8/17

Data-intensive computing systems

MapReduce, Hadoop and Spark. Bompotas Agorakis

Cloud, Big Data & Linear Algebra

Guidelines For Hadoop and Spark Cluster Usage

CSE 444: Database Internals. Lecture 23 Spark

Data processing in Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Data processing in Apache Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Introduction to Apache Spark

Spark, Shark and Spark Streaming Introduction

Data processing in Apache Spark

TUTORIAL: BIG DATA ANALYTICS USING APACHE SPARK

Apache Spark: Hands-on Session A.A. 2016/17

Apache Spark: Hands-on Session A.A. 2017/18

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Apache Spark Internals

In-Memory Processing with Apache Spark. Vincent Leroy

Distributed Computing with Spark and MapReduce

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture with Apache Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Turning Relational Database Tables into Spark Data Sources

Getting Started with Spark

Spark Tutorial. General Instructions

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Spark Standalone 模式应用程序开发 Spark 大数据博客 -

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Lecture 11 Hadoop & Spark

An Overview of Apache Spark

Cloud Computing & Visualization

/ Cloud Computing. Recitation 13 April 14 th 2015

Apache Spark. Easy and Fast Big Data Analytics Pat McDonough

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark Programming Essentials

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Databases and Big Data Today. CS634 Class 22

Datacenter Simulation Methodologies: Spark

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise

Machine Learning With Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Big Data Analytics with Hadoop and Spark at OSC

EPL660: Information Retrieval and Search Engines Lab 11

Massive Online Analysis - Storm,Spark

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

a Spark in the cloud iterative and interactive cluster computing

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Distributed Computing with Spark

本文转载自 :


Big Data Analytics at OSC

Spark supports several storage levels

Chapter 4: Apache Spark

Big Data Architect.

Pyspark standalone code

Scalable Tools - Part I Introduction to Scalable Tools

Scaling Up & Out. Haidar Osman

STA141C: Big Data & High Performance Statistical Computing

Big Data Analytics using Apache Hadoop and Spark with Scala

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Apache Spark and Scala Certification Training

Stream Processing on IoT Devices using Calvin Framework

Certified Big Data Hadoop and Spark Scala Course Curriculum

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Apache Spark 2.0. Matei

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

/ Cloud Computing. Recitation 13 April 12 th 2016

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Transcription:

Beyond MapReduce: Apache Spark Antonino Virgillito 1

Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration s output is written to disk making it slow Hadoop execution flow Spark execution flow Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly

About Apache Spark Initially started at UC Berkeley in 2009 Fast and general purpose cluster computing system 10x (on disk) - 100x (In-Memory) faster Most popular for running Iterative Machine Learning Algorithms. Provides high level APIs in 3 different programming languages Scala, Java, Python Integration with Hadoop and its eco-system and can read existing data on HDFS

Spark Stack Spark SQL For SQL and unstructured data processing MLib Machine Learning Algorithms GraphX Graph Processing Spark Streaming stream processing of live data streams

Execution Flow

Terminology Application Jar User Program and its dependencies except Hadoop & Spark Jars bundled into a Jar file Driver Program The process to start the execution (main() function) Cluster Manager An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) Deploy Mode cluster : Driver inside the cluster client : Driver outside of Cluster

Terminology (contd.) Worker Node Node that run the application program in cluster Executor Process launched on a worker node, that runs the Tasks Keep data in memory or disk storage Task A unit of work that will be sent to executor Job Consists multiple tasks Created based on a Action SparkContext : represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Resilient Distributed Datasets Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform inmemory parallel computations on large clusters. And that too in a highly fault tolerant manner. This is the main concept around which the whole Spark framework revolves around. Currently 2 types of RDDs: Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU. Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbase etc). These are created using SparkContext s textfile method. Default number of slices in this case is 1 slice per file block. 8

Cluster Deployment Standalone Deploy Mode simplest way to deploy Spark on a private cluster Amazon EC2 EC2 scripts are available Very quick launching a new cluster Hadoop YARN

Spark Shell./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[n] to run locally with N threads. You should start by using local for testing../bin/run-example SparkPi 10 This will run 10 iterations to calculate the value of Pi

Basic operations scala> val textfile = sc.textfile("readme.md") textfile: spark.rdd[string] = spark.mappedrdd@2ee9b6e3 scala> textfile.count() // Number of items in this RDD ees0: Long = 126 scala> textfile.first() // First item in this RDD res1: String = # Apache Spark scala> val lineswithspark = textfile.filter(line => line.contains("spark")) lineswithspark: spark.rdd[string] = spark.filteredrdd@7dd4af09 Simplier - Single liner: scala> textfile.filter(line => line.contains("spark")).count() // How many lines contain "Spark"? res3: Long = 15

Map - Reduce scala> textfile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.math scala> textfile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordcounts = textfile.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordcounts: spark.rdd[(string, Int)] = spark.shuffledaggregatedrdd@71f027b8 wordcounts.collect()

With Caching scala> lineswithspark.cache() res7: spark.rdd[string] = spark.filteredrdd@17e51082 scala> lineswithspark.count() res8: Long = 15 scala> lineswithspark.count() res9: Long = 15

With HDFS val lines = spark.textfile( hdfs://... ) val errors = lines.filter(line => line.startswith( ERROR )) println(total errors: + errors.count())

Standalone (Java) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; import org.apache.spark.api.java.function.function; public class SimpleApp { public static void main(string[] args) { String logfile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logdata = sc.textfile(logfile).cache(); long numas = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("a"); } }).count(); long numbs = logdata.filter(new Function<String, Boolean>() { public Boolean call(string s) { return s.contains("b"); } }).count(); } } System.out.println("Lines with a: " + numas + ", lines with b: " + numbs); 15

Job Submission $SPARK_HOME/bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ target/scala-2.10/simple-project_2.10-1.0.jar

Configuration val conf = new SparkConf().setMaster("local").setAppName("CountingSheep").set("spark.executor.memory", "1g") val sc = new SparkContext(conf)