Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Similar documents
Cloud, Big Data & Linear Algebra

An Introduction to Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Processing of big data with Apache Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

DATA SCIENCE USING SPARK: AN INTRODUCTION

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Integration of Machine Learning Library in Apache Apex

CSE 444: Database Internals. Lecture 23 Spark

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Big Data Infrastructures & Technologies

Spark, Shark and Spark Streaming Introduction

Chapter 4: Apache Spark

Big data systems 12/8/17

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Cloud Computing & Visualization


Apache Spark 2.0. Matei

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

A Tutorial on Apache Spark

Analyzing Flight Data

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Fast, Interactive, Language-Integrated Cluster Computing

Introduction to Apache Spark

Resilient Distributed Datasets

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Backtesting with Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

MapReduce, Hadoop and Spark. Bompotas Agorakis

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Dell In-Memory Appliance for Cloudera Enterprise

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Twitter data Analytics using Distributed Computing

Distributed Computing with Spark and MapReduce

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Accelerating Spark Workloads using GPUs

Spark: A Brief History.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Beyond MapReduce: Apache Spark Antonino Virgillito

Data-intensive computing systems

Unifying Big Data Workloads in Apache Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

An Introduction to Big Data Analysis using Spark

Specialist ICT Learning

Distributed Machine Learning" on Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

Apache Spark Internals

Webinar Series TMIP VISION

An Overview of Apache Spark

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Turning Relational Database Tables into Spark Data Sources

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

The Evolution of Big Data Platforms and Data Science

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Stream Processing on IoT Devices using Calvin Framework

The Datacenter Needs an Operating System

An Introduction to Apache Spark

Data processing in Apache Spark

/ Cloud Computing. Recitation 13 April 14 th 2015

Big Data Architect.

Article from. CompAct. April 2018 Issue 57

a Spark in the cloud iterative and interactive cluster computing

Hadoop Development Introduction

Batch Processing Basic architecture

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Outline. CS-562 Introduction to data analysis using Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics with Apache Spark. Nastaran Fatemi

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Apache Spark and Scala Certification Training

Data processing in Apache Spark

Distributed Computing with Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Certified Big Data Hadoop and Spark Scala Course Curriculum

A Gentle Introduction to

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

Data processing in Apache Spark

Lecture 11 Hadoop & Spark

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Research on improved K - nearest neighbor algorithm based on spark platform

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Transcription:

Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation

Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Ease of use: Write applications quickly in Java, Scala, Python and R, also with notebooks Generality: Combine SQL, streaming, and complex analytics machine learning, graph processing Runs everywhere: runs on Apache Mesos, Hadoop YARN cluster manager, standalone, or in the cloud, and can read any existing Hadoop data, and data from HDFS, object store, databases etc. 2

History of Spark Started in 2009 as a research project of UC Berkley Now it is an open source Apache project Built by a wide set of developers from over 200 companies more than 1000 developers have contributed to Spark IBM has decided to bet big on Spark at June 2015 Created Spark Technology Center (STC) - http://www.spark.tc/ Spark as a Service on Bluemix 3

How to Analyze BigData? 4

Basic Example: Word Count (Spark & Python) 5 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Basic Example: Word Count (Spark & Scala) 6 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark RDD (Resilient Distributed Dataset) Immutable, partitioned collections of objects spread across a cluster, stored in RAM or on Disk Built through lazy parallel transformations Fault tolerance automatically built at failure Partition can be cached RAM var myrdd = sc.sequencefile( hdfs:/// ) myrdd array Partition DISK Partition We can apply Transformations or Actions on RDD Partition 7

Spark Cluster Driver program The process running the main() function of the application and creating the SparkContext Cluster manager External service for acquiring resources on the cluster (e.g. standalone, Mesos, YARN) Worker node - Any node that can run application code in the cluster Executor A process launched for an application on a worker node 8

Spark Scheduler Task - A unit of work that will be sent to one executor Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other 9

Scala Spark was originally written in Scala Java and Python API were added later Scala: high-level language for the JVM Object oriented Functional programming Immutable Inspired by criticism of the shortcomings of Java Static types Comparable in speed to Java Type inference saves us from having to write explicit types most of the time Interoperates with Java Can use any Java class Can be called from Java code 10

Scala vs. Java 11 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark & Scala: Creating RDD or SoftLayer object store >sc.textfile("swift://containername.spark/objectname") 12 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark & Scala: Basic Transformations 13 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark & Scala: Basic Actions 14 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark & Scala: Key-Value Operations 15 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Example: Spark Core API 16 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Example: Spark Core API 17 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Example: Spark Core API 18 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Example: Spark Core API Better implementation: 19 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Example: PageRank How to implement PageRank algorithm using Map/Reduce? 20 Hossein Falaki, Numerical Computing with Spark, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform 21 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform: GraphX 22 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform: GraphX Example: PageRank PageRank is implemented using Pregel graph processing 23

Spark Platform: MLLib 24 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform: MLLib Example: K-Means Clustering Goal: Segment tweets into clusters by geolocation using Spark MLLib K-means clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 25

Spark Platform: MLLib Example: K-Means Clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 26

Spark Platform: MLLib Example: K-Means Clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 27

Spark Platform: Streaming 28 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform: Streaming Example 29

Spark Platform: SQL and DataFrames 30 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Spark Platform: SQL and DataFrames Example Michael Armbrust, Spark DataFrames: Simple and Fast Analytics on Structured Data, Spark Summit 2015 31

Machine Learning Pipeline with Spark ML Patrick Wendell, Matei Zaharia, Spark community update, https://spark-summit.org/2015/events/keynote-1/ 32

Combined Analytics of Data Analyze tabular data with SQL Analyze graph data using GraphX graph analytics engine Use same machine learning Infrastructure Use same solution for streaming data 33 Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica, GRAPHX: UNIFIED GRAPH ANALYTICS ON SPARK, spark summit July 2014