Cloud, Big Data & Linear Algebra

Similar documents
Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

An Introduction to Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Processing of big data with Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Cloud Computing & Visualization

2/4/2019 Week 3- A Sangmi Lee Pallickara

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Integration of Machine Learning Library in Apache Apex

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Big Data Infrastructures & Technologies

A Tutorial on Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Big data systems 12/8/17

Twitter data Analytics using Distributed Computing

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

CSE 444: Database Internals. Lecture 23 Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Spark: A Brief History.

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Spark, Shark and Spark Streaming Introduction


An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Beyond MapReduce: Apache Spark Antonino Virgillito

/ Cloud Computing. Recitation 13 April 14 th 2015

Introduction to Apache Spark

Distributed Computing with Spark and MapReduce

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Fast, Interactive, Language-Integrated Cluster Computing

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Distributed Machine Learning" on Spark

Chapter 4: Apache Spark

Hadoop Development Introduction

Resilient Distributed Datasets

Analyzing Flight Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Specialist ICT Learning

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

An Introduction to Big Data Analysis using Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Dell In-Memory Appliance for Cloudera Enterprise

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Accelerating Spark Workloads using GPUs

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

An Introduction to Apache Spark

Apache Spark Internals

Stream Processing on IoT Devices using Calvin Framework

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Machine learning library for Apache Flink

Scalable Tools - Part I Introduction to Scalable Tools

Certified Big Data Hadoop and Spark Scala Course Curriculum

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

a Spark in the cloud iterative and interactive cluster computing

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Map Reduce Group Meeting

Big Data Architect.

The Datacenter Needs an Operating System

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Webinar Series TMIP VISION

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Data-intensive computing systems

/ Cloud Computing. Recitation 13 April 12 th 2016

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

An Overview of Apache Spark

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

The Evolution of Big Data Platforms and Data Science

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Correlating Apache Spark and Map Reduce with Performance Analysis using K-Means

Outline. CS-562 Introduction to data analysis using Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Databases and Big Data Today. CS634 Class 22

Lecture 11 Hadoop & Spark

The MapReduce Framework

Turning Relational Database Tables into Spark Data Sources

A REVIEW: MAPREDUCE AND SPARK FOR BIG DATA ANALYTICS

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Getting Started with Spark

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Introduction to Apache Spark

Transcription:

Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation

What is Big Data? 2

Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3

Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 4

Big Data in the Cloud TV 5

How to Analyze Big Data? 6

How to Analyze Big Data? 7

Basic Example: Word Count (Spark & Python) 8 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Basic Example: Word Count (Spark & Scala) 9 Holden Karau, Making interactive BigData applications fast and easy, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Some History Map/Reduce was invented by Google: Inspired by functional programming languages map and reduce functions Seminal paper: Jeffrey Dean and Sanjay Ghemawat (OSDI 2004), "MapReduce: Simplified Data Processing on Large Clusters" Used at Google to completely regenerate Google's index of the World Wide Web Hadoop open source implementation matches Google s specifications Amazon EMR (Elastic MapReduce) running on Amazon EC2 Spark started in 2009 as a research project of UC Berkley Spark is now an open source Apache project Built by a wide set of developers from over 200 companies more than 1000 developers have contributed to Spark IBM created Spark Technology Center (STC) - http://www.spark.tc/ 10

Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable to run programs up to 100x faster than Hadoop Map/Reduce in memory, or 10x faster on disk Ease of use: Write applications quickly in Java, Scala, Python and R, also with notebooks Generality: Combine streaming, SQL and complex analytics machine learning, graph processing Runs everywhere: on Apache Mesos, Hadoop YARN cluster manager, standalone, or in the cloud, and can read any existing Hadoop data, and data from HDFS, object store, databases etc. https://spark.apache.org/ 11

Combined Analytics of Data with Spark Analyze tabular data with SQL Analyze graph data using GraphX graph analytics engine Use same machine learning Infrastructure Use same solution for streaming data 12 Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica, GRAPHX: UNIFIED GRAPH ANALYTICS ON SPARK, spark summit July 2014

Spark Example 13 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Spark Example 14 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

Spark Example 15 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https://spark-summit.org/2014/

PageRank 17 Sergei Brin and Lawrence Page, "The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems. (1998) 30: 107 117.

18

19

20

21

22

B k v+(b k-1 +B k-2 + +B 2 +B+I)U= B k v+(i-b k )(I-B) -1 U 23

B k v 0 B k v+(i-b k )(I-B) -1 U (I-B) -1 U 24

A is a stochastic matrix, 25 A k v ce B k v 0

PageRank 26 Sergei Brin and Lawrence Page, "The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems. (1998) 30: 107 117.

PageRank 27 Hossein Falaki, Spark for Numerical Computing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

PageRank in Spark 28 Hossein Falaki, Spark for Numerical Computing, Spark Workshop April 2014, http://stanford.edu/~rezab/sparkworkshop/

Machine Learning: K-Means Clustering Goal: Segment tweets into clusters by geolocation using Spark MLLib K-means clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 29

Machine Learning: K-Means Clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 30

Machine Learning: K-Means Clustering (from Wikipedia) 31

K-Means Clustering with Spark MLLib https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 32

Combined Analytics of Data with Spark Analyze tabular data with SQL Analyze graph data using GraphX graph analytics engine Use same machine learning Infrastructure Use same solution for streaming data 33 Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica, GRAPHX: UNIFIED GRAPH ANALYTICS ON SPARK, spark summit July 2014

How Does Spark Work? 34

Spark RDD (Resilient Distributed Dataset) Immutable, partitioned collections of objects spread across a cluster, stored in RAM or on Disk Built through lazy parallel transformations Fault tolerance automatically built at failure Partition can be cached RAM var myrdd = sc.sequencefile( hdfs:/// ) myrdd array Partition DISK Partition We can apply Transformations or Actions on RDD Partition 35

Spark Cluster Driver program The process running the main() function of the application and creating the SparkContext Cluster manager External service for acquiring resources on the cluster (e.g. standalone, Mesos, YARN) Worker node - Any node that can run application code in the cluster Executor A process launched for an application on a worker node 36

Spark Scheduler Task - A unit of work that will be sent to one executor Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other 37