Massive Online Analysis - Storm,Spark

Similar documents
Resilient Distributed Datasets

Spark: A Brief History.

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Data-intensive computing systems

MapReduce, Hadoop and Spark. Bompotas Agorakis

Fast, Interactive, Language-Integrated Cluster Computing

Big Data Infrastructures & Technologies

Spark Overview. Professor Sasu Tarkoma.

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

CSE 444: Database Internals. Lecture 23 Spark

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

CompSci 516: Database Systems

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Deep Learning Comes of Age. Presenter: 齐琦 海南大学

Spark, Shark and Spark Streaming Introduction

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

An Introduction to Apache Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Platforms and Pattern Mining

Processing of big data with Apache Spark

Outline. CS-562 Introduction to data analysis using Apache Spark

a Spark in the cloud iterative and interactive cluster computing

Lecture 11 Hadoop & Spark

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/22/2018 Week 10-A Sangmi Lee Pallickara. FAQs.

Distributed Computing with Spark and MapReduce

Cloud Computing & Visualization

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Chapter 4: Apache Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Survey on Incremental MapReduce for Data Mining

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Apache Spark and Scala Certification Training

2/4/2019 Week 3- A Sangmi Lee Pallickara

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Programming Systems for Big Data

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

An Introduction to Apache Spark

Fault Tolerant Distributed Main Memory Systems

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Scalable Tools - Part I Introduction to Scalable Tools

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Embedded Technosolutions

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Distributed Computing with Spark

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Turning Relational Database Tables into Spark Data Sources

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Webinar Series TMIP VISION

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Architect.

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

BIG DATA TESTING: A UNIFIED VIEW

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

An Introduction to Big Data Analysis using Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Distributed Computation Models

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Distributed File Systems II

Introduction to Apache Spark

CS Spark. Slides from Matei Zaharia and Databricks

Beyond MapReduce: Apache Spark Antonino Virgillito

Hadoop/MapReduce Computing Paradigm

Apache Spark 2.0. Matei

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

15.1 Data flow vs. traditional network programming

A Review Paper on Big data & Hadoop

Data Storage Infrastructure at Facebook

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big data systems 12/8/17

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Big Data Analytics using Apache Hadoop and Spark with Scala

Data processing in Apache Spark

/ Cloud Computing. Recitation 13 April 14 th 2015

Data processing in Apache Spark

Distributed computing: index building and use

Introduction to Hadoop and MapReduce

Hadoop An Overview. - Socrates CCDH

Transcription:

Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 1 / 1

Overview (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 2 / 1

Introduction to Big Data - Examples Weather Forcasting management Raytheon - working with NASA and NOAA. Every day, the system receives about 60GB worth of data from weather satellites. Using sophisticated algorithms, they were able to process that data into environmental data records, such as cloud coverage and height. They produce sea ice concentrations, surface temperatures, as well as atmospheric pressure. Flight takeoff and landing It produces 5 to 10 GB of data. Facebook It produces 500+ TB of data per day. User Behaviour (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 3 / 1

Introduction What is Big Data? Big Data is more data and more varieties of data that must be handled by a conventional database. The term also refers to the many tools and techniques that have emerged to help users mine valuable information from these massive data. (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 4 / 1

Introduction cont.. Characteristics of Big Data Huge volume of data Complexity of datatypes and structures Speed or Velocity of new data creation Criteria for Big Data Projects Speed of decision making Analysis flexibility Throughput (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 5 / 1

Different types of Data Types Structured Eg. Transaction data Semi Structured Eg.XML data files that are self describing and defined by XML Schema Quasi Structured Eg.Webclickstream data that many contain some inconsistent in data values and formats Unstructured Eg.Text document,pdfs, Images, Audios, Videos 1 TB (1024 Gigabytes) Oracle RDBMS 1 PB (1024 Terabytes) PDF, Excel, Word, ppt Content Management 1 EB (1024 Petabytes) Youtube, Tweet, FB, Wiki No-SQL,Hadoop (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 6 / 1

HDFS Architecture Hadoop Distributed File System(HDFS) HDFS is a file system designed for storing very large files with streaming data access, patterns, running clusters on commodity hardware. HDFS is the primary storage system used by the Hadoop application. Figure : HDFS Architecture (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 7 / 1

HDFS Features HDFS Features High fault Tolerant (replication of data in minimum 3 different nodes). Suitable for application with large data sets. Can be built out of commodity hardware. HDFS Drawback Each time it should write into disk. (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 8 / 1

Apache Spark Spark uses a restricted abstraction of distributed shared memory: the Resilient Distributed Datasets (RDDs). An RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are partitioned to be distributed across nodes. RDDs are created by starting with a file in the Hadoop file system. Invoke Spark operations on each RDD to do computation. Offers efficient fault tolerance. (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 9 / 1

Resilient Distributed Datasets Motivation A Fault-Tolerant Abstraction for In-Memory Cluster Computing MapReduce greatly simplified big data analysis on large, unreliable clusters But as soon as it got popular, users wanted more: More complex, multi-stage applications (e.g. iterative machine learning & graph processing) More interactive ad-hoc queries. Response: Specialized frameworks for some of these apps (e.g. Pregel for graph processing) (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 10 / 1

Resilient Distributed Datasets Motivation Complex apps and interactive queries both need one thing that MapReduce lacks:efficient primitives for data sharing MapReduce In MapReduce, the only way to share data across jobs is stable storage > slow! (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 11 / 1

Resilient Distributed Datasets (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 12 / 1

Resilient Distributed Datasets (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 13 / 1

Challenge Challenge How to design a distributed memory abstraction that is both fault-tolerant and efficient? (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 14 / 1

Challenge Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state RAMCloud, databases, distributed mem, Piccolo (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 15 / 1

Challenge Challenge Existing storage abstractions have interfaces based on fine-grained updates to mutable state RAMCloud, databases, distributed mem, Piccolo Challenge Requires replicating data or logs across nodes for fault tolerance Costly for data-intensive apps 10-100x slower than memory write (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 15 / 1

Solution: Resilient Distributed Datasets (RDDs) Restricted form of distributed shared memory Immutable, partitioned collections of records Can only be built through coarse-grained deterministic transformations (map, filter, join,...) Resilient Distributed Datasets (RDDs) Efficient fault recovery using lineage Log one operation to apply to many elements Recompute lost partitions on failure No cost if nothing fails (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 16 / 1

Resilient Distributed Datasets (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 17 / 1

Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms These naturally apply the same operation to many items Unify many current programming models Data flow models: MapReduce, Dryad, SQL,... Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental,... Support new apps that these models dont (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 18 / 1

Tradeoff Space (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 19 / 1

Spark Programming Interface DryadLINQ-like API in the Scala language Usable interactively from Scala interpreter Provides: Resilient distributed datasets (RDDs) Operations on RDDs: transformations (build new RDDs),actions (compute and output results) Control of each RDDs partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc) (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 20 / 1

Example:Log Mining Load error messages from a log into memory, then interactively search for various patterns (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 21 / 1

Fault Recovery RDDs track the graph of transformations that built them (their lineage) to rebuild lost data (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 22 / 1

Fault Recovery Results (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 23 / 1

Example: Page Rank (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 24 / 1

Optimizing Placement (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 25 / 1

PageRank Performance (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 26 / 1

Implementation (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 27 / 1

Programming Models Implemented on Spark (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 28 / 1

Open Source Community (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 29 / 1

Related works (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 30 / 1

Behavior with Insufficient RAM (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 31 / 1

Scalability (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 32 / 1

Breaking Down the Speedup (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 33 / 1

Spark Operations (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 34 / 1

Task Scheduler (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 35 / 1

Apache Spark RDDs can only be created through reading data from stable stroage or Spark operations over existing RDDs. Spark defines two kinds of RDD operations: transformations which apply the same operation on every record of the RDD to generate a separate, new RDD. Map() actions which aggregate the RDD to generate computation result. Reduce()(runs on the all nodes parallel) In our implementation, each iteration consists of two map operations and two reduce operations. Like Hadoop, the efficiency of Spark highly depends on the parallelism of the algorithm itself. (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 36 / 1

Apache Spark A second abstraction in Spark is shared variables that can be used in parallel operations. When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only added to, such as counters and sums. (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 37 / 1

Apache Spark Figure : Spark Architecture (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 38 / 1

Spark Engine Figure : Spark Engine (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 39 / 1

Approaching with Apache Spark Approaching with Apache Spark http://databricks.com/blog/2014/11/05/spark-officially-sets-a-newrecord-in-large-scale-sorting.html (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 40 / 1

Approaching with Apache Spark val input = sc.textfile( log.txt ) val splitedlines = input.map(line => line.split( )).map(words => (words(0), 1)).reduceByKey(a, b) => a + b (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 41 / 1

Conclusion (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 42 / 1

Thank You (R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 43 / 1