Spark, Shark and Spark Streaming Introduction

Similar documents
Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Spark Streaming. Guido Salvaneschi

An Introduction to Apache Spark

Spark Streaming. Professor Sasu Tarkoma.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CSE 444: Database Internals. Lecture 23 Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Big Data Infrastructures & Technologies

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Spark Overview. Professor Sasu Tarkoma.

Warehouse- Scale Computing and the BDAS Stack

Processing of big data with Apache Spark

Resilient Distributed Datasets

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Chapter 4: Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Lambda Architecture with Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Distributed Computing with Spark and MapReduce

Fast, Interactive, Language-Integrated Cluster Computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Spark: A Brief History.

An Overview of Apache Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Apache Spark 2.0. Matei

Lecture 11 Hadoop & Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Cloud Computing & Visualization

An Introduction to Big Data Analysis using Spark

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Introduction to Apache Spark

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Massive Online Analysis - Storm,Spark

Comparative Study of Apache Hadoop vs Spark

Data-Intensive Distributed Computing

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Data Platforms and Pattern Mining

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Big Data Architect.

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Cloud, Big Data & Linear Algebra

Dell In-Memory Appliance for Cloudera Enterprise

Webinar Series TMIP VISION

25/05/2018. Spark Streaming is a framework for large scale stream processing

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Spark Streaming: Hands-on Session A.A. 2017/18

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

Analyzing Flight Data

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Over the last few years, we have seen a disruption in the data management

Unifying Big Data Workloads in Apache Spark

Dept. Of Computer Science, Colorado State University

Turning Relational Database Tables into Spark Data Sources

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Outline. CS-562 Introduction to data analysis using Apache Spark

An Introduction to Apache Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Big data systems 12/8/17

Data-intensive computing systems

Hadoop Development Introduction

Backtesting with Spark

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Beyond MapReduce: Apache Spark Antonino Virgillito

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Scalable Tools - Part I Introduction to Scalable Tools

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Programming Systems for Big Data

CS Spark. Slides from Matei Zaharia and Databricks

Apache Spark Internals

Apache Spark and Scala Certification Training

The Datacenter Needs an Operating System

Practical Big Data Processing An Overview of Apache Flink

Introduction to Apache Spark

Spark and distributed data processing

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Transcription:

Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015

This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

Project History 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @ Yahoo 2008 Hadoop Summit 2010 Spark paper 2014 Apache Spark top-level 2014 1.2.0 release in December 2015 1.3.0 release in March Spark is HOT!!! Most active project in Hadoop ecosystem One of top 3 most active Apache projects Databricks founded by the creators of Spark from UC Berkeley s AMPLab Activity for 6 months in 2014 (from Matei Zaharia 2014 Spark Summit)

What is Spark? Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing - Fast - Leverages aggressively cached in-memory distributed computing and JVM threads - Faster than MapReduce - Generality - Covers a wide range of workloads - Provides SQL, streaming and complex analytics - Ease of use for programmers - Spark is written in Scala, an object oriented, functional programming language - Scala, Python and Java APIs - Scala and Python interactive shells - Runs on YARN, Mesos, standalone or cloud Logistic regression in Hadoop and Spark from http://spark.apache.org

What is Big Data used For? Reports, e.g., - Track business processes, transactions Diagnosis, e.g., - Why is user engagement dropping? - Why is the system slow? - Detect spam, worms, viruses, DDoS attacks Decisions, e.g., - Personalized medical treatment - Decide what feature to add to a product - Decide what ads to show Data is only as useful as the decisions it enables

Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions - E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data - E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable better decisions - E.g., anomaly detection, trend analysis

Our Goal Batch Single Stack! Interactive Streaming Support batch, streaming, and interactive computations and make it easy to compose them Easy to develop sophisticated algorithms (e.g., graph, ML algos)

The Need for Unification (1/2) Today s state-of-art analytics stack Logs Demux Streaming stack (e.g., Storm) Batch stack (e.g., Hadoop) Real-Time Analytics Ad-Hoc queries on historical data Interactive queries (e.g., HBase, Impala, SQL) Challenges:» Need to maintain three separate stacks Expensive and complex Hard to compute consistent metrics across stacks» Hard and slow to share data across stacks Interactive queries on historical data

The Need for Unification (2/2) Make real-time decisions - Detect DDoS, fraud, etc E.g.,: what s needed to detect a DDoS attack? 1. Detect attack pattern in real time streaming 2. Is traffic surge expected? interactive queries 3. Making queries fast pre-computation (batch) And need to implement complex algos (e.g., ML)!

Key Idea Resilient Distributed Datasets (RDDs) Spark s basic unit of data Immutable, fault tolerant collection of elements that can be operated on in parallel across a cluster Two types of operations - Transformations - Data lineage DAG (Directed Acyclic Graph) - Single run with many stages, versus multiple jobs with MR - Lazy evaluations - Actions - Performs transformations and action - Returns a value Fault tolerance - If data in memory is lost it will be recreated from lineage Caching, persistence (memory, spilling, disk) and check-pointing

Motivation Many important applications must process large streams of live data and provide results in near-real-time - Social network trends - Website statistics - Ad impressions Distributed stream processing framework is required to - Scale to large clusters (100s of machines) - Achieve low latency (few seconds)

Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post processing Existing framework cannot do both - Either do stream processing of 100s of MB/s with low latency - Or do batch processing of TBs / PBs of data with high latency Extremely painful to maintain two different stacks - Different programming models - Double the implementation effort - Double the number of bugs

Stateful Stream Processing Traditional streaming systems have a record-at-a-time processing model - Each node has mutable state - For each record, update state and send new records State is lost if node dies! input records input records mutable state node 1 node 2 node 3 Making stateful stream processing be faulttolerant is challenging

Existing Streaming Systems Storm - Replays record if not processed by a node - Processes each record at least once - May update mutable state twice! - Mutable state can be lost due to failure! Trident Use transactions to update state - Processes each record exactly once - Per state transaction to external database is slow

Spark Streaming

DStream Input Sources Scalable, high-throughput, fault-tolerant stream processing of live data streams

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds Spark treats each batch of data as Resilient Distributed Datasets RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches live data stream batches of X seconds processed results Spark Streaming Spark

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Batch sizes as low as ½ second, latency of about 1 second Potential for combining batch processing and streaming processing in the same system live data stream batches of X seconds processed results Spark Streaming Spark

Example Get hashtags from Twitter val tweets = ssc.twitterstream() DStream: a sequence of RDDs representing a stream of data Twitter Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed)

Example Get hashtags from Twitter val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) new DStream transformation: modify data in one DStream to create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream flatmap flatmap flatmap hashtags Dstream [#cat, #dog, ] new RDDs created for every batch

Example Get hashtags from Twitter val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...") output operation: to push data to external storage tweets DStream hashtags DStream batch @ t batch @ t+1 batch @ t+2 flatmap flatmap flatmap save save save every batch saved to HDFS

Example Get hashtags from Twitter val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.foreach(hashtagrdd => {... }) foreach: do whatever you want with the processed data tweets DStream hashtags DStream batch @ t batch @ t+1 batch @ t+2 flatmap flatmap flatmap foreach foreach foreach Write to database, update analytics UI, do whatever you want

Java Example Scala val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterstream() JavaDstream<String> hashtags = tweets.flatmap(new Function<...> { }) hashtags.saveashadoopfiles("hdfs://...") Function object

Window-based Transformations val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) val tagcounts = hashtags.window(minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length DStream of data sliding interval

Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data - Example: Maintain per-user mood as state, and update it with their tweets updatemood(newtweets, lastmood) => newmood moods = tweets.updatestatebykey(updatemood _)

Arbitrary Combinations of Batch and Streaming Computations Inter-mix RDD and DStream operations! - Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsrdd => { }) tweetsrdd.join(spamhdfsfile).filter(...)

Fault-tolerance: Worker RDDs remember the operations that created them Batches of input data are replicated in memory for fault-tolerance tweets RDD flatmap input data replicated in memory Data lost due to worker failure, can be recomputed from replicated input data hashtags RDD All transformed data is fault-tolerant, and exactly-once transformations lost partitions recomputed on other workers

Fault-tolerance: Master Master saves the state of the DStreams to a checkpoint file - Checkpoint file saved to HDFS periodically If master fails, it can be restarted using the checkpoint file More information in the Spark Streaming guide - Link later in the presentation Automated master fault recovery coming soon

Unifying Batch and Stream Processing Models Spark program on Twitter log file using RDDs val tweets = sc.hadoopfile("hdfs://...") val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfile("hdfs://...") Spark Streaming program on Twitter stream using DStreams val tweets = ssc.twitterstream() val hashtags = tweets.flatmap (status => gettags(status)) hashtags.saveashadoopfiles("hdfs://...")

Vision - one stack to rule them all Explore data interactively using Spark Shell to identify problems Use same code in Spark standalone programs to identify problems in production logs Use similar code in Spark Streaming to identify problems in live log streams $./spark-shell scala> val file = sc.hadoopfile( smalllogs )... scala> val filtered = file.filter(_.contains( ERROR ))... scala> val mapped = filtered.map(...).. ȯbject ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopfile( productionlogs ) val filtered = file.filter(_.contains( ERROR )) val mapped = filtered.map(...)... } } object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkastream(...) val filtered = file.filter(_.contains( ERROR )) val mapped = filtered.map(...)... } }

Vision - one stack to rule them all Stream Processing Spark + Shark + Spark Streaming Ad-hoc Queries Batch Processing

Part2 continued THANK YOU