Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Similar documents
Apache Flink. Alessandro Margara

Real-time data processing with Apache Flink

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Apache Storm. Hortonworks Inc Page 1

Batch Processing Basic architecture

Apache Flink- A System for Batch and Realtime Stream Processing

CompSci 516: Database Systems

Spark, Shark and Spark Streaming Introduction

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Data Analytics with HPC. Data Streaming

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Apache Flink Big Data Stream Processing

Graph Processing. Connor Gramazio Spiros Boosalis

Tutorial: Apache Storm

Asynchronous Graph Processing

Big Data Platforms. Alessandro Margara

CSE 444: Database Internals. Lecture 23 Spark

Over the last few years, we have seen a disruption in the data management

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Parallel Computing: MapReduce Jin, Hai

Webinar Series TMIP VISION

Lecture 11 Hadoop & Spark

Large Scale Graph Processing Pregel, GraphLab and GraphX

Streaming & Apache Storm

Programming Systems for Big Data

Distributed computing: index building and use

Practical Big Data Processing An Overview of Apache Flink

MapReduce, Hadoop and Spark. Bompotas Agorakis

MillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Distributed computing: index building and use

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Resilient Distributed Datasets

Scalable Streaming Analytics

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Chapter 4: Apache Spark

Paper Presented by Harsha Yeddanapudy

GFS: The Google File System. Dr. Yingwu Zhu

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Data Platforms and Pattern Mining

Distributed Computation Models

Drizzle: Fast and Adaptable Stream Processing at Scale

Distributed Computing with Spark and MapReduce

A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MEAP Edition Manning Early Access Program Flink in Action Version 2

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The MapReduce Abstraction

Database Systems CSE 414

STORM AND LOW-LATENCY PROCESSING.

6.830 Lecture Spark 11/15/2017

A Parallel R Framework

Spark Streaming. Guido Salvaneschi

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Architecture of Flink's Streaming Runtime. Robert

Distributed Computing with Spark

modern database systems lecture 10 : large-scale graph processing

Analyzing Flight Data

Hadoop Map Reduce 10/17/2018 1

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

The Stream Processor as a Database. Ufuk

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Spark Overview. Professor Sasu Tarkoma.

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

The Power of Snapshots Stateful Stream Processing with Apache Flink

Spark Streaming. Professor Sasu Tarkoma.

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Unifying Big Data Workloads in Apache Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Fault Tolerance for Stream Processing Engines

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Massive Online Analysis - Storm,Spark

Systems research enables application development by defining and implementing abstractions:

GFS: The Google File System

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Streaming Analytics with Apache Flink. Stephan

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Spark: A Brief History.

Big Data Infrastructures & Technologies

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Map Reduce. Yerevan.

Apache Flink Streaming Done Right. Till

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Transcription:

Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files 2. Then merge intermediate output 3. Compute word count on merged outputs map combine partition reduce 3 4 1

Synchronization Barrier Fault Tolerance in MapReduce Map worker writes intermediate output to local disk, separated by partitioning. Once completed, tells master node. Reduce worker told of location of map task outputs, pulls their partition s data from each mapper, execute function across data Note: All-to-all shuffle b/w mappers and reducers Written to disk ( materialized ) b/w each stage 5 6 Graphs are Everywhere Social Network Collaborative ing Users Netflix Graph-Parallel Computation Probabilistic Analysis Movie s Text Analysis Docs Wiki Words 7 2

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like Iterative Algorithms MR doesn t efficiently express iterative algorithms: Iterations CPU 1 CPU 1 CPU 1 What My Friends Like Slow Processor CPU 2 CPU 2 CPU 2 CPU 3 CPU 3 CPU 3 Barrier Barrier Barrier MapAbuse: Iterative MapReduce System is not optimized for iteration: Iterations CPU 1 CPU 1 CPU 1 Startup Penalty CPU 2 CPU 3 Disk Penalty Startup Penalty CPU 2 CPU 3 Disk Penalty Startup Penalty CPU 2 CPU 3 Disk Penalty The GraphLab Framework Graph Based Representation Consistency Model Update Functions User Computation 12 3

Graph is associated with both vertices and edges Distributed Graph Partition the graph across multiple machines: Graph: Social Network Vertex : User profile Current interests estimates Edge : Relationship (friend, classmate, relative) 13 14 Distributed Graph Ghost vertices maintain adjacency structure and replicate remote data. Update Function A user-defined program, applied to a vertex; transforms data in scope of vertex ghost vertices 15 Pagerank (scope) { // Update the current vertex data } vertex.pagerank = α ForEach inpage: vertex.pagerank += (1 α) inpage.pagerank // Reschedule Neighbors if needed if vertex.pagerank changes then reschedule_all_neighbors; Selectively triggers computation at neighbors 16 4

Update Function A user-defined program, applied to a vertex; transforms data in scope of vertex Pagerank (scope) { // Update the current vertex data Update function applied vertex.pagerank (asynchronously) = α ForEach inpage: in parallel until convergence vertex.pagerank += (1 α) inpage.pagerank Many schedulers available to prioritize computation } // Reschedule Neighbors if needed if vertex.pagerank changes then reschedule_all_neighbors; Selectively triggers computation at neighbors 17 How to handle machine failure? What when machines fail? How do we provide fault tolerance? Strawman scheme: Synchronous snapshot checkpointing 1. Stop the world 2. Write each machines state to disk Chandy-Lamport checkpointing Step 1. Atomically, one initiator: 1. Turns red 2. Records its own state 3. Sends marker to neighbors Stream Processing Step 2. On receiving marker. non-red node atomically: 1. Turns red, 2. Records its own state, 3. Sends markers along all outgoing channels First-in, first-out channels between nodes 20 5

Simple stream processing Examples: Stateless conversion Single node Read data from socket Convert Celsius temperature to Fahrenheit Stateless operation: emit (input * 9 / 5) + 32 Process Write output 21 22 Examples: Stateless filtering Examples: Stateful conversion EWMA Function can filter inputs if (input > threshold) { emit input } Compute EWMA of Fahrenheit temperature new_temp = * ( (input) ) + (1- ) * last_temp last_temp = new_temp emit new_temp 23 24 6

Examples: Aggregation (stateful) Stream processing as chain Avg Avg E.g., Average value per window Window can be # elements (10) or time (1s) Windows can be disjoint (every 5s) Windows can be tumbling (5s window every 1s) 25 26 Stream processing as directed graph sensor type 1 sensor type 2 KtoF Avg alerts storage Enter BIG DATA 27 28 7

The challenge of stream processing Scale up Large amounts of data to process in real time Examples Social network trends (#trending) Intrusion detection systems (networks, datacenters) Sensors: Detect earthquakes by correlating vibrations of millions of smartphones Fraud detection Visa: 2000 txn / sec on average, peak ~47,000 / sec 29 Tuple-by-Tuple input read if (input > threshold) { emit input } Micro-batch inputs read out = [] for input in inputs { if (input > threshold) { out.append(input) } } emit out 30 Scale up Scale out Tuple-by-Tuple Lower Latency Lower Throughput Micro-batch Higher Latency Higher Throughput Why? Each read/write is an system call into kernel. More cycles performing kernel/application transitions (context switches), less actually spent processing data. 31 32 8

Stateless operations: trivially parallelized State complicates parallelization C F Aggregations: Need to join results across parallel computations C F Avg C F 33 34 State complicates parallelization Aggregations: Need to join results across parallel computations Parallelization complicates fault-tolerance Aggregations: Need to join results across parallel computations Cnt Cnt Cnt Avg Cnt Avg Cnt Cnt - blocks - 35 36 9

Can parallelize joins Can parallelize joins Compute trending keywords E.g., Hash partitioned tweets - blocks - 37 38 Parallelization complicates fault-tolerance Hash partitioned tweets A Tale of Four Frameworks 1. Record acknowledgement (Storm) 2. Micro-batches (Spark Streaming, Storm Trident) 3. Transactional updates (Google Cloud dataflow) 4. Distributed snapshots (Flink) 39 40 10

Apache Storm Architectural components : streams of tuples, e.g., Tweet = <Author, Msg, Time> Sources of data: spouts Operators to process data: bolts Topology: Directed graph of spouts & bolts Fault tolerance via record acknowledgement (Apache Storm -- at least once semantics) Goal: Ensure each input fully processed Approach: DAG / tree edge tracking Record edges that get created as tuple is processed. Wait for all edges to be marked done Inform source (spouts) of data when complete; otherwise, they resend tuple. 41 Challenge: at least once means: Operators (bolts) can receive tuple > once Replay can be out-of-order... application needs to handle. 42 Fault tolerance via record acknowledgement (Apache Storm -- at least once semantics) Apache Spark Streaming: Discretized Stream Processing Spout assigns new unique ID to each tuple When bolt emits dependent tuple, it informs system of dependency (new edge) When a bolt finishes processing tuple, it calls ACK (or can FAIL) Acker tasks: Keep track of all emitted edges and receive ACK/FAIL messages from bolts. When messages received about all edges in graph, inform originating spout Spout garbage collects tuple or retransmits Note: Best effort delivery by not generating dependency on downstream tuples. 43 Split stream into series of small, atomic batch jobs (each of X seconds) Process each individual batch using Spark batch framework Akin to in-memory MapReduce Emit each micro-batch result RDD = Resilient Distributed live data stream batches of X seconds processed results Spark Streaming Spark 44 11

Fault tolerance via micro batches (Apache Spark Streaming, Storm Trident) Fault Tolerance via transactional updates (Google Cloud flow) Can build on batch frameworks (Spark) and tuple-by-tuple (Storm) Tradeoff between throughput (higher) and latency (higher) Each micro-batch may succeed or fail Original inputs are replicated (memory, disk) At failure, latest micro-batch can be simply recomputed (trickier if stateful) DAG is a pipeline of transformations from micro-batch to micro-batch Lineage info in each RDD specifies how generated from other RDDs To support failure recovery: Occasionally checkpoints RDDs (state) by replicating to other nodes To recover: another worker (1) gets last checkpoint, (2) determines upstream dependencies, then (3) starts recomputing using those usptream dependencies starting at checkpoint (downstream might filter) 45 Computation is long-running DAG of continuous operators For each intermediate record at operator Create commit record including input record, state update, and derived downstream records generated Write commit record to transactional log / DB On failure, replay log to Restore a consistent state of the computation Replay lost records (further downstream might filter) Requires: High-throughput writes to distributed store 46 Fault Tolerance via distributed snapshots (Apache Flink) Rather than log each record for each operator, take system-wide snapshots Fault Tolerance via distributed snapshots (Apache Flink) Use markers (barriers) in the input data stream to tell downstream operators when to consistently snapshot Snapshotting: Determine consistent snapshot of system-wide state (includes in-flight records and operator state) Store state in durable storage Recover: Restoring latest snapshot from durable storage Rewinding the stream source to snapshot point, and replay inputs Algorithm is based on Chandy-Lamport distributed snapshots, but also captures stream topology 47 48 12

Optimizing stream processing Keeping system performant: Careful optimizations of DAG Scheduling: Choice of parallelization, use of resources Where to place computation Often, many queries and systems using same cluster concurrently: Multi-tenancy 49 13