Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Similar documents
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Cloud Analytics Database Performance Report

Evolution From Shark To Spark SQL:

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

Database in the Cloud Benchmark

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

Spark, Shark and Spark Streaming Introduction

Lecture 11 Hadoop & Spark

Resilient Distributed Datasets

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

HadoopDB: An open source hybrid of MapReduce

Massive Online Analysis - Storm,Spark

Distributed Computation Models

CSE 444: Database Internals. Lecture 23 Spark

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Fast, Interactive, Language-Integrated Cluster Computing

Survey on Incremental MapReduce for Data Mining

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Shark: SQL and Rich Analytics at Scale

Shark: Hive (SQL) on Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce, Hadoop and Spark. Bompotas Agorakis

RAFTing MapReduce: Fast Recovery on the Raft

Shark: SQL and Rich Analytics at Scale

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

The Stratosphere Platform for Big Data Analytics

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Accelerate Big Data Insights

RAFTing MapReduce: Fast Recovery on the Raft

April Copyright 2013 Cloudera Inc. All rights reserved.

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Big Data Infrastructures & Technologies

Chapter 4: Apache Spark

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Database Systems CSE 414

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Programming Systems for Big Data

Cloud Computing & Visualization

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Scalable Tools - Part I Introduction to Scalable Tools

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

6.830 Lecture Spark 11/15/2017

Adaptive Cluster Computing using JavaSpaces

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Is Memory Disaggregation Feasible? A Case Study with Spark SQL

Spark Overview. Professor Sasu Tarkoma.

Apache Flink. Alessandro Margara

Data processing in Apache Spark

Technical Sheet NITRODB Time-Series Database

CompSci 516: Database Systems

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Delft University of Technology. Better Safe than Sorry Grappling with Failures of In-Memory Data Analytics Frameworks. Ghit, Bogdan; Epema, Dick

Data processing in Apache Spark

Database Architectures

Outline. CS-562 Introduction to data analysis using Apache Spark

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Map-Reduce. Marco Mura 2010 March, 31th

Spark: A Brief History.

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

CSE 414: Section 7 Parallel Databases. November 8th, 2018

MapR Enterprise Hadoop

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Spark.jl: Resilient Distributed Datasets in Julia

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Distributed File Systems II

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Map Reduce Group Meeting

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Chapter 5. The MapReduce Programming Model and Implementation

Data Partitioning and MapReduce

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Apache Flink- A System for Batch and Realtime Stream Processing

The Evolution of Big Data Platforms and Data Science

Data processing in Apache Spark

Next-Generation Cloud Platform

Transcription:

Fault Tolerance in K3 Ben Glickman, Amit Mehta, Josh Wheeler

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Big Systems Several sources of Big Sciences, Healthcare, Enterprise, and more. Need systems that scale to the volume of the data Single machine supercomputers are expensive Scale-out systems have become popular Cluster of affordable machines Massively parallel with communication over a network e.g., MapReduce (Hadoop), Distributed DBMS

Main-Memory Systems Disk was the bottleneck of the first generation systems Motivated a new class of data systems that compute entirely in-memory. Cluster provides a large pool of RAM (TB scale) Feasible to store entire datasets (Spill to disk if necessary) Improves throughput by orders of magnitude e.g., Spark, Stratosphere, etc.

K3 Background Programming framework for building shared-nothing main-memory data systems High level language for systems building Compiled into high-performance native code Under development at the Management Systems Lab at JHU: http://damsl.cs.jhu.edu K3 Github: https://github.com/damsl/k3

K3 Programming Model Functional-imperative language Currying, Higher-Order functions Mutable variable, loops Asynchronous distributed computation act as event handlers Message passing between triggers defines a dataflow Collections library High-level operations (map, filter, fold, etc.) Fine-grained updates (insert, update, etc.) Nested Collections

K3 Execution Model Several peers run the same K3 executable program Shared-Nothing Each peer can access only its own local data segment movement and coordination across peers achieved through message passing Partitioned Execution Model Large datasets are partitioned and distributed evenly among peers Kept in-memory

K3 Execution Model Focused on large scale analytics (read-only) workloads Transactions, fine-grained updates in future work Single master peer to Coordinate distributed computation Collect results at a single site Remaining peers are workers that compute over local partitions of a dataset and communicate through messaging

K3 Execution Model Network: Messages: Master Workers

K3 Execution Model Used to build multi-stage, complex data-flows: 1 Master 2 3 4 5 6 1 2 4 5 6 3 7 8 7 8 1 2 4 3 1 2 4 3 1 2 3 4 5 6 7 8 flow for the program that will be demonstrated after the presentation

K3 Performance Outperforms two state of the art systems: Spark and Impala SQL processing and iterative Machine Learning and Graph algorithms

K3 Example Given an Employees dataset: Employee: name String, age Integer Find the oldest employee in the dataset Partitioned among the K3 workers Bob Smith, 57... Saul Goodman, 37... Walter White, 67... Master Workers

K3 Example 1) Master: Instruct workers to compute local maximum Find the oldest employee in the dataset K3 Code: // Send a message to each peer s // computelocalmax trigger trigger start: () = \_ -> ( workers.iterate (\p -> ) ) (computelocalmax, p.addr) <- () Master Workers

K3 Example 1) Master: Instruct workers to compute local maximum Find the oldest employee in the dataset K3 Code: computelocalmax() // Send a message to each peer s // computelocalmax trigger trigger start: () = \_ -> ( workers.iterate (\p -> ) ) (computelocalmax, p.addr) <- () Master Workers

K3 Example 2) Workers: Compute local maximum, send it to the master Find the oldest employee in the dataset K3 Code: // Send local max to the collectmax // trigger at the master trigger computelocalmax: () = \_ -> let max = local_data.fold (\acc -> \elem -> if elem.age > acc.age then elem else acc ) local_data.peek() in (collectmax, master) <- max Master computelocalmax() computelocalmax() computelocalmax() Workers

K3 Example 2) Workers: Compute local maximum, send it to the master K3 Code: // Send local max to the collectmax // trigger at the master trigger computelocalmax: () = \_ -> let max = local_data.fold (\acc -> \elem -> if elem.age > acc.age then elem else acc ) local_data.peek() in (collectmax, master) <- max Find the oldest employee in the dataset Master collectmax ( Bob Smith, 57, ) collectmax ( Saul Goodman, 37, ) Workers collectmax ( Walter White, 67, )

K3 Example 3) Master: Wait to receive all messages, keep track of max. K3 Code: // Wait to receive from each peer. trigger collectmax: (string, int) = \(name, age) -> ( global_max = if age > global_max.age then (name, age) else global_max; responses_recv += 1; if responses_recv==workers.size() then print "Finished!" Find the oldest employee in the dataset Master collectmax ( Bob Smith, 57, ) collectmax collectmax ( Walter White, 67, ) ( Saul Goodman, 37, ) Workers

K3 Example 3) Master: Wait to receive all messages, keep track of max. K3 Code: // Wait to receive from each peer. trigger collectmax: (string, int) = \(name, age) -> ( global_max = if age > global_max.age then (name, age) else global_max; responses_recv += 1; if responses_recv==workers.size() then print "Finished!" Find the oldest employee in the dataset Master Finished! global_max = (Walter White, 67) Workers

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Fault Tolerance Motivation Analytical queries are often long-running Large volume of data. Limited CPU throughput Iterative algorithms take time to converge Likelihood of failure increases with the number of machines Hardware Failures. Bad Disks, Power Loss, etc. At Largest Scale: Hours of computation Hundreds/Thousands of machines Periodic faults will occur

Mid Query Fault Tolerance Without Fault Tolerance: Restart entire computation Any progress made towards a solution before the crash is lost Start from scratch: Hopefully no failures! or else repeat! Existing solutions tolerate crashes via Replicated Input Optional checkpointing of intermediate program state (expensive) Replaying work that has been lost. Hadoop and Spark both replay missing work

Fault Tolerance in K3 Before our project: K3 did not handle faults When a process crashed: Others might become stuck waiting for messages Others might attempt to send messages to a missing peer Consider the max example

K3 Crash Example 3) Master: Wait to receive all messages, keep track of max. K3 Code: // Wait to receive from each peer. trigger collectmax: (string, int) = \(name, age) -> ( global_max = if age > global_max.age then (name, age) else global_max; responses_recv += 1; Find the oldest employee in the dataset if responses_recv==workers.size() Master then print "Finished!" Will never occur! collectmax ( Bob Smith, 57, ) collectmax collectmax ( Walter White, 67, ) ( Saul Goodman, 37, ) Workers

Fault Tolerance in K3 Need to offer programmer a way to react to crashes Allow them to implement application specific logic for handling a crash We explored several applications/modes of failure Alternatively, a general solution might leverage static analysis of a K3 program to automatically provide fault tolerance Place less burden on the programmer Potential area for future work, not covered in this project

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Spread Group communication toolkit to assist in building reliable distributed systems Allows process to join groups for communication Processes are alerted when others join or leave groups i.e., due to a process crash or network partition We incorporated Spread client into the K3 runtime for its membership functionality

K3 / Spread K3 processes act as Spread clients K3 command line args specify connection parameters Startup protocol: Wait for all processes to join a public group Processes agree on the initial set of peers sent to the K3 program Spread event loop runs in a separate thread from the K3 event loop Spread client code receives a membership change Creates the appropriate K3 message and injects into program s queue

K3 / Spread We allow programmers to designate a special trigger for handling a membership change Indicated with a @:Membership Annotation Trigger receives set of new members as an argument Trigger contains arbitrary application specific logic for reacting to the change After startup, called after each membership change K3 Code: trigger t: [address]@set = (\members -> print Oh no! A membership change! ;... ) @:Membership

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Fault Tolerance in K3 We explored 3 example models of fault tolerance: Terminate gracefully after a crash Remaining peers continue after a crash (approximate solution) Replay missing work after a crash

Fault Tolerance in K3 We explored 3 example models of fault tolerance: Terminate gracefully after a crash Remaining peers continue after a crash (approximate solution) Replay missing work after a crash

Graceful Termination Simply exit the program when a membership change is received Baby-step towards fault tolerance: All peers are aware of the crash Prevents a peer from becoming stuck General enough for all programs Only prevents stuck state Does not help with getting output K3 Code: trigger t: [address]@set = (\members -> shutdown() ) @:Membership TODO: Code Formatting

Fault Tolerance in K3 We explored 3 models of fault tolerance: Terminate gracefully after a crash Remaining peers continue after a crash (approximate solution) Replay missing work after a crash

Continue After a Crash For example, make the following changes to prevent the stuck state in the max program: Master keeps track of which peers are expected to respond at any time Instead of counting responses After a membership change: Master: Stop expecting messages from any missing peer Workers: Exit if the master has been lost.

Continue After a Crash Able to reach an approximate solution Partitions of data have been lost, may affect the answer In the max example: the partition containing the true maximum may have been lost. Appropriate in certain situations only e.g., training a statistical model Up to the developer to decide if this is acceptable.

Fault Tolerance in K3 We explored 3 models of fault tolerance: Terminate gracefully after a crash Remaining peers continue after a crash (approximate solution) Replay missing work after a crash

Recovery by Replay Motivated by Spark s Resilient Distributed sets (RDD) Applications that apply coarse-grained transformations to partitioned datasets Many algorithms can be encoded in this model Input datasets must be replicated e.g., HDFS replicates input data 3 times

Recovery by Replay In the RDD model: Each partition of data has a lineage or set of dependencies Input data comes directly from disk Intermediate data is a result of applying transformations to previously defined partitions When a partition is lost, it can be re-computed by replaying its lineage Bottoms-out at disk, if there are still replicas available See example in demonstration

Recovery by Replay In the RDD model: When a machine crashes, the partitions that it was hosting are re-assigned to several other machines. Allows the work to be replayed in parallel Does not require expensive checkpointing and replication of logs or intermediate datasets A big issue when datasets are large.

Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration

Proof of Concept We Implemented a multi-stage SQL query from the Amplab Big Benchmark https://amplab.cs.berkeley.edu/benchmark/ (Query 2) Replays missing work in the event of a crash Can handle as many crashes as there are replicas of input data Picks a new master if the master is lost No single point of failure We demonstrate a proof of concept using 6 processes across 2 physical machines on a sample dataset

SQL Example set: Rankings Lists websites and their page rank: Schema: pageurl VARCHAR(300) pagerank INT avgduration INT Uservisits Stores server logs for each web page Schema: sourceip VARCHAR(116) desturl VARCHAR(100) adrevenue FLOAT (ommitted) English Query: For the user that generated the most ad revenue during 1980: Report the sourceip, total revenue, and average page rank of pages visited by this user SQL Query: SELECT sourceip, totalrevenue, avgpagerank FROM (SELECT sourceip, AVG(pageRank) as avgpagerank, SUM(adRevenue) as totalrevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01)' AND Date(`1981-01-01') GROUP BY UV.sourceIP) ORDER BY totalrevenue DESC LIMIT 1

SQL Example: Logical Plan Rankings Uservisits FILTER WHERE visitdate between 1980 and 1981 JOIN ON pageurl = desturl GROUP BY sourceip SUM(adRevenue) AVG(pageRank) MAX totalrevenue

SQL Example: Physical Plan Rankings 1 2 3 4 5 6 7 8 (Partition on pageurl) 1 2 3 (Partition on sourceip) 1 2 3 Uservisits 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Filtered Uservisits (Partition on desturl) 4 Join 4 Group By (Collect maximum) Max at Master

SQL Example: Gold Crashed Rankings 1 2 3 4 5 6 7 8 (Partition on pageurl) 1 2 3 (Partition on sourceip) 1 2 3 Uservisits 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Filtered Uservisits (Partition on desturl) 4 Join 4 Group By (Collect maximum) Max at Master Red: Requires a resend Yellow: Maybe resend Green: No resend

SQL Example: Implementation Assignment/Location of all partitions are known by all peers Locations for replicas of input data are provided in deployment config Assignments are a pure function of the current membership Request/Response Model Request all dependencies required to perform local computation When a request is received: Compute locally and respond if all dependency data is local Always possible at the leaves of the plan Otherwise: Request dependencies required for local computation Respond after all requests are fulfilled In the event of a membership change Reassign all partitions. Reissue requests, as necessary

Demonstration 4 versions of the query: No Fault Tolerance (gets stuck) Terminate Gracefully Continue with missing data Replay missing work