ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Similar documents
Lecture 11 Hadoop & Spark

Map-Reduce. Marco Mura 2010 March, 31th

DISTRIBUTED COMPUTER SYSTEMS

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Distributed Filesystem

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Massive Online Analysis - Storm,Spark

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Distributed Computation Models

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Multiprocessors 2007/2008

Communication. Distributed Systems Santa Clara University 2016

The Google File System

Practical Considerations for Multi- Level Schedulers. Benjamin

MapReduce for Data Intensive Scientific Analyses

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

CHAPTER - 4 REMOTE COMMUNICATION

Introduction to MapReduce

MapReduce. U of Toronto, 2014

Cloud Computing CS

The Google File System

Locality Aware Fair Scheduling for Hammr

SimpleChubby: a simple distributed lock service

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Processes and Threads. Processes: Review

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Resilient Distributed Datasets

The Google File System

CA485 Ray Walshe Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS5314 RESEARCH PAPER ON PROGRAMMING LANGUAGES

Clustering Lecture 8: MapReduce

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Topics in Object-Oriented Design Patterns

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

The Google File System

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Following are a few basic questions that cover the essentials of OS:

Programming Systems for Big Data

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Distributed Systems 16. Distributed File Systems II

Spark: A Brief History.

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Evolution of an Apache Spark Architecture for Processing Game Data

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Challenges for Data Driven Systems

CLOUD-SCALE FILE SYSTEMS

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

DATA SCIENCE USING SPARK: AN INTRODUCTION

BigData and Map Reduce VITMAC03

Distributed File Systems II

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Parallel Programming Concepts

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Outline. CS-562 Introduction to data analysis using Apache Spark

Hadoop. copyright 2011 Trainologic LTD

CS370 Operating Systems

Big Data XML Parsing in Pentaho Data Integration (PDI)

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Large-Scale GPU programming

The Google File System (GFS)

Data Informatics. Seon Ho Kim, Ph.D.

Map Reduce. Yerevan.

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

Socket attaches to a Ratchet. 2) Bridge Decouple an abstraction from its implementation so that the two can vary independently.

Middleware Mediated Transactions & Conditional Messaging

The Google File System

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Multiprocessor Systems

Distributed Object-Based Systems The WWW Architecture Web Services Handout 11 Part(a) EECS 591 Farnam Jahanian University of Michigan.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

CAS 703 Software Design

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to MapReduce

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Adaptive Cluster Computing using JavaSpaces

Distributed Programming

Chapter 5. The MapReduce Programming Model and Implementation

Survey on MapReduce Scheduling Algorithms

The Google File System

Introduction to Spark

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

The Google File System

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Transcription:

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for Distributed Dataflow Computing Presented by Saeed Shokri

1. Overview 2. Why CIEL 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Conclusion Outline Ciel means Sky in French and it s pronounced see-elle 2

Overview Some existing distributed execution engines for cluster hardware: MapReduce Pregel Dryad Piccolo CIEL 3

Overview A distributed execution engine: is a software system that automatically execute a program, in parallel, runs on a cluster of networked computers, that provides a large aggregate amount of computational and I/O performance. Distributed execution engines are attractive: they shield developers from the challenging aspects of distributed and parallel computing, such as synchronization, scheduling, data transfer and dealing with failures. Data-dependent control flow: is the fundamental concept that enables a machine to change its behavior on the basis of intermediate results. This ability increases the computational power of a machine, because it enables the machine to execute iterative algorithms. 4

Overview Google s MapReduce: runs programs defined by functions: map(), which operates on the input records to produce intermediate data; and reduce(), which operates on the intermediate data to produce a final result. Apache Hadoop MapReduce: The simplicity of MapReduce led to several clones, including a popular open-source version called Hadoop. Dryad: Microsoft developed a more-general execution engine, called Dryad, which operates on programs that are written as data-flow graphs. 5

Distributed Execution Engines comparison CIEL provides: Distributed data-flow computing Task dependencies Dynamic coordination Transparency (fault tolerance, scaling, locality) 6

Why CIEL Why we need another distributed execution engine which called CIEL? MapReduce/Dryad have disadvantages: 1. Designed to maximize throughput, not to minimize latency. 2. Perform scheduling before running the algorithm. The resulting schedule is static. These makes MapReduce/Dryad unsuitable for iterative algorithms. Many algorithms contain data-dependent control flow, and cannot be expressed using previous execution engines. 7

Sample Iterative algorithm The result of thedo_lots_of_work()function is used to decide whether or not the while loop should terminate. The amount work depends on the input data, and can only be determined by actually running the algorithm. MapReduceand Dryad require a complete list of tasks to be provided when a job is submitted, so they cannot natively handle this type of algorithm. In MapReduceand Dryad, the user must write a separatedriver program, which submits multiple jobs, fetches their results, and makes the decision about when to terminate the computation. The driver program runs outside the cluster, it doesn t enjoy the benefits of running on an execution engine, in particular transparent fault tolerance. If the driver program crashes, or loses network connectivity to the cluster, the entire computation is lost. 8

Goals Design a distributed execution framework that can 1. efficiently run iterative algorithms 2. provide a simple interface 3. offer transparent fault tolerance 9

CIEL s Model CIEL is an execution model for distributed execution engines that supports data-dependent control flow. The model is based on dynamic task graphs, in which each vertex is a sequential computation that may decide, on the basis of its input, to spawn additional computation and hence rewrite the graph. Data-dependent control flow can be supported in a distributed execution engine by adding the facility for a task to spawn further tasks. A dynamic task graph is like a Dryad data-flow graph, but it also allows tasks to rewrite the graph by spawning new tasks and delegating their outputs. 10

Primitives of the model: Dynamic task graphs Objects: The goal of a CIEL job is to produce one or more output objects. An object is an unstructured, finite-length sequence of bytes. Every object has a unique name To simplify consistency and replication, an object is immutable once it has been written, but it is sometimes possible to append to an object. References: Describe an object without possessing its full contents. A reference comprises a name and a set of locations (e.g. hostname-port pairs) where the object with that name is stored. The set of locations may be empty: in that case, the reference is a future reference to an object that has not yet been produced. Otherwise, it is a concrete reference, which may be consumed. 11

Dynamic task graphs Tasks: A CIEL job makes progress by executing tasks. A task is a non-blocking atomic computation that executes completely on a single machine. A task has one or more dependencies, which are represented by references. The task becomes runnable when all of its dependencies become concrete. The dependencies include a special object that specifies the behavior of the task (such as an executable binary or a Java class) A task also has one or more expected outputs, which are the names of objects that the task will either create or delegate another task to create. objects references tasks 12

Data-dependent control flow For expected outputs, a task must either publish a concrete reference, or spawn a child task with that name as an expected output. The task can publish objects for its expected outputs, which may cause other tasks to become runnable if they depend on those outputs. When the children eventually terminate, any task that depends on the parent s output will eventually become runnable. A child task must only depend on concrete references (i.e. objects that already exist) or future references to the outputs of tasks that have already been spawned (i.e. objects that are already expected to be published). This prevents deadlock, as a cycle cannot form in the dependency graph. The key feature of CIEL is a dynamic task graph 13

A Dynamic Task Graph Dynamic Task Graph: Task spawns Task 14

System Architecture A CIEL cluster has a single master and many workers. The master dispatches tasks to the workers for execution. After a task completes, the worker publishes a set of objects and may spawn further tasks. Clients submit a job to the master A CIEL cluster 15

System Architecture Master: The master maintains the current state of the dynamic task graph in the object table and task table. Each row in the object table contains the latest reference for that object, including its locations, and a pointer to the task that is expected to produce it Each row in the task table corresponds to a spawned task, and contains pointers to the references on which the task depends. The master scheduler is responsible for making progress in a CIEL computation: It lazily evaluates output objects and pairs runnable tasks with idle workers. Task I/O may be large (gigabytes per task), all bulk data is stored on the workers themselves, and the master handles references. The master uses a multiple-queue-based scheduler to dispatch tasks to the worker nearest the data. 16

Task and Object table maintained in Master node The object table contains the latest reference for that object, including its locations, and a pointer to the task that is expected to produce it 17

System Architecture Workers: The workers execute tasks and store objects. If a worker needs to fetch a remote object, it reads the object directly from another worker. A worker registers with the master, periodically sends a heartbeat to demonstrate its availability. When a task is dispatched to a worker, the appropriate executor is invoked. An executor is a generic component that prepares input data for consumption and invokes some computation on it When a worker executes a task, reply to the master with the set of references that it wishes to publish, and list of any new tasks that it wishes to spawn. The master will then update the object table and task table, and re-evaluate the set of tasks now runnable. 18

Skywriting Language for expressing task-level parallelism that runs on top of CIEL Task Creation in Skywriting: Task creation is the distinctive feature that facilitates data dependent control flow. A couple of essential ways to create tasks in Skywriting: 1. spawn: (f, [args,...]) spawns a parallel task that computes and returns a pointer to f(args, ). 2. Spawn-exec: (executor, args, n) spawns a parallel task to run executor with the given args 3. exec: (executor, args, n) synchronous executor 4. dereference: (unary-*) unary dereference operator that applies to a reference. Loads the referenced data and evaluates to the resulting data structure. 19

Skywriting script for computing the Fibonacci number The Fibonacci sequence is a set of numbers that starts with a one or a zero, followed by a one, and proceeds based on the rule that each number (called a Fibonacci number) is equal to the sum of the preceding two numbers. The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811,... Fibonacci numbers are of interest to biologists and physicists because they are frequently observed in various natural objects and phenomena. The branching patterns in trees and leaves, for example, and the distribution of seeds in a raspberry are based on Fibonacci numbers. 20

Skywriting script for computing the Fibonacci number Skywriting can be used to define a data-dependent parallel algorithm. For n > 1, the fib(n) function spawns two threads to calculate fib(n 1), fib(n - 2) Dereferences the results of these tasks, adds them together, and returns them. Dereference operator(*) is applied to x and y, which blocks the current thread until the future reference has become concrete. This example suggests the possibility of using Skywriting and CIEL to execute parallel dividend- conquer algorithms, such as decision tree learning. 21

Spawning Tasks The feature of Skywriting is its ability to spawn new tasks in the middle of executing a job. The language provides two explicit mechanisms for spawning new tasks (the spawn() and spawn exec() functions) and one implicit mechanism (the * -operator). 22

Skywriting Task The spawn() function creates a new task to run the given Skywriting function. The Skywriting runtime first creates a data object that contains the new task s environment, including the text of the function to be executed and the values of any arguments passed to the function. This object is called a Skywriting continuation, because it encapsulates the state of a computation. The runtime then creates a task descriptor for the new task, which includes a dependency on the new continuation. Finally, it assigns a reference for the task result, which it returns to the calling script. Blocking on futures 23

A non-skywriting Task Creation The spawn exec() function is a lower-level task creation mechanism that allows the caller to invoke code written in a different language. this function is not called directly, but rather through a wrapper for the relevant executor When spawn exec() is called, the runtime serializes the arguments into a data object and creates a task that depends on that object If the arguments to spawn exec() include references, the runtime adds those references to the new task s dependencies to ensure that CIEL will not schedule the task until all of its arguments are available. the runtime creates references for the task outputs, and returns them to the calling script. A non-skywriting task created with spawn_exec(). 24

Implicit Task Creation If the task attempts to dereference an object that has not yet been created(i.e. the result of a call to spawn() )the current task must block. CIEL tasks are non-blocking: all synchronization (and data-flow) must be made explicit in the dynamic task graph. the runtime implicitly creates a continuation task that depends on the dereferenced object and the current continuation (i.e. the current Skywriting execution stack). The new task therefore will only run when the dereferenced object has been produced, which provides the necessary synchronization. Skywriting script that spawns two tasks and blocks on their results 25

Task Termination A task terminates when it reaches a return statement (or it blocks on a future reference). A Skywriting task has a single output, which is the value of the expression in the return statement. On termination, the runtime stores the output in the local object store, publishes a concrete reference to the object, and sends a list of spawned tasks to the master, in order of creation. 26

Fault Tolerance Client: Trivial since no driver program is required. Worker: Monitored by master (similar to Dryad) Master: Master state can be derived from the set of active jobs. This is accomplished with persistent logging, and object table reconstruction by workers secondary masters 27

Master fault tolerance (Log Approach) The persistent log approach creates one log file per job. When a job is submitted, a new log file is created and the initial log entry containing the job submission message is written synchronously to that file. All spawn and publish messages that the master receives can be written to the log asynchronously. After the master fails, a new master will replay the log, applying each operation in order to rebuild the dynamic task graph for the job. 28

Master fault tolerance (secondary master approach ) The secondary master approach is similar to the persistent log approach. The job submission message and all spawns and publish messages are forwarded to a secondary master. The secondary master immediately applies these operations to build a hot standby version of the dynamic task graph. To maintain the same reliability guarantees, the master must wait until the secondary master acknowledges the job submission message before returning an acknowledgement to the client All other messages may be sent asynchronously. 29

Performance Comparison with production system Distributed Grep on Hadoop and Ciel 30

Performance of Iterative Algorithm K-means on Hadoop and Ciel with 20 workers 31

Related Work Pregel: Google s distributed execution engine for graph algorithms (designed primarily for graph algorithms) HaLoop: task scheduler is made loop-aware by adding caching mechanisms (lacks fault tolerance) Apache Mahout: Uses Hadoop as its execution engine and a driver program runs iterative algorithms (lacks master fault tolerance + requires driver program) Dryad: allows data flow to follow a more general directed acyclic graph (not support dynamic/data-dependent control flow) Naiad: A timely dataflow system. Distributed system for executing data parallel, cyclic dataflow programs 32

Conclusion CIEL and Skywriting are not good for: sharing large amounts of data fine-grain parallelization fully automatic parallelism relation algebra environment distributed operating system CIEL and Skywriting are good for: writing iterative algorithms data-dependent control using dynamic task graphs transparent fault tolerance and automatic distribution scaling across hundreds of machines 33

Reference Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. CIEL: a universal execution engine for distributed data-flow computing. In NSDI 2011: Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation, page 00, 2011. 34

Reference 1. In what aspect(s) is CIEL different MapReduceand Dryad? (the first paragraph of Section 1) 2. What weaknesses of Pregeland MapReduceare addressed by CIEL, respectively? (the 4th, 5th, 6th paragraphs in Section 2) 35