ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models CIEL: A Universal Execution Engine for Distributed Dataflow Computing Presented by Saeed Shokri
1. Overview 2. Why CIEL 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Conclusion Outline Ciel means Sky in French and it s pronounced see-elle 2
Overview Some existing distributed execution engines for cluster hardware: MapReduce Pregel Dryad Piccolo CIEL 3
Overview A distributed execution engine: is a software system that automatically execute a program, in parallel, runs on a cluster of networked computers, that provides a large aggregate amount of computational and I/O performance. Distributed execution engines are attractive: they shield developers from the challenging aspects of distributed and parallel computing, such as synchronization, scheduling, data transfer and dealing with failures. Data-dependent control flow: is the fundamental concept that enables a machine to change its behavior on the basis of intermediate results. This ability increases the computational power of a machine, because it enables the machine to execute iterative algorithms. 4
Overview Google s MapReduce: runs programs defined by functions: map(), which operates on the input records to produce intermediate data; and reduce(), which operates on the intermediate data to produce a final result. Apache Hadoop MapReduce: The simplicity of MapReduce led to several clones, including a popular open-source version called Hadoop. Dryad: Microsoft developed a more-general execution engine, called Dryad, which operates on programs that are written as data-flow graphs. 5
Distributed Execution Engines comparison CIEL provides: Distributed data-flow computing Task dependencies Dynamic coordination Transparency (fault tolerance, scaling, locality) 6
Why CIEL Why we need another distributed execution engine which called CIEL? MapReduce/Dryad have disadvantages: 1. Designed to maximize throughput, not to minimize latency. 2. Perform scheduling before running the algorithm. The resulting schedule is static. These makes MapReduce/Dryad unsuitable for iterative algorithms. Many algorithms contain data-dependent control flow, and cannot be expressed using previous execution engines. 7
Sample Iterative algorithm The result of thedo_lots_of_work()function is used to decide whether or not the while loop should terminate. The amount work depends on the input data, and can only be determined by actually running the algorithm. MapReduceand Dryad require a complete list of tasks to be provided when a job is submitted, so they cannot natively handle this type of algorithm. In MapReduceand Dryad, the user must write a separatedriver program, which submits multiple jobs, fetches their results, and makes the decision about when to terminate the computation. The driver program runs outside the cluster, it doesn t enjoy the benefits of running on an execution engine, in particular transparent fault tolerance. If the driver program crashes, or loses network connectivity to the cluster, the entire computation is lost. 8
Goals Design a distributed execution framework that can 1. efficiently run iterative algorithms 2. provide a simple interface 3. offer transparent fault tolerance 9
CIEL s Model CIEL is an execution model for distributed execution engines that supports data-dependent control flow. The model is based on dynamic task graphs, in which each vertex is a sequential computation that may decide, on the basis of its input, to spawn additional computation and hence rewrite the graph. Data-dependent control flow can be supported in a distributed execution engine by adding the facility for a task to spawn further tasks. A dynamic task graph is like a Dryad data-flow graph, but it also allows tasks to rewrite the graph by spawning new tasks and delegating their outputs. 10
Primitives of the model: Dynamic task graphs Objects: The goal of a CIEL job is to produce one or more output objects. An object is an unstructured, finite-length sequence of bytes. Every object has a unique name To simplify consistency and replication, an object is immutable once it has been written, but it is sometimes possible to append to an object. References: Describe an object without possessing its full contents. A reference comprises a name and a set of locations (e.g. hostname-port pairs) where the object with that name is stored. The set of locations may be empty: in that case, the reference is a future reference to an object that has not yet been produced. Otherwise, it is a concrete reference, which may be consumed. 11
Dynamic task graphs Tasks: A CIEL job makes progress by executing tasks. A task is a non-blocking atomic computation that executes completely on a single machine. A task has one or more dependencies, which are represented by references. The task becomes runnable when all of its dependencies become concrete. The dependencies include a special object that specifies the behavior of the task (such as an executable binary or a Java class) A task also has one or more expected outputs, which are the names of objects that the task will either create or delegate another task to create. objects references tasks 12
Data-dependent control flow For expected outputs, a task must either publish a concrete reference, or spawn a child task with that name as an expected output. The task can publish objects for its expected outputs, which may cause other tasks to become runnable if they depend on those outputs. When the children eventually terminate, any task that depends on the parent s output will eventually become runnable. A child task must only depend on concrete references (i.e. objects that already exist) or future references to the outputs of tasks that have already been spawned (i.e. objects that are already expected to be published). This prevents deadlock, as a cycle cannot form in the dependency graph. The key feature of CIEL is a dynamic task graph 13
A Dynamic Task Graph Dynamic Task Graph: Task spawns Task 14
System Architecture A CIEL cluster has a single master and many workers. The master dispatches tasks to the workers for execution. After a task completes, the worker publishes a set of objects and may spawn further tasks. Clients submit a job to the master A CIEL cluster 15
System Architecture Master: The master maintains the current state of the dynamic task graph in the object table and task table. Each row in the object table contains the latest reference for that object, including its locations, and a pointer to the task that is expected to produce it Each row in the task table corresponds to a spawned task, and contains pointers to the references on which the task depends. The master scheduler is responsible for making progress in a CIEL computation: It lazily evaluates output objects and pairs runnable tasks with idle workers. Task I/O may be large (gigabytes per task), all bulk data is stored on the workers themselves, and the master handles references. The master uses a multiple-queue-based scheduler to dispatch tasks to the worker nearest the data. 16
Task and Object table maintained in Master node The object table contains the latest reference for that object, including its locations, and a pointer to the task that is expected to produce it 17
System Architecture Workers: The workers execute tasks and store objects. If a worker needs to fetch a remote object, it reads the object directly from another worker. A worker registers with the master, periodically sends a heartbeat to demonstrate its availability. When a task is dispatched to a worker, the appropriate executor is invoked. An executor is a generic component that prepares input data for consumption and invokes some computation on it When a worker executes a task, reply to the master with the set of references that it wishes to publish, and list of any new tasks that it wishes to spawn. The master will then update the object table and task table, and re-evaluate the set of tasks now runnable. 18
Skywriting Language for expressing task-level parallelism that runs on top of CIEL Task Creation in Skywriting: Task creation is the distinctive feature that facilitates data dependent control flow. A couple of essential ways to create tasks in Skywriting: 1. spawn: (f, [args,...]) spawns a parallel task that computes and returns a pointer to f(args, ). 2. Spawn-exec: (executor, args, n) spawns a parallel task to run executor with the given args 3. exec: (executor, args, n) synchronous executor 4. dereference: (unary-*) unary dereference operator that applies to a reference. Loads the referenced data and evaluates to the resulting data structure. 19
Skywriting script for computing the Fibonacci number The Fibonacci sequence is a set of numbers that starts with a one or a zero, followed by a one, and proceeds based on the rule that each number (called a Fibonacci number) is equal to the sum of the preceding two numbers. The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811,... Fibonacci numbers are of interest to biologists and physicists because they are frequently observed in various natural objects and phenomena. The branching patterns in trees and leaves, for example, and the distribution of seeds in a raspberry are based on Fibonacci numbers. 20
Skywriting script for computing the Fibonacci number Skywriting can be used to define a data-dependent parallel algorithm. For n > 1, the fib(n) function spawns two threads to calculate fib(n 1), fib(n - 2) Dereferences the results of these tasks, adds them together, and returns them. Dereference operator(*) is applied to x and y, which blocks the current thread until the future reference has become concrete. This example suggests the possibility of using Skywriting and CIEL to execute parallel dividend- conquer algorithms, such as decision tree learning. 21
Spawning Tasks The feature of Skywriting is its ability to spawn new tasks in the middle of executing a job. The language provides two explicit mechanisms for spawning new tasks (the spawn() and spawn exec() functions) and one implicit mechanism (the * -operator). 22
Skywriting Task The spawn() function creates a new task to run the given Skywriting function. The Skywriting runtime first creates a data object that contains the new task s environment, including the text of the function to be executed and the values of any arguments passed to the function. This object is called a Skywriting continuation, because it encapsulates the state of a computation. The runtime then creates a task descriptor for the new task, which includes a dependency on the new continuation. Finally, it assigns a reference for the task result, which it returns to the calling script. Blocking on futures 23
A non-skywriting Task Creation The spawn exec() function is a lower-level task creation mechanism that allows the caller to invoke code written in a different language. this function is not called directly, but rather through a wrapper for the relevant executor When spawn exec() is called, the runtime serializes the arguments into a data object and creates a task that depends on that object If the arguments to spawn exec() include references, the runtime adds those references to the new task s dependencies to ensure that CIEL will not schedule the task until all of its arguments are available. the runtime creates references for the task outputs, and returns them to the calling script. A non-skywriting task created with spawn_exec(). 24
Implicit Task Creation If the task attempts to dereference an object that has not yet been created(i.e. the result of a call to spawn() )the current task must block. CIEL tasks are non-blocking: all synchronization (and data-flow) must be made explicit in the dynamic task graph. the runtime implicitly creates a continuation task that depends on the dereferenced object and the current continuation (i.e. the current Skywriting execution stack). The new task therefore will only run when the dereferenced object has been produced, which provides the necessary synchronization. Skywriting script that spawns two tasks and blocks on their results 25
Task Termination A task terminates when it reaches a return statement (or it blocks on a future reference). A Skywriting task has a single output, which is the value of the expression in the return statement. On termination, the runtime stores the output in the local object store, publishes a concrete reference to the object, and sends a list of spawned tasks to the master, in order of creation. 26
Fault Tolerance Client: Trivial since no driver program is required. Worker: Monitored by master (similar to Dryad) Master: Master state can be derived from the set of active jobs. This is accomplished with persistent logging, and object table reconstruction by workers secondary masters 27
Master fault tolerance (Log Approach) The persistent log approach creates one log file per job. When a job is submitted, a new log file is created and the initial log entry containing the job submission message is written synchronously to that file. All spawn and publish messages that the master receives can be written to the log asynchronously. After the master fails, a new master will replay the log, applying each operation in order to rebuild the dynamic task graph for the job. 28
Master fault tolerance (secondary master approach ) The secondary master approach is similar to the persistent log approach. The job submission message and all spawns and publish messages are forwarded to a secondary master. The secondary master immediately applies these operations to build a hot standby version of the dynamic task graph. To maintain the same reliability guarantees, the master must wait until the secondary master acknowledges the job submission message before returning an acknowledgement to the client All other messages may be sent asynchronously. 29
Performance Comparison with production system Distributed Grep on Hadoop and Ciel 30
Performance of Iterative Algorithm K-means on Hadoop and Ciel with 20 workers 31
Related Work Pregel: Google s distributed execution engine for graph algorithms (designed primarily for graph algorithms) HaLoop: task scheduler is made loop-aware by adding caching mechanisms (lacks fault tolerance) Apache Mahout: Uses Hadoop as its execution engine and a driver program runs iterative algorithms (lacks master fault tolerance + requires driver program) Dryad: allows data flow to follow a more general directed acyclic graph (not support dynamic/data-dependent control flow) Naiad: A timely dataflow system. Distributed system for executing data parallel, cyclic dataflow programs 32
Conclusion CIEL and Skywriting are not good for: sharing large amounts of data fine-grain parallelization fully automatic parallelism relation algebra environment distributed operating system CIEL and Skywriting are good for: writing iterative algorithms data-dependent control using dynamic task graphs transparent fault tolerance and automatic distribution scaling across hundreds of machines 33
Reference Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. CIEL: a universal execution engine for distributed data-flow computing. In NSDI 2011: Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation, page 00, 2011. 34
Reference 1. In what aspect(s) is CIEL different MapReduceand Dryad? (the first paragraph of Section 1) 2. What weaknesses of Pregeland MapReduceare addressed by CIEL, respectively? (the 4th, 5th, 6th paragraphs in Section 2) 35