Coded Distributed Computing: Straggling Servers and Multistage Dataflows

Size: px

Start display at page:

Download "Coded Distributed Computing: Straggling Servers and Multistage Dataflows"

Mariah Mills
6 years ago
Views:

1 Coded Distributed Computing: Straggling Servers and Multistage Dataflows Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA Noia Bell Labs, Holmdel, NJ, USA Abstract In this paper, we first review the Coded Distributed Computing (CDC framewor, recently proposed to significantly slash the data shuffling load of distributed computing via coding, and then discuss the extension of the CDC techniues to cope with two major challenges in general distributed computing problems, namely the straggling servers and multistage computations. When faced with straggling servers in a distributed computing cluster, we describe a unified coding scheme that superimposes CDC with the Maximum-Distance-Separable (MDS coding on computation tass, which allows a flexible tradeoff between computation latency and communication load. On the other hand, for a general multistage computation tas expressed as a directed acyclic graph (DAG, we propose a coded framewor that given the load of computation on each vertex of the DAG, applies the generalized CDC scheme individually on each vertex to minimize the communication load. I. INRODUCION Recently in [] [3], coding was introduced into distributed computing, in order to reduce the overhead of shuffling intermediate results across computing servers, hence speeding up the overall computation. In a general MapReducetype distributed computing structure, input files are processed distributedly using designed Map functions across a server cluster, generating some intermediate values. hen the servers exchange the calculated intermediate values (a..a. data shuffling, in order to calculate the final output results distributedly using the designed Reduce functions. For such a structure, it was demonstrated in [3] that coding can be applied on both Map tas placement and data shuffling, significantly slashing the load of communication. A tradeoff between the communication load (normalized total number of shuffled bits and the computation load (normalized total number of computed Map functions was formalized and exactly characterized in [3]. In particular, for a distributed computing application run on K servers and a computation load of r, r {,..., K}, the minimum reuired communication load was characterized as L (r= r ( r K. A coded computing framewor, namely Coded Distributed Computing (CDC, was proposed in [3] to achieve this tradeoff. CDC utilizes a carefully designed repetitive mapping of input files at r distinct servers, creating coded multicast messages that simultaneously satisfy the data demands of r servers. Hence, compared with an uncoded data shuffling scheme, CDC reduces the communication load by exactly a factor of the computation load r. his effect is demonstrated in a numerical evaluation in Fig.. Communication Load (L Uncoded Scheme Coded Distributed Computing Computation Load (r Fig. : Comparison of the communication load achieved by Coded Distributed Computing with that of the uncoded scheme. For r {,..., K}, CDC is r times better than the uncoded scheme. In this paper, we consider two extensions of the CDC framewor, namely CDC with Straggling Servers and CDC for Multistage Dataflows, which focus on applying the principle of CDC into a broader class of distributed computing problems. A. CDC with Straggling Servers As mentioned before, the execution of a MapReduce-type distributed computing job consists of the Map phase, the Shuffle phase, and the Reduce phase. he CDC scheme proposed in [3] focus on minimizing the communication load in the Shuffle phase, and we term this coding approach as Minimum Bandwidth Code. On the other hand, in a recent wor [4], the authors proposed to apply Maximum-Distance- Separable (MDS codes to create some redundant Map tass, so that the run-time of the Map phase is not affected by up to a certain number of straggling servers. his coding scheme, which we term as Minimum Latency Code, results in a significant reduction of Map computation latency. We proposed in [5] a unified coding framewor for distributed computing with straggling servers, by introducing a tradeoff between latency of computation and load of communication, for a distributed matrix multiplication problem. We show that the Minimum Bandwidth Code in [3] and the Minimum Latency Code in [4] can then be viewed as special instances of the proposed coding framewor by considering two extremes of this tradeoff: minimizing either the load of communication or the latency of computation individually. B. CDC for Multistage Dataflows Unlie simple computation tass lie Grep, Join and Sort, many distributed computing applications contain multiples

2 stages of MapReduce computations. Examples of these applications include machine learning algorithms [6], SQL ueries for databases [7], [8], and scientific analytics [9]. One can express the computation logic of a multistage application as a directed acyclic graph (DAG [0], in which each vertex represents a logical step of data transformation, and each edge represents the dataflow across processing vertices. We formalize a distributed computing model for multistage dataflow applications. We express a multistage dataflow as a layered-dag, in which the processing vertices within a particular computation stage are grouped into a single layer. Each vertex represents a MapReduce-type computation, transforming a set of input files into a set of output files. he set of edges specifies the order of the computations such that the head vertex of an edge does not start its computation until the tail vertex finishes, and 2 the inputoutput relationships between vertices such that the input files of a vertex consist of the output files of all vertices connected to it through incoming edges. For a given layered-dag, we propose a coded computing scheme to achieve a set of computation-communication tuples, which characterizes the load of computation for each processing vertex, and the load of communication within each layer. he proposed scheme first specifies the computation loads of the Map and Reduce functions for each vertex (i.e., how many times a Map or a Reduce function should be calculated, and then exploits the CDC scheme in [3] to perform the computation for each vertex individually. II. OVERVIEW OF CODED DISRIBUED COMUING In this section, we first briefly describe the problem of distributed computing, and then review our results in [3] on characterizing the tradeoff between the computation load and the communication load. A. Distributed Computing Framewor In a distributed computing problem, the goal is to compute Q output functions from N input files. As shown in Fig. 2, the overall computation is decomposed into computing a set of Map functions, one for each input file, and a set of Reduce functions, one for each output function. In particular, each Map function computes Q intermediate values, one for each output function. Each Reduce function taes in all N intermediate values from all input files, and calculates the final output result. 2 N Map Map 2 Map N 2 2 N N Reduce Reduce Q Output Output Q Fig. 2: Illustration of a two-stage distributed computing framewor. he computation is carried out over K distributed computing servers, on which the computations of the Q output functions are uniformly distributed. Following the above decomposition, the computation proceeds in three phases: Map, Shuffle and Reduce. In the Map phase, each server computes a subset of Map functions locally. hen in the Shuffle phase, each server creates messages based on the local Map results and multicasts them to the intended servers. In the Reduce phase, each server recovers the reuired intermediate values from the received messages and the local Map results, and uses them to reduce the assigned output functions. he computation load, denoted by r, r K, is defined as the total number of Map functions computed across the K servers, normalized by the number of files N. he communication load, denoted by L, 0 L, is defined as the total number of bits communicated in the Shuffle phase, normalized the total number of bits in all QN intermediate values. A computation-communication pair (r, L is feasible if there exist a placement of the Map tass and a data shuffling scheme such that all output functions can be successfully reduced. he computation-communication function of this framewor is defined as L (r inf{l : (r, L is feasible}. ( B. Computation-Communication radeoff he computation-communication function was exactly characterized in [3], and stated in the following theorem. heorem. he computation-communication function of the distributed computing framewor, L (r is given by L (r = L coded (r r ( r K, r {,..., K}, (2 for sufficiently large N. For general r K, L (r is the lower convex envelop of the above points. he tradeoff in heorem is achieved by the Coded Distributed Computing (CDC scheme proposed in [3]. he ey idea of the scheme is to repeat each Map computation across servers following a specific pattern, in order to create coded multicast messages in the Shuffle phase that are simultaneously useful for multiple servers. In [3], the CDC scheme was also generalized to tacle a cascaded distributed computing framewor, in which each output function is computed by s servers, for some s {,..., K}. he computation-communication function for the cascaded framewor, which is achieved by a generalized CDC scheme, was stated in the following theorem. heorem 2. he computation-communication function of the cascaded distributed computing framewor, for r {,..., K}, is characterized by L (r, s=l coded (r, s min{r+s,k} l=max{r+,s} ( l 2 ( r l l r l s r ( ( K K, (3 r s for sufficiently large Q and N, and s {,..., K}. For general r K, L (r, s is the lower convex envelop of the above points {(r, L coded (r, s : r {,..., K}}. III. CDC WIH SRAGGLING SERVERS We introduced in [5], a MapReduce-type distributed computing framewor for a matrix multiplication problem. When the CDC scheme (or the Minimum Bandwidth Code is applied to this framewor, while the shuffling load is minimized, a high Map phase latency would occur since the system needs to wait for all straggling servers to finish their

3 Map computations. In order to extend the CDC scheme to optimize the performance of systems with straggling servers, we formalized in [5] a tradeoff between the computation latency in the Map phase and the communication load in the Shuffle phase, and proposed a unified coding scheme that systematically concatenates the Minimum Bandwidth Code in [3] and the Minimum Latency Code in [4]. Next, we first describe the considered distributed matrix multiplication problem, then state our main results, and finally demonstrate the proposed unified coding scheme using an illustrative example. A. roblem Formulation System Model: We consider a matrix multiplication problem in which given a matrix A F m n for some integers 2, m and n, and N input vectors x,..., x N F n 2, we want to compute N output vectors y =Ax,..., y N = Ax N. We perform the computations using K distributed servers. Each server has a local memory of size µmn bits, for some K µ. We allow applying linear codes for storing the rows of A at each server. Specifically, Server, {,..., K}, designs an encoding matrix E F µm m, and stores 2 U = E A. (4 he collection of the encoding matrices {E } K = is denoted as storage design. 2 Distributed Computing Model: We assume that the input vectors x,..., x N are nown to all the servers. he computation proceeds in Map, Shuffle and Reduce phases. Map hase. For all j =,..., N, Server, =,..., K, computes the intermediate vectors z j, = U x j = E Ax j = E y j. (5 We denote the latency for Server to compute z,,..., z N, as S. S,..., S K are i.i.d. random variables. We denote the th order statistic, i.e., the th smallest variable of S,..., S K as S (, for all {,..., K}, and focus on a class of distributions of S such that E{S ( } = µng(k,, (6 for some function g(k,. he Map phase terminates when a subset of servers, denoted by Q {,..., K}, have finished their Map computations in (5. A necessary condition for selecting Q is that the output vectors y..., y N can be re-constructed by jointly utilizing the intermediate vectors calculated by the servers in Q, i.e., {z j, : j =,..., N, Q}. Definition (Computation Latency. We define the computation latency, denoted by D, as the average amount of time spent in the Map phase. After the Map phase, the job of computing the output vectors y..., y N continues exclusively over the servers in Q. he final computations of the output vectors are distributed uniformly across the servers in Q. 2 Shuffle hase. Each server in Q generates a message X from the locally computed intermediate vectors hus enough information to recover the entire matrix A can be stored collectively on the K servers. 2 We assume that N K, and Q divides N for all Q {,..., K}. z,,..., z N, through an encoding function φ, i.e., X = φ (z,,..., z N,, such that upon receiving all messages {X : Q}, every server Q can reduce the assigned output vectors. We assume that the servers are connected by a shared bus lin. After generating X, Server multicasts X to all the other servers in Q. Definition 2 (Communication Load. We define the communication load, denoted by L, as the average total number of bits in all messages {X : Q}, normalized by m (i.e., the total number of bits in an output vector. Reduce hase. Server, Q, uses the locally computed vectors z,,..., z N, and the received multicast messages {X : Q} to reduce the assigned N/ Q output vectors. For such a distributed computing system, we say a latencyload pair (D, L R 2 is feasible if there exist a storage design {E } K =, a Map phase computation with latency D, and a shuffling scheme with communication load L, such that all output vectors can be successfully reduced. Definition 3. We define the latency-load region, as the closure of the set of all feasible (D, L pairs. 3 Illustrating Example: In order to clarify the formulation, we use the following simple example to illustrate the latency-load pairs achieved by the two coded approaches discussed in Section I. We consider a matrix A consisting of m = 2 rows a,..., a 2. We have N = 4 input vectors x,..., x 4, and the computation is performed on K = 4 servers each has a storage size µ = 2. We assume that the Map latency S, =,..., 4, has a shifted-exponential distribution function F S (t = e ( t µn, t µn, (7 and by e.g., [], the average latency for the fastest, 4, servers to finish the Map computations is D( = E{S ( } = µn + j=k + j. (8 Minimum Bandwidth Code (or CDC [3]. As shown in Fig. 3(a, a Minimum Bandwidth Code repeats the multiplication of each row of A with all input vectors x,..., x 4, µk = 2 times across the 4 servers, according to the mapping strategy of CDC. he Map phase continues until all servers have finished their computations, achieving a computation latency D(4 = 2 ( + 4 j= j = For =,..., 4, Server will be reducing output vector y. In the Shuffle phase, every server multicasts 3 bit-wise XORs, each of which is simultaneously useful for two other servers. Hence, the Minimum Bandwidth Code achieves a communication load L = 3 4/2 =. Minimum Latency Code [4]. A Minimum Latency Code first has each server, =,..., 4, independently and randomly generate 6 random linear combinations of the rows of A, denoted by c 6( +,..., c 6( +6 (see Fig. 3(b, achieving a (24, 2 MDS code. herefore, for any subset D {,..., 24} of size D = 2, using the intermediate values {c i x j : i D} can recover the output vector y j. he Map phase terminates once the fastest 2 servers have finished their computations (e.g., Server and 3, achieving a com-

4 putation latency D(2=2 ( hen Server continues to reduce y and y 2, and Server 3 continues to reduce y 3 and y 4. As illustrated in Fig. 3(b, Server and 3 respectively unicasts the intermediate values it has calculated and needed by the other server to complete the computation, achieving a communication load L=6 4/2=2. Map Shuffle = 9 Server Server 2 Server 3 Server 4 (a Minimum Bandwidth Code. Every row of A is multiplied with the input vectors twice. For =, 2, 3, 4, Server reduces the output vector y. In the Shuffle phase, each server multicasts 3 bit-wise XORs, denoted by, of the calculated intermediate values, each of which is simultaneously useful for two other servers. Map Server Server 2 Server 3 Server 4 he latency-load pairs in heorem are achieved by a unified coding framewor that organically superimposes the Minimum Bandwidth Code and the Minimum Latency Code. he ey idea is to appropriately concatenate the MDS code and the repetitive computations specified by the CDC scheme for Map computations, in order to tae advantage of the redundancies to both combat the stragglers and slash the shuffling load. We demonstrate this unified scheme through an illustrative example in the next subsection. Remar. he Minimum Latency Code and the Minimum Bandwidth Code correspond to = µ and = K, and achieve the two end points (E{S ( µ }, N N/ µ and (E{S (K }, N µk /K µk respectively. Communication Load (L roposed Coded Framewor 20 Outer Bound Shuffle (b Minimum Latency Code. A is encoded into 24 coded rows c..., c 24. Server and 3 finish their Map computations first. hey then exchange enough number (6 for each output vector of intermediate values to reduce y, y 2 at Server and y 3, y 4 at Server 3. Fig. 3: Illustration of the Minimum Bandwidth Code in [3] and the Minimum Latency Code in [4]. Minimum Bandwidth Code spends about twice of the time in the Map phase compared with the Minimum Latency Code, and achieves half of the communication load in the Shuffle phase. hey represent the two end points of a general latencyload tradeoff characterized in the next subsection. B. Main Results he main results of [5] are, a characterization of a set of achievable latency-load pairs by developing a unified coded framewor, 2 an outer bound of the latency-load region, which are stated in the following two theorems. heorem 3. For a distributed matrix multiplication problem of computing N output vectors using K servers, each with a storage size µ K, the latency-load region contains the lower convex envelop of the points {(D(, L( : = µ,..., K}, (9 in which D( = E{S ( } = µng(k,, (0 L( = N µ j=s B j j + N min { µ µ j=s B j, Bs s ( where S ( is the th smallest latency of the K i.i.d. latencies S,..., S K with some distribution F, g(k, is a function of K and computed from F, µ µ and s inf{s : µ j=s B j µ}., B j ( j }, ( µ j K K ( µ K, Computation Latency (D Fig. 4: Comparison of the latency-load pairs achieved by the proposed scheme with the outer bound, for computing N = 80 output vectors using K = 8 servers each with a storage size µ = /3, assuming the the distribution function in (7. Remar 2. As numerically evaluated in Fig. 4, the tradeoff achieved by the unified coding framewor approximately exhibits an inverse-linearly proportional relationship between the latency and the load. For instance, doubling the latency from 20 to 240 results in a drop of the communication load from 43 to 23 by a factor of.87. heorem 4. he latency-load region is contained in the lower convex envelop of the points {(D(, L( : = µ,..., K}, (2 in which D( is given by (0 and min{tµ, } L( = N max t=,..., t ( t. (3 For each = µ,..., K, the lower bound L( was proved as a cut-set bound on multiple instances of the problem, each corresponding to a specific assignment of the output vectors. At two end points of the tradeoff, the unified coding scheme was shown in [5] to achieve the lower bound to within a constant multiplicative gap. C. Unified Coding Framewor In this subsection, we demonstrate the ey ideas of the unified coding framewor that achieves the latency-load pairs in (9, through the following example. We consider a problem of multiplying a matrix A F m n 2 of m = 20 rows with N = 2 input vectors x,..., x 2 to compute 2 output vectors y = Ax..., y 2 = Ax 2, using K = 6 servers each with a storage size µ = 2. We assume that we can afford to wait for = 4 servers to finish their Map computations.

5 Storage Design. As illustrated in Fig 5, we first independently generate 30 random linear combinations c,..., c 30 F n 2 of the 20 rows of A. hen we partition these coded rows c,..., c 30 into 5 batches each of size 2, and store every batch of coded rows at a uniue pair of servers. (30,20 MDS Code artition 5 Batches Storage Server Server 2 Server 3 Server 4 Server 5 Server 6 Fig. 5: Storage Design when the Map phase is terminated when 4 servers have finished the computations. WLOG, due to the symmetry of the storage design, we assume that Servers, 2, 3 and 4 are the first 4 servers that finish their Map computations. hen we assign the Reduce tass such that Server reduces the output vectors y 3( +, y 3( +2 and y 3( +3, for all {,..., 4}. Since Server has computed {c x j,..., c 0 x j : j =,..., 2}, for it to reduce y = Ax, it needs any subset of 0 intermediate values c i x with i {,..., 30} from Server 2, 3 and 4 in the Shuffle phase. Similar data demands hold for all 4 servers and the output vectors they are reducing. Coded Shuffle. We first group the 4 servers into 4 subsets of size 3 and perform coded shuffling within each subset. We illustrate the coded shuffling scheme for Servers, 2 and 3 in Fig. 6. Each server multicasts 3 bit-wise XORs, denoted by, of the locally computed intermediate values to the other two. After receiving 2 multicast messages, each server recovers 6 needed intermediate values. Server Server 2 Server 3 Fig. 6: Multicasting 9 coded intermediate values across Servers, 2 and 3. Similar coded multicast communications are performed for another 3 subsets of 3 servers. Similarly, we perform the above coded shuffling for another 3 subsets of 3 servers. Each server recovers 8 needed intermediate values (6 for each output vector it is reducing. As mentioned before, since each server needs a total of 3 (20 0 = 30 intermediate values to reduce the 3 assigned output vectors, it needs another 30 8 = 2 after decoding all multicast messages. We satisfy the residual data demands by simply having the servers unicast enough (i.e., 2 4 = 48 intermediate values for reduction. Overall, = 84 (possibly coded intermediate values are communicated, achieving a communication load of L = 4.2. IV. CDC FOR MULISAGE DAAFLOWS While the distributed computing model in [3] deals with a single pair of Map and Reduce operations, the logical dataflow of a general distributed computing application consists of multiple stages of MapReduce computations. We can express a multistage dataflow as a directed acyclic graph (DAG. he DAG of an application, denoted by G, consists of a set of vertices V and a set of directed edges A, i.e., G = (V, A. he vertices represent the user-defined operations on the data, e.g., MapReduce, and the edges represent the flow of data between operation vertices. In this section, we formalize a multistage computation tas represented by a DAG, and propose a general coded scheme for DAGs as an extension of the CDC scheme in [3]. A. roblem Formulation: Layered-DAG We consider a computing tas that processes N input files w,..., w N F 2 F to generate Q output files u,..., u Q F 2 B, for some parameters F, B N. he overall computation is represented by a layered-dag G = (V, A, in which the set of vertices V is composed of D layers, denoted by L,..., L D, for some D N. For each d =,..., D, we label the ith vertex in Layer d as m, for all i =,..., L d. See Fig. 7 for the illustration of a 4-layer DAG. m, m,2 m 2, m 2,2 m 2,3 m 3, m 3,2 Layer Layer 2 Layer 3 Fig. 7: A 4-layer DAG. m 4, Layer 4 Each vertex m processes N input files w,..., w N F 2 F, and computes Q output files u..., u Q F 2 B, for some system parameters N, Q, F, B N. In particular, the input files of G are distributed as the inputs to the vertices in Layer, i.e., {w,..., w N } = i=,..., L {w,i,..., w,i N,i }, and the output files of G are distributed as the outputs of the vertices in Layer D, i.e., {u,..., u Q } = i=,..., L D {ud,i,..., u D,i Q D,i }. Edges in A are between vertices in consecutive layers, i.e., A (m, m d+,j. (4 d=,...,d i=,..., L d,j=,..., L d+ he input files of a vertex in Layer d, d = 2,..., D, consist of the output files of the vertices it connects to in the preceding layer. More specifically, for any d {2,..., D} and i {,..., L d }, N = Q d,j and {w,..., w N } = j:(m d,j,m A j:(m d,j,m A {,...,Q d,j } ud,j. (5 For example in Fig. 7, the input files to the vertex m 3, consist of the output files of the vertices m 2, and m 2,3. As a result, other than the number of input files for the vertices in Layer, we only need the number of output files at each vertex as the system parameters. he computation of the output file u, =,..., Q, of the vertex m, for all d =,..., D, i =,..., L d, is

6 decomposed as follows: u (w,..., w N = h where he Map functions g n (g, (w,..., g,n (w N, (6 (F 2 Q, n {,..., N } maps the input file wn into Q length- intermediate values {v,n = g,n(w n F 2 : =,..., Q }, for some N. he Reduce functions h : (F 2 N F 2 B, {,..., Q } maps the intermediate values of the output function u u = h (v = (g,n,..., g Q,n : F 2 F in all input files into the output file,,..., v We compute the above layered-dag using a K-server cluster, for some K N. At each time instance, the servers only perform the computations of the vertices within a single layer. Each vertex in a layer is computed by a subset of servers. We denote the set of servers computing the vertex m as K {,..., K}, where the selection of K is a design parameter. For each K, Server computes a subset of Map functions of m with indices M,N. {,..., N }, and a subset of Reduce functions with indices W {,..., Q }, where M and W are design parameters. We denote the placements of the Map and Reduce functions for m as M {M : K } and W {W : K } respectively. Data Locality. We prohibit transferring input files (or output files calculated in the preceding layer across servers, i.e., every node either stores the needed input files to compute the assigned Map functions (only to initiate the computations in Layer or computes them locally from the assigned Reduce functions in the preceding layer. his implementation provides a better fault-tolerance since the Reduce functions have to be calculated independently across servers. he computation of Layer d, d =,..., D, proceeds in three phases: Map, Shuffle, and Reduce. Map phase. For each vertex m, i =,..., L d, in Layer D, each server in K computes its assigned Map functions g n (wn = (v,n,..., v Q,n, for all n M. Definition 4 (Computation Load. We define the computation load of vertex m, d {,..., D}, i {,..., L d }, denoted by, as the total number of Map functions of m computed across the servers in K, normalized by the K M number of input files N, i.e., N. Shuffle phase. Each server, {,..., K}, creates a message X d as a function, denoted by ψd, of the intermediate values from all input files it has mapped in Layer d, i.e., X d = ψ d ({ v,n : {,..., Q }, n M } Ld i=, and multicasts it to a subset of j K servers. Definition 5 (Communication Load. We define the communication load of Layer d, denoted by L d, as the total number of bits communicated in the Shuffle phase of Layer d. By the end of the Shuffle phase, each server, =,..., K, recovers all reuired intermediate values for the assigned Reduce functions in Layer d, i.e., {v,,..., v,n : W i=, from either the local Map computations or the multicast messages from the other servers. Reduce phase. Each server, =,..., K, computes the assigned Reduce functions to generate the output files of the vertices in Layer d, i.e., {u = h (v,,..., v,n : W }, for all i =,..., L d. We say that a computation-communication tuple {(r d,,..., r d, Ld, L d } D d= is achievable if there exists an assignment of the Map and Reduce computations { M d,, W d,..., Md, Ld, W d, Ld } D d=, and D shuffling schemes such that Server, =,..., K, can successfully } L d compute all the Reduce functions in W, for all d {,..., D} and i {,..., L d }. Definition 6. We define the computation-communication region of a layered-dag G = (V, A, denoted by C(G, as the closure of the set of all achievable computationcommunication tuples. B. CDC for Layered-DAG We propose a general Coded Distributed Computing (CDC scheme for an arbitrary layered-dag, which achieves the computation-communication tuples characterized in the following theorem. heorem 5. For a layered-dag G = (V, A of D layers, the following computation-communication tuples are ahievable {(r d,,..., r d, Ld, L u d} D d=, {r d,,...,r d, Ld {,...,K}} D d= Ld where L u d = i= L coded(, s, KQ N. min{r+s,k} l( Here L coded (r, s, K K l ( r ( l 2 l s r, r l=max{r+,s} r s max s r d+,j, d < D, j:(m,m d+,j A, and, d = D. N = Q d,j, d = 2,..., D. j:(m d,j,m A he above computation-communication tuples are achieved by the proposed CDC scheme for the layered-dag, which first designs the parameters {s d,,..., s d, Ld } D d= that specify the placements of the computations of the Reduce functions, and then applies the CDC scheme for a cascaded distributed computing framewor (see heorem 2 to compute each of the vertices individually. Remar 3. he achieved communication load for vertex m, L coded (, s, KQ N decreases as increases (more locally available Map results and s decreases (more data demands. Due to the specific way the parameter s is chosen in heorem, increasing the computation load r d+,j of some vertex m d+,j connected to m can cause s to increase. In general, while more Map computations result in a smaller communication load in the current layer, they impose a larger communication load on the preceding layer. Next, we describe and analyze the proposed general CDC scheme to compute a layered-dag. o start, we employ a uniform resource allocation such that every vertex is computed over all K servers, i.e., K = {,..., K} for all d =,..., D and i =,..., L d.

7 Remar 4. We note that the communication load L coded (r, s, K in heorem 7 is a decreasing function of K. hat is, for fixed r and s, performing the computation of a vertex over a smaller number of serves yields a smaller communication load. However, the disadvantages of using less servers are Each server needs to compute more Map and Reduce functions, incurring a higher computation load. 2 It may affect the symmetry of the data placement, increasing the communication load in the next layer (see discussions in the next subsection. For each vertex m, i =,..., L d, in Layer d, we specify a computation load {,..., K}, such that the computation of each Map function of m is placed on servers. We also define the reduce factor of m, denoted by s {,..., K}, as the number of servers that compute each Reduce function of m. o satisfy the data locality reuirements (explained later, we select the reduce factor s eual to the largest computation load of the vertex connected to m in Layer d +, i.e., max s = r d+,j, d < D, j:(m,m d+,j A (7, d = D. As an example, for a diamond DAG in Fig. 8, since the output files of m will be used as the inputs for both m 2 and m 3, we should compute each Reduce function of m at s = max{r 2, r 3 } servers. Also, since m 2 and m 3 both only connect to m 4, we shall choose s 2 = s 3 = r 4. m 2 m m 4 m 3 Layer Layer 2 Layer 3 Fig. 8: A diamond DAG. he reduce factors s,..., s 4 are determined by the computation loads r 2, r 3, r 4. Having selected the computation load and the reduce factor s, we employ the CDC scheme in [3] to compute the vertex m, over all K servers. We next briefly describe the CDC computation for m. Map hase Design. he N input files are evenly partitioned into ( K N disjoint batches of size, each of which is labelled by a subset {,..., K} of size : {,..., N } = {B : {,..., K}, = }, (8 where B denotes the batch corresponding to the subset. Given this partition, Server, {,..., K} maps the files in B if. Reduce Functions Assignment. he Q Reduce functions are evenly partitioned into ( K s disjoint batches of size Q s, each of which is labelled by a subset of s nodes: {,..., Q }={D : {,..., K}, =s }, (9 where D denotes the batch corresponding to the subset. Given this partition, Server, {,..., K} computes the Reduce functions whose indices are in D if. Coded Data Shuffling. In the Shuffle phase, within a subset of max{ +, s } l min{ + s, K} servers, every of them shared some intermediate values that are simultaneously needed by the remaining l servers. Each server multicasts enough linear combinations of the segments of these intermediate values until they can be decoded by all the intended servers. his achieves a communication load L coded (, s, K for vertex m, where L coded (r, s, K = min{r+s,k} l l ( r ( l 2 l s r is given in heorem 5. r r s l=max{r+,s} Next we demonstrate that, the above CDC scheme can be applied to compute every vertex subject to the data locality constraint, using the reduce factors s specified in (7. o do that, we focus on the computation of a vertex m in Layer d. WLOG, we assume that m only connects to a single vertex m d, in Layer d, hence the input files of m are the output files of m d, and N = Q d,. Out of all vertices in Layer d connected to m d,, say vertex m d,j has the largest computation load such that by (7, s d, = r d,j, and each of the output files of m d, is available on r d,j servers after the computation of Layer d. By the above assignment of the Reduce functions, a batch of Q d, r d,j output files of m d, (or input files of m, denoted by D d,, are available at all r d,j servers in a subset. o execute the Map phase of m, we first evenly partition the D d, into ( r d,j Q sub-batches of size d,, each of r d,j ( r d,j which is sub-labelled by a subset of of nodes: D d, = {D, :, = }, (20 where D, denotes the sub-batch corresponding to. hen for each server, it maps all files in D, if. Finally, we repeat this Map process for all subsets of size r d,j. Since every subset of servers are contained in r d,j subsets of size rd,j, they map a Q total of d, r d,j ( r d,j Q r d,j = d, input files of m. his is consistent with the above Map phase design for m, i.e., for all {,..., K} of size, B = D,, (2 {,...,K}: =r d,j where B, as defined in (8, is the batch of input files of m mapped by servers in. We demonstrate in Fig. 9, the Map computations of the vertices m 2 and m 3 of the diamond DAG in Fig. 8 with Q = 6 output files of m, computation loads r 2 = 2 and r 3 =, using K = 3 servers. First we select the reduce factor of m, s = max{r 2, r 3 } = 2, such that every output file of m, u,..., u 6, is reduced on two servers. Having computed the output files of m that are also input files of m 2 and m 3, each server computes the Map functions of m 2 on all locally available files. However, since m 3 has a computation load r 3 =, each file is only mapped once on one server, e.g., u 3 and u 4 are both available on Server and 2 after computing m, but u 3 is mapped only on Server and u 4 is mapped

8 only on Server 4 in the Map phase of m 3. m 3 Map m 2 Map m Reduce Server Server 2 Server 3 Fig. 9: Illustration of the mapped files in the Map phases of the vertices m 2 and m 3 in the diamond DAG, for the case Q = N 2 = N 3 = 6, r 2 = 2, and r 3 =. Using the above CDC scheme for each vertex, we can achieve a communication load L u d in Layer d, d =,..., D: L d L u d = L coded (, s, KQ N. (22 i= aing the union over all combinations of the computation loads achieves the computation-communication tuples in heorem 5. Remar 5. Having characterized a set of computationcommunication tuples using CDC, one can optimize the overall job execution time over the computation loads. Varying computation loads affects the Map time, the Shuffle time and the Reduce time in each layer in different ways. For example, a smaller computation load can lead to a shorter Map time in the current layer and also a shorter Reduce time in the preceding layer, but may cause a long Shuffle phase in the current layer. In general, the design of optimum computation loads depends on the system parameters including input/output sizes, sizes of the intermediate values, server processing speeds and the networ bandwidth. C. Is the Uniform Resource Allocation Optimal? In the above proposed CDC scheme for layered-dags, we allocate all processing resources to compute each vertex in the DAG. However, this uniform resource allocation strategy does not always lead to a better performance. We demonstrate this phenomenon through the following example. Consider again the diamond DAG in Fig. 8 in which the vertices m 2 and m 3 have the same number of output functions, i.e., Q 2 = Q 3. When computing m 2 and m 3 in Layer 2, we split the computation resources such that half of the servers exclusively compute m 2 and the remaining half exclusively compute m 3. hat is, we select K 2 and K 3 such that K 2 K 3 = and K 2 = K 3 = K 2. We choose the reduce factors s 2 = s 3 = r 4, and then apply the CDC scheme on K 2 and K 3 to compute m 2 and m 3 respectively. his achieves a total communication load in Layer 2: L s 2 =(L coded (r 2, r 4, K 2 2+L coded (r 3, r 4, K 2 3Q Q 2. (23 he above communication load is less than the load L u 2 = (L coded (r 2, r 4, K 2 + L coded (r 3, r 4, K 3 Q Q 2 in Layer 2 achieved using the uniform resource allocation.his is because that when using less number of servers to compute a vertex, each server will compute more Map functions and obtain more useful local information (i.e., needed intermediate values available locally, and thus less amount of information needs to be transferred over the networ. However, when computing the vertex m 4 in Layer 3, since the output results of the preceding vertices m 2 and m 3 reside on completely separate sets of servers, no coding can be applied for communication between K 2 and K 3. We can use CDC within K 2 and K 3 respectively to achieve a communication load of ( r 4 2 K Q 2Q 4 4, and uncoded communication between K 2 and K 3 that incurs another communication load of Q 2 Q 4 4. Hence, the total communication load achieved in Layer 3 when splitting the computation resources is L s 3 = ( r 4 2 K + Q 2Q 4 4. (24 On the other hand, computing m 2 and m 3 using all servers in Layer 2 induces a more symmetric placement of the input files of m 4, over all K servers. herefore, it can tae a better advantage of the coding opportunities in data shuffling of Layer 3, achieving a smaller communication load L u 3 = ( 2 r 4 2 K Q 2Q 4 4 L s 3. o summarize, for a different resource allocation strategy in which the computation resources are split to compute m 2 and m 3 in the second layer, we can achieve a smaller communication load in Layer 2 at the cost of a higher communication load in Layer 3. V. CONCLUSION We describe two extensions of the CDC scheme in [3] to solve distributed computing problems with straggling servers and multistage computations respectively. In particular, when faced with straggling servers, we present a unified coding scheme that superimposes the CDC scheme on top of the MDS code, achieving a flexible tradeoff between computation latency and communication load. On the other hand, for a multistage computation expressed as a DAG, we propose a general coded scheme that first specifies the computation load for each processing vertex of the DAG, and then applies the CDC scheme to each vertex individually. REFERENCES [] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, Coded MapReduce, 53rd Allerton Conference, Sept [2], Fundamental tradeoff between computation and communication in distributed computing, IEEE ISI, July 206. [3] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, A fundamental tradeoff between computation and communication in distributed computing, e-print arxiv: , 206, submitted to IEEE rans. Inf. heory. [4] K. Lee, M. Lam, R. edarsani, D. apailiopoulos, and K. Ramchandran, Speeding up distributed machine learning using codes, e-print arxiv: , Dec [5] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, A unified coding framewor for distributed computing with straggling servers, e-print arxiv: , Sept. 206, a shorter version to appear in IEEE NetCod 206. [6] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradsi, A. Y. Ng, and K. Oluotun, Map-Reduce for machine learning on multicore, Advances in neural information processing systems, vol. 9, [7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: distributed data-parallel programs from seuential building blocs, in ACM SIGOS Operating Systems Review, vol. 4, no. 3, June [8] A. Abouzeid, K. Bajda-awliowsi, D. Abadi, A. Silberschatz, and A. Rasin, Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical worloads, roceedings of the VLDB Endowment, vol. 2, no., pp , [9] J. Eanayae,. Gunarathne, G. Fox, A. S. Balir, C. oulain, N. Araujo, and R. Barga, DryadLINQ for scientific analyses, in Fifth IEEE International Conference on e-science, 2009, pp [0] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino, Apache ez: A unifying framewor for modeling and building data processing applications, in roceedings of the 205 ACM SIGMOD, 205, pp [] B. C. Arnold, N. Balarishnan, and H. N. Nagaraja, A first course in order statistics. Siam, 992, vol. 54.

A Unified Coding Framework for Distributed Computing with Straggling Servers

A Unified Coding Framework for Distributed Computing with Straggling Servers A Unified Coding Framewor for Distributed Computing with Straggling Servers Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern California,