Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Size: px

Start display at page:

Download "Complexity results for throughput and latency optimization of replicated and data-parallel workflows"

Bertram Arnold
5 years ago
Views:

1 Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 June 2007 Workflows complexity results Alpage meeting 1/ 33

2 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

3 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

4 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

5 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

6 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

7 Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows June 2007 Workflows complexity results Alpage meeting 2/ 33

8 Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 3/ 33

9 Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 3/ 33

10 Replication and data-parallelism Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 4/ 33

11 Replication and data-parallelism Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 4/ 33

12 Major contributions Complexity results for throughput and latency optimization of replicated and data-parallel workflows Theoretical approach to the problem definition of replication and data-parallelism formal definition of T period and T latency in each case Problem complexity: focus on pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 5/ 33

13 Major contributions Complexity results for throughput and latency optimization of replicated and data-parallel workflows Theoretical approach to the problem definition of replication and data-parallelism formal definition of T period and T latency in each case Problem complexity: focus on pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 5/ 33

14 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 6/ 33

15 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 7/ 33

16 Pipeline graphs δ 0 δ 1 δ k 1 δ k δ n S 1 S 2 S k S n w 1 w 2 w k w n n stages S k, 1 k n S k : receives input of size δ k 1 from S k 1 performs w k computations outputs data of size δ k to S k+1 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 8/ 33

17 Fork graphs δ 1 w 0 S 0 δ 0 δ 0 δ0 δ 0 w 1 w 2 S 2 w k S k w n S n S δ 1 δ 2 δ k δ n n + 1 stages S k, 0 k n S 0 : root stage S 1 to S n : independent stages A data set goes through stage S 0, then it can be executed simultaneously for all other stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 9/ 33

18 The platform s in P in P out s out b in,u b v,out P u s u b u,v P v s v p processors P u, 1 u p, fully interconnected s u : speed of processor P u bidirectional link link u,v : P u P v, bandwidth b u,v one-port model: each processor can either send, receive or compute at any time-step Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 10/ 33

19 Different platforms Fully Homogeneous Identical processors (s u = s) and links (b u,v = b): typical parallel machines Communication Homogeneous Different-speed processors (s u s v ), identical links (b u,v = b): networks of workstations, clusters Fully Heterogeneous Fully heterogeneous architectures, s u s v and b u,v b u,v : hierarchical platforms, grids Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 11/ 33

20 Back to pipeline: mapping strategies S 1 S 2... S k... S n The pipeline application Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

21 Back to pipeline: mapping strategies S 1 S 2... S k... S n One-to-one Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

22 Back to pipeline: mapping strategies S 1 S 2... S k... S n Interval Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

23 Back to pipeline: mapping strategies S 1 S 2... S k... S n General Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

24 Back to pipeline: mapping strategies S 1 S 2... S k... S n Interval Mapping In this work, Interval Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

25 Chains-on-chains Load-balance contiguous tasks Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

26 Chains-on-chains Load-balance contiguous tasks With p = 4 identical processors? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

27 Chains-on-chains Load-balance contiguous tasks With p = 4 identical processors? T period = 20 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

28 Chains-on-chains Load-balance contiguous tasks With p = 4 identical processors? T period = 20 NP-hard for different-speed processors, even without communications Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

29 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 14/ 33

30 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

31 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

32 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

33 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

34 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Replicate interval [S u..s v ] on P 1,..., P q... S S u... S v on P 1 : data sets 1, 4, 7,... S u... S v on P 2 : data sets 2, 5, 8,... S u... S v on P 3 : data sets 3, 5, 9,... S... T period = P v k=u w k q min i (s i ) and T latency = q T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

35 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Data Parallelize single stage S k on P 1,..., P q S (w = 16) T period = w k P q i=1 s i P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : and T latency = T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

36 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

37 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? DP S 1 P REP 1P 2, S 2 S 3 S 4 P 3P 4 T period = max( , ) = 5, T latency = Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

38 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? DP S 1 P REP 1P 2, S 2 S 3 S 4 P 3P 4 T period = max( , ) = 5, T latency = DP S 1 P 2P 3 P 4, S 2 S 3 S 4 P 1 14 T period = max( 1+1+1, ) = 5, T latency = 9.67 (optimal) Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

39 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 17/ 33

40 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

41 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

42 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

43 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

44 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

45 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages T period =? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

46 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages T period =? depends on the com model: is it possible to start com as soon as S 0 is done? Which order for com? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

47 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

48 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result T latency = time elapsed between data set input to S 0 until last computation for this data set is completed Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

49 Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result T latency = time elapsed between data set input to S 0 until last computation for this data set is completed Simpler model for formal analysis Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

50 Back to a simpler problem No communication costs nor overheads Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

51 Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: w i s u Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

52 Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize [S i, S j ] (i = j for pipeline; 0 < i j or i = j = 0 for fork) on k processors P q1,..., P qk : w i s u j l=i w l k u=1 s q u. Cost = T period of assigned processors Cost = delay to traverse the interval Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

53 Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize Cost to replicate [S i, S j ] on k processors P q1,..., P qk : w i s u j l=i w l k min 1 u k s qu. Cost = T period of assigned processors Delay to traverse the interval = time needed by slowest processor: j l=i t max = w l min 1 u k s qu Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

54 Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize Cost to replicate With these formulas: easy to compute T period for both graphs, and T latency for pipeline graphs w i s u Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

55 Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

56 Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

57 Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

58 Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

59 Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

60 Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

61 Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

62 Bi-criteria optimization problem given threshold period P threshold, determine mapping whose period does not exceed P threshold and that minimizes T latency given threshold latency L threshold, determine mapping whose latency does not exceed L threshold and that minimizes T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 23/ 33

63 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 24/ 33

64 Complexity results Without data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (str) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

65 Complexity results With data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (DP) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

66 Complexity results Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

67 Complexity results With data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline NP-hard Het. pipeline - Hom. fork NP-hard Het. fork - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

68 Complexity results Most interesting case: Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

69 No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline June 2007 Workflows complexity results Alpage meeting 26/ 33

70 No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline June 2007 Workflows complexity results Alpage meeting 26/ 33

71 Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 27/ 33

72 Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 27/ 33

73 Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 28/ 33

74 Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 28/ 33

75 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Case (1): replicating m stages onto processors P i,..., P j Case (2): splitting the interval Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

76 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = Initialization: min 1 m < m i k < j L(1, i, j) = L(m, i, i) = { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) { w w s i if (j i).s i K + otherwise { m.w s i m.w if s i + otherwise K Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

77 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Complexity of the dynamic programming: O(n 2.p 4 ) Number of iterations of the binary search formally bounded, very small number of iterations in practice. Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

78 Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 30/ 33

79 Related work Subhlok and Vondran Extension of their work (pipeline on hom platforms) Chains-to-chains In our work possibility to replicate or data-parallelize Mapping pipelined computations onto clusters and grids DAG [Taura et al.], DataCutter [Saltz et al.] Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization Mapping pipelined computations onto special-purpose architectures FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping skeletons onto clusters and grids Use of stochastic process algebra [Benoit et al.] June 2007 Workflows complexity results Alpage meeting 31/ 33

80 Conclusion Mapping structured workflow applications onto computational platforms, with replication and data-parallelism Complexity of the most tractable instances insight of the combinatorial nature of the problem Pipeline and fork graphs, extension to fork-join Homogeneous and Heterogeneous platforms with no communications Minimizing period or latency, and bi-criteria optimization problems Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages June 2007 Workflows complexity results Alpage meeting 32/ 33

81 Conclusion Mapping structured workflow applications onto computational platforms, with replication and data-parallelism Complexity of the most tractable instances insight of the combinatorial nature of the problem Pipeline and fork graphs, extension to fork-join Homogeneous and Heterogeneous platforms with no communications Minimizing period or latency, and bi-criteria optimization problems Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages June 2007 Workflows complexity results Alpage meeting 32/ 33

82 Future work Short term Longer term Select polynomial instances of the problem and assess complexity when adding communication Design heuristics to solve combinatorial instances of the problem Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels Real experiments on heterogeneous clusters Comparison of effective performance against theoretical performance June 2007 Workflows complexity results Alpage meeting 33/ 33

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon September 2007 Anne.Benoit@ens-lyon.fr