Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Size: px

Start display at page:

Download "Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows"

Alyson Reed
5 years ago
Views:

1 Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon September 2007 September 2007 Complexity results for workflows HeteroPar 07 1/ 30

2 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

3 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

4 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

5 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

6 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

7 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

8 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping skeleton workflows (pipeline, fork) onto heterogeneous platforms September 2007 Complexity results for workflows HeteroPar 07 2/ 30

9 Rule of the game Map each pipeline stage on a single processor (extended later: replication and data-parallelism) Goal: minimize execution time (extended later: throughput and latency) Several mapping strategies S 1 S 2... S k... S n The pipeline application Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 3/ 30

10 Rule of the game Map each pipeline stage on a single processor (extended later: replication and data-parallelism) Goal: minimize execution time (extended later: throughput and latency) Several mapping strategies S 1 S 2... S k... S n The pipeline application Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 3/ 30

11 Rule of the game Map each pipeline stage on a single processor (extended later: replication and data-parallelism) Goal: minimize execution time (extended later: throughput and latency) Several mapping strategies S 1 S 2... S k... S n One-to-one Mapping Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 3/ 30

12 Rule of the game Map each pipeline stage on a single processor (extended later: replication and data-parallelism) Goal: minimize execution time (extended later: throughput and latency) Several mapping strategies S 1 S 2... S k... S n Interval Mapping Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 3/ 30

13 Rule of the game Map each pipeline stage on a single processor (extended later: replication and data-parallelism) Goal: minimize execution time (extended later: throughput and latency) Several mapping strategies S 1 S 2... S k... S n General Mapping Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 3/ 30

14 Major contributions Theory Formal approach to the problem Definition of replication and data-parallelism (stages on several processors) Consider several optimization criteria Problem complexity for several cases Practice Wait for my next talk! September 2007 Complexity results for workflows HeteroPar 07 4/ 30

15 Major contributions Theory Formal approach to the problem Definition of replication and data-parallelism (stages on several processors) Consider several optimization criteria Problem complexity for several cases Practice Wait for my next talk! September 2007 Complexity results for workflows HeteroPar 07 4/ 30

16 Major contributions Theory Formal approach to the problem Definition of replication and data-parallelism (stages on several processors) Consider several optimization criteria Problem complexity for several cases Practice Wait for my next talk! September 2007 Complexity results for workflows HeteroPar 07 4/ 30

17 Outline 1 Framework 2 Working out an example 3 Complexity results 4 Conclusion Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 5/ 30

18 Outline 1 Framework 2 Working out an example 3 Complexity results 4 Conclusion Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 6/ 30

19 The application: pipeline graphs δ 0 δ 1 δ k 1 δ k δ n S 1 S 2 S k S n w 1 w 2 w k w n n stages S k, 1 k n S k : receives input of size δ k 1 from S k 1 performs w k computations outputs data of size δ k to S k+1 Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 7/ 30

20 The application: fork graphs δ 1 w 0 S 0 δ 0 δ 0 δ0 δ 0 w 1 w 2 S 2 w k S k w n S n S δ 1 δ 2 δ k δ n n + 1 stages S k, 0 k n S 0 : root stage S 1 to S n : independent stages A data-set goes through stage S 0, then it can be executed simultaneously for all other stages Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 8/ 30

21 The platform s in P in P out s out b in,u b v,out P u s u b u,v P v s v p processors P u, 1 u p, fully interconnected s u : speed of processor P u bidirectional link link u,v : P u P v, bandwidth b u,v one-port model: each processor can either send, receive or compute at any time-step Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 9/ 30

22 Different platforms NO COMMUNICATIONS Homogeneous Identical processors (s u = s): typical parallel machines Heterogeneous Different-speed processors (s u s v ), identical links since we do not consider communications (b u,v = b): networks of workstations, clusters Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 10/ 30

23 Different platforms NO COMMUNICATIONS Homogeneous Identical processors (s u = s): typical parallel machines Heterogeneous Different-speed processors (s u s v ), identical links since we do not consider communications (b u,v = b): networks of workstations, clusters Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 10/ 30

24 Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 11/ 30

25 Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 11/ 30

26 Stage types Monolithic stages: must be mapped on one single processor since computation for a data-set may depend on result of previous computation Replicable stages: can be replicated on several processors, but not parallel, i.e. a data-set must be entirely processed on a single processor Data-parallel stages: inherently parallel stages, one data-set can be computed in parallel by several processors Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 12/ 30

27 Replication Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... S k+1 may be monolithic: output order must be respected Round-robin rule to ensure output order Cannot feed more fast processors than slow ones Most efficient with similar-speed processors Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 13/ 30

28 Replication Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... S k+1 may be monolithic: output order must be respected Round-robin rule to ensure output order Cannot feed more fast processors than slow ones Most efficient with similar-speed processors Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 13/ 30

29 Data-parallelism Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Perfect sharing of the work Data-parallelize single stage only Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 14/ 30

30 Data-parallelism Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Perfect sharing of the work Data-parallelize single stage only Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 14/ 30

31 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m ej i=d j w i s alloc(j) T latency = 1 j m ej i=d j w i s alloc(j) Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 15/ 30

32 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m ej i=d j w i s alloc(j) T latency = 1 j m ej i=d j w i s alloc(j) Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 15/ 30

33 Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m ej i=d j w i s alloc(j) T latency = 1 j m ej i=d j w i s alloc(j) Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 15/ 30

34 Replication and data-parallelism No data-parallelism overheads Cost to execute S i on P u alone: w i s u Cost to data-parallelize [S i, S j ] (i = j for pipeline; 0 < i j or i = j = 0 for fork) on k processors P q1,..., P qk : j l=i w l k u=1 s q u Cost = T period of assigned processors Cost = delay to traverse the interval Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 16/ 30

35 Replication and data-parallelism No data-parallelism overheads Cost to execute S i on P u alone: w i s u Cost to data-parallelize [S i, S j ] (i = j for pipeline; 0 < i j or i = j = 0 for fork) on k processors P q1,..., P qk : j l=i w l k u=1 s q u Cost = T period of assigned processors Cost = delay to traverse the interval Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 16/ 30

36 Replication and data-parallelism No data-parallelism overheads Cost to execute S i on P u alone: w i s u Cost to data-parallelize [S i, S j ] (i = j for pipeline; 0 < i j or i = j = 0 for fork) on k processors P q1,..., P qk : j l=i w l k u=1 s q u Cost = T period of assigned processors Cost = delay to traverse the interval Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 16/ 30

37 Replication and data-parallelism Cost to replicate [S i, S j ] on k processors P q1,..., P qk : j l=i w l k min 1 u k s qu. Cost = T period of assigned processors Delay to traverse the interval = time needed by slowest processor: j l=i t max = w l min 1 u k s qu With these formulas: easy to compute T period and T latency for pipeline graphs Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 17/ 30

38 Replication and data-parallelism Cost to replicate [S i, S j ] on k processors P q1,..., P qk : j l=i w l k min 1 u k s qu. Cost = T period of assigned processors Delay to traverse the interval = time needed by slowest processor: j l=i t max = w l min 1 u k s qu With these formulas: easy to compute T period and T latency for pipeline graphs Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 17/ 30

39 Outline 1 Framework 2 Working out an example 3 Complexity results 4 Conclusion Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 18/ 30

40 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 19/ 30

41 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 19/ 30

42 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 19/ 30

43 Working out an example S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 19/ 30

44 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 20/ 30

45 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 20/ 30

46 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? S 1 DP P 1P 2, S 2 S 3 S 4 REP P 3P 4 T period = max( , ) = 5, T latency = Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 20/ 30

47 Example with replication and data-parallelism S 1 S 2 S 3 S Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? S 1 DP P 1P 2, S 2 S 3 S 4 REP P 3P 4 T period = max( , ) = 5, T latency = DP S 1 P 2P 3 P 4, S 2 S 3 S 4 P 1 14 T period = max( 1+1+1, ) = 5, T latency = 9.67 (optimal) Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 20/ 30

48 Outline 1 Framework 2 Working out an example 3 Complexity results 4 Conclusion Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 21/ 30

49 Complexity results Pipeline and fork graphs No communications Homogeneous or Heterogeneous platforms Interval Mapping only Replicable stages, and either data-parallelism or not Bi-criteria optimization September 2007 Complexity results for workflows HeteroPar 07 22/ 30

50 Complexity results Without data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (str) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard str = straightforward (map everything on the same proc...) DP = dynamic programming * = interesting case Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

51 Complexity results With data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (DP) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard str = straightforward (map everything on the same proc...) DP = dynamic programming * = interesting case Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

52 Complexity results Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - str = straightforward (map everything on the same proc...) DP = dynamic programming * = interesting case Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

53 Complexity results With data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline NP-hard Het. pipeline - Hom. fork NP-hard Het. fork - str = straightforward (map everything on the same proc...) DP = dynamic programming * = interesting case Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

54 Complexity results Most interesting case: Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

55 Complexity results Most interesting case: Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 23/ 30

56 No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline September 2007 Complexity results for workflows HeteroPar 07 24/ 30

57 No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline September 2007 Complexity results for workflows HeteroPar 07 24/ 30

58 Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 25/ 30

59 Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 25/ 30

60 Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 26/ 30

61 Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 26/ 30

62 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Case (1): replicating m stages onto processors P i,..., P j Case (2): splitting the interval Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 27/ 30

63 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = Initialization: min 1 m < m i k < j L(1, i, j) = L(m, i, i) = { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) { w w s i if (j i).s i K + otherwise { m.w s i m.w if s i + otherwise K Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 27/ 30

64 Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Complexity of the dynamic programming: O(n 2.p 4 ) Number of iterations of the binary search formally bounded, very small number of iterations in practice. Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 27/ 30

65 Outline 1 Framework 2 Working out an example 3 Complexity results 4 Conclusion Anne.Benoit@ens-lyon.fr September 2007 Complexity results for workflows HeteroPar 07 28/ 30

66 Conclusion Theoretical side Complexity results for several cases. Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages. Practical side Optimal polynomial algorithms. Some heuristics on particular cases (stay for next talk ). Future work Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels. Lots of open problems. September 2007 Complexity results for workflows HeteroPar 07 29/ 30

67 Related work Subhlok and Vondran Extension of their work (pipeline on hom platforms) Chains-to-chains In our work possibility to replicate or data-parallelize Mapping pipelined computations onto clusters and grids DAG [Taura et al.], DataCutter [Saltz et al.] Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization Mapping pipelined computations onto special-purpose architectures FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping skeletons onto clusters and grids Use of stochastic process algebra [Benoit et al.] September 2007 Complexity results for workflows HeteroPar 07 30/ 30

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 Anne.Benoit@ens-lyon.fr