Mapping pipeline skeletons onto heterogeneous platforms

Size: px

Start display at page:

Download "Mapping pipeline skeletons onto heterogeneous platforms"

Eustace McKinney
5 years ago
Views:

1 Mapping pipeline skeletons onto heterogeneous platforms Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon January 2007 January 2007 Mapping pipeline skeletons MAO 1/ 41

2 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

3 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

4 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

5 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

6 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

7 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

8 Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 2/ 41

9 Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

10 Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

11 Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

12 Chains-on-chains Load-balance contiguous tasks Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

13 Chains-on-chains Load-balance contiguous tasks With p = 4 processors? Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

14 Chains-on-chains Load-balance contiguous tasks With p = 4 processors? Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

15 Chains-on-chains Load-balance contiguous tasks With p = 4 processors? Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

16 Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n The pipeline application Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

17 Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n The pipeline application Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

18 Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n One-to-one Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

19 Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n Interval Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

20 Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n General Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

21 Major contributions Theory Formal approach to the problem Problem complexity Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance January 2007 Mapping pipeline skeletons MAO 6/ 41

22 Major contributions Theory Formal approach to the problem Problem complexity Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance January 2007 Mapping pipeline skeletons MAO 6/ 41

23 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 7/ 41

24 The application δ 0 δ 1 δ k 1 δ k δ n S 1 S 2 S k S n w 1 w 2 w k w n n stages S k, 1 k n S k : receives input of size δ k 1 from S k 1 performs w k computations outputs data of size δ k to S k+1 S 0 and S n+1 : virtual stages representing the outside world Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 8/ 41

25 The platform s in P in P out s out b in,u b v,out P u s u b u,v P v s v p processors P u, 1 u p, fully interconnected s u : speed of processor P u bidirectional link link u,v : P u P v, bandwidth b u,v one-port model: each processor can either send, receive or compute at any time-step P in : input data P out : output data Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 9/ 41

26 Different platforms Fully Homogeneous Identical processors (s u = s) and links (b u,v = b): typical parallel machines Communication Homogeneous Different-speed processors (s u s v ), identical links (b u,v = b): networks of workstations, clusters Fully Heterogeneous Fully heterogeneous architectures, s u s v and b u,v b u,v : hierarchical platforms, grids Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 10/ 41

27 Mapping problem: One-to-one Mapping n p: map each stage S k onto a distinct processor P alloc(k) Period of P alloc(k) : minimum delay between processing of two consecutive tasks S k 1 S k S k+1 t = alloc(k 1) u = alloc(k) v = alloc(k + 1) Cycle-time of P u : cycle u = δ k 1 b t,u + w k s u + δ k b u,v Optimization problem: find the allocation function alloc : [1, n] [1, p] which minimizes T period = max 1 k n cycle alloc(k) (with alloc(0) = in and alloc(n + 1) = out) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 11/ 41

28 Mapping problem: One-to-one Mapping n p: map each stage S k onto a distinct processor P alloc(k) Period of P alloc(k) : minimum delay between processing of two consecutive tasks S k 1 S k S k+1 t = alloc(k 1) u = alloc(k) v = alloc(k + 1) Cycle-time of P u : cycle u = δ k 1 b t,u + w k s u + δ k b u,v Optimization problem: find the allocation function alloc : [1, n] [1, p] which minimizes T period = max 1 k n cycle alloc(k) (with alloc(0) = in and alloc(n + 1) = out) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 11/ 41

29 Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

30 Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

31 Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

32 Mapping problem: General Mapping Not suiting the one-port model very well: can always be replaced by an Interval Mapping as good as the general one for Communication Homogeneous platforms Can be the optimal mapping for Fully Heterogeneous platforms in some particular cases More general, but requires threads and may lead to idle times and races with the one-port model January 2007 Mapping pipeline skeletons MAO 13/ 41

33 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 14/ 41

34 Complexity results One-to-one Mapping Interval Mapping General Mapping Fully Hom. Comm. Hom. January 2007 Mapping pipeline skeletons MAO 15/ 41

35 Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping General Mapping Binary search polynomial algorithm for One-to-one Mapping January 2007 Mapping pipeline skeletons MAO 15/ 41

36 Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms January 2007 Mapping pipeline skeletons MAO 15/ 41

37 Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms General mapping: same complexity as Interval Mapping January 2007 Mapping pipeline skeletons MAO 15/ 41

38 Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms General mapping: same complexity as Interval Mapping All problem instances NP-complete on Fully Heterogeneous platforms January 2007 Mapping pipeline skeletons MAO 15/ 41

39 Back to chains-on-chains Chains-on-chains + homogeneous communications: polynomial Chains-on-chains + different-speed processors: NP-complete Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 16/ 41

40 Back to chains-on-chains Chains-on-chains + homogeneous communications: polynomial Chains-on-chains + different-speed processors: NP-complete Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 16/ 41

41 One-to-one/Comm. Hom.: binary search algorithm Work with fastest n processors, numbered P 1 to P n, where s 1 s 2... s n Mark all stages S 1 to S n as free For u = 1 to n Pick up any free stage S k s.t. δ k 1 /b + w k /s u + δ k /b T period Assign S k to P u, and mark S k as already assigned If no stage found return failure Proof: exchange argument Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 17/ 41

42 One-to-one/Comm. Hom.: binary search algorithm Work with fastest n processors, numbered P 1 to P n, where s 1 s 2... s n Mark all stages S 1 to S n as free For u = 1 to n Pick up any free stage S k s.t. δ k 1 /b + w k /s u + δ k /b T period Assign S k to P u, and mark S k as already assigned If no stage found return failure Proof: exchange argument Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 17/ 41

43 Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

44 Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

45 Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

46 Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

47 Heterogeneous platforms: NP-complete Reduction from MINIMUM METRIC BOTTLENECK WANDERING SALESPERSON PROBLEM 1 P in P 1 2 d(c1,c2) 2 d(c1,c4) 2 d(c1,c3) P 1,2 P 1,3 P 1,4 2 d(c1,c2) 2 d(c1,c3) 2 d(c1,c4) P 2 P 3 P 4 1 P 2,3 P 3,4 P 2,4 P out 2 d(c2,c4) 2 d(c2,c4) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 19/ 41

48 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 20/ 41

49 Greedy heuristics (1/2) Target clusters: Communication Homogeneous platforms and Interval Mapping L = n/p consecutive stages per processor: set of intervals fixed n/l processors used One-to-one Mapping when n p (L = 1) Rule applied in all greedy heuristics except random interval length Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 21/ 41

50 Greedy heuristics (2/2) H1a-GR: random Random choice of a free processor for each interval H1b-GRIL: random interval length Idem with random interval sizes: average length L, 1 length 2L 1 H2-GSW: biggest w Place interval with most computations on fastest processor H3-GSD: biggest δ in + δ out Intervals are sorted by communications (δ in + δ out ) in: first stage of interval; out 1: last one H4-GP: biggest period on fastest processor Balancing computation and communication: processors sorted by decreasing speed s u ; for current processor u, choose interval with biggest period (δ in + δ out )/b + i Interval w i/s u Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 22/ 41

51 Sophisticated heuristics H5-BS121: binary search for One-to-one Mapping optimal algorithm for One-to-one Mapping. When p < n, application cut in fixed intervals of length L. H6-SPL: splitting intervals Processors sorted by decreasing speed, all stages to first processor. At each step, select used proc j with largest period, split its interval (give fraction of stages to j ): minimize max(period(j), period(j )) and split if maximum period improved. H7a-BSL and H7b-BSC: binary search (longest/closest) Binary search on period P: start with stage s = 1, build intervals (s, s ) fitting on processors. For each u, and each s s, compute period (s..s, u) and check whether it is smaller than P. H7a: maximizes s ; H7b: chooses the closest period. Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 23/ 41

52 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 24/ 41

53 Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

54 Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

55 Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

56 Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and processors 100 processors Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 26/ 41

57 Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and processors 100 processors H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Yves.Robert@ens-lyon.fr January 2007 Number of stages Mapping (p=10) pipeline skeletons MAO 26/ 41

58 Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and processors 100 processors H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Yves.Robert@ens-lyon.fr January 2007 Number of stages Mapping (p=100) pipeline skeletons MAO 26/ 41

59 Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and 20 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

60 Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

61 Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

62 Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and 1000 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

63 Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

64 Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

65 Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and 10 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

66 Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

67 Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

68 Summary of experiments Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors) January 2007 Mapping pipeline skeletons MAO 30/ 41

69 Summary of experiments Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors) January 2007 Mapping pipeline skeletons MAO 30/ 41

70 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 31/ 41

71 Integer linear programming Integer LP to solve Interval Mapping on Fully Heterogeneous platforms Many integer variables: no efficient algorithm to solve Approach limited to small problem instances Absolute performance of the heuristics for such instances January 2007 Mapping pipeline skeletons MAO 32/ 41

72 Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

73 Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

74 Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

75 Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v nx < u X δ k 1 z : k 1,t,u A + w k x k,u X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

76 Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v nx < u X δ k 1 z : k 1,t,u A + w k x k,u X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

77 Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v nx < u X δ k 1 z : k 1,t,u A + w k x k,u X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

78 Linear program: experiments O(np 2 ) variables, as many constraints Experiments only on small problem instances Average over 10 instances of each application Use GLPK Largest experiment: p = 8, n = 4: 14-hour computation time Parameters similar to Experiment 1: homogeneous communications and balanced comm/comp Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 35/ 41

79 Linear program: experiments O(np 2 ) variables, as many constraints Experiments only on small problem instances Average over 10 instances of each application Use GLPK Largest experiment: p = 8, n = 4: 14-hour computation time Parameters similar to Experiment 1: homogeneous communications and balanced comm/comp Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 35/ 41

80 Linear program: experiment p = Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 3.2 Maximum period Number of stages (p=8) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 36/ 41

81 Linear program: experiment p = 8 n LP H5-BS121 H7b-BSC Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 36/ 41

82 Linear program: experiment p = 4 Homogeneous communications (Experiment 1) Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 4.5 Maximum period Number of stages (p=4) - Homogeneous case H7b very close to the optimal (< 3% error) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 37/ 41

83 Linear program: experiment p = 4 Heterogeneous communications (Experiment 2) Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 14 Maximum period Number of stages (p=4) - Heterogeneous case H6 very close to the optimal (< 0.05% error) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 37/ 41

84 Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 38/ 41

85 Related work Scheduling task graphs on heterogeneous platforms Acyclic task graphs scheduled on different speed processors [Topcuoglu et al.]. Communication contention: 1-port model [Beaumont et al.]. Mapping pipelined computations onto special-purpose architectures FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping pipelined computations onto clusters and grids DAG [Taura et al.], DataCutter [Saltz et al.] Mapping skeletons onto clusters and grids Use of stochastic process algebra [Benoit et al.] January 2007 Mapping pipeline skeletons MAO 39/ 41

86 Conclusion Theoretical side Complexity for different mapping strategies and different platform types Practical side Optimal polynomial algorithm for One-to-one Mapping Design of several heuristics for Interval Mapping on Communication Homogeneous Comparison of their performance Linear program to assess the absolute performance of the heuristics, which turns out to be quite good January 2007 Mapping pipeline skeletons MAO 40/ 41

87 Future work Short term Longer term Heuristics for Fully Heterogeneous platforms Extension to DAG-trees (a DAG which is a tree when un-oriented) Extension to stage replication LP with replication and DAG-trees Real experiments on heterogeneous clusters, using an already-implemented skeleton library and MPI Comparison of effective performance against theoretical performance Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 41/ 41

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 Anne.Benoit@ens-lyon.fr