Mapping pipeline skeletons onto heterogeneous platforms

Similar documents
Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Mapping pipeline skeletons onto heterogeneous platforms

Mapping Linear Workflows with Computation/Communication Overlap

Mapping pipeline skeletons onto heterogeneous platforms

Multi-criteria scheduling of pipeline workflows

Strategies for Replica Placement in Tree Networks

Scheduling tasks sharing files on heterogeneous master-slave platforms

Centralized versus distributed schedulers for multiple bag-of-task applications

Parallel algorithms at ENS Lyon

Centralized versus distributed schedulers for multiple bag-of-task applications

Scheduling tasks sharing files on heterogeneous master-slave platforms

Scheduling Tasks Sharing Files from Distributed Repositories

Scheduling on clusters and grids

Bi-criteria Pipeline Mappings for Parallel Image Processing

Optimizing Latency and Reliability of Pipeline Workflow Applications

Rectangular Partitioning

Lecture 9: Load Balancing & Resource Allocation

Algorithm Design and Analysis

COLA: Optimizing Stream Processing Applications Via Graph Partitioning

Modularity CMSC 858L

Introduction to Graph Theory

CSE 417 Branch & Bound (pt 4) Branch & Bound

Maximum Differential Graph Coloring

Towards Green Cloud Computing: Demand Allocation and Pricing Policies for Cloud Service Brokerage Chenxi Qiu

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

11. APPROXIMATION ALGORITHMS

Laboratoire de l Informatique du Parallélisme

Connectivity-aware Virtual Network Embedding

Approximation Algorithms

The complement of PATH is in NL

6 Randomized rounding of semidefinite programs

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Clustering Algorithms. Margareta Ackerman

On Modularity Clustering. Group III (Ying Xuan, Swati Gambhir & Ravi Tiwari)

Approximation Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Maximizing Restorable Throughput in MPLS Networks

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation)

High-Level Synthesis

Database Architectures

Notes for Lecture 24

Geometric Steiner Trees

Load-Balancing Iterative Computations on Heterogeneous Clusters with Shared Communication Links

Graph Theory and Optimization Approximation Algorithms

Chapter 5: Distributed Process Scheduling. Ju Wang, 2003 Fall Virginia Commonwealth University

What is Performance for Internet/Grid Computation?

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011

Approximation Algorithms

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Extending SLURM with Support for GPU Ranges

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Principles of Parallel Algorithm Design: Concurrency and Mapping

Resource Allocation Strategies for In-Network Stream Processing

3 No-Wait Job Shops with Variable Processing Times

Design and Analysis of Algorithms

Algorithms for Euclidean TSP

11. APPROXIMATION ALGORITHMS

Algorithm Analysis. (Algorithm Analysis ) Data Structures and Programming Spring / 48

Computational Complexity CSC Professor: Tom Altman. Capacitated Problem

Outline. 1 Introduction. 2 The problem of assigning PCUs in GERAN. 3 Other hierarchical structuring problems in cellular networks.

Energy-aware Scheduling for Frame-based Tasks on Heterogeneous Multiprocessor Platforms

L9: Hierarchical Clustering

1. Lecture notes on bipartite matching

Approximation Algorithms

The Size Robust Multiple Knapsack Problem

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Theorem 2.9: nearest addition algorithm

Master-Worker pattern

Minimum Weight Constrained Forest Problems. Problem Definition

Approximation Algorithms

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Master-Worker pattern

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

CSE 431/531: Analysis of Algorithms. Greedy Algorithms. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Approximation Algorithms

LID Assignment In InfiniBand Networks

Cilk, Matrix Multiplication, and Sorting

Adding Data Parallelism to Streaming Pipelines for Throughput Optimization

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

12.1 Formulation of General Perfect Matching

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Approximation Basics

Hardware-Software Codesign

Parallel Databases C H A P T E R18. Practice Exercises

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Introduction to Parallel Computing

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

Solutions for the Exam 6 January 2014

1. Lecture notes on bipartite matching February 4th,

Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn

Energy-Aware Scheduling for Acyclic Synchronous Data Flows on Multiprocessors

Greedy Algorithms. T. M. Murali. January 28, Interval Scheduling Interval Partitioning Minimising Lateness

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

CSC 373: Algorithm Design and Analysis Lecture 3

1 The Traveling Salesperson Problem (TSP)

Transcription:

Mapping pipeline skeletons onto heterogeneous platforms Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon January 2007 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 1/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Introduction and motivation Mapping applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline skeletons onto heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 2/ 41

Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

Why restrict to pipelines? Chains-on-chains partitioning problem - no communications - identical processors Extensions (done) - with communications - with heterogeneous processors/links - goal: assess complexity, design heuristics Extensions (current work) - deal with DEALs - deal with DAGs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 3/ 41

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 processors? Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 processors? 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 processors? 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 Back to Bokhari and Iqbal partitioning papers See survey by Pinar and Aykanat, JPDC 64, 8 (2004) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 4/ 41

Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n The pipeline application Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n The pipeline application Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n One-to-one Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n Interval Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

Rule of the game Map each pipeline stage on a single processor (no deals) Goal: minimize execution time Several mapping strategies S 1 S 2... S k... S n General Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 5/ 41

Major contributions Theory Formal approach to the problem Problem complexity Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 6/ 41

Major contributions Theory Formal approach to the problem Problem complexity Integer linear program for exact resolution Practice Heuristics for Interval Mapping on clusters Experiments to compare heuristics and evaluate their absolute performance Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 6/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 7/ 41

The application...... δ 0 δ 1 δ k 1 δ k δ n S 1 S 2 S k S n w 1 w 2 w k w n n stages S k, 1 k n S k : receives input of size δ k 1 from S k 1 performs w k computations outputs data of size δ k to S k+1 S 0 and S n+1 : virtual stages representing the outside world Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 8/ 41

The platform s in P in P out s out b in,u b v,out P u s u b u,v P v s v p processors P u, 1 u p, fully interconnected s u : speed of processor P u bidirectional link link u,v : P u P v, bandwidth b u,v one-port model: each processor can either send, receive or compute at any time-step P in : input data P out : output data Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 9/ 41

Different platforms Fully Homogeneous Identical processors (s u = s) and links (b u,v = b): typical parallel machines Communication Homogeneous Different-speed processors (s u s v ), identical links (b u,v = b): networks of workstations, clusters Fully Heterogeneous Fully heterogeneous architectures, s u s v and b u,v b u,v : hierarchical platforms, grids Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 10/ 41

Mapping problem: One-to-one Mapping n p: map each stage S k onto a distinct processor P alloc(k) Period of P alloc(k) : minimum delay between processing of two consecutive tasks S k 1 S k S k+1 t = alloc(k 1) u = alloc(k) v = alloc(k + 1) Cycle-time of P u : cycle u = δ k 1 b t,u + w k s u + δ k b u,v Optimization problem: find the allocation function alloc : [1, n] [1, p] which minimizes T period = max 1 k n cycle alloc(k) (with alloc(0) = in and alloc(n + 1) = out) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 11/ 41

Mapping problem: One-to-one Mapping n p: map each stage S k onto a distinct processor P alloc(k) Period of P alloc(k) : minimum delay between processing of two consecutive tasks S k 1 S k S k+1 t = alloc(k 1) u = alloc(k) v = alloc(k + 1) Cycle-time of P u : cycle u = δ k 1 b t,u + w k s u + δ k b u,v Optimization problem: find the allocation function alloc : [1, n] [1, p] which minimizes T period = max 1 k n cycle alloc(k) (with alloc(0) = in and alloc(n + 1) = out) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 11/ 41

Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

Mapping problem: Interval Mapping Several consecutive stages onto the same processor Increase computational load, reduce communications Mandatory when p < n Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m { δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) } Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 12/ 41

Mapping problem: General Mapping Not suiting the one-port model very well: can always be replaced by an Interval Mapping as good as the general one for Communication Homogeneous platforms Can be the optimal mapping for Fully Heterogeneous platforms in some particular cases More general, but requires threads and may lead to idle times and races with the one-port model Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 13/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 14/ 41

Complexity results One-to-one Mapping Interval Mapping General Mapping Fully Hom. Comm. Hom. Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 15/ 41

Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping General Mapping Binary search polynomial algorithm for One-to-one Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 15/ 41

Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 15/ 41

Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms General mapping: same complexity as Interval Mapping Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 15/ 41

Complexity results Fully Hom. Comm. Hom. One-to-one Mapping polynomial polynomial Interval Mapping polynomial NP-complete General Mapping same complexity as Interval Binary search polynomial algorithm for One-to-one Mapping Dynamic programming algorithm for Interval Mapping on Hom. platforms General mapping: same complexity as Interval Mapping All problem instances NP-complete on Fully Heterogeneous platforms Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 15/ 41

Back to chains-on-chains Chains-on-chains + homogeneous communications: polynomial Chains-on-chains + different-speed processors: NP-complete Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 16/ 41

Back to chains-on-chains Chains-on-chains + homogeneous communications: polynomial Chains-on-chains + different-speed processors: NP-complete Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 16/ 41

One-to-one/Comm. Hom.: binary search algorithm Work with fastest n processors, numbered P 1 to P n, where s 1 s 2... s n Mark all stages S 1 to S n as free For u = 1 to n Pick up any free stage S k s.t. δ k 1 /b + w k /s u + δ k /b T period Assign S k to P u, and mark S k as already assigned If no stage found return failure Proof: exchange argument Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 17/ 41

One-to-one/Comm. Hom.: binary search algorithm Work with fastest n processors, numbered P 1 to P n, where s 1 s 2... s n Mark all stages S 1 to S n as free For u = 1 to n Pick up any free stage S k s.t. δ k 1 /b + w k /s u + δ k /b T period Assign S k to P u, and mark S k as already assigned If no stage found return failure Proof: exchange argument Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 17/ 41

Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

Interval, Fully Hom.: dynamic programming algorithm c(i, j, k): optimal period to map stages S i to S j using exactly k processors Goal: min 1 k p c(1, n, k) { c(i, j, k) = min q + r = k 1 q k 1 1 r k 1 min {max (c(i, l, q), c(l + 1, j, r))} i l j 1 c(i, j, 1) = δ j i 1 b + k=i w k + δ j s b c(i, j, k) = + if k > j i + 1 Proof: search over all possible partitionings into two subintervals, using every possible number of processors for each interval Complexity: O(n 3 p 2 ) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 18/ 41 }

Heterogeneous platforms: NP-complete Reduction from MINIMUM METRIC BOTTLENECK WANDERING SALESPERSON PROBLEM 1 P in P 1 2 d(c1,c2) 2 d(c1,c4) 2 d(c1,c3) P 1,2 P 1,3 P 1,4 2 d(c1,c2) 2 d(c1,c3) 2 d(c1,c4) P 2 P 3 P 4 1 P 2,3 P 3,4 P 2,4 P out 2 d(c2,c4) 2 d(c2,c4) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 19/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 20/ 41

Greedy heuristics (1/2) Target clusters: Communication Homogeneous platforms and Interval Mapping L = n/p consecutive stages per processor: set of intervals fixed n/l processors used One-to-one Mapping when n p (L = 1) Rule applied in all greedy heuristics except random interval length Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 21/ 41

Greedy heuristics (2/2) H1a-GR: random Random choice of a free processor for each interval H1b-GRIL: random interval length Idem with random interval sizes: average length L, 1 length 2L 1 H2-GSW: biggest w Place interval with most computations on fastest processor H3-GSD: biggest δ in + δ out Intervals are sorted by communications (δ in + δ out ) in: first stage of interval; out 1: last one H4-GP: biggest period on fastest processor Balancing computation and communication: processors sorted by decreasing speed s u ; for current processor u, choose interval with biggest period (δ in + δ out )/b + i Interval w i/s u Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 22/ 41

Sophisticated heuristics H5-BS121: binary search for One-to-one Mapping optimal algorithm for One-to-one Mapping. When p < n, application cut in fixed intervals of length L. H6-SPL: splitting intervals Processors sorted by decreasing speed, all stages to first processor. At each step, select used proc j with largest period, split its interval (give fraction of stages to j ): minimize max(period(j), period(j )) and split if maximum period improved. H7a-BSL and H7b-BSC: binary search (longest/closest) Binary search on period P: start with stage s = 1, build intervals (s, s ) fitting on processors. For each u, and each s s, compute period (s..s, u) and check whether it is smaller than P. H7a: maximizes s ; H7b: chooses the closest period. Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 23/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 24/ 41

Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

Plan of experiments Assess performance of polynomial heuristics Random applications, n = 1 to 50 stages Random platforms, p = 10 and p = 100 processors b = 10 (comm. hom.), proc. speed between 1 and 20 Relevant parameters: ratios δ b and w s Average over 100 similar random appli/platform pairs Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 25/ 41

Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and 20 10 processors 100 processors Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 26/ 41

Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and 20 10 processors 100 processors 25 20 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 15 10 5 0 0 10 20 30 40 50 Yves.Robert@ens-lyon.fr January 2007 Number of stages Mapping (p=10) pipeline skeletons MAO 26/ 41

Experiment 1 - balanced comm/comp, hom comm δ i = 10, computation time between 1 and 20 10 processors 100 processors 3.4 3.2 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 3 2.8 2.6 0 10 20 30 40 50 Yves.Robert@ens-lyon.fr January 2007 Number of stages Mapping (p=100) pipeline skeletons MAO 26/ 41

Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and 20 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and 20 35 30 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 25 20 15 10 0 10 20 30 40 50 Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

Experiment 2 - balanced comm/comp, het comm communication time between 1 and 100 computation time between 1 and 20 30 25 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 20 15 10 0 10 20 30 40 50 Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 27/ 41

Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and 1000 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and 1000 1000 800 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 600 400 200 0 0 10 20 30 40 50 Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

Experiment 3 - large computations communication time between 1 and 20 computation time between 10 and 1000 70 65 60 55 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 50 45 40 35 30 25 0 10 20 30 40 50 Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 28/ 41

Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and 10 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and 10 4.5 4 3.5 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 3 2.5 2 1.5 0 10 20 30 40 50 Number of stages (p=10) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

Experiment 4 - computations communication time between 1 and 20 computation time between 0.01 and 10 5 4.5 4 H1a-GreedyRandom H1b-GreedyRandomIntervalLength H2-GreedySumW H3-GreedySumDinDout H4-GreedyPeriod H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest Maximum period 3.5 3 2.5 2 1.5 1 0 10 20 30 40 50 Number of stages (p=100) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 29/ 41

Summary of experiments Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 30/ 41

Summary of experiments Much more efficient than random mappings Three dominant heuristics for different cases Insignificant communications (hom. or small) and many processors: H5-BS121 (One-to-one Mapping) Insignificant communications (hom. or small) and few processors: H7b-BSC (clever choice where to split) Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 30/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 31/ 41

Integer linear programming Integer LP to solve Interval Mapping on Fully Heterogeneous platforms Many integer variables: no efficient algorithm to solve Approach limited to small problem instances Absolute performance of the heuristics for such instances Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 32/ 41

Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

Linear program: variables x k,u : 1 if S k on P u (0 otherwise) y k,u : 1 if S k and S k+1 both on P u (0 otherwise) z k,u,v : 1 if S k on P u and S k+1 on P v (0 otherwise) first u and last u : integer denoting first and last stage assigned to P u (to enforce interval constraints) T period : period of the pipeline Objective function: minimize T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 33/ 41

Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v 80 1 0 19 nx < u [1..p], @ X δ k 1 z : k 1,t,u A + w k x k,u + @ X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v 80 1 0 19 nx < u [1..p], @ X δ k 1 z : k 1,t,u A + w k x k,u + @ X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

Linear program: constraints k [0..n + 1], u x k,u = 1 k [0..n], u v z k,u,v + u y k,u = 1 k [0..n], u, v [1..p] {in, out}, u v, x k,u +x k+1,v 1+z k,u,v k [0..n], u [1..p] {in, out}, x k,u + x k+1,u 1 + y k,u k [1..n], u [1..p], first u k.x k,u + n.(1 x k,u ) k [1..n], u [1..p], last u k.x k,u k [1..n 1], u, v [1..p], u v, last u k.z k,u,v + n.(1 z k,u,v ) k [1..n 1], u, v [1..p], u v, first v (k + 1).z k,u,v 80 1 0 19 nx < u [1..p], @ X δ k 1 z : k 1,t,u A + w k x k,u + @ X δ = k z k,u,v A b k=1 t u t,u s u b v u u,v ; T period Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 34/ 41

Linear program: experiments O(np 2 ) variables, as many constraints Experiments only on small problem instances Average over 10 instances of each application Use GLPK Largest experiment: p = 8, n = 4: 14-hour computation time Parameters similar to Experiment 1: homogeneous communications and balanced comm/comp Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 35/ 41

Linear program: experiments O(np 2 ) variables, as many constraints Experiments only on small problem instances Average over 10 instances of each application Use GLPK Largest experiment: p = 8, n = 4: 14-hour computation time Parameters similar to Experiment 1: homogeneous communications and balanced comm/comp Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 35/ 41

Linear program: experiment p = 8 3.4 Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 3.2 Maximum period 3 2.8 2.6 0 1 2 3 4 5 6 7 8 9 Number of stages (p=8) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 36/ 41

Linear program: experiment p = 8 n LP H5-BS121 H7b-BSC 1 2.576857 2.576882 2.576882 2 2.749913 2.749934 2.749934 3 2.879871 2.879900 2.883072 4 2.760960 2.760981 2.770690 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 36/ 41

Linear program: experiment p = 4 Homogeneous communications (Experiment 1) 5.5 5 Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 4.5 Maximum period 4 3.5 3 2.5 0 2 4 6 8 10 Number of stages (p=4) - Homogeneous case H7b very close to the optimal (< 3% error) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 37/ 41

Linear program: experiment p = 4 Heterogeneous communications (Experiment 2) 17 16 15 Linear Program H3-GreedySumDinDout H5-BinarySearch1to1 H6-SPLitting H7a-BinarySearchLongest H7b-BinarySearchClosest 14 Maximum period 13 12 11 10 9 8 0 2 4 6 8 10 Number of stages (p=4) - Heterogeneous case H6 very close to the optimal (< 0.05% error) Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 37/ 41

Outline 1 Framework 2 Complexity results 3 Heuristics 4 Experiments 5 Linear programming formulation 6 Conclusion Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 38/ 41

Related work Scheduling task graphs on heterogeneous platforms Acyclic task graphs scheduled on different speed processors [Topcuoglu et al.]. Communication contention: 1-port model [Beaumont et al.]. Mapping pipelined computations onto special-purpose architectures FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping pipelined computations onto clusters and grids DAG [Taura et al.], DataCutter [Saltz et al.] Mapping skeletons onto clusters and grids Use of stochastic process algebra [Benoit et al.] Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 39/ 41

Conclusion Theoretical side Complexity for different mapping strategies and different platform types Practical side Optimal polynomial algorithm for One-to-one Mapping Design of several heuristics for Interval Mapping on Communication Homogeneous Comparison of their performance Linear program to assess the absolute performance of the heuristics, which turns out to be quite good Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 40/ 41

Future work Short term Longer term Heuristics for Fully Heterogeneous platforms Extension to DAG-trees (a DAG which is a tree when un-oriented) Extension to stage replication LP with replication and DAG-trees Real experiments on heterogeneous clusters, using an already-implemented skeleton library and MPI Comparison of effective performance against theoretical performance Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons MAO 41/ 41