Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Similar documents
Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Mapping pipeline skeletons onto heterogeneous platforms

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity Results for Throughput and Latency Optimization of Replicated and Data-parallel Workflows

Mapping Linear Workflows with Computation/Communication Overlap

Optimizing Latency and Reliability of Pipeline Workflow Applications

Multi-criteria scheduling of pipeline workflows

Mapping pipeline skeletons onto heterogeneous platforms

Scheduling Tasks Sharing Files from Distributed Repositories

Scheduling on clusters and grids

Database Architectures

Scheduling tasks sharing files on heterogeneous master-slave platforms

Mapping a Dynamic Prefix Tree on a P2P Network

Reliability and performance optimization of pipelined real-time systems. Research Report LIP December 2009

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

Mapping pipeline skeletons onto heterogeneous platforms

Bi-criteria Pipeline Mappings for Parallel Image Processing

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Optimizing Latency and Reliability of Pipeline Workflow Applications

Parallel algorithms at ENS Lyon

Chapter 20: Database System Architectures

Strategies for Replica Placement in Tree Networks

Centralized versus distributed schedulers for multiple bag-of-task applications

Lecture 23 Database System Architectures

Scheduling tasks sharing files on heterogeneous master-slave platforms

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Centralized versus distributed schedulers for multiple bag-of-task applications

Introduction to Parallel Computing

Counting the Number of Eulerian Orientations

Adding Data Parallelism to Streaming Pipelines for Throughput Optimization

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

sflow: Towards Resource-Efficient and Agile Service Federation in Service Overlay Networks

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Energy-aware Scheduling for Frame-based Tasks on Heterogeneous Multiprocessor Platforms

Interconnect Technology and Computational Speed

3 Competitive Dynamic BSTs (January 31 and February 2)

A Synchronization Algorithm for Distributed Systems

Load-Balancing Iterative Computations on Heterogeneous Clusters with Shared Communication Links

Mapping Heuristics in Heterogeneous Computing

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Approximation Algorithms for Wavelength Assignment

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Design and Analysis of Algorithms (VII)

The Complexity of Minimizing Certain Cost Metrics for k-source Spanning Trees

Lecture 1: A Monte Carlo Minimum Cut Algorithm

Application Provisioning in Fog Computingenabled Internet-of-Things: A Network Perspective

Minimization of NBTI Performance Degradation Using Internal Node Control

6.231 DYNAMIC PROGRAMMING LECTURE 14 LECTURE OUTLINE

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

Parallel Databases C H A P T E R18. Practice Exercises

Clustering and Mapping Algorithm for Application Distribution on a Scalable FPGA Cluster

Randomized algorithms have several advantages over deterministic ones. We discuss them here:

Throughput-Optimal Broadcast in Wireless Networks with Point-to-Multipoint Transmissions

Highway Dimension and Provably Efficient Shortest Paths Algorithms

Compilation for Heterogeneous Platforms

Advanced Databases: Parallel Databases A.Poulovassilis

Energy-Aware Scheduling for Acyclic Synchronous Data Flows on Multiprocessors

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

It also performs many parallelization operations like, data loading and query processing.

Energy-aware mappings of series-parallel workflows onto chip multiprocessors

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

L41: Lab 2 - Kernel implications of IPC

Synthesis and Optimization of Digital Circuits

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation)

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Cluster Editing with Locally Bounded Modifications Revisited

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

An Approach for Integrating Basic Retiming and Software Pipelining

Resource Allocation Strategies for In-Network Stream Processing

vsan Management Cluster First Published On: Last Updated On:

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering

Multi-resource Energy-efficient Routing in Cloud Data Centers with Network-as-a-Service

LINEAR CODES WITH NON-UNIFORM ERROR CORRECTION CAPABILITY

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

Spatial Data Structures

Hierarchical Chubby: A Scalable, Distributed Locking Service

Scheduling Algorithms to Minimize Session Delays

Maximum Differential Graph Coloring

CS599: Convex and Combinatorial Optimization Fall 2013 Lecture 14: Combinatorial Problems as Linear Programs I. Instructor: Shaddin Dughmi

A Distributed Hash Table for Shared Memory

6.896 Topics in Algorithmic Game Theory March 1, Lecture 8

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

Bayeux: An Architecture for Scalable and Fault Tolerant Wide area Data Dissemination

Modalis. A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche and Hélène Renard, May 5, 2010

The Google File System

Parallel Query Optimisation

On Modularity Clustering. Group III (Ying Xuan, Swati Gambhir & Ravi Tiwari)

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Spatial Data Structures

Transcription:

Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 1/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Introduction and motivation Mapping workflow applications onto parallel platforms Difficult challenge Heterogeneous clusters, fully heterogeneous platforms Even more difficult! Structured programming approach Easier to program (deadlocks, process starvation) Range of well-known paradigms (pipeline, farm) Algorithmic skeleton: help for mapping Mapping pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 2/ 33

Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 3/ 33

Rule of the game Consecutive data-sets fed into the workflow Period T period = time interval between beginning of execution of two consecutive data sets (throughput=1/t period ) Latency T latency (x) = time elapsed between beginning and end of execution for a given data set x, and T latency = max x T latency (x) Map each pipeline/fork stage on one or several processors Goal: minimize T period or T latency or bi-criteria minimization Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 3/ 33

Replication and data-parallelism Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 4/ 33

Replication and data-parallelism Replicate stage S k on P 1,..., P q... S k 1 S k on P 2 : data sets 2, 5, 8,... S k on P 1 : data sets 1, 4, 7,... S k on P 3 : data sets 3, 5, 9,... S k+1... Data-parallelize stage S k on P 1,..., P q S k (w = 16) P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 4/ 33

Major contributions Complexity results for throughput and latency optimization of replicated and data-parallel workflows Theoretical approach to the problem definition of replication and data-parallelism formal definition of T period and T latency in each case Problem complexity: focus on pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 5/ 33

Major contributions Complexity results for throughput and latency optimization of replicated and data-parallel workflows Theoretical approach to the problem definition of replication and data-parallelism formal definition of T period and T latency in each case Problem complexity: focus on pipeline and fork workflows Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 5/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 6/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 7/ 33

Pipeline graphs...... δ 0 δ 1 δ k 1 δ k δ n S 1 S 2 S k S n w 1 w 2 w k w n n stages S k, 1 k n S k : receives input of size δ k 1 from S k 1 performs w k computations outputs data of size δ k to S k+1 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 8/ 33

Fork graphs δ 1 w 0 S 0 δ 0 δ 0 δ0 δ 0 w 1 w 2 S 2 w k S k w n S n S 1...... δ 1 δ 2 δ k δ n n + 1 stages S k, 0 k n S 0 : root stage S 1 to S n : independent stages A data set goes through stage S 0, then it can be executed simultaneously for all other stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 9/ 33

The platform s in P in P out s out b in,u b v,out P u s u b u,v P v s v p processors P u, 1 u p, fully interconnected s u : speed of processor P u bidirectional link link u,v : P u P v, bandwidth b u,v one-port model: each processor can either send, receive or compute at any time-step Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 10/ 33

Different platforms Fully Homogeneous Identical processors (s u = s) and links (b u,v = b): typical parallel machines Communication Homogeneous Different-speed processors (s u s v ), identical links (b u,v = b): networks of workstations, clusters Fully Heterogeneous Fully heterogeneous architectures, s u s v and b u,v b u,v : hierarchical platforms, grids Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 11/ 33

Back to pipeline: mapping strategies S 1 S 2... S k... S n The pipeline application Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

Back to pipeline: mapping strategies S 1 S 2... S k... S n One-to-one Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

Back to pipeline: mapping strategies S 1 S 2... S k... S n Interval Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

Back to pipeline: mapping strategies S 1 S 2... S k... S n General Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

Back to pipeline: mapping strategies S 1 S 2... S k... S n Interval Mapping In this work, Interval Mapping Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 12/ 33

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 identical processors? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 identical processors? 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 T period = 20 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

Chains-on-chains Load-balance contiguous tasks 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 With p = 4 identical processors? 5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6 T period = 20 NP-hard for different-speed processors, even without communications Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 13/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 14/ 33

Working out an example S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

Working out an example S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

Working out an example S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

Working out an example S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? T period = 7, S 1 P 1, S 2 S 3 P 2, S 4 P 3 (T latency = 17) Optimal latency? T latency = 12, S 1 S 2 S 3 S 4 P 1 (T period = 12) Min. latency if T period 10? T latency = 14, S 1 S 2 S 3 P 1, S 4 P 2 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 15/ 33

Example with replication and data-parallelism S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Replicate interval [S u..s v ] on P 1,..., P q... S S u... S v on P 1 : data sets 1, 4, 7,... S u... S v on P 2 : data sets 2, 5, 8,... S u... S v on P 3 : data sets 3, 5, 9,... S... T period = P v k=u w k q min i (s i ) and T latency = q T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

Example with replication and data-parallelism S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Data Parallelize single stage S k on P 1,..., P q S (w = 16) T period = w k P q i=1 s i P 1 (s 1 = 2) : P 2 (s 2 = 1) : P 3 (s 3 = 1) : and T latency = T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

Example with replication and data-parallelism S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

Example with replication and data-parallelism S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? DP S 1 P REP 1P 2, S 2 S 3 S 4 P 3P 4 T period = max( 14 2+1, 4+2+4 2 1 ) = 5, T latency = 14.67 Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

Example with replication and data-parallelism S 1 S 2 S 3 S 4 14 4 2 4 Interval mapping, 4 processors, s 1 = 2 and s 2 = s 3 = s 4 = 1 Optimal period? DP S 1 P REP 1P 2, S 2 S 3 S 4 P 3P 4 T period = max( 14 2+1, 4+2+4 2 1 ) = 5, T latency = 14.67 DP S 1 P 2P 3 P 4, S 2 S 3 S 4 P 1 14 T period = max( 1+1+1, 4+2+4 2 ) = 5, T latency = 9.67 (optimal) Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 16/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 17/ 33

Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

Interval Mapping for pipeline graphs Several consecutive stages onto the same processor Increase computational load, reduce communications Partition of [1..n] into m intervals I j = [d j, e j ] (with d j e j for 1 j m, d 1 = 1, d j+1 = e j + 1 for 1 j m 1 and e m = n) Interval I j mapped onto processor P alloc(j) T period = max 1 j m T latency = 1 j m { { δ dj 1 b alloc(j 1),alloc(j) + δ dj 1 b alloc(j 1),alloc(j) + ej i=d j w i s alloc(j) + ej i=d j w i s alloc(j) + δ ej b alloc(j),alloc(j+1) δ ej b alloc(j),alloc(j+1) } } Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 18/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages T period =? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages T period =? depends on the com model: is it possible to start com as soon as S 0 is done? Which order for com? Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result T latency = time elapsed between data set input to S 0 until last computation for this data set is completed Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Fork graphs map any partition of the graph onto the processors q intervals, q p first interval: S 0 and possibly S 1 to S k next intervals of independent stages Informally: T period = max time needed by processor to receive data, compute, output result T latency = time elapsed between data set input to S 0 until last computation for this data set is completed Simpler model for formal analysis Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 19/ 33

Back to a simpler problem No communication costs nor overheads Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: w i s u Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize [S i, S j ] (i = j for pipeline; 0 < i j or i = j = 0 for fork) on k processors P q1,..., P qk : w i s u j l=i w l k u=1 s q u. Cost = T period of assigned processors Cost = delay to traverse the interval Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize Cost to replicate [S i, S j ] on k processors P q1,..., P qk : w i s u j l=i w l k min 1 u k s qu. Cost = T period of assigned processors Delay to traverse the interval = time needed by slowest processor: j l=i t max = w l min 1 u k s qu Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

Back to a simpler problem No communication costs nor overheads Cost to execute S i on P u alone: Cost to data-parallelize Cost to replicate With these formulas: easy to compute T period for both graphs, and T latency for pipeline graphs w i s u Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 20/ 33

Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

Latency for a fork? partition of stages into q sets I r (1 r q p) S 0 I 1, to k processors P q1,..., P qk t max (r) = delay of r-th set (1 r q), computed as before flexible com model: computations of I r, r 2, start as soon as computation of S 0 is completed. s 0 = speed at which S 0 is processed: s 0 = k u=1 s q u if I 1 is data-parallelized s 0 = min 1 u k s qu if I 1 is replicated ( T latency = max t max (1), w ) 0 + max s t max(r) 0 2 r q Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 21/ 33

Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

Optimization problem Given an application graph (n-stage pipeline or (n + 1)-stage fork), a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors), a mapping strategy with replication, and either with data-parallelization or without, an objective (period T period or latency T latency ), determine an interval-based mapping that minimizes the objective 16 optimization problems Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 22/ 33

Bi-criteria optimization problem given threshold period P threshold, determine mapping whose period does not exceed P threshold and that minimizes T latency given threshold latency L threshold, determine mapping whose latency does not exceed L threshold and that minimizes T period Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 23/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 24/ 33

Complexity results Without data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (str) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

Complexity results With data-parallelism, Homogeneous platforms Objective period latency bi-criteria Hom. pipeline - Het. pipeline Poly (DP) Hom. fork - Poly (DP) Het. fork Poly (str) NP-hard Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

Complexity results Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

Complexity results With data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline NP-hard Het. pipeline - Hom. fork NP-hard Het. fork - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

Complexity results Most interesting case: Without data-parallelism, Heterogeneous platforms Objective period latency bi-criteria Hom. pipeline Poly (*) - Poly (*) Het. pipeline NP-hard (**) Poly (str) NP-hard Hom. fork Poly (*) Het. fork NP-hard - Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 25/ 33

No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 26/ 33

No data-parallelism, Heterogeneous platforms For pipeline, minimizing the latency is straightforward: map all stages on fastest proc Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity. Polynomial bi-criteria algorithm for homogeneous pipeline Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 26/ 33

Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 27/ 33

Lemma: form of the solution Pipeline, no data-parallelism, Heterogeneous platform Lemma If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors P 1,..., P q, ordered by non-decreasing speeds: s 1... s q. There exists an optimal solution which replicates intervals of stages onto k intervals of processors I r = [P dr, P er ], with 1 r k q, d 1 = 1, e k = q, and e r + 1 = d r+1 for 1 r < k. Proof: exchange argument, which does not increase latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 27/ 33

Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 28/ 33

Binary-search/Dynamic programming algorithm Given latency L, given period K Loop on number of processors q Dynamic programming algorithm to minimize latency Success if L is obtained Binary search on L to minimize latency for fixed period Binary search on K to minimize period for fixed latency Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 28/ 33

Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Case (1): replicating m stages onto processors P i,..., P j Case (2): splitting the interval Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = Initialization: min 1 m < m i k < j L(1, i, j) = L(m, i, i) = { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) { w w s i if (j i).s i K + otherwise { m.w s i m.w if s i + otherwise K Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

Dynamic programming algorithm Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P i to P j, while fitting in period K. L(m, i, j) = min 1 m < m i k < j { m.w s i if m.w (j i).s i K (1) L(m, i, k) + L(m m, k + 1, j) (2) Complexity of the dynamic programming: O(n 2.p 4 ) Number of iterations of the binary search formally bounded, very small number of iterations in practice. Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 29/ 33

Outline 1 Framework 2 Working out an example 3 The problem 4 Complexity results 5 Conclusion Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 30/ 33

Related work Subhlok and Vondran Extension of their work (pipeline on hom platforms) Chains-to-chains In our work possibility to replicate or data-parallelize Mapping pipelined computations onto clusters and grids DAG [Taura et al.], DataCutter [Saltz et al.] Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization Mapping pipelined computations onto special-purpose architectures FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.] Mapping skeletons onto clusters and grids Use of stochastic process algebra [Benoit et al.] Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 31/ 33

Conclusion Mapping structured workflow applications onto computational platforms, with replication and data-parallelism Complexity of the most tractable instances insight of the combinatorial nature of the problem Pipeline and fork graphs, extension to fork-join Homogeneous and Heterogeneous platforms with no communications Minimizing period or latency, and bi-criteria optimization problems Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 32/ 33

Conclusion Mapping structured workflow applications onto computational platforms, with replication and data-parallelism Complexity of the most tractable instances insight of the combinatorial nature of the problem Pipeline and fork graphs, extension to fork-join Homogeneous and Heterogeneous platforms with no communications Minimizing period or latency, and bi-criteria optimization problems Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 32/ 33

Future work Short term Longer term Select polynomial instances of the problem and assess complexity when adding communication Design heuristics to solve combinatorial instances of the problem Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels Real experiments on heterogeneous clusters Comparison of effective performance against theoretical performance Anne.Benoit@ens-lyon.fr June 2007 Workflows complexity results Alpage meeting 33/ 33