Lecture 9: Load Balancing & Resource Allocation

Similar documents
Parallel Job Scheduling

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

Algorithm Design and Analysis

CSE 417 Branch & Bound (pt 4) Branch & Bound

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Chapter 16. Greedy Algorithms

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

AN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Design of Parallel Algorithms. Models of Parallel Computation

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

A Framework for Space and Time Efficient Scheduling of Parallelism

Principles of Parallel Algorithm Design: Concurrency and Mapping

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Time Triggered and Event Triggered; Off-line Scheduling

Lecture 5: Search Algorithms for Discrete Optimization Problems

Announcements. CSEP 521 Applied Algorithms. Announcements. Polynomial time efficiency. Definitions of efficiency 1/14/2013

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Approximation Algorithms

Introduction to Graph Theory

Theorem 2.9: nearest addition algorithm

Review: Creating a Parallel Program. Programming for Performance

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms

Operating Systems Comprehensive Exam. Fall Student ID # 10/31/2013

Advanced Algorithms. On-line Algorithms

Principles of Parallel Algorithm Design: Concurrency and Mapping

Start of Lecture: February 10, Chapter 6: Scheduling

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

11. APPROXIMATION ALGORITHMS

We have already seen the transportation problem and the assignment problem. Let us take the transportation problem, first.

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Approximation Algorithms

Clustering Algorithms for general similarity measures

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software

Chapter 5: Distributed Process Scheduling. Ju Wang, 2003 Fall Virginia Commonwealth University

Performance Analysis of Adaptive Dynamic Load Balancing in Grid Environment using GRIDSIM

Assignment 5. Georgia Koloniari

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

High-Level Synthesis (HLS)

Hardware-Software Codesign

Mapping pipeline skeletons onto heterogeneous platforms

CLOCK DRIVEN SCHEDULING

A Comparative Study of Load Balancing Algorithms: A Review Paper

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Unit 2: High-Level Synthesis

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Chapter S:II. II. Search Space Representation

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

CAD Algorithms. Circuit Partitioning

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Lecture (08, 09) Routing in Switched Networks

VLSI Physical Design: From Graph Partitioning to Timing Closure

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

Advanced Methods in Algorithms HW 5

Parallel Numerical Algorithms

COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS)

Master-Worker pattern

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology

Aperiodic Task Scheduling

Copyright 2010, Elsevier Inc. All rights Reserved

Event-Driven Scheduling. (closely following Jane Liu s Book)

Operating Systems Comprehensive Exam. Spring Student ID # 3/16/2006

Load Balancing Part 1: Dynamic Load Balancing

6 ROUTING PROBLEMS VEHICLE ROUTING PROBLEMS. Vehicle Routing Problem, VRP:

Networking Recap Storage Intro. CSE-291 (Cloud Computing), Fall 2016 Gregory Kesden

Analyzing Real-Time Systems

Scheduling on Parallel Systems. - Sathish Vadhiyar

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

D.A.S.T. Defragmentation And Scheduling of Tasks University of Twente Computer Science

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Distributed KIDS Labs 1

Graph Theoretic Models for Ad hoc Wireless Networks

Process Scheduling Part 2

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

10/15/2009 Comp 590/Comp Fall

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Distributed minimum spanning tree problem

of optimization problems. In this chapter, it is explained that what network design

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

8th Slide Set Operating Systems

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

COLA: Optimizing Stream Processing Applications Via Graph Partitioning

Operating Systems. Lecture Process Scheduling. Golestan University. Hossein Momeni

Lecture 2 Process Management

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

CS370 Operating Systems

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Real-Time (Paradigms) (47)

EECS 571 Principles of Real-Time Embedded Systems. Lecture Note #8: Task Assignment and Scheduling on Multiprocessor Systems

Transcription:

Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently assign the different concurrent processes that make up a concurrent program on the available processors. This is called Load Balancing. Load balancing is a special case of more general Resource Allocation Problem in a parallel/distributed system. In the load balancing situation, resources are processors. Before clarifying load balancing problem need to formalise models of the concurrent program and concurrent system. To do this, we can use methods such as Graph Theory. CA463 Lecture Notes (Martin Crane 204) 2

Sources of Parallel Imbalance Individual processor performance Typically in the memory system Too much parallelism overhead Thread creation, synchronization, communication Load imbalance Different amounts of work across processors (comp: comms ratio) Processor heterogeneity (maybe caused by load distribution) Recognizing load imbalance Time spent at synchronization is high/uneven across processors CA463 Lecture Notes (Martin Crane 204) 3 Aside: Graph Theory Directed graph are useful in the context of load balancing Nodes can represent tasks and the links representing data or communication dependencies Need to partition graph so that to minimize execution time. The graph partition problem is formally defined on data represented in the form of a graph = (, ) with vertices and edges It is possible to partition into smaller components with specific properties. For instance, a -way partition divides the vertex set into smaller components. A good partition is defined as one in which the number of edges running between separated components is small. CA463 Lecture Notes (Martin Crane 204) 4 2

Graph Theory (cont d) Partition such that = with / As few of connecting with as possible If = {tasks}, each unit cost, edge =, (comms between task and task ), and partitioning means = with / i.e. load balancing Minimize i.e. minimize comms As optimal graph partitioning is NP complete, so use heuristics Trades off between partitioner speed & with quality of partition Better load balance costs more and law of diminishing returns? CA463 Lecture Notes (Martin Crane 204) 5 Formal Models in Load Balancing: Task Graphs A task graph is a directed acyclic graph where nodes denote the concurrent processes in a concurrent program edges between nodes represent process comms/synchronisation nodal weight is the computational load of the process the node represents edge weight between two nodes is the amount of comms between two processes represented by the two nodes. 5 2 3 4 0 5 8 2 2 2 5 5 3 CA463 Lecture Notes (Martin Crane 204) 6 3

Formal Models in Load Balancing: Processor Graphs The processor graph defines the configuration of the parallel or distributed system. Each node represents a processor & the nodal weight is the computation speed of this processor. The edges between nodes represent the communication links between the processors represented by the nodes. Edge weight is the speed of this communications link. 4 2 5 4 4 6 3 4 4 CA463 Lecture Notes (Martin Crane 204) 7 Load Balancing Based on Graph Partitioning: Typical Example The Nodes represent tasks The Edges represent communication cost The Node values represent processing cost A second node value could represent reassignment cost CA463 Lecture Notes (Martin Crane 204) 8 4

Load Balancing: The Problem To partition a set of interacting tasks among a set of interconnected processors to maximise performance. Basically the idea in load balancing is to balance the processor load so they all can proceed at the same rate. However formally can define maximising performance as: minimising the makespan, : min ( ) = min( max ) minimising the response time, the total idle time, or any other reasonable goal. A general assumption that is made is that the comms between tasks on the same processor is much faster than that between two tasks on different processors. So intra-processor comms is deemed to be instantaneous. CA463 Lecture Notes (Martin Crane 204) 9 where makespan is defined as the maximum completion time of any of the tasks Load Balancing: Allocation & Scheduling Load Balancing has two aspects: the allocation of the tasks to processors, and the scheduling of the tasks allocated to a processor. Allocation is usually seen as the more important issue. As a result some load balancing algorithms only address allocation. Complexity of the problem: Find an allocation of arbitrarily intercommunicating tasks, constrained by precedence relationships, to an arbitrarily interconnected network of m processing nodes, meeting a given deadline this is an NP complete problem. Finding min( ) for a set of tasks, where any task can execute on any node and is allowed to pre-empt another task, is NP complete even when the number of processing nodes is limited to two. CA463 Lecture Notes (Martin Crane 204) 0 5

Casavant & Kuhl s Taxonomy A hierarchical taxonomy of algorithms is by Casavant and Kuhl. local global static dynamic optimal sub-optimal physically distributed physically non-distributed approximate heuristic cooperative non-cooperative optimal sub-optimal enumerative graph theory math. prgm. queuing theory approximate heuristic CA463 Lecture Notes (Martin Crane 204) Casavant & Kuhl (cont d): Static V Dynamic Static Algorithms: nodal assignment (once made to processors) is fixed use only info about the average behaviour of the system. ignore current state/load of the nodes in the system. are obviously much simpler. Dynamic Algorithms: use runtime state info to make decisions i.e. can tasks be moved from one processor as system state changes? collect state information and react to system state if it changed are able to give significantly better performance CA463 Lecture Notes (Martin Crane 204) 2 6

Casavant & Kuhl (cont d): Centralized V Distributed Centralized Algorithms: collect info to server node and it makes assignment decision can make efficient decisions, have lower fault-tolerance must take account of info collection/allocation times Distributed Algorithms: contains entities to make decisions on a predefined set of nodes avoid the bottleneck of collecting state info and can react faster don t have to take account of info times CA463 Lecture Notes (Martin Crane 204) 3 Load Balancing: Coffman s Algorithm This is an optimal static algorithm that works on arbitrary task (program) graphs. Since generally, the problem is NP-complete, some simplifying assumptions must be made:. All tasks have the same execution time. 2. Comms negligible versus computation. Precedence ordering remains. The Algorithm. Assign labels,, to the terminal (i.e. end) tasks. a) Let labels,, be assigned, and let be the set of tasks with no unlabelled successors. b) For each node in define ( ) as the decreasing sequence of the labels of the immediate successors of. c) Label as if ( ) ( )(lexicographically) for all in. 2. Assign the highest labelled ready task to the next available time slot among the two processors. CA463 Lecture Notes (Martin Crane 204) 4 7

Coffman s Algorithm: Example 5 Nodes Inv Lex 6 8 7 0 8 These Nodes have no 3 Unlabelled Successors 6 2 7 3 4 5 4 6 7 2 8 8 0 9 0 9 7 6 2 4 3 4 5 5 2 6 3 7 Gantt Chart P 3 4 5 7 9 2 3 7 P2 2 0 6 8 4 6 5 Nodes Nodes Inv Lex 8 64 9 654 0 65 5 These Nodes have no Unlabelled Successors These Nodes have no Unlabelled Successors 4 3 3 3 2 32 Inv Lex Order of Successors CA463 Lecture Notes (Martin Crane 204) 5 Scheduling Algorithms Concepts of load balancing & scheduling are closely related. The goal of scheduling is to maximize system performance, by switching tasks from busy to less busy/ idle processors A scheduling strategy involves two important decisions:. determine tasks that can be executed in parallel, and 2. determine where to execute the parallel tasks. A decision is normally taken either based on prior knowledge, or on information gathered during execution. CA463 Lecture Notes (Martin Crane 204) 6 8

Scheduling Algorithms: Difficulties A scheduling strategy design depends on the tasks properties: a) Cost of tasks do all tasks have the same computation cost? if not, when are costs known? before execution, on creation, or on termination? b) Dependencies between tasks can we execute the tasks in any order? if not, when are task dependencies known? again, before execution, when the task is created, or only when it terminates? c) Locality is it important that some tasks execute in the same processor to reduce communication costs? when do we know the communication requirements? Have come up against a lot of these ideas already in MPI Lectures CA463 Lecture Notes (Martin Crane 204) 7 Scheduling Algorithms: Differences Like Allocation Algorithms, Scheduling Algorithms can be either Static or Dynamic. A key question is when certain information about the load balancing problem is known. Leads to a spectrum of solutions:. Static scheduling: In this all info is available to the job scheduling algorithm Then this is able to run before any real computation starts. For this case, we can run off-line algorithms, eg graph partitioning algorithms. CA463 Lecture Notes (Martin Crane 204) 8 9

Scheduling: Semi-Static Algorithms 2. Semi-Static Scheduling: In this case, info about load balancing may be known program startup, or beginning of each timestep, or at other well-defined points in the execution of the program. Offline algorithms may be used even though the problem has dynamic aspects. eg Kernighan-Lin Graph Partitioning Algorithm Kernighan-Lin (KL) is a ( log ) heuristic algorithm for solving the graph partitioning problem. It is commonly applied as a solution to the Travelling Salesman Problem (TSP) which, ordinarily, is NP complete. CA463 Lecture Notes (Martin Crane 204) 9 Scheduling: Semi-Static Algorithms (cont d) KL tries to split into two disjoint subsets, of equal size. Partitioned such that sum of the weights of the edges between nodes in and is minimized. Proceeds by finding an optimal set of interchanges between elements of, maximizing (iterating as necessary) It then executes the operations, partitioning into and. Kernighan-Lin has many applications in such areas as diverse as: Circuit Board Design (where edges represent solder on a circuit board and need to minimize crossings between components represented by vertices) and DNA sequencing (where edges represent a similarity measure between DNA fragments and the vertices represent DNA fragments themselves). CA463 Lecture Notes (Martin Crane 204) 20 0

Scheduling: Dynamic Algorithms 3. Dynamic Scheduling: Here load balancing info is only known mid-execution. This gives rise to sub-divisions under which dynamic algorithms can be classified: a. source-initiative algorithms, where the processor that generates the task decides which processor will serve the task, and b. server-initiative algorithms, where each processor determines which tasks it will serve. Examples of source-initiative algorithms are random splitting, cyclical splitting, and join shortest queue. Examples of server-initiative algorithms are random service, cyclical servicing, serve longest queue and shortest job first. CA463 Lecture Notes (Martin Crane 204) 2 Scheduling: Dynamic Algorithms (cont d) Server-initiative algorithms tend to out-perform sourceinitiative algorithms, with the same information content if the communications costs are not a dominating effect. However, they are more sensitive to distribution of load generation, and deteriorate quickly when one load source generates more tasks than another. But in heavily loaded environments server-initiative algorithms dominate source-initiative algorithms. CA463 Lecture Notes (Martin Crane 204) 22

Scheduling in Real Time Systems (RTS) The goal of scheduling here is to guarantee: that all critical task meet their deadlines and that as many as possible essential tasks meet theirs. RTS Scheduling can be synchronous or asynchronous.. Synchronous Scheduling Algorithms These are static algorithms in which the available processing time is divided by hardware clock into frames. Into each frame a set of tasks are allocated which will be guaranteed to be completed by the end of the frame. If a task is too big for a frame it is artificially divided into highly dependent tasks such that the smaller tasks can be scheduled into the frames. CA463 Lecture Notes (Martin Crane 204) 23 RTS Scheduling (cont d) 2. Asynchronous Scheduling This can be either static or dynamic. In general dynamic scheduling algorithms are preferred as static algorithms cannot react to changes in state such as h/w or s/w failure in some subsystem. Dynamic Asynchronous Scheduling Algorithms in a hard real time system must still guarantee that all critical tasks meet their deadlines under specified failure conditions. So critical tasks are scheduled statically and replicates of them are statically allocated to several processors and that the active state information of the task is also duplicated. In the event of a processor failure the state information is sent to a duplicate of the task and all further inputs are rerouted to the replicate task. CA463 Lecture Notes (Martin Crane 204) 24 2