AN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis

Size: px

Start display at page:

Download "AN ABSTRACT OF THE THESIS OF. Title: Static Task Scheduling and Grain Packing in Parallel. Theodore G. Lewis"

Estella Hines
6 years ago
Views:

1 AN ABSTRACT OF THE THESIS OF Boontee Kruatrachue for the degree of Doctor of Philosophy in Electrical and Computer Engineering presented on June Title: Static Task Scheduling and Grain Packing in Parallel Processing Systems Abstract approved: Redacted for Privacy Theodore G. Lewis We extend previous results for optimally scheduling concurrent program modules, called tasks, on a fixed, finite number of parallel processors in two fundamental ways: (1) we introduce a new heuristic which considers the time delay imposed by message transmission among concurrently running tasks; and (2) we introduce a second heuristic which maximizes program execution speed by duplicating tasks. Simulation results are given which suggest an order of magnitude improvement in program execution speed over previous scheduling heuristics. The first solution, called ISH, (insertion scheduling heuristic) provides only a small improvement over current solutions but has a smaller time complexity than DSH, 0(N2). DSH (duplication scheduling heuristic), is an 0(N4) heuristic that (1) gives up to an order of magnitude improvement in performance, (2) solves the max-min problem of parallel processor scheduling by duplicating selected scheduled tasks on some PEs, and (3) gives monotonically growing improvements as the number of PEs is increased. The max-min problem is caused by the trade-off between maximum parallelism

2 versus minimum communication delay. The DSH is also applied in "Grain Packing", which is a new way to define the grain size for a user program on a specific parallel processing system. Instead of defining the grain size before scheduling, grain packing uses the fine grain scheduling to construct larger grains. In this way all available parallelism is considered as well as the communication delay.

3 Static Task Scheduling and Grain Packing in Parallel Processing Systems by B oontee Kruatrachue A THESIS submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Completed June 10, 1987 Commencement June, 1988

4 APPROVED: Professor of Computer Science, in charge of major Head of DVpartment of Electrical & Computer Engineering Dean of Graduat chool d Date thesis is presented June 10, 1987 Type by Boontee Kruatrachue

5 DEDICATION This dissertation is dedicated to my aunt Dr. Fuangfoong Kruatrachue,,my parents Dr. Mongkol Kruatrachue Dr. Foongfuang Kruatrachue Mrs. Praneet Kruatrachue my brother and sister Samapap Kruatrachue Kritawan Kruatrachue.

6 ACKNOWLEDGEMENTS I am deeply grateful to Dr. Ted Lewis for his help, guidance, inspiration, and many hours he spent reading the thesis and commenting on it. Also, I thank him for his constant encouragement and confidence in me throughout the years of my research at Oregon State University. I would like to express my gratitude to my parents and my aunts for theirs love, support and encouragement. I thank my sister for all her love and support. Finally, I express my gratitude to both the Electrical and Computer Engineering Department and Computer Science Department for financial support and computer equipments.

7 TABLE OF CONTENTS.Page I. INTRODUCTION Motivation and Purpose of this Research Previous Works Scope Outline of the Dissertation 15 II. THE PARALLEL PROCESSOR SCHEDULING ENVIRONMENT Background Scheduling Definition, Function and Goal Critical Path Nature of a Task Graph and Gantt Chart Difference between Allocation and Scheduling List Scheduling The Parallel Processor Scheduling Problem Communication and Parallelism Trade off Problem Grain Size Problem Level Alteration and the Critical Path 24 III. DUPLICATION SCHEDULING HEURISTIC (DSH) & INSERTION SCHEDULING HEURISTIC (ISH) Introduction Definitions 30

8 TABLE OF CONTENTS (continuation).page 3.3 Insertion Scheduling Heuristic (ISH) Duplication Scheduling Heuristic (DSH) Complexity of ISH and DSH 44 IV. EXPERIMENT RESULTS 65 V. OPTIMAL GRAIN DETERMINATION FOR PARALLEL PROCESSING SYSTEM Introduction Grain Packing Approach 82 VI. CONCLUSION Significance of this Research Future Related Research 94 BIBLIOGRAPHY 96 APPENDIX 101

9 LIST OF FIGURES Figure.Page 1.1 An Example of Program Represented by Task Graph Comparison between Allocation and Scheduling The Allocation Consideration due to Communication Delay The Comparison between Parallelism and Communication Delay The Comparison of Large Grain Versus Fine Grain Scheduling The Segment of a Three-processor Schedule after Node 7's Assignment The Main List Scheduler The Update_R_queue The Example of Task Graph List Scheduling The Locate-PE of ISH The Assigned-Node of ISH The Example of ISH Task Insertion and Scheduling The Choices in Implementing ISH Step The Average Speedup Ratio Comparison between ISH Versions on 10 Random-generated 350 Nodes Task Graphs The Task Duplication Concept The Task Duplication Process (TDP) The Copy_LIP of DSH The Sample Task Graph of 11 Nodes and Its Intermediate Gantt Chart using DSH 60

10 Figure LIST OF FIGURES (continuation) Page 3.14 The Example of Duplication Task List (CTlst) on PEi Constructed by TDP for Node The Locate_PE of DSH The Assigned-Node of DSH The Average Speedup Ratio Comparison (20 Nodes) The Average Speedup Ratio Comparison (50 Nodes) The Average Speedup Ratio Comparison (100 Nodes) The Average Speedup Ratio Comparison (150 Nodes) The Average Speedup Ratio Comparison (250 Nodes) The Average Speedup Ratio Comparison (350 Nodes) The Average DSH Speedup Ratio Comparison for Different Delay The Average Speedup Ratio Comparison for Non-identical Node Size and Non-identical Communication Delay An Example of User Program and Its Task Graph Construction An Example of Fine Grain Task Graph Construction An Example of Fine Grain Node Size Calculation 89

11 Figure LIST OF FIGURES (continuation) Page 5.4 An Example of Communication Delay Calculation for the specific architecture An Example of Fine Grain Scheduling using DSH in Comparison with Load Balancing and Single PE An Example of User Program Restructure 92

12 Table LIST OF TABLES Page 2.1 The Complexity Comparison of Scheduling Problems ISH's Speedup Ratio Improvement over Hu and Yu D Heuristics DSH's Speedup Ratio Improvement over Hu and Yu D Heuristics The Effect of Communication Delay on Speedup Ratio Comparison DSH and ISH Speedup Ratio Improvement over Hu and Yu D Heuristics for Variable Node Size and Communication Delay 78

13 STATIC TASK SCHEDULING AND GRAIN PACKING IN PARALLEL PROCESSING SYSTEMS CHAPTER I INTRODUCTION 1.1 Motivation and Purpose of this Research The goal of parallel processing is to achieve very high-speed computing by partitioning a sequential program into concurrent parts and then executing the concurrent parts, simultaneously. This usually means that a programmer must manually schedule the concurrent parts on the available processors. Besides being timeconsuming and prone to errors, this approach limits the number and kind of applications that can take advantage of parallel processing. Instead, an automatic means of allocating and scheduling parts of a program on multiple processors is desired. The research problem in this dissertation is the problem of optimal scheduling of concurrent program modules called tasks on a fixed, finite number of parallel processors. We call this the scheduling problem for parallel processor systems. To be specific, consider a system of fully connected identical processors and assume the program or tasks submitted to the system are specified by an acyclic directed task graph TG (N,E). N is a set of numbered nodes, each node is assigned an integer weight representing the execution time of a task. The node execution time is also called its size. E is a set of edges representing precedence constraints among nodes. Edges are also assigned an integer representing the amount of communication between nodes in the form of a message which is

14 2 sent from one node to another. A node must wait for all of its input messages before it can start execution. A node N1 is called an immediate predecessor to node N2, if N1 sends a message to N2. Likewise, node N2 is called the immediate successor node of N1 because N2 receives a message from N1. A successor (predecessor) node may have zero, one or more immediate predecessors (successors). Figure 1.1 is an example of a program represented by TG. The label number inside a node is a node number. The number adjacent to a node is the node's size, and represents the execution time for that node. The number on each branch is the message's size, and represents the transmission time for that branch. Node 1 has no predecessor, but node 1 is the immediate predecessor of nodes 2,3,4,5 and 6. Node 7 is the immediate successor to nodes 5 and 6. All nodes are of size 1, and all messages are of size 1. Given TG for some program P, we seek the scheduling of tasks to processors such that the shortest possible execution time is obtained. Furthermore, we assume that only one program is executed on the parallel processor at a time. The resulting shortesttime execution is called the optimal scheduling length for program P. Ullman [U1 lm 75] has shown that finding the optimal schedule length for this type of problem is generally very hard and is an NPcomplete problem. Because of the computational complexity of optimal solution strategies, a need has arisen for a simplified suboptimal ("near optimal") approach to this scheduling problem.

15 3 Scheduling research for this problem has emphasized heuristic approaches. Without communication delay, this type of problem is solved by using algorithms belonging to the class of HLF (high level first) or CP (critical path) [Hu 61] (see previous work section). In HLF, tasks are executed from the highest level, level by level, where level is defined as the longest path to the ending node. HLF algorithms provide "the nearest optimal" schedule most of the time compared to other heuristics [Adam 74]. In order to be more realistic, we need to add the communication delay to this scheduling problem. This addition makes the problem harder. In fact, an algorithm for finding the optimal solution for an arbitrary task graph is not known. Moreover, addition of communication delays introduces a key difficulty in parallel processor scheduling, the so called max-min problem. The max-min problem occurs because of the trade-off possible between maximum parallelism versus minimum communication delay. If tasks are allocated to parallel processors in such a manner as to maximize the amount of simultaneous execution of tasks without regard for the cost of message transmission, the result may be a program that runs slower than on a single processor. This case arises when communication costs are high compared to execution time delays. Alternately, when available parallelism is not exploited

16 4 to advantage, the parallel processor may be underutilized. Therefore, a "good near optimal" scheduling algorithm must solve the max-min problem and consider available communication bandwidth as well as available concurrency. While various scheduling problems have been studied for many years, previous results are extended in three fundamental ways by 1) introducing a new method which includes the time-delay imposed by message transmission among tasks, 2) proposing an entirely new heuristic which maximizes program execution speed by duplicating tasks on the fixed set of processors, and 3) proposing a solution to the max-min problem by duplicating tasks (this max-min problem has not widely been recognized in the literature and so has never been solved by previous researchers). The scheduling results of these two new schedulers will also show that load balancing does not yield "near optimal" schedules for this scheduling problem. Moreover, a new method to solve the "near optimal-grain size" problem, called "Grain Packing", is proposed. Grain packing uses the fine grain scheduling results of the new heuristic to find a "nearoptimal grain size" for an application program on a parallel processing system. 1.2 Previous Works

17 5 The scheduling problem for parallel processor systems has been widely studied [Coff 76]. A detailed survey of earlier results can be found in a paper by Gonzalez[Gonz 77]. This survey presents an extensive discussion of processor scheduling, as well as results on job scheduling, taking into account a wide range of scheduling constraints. Research results in parallel processor scheduling can be roughly classified into five groups based on the type of tasks to be scheduled, type of parallel processor system and the goal of the scheduler. The five groups are 1) Optimal precedent-scheduling, 2) Optimal communication precedent scheduling, 3) Load balancing communication scheduling, 4) Dynamic task scheduling, and 5) Independent task scheduling. 1: Optimal precedent-scheduling. The objective of the first type of scheduler is to minimize the schedule length. The tasks can be represented by an acyclic directed task graph G(T,<), where T is a set of nodes representing tasks (with known execution time), and < is a set of edges representing the precedence constraints. We also assume zero communication delay between any two communicating tasks. Finding the optimal schedule length for this type of problem is generally very hard and is an NP-complete problem. The problem is NP-complete even in two simple restricted cases, 1) scheduling unittime tasks to an arbitrary number of processors, 2) two-processor scheduling with all tasks requiring one or two time units [Ullm 75].

18 6 With the addition of more restrictions this problem can be solved in polynomial time [Lens 78]. For example, when the task graph is a tree and all tasks execute in one time unit, a solution can be found in 0(n) time, see [Hu 61]. Hu's list scheduling algorithm uses a level number equal to the length of the longest path from the node to the ending node as a priority number, i.e. tasks are executed level by level from the highest level first. Coffman and Graham [Coff 72] gave an 0(n2) list scheduling algorithm similar to Hu's list scheduling algorithm except that the task scheduling priorities are assigned in such a way that nodes at the same level have different priorities. The algorithm gives an optimal length schedule for an arbitrary graph containing unit-time delay tasks on a 2-processor system. Sethi [Seth 76] gave a less complex algorithm which provides the same schedule in 0(na(n) + e), where e is the number of edges in the graph, and a(n) is almost a constant function of n. Nett [Nett 76] extended Hu's algorithm to provide the same optimal schedule for 2 processors. Task priorities are still equal to the node's level number, but tasks at the same level are ordered by the number of each task's immediate successors. Kaufman [Kauf 74] reports an algorithm similar to Hu's algorithm that works on a tree containing tasks with arbitrary execution time. This algorithm finds a schedule in time bounded by 1 + (p -1)t/T, where p = number of processors, t = longest task

19 7 execution time, and T = summation of all task execution times. All of the algorithms described above belong to the more general class of HLF-algorithm (Highest-level-first), which have also been called (Critical path), LP (longest path), or LPT (largest processing time) algorithms. Since no schedule for an arbitrary graph can be shorter than its critical path, the level number or critical path length is the key to successful precedence task scheduling. In fact experimental studies of simulated task graphs show that: very simple HLF-algorithms can produce optimal schedules most of the time for unit time tasks(within 90 % of optimal for over 700 cases) [Bash 83], HLF-algorithms yield "near optimal" schedules most of the time for arbitrary time tasks (4.4 percent away from optimal in 899 of 900 cases). HLF algorithms also provided the best results compared to other list scheduling algorithms (847 cases of 900 cases) [Adam 74]. Bounds on the ratio of HLF algorithm schedules versus optimal schedules are summarized as followed: Precedence P* T* Algorithm Bound Ref. Tree Ar* =1 [Hu 61] optimal [Hu 61] Ar =2 =1 [Coff 72] optimal [Coff 72] A r =2 A r HLF 4/ 3 [Chen 75] Ar Ar Ar HLF 2-1/(P-1) [Chen 75]

20 8 * where P = Number of Processors, T = Task execution time, Ar = Arbitrary (precedence, number of processors, execution times). Kohler [Kohl 75] suggested a branch-and-bound algorithm to obtain the optimal solution for an arbitrary task graph. The algorithm begins with construction of a searching tree. Then an elimination rule is used to eliminate branches that violate the constraints, and a selection rule is used to select only the most promising branches first. Therefore, only part of the solution space is searched. The algorithm is very general and can be applied to many kinds of scheduling problems. Branch-and-bound guarantees an optimal solution, but the solution space for the precedence-constraint scheduling problem is very large. Therefore, the branch-and-bound algorithm is much slower than HLF-algorithms. 2: Optimal communication precedent scheduling The second type of scheduling problem is exactly the same as the first type but with the addition of communication delays between communicating nodes located in different processors. Because of the computational complexity of optimal solution strategies, a need has arisen for a simplified suboptimal approach to this scheduling problem. Recent research in this area has emphasized heuristic approaches. Yu's [Yu 84] heuristics are based on Hu's algorithm [Hu 61]. Yu's improvements were to consider communication delays when

21 9 making task assignments, and to use a combinatorial min-max weight matching scheme at each task assignment step. Yu's results were compared with results from Hu's algorithms, and the comparisons were good for large communication delays. However, the results were not significantly different from those of Hu for small communication delays. Also, nodes in the application task graph must have identical size (execution time) and so must the communication delays. 3: Load balancing communication scheduling can be represented by a task graph G(T,E), where T is a set of nodes representing tasks with known execution time, and E is a set of edges representing the amount of communication between nodes located on different processors (no precedence constraint is assumed). The scheduling (allocation) objective is to minimize the communication delay between processor and to balance the load among processors. Since there are no precedence constraints, load balancing yields a system with high throughput and faster response time. Exclusion of the precedence constraints results in a different schedule objective, a completely different schedule solution, and a totally different scheduling problem from the previous two types. This type of problem is more suited for partitioning (allocation) of a task graph than scheduling tasks on processors. While we are not concerned with this kind of problem, there are three approaches which have been taken by the following researchers:

22 10 Graph theoretical approach : [Ston 77], [Ston 78], [Bokh 81], Integer programming approach : [Chu 80], [Ma 82], Heuristic approach : [Etc 82], [Ma 84]. 4: Dynamic task scheduling is about the same as the second type of problem except that every parameter (task execution time, amount of communication, precedence constraint, number of nodes in task graph) is dynamic and can be changed during runtime. An example of this type of task graph is a graph which represents a program with loop or branching statements. We generally do not know how many times a loop will be executed nor which branch will be executed beforehand. Solutions to this type of problem are obtained through stochastic scheduling algorithms. A critical path can not be found until the dynamic graph is executed. If the objective is to minimize the schedule length, the schedule of this type of problem would have the following properties: 1) scheduling must be done during task execution because there is not enough information for scheduling before execution. 2) Statistical data about the dynamic task graph may have to be gathered during runtime in order to predict the behavior of the task graph. [Schw 87] 3) Since the schedule is determined at runtime, a complex schedule strategy might introduce excessive overhead with no guarantee of a "near optimal" schedule length.

23 11 4) The "goodness" of the schedule length computed by the "dynamic" scheduler depends on 1) how close the scheduler can predict the future behavior of a task graph, and 2)how much excessive overhead is introduced. Several scheduling models were proposed for dynamic scheduling [Rama 76], [Jens 77], [Schw 87] along with a scheduler heuristic [Rama 76], [Kung 81]. Some researchers have focused on special cases, for example loops and branches [Tows 86]. 5: Independent task scheduling: The tasks for this type of scheduling problem are completely independent from each other. There are several objectives, such as load balancing and individual task deadlines. Since there are no precedence constraints among tasks, this type of scheduling is different from the first two. Examples of this type of scheduler are found in [Ni 85] and [Krau 75]. 1.3 scope cooperating and Given a program P which can be partitioned into separate communicating parts called tasks, the goal of this research is to devise an efficient heuristic scheduler to assign statically a set of cooperative tasks (task graph) of P into a finite number of processors in the parallel processor system. The assumptions on the task graph and parallel processor system are, 1: (Data Flow Property): Each task of P, pi must wait for its input before it can begin to execute; but once its inputs have

24 12 been obtained, the task can executed to completion in time ti. Therefore, the task assignment must consider the order of execution of each task (precedence constraint). 2: (Non-homogeneous Time Delays): Nodes of a task graph to be scheduled need not be of the same size (execution times ti may vary from task to task), hence the assignment of tasks to processors must be optimized with regard to variable but statically defined time delay. 3: (Non-homogeneous Communication Delays): Tasks communicate with one another. The time taken to communicate is measured as a communication time delay. These delays are related to the interconnection topology of the underlying hardware, the method of transmission, and the number of bits of information to be transmitted. The underlying hardware parameters are assumed to be fixed, but the number of bits per message may vary, hence delays may vary from connection to connection. Thus, communication delays are not required to be identical. 4: (Max Parallelism) : Parallelisms must be exploited as much as possible. 5: (Min Communication) : Communication time delays must be suppressed as much as possible. The main objectives of this research is to devise a heuristic that takes advantage of parallelisms while at the same time reduces the communication delay. Therefore, the execution of a task graph is completed as early as possible. Unfortunately, maximal parallelism

25 13 and minimal communication delay compete with one another leading to a trade-off problem (Max-min), as is discussed in Chapter II. Load balancing is not included in our objectives, since load balancing tends to distribute tasks evenly to every processor in the system even though tasks can be distributed to a smaller number of processors. This tends to increase the communication delay, hence increasing the runtime. The assumptions above can be rephrased in terms of the task graph as follows: 1: (Acyclic Graph): The task graph is acyclic, 2: (Static Topology): The task graph is defined beforehand, and remains unchanged during program execution, 3: (Non-homogeneous Labels): Message and node size need not be identical, 4: (Non-preemption): Once a task begins executing, it executes to completion, 5: (Static Scheduling): Once a task has been scheduled on a processor, the schedule is not changed. In addition, we assume the following properties of the parallel processing system. 1: (Connectivity): All processors are connected so that a message can be sent from any processor to any other processor (for simplicity, the queuing delay that may occur due to the congestion on the interconnection network is not

26 14 analyzed), 2: (Locality): Transmission time between nodes located on the same processor is assumed to be zero time units, 3: (Identicality): Each processor is the same (speed, function). 4: (Co-Processor I /O): A processor can execute tasks and communicate with another processor at the same time. This is typical with IO-processors (channel) and direct memory access. Hence, the term "processing element" (PE) is used instead of "processor" to imply the existence of an I0- processor, 5: (Single Application): Only one program is executed at a time on the parallel processing system--this is to maximize execution speed. Inspite of the connectivity and identicality assumptions, the proposed DSH scheduling algorithms can also be applied to parallel processor systems partially connected to non-identical processors of different speeds. An efficient solution to the optimal scheduling problem described above is a significant contribution to parallel processing system research, because the solution enables more effective and widespread use of parallel processing systems for general purpose applications. The new scheduling heuristic, DSH, produces nearoptimal solutions with speedup ratios several times better than the earlier solutions (Hu and Yu heuristic [Hu 601 [YU 84]). Also, the heuristic can be used in related disciplines to solve problems in

27 15 operations research, job assignment in industrial management, and programmer work assignment in software engineering. 1.4 Outline of the Dissertation The research is organized into six chapters as follows: The second chapter gives background and details on the scheduling problem. Problems and difficulties in parallel processor precedence task scheduling with communication delay are presented, including the parallelism and communication trade-off problem. Heuristic scheduling algorithms and their complexities are described in Chapter Three along with examples of scheduling of sample task graphs. Chapter Four presents results from the proposed heuristics and compares these results with selected scheduling algorithms described in the Literature. The software used to conduct these experiments include a random task graph generator, Hu's algorithm, Yu's algorithm, and Report generator. Chapter Five presents "grain packing", which is the method used to define optimal grain sizes for a user application program on a specific parallel processing system using DSH. Finally, Chapter Six contains a summary, conclusions, and recommendations for future research.

28 16 y = communication delay x = Node Number z = Node size Program Parallel; Var a,b,c,d,e,f,g,h,i,j,k,l,m,n,o : real; Begin a:= 1; b:= 2; c:= 3; d:=4; e:=5; f:=6; {Node 1} g := a*b; {Node 2} h := c*d; {Node 3) i := d*e; {Node 4} j := e*f; {Node 5} k := d*f; {Node 6} 1 := j*k; {Node 7) m := 4*1; {Node 8) n := 3*m; {Node 9} o := n*i; {Node 10} p := o*h; {Node 11} q := p*g; {Node 12} end. Figure 1.1 An Example of Program Represented by Task Graph.

29 17 CHAPTER II THE PARALLEL PROCESSOR SCHEDULING ENVIRONMENT 2.1 Background Scheduling Definition Function and Goals. A scheduler is an algorithm that takes as inputs a task graph and the number of available processors, and produces as output an allocation and execution order of tasks to be processed by each processor. An optimal schedule results when the scheduler guarantees the shortest time to complete all tasks in a graph. However, it is known that finding an optimal schedule is a difficult problem and generally intractable. Consequently, restrictions are added to reduce the computational complexity of this class of problems. Even so most of these problems are still classified as NPhard problems[coff 76], as shown in Table 2.1. Scheduling problems can be classified by the type of jobs that schedulers distribute, the type of processor systems that process the jobs, and the restrictive conditions added to the problem. First, jobs vary from a set of independent tasks to a set of precedenceconstrained tasks. The latter can be modeled by a precedence graph, where each node represents a task and each arc represents a precedence constraint between nodes connected by the arc. An example of a precedence task graph is shown in Figure 2.1. Second, processor systems depend on their connection topology, number of processors, communication protocol among

30 18 processors, and the similarity of processors in the system. Third, restrictive conditions can be based on characteristics of a scheduler itself (preemption or non-preemption). The difficulty of the scheduling problem differs among each class of scheduler as shown in Table Critical Path Nature of the Task Graph and Gantt Chart. The length of a path in a task graph is the sum of all node and branch sizes along the path. The critical path of a task graph is the longest path from the start node to the exit node. For example, there are two identical longest length paths in Figure 2.1 so there are two critical paths. One is through nodes 1,5,7,8,9,10,11,12 and the other through nodes 1,6,7,8,9,10,11,12. The lower bound on the execution time of the task graph is the length of the critical path. That is, a program cannot execute to completion in less time than given by the length of the critical path regardless of the number of processors in the system. Processor run time is given by the difference ( tstop tstart) where t start and tstop are the time a processor starts and finishes node execution. The execution time of a task graph on a parallel processor system is the longest run time among processors in the system; given by maxi (tstopi - t starti). Processor run time includes both communication delay time and execution time. A form of Gantt chart [Clar 52], as shown in Figure 2.1, is used to show the scheduling of nodes on processors. From Scheduling

31 19 Gantt Chart in Figure 2.1, the longest run time belongs to processor PE 1, and the execution time of the task graph is 10 time units. Notice that this time includes the communication delay time (2 units, from node 1 to 5 and node 5 to 7) plus node execution time in the critical path (8 units, node 1, 5, 7, 8, 9, 10, 11, 12) Difference between Allocation and Scheduling Figure 2.1 also shows how an allocation might be different than a schedule. This difference is due to the bound on the execution time defined by the critical path of the task graph. An allocation defines where the node is to execute, but not the order of execution of each node. The order of execution is dictated by the availability of input messages to each node on each processor. That is, the node that receives all of its messages begins execution first. But, this execution order may not be the optimal one. In Figure 2.1, nodes 2, 3, 4, 5, and 6 are the successor nodes of node 1. Since the directed graph does not specify the order of sending communication messages (this order depends on communication protocol), nodes 2,3, and 4 receive their messages either before or at the same time as node 5. If they are assigned to the same PE (PE2) and an execution order is not specified, nodes 2,3, and 4 may be scheduled for execution before node 5 as shown in the Allocation Gantt Chart. This results in longer task graph execution time because node 5 is in the critical path, and any execution time before node 5 execution (node 2,3,4) is included in the overall run time. But, if the scheduler specifies the order of node execution so

32 20 node 5 can be assigned and executed before nodes 2,3, and 4 then an overall shorter execution time is obtained as shown in the scheduling Gantt chart of Figure List Scheduling One class of scheduling heuristics, in which many parallel processing schedulers are classified, is list scheduling. In list scheduling each task(node) is assigned a priority, then a list of nodes is constructed in decreasing priority order. Whenever a processor is available, a ready node with the highest priority is selected from the list and assigned to the processor. If more than one node has the same priority, a node is selected randomly. The schedulers in this class differ only in the way that each scheduler assigns priorities to nodes. Priority assignment results in different schedules because nodes are selected in a different order. The comparison between different node priority (level, co-level, random) has been studied by Adam et al [Adam 74]. The comparison suggests that the use of level number as node priority is the nearest to optimal. 2.2 The Parallel Processor Scheduling Problem The principle difficulties encountered when designing schedulers of this type are reviewed before presenting new heuristic schedulers. The first two problems are due to communication delay, and the third problem is due to the alteration of critical paths of a task graph which is a subset of the Dynamic

33 21 Scheduling Problem discussed in Chapter One Communication and Parallelism Trade off Problem When there is non-zero communication delay between tasks, all ready tasks can be allocated to all available processors so that the overall execution time of the task graph is reduced. This situation may occur in a shared-memory parallel processor where messages are passed at memory cycle speeds. In fact, this is the basis of earlier schedulers that do not consider communication delays, for example Hu's heuristic [Hu 61]. On the other hand, when there is a communication delay, scheduling must be based on both the communication delay and the point in time when each processor is ready for execution. It is possible for ready nodes with long communication delays to end up assigned to the same processor as their immediate predecessors as shown in Figure 2.2 Gantt chart A. Node 3 start time on PE2 is later than node 3 start time on PE1 since the communication delay of node 3 is greater than the execution time of node 2. So node 3 should be assigned to PE 1, which has its immediate predecessor node 1. Conversely as shown on Gantt chart B, if the communication delay is less than the execution time of node 2, node 3 should be assigned to PE2 instead. Hence, adding a communication delay constraint increases the difficulty of arriving at an optimal schedule because the scheduler must examine the start time of each node on each available processor in order to select the best one.

34 22 As shown above, it would be a mistake to always increase the amount of parallelism available by simply starting each node as soon as possible. Distributing parallel nodes to as many processors as possible tends to increase the communication delay, which contributes to the overall execution time. In short, there is a tradeoff between taking advantage of maximal parallelism versus minimizing communication delay. This problem has not been widely recognized and will be called the max-min problem for parallel processing. The task graph in Figure 2.3 demonstrates the dramatic effect of the max-min problem. If communication delay D3 between node 1 and node 3 is less than execution time of node 2, node 3 is assigned to PE2 in order to begin its execution sooner. Because node 2 and node 3 are the immediate predecessor of node 4, and they are assigned to different processors, node 4 cannot avoid the communication delay from one of its immediate predecessors. Thus, the execution time of this task graph is the summation of the execution time of nodes 1, 2, 4, plus communication delay Dx or Dy depending on where node 4 is assigned. But, what happens if node 4 communication delays are larger than node 3 execution time? Then, assigning node 3 to PE1 will result in a shorter task graph execution time. This is so even if node 3 finishes its execution later than the previous assignment as shown in Gantt chart C. Current communication delay scheduling heuristics try to take

35 23 advantage of parallelism and reduce communication delay. But none of the current heuristics solve the max-min problem. A new method is proposed for the solution of the max-min problem by duplicating tasks where necessary to reduce the overall communication delay and maximize parallelism at the same time. The method is called DSH Grain Size Problem Another problem closely related to the max-min problem is the grain size problem. The challenge of this problem is to determine the "best" node size for every node in the task graph such that the task graph execution time is minimized. The size of a node can be altered by adding or removing tasks from the node. Such nodes are called grains to indicate the packing of tasks within a node. If a grain is too big, parallelism is reduced because potentially concurrent tasks are grouped in a node and executed sequentially by one processor. If a grain size is too small, more overhead in the form of context switching, scheduling time, and communication delay is added to the overall execution time. The solution to the max-min problem can be used to solve the grain size problem, since in the grain size problem, there is also a trade-off between parallelism (small grain) and communication(large grain). As shown in Figure 2.4, the small grain scheduler can take advantage of parallelism, but the large grain scheduler can not

36 24 because there is no parallelism in the large grain task graph. Also, for large grain, the order of execution of each small task grouped inside the larger grain is fixed before schedule time, and the order may not be the optimal one. As shown in Figure 2.4 Gantt chart D, fixing the order of execution of the small grains too early in the algorithm results in sequential execution of the whole task graph. Figure 2.4 shows the technique used to define the best grain size. The grain size is defined by grouping the scheduled tasks obtained from the small grain schedule shown in Gantt chart A. This forms the larger grain schedule shown in Gantt chart B. The grouping decision depends on the underlying parallel processor system hardware and software. Usually, the more we can group smaller grains, the shorter the task graph execution time because of the reduction of overhead Level Alteration and the Critical Path Another important scheduling problem caused by the introduction of non-zero communication delays is due to the alteration of node levels and their impact on critical path calculation. Any heuristic that uses level numbers or critical path length faces this problem. The level of a node is defined as the length of the longest path from the node to the exit node. This length includes all node execution times and all communication delay times along the path. Node level was first used in scheduling by Hu [Hu 61]. Adam

37 25 [Adam 74] showed that among all priority schedulers, level priority schedulers are the best at getting close to the optimal schedule. Unfortunately, the level numbers do not remain constant when communication delays are considered because the level of each node changes as the length of the path leading to the exit node changes. The path length varies depending on communication delay, and the communication delay changes depending on task allocation. Communication delay is zero if tasks are allocated to the same PE, and non-zero if tasks are allocated to different PEs. We call this the level number problem for parallel processor scheduling. The scheduling techniques used in this paper do not solve the level number problem, and nor does any other known scheduling technique. The node level in this paper is the summation of the node sizes along the path to the exit node excluding the communication delay. The reason that this is an unsolved problem is as follows. Level number is used as node priority which has to be defined before schedule time in order to construct a schedule. But, the communication delay, which is a part of a level number as previously described, is not defined until nodes are scheduled because the communication delay is a function of assigned PE. A better approximation of level number may be obtained by iteration: schedule, then calculate node level, schedule... etc. The time complexity would be tremendously increased and the resulting level number would be only an approximation.

38 26 y = communication delay x = Node Number z = Node size Time Allocation 0 PE1 1 Gantt PE2 Chart Scheduling Time 0 Gantt Chart PEI Figure 2.1 Comparison between Allocation and Scheduling. 1 PE2

39 27 Time 0 Gantt Chart A Gantt Chart B Time PE1 PE2 PE1 PE Dx 2 T IDx T Dx > Node 2 Dx < Node 2 Figure 2.2 The Allocation Consideration due to Communication Delay. Gantt Chart Gantt Chart Gantt Chart A B C PE1 PE2 PE1 PE2 PE T D3 T Dx T Figure 2.3 The Comparison between Parallelism and Communication Delay.

40 28 Time 0 Gantt Chart A PE1 PE2 Gantt Chart Time B PE1 PE2 0 Small Grain Scheduling Followed by II Grain Packing (Grouping). 2 I II IV V Time 0 Gantt Chart D Large grain scheduling Figure 2.4 The Comparison of Large Grain versus Fine Grain Scheduling

41 29 Task Graph Type Processor Type Graph Task execution Number of Complexity Topology Time Predecessors (P) tree identical arbitrary 0(N) arbitrary identical 2 0(N) arbitrary identical arbitrary NP-hard arbitrary 1 Or 2 time unit Fixed P>=2 NP-hard arbitrary arbitrary arbitrary Strong NP-hard Table 2.1 The Complexity Comparison of Scheduling Problems.

42 30 CHAPTER III DUPLICATION SCHEDULING HEURISTIC (DSH) & INSERTION SCHEDULING HEURISTIC (ISH) 3.1. Introduction Two new scheduling heuristics are proposed to solve the scheduling problem as discussed in Chapter I. They are the insertion scheduling heuristic (ISH) and the duplication scheduling heuristic (DSH). Both heuristics are improvements over the Hu heuristic which solve the communication delay and the max-min problems. ISH is essentially Hu's heuristic with an improvement to the communication delay problem by inserting tasks in available communication delay time slots. However, ISH does not solve the max-min problem. DSH solves the max-min problem using task duplication scheduling. The inputs of an heuristic are the task graph (directed task graph model) and the parallel processing system as described earlier. The output is a Gantt chart of all PEs in the system. The Gantt chart show the order of execution of each task of a task graph on each PE. Some definitions and terminology pertaining to these heuristics and an example from Figure 3.1 are given as follows definitions A node is ready if all of its immediate predecessors have been assigned. Node 8 is ready since node 7 was assigned to PE2. The ready time of a PE (trp) is the time when a PE has

43 31 finished its assigned task and is ready for execution of the next task. The ready times for PEI, PE2, PE3 are 10,14,9 respectively. The message ready time of a node lt all the messages to the node have been received by the PE 1 is the time when containing the node. This time is the latest communication delay of all the messages sent from the node's immediate predecessors. The immediate predecessor of the node that has the latest communication delay will be called the latest immediate predecessor (LIP). The times that PE2 receives messages from node 4,5,6 are 12,13,11 respectively. Hence, tmr = 13 and LIP = node 5. The starting time (tsn) of a node for a PE is the earliest time when the node can start its execution on the PE. It is the larger of either the ready time of the PE or the message ready time of the node. Node 7's tsn is 13. The jdle time slot(tidp) is the time interval between the PE ready time and the assigned node starting time if the node starting time is more than the PE ready time. Otherwise the idle time is zero. PE2's tidp is 11 to 13. tidp = 0; if node start time <= PE ready time otherwise tidp = node start time PE ready time The finishing time (tf)of a node for a PE is the time when the node finishes its execution on that PE. It is the starting time of that node plus the size of that node. tf = tsn + node size

44 32 PE2 finish time is 14. The realtv queue is a queue of ordered ready nodes. The order is defined by the node priority; A node priority is a node level. The highest level node is first in the queue, and so it is scheduled first, and the lowest level node is scheduled last. Nodes at the same level are ordered according to the number of their immediate successors; the node with the greatest number of immediate successors is scheduled first. The assigned node (AN) is the highest priority node selected from the ready queue. After node 7 is assigned to PE2, Node 8 is inserted in the ready queue and becomes the assigned node. The assigned PE is the PE chosen to execute the assigned node. The assigned PE for node 7 is PE2. The first ready PE (PERF) is the first PE in the set of all PEs to become ready after each assigned node is scheduled on the assigned PE. After node 7 is assigned, PE3 is PERF and is ready at time unit 9. ISH and DSH share the same list scheduling heuristic shown in pseudocode in Figure 3.2. The heuristic tries to minimize the schedule length by minimizing the finishing time of each assigned node. At first, the level of each node in the task graph is calculated and used as each node's priority. (An example of node level is shown in Figure 3.4). The ready queue is then initialized by inserting all nodes with no immediate predecessors into the ready queue. This

45 33 usually means that only one node is inserted into the ready queue initially. Then, the node at the top of the queue (with the highest priority, i.e. highest level) is assigned to a PE. The PE is selected by the processor selection routine called Locate-PE. For the first node, any PE can be selected because no PE has been assigned a task, yet. The heuristic continues assigning nodes and updating the ready queue until all the nodes in the task graph are assigned. Each time a node is assigned, the assigned PE is marked, then the marked PE with the earliest ready time is located. The last node assigned to that PE, is called the event node, and is used to update the ready queue. The reason to mark and unmark a PE is to prevent one event node from updating the ready queue a second time. Each time an event node is selected, the PERF is unmarked. Hence, the event node can not be used to update the ready queue again. The PE is marked each time a new node is assigned to it, and unmarked each time an event node is chosen. Pseudocode of Update_R_queue routine is shown Figure 3.3. The ready queue is updated by inserting a new ready node chosen after the assignment of an event node. The new ready node is selected as follows: The number of immediate predecessors of the immediate successor nodes of the event node is decrement by one. If the number is zero, that immediate successor node is chosen as the new ready node and is inserted into the ready queue. Decrementing is repeated for all immediate successors of the event node. The new ready nodes are placed in the ready queue according

46 34 to their priority, thus maintaining the order of the queue. A node is assigned by getting a ready node from the front of the ready queue. Thus, the node with the highest level and with the maximum number of immediate successors is assigned first. Then, the processor selection routine (locate-pe) is used to select the PE, and Assign_node routine assigns the node to the selected PE. Figure 3.4 illustrates the step-by-step example of the main list scheduler along with the scheduling result in the Gantt chart form, ready queue, and task graph at each scheduling step, given the sample task graph and two PEs in the parallel processing system. The Locate_PE routine, used in this example, selects the assigned-pe that has the earliest start time for each assigned-node without any task duplication but considering the communication delay. The Assign_Node, in this example, assigns an assigned-node to the selected PE at its start time calculated by Locate_PE without any task insertion Insertion Scheduling Heuristic (ISH) ISH is essentially Hu's heuristic with two improvements. First, Hu assumed no communication delay. So, the first improvement is in the processor selection routine which takes into account communication delay. The second improvement is in the Assign- Node routine which makes use of the idle time of a PE by trying to assign ready nodes (hole tasks) to idle time slots. ISH tries to minimize the schedule length by utilizing idle time. However, ISH

47 35 does not solve the max-min problem of trading communication delay for parallelism. The Locate-PE routine returns the PE (PEL) that can start executing the assigned node earliest and returns the assigned node start time on the PEL, ST. If there is no communication delay, the PEL is equal to PERF. But in the presence of communication delays, the PE containing the assigned-node's immediate predecessor (or the PE that communicates with the assigned node) must be considered as well because communication delay is assumed to be zero when the message source and destination tasks are located in the same PE. On the other hand, the communication delay can be the major part of an assigned node's starting time. Therefore, the assigned PE is the first PE that can start executing the assigned node, and is selected from the PERF and the PEs that have the assigned node's immediate predecessors. This means that the minimum schedule length is enforced instead of load balancing, because the assigned PE is selected by the assigned node's starting time. The details of the Locate-PE routine are shown in Figure 3.5. The purpose of the assign node routine is not only to assign node AN to the PEL column of the Gantt chart but, also to insert hole tasks into the idle time slot created by the communication delays. Instead of changing the node priority to select the assigned node that creates the smallest idle time, the node with the highest level is still the highest priority node and is assigned first. But, the idle time created by this strategy is exploited by searching through all the

48 36 ready nodes from the front of the ready queue to find nodes (hole tasks) that can be inserted into the idle time slot. Searching continues until the idle time slot is filled or no hole task is found. The details of the Assign_node routine are shown in Figure 3.6. Figure 3.7 contains a step-by-step example of Locate_PE and Assign_Node procedures used by the ISH scheduler along with the intermediate scheduling results returned in the Gantt chart form and the ready queue, given a sample task graph and two PEs. Figure 3.7 starts with the Locate_PE for node 7 (Node 1,4,5, and 6 have already been assigned, Gantt chart A). The Locate_PE returns PEi as the assigned PE with node 7 start time is 4 time unit and the idle time slot is 3-4 time unit. Then, the Assign_node tries to assign the hole task to the idle time slot, and finally Assign_node assigns node 7 to PEi start at 4 time unit. The hole task assignment (Step 2 of Figure 3.6) starts with searching from the top of the ready queue to find the ready node that can start execution within the idle time slot. The first hole task candidate node from the ready queue is node 2. Even node 2 can start execution with in the idle time slot, but it can start execution on PE3 earlier, hence hole task searching continues to the next node, node 3. Node 3 is selected as a hole task and assigned to the idle time slot (since it can start execution within the idle time slot and can not start execution on other PEs earlier than PEi ). Then, Node 3 is used to update the ready queue (Update_R_queue) and the idle time slot is modified due to the assignment of node 3. After the

49 37 modification, the Idle time slot is empty and the hole task assignment step (Step 2) is done. Then, node 7 is assigned at its start time on PE 1. Finally, Figure 3.7 shows the improvement of final task schedule result of ISH compared to the schedule result without hole task insertion. From the example of task graph scheduling in Figure 3.7, there are choices in implementing ISH. Fifty task graphs with variation on number of PE from 2 to 70 for a total of 1750 runs were simulated in order to make decisions on those choices. The choices and their simulation results are as follows: 1: NISH, non-insertion scheduling heuristic, is the level-list scheduling with communication delay but no task insertion (No step 2 in Figure 3.6). 2: ISHO is about the same as ISH. The difference is in the step as shown in Figure 3.8. The hole task for ISHO is any ready task that can start execution within the idle time slot, even though it can start execution on other PE earlier. 3: ISH1 is ISH. There are two criteria for a ready task to be a hole task. The first criterion is a ready task must be able to start execution within the idle time slot. The second criterion is a ready task must not be able to start execution on the other PE earlier. The ISH1 heuristically enforces the second criteria by only inserting ready task that has message ready time no sooner than the idle time

50 38 start time, since that node has a tendency to start execution on the other PE earlier (Figure 3.7, node 2 is not a hole task since its message ready time on PE1 is 1 time unit and the idle time slot start time is 3 time unit). 4: ISH2 has the same criteria as ISH1 except it heuristically enforces the second criterion in a more strict way (more runtime). ISH2 tests the second criterion by performs Locate_PE on each candidate hole task to insure that it can not start on the other PE earlier. From the simulation results, Figure 3.9, we can see that 1: ISH2 has the best speedup ratio in most of the cases (except on PE = 6-12, ISH1 is better). On average, the descending order of the speedup ratio is ISH2, ISH1, NISH, and ISHO. 2: ISHO is better than NISH only when the number of PE is small (2-6). The main reason is that ISHO does not enforce the second criterion. Hence, the hole task could start execution on the other PE earlier than execution in the idle time slot, and finally results in a longer overall task graph execution time than scheduler without task insertion (NISH). The reason that ISHO is better on a small number of PEs is that ISHO can squeeze some extra PE execution time out of the wasted idle time, and for small number of PE (relative to the number of nodes in the task graph) most PEs are busy and the safe idle time is beneficial (as seen in Figure 3.7). On the other hand when the number of PEs is large, most PEs are idle so it is risky to try to save idle time if hole tasks can start execution on

51 39 the other PE earlier. In conclusion, ISH1 is selected for ISH, since the speedup performance is better than NISH and the time complexity is less than ISH2 (ISH2 has a small improvement, 4 percent, over ISH1) Duplication Scheduling Heuristic (DSH) DSH is an improvement over ISH because it duplicates tasks to reduce the cost of communication. The duplication of a task is called the task duplication concept (TDC). Task duplication has not been explored by other researchers. The TDC solves the max-min problem by duplicating the task nodes that influence the communication delay. As shown in Figure 3.10,node 1 is duplicated to run on both PEi and PE2. The duplication decreases the starting time of node 3 on PE2, so that parallelism (Node 2 and 3) is fully exploited. Node 2 and node 3 run in parallel on different PEs and node 5 can start execution sooner than if we were to assign node 2 and node 3 to the same PE. The TDC takes advantage of parallelism and reduces the communication delays at the same time. Since node 3 is also duplicated to run on PE 1, there is no communication delay for node 4 except the time it takes to run the duplicated node 3 (The communication delay is usually bigger than duplication nodes for a "fine-grain"). The TDC is used in a task duplication heuristic(tdp), which is shown in detail in Figure The inputs to TDP are an assigned node and a PE that is a candidate for an assigned PE (PEc). TDP

52 40 calculates the starting time ST of the assigned node, and constructs the duplicated task list, CTlst, for assigned node starting time ST, if there are any duplicated tasks. The duplication task list is a list of duplication tasks and their starting times on the candidate PE. TDP begins with the calculation of the message ready time of the assigned node and finds the latest immediate predecessor node (LIP) that causes that starting time. Then, if the assigned node starting time is more than the PEE ready time (there is a communication delay) and the LIP was not assigned to PEC, TDP tries to minimize the communication delay by copying predecessor(s) of the assigned node to PEC with the hope that the copy-node will reduce the communication delay. To copy LIP, there are two cases depending on where LIP is located. In the first case, the LIP was assigned to another PE in the system (not PEE ). TDP tries to copy LIP node into the idle time slot of the PEE, since the duplication of the LIP may improve the start time of the assigned node. For the second case, the LIP has already been duplicated in the idle time slot of the PEE. To start the assigned node sooner, the LIP has to start it's execution sooner. Thus, the CTlst is searched to find the node that affects the start time of the LIP node. The search starts with the LIP of the LIP of the assigned node. If that node is assigned to some other PE, the search process stops and that node is the search node. Otherwise, the process searches deeper levels until it finds the LIP of the LIP node that is located in some other PE, and

53 41 that node becomes the search node. Once the search node is found, it is copied into the idle time slot of the PEc. Then all the duplicated task nodes that start after the copied node are removed and re-copied due to the duplication of the search node. The reason for re-copying is that the start time (and the order in CTlst) of the node located after the copied-search node may change due to the presence of the search node. The re-copy may indirectly cause the LIP node starting time to decrease so the assigned node can start execution sooner. The duplication process continues until duplication fails or the LIP was already assigned to PEC (No reason to copy). If some idle time remains, hole tasks are inserted into the time slot as in the ISH heuristic. The details of the copy routine (COPY_LIP) are shown in Figure To copy the node the starting time of the copied node and the LIP node of the copied node have to be determined first. If the copied node starting time is within the idle time slot and no duplicated node is assigned to that time, then the duplication is successful and the LIP of the copied node is recorded for the purpose of searching as previously described. If the copied node starting time is not in the idle time slot, the LIP node of the copied node has to be duplicated if possible. Otherwise, the copy fails and the duplication process is terminated at that point. If the idle time slot is large enough, TDP will continue to duplicate the predecessors of the assigned node in order to make the node start sooner. TDP is used in Locate-PE and Assign-node in

54 42 order to find the starting time and the duplicated task list for each assigned node on any specified PE. Figure 3.14 illustrates the step-by-step example of the CTlst created by TDP, given the sample task graph in Figure 3.13 and three PEs in the parallel processing system and the intermediate scheduling result of DSH in the Gantt chart A. From Figure 3.14, all node in the task graph have already been assigned except the last node (node 11, exit node, Gantt chart A). Locate_PE of DSH have to find out the start time of node 11 by using TDP on all three PEs in order to compare and make the assignment decision. Figure 3.14 shows only TDP (CTlst) on PEi which is also the one that Locate_PE will select for node 11's assignment. The Locate_PE of DSH is shown in Figure It is basically the same as that of ISH. The difference is in the use of the TDP to find the assigned node starting time instead of finding it by calculating communication delay without TDC. The Assign_node's function is the same as that of ISH except for the use of TDP to find the assigned node's starting time along with the duplication task list. The details of the Assigned_node are shown in Figure An example of a schedule computed by DSH is shown in Figure 13 Gantt chart B (final assignment). This example illustrates how DSH solves the max-min problem. From Gantt chart A (after node 10's assignment), we see that DSH has taken full advantage of all

55 43 existing parallelism within the task graph and no PE is idle due to communication delay. This is because no message is passing due to the duplication of tasks. At this point, node 2,3,4 and, node 5,6,7 and node 8,9,10 are scheduled to execute in parallel. But, without task duplication (ISH or NISH), none of this parallelism would be exploited, because all nodes would run on the same PE due to extremely high communication delay. Node 11's assignment illustrates the max-min problem. Instead of scheduling node 11 at time 11, node 6,9,7 and 10 are duplicated so that node 11 can be scheduled at time 10. In this extreme case, the parallelism of node 8,9,10 and node 5,6,7 gives no improvement in the overall task graph execution time. On the contrary, the parallelism of node 8,9,10 and node 5,6,7 is harmful to node 11's assignment. Consequently, DSH has to duplicate them and schedule them to run serially on PEI to avoid the communication delays. Before node 11's assignment DSH tried to take advantage of parallelism as much as possible even though the parallelism is harmful to the future nodes assignments (node 11). The duplications of node 6,9,7,10 at node 11's assignment can be viewed as feedback (correction) from previous schedulings. From this feedback DSH decides that maximum parallelism should not be exploited; only node 2,4 parallelism. In short, DSH's scheduling strategy is to exploit all possible parallelism first, but when it discover that parallelism is known to be harmful, DSH duplicates nodes and schedules them to

56 44 execute serially. Notice the redundant node allocation in Gantt chart B, (all nodes in PE2 and nodes 2,7,10 in PE3). A simple clean-up-algorithm could be constructed to get rid of redundant tasks resulting in the final Gantt chart C. On the other hand, the redundancies could be used for fault torelance. This simple example was intensionaly constructed to show the max-min problem. It shows the feed back case and the creation of a duplication task list (CTlst). If one of the communication delays from nodes 8,9, or 10 to node 11 is less than 6 the result would show great improvement over scheduling without duplication (in this case, assign all node to one PE, task graph execution time is 12) and the parallelism of node 8,9,10 and node 5,6,7 would be a beneficial to node 11's assignment Complexity of ISH and DSH. The difference between ISH and DSH are in the processor selection routine (Locate-PE) and the way a node is assigned to the selected PE (assign_node). The node priority (highest level) and the main algorithm are the same. Hence, the time complexity of the main algorithm will be described first, where P= number of PEs in the system, N = number of nodes in a task graph, B = number of branches in a task graph. From the main algorithm in Figure 7: The complexity of node level calculation (step 1) is 0(B). Since N nodes are taken from a

57 45 ready queue, the complexity of step 4 and step 8.1 is 0(N). Also, N nodes are inserted into a ready queue so the node insertion is 0(N). But before each insertion of ready node, each ready node predecessor must be assigned to some PE. To check predecessor node assignment each branch has to be traversed once. So the complexity of step 3 and step 7 is 0(N+B). Each PE is initialized in step 2 with complexity 0(P). In step 6, PERF is located before each node assignment, so the complexity of step 6 is 0(NP). The complexity of the main algorithm is 0(B+N+P+NP+O(Locate_PE)+0(Assign_node)). In Locate_PE ISH, the node start time is calculated for each PE that has been assigned the assign-node's immediate predecessors (bounded by P). The start time calculation is done by comparing communication delay for each immediate predecessor node and the complexity is 0(N), so the execution time complexity of Locate_PE ISH is 0(PN). But Locate_PE is called whenever a node is assigned, so the complexity becomes 0(PN2). For Assigned_node ISH, each time a node is assigned and there is an idle time slot, a hole task is searched to fill in the time slot. So, the complexity of Assign_node ISH is 0(N2). Therefore, the execution time complexity of ISH is 0(B+N+P+PN+PN2+N2), which is 0(N2) for constant P. TDP is used to calculate a node start time in Locate_PE of DSH. The order of the TDP is 0(N3). Therefore the complexity of

58 46 Locate_PE DSH is 0(PN4). The complexity of assigned_node DSH is 0(N2). Therefore, the execution time complexity of DSH is 0(B+N+P+PN+PN4+N2), which is 0(N4) for constant P. The time complexity of both Hu's heuristic and YU D heuristic is 0(N) [Hu 61], [Yu 84]. The complexity of ISH, DSH are 0(N2), 0(N4) respectively.

59 47 Gantt Chart Time PE1 PE2 PE3 Ready Queue 8 x = don't care i = idle time slot x x 5 4 x 6 i i 7 Figure 3.1 The segment of a Three-processor Schedule after Node 7's Assignment

60 48 Main List Scheduler input: 1) TG, Directed task graph 2) NP, number of PEs in the parallel processing system, output: PE_GC, Gantt chart consisting of an array [1..NP] for each PE in the system (list of tasks ordered by their execution time on a PE, including task start time and finish time) Begin: 1) Level_graph(TG); (Calculate level number for each node in TG) 2) Init_Gantt(PE_GC); (reset Gantt chart of each PE to empty } 3) Init_R_queue(RQ); (Insert all nodes having no immediate predecessors into the RQ, ready queue, in order by their level number) 4) Get_node (AN,RQ); (get AN, assigned node, from the front of the Ready queue) 5) Assign_node(AN,O,PE_GC,1,RQ); (assigns AN, the assigned node, to PE1 at start time 0) ; repeat 6) Update_R_queue(RQ,AN,TG); (update the ready queue using the assigned node) 7) if RQ, ready queue, is not empty then 7.1) Get node (AN,RQ); (get a new AN from front of RQ) 7.2) Locate_PE (AN,PE_GC,PEL,ST); (return PEL that has the smallest ST, AN's start time) 7.3) Assign_node(AN,ST,PE_GC,PEL,RQ); until all nodes in TG are assigned Figure 3.2 Main List scheduler.

61 49 Update_R_queue (AN,RQ,TG); input: 1) AN, assigned node 2) RQ, Ready Queue 3) TG, Task graph output: RQ, Ready queue Begin Let IMS be a set of immediate successor nodes of EN For all IN node in IMS Do 1) N = Num of Immediate Pred(1N) (N = Number of Immediate Predecessors of node IN) 2) N = N 1; (store N as number of predecessors of node IN for next Update_r_queue called) 3) if N = 0 then Figure 3.3 Update_R_queue. Ordered Insert(IN,Ready_queue) { Keep the ready_queue in order by level number}

62 50 Bold number = Node level After initializing Gantt chart and Ready queue (step 3 of Figure 3.2) Gantt Chart PE1 PE2 Ready Queue 1 1 N# #Pr Level After assigned node 1 to PE1 (Step 5 Figure 3.2) Gantt Chart PE1 PE2 1 Ready Queue N# #Pr Level After update Ready queue using node 1, Event node (Step 6) Gantt Chart PE1 PE2 1 Ready Queue 3 2 N# #Pr Level N# = Node Number #Pr = Number of Immediate Predecessor of node N# Level = Level number of node N# Figure 3.4 The Example of Task Graph List Scheduling

63 51 After assign node 3. (Step 7.3, Figure 3.2) Nytt art 1 3 Ready Queue 2 N# #Pr Level After update Ready queue using node 3 (repeat loop back to Step 6) Iiiett ivy Ready Queue N# #Pr Level 1 3 After assign node 2 (Step 7.3, Figure 3.2) Rytt Fet Ready Queue N# #Pr Level After update Ready queue using node 2 (Step 6, Figure 3.2) Gantt PE1 Chg rt PE2 Ready Queue N# #Pr Level After assign node 4 (Done). Gantt Chart PE1 PE2 Ready Queue N# #Pr Level Figure 3.4 The Example of Task Graph List Scheduling (Continuation)

64 52 Locate PE ISH (AN,PE_GC,PEL,ST) input: 1)AN, assigned node 2)PE_GC, array of Gantt chart of all PEs output: l)pel, assigned PE 2)ST, AN's start time on PEL Begin 1) First_Ready_PE(PE_GC,PERF); {From all PEs in the system, compare PE ready times in PE_GC and find the one that ready at the earliest time (PERF) } 2) Initially, PEL = PERF 3) if Num_of_Immediate_Pred(AN) > 0 then Start time(an,perf,st,pe_gc) (calculate the AN start time on PERF (which is the larger of either PERF's ready time or AN message ready time which is a communication delay from LIP node. Figure 5 is an example of a calculation of node (7) start time.) ) 4) Let IMP be the set of all immediate predecessor nodes of AN 5) For all P such that X is in IMP and X is in PE_GC[P] ( for all PE that executed AN's immediate predecessor} Do 5.1) Start_time(AN,P,STA,PE_GC) (calculate the AN's start time on PE P). 5.2) if STA < ST then ST = STA; PEL = P; Figure 3.5 The Locate-PE of ISH.

65 53 Assign-Node ISH (AN,ST,PE_GC,PEL,RQ) input: 1)AN, assigned-node 2)PEL, assigned-pe 3)ST, AN's start time on PEL 4)PE_GC, PEs's Gantt chart 5)RQ, ready queue output: PE_GC[PEL], PEL's Gantt chart Begin 1) Idle time slot = ST Ready_Time(PEL) 2) if Idle time slot > 0 then 2.1) repeat 2.1.1) Initially, HT, hole task = task at the top of RQ, 2.1.2) repeat Start time(ht,pel,sth,pe_gc); (compute STH = start time of HT on PEL) if (STH is within the idle time slot) then begin FTH = size(ht) + STH; (hole task's finish time) if FTH is with in the idle time slot then HT is the hole task; end; if HT is not a hole task then HT = Next_task(HT,RQ); (HT becomes the task next to HT in RQ) until hole task found or search through RQ (find a hole task that can be assigned between the idle time slot of PEL by searching the ready queue ) 2.1.3) If hole task found then Insert the hole task into PE_GC[PEL] at time STH; Update-R-queue (HT,RQ,TG); idle time slot = idle time slot the STH to FTH time slot ; until no hole task found or no more idle time; 2.2) Insert remain idle time slots into PE_GC[PEL]. 3) Insert AN into PE_GC[PEL] at time ST. Figure 3.6 The Assigned-Node of ISH.

66 54 Gantt chart A and ready queue A after Locate_PE routine selects PE1 as assigned PE for node 7 start at 4 time unit with idle time slot 4-5. Gantt Chart B is after node 3 (hole task) assignment followed by node 7's assignment. Time Gantt Chart A PE1 PE2 PE3 Time Gantt Chart B PE1 PE2 PE3 Ready Queue NodeLevel Ready Queue A 0 Node Level * * Locate_PE return PE1 for node 7 at time 4, but node 7 has not been assigned to PE1 yet. The final Gantt Chart of ISH in comparison to Gantt chart of level-list scheduling without task insertion (NISH). Gantt Chart with Gantt Chart without Task insertion Task insertion Time 0 PE1 PE2 PE3 Time PE1 PE2 PE Figure 3.7 The Example of ISH Task Insertion and Scheduling. B

67 55 ISH1 step repeat Start time(ht,pel,sth,pe_gc); if (STH < the idle time slot start time) then STH = idle time slot start time; FTH = size(ht) + STH; if FTH is with in the idle time slot then HT is the hole task; if HT is not a hole task then HT = Next task(ht,rq); until hole task found or search through RQ ISH1 (ISH) step repeat Start time(ht,pel,sth,pe_gc); if (STH is within the idle time slot) then begin FTH = size(ht) + STH; if FTH is with in the idle time slot then HT is the hole task; end; if HT is not a hole task then HT = Next task(ht,rq); until hole task found or search through RQ ISH2 step repeat Start time(ht,pel,sth,pe_gc); hole task found = false; if (STH is <= the idle time slot end time) then begin Locate_PE(HT,PE_GC,PEL,STH2); if STH <= STH2 then begin FTH = size(ht) + STH; if FTH is with in the idle time slot then HT is the hole task; end; end; if HT is not a hole task then HT = Next task(ht,rq); until hole task found or search through RQ Figure 3.8 The choices in implementing ISH step 2.1.2

68 56 -- NISH - ISHO ISM - ISH Figure 3.9 The Average Speedup Ratio Comparison between ISH Versions on 10 Random-generated 350 Nodes Task Graphs.

69 57 Time 0 Gantt Chart with TDC PE1 PE2 Time 0 Gantt Chart without TDC PE1 PE2 1 1* a * underline number indicated duplicated node Figure 3.10 The Task Duplication Concept.

70 58 Task Duplication Process, TDP (AN,PEc,ST,CT1st,CTcnt); Input: 1) AN, an assigned node 2) PEC, Assigned PE candidate Output: 1) ST, AN's Start time 2) CTlst of PEC, a list of duplicated tasks, and their start time on PEC. 3) CTcnt number tasks duplicated in PEc's CTlst. Begin CTlst is empty and CTcnt = 0 repeat Start time(an,pec,st,anlip,ct1st,pe_gc); (The same as the Start_time used in ISH except using both PE_GC and CTlst to calculate AN's start time on PEC, also return ANLIP which is the LIP node of AN) COPY = false; (Flag indicates LIP has successfully been copied) If (ST > Ready_Time(PEC)) and ANLIP is not in PE_GC[PEC] then 1) If ANLIP is not in PEC's CTlst then Start time(anlip,pec,stl,liplip,ct1st,pe_gc); Copy_LIP(ANLIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT); if COPY then Shift_ Task _ in _ CT1st(CTPT,CTLst); (remove all nodes in CTlst that are located after the Copied node; and Let SRN be a set of removed nodes; For all removed nodes RN in SRN Do Start time (RN,PEc,STRN,RNLIP,CT1st,PE_GC); Copy_LIP(RN,RNLIP,CT1st,STRN,CTcnt,COPY,CTPT) 1 2) If ANLIP is in PEc's CTlst, then 2.1) Search _CTlst_LIP(ANLIP,LIP,Found); (Search ancestor of ANLIP from CTlst to find LIP node that is not in CT1st or LIP becomes an entry node) 2.2) If Found and LIP is not in PE_GC[PEC] then Start time(lip,pec,stl,liplip,ct1st,pe_gc); Copy_LIP(LIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT); if COPY then Shift Task in CT1st(CTPT,CTLst); until not COPY or ST <= Ready_Time(PEC) Figure 3.11 Task duplication Process (TDP).

71 59 Copy_LIP (LIP,LIPLIP,CT1st,STL,CTcnt,COPY) Input: 1) LIP, LIP node that want to be copied in CTlst 2) LIPLIP, LIP of LIP node 3) STL, Start time of LIP node 4) CTlst, link list of Copied task 5) CTcnt, Number of copied task in CTlst output: 1) CTlst 2) CTcnt 3) CTPT, pointer point to the copied task in CTlst 3) COPY, Boolean indicates whether COPY_LIP is successful Begin 1) Insert LIP node in CTlst. 2) if insert successful then COPY = true; CTPT point to the Copied node in CTlst else COPY = false; CTPT = nil; 3) if not COPY and LIP is not an entry node and LIPLIP is not in PE_GC[PEC] then 3.1) If LIPLIP is not in PEc's CT1st then Start_time(LIPLIP,PEc,STLL,LIPLIPLIP,CT1st,PE_GC); Copy_LIP(LIPLIP,LIPLIPLIP,CT1st,STLL,CTcnt,COPY,CTPT); 3.2) If LIPLIP is in PEC's CTlst, then 2.1) Search_CT1st_LIP(LIPLIP,LIP,Found); { Search ancestor of LIPLIP from CT1st to find LIP node that is not in CTlst or LIP becomes an entry node) 2.2) If Found and LIP is not in PE_GC[PEC] then Start time(lip,pec,stl,liplip,ct1st,pe_gc); Copy_LIP(LIP,LIPLIP,CT1st,STL,CTcnt,COPY,CTPT); Figure 3.12 The Copy_LIP of DSH.

72 60 1 Gantt Chart A is an intermediate DSH's scheduling result before Node 11's assignment. Gantt Chart B is a final result after Assign_Node (node 1 I,CTIst). Gantt Chart C is a Gantt chart B after removing redundant tasks Time Gantt Chart A PE1 PE2 PE it. 2_ Time Gantt Chart B PE1 PE2 PE3 1 I a _ IQ Time Gantt Chart C PE1 PE Figure 3.13 The Sample Task Graph of 11 Nodes and Its Intermediate Gantt Chart using DSH. a 8 b IQ 11 PE3

73 61 CTlst after call Start_Time of node 11 (LIP = 9, Start time = 11). No node duplicatuion yet. First node is a dummy node shows last node on the PE1 (node 8) and its finish time (5). Last node shows LIP of the assign node (node 11) and its (nodel 1) 's start time. 1: LIP = node 9 and node 9 is not in CT1st (case 1 of Figure 3.11) 2: Start_time (node 9) returns LIP of node 9 (node 6) and node 9's start time (10). 3: Copy_LIP (node 9) (node 9 is LIP of node 11). 4: Copy sucess, Node 9 is duplicated in CTlst at time 10. 1: re-calculated start time of node 11 (LIP=9, ST=11). 2: LIP (node 9) is already in CTlst (Case 2 Figure 3.11) 3: Search_CTlst_LIP (node 9), node 9's LIP = node 6 and node 6 is not in CTlst. 4: Start_time (Node 6), Copy_LIP(node 6). 5: Copy success, Node 6 is duplicated in Ctlst at time 5. N# LIP ST FT MSRT N# LIP ST FT MSRT N# LIP ST FT MSRT N# LIP ST FT MSRT 1: remove node 9 from CTlst (since node 6 was duplicted sucessfully. 2: Start_time (node 9), LIP = node 6, ST=6. 3: re-copy_lip(node 9) with the new start time (6) * N# = Node number of duplicated node, LIP = LIP of N#, ST = Start time of N# on CTlst, FT = finish time of N#, MSRT = Message ready time of N# on CTlst Figure 3.14 The Example of Duplication Task List (CTlst) on PE1 Constructed by TDP for Node 11.

74 62 1: re-calculated start time of node 11 (LIP=10, ST=11). 2: LIP (node 10) is not in CTlst (Case 1 of Figure 3.11) 3: Start_time (Node 10), returns ST=11, LIP = node 7, Copy_LIP(node 10) 4: Copy node 10 fail since node 10's ST = 11 5: Node 10's LIP (node 7) is not in CTlst (case 3.1 of Figure 3.12) 6: Start_time (node 7), Copy_LIP (node 7). 7: Copy success, Node 7 is duplicated in CTlst at time 7. 1: re-calculated start time of node 11 (LIP=10, ST=11). 2: LIP (node 10) is not in CT1st (Case 1 of Figure 3.11) 3: Start_time (Node 10), returns ST=8, LIP = node 7, Copy_LIP(node 10). 4: Copy success, Node 10 is duplicated in CT1st at time 8. N# LIP ST FT MSRT N# LIP ST FT MSRT N# LIP ST FT MSRT 1: re-calculated start time of node 11 (LIP=10, ST=9). 2: No idle time slot left TDP terminated with Node 11's start time on PE1 is 9 and four nodes duplicated. 3: Final Gantt chart (B) is shown on Figure Figure 3.14 The Example of Duplication Task List (CTlst) on PE1 Constructed by TDP for Node 11(Continuation).

75 63 Locate-PE DSH (AN,PE_GC,PEL,ST,DT1st) input: 1)AN, assigned node 2)PE_GC, array of Gantt chart of all PEs output: 1)PEL, assigned PE 2)ST, AN's start time on PEL 3)duplication task list DTIst Begin 1) First_Ready_PE(PE_GC,PERF); {From all PEs in the system, compare PE ready times in PE_GC and find the one that ready at the earliest time (PERF) } 2) PEL = PERF 3) if Num_of_Immediate_Pred(AN) > 0 then TDP(AN,PERF,ST,CT1st,CTcnt); 4) Let IMP be a set of immediate predecessor nodes of AN 5) For all P that X is in IMP and X is in PE_GC[P] {PE that execute AN's immediate predecessor) D o 5.1) TDP(AN,P,STA,CT1st,CTAcnt); [calculate the AN's start time on the PE P}. 5.2) if (STA < ST) or (STA=ST and CTAcnt<CTcnt) then ST = STA; PEL = P; DTIst = CTlst; CTcnt = CTAcnt Figure 3.15 The Locate_PE of DSH.

76 64 Assign_Node DSH (AN,PE_GC,PEL,ST,DT1st) input: 1)AN, assigned node 2)PEL, assigned PE 3)ST, AN's start time on PEL 4)duplication task list DTlst output: PE_GC array of Gantt chart of all PEs Begin 1) If DTlst is not empty then Insert all the tasks in the duplication task list into PE_GC[PEL]. 2) Insert hole tasks to all idle time slots between the duplicated tasks. The details are as described in Assign-Node ISH step 2 4) Insert idle tasks into all remaining idle time slots. 5) Insert AN into PE_GC[PEL] at time ST Figure 3.16 The Assigned-Node of DSH.

77 65 CHAPTER IV EXPERIMENT RESULTS This Chapter describes the test results for ISH and DSH compared to the results for Hu's heuristic [Hu 61] and Yu's heuristic D[Yu 84]. The comparison test consists of applying the heuristics to a wide range of randomly generated task graphs with 20, 50, 100, 150, 250, and 350 nodes for a total of 340 task graphs. This approach follows closely the approach used by [Adam 74]. The only difference from Adam's test is that we compared our results with Hu's and Yu's results instead of the optimal solution because there is no algorithm to find the optimal solution for precedence task graphs that include communication delays. We used the speedup ratio, which is the ratio of the execution time of the task graph on a uni-processor (no communication delay) to the execution time on a parallel processing system, as our measure of performance improvement. The number of PEs in the parallel processing system were varied from 2 to 15 in 20-node tests, 2 to 30 in 150-node tests, and 2 to 70 in 250-node and 350 node tests. The random graphs can be classified into two groups. The first group has identical node sizes and identical communication delays in order to study the effects of communication delay on the performance of each heuristic. The second group has variable node sizes and variable communication delays in order to study the stability of the heuristic when node and communication delay sizes

78 66 change. For the first group, the task graph node size is one time unit and communication delay sizes are varied; 1, 3, 5, 10, and 20 time units. At each communication delay size, 10 task graphs were randomly generated and scheduled. An average speedup ratio was compared from the 10 task graph runs for each communication delay and the results plotted in Figures The minimum, maximum and average of each data point in Figures are summarized in the Appendix. The speedup ratios at the saturation points Figures and the percentage improvements of ISH and DSH are summarized in Table 4.1 and Table 4.2, respectively. Table 4.3 summarizes the percent of the average speedup ratio versus the average speedup ratio for unit delay task graphs for all four heuristics. This percent, which is the ratio of speedup ratio at each delay to the speedup ratio of unit delay, shows how speedup decreases as the delay increases. From Table 4.1, we conclude that ISH gives improvements up to 45 % over previous heuristics. On the other hand, DSH gives much more improvement as shown in Table 4.2. The percentage of improvement of DSH increases as communication delay increases. This means that DSH can handle communication delays better than previous methods. The improvement is up to 420% over Hu's heuristic and 270% over Yu's heuristic at the unity speedup for 20- node 20-delay tests. For 350-node 20-delay tests, the improvement is up to 378% and 158% respectively.

79 67 From Table 4.3, the percentage of the speedup ratio of DSH decreases slowly as communication delay increases. Also the average speedup ratio never goes below 1. This indicates that DSH handles communication delay better than previous heuristics. The speedup ratio of DSH for different delay are plotted in Figure 4.7. It is also interesting to note that the speedup ratio of DSH never decreases when the number of PEs available in the system is increased, as compared to the other methods (especially Hu's heuristic). This is an important property of a good scheduling algorithm since the total task execution time should not increase with increases in the number of available PEs in the system. If adding PEs to the system causes a greater delay, a good scheduler should be able to decide not to use the additional PEs. For the second group, the task graphs have 20, 150, 250, and 350 nodes. For each number of nodes 10 task graphs were randomly generated with different node size and communication delay as shown in Table 4.4. The schedule results were plotted in Figure 4.8. The speedup ratios at the saturation points marked in Figures 4.8 and the percentage improvements of DSH are summarized in Table 4.4. The results are about the same as the first group for the same communication delay ratios.

80 nodes, delays - i 20 nodes, delays O tx 2-0: ) i.6 tn 1.4 1// DSH ISH -- Yu Hu DSH ISH -- Yu Hu ii t Number of PEs 0. t f-4 I Number of PEs I i nodes, delays nodes, delays a DSH 0. 0._ DSH Li ISH Yu \.O. 0.5 ISH -- Yu Hu 0.4 '... Hu I -I I i5 Number of PEs i 9 4 I I ; Number of PEs Figure 4.1 The Average Speedup Ratio Comparison (20 Nodes).

81 nodes, delays = 50 nodes, delays = 10 cc ca DSH ISH - - Yu Hu DSH ISH -- Yu Hu Number of PEs i5 17 Number of PEs 50 nodes, delays = 5 50 nodes, delays = DSH ISH -- Yu Hu r4 4-, ca. 1:2 a 0.75 cn 0.55 DSH ISH -- Yu Hu ii Number of PEs Number of PEs Figure 4.2 The Average Speedup Ratio Comparison (50 Nodes).

82 100 nodes, delays = nodes, delays = 10 OSH ISH -- Yu DSH ISH -- Yu Hu 11U I I I Number of PEs Number of PEs nodes, delays = nodes, delays Yu 1.0 DSH 1.6 ISH CO IX 0. '0 Hu DSH ISH -- Yu Hu Plumber of PEs Number of PEs Figure 4.3 The Average Speedup Ratio Comparison (100 Nodes).

83 150 nodes, delays - I 150 nodes, delays -, f DSH ISH -- Yu Hu.., a 4., 0 rc 2.5 a. a.1:3 a)) a cx U3 1.5 DSH ISH -- Yu Hu I i i Number of PEs 1 i Number of PEs nodes, delays nodes, delays Q m= 3.../,../ ;, N I Number of PEs DSH ISH -- Yu Hu / CC 10 i Ca. -0 a) ,,.,"... A "'" \ Number of PEs Figure 4.4 The Average Speedup Ratio Comparison (150 Nodes). DSH - ISH -- Yu Hu

84 250 nodes, delays nodes, delays ii DSH 4.5 DSH 9 ISH " " Yu ISH -- Yu Hu Si 2.5 Hu Number of PEs Number of PEs 250 nodes, delays nodes, delays - 20 DSH ISH 0 CC 2.8 DSH ISH -- Yu CL Yu CL Hu cn 1.8 Hu Number of PEs Number of PEs Figure 4.5 The Average Speedup Ratio Comparison (250 Nodes). t.)

85 350 nodes, delays - i 350 nodes, delay = 10 DSH 0 ISH 4-1 CC Yu 7.0 Hu 4 DSH ISH -- Yu Hu I I I I b2 bb Number of PEs I 68 Number of PEs 350 nodes, delays = nodes, delays a DSH DSH ISH -- Yu Hu 4-1 m ISH Yu Hu I I I I I I I Number of PEs -I Number of PEs Figure 4.6 The Average Speedup Ratio Comparison (350 Nodes).

86 20 nodes 150 nodes CC V-- I 10 I 12 I 14 I Number of PEs Number of PEs 100 nodes 350 nodes g 10 /./ D Number of PEs Number of PEs Figure 4.7 The Average DSH Speedup Ratio Comparison for Different Delay

87 20 nodes 250 nodes 1.2 DSH DSH ISH -- Yu Hu 4 ISH - - Yu Hu Number of PEs Number of PEs nodes nodes 5T CO cc III Number of PEs DSH ISH -- Yu Hu 0 cc CO ca. -a a) Number of PEs DSH ISH - - Yu Hu Figure 4.8 The Average Speedup Ratio Comparison for Non-identical Node Size and Non-identical Communication Delay.

88 Table 4.1 ISH's Speedup Ratio Improvement over Hu and Yu D Heuristics. 20 Nodes 150 Nodes 250 Nodes 350 Nodes D % Hu % Yu % Hu % Yu % Hu % Yu % Hu % Yu D = communication delay size (time unit), Nodes size are 1 time unit. % A, percentage improvement over heuristic A = (ISH's speedup A's speedup)/ A's speedup * 100 Table 44.2 DSH's Speedup Ratio Improvement over Hu and Yu D Heuristics. 20 Nodes 150 Nodes 250 Nodes 350 Nodes D % Hu % Yu % Hu % Yu % Hu % Yu % Hu % Yu C7N

89 77 20 Nodes D DSH S S% Hu S% Yu S% ISH S% DSH Nodes D DSH S S% Hu S% Yu S% ISH S% DSH Nodes D DSH S S% Hu S% Yu S% ISH S% DSH Nodes D DSH S S% Hu S% Yu S% ISH S% DSH D = Communication Delay Size. DSH S = Average Speedup Ratio of DSH S% X = Speedup Ratio of Heuristic X at Delay D *100 Speedup Rayio of X at D = 1 Table 4.3 The Effect of Communication Delay on Speedup Ratio Comparison.

90 78 ISH DSH Nodes Node size Delay size RD % Hu % Yu % Hu % Yu (3) 1-5(3) (3) 1-5(3) (3) 1-5(3) (3) 1-5(3) Node Size 1-5(3) means node sizez varied from 1 to 5 with average of 3 time unit. Delay Size (Communication Delay) used the same format as node size. RD is a ratio of average communication delay average node size Table 4.4 DSH and ISH Speedup Ratio Improvement over Hu and YU D Heuristics for Variable Node Size aand Communication Delay.

91 79 CHAPTER V OPTIMAL GRAIN DETERMINATION FOR PARALLEL PROCESSING SYSTEMS 5.1 Introduction Solutions to the "grain size" problem for parallel processing systems and an example of grain size determination are given in this Chapter. This grain size problem as previously described in Chapter II is stated as follows: Grain Size Problem. How to partition a given program into concurrent modules to obtain the shortest possible program execution time? What is the "best" size for each concurrent module? While these problems have been widely studied [Babb 83], we propose a new solution called "Grain packing" which provides: 1: a new way to determine grain size for any underlying parallel processing architecture or any kind of application program, with the advantage that each grain is of the best size for scheduling, reducing the communication delay, and enhancing parallelism, 2: a new way to schedule a given application program to execute on a given parallel processing system, with the advantage that program execution time is as short as possible, 3: an automatic parallel program development scheme which saves the user time and reduces errors when doing program development (grain size determination and scheduling) by hand,

92 80 4: a parallel processor simulator tool: Given a user program and a specification of a target parallel system, a simulator can compute the speedup ratio and execution time of the application without actually running it on the real system. This tells the user that the program may have too much communication overhead to take advantage of the parallelism in the target system. This saves both cost and time to find out whether the target system is appropriate. Researchers have taken two general approaches to solving these problems: 1: Basic scheduling strategy [Grah 721 This strategy consists of assigning the task whose predecessors have been completed to the first available processor. Therefore, if a processor is idle it is because no task can be assigned to it. Examples of this type of scheduling are load balancing, and list scheduling [Hu 61]. It is known that this strategy may not produce the best schedule for a given task graph. For example, forcing some processors to remain idle can decrease the execution time of some task graphs rather than increasing execution time as might be expected [Rama 72]. On the other hand, this type of scheduler, yields "near optimal" schedules most of the time [Adam 74]. Because of the general performance of this strategy, it has been used in real parallel processing systems. Most parallel processing system users believe it is best to take advantage of all

93 81 parallelism, keep all processors busy (load balancing), and to start each task as soon as possible. It is generally thought that these rules-of-thumb provide the best possible program execution time. While load balancing works very well in an ideal system, it yields "poor" results in the presence of the unavoidable communication delays of real systems. By "poor" results, we mean that program execution time is "far" from optimal. In fact as shown in Chapter IV, in the case of an application with intensive communication, execution time on several processors is greater than execution time on one processor! The main reason for this failure is that load balancing attempts to utilize all available parallelism without regard for the corresponding high cost of communication. 2: Large grain data flow (Babb 841 This strategy is based on an awareness of communication delay between tasks on different processors. Instead of taking advantage of all available parallelism in a program, the program is partitioned in such a way that execution time of each task is much greater than its communication time delay. This gives the appearance that communication delays are negligible. This strategy seems to be a good solution, but it still has problems: 1) even though Babb states, " what qualifies as large-grain processing will vary somewhat depending on the underlying architecture. In general, 'large grain' means that the amount of

94 82 processing a lowest level program performs during an execution is large compared to the overhead of scheduling its execution". Babb offers no method of grain size determination for a particular system [Babb 84]. If the grain size is defined manually, it is time-consuming and prone to errors, 2) Since grains are typically large, some parallelism is reduced to run sequentially inside the large grain, and hence the application fails to take advantage of parallelism. Again, large grain dataflow does not solve the max-min problem. Instead it tries to reduce the communication delay by "throwing away" the available parallelism in a user program. 5.2 Grain Packing Approach Grain Packing is an alternative approach to automatically determining the best size grains to schedule on a target parallel processing system. Instead of trying to define the best grain size and scheduling those grains, grain packing starts from the smallest grain size, schedules these small grains, and then defines larger grains. Since the final grain sizes are defined after scheduling of fine grains, all parallelism is taken into consideration and the only parallelism that is discarded is the parallelism that leads to decreased performance. Grain packing can be divided into 4 main steps 1: Fine grain task graph construction. A fine grain task graph is constructed from a user program.

95 83 This step involves 1.1)parallelism extraction [Rama 69] of the user program as shown in Figures ) node size calculation. The size of each node is an estimation of the length of time to execute all tasks in the node. Sizes are measured in units of cpu machine cycles, hence if each CPU in the system has different speed the size of each node can be calculated differently for each CPU. As shown in Figure 5.3, some move instructions are not included since moves are only needed when a task is not in the same node. 1.3) Edge size calculation. The communication delay is calculated by estimating the time taken to transmit a message between processors. So if each link has a different transmission rate, distance between two processors in the system is not the same (more than one hop in a hyper-cube, for instance), or the length varies, then the communication delay is calculated by adding up all of these effects. 2: Fine grain scheduling, The task graph from step 1 is scheduled on a parallel processor system using a fine grain scheduler that can take advantage of all available parallelism while reducing the communication delay (for example DSH in Chapter 3 ). The application programs execution time is defined at this step from each node's execution time and the communication delays using information from the fine grain task graph and the specific architecture of the target parallel processing system.

96 84 Since the grain size and the user program's execution time depends on the scheduler used, the choice of scheduler is very critical. The choice of scheduler used in grain packing must provide a solution to the max-min problem, and give monotonically growing improvement as the number of processors is increased (Speedup ratio >=1). An example of communication delay calculation is shown in Figure 5.4, where, T1 and T5 represent MOV instructions in Figure 5.3, T2 and T4 represent DMA fetch and set up delay, T3 is a transmission delay in the serial link, and T6 is the delay for the communication protocol. Communication delay is a function of both application program and specific architecture of the target parallel processing system. The final schedule is shown in Figure 5.5. The schedule obtained by DSH yields a speed up ratio of 2.39 while load balancing yields a speed up ratio of This underscores the importance of selecting a scheduler that solves the max-min problem. 3: Grain Packing. In this step, the Gantt chart from step 2 is analyzed. Fine grains are "packed" together to form a larger grain in order to reduce overhead. For example, "overhead" includes all optional move instructions and all the overhead caused by communication protocols. Since the overhead is system dependent, the way to pack fine grains depends on each specific system. Usually the larger the grain the smaller the overhead.

97 85 An example of grain packing is shown in Figure 5.5. The grain boundary is obtained from the scheduler, hence it will be the best schedule that trades off communication delay and parallelism. Also, some fine grains may be duplicated and grouped into more than one large grain to reduce the communication delay and increase the parallelism (see duplication details in DSH). 4: Parallel module generation. Based on the grain information from step 3, a compiler might construct modules to run in parallel on the parallel processor system. Alternately, a user program can be restructured to achieve the optimal runtime as shown by an Occam program in Figure 5.6. This figure also shows the steps taken to find the best grain size starting from a user defined grain size, to fine grain size, and finally to large, packed grain size. The run time for the user-defined and packed grain task graph is the same for both DSH and load balancing (536,536 and 361,361 time unit) because no more duplication is possible. Once all grains are packed, they can be rescheduled by a simple scheduler for a given real system. Such a simple scheduler might be an operating system scheduler which is executed while the user program is running. By using grain packing, users do not need to learn a specific parallel programming languages such as Occam. A programmer typically does not know details such as grain execution time and communication delay cost, so the grain size and parallelism selected by a programmer is not optimal. For example, a programmer who

98 86 tries to take advantage of parallelism in matrix multiplication will produce too fine a grain and as a consequence introduce more communication delay than necessary. For example, the Occam matrix multiplication program in Figure 5.1 contains no more parallelism than a corresponding C or Pascal matrix multiplication program. In short, the parallelism identified by a programmer using a parallel programming language is not useful information for producing an optimal parallel program. This counter-intuitive result may seem controversial, but is observed in even the simplest examples. To sum up, grain packing provides a new way to develop a program on a particular system. It 1: gives an optimal way to partition a serial or parallel program on a specific computer architecture. 2: gives a run time estimate of a particular program on a particular system before running the program. A speedup of less than or equal to one means the program is not suitable to run on the specific architecture. 3: gives automatic grain packing which saves the user time and reduces errors which might occur if grain packing were done by hand. 4: applies to any language such as C, Pascal, Fortran, Modular- 2. Grain Packing allows more applications to take advantage of the parallel processing systems because existing programs do not need to be re-written in a new language.

99 87 OCCAM Matrix multiplication PAR INT All,Al2,B11,B21,C11 : SEQ Chan lin? All Chan lin? Al2 Chan lin? B11 Chan lin 7 B21 C11 := (All*B11)+(Al2*B21) Chan lout! C11 INT All,Al2,B12,B22,C12 : SEQ Chanlin? All Chan 2in? Al2 Chan 2in? B12 Chan 2in 7 B22 C12 := (A1l*B12)+(Al2*B22) Chan 2out! C12 [ All Al2 B11 B12 C11 C127 * = A21 A22 B21 B22 C21 C22 C11 = All*B11 + Al2*B21 C12 = Al l*b12 + Al2*B22 C21 = A21*B11 + A22*B21 C22 = A21*B12 + A22*B22 Sum = C11 + C12 + C21 + C22 INT A21,A22,B11,B21,C21 : SEQ Chan3in? A21 Chan 3in? A22 Chan 3M7 B11 Chan 3M 7 B21 C21 := (A21*B11)+(A22*B21) Chan 3out! C21 Task Graph representation of OCCAM Program INT A21,A22,B12,B22,C22 : SEQ Chan4in? A21 Chan 4in 7 A22 Chan 4in? B12 Chan 4in 7 B22 C22 := (A21*B12)+(A22*B22) Chan 4out! C22 INT C11,C12,C21,C22 : SEQ Chanlout? C11 Chan 2out? C12 Chan 3out C21 Chan 4out C22 Sum := C11+C12+C21+C22 Chan 5out! Sum Figure 5.1 An Example of User Program and Its Task Graph Construction.

100 88 l User Specified C Grain Fine Grain Decomposition from User Defined Grain. Figure 5.2 An Example of Fine Grain Task Graph Construction.

101 89 M68000 ASSEMBLY LANQUAGE CPU CYCLE Ann * Bnn MOVE.W Axx, DI 15 MOVE.W Bxx, D2 15 MULU DI, D2 71 CM. OPT.* MOVE.L D2, PAR 20 O Node size = 101 Ann*Bnn Ann*Bnn CM. OPT. = OPTIONAL FOR COMMUNiCRION Ann*Bnn + Ann*Bnn CM. OPT.* MOVE.L PAR1, DI 20 CM. OPT.* MOVE.L PAR2, D2 20 ADD.L DI, D2 8 CM. OPT.* MOVE.L D2, PSUM 20 O Node size = 8 Communication delay (Cm) = CM. OPT.+ Extra delay = 20 + Fine Grain Decomposition from User Defined Grain. i LQ a a Figure 5.3 An Example of Fine Grain Node Size Calculation.

102 90 PE1 PE2 T I A B ( PE1 ) ( PE2 ) T1 DMA T3 T5 T2 Serial Link MEM. MEM. Corn. Delay = T1 + T2 + T3 + T4 + T5 + Corn. Protocol = ? = 112 T3 = 32 bit transmission time at 20Mbit/Sec normalized to M68000 Cycle at 20 Mhz Corn. Protocol 1. Protocol code execution time 2. Synchronization time 3. Routing time (# of Hops) Figure 5.4 An Example of Communication Delay Calculation for the Specific Architecture.

103 91 Fine Grain Decomposition from User Defined Grain * ci1 a il (I) es 1 2 * Communication Delay 212 assume Communication Protocol equals to 5 MOV instructions Gantt Chart A (DSH) Gantt Chart B (Load Balancing) Gantt Chart C Time P1 P2 P3 P4 P5 P6 P7 P Time P1 P2 P3 P4 P5 P6 P7 P P Fine grain for scheduling TZ Figure 5.5 An Example of Fine Grain Scheduling using DSH in Comparison with Load Balancing and Single PE.

104 92 Occam matrix multiplication program PAR INT All,Al2,B11,B12,C11 SEQ Chan lin? All Chanlin? Al2 Chanlin? B11 Chanlin? B21 C11 := (Al l*b11)+(al2*b12) Chanlout! C11 INT A 1 1,B12,Mul2 SEQ Chan2in? All Chan2in? B12 Mul2 := Al l*b12 Chan2out! Mul2 (reconstructed) User defined grain Fine grain for scheduling T1,T2 536,536 INT A22,B22,Mu17 SEQ Chanlin 7 A22 Chanlin? B22 Mu17 := A22*B22 Chanlout! Mu17 INT C11,Mul2,Mul3,Mul4,Mul5,Mul6,Mul7,Sum SEQ Chanlout? C11 Clian2out? Mu12 Chanlout? Mu17 Sum := Cll+Mu12+Mu13+Mul4+Mu15+Mu16+Mu17 Chan8out! Sum Grain after Grain Packing Figure 5.6 An Example of User Program Restructure. Ti is a task graph run time using DSH. T2 is a task graph run time using Load Balancing.

105 93 CHAPTER VI CONCLUSION 6.1 Significance of this Research We proposed two new scheduling heuristics for task graphs that have communication delays. ISH (insertion scheduling heuristic) provides an improvement up to 45 % over the current solutions and has a smaller time complexity than DSH, 0(N2). DSH (duplication scheduling heuristics), is an 0(N4) heuristic which we recommend be used to solve the scheduling problem for three main reasons: First, resulting schedules from DSH provide up to an order of magnitude improvement in performance over the current solutions. The improvement keeps growing as the ratio of the communication delay to node size increases. From a wide range of randomly generated test data (340 task graphs) running with a varied number of PEs for a total of 9040 test runs, DSH provided improvement in almost all tests. Performance is about the same in the tests with small ratio of communication delay and small number of PEs, but none of the tests show program that scheduled by DSH have less performance. Second, DSH solves the max-min problem by duplicating some scheduled tasks on some PEs. The max-min problem has not been

106 94 fully explored elsewhere, yet it is of major importance to gaining optimal or near-optimal schedules. Third, DSH gives monotonically growing improvements as the number of PEs are allowed to increase. In cases where the communication delay ratio is very high, DSH gives a speed up ratio of 1.0, which indicates that such a task graph should be executed on a single processor. Furthermore, the small grain schedule obtained from DSH can be used in "grain packing" to find a "near optimal" grain size for parallel programs. The main problem that might influence the "near optimal" schedule result of DSH is the level alteration and the critical path problem as mentioned in Chapter II. The schedule result might not "near optimal" because critical path changes as allocation is done. This problem is an unsolved problem and is part of the Dynamic scheduling problem. 6.2 Future Related Research In this section, some of the possible future related research areas are discussed. The main future researches are in relaxing some of the assumptions in the task graph and parallel system model. There are many extensions that can be made to our parallel processing model. First, the scheduler assumes all PEs are fully connected. Second, the calculation of communication delay does not

107 95 take into account the queuing delay nor a number of hops in a network. The inclusion of the shortest path, routing algorithm, and the scheduling of the messages into a parallel processor scheduler would be another useful area to explore. The extensions of the task graph model is to handle a dynamic task graph where node execution time, amount of message passing, precedence constraint, and number of nodes in the task graph is dynamic and can be changed during runtime. An example of this is a task graph with loops and branches. This extension is extremely hard to achieve because critical path information is not available until runtime. Hence, the performance of such a scheduler depends on 1) how close the scheduler can predict the future behavior of the task graph, 2) how much overhead is introduced by the scheduler if scheduling is done during runtime. This extension is the most important problem in parallel processing system area because it will allow many more application programs to run on parallel processing systems.

108 96 BIBLIOGRAPHY Adam 74 T. L. Adam, K. M. Chandy, J. R. Dickson, "A Comparison of List Schedules for Parallel Processing Systems," Comm. ACM., Vol. 17, pp , Dec Babb 84 Bash 83 R. G. Babb, "Parallel Processing with Large-Grain Data Flow Techniques", Computer, Vol. 17, No. 7, July 1984, pp A. F. Bashir, V. Susarla, and K. Varavan, "A Statistical Study of a Task Scheduling Algorithm," IEEE Trans. Comput., Vol. C-32, No. 8, Aug. 1983, pp Blaz 84 J. Blazewicz and J. Weglarz, "Scheduling Independent 2- Processor Tasks to Minimize Scheduling," Information Processing Letter, Vol 18, No. 5, June 1984, pp Bokh 81 S. H. Bokhari, "On the Mapping Problem," IEEE Trans. Computers, Vol. C-30, No. 3, pp , March Chen 75 Chou 82 Chu 80 Clar 52 Coff 72 Coff 76 Dogr 78 N. F. Chen, and C. L. Liu, On a Class of Scheduling Algorithm for Multiprocessing Systems, Proc Segamore Computer Conference on Parallel Processing, T. Feng, ed., Springer, Berlin, 1975, pp T. Chou and J. Abraham, "Load Balancing in Distributed Systems," IEEE Transactions on Software Engineering, Vol. SE-8, No. 4, July 1982, pp W. W. Chu et al., "Task Allocation in Distributed Data Processing," Computer, Vol. 13, No. 11, pp , Nov W. Clark, The Gantt Chart, 3rd edition, London: Pittman and Sons, E. G. Coffman, Jr., and R. L. Graham, "Optimal Scheduling for two Processor System," Acta Information, Vol. 1, No. 3, pp , E. G. Coffman, Computer and Job-Shop Scheduling Theory. New York: Wiley, A. Dogramaci and J. Surkis, "Limitation of a Parallel

109 97 Efe 82 Gabo 82 Gonz 77 Grah 72 Hory 77 Processor Scheduling Algorithm," Int. J. Prod. Res., Vol. 16, No. 1, pp , 1978 Kermal Efe, "Heuristic Models of Task Assignment Scheduling in Distributed Systems," Computer, Vol. 15, No. 6, pp , June H. N. Gabow, "An Almost-Linear Algorithm for Two- Processor Scheduling," ACM. J., Vol. 29, No. 3, July 1982, pp M. J. Gonzalez Jr., "Deterministic Processor Scheduling," ACM Computing Surveys, Vol. 9, No. 3, Sept. 1977, pp R. L. Graham, "Bounds on Multiprocessing Anamolies and Related Packing Algorithms," AFIPS 1972 Conf. Proc., Vol. 40, AFIPS Press, Montvale, N.J., pp E. C. Horvath, S. Lam, and R. Sethi, "A Level Algorithm for Preemptive Scheduling," ACM J., Vol. 24, No. 1, 1977, pp Hu 61 T. C. Hu, "Parallel Sequencing and Assembly Line Problems," Operations Research, Vol. 9, No. 6, 1961, pp Jens 77 Kasa 84 Kauf 74 Kohl 75 John E. Jensen, "A Fixed-Variable Scheduling Model for Multiprocessors," Proc. of 1977 International Conference on Parallel Processing, pp , H. Kasahara, S. Narita, "Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing," IEEE Transactions on Computers, Vol. c-33, No. 11, Nov. 1984, pp M. T. Kaufman, "An Almost-Optimal Algorithm for the Assembly Line Scheduling Problem," IEEE Trans. Comput., Vol. C-23, No. 11, Nov. 1974, pp W. H. Kohler, "A Preliminary Evaluation of the Critical Path Method for Scheduling Tasks on Multiprocessor Systems," IEEE Transactions on Computers, Vol. c-15, No. 12, Dec. 1975, pp

110 98 Krau 75 Kund 81 Kung 81 Lam 77 Lens 78 K. L. Krause, V. Y. Shen, and H. D. Schwetman, "Analysis of Several Task-Scheduling Algorithms for a Model of Multiprogramming Computer Systems," ACM J., Vol. 22, No. 4, October 1975, pp M. Kunde, "Nonpreemptive LP-Scheduling on Homogeneous Multiprocessor Systems," SIAM J. Comput., Vol. 10, No. 1, Feb. 1981, pp H. T. Kung, "Synchronized and Asynchronized Parallel Algorithms for Multiprocessors," Tutorial on Parallel Processing, IEEE Computer Society Press, 1981, pp S. Lam, and R. Sethi, "Worst Case Analysis of Two Scheduling Algorithms," SIAM Journal of Computing, Vol. 6, 1977, pp J. K. Lenstra, and A. H. G. Rinnooy Kan, "Complexity of Scheduling under Precedence Constraints," Operations Research, Vol. 26, No. 1, Jan.-Feb. 1978, pp Lo 81 Virginia Lo, Jane W. S. Liu, "Task Assignment in Distributed Multiprocessor Systems," Proc. of 1981 International Conference on Parallel Processing, pp , Ma 82 P. Ma, E. Y. Lee and M. Tsuchiya, "A Task Allocation Model for Distributed Computing Systems," IEEE Trans. Computers, Vol. C-31, No. 1, pp , Jan Ma 84 P. Ma, "A Model to Solve Timing-Critical Application Problems in Distributed Computer Systems," Computer, Vol. 17, No. 1, pp , Jan Nett 76 E. Nett, "On Further Applications of the Hu Algorithm to Scheduling Problems," Proc. of 1976 International Conference on Parallel Processing, pp , Ni 85 Lionel M. Ni and Kai Hwang, "Optimal Load Balancing in a Multiprocessor System with Many Job Classes," IEEE Transactions on Software Engineering, Vol. SE-11, No. 5, May 1985, pp

111 99 Rama 69 Rama 72 Rama 76 C. V. Ramamoorthy and M. J. Gonzalez, "A Survey of the Techniques for Recognizing Parallel Processable Streams in Computer Programs," AFIPS FJCC, 1969 C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez, "Optimal Scheduling Strategies in a Multiprocessor System," IEEE Trans. Comput., Vol. C-21, No. 2, Feb. 1972, pp C. V. Ramamoorthy and W. H. Leung, "A Scheme for Parallel Execution of Sequential Programs," Proc. of 1976 International Conference on Parallel Processing, pp , Schw 87 K. Schwan, R. Ramnath, and S. Vasudevan, and D. Ogle, "A System for Parallel Programmimng," Proc. of the Ninth International Conference on Software Engineering, Mar. 1987, pp Seth 76 R. Sethi, "Scheduling Graphs on Two Processors," SIAM J. Comput., Vol. 5, No. 1, March 1976, pp Shen 85 Ston 77 Ston 78 Tows 86 Ulim 75 Chien-Chung Shen and Wen-Hsiang Tsai, "A Graph Matching Approach to Optimal Task Assignment in Distributed Computing Systems Using a Minimax Criterion," IEEE Trans. Computers, Vol. C-34, No. 3, pp , March H. S. Stone, "Multiprocessor Scheduling with the Aid of Network Flow Algorithms," IEEE Trans. Software Engineering, Vol. SE-3, pp , Jan H. S. Stone and S. H. Bokhari, "Control of Distributed Processes," Computer, Vol. 11, No. 7, pp , July Don Towsley "Allocating Programs Containing Branches and Loops Within a Multiple Processor System," IEEE Transactions on Software Engineering, Vol. SE-12, No. 10, Oct. 1986, pp J. D. Ullman, "NP-Complete Scheduling Problems," J. of Computer and System Science, pp , 1975.

112 Yu 84 Wang Ho Yu, "LU Decomposition on a Multiprocessing System with Communication Delay," Ph. D. dissertation, Department of Electrical Engineering and Computer Science. University of California, Berkeley,

113 APPENDIX 1 0 1

114 Data points of Figure 4.1 (20 nodes, delays = 1) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

115 Data points of Figure 4.1 (20 nodes, delays = 5) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

116 Data points of Figure 4.1 (20 nodes, delays = 10) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

117 Data points of Figure 4.1 (20 nodes, delays = 20) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

118 Data points of Figure 4.6 (350 nodes, delays = 1) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

119 Data points of Figure 4.6 (350 nodes, delays = 5) P # min Hu max Hu Ave flu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

Data points of Figure 4.6 (350 nodes, delays = 10) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH 2 1.7070 1.8320 1.7724 1.7410 1.8320 1.7862 1.8040 1.

120 Data points of Figure 4.6 (350 nodes, delays = 10) P # min Hu max Hu Ave Hu min Yu max Yu ave Yu min ISH max ISH ave ISH min DSH max DSH ave DSH

Lecture 9: Load Balancing & Resource Allocation

Lecture 9: Load Balancing & Resource Allocation Introduction Moler s law, Sullivan s theorem give upper bounds on the speed-up that can be achieved using multiple processors. But to get these need to efficiently