Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures

Size: px

Start display at page:

Download "Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures"

Jasper Wiggins
5 years ago
Views:

1 Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures By Yu-Kwong KWOK A Thesis Presented to The Hong Kong University of Science and Technology in Partial Fulfilment of the Requirements for the Degree of Master of Philosophy in Computer Science Hong Kong, June 1994 Copyright by Yu-Kwong KWOK 1994

2 Authorisation I hereby declare that I am the sole author of the thesis. I authorise the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research. I further authorise the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. - ii -

3 Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures By Yu-Kwong KWOK APPROVED: Dr. Ishfaq AHMAD, Lecturer (Advisor) Dr. Jogesh K. MUPPALA, Lecturer Dr. Helen C. SHEN, Senior Lecturer Department of Computer Science June 10, iii -

4 Acknowledgements I would like to sincerely thank my advisor, Dr. Ishfaq Ahmad, for his patience, guidance and invaluable advice on my studies. I am very grateful for his continual support on both academic and personal problems. I am most grateful to my wife, Fyon, for her love and patience that keeps me working whenever I am frustrated. Without the encouragements from Dr. Ahmad and Fyon, it could have been difficult to continue my graduate studies. I am also very grateful to Dr. Jogesh K. Muppala and Dr. Helen C. Shen for their helpful reviews and suggestions on the thesis. Thanks are also extended to Dr. Siu-Wing Cheng and Dr. Michael Kaminski for their invaluable advice on my studies. I enjoyed working with Richard Li, Eric Yeung and Warren Tsui on the CASCH project and I thank them for their precious contributions. I would like to acknowledge the Hong Kong Research Grants Council for supporting this work (under contract number HKUST 179/93E). Finally, I thank the Computer Science Department for its generosity in providing a nice and convenient work environment for its graduate students. I hope that the department will continue to do so in the future. - iv -

5 Table of Contents Title Page... i Authorisation Page...ii Signature Page...iii Acknowledgements...iv Table of Contents... v List of Figures...viii List of Tables...xii Abstract...xiii Chapter 1 Introduction Overview Parallel Architectures and the Scheduling Problem A Taxonomy of Approaches to the Scheduling Problem Outline of the Thesis... 7 Chapter 2 Evolution of the Scheduling Problem Introduction Problem Statement Optimal Static Scheduling Algorithms Optimal Scheduling of Tree-structured Task Graphs Optimal Scheduling for Two-processor Systems Heuristic Approaches State-of-the-Art Scheduling Algorithms The EZ Algorithm The MCP Algorithm The MD Algorithm The DSC Algorithm v -

6 2.6 Scheduling Using Task Duplication The DSH Algorithm The BTDH Algorithm Scheduling and Mapping Algorithms The MH Algorithm The DLS Algorithm Summary Chapter 3 The Dynamic Critical Path Scheduling Algorithm Introduction The Proposed Algorithm Design Principles The DCP Algorithm An Application Example Workload Generation The Performance Comparison Comparison of Schedule Lengths A Global Comparison Number of Processors Comparison of Running Times Summary Chapter 4 Exploiting Task Duplication in Scheduling Introduction The Proposed Algorithms Algorithm for Unlimited Number of Processors Algorithm for Limited Number of Processors Algorithm for Heterogeneous Processors An Application Example The Performance Comparison Unlimited Homogeneous Processors Limited Homogeneous Processors Heterogeneous Processors Summary vi -

7 Chapter 5 The Bubble Scheduling and Allocation Algorithm Introduction The Proposed Algorithm Definitions and Notations Description of the Algorithm Characteristics of the Proposed Approach An Application Example The Performance Comparison Summary Chapter 6 Conclusions and Future Work References vii -

8 List of Figures Figure 1.1 : (a) A shared-memory architecture; (b) Message-passing (distributed memory) architectures... 4 Figure 1.2 : (a) A taxonomy of the approaches to the scheduling problem; (b) A task interaction graph; (c) A task precedence graph... 6 Figure 2.1 : (a) A simple tree-structured task graph with unit-cost tasks and without communication among tasks; (b) The optimal schedule of the task graph using three processors Figure 2.2 : (a) A simple task graph with unit-cost tasks and without communication among tasks; (b) The optimal schedule of the task graph in a two-processor system Figure 2.3 : (a) A task graph; (b) The schedule generated by the HLFET algorithm (schedule length = 43 time units); (c) The best possible schedule (schedule length = 34 time units) Figure 2.4 : The schedule of the task graph in Figure 2.3(a) generated by the EZ algorithm (schedule length = 35 time units) Figure 2.5 : (a) ASAP binding and (b) ALAP binding of the task graph in Figure 2.3(a).... Figure 2.6 : (a) A randomly generated task graph; (b) An initial schedule without duplication (schedule length = 299 time units); (c) The final schedule produced by the DSH algorithm (schedule length = 275 time units); (d) An intermediate schedule in which the duplication of n 2 increases the start times of n 3 and n 5 ; (e) The final schedule produced by the BTDH algorithm (schedule length = 246 time units) Figure 2.7 : (a) A simple task graph; (c) an intermediate schedule generated by MH - viii -

9 after node n 4 is scheduled; (c) another intermediate schedule generated by MH after node n 5 is scheduled; and (d) the final schedule produced by the MH algorithm Figure 3.1 : A parallel Gaussian elimination task graph Figure 3.2 : The schedule of the Gaussian elimination task graph generated by the EZ algorithm (schedule length = 600 time units) Figure 3.3 : The schedule of the Gaussian elimination task graph generated by the MCP algorithm and the DLS algorithm (schedule length = 5 time units). 43 Figure 3.4 : The schedule of the Gaussian elimination task graph generated by the DSC algorithm (schedule length = 460 time units) Figure 3.5 : The schedules of the Gaussian elimination task graph generated by the MD algorithm (schedule length = 460 time units) Figure 3.6 : The schedule of the Gaussian elimination task graph generated by the DCP algorithm (schedule length = 4 time units) Figure 3.7 : (a) An in-tree task graph; (b) An out-tree task graph; (c) A fork-join task graph; (d) A LU-decomposition task graph; (e) A mean value analysis task graph; (f) A Laplace equation solver task graph; (g) A FFT task graph Figure 3.8 : Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for Gaussian elimination graph; algorithm ranking: DCP, (MD, MCP), DLS, DSC, EZ Figure 3.9 : Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for Laplace equation graph; algorithm ranking: DCP, DLS, DSC, MCP, MD, EZ Figure 3.10: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for LU-Decomposition graph; algorithm ranking: DCP, MCP, DLS, MD, DSC, EZ Figure 3.11: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for fast Fourier transform graph; algorithm ranking: - ix -

10 DCP, MD, MCP, DLS, EZ, DSC Figure 3.12: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for mean value analysis graph; algorithm ranking: DCP, (DSC, MCP), (DLS, EZ), MD Figure 3.13: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for in-tree graphs; algorithm ranking: DCP, EZ, DSC, MD, MCP, DLS Figure 3.14: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for out-tree graphs; algorithm ranking: DCP, DLS, DSC, (MCP, EZ), MD Figure 3.15: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for completely random graphs; algorithm ranking: DCP, MD, (MCP, DLS), EZ, DSC Figure 3.16: Average normalized schedule lengths (with respect to lower bounds) at various graph sizes for fork-join graph; algorithm ranking: DCP, MD, MCP, DSC, DLS, EZ Figure 3.17: A Global comparison of the six algorithms in terms of better, worse and equal performance Figure 3.18: Average number of processors used by each algorithm Figure 3.19: Average running time for each algorithm Figure 4.1 : (a) A parallel Gaussian elimination task graph and (b) its optimal schedule without using duplication (schedule length = 3 time units) Figure 4.2 : The schedule generated by the DSH/HLFET algorithm and the BTDH/ HLFET algorithm (schedule length = 3 time units) Figure 4.3 : The schedule generated by the CPFD algorithm (schedule length = 0 time units) Figure 4.4 : The schedule generated by the ECPFD algorithm (schedule length = 310 time units) Figure 4.5 : The schedule generated by the HCPFD algorithm onto heterogeneous - x -

11 processors with variance factor = 0.5 (schedule length = 182 time units). 71 Figure 4.6 : Average number of processors used by the DSH, BTDH and CPFD algorithms Figure 4.7 : Efficiency of the BTDH, CFPD and ECPFD (100%, % and 50% processors) algorithms for various task graph sizes and CCRs Figure 4.8 : Percentage improvement in schedule length with HCPFD (VF = 0.2, 0.3, 0.5, 0.9) over CPFD at various task graph sizes and CCRs Figure 5.1 : A task graph representing the mean value analysis algorithm for a matrix of dimension 5x Figure 5.2 : The schedule generated by the (a) the MH algorithm (schedule length = 983, total communication cost incurred = 833), and (b) by the DLS algorithm (schedule length = 998, total communication cost incurred = 748) Figure 5.3 : The intermediate schedule generated by the BSA algorithm (a) after phase-1 (schedule length = 11, total communication cost incurred = 187) (b) after phase-2 (schedule length = 1187, total communication cost incurred = 3) Figure 5.4 : (a) The intermediate schedule generated by the BSA algorithm after phase-3 (schedule length = 1056, total communication cost incurred = 493) (b) The final schedule generated by the BSA algorithm (schedule length = 936, total communication cost incurred = 6) Figure 5.5 : Normalized scheduled lengths of regular graphs with varying dimensions on (a) 8-node ring, (b) 8-node hypercube, (c) 16-node ring, (d) 16-node hypercube, (e) 8-node random, and (f) 8-node fully connected topology Figure 5.6 : Normalized scheduled lengths of random graphs with varying number of nodes on (a) 8-node ring, (b) 8-node hypercube, (c) 16-node ring, (d) 16- node hypercube, (e) 8-node random, and (f) 8-node fully connected topology xi -

12 List of Tables Table 3.1 : Symbols and their meanings Table 4.1 : Notations and their meanings Table 4.2 Table 4.3 : A performance comparison of the DSH, BTDH and CPFD scheduling algorithms using unlimited number of processors : A comparison of the BTDH and ECPFD with limited number of processors Table 5.1 : Notations and their meanings Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 : A relative comparison of MH, DLS and BSA algorithms for a 500-node mean value analysis task graph on various topologies : A relative comparison of MH, DLS and BSA algorithms for a 500-node Gaussian elimination task graph on various topologies : A relative comparison of MH, DLS and BSA algorithms for a 500-node Laplace equation solver task graph on various topologies : A relative comparison of MH, DLS and BSA algorithms for a 500-node LU-decomposition task graph on various topologies : A relative comparison of MH, DLS and BSA algorithms for a 500-node random task graph on various topologies xii -

13 Efficient Algorithms for Scheduling and Mapping of Parallel Programs onto Parallel Architectures By Yu-Kwong KWOK Department of Computer Science Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong June 1994 Abstract Scheduling and mapping of computations onto processors is one of the crucial components of a parallel processing environment. Since the scheduling and mapping problems are known to be NP-complete in many variants, most of the previous solutions are based on heuristics. However, these approaches make simplifying assumptions about the parallel program and the target machine architecture, and, therefore, can be useful in very limited environments. In this thesis, we propose efficient algorithms for compile-time scheduling and mapping of parallel programs onto parallel processing systems, under more realistic assumptions. Our first algorithm, called the Dynamic Critical Path (DCP) algorithm, which is designed for scheduling arbitrary task graphs on unlimited and fully connected processors, is based on new principles. The DCP algorithm outperforms all of the contemporary scheduling algorithms known to us. The second algorithm, called the Critical Path Fast Duplication (CPFD) algorithm, is designed to exploit task duplication in scheduling. Using task duplication can drastically reduce the communication overhead. The CPFD algorithm outperforms two previous algorithms which are recently reported. We have developed a number of versions of this algorithm which can be used in limited or unlimited homogeneous as well as heterogeneous processors such as a cluster of workstations. The most attractive feature of these algorithms is that they can automatically adjust the degree of duplication depending upon the speed of the communication network and the number of processors available in the target system. The third algorithm, named the Bubble Scheduling and Allocation (BSA) algorithm, is a based on a novel technique which is different from the classical methods. It works by injecting all of the tasks of a parallel program to one processor in a serial fashion first. The tasks are then bubbled up to other processors depending upon the network topology. The BSA algorithm not only schedules tasks but also schedules communication edges on the communication channels. The algorithm can be used with any routing strategy and can optimize itself on any network topology. Our five proposed algorithms, as well as a number of algorithms proposed by others, have been implemented into an interactive software tool called CASCH (Computer-Aided SCHeduling). - xiii -

14 Chapter 1 Introduction 1.1 Overview During the past few years, we have witnessed a spectacular growth of parallel computing hardware platforms [23], []. This is because a variety of architectures have emerged exploiting advancements in processors technology, low overhead switches, fast communication channels, and rich interconnection network topologies. As the hardware of parallel processing systems evolves towards achieving the goal of a teraflop performance, the software designers of these systems face increasingly difficult challenges. These include designing new algorithms, programming models, languages, automated programming aids and performance assessment tools. Perhaps, the most crucial component of an efficient parallel processing software is the scheduling and allocation of the modules of a parallel program to the processors. This is because the modules of the parallel program must be properly arranged in time and space in order to optimize performance. Given a parallel program represented by a task graph, in which the nodes represent the tasks and the edges represent the communication costs and precedence constraints among the tasks, a scheduling algorithm determines the execution order of tasks and a mapping algorithm determines the allocation of these tasks to processors. The prime objective of scheduling and mapping is to minimize the execution time. This is equivalent to maximization of the speedup which is defined as the time required for sequential execution divided by the time required for parallel execution. It is well-known that the multiprocessor scheduling problem is NP-complete [26] in its many variants except for a few highly simplified cases [29], [36], [38], [59]. To tackle the - 1 -

15 problem, many heuristic algorithms, which are based on simplifying assumptions about the structure of the parallel programs as well as the underlying parallel processing systems, have been reported in the literature [1], [11], [16], [19], [33], [39], [64], [66], [67], [69]. For more realistic cases, a scheduling algorithm needs to address a number of issues. It should exploit the parallelism by identifying the task graph structure, and take into consideration task granularity, load balancing, arbitrary computation and communication costs, the number of processors and interprocessor communication. Moreover, in order to be of practical use, a scheduling algorithm should be fast and economical in terms of number of processors used. Addressing all of these issues makes the scheduling problem even more complex and challenging. In this thesis, we propose efficient and fast algorithms for compile-time scheduling and mapping of parallel programs onto scalable parallel processing systems, under realistic assumptions. Our first algorithm, called Dynamic Critical Path (DCP) algorithm, which is designed for a virtual architecture composing of unlimited fully-connected processors, is based on new principles. The proposed algorithm determines the critical path of the task graph at each scheduling step. It also schedules tasks to processors by rearranging the schedule dynamically in the sense that the tasks in the partial schedules remain mobile until the scheduling process finishes. The DCP algorithm outperforms all of the contemporary scheduling algorithms known to us. The second algorithm, called Critical Path Fast Duplication (CPFD) algorithm, is designed to exploit task duplication in scheduling. Using duplication can drastically reduce the communication overhead. The CPFD algorithm outperforms two previous algorithms which are recently reported. We have developed a number of versions of this algorithm which can be used in limited or unlimited homogeneous as well as heterogeneous processors such as a cluster of workstations. The most attractive feature of these algorithms is that they can automatically adjust the degree of duplication depending upon the speed of the communication network and the number of processor available in the target system. The third algorithm, named the Bubble Scheduling and Allocation (BSA) algorithm, is a based on a novel technique which is different from the classical methods. It works by injecting all of the tasks of a parallel program to one processor in a serial fashion first. The tasks are then bubbled up to other processors depending upon the network topology. The BSA algorithm not only schedules tasks but also schedules communication edges on the communication channels. The algorithm can be used with any routing strategy and can optimize itself on any network topology. All of our proposed - 2 -

16 algorithms are evaluated with a number of suites of task graphs. These include randomly generated graphs of various different structures, as well as tasks graphs for a number of parallel algorithms such as mean value analysis, Gaussian elimination, FFT, LUdecomposition, and Laplace equation solver. With the increasing advancement of hardware technology, high performance parallel processing machines are becoming more readily accessible. However, the lack of an efficient parallel programming and algorithm design tool is a major hurdle in using parallel machines to general applications. For example, without an efficient simulation tool, researchers usually find it very difficult to evaluate the performance of a scheduling algorithm. CASCH (Computer-Aided SCHeduling) is an interactive software tool for studying the performance of scheduling algorithms. We have implemented our proposed algorithms, as well as a number of algorithms developed by others, in the CASCH tool. With this versatile and flexible tool, a researcher or programmer can design and evaluate parallel program scheduling and mapping algorithms in a very efficient manner. The user can interactively draw task graphs which represent parallel programs, draw architecture graphs which represent the target parallel processing systems, and execute the various scheduling algorithms to observe and compare the performance of the parallel programs as well as the scheduling algorithms. 1.2 Parallel Architectures and the Scheduling Problem Parallel computers can be broadly classified into two categories: shared-memory (Figure 1.1(a)) and message-passing (distributed memory) (Figure 1.1(b)) architectures. Shared-memory machines (e.g., BBN Butterfly [22]) present a uniform-address-space view of the memory to the programmer; interprocessor communication is through writing and reading shared variables. The hardware generally provides an equal-cost access to any shared variable from any of the processors and there is no notion of communication locality. Message-passing architectures (e.g., hypercubes [32], [57]) use direct communication links between processors. Interprocessor communication and synchronization are achieved through explicit message passing. Each processing element (PE) is connected to a fixed number of PEs in some regular geometry such as a ring or a hypercube (see Figure 1.1(b)). The advantage of this approach over the shared-memory approach is the greater communication bandwidth in the system, due to the large number of simultaneous communications possible on the independent interprocessor links. Another advantage is - 3 -

17 Shared Memory Modules M 1 M2 Mk Shared Bus Processors PE 1 PE 2 PE n (a) PE0 PE1 PE0 PE2 PE1 PE3 PE3 PE2 PE4 PE6 PE5 PE7 A 4-processor ring. An 8-processor hypercube. (b) Figure 1.1: (a) A shared-memory architecture; (b) Message-passing (distributed memory) architectures. scalability. We can add new processors as well as communication channels to a messagepassing multicomputer with a very low cost. The disadvantage is the longer communication delay when the destination processor is not directly connected to the source processor. An efficient scheduling of a parallel program onto the processors is vital to achieving high performance in both shared-memory and message-passing architectures. The proposed scheduling algorithms can be applied to message-passing parallel architectures. These algorithms can be linked with parallel program compilers for performing static (deterministic) scheduling of macro data-flow graphs. These macro data-flow graphs can be generated for SPMD or MPMD style of parallel programs. These scheduling algorithms are executed off-line. We do not assume any preemption in our study

18 1.3 A Taxonomy of Approaches to the Scheduling Problem The scheduling problem is one of crucial importance in the effective utilization of large scale parallel computers and distributed computer networks. In a very broad sense, it is usual to subdivide the general scheduling problem into two categories job scheduling and scheduling and mapping (see Figure 1.2(a)). In the former class, independent jobs are to be scheduled among the processors of a distributed computing system to optimize overall system performance. In contrast, the scheduling and mapping problem requires the allocation of multiple interacting tasks of a single parallel program in order to minimize the completion time on the parallel computer system [10], [15], [17], [45], [55], [65]. While job scheduling requires dynamic run-time scheduling that is not a priori decidable, the scheduling and mapping problem can be addressed in both static [4], [7], [9], [18], [67], [39], [42], [43], [44], [50], [51], [52], [56], [58], [61], [62], [70] as well as dynamic contexts [2], [3], [37]. When the structure of the parallel program in terms of its task execution times, task dependencies, task communications and synchronization, is known a priori, scheduling can be accomplished statically at compile time. On the contrary, dynamic scheduling is required when a priori information is not available and scheduling is done on-the-fly according to the state of the system [2], [3]. Two distinctly different models of the parallel program have been considered extensively in the context of static scheduling the model the task interaction graph (TIG) and task precedence graph (TPG) model. They are shown in Figure 1.2(b) and Figure 1.2(c). With the task interaction graph model, graph vertices represent parallel processes and edges denote the inter-process interaction [47]. However, temporal execution dependencies are not explicitly represented. Thus, all tasks are considered essentially simultaneously and independently executable. For example, a TIG can be used to model the finite element method (FEM) [55]. The objective of mapping is the minimization of parallel program completion time [48]. This requires balancing the computation load uniformly among the processors while simultaneously keeping communication costs as low as possible. The mapping problem is analogous to graph-to-graph mapping since both the problem and machine models can be represented as graphs. The research in this area was pioneered by Stone and Bohkari [12], [13], [14]. Stone [63] applied network-flow algorithms to solve the assignment problem. Bokhari described the mapping problem as being equivalent to graph isomorphism, quadratic assignment and sparse matrix bandwidth reduction problems [11]. However, this approach does not consider the precedence constraints among the tasks and - 5 -

19 Parallel Program Scheduling Job Scheduling (independent tasks) Scheduling and Mapping (multiple interacting tasks) Dynamic Scheduling Static Scheduling Task Interaction Graph Task Precedence Graph (a) n 2 n 3 n 4 (b) (c) Figure 1.2: (a) A taxonomy of the approaches to the scheduling problem; (b) A task interaction graph; (c) A task precedence graph. can be useful for assigning clusters of tasks to the processors if tasks have already been scheduled into the clusters. Furthermore, since the temporal dependencies within the clusters are ignored, this approach cannot consider the sequencing of messages and contention on the communication channels. With the task precedence graph model, a parallel program is viewed as a directed acyclic graph, in which the nodes represent the tasks and the directed edges represent the execution dependencies as well as the amount of communication. Thus, in the task - 6 -

20 precedence graph shown in Figure 1.2(c), task n 4 cannot commence execution before tasks and n 2 finish execution and it gathers all the communication data from n 2 and n 3. The scheduling objective is again to properly schedule the tasks in time and space so as to minimize the program completion time or maximize the speedup, which is defined as the time required for sequential execution divided by the time required for parallel execution. For most parallel applications, a task precedence graph can model the program more accurately because it captures the temporal dependencies among tasks. This is the model we use in the scheduling problem addressed in this thesis. 1.4 Outline of the Thesis In Chapter 2, we present a discussion on the evolution of the static scheduling problem. We describe the earlier proposed classical scheduling algorithms as well as the current state of the art. We also briefly mention and discuss the merits and limitations of these algorithms. In Chapter 3, we present our proposed Dynamic Critical Path (DCP) scheduling algorithm. We first describe the design principles of our algorithm. Then, we present the DCP algorithm followed by an application example to illustrate the algorithm s effectiveness. Finally, the results and comparisons of the performance of our algorithm with other algorithms on a large set of task graphs with various types of structures are presented. In Chapter 4, we present our proposed Critical Path Fast Duplication (CPFD) scheduling algorithm. We first discuss the potential and benefit of using task duplication in scheduling. Afterwards, we discuss the design principles of the CPFD algorithm. We then describe the CPFD algorithm and its variants which are designed to tackle the cases of a limited number of homogeneous or heterogeneous processors. We also illustrate the functionality of the CPFD algorithm by presenting an application example. Finally, we present and discuss the experiments we conducted to investigate the performance of our algorithms compared with other duplication-based algorithms. In Chapter 5, we present our proposed Bubble Scheduling and Allocation (BSA) algorithm. We first discuss the issues of scheduling under realistic system constraints. We then describe the principles we used in designing an efficient and robust scheduling and mapping algorithm. Afterwards, we describe the BSA algorithm followed by an application example which illustrates the superior performance of the BSA algorithm over other algorithms. Finally, we present and compare the performance of the BSA algorithm and two other algorithms. Chapter 6 concludes the thesis. Future research directions are also suggested in the same chapter

21 Chapter 2 Evolution of the Scheduling Problem 2.1 Introduction In this chapter, we present a discussion on the evolution of the static scheduling problem by describing scheduling algorithms of the so-called different generations. First, we describe earlier reported scheduling algorithms which were designed based on simplifying assumptions on the task graphs as well as the underlying multiprocessor systems. Second, we describe the succeeding generation of scheduling algorithms, which cannot guarantee generating optimal solutions but can be applied to more realistic environment. These algorithms are also called scheduling heuristics. Essentially, these heuristics extend previous ideas to adopt more realistic constraints such as arbitrary computation costs and taking into consideration communication among tasks. Third, we describe four state-of-the-art scheduling algorithms, which are recently reported and are shown to be efficient compared with many other algorithms. We also describe two recently reported scheduling algorithms that are based on task duplication. Finally, we describe two contemporary scheduling algorithms that also consider scheduling of communication edges on communication links. The scheduling problem existed even before the advent of parallel computers since the allocation of a set of tasks to a single processor is also a non-trivial problem [29]. Static scheduling is an old problem and has been studied extensively in the operations research community for a long time. The scheduling problem in the context of parallel computers benefited from the approaches employed in the area of operations research since allocation of parallel programs to parallel processors is analogous to allocation of a set of jobs to a set of machines. However, such approaches assume a very simple model of the parallel computer - 8 -

22 [16], [19]. Over the years, with the rapid advancements in computer architectures, the scheduling problem has evolved through a number of generations. Every scheduling algorithm reported in the early literature works under different circumstances and assumptions. However, there are three fundamental questions to ask in scheduling: (1) does the algorithm make realistic assumptions? (2) is it sophisticated enough to capture the architectural details of the system? and (3) does the complexity of the algorithm permit it to be practically used for compile-time scheduling? The first question relates to the assumptions made by the scheduling algorithm about the parallel program and architecture models. As elaborated in later sections, earlier scheduling algorithms made simplifying assumptions such as equal computation times for all the tasks in the task graph, simple graph structure such as trees, or ignoring the communication delays among tasks altogether. Similarly, scheduling strategies that ignore precedence relations among tasks and the contention on communication links of the system may work only in certain environments. The second question is concerned with the optimization of the scheduling strategy with respect to the target architecture. The scheduling problem is not only problemdependent but is also machine-dependent. A scheduling algorithm tailored for one particular architecture may not generate efficient solutions on another architecture. Recently, a wide variety of architectures have emerged employing various design methodologies. The architectural attributes such as system topology, routing strategy, overlapped communication and computation, etc., if taken into account, can result in different allocation decisions. Therefore, more sophisticated algorithms are required that can optimize the allocation strategy for a given machine architecture. The third question which is related to the complexity of the heuristic is an important consideration. For example, consider the task graph for Gaussian elimination [70]. The task graph for Gaussian elimination on a 4 by 4 matrix consists of 18 nodes and 28 edges. The number of nodes, n, in the task graph is roughly O(N 2 ) for a matrix of dimension N. Thus, a scheduling algorithm whose complexity is O(n 2 ) will have a complexity of O(N 4 ). A number of reported scheduling algorithms exhibit good performance by considering only a set of small task graphs. Such algorithm do not carry enough potential to be used for practical purpose. Some low complexity algorithms, on the other end of the spectrum, do not always - 9 -

23 perform well. We would like to design efficient algorithms whose complexities are low enough to make them scalable for scheduling very large task graphs on a large number of processors. In the beginning, the architecture of the parallel machine and the parallel program were represented in a very abstract fashion. In order to tackle the problem, simplifying assumptions were made regarding the task graph structure representing the program and the model for the parallel processor systems [], [24], [25], [33]. However, the problem is NPcomplete even in two simple cases: (1) scheduling unit-time tasks to an arbitrary number of processors, (2) scheduling one or two time unit tasks to two processors [46]. There are only two special cases for which optimal polynomial time algorithms exist. These cases are: scheduling tree-structured task graphs with identical computation costs on arbitrary number of processors [36], and scheduling arbitrary task graphs with identical computation costs on two processors [19]. However, even in these cases, no communication is assumed among tasks of the parallel program []. There are many approaches that can be employed in static scheduling [27], [28]. These include queuing theory, graph-theoretic approaches, mathematical programming [34], [35] and state-space search [31], []. In the classical approach [1], [53], which is also called list scheduling, the basic idea is to make an ordered list of nodes by assigning them some priorities, and then repeatedly execute the following two steps until a valid schedule is obtained. 1) Select from the list the node with the highest priority for scheduling. 2) Select a processor to accommodate this node. The priorities are determined statically before the scheduling process begins. In the scheduling process, the node with the highest priority is chosen for scheduling. In the second step, the best possible processor, that is, the one which allows the earliest start time, is selected to accommodate this node. Most of the earlier reported scheduling algorithms are based on this concept employing variations in the priority assignment methods such as HLF (Highest level First), LP (Longest Path), LPT (Longest Processing Time) and CP (Critical Path) [], [29]. However, static priority assignment may not always precisely order the nodes for scheduling according to their relative importance. A node is more important than other

24 nodes if timely scheduling of that node can eventually lead to a better schedule. The drawback of the static approach is that an inefficient schedule may be generated if a relatively less important node is chosen for scheduling before the more important ones. Static priority assignment fails to capture the variation in relative importance of nodes during the scheduling process. In order to avoid scheduling less important nodes before the more important ones, node priorities need to be determined dynamically during the scheduling process. The priorities of nodes are re-computed after a node has been scheduled in order to capture the changes in the relative importance of nodes. Thus, the following three steps are repeatedly executed in such kind of scheduling algorithms. 1) Determine new priorities of all unscheduled nodes. 2) Select the node with the highest priority for scheduling. 3) Select the most suitable processor to accommodate this node. Scheduling algorithms which employ the above three-step approach can potentially generate better schedules [43], [61]. However, this can increase the complexity of the algorithm. 2.2 Problem Statement A parallel program is represented by a directed acyclic graph. A node in the parallel program graph represents a task which is a set of instructions that must be executed serially in the same processor. Associated with each node is its computation cost, denoted by w(n i ) which indicates the amount of computation required. The edges in the parallel program graph correspond to the communication messages and precedence constraints among the nodes. Associated with each edge is a number indicating the amount of communication data from one node to another. This number is called the communication cost of the edge and is denoted by c ij. Here, the subscript ij indicates that the directed edge emerges from the source node n i and incidents on the destination node n j. The source node and the destination node of an edge is called the parent node and the child node respectively. The communication-tocomputation-ratio (CCR) of a parallel program is defined as its average communication cost divided by its average computation cost on a given system. We assume each processor in the system possesses a dedicated hardware to deal with communication so that communication can take place simultaneously with computation. In a task graph, a node which does not have any parent is called an entry node while a node which does not have any child is called an exit node. A node cannot start execution before it gathers all of the messages from its

25 parent nodes. The communication cost among two nodes assigned to the same processor is assumed to be zero. The scheduling of nodes is non-preemptive a task scheduled to a processor cannot be interrupted before its completion. The scheduling problem is defined to be an allocation of a set of tasks to a set of processors such that the cumulative schedule length or makespan is minimized without violating the precedence constraints among the tasks. A schedule is considered as efficient if the schedule length is short and the number of processors used is reasonable. 2.3 Optimal Static Scheduling Algorithms There have been notably few known polynomial-time scheduling algorithms for determining minimum length schedules, even when severe constraints are imposed on the task graphs and the underlying parallel processing systems. Indeed, there are only two cases for which polynomial-time optimal scheduling algorithms are known: (1) when the task graph is a rooted tree, and (2) when there are only two processors available. In both cases, every task in the task graphs has unit computation cost. Furthermore, there is no communication assumed among the tasks. That is, w( n i ) = 1, c ij = 0 for all i and j. In the following, we describe the two algorithms for these two highly simplified cases Optimal Scheduling of Tree-structured Task Graphs Hu proposed a polynomial-time algorithm to determining minimum length schedules for tree-structured task graphs with unit-cost task and without communication among tasks [36]. The first step in Hu s algorithm involves the labelling of the nodes. A node n i is recursively given the label α i = X i + 1, where X i is the length of the longest path from n i to the exit node in the graph. Here, it should be noted that the rooted tree is assumed to be an in-tree. That is, each task in the graph has only one successor and there is only one exit node, which is the root of the tree. The labelling process begins with the exit node, which is given the label α 1 = 1. Nodes that are one edge above the exit node are given the label 2, and so on. It is clear that the minimum time to process the graph is related to, which is the T min node(s) with the highest numbered label, by the following inequality: T min α max α max

26 Using the above labelling procedure, an optimal schedule can be obtained for m processors by processing a tree-structured task graph in the following steps: (1) Schedule the first m (or fewer) nodes with the highest numbered label, i.e., the entry nodes, to the processors. If the number of entry nodes is greater than m, choose m nodes whose α i is greater than the others. In case of a tie, choose a node arbitrarily. (2) Remove the m scheduled nodes from the graph. Treat the nodes with no predecessor as the new entry nodes. (3) Repeat steps (1) and (2) until all nodes are scheduled. The labelling process of the algorithm partitions the task graph into a number of levels. In the scheduling process, each level of tasks are assigned to the available processors. Schedules generated using the above steps are optimal under the stated constraints. This is illustrated in the simple task graph and its optimal schedule shown in Figure 2.1. The complexity of the algorithm is linear in the number of nodes because each node in the task graph is visited a constant number of times. Labels PE 0 PE 1 PE 2 3 n 2 n 3 0 n 2 n 3 1 n 4 n 5 2 n 4 n 5 2 n n 6 (a) (b) Figure 2.1: (a) A simple tree-structured task graph with unit-cost tasks and without communication among tasks; (b) The optimal schedule of the task graph using three processors. Although Hu s algorithm can generate optimal schedules, it is not useful in practical parallel processing systems because it requires too severe constraints on the structure of the parallel programs

27 2.3.2 Optimal Scheduling for Two-processor Systems Optimal results for static scheduling have also been addressed by Coffman, Graham and Sethi [19], [59]. They developed an algorithm for generating optimal schedules for arbitrary task graphs with unit-cost tasks and without communication among tasks on twoprocessor systems. Their algorithm works on similar principles as Hu s algorithm. The algorithm first assigns labels to each node in the task graph. The assignment process proceeds up the graph in a way that considers as candidates for the assignment of the next label all nodes whose successors have already been assigned a label. After all nodes are assigned a label, a list is formed by ordering the tasks in decreasing label numbers, beginning with the last label assigned. The optimal schedule is then obtained by scheduling ready tasks in this list to idle processors. This is elaborated in the following steps. (1) Assign label 1 to one of the exit node. (2) Assume that labels 12,,, j 1 have been assigned. Let S be the set of unassigned nodes with no unlabeled successors. Select an element of S to be assigned label j as follows. For each node x in S, let y 1, y 2,, y k be the immediate successors of x. Then, define l( x) to be the decreasing sequence of integers formed by ordering the set of y s labels. Suppose that l( x) l( x' ) lexicographically for all x' in S. Assign the label j to x. (3) After all tasks have been labeled, use the list of tasks in descending order of labels for scheduling. Beginning from the first task in the list, schedule each task to one of the two given processors that allows the earlier execution of the task. Schedules generated using the above algorithm are optimal under the given constraints. An example is illustrated in Figure 2.2. Through the use of counter-examples, Coffman and Graham demonstrated that their algorithm can generate sub-optimal solutions when the number of processors is increased to three or more, or when the number of processors is two and tasks are allowed to have arbitrary computation costs. This is true even when the computation costs are allowed to be one or two units. The complexity of the algorithm is O( n 2 ), where n is the number of nodes in the task graph. Similar to Hu s algorithm, the above algorithm is not useful for modern parallel processing environments in which there may be many processing elements. 2.4 Heuristic Approaches Since the scheduling problem is shown to be computationally intractable even in very

28 PE 0 PE n 2 7 n 3 n 3 n 2 1 n 5 3 n 4 4 n 5 2 n 4 n 7 3 n n 6 2 n 7 (a) (b) Figure 2.2: (a) A simple task graph with unit-cost tasks and without communication among tasks; (b) The optimal schedule of the task graph in a two-processor system. simple cases [], researchers in this area resort to heuristic approaches from which reasonably good schedules can be obtained in a very efficient manner [60]. Adam, Chandy and Dickson [1] performed extensive simulations to compare the performance of several such earlier reported scheduling heuristics. These are the Highest Levels First with Estimated Times (HLFET), the Highest Levels First with No Estimated Times (HLFNET), the Random list scheduling, the Smallest Co-levels First with Estimated Times (SCFET) and the Smallest Co-levels First with No Estimated Times (SCFNET). All of these heuristics take task graphs of arbitrary structures with arbitrary task computation costs as input and schedule the graphs to an arbitrary number of processors. Each heuristic first constructs a list of nodes and schedules the nodes one after the other in a way similar to Hu s algorithm. However, it should be noted that no communication is assumed among the nodes in the task graph. Each heuristic assigns node priorities in a different manner. This is elaborated below. 1) In the HLFET algorithm, the term level refers to the sum of the computation costs of all nodes in the longest path from a node to an exit node. This level is used as the priority for the node. 2) In the HLFNET algorithm, all nodes are assumed to have equal computation costs. 3) In the Random list scheduling algorithm, nodes are assigned random priorities. 4) In the SCFET algorithm, a colevel of a node is calculated in the same way as its level, except that the length of the path is computed from the entry node rather than from the exit node. Node priorities are assigned according to colevels (i.e., the smaller the colevel, the higher the priority)

29 5) The SCFNET algorithm is the same as the SCFET algorithm except that all nodes are assumed to have equal computation cost. This is equivalent to an earliest precedence partition if computation costs are ignored. In their study, Adam, et al. defined a scheduling heuristic as being near optimal if the schedule lengths obtained by the heuristic is within five percent of the optimal schedule lengths in 90 percent of the cases. Using this criterion, extensive simulations based on real and randomly generated task graphs showed that the order of accuracy of the five algorithms is: HLFET, HLFNET, SCFNET, Random and SCFET. The near-optimal performance of HLFET indicates that longest-path (LP) scheduling can generate better schedules [8], [60]. This was also confirmed by Kohler [41]. This important attribute of a task graph is also used to design efficient scheduling algorithms which work under more realistic constraints. This is elaborated in the next section. 2.5 State-of-the-Art Scheduling Algorithms In this section, four contemporary scheduling algorithms and their characteristics are described. These are the Edge-zeroing (EZ) algorithm [56], the Modified Critical Path (MCP) algorithm [70], the Mobility Directed (MD) algorithm [70], the Dominant Sequence Clustering (DSC) algorithm [67]. These algorithms can be considered as the state-of-the-art because they are recently reported and are shown to be efficient for scheduling arbitrary task graphs. Furthermore, all of these algorithms consider arbitrary communication among tasks. Taking communication into consideration makes a scheduling algorithm become more sophisticated [6]. Before describing these algorithms, we discuss the reasons that motivated their development. Although the HLFET algorithm has near-optimal performance, it cannot generate efficient schedules for task graphs with communication among tasks. For example, consider the task graph shown in Figure 2.3(a). Here, a schedule is produced using the HLFET algorithm (communication costs are incorporated in calculating the levels of nodes). The schedule is shown in Figure 2.3(b) in which all the nodes are scheduled to one processor and the schedule length is 43 time units. The HLFET algorithm schedules nodes in the order:, n 2, n 3, n 4. However, the schedule length can be reduced by using one more processor. This can be seen from the schedule shown in Figure 2.3(c). This schedule, which is generated by hand, is produced according to the order:, n 3, n 2, n 4. At the second scheduling step, n 3 is a relatively more important node than n 2 because if it is not scheduled to start earlier on a

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems Yu-Kwong Kwok and Ishfaq Ahmad Department of Computer Science Hong Kong University of Science and