On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems.

Size: px

Start display at page:

Download "On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems."

Damian Snow
6 years ago
Views:

1 On the Complexity of List Scheduling Algorithms for Distributed-Memory Systems. Andrei Rădulescu Arjan J.C. van Gemund Faculty of Information Technology and Systems Delft University of Technology P.O.Box 5031, 2600 GA Delft, The Netherlands ØÖ Ø This paper presents a novel heuristic, called Fast Critical Path (FCP), intended as a compile-time scheduling algorithm on a distributed-memory system. While similar to existing list scheduling algorithms, FCP has two important differences: (a) it does not sort all the tasks at the beginning but maintains only a limited number of tasks sorted at any given time, and (b) instead of considering all processors as possible target for a given task, the choice is restricted to either the processor from which the last messages to the given task arrives or the processor which becomes idle the earliest. As a result, the time complexity is drastically reduced to Ç Î ÐÓ È µ µ, where Î and are the number of tasks and edges in the task graph, respectively, and È is the number of processors. We demonstrate through theory and experiments that FCP performs equally to existing list scheduling algorithms of much higher complexity. ½ ÁÒØÖÓ ÙØ ÓÒ Efficient scheduling is essential to obtain high performance from a parallel program. If the structure of the program and the target machine are known in advance, scheduling can be done automatically at compile time, thus saving considerable overhead at run-time. Task scheduling on distributed memory systems is a tradeoff between exploiting as much parallelism as possible while at the same time reducing communication, to minimize the parallel completion time of the program. Except for very restricted cases, the scheduling problem has been shown to be NP-complete [2]. Therefore, for realistic cases, scheduling is performed using heuristics. Moreover, in order to be of practical use for large parallel This research is part of the Automap project granted by the Netherlands Computer Science Foundation (SION) with financial support from the Netherlands Organization for Scientific Research (NWO) under grant number SION-2519/ To appear in the 1999 ACM International Conference on Supercomputing, June 1999, Rhodes, Greece. applications, scheduling heuristics must have a low complexity. For shared-memory systems, it has been proven that even a low-cost scheduling heuristic is guaranteed to produce acceptable performance [3]. In the distributedmemory case however, such a guarantee does not exist and the scheduling problem remains a challenge, especially for algorithms where low cost is of principal interest. Scheduling can be done either for a bounded or an unbounded number of processors. Although attractive from both cost and performance perspective, scheduling for an unbounded number of processors is rarely of practical use, because the required number of processors is not usually available. Hence, their application is typically found within the multi-step scheduling method for a bounded number of processors [7, 8, 11]. Scheduling for a bounded number of processors can be done either using duplication (e.g., DSH [5] or CPFD [1]) or without duplication (e.g., MCP [10], ETF [4] or DSC-LLB [7, 11]). Duplicating tasks results in better performance but significantly increases cost compared to non-duplicating heuristics. Within non-duplicating heuristics, list scheduling algorithms obtain good performance at a low cost [6, 7]. However, when compiling large programs for large systems, the complexity of current list scheduling approaches is still prohibitive. This paper presents a new compile-time task scheduling algorithm, called Fast Critical Path (FCP). The main objective is to reduce the scheduling costs as much as possible, while maintaining performance comparable to existing list scheduling algorithms. The FCP algorithm is inspired by list scheduling algorithms which statically compute task priorities. The algorithms in this class already have a low complexity and good output performance. In our approach, we drastically reduce this complexity, yet maintaining equivalent performance. Moreover, FCP obtains better results compared to multi-step scheduling at an even lower complexity. This paper is organized as follows: The next section describes the scheduling problem and introduces some definitions used in the paper. Section 3 briefly reviews some of the well-known scheduling algorithms. Section 4 presents the FCP algorithm, while Section 5 describes its performance. Section 6 concludes the paper. 1

2 ¾ ÈÖ Ð Ñ Ò Ö A parallel program can be modeled by a directed acyclic graph Î µ, where Î is a set of Î nodes and is a set of edges. A node in the DAG represents a task, containing instructions that execute sequentially without preemption. Each task Ø has a weight ÓÑÔ Øµ associated with it, which represents the computation cost of executing the task. The edges correspond to task dependencies (communication messages or precedence constraints). Each edge Ø Ø ¼ µ has a weight ÓÑÑ Ø Ø ¼ µ associated with it, which represents the communication cost to satisfy the dependence. The communication to computation ratio ( Ê) of a parallel program is defined as the ratio between its average communication cost and its average computation cost. A task with no input edges is called an entry task, while a task with no output edges is called an exit task. A task is said to be ready if all its parents have finished their execution. A task can start its execution only after all its dependencies have been satisfied. If two tasks are mapped to the same processor, the communication cost between them is assumed to be zero. As a distributed system, we assume a set È of È processors connected in homogeneous clique topology. Interprocessor communication is performed without contention and tasks are executed without preemption. Once scheduled, a task Ø is associated with a processor È Øµ, a start time ËÌ Øµ and a finish time Ì Øµ. If the task is not scheduled, these three values are not defined. The processor idle time of a given processor Ô on a partial schedule is defined as the finish time of the last task scheduled on the given processor to È ÁÌ Ôµ Ñ Ü Ø¾Î È Øµ Ô Ì Øµ. The objective of the scheduling problem is to find a scheduling of the tasks in Î on the target system such that the parallel completion time (schedule length) is minimized. The parallel completion time is defined as Ì Ô Ö Ñ Ü Ô¾È È ÁÌ Ôµ. Ê Ð Ø ÏÓÖ In this section, four existing scheduling algorithms and their characteristics are described: (a) three list scheduling algorithms: Modified Critical Path (MCP) [10], Critical Path Method (CPM) [9], Earliest Task First (ETF) [4] and (b) a multi-step method (DSC-LLB) composed of: Dominant Sequence Clustering (DSC) [11] and List-Based Load Balancing (LLB) [7]. º½ Å È The MCP algorithm is a list scheduling algorithm in which task priorities are based on the latest possible start time of the tasks. The latest possible start time is computed as difference between the critical path of the graph and the longest path from the current task to any exit task. A path length is the sum of the execution times and the communication costs of the tasks and edges belonging to the path. The critical path is the longest path in a graph. A task with the smallest latest possible start time has the highest priority. Ties are broken considering the priorities of the task s descendents. The tasks are selected in the order of their priorities and assigned on the processor that can execute it the earliest. The time complexity of MCP is Ç Î ¾ ÐÓ Î µ È µµ. MCP is relatively fast compared with other list scheduling algorithms. Furthermore, its scheduling performance is shown to be superior compared to most of the other algorithms for bounded number of processors [6, 7]. MCP can be modified to run faster by choosing a random tie breaking scheme at a negligible loss of performance. In this case, the time complexity is reduced to Ç Î ÐÓ Î µ Î µè µ. º¾ ÈÅ Like MCP, the CPM algorithm also uses the latest possible start time as the task static priorities. However, the processor selection is different. The task with the highest priority is scheduled on the processor that becomes idle the earliest. The time complexity of CPM is Ç Î ÐÓ Î µ ÐÓ È µµµ. CPM was originally designed without taking communication into consideration. In the case of non-zero communication delays, the scheduling performance of CPM decreases significantly. The reason is that CPM does not try to reduce communication costs, but only balances the load of the processors. Therefore, CPM is closer to a load balancing scheme. º Ì The ETF algorithm is a list scheduling algorithm based on a dynamic task priority scheme. At each scheduling step, the priorities for ready unmapped tasks are computed. The task priority is the earliest start time, which is determined by tentatively mapping the given tasks to all processors. The task with the minimum priority is selected and mapped to the processor corresponding to this priority. Ties are broken by considering the statically computed priority. The time complexity of ETF is Ç Î È µ. ETF has a higher complexity than MCP because at each step, it is required to compute task priorities. Yet, the performance does not seem to improve [6]. The main idea is to keep the processor busy, in this respect being close to a load balancing scheme. Because of that, the ETF algorithm does not always map the most important ready tasks first (i.e., the tasks on the critical path). º Ë ¹ÄÄ DSC-LLB is a multi-step scheduling algorithm. The first step, applying DSC, is intended to minimize communication by grouping the highly communicating tasks together in clusters. The second step, using LLB, maps the clusters to 2

3 the existing processors and orders the tasks within the clusters. In the DSC algorithm, task priorities are dynamically computed as the sum of their top level and bottom level. The top level and bottom level are the sum of the computation and communication costs along the longest path from the given task to an entry task and an exit task, respectively. Again, the communication costs are assumed to be zero between two tasks mapped to the same processor. While the bottom level is statically computed at the beginning, the top level is computed incrementally during the scheduling process. The tasks are scheduled in the order of their priorities. The destination processor is either the processor from which the last message arrives, or a new processor depending on which the given task can start earlier. The time complexity of DSC is Ç Î µ ÐÓ Î µ. In the LLB algorithm, a task is mapped to a processor if there is at least another task from the same cluster scheduled on that processor and not mapped otherwise. LLB is a load balancing scheme. First, the destination processor is selected as the processor that becomes idle the earliest. Second, the task to be scheduled is selected. There are two candidates for the task to be scheduled: (a) a task already mapped to the selected processor having the least bottom level, or (b) an unmapped task with the least bottom level. The one starting the earliest is scheduled. The time complexity of LLB is Ç ÐÓ µ Î µ, where is the number of clusters obtained in the clustering step. DSC-LLB is a low-cost algorithm. Not surprisingly, compared to a higher cost scheduling algorithm as MCP, it has a worse scheduling performance. However, the DSC-LLB output performance is still shown to be within ¼± of the MCP output performance, while outperforming other known multi-step scheduling algorithms [7]. º½ Ì È Ð ÓÖ Ø Ñ Ê Ø ÓÒ Ð As mentioned earlier, list scheduling algorithms generally perform better compared to other scheduling algorithms for bounded number of processors, such as multi-step methods [7]. However, the list scheduling algorithms have a higher complexity compared to multi-step scheduling algorithms. Our goal is to reduce the time complexity of list scheduling algorithms, while maintaining their good results. Analyzing list scheduling algorithms, one can distinguish several steps. The first step is the task priorities computation, which takes at least Ç Î µ time, since the whole task graph has to be traversed. The second step, sorting tasks according to their priorities, takes Ç Î ÐÓ Î µ time. The third step, task scheduling, schedules the sorted tasks one at a time on the task s best processor. In list scheduling, usually the best processor is considered to be the processor on which the task to be scheduled starts the earliest (e.g., MCP, ETF). Computing the start times for all tasks requires traversing all tasks and edges, leading to a time complexity Ç Î µ. As each task is tentatively scheduled to all processors to find the earliest start time, the processor selection takes Ç Î µè µ time. Finally, scheduling the task to a processor only takes Ç ½µ time because the start time of the task to the selected processor was already computed in the previous step. Therefore the highest complexity steps of the list scheduling algorithms are the second and the third steps, which have Ç Î ÐÓ Î µ and Ç Î µè µ time complexity, respectively. As for practical problems has the same order of magnitude as Î, the second step usually has a higher cost. One first way to reduce the complexity of the task sorting step is not to sort all the tasks from the beginning, but to maintain only the ready tasks sorted throughout the scheduling process. However, despite the fact the sorting time is reduced, it still has the same complexity in the worst case: Ç Î ÐÓ Î µ. This complexity can be effectively reduced by maintaining only a constant size sorted list of ready tasks. The others are stored in an unsorted FIFO list which has an Ç ½µ time access. When a task becomes ready it is added to the sorted list when there is room to accommodate it, otherwise it is added to the FIFO list. For this reason, as long as the sorted list is not full, there can are tasks in the FIFO list. The tasks are always dequeued from the sorted list. After a dequeue operation, if the FIFO list is not empty, one task is moved to the sorted list. The time complexity of sorting tasks using a list of size À decreases to Ç Î ÐÓ Àµ as all the tasks are enqueued and dequeued in the sorted list only once. A possible drawback using a fixed size for a sorted task list is that the task with the highest priority may not be included in the sorted list, but is temporarily stored in the FIFO list. The size of the sorted list must therefore be large enough not to affect the performance of the algorithm too much. At the same time, it should not be too large in view of the time complexity. In our experiments, we find that a size of È is required to achieve a performance comparable to the original list scheduling algorithm (see Section 5). A sorted list size of È results in a task sorting complexity of Ç Î ÐÓ È µ. The Ç Î µè µ time complexity of list scheduling processor selection can also be reduced by restricting the choice for the destination processor from all processors to only two processors: (a) the processor from which the last message to the given task arrives or (b) the processor which becomes idle the earliest. In the appendix we prove that the start time of a given task is minimized by one of these two destination processors. The proof, is based on the fact that the start time of a task Ø on a candidate processor Ô is defined as the maximum between (a) the time the last message to Ø arrives from a different processor, and (b) the time Ô becomes idle. The task start time is minimized on one of the two processors that minimize these two components of the start time. As a consequence, there are two possible destination processors: (a) the processor from which the 3

4 last message to the given task arrives, because mapping the task on this processor is the only case in which the last message cost is zeroed, and (b) the processor becoming idle the earliest. This implies that restricting the selection to these two processors indeed does not affect the performance of the algorithm, while drastically reducing its complexity to Ç Î ÐÓ È µ µ. Using the described techniques for task sorting and processor selection the total complexity of the scheduling algorithm is decreased to Ç Î ÐÓ È µ µ, which is clearly a significant improvement over the typical time complexity Ç Î ÐÓ Î µ Î µè µ of the current list scheduling approach. º¾ Ì È Ð ÓÖ Ø Ñ The FCP algorithm is described in this section. It is based on three procedures AddReadyTask, SelectReadyTask and SelectProcessor which are described first. AddReadyTask (task) BEGIN IF size(priority_list) H THEN Enqueue_sorted(task,priority_list); ELSE Enqueue_FIFO(task,FIFO_list); END IF END AddReadyTask adds a ready task to the partially sorted ready task set. The task set is implemented as a fixed size priority list and a FIFO list. If the fixed size priority list is not full, the task is added to the priority list, otherwise to the FIFO list. SelectReadyTask () BEGIN task Dequeue(priority_list); IF FIFO_list is not empty THEN t Dequeue(FIFO_list); Enqueue_sorted(t,priority_list); END IF RETURN task; END SelectReadyTask returns the task with the highest priority from the priority list. The priority list must be full as long as there are tasks in the FIFO list. Therefore, if there exist tasks in the FIFO list after one task is dequeued from the priority list, one task from the FIFO list must be moved to the priority list. Using this approach, the priority list is always full if there exist tasks in the FIFO list. SelectProcessor (task) BEGIN pa processor from which the last message arrives; pb processor becoming idle the earliest; IF ST(task,pA) < ST(task,pB) THEN p pa; ELSE p pb; END IF RETURN p; END SelectProcessor is the processor selection procedure. The two processor candidates are (a) the processor from which the last message is received (pa) and (b) the processor becoming idle the earliest (pb). The first is used to reduce communication, while the latter is intended to ensure processor load balancing, The task start time is computed for both candidate processors. The one which determines the earliest start time is selected. FCP () BEGIN FOR t ¾ Î DO ComputePriority(t); IF t is an entry task THEN AddReadyTask(t); END IF END FOR WHILE NOT all tasks scheduled DO task SelectReadyTask(); p SelectProcesssor(task); ScheduleTask(task,p); FOR t ¾ new ready task set DO AddReadyTask(t); END FOR END WHILE END The FCP algorithm uses static task priorities, which are computed before the scheduling loop starts. Also, the ready task set is initialized with the entry tasks. The scheduling loop is repeated as long as there exist unscheduled tasks. At each iteration, one task is scheduled. The task to be scheduled is selected among ready tasks using SelectReadyTask described above. The destination processor for the given task is selected using SelectProcessor as described above. The task is scheduled on the selected processor after the last task scheduled on that processor. Before continuing with the following iteration, the ready task set is updated by adding the successors that become ready as a result of the current scheduling. The complexity of the FCP algorithm is as follows. Computing task priorities takes Ç Î µ. Each task is once added to and once removed from the partially sorted ready task set. Both operations take Ç ÐÓ Àµ for each task. As there are Î tasks, maintaining the task lists takes Ç Î ÐÓ Àµ. For each task, finding the processor from which the last message is received implies scanning all the tasks and edges in the task graphs, which takes Ç Î µ. Finding the processor becoming idle the earliest takes Ç ÐÓ È µ for each task, implying Ç Î ÐÓ È µ for all tasks. As a result, the total complexity of FCP is Ç Î ÐÓ Àµ ÐÓ È µµ µ. A priority list of size È yields good results, as indicated by the experiments, which implies a total complexity of Ç Î ÐÓ È µ µ. È Ö ÓÖÑ Ò Ê ÙÐØ The FCP algorithm is compared with the three algorithms described in Section 3, MCP, ETF, CPM and DSC-LLB. The four algorithms are well-known, use different scheduling schemes, and were shown to obtain competitive re- 4

5 T[ms] ETF MCP DSC-LLB CMP FCP LU Laplace Stencil FFT Figure 1: Miniature task graphs sults [6, 7, 10, 11]. We selected the lower-cost version of MCP, in which if there are more tasks with the same priority, the task to be scheduled is chosen randomly. For CPM, we also considered communication when computing task priorities. However, no improvement has been done on the processor selection scheme. For FCP, we use a priority list size of È when comparing it with the other algorithms (this choice is explained in Section 5.3). We consider task graphs representing various types of parallel algorithms. The selected problems are LU decomposition ( LU ), Laplace equation solver ( Laplace ), a stencil algorithm ( Stencil ) and Fast Fourier Transform ( FFT ). Miniature task graphs samples of each type are shown in Figure 1. For each of these problems, we adjusted the problem size to obtain task graphs of about ¾¼¼¼ nodes. For each problem, we varied the task graph granularities, by varying the communication to computation ratio ( Ê). The values used for Ê were 0.2, 1.0 and 5.0. For each problem and each Ê value, we generated 5 graphs with random execution times and communication delays (i.i.d. uniform distribution with unit coefficient of variation). º½ ÊÙÒÒ Ò Ì Ñ Our main objective is to reduce task scheduling cost (i.e., running time), while maintaining performance. In Fig. 2 the average running time of the algorithms is shown in CPU seconds as measured on a Pentium Pro/233MHz PC with 64Mb RAM running Linux ETF is the most costly among the compared algorithms. It increases from 2.5 s for 2 processors up to 65 s for 32 processors. MCP also has a runtime proportional with the number of processors, but its cost is significantly lower. For È ¾, it runs for 0.3 s, while for È ¾, the running time is 1.7 s. CPM is the fastest list scheduling algorithm in this particular comparison. Its running times increase logarithmically with È from 0.17 s for 2 processors to 0.28 s for 32 processors. DSC-LLB does not vary with È at all, as its most costly step, clustering, is independent of number of processors. The DSC- LLB running times vary around 0.7 s. FCP s running time is the lowest, comparable with CPM, and increases logarithmically with È from 0.17 s for 2 processors to 0.26 s for 32 processors. º¾ Figure 2: Scheduling algorithm cost comparison Ë ÙÐ Ò È Ö ÓÖÑ Ò In this section we show that the FCP algorithm achieves the same performance as the existing more expensive list scheduling algorithms. For performance comparison, we use the normalized schedule lengths (Æ ËÄ), which is defined as the ratio between the schedule length of the given algorithm and the schedule length of a reference algorithm. In Figure 3, we show the average normalized schedule lengths for the selected algorithms and problems where the reference algorithm is MCP. For each of the considered Ê values a set of ÆËÄ values is presented. Note that ÆËÄ generally increases with È as a result of the limited parallelism in the task graphs. MCP and ETF constantly yield relatively good schedules. Depending on the problem and granularity, either one or the other performs better. The differences are greater for finegrained problems, which are more sensitive to scheduling decisions. For LU, MCP schedules are up to ¾ ± better, while for Laplace and Stencil, ETF schedules are up to ½ ± better. For coarse-grain problems, the results are comparable. The only exception is LU, for which ETF performs up to ¾¼± worse. The third list scheduling algorithm, CPM, does not try to reduce communication costs, but only balances the load of the processors. Consequently, it has a very simple processor selection scheme which causes its low cost. However, the performance is even worsened when communication is considered, rising to more than a factor of 2 for coarse-grain LU on 32 processors. DSC-LLB is a multi-step method intended to obtain still good results while aiming for minimum complexity. Its scheduling performance is not much worse compared to MCP and ETF. Typically, the schedule lengths are no more than ¾¼± larger than the MCP and ETF schedule lengths. Although in some cases the difference can be higher (up to ¾±) there are also cases in which DSC-LLB performs better (up to ½¼±) compared to a list scheduling algorithm. FCP has both low-cost and good performance. Compared to the other provable low-cost algorithms, FCP has a consistently better performance, while the cost is lower. 5

6 .. CCR = 5.0 CCR = 1.0 CCR = 0.2 MCP ETF DSC-LLB CPM FCP LU LAPLACE STENCIL FFT Figure 3: Scheduling algorithms performance comparison S CCR = S CCR = S CCR = 0.2 FFT Stencil LU Laplace Figure 4: FCP speedup 1 1 6

7 . CCR = 5.0. CCR = 1.0. CCR = 0.2 À ¼ À È À È ¾ À È À ¾È Figure 5: The influence of priority list size to the FCP performance Compared to the more expensive two algorithms, MCP and ETF, one can note that FCP usually performs comparable to the better of the two. The only case in which FCP performs comparable with only the second is for fine-grained Stencil, in which ETF has a slightly better performance. Finally, in Figure 4 we show the FCP speedup for the considered problems. For all the problems, FCP obtains significant speedup. For coarse-grain problems, the speedup is almost linear, while for fine-grain problems, because of the limited parallelism available, the speedup starts levelling off earlier. For LU and Laplace, there are a large number of join operations. As a consequence, there is not much parallelism available and the speedup is lower. Stencil and FFT are more regular. Therefore more parallelism can be exploited and better speedup is obtained. º ÈÖ ÓÖ ØÝ Ä Ø Ë Þ Ë Ò Ø Ú ØÝ ÓÒ È Ö ÓÖÑ Ò As mentioned earlier, an important improvement of our algorithm is based on the fact that only a fixed amount of tasks need to be maintained sorted at each scheduling step. In Figure 5, we study the influence of the priority list size to the scheduling performance of FCP. As reference algorithm when calculating Æ ËÄ the FCP algorithm with a priority list of size È was selected. The results represent the mean values over the four considered problems. For small number of processors, when there is more parallelism to be exploited, FCP yields good results even for small sizes of the priority list. Even for a zero length priority list, which results in scheduling tasks while they become ready in a FIFO order, FCP still obtains good results. For larger number of processors, the parallelism in the problems decreases, and as a consequence the performance degrades for small sizes of the priority list size. However, one can note that for a priority list size greater than È, there is not much improvement obtained. If the number of ready tasks is greater than È, the scheduling process tends to become a load balancing scheme. The reason is that after mapping the first È ready tasks, the communication costs for the remaining ready tasks tends to be overlapped with the execution of the previous È mapped tasks. The task priorities, used to select the task with the least probable delay because of the communication costs, therefore become less important. As a result, a priority list with a smaller size can be used. For a small number of ready tasks at each scheduling step, the task priorities become more important. In this case, the tasks must be maintained sorted to obtain good performance. From the above experiments, it can be concluded that a priority list size of È is a good choice for the FCP algorithm. A smaller size penalizes problems with limited parallelism, while a greater size does not yield further improvements. ÓÒÐÙ ÓÒ In this paper, a new list scheduling algorithm, called Fast Critical Path (FCP), is presented. FCP is intended as a compile-time scheduling algorithm on a distributed-memory system. While similar to list scheduling algorithms, FCP has two important differences: (a) it does not sort the tasks at the beginning but maintains only a limited number of tasks sorted at any given time, and (b) instead of considering all processors as possible destination for a given task, the choice is restricted to either the processor from which the last messages to the given task arrives or the processor which becomes idle the earliest. By introducing this approach, the time complexity is reduced to Ç Î ÐÓ Àµµ ÐÓ È µµ µ. It is shown that a priority list size of È (À È ) yields good scheduling performance. As a result, the FCP complexity becomes Ç Î ÐÓ È µ µ. Experimental results show that compared with known scheduling algorithms, FCP obtains schedules comparable to the more expensive list scheduling algorithms, MCP and ETF. Yet, the FCP complexity is lower even when compared to low-cost scheduling algorithms like DSC-LLB or CPM which have a complexity of Ç Î µ ÐÓ Î µ and Ç Î ÐÓ Î µ ÐÓ È µµ, respectively, while FCP s scheduling performance is consistently better. In summary, despite its very low complexity, FCP outperforms other low-cost algorithms and even matches the better performing, higher-cost list scheduling algorithms. 7

8 Ê Ö Ò [1] I. Ahmad and Y.-K. Kwok. A new approach to scheduling parallel programs using task duplication. In Proc. ICPP, pages 47 51, Aug [2] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP- Completeness. W. H. Freeman and Co., [3] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM J. on Applied Mathematics, 17(2): , Mar [4] J.-J. Hwang, Y.-C. Chow, F. D. Anger, and C.-Y. Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM J. on Computing, 18: , Apr [5] B. Kruatrachue and T. G. Lewis. Grain size determination for parallel processing. IEEE Software, pages 23 32, Jan [6] Y.-K. Kwok and I. Ahmad. Benchmarking the task graph scheduling algorithms. In Proc. 1st Merged IPPS/SPDP, pages , Mar [7] A. Rădulescu, A. J. C. van Gemund, and H.-X. Lin. LLB: A fast and effective scheduling algorithm for distributed-memory systems. In Proc. 2nd Merged IPPS/SPDP, San Juan, Puerto Rico, Apr IEEE. [8] V. Sarkar. Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. PhD thesis, MIT, [9] B. Shirazi, M. Wang, and G. Pathak. Analysis and evaluation of heuristic methods for static task scheduling. J. of Parallel and Distributed Computing, 10(3): , Nov [10] M.-Y. Wu and D. D. Gajski. Hypertool: A programming aid for message-passing systems. IEEE Trans. on Parallel and Distributed Systems, 1(7): , July [11] T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Trans. on Parallel and Distributed Systems, 5(9): , Dec ÈÖÓ ÓÖ Ë Ð Ø ÓÒ In this appendix we prove that given a task, the choice of processors on which the task starts the earliest can be restricted to only two processors. The two possible destination processors are either (a) the processor from which the last message to the given task arrives or (b) the processor which becomes idle the earliest. The start time of a task Ø on a candidate processor Ô is defined as the maximum between (a) the time the last message to Ø arrives from a different processor, and (b) the time Ô becomes idle. The task start time is minimized on the two processors that minimize those two components of the start time. As a consequence, there are two possible destination processors: (a) the processor from which the last message to the given task arrives, because mapping the task on this processor is the only case in which the last message cost is zeroed, and (b) the processor becoming idle the earliest. This is formalized in the following. Definition 1 The arrival time of a message sent by a task Ø Ü to task Ø defined as: Ì Ñ Ø Ü Øµ Ì Ø Ü µ ÓÑÑ Ø Ü Øµ Definition 2 The time the last message arrives to a task Ø is defined as the maximum of the message arrivals times or 0 in the case the task is an entry task: Ì ÐÑ Øµ Ñ Ü Ñ Ü Ì Ñ Ø Ü Øµ ¼ ØÜ Øµ¾ Definition 3 Let Ø be defined as the task from which the last message arrives at task Ø: Ø Ø Ü ¾ Î Ì Ñ Ø Ü Øµ Ì ÐÑ Øµ Definition 4 Let Ô denote the processor from which the last message arrives: Ô È Ø µ Definition 5 The processor becoming idle the earliest is: Ô Ô ¾ È È ÁÌ Ôµ Ñ Ò ÔÜ¾È È ÁÌ Ô Üµ Definition 6 The start time of a tentative scheduling of task Ø to a processor Ô is defined as: Ì ËÌ Ø Ôµ Ñ Ü Ñ Ü Ì Ñ Ø Ü Øµ È ÁÌ Ôµ ØÜ Øµ¾ È ØÜµ Ô Theorem 1 Let Ø be the task to be scheduled, and Ô denote È Øµ such that Ì ËÌ Ø Ôµ Ñ Ò ÔÜ¾È Ì ËÌ Ø Ô Üµ Then Ô ¾ Ô Ô. Proof Ô Ý Ô Ì Ñ Ø Øµ ¾ Ì Ñ Ø Ü Øµ Ø Ü Øµ ¾ È Ø Ü µ Ô Ý which implies Ô Ý Ô Ì Ñ Ø Øµ Ñ Ü Ì Ñ Ø Ü Øµ ØÜ Øµ¾ È ØÜµ ÔÝ From Definitions 2 and 3, it follows that Ô Ý Ô Ì ÐÑ Ì Ñ Ø Øµ As a result, Ô Ý Ô Ñ Ü Ì Ñ Ø Ü Øµ Ì ØÜ Øµ¾ È ØÜµ ÔÝ Ñ Ü Ì Ñ Ø Ü Øµ Ì ÐÑ ØÜ Øµ¾ È ØÜµ ÔÝ which is the maximum value according to Definition 2. As a consequence, the only processor this first term of Ì ËÌ Ø Ôµ can be decreased below Ì ÐÑ is Ô. This implies that the first term of Ì ËÌ Ø Ôµ is minimized for Ô. According to Definition 5, the second term of Ì ËÌ Ø Ôµ is minimized for Ô. As the two terms of Ì ËÌ Ø Ôµ are minimized either for Ô or Ô it follows that Ô ¾ Ô Ô. ¾ 8

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract