Mapping of Parallel Tasks to Multiprocessors with Duplication *

Size: px

Start display at page:

Download "Mapping of Parallel Tasks to Multiprocessors with Duplication *"

Maurice Morton
5 years ago
Views:

1 Mapping of Parallel Tasks to Multiprocessors with Duplication * Gyung-Leen Park Dept. of Comp. Sc. and Eng. Univ. of Texas at Arlington Arlington, TX gpark@cse.uta.edu Behrooz Shirazi Dept. of Comp. Sc. and Eng. Univ. of Texas at Arlington Arlington, TX shirazi@cse.uta.edu Jeff Marquis Prism Parallel Tech., Inc N. Plano Rd. Richardson, Texas Abstract Duplication Based Scheduling (DBS) is a relatively new approach for solving multiprocessor scheduling problems. The problem is defined as finding an optimal schedule which minimizes the parallel execution time of an application on a target system. This paper proposes a new DBS algorithm which achieves considerable performance improvement over existing DBS algorithms with equal or less time complexity. The proposed algorithm obtains a comparable performance to DBS algorithms with higher complexities. The paper also proposes a variation of the proposed algorithm which adjusts the extent of duplications according to the limited number of processors available in the target system. Our simulation study reveals the gradual performance degradation of the proposed algorithm as the number of processors available in the system is decreased. 1. Introduction Efficient scheduling of parallel programs, represented as a Directed Acyclic Graph (DAG), onto processing elements of parallel and distributed computer systems are extremely difficult and important issues [10, 22-26, 29, 31]. The goals of the scheduling process are to efficiently utilize resources and to achieve performance objectives of the application (e.g., to minimize program parallel execution time). It has been shown that the multiprocessor scheduling problem is NP-complete in general forms. The typical approach to the problem is list scheduling algorithms where tasks are put into lists according to priorities assigned by heuristics [1, 19]. Duplication Based Scheduling is a relatively new approach to the scheduling problem. The DBS algorithms are capable of reducing communication overhead by duplicating remote parent tasks on local processing elements. Since DBS methods have been also shown to be NP-complete [17] in general forms, many of the proposed DBS algorithms are based on heuristics. This paper classifies DBS algorithms into two categories according to the task duplication approach used: Scheduling with Partial Duplication (SPD) and Scheduling with Full Duplication (SFD). SPD algorithms do not duplicate the parent of a join node unless the parent is critical. A join node is defined as a node with an in-degree greater than one (i.e., a node with more than one incoming edge). Instead, they try to find the primary iparent which is defined later in this paper as an immediate parent which gives the largest start time to the join node. The join node is scheduled on the processor where the primary iparent has been scheduled. Because of the limited task duplication, algorithms in this category have a low complexity but may not be appropriate for systems with high communication overhead. They typically provide good schedules for an input DAG where computation cost is strictly larger than communication cost. Critical Path Method (CPM) [5], Search and Duplication Based Scheduling (SDBS) [6], and Scalable Task Duplication Scheduling (STDS) [8] belong to this category. SFD algorithms attempt to duplicate all the parents of a join node and apply the task duplication algorithm to all the processors that have any of the parents of the join node. Thus, algorithms in this category have higher complexity but typically show better performance than SPD algorithms. Duplication Scheduling Heuristic (DSH) [13], Bottom-up Top-down Duplication Heuristic (BTDH) [4], Linear Clustering with Task Duplication (LCTD) [3, 27], Critical Path Fast Duplication (CPFD) [2] and Economical Critical Path Fast Duplication (ECPFD) [14] belong to this category. A trade-off exists between algorithms in these two categories: performance (better application parallel execution time) versus time complexity (longer time to * This work has in part been supported by grants from NSF(CDA and MIPS ) and state of Texas ATP

2 carry out the scheduling algorithm itself). This paper proposes a new DBS algorithm that attempts to achieve the performance of SFD algorithms with a time complexity approaching SPD algorithms. The proposed algorithm, called Duplication First and Reduction Next (DFRN), duplicates the parents of any join node as done in SFD algorithms but with reduced computational complexity. In general, most of the DBS algorithms, including DFRN, assume availability of an unlimited number of processors; with few exceptions such as STDS [8] and ECPFD [14]. Since this assumption may not hold in practice, we also propose a variation of DFRN algorithm, Scalable scheduling with DFRN (SDFRN), which adjusts the extent of duplications according to the limited number of processors available in the target system. The SDFRN algorithm with N available processors provides the same schedule as that obtained by the algorithm with unbounded number of processors, where N is the number of task nodes in the input DAG. Our simulation study shows that DFRN algorithm achieves considerable performance improvement over existing algorithms with equal or less time complexity while it obtains comparable performance to algorithms which have higher time complexities. It is also shown that the performance improvement becomes greater as Communication to Computation Ratio is increased. The simulation study also reveals the graceful performance degradation of SDFRN algorithm according to the number of processors available. The remainder of this paper is organized as follows. Section 2 presents the system model and the problem definition. Section 3 briefly covers the related works. The two proposed DBS algorithms, DFRN and SDFRN, are presented in Section 4. The performance of the DFRN algorithm is compared with those of the existing algorithms in Section 5. Section 5 also shows the effect of the number of processors available on SDFRN algorithm. Finally, Section 6 concludes this paper. 2. System model and problem definition A parallel program is usually represented by a Directed Acyclic Graph (DAG), which is also called a task graph. As defined in [6], a DAG consists of a tuple (V, E, T, C), where V, E, T, and C are the set of task nodes, the set of communication edges, the set of computation costs associated with the task nodes, and the set of communication costs associated with the edges, respectively. T(V i ) is a computation cost for task V i and C(V i, V j ) is the communication cost for edge E(V i, V j ) which connects task V i and V j. The edge E(V i, V j ) represents the precedence constraint between the node V i and V j. In other words, task V j can start the execution only after the output of V i is available to V j. When the two tasks, V i and V j, are assigned to the same processor, C(V i, V j ) is assumed to be zero since intra-processor communication cost is negligible compared with the interprocessor communication cost. The weights associated with nodes and edges are obtained by estimation [30]. This paper defines two relations for precedence constraints. The V i V j relation indicates the strong precedence relation between V i and V j. That is, V i is an immediate parent of V j and V j is an immediate child of V i. The terms iparent and ichild are used to represent immediate parent and immediate child, respectively. The V i V j relation indicates the weak precedence relation between V i and V j. That is, V i is a parent of V j but not necessarily the immediate one. V i V j and V j V k imply V i V k. V i V j and V j V k do not imply V i V k, but imply V i V k. The relation is transitive, and the relation is not. A node without any parent is called an entry node and a node without any child is called an exit node. Graphically, a node is represented as a circle with a dividing line in the middle. The number in the upper portion of the circle represents the node ID number and the number in the lower portion of the circle represents the computation cost for the node. For example, for the sample DAG in Figure 1, the entry node is V 1 which has a computation cost of 10. In the graph representation of a DAG, the communication cost for each edge is written on the edge itself. For each node, in-degree is the number of input edges and out-degree is the number of output edges. For example, in Figure 1, the incoming and outdegrees for the node V 5 are 3 and 1, respectively. A few terms are defined here for a more clear presentation. Definition 1: A node is called a fork node if its outdegree is greater than 1. Definition 2: A node is called a join node if its in-degree is greater than 1. Note that the fork node and the join node are not exclusive terms, which means that one node can be both a fork and also a join node; i.e., both of the node s indegree and out-degree are greater than one. Similarly, a node can be neither a fork nor a join node; i.e., both of the node s in-degree and out-degree are one. In the task graph of Figure 1, nodes V 1, V 2, V 3, and V 4 are fork nodes while nodes V 5, V 6, V 7, and V 8 are join nodes. In the example, all the fork nodes have out-degree three and all the join nodes have in-degree three.

3 Figure 1. Sample DAG Definition 3: The Earliest Start Time, EST(V i, P k ), and Earliest Completion Time, ECT(V i, P k ), are the times that a task V i starts and finishes its execution on processor P k, respectively. Definition 4: A message arriving time (MAT) from V i to V j, or MAT(V i, V j ), is the time that the message from V i arrives at V j. If V i and V j are scheduled on the same processor P k, MAT(V i, V j ) becomes ECT(V i, P k ). Otherwise, MAT(V i, V j ) = ECT(V i, P k ) + C(V i, V j ). Definition 5: An iparent of a join node is called its primary iparent if it provides the largest MAT to the join node. The primary iparent is denoted as V i = PIP(V j ) if V i is the primary iparent of V j. More formally, V i = PIP(V j ) if and only if MAT(V i, V j ) > MAT(V k, V j ), for all V k, V k V j, V i V j, i k. If there are more than one iparent providing the same largest MAT, PIP is chosen arbitrary. Definition 6: An immediate parent node of a join node is called the secondary iparent of the join node if it provides the second largest MAT to the join node. The secondary iparent is denoted as V i = SIP(V j ) if V i is the secondary iparent of V j. Formally, V i = SIP(V j ) if and only if MAT(V i, V j ) > MAT(V k, V j ), for all V k, V k V j, V k PIP(V j ), V i V j, i k. If there are more than one iparent providing the same second largest MAT, SIP is chosen arbitrary. EST(V j, P c ) becomes Max(ECT(PIP(V j ), P c ), MAT(SIP(V j ), V j )) if V j is scheduled without any task duplication on P c where PIP(V j ) has been scheduled. Definition 7: The processor having the primary iparent for V i is called the primary processor of V i. Definition 8: The critical path of a task graph is the path from an entry node to an exit node which has the largest sum of computation and communication costs of the nodes and edges on the path. The Critical Path Including Communication cost (CPIC) is the length of the critical path including communication costs in the path while the Critical Path Excluding Communication cost (CPEC) is the length of the critical path excluding communication costs in the path. For example, the critical path of the sample graph in Figure 1 consists of node V 1, V 4, V 7, and V 8. Then CPIC is T(V 1 ) + C(V 1, V 4 ) + T(V 4 ) + C(V 4, V 7 ) + T(V 7 ) + C(V 7, V 8 ) + T(V 8 ), which is 400. CPEC is T(V 1 ) + T(V 4 ) + T(V 7 ) + T(V 8 ), which is 150. Definition 9: The level of a node is recursively defined as follows. The level of an entry node, V 0, is zero. Let Lv(V i ) be the level of V i. Then Lv(V 0 ) = 0. Lv(V j ) = Lv(V i ) + 1, V i V j, for non-join node V j. Lv(V j ) = Max(Lv(V i )) + 1, V i V j, for join node V j. For example, the level of node V 1, V 2, V 5, V 8 are 0, 1, 2, and 3, respectively. Even though we assume that there is an edge from node 1 to 5, the level of node 5 is still 2 not 1, since Lv(V 5 ) = Max(Lv(V i )) + 1, V i V 5, for join node V 5. Similar to existing DBS algorithms, the number of processors are assumed to be unbounded. This assumption is released in this paper when we present the scalable version of DBS algorithm in Section 4.2. The topology of the target system is also assumed to be a complete graph; i.e., all processors can directly communicate with each other. This assumption may be justified by noting that the distance between processors is no longer an important factor to the communication delays with current technologies such as worm whole routing. Thus, the multiprocessor scheduling process becomes a mapping of the task nodes in the input DAG to the processors in the target system with the goal of minimizing the execution time of the entire program. The execution time of the entire program after scheduling is called the parallel time to be distinguished from the completion time of an individual task node. 3. Related work This section briefly covers several typical scheduling algorithms in the literature. They are used later in this paper for performance comparison.

4 3.1. Heavy Node First (HNF) algorithm The HNF algorithm [21] assigns the nodes in a DAG to processors level by level. At each level, the scheduler selects the eligible nodes for scheduling in descending order based on computational weight with the heaviest node (i.e. the node which has the largest computation cost) selected first. The node is selected arbitrarily if multiple nodes at the same level have the same computation cost. The selected node is assigned to a processor which gives the earliest start time to the node Linear Clustering (LC) algorithm The LC algorithm [12] is a traditional critical path based clustering method. The scheduler identifies the critical path, removes the nodes in the path from the DAG, and assigns the nodes in the path into a linear cluster. The process is repeated until there is no task node remaining in the DAG. Each cluster is then scheduled onto a processor Scalable Task Duplication based Scheduling (STDS) algorithm The STDS [8] algorithm first calculates the start time and the completion time of each node by traversing the input DAG. The algorithm then generates clusters by performing depth first search starting from the exit node. While performing the task assignment process, only critical tasks which are essential to establish a path from a particular node to the entry node are duplicated. The algorithm has a small complexity because of the limited duplication. If the number of processors available is less than that needed, the algorithm executes the processor reduction procedure. In this paper, unbounded number of processors are used for STDS for performance comparison in Section Critical Path Fast Duplication (CPFD) algorithm The CPFD algorithm [2] classifies the nodes in a DAG into three categories: Critical Path Node (CPN), In- Branch Node (IBN), and Out-Branch Node (OBN). A CPN is a node on the critical path. An IBN is a node from which there is a path to a CPN. An OBN is a node which is neither a CPN nor an IBN. CPFD tries to schedule CPNs first. If there is any unscheduled IBN for a CPN, CPFD traces the IBN and schedules it first. OBNs are scheduled after all the IBNs and CPNs have been scheduled. The motivation behind CPFD is that the parallel time will be likely to be reduced by trying to schedule CPNs first. Performance comparisons shows that CPFD outperforms DSH and BTDH in most cases [2] Comparison We have classified the existing DBS algorithms into two categories: SPD (Scheduling with Partial Duplication) and SFD (Scheduling with Full Duplication). Both SPD and SFD approaches duplicate a fork node if the ichild of the fork node is not a join node. On the other hand, the SPD approach does not duplicate any parent except PIP for a join node while the SFD approach tries to duplicate all the parents. Naturally, there exists a trade-off between better performance (smaller parallel time of the application, typically achieved by SFD algorithms) and better running time (smaller time to carry out the scheduling process itself, typically achieved by SPD algorithms) between the two approaches to the duplication based scheduling. Our goal is to introduce a new task duplication scheduling algorithm with a performance better than, but a running time comparable to, the SPD algorithms. Table I summarizes the time complexity of these algorithms and indicates the class of algorithms they belong to (i.e., whether they are SPD or SFD algorithms). Note that, for a DAG with V nodes, all the SFD algorithms have a complexity of O(V 4 ) while the SPD algorithms have a complexity of O(V 2 ). Table I. Comparison of scheduling algorithms SCHEDULERS CLASSIFICATION COMPLEXITY HNF NON-DBS O(VLOGV) LC NON-DBS O(V 3 ) DSH SFD O(V 4 ) BTDH SFD O(V 4 ) CPM SPD O(V 2 ) SDBS SPD O(V 2 ) STDS SPD O(V 2 ) LCTD SFD O(V 4 ) CPFD SFD O(V 4 ) ECPFD SFD O(V 4 ) As an illustration, Figure 2 presents the schedules obtained by each algorithm for the sample DAG of Figure 1. In this example, P i represents processing element i: PT is the Parallel Time of the DAG; and [EST(V i, P k ), i, ECT(V i, P k )] represents the earliest starting time and earliest completion time of task i. For example, in Figure 2.(a), task V 1 starts its execution at time 0 and finishes at time 10 on processor P 1.

5 P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270] P2: [60, 3, 90][170, 6, 230] P3: [60, 2, 80][160, 5, 210] (a) Schedule by HNF (PT = 270) P1: [0, 1, 10][10, 4, 70][140, 7, 210][210, 8, 220] P2: [0, 1, 10][10, 3, 40] P3: [0, 1, 10][10, 2, 30] P4: [0, 1, 10][10, 4, 70][100, 6, 160] P5: [0, 1, 10][10, 4, 70][110, 5, 160] (b) Schedule by STDS (PT = 220) P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270] P2: [60, 3, 90][120, 5, 170] P3: [60, 2, 80][170, 6, 230] (c) Schedule by LC (PT = 270) P1: [0,1,10][10,4,70][70,3,100][110,7,180][180,8,190] P2: [0, 1, 10][10, 3, 40] P3: [0, 1, 10][10, 2, 30] P4: [0, 1, 10][10, 4, 70][70, 3, 100][100, 6, 160] P5: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150] (d) Schedule by DFRN (PT = 190) P1: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150] P2: [0,1,10][10,3,40][40,4,100][110,7,180][180,8,190] P3: [0, 1, 10][10, 2, 30][30, 4, 90][100, 6, 160] (e) Schedule by CPFD (PT = 190) Figure 2. Schedules by various schedulers 4. The proposed algorithms This section presents the two proposed algorithms, DFRN and SDFRN. The motivations and the high level descriptions of DFRN and SDFRN algorithms are shown in Section 4.1 and Section 4.2, respectively. The worst case performance and the optimality analysis are also provided in this section DFRN Motivation. When we ran existing scheduling algorithms for a DAG with about 400 nodes, we observed that a SFD algorithm takes about 50 minutes to generate a schedule while a SPD algorithm takes less than one second. We need a scheduler with a performance better than SPD algorithms, but with running time adequate for applications consisting of large number of tasks. This need became our goal, and the goal was achieved by employing a new task duplication approach called DFRN (Duplication First and Reduction Next). DFRN approach behaves the same as SPD and SFD approaches in handling fork nodes but differently in handling join nodes. A SFD algorithm recursively estimates the effect of a possible duplication and decides whether to duplicate each node one by one. As a consequence, for a DAG with V nodes, each node may be considered V times for duplication in the worst case. Unlike the SFD approach, DFRN first duplicates all parent nodes in a bottom-up fashion to the parent which has been scheduled on the same processor, without estimating the effect of their duplications. Then each duplicated task is removed if the task does not meet certain conditions. Also, SFD algorithms are applied to all the processors on which any iparent of the join node has been scheduled. We observed that the completion time of the join node on the primary processor was shorter than those on other processors after the duplication process in most cases. Thus, DFRN applies the duplication only for the primary processor with the hope that the primary processor is the best candidate for the join node. Those two differences provide, incomparably shorter running time than, but comparable performance to, SFD algorithms as shown in Section 5. On the other hand, DFRN approach also achieves considerable performance improvement over SPD approaches Description of the proposed algorithm. The high level description of DFRN algorithm is shown in Figure 3. In this figure, the notations, P c, P u, PIP, IP, LN, and JN are used for the primary processor, an unused processor, the primary iparent, iparent, the last node, and a join node, respectively. In addition, a new term used in the algorithm is defined first. Definition 10: At any step of the scheduling process, the last node of processor P i is the most recent node assigned to P i. In Figure 2.(a), the last node of P 1, P 2, and P 3 are V 8, V 6, and V 5, respectively. The term, iparent, used in the algorithm in Figure 3 indicates the iparent which has the minimum EST if there are more than one iparent image across different processors. For example, in Figure 2. (d), V 3 on P 2 is identified as the iparent of its ichild since EST(V 3, P 2 ) = 10 while EST(V 3, P k ) = 70 for k = 1, 4, and 5. The primary iparent and the primary processor are used in the same way. For example, V 3 on P 2 is identified as the primary iparent if V 3 is the primary iparent of any node. The primary processor is P 2 in this case. Note that the algorithm is presented in a generic form so that we can use any list scheduling algorithm as a node selection algorithm. The node selection algorithm

6 decides which node is considered first. HNF is used as the node selection algorithm in this paper. In step (1), initialize() reads the input DAG and identifies the level of each node. All the nodes in the same level are sorted in descending order of their computation costs (as per the HNF heuristic). Step (2) considers each node according to the priority given in step (1). The node under consideration, V i, can be either a join node or not. Steps (3) through (10) handle non-join nodes. The iparent in step (4) may or may not be the last node. If the iparent is the last node, V i is scheduled after the iparent as shown in step (6). If the iparent is not the last node, tasks scheduled on the processor up to the iparent are copied to (i.e. duplicated) an unused processor as shown in step (8). Then V i is scheduled onto the unused processor to make EST of V i the same as ECT of the iparent in step (9). Otherwise, EST of V i is increased due to the computation time of the tasks between the iparent and the last node in the schedule. If V i is a join node, the primary iparent of V i and the primary processor are identified in step (12). DFRN is applied to a join node in steps (14) or (17) after handling the last node in the same way. DFRN(P a,v i ) consists of two procedures, try_duplication(p a,v i ) and try_deletion(p a,v i ), as shown in steps (21) and (22). try_duplication(p c,v i ) first tries to duplicate the iparent giving the largest MAT to V i. The procedure recursively searches its iparent from V i in a bottom-up fashion until it finds the parent which has already been scheduled on P a as shown in step (24) and (25). When it finds the parent on P a, it stops the search and duplicates the parents searched so far as shown in step (27). As a result, V i is duplicated before V j, when V i V j, and try_deletion(p a,v i ) considers each duplicated node one by one in the same sequence. After the duplication step, try_deletion(p a,v i ) decides whether to delete any of the duplicated tasks based on the two conditions in step (30). The first condition is for the case when the output of the duplicated task is available earlier by a message from the task on another processor than the duplicated task itself; the duplicated task is deleted since the duplication is not necessary. The second condition is for the case when the duplication does not decrease EST(V i, P c ) any more. By the second condition, EST of any node obtained by the DFRN algorithm is guaranteed to be less than or equal to EST of the same node obtained by SPD algorithms since the second condition results in EST(V i, P c ) MAT(SIP(V i ), V i ) while EST(V i, P c ) = MAT(SIP(V i ), V i ) in SPD approach assuming ECT(PIP(V i ), P c ) MAT(SIP(V i ), V i ). The parallel time obtained from DFRN is also less than or equal to that from a SPD algorithm since the parallel time is the largest ECT of all the nodes in a DAG. Scheduling algorithm with DFRN (1) initialize() // build a priority queue using HNF (2) for each node V i in the queue // in FIFO manner (3) if V i is not a JN // V i has only one IP (4) identify the IP (5) if the IP is LN (6) schedule V i to the PE having the IP (7) else // if the IP is not LN (8) copy the schedule up to the IP onto P u // now the IP is LN in P u (9) schedule V i to P u. (10) endif (11) else // if V i is a join node (12) identify PIP and P c (13) if PIP is LN (14) DFRN (P c,v i ) // apply DFRN to P c (15) else // if PIP is not LN (16) copy the schedule up to PIP onto P u (17) DFRN (P u,v i )// apply DFRN to P u (18) endif (19) endif (20) end for DFRN(P a, V i ) (21) try_duplication(p a, V i ) (22) try_deletion(p a, V i ) try_duplication(p a, V i ) (23) for each V p, (MAT(V p,v i ) MAT(V q,v i ),V p V i, V q V i, p q, V p and V q are not on P a yet) // from the node giving the largest MAT to the node giving the smallest MAT (24) if there is any V x, such that MAT(V x,v p ) MAT(V y,v p ), V x V p, V y V p, x y,v x and V y are not on P a yet // if any IP of V p is not scheduled on P a (25) try_duplication(p a, V x ) //traces the IP which is not on P a (26) else // if all its IPs are scheduled on P a (27) schedule V p onto P a //duplicates the IP which is not on P a (28) endif (29) end for try_deletion(p a, V i ) (30) delete any duplicated task V k if (i) ECT(V k, P a ) > MAT(V k, V d ) or // V d is the ichild of V k for which V k is duplicated (ii) ECT(V k, P a ) > MAT(SIP(V i ), V i )) Figure 3. Description of the DFRN algorithm

7 The dominant part is DFRN(P a, V i ) in the algorithm. Since try_duplication(p a, V i ) duplicates parents by the order of MAT, the sorting takes O(V 2 ), which makes the complexity of the routine O(V 2 ). try_deletion(p a, V i ) also takes O(V 2 ) time since it considers deletion m times and takes O(p) time for calculation of EST(V i, P a ) whenever any node is deleted, where m is the number of tasks duplicated and p is the number of deleted iparent of the node, m V, p V. Thus the complexity of try_deletion(p a, V i ) becomes O(V 2 ). The whole complexity becomes O(V 3 ) since DFRN(P a, V i ) is executed q times where q is the number of join nodes in the DAG, q V Analysis of the proposed algorithm. The worst case analysis of the proposed algorithm is important especially for real-time systems which are important application areas of parallel processing. The proposed algorithm has the following two properties. Due to space limitation, we omit the proof. Interested readers are referred to [18]. 1. The worst case parallel time obtained by the proposed algorithm for any input DAG is guaranteed to be less than or equal to CPIC. 2. The proposed algorithm always provides an optimal schedule for tree structured input DAGs SDFRN (Scalable scheduling with DFRN) Motivation. Most of DBS algorithms, including DFRN, assume that an unlimited number of processors are available for scheduling since this assumption makes the design of DBS algorithms simpler. Recently, scalable DBS algorithms [8, 14] are considered due to the limited number of processors in real world situations. We propose a scalable DFRN algorithm which adjusts the extent of the duplication according to the number of processors available in the target system Description of the proposed algorithm. The scalability can be achieved by adding one condition to the algorithm in Figure 3. DFRN algorithm requires an unused processor for duplications in steps (8) and (16) as shown in Figure 2. By inserting a condition which checks the availability of any unused processor in these two lines, the DFRN algorithm can adjust the extent of the duplications according to the number of processors available. In other words, SDFRN algorithm executes the duplications in lines (8) and (16) only if there is an unused processor available in the system. Since the algorithm is executed for each node, N processors are enough to guarantee the necessary duplications in the algorithm, where N is the number of task nodes in the input DAG. That is, the two properties shown in Section are still valid for SDFRN algorithm if at least N processors are available in the target system. For illustration, Figure 4 contains the schedules obtained by SDFRN algorithm with various number of processors for the sample DAG of Figure 1. In this example, when we try to use more than 5 processors, the algorithm limits the number of processors to 5. Thus, the cases with more than 5 processors are not shown in this figure. P1: [0,1,10] [10,4,70] [70,3,100] [110,7,180] [180,8,190] P2: [0,1,10] [10,3,40] P3: [0,1,10] [10,2,30] P4: [0,1,10] [10,4,70] [70,3,100] [100,6,160] P5: [0, 1, 10] [10, 4, 70] [70, 3, 100] [100, 5, 150] (a) Schedule by SDFRN with 5 processors (PT = 190) P1: [0,1,10] [10,4,70] [70,3,100] [110,7,180] [180,5,230] [230,8,240] P2: [0,1,10] [10,3,40] P2: [0,1,10] [10,2,30] P4: [0,1,10] [10,4,70] [70,3,100] [100,6,160] (b) Schedule by SDFRN with 4 processors (PT = 240) P1: [0,1,10] [10,4,70] [70,3,100] [110,7,180] [180,6,240] [240,5,290] [290,8,300] P2: [0,1,10] [10,3,40] P3: [0,1,10] [10,2,30] (c) Schedule by SDFRN with 3 processors (PT = 300) P1: [0,1,10] [10,4,70] [70,2,90] [90,3,120] [120,7,190] [190,6,250] [250,5,300] [300,8,310] P2: [0,1,10] [10,3,40] (d) Schedule by SDFRN with 2 processors (PT = 310) Figure 4. Schedules obtained by SDFRN with various number of available processors. 5. Performance comparison We generated 1000 random DAGs to compare the performance of DFRN with existing scheduling algorithms. We used two parameters the effects of which we were interested to investigate: the number of nodes and CCR (Communication to Computation Ratio). The numbers of nodes used are 20, 40, 60, 80, and 100 while CCR values used are 0.1, 0.5, 1.0, 5.0, and CCR is the ratio of average communication cost to average computation cost. 40 DAGs are generated for each case of the 25 combinations, which makes 1000 DAGs. The scheduling techniques presented in Section 2 are used for comparison.

8 For performance comparison, we define one normalized performance measure named Relative Parallel Time (RPT), which is a ratio of the parallel time to CPEC. For example, if the parallel time obtained by DFRN is 200 and CPEC is 100, RPT of DFRN is 2.0. A smaller RPT value is indicative of a shorter parallel time. The RPT of any scheduling algorithm can not be lower than one since CPEC is the lower bound. One of our objectives is to observe the trade-off between the performance (the parallel time obtained) and the running time (the time taken to generate a schedule) among the scheduling algorithms. Table II shows the actual average running time of the five algorithms. The running time is the user time obtained by time command on a Sun Sparc10 workstation. For an input DAG with 400 nodes, the time taken to get a schedule was 5.97 seconds by HNF, 0.34 seconds by STDS, 2.95 minutes by LC, 17.3 seconds by DFRN, and 46.4 minutes by CPFD. There are significant differences among the running times. Table II. Comparison of running times (in seconds) N HNF STDS LC CPFD DFRN Table III shows the result of the comparison between each pair of algorithms. Each entry of the table consists of three elements in > a, = b, < c format, which means that the algorithm in the same row provides longer parallel time a times more than, same parallel time b times as, and shorter parallel time c times more than the algorithm in the same column. For example, if we want to see the comparison between DFRN and HNF, we look up DFRN in the fifth row and HNF in the first column or vice versa. In this case, the entry is > 2, = 22, < 976, which means that DFRN provides the longer parallel time 2 times more than, same parallel time 22 times as, and shorter parallel time 976 times more than HNF for 1000 randomly generated DAGs. The comparison shows that applying DFRN to HNF shortens the parallel time in 97.6 % of the cases. Comparing DFRN with LC which has the same complexity as DFRN, DFRN generates shorter parallel time 829 times, same parallel time 171 times, and no longer parallel time while the running time of DFRN was shorter than that of LC. We also confirmed that the parallel time obtained by DFRN is always less than CPIC in the 1000 runs. On the other hand, DFRN generates shorter parallel time 27 times more than, same parallel time 685 times as, and longer parallel time 288 times more than CPFD. Note that DFRN provides the same parallel time as that obtained by CPFD in 68.5% of the cases with % of running time of CPFD, which implies the effectiveness of DFRN approach. Due to the incomparably long running time of CPFD, DFRN would be a good candidate for application programs consisting of large number of tasks. For a DAG with very large number of nodes, STDS will be appropriate because of its very short running time. Table III. Comparison of parallel times HNF STDS LC CPFD DFRN HNF = 1000 > 885 = 48 < 67 > 587 = 39 < 374 > 978 = 22 > 976 = 22 < 2 STDS > 67 = 48 < 885 LC > 374 = 39 < 587 CPFD = 22 < 978 DFRN > 2 = 22 < 976 = 1000 > 808 = 165 < 27 = 425 < 575 > 3 = 430 < 567 > 27 = 165 < 808 = 1000 = 171 < 829 = 171 < 829 > 575 = 425 > 829 = 171 = 1000 > 288 = 685 < 27 > 567 = 430 < 3 > 829 = 171 > 27 = 685 < 288 = 1000 Graphical representations of the performance comparison are shown in Figure 5 and Figure 6 with respect to N (the number of nodes) and CCR, respectively. Each case in Figure 5 is an average of 200 runs varying CCR. The average CCR value is 3.3. As shown in Figure 5, the number of nodes does not significantly affect the relative performance of scheduling algorithms. In other words, the performance comparison shows similar patterns regardless of N. In the pattern, DFRN shows much shorter parallel time than existing algorithms with equal or lower time complexity while it shows a comparable performance to CPFD. CCR is a critical parameter. As CCR is increased, the performance gap becomes larger as shown in Figure 6. The difference among 5 algorithms was negligible until CCR is one. But when CCR is 5, RPT of HNF, STDS, LC, DFRN, and CPFD become 3.38, 2.57, 3.61, 1.67, and 1.61, respectively. When CCR is 10, they are 5.79, 5.01, 7.68, 2.45, and 2.27, respectively. As expected, duplication-based scheduling algorithms show considerable performance improvement for a DAG with high CCR values.

9 RPT N HNF STDS LC DFRN CPFD RPT CCR 1.0N 0.9N 0.8N 0.7N 0.6N 0.5N 0.4N 0.3N 0.2N 0.1N Figure 5. Comparison with respect to N RPT CCR HNF STDS LC DFRN CPFD Figure 6. Comparison with respect to CCR Figure 7 and Figure 8 show the performance degradation of SDFRN algorithm according to the number of processors available. In these Figures, an represents a N available processors. For example, 0.9N indicates that the number of available processors is 90 % of the number of task nodes. These Figures show that the SDFRN achieves comparable performance to that of DFRN with unbounded number of processors until the number of available processors is reduced to 60 % of the number of task nodes, in most cases. The performance degradation becomes significant as the number of processors is decreased to less than 50 % of the number of nodes. The differences in values of N and CCR do not significantly change the pattern of the performance degradation. RPT N 1.0N 0.9N 0.8N 0.7N 0.6N 0.5N 0.4N 0.3N 0.2N 0.1N Figure 7. Performance degradation with respect to N (CCR = 3.3, D = 3.5) Figure 8. Performance degradation with respect to CCR (N = 100, D = 3.8) 6. Conclusion This paper classified existing DBS algorithms into two categories, SPD and SFD algorithms, according to the duplication method used for a join node. SFD algorithms try to duplicate all the iparents of a join node while SPD algorithms do not. As a result, a SFD algorithm outperforms a SPD algorithms while its running time is incomparably longer than that of a SPD algorithms. This paper presented a new duplicationbased scheduling algorithm (i.e., DFRN) by trying to combine good features of the two approaches. The motivation is to duplicate iparents for a join node if the duplication reduces earliest start time of the join node as done in SFD algorithms but without adding much complexity so that the new approach will be well suitable for applications consisting of large number of tasks. In general, most of the DBS algorithms, including the proposed one, assume availability of an unlimited number of processors. Since the assumption may not hold in practice, we also proposed a variation of DFRN, SDFRN, which adjust the extent of the duplications according to the limited number of processors available in real world situations. The SDFRN shows the same performance as that of DFRN if at least N processors are available in the target system. Our performance study showed that DFRN has a runtime comparable to SPD and non-duplicating scheduling algorithms, while outperforming such algorithms by generating schedules with much shorter parallel times. Compared to SFD algorithms, DFRN offers comparable performance with a run-time which is several orders of magnitude shorter. The study also reveals the graceful performance degradation of SDFRN algorithm according to the number of processors available in the system. Since the performance comparison study is done based on random DAGs, we are currently investigating the characteristics of DAGs from real applications.

10 References [1] T. L. Adam, K. Chandy, and J. Dickson, A Comparison of List Scheduling for Parallel Processing System, Communication of the ACM, vol. 17, no. 12,, Dec. 1974, pp [2] I. Ahmad and Y. K. Kwok, A New Approach to Scheduling Parallel Program Using Task Duplication, Proc. of Int l Conf. on Parallel Processing, vol. II, Aug. 1994, pp [3] H. Chen, B. Shirazi, and J. Marquis, Performance Evaluation of A Novel Scheduling Method: Linear Clustering with Task Duplication, Proc. of Int l Conf. on Parallel and Distributed Systems, Dec. 1993, pp [4] Y. C. Chung and S. Ranka, Application and Performance Analysis of a Compile-Time Optimization Approach for List Scheduling Algorithms on Distributed-Memory Multiprocessors, Proc. of Supercomputing 92, Nov. 1992, pp [5] J. Y. Colin and P. Chretienne, C.P.M. Scheduling with Small Communication Delays and Task Duplication, Operations Research, 1991, pp [6] S. Darbha and D. P. Agrawal, SDBS: A task duplication based optimal scheduling algorithm, Proc. of Scalable High Performance Computing Conf., May 1994, pp [7] S. Darbha and D. P. Agrawal, A Fast and Scalable Scheduling Algorithm for Distributed Memory Systems, Proc. of Symp. On Parallel and Distributed Processing, Oct. 1995, pp [8] S. Darbha, Task Scheduling Algorithms for Distributed Memory Systems, PhD Thesis, North Carolina State Univ., [9] S. Darbha and D. P. Agrawal, Scalable Scheduling Algorithm for Distributed Memory Machines, Proc. of Symp. On Parallel and Distributed Processing, Oct. 1996, pp [10] H. El-Rewini, T. G. Lewis, and H. H. Ali, Task Scheduling in Parallel and Distributed Systems, New York: Prentice Hall, [11] A. Gerasoulis and T. Yang, A Comparison of Clustering Heuristics for Scheduling DAG s on Multiprocessors, J. Parallel and Distributed Computing, vol. 16, no. 4, Dec. 1992, pp [12] S. J. Kim and J. C. Browne, A general approach to mapping of parallel computation upon multiprocessor architectures, Proc. of Int l Conf. on Parallel Processing, vol III, 1988, pp [13] B. Kruatrachue and T. G. Lewis, Grain Size Determination for parallel processing, IEEE Software, Jan. 1988, pp [14] Y.-K. Kwok and I. Ahmad, Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems, Proc. of Symp. On Parallel and Distributed Processing, Oct. 1994, pp [15] Y.-K. Kwok, I. Ahmad, and J. Gu, FAST: A Low- Complexity Algorithm for Efficient Scheduling of DAGs on Parallel Processors, Proc. of Int l Conf. on Parallel Processing, vol. II, Aug. 1996, pp [16] C. McCreary and H. Gill, Automatic Determination of Grain Size for Efficient parallel Processing, Comm. ACM, vol. 32, Sept. 1989, pp. 1,073-1,078. [17] C. H. Papadimitriou and M. Yannakakis, Towards an architecture-independent analysis of parallel algorithms, ACM Proc. of Symp. on Theory of Computing (STOC), 1988, pp [18] G.-L. Park, B. Shirazi, and J. Marquis, Employing Task Duplication for Multiprocessor Scheduling, Tech. Report, Dept. of Computer Science and Engineering, University of Texas, [19] G.-L. Park, B. Shirazi, J. Marquis, and Hyunseung Choo, Decisive Path Scheduling: A New List Scheduling Method, Proceeding of 26 th International Conference on Parallel Processing, Chicago, USA, Aug. 1997, pp [20] V. Sarkar, Partitioning and scheduling Parallel Programs for Multiprocessors, Cambridge, Mass: MIT Press, [21] B. Shirazi, M. Wang, and G. Pathak, Analysis and Evaluation of Heuristic Methods for Static Task Scheduling, Journal of Parallel and Distributed Computing, vol. 10, No. 3, 1990, pp [22] B. Shirazi, A. R. Hurson, "Scheduling and Load Balancing: Guest Editors' Introduction," Journal of Parallel and Distributed Computing, Dec. 1992, pp [23] B. Shirazi, A. R. Hurson, "A Mini-track on Scheduling and Load Balancing: Track Coordinator's Introduction," Hawaii Int'l Conf. on System Sciences (HICSS-26), Jan. 1993, pp [24] B. Shirazi, K. Kavi,, A. R. Hurson, and P. Biswas, PARSA: A PARallel program Scheduling and Assesment environment, Proc. of Int l Conf. on Parallel Processing, vol. II, Aug. 1993, pp [25] B. Shirazi, H. B. Chen, K. Kavi, J. Marquis, and A. R. Hurson, PARSA: A Parallel Programs Software Development Tool, Proc. Symp. on Assessment of Auality Software Development Tools, June 1994, pp [26] B. Shirazi, A. R. Hurson, K. Kavi, "Scheduling & Load Balancing," IEEE Press, [27] B. Shirazi, H.-B. Chen, and J. Marquis, Comparative Study of Task Duplication Static Scheduling versus Clustering and Non-Clustering Techniques, Concurrency: Practice and Experience, vol. 7(5), Aug. 1995, pp [28] T. Yang and A. Gerasoulis, DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors, IEEE Trans. On Parallel and Distributed Systems, vol. 5, no. 9, Sept. 1994, pp [29] T. Yang and A. Gerasoulis, A dedicated track on Partitioning and Scheduling for Parallel and Distributed Computation, in the Hawaii Int l Conference on Systems Sciences, Jan [30] M. Y. Wu and D. D. Gajski, Hypertool: A Programming Aid for Message-Passing Systems, IEEE Trans. on Parallel and Distributed Systems, vol. 1, no. 3, Jul. 1990, pp [31] M.Y. Wu, A dedicated track on Program Partitioning and Scheduling in Parallel and Distributed Systems, in the Hawaii Int l Conference on Systems Sciences, Jan

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems

Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems Yu-Kwong Kwok and Ishfaq Ahmad Department of Computer Science Hong Kong University of Science and