A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic technique for mapping parallel algorithms parallel architectures. The proposed technique is a fast recursive mapping algorithm which is a component of the Cluster-M programming tool. The other components of Cluster-M are the Specication module and the Representation module. In the Specication module, for a given task specied by a high-level machine-indepent program, a clustered task graph called Spec graph is generated. In the Representation module, for a given architecture or computing organization, a clustered system graph called Rep graph is generated. Given a task (or system) graph, a Spec (or Rep) graph can be generated using one of the clustering algorithms presented in this paper. The clustering is done only once for a given task graph (system graph) indepent of any system graphs (task graphs). It is a machine-indepent (application-indepent) clustering, therefore, it is not repeated for dierent mappings. The Cluster-M mapping algorithm presented produces a sub-optimal matching of a given Spec graph containing M task modules, a Rep graph of N processors, in O(MN) time. This generic algorithm is suitable for both the allocation problem and the scheduling problem. Its performance is compared to other leading techniques. We show that Cluster-M produces better or similar results in signicantly less time and using less or equal number of processors as compared to the other known methods. Introduction An ecient parallel algorithm designed for a parallel architecture includes a detailed outline of the accurate assignment of the concurrent computations processors and data-transfers communication links, such that the overall execution time is minimized. This process may be complex deping on the application task and the underlying organization. Given the same application task, this process must be repeated for every dierent architecture. To remedy this problem, one should design portable softwares which can be any architecture or organization. It is essential to use portable programming tools with intelligent mapping modules which can support this process eciently. Design of ecient techniques for mapping parallel programs parallel computers is the focus of this paper. The mapping problem has been described in various ways through literature [5, ]. Generally, the mapping problem can be viewed as assigning a given program which consists of a collection of task modules the processing elements of the underlying architecture, so that some performance measure, e.g., total execution time, is optimized. A program can be represented in form of a task graph, and a parallel computer

system can be represented in form of a system graph. Mapping can be either static or dynamic. In static mapping, the assignments of the nodes of the task graphs the system graphs are determined prior to the execution and are not changed until the of the execution. A static task graph or system graph can be either uniform or non-uniform. A graph is called non-uniform if the weight of all the nodes are not the same and the weight of the edges also dier. Otherwise, it is uniform. Mapping of directed task graphs (if there is precedence relation among the task modules) is called task scheduling []. If the task graphs to be are undirected, then it is called task allocation []. Whether the graphs are directed or undirected, uniform or non-uniform, there are basically four types of static mappings based on the topological structures of the task and system graphs. These are () mapping of specialized tasks specialized systems (e.g., mapping of a chain-structured task chain-linked processors) [8, 4, 4], () mapping of specialized tasks arbitrary systems (e.g., mapping of trees any architectures) [, ], () mapping of arbitrary tasks specialized systems (e.g., mapping of any tasks a hypercube or a completely connected network) [9,,, 5, 4, 6] and (4) mapping of arbitrary tasks arbitrary systems [,, 6, 7, 9, 6]. The focus of this paper is on the mapping of arbitrary tasks arbitrary systems. One of the earliest mapping algorithms, which can map an arbitrary task an arbitrary system, is Lo's heuristic in [7]. Basically, this heuristic repetitively uses a max-ow min-cut algorithm to nd mappings of task modules heterogeneous processors. The time complexity of Lo's heuristic is O(M NjE p j log M), where M is the number of task modules, N is the number of processors and je p j is the number of communication links between processors. El-Rewini and Lewis in [9] presented their mapping heuristic (MH). MH is a list scheduling algorithm which maps an arbitrary task graph an arbitrary system graph. In list scheduling, each task module is assigned a priority. Whenever a processor is available, a task module with the highest priority is selected from the list and assigned to a processor. MH has a time complexity of O(M N ). The mapping problem can also be viewed as a graph matching problem [, ]. In this case, the task graph is to be matched against the system graph in order to minimize the overall execution time. This problem has been known to be NP-complete in its general form as well as in several restricted forms []. In an attempt to solve the problem in a general case, a number of heuristics have been introduced [,,, 6, ]. Bokhari in [] searches for the best matching of the edges of the undirected task graph versus the system graph. This heuristic algorithm is based on local search and pair-wise exchange. Lee and Aggarwal's mapping strategy is another example of this approach but considers directed task graph [6]. They both assume the number of nodes of the task graph to be no greater than that of the system graph. The time complexities of both algorithms are O(N ). To reduce the complexity of the mapping problem, a number of approaches such as graph contraction and clustering have been studied [, 5,, 5, 6]. However, in all of these graph matching based techniques, only the task graph is clustered and the entire task graph is then matched against the entire system graph. In this paper, we will present a new mapping technique which not only clusters the task graph but

also clusters the system graph for more ecient mapping. The clustering is done only once for a given task graph (system graph) indepent of any system graphs (task graphs). It is a machine-indepent (application-indepent) clustering, therefore, it is not repeated for dierent mappings. The mapping algorithm presented in this paper has a time complexity of only O(MN). It is part of Cluster-M programming tool []. This generic algorithm is suitable for both the allocation problem and the scheduling problem. The task and system graphs studied in this paper have uniform weight on edges. An exted mapping algorithm which works for non-uniform task graphs and system graphs is presented in [7]. The paper continues as follows. In section, we present Cluster-M clustering algorithms. The mapping algorithm is then detailed in section. This is followed by comparisons of our algorithm with other existing mapping algorithms (for both allocation and scheduling) in section 4. Finally, a brief conclusion is given in section 5. Cluster-M Clustering Cluster-M is a programming tool which facilitates the design and mapping of portable parallel programs. Cluster-M has three main components: Cluster-M Specication, Cluster-M Representation and Cluster- M mapping module. Cluster-M Specications are high-level machine-indepent programs represented in the form of a multi-level clustered task graph called Spec graph. Each clustering level in the Spec graph represents a set of concurrent computations, called Spec clusters. A Cluster-M Representation, on the other hand, represents a multi-level partitioning of a system graph called Rep graph. At every partitioning level of the Rep graph, there are a number of clusters called Rep clusters. Each Rep cluster represents a set of processors with a certain degree of connectivity. Given a task (or system) graph, a Spec (or Rep) graph can be generated using one of the clustering algorithms described below. The clustering is done only once for a given task graph (system graph) indepent of any system graphs (task graphs). It is a machineindepent (application-indepent) clustering, therefore it is not repeated for dierent mappings. For this reason, the time complexity of the clustering algorithms is not included in the time complexity of the mapping algorithm presented in section.. Clustering directed graphs Many clustering techniques have been developed to reduce the order and size of task graphs [, 9,,, 8, 4, ]. For example, a cluster can be a clan [9] which is set of nodes with common outside ancestors and descants on the task graph. Cluster-M based mapping requires clustering for the task graph as well as the system graph to obtain better and faster solutions. For clustering either the task graph or the system graph, we use the following algorithm if the input graph is directed, otherwise we use the algorithm presented in the next section (.). In the scheduling problem, task graphs are directed, while in the task allocation problem they are not. The system graphs, on the other hand, are always assumed to be undirected (todays computers have bi-directional links). Therefore, the algorithm presented below is to be used only for directed

task graphs. In the following, we give a formal denition of directed task graphs which is also applicable to undirected task graphs with the exception that in undirected graphs, for every i,j, (t i ; t j ) = (t j ; t i ). A task can be represented by a task graph G t (V t ; E t ), where V t = ft,..., t M g is a set of task modules to be executed, and E t is a set of edges representing the partial orders and communication directions between task modules. A directed edge (t i ; t j ) represents that a data communication exists from module t i to t j and that t i must be completed before t j may begin. Furthermore, each task module t i is associated with its amount of computation A i. Each edge (t i ; t j ) is associated with D ij, the amount of data required to be transmitted from module t i to module t j. Note A i and D ij, for i; j M. If a directed edge (t i ; t j ) exists, we call t i a parent node of t j and t j a child node of t i. If a node has more than one child, it is called a broadcast node. If a node has more than one parent, it is called a merge node. According to the data and operational precedence, nodes can be grouped into execution steps and edges can be grouped into execution phases as described below. An execution step (phase) represents a set of computations (communications) which can be carried out in parallel. Task nodes in execution step are those without parent nodes. Task nodes in step i (i > ) are those with at least one parent node in step i? but no parent node in step j (j i). Edges in phase i are those (t x ; t y ) where t x is in execution step i. In this paper, we assume that the amount of data communication between any two task modules is uniform, i.e., D ij =, for i; j M, (t i ; t j ) E t. This assumption leads to the simple greedy clustering in the clustering algorithm which will be described later. An exted clustering algorithm which clusters non-uniformly weighted directed task graphs is presented in [7]. The algorithm for clustering directed graphs is presented in Figure. The basic idea is to merge all the nodes in each execution step if they have a common parent node or a common child node. If a parent node t i has one or more children, one must be embedded to t i. Each cluster has a size which is the number of the member nodes in this cluster. A member node can be either a task module or a \supernode". A supernode is obtained by embedding one task module another task module or supernode (this embedding process is also known as linear clustering [4]). If a Spec cluster has a size Si and the sizes of its sub-clusters at the lower level are Si,, Sik. It is obvious Si = Si + + Sik. The complexity of the clustering-directed-graph algorithm is in the order of the number of edges of the task graph, which is O(M ) in the worst case, where M is the number of nodes of the task graph. To illustrate this algorithm, the following example is presented. A task graph of 5 modules is shown in Figure. Each module has computation amount of, and each edge carries amount of data communication of. This task graph contains two subgraphs which are not connected. This means that the two subtasks can be executed in parallel. The Spec graph is constructed by merging the clusters when they have communication needs, as illustrated in Figure. The input task graph has nodes a to o (5 nodes). The nal Spec graph is a multi-layered graph containing member nodes a to i (9 nodes). For example, j, k and l are embedded to d, since j, k and l are in dierent execution steps and can not be executed concurrently. This will not only save the processor resources and communication cost, 4

Clustering-directed-graphs Algorithm group nodes of given task graph into corresponding steps group edges of given task graph into corresponding phases for all nodes at step, do make it into a cluster for all phases, do for all edges (ti; tj), do begin if tj is a merge node, then begin embed tj to ti if the parent nodes of tj are not in a cluster, then begin merge them into a cluster increase cluster size if ti is a broadcast node, then begin k = number of nodes in cluster ti belongs to if ti has more than k children, then begin embed rst k children to the above k nodes merge the rest into the above cluster increase cluster size else embed all children Figure : Clustering-directed-graphs Algorithm. but also reduce the mapping cost since the Spec graph now contains only 9 member nodes instead of the original 5.. Clustering undirected graphs The algorithm presented in this section can be used for generating the Spec graph of an undirected task graph (for allocation problem), as well as the Rep graph of a system graph (undirected). Since the denition of directed task graph presented in the last section is also applicable to undirected task graph (with the exception of (t i ; t j ) = (t j ; t i ), for all i, j), in this section, we only present the denition of system graphs (undirected). We then present the algorithm for generating a clustered graph (Spec graph for task graph, or Rep graph for system graph) out of such an undirected input graph. A parallel system can be modeled as an undirected system graph G p (V p ; E p ). V p = fp ; :::; p N g is a set of processors forming the underlying architecture, while E p is a set of edges representing the interconnection topology of the parallel system. We assume the connections between adjacent processors of the parallel systems studied here are bi-directional. Therefore, an edge (p i ; p j ) represents there is a direct connection between processor p i and p j. The speed of processor p i is denoted by S i, and the transmission rate over edge (p i ; p j ) is denoted by R ij. In this paper, we assume that S i = and R ij = for i; j N, (p i ; p j ) E p. This assumption leads to the simple greedy clustering. An exted clustering algorithm which clusters a non-uniform undirected graph is given in [7]. To construct a clustered graph (Rep graph or Spec graph) from an undirected input graph, initially, every 5

d c e g j b m f n h i k a l A task graph o clusters step: d c b a e g step: c d e f g h i embed j to d embed m to e embed n to g step: b c d e f g h i embed k to d embed o to f step4: a b c d embed l to d result: a b c d e f g h i embed j, k, l to d m to e n to g o to f Constructing the Spec graph Figure : A task graph and the obtained Spec graph. 6

Clustering-undirected-graphs Algorithm for all nodes pi (ti), do make a cluster for pi (ti) at clustering level set cluster level to while merging is possible, do begin for all clusters c at current level, do begin make c into cluster c at next level delete cluster c from current level for all clusters x in current level, do if x is connected to all sub-clusters of c, then begin merge x into c delete x from current level increment clustering level by Figure : Clustering-undirected-graphs Algorithm. B D F H A C E G An undirected graph. step: A B C D E F G H step: A B C D E F G H step: A B C D E F G H step4: (result) A B C D E F G H Clustering of the undirected graph. Figure 4: An undirected graph and its clustering. node forms a cluster. This node is presented by p i in the case of system graph and by t i in the case of task graph. Then clusters which are completely connected are merged to form a new cluster. This is continued until no more merging is possible. Two clusters x and y are connected if x contains a node p x (or t x ) and y contains a node p y (or t y ), such that (p x ; p y ) E p (or (t x ; t y ) E t ). Each cluster has a size which is the number of the nodes it contains. If a Rep (or Spec) cluster has a size Ri (or Si ) and the sizes of its sub-clusters at the lower level are Ri,, Rik (or Si,, Sik ), it is obvious that Ri = Ri + + Rik (or Si = Si + + Sik ). The algorithm for clustering undirected graphs is seen in Figure. Figure 4 shows an example. The undirected graph shown can present a system graph, therefore the generated output as shown is a Rep graph. However, if the same input is an undirected task graph for allocation problem, then the generated output is a Spec graph. 7

We now analyze the running time of this implementation. For each level, we compare each cluster in that level with other clusters in the same level and check if they form a clique. Suppose at a certain level of system graph (undirected task graph), there are m clusters c ; ; c m, with each cluster c i containing P i number of processors (T i number of task modules). We have P m i= P i = N ( P m i= T i = M), where N is the number of underlying processors (M is the number of task modules). The time of clustering at this level is dominated by the total number of comparisons made to determine if each cluster is connected to all P P m sub-clusters of another cluster at next level, which is at most i=p P m P m j=i+ ip j P i= in N m ( i=p P m T m j=i+ it j T i= im M ). The number of levels can be at most N? (or M? ). Therefore the total time complexity of this algorithm is O(N ) (or O(M )). Cluster-M Mapping Algorithm For a given problem, a high level machine indepent parallel solution can be presented in form of a Cluster- M Specication, thus directly representing a Cluster-M Spec graph []. However, a Spec graph can also be generated directly from a given task graph, using one of the algorithms in the last section (clusteringdirected-graphs algorithm is used for directed task graph in scheduling, and clustering-undirected-graphs algorithm is used for allocation). On the other hand, given a system graph representing an underlying architecture or organization, a Rep graph can be generated using the algorithm presented in section.. In this section, given a Spec graph and a Rep graph as the input to the mapping module, we present an ecient mapping algorithm which produces a sub-optimal matching of them in O(M N) time. The mapping procedure presented in this paper has a much lower time complexity than the traditional mappings since it contains a graph matching procedure for which both of the input graphs have been clustered. In the following, we rst present a set of preliminaries and then give a high level description of the mapping algorithm.. Preliminaries First we dene the mapping function f m : V t?! V p. Following the precedence constraints and the computation and communication requirements of the original task graph, a schedule can be obtained which places each task module t i to processor f m (t i ) at the proper time (earliest possible starting time). Since the edges of both task and system graphs are uniformly weighted, we assume that the communication time of the task graph edge (t i ; t j ) is equal to dist(f m (t i ); f m (t j )), where dist(p i ; p j ) is the shortest distance between processor p i and p j. We also assume it takes no time to communicate data at the same processor, i.e., dist(p i ; p i ) =. A schedule can be illustrated by a Gantt chart which consists of a list of all processors and for each processor a list of all task modules allocated to that processor ordered by their execution time []. We dene the total execution time of a schedule, T m, to be the latest nishing computation time of the last scheduled task module on any processor. Obviously, T m is equal to the total execution time of a given task on a given system. As we consider the shortest execution time of a given task on a system to be the ultimate 8

goal in mapping, we take T m as our measure of quality to scale how good a mapping is. However, since T m can only be calculated once a schedule has been obtained, it is dicult to predict T m in the process of mapping. Therefore, we shall present another objective function as part of the the mapping heuristic to guide the mapping process. This function is described in section... Mapping Algorithm A detailed description of the Cluster-M mapping algorithm is presented in Figure 5. In the following, we give an overview of the algorithm. Before starting the mapping, we need to compute a reduction factor denoted by f, which is essential to map task graphs having more nodes than the system graphs. The reduction factor f is the ratio of the total size of the Rep clusters over the total size of the Spec clusters. It is used to estimate how many computation nodes will share a processor. The mapping is done recursively at each clustering level, where we nd the best matching between Spec clusters and Rep clusters. To map each of the Spec clusters (denoted by Si ) with size Si, we search for the Rep cluster (denoted by Rj ) with the best matched size, i.e., closest to f Si. Therefore, we try to minimize the function as formulated in Equation (). If no Rep cluster with a matching size can be found for a Spec cluster, we either merge or split (unmerge) Rep clusters until a matching Rep cluster is found. jf m j = X i jf Si? Rfm(Si ) j () When the mapping at top level is completed, for each pair of the Spec and Rep clusters, the same mapping procedure is continued recursively at a lower level until the mapping is ne grained to the processor level. As the number of task modules and processors in the original task and system graphs are M and N respectively, the total numbers of all Spec and Rep clusters at all clustering levels are O(M) and O(N) respectively. Thus, the time complexity of sorting all the Spec and Rep clusters is O(M log M + N log N). In the mapping at each level, the time complexity of nding the best match of each Spec cluster is O(N), as the total number of clusters in the Rep graph is O(N). This includes the time in splitting a Rep cluster (which is also a recursive procedure) and inserting the extra part back to the sorted Rep cluster list. Suppose the total number of Spec clusters at level i is K i. The time complexity of nding the best matches P of all K i Spec clusters at level i is thus O(K i N). Since the total number of Spec clusters is O(M), i.e., i K i = O(M), the total time complexity of this mapping algorithm is O(M log M + N log N + MN) = O(MN).. Mapping Examples In section, we have constructed a Spec graph and a Rep graph from the original task graph and system graph, as shown in Figure and 4. Figure 6 shows the mapping from the obtained Spec graph to the Rep graph following the mapping algorithm described above. First, we calculate S = 9, R = 8 and f = 8=9. The Spec cluster of size 5 is the Rep cluster of the same size, however the Spec cluster of 9

Cluster-M Mapping Algorithm Input: input Spec graph sort all Spec clusters at each level in descing order of sizes. input Rep graph sort all Rep clusters at each level in descing order of sizes. Recursive Mapping Procedure: ffor all Spec and Rep clusters at the top levelg calculate f. if f >, let f =. calculate the required size of the Rep cluster matching S i to be f S i for each Spec cluster at top level sorted list, do begin if a Rep cluster of required size is found, then match the Spec cluster to the Rep cluster delete the Spec and Rep cluster from Spec and Rep list for each unmatched Spec cluster, do begin if the size of the rst Rep cluster > the required size, then begin split the Rep cluster into two parts with one part of the required size match the Spec cluster to this part insert the other part to proper position of the sorted Rep cluster list else begin merge Rep clusters until the sum of sizes the required size if =, then match the Spec cluster to the merged Rep cluster else begin split the merged Rep cluster into two parts with one of required size match the Spec cluster to this part insert the other part to the sorted Rep list for each matching pair of Spec cluster and Rep cluster, do begin if the Rep cluster contains only one processor, then map all the modules in the Spec cluster to the processor else begin go to the sub-clusters of the Spec and Rep cluster (thus they are pushed to top level) call the recursive mapping procedure for these clusters Figure 5: Mapping Algorithm.

Cluster-M Specification : Cluster-M Representation : a b c d e f g h i to be A B C D E F G H f = 8/9 Step : e f g h i A B C D E a b c d F G H Step : g h i A B C e f D E b c d F G a H Step : g A h B i C e D f E c d F b G Step 4 : c F d F Figure 6: A mapping example. Processors 4 5 6 A B C D E F G H g n h i e m f o c d j k l b a Figure 7: The obtained mapping result. size 4 has to be the Rep cluster of size since this is the closest size match. Then the same procedure is applied to the Spec clusters at the lower level. As shown in step in Figure 6, task module a is Rep cluster H, which contains a single processor. In step, module b, e, f, g, h and i are corresponding processors. Finally in step 4, module c and d are both processor F. Since modules j, k and l are embedded to module d (see Figure ), they are also processor F, to which d is. Similarly, module m, n and o are processor D, A and E respectively. Now all the task modules in the original task graph have been corresponding processors. Figure 7 shows the nal schedule obtained from the above mapping by following the data and operational precedence of the task graph. As we can see in the Gantt chart, T m = 6.

4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 7 8 9 Task graph 4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 7 8 9 4 5 4 5 System graph 6 8 4 Figure 8: Comparison example with Bokhari: task and system graph. 4 Comparison Results In this section, we present a set of experimental results which have been obtained by comparing our algorithm with other leading techniques. The examples selected in this paper are the same as those presented and experimented by the authors of other leading techniques. The following ve criteria are used for evaluating the performance of the algorithms examined: () the total time complexity of executing the mapping algorithm, T c ; () the total execution time of the generated mappings, T m ; () the speedup S m = Ts T m, where T s is the sequential execution time of the task; (4) eciency = Sm N m, where N m is the number of processors used; and (5) the actual time of running the mapping algorithm on a certain computer, T r. In the following, we present our comparison results for both the allocation and scheduling problems. 4. Allocating undirected task graphs The goal of task allocation is to minimize the communication delay between processors and to balance the load among processors. The problem of task allocation arises when specifying the order of executing the task modules is not required. Therefore, the task graph in task allocation is undirected and the clusteringundirected-graphs algorithm is used to generate the Spec graph in this case. We consider the measure of mapping quality in task allocation to be T m. We compare our results to Bokhari's mapping (allocation) algorithm [] using undirected task graphs. Bokhari's algorithm has the running time complexity of O(N ), while ours is O(MN). Bokhari's algorithm assumes that the computation amount of each task module, the amount of data communication along each task graph edge, the computation speed of each processor and data transmission rate along each communication link are all uniform, i.e.,. It further assumes the number of task modules is no greater than the number of processors, so that the mapping can be one to one. In this case, a lower bound on T m can be +, where is the degree of a given task graph. In comparing Cluster-M with Bokhari we use the example shown in Figure 8 which has a -node task

Table : Mapping of Bokhari's algorithm and Cluster-M Task Mapped processor module Bokhari Cluster-M 5 4 6 5 6 6 4 7 7 8 8 8 9 7 9 5 5 4 4 9 5 9 9 6 8 7 7 4 8 8 5 9 6 6 7 4 6 5 9 8 6 6 7 7 7 8 8 9 4 5 5 4 Tm 7 Tr (sec) 5.5.5 Table : Comparisons of mappings of Bokhari's algorithm and Cluster-M Random graphs Tm Tr (sec) of nodes Bokhari Cluster-M lower bound Bokhari Cluster-M 5 5 8.8. 9 7.58. 8.. 4 4 8.. 5 9.. 6 8.5. 7 8.4. 8 8.8. 9 9.. 9 7..

graph and a 6 6 nite element machine (FEM) []. A Sun SPARC station was used for the experiments. The results are shown in Table. Note that the running time of clustering the task graph and system graph by Cluster-M, which is.7 second, is not included in T r, as our clustering is indepent of the mapping. However, even if we included it, the running time of Cluster-M would still be times faster than Bokhari's algorithm. The lower bound on T m as described before is 9, and yet both Cluster-M and Bokhari's algorithms have obtained near optimal results of T m = 7 and respectively. The above example uses the same structured task and system graph as tried in []. We have also tested other randomly generated task and system graphs. Table shows the mapping results and comparisons for randomly generated task and system graphs of nodes. Similar results were obtained for the set of random graphs. 4. Scheduling directed task graphs We rst compare our algorithm with El-Rewini and Lewis's mapping heuristic (MH) algorithm [9]. The time complexity of MH is O(M N ), while ours has a O(MN) time complexity. Given a task graph and the system graph of a -cube as shown in Figure 9, the schedule obtained from MH is illustrated by a Gantt chart in Figure (a) [9]. Similarly, the Gantt chart of the schedule obtained by Cluster-M mapping is shown in Figure (b). An optimal schedule is also shown in Figure (c). We can see that both MH and Cluster-M mappings have produced close to optimal T m for this example, yet Cluster-M is faster by a factor of O(MN ). We next compare with Lee and Aggarwal's mapping strategy [6]. Their mapping strategy considers the task graph as directed graph and dierentiate nodes and edges into dierent computation stages and communication phases, to accurately calculate the actual communication cost between two non-adjacent processors. However, it maps the entire task graph the system graph without graph contraction or clustering. Also, it assumes the number of nodes in the task graph is no greater than that of the system graph. The time complexity of Lee and Aggarwal's algorithm is O(N ), while ours is O(MN) (if M = N, then ours is O(N )). Given a task graph and the system graph of a 4-cube as shown in Figure, the comparison of the mapping results is shown in Figure. In this example, A i =, i 5. The task graph for the second comparison example with Lee and Aggarwal is shown in Figure, where A i =, i 7. The system graph for this problem is -cube. The mapping results are shown in Figure 4. Lee and Aggarwal's mapping strategy was later exted by Chaudhary and Aggarwal for mapping larger task graphs smaller system graphs [6]. The time complexity of this algorithm is O(M 4 ). Next, we compare our mapping results with Chaudhary and Aggarwal. We present two examples. In the rst example, the task graph of Figure is a -cube. The mapping results for this example is shown in Figure 5. In the second example, the task graph of Figure is a -cube. The mapping results for this example is shown in Figure 6. As we see in all the examples in this section, Cluster-M mapping has a superior running time and the results obtained are similar to or better than those from the other algorithms. 4

4 5 6 7 8 9 A i = 8>< >: i = ; 8 i 9 i 5 4 i 7 4 5 6 7 8 Task graph 4 5 7 6 System graph Figure 9: Comparison example with MH: task and system graph. 5

Processors 4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 t t t t t5 t8 t4 t t6 t5 t7 t7 t 4 t6 t t4 5 6 t8 t9 (a) MH, T c = O(M N ), T m = 6, S m = :, = :47 Processors 4 5 6 7 4 5 6 7 8 9 4 5 6 7 8 9 4 5 6 t t t t4 t8 t t4 t t5 t5 t6 t t6 t7 t8 t t7 t9 (b) Cluster-M, T c = O(MP ), T m = 6, S m = :, = :4 Processors 4 5 6 7 4 5 6 7 8 9 4 5 6 7 8 9 4 5 t t t t t4 t8 t4 t5 t t5 t6 t7 t t6 t8 t9 t t7 (c) Optimal, T c = O( MN ), T m = 5, S m = :4, = :4 Figure : Comparison example with MH: mapping results. 6

4 5 6 7 8 9 4 5 Task graph 8 9 4 5 7 6 5 4 System graph Figure : Comparison example with Lee and Aggarwal: task and system graph. 7

Processors 4 5 6 7 8 9 t t t7 t9 4 t 5 t 6 t 7 t 8 t5 9 t t4 t6 t4 t8 4 t5 5 t (a) Lee and Aggarwal, T c = O(N ), T m =, S m = :, = :8. 4 5 6 7 Processors 4 5 6 7 8 9 t t t9 t t5 t t t4 t t5 t t4 t6 t7 t8 (b) Cluster-M, T c = O(MN), T m =, S m = :5, = :9. t Processors 4 5 6 7 8 t t t t9 t t4 t t t5 t6 t t7 t8 t t4 t5 (c) Optimal, T c = O( MN ), T m = 8, S m =, = :5. Figure : Comparison example with Lee and Aggarwal: mapping results. 4 5 6 4 5 7 6 7 Task graph System graph Figure : Comparison example with Lee and Aggarwal: task and system graph. 8

Processors 4 5 6 7 4 5 6 7 t t t t4 t t6 t5 t7 (a) Lee and Aggarwal, T c = O(N ), T m = 7, S m = :, = :4. Processors 4 5 6 t t t4 t7 t t t5 t6 (b) Cluster-M, T c = O(MN), T m = 6, S m = :, = :4. Processors 4 5 6 t t t4 t t6 t7 t t5 (c) Optimal, T c = O( MN ), T m = 6, S m = :, = :65. Figure 4: Comparison example with Lee and Aggarwal: mapping results. Processors 4 5 6 7 8 9 t t t7 t9 t t t t t4 t5 t t4 t6 t8 t t5 (a) Chaudhary and Aggarwal, T c = O(M 4 ), T m =, S m = :6, = :4. Processors 4 5 6 7 8 9 t t t t9 t t5 t t4 t t5 t6 t t4 t7 t8 t (b) Cluster-M, T c = O(MN), T m =, S m = :6, = :4. Processors 4 5 6 7 8 t t t t9 t t4 t t t5 t6 t t7 t8 t t4 t5 (c) Optimal, T c = O( MN ), T m = 8, S m =, = :5. Figure 5: Comparison example with Chaudhary and Aggarwal: mapping results. 9

Processors 4 5 6 t t t t4 t t6 t5 t7 (a) Chaudhary and Aggarwal, T c = O(M 4 ), T m = 6, S m = :, = :. Processors 4 5 6 t t t4 t7 t t t5 t6 (b) Cluster-M, T c = O(MN), T m = 6, S m = :, = :4. Processors 4 5 6 t t t4 t t6 t7 t t5 (c) Optimal, T c = O( MN ), T m = 6, S m = :, = :65. Figure 6: Comparison example with Chaudhary and Aggarwal: mapping results. 5 Conclusion In this paper, we have presented and implemented a generic algorithm for mapping portable parallel algorithms various multiprocessor systems or computing organizations. The input to the mapping algorithm is a Spec graph which corresponds to a clustered (layered) task graph and a Rep graph which corresponds to a clustered (partitioned) system graph. Unlike other mapping approaches which only cluster the task graph, our Cluster-M based mapping algorithm requires a clustering of the system graph as well as the task graph, before executing the mapping process. The clustering is done only once for a given task graph (system graph) indepent of any system graphs (task graphs). It is a machine-indepent (applicationindepent) clustering, therefore, it is not repeated for dierent mappings. The complexity of our mapping algorithm is O(MN), where M is the number of task modules and N is the number of processors. We presented our experimental results in comparing our implemented algorithm with others. Compared to other leading techniques, Cluster-M mapping nds better or similar results in much faster time and uses less or equal number of processors. Furthermore, this generic algorithm is suitable for both the allocation problem and the scheduling problem. This work has been exted to the case where both the task graph and the system graph are non-uniform [7]. References [] F. Berman and L. Snyder. On mapping parallel algorithms into parallel architectures. Journal of Parallel and Distributed Computing, 4:49{458, 987. [] S. H. Bokhari. On the mapping problem. IEEE Trans. on Computers, c-():7{4, March 98.

[] S. H. Bokhari. A shortest tree algorithm for optimal assignments across space and time in a distributed processor system. IEEE Trans. on Software Engineering, SE-7(6):58{589, November 98. [4] S. H. Bokhari. Partitioning problem in parallel, pipelined, and distributed computing. IEEE Trans. on Computers, 7():48{57, January 988. [5] T. L. Casavant and J. G. Kuhl. A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans. on Software Engineering, 4():4{45, February 988. [6] V. Chaudhary and J. K. Aggarwal. A generalized scheme for mapping parallel algorithms. IEEE Trans. on Parallel and Distributed Systems, 4():8{46, March 99. [7] S. Chen, M. M. Eshaghian, and Y. Wu. Mapping arbitrary non-uniform task graphs arbitrary non-uniform system graphs. In Proc. International Conference on Parallel Processing, August 995. [8] E. G. Coman and R. L. Graham. Optimal scheduling for two processor systems. Acta Informatica, :{, 97. [9] H. El-Rewini and T. G. Lewis. Scheduling parallel program tasks arbitrary target machines. Journal of Parallel and Distributed Computing, 9:8{5, 99. [] H. El-Rewini, T. G. Lewis, and H. H. Ali. Task Scheduling in Parallel and Distributed Systems. Prentice Hall, 994. [] F. Ercal, J. Ramanujam, and P. Sadayappan. Task allocation a hypercube by recursive mincut bipartitioning. Journal of Parallel and Distributed Computing, :5{44, 99. [] M. M. Eshaghian and M. E. Shaaban. Cluster-M parallel programming paradigm. International Journal of High Speed Computing, 6():87{9, June 994. [] D. Fernandez-Baca. Allocating modules to processors in a distributed systems. IEEE Trans. on Software Engineering, 5():47{46, November 989. [4] A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 6:76{9, 99. [5] S. J. Kim and J. C. Browne. A general approach to mapping of parallel computation upon multiprocessor architectures. In Proc. International Conference on Parallel Processing, volume, pages {8, 988. [6] S. Lee and J. K. Aggarwal. A mapping strategy for parallel processing. IEEE Trans. on Computers, 6:4{44, April 987. [7] V. M. Lo. Heuristic algorithms for task assignment in distributed systems. IEEE Trans. on Computers, 7():84{97, November 988.

[8] V. M. Lo, S. Rajopadhye, S. Gupta, D. Keldsen, M. A. Mohamed, and J. A. Telle. Oregami: Software tools for mapping parallel computations to parallel architectures. In Proc. International Conference on Parallel Processing, 99. [9] C. McCreary and H. Gill. Automatic determination of grain size for ecient parallel processing. Communications of ACM, (9):7{78, September 989. [] R. Ponnusamy, N. Mansour, A. Choudhary, and G. C. Fox. Mapping realistic data sets on parallel computers. In Proc. 7th International Parallel Processing Symposium, pages {8, April 99. [] P. Sadayappan, F. Ercal, and J. Ramanujam. Cluster partitioning approaches to mapping parallel programs a hypercube. Parallel Computing, :{6, 99. [] V. Sarkar. Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. MIT Press, 989. [] C. Shen and W. Tsai. A graph matching approach to optimal task assignment in distributed computing systems using a minmax criterion. IEEE Trans. on Computers, c-4():97{, March 985. [4] H. S. Stone. Multiprocessor scheduling with the aid of network ow algorithms. IEEE Trans. on Software Engineering, SE-():85{9, January 977. [5] M. Y. Wu and D. Gajski. Hypertool: A programming aid for message-passing systems. IEEE Trans. on Parallel and Distributed Systems, ():{9, 99. [6] T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Trans. on Parallel and Distributed Systems, 5(9):95{967, September 994.