A static mapping heuristics to map parallel applications to heterogeneous computing systems

Size: px

Start display at page:

Download "A static mapping heuristics to map parallel applications to heterogeneous computing systems"

Rolf Rogers
5 years ago
Views:

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2005; 17:1579 1605 Published online 24 June 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2005; 17: Published online 24 June 2005 in Wiley InterScience ( DOI: /cpe.902 A static mapping heuristics to map parallel applications to heterogeneous computing systems Ranieri Baraglia 1,,, Renato Ferrini 1 and Pierluigi Ritrovato 2 1 High Performance Computing Laboratory, Information Science and Technologies Institute (ISTI), Italian National Research Council (CNR), Via G. Moruzzi, 1, Pisa, Italy 2 Centro di Ricerca in Matematica Pura ed Applicata, Universitá degli Studi di Salerno, Fisciano (SA), Italy SUMMARY In order to minimize the execution time of a parallel application running on a heterogeneously distributed computing system, an appropriate mapping scheme is needed to allocate the application tasks to the processors. The general problem of mapping tasks to machines is a well-known NP-hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph-based mapping algorithm, called Heterogeneous Multi-phase Mapping (HMM), which permits suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta-heuristic. HMM allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application, and considering an affinity parameter, that identifies which machine in the heterogeneous computing system is most suitable to execute a task. We compare HMM with some leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of two real applications using HMM. Experimental results show that HMM performs well demonstrating the applicability of our approach. Copyright c 2005 John Wiley & Sons, Ltd. KEY WORDS: parallel processing; static mapping; task mapping; performance evaluation 1. INTRODUCTION Heterogeneous computing (HC) is a set of techniques enabling the use of different interconnected machines that provide a variety of computational capabilities to execute a number of application tasks Correspondence to: Ranieri Baraglia, High Performance Computing Laboratory, Information Science and Technologies Institute (ISTI), Italian National Research Council (CNR), Via G. Moruzzi, 1, Pisa, Italy. ranieri.baraglia@cnuce.cnr.it Copyright c 2005 John Wiley & Sons, Ltd. Received 24 March 2003 Revised 17 March 2004 Accepted 17 March 2004

2 1580 R. BARAGLIA, R. FERRINI AND P. RITROVATO with different requirements [1,2]. There are many types of heterogeneous systems [3]. In this paper we refer to a HC system as a heterogeneous suite of independent machines connected by high-speed networks, and used as a single HC distributed system [1]. This kind of HC system is also known as Metacomputer [4]orasaGrid [5]. A HC system requires compile- and run-time services supporting the execution of applications. Computational intensive parallel and distributed applications made up of coarse-grained looselycoupled components that require the use of different computational types (i.e. Sequential, SIMD, MIMD, etc.) are the best candidates to exploit the full potential of a HC system. In order to achieve high-performance, critical aspects of the HC systems are to efficiently assign application tasks to the most suitable machines and to determine a scheduling execution order of the tasks on each selected machine. The assignment-scheduling process is called mapping. The task-mapping process has shown to be NP-complete in its general as well as in some restricted forms [6]. Mapping algorithms are classified as static or dynamic [6]. In the former the mapping decisions are taken before executing an application and remain unchanged until the end of the execution. In the latter the mapping decisions are taken while the program is running. Since static mapping does not imply overheads on the execution time of the mapped application, more complex mapping solutions than the dynamic ones can be adopted. When the characteristics of a parallel application, such as task computational cost, amount of data exchanged among tasks, and task dependencies, are known before the application execution, static mapping can be exploited. In the general form of static mapping a parallel application is represented by a directed acyclic graph (DAG), where the nodes represent application tasks and the edges represent data dependencies between two tasks. Each node is associated with a weight representing the computational cost of the corresponding task, and the weights associated to each edge represent the amount of data exchanged between the two interconnected tasks [7]. The goal of the DAG mapping process is to minimize the overall application makespan by appropriately allocating the tasks to the processors and ordering the task execution sequence in such a manner that task precedence constraints are preserved. Despite the number of scheduling heuristics proposed in the past for both homogeneous and heterogeneous distributed systems, the mapping problem remains an open research problem. In this paper, we propose Heterogeneous Multi-phase Mapping (HMM), a static general mapping heuristics, to map parallel applications represented by DAGs onto a metacomputer to minimize the applications completion time. HMM takes mapping decisions by exploiting the information embedded in the parallelism forms or skeletons (i.e. farm, pipeline, geometric, etc.) used to implement an application, and considering an affinity parameter, identifying which machine in the HC system is most suitable for executing a task [1,3,8]. It also considers task computation cost, processor computational power and workload (i.e. workload induced by the application mapped), and amount of data to be transferred between tasks running on different processors. Moreover, HMM uses a local search technique together with the tabu search meta-heuristics [9], in order to improve the quality of the HMM initial mapping solution. Most parallel programs are designed by means of parallelism forms that can be used by adopting models allowing arbitrary computation structures (unrestricted programming model) [10,11] and also models restricting the forms in which computation may be expressed (restricted programming

3 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1581 model) [12,13]. Various coordination languages [14 17] have been proposed in which sets of parallel constructs (skeletons) are used as program composition forms. Parallelism forms allow the implementation of parallel computations in which the communication patterns are structured and well defined, and a considerable amount of communication is generally concentrated within a parallelism form. Parallel applications suited for HC are made up of large-grained loosely coupled components that can be either sequential or parallel. When a parallel component has been designed by using parallelism forms, the tasks inside a parallelism form can be seen as naturally grouped in a cluster that can be allocated on processors of the same machine to reduce task communication costs. Each cluster can embed one or more nested parallelism forms to be handled in a top-down manner. This allows us to find the size of the application grain that better exploits the computation and communication characteristics of the target HC platform. Exploitation of the clustering induced by means of parallelism forms permits us to obtain good schedule quality reducing the heuristics execution times with respect to other solutions such as the cluster-based solutions. This paper is organized as follows. Section 2 describes some related works. Section 3 gives a description of the problem, and Section 4 describes our algorithm. In Section 5, algorithm time complexity is evaluated. Section 6 outlines and evaluates the experimental tests. Finally, we summarize our work in Section RELATED WORK In the past a number of static graph-based mapping algorithms for heterogeneous computing environments have been proposed [3,18 30] and evaluated [31], to approximate optimal mapping solution. In [3], Eshaghian et al. proposed the Cluster-M and the Augmented Cluster-M mapping algorithms based on the Cluster-M programming model. Cluster-M uses a direct graph and undirected graph to represent a meta-application and a metacomputer, respectively. The direct graph is horizontally and vertically partitioned. Horizontal partitioning builds layers (clusters) of tasks that can be executed in parallel. Vertical partitioning allows consecutive layers to be merged or embedded in order to minimize communication costs. The undirected graph represented in a multilayered format is such that each layer includes a set (cluster) of completely connected processing units. Clusters of communication-intensive tasks are then mapped onto clusters of communication-efficient processing units by running Cluster-M. The latter presents a time complexity equal to O(MP),whereM is the number of tasks belonging to an application graph, and P = max(m, N), with N numbers of processors belonging to a metacomputer graph. Augmented Cluster-M is an extension of Cluster-M. As it treats the task computational type-machine type affinity, that drives the mapping of a task onto a machine with the same computation type, it is more suitable for HC. The major drawback of this solution is the need to re-compute the clustering of the system graph when the metacomputer configuration changes. Metacomputers are dynamic computing environments on which the availability of computational units cannot be established apriori. Moreover, the machine and network workload can lead to different mapping for different time requests of the same application execution.

4 1582 R. BARAGLIA, R. FERRINI AND P. RITROVATO In [19], Lo proposed a Max-Flow/Min Cut algorithm to find a mapping of task modules on heterogeneous processors as an extension of Stone s model [32]. The algorithm combines recursive invocations of the Max-Flow/Min Cut algorithm with a greedy algorithm to find a suboptimal assignment of task modules on heterogeneous processors. It has a complexity equal to O(M 2 N E p log M) where M, N and E p are the number of parallel tasks, processors and communication links among processors, respectively. This algorithm presents two major drawbacks: high time complexity and non-treatment of task data dependencies. In [18], Shen and Tsai proposed the A Searching algorithm that exploits a graph matching model called weak homomorphism to map a task graph to a system graph. The search of the optimal weak homomorphism corresponds to the task assignment with minimum task turnaround time. It is formulated as a state-space search problem, solved using the A algorithm [33]. The latter is a branch and bound algorithm and, therefore, due to its time complexity is suitable for small problems. In [34] the Levelized-Min Time (LMT) algorithm based on list scheduling heuristics was proposed. The algorithm is structured according to two phases. In the first phase it groups the tasks that can be executed in parallel and in the second phase each task is assigned to a processor that minimizes the completion time of the whole application. Within the same group, the task with the highest computation cost has the highest selection priority. The time complexity of LMT is O(n 2 p 2 ),wherenand p are the number of tasks and the number of processors, respectively. In [29], Topcuoglu et al. proposed the Heterogeneous Earliest-Finish-Time (HEFT) and Critical- Path-on-a-Processor (CPOP), two list scheduling-based algorithms to map parallel applications, represented by DAGs, on a bounded number of heterogeneous processors. The algorithms are structured according to two phases. In the former HEFT computes the upward rank values for each task, while in the latter it selects the task with highest upward rank value and assigns the selected task to the processor which minimizes its earliest ending time. On the other hand, in the first phase CPOP computes the upward and downward rank values for each task, and uses the summation of upward and downward rank values to prioritize total execution time of the critical path. The algorithms have time complexity equal to O(e p) for e graph edges and p processors. HEFT time complexity is equal to O(n 2 p) (n is the number of the tasks) for dense application graphs where the number of edges is proportional to O(n 2 ). Even if both algorithms are able to find good scheduling solutions with fast execution times, HEFT finds better scheduling solutions than CPOP. 3. PROBLEM DESCRIPTION The proposed heuristics assumes that the parallel application is decomposed in multiple tasks whose data dependencies are known, and that the parallel application is modeled by a DAG. It is also assumed that task computation and communication costs are known before the application execution, and that task execution is not preemptive. Task computation and communication costs are usually obtained by estimation using profiling information of operations such as numerical operations, memory access operations, and message-passing primitives [35]. Assumptions like these are typically made in these types of studies [3,24,29]. The DAG representing an application is denoted as TG = (N, E), wheren is a set of nodes and E is a set of directed edges. A node represents a computational task that can be sequential or parallel, called atomic and hierarchical, respectively.

5 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1583 Figure 1. DAG representing a parallel application. Atomic nodes are indivisible units of execution, and have associated a weight representing the estimated task computational cost. Hierarchical nodes are sets of atomic and/or hierarchical nodes structured according to the parallelism forms used to implement a parallel program. Parallelism forms can be nested one into the other. Hierarchical nodes are represented with a DAG as well, and the structure of which respects the order in which the parallelism forms are used in a parallel task. From the point of view of the task mapping a hierarchical node is considered to be atomic. In Figure 1, an example of DAG representing a parallel application is shown. The application DAG is considered as structured in S layers that represent the depth of nested levels. In the first layer it consists of only a hierarchical node which is considered to be atomic, and in the next layers it could consist of atomic and/or hierarchical nodes represented with a DAG denoted as TG s with 1 s S. In the last layer all nodes are atomic. The introduction of the layer concept allows us to find the layer whose mapping guarantees the shortest execution times of the nodes it contains. Although problem partitioning is known aprioriin the mapping static model, the elaboration of the layers in top-down order allows us to find the size of the grain that better exploits the computation and communication characteristics of the target HC platform. A generic node, atomic or hierarchical, belonging to a graph TG s is identified as n s i with 1 i N s and N s number of nodes at the layer s. An edge e i,j represents a data dependence between nodes n s i and n s j with i j, and it has an associated weight specifying the amount of data exchanged between the nodes. In Figure 2, an example of a three-stage expansion of a hierarchical node is given. The first layer consists of a hierarchical node; the second layer is represented by a task graph of four nodes, one atomic

6 1584 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 2. Layers inside a TG graph representing a parallel application. and three hierarchical (a pipeline and two task-farms), and in the last layer the task graph includes 14 atomic nodes. In order to reduce mapping complexity a TG s is horizontally divided into F phases, which are executed sequentially (top-down) one at a time. The nodes in the first phase do not have parents, the nodes in a phase being those with at least one parent among the nodes in a previous phase. By structuring TG s into phases the precedence relations among its nodes reflect the execution sequence of the phases in the application graph, and the mapping problem is simplified because HMM finds the initial suboptimal mapping solution by operating on each phase separately. The initial problem is thus decomposed into more manageable subproblems. Since the phases are linked by the edges to map the hth phase, HMM considers the data exchanged between the nodes belonging to phase h and the nodes of phase k with k<h. The nodes in a phase can be executed in parallel, and a node in a phase is executed when all its predecessors have been completed and it has received all the data needed for its execution. A HC system is represented by a fully connected indirect graph denoted as MG. Each node of MG represents the machine of a HC system which is denoted as m with 1 m M,whereM is a number of machines in a HC system. Each node is associated with a number specifying the computational speed of a machine, and each edge denotes a link between two machines, and has an associated value which specifies its communication bandwidth. A vector V m, with size equal to the number of processors P m of m, is associated with each machine, to find the processors available on a machine m at any time. The uth entry of this vector contains the time value at which the uth processor of m will be available to execute any new work. HMM updates the vector V m on the basis of the workload due to the application tasks allocated on each processor of machine m. To enable us to compute the affinity parameter it is useful to characterize both applications and HC systems by relating them to some real-world parameters. Several different parameters can be used

7 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1585 to characterize application tasks and machines able to select the HC system machine most suitable to execute a task. In this work, without loss of generality, we associate to each atomic node the computational model adopted for its implementation (e.g. SIMD, MIMD, sequential), and the amount of memory and software required for its execution. The same parameters are used to characterize a hierarchical node. A hierarchical node has also associated the number of processors needed to run its atomic nodes in parallel. We assume that these parameters can be specified by the programmer and/or provided by the programming tools used to implement a parallel application. Each MG node is associated with its architectural class, software and memory available, number of processors, and processor power. HMM adopts the following strategy to carry out a suboptimal solution. Starting from node n 1 1 (node at layer one) HMM finds the most suitable machine for its execution on the basis of affinity parameter value, node computational cost, and computational power of the machine. HMM estimates the execution time of the task graph TG 1, that consists of node n 1 1, on the selected machine, and then TG 1 is expanded in the next layer producing the task graph TG 2. To find the best mapping of the nodes in the second layer, HMM divides TG 2 into phases, and, starting from the first, it estimates the execution time of each TG 2 node. To do this, the nodes inside a phase are sorted in descending order with respect to their computational cost, and starting from the heaviest, the execution time of each node is estimated on the machine selected on the basis of machine power and workload, node computational cost, amount of data exchanged between a node and its parents, and affinity parameter value. When no machine is found to allocate a node it is flagged as not allocated, and it will be allocated to the first next layer for which a suitable machine is available. Then, the estimated execution time of TG 2 is compared with the one found for TG 1. The shortest one leads to the current initial mapping solution. This process is repeated for all the remaining S 2 layers of the application graph to find a TG s leading to an initial mapping solution to which a shorter application execution time corresponds. Then, to refine the initial mapping solution, a local search algorithm is adopted together with the tabu search meta-heuristics. The suboptimal mapping solution is found by keeping the task communication costs as low as possible, and trying to make a good trade-off between the communication costs and workload, balancing among the machines of the target HC platform. 4. ALGORITHM DESCRIPTION HMM takes in input the TG and MG graphs, which represent a parallel application and a HC system, respectively. The algorithm is structured according to two phases. In the former an initial mapping solution is carried out. In the latter, to improve the quality of the initial solution a local search function is applied to the atomic nodes in the critical path of TG s, corresponding to the initial solution. Figure 3 shows the pseudo-code of the HMM first phase which operates according to the following steps. Step 1 computes the affinity value Aff of all the n s i nodes belonging to each layer with respect to all the M machines. The affinity value of a node is computed by considering both the required computational resources and the adopted computational model. When a machine m does not provide the memory and software requirements as well as the computational model to run a node n s i, the affinity parameter is set equal to zero. Otherwise, it assumes a value in the range 0 1 computed as function of the number of processors available on m and the number of processors required to run n s i.

8 1586 R. BARAGLIA, R. FERRINI AND P. RITROVATO HMM Initial Solution (TG) Begin Step 1: SET AFFINITY(TG) INITIALIZE S init LOOP on the layers s(1 s S) ALLOCATE (TG s ) Begin Step 2: LOOP on the nodes n s i (1 i N s) DIVIDE the TG s graph into F phases LOOP on the phases k(1 k F) Step 3: SORT the nodes n s i within phase k in descending order with respect to Cns i LOOP on the ordered nodes n s i of the phase k Step 4: LOOP on the machines m(1 m M) FIND the best machine m to execute the node n s i Step 5: IF at least one machine m is found THEN COMPUTE the completion time T e of the node n s i on machine m ELSE End the node n s i as not allocated FLAG Step 6: LOOP on the nodes n s i (ns i { not allocated } nodes) WHILE node n s i is flagged not allocated s = s + 1 Step 7: ALLOCATE (n s i ) IF completion time of TG s solution (S s ) < completion time of S init THEN S init = S s End Figure 3. Pseudo-code of the first phase of HMM. HMM ends if at least one atomic node has affinity equal to zero with all the HC system machines. Figure 4 shows the pseudo-code of the Set affinity function used to compute the affinity parameter. Step 2 divides all nodes of a TG s into F phases. The nodes in the first phase do not have parents while the nodes belonging to a phase i, with 1 <i F, are those that have at least one parent among the nodes in a phase j with j<i. In Figure 5 an example of a subdivision into phases of a hierarchical node at layer 2 is shown. Step 3 arranges in descending order the nodes in a phase with respect to their computational cost. This ordering is used in the next step to first select the computationally heavier nodes. It improves the probability of computationally heavier hierarchical nodes to be allocated on the same machine.

9 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1587 SET AFFINITY (TG) Begin LOOP on the layers s(1 s S) LOOP on the nodes n s i (1 i N s) LOOP on the machines m(1 m M) Aff (n s i,m)= 0 IF memory required by n s i memory size of m THEN IF software required by n s i = software available of m THEN IF computational model required by n s i = architectural class of m THEN Aff (n s i,m)= 1 IF processors required by n s i > processors available on m THEN Aff (n s i,m)= processors available in m/processors required by ns i Figure 4. Pseudo-code of the Set affinity function. Figure 5. Example of a subdivision into phases of a hierarchical node at layer 2. Step 4 selects, according to the ordering given by the previous step, a node n s i which has not yet been analyzed and finds a machine m that can perform it. In order to select a machine, expression (1) is applied to all the machines M Aff with which the selected node has a value of the affinity parameter greater than zero. ( { Wm + C n s Dc (n s i i min Aff (n s i,m) S + max,ns j ) }) m B m,n m {1, 2,...,M Aff } and n {1, 2,...,M} and j with (n s j ) BS(ns i ) (1) where W m is the workload of m due to the previous allocated tasks, S m the computational power of m, and C n s i the computational cost of n s i. D c is the amount of data exchanged between node n s i and node ns j belonging to its backward star BS(n s i ) (i.e. its parent). D c is zero if n s i and its parent with which it has the highest communication cost can be allocated on the same machine. B m,n is the link communication

10 1588 R. BARAGLIA, R. FERRINI AND P. RITROVATO bandwidth connecting the machine m on which n s i will be and the machine n on which n s j has been allocated. The machine corresponding to the minimum value computed by (1) is selected to allocate node n s i. When the same minimum value corresponds to several machines, HMM selects the machine m corresponding to the smallest value of ratio W m /S m. If no machine is found, the hierarchical node n s i is flagged as not allocated, and the next node is analyzed. Otherwise, step 5 is executed to estimate the execution time of n s i on the candidate machine. At first layer W m and D c are equal to zero; therefore, in order to select the most suitable machine to allocate node n 1 1, only the task computational cost, the affinity parameter and the machine computational power are considered. Step 5 estimates overall completion time of a node n s i on the selected m, according to the node s precedence relations. If n s i is a hierarchical node, all the atomic nodes it contains are allocated on the machine processors by using expression (2) and its execution time is equal to the execution time of the atomic node that is the last to finish. ( Wp m k min + C { n s Dc (n s i i + max,ns j ) }) k {1, 2,...,P m } and j with (n s j ) BS(ns i ) (2) S p m k B m where S p m k is the computational power of the kth processor of the machine m on which n s i will be allocated, W p m k the processor workload, and B m is the machine network bandwidth. The atomic node execution time T e is computed by applying Equation (3): T e (n s i,m) = T w(n s i,m)+ T x(n s i,m) (3) where T w (n s i,m)is the amount of time the node ns i has to wait before executing on the machine m,and T x (n s i,m)is the execution time of ns i on m. T w (n s i,m)is given by Tw(n s i,m)= max{t c(bs(n s i )), T a(n s i,pm k )} (4) where T c is the waiting time due to the communications of n s i with the nodes belonging to BS(n s i ) and T a (n s i,pm k ) is the waiting time to get the kth processor p of the machine m needed to execute ns i. T a is obtained from vector V m where the processors are stored in increasing order with respect to completion time of the task executed. The uth entry of V m stores the time in which the uth processor becomes available. The waiting time T c (BS(n s i )) is given by { T c (BS(n s i )) = max T end (n s j ) + D c(n s i,ns j ) } B m,n j with (n s j ) BS(ns i ) and n {1, 2,...,M} where T end is the time needed by node n s j to end its execution. When ns i and ns j are allocated onto the processors of the same machine (e.g. a multicomputer), B m,n is the machine network bandwidth. T x (n s i,m)is computed as Tx(n s i,m)= C n s i (6) S p m k (5)

11 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1589 HMM Local Search (TG s ) Begin SET the initial solution as the best solution: S best = S init WHILE total iteration number K<K max AND iterations without improvements I<I max FIND the Critical Path of current solution S init LOOP on the nodes n s i of the Critical Path (1 i N cp) FIND the best processor to execute the node n s i IF at least one processor is found THEN MOVE the node n s i to the selected processor COMPUTE the new solution S cp IF completion time of S cp < completion time of S init THEN S init = S cp IF completion time of S init < completion time of S best then S best = S init ELSE INCREMENT the iterations without improvements: I = I + 1 INCREMENT the total iteration number: K = K + 1 END Figure 6. Pseudo-code of the second phase of the HMM algorithm. When the list of nodes within the phase currently analyzed is empty, step 6 is performed. Otherwise execution continues from step 4 to allocate the next node. Step 6 checks whether there are any nodes flagged as not allocated. If there are, these are allocated at the s-not allocated layer for which there is a machine that can execute each related node. This is made possible by selecting each node according to the order established in step 3, and expanding it to the next layer. It is then allocated by adopting the procedure described in steps 2 5. If there are no nodes flagged as not allocated, this process ends by carrying out the initial partial solution of the phase analyzed, and the execution continues from step 3 to elaborate the next phase. Step 7 analyzes all initial partial mapping solutions to find the layer completion time (S s ), which corresponds to the last TG s node ending its execution. S s is then compared with the best initial mapping solution S init carried out in a previous layer. If S s is better than S init it becomes S init, and execution continues from step 2 to analyze the next layer. When all the layers have been evaluated, the algorithm ends returning S init. Figure 6 shows the pseudo-code of the HMM second phase. It considers all the nodes in S init as atomic, and analyzes the neighborhood of S init to search for a better solution. We say that a neighborhood of a solution is the set of solutions obtainable by moving a node of the critical path from processor pi m (current allocation) to a processor pj m (with j i)ortoph i (with h m). The new processor is selected according to the following criterion: max{aff (n s i,m) S pj m} m {1, 2,...,M Aff } and j {1, 2,...,P m } (7)

12 1590 R. BARAGLIA, R. FERRINI AND P. RITROVATO If several processors are selected, the one with the smallest value of the fraction W p m j /S p m j is selected. If there is not a processor able to allocate n s i, the node is not moved, and the process continues by selecting the next node in the critical path. Otherwise, the selected node n s i is moved, and HMM estimates the application execution time corresponding to the new solution S cp. When all the nodes on the critical path have been analyzed the local search algorithm carries out a new solution, i.e. the solution corresponding to the smallest execution time among all the S cp solutions previously found. If it is better than the initial solution, it becomes the best solution. The search for a suboptimal solution is an iterative process. HMM ends by returning the final mapping solution, when one of the following conditions is verified: (a) the total number of iterations performed reaches a user-defined threshold: (b) the number of iterations performed without improving the solution reaches a user-defined threshold Mapping example In order to show the choices taken by HMM, an example is given of mapping of a real quantum chemistry application on three different HC system configurations (see Figures 7(b), 8(b) and 9(b)). The chemistry application deals with the integration of a two-dimensional Schrödinger equation using a sector-diabatic coupled-channel approach [36]. In the following figures the application tasks are represented in a task graph by nodes labeled t i. The number inside each circle representing an atomic node specifies the computational cost of the task, and the communication cost is specified by the number associated to each edge. In a system graph the nodes labeled p i represent the processors inside a machine. The number inside each circle specifies the computational power of the processor, and the communication bandwidth of the channel linking two processors or two machines is specified by the number associated to each link. Figure 7(b) shows the HC system configuration in which all machines have processors with equal computational power, and equal communication bandwidth. Figures 8(b) and 9(b) represent the two other configurations in which machines m 1,andm 1, m 2, m 3, m 4, respectively, have processors with half computation power and half communication bandwidth with respect to the same machine in the HC system graph of Figure 7(b). This situation can occur, for example, when a machine is workloaded at 50%, and therefore we can suppose that its computational power is cut down by a half. In Figures 7, 8 and 9, the arrows together with the gray levels indicate the layer at which the initial mapping solution (S init ) was carried out, and the machines onto which the tasks were mapped. Figure 7 shows the case in which the machine m 1 in the HC system is large enough to carry out the application mapping at layer 1, where the execution time corresponding to S init is equal to 79. Table I shows the values of the affinity parameter (Aff ), machine computational power (S m ), machine workload (W m ), computational cost (C n ) of the hierarchical node t 19, and ratio between the data exchanged (D c ) and the communication bandwidth (B m ) used to compute expression (1) to allocate at layer 1 the task graph of Figure 7. Table II shows the values used to compute expression (1) in order to allocate at layer 2 the same task graph on the HC system in Figure 8. In the same way, expression (1) was computed to allocate at layer 3 the task graph of the example shown in Figure 9. In all of the examples the mapping corresponding to the smallest value of S init was chosen. Figures 7(c), 8(c) and 9(c) show the final suboptimal mapping solutions obtained after the local search process.

13 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1591 Figure 7. Mapping carried out at layer 1.

14 1592 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 8. Mapping carried out at layer 2.

15 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1593 Figure 9. Mapping carried out at layer 3.

16 1594 R. BARAGLIA, R. FERRINI AND P. RITROVATO Table I. Values computed by HMM to evaluate the mapping at layer 1 of the application graph on the HC system shown in Figure 7. Machine Aff S m W m C n D c /B m (1) m m m m m m m m t ALGORITHM COMPLEXITY To evaluate algorithm time complexity we consider the number of operations performed to carry out the initial mapping solution and to run the local search algorithm Initial solution The cost C(S init ) of the initial solution is computed according to the number of operations executed to allocate the atomic and/or hierarchical nodes within all layers of TG.ThecostC(S s ) to allocate the N s nodes of a TG s (a graph expanded at layer s) isgivenby C(S s ) = N3 s 3 + F N s log 2 N s + N s M (8) where Ns 3/3 is the number of operations performed to divide TG s into F phases, N s log 2 N s is the number of operations performed to sort the nodes within a phase (supposing there are all the TG s nodes in each phase), and N s M is the number of operations executed to choose a machine on which to allocate the N s nodes. Since the worst case with as many layers as the number of atomic nodes is rare, in order to compute C(S init ) we consider the average case on which TG s is structured in log 2 N + 1 layers, each layer including 2 (i 1) nodes with 1 i log 2 N + 1. In this case C(S init ) is given by C(S init ) = log 2 N+1 Since every TG i includes 2 i 1 nodes expression (9) can be written as C(S init ) = log 2 N+1 i=1 i=1 C(S i ) (9) (2 (i 1) ) 3 + F 2 (i 1) log (i 1) + 2 (i 1) M (10)

17 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1595 Table II. Values computed by HMM to evaluate the mapping at layer 2 of the application graph on the HC system shown in Figure 8. Machine Aff (W m + C n )/(Aff S m ) D c /B m (1) t 15m m m m m m m m t 16m m m m m m m m t 17m m m m m m m m t 18m m m m m m m m t 14m m m m m m m m

18 1596 R. BARAGLIA, R. FERRINI AND P. RITROVATO which is equal to which is equal to 5.2. Local search C(S init ) = log 2 N+1 i=1 2 3(i 1) 3 + F 2 (i 1) log 2 2 (i 1) + 2 (i 1) M (11) C(S init ) = N 3 + F N ln N + N M (12) The cost to find S best by using the local search function is computed according to the number of operations executed at each iteration to reallocate the atomic nodes in the critical path of a new solution. This is given by C(S best ) = N3 3 + N log 2 N + K ( N(N 1) 2 ) + N c P + N (13) where N 3 /3 is the number of operations executed to divide into phases the atomic nodes of TG, N log 2 N is the number of operations performed to sort the nodes within a phase (we suppose that there are N nodes in each phase), K (N(N 1)/2) is the number of operations to find the critical path (K is the number of iterations), N c P are the operations executed to reallocate the N c nodes in the critical path onto P HC system processors, and N is the cost paid to compute the endedtime of all atomic nodes belonging to a new solution. Since expressions (12) and(13) can be approximated to O(N 3 ), algorithm time complexity, in the average case, results to be O(N 3 ). 6. PERFORMANCE EVALUATION Comparing mapping heuristics is generally a difficult task because of the different underlying assumptions in the original studies of each heuristic [31]. HMM assessment was made possible by comparison with six other graph-based mapping algorithms, the Cluster-M mapping algorithm (CM), Augmented Cluster-M mapping (ACM), Lo s Max-Flow/Min Cut (MFMC), Shen and Tsai s A Searching (AS), Heterogeneous Earliest-Finish-Time (HEFT), Levelized-Min Time (LMT) and with an exhaustive mapping algorithm (EM). EM carries out all possible P N mappings, where N is the number of atomic tasks of an application, and P is the number of machines processors in a HC system. Moreover, two more tests were conducted by using a quantum chemistry [37] and a molecular dynamic Monte Carlo application [38]. The first four tests used as input the application DAGs and the HC system graphs from [3], and the results are shown in Gantt charts. For each task t i the machine m on which it was allocated and the unit of time required for its execution (numbers at the top of the chart) are shown. Figure 10 shows the application DAG and the metacomputer graphs used in the first three tests performed to compare HMM with CM, and EM. Figures 11(a), (b) and (c) show mapping of the application DAG of Figure 10(a) on the HC systems 1, 2and3ofFigure10(b), respectively.

19 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1597 Figure 10. Application DAG and HC system graphs used in tests 1, 2 and 3. Figure 11. Gantt charts describing the mapping carried out in tests 1, 2 and 3.

20 1598 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 12. Application DAG and HC system graph used in test 4. In the first test (see Figure 11(a)) HMM converges to the optimal solution by analyzing only 25 mappings of the 2187 possible mappings analyzed by EM. It was obtained allocating the graph expanded at the last layer. In order to reduce the task communications, t 0 and t 1 were allocated onto the same machine m 1. In the second test (see Figure 11(b)) HMM found a suboptimal mapping corresponding to a total application execution time of 15.1 units, which is worse than the optimal mapping by about 0.7%. It was carried out by allocating the more communicating tasks onto the same machine. HMM converges to the suboptimal solution after 27 possible mappings have been analyzed. In the third test (see Figure 11(c)) HMM converged to the optimal solution after 16 possible mappings by allocating all the application tasks onto the highest speed machine. In all the three tests the HMM execution required about 10 ms. Figure 12 shows the application and the HC system graphs used in test four to compare HMM with CM, ACM, MFMC, and EM. The application consists of an MIMD code and a vector code (Gaussian elimination). The HC system used in the test models an MIMD and a vector machine. As in [3] we define the HMM affinity parameter so that the MIMD code was allocated on the MIMD machine and the vector code on the vector machine. For a better view of the total application execution time the

21 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1599 Figure 13. Gantt charts describing the mapping carried out in test 4. execution time of both code types was placed in the same Gantt, even if MFMC, AS, EM do not treat heterogeneity in computation and machine types. Their mapping results were obtained by mapping the MIMD and vector code separately onto the corresponding type of processor. As shown in Figure 13, HMM and ACM carried out the same mapping corresponding to a total application execution time of units with worsening of 1.6% with respect to the optimal mapping. In order to map the vector code EM analyzed possible mappings to seek the optimal solution relative to a total application execution time of units. HMM converges to the suboptimal solution by analyzing only 19 of the possible mappings. The HMM execution required 40 ms, the EMA execution required 1350 ms. Mapping the MIMD code HMM converged to the suboptimal solution by analyzing only 21 of the possible mappings analyzed by EM. The HMM suboptimal solution worsened 3.6% with respect to the optimal mapping. The HMM execution required 10 ms, the EM execution required 600 ms. The longer execution time required by the exhaustive algorithm to map the vector code with respect to the MIMD code was due to the longer execution time required to compute the overall application execution time of the vector task. The fifth test was conducted to compare HMM with HEFT and LMT algorithms when used to map a real chemical parallel application, which exploits the Monte Carlo method to model gas reactive systems [38], on a heterogeneousmulti-clustercomputingplatform. The chemical processsimulated is made up of a large set of independent tasks of unpredictable duration. The task farm parallelism form

22 1600 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 14. Results obtained by HMM, HEFT and LMT when mapping the Monte Carlo application. was adopted to implement the parallel program used to run the simulation. In our test, a task farm with 1000 slave nodes was used, and 10 runs were executed by varying the number of tasks from 10 3 to Task computational and communication costs were randomly fixed according to a uniform distribution with mean and variance defined on the basis of results obtained by real executions of the chemical application. The HC system used in this test consisted of three clusters with the following configurations: (a) four workstations with processor power of 25, 20, 20 and 15 Mflop/s, and connected by a 1 Mbit/s Ethernet; (b) three workstations with processor power of 25, 15 and 12 Mflop/s, and connected by a 0.75 Mbit/s Ethernet; (c) four workstations with processor power of 20, 15, 12 and 10 Mflop/s connected by a 0.8 Mbit/s Ethernet. All the clusters were considered to be connected by a 0.25 Mbit/s LAN. Figure 14 shows the application makespans obtained by HMM, HEFT and LMT when mapping the parallel program on the target HC system by increasing the number of tasks elaborated by each slave in each run. Each value was obtained by running the mapping algorithms 10 times using different randomly generated sets of communication and computation costs. Although in [29] Topcuoglu et al. showed that HEFT performs better than the critical path-based mapping algorithm they proposed, on this kind of application, as shown in Figure 14, HMM obtained better results than HEFT. The last test was performed to evaluate HMM when used to map the real parallel quantum chemistry application cited in Section 4.1 (but implemented using a different parallel programming model). Application execution times and the machine utilization obtained by mapping the application using HMM were compared with those obtained by simulating the execution of the application on the same target HC system. The chemical simulation is composed by a set of independent processes each computing the propagation of particles in environments with different physical and

23 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1601 Figure 15. DAG representing the chemical parallel application. chemical conditions. In Figure 15 DAG represents the parallel program used to perform a simulated process. A detailed description of this parallel application can be found in [37]. In DAG the computational cost associated with each task was evinced by analyzing previous works [37,39], and the computational and communication costs of the tasks labeled t 7 t 11 and the related edges, respectively, vary by ±20% with respect to the value of physical entities (i.e. energy) used in a simulation. The HC system used to conduct our experiments consisted of an eight-node machine with processor power of 35 MFlop/s fully connected at 8 Mbit/s, a 64-node hypercube multicomputer with processor power of 6 MFlop/s, and a network bandwidth of 2.4 Mbit/s, and a cluster of four workstations with processor power of 25 MFlop/s. All of the machines were connected by a 0.25 Mbit/s LAN. Previous works [37,39] demonstrated that, for this application, good execution times were obtained performing the application structured according to a task farm programming model on clusters of processors. Therefore, the following configuration was used for our tests: (a) eight-node machine, partitioned into two clusters of four nodes each (machine 0); (b) 64-node hypercube multicomputer, partitioned into four clusters of 16 nodes each (machine 1); (c) one cluster of four workstations (machine 2). In order to simulate the application execution on this HC system configuration we used a program (called Task-Farm) implementing a task farm to dynamically assign the parallel tasks to the clusters, and estimating the application execution timed. Figure 16 shows schedule length obtained by both HMM and Task-Farm program elaborating problems of dimension from one to 32 chemical processes. Since the values assigned to tasks t 7 t 11, and the related edges were randomly generated according to a normal distribution in the interval ±20%, each run was performed 10 times, and the average execution time related to each problem dimension

24 1602 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 16. Application execution times obtained by using HMM and Task-Farm. Figure 17. Machine utilization obtained mapping the application by HMM. was plotted. As can be seen in Figure 16, the application execution time noticeably improves when the problem dimension increases, and with dimension equal to 32 using HMM we have an application schedule length equal to 255 units with respect to 660 units obtained using the Task-Farm program. Figures 17 and 18 show the workload distribution on the HC system machines, when the number of chemical processes simulated is greater than 16 the use of HMM leads to a smaller utilization of the more powerful machine 0 and a higher utilization of the other machines. This better workload distribution leads to a smaller execution time with respect to that obtained simulating the application makespan. All tests were conducted on a dual processor Pentium III 800 MHz with 512 MBytes of RAM.

25 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1603 Figure 18. Machine utilization obtained simulating the application execution. 7. CONCLUSIONS In this work we have presented a new static graph-based mapping algorithm, which allows us to find a suboptimal mapping of a parallel application onto a HC system in order to minimize total application execution time. The algorithm exploits the information embedded in the parallelism forms used to implement an application, and uses the local search technique together with the tabu search meta-heuristic. In order to evaluate the quality of the mapping performed, several tests were conducted by: (a) case studies previously used to evaluate other static mapping algorithms employing different leading techniques; (b) two chemical applications; (c) an exhaustive mapping algorithm. The metric used to compare the mappings obtained was the application makespan. The test results show that our algorithm performs better than or in the same way as other mapping algorithms, and that it obtains good results mapping two large real applications on heterogeneous platforms composed of many processors. Comparisons with the mapping obtained by an exhaustive mapping algorithm show that our algorithm always carried out mappings which are very close to the optimal solution. This is because HMM allocates parallel tasks in a way that optimizes the execution of each form of parallelism used within a parallel application. The mapping quality carried out by HMM can be enhanced by introducing several improvements such as exploitation of the native mapping mechanisms available on each HC system machine, and better estimation of code-machine affinity values, by using data from specific benchmarks. In addition, the introduction of dynamic mapping functionalities to permit the migration of parallel tasks among the HC system machines can enhance throughput of the HC system.

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,