A static mapping heuristics to map parallel applications to heterogeneous computing systems

Size: px
Start display at page:

Download "A static mapping heuristics to map parallel applications to heterogeneous computing systems"

Transcription

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2005; 17: Published online 24 June 2005 in Wiley InterScience ( DOI: /cpe.902 A static mapping heuristics to map parallel applications to heterogeneous computing systems Ranieri Baraglia 1,,, Renato Ferrini 1 and Pierluigi Ritrovato 2 1 High Performance Computing Laboratory, Information Science and Technologies Institute (ISTI), Italian National Research Council (CNR), Via G. Moruzzi, 1, Pisa, Italy 2 Centro di Ricerca in Matematica Pura ed Applicata, Universitá degli Studi di Salerno, Fisciano (SA), Italy SUMMARY In order to minimize the execution time of a parallel application running on a heterogeneously distributed computing system, an appropriate mapping scheme is needed to allocate the application tasks to the processors. The general problem of mapping tasks to machines is a well-known NP-hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph-based mapping algorithm, called Heterogeneous Multi-phase Mapping (HMM), which permits suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta-heuristic. HMM allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application, and considering an affinity parameter, that identifies which machine in the heterogeneous computing system is most suitable to execute a task. We compare HMM with some leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of two real applications using HMM. Experimental results show that HMM performs well demonstrating the applicability of our approach. Copyright c 2005 John Wiley & Sons, Ltd. KEY WORDS: parallel processing; static mapping; task mapping; performance evaluation 1. INTRODUCTION Heterogeneous computing (HC) is a set of techniques enabling the use of different interconnected machines that provide a variety of computational capabilities to execute a number of application tasks Correspondence to: Ranieri Baraglia, High Performance Computing Laboratory, Information Science and Technologies Institute (ISTI), Italian National Research Council (CNR), Via G. Moruzzi, 1, Pisa, Italy. ranieri.baraglia@cnuce.cnr.it Copyright c 2005 John Wiley & Sons, Ltd. Received 24 March 2003 Revised 17 March 2004 Accepted 17 March 2004

2 1580 R. BARAGLIA, R. FERRINI AND P. RITROVATO with different requirements [1,2]. There are many types of heterogeneous systems [3]. In this paper we refer to a HC system as a heterogeneous suite of independent machines connected by high-speed networks, and used as a single HC distributed system [1]. This kind of HC system is also known as Metacomputer [4]orasaGrid [5]. A HC system requires compile- and run-time services supporting the execution of applications. Computational intensive parallel and distributed applications made up of coarse-grained looselycoupled components that require the use of different computational types (i.e. Sequential, SIMD, MIMD, etc.) are the best candidates to exploit the full potential of a HC system. In order to achieve high-performance, critical aspects of the HC systems are to efficiently assign application tasks to the most suitable machines and to determine a scheduling execution order of the tasks on each selected machine. The assignment-scheduling process is called mapping. The task-mapping process has shown to be NP-complete in its general as well as in some restricted forms [6]. Mapping algorithms are classified as static or dynamic [6]. In the former the mapping decisions are taken before executing an application and remain unchanged until the end of the execution. In the latter the mapping decisions are taken while the program is running. Since static mapping does not imply overheads on the execution time of the mapped application, more complex mapping solutions than the dynamic ones can be adopted. When the characteristics of a parallel application, such as task computational cost, amount of data exchanged among tasks, and task dependencies, are known before the application execution, static mapping can be exploited. In the general form of static mapping a parallel application is represented by a directed acyclic graph (DAG), where the nodes represent application tasks and the edges represent data dependencies between two tasks. Each node is associated with a weight representing the computational cost of the corresponding task, and the weights associated to each edge represent the amount of data exchanged between the two interconnected tasks [7]. The goal of the DAG mapping process is to minimize the overall application makespan by appropriately allocating the tasks to the processors and ordering the task execution sequence in such a manner that task precedence constraints are preserved. Despite the number of scheduling heuristics proposed in the past for both homogeneous and heterogeneous distributed systems, the mapping problem remains an open research problem. In this paper, we propose Heterogeneous Multi-phase Mapping (HMM), a static general mapping heuristics, to map parallel applications represented by DAGs onto a metacomputer to minimize the applications completion time. HMM takes mapping decisions by exploiting the information embedded in the parallelism forms or skeletons (i.e. farm, pipeline, geometric, etc.) used to implement an application, and considering an affinity parameter, identifying which machine in the HC system is most suitable for executing a task [1,3,8]. It also considers task computation cost, processor computational power and workload (i.e. workload induced by the application mapped), and amount of data to be transferred between tasks running on different processors. Moreover, HMM uses a local search technique together with the tabu search meta-heuristics [9], in order to improve the quality of the HMM initial mapping solution. Most parallel programs are designed by means of parallelism forms that can be used by adopting models allowing arbitrary computation structures (unrestricted programming model) [10,11] and also models restricting the forms in which computation may be expressed (restricted programming

3 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1581 model) [12,13]. Various coordination languages [14 17] have been proposed in which sets of parallel constructs (skeletons) are used as program composition forms. Parallelism forms allow the implementation of parallel computations in which the communication patterns are structured and well defined, and a considerable amount of communication is generally concentrated within a parallelism form. Parallel applications suited for HC are made up of large-grained loosely coupled components that can be either sequential or parallel. When a parallel component has been designed by using parallelism forms, the tasks inside a parallelism form can be seen as naturally grouped in a cluster that can be allocated on processors of the same machine to reduce task communication costs. Each cluster can embed one or more nested parallelism forms to be handled in a top-down manner. This allows us to find the size of the application grain that better exploits the computation and communication characteristics of the target HC platform. Exploitation of the clustering induced by means of parallelism forms permits us to obtain good schedule quality reducing the heuristics execution times with respect to other solutions such as the cluster-based solutions. This paper is organized as follows. Section 2 describes some related works. Section 3 gives a description of the problem, and Section 4 describes our algorithm. In Section 5, algorithm time complexity is evaluated. Section 6 outlines and evaluates the experimental tests. Finally, we summarize our work in Section RELATED WORK In the past a number of static graph-based mapping algorithms for heterogeneous computing environments have been proposed [3,18 30] and evaluated [31], to approximate optimal mapping solution. In [3], Eshaghian et al. proposed the Cluster-M and the Augmented Cluster-M mapping algorithms based on the Cluster-M programming model. Cluster-M uses a direct graph and undirected graph to represent a meta-application and a metacomputer, respectively. The direct graph is horizontally and vertically partitioned. Horizontal partitioning builds layers (clusters) of tasks that can be executed in parallel. Vertical partitioning allows consecutive layers to be merged or embedded in order to minimize communication costs. The undirected graph represented in a multilayered format is such that each layer includes a set (cluster) of completely connected processing units. Clusters of communication-intensive tasks are then mapped onto clusters of communication-efficient processing units by running Cluster-M. The latter presents a time complexity equal to O(MP),whereM is the number of tasks belonging to an application graph, and P = max(m, N), with N numbers of processors belonging to a metacomputer graph. Augmented Cluster-M is an extension of Cluster-M. As it treats the task computational type-machine type affinity, that drives the mapping of a task onto a machine with the same computation type, it is more suitable for HC. The major drawback of this solution is the need to re-compute the clustering of the system graph when the metacomputer configuration changes. Metacomputers are dynamic computing environments on which the availability of computational units cannot be established apriori. Moreover, the machine and network workload can lead to different mapping for different time requests of the same application execution.

4 1582 R. BARAGLIA, R. FERRINI AND P. RITROVATO In [19], Lo proposed a Max-Flow/Min Cut algorithm to find a mapping of task modules on heterogeneous processors as an extension of Stone s model [32]. The algorithm combines recursive invocations of the Max-Flow/Min Cut algorithm with a greedy algorithm to find a suboptimal assignment of task modules on heterogeneous processors. It has a complexity equal to O(M 2 N E p log M) where M, N and E p are the number of parallel tasks, processors and communication links among processors, respectively. This algorithm presents two major drawbacks: high time complexity and non-treatment of task data dependencies. In [18], Shen and Tsai proposed the A Searching algorithm that exploits a graph matching model called weak homomorphism to map a task graph to a system graph. The search of the optimal weak homomorphism corresponds to the task assignment with minimum task turnaround time. It is formulated as a state-space search problem, solved using the A algorithm [33]. The latter is a branch and bound algorithm and, therefore, due to its time complexity is suitable for small problems. In [34] the Levelized-Min Time (LMT) algorithm based on list scheduling heuristics was proposed. The algorithm is structured according to two phases. In the first phase it groups the tasks that can be executed in parallel and in the second phase each task is assigned to a processor that minimizes the completion time of the whole application. Within the same group, the task with the highest computation cost has the highest selection priority. The time complexity of LMT is O(n 2 p 2 ),wherenand p are the number of tasks and the number of processors, respectively. In [29], Topcuoglu et al. proposed the Heterogeneous Earliest-Finish-Time (HEFT) and Critical- Path-on-a-Processor (CPOP), two list scheduling-based algorithms to map parallel applications, represented by DAGs, on a bounded number of heterogeneous processors. The algorithms are structured according to two phases. In the former HEFT computes the upward rank values for each task, while in the latter it selects the task with highest upward rank value and assigns the selected task to the processor which minimizes its earliest ending time. On the other hand, in the first phase CPOP computes the upward and downward rank values for each task, and uses the summation of upward and downward rank values to prioritize total execution time of the critical path. The algorithms have time complexity equal to O(e p) for e graph edges and p processors. HEFT time complexity is equal to O(n 2 p) (n is the number of the tasks) for dense application graphs where the number of edges is proportional to O(n 2 ). Even if both algorithms are able to find good scheduling solutions with fast execution times, HEFT finds better scheduling solutions than CPOP. 3. PROBLEM DESCRIPTION The proposed heuristics assumes that the parallel application is decomposed in multiple tasks whose data dependencies are known, and that the parallel application is modeled by a DAG. It is also assumed that task computation and communication costs are known before the application execution, and that task execution is not preemptive. Task computation and communication costs are usually obtained by estimation using profiling information of operations such as numerical operations, memory access operations, and message-passing primitives [35]. Assumptions like these are typically made in these types of studies [3,24,29]. The DAG representing an application is denoted as TG = (N, E), wheren is a set of nodes and E is a set of directed edges. A node represents a computational task that can be sequential or parallel, called atomic and hierarchical, respectively.

5 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1583 Figure 1. DAG representing a parallel application. Atomic nodes are indivisible units of execution, and have associated a weight representing the estimated task computational cost. Hierarchical nodes are sets of atomic and/or hierarchical nodes structured according to the parallelism forms used to implement a parallel program. Parallelism forms can be nested one into the other. Hierarchical nodes are represented with a DAG as well, and the structure of which respects the order in which the parallelism forms are used in a parallel task. From the point of view of the task mapping a hierarchical node is considered to be atomic. In Figure 1, an example of DAG representing a parallel application is shown. The application DAG is considered as structured in S layers that represent the depth of nested levels. In the first layer it consists of only a hierarchical node which is considered to be atomic, and in the next layers it could consist of atomic and/or hierarchical nodes represented with a DAG denoted as TG s with 1 s S. In the last layer all nodes are atomic. The introduction of the layer concept allows us to find the layer whose mapping guarantees the shortest execution times of the nodes it contains. Although problem partitioning is known aprioriin the mapping static model, the elaboration of the layers in top-down order allows us to find the size of the grain that better exploits the computation and communication characteristics of the target HC platform. A generic node, atomic or hierarchical, belonging to a graph TG s is identified as n s i with 1 i N s and N s number of nodes at the layer s. An edge e i,j represents a data dependence between nodes n s i and n s j with i j, and it has an associated weight specifying the amount of data exchanged between the nodes. In Figure 2, an example of a three-stage expansion of a hierarchical node is given. The first layer consists of a hierarchical node; the second layer is represented by a task graph of four nodes, one atomic

6 1584 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 2. Layers inside a TG graph representing a parallel application. and three hierarchical (a pipeline and two task-farms), and in the last layer the task graph includes 14 atomic nodes. In order to reduce mapping complexity a TG s is horizontally divided into F phases, which are executed sequentially (top-down) one at a time. The nodes in the first phase do not have parents, the nodes in a phase being those with at least one parent among the nodes in a previous phase. By structuring TG s into phases the precedence relations among its nodes reflect the execution sequence of the phases in the application graph, and the mapping problem is simplified because HMM finds the initial suboptimal mapping solution by operating on each phase separately. The initial problem is thus decomposed into more manageable subproblems. Since the phases are linked by the edges to map the hth phase, HMM considers the data exchanged between the nodes belonging to phase h and the nodes of phase k with k<h. The nodes in a phase can be executed in parallel, and a node in a phase is executed when all its predecessors have been completed and it has received all the data needed for its execution. A HC system is represented by a fully connected indirect graph denoted as MG. Each node of MG represents the machine of a HC system which is denoted as m with 1 m M,whereM is a number of machines in a HC system. Each node is associated with a number specifying the computational speed of a machine, and each edge denotes a link between two machines, and has an associated value which specifies its communication bandwidth. A vector V m, with size equal to the number of processors P m of m, is associated with each machine, to find the processors available on a machine m at any time. The uth entry of this vector contains the time value at which the uth processor of m will be available to execute any new work. HMM updates the vector V m on the basis of the workload due to the application tasks allocated on each processor of machine m. To enable us to compute the affinity parameter it is useful to characterize both applications and HC systems by relating them to some real-world parameters. Several different parameters can be used

7 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1585 to characterize application tasks and machines able to select the HC system machine most suitable to execute a task. In this work, without loss of generality, we associate to each atomic node the computational model adopted for its implementation (e.g. SIMD, MIMD, sequential), and the amount of memory and software required for its execution. The same parameters are used to characterize a hierarchical node. A hierarchical node has also associated the number of processors needed to run its atomic nodes in parallel. We assume that these parameters can be specified by the programmer and/or provided by the programming tools used to implement a parallel application. Each MG node is associated with its architectural class, software and memory available, number of processors, and processor power. HMM adopts the following strategy to carry out a suboptimal solution. Starting from node n 1 1 (node at layer one) HMM finds the most suitable machine for its execution on the basis of affinity parameter value, node computational cost, and computational power of the machine. HMM estimates the execution time of the task graph TG 1, that consists of node n 1 1, on the selected machine, and then TG 1 is expanded in the next layer producing the task graph TG 2. To find the best mapping of the nodes in the second layer, HMM divides TG 2 into phases, and, starting from the first, it estimates the execution time of each TG 2 node. To do this, the nodes inside a phase are sorted in descending order with respect to their computational cost, and starting from the heaviest, the execution time of each node is estimated on the machine selected on the basis of machine power and workload, node computational cost, amount of data exchanged between a node and its parents, and affinity parameter value. When no machine is found to allocate a node it is flagged as not allocated, and it will be allocated to the first next layer for which a suitable machine is available. Then, the estimated execution time of TG 2 is compared with the one found for TG 1. The shortest one leads to the current initial mapping solution. This process is repeated for all the remaining S 2 layers of the application graph to find a TG s leading to an initial mapping solution to which a shorter application execution time corresponds. Then, to refine the initial mapping solution, a local search algorithm is adopted together with the tabu search meta-heuristics. The suboptimal mapping solution is found by keeping the task communication costs as low as possible, and trying to make a good trade-off between the communication costs and workload, balancing among the machines of the target HC platform. 4. ALGORITHM DESCRIPTION HMM takes in input the TG and MG graphs, which represent a parallel application and a HC system, respectively. The algorithm is structured according to two phases. In the former an initial mapping solution is carried out. In the latter, to improve the quality of the initial solution a local search function is applied to the atomic nodes in the critical path of TG s, corresponding to the initial solution. Figure 3 shows the pseudo-code of the HMM first phase which operates according to the following steps. Step 1 computes the affinity value Aff of all the n s i nodes belonging to each layer with respect to all the M machines. The affinity value of a node is computed by considering both the required computational resources and the adopted computational model. When a machine m does not provide the memory and software requirements as well as the computational model to run a node n s i, the affinity parameter is set equal to zero. Otherwise, it assumes a value in the range 0 1 computed as function of the number of processors available on m and the number of processors required to run n s i.

8 1586 R. BARAGLIA, R. FERRINI AND P. RITROVATO HMM Initial Solution (TG) Begin Step 1: SET AFFINITY(TG) INITIALIZE S init LOOP on the layers s(1 s S) ALLOCATE (TG s ) Begin Step 2: LOOP on the nodes n s i (1 i N s) DIVIDE the TG s graph into F phases LOOP on the phases k(1 k F) Step 3: SORT the nodes n s i within phase k in descending order with respect to Cns i LOOP on the ordered nodes n s i of the phase k Step 4: LOOP on the machines m(1 m M) FIND the best machine m to execute the node n s i Step 5: IF at least one machine m is found THEN COMPUTE the completion time T e of the node n s i on machine m ELSE End the node n s i as not allocated FLAG Step 6: LOOP on the nodes n s i (ns i { not allocated } nodes) WHILE node n s i is flagged not allocated s = s + 1 Step 7: ALLOCATE (n s i ) IF completion time of TG s solution (S s ) < completion time of S init THEN S init = S s End Figure 3. Pseudo-code of the first phase of HMM. HMM ends if at least one atomic node has affinity equal to zero with all the HC system machines. Figure 4 shows the pseudo-code of the Set affinity function used to compute the affinity parameter. Step 2 divides all nodes of a TG s into F phases. The nodes in the first phase do not have parents while the nodes belonging to a phase i, with 1 <i F, are those that have at least one parent among the nodes in a phase j with j<i. In Figure 5 an example of a subdivision into phases of a hierarchical node at layer 2 is shown. Step 3 arranges in descending order the nodes in a phase with respect to their computational cost. This ordering is used in the next step to first select the computationally heavier nodes. It improves the probability of computationally heavier hierarchical nodes to be allocated on the same machine.

9 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1587 SET AFFINITY (TG) Begin LOOP on the layers s(1 s S) LOOP on the nodes n s i (1 i N s) LOOP on the machines m(1 m M) Aff (n s i,m)= 0 IF memory required by n s i memory size of m THEN IF software required by n s i = software available of m THEN IF computational model required by n s i = architectural class of m THEN Aff (n s i,m)= 1 IF processors required by n s i > processors available on m THEN Aff (n s i,m)= processors available in m/processors required by ns i Figure 4. Pseudo-code of the Set affinity function. Figure 5. Example of a subdivision into phases of a hierarchical node at layer 2. Step 4 selects, according to the ordering given by the previous step, a node n s i which has not yet been analyzed and finds a machine m that can perform it. In order to select a machine, expression (1) is applied to all the machines M Aff with which the selected node has a value of the affinity parameter greater than zero. ( { Wm + C n s Dc (n s i i min Aff (n s i,m) S + max,ns j ) }) m B m,n m {1, 2,...,M Aff } and n {1, 2,...,M} and j with (n s j ) BS(ns i ) (1) where W m is the workload of m due to the previous allocated tasks, S m the computational power of m, and C n s i the computational cost of n s i. D c is the amount of data exchanged between node n s i and node ns j belonging to its backward star BS(n s i ) (i.e. its parent). D c is zero if n s i and its parent with which it has the highest communication cost can be allocated on the same machine. B m,n is the link communication

10 1588 R. BARAGLIA, R. FERRINI AND P. RITROVATO bandwidth connecting the machine m on which n s i will be and the machine n on which n s j has been allocated. The machine corresponding to the minimum value computed by (1) is selected to allocate node n s i. When the same minimum value corresponds to several machines, HMM selects the machine m corresponding to the smallest value of ratio W m /S m. If no machine is found, the hierarchical node n s i is flagged as not allocated, and the next node is analyzed. Otherwise, step 5 is executed to estimate the execution time of n s i on the candidate machine. At first layer W m and D c are equal to zero; therefore, in order to select the most suitable machine to allocate node n 1 1, only the task computational cost, the affinity parameter and the machine computational power are considered. Step 5 estimates overall completion time of a node n s i on the selected m, according to the node s precedence relations. If n s i is a hierarchical node, all the atomic nodes it contains are allocated on the machine processors by using expression (2) and its execution time is equal to the execution time of the atomic node that is the last to finish. ( Wp m k min + C { n s Dc (n s i i + max,ns j ) }) k {1, 2,...,P m } and j with (n s j ) BS(ns i ) (2) S p m k B m where S p m k is the computational power of the kth processor of the machine m on which n s i will be allocated, W p m k the processor workload, and B m is the machine network bandwidth. The atomic node execution time T e is computed by applying Equation (3): T e (n s i,m) = T w(n s i,m)+ T x(n s i,m) (3) where T w (n s i,m)is the amount of time the node ns i has to wait before executing on the machine m,and T x (n s i,m)is the execution time of ns i on m. T w (n s i,m)is given by Tw(n s i,m)= max{t c(bs(n s i )), T a(n s i,pm k )} (4) where T c is the waiting time due to the communications of n s i with the nodes belonging to BS(n s i ) and T a (n s i,pm k ) is the waiting time to get the kth processor p of the machine m needed to execute ns i. T a is obtained from vector V m where the processors are stored in increasing order with respect to completion time of the task executed. The uth entry of V m stores the time in which the uth processor becomes available. The waiting time T c (BS(n s i )) is given by { T c (BS(n s i )) = max T end (n s j ) + D c(n s i,ns j ) } B m,n j with (n s j ) BS(ns i ) and n {1, 2,...,M} where T end is the time needed by node n s j to end its execution. When ns i and ns j are allocated onto the processors of the same machine (e.g. a multicomputer), B m,n is the machine network bandwidth. T x (n s i,m)is computed as Tx(n s i,m)= C n s i (6) S p m k (5)

11 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1589 HMM Local Search (TG s ) Begin SET the initial solution as the best solution: S best = S init WHILE total iteration number K<K max AND iterations without improvements I<I max FIND the Critical Path of current solution S init LOOP on the nodes n s i of the Critical Path (1 i N cp) FIND the best processor to execute the node n s i IF at least one processor is found THEN MOVE the node n s i to the selected processor COMPUTE the new solution S cp IF completion time of S cp < completion time of S init THEN S init = S cp IF completion time of S init < completion time of S best then S best = S init ELSE INCREMENT the iterations without improvements: I = I + 1 INCREMENT the total iteration number: K = K + 1 END Figure 6. Pseudo-code of the second phase of the HMM algorithm. When the list of nodes within the phase currently analyzed is empty, step 6 is performed. Otherwise execution continues from step 4 to allocate the next node. Step 6 checks whether there are any nodes flagged as not allocated. If there are, these are allocated at the s-not allocated layer for which there is a machine that can execute each related node. This is made possible by selecting each node according to the order established in step 3, and expanding it to the next layer. It is then allocated by adopting the procedure described in steps 2 5. If there are no nodes flagged as not allocated, this process ends by carrying out the initial partial solution of the phase analyzed, and the execution continues from step 3 to elaborate the next phase. Step 7 analyzes all initial partial mapping solutions to find the layer completion time (S s ), which corresponds to the last TG s node ending its execution. S s is then compared with the best initial mapping solution S init carried out in a previous layer. If S s is better than S init it becomes S init, and execution continues from step 2 to analyze the next layer. When all the layers have been evaluated, the algorithm ends returning S init. Figure 6 shows the pseudo-code of the HMM second phase. It considers all the nodes in S init as atomic, and analyzes the neighborhood of S init to search for a better solution. We say that a neighborhood of a solution is the set of solutions obtainable by moving a node of the critical path from processor pi m (current allocation) to a processor pj m (with j i)ortoph i (with h m). The new processor is selected according to the following criterion: max{aff (n s i,m) S pj m} m {1, 2,...,M Aff } and j {1, 2,...,P m } (7)

12 1590 R. BARAGLIA, R. FERRINI AND P. RITROVATO If several processors are selected, the one with the smallest value of the fraction W p m j /S p m j is selected. If there is not a processor able to allocate n s i, the node is not moved, and the process continues by selecting the next node in the critical path. Otherwise, the selected node n s i is moved, and HMM estimates the application execution time corresponding to the new solution S cp. When all the nodes on the critical path have been analyzed the local search algorithm carries out a new solution, i.e. the solution corresponding to the smallest execution time among all the S cp solutions previously found. If it is better than the initial solution, it becomes the best solution. The search for a suboptimal solution is an iterative process. HMM ends by returning the final mapping solution, when one of the following conditions is verified: (a) the total number of iterations performed reaches a user-defined threshold: (b) the number of iterations performed without improving the solution reaches a user-defined threshold Mapping example In order to show the choices taken by HMM, an example is given of mapping of a real quantum chemistry application on three different HC system configurations (see Figures 7(b), 8(b) and 9(b)). The chemistry application deals with the integration of a two-dimensional Schrödinger equation using a sector-diabatic coupled-channel approach [36]. In the following figures the application tasks are represented in a task graph by nodes labeled t i. The number inside each circle representing an atomic node specifies the computational cost of the task, and the communication cost is specified by the number associated to each edge. In a system graph the nodes labeled p i represent the processors inside a machine. The number inside each circle specifies the computational power of the processor, and the communication bandwidth of the channel linking two processors or two machines is specified by the number associated to each link. Figure 7(b) shows the HC system configuration in which all machines have processors with equal computational power, and equal communication bandwidth. Figures 8(b) and 9(b) represent the two other configurations in which machines m 1,andm 1, m 2, m 3, m 4, respectively, have processors with half computation power and half communication bandwidth with respect to the same machine in the HC system graph of Figure 7(b). This situation can occur, for example, when a machine is workloaded at 50%, and therefore we can suppose that its computational power is cut down by a half. In Figures 7, 8 and 9, the arrows together with the gray levels indicate the layer at which the initial mapping solution (S init ) was carried out, and the machines onto which the tasks were mapped. Figure 7 shows the case in which the machine m 1 in the HC system is large enough to carry out the application mapping at layer 1, where the execution time corresponding to S init is equal to 79. Table I shows the values of the affinity parameter (Aff ), machine computational power (S m ), machine workload (W m ), computational cost (C n ) of the hierarchical node t 19, and ratio between the data exchanged (D c ) and the communication bandwidth (B m ) used to compute expression (1) to allocate at layer 1 the task graph of Figure 7. Table II shows the values used to compute expression (1) in order to allocate at layer 2 the same task graph on the HC system in Figure 8. In the same way, expression (1) was computed to allocate at layer 3 the task graph of the example shown in Figure 9. In all of the examples the mapping corresponding to the smallest value of S init was chosen. Figures 7(c), 8(c) and 9(c) show the final suboptimal mapping solutions obtained after the local search process.

13 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1591 Figure 7. Mapping carried out at layer 1.

14 1592 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 8. Mapping carried out at layer 2.

15 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1593 Figure 9. Mapping carried out at layer 3.

16 1594 R. BARAGLIA, R. FERRINI AND P. RITROVATO Table I. Values computed by HMM to evaluate the mapping at layer 1 of the application graph on the HC system shown in Figure 7. Machine Aff S m W m C n D c /B m (1) m m m m m m m m t ALGORITHM COMPLEXITY To evaluate algorithm time complexity we consider the number of operations performed to carry out the initial mapping solution and to run the local search algorithm Initial solution The cost C(S init ) of the initial solution is computed according to the number of operations executed to allocate the atomic and/or hierarchical nodes within all layers of TG.ThecostC(S s ) to allocate the N s nodes of a TG s (a graph expanded at layer s) isgivenby C(S s ) = N3 s 3 + F N s log 2 N s + N s M (8) where Ns 3/3 is the number of operations performed to divide TG s into F phases, N s log 2 N s is the number of operations performed to sort the nodes within a phase (supposing there are all the TG s nodes in each phase), and N s M is the number of operations executed to choose a machine on which to allocate the N s nodes. Since the worst case with as many layers as the number of atomic nodes is rare, in order to compute C(S init ) we consider the average case on which TG s is structured in log 2 N + 1 layers, each layer including 2 (i 1) nodes with 1 i log 2 N + 1. In this case C(S init ) is given by C(S init ) = log 2 N+1 Since every TG i includes 2 i 1 nodes expression (9) can be written as C(S init ) = log 2 N+1 i=1 i=1 C(S i ) (9) (2 (i 1) ) 3 + F 2 (i 1) log (i 1) + 2 (i 1) M (10)

17 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1595 Table II. Values computed by HMM to evaluate the mapping at layer 2 of the application graph on the HC system shown in Figure 8. Machine Aff (W m + C n )/(Aff S m ) D c /B m (1) t 15m m m m m m m m t 16m m m m m m m m t 17m m m m m m m m t 18m m m m m m m m t 14m m m m m m m m

18 1596 R. BARAGLIA, R. FERRINI AND P. RITROVATO which is equal to which is equal to 5.2. Local search C(S init ) = log 2 N+1 i=1 2 3(i 1) 3 + F 2 (i 1) log 2 2 (i 1) + 2 (i 1) M (11) C(S init ) = N 3 + F N ln N + N M (12) The cost to find S best by using the local search function is computed according to the number of operations executed at each iteration to reallocate the atomic nodes in the critical path of a new solution. This is given by C(S best ) = N3 3 + N log 2 N + K ( N(N 1) 2 ) + N c P + N (13) where N 3 /3 is the number of operations executed to divide into phases the atomic nodes of TG, N log 2 N is the number of operations performed to sort the nodes within a phase (we suppose that there are N nodes in each phase), K (N(N 1)/2) is the number of operations to find the critical path (K is the number of iterations), N c P are the operations executed to reallocate the N c nodes in the critical path onto P HC system processors, and N is the cost paid to compute the endedtime of all atomic nodes belonging to a new solution. Since expressions (12) and(13) can be approximated to O(N 3 ), algorithm time complexity, in the average case, results to be O(N 3 ). 6. PERFORMANCE EVALUATION Comparing mapping heuristics is generally a difficult task because of the different underlying assumptions in the original studies of each heuristic [31]. HMM assessment was made possible by comparison with six other graph-based mapping algorithms, the Cluster-M mapping algorithm (CM), Augmented Cluster-M mapping (ACM), Lo s Max-Flow/Min Cut (MFMC), Shen and Tsai s A Searching (AS), Heterogeneous Earliest-Finish-Time (HEFT), Levelized-Min Time (LMT) and with an exhaustive mapping algorithm (EM). EM carries out all possible P N mappings, where N is the number of atomic tasks of an application, and P is the number of machines processors in a HC system. Moreover, two more tests were conducted by using a quantum chemistry [37] and a molecular dynamic Monte Carlo application [38]. The first four tests used as input the application DAGs and the HC system graphs from [3], and the results are shown in Gantt charts. For each task t i the machine m on which it was allocated and the unit of time required for its execution (numbers at the top of the chart) are shown. Figure 10 shows the application DAG and the metacomputer graphs used in the first three tests performed to compare HMM with CM, and EM. Figures 11(a), (b) and (c) show mapping of the application DAG of Figure 10(a) on the HC systems 1, 2and3ofFigure10(b), respectively.

19 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1597 Figure 10. Application DAG and HC system graphs used in tests 1, 2 and 3. Figure 11. Gantt charts describing the mapping carried out in tests 1, 2 and 3.

20 1598 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 12. Application DAG and HC system graph used in test 4. In the first test (see Figure 11(a)) HMM converges to the optimal solution by analyzing only 25 mappings of the 2187 possible mappings analyzed by EM. It was obtained allocating the graph expanded at the last layer. In order to reduce the task communications, t 0 and t 1 were allocated onto the same machine m 1. In the second test (see Figure 11(b)) HMM found a suboptimal mapping corresponding to a total application execution time of 15.1 units, which is worse than the optimal mapping by about 0.7%. It was carried out by allocating the more communicating tasks onto the same machine. HMM converges to the suboptimal solution after 27 possible mappings have been analyzed. In the third test (see Figure 11(c)) HMM converged to the optimal solution after 16 possible mappings by allocating all the application tasks onto the highest speed machine. In all the three tests the HMM execution required about 10 ms. Figure 12 shows the application and the HC system graphs used in test four to compare HMM with CM, ACM, MFMC, and EM. The application consists of an MIMD code and a vector code (Gaussian elimination). The HC system used in the test models an MIMD and a vector machine. As in [3] we define the HMM affinity parameter so that the MIMD code was allocated on the MIMD machine and the vector code on the vector machine. For a better view of the total application execution time the

21 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1599 Figure 13. Gantt charts describing the mapping carried out in test 4. execution time of both code types was placed in the same Gantt, even if MFMC, AS, EM do not treat heterogeneity in computation and machine types. Their mapping results were obtained by mapping the MIMD and vector code separately onto the corresponding type of processor. As shown in Figure 13, HMM and ACM carried out the same mapping corresponding to a total application execution time of units with worsening of 1.6% with respect to the optimal mapping. In order to map the vector code EM analyzed possible mappings to seek the optimal solution relative to a total application execution time of units. HMM converges to the suboptimal solution by analyzing only 19 of the possible mappings. The HMM execution required 40 ms, the EMA execution required 1350 ms. Mapping the MIMD code HMM converged to the suboptimal solution by analyzing only 21 of the possible mappings analyzed by EM. The HMM suboptimal solution worsened 3.6% with respect to the optimal mapping. The HMM execution required 10 ms, the EM execution required 600 ms. The longer execution time required by the exhaustive algorithm to map the vector code with respect to the MIMD code was due to the longer execution time required to compute the overall application execution time of the vector task. The fifth test was conducted to compare HMM with HEFT and LMT algorithms when used to map a real chemical parallel application, which exploits the Monte Carlo method to model gas reactive systems [38], on a heterogeneousmulti-clustercomputingplatform. The chemical processsimulated is made up of a large set of independent tasks of unpredictable duration. The task farm parallelism form

22 1600 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 14. Results obtained by HMM, HEFT and LMT when mapping the Monte Carlo application. was adopted to implement the parallel program used to run the simulation. In our test, a task farm with 1000 slave nodes was used, and 10 runs were executed by varying the number of tasks from 10 3 to Task computational and communication costs were randomly fixed according to a uniform distribution with mean and variance defined on the basis of results obtained by real executions of the chemical application. The HC system used in this test consisted of three clusters with the following configurations: (a) four workstations with processor power of 25, 20, 20 and 15 Mflop/s, and connected by a 1 Mbit/s Ethernet; (b) three workstations with processor power of 25, 15 and 12 Mflop/s, and connected by a 0.75 Mbit/s Ethernet; (c) four workstations with processor power of 20, 15, 12 and 10 Mflop/s connected by a 0.8 Mbit/s Ethernet. All the clusters were considered to be connected by a 0.25 Mbit/s LAN. Figure 14 shows the application makespans obtained by HMM, HEFT and LMT when mapping the parallel program on the target HC system by increasing the number of tasks elaborated by each slave in each run. Each value was obtained by running the mapping algorithms 10 times using different randomly generated sets of communication and computation costs. Although in [29] Topcuoglu et al. showed that HEFT performs better than the critical path-based mapping algorithm they proposed, on this kind of application, as shown in Figure 14, HMM obtained better results than HEFT. The last test was performed to evaluate HMM when used to map the real parallel quantum chemistry application cited in Section 4.1 (but implemented using a different parallel programming model). Application execution times and the machine utilization obtained by mapping the application using HMM were compared with those obtained by simulating the execution of the application on the same target HC system. The chemical simulation is composed by a set of independent processes each computing the propagation of particles in environments with different physical and

23 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1601 Figure 15. DAG representing the chemical parallel application. chemical conditions. In Figure 15 DAG represents the parallel program used to perform a simulated process. A detailed description of this parallel application can be found in [37]. In DAG the computational cost associated with each task was evinced by analyzing previous works [37,39], and the computational and communication costs of the tasks labeled t 7 t 11 and the related edges, respectively, vary by ±20% with respect to the value of physical entities (i.e. energy) used in a simulation. The HC system used to conduct our experiments consisted of an eight-node machine with processor power of 35 MFlop/s fully connected at 8 Mbit/s, a 64-node hypercube multicomputer with processor power of 6 MFlop/s, and a network bandwidth of 2.4 Mbit/s, and a cluster of four workstations with processor power of 25 MFlop/s. All of the machines were connected by a 0.25 Mbit/s LAN. Previous works [37,39] demonstrated that, for this application, good execution times were obtained performing the application structured according to a task farm programming model on clusters of processors. Therefore, the following configuration was used for our tests: (a) eight-node machine, partitioned into two clusters of four nodes each (machine 0); (b) 64-node hypercube multicomputer, partitioned into four clusters of 16 nodes each (machine 1); (c) one cluster of four workstations (machine 2). In order to simulate the application execution on this HC system configuration we used a program (called Task-Farm) implementing a task farm to dynamically assign the parallel tasks to the clusters, and estimating the application execution timed. Figure 16 shows schedule length obtained by both HMM and Task-Farm program elaborating problems of dimension from one to 32 chemical processes. Since the values assigned to tasks t 7 t 11, and the related edges were randomly generated according to a normal distribution in the interval ±20%, each run was performed 10 times, and the average execution time related to each problem dimension

24 1602 R. BARAGLIA, R. FERRINI AND P. RITROVATO Figure 16. Application execution times obtained by using HMM and Task-Farm. Figure 17. Machine utilization obtained mapping the application by HMM. was plotted. As can be seen in Figure 16, the application execution time noticeably improves when the problem dimension increases, and with dimension equal to 32 using HMM we have an application schedule length equal to 255 units with respect to 660 units obtained using the Task-Farm program. Figures 17 and 18 show the workload distribution on the HC system machines, when the number of chemical processes simulated is greater than 16 the use of HMM leads to a smaller utilization of the more powerful machine 0 and a higher utilization of the other machines. This better workload distribution leads to a smaller execution time with respect to that obtained simulating the application makespan. All tests were conducted on a dual processor Pentium III 800 MHz with 512 MBytes of RAM.

25 A STATIC MAPPING HEURISTICS TO MAP PARALLEL APPLICATIONS TO HC SYSTEMS 1603 Figure 18. Machine utilization obtained simulating the application execution. 7. CONCLUSIONS In this work we have presented a new static graph-based mapping algorithm, which allows us to find a suboptimal mapping of a parallel application onto a HC system in order to minimize total application execution time. The algorithm exploits the information embedded in the parallelism forms used to implement an application, and uses the local search technique together with the tabu search meta-heuristic. In order to evaluate the quality of the mapping performed, several tests were conducted by: (a) case studies previously used to evaluate other static mapping algorithms employing different leading techniques; (b) two chemical applications; (c) an exhaustive mapping algorithm. The metric used to compare the mappings obtained was the application makespan. The test results show that our algorithm performs better than or in the same way as other mapping algorithms, and that it obtains good results mapping two large real applications on heterogeneous platforms composed of many processors. Comparisons with the mapping obtained by an exhaustive mapping algorithm show that our algorithm always carried out mappings which are very close to the optimal solution. This is because HMM allocates parallel tasks in a way that optimizes the execution of each form of parallelism used within a parallel application. The mapping quality carried out by HMM can be enhanced by introducing several improvements such as exploitation of the native mapping mechanisms available on each HC system machine, and better estimation of code-machine affinity values, by using data from specific benchmarks. In addition, the introduction of dynamic mapping functionalities to permit the migration of parallel tasks among the HC system machines can enhance throughput of the HC system.

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,

More information

Mapping Heuristics in Heterogeneous Computing

Mapping Heuristics in Heterogeneous Computing Mapping Heuristics in Heterogeneous Computing Alexandru Samachisa Dmitriy Bekker Multiple Processor Systems (EECC756) May 18, 2006 Dr. Shaaban Overview Introduction Mapping overview Homogenous computing

More information

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids Ranieri Baraglia, Renato Ferrini, Nicola Tonellotto

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

A Novel Task Scheduling Algorithm for Heterogeneous Computing

A Novel Task Scheduling Algorithm for Heterogeneous Computing A Novel Task Scheduling Algorithm for Heterogeneous Computing Vinay Kumar C. P.Katti P. C. Saxena SC&SS SC&SS SC&SS Jawaharlal Nehru University Jawaharlal Nehru University Jawaharlal Nehru University New

More information

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology. A Fast Recursive Mapping Algorithm Song Chen and Mary M. Eshaghian Department of Computer and Information Science New Jersey Institute of Technology Newark, NJ 7 Abstract This paper presents a generic

More information

A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems

A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems J Supercomput (2014) 68:1347 1377 DOI 10.1007/s11227-014-1090-4 A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems Jing Mei Kenli Li Keqin Li Published

More information

LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM

LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM LIST BASED SCHEDULING ALGORITHM FOR HETEROGENEOUS SYSYTEM C. Subramanian 1, N.Rajkumar 2, S. Karthikeyan 3, Vinothkumar 4 1 Assoc.Professor, Department of Computer Applications, Dr. MGR Educational and

More information

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Chapter 5: Distributed Process Scheduling. Ju Wang, 2003 Fall Virginia Commonwealth University

Chapter 5: Distributed Process Scheduling. Ju Wang, 2003 Fall Virginia Commonwealth University Chapter 5: Distributed Process Scheduling CMSC 602 Advanced Operating Systems Static Process Scheduling Dynamic Load Sharing and Balancing Real-Time Scheduling Section 5.2, 5.3, and 5.5 Additional reading:

More information

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr

More information

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation)

Grid Scheduler. Grid Information Service. Local Resource Manager L l Resource Manager. Single CPU (Time Shared Allocation) (Space Shared Allocation) Scheduling on the Grid 1 2 Grid Scheduling Architecture User Application Grid Scheduler Grid Information Service Local Resource Manager Local Resource Manager Local L l Resource Manager 2100 2100 2100

More information

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications Centralized versus distributed schedulers for multiple bag-of-task applications O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal and Y. Robert Laboratoire LaBRI, CNRS Bordeaux, France Dept.

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors

A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors A Duplication Based List Scheduling Genetic Algorithm for Scheduling Task on Parallel Processors Dr. Gurvinder Singh Department of Computer Science & Engineering, Guru Nanak Dev University, Amritsar- 143001,

More information

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications Centralized versus distributed schedulers for multiple bag-of-task applications O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal and Y. Robert Laboratoire LaBRI, CNRS Bordeaux, France Dept.

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

A Task Scheduling Method for Data Intensive Jobs in Multicore Distributed System

A Task Scheduling Method for Data Intensive Jobs in Multicore Distributed System 第一工業大学研究報告第 27 号 (2015)pp.13-17 13 A Task Scheduling Method for Data Intensive Jobs in Multicore Distributed System Kazuo Hajikano* 1 Hidehiro Kanemitsu* 2 Moo Wan Kim* 3 *1 Department of Information Technology

More information

Algorithms for the Bin Packing Problem with Conflicts

Algorithms for the Bin Packing Problem with Conflicts Algorithms for the Bin Packing Problem with Conflicts Albert E. Fernandes Muritiba *, Manuel Iori, Enrico Malaguti*, Paolo Toth* *Dipartimento di Elettronica, Informatica e Sistemistica, Università degli

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the

More information

Adaptive-Mesh-Refinement Pattern

Adaptive-Mesh-Refinement Pattern Adaptive-Mesh-Refinement Pattern I. Problem Data-parallelism is exposed on a geometric mesh structure (either irregular or regular), where each point iteratively communicates with nearby neighboring points

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Data Flow Graph Partitioning Schemes

Data Flow Graph Partitioning Schemes Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The

More information

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems Tony Maciejewski, Kyle Tarplee, Ryan Friese, and Howard Jay Siegel Department of Electrical and Computer Engineering Colorado

More information

Toward the joint design of electronic and optical layer protection

Toward the joint design of electronic and optical layer protection Toward the joint design of electronic and optical layer protection Massachusetts Institute of Technology Slide 1 Slide 2 CHALLENGES: - SEAMLESS CONNECTIVITY - MULTI-MEDIA (FIBER,SATCOM,WIRELESS) - HETEROGENEOUS

More information

Indexing by Shape of Image Databases Based on Extended Grid Files

Indexing by Shape of Image Databases Based on Extended Grid Files Indexing by Shape of Image Databases Based on Extended Grid Files Carlo Combi, Gian Luca Foresti, Massimo Franceschet, Angelo Montanari Department of Mathematics and ComputerScience, University of Udine

More information

MIRROR SITE ORGANIZATION ON PACKET SWITCHED NETWORKS USING A SOCIAL INSECT METAPHOR

MIRROR SITE ORGANIZATION ON PACKET SWITCHED NETWORKS USING A SOCIAL INSECT METAPHOR MIRROR SITE ORGANIZATION ON PACKET SWITCHED NETWORKS USING A SOCIAL INSECT METAPHOR P. Shi, A. N. Zincir-Heywood and M. I. Heywood Faculty of Computer Science, Dalhousie University, Halifax NS, Canada

More information

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering Process Allocation for Load Distribution in Fault-Tolerant Multicomputers y Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering **Dept. of Electrical Engineering Pohang University

More information

Ruled Based Approach for Scheduling Flow-shop and Job-shop Problems

Ruled Based Approach for Scheduling Flow-shop and Job-shop Problems Ruled Based Approach for Scheduling Flow-shop and Job-shop Problems Mohammad Komaki, Shaya Sheikh, Behnam Malakooti Case Western Reserve University Systems Engineering Email: komakighorban@gmail.com Abstract

More information

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing

5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing 1. Introduction 2. Cutting and Packing Problems 3. Optimisation Techniques 4. Automated Packing Techniques 5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing 6.

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

Edge Equalized Treemaps

Edge Equalized Treemaps Edge Equalized Treemaps Aimi Kobayashi Department of Computer Science University of Tsukuba Ibaraki, Japan kobayashi@iplab.cs.tsukuba.ac.jp Kazuo Misue Faculty of Engineering, Information and Systems University

More information

Index. ADEPT (tool for modelling proposed systerns),

Index. ADEPT (tool for modelling proposed systerns), Index A, see Arrivals Abstraction in modelling, 20-22, 217 Accumulated time in system ( w), 42 Accuracy of models, 14, 16, see also Separable models, robustness Active customer (memory constrained system),

More information

DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS.

DOWNLOAD PDF SYNTHESIZING LINEAR-ARRAY ALGORITHMS FROM NESTED FOR LOOP ALGORITHMS. Chapter 1 : Zvi Kedem â Research Output â NYU Scholars Excerpt from Synthesizing Linear-Array Algorithms From Nested for Loop Algorithms We will study linear systolic arrays in this paper, as linear arrays

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Scheduling in Multiprocessor System Using Genetic Algorithms

Scheduling in Multiprocessor System Using Genetic Algorithms Scheduling in Multiprocessor System Using Genetic Algorithms Keshav Dahal 1, Alamgir Hossain 1, Benzy Varghese 1, Ajith Abraham 2, Fatos Xhafa 3, Atanasi Daradoumis 4 1 University of Bradford, UK, {k.p.dahal;

More information

An Improved Heft Algorithm Using Multi- Criterian Resource Factors

An Improved Heft Algorithm Using Multi- Criterian Resource Factors An Improved Heft Algorithm Using Multi- Criterian Resource Factors Renu Bala M Tech Scholar, Dept. Of CSE, Chandigarh Engineering College, Landran, Mohali, Punajb Gagandeep Singh Assistant Professor, Dept.

More information

Complexity results for throughput and latency optimization of replicated and data-parallel workflows

Complexity results for throughput and latency optimization of replicated and data-parallel workflows Complexity results for throughput and latency optimization of replicated and data-parallel workflows Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon June 2007 Anne.Benoit@ens-lyon.fr

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION

CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION CHAPTER 5 ANT-FUZZY META HEURISTIC GENETIC SENSOR NETWORK SYSTEM FOR MULTI - SINK AGGREGATED DATA TRANSMISSION 5.1 INTRODUCTION Generally, deployment of Wireless Sensor Network (WSN) is based on a many

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Foundation of Parallel Computing- Term project report

Foundation of Parallel Computing- Term project report Foundation of Parallel Computing- Term project report Shobhit Dutia Shreyas Jayanna Anirudh S N (snd7555@rit.edu) (sj7316@rit.edu) (asn5467@rit.edu) 1. Overview: Graphs are a set of connections between

More information

Mapping pipeline skeletons onto heterogeneous platforms

Mapping pipeline skeletons onto heterogeneous platforms Mapping pipeline skeletons onto heterogeneous platforms Anne Benoit and Yves Robert GRAAL team, LIP École Normale Supérieure de Lyon January 2007 Yves.Robert@ens-lyon.fr January 2007 Mapping pipeline skeletons

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Adaptive Runtime Partitioning of AMR Applications on Heterogeneous Clusters

Adaptive Runtime Partitioning of AMR Applications on Heterogeneous Clusters Adaptive Runtime Partitioning of AMR Applications on Heterogeneous Clusters Shweta Sinha and Manish Parashar The Applied Software Systems Laboratory Department of Electrical and Computer Engineering Rutgers,

More information

An Efficient Algorithm to Test Forciblyconnectedness of Graphical Degree Sequences

An Efficient Algorithm to Test Forciblyconnectedness of Graphical Degree Sequences Theory and Applications of Graphs Volume 5 Issue 2 Article 2 July 2018 An Efficient Algorithm to Test Forciblyconnectedness of Graphical Degree Sequences Kai Wang Georgia Southern University, kwang@georgiasouthern.edu

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Reference Point Based Evolutionary Approach for Workflow Grid Scheduling

Reference Point Based Evolutionary Approach for Workflow Grid Scheduling Reference Point Based Evolutionary Approach for Workflow Grid Scheduling R. Garg and A. K. Singh Abstract Grid computing facilitates the users to consume the services over the network. In order to optimize

More information

COPYRIGHTED MATERIAL. Introduction: Enabling Large-Scale Computational Science Motivations, Requirements, and Challenges.

COPYRIGHTED MATERIAL. Introduction: Enabling Large-Scale Computational Science Motivations, Requirements, and Challenges. Chapter 1 Introduction: Enabling Large-Scale Computational Science Motivations, Requirements, and Challenges Manish Parashar and Xiaolin Li 1.1 MOTIVATION The exponential growth in computing, networking,

More information

CSE 421 Greedy Alg: Union Find/Dijkstra s Alg

CSE 421 Greedy Alg: Union Find/Dijkstra s Alg CSE 1 Greedy Alg: Union Find/Dijkstra s Alg Shayan Oveis Gharan 1 Dijkstra s Algorithm Dijkstra(G, c, s) { d s 0 foreach (v V) d[v] //This is the key of node v foreach (v V) insert v onto a priority queue

More information

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring Parallel Programs Conditions of Parallelism: Data Dependence Control Dependence Resource Dependence Bernstein s Conditions Asymptotic Notations for Algorithm Analysis Parallel Random-Access Machine (PRAM)

More information

Optimizing Molecular Dynamics

Optimizing Molecular Dynamics Optimizing Molecular Dynamics This chapter discusses performance tuning of parallel and distributed molecular dynamics (MD) simulations, which involves both: (1) intranode optimization within each node

More information

Branch-and-bound: an example

Branch-and-bound: an example Branch-and-bound: an example Giovanni Righini Università degli Studi di Milano Operations Research Complements The Linear Ordering Problem The Linear Ordering Problem (LOP) is an N P-hard combinatorial

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai Simula Research Laboratory Overview Parallel FEM computation how? Graph partitioning why? The multilevel approach to GP A numerical example

More information

Self-Organizing Maps for cyclic and unbounded graphs

Self-Organizing Maps for cyclic and unbounded graphs Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong

More information

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVMA (0 Summer) Segmentation Outline Strategies and Data Structures Overview of Algorithms Region Splitting Region Merging

More information

PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance

PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance Atsuto Kubo, Hiroyuki Nakayama, Hironori Washizaki, Yoshiaki Fukazawa Waseda University Department of Computer Science

More information

BI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR FLEXIBLE JOB-SHOP SCHEDULING PROBLEM. Minimizing Make Span and the Total Workload of Machines

BI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR FLEXIBLE JOB-SHOP SCHEDULING PROBLEM. Minimizing Make Span and the Total Workload of Machines International Journal of Mathematics and Computer Applications Research (IJMCAR) ISSN 2249-6955 Vol. 2 Issue 4 Dec - 2012 25-32 TJPRC Pvt. Ltd., BI-OBJECTIVE EVOLUTIONARY ALGORITHM FOR FLEXIBLE JOB-SHOP

More information

HETEROGENEOUS COMPUTING

HETEROGENEOUS COMPUTING HETEROGENEOUS COMPUTING Shoukat Ali, Tracy D. Braun, Howard Jay Siegel, and Anthony A. Maciejewski School of Electrical and Computer Engineering, Purdue University Heterogeneous computing is a set of techniques

More information

A 3. CLASSIFICATION OF STATIC

A 3. CLASSIFICATION OF STATIC Scheduling Problems For Parallel And Distributed Systems Olga Rusanova and Alexandr Korochkin National Technical University of Ukraine Kiev Polytechnical Institute Prospect Peremogy 37, 252056, Kiev, Ukraine

More information

Determining Differences between Two Sets of Polygons

Determining Differences between Two Sets of Polygons Determining Differences between Two Sets of Polygons MATEJ GOMBOŠI, BORUT ŽALIK Institute for Computer Science Faculty of Electrical Engineering and Computer Science, University of Maribor Smetanova 7,

More information

sflow: Towards Resource-Efficient and Agile Service Federation in Service Overlay Networks

sflow: Towards Resource-Efficient and Agile Service Federation in Service Overlay Networks sflow: Towards Resource-Efficient and Agile Service Federation in Service Overlay Networks Mea Wang, Baochun Li, Zongpeng Li Department of Electrical and Computer Engineering University of Toronto {mea,

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

CLOUD WORKFLOW SCHEDULING BASED ON STANDARD DEVIATION OF PREDICTIVE RESOURCE AVAILABILITY

CLOUD WORKFLOW SCHEDULING BASED ON STANDARD DEVIATION OF PREDICTIVE RESOURCE AVAILABILITY DOI: http://dx.doi.org/10.26483/ijarcs.v8i7.4214 Volume 8, No. 7, July August 2017 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

Simple mechanisms for escaping from local optima:

Simple mechanisms for escaping from local optima: The methods we have seen so far are iterative improvement methods, that is, they get stuck in local optima. Simple mechanisms for escaping from local optima: I Restart: re-initialise search whenever a

More information

Query-Driven Document Partitioning and Collection Selection

Query-Driven Document Partitioning and Collection Selection Query-Driven Document Partitioning and Collection Selection Diego Puppin, Fabrizio Silvestri and Domenico Laforenza Institute for Information Science and Technologies ISTI-CNR Pisa, Italy May 31, 2006

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Scheduling Real Time Parallel Structure on Cluster Computing with Possible Processor failures

Scheduling Real Time Parallel Structure on Cluster Computing with Possible Processor failures Scheduling Real Time Parallel Structure on Cluster Computing with Possible Processor failures Alaa Amin and Reda Ammar Computer Science and Eng. Dept. University of Connecticut Ayman El Dessouly Electronics

More information

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Pallavi Baral and Jose Pedro Fique Department of Economics Indiana University at Bloomington 1st Annual CIRANO Workshop on Networks

More information

Concurrency Preserving Partitioning Algorithm

Concurrency Preserving Partitioning Algorithm VLSI DESIGN 1999, Vol. 9, No. 3, pp. 253-270 Reprints available directly from the publisher Photocopying permitted by license only (C) 1999 OPA (Overseas Publishers Association) N.V. Published by license

More information

Load Balancing of Parallel Simulated Annealing on a Temporally Heterogeneous Cluster of Workstations

Load Balancing of Parallel Simulated Annealing on a Temporally Heterogeneous Cluster of Workstations Load Balancing of Parallel Simulated Annealing on a Temporally Heterogeneous Cluster of Workstations Sourabh Moharil and Soo-Young Lee Department of Electrical and Computer Engineering Auburn University,

More information

Scheduling on clusters and grids

Scheduling on clusters and grids Some basics on scheduling theory Grégory Mounié, Yves Robert et Denis Trystram ID-IMAG 6 mars 2006 Some basics on scheduling theory 1 Some basics on scheduling theory Notations and Definitions List scheduling

More information

Parallel Computing in Combinatorial Optimization

Parallel Computing in Combinatorial Optimization Parallel Computing in Combinatorial Optimization Bernard Gendron Université de Montréal gendron@iro.umontreal.ca Course Outline Objective: provide an overview of the current research on the design of parallel

More information

Parallelization Strategy

Parallelization Strategy COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Scheduling Tasks Sharing Files from Distributed Repositories

Scheduling Tasks Sharing Files from Distributed Repositories from Distributed Repositories Arnaud Giersch 1, Yves Robert 2 and Frédéric Vivien 2 1 ICPS/LSIIT, University Louis Pasteur, Strasbourg, France 2 École normale supérieure de Lyon, France September 1, 2004

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 4 ] Scheduling Theory Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and

More information

Improving the Data Scheduling Efficiency of the IEEE (d) Mesh Network

Improving the Data Scheduling Efficiency of the IEEE (d) Mesh Network Improving the Data Scheduling Efficiency of the IEEE 802.16(d) Mesh Network Shie-Yuan Wang Email: shieyuan@csie.nctu.edu.tw Chih-Che Lin Email: jclin@csie.nctu.edu.tw Ku-Han Fang Email: khfang@csie.nctu.edu.tw

More information

The Generalized Weapon Target Assignment Problem

The Generalized Weapon Target Assignment Problem 10th International Command and Control Research and Technology Symposium The Future of C2 June 13-16, 2005, McLean, VA The Generalized Weapon Target Assignment Problem Jay M. Rosenberger Hee Su Hwang Ratna

More information

CS4311 Design and Analysis of Algorithms. Lecture 1: Getting Started

CS4311 Design and Analysis of Algorithms. Lecture 1: Getting Started CS4311 Design and Analysis of Algorithms Lecture 1: Getting Started 1 Study a few simple algorithms for sorting Insertion Sort Selection Sort Merge Sort About this lecture Show why these algorithms are

More information

A COMPARATIVE STUDY OF FIVE PARALLEL GENETIC ALGORITHMS USING THE TRAVELING SALESMAN PROBLEM

A COMPARATIVE STUDY OF FIVE PARALLEL GENETIC ALGORITHMS USING THE TRAVELING SALESMAN PROBLEM A COMPARATIVE STUDY OF FIVE PARALLEL GENETIC ALGORITHMS USING THE TRAVELING SALESMAN PROBLEM Lee Wang, Anthony A. Maciejewski, Howard Jay Siegel, and Vwani P. Roychowdhury * Microsoft Corporation Parallel

More information

A Genetic Algorithm for Multiprocessor Task Scheduling

A Genetic Algorithm for Multiprocessor Task Scheduling A Genetic Algorithm for Multiprocessor Task Scheduling Tashniba Kaiser, Olawale Jegede, Ken Ferens, Douglas Buchanan Dept. of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB,

More information

Trace Signal Selection to Enhance Timing and Logic Visibility in Post-Silicon Validation

Trace Signal Selection to Enhance Timing and Logic Visibility in Post-Silicon Validation Trace Signal Selection to Enhance Timing and Logic Visibility in Post-Silicon Validation Hamid Shojaei, and Azadeh Davoodi University of Wisconsin 1415 Engineering Drive, Madison WI 53706 Email: {shojaei,

More information