Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr Abstract. Task allocation is one of the biggest issues in the area of parallel and distributed computing. Given a parallel program composed of M communicating modules (tasks) and a multicomputer system of N processors with a specific interconnection network, the problem is how to assign the program modules onto the available processors in the system so as to minimize the entire program completion time. This problem is known to be NP-complete and therefore untractable as soon as the number of tasks and/or processors exceeds a few units. This paper presents a heuristic algorithm, derived from Simulated Annealing, to this problem taking into account several kinds of constraints. The performance of the algorithm is evaluated through experimental study on randomly generated instances that being allocated into a multicomputer system of bus topology. Furthermore, the quality of solutions are compared with those derived using the Branch-and-Bound on the same sample problems. 1 Introduction A fundamental issue affecting the performance of a parallel program running on a multicomputer system is the assignment of the program modules (tasks) into the available processors of the system. Modules of a program may be executed on the same or different processors. Because of different processors capabilities, the cost of executing a module may vary from one processor to another. The module execution cost depends on the work to be performed by the module and the processor attributes such as its clock rate (speeds), instruction set and cache memory. On the other hand, modules that are executed on different processors but communicate with one another do so using the interconnection network and so incur communication costs due to the overhead of the communication protocols and the transmission delays in the network. To realize performance potential, two goals need to be met: interprocessor communication cost has to be minimized and modules execution costs need to be balanced among processors. These two goals seem to conflict with one another. On one hand, having all modules on one processor will remove the interprocessors communication costs but result in poor balance of the processing load. On the other hand, an even distribution of modules among processors will maximize the processors utilization but might A. Laganà et al. (Eds.): ICCSA 2004, LNCS 3044, pp. 97 106, 2004. c Springer-Verlag Berlin Heidelberg 2004

98 G. Attiya and Y. Hamam also increase the interprocessors communication costs. Thus, the purpose is to balance the two often conflicting objectives of minimizing the interprocessors communication costs and maximizing the processors utilization. Several approaches, in both fields of computer science and operations research, have been suggested to solve the allocation problem. They may be roughly classified into four broad categories, namely, graph theory [1,2], state space search [3,4], mathematical programming [5,6] and heuristics [7] [13]. However, most of the existing approaches deal with homogeneous systems. Furthermore, they do not consider many kinds of constraints that are related to the application requirements and the availability of system resources. The allocation problem may be further complicated when the system contains heterogeneous components such as different processors speeds and different resources availability (memory and/or communication capacities). This paper addresses the allocation problem in heterogeneous multicomputer systems taking into account both memory and communication capacity constraints. It first models the allocation problem as an optimization problem. It then proposes a heuristic algorithm, derived from Simulated Annealing (SA), to allocate modules of a parallel program into processors of a multicomputer system so as to minimize the entire program completion time. The performance of the proposed algorithm is evaluated through experimental studies on randomly generated instances. Furthermore, the quality of the resulting solutions are compared with those derived by applying the Branch-and Bound algorithm [14] on the same instances. The remainder of this paper consists of five sections. Section 2 defines the task allocation problem. Section 3 describes how the allocation problem may be formulated as an optimization problem. Section 4 presents the Simulated Annealing (SA) technique and describes how it may be employed for solving the allocation problem. The simulation results are discussed in Section 5 while the paper conclusions are given in Section 6. 2 Problem Definition The problem being addressed in this paper is concerned with allocating modules (tasks) of a parallel program into processors of a heterogeneous multicomputer system so as to minimize the program completion time. A multicomputer system consists of a set of N heterogeneous processors connected via an interconnection network. Each processor has some computation facilities and its own memory. Furthermore, the interconnection network has some communication capacities and cost of transferring a data unit from one computer sender to other computer receiver. An example of a multicomputer system is shown in Figure 1(b). A parallel program consists of a set of M communicating modules corresponding to nodes in a task interaction graph G(V, E), as shown in Figure 1(a). Where, V represents set of M modules and E represents set of edges. In the graph, each module i V is labeled by memory requirements and each arc (i, j) E is labeled by communication requirements among modules. Furthermore, a vector is associated with each module representing the execution cost of the module at

Task Allocation for Minimizing Programs Completion Time 99 different processors in the system, as shown in Figure 1(c). The problem is how to map tasks (modules) of the task graph into processors of the system so as to minimize the program completion time under one or more constraints. Briefly, we are given a set of M communicating modules representing a parallel program to be executed on a heterogeneous multicomputer system of N processors. Modules of the program require certain capacitated computer resources. They have computational and memory capacity requirements. Furthermore, they communicate with each other at a given rate. On the other hand, processors and communication resources are also capacitated. Thus, the purpose is to find an assignment of the program modules to the system processors such that the program completion time is minimized, the requirements of modules and edges are met, and the capacities of the system resources are not violated. 2 t1 1 1 4 6 t2 t3 3 1 6 t4 2 3 t5 4 3 t6 4 3 6 t7 3 C1 C2 C3 C4 M1 M2 M3 M4 P1 P2 P3 P4 Interconnection Network P1 P2 P3 P4 t1 10 16 11 50 t2 12 60 30 8 t3 15 40 8 50 t4 4 15 25 10 t5 14 10 90 12 t6 12 70 6 45 t7 18 8 33 6 (a) Task Interaction Graph (b) Distributed System (c) Execution Costs Fig. 1. An Example of a Task Interaction Graph and a Distributed System. 3 Model Description Designing a model to the allocation problem involves two steps; (i) formulate a cost function to represent the objective of the task allocation, (ii) formulate set of constraints in terms of the modules requirements and the availability of the system resources. To describe the allocation model, let A be an assignment vector such that A(i)=p if module i is assigned to processor p, and TC p be a task cluster representing the set of tasks assigned to processor p. 3.1 Allocation Cost Development For an assignment A, a processor load comprises all the execution and the communication costs associated with its assigned tasks. The time needed by the bottleneck processor (i.e., the heaviest loaded processor) will determine the program completion time. Therefore, minimizing the entire program completion time may be achieved by minimizing the cost at the maximum loaded processor.

100 G. Attiya and Y. Hamam Actual Execution Cost: The actual execution load at a processor p is the cost of processing all tasks assigned to p for an assignment A. Define C ip as the cost of processing task i on processor p, then the actual execution cost may be formulated as EXEC p = i TC p C ip Actual Communication Cost: The actual communication load at a processor p is the cost of communicating data between tasks resident at p with other tasks resident at other processor q. Let cc avg be the average communication cost of transferring a data unit through the network transmission media and d ij be the data to flow between two communicating tasks i and j, then the actual communication cost may be formulated as COMM p = d ij cc avg i TC p j i,(i,j) E,A(j) p It is worth noting that if two communicating tasks are assigned to different processors p and q, the communication cost contributes the load of the two processors. Furthermore, if two communicating tasks are assigned to the same processor, the communication cost is assumed to be zero as they use the local system memory for data exchange. Bottleneck in the System: For an assignment A, the workload at a processor p comprises all the execution and the communication costs associated with its assigned tasks. Hence, the workload at p may be formulated as L p = EXEC p + COMM p The bottleneck in the system is the critical processor which has the maximum cost over all processors. Thus the maximum load at a bottleneck processor may be formulated as L max = max {L p 1 p N} Objective Function: To minimize the entire program completion time, the cost at the maximum loaded processor must be minimized. Therefore the objective function of the allocation model may be formulated as min L max 3.2 Assignment Constraints The assignment constraints depend on the characteristics of both the application involved (such as memory and communication capacities requirements) and on the system resources including the computation speed of processors, the availability of memory and communication network capacities.

Task Allocation for Minimizing Programs Completion Time 101 Memory Constraints: Let m i be the memory capacity requirements of a task i and M p be the available memory capacity of a processor p. For an assignment A, the total memory required by all modules assigned to processor p must be less than or equal to the available memory capacity of p. That is, the following inequality must hold at each processor p. m i M p i TC p Network Capacity Constraints: Let b ij be the communication capacity requirements of edge (i,j) and R be the available communication capacity of the network transmission media. For an assignment A, the total communication capacity required by all arcs mapped to the transmission media must be less than or equal to the available communication capacity of the media. That is, the following inequality must hold at the network transmission media. p i TC p j i,(i,j) E,A(j) p b ij 2 R 3.3 Allocation Model This paper considers a model where a task must be allocated to exactly one processor and takes into account both the memory and the network capacity constraints. Let X be an M N binary matrix whose element X ip = 1 if module i is assigned to processor p and X ip = 0 otherwise. Then the allocation problem may be formulated as follows: min L max = max 1 p N {EXEC p + COMM p } subject to p X ip =1 tasks i i TC p m i M p processors p b ij p i TC p j i,(i,j) E,A(j) p 2 R networks 4 Heuristic Algorithm This section presents a heuristic algorithm derived from Simulated Annealing (SA) to the allocation problem. It first defines the basic concept of SA and then explains how it may be employed for solving the allocation problem. 4.1 Basic Concepts Simulated Annealing (SA) is a global optimization technique which attempts to find the lowest point in an energy landscape [15,16]. The technique was derived

102 G. Attiya and Y. Hamam from the observations of how slowly cooled molten metal can result in a regular crystalline structure. It emulates the physical concepts of temperature and energy to represent and solve the optimization problems. The procedure is as follows: the system is submitted to high temperature and is then slowly cooled through a series of temperature levels. At each level, the algorithm searches for the system equilibrium state through elementary transformations which will be accepted if they reduce the system energy. However, as the temperature decreases, smaller energy increments may be accepted and the system eventually settles into a low energy state close to the global minimum. Several functions have been proposed to determine the probability that an uphill move of size may be accepted. The algorithm presented in this paper uses exp( /T ), where T is the temperature. 4.2 Simulated Annealing Algorithm To describe the SA algorithm, some definitions are needed. The set of all possible allocations of tasks into processors is called problem space.apoint in the problem space is a mapping of tasks to processors. Neighbors of a point is the set of all points that are reachable by moving any single task from one processor to any other processor. The energy of a point is a measure of the suitability of the allocation represented by that point. The structure of the algorithm may be sketched as follows: Randomly select an initial solution s; Compute the cost at this solution E s ; Select an initial temperature T ; Select a cooling factor α<1; Select an initial chain n rep ; Select a chain increasing factor β>1; Repeat Repeat Select a neighbor solution n to s; Compute the cost at n, E n ; = E n E s ; If <0, s = n; E s = E n ; Else Generate a random value x in the range (0,1); If x < exp( /T ), s = n; E s = E n ; End End Until iteration = n rep (equilibrium state at T ) Set T = α T ; Set n rep = β n rep ; Until stopping condition = true E s is the cost and s is the solution.

Task Allocation for Minimizing Programs Completion Time 103 4.3 Applying the Algorithm As can be seen from above, the SA algorithm requires an energy function, a neighbor function, a cooling function and some annealing parameters. The energy function is the heart of the algorithm. It shapes the energy landscape which affects how the annealing algorithm reaches a solution. For the allocation problem, the energy function represents the objective function to be optimized and has to penalize the following characteristics: (i) Tasks duplication, (ii) Processors with a memory utilization 100%, (iii) Transmission media with capacity utilization 100%. These characteristics should be penalized to achieve the application requirements and validate the resources availability. In the solution development, the first property is penalized by constructing an allocation vector A whose element A(i) represents the processor p where a task i may be assigned and therefore each task can not be allocated to more than one processor. For an assignment A, the second property is penalized by comparing the required memory capacity of all the tasks allocated to a processor p and the available memory capacity of p. An energy component E mem is determined such that E mem = 1 if the memory constraints are not satisfied and 0 otherwise. By the same strategy, the third property is penalized by testing and returning an energy component E cap such that E cap = 1 if the communication constraints are not satisfied and 0 otherwise. Let k be a penalty factor, then the energy function E may be formulated as: E = L max + k (E mem + E cap ) In this paper, a neighboring solution is obtained by choosing at random a task i from the current allocation vector A and assign it to another randomly selected processor p. For the cooling process, a geometric cooling schedule is used. An initial temperature T is set after executing a sufficiently large number of random moves such that the worst move would be allowed. The temperature is reduced in so that T = α T, where α =0.90. At each temperature, the chain n rep is updated in a similar manner: n rep = β n rep, where β =1.05. Figure 2 shows the behaviour of the SA algorithm for allocating a randomly generated task graph of 100 tasks into a distributed systems of 4 computers. The figure shows that the cost is unstable at the beginning but it rapidly improves and converges after a short latency time to the best cost. 5 Performance Evaluation The proposed algorithm is coded in Matlab and tested for a large number of randomly generated graphs that being mapped into a distributed system of 4 computers with bus topology. Furthermore, the quality of the resulting solutions are compared with those derived using the Branch-and-Bound (BB) algorithm [14] which is also coded in Matlab and applied on the same instances.

104 G. Attiya and Y. Hamam 550 500 450 400 Cost 350 300 250 200 150 0 10 20 30 40 50 60 70 80 90 100 Time (sec) Fig. 2. Typical Behaviour of Simulated Annealing The simulation results are shown in Figures 3, 4, 5 and 6. Figure 3 illustrates the computation time of the SA and the BB algorithms as a function of the number of tasks. The figure shows that the SA algorithm finds a solution very fast in comparing with the BB algorithm. Furthermore, the computation time of the SA algorithm slowly increases as the number of tasks increases. Figure 4 shows the quality of solutions resulting by SA with those derived by using the BB on the same instances. Figures 5 and 6 illustrate the workload distribution on the 4 processors of the system by applying SA and BB algorithms respectively. As shown in the figures, the workload distribution resulting by using the SA algorithm is very close to the optimal workload distribution. 180 160 SA BB 140 Computation Time (sec) 120 100 80 60 40 20 0 4 6 8 10 12 14 16 18 20 Number of Tasks Fig. 3. Algorithms Computation Time

Task Allocation for Minimizing Programs Completion Time 105 34 32 30 28 26 24 Cost 22 20 18 16 14 12 10 4 6 8 10 12 14 16 18 20 Number of Tasks SA BB Fig. 4. Optimal and Suboptimal Completion Time 30 P1 P2 P3 P4 25 Processors Load 20 15 10 5 0 2 4 6 8 10 12 14 16 18 20 22 Number of Tasks Fig. 5. Suboptimal Workload Distribution by SA 30 P1 P2 P3 P4 25 Processors Load 20 15 10 5 0 2 4 6 8 10 12 14 16 18 20 22 Number of Tasks Fig. 6. Optimal Workload Distribution by BB

106 G. Attiya and Y. Hamam 6 Conclusions A heuristic algorithm for allocating a parallel program into a heterogeneous distributed computing system is presented in this paper. The goal is to minimize the entire program completion time. The algorithm derived from the well known Simulated Annealing (SA) and tested for a large number of randomly generated task graphs. The effectiveness of the algorithm is evaluated by comparing the quality of solutions with those derived using the Branch-and-Bound (BB) technique on the same sample problems. The simulation results show that, the SA is an efficient approach to the task allocation problem. The algorithm guarantees a near optimal allocation in acceptable amount of computation time. References 1. C.-H. Lee, D. Lee, and M. Kim. Optimal Task Assignment in Linear Array Networks. IEEE Trans. Computers, 41(7):877-880, July 1992. 2. C.-H. Lee, and K. G. Shin. Optimal Task Assignment in Homogeneous Networks. IEEE Trans. on Parallel and Dist. Systems,8(2):119-129, Feb. 1997. 3. M. Kafil and I. Ahmed. Optimal Task Assignment in Heterogeneous Distributed Computing Systems. IEEE Concurrency, 42-51, July-Sept. 1998. 4. A. Tom and C. Murthy. Optimal Task Allocation in Distributed Systems by Graph Matching and State Space Search. The J. of Systems and Software, 46:59-75, 1999. 5. Y.-C.Ma and C.-P.Chung. A Dominance Relation Enhanced Branch-and-Bound Task Allocation. J. of Systems and Software, 58:125-134, 2001. 6. Gamal Attiya and Yskandar Hamam. Static Task Assignment in Distributed Computing Systems. 21 st IFIP TC7 Conference on System Modeling and Optimization, Sophia Antipolis, Nice, France, 2003. 7. V. M. Lo. Heuristic Algorithms for Task Assignment in Distributed Systems. IEEE Trans. on Computers, 37(11):1384-1397, Nov. 1988. 8. P. Sadayappan, F. Ercal and J. Ramanujam. Cluster Partitioning Approaches to Mapping Parallel Programs Onto a Hypercube. Parallel computing, 13:1-16, 1990. 9. T. Chockalingam and S. Arunkumar. A Randomized Heuristics for the Mapping Problem: the Genetic Approach. Parallel Computing, 18(10):1157-1165, 1992. 10. T. Bultan and C. Aykanat. A New Mapping Heuristic Based on Mean Field Annealing. J. of Parallel and Distributed Computing, 10:292-305, 1992. 11. P. Bouvry, J. Chassin and D. Trystram. Efficient Solution for Mapping Parallel Programs. Proceedings of EuroPar 95, Vol. 966 of LNCS, pages 379-390, Springer- Verlag, August 1995. 12. J. Aguilar and E. Gelenbe. Task Assignment and Transaction Clustering Heuristics for Distributed Systems. Information and Computer Sciences, 97:199-219, 1997. 13. M. A. Senar, A. R. Ripoll, A. C. Cortes and E.Luque. Clustering and Reassignment- Based Mapping Strategy for Message-Passing Architectures. J. of Systems Architecture, 48:267-283, 2003. 14. Gamal Attiya and Yskandar Hamam. Optimal Allocation of Tasks onto Networked Heterogeneous Computers Using Minimax Criterion. Proceedings of International Network Optimization Conference (INOC 03), pp. 25-30, Evry/Paris, France, 2003. 15. S. Kirkpatrick, C. D. Gelatt and J. M. P. Vecchi. Optimization by Simulated Annealing. Science, 220:671-680, May 1983 16. E. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines. John Wiley and Sons, New York, 1989.