Task Distribution in a Workstation Cluster with a Concurrent Network

Size: px

Start display at page:

Download "Task Distribution in a Workstation Cluster with a Concurrent Network"

Evan Fowler
6 years ago
Views:

1 Task Distribution in a Workstation Cluster with a Concurrent Network Frank Burchert, Michael Koch, Gunther Hipper, Djamshid Tavangarian Universität Rostock, Fachbereich Informatik, Institut für Technische Informatik Albert-Einstein-Straße 21, D Rostock (Germany) Tel.: ++49 (381) , Fax: ++49 (381) Burchert@Informatik.Uni-Rostock.de Abstract This paper concentrates on task allocation and load balancing within a Concurrent Network Architecture Cluster using simple strategies for distributing tasks. A sender initiated approach, a receiver initiated approach with reservation strategy, a deterministic Greedy algorithm similar to the sender initiated method and a gradient algorithm adapted to the special communication possibilities of CNA are investigated and compared. First results gained by simulation show that the gradient model load balancing method yields the best increase in performance. In addition, simple methods used to get parameters required by these load balancing algorithms in a UNIX environment are introduced. Key words distributed systems, load balancing, load sharing, workstation cluster 1. Introduction Over the past few years, claims respecting the performance of computer systems have grown steadily. In order to satisfy these claims, new techniques and architectures are necessary for both soft- and hardware. Parallel computing has proved to be a suitable tool, but requires specialized parallel hardware structures. The use of workstation clusters (C) as distributed memory parallel computers, which has gained increasing acceptance, has proved to be a suitable and cost-effective approach. The main bottleneck of a distributed memory computer architecture is the performance of the communication system connecting the processing elements. Conventional clusters consist of a single, shared communication channel which becomes disadvantageous if the amount of exchanged data exceeds the usual share for which the network was primary dimensioned (e.g. file service). Future LAN technologies will be able to reduce this bottleneck but not to eliminate it. A modified communication architecture called Concurrent Network Architecture (CNA) extends the concurrency of the processing elements to their communication network [1]. The following chapter examines the basic idea of CNA. Another approach used to increase system performance can be found in the application of load balancing and load sharing strategies to C. Over the last decade, many load balancing algorithms have been developed in order to equalize system load over all workstations () within a network [2]. Load balancing prevents some workstations from staying idle while others within the cluster are under heavy load. Therefore, the length of time during which the workstations are idle is reduced and the system performance is increased. This paper concentrates on task allocation and load balancing within a CNA cluster using simple strategies for distributing serial tasks. With respect to the special CNA communication features, the following strategies are compared: a sender initiated approach [3], a receiver initiated approach with reservation strategy [3], a deterministic Greedy algorithm similar to the sender initiated method [4] and a gradient algorithm adapted to the special communication possibilities of CNA [5]. First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

2 The first chapter introduces the concept of the Concurrent Network Architecture. In the second chapter, different load balancing algorithms and their parameters are demonstrated. The third chapter concentrates on methods used to determine these parameters in a real, heterogeneous UNIX environment. In the final chapter, simulation results concerning the improvement of job throughput gained by load balancing and sharing within a CNA cluster are presented. The paper ends with conclusions and a description of future work. 2. The Concurrent Network Architecture Concurrent computing in workstation clusters as an alternative to MIMD- or SIMD-computer architectures has gained increasing acceptance within the last few years. The main reason for this is the superior cost / performance ratio of a C compared to conventional parallel architectures. In many places, workstations are already connected by a LAN which is sufficient for usual file server activities. Such a workstation cluster can be regarded as a distributed memory parallel computer. In principal, clusters with a single communication channel (denoted as conventional clusters) can be used to execute concurrent applications. Research has proved that the network often becomes a bottleneck which subsequently reduces system performance. In order to avoid this, a new communication architecture called CNA (Concurrent Network Architecture) is proposed in [1] through which the concurreny of the processing elements to the network is extended. An economic solution which is able to enhance the communication performance whilst maintaining the cost-effectiveness of C can be found by modifying existing networks. Within a CNA Cluster, workstations () are structured in a regular n-dimensional mesh. Every is connected to n different and independent communication network channels (e.g. Ethernet, FDDI, ATM). Each has two different IP-addresses and each CPU of the has to perform routing tasks of the IP-protocols between its communication channels [6]. a) b) ,3 2,3 3, ,2 2,2 3, ,1 2,1 3,1 Conventional cluster with a single communication channel Example of a two-dimensional 3x3 CNA cluster Figure 1: Conventional cluster architecture vs. Concurrent Network Architecture In order to build a CNA cluster, both additional communication hardware for the implementation of the concurrency of the communication system (see fig. 1) and software components are needed. A part of these components are the TCP/IP and UDP/IP protocol stacks as an element of the UNIX operating system. Another part of this software is an optimized programming environment for the CNA cluster, which is based on a message-passing model. These programs and libraries enable the programmer of a parallel application to use the special feature of the CNA, this being independent communication channels which realize a higher system bandwidth. A realization of such a communication software is described in [6]. First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

3 Fig. 2 shows examples of possible routing ways within a CNA cluster. There are different possibilities for sending messages from one to another which depends on the load of each communication channel. It is also possible to divide the message into halves and transmit each half along an own channel. By this, the bandwidth between two integrated in the CNA cluster is extended. 1,3 2,3 3,3 Y 3 Patterns of independent communications between : Y 2 1,2 2,2 3,2 a) (3,1) and (3,2) b) (1,1) and (2,1) Y 1 1,1 2,1 3,1 c) (1,3) and (2,2) X 1 X 2 X 3 Figure 2: Example of independent communications in a two-dimensional CNA cluster 3. Load balancing strategies Load balancing strategies can be divided into two main categories: static policies which allocate tasks only before they have started and dynamic methods which also reallocate tasks at run time. The benefit of dynamic load balancing is evaluated controversially in literature. Some authors prove the theory that dynamic load balancing offers additional performance [9][10][11][12], whereas others prove that the overhead reaches or exceeds the benefits [13][14][15]. The emphasis of this paper will be on static strategies, due to their simple implementability. If arbitrary tasks are scheduled in a CNA cluster, one has to distinguish between scheduling parallel and sequential tasks. Usually, parallel tasks produce much more traffic on the communication channel than sequential tasks. Hence, to allocate parallel tasks, algorithms are needed which especially take communication costs between the single modules into account. Several load balancing algorithms are based on graph theory [7][8]. The basic ideas are presented in fig. 3: Tasks t i (and in Stone s approach also the processing elements P k ) are represented by nodes. The edges connecting the nodes are weighted with communication costs c ij (edges between two task nodes) or with specially-defined execution costs w ij (edges between processes and processors) denoting the costs if task t i is not executed on P j. Through the use of the Ford-Fulkerson-Algorithm, task allocations which cause minimum costs can be obtained. n-way cut Figure 3: Task allocation by use of min cut / max flow algorithm [7] First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

4 This algorithm yields optimal task allocations only for two-processor-systems. If there are more than two processing elements, the policy becomes much more complex since the Ford-Fulkerson-Algorithm has to be performed numerous times. In addition, optimal allocations are not guaranteed. Therefore, Lo propagates an extension with a heuristic approach [8] which leads to suboptimal distributions. This is a suitable basis for efficient scheduling of parallel tasks within CNA clusters. If sequential tasks have to be distributed within a cluster, uncomplicated strategies are sufficient to distinctly improve performance. Table 1 shows some simple algorithms proposed in [3][4][5], which were adapted to CNA and its special communication capabilities. Therefore, the definition of immediate neighbours within the gradient model was extended to which are directly connected by a common communication channel. The applied load index is the amount of processes running on a. Although this is a simple load index, Eager et al. proved in [16] that a distinct increase of the performance is possible. Extensions to this load index contain the used main memory and swap space and cpu use. name sender-initiated receiver-initiated reservation method greedy algorithm gradient method short description Table 1: Simple load balancing strategies with threshold policies 4. Detecting load parameters under UNIX heavy-loaded nodes try to find lower-loaded nodes by incident and then transmit the newly-arrived task [3] low-loaded nodes try to find heavy loaded nodes by incident and then transmit the newly-arrived task [3] nodes at which tasks have just terminated reserve for the next arriving task on a heavy-loaded node to transmit it on arrival [3] similar to the sender-initiated algorithm, but with a deterministic choice of target nodes [4] a newly-arrived task is sent to the lowest loaded node in the neighbourhood (up to n hops) [5] Load balancing in homogeneous environments is a well investigated topic, but difficulties appear if strategies have to take heterogeneity into account. The complexity of these algorithms and the fact that adapted software is needed for detecting the required parameters impede general implementations of load balancing strategies for heterogeneous clusters. UNIX systems offer a number of user commands, maintenance commands and system calls for information about system and process state. Some of them are listed in table 2. The fact that there is no common standard for UNIX process- or system information makes it difficult to implement universal programs for monitoring the most interesting system parameters. Moreover, professional monitoring programs do not yield a uniform interface where required parameters can be received. In general, it would be useful to take only commands and system calls which exist on any machine and guarantee the same output. However, calling these commands frequently initiates fork-execcombinations which extremely increase the load at the moment of measurement and falsify system parameters like cpu use and memory accesses [2]. Hence, other methods are needed to get information about the system state in a single process. First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

5 name user command system call maintenance command LINUX SunOS System V description ps X X X X report process status pstat X X print system facts iostat X X X report i/o statistics vmstat X (X) X X report virtual memory statistics w X (X) X X who is logged in and what they are doing uname X X X X X system identification sysinfo X X X *) system information netstat X X X show network status top X X system and process information time X X X X time a command times X X X X get topical process times /proc X X process file system acct... X (X) X X process accounting sa X X X system accounting *) different meaning under System V Table 2: General sources of system and process information under several UNIX versions Newer UNIX versions offer information on the most important system facts within the process file system /proc maintained by the kernel in main memory. Unfortunately, even here no standard is in sight. The volume of offered information differs as well as forms of representation. LINUX, for instance, offers system and process information in the form of ASCII files, whereas Solaris yields pointers to special memory areas where the information is stored. Scheduler uniform load information basis LINUX SunOS System V Solaris OSF/1... Figure 4: The necessity of an uniform load information basis First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

6 A uniform basis for load and system facts which provides the scheduler with required information (see fig. 4) is useful for a general implementation of load balancing policies in a heterogeneous environment. First realizations of such a uniform load information basis exist for SunOS4 and LINUX. Under SunOS4, direct access to the process table is necessary, whereas under LINUX the process file system is used in combination with system calls. Further development will contain other operating systems. 5. Simulation model and first results An event-driven simulation was used to compare the behaviour of the algorithms described in table 2 within an 3x3-CNA cluster using ethernet communication channels. Within this model, workstations are represented by dynamic process lists, and the processing of the inserted events is individually controlled for each depending on the length of its process list (load depending characteristic). The predefined process times (time the processes have to stay in the system) are increased depending on the final location of the process within the cluster if no local processing is instructed by the scheduler. Communication costs and the time interval determining the load measurement period are varied. In order to investigate the behaviour of these algorithms in a heterogeneous environment, performance factors were used to differentiate between more and less powerful within the cluster. Fig. 5a and 5b show the behaviour of the tested policies at different transfer costs and the effect of the update period for load information. The figures show the mean response time of the simulated processes standardized on the mean response time when no load balancing is performed. Transfer costs refer to direct transfers without any gateway. In simulation, these costs increase to twice the value if the transfer is routed along a gateway (recent research has shown that the real factor in the CNA cluster is ca. 1.5 [6], improving the results). mean response time (standardized) None = Sender Greedy 0.7 Gradient Reserv transfer costs (in % of processing costs) a) effect of different transfer costs mean response time (standardized) None = 1 Sender Greedy Gradient Reserv update period (in abstract time units) b) effect of the update period for load information Figure 5: The reduction of the response time depending on transfer costs and update period for load information The gradient strategy yields the shortest mean response time if the update period for load information is short enough compared to the global task arrival rate which was defined at 10 time units. For update periods much longer than the global arrival rate, the information basis of the scheduler does not correspond to the real load in the system, so that the mean response time even exceeds the value received without scheduling. Although the afore-mentioned strategies originally were created for homogeneous environments, further investigation has shown that they also yield proper performance improvements within heterogeneous environments. This can be explained due to the special load index which enables more powerful machines to be indirectly recognized since the higher processing performance leads to more task terminations within an update period than on other machines. First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

7 6. Conclusions This paper presented the concept of a CNA cluster and introduced static load balancing strategies suitable for distributing arbitrary tasks within such an environment. Different approaches were proposed to distribute parallel and sequential tasks. With respect to sequential tasks, first results show that the gradient model load balancing method is superior to sender-initiated and receiver-initiated strategies if it is adapted to the special communication possibilities. Future work will concentrate on simulating the proposed load balancing algorithm for distributing parallel tasks and extending the uniform load information basis to other UNIX versions like System V. The aim is the development of a comprehensive scheduling system which enables the use of CNA clusters as general servers for arbitrary tasks. References [1] Klein; Tavangarian; Hipper; Koch: A New Concurrent Network Architecture (CNA) for an Efficient Parallel Computing in Workstation Clusters, Workshop Parallele Datenverarbeitung im Verbund von Hochleistungs-Workstations, Fachberichte Informatik, 09/1994, Universität Koblenz- Landau [2] Burchert: Verteilung von Applikationen in einem konzentrierten Workstation-Cluster, Diplomarbeit am Lehrgebiet Technische Informatik II, FernUniversität Hagen, 1995 [3] Eager; Lazowska; Zahorjan: A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing, Performance Evaluation, Vol. 6, No. 1 03/1986, p [4] Chowdhury: The Greedy Load Sharing Algorithm, Journal of Parallel and Distributed Computing (Academic Press), No. 9, 1990, p [5] Lin; Keller: The Gradient Model Load Balancing Method, IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, 01/1987, p [6] Hipper; Koch; Tavangarian: Eine parallele Kommunikationsarchitektur für Workstation-Cluster, 3. GI/ITG Fachtagung Arbeitsplatzrechensysteme (APS), Hannover 1995, ISBN , p [7] Stone: Multiprocessor Scheduling with the Aid of Network Flow Algorithms, IEEE Transactions on Software Engineering, Vol. SE-3, No. 1, 01/1977, P [8] Lo, V. M.: Heuristic Algorithms for Task Assignment in Distributed Systems, IEEE Transactions on Computers, Vol. 37, No. 11, 11/1988, p [9] Baraq, Annon et al.: The MOSIX distributed operating system - Load Balancing for UNIX, Springer 1993, ISBN [10] Douglis, Fred; Ousterhout, John: Transparent Process Migration: Design Alternatives and the Sprite Implementation, Software - Practice and Experience, Vol. 21, No. 8, 08/1991, p [11] Krueger; Livny: A Comparison of Preemptive and Non-Preemptive Load Distributing, Proceedings of the 8th International Conference on Distributed Computing Systems (IEEE), 06/1988, p [12] Beerbohm; Bresgen; Hofestädt; Huang: Dynamic Load Balancing on Workstation Clusters, Workshop Parallele Datenverarbeitung im Verbund von Hochleistungs-Workstations, Fachberichte Informatik, 09/1994, Universität Koblenz-Landau [13] Eager; Lazowska; Zahorjan: The Limited Performance Benefits of Migrating Active Processes for Load Sharing, ACM Performance Evaluation Review, Vol. 16, No. 1, 05/1988, p [14] Joosen; Verbaeten: On the Use of Process Migration in Distributed Systems, Microprocessing and Microprogramming (North-Holland), 1986, p [15] Leland; Ott; Teunis: Load Balancing Heuristics and Process Behavior, ACM Performance Evaluation Review, Vol. C-33, No. 1, 05/1986, p [16] Eager; Lazowska; Zahorjan: Adaptive Load Sharing in Homogeneous Distributed Systems, IEEE Transactions on Software Engineering, Vol. SE-12, No. 5, 05/1986, p First Sino-German Workshop on Advanced Parallel Processing Technologies, APPT 95, Beijing, China, September / 7

Chapter 5: Distributed Process Scheduling. Ju Wang, 2003 Fall Virginia Commonwealth University

Chapter 5: Distributed Process Scheduling CMSC 602 Advanced Operating Systems Static Process Scheduling Dynamic Load Sharing and Balancing Real-Time Scheduling Section 5.2, 5.3, and 5.5 Additional reading: