Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Similar documents
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Scalability of Heterogeneous Computing

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.

Monte Carlo Method on Parallel Computing. Jongsoon Kim

Evaluation of Parallel Programs by Measurement of Its Granularity

Consultation for CZ4102

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

6.1 Multiprocessor Computing Environment

NAS and SAN Scaling Together

Network Design Considerations for Grid Computing

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

On the Performance of Simple Parallel Computer of Four PCs Cluster

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

A MATLAB Toolbox for Distributed and Parallel Processing

Parallel Linear Algebra on Clusters

Technical Brief: Specifying a PC for Mascot

DUE to the increasing computing power of microprocessors

The Oracle Database Appliance I/O and Performance Architecture

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Chapter 18 Distributed Systems and Web Services

RTW SUPPORT FOR PARALLEL 64bit ALPHA AXP-BASED PLATFORMS. Christian Vialatte, Jiri Kadlec,

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

COMPUTER ARCHITECTURE

NeOn Methodology for Building Ontology Networks: a Scenario-based Methodology

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Client Server & Distributed System. A Basic Introduction

Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

Introduction...2. Executive summary...2. Test results...3 IOPs...3 Service demand...3 Throughput...4 Scalability...5

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

WebSphere Application Server Base Performance

Chapter 11: Implementing File Systems

Efficient Scheduling of Scientific Workflows using Hot Metadata in a Multisite Cloud

Chapter 11: Implementing File

Load Balancing in Distributed System through Task Migration

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Functional Requirements for Grid Oriented Optical Networks

WHITE PAPER. Optimizing Virtual Platform Disk Performance

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Resource CoAllocation for Scheduling Tasks with Dependencies, in Grid

Initial studies of SCI LAN topologies for local area clustering

Boosting the Performance of Myrinet Networks

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

ANALYZING CHARACTERISTICS OF PC CLUSTER CONSOLIDATED WITH IP-SAN USING DATA-INTENSIVE APPLICATIONS

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Free upgrade of computer power with Java, web-base technology and parallel computing

Advanced Data Management Technologies

Chapter 20: Database System Architectures

SurFS Product Description

Episode Engine. Best Practices - Deployment. Single Engine Deployment

Chapter 12: File System Implementation

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

OPERATING SYSTEM. Chapter 12: File System Implementation

Low Cost Supercomputing. Rajkumar Buyya, Monash University, Melbourne, Australia. Parallel Processing on Linux Clusters

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Chapter 17: Distributed Systems (DS)

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Diffusing Your Mobile Apps: Extending In-Network Function Virtualisation to Mobile Function Offloading

Network-on-Chip Architecture

Performance Evaluation of FDDI, ATM, and Gigabit Ethernet as Backbone Technologies Using Simulation

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Parallel Performance Studies for a Clustering Algorithm

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

European Space Agency Provided by the NASA Astrophysics Data System

Chapter 12: File System Implementation

Parallel Algorithm Design. CS595, Fall 2010

A Framework for Parallel Genetic Algorithms on PC Cluster

Computer-System Organization (cont.)

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array

Chapter 10: File System Implementation

A Case Study on Grammatical-based Representation for Regular Expression Evolution

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

May Gerd Liefländer System Architecture Group Universität Karlsruhe (TH), Systemarchitektur

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

Keywords: Mobile Agent, Distributed Computing, Data Mining, Sequential Itinerary, Parallel Execution. 1. Introduction

Multiprocessor and Real-Time Scheduling. Chapter 10

Towards a Portable Cluster Computing Environment Supporting Single System Image

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Artisan Technology Group is your source for quality new and certified-used/pre-owned equipment

Efficiency of Functional Languages in Client-Server Applications

A Simulation Model for Large Scale Distributed Systems

Delegated Access for Hadoop Clusters in the Cloud

The Barnes-Hut Algorithm in MapReduce

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 20: Networks and Distributed Systems

Table of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning.

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

Transcription:

Balanced workload distribution on a multi-processor cluster J.L. Bosque*, B. Moreno*", L. Pastor*" *Depatamento de Automdtica, Escuela Universitaria Politecnica de la Universidad de Alcald, Alcald de Henares, Madrid, Spain E-mail: jbosque@aut. alcala. es ^Departamento de Tecnologia Fotonica, Facultad de Informdtica, Universidad Politecnica de Madrid, 28660 Boadilla del Monte, Spain. E-mail: lpastor@fi.upm.es, bmoreno@sidra.dtf.fi.upm.es Abstract This paper presents LoadBalancer, an application that aims to execute compute-intensive tasks over a cluster of machines linked through a local or wide area network. Workpackages are evenly distributed among the computers that compose the virtual machine. The available processors are arranged using a "farm" strategy, in which a master process sends workpackages to the slave processors that are integrated on the cluster. LoadBalancer has been developed for Construcciones Aeronauticas, S. A. (Space Division), under the EU ESPRIT Programme. 1 Introduction Parallel processing and multiprocessor machines have been posed as a solution for making high-performance computing both available and affordable for many scientific and engineering problems[l][2]. Nevertheless, parallel machines beyond shared memory multiprocessors using a few CPU's present higher prices for their hardware and software, requiring often skills that are out of many users' background. These reasons have prevented or at least constrained their widespread use in companies and research centres. On the other hand, it is very common to find users whose computing needs have been solved during time by purchasing workstations that were interconnected later on by a relatively fast network. In a logical evolution, distributed systems [3] [10] have naturally appeared as a low-cost alternative for users wishing to achieve high computing power for solving their problems at an affordable cost. The main idea behind is to configure a group or cluster of workstations interconnected through a network to form a parallel virtual machine with a high computing power, able to solve compute-intensive problems in short times, using hardware resources that often

114 High Performance Computing are already available. This has been made possible thanks to the speed, price and reliability improvements achieved by computer networks, in particular with the arrival of fiber optics. A cluster or parallel virtual machine can be seen as a set of possibly heterogeneous, independent machines, connected through a fast communication network, working together under the management of a distributed software on the solution of a particular problem. The data communication and process synchronisation is performed using message-passing primitives, usually under a client/server architecture. This paper presents LoadBalancer, a distributed application, implemented with a master/slave architecture under PVM (Parallel Virtual Machine), with the following objectives: - To execute compute-intensive applications over a cluster of machines linked by an interconnection network, performing at the same time a balanced workload distribution among the heterogeneous set of processors which compose the cluster. - To keep the communication overhead associated to the work distribution as low as possible (communication overheads affect very strongly multiprocessors performance). - To decrease the overall system latency (the user response time from the instant when the execution is started to the moment when the results are produced). LoadBalancer has been developed for Construcciones Aeronauticas, S.A. (Space Division) within the framework of the EU ESPRIT programme, focusing on the development of parallel Montecarlo methods for structure analysis. The following sections describe the application environment and structure, the tests performed and the results achieved. Last, the conclusions that can be taken from the experimental results are presented. 2 Application description 2.1 Environment The hardware over wich the application runs is composed by a set of independent nodes, interconnected through a communication network. The nodes can have heterogeneous architectures, although all of them have to run under the UNIX operating system [6] [9]. Therefore, the hardware can be seen as a distributed system [7].

High Performance Computing 115 The communication network used for linking the nodes can be local or wide area (the communication network can be also heterogeneus). An important aspect to take into account is the network traffic: heavily loaded networks can become a bottleneck, determining largely the overall application performance. The hardware used is conceptually similar to a distributed memory multiprocessor. We will refer to it on the rest of the paper as the virtual machine (VM). 2.2 System configuration The VM configuration is done dynamically. It can therefore be changed between different applications' executions. This process is done in a transparent way from the user point of view: the user only needs to provide a configurationfilewith the IP addresses of all the machines that can take part in the VM. The application starts by reading the configuration file on a first machine, attempting later on the connetion to the specified computers. If the connection process succeeds, the remote node is added to the VM Otherwise the user is informed of the resulting error, being the operation continued with the remaining machines. This process is performed using PVM primitives [9]. Once the final VM configuration is achieved, the user is presented with a graphical schematic describing the system configuration. 2.3 Application structure The application is basically composed of a computing process, called 'solver*, which has to process a (large) number offiles.as stated on the introduccion, the first objective posed for LoadBalancer is the even workload distribution among the available processors. For that purpose, a "farm" [7] strategy was selected: a master process is executed on a central node, being in charge both with the configuration of the VM and with the distribution of the work packages among the different slave nodes. The master process has to perform a number of steps before the solver can start processing each of the datafilesassociated to each run:first,a number of userdefined parameters have to be read in order to set up the application environment. Figure 1 presents a Motif window [9][10][11] showing the required data. After data is read, the master configurates the VM using the IP addresses provided by the user.

116 High Performance Computing Figure 1 : User defined parameters for the application setup. The third step performed by the master is the execution of the slave processes on each node, which include different solver instances. The master has to supply each slave with the execution parameters required by the solver as well as with the raw datafilesto be processed, waiting then to gather the results provided by each slave. Slave processes, on the other hand, have to store the receivedfileon the local node, start the solver execution using the data contained on the file and return the results produced to the master process. During the solver execution, each slave process has to check the execution time, aborting the solver if the time exceeds a predetermined span. Last, the master has to gather the results provided by each slave on each of the allocated raw datafiles,storing them on a results data base. During the whole process, the master presents the user real-time graphics describing the application execution. Once all of thefileshave been processed, thefinalstatistics are computed, an accounting file summarizing the whole process is generated, and the application is finished.

High Performance Computing 117 LoadBalancer can be used with different solvers, keeping the processing structure independent on the data processing algorithms. In fact, it could be used with any application that performs heavy computation on blocks of data stored in registers. Figure 2 describes the general application structure. Figure 2: General application structure. The graphical information presented by the master during the execution allows the user to find out the structure of the virtual machine (specifiying whether the nodes are active, not active or communicating with the master) as well as the charge of work supported by each CPU from the beginning of the application until every moment, and the communication mean time between each host and the master. Figure 3 displays the way this information is presented to the user. 3. Experimental results A number of tests have been performed to check LoadBalancer's performance when clusters and problems of different size are taken. This section presents first the experimental setup (including both hardware and software), describing afterwards the execution times obtained during the trials.

118 High Performance Computing I Not active Active H Comunication Figure 3 : Real time execution information displayed by the master process. 3.1 Hardware and software setup The hardware available for testing the application consisted on nine ALPHA 400 workstations from DEC. One of the workstations is a server, being the machine selected both for the execution of the solver when only one processor was used and for running the master process when more than one processor was used. The other eight workstations were selected for the execution of slave processes. The ALPHA workstations' most salient features are: - Server: Processor: AS400 at 144 MHz Memory: 64 MB Mass storage: 2.5 GB on 1 SCSI disk Operating system: DEC/OSF1 v3.2 (UNIX) - Slaves:

Processor: AS400 at 100 MHz Memory: 32 MB Mass storage: 1.2 GB on 1 SCSI disk Operating system: DEC/OSF1 v3.2 (UNIX) High Performance Computing 119 The available workstations are linked through a departmental LAN, belonging to the Laboratory of Telematics of the University of Alcala de Henares (Laboratory of Telematics, Dept. of Automatica, Univ. o Alcala de Henares). The reasons: fact that a departmental network has been used is relevant for two - The situation is closer to "real world" working conditions. - The LAN traffic conditions can affect differently subsequent executions, introducing a small degree of distortion on the times reported on this paper. The LAN used is an ETHERNET using TCP/IP protocols. The network is decomposed on four segments, having a 16 input hub available to perform efficient routing. The network bandwith is 10 Mbits/sg. With respect to software considerations, it was mentioned before that LoadBalancer can work with different solvers. Although the application was developed within a structure analysis environment, the experiments presented here have used a simple matrix multiplication solver. Therefore, each of the input data files used for the tests contains two matrices and their respective dimensions. Three different trials will be presented here. They involve processing three sets of 50, 75 and 100files,having eachfilea random problem dimension (the matrices' dimensions, although compatible for matrix product, are selected randomly between a minimum value of 20 and a maximum of 500). For each of these trials different executions have been done, changing the number of processors while keeping constant the input datafiles.it has to be noted that the figures given for executions using only one processor have been obtained using an entirely sequential algorithm (only the solver was started on the server, having therefore no parallelism or communications overheads). 3.2 Execution times The experimental results obtained with the hardware and software setup are summarized onfigures4 to 7. Figure 4 gives the execution time dependence on the number of available slave processors (the figures do not include the master processors). Three problem sizes have been considered : the input data set was composed of 50, 75 and 100 files

120 High Performance Computing respectively. Times given infigure4 are total user response times. The time needed by the user to enter the input data has not been taken into consideration for these latency values, although the times needed for the configuration of the VM has been included. Figures 5 and 6 show the speedup and efficiency factors [4] [5] for processing 50, 75 or 100fileswhen one to eight slave machines are used. 50 files 75 files 100 files 1 slave B 2 slaves S 3 slaves Q 4 slaves ED 5 slaves E3 6 slaves ED 7 slaves O 8 slaves Figure 4: Execution time versus number of slave machines for different numbers of processed files. 50 files 75 files 100 files 2 slaves 3 slaves 0 4 slaves B 5 slaves E3 6 slaves 03 7 slaves E3 8 slaves Figure 5 : Speedupfiguresfor different VM configurations and number of processed files. 50 files 75 files 100 files 2 slaves 3 slaves B 4 slaves 0 5 slaves 0 6 slaves Q 7 slaves Q 8 slaves Figure 6: Efficiencyfiguresfor different VM configurations and number of processed files.

High Performance Computing 121 Last, figure 7 shows the dependence of the communications overhead with problem size and the number of slave machines. This overhead is given by the ratio total communication time for "n" slaves - total computation time for the same VM configuration : 50 files 75 files 100 files 2 slaves B 3 slaves 4 slaves 83 5 slaves E3 6 slaves E3 7 slaves Q 8 slaves Figure 7: Dependence of communications overhead with problem size and numbers of slave machines. 4 Conclusions The analysis of the experimental results allows the formulation of a number of conclusions : First, the exploitation of asynchronous communication protocols such as the one implemented in LoadBalancer for the master/slaves communications allows the achievement of low communication overheads. As it can be seen in figure 7, These overheads have been always below the 10%, having reached an average value around 7 to 8%. Second, the numbers obtained both for speedup and efficiency are quite good for larger jobs, the efficiency keeps around or above 70%. For smaller jobs the initialization times affect negatively the application performance. It has to be remembered that the values used for execution over only one processor include just the solver, without parallelism or communications overhead. Moreover, the machine used for these serial executions is the most powerful one, making the results look worse. The application structure makes it also to reach a good scalability degree: increasing the number of slave processors from one to eight, for processing 100 files, makes the efficiency vary between 72% to 85%. Last, it has to be noted that the results given in this paper have been obtained with a communication network shared with other users. Since the executions on just one processor do not use the network, these results could be further improved by restricting other user's network usage.

122 High Performance Computing 5 References [1]- Kevin Dowd, 'High Performance Computing^ Editorial O'Reilley & Associates, Inc. 1995. [2] - Bruce P. Lester, 'The Art of Parallel Programming \ Editorial Prentice- Hall International, 1994. [3] - Andrew S. Tamenbaun., 'Distributed Operating Systems', Prentice-Hall 1996. [4] - Kai Hwang, 'Advanced Computer Architecture', Me Graw-Hill, 1993. [5] - V de Carlini and U. Villano, 'Transputers and Parallel Architectures", Ellis Morwood, 1991. [6] - Kay Robins & Steven Robbing 'Practical UNIXProgramming ', Prentice-Hall, 1996. [7] - G. Colouris,' Distributed systems: Concepts and Decision', Addison- Wesley,1996. [8] - Shivarati et al. 'LoadDistributingfor Locally Distributed Systems ', (web). [9] - TVM 3 'User's Guide andreference Manual'', ORNL/TM-12187,May 1994. [10] - Open Software Fundation, 'OSF/Motif Style Guide' for OSF/Motif Release 1.1, Prentice-Hall, 1991. [11]- Open Software Fundation, VSF/MotifProgrammer's Guide' for OSF/Motif Release 1.1, Prentice-Hall, 1991. [12] - Open Software Fundation, VSF/Motif Programmer's Reference' for OSF/Motif Release 1.1, Prentice-Hall, 1991.