Multi-site Jobs Management System (MJMS): A tool to manage multi-site MPI applications execution in Grid environment

Similar documents
UNICORE Globus: Interoperability of Grid Infrastructures

Bridging Secure WebCom and European DataGrid Security for Multiple VOs over Multiple Grids

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

Cluster Abstraction: towards Uniform Resource Description and Access in Multicluster Grid

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Web-based access to the grid using. the Grid Resource Broker Portal

Grid Scheduling Architectures with Globus

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

Grid Compute Resources and Grid Job Management

Multiple Broker Support by Grid Portals* Extended Abstract

DiPerF: automated DIstributed PERformance testing Framework

Globus Toolkit Firewall Requirements. Abstract

A Distributed Media Service System Based on Globus Data-Management Technologies1

Managing CAE Simulation Workloads in Cluster Environments

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Grid Architectural Models

IMAGE: An approach to building standards-based enterprise Grids

Managing MPICH-G2 Jobs with WebCom-G

Resource Allocation in computational Grids

A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations

Design and Implementation of Condor-UNICORE Bridge

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

CSF4:A WSRF Compliant Meta-Scheduler

MSF: A Workflow Service Infrastructure for Computational Grid Environments

A Parallel Programming Environment on Grid

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

Scheduling Large Parametric Modelling Experiments on a Distributed Meta-computer

Grid Compute Resources and Job Management

An Evaluation of Alternative Designs for a Grid Information Service

A RESOURCE MANAGEMENT FRAMEWORK FOR INTERACTIVE GRIDS

GEMS: A Fault Tolerant Grid Job Management System

Formalization of Objectives of Grid Systems Resources Protection against Unauthorized Access

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance

Grid Computing Middleware. Definitions & functions Middleware components Globus glite

The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Abstract. Introduction. Application of Grid Techniques in the CFD Field

PBS PRO: GRID COMPUTING AND SCHEDULING ATTRIBUTES

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID

Today s Readings, Part I

AssistConf: a Grid configuration tool for the ASSIST parallel programming environment

Advanced Job Submission on the Grid

Performance Analysis of Applying Replica Selection Technology for Data Grid Environments*

A P2P Approach for Membership Management and Resource Discovery in Grids1

Computational Mini-Grid Research at Clemson University

Knowledge Discovery Services and Tools on Grids

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

MPICH-G2 performance evaluation on PC clusters

glite Grid Services Overview

Replica Selection in the Globus Data Grid

Top-down definition of Network Centric Operating System features

Customized way of Resource Discovery in a Campus Grid

Application of Grid techniques in the CFD field

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

High Throughput WAN Data Transfer with Hadoop-based Storage

Usage of LDAP in Globus

Future Developments in the EU DataGrid

MOHA: Many-Task Computing Framework on Hadoop

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

Tutorial 4: Condor. John Watt, National e-science Centre

GridSphere s Grid Portlets

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

CMS HLT production using Grid tools

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

How to Run Scientific Applications over Web Services

Layered Architecture

Minutes of WP1 meeting Milan (20 th and 21 st March 2001).

Globus Toolkit 4 Execution Management. Alexandra Jimborean International School of Informatics Hagenberg, 2009

visperf: Monitoring Tool for Grid Computing

A Comparison of Conventional Distributed Computing Environments and Computational Grids

THE VEGA PERSONAL GRID: A LIGHTWEIGHT GRID ARCHITECTURE

High Performance Computing Course Notes Grid Computing I

A Time-To-Live Based Reservation Algorithm on Fully Decentralized Resource Discovery in Grid Computing

An Engineering Computation Oriented Visual Grid Framework

UNIT IV PROGRAMMING MODEL. Open source grid middleware packages - Globus Toolkit (GT4) Architecture, Configuration - Usage of Globus

EGEE and Interoperation

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

DiPerF: automated DIstributed PERformance testing Framework

PoS(EGICF12-EMITC2)081

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract.

PROOF-Condor integration for ATLAS

Implementing a Hardware-Based Barrier in Open MPI

Workload Management on a Data Grid: a review of current technology

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

ADAPTIVE GRID COMPUTING FOR MPI APPLICATIONS

Grid Resources Search Engine based on Ontology

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

ETICS meta-data software editing - from check out to commit operations

Predicting the Performance of a GRID Environment: An Initial Effort to Increase Scheduling Efficiency

DataGrid TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO- ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB PARTITIONING

Introduction to Grid Computing

High Throughput Urgent Computing

GridSAT Portal: A Grid Portal for Solving Satisfiability Problems On a Computational Grid

An update on the scalability limits of the Condor batch system

Gergely Sipos MTA SZTAKI

MPI SUPPORT ON THE GRID. Kiril Dichev, Sven Stork, Rainer Keller. Enol Fernández

A Survey on Grid Scheduling Systems

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Transcription:

Multi-site Jobs Management System (MJMS): A tool to manage multi-site MPI applications execution in Grid environment This work has been partially supported by Italian Ministry of Education,University and Research (MIUR) within the activities of the WP9 work package Grid Enabled Scientific Libraries, part of the MIUR FIRB RBNE01KNFP Grid.it project. Jaime Frey, Francesco Gregoretti, Giuliano Laccetti, Almerico Murli and Gennaro Oliva Department of Computer Science, University of Wisconsin, Madison, WI Email: jfrey@cs.wisc.edu Institute for High Performance Computing and Networking ICAR-CNR, Naples, Italy Email: {francesco.gregoretti,gennaro.oliva@na.icar.cnr.it University of Naples Federico II, Naples, Italy Email: {giuliano.laccetti,almerico.murli@dma.unina.it Abstract Multi-site parallel applications consist of multiple distributed processes running on one or more potentially heterogeneous resources in different locations. This class of applications can effectively use the Grid for high-performance computing. We propose a multi-site job management system (MJMS) based on the Globus Toolkit, the MPICH-G2 library and Condor-G for the effective, reliable and secure execution of multi-site parallel applications in the Grid environment. This system allows the user to submit execution requests specifying application requirements and preferences in a high-level language (following Condor submit file syntax) freeing her from the tasks of resource discovery and co-allocation: it efficiently selects available computing resources for execution according to the application requirements and interacts with the local management systems through the Condor-G system. I. Introduction and state of the art Multi-site parallel applications consist of multiple distributed processes running on one or more potentially heterogeneous resources in different locations. This class of applications can effectively use the Grid for highperformance computing by grouping computing resources for a single application execution. In such a distributed environment programmers can benefit from the use of the message passing paradigm and in particular of the MPI standard because it is easy to understand and use, architecture-independent and portable. Furthermore MPI is a widely-used standard for high performance computing and therefore can make Grid applications development more accessible to programmers with parallel computing skills. The first MPI standard (MPI-1) was released in 1994 and since then many libraries and tools have been developed based on it. There are several grid-enabled MPI libraries [1: MagPIe, MPICH-G2, MPI Connect, MetaMPICH, Stampi, PACX-MPI, etc. Many of these implementations allow the grouping of multiple machines potentially based on heterogeneous architectures for the execution of MPI programs and the use of vendor-supplied MPI libraries over high performance networks for intramachine messaging. All of the available libraries support coallocation as well as process synchronization, but don t provide any execution management tools. Thus, users must explicitly specify resources to be used for the execution and directly handle possible failures. Among these libraries, MPICH-G2 [2 is one of the most complete and widespread and many of its features are very useful in a Grid environment. MPICH-G2 provides the user with an advanced interconnecting topology description with multiple levels of depth, that gives a detailed view of the underlying executing environment. It uses the Globus Security Infrastructure (GSI) [3 for authorization and authentication, the Grid Resource Allocation and Management (GRAM) protocol [4 for resource allocation the DUROC library [5 for process synchronization. The adoption of job management systems like Condor-G [6 and the DataGrid WMS [7 has facilitated the use of the Grid for sequential and parallel applications execution. These systems accept user execution requests consisting of input files, execution binaries, environment requirements and other parameters without necessarily naming the specific resources to run the job. They transparently handle resource selection, resource allocation, file transfers, application monitoring and failure events on the users behalf. Moreover during the life cycle of the jobs, users can query the systems to retrieve detailed information about the current and past status of their jobs. Therefore the use of a job management system hides the complexity of the Grid by raising the architecture abstraction level. Existing management systems don t allow the execution of multi-site parallel applications because they don t handle requests for multiple resources within a single job

execution. Furthermore the available systems don t interact with existing grid-enabled MPI libraries for process synchronization. We propose an execution management tool called MPI Job Management System (MJMS) based on the Globus Toolkit [8, the MPICH-G2 library and Condor-G for the effective, reliable and secure execution of multi-site parallel applications in the Grid environment. II. MJMS: Multi-site Jobs Management System MJMS manages the execution of parallel MPI applications on multiple resources in a Grid environment. It interacts with the Condor-G daemons exploiting the Condor-G system features for job execution management and uses the services provided by DUROC to handle job synchronization across sites. The Condor-G system has many features that are useful for our purposes: it is capable of interacting with the GRAM service provided by the Globus toolkit for job submission and monitoring; it provides a high-level language called ClassAds (Classified Advertisements) [9 to describe job requirements and preferences as well as Grid resources characteristics; it has some fault-tolerance features; it can manage distributed I/O operations; it provides a robust mechanism called Matchmaking to match a job request with a computing resource. However, the Condor-G system has three limitations that prevent its use for our class of applications: it does not support the execution of MPICH-G2 jobs; its Matchmaking mechanism is limited to match each single job request with a single resource precluding co-allocation while we need to request a consistent collection of resources for a single MPI execution; the condor collector daemon, responsible for collecting all the information about the status and the characteristics of the available Grid resources, doesn t interact with the Globus Information System (MDS) [10. The system we propose allows the users to submit execution requests for multi-site parallel applications by specifying requirements and preferences in a high-level language (following the Condor submit file syntax, figure 1). Requirements and preferences should reflect the application needs in terms of computational and communication costs located by the user. Parallel and multi-site parallel application requirements and preferences may be different from those of sequential applications. In fact parallel application performance can depend both on communication and computational costs. For this reason users should be able to specify network characteristics such as bandwidth and latency alongside job requirements and preferences as well as CPU performance. A user can utilize multiple Grid resources for a single MPI job execution by splitting her application into subjobs. MJMS assigns each subjob to a single Grid resource. MJMS accepts the user submit description file and locates a matching pool of available resources among those published by the Globus Information System, according to the application requirements and preferences. (subjob1) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 17 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=17) (label=subjob 0) requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 17) rank = 10000/ClusterNetLatency + 10*ClusterCPUsSpecFloat queue (subjob2) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 4 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=4) (label=subjob 17) requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 4) queue (subjob3) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 14 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=14) (label=subjob 21) +Subnet1 = subjob1.subnet requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 14) rank = 10*ClusterCPUsSpecFloat queue Fig. 1. MJMS submit description file example. For resource location MJMS uses Gangmatching [11 an extended version of the Condor Matchmaking [9 capable of matching multiple resource requests within a single job execution request. Once compatible resources have been located, MJMS submits subjobs to the Condor-G system specifying target resources for each subjob. Process synchronization is achieved by a Coordinator barrier. Each process placed in execution stops at the barrier. Once all processes have been successfully started,

MJMS starts the computation by releasing the barrier. Input and output file staging and subjob logging is handled by the Condor-G system. MJMS system architecture, depicted in figure 2, consists of three main components: Coordinator GangMatchMaker Advertiser A. Coordinator Fig. 2. MJMS architecture The Coordinator accepts user submit description files, invokes the GangMatchmaker to locate a pool of matching resources and delegates the management of the subjobs execution to the Condor-G system. A MJMS job may be in one of the following states: ACCEPTED the job has been accepted by the system; MATCHED a pool of matching resources has been located; UNMATCHED no set of resources matches the job request; BARRIERED not all subjobs are yet held at the DUROC-controlled barrier; RUNNING the subjobs have been released from the barrier. When the job is ACCEPTED all its subjobs are enqueued and held in the Condor-G queue. Target Grid resources have not yet been identified. The DUROC services are initialized and a synchronization barrier for all the subjobs is implemented by modifying subjob environment variables. The synchronization method depends on the grid-enabled MPI library used: MPICH-G2 process synchronization is implemented through the DUROC library which uses environment variables. Coordinator creates a single ClassAd containing Requirements and Rank of all subjobs as stated in the corresponding submit description file. The resulting ClassAd is sent to the GangMatchMaker that is responsible for locating a pool of matching Grid resources. If GangMatchMaker doesn t find compatible resources the job state becomes UNMATCHED otherwise it becomes MATCHED and Coordinator sets the target Grid resources. This is done by modifying the subjob attribute GlobusResource with the Gatekeeper contact string of the matching resource. Coordinator then releases the subjobs from the Condor-G queue and the job state becomes BARRIERED. Once all subjobs are running on the selected resources, the Coordinator starts the computation by releasing the synchronization barrier. The job state becomes RUNNING and the execution management of all subjobs is from this point onwards in charge of the Condor-G system. B. GangMatchMaker The GangMatchMaker uses the Gangmatching model to locate resources for multi-site job execution. In the context of a multi-site parallel application co-allocation problem we need a matchmaking scheme to be able to match the subjobs execution requests and the available resources by simultaneously taking into account subjobs requirements, preferences and interdependencies. Moreover each subjob has to be matched with a different resource. The Gangmatching model is a possible solution to this multilateral matchmaking problem. It is an extended version of the Condor bilateral Matchmaking which makes use of the ClassAds library for job and resource description. The ClassAd representing the job execution request is generated by the Coordinator starting from the submit description file. A job ClassAd example for the multi-site parallel application problem generated from the submit description file of figure 1 is illustrated in figure 3. The most notable feature of this example is the Ports attribute. In the Gangmatching model the Matchmaking bilateral algorithm occurs between ports of ClassAds instead of entire ClassAds. The Ports attribute defines the number and characteristics of matching offers required for that ClassAd to be satisfied [11. In the above example requirements on port matching are not influenced by the attributes of the other ports. One could require some port Requirements to be influenced by the attributes of some preceding port. As a consequence of the label scoping the dependency relations between ports are anyway limited by the fact that labels of successive ports may not be referred to by preceding ports. Example ClassAds of resource offers are illustrated in figures 4, 5, 6. The resources ClassAds are very similar to their bilateral matchmaking counterpart except for the presence of

[ Type = Job ; [ Label= subjob1 ; Requirements = subjob1.type == Machine && subjob1.arch == i686 && subjob1.opsys == LINUX && subjob1.clusternodecount >= 17; Rank = 10000 / subjob1.clusternetmpi0latency + 10 * subjob1.clustercpusspecfloat;, [ Label= subjob2 ; Requirements = subjob2.type == Machine && subjob2.arch == i686 && subjob2.opsys == LINUX && subjob2.clusternodecount >= 4;, [ Label= subjob3 ; Subnet1 = subjob1.subnet; Requirements = subjob3.type == Machine && subjob3.arch == i686 && subjob3.opsys == LINUX && subjob3.clusternodecount >= 14; Rank = 10 * subjob3.clustercpusspecfloat; Fig. 3. ClassAd example describing a multi-site parallel job. [ MyType = Machine ; Name = vega.na.icar.cnr.it ; gatekeeper_url = vega.na.icar.cnr.it:2122 ; Arch = i686 ; OpSys = LINUX ; Subnet = 140.164.14 ClusterNodeCount = 17; ClusterCPUType = Intel(R) Pentium(R) 4 CPU 1500MHz ; ClusterNetMPIBandwidth = 88 ; ClusterCPUsSpecFloat = 558 ; ClusterCPUsSpecInt = 535 ; ClusterNetiMPI0Latency = 38 ; [ Label = subjob Fig. 4. ClassAd extract describing a cluster named Vega. [ MyType = Machine ; Name= beocomp.dma.unina.it ; gatekeeper_url= beocomp.dma.unina.it:2122 ; Arch= i686 ; OpSys= LINUX ; Subnet = 192.167.11 ClusterNodeCount=14; ClusterCPUType= Pentium II (Deschutes) ; ClusterNetMPIBandwidth= 82 ; ClusterCPUsSpecFloat= 13 ; ClusterCPUsSpecInt= 17 ; ClusterNetMPI0Latency= 65 ; [ Label = subjob Fig. 6. ClassAd extract describing a cluster named Beocomp. resources ClassAds through a specific port. The subjob3 port can refer to an attribute of the resource ClassAd already matched with a preceding port. It conveys the value of this attribute to the resources ClassAds through the P orts[2.attribute syntax. For example the job ClassAd of figure 3 conveys the value of the attribute Subnet of the resource ClassAd already matched with the subjob1 port to the resources ClassAds through the Ports[2.Subnet1 syntax. A user can request that subjob1 and subjob3 run on resources belonging to the same LAN by adding the expression subjob3.subnet == Subnet1 among Requirements. The algorithm implemented by the GangMatchMaker is the simple naive algorithm [11. It marshals a consistent gang of ClassAds for the job ClassAds in the system. In our example, the gang consists of four ads: a job and three clusters. The match for that example consists of the tree illustrated in figure 7. [ MyType = Machine ; Name= altair.dma.unina.it ; gatekeeper_url= altair.dma.unina.it:2122 ; Arch= i686 ; OpSys= LINUX ; Subnet = 192.167.11 ClusterNodeCount=11; ClusterCPUType= Pentium Pro ; ClusterNetMPIBandwidth= 74 ; ClusterCPUsSpecFloat= 6 ; ClusterCPUsSpecInt= 8 ; ClusterNetMPI0Latency= 117 ; [ Label = subjob Fig. 5. ClassAd extract describing a cluster named Altair. Beocomp Job Altair Vega ClassAd port Gangmatching Job Beocomp Altair Vega matched ports (bilateral match) gang Fig. 7. Gangmatching operation for our job example. a port to explicitly indicate its imperative to match only one request port. No Requirements are expressed in the resources ClassAds examples. Complex Requirements may be expressed in the offer resource ClassAd to implement sophisticated resource management policies. These Requirements could refer to some attributes of the job ports. The Gangmatching model allows the job request ClassAd to convey information about matched resources to the C. Advertiser The Advertiser is responsible for periodically sending ClassAds representing available resources status and characteristics to the GangMatchMaker and to the Condor-G system. The Advertiser gathers this information by periodically querying a set of Globus MDS servers, extracting information that is relevant to MJMS.

Parallel and multi-site parallel application generally have different requirements from those of sequential applications. In particular a user can be interested in information about the high-performance network of a parallel computing resource. For this reason a user should be able to specify in her job requirements and preferences conditional expressions with respect to the network characteristics (i.e. MPI communications bandwidth and latency for some packet size). The Information System provided by the Globus Toolkit doesn t supply this information, then we extended it by a custom schema and a corresponding custom Information Provider. A sample of information supplied by the MJMS Information Provider is shown in figure 8. system-network-vendor: Quadrics cluster-network-model: Elan cluster-network-version: 4 cluster-network-mpi-0-latency: 2.63 cluster-network-mpi-1024-latency: 6.32 cluster-network-mpi-32768-latency: 44.40 cluster-network-mpi-64-bandwidth: 25.64 cluster-network-mpi-1024-bandwidth: 177.95 cluster-network-mpi-32768-bandwidth: 718.91 Fig. 8. Network related information provided by the MJMS Information Provider Latency and bandwidth measurements are static information. We used values obtained by the Intel MPI Benchmark suite [12. The MJMS schema defines where to publish and to retrieve information in the MDS tree. D. The role of Condor-G The management of the subjobs execution through the Condor-G system continues as follow: The Condor-G Scheduler creates a new GridManager daemon to submit and manage all subjobs; The GridManager job submission request results in the creation of one Globus JobManager daemon on each Grid resource; The JobManager daemon connects to the GridManager using the Globus GASS service [13 in order to transfer the executable binary and standard input files, and to provide the streaming of standard output and error. The JobManager submits the subjob to the local management systems of the target resource; The JobManager sends back the subjob status updates to the GridManager, which in its turn updates the Scheduler. In order to allow Condor-G to make remote requests on behalf of the user a proxy credential has to be created. The Condor-G system uses the Globus GSI authentication and authorization system. This system make it possible to authenticate a user just once, so she need not reauthenticate herself each time a Condor-G computation management service access target remote resources. Condor-G is able to tolerate four types of failure events: crash of the Globus JobManager, crash of the machine that manages the target resource (on which GateKeeper and JobManager are executing), crash of the GridManager or of the machine on which the GridManager is executing, failures in the network connecting the two machines. The Condor-G is able to transfer any files needed by a job for the computation from the machine where MJMS submitted the subjobs to target resources. In the same way it can transfer output files back to MJMS. The user simply specifies when and which files have to be transmitted in the submit description file. III. Preliminary Experiences MJMS was used to run an iterative solver of sparse linear systems of equations with multiple right-hand sides in the Grid environment. In this context the Block version of the Conjugate Gradient algorithm (BCG) [14 becomes attractive. In fact with respect to the standard Conjugate Gradient parallel algorithm, it reduces the number of synchronization points and increases the size of messages improving latency tolerance. Therefore it is more suited for a distributed Grid environment. The BCG algorithm simultaneously produces a solution vector for each right-hand side vector. It consists of eight concurrent tasks with different computational complexity. To correctly balance the computational load we introduced a two-level parallelism schema in its implementation. The first level reflects the task decomposition while the second is introduced within the most computationally intensive tasks. The resulting multisite job consists of different subjobs each grouping one or more tasks. Each subjob must be assigned to a computing resource. For the implementation we used a grid-enabled MPI library that extend MPICH-G2 called MGF [15 which allows transparent and efficient usage of Grid resources with a particular attention to clusters with private networks. We used MJMS to assign the subjobs to the available parallel computing resources according to their computational costs and inter-task communication costs. This let us to perform a better resource workload balancing that improved the overall speedup performance. IV. Discussion and future work MJMS was designed on top of the Condor-G System which is a personal desktop agent. Even so its features can be useful in a multi-institution multi-user environment. The DataGrid WorkLoad Management System which is based on the Condor-G system, is used in the EGEE/LCG production Grid [16 by thousands of users and hundreds of institutions and is proof of Condor-G s flexibility. The use of MJMS in a multi-user and multi-institutional environment can be achieved, as done for EGEE/LCG Grid, by manging resource selection on the basis of Virtual

Organization (VO) membership and using a system to verify user VO association like VOMS [17. An important issue concerning the management of multi-site parallel applications that emerged during MJMS implementation is how to handle possible subjob or process failures. A robust management system should allow application fault recovery. A possible way to handle faults for MPI applications, easily integrable in our system, is by providing process level fault-tolerance at the MPI API level like that provided by FT-MPI [18. Another important issue related to the use of our system in a real production Grid is the interaction with the local batch management systems. The GangMatchMaker should take care of the batch systems queues when matching subjobs with resources. Queues characteristics and status can be published by the Globus Information System and therefore easily included in the ClassAd representing the computing resource. Resource selection becomes hard when there are not enough free resources available and some subjobs need to be queued on the local batch systems before running. MJMS implementation has been based on MPICH-G2 and pre-ws GRAM but its design is independent of the MPI implementation and Grid technology. Therefore MJMS implementation can be extended in order to support other grid-enabled MPI libraries and other Grid systems used in Condor-G (NorduGrid, Unicore, WS GRAM and remote Condor pools). A limitation with the ClassAd Gangmatching used in the MJMS is that the decomposition of a job into subjobs is constrained. The number of subjobs is fixed in the job ClassAd. GangMatchMaker should allow an aggregate matching policy to increase scheduling flexibility. A proposed extension to Gangmatching called Set-Matching [19 would overcome that limitation. With Set-Matching, a job ClassAd can request to be matched with an unspecified number of resource ClassAds. In addition to Requirements on individual resources ClassAds, the job ClassAd can place Requirements on aggregate properties of the set of matching resources. For example, the job ClassAd can request a certain number of total CPUs and the MatchMaker will add resources to the set until that number is reached. V. Conclusions One of the interesting and exciting fields of application of Grid Computing has always been High-Performance Computing (HPC). Multi-site parallel applications can achieve high performance rates by grouping HPC resources for a single application execution. Therefore we argue that this class of applications can effectively use the Grid for HPC. The use of Globus Grid technologies with their middleware design has given a uniform access to computing resources, while the use of MPI standard has facilitated application development. MJMS was developed to support multi-site application execution and testing. Our system has all basic features for job management: it allows the submission of multi-site jobs to the Grid and raises the abstraction level by transparently handling the job throughout its entire life cycle. MJMS has been designed to be extended in order to support some additional features like fault-tolerance and multi-user usage and to be independent of any particular grid-enabled MPI implementation and Grid technology. VI. Acknowledgment The authors would like to thank the Condor Team at the Computer Sciences Department of the University of Wisconsin-Madison. References [1 C. Lee, S. Matsuoka, D. Talia, A. Sussmann, M. Mueller, G. Allen, and J. Saltz, A grid programming primer, 2001. [2 N. T. Karonis, B. Toonen, and I. Foster, Mpich-g2: a gridenabled implementation of the message passing interface, J. Parallel Distrib. Comput., vol. 63, no. 5, pp. 551 563, 2003. [3 I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, A security architecture for computational grids, in Proceedings of the 5th ACM conference on Computer and communications security. ACM Press, 1998, pp. 83 92. [4 K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, A resource management architecture for metacomputing systems, Lecture Notes in Computer Science, vol. 1459, pp. 62??, 1998. [5 K. Czajkowski, I. T. Foster, and C. Kesselman, Resource coallocation in computational grids, in HPDC, 1999. [6 J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, Condor-G: A computation management agent for multiinstitutional grids, in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC), San Francisco, California, August 2001, pp. 7 9. [7 G. Avellino, S. Beco, B. Cantalupo, A. Maraschini, F. Pacini, M. Sottilaro, A. Terracina, D. Colling, F. Giacomini, E. Ronchieri, A. Gianelle, M. Mazzucato, R. Peluso, M. Sgaravatto, A. Guarise, R. M. Piro, A. Werbrouck, D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Pospísil, M. Ruda, Z. Salvet, J. Sitera, J. Skrabal, M. Vocu, M. Mezzadri, F. Prelz, S. Monforte, and M. Pappalardo, The datagrid workload management system: Challenges and results. J. Grid Comput., vol. 2, no. 4, pp. 353 367, 2004. [8 I. Foster and C. Kesselman, Globus: A metacomputing infrastructure toolkit, The International Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp. 115 128, Summer 1997. [9 R. Raman, M. Livny, and M. Solomon, Matchmaking: Distributed resource management for high throughput computing, in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 1998. [10 K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman, Grid information services for distributed resource sharing, 2001. [11 R. Raman, M. Livny, and M. Solomon, Policy driven heterogeneous resource co-allocation with gangmatching, in HPDC 03: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 03). Washington, DC, USA: IEEE Computer Society, 2003, p. 80. [12 Intel mpi benchmark suite, Code and documentation available from http://www.intel.com/software/products/cluster/ downloads/mpibenchmarks.htm. [13 J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke, GASS: A data movement and access service for wide area computing systems, in Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems. Atlanta, GA: ACM Press, 1999, pp. 78 88.

[14 D. P. O Leary, Parallel implementation of the block conjugate gradient algorithm. Parallel Computing, vol. 5, no. 1-2, pp. 127 139, 1987. [15 F. Gregoretti, G. Laccetti, A. Murli, G. Oliva, and U. Scafuri, Mgf: A grid-enabled mpi library with a delegation mechanism to improve collective operations. in PVM/MPI, ser. Lecture Notes in Computer Science, B. D. Martino, D. Kranzlmüller, and J. Dongarra, Eds., vol. 3666. Springer, 2005, pp. 285 292. [16 D. Kranzlmüller, The egee project - a multidisciplinary, production-level grid. in HPCC, ser. Lecture Notes in Computer Science, L. T. Yang, O. F. Rana, B. D. Martino, and J. Dongarra, Eds., vol. 3726. Springer, 2005, p. 2. [17 R. Alfieri, R. Cecchini, V. Ciaschini, L. dell Agnello, Á. Frohner, A. Gianoli, K. Lörentey, and F. Spataro, Voms, an authorization system for virtual organizations. in European Across Grids Conference, ser. Lecture Notes in Computer Science, F. F. Rivera, M. Bubak, A. G. Tato, and R. Doallo, Eds., vol. 2970. Springer, 2003, pp. 33 40. [18 E. Gabriel, G. Fagg, A. Bukovsky, T. Angskun, and J. Dongarra, A fault tolerant communication library for grid environments, in Proceedings of the 17th Annual ACM International Conference on Supercomputing (ICS 03), International Workshop on Grid Computing, 2003. [19 C. Liu, L. Yang, I. Foster, and D. Angulo, Design and evaluation of a resource selection framework for grid applications, in In Proceedings of the 11th IEEE Symposium on High- Performance Distributed Computing, July 2002.