Multi-site Jobs Management System (MJMS): A tool to manage multi-site MPI applications execution in Grid environment

Multi-site Jobs Management System (MJMS): A tool to manage multi-site MPI applications execution in Grid environment This work has been partially supported by Italian Ministry of Education,University and Research (MIUR) within the activities of the WP9 work package Grid Enabled Scientific Libraries, part of the MIUR FIRB RBNE01KNFP Grid.it project. Jaime Frey, Francesco Gregoretti, Giuliano Laccetti, Almerico Murli and Gennaro Oliva Department of Computer Science, University of Wisconsin, Madison, WI Email: jfrey@cs.wisc.edu Institute for High Performance Computing and Networking ICAR-CNR, Naples, Italy Email: {francesco.gregoretti,gennaro.oliva@na.icar.cnr.it University of Naples Federico II, Naples, Italy Email: {giuliano.laccetti,almerico.murli@dma.unina.it Abstract Multi-site parallel applications consist of multiple distributed processes running on one or more potentially heterogeneous resources in different locations. This class of applications can effectively use the Grid for high-performance computing. We propose a multi-site job management system (MJMS) based on the Globus Toolkit, the MPICH-G2 library and Condor-G for the effective, reliable and secure execution of multi-site parallel applications in the Grid environment. This system allows the user to submit execution requests specifying application requirements and preferences in a high-level language (following Condor submit file syntax) freeing her from the tasks of resource discovery and co-allocation: it efficiently selects available computing resources for execution according to the application requirements and interacts with the local management systems through the Condor-G system. I. Introduction and state of the art Multi-site parallel applications consist of multiple distributed processes running on one or more potentially heterogeneous resources in different locations. This class of applications can effectively use the Grid for highperformance computing by grouping computing resources for a single application execution. In such a distributed environment programmers can benefit from the use of the message passing paradigm and in particular of the MPI standard because it is easy to understand and use, architecture-independent and portable. Furthermore MPI is a widely-used standard for high performance computing and therefore can make Grid applications development more accessible to programmers with parallel computing skills. The first MPI standard (MPI-1) was released in 1994 and since then many libraries and tools have been developed based on it. There are several grid-enabled MPI libraries [1: MagPIe, MPICH-G2, MPI Connect, MetaMPICH, Stampi, PACX-MPI, etc. Many of these implementations allow the grouping of multiple machines potentially based on heterogeneous architectures for the execution of MPI programs and the use of vendor-supplied MPI libraries over high performance networks for intramachine messaging. All of the available libraries support coallocation as well as process synchronization, but don t provide any execution management tools. Thus, users must explicitly specify resources to be used for the execution and directly handle possible failures. Among these libraries, MPICH-G2 [2 is one of the most complete and widespread and many of its features are very useful in a Grid environment. MPICH-G2 provides the user with an advanced interconnecting topology description with multiple levels of depth, that gives a detailed view of the underlying executing environment. It uses the Globus Security Infrastructure (GSI) [3 for authorization and authentication, the Grid Resource Allocation and Management (GRAM) protocol [4 for resource allocation the DUROC library [5 for process synchronization. The adoption of job management systems like Condor-G [6 and the DataGrid WMS [7 has facilitated the use of the Grid for sequential and parallel applications execution. These systems accept user execution requests consisting of input files, execution binaries, environment requirements and other parameters without necessarily naming the specific resources to run the job. They transparently handle resource selection, resource allocation, file transfers, application monitoring and failure events on the users behalf. Moreover during the life cycle of the jobs, users can query the systems to retrieve detailed information about the current and past status of their jobs. Therefore the use of a job management system hides the complexity of the Grid by raising the architecture abstraction level. Existing management systems don t allow the execution of multi-site parallel applications because they don t handle requests for multiple resources within a single job

execution. Furthermore the available systems don t interact with existing grid-enabled MPI libraries for process synchronization. We propose an execution management tool called MPI Job Management System (MJMS) based on the Globus Toolkit [8, the MPICH-G2 library and Condor-G for the effective, reliable and secure execution of multi-site parallel applications in the Grid environment. II. MJMS: Multi-site Jobs Management System MJMS manages the execution of parallel MPI applications on multiple resources in a Grid environment. It interacts with the Condor-G daemons exploiting the Condor-G system features for job execution management and uses the services provided by DUROC to handle job synchronization across sites. The Condor-G system has many features that are useful for our purposes: it is capable of interacting with the GRAM service provided by the Globus toolkit for job submission and monitoring; it provides a high-level language called ClassAds (Classified Advertisements) [9 to describe job requirements and preferences as well as Grid resources characteristics; it has some fault-tolerance features; it can manage distributed I/O operations; it provides a robust mechanism called Matchmaking to match a job request with a computing resource. However, the Condor-G system has three limitations that prevent its use for our class of applications: it does not support the execution of MPICH-G2 jobs; its Matchmaking mechanism is limited to match each single job request with a single resource precluding co-allocation while we need to request a consistent collection of resources for a single MPI execution; the condor collector daemon, responsible for collecting all the information about the status and the characteristics of the available Grid resources, doesn t interact with the Globus Information System (MDS) [10. The system we propose allows the users to submit execution requests for multi-site parallel applications by specifying requirements and preferences in a high-level language (following the Condor submit file syntax, figure 1). Requirements and preferences should reflect the application needs in terms of computational and communication costs located by the user. Parallel and multi-site parallel application requirements and preferences may be different from those of sequential applications. In fact parallel application performance can depend both on communication and computational costs. For this reason users should be able to specify network characteristics such as bandwidth and latency alongside job requirements and preferences as well as CPU performance. A user can utilize multiple Grid resources for a single MPI job execution by splitting her application into subjobs. MJMS assigns each subjob to a single Grid resource. MJMS accepts the user submit description file and locates a matching pool of available resources among those published by the Globus Information System, according to the application requirements and preferences. (subjob1) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 17 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=17) (label=subjob 0) requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 17) rank = 10000/ClusterNetLatency + 10*ClusterCPUsSpecFloat queue (subjob2) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 4 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=4) (label=subjob 17) requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 4) queue (subjob3) universe = globus executable = bcg_dist arguments = s3rmt3m3.mtx 3 bs3rmt3m3.mtx 7 15 16 17 18 24 25 31 machine_count = 14 should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = s3rmt3m3.mtx,s3rmt3m3_rhs.mtx output = outfile.$(cluster) error = errfile.$(cluster) log = logfile.$(cluster) environment = P4_SETS_ALL_ENVVARS=1; MGF_PRIVATE_SLAN=1; LD_LIBRARY_PATH=/opt/globus-2.4.3/lib/ globusrsl = (jobtype=mpi) (count=14) (label=subjob 21) +Subnet1 = subjob1.subnet requirements = (OpSys == LINUX && Arch == i686 ) && (ClusterNodeCount >= 14) rank = 10*ClusterCPUsSpecFloat queue Fig. 1. MJMS submit description file example. For resource location MJMS uses Gangmatching [11 an extended version of the Condor Matchmaking [9 capable of matching multiple resource requests within a single job execution request. Once compatible resources have been located, MJMS submits subjobs to the Condor-G system specifying target resources for each subjob. Process synchronization is achieved by a Coordinator barrier. Each process placed in execution stops at the barrier. Once all processes have been successfully started,

MJMS starts the computation by releasing the barrier. Input and output file staging and subjob logging is handled by the Condor-G system. MJMS system architecture, depicted in figure 2, consists of three main components: Coordinator GangMatchMaker Advertiser A. Coordinator Fig. 2. MJMS architecture The Coordinator accepts user submit description files, invokes the GangMatchmaker to locate a pool of matching resources and delegates the management of the subjobs execution to the Condor-G system. A MJMS job may be in one of the following states: ACCEPTED the job has been accepted by the system; MATCHED a pool of matching resources has been located; UNMATCHED no set of resources matches the job request; BARRIERED not all subjobs are yet held at the DUROC-controlled barrier; RUNNING the subjobs have been released from the barrier. When the job is ACCEPTED all its subjobs are enqueued and held in the Condor-G queue. Target Grid resources have not yet been identified. The DUROC services are initialized and a synchronization barrier for all the subjobs is implemented by modifying subjob environment variables. The synchronization method depends on the grid-enabled MPI library used: MPICH-G2 process synchronization is implemented through the DUROC library which uses environment variables. Coordinator creates a single ClassAd containing Requirements and Rank of all subjobs as stated in the corresponding submit description file. The resulting ClassAd is sent to the GangMatchMaker that is responsible for locating a pool of matching Grid resources. If GangMatchMaker doesn t find compatible resources the job state becomes UNMATCHED otherwise it becomes MATCHED and Coordinator sets the target Grid resources. This is done by modifying the subjob attribute GlobusResource with the Gatekeeper contact string of the matching resource. Coordinator then releases the subjobs from the Condor-G queue and the job state becomes BARRIERED. Once all subjobs are running on the selected resources, the Coordinator starts the computation by releasing the synchronization barrier. The job state becomes RUNNING and the execution management of all subjobs is from this point onwards in charge of the Condor-G system. B. GangMatchMaker The GangMatchMaker uses the Gangmatching model to locate resources for multi-site job execution. In the context of a multi-site parallel application co-allocation problem we need a matchmaking scheme to be able to match the subjobs execution requests and the available resources by simultaneously taking into account subjobs requirements, preferences and interdependencies. Moreover each subjob has to be matched with a different resource. The Gangmatching model is a possible solution to this multilateral matchmaking problem. It is an extended version of the Condor bilateral Matchmaking which makes use of the ClassAds library for job and resource description. The ClassAd representing the job execution request is generated by the Coordinator starting from the submit description file. A job ClassAd example for the multi-site parallel application problem generated from the submit description file of figure 1 is illustrated in figure 3. The most notable feature of this example is the Ports attribute. In the Gangmatching model the Matchmaking bilateral algorithm occurs between ports of ClassAds instead of entire ClassAds. The Ports attribute defines the number and characteristics of matching offers required for that ClassAd to be satisfied [11. In the above example requirements on port matching are not influenced by the attributes of the other ports. One could require some port Requirements to be influenced by the attributes of some preceding port. As a consequence of the label scoping the dependency relations between ports are anyway limited by the fact that labels of successive ports may not be referred to by preceding ports. Example ClassAds of resource offers are illustrated in figures 4, 5, 6. The resources ClassAds are very similar to their bilateral matchmaking counterpart except for the presence of

[ Type = Job ; [ Label= subjob1 ; Requirements = subjob1.type == Machine && subjob1.arch == i686 && subjob1.opsys == LINUX && subjob1.clusternodecount >= 17; Rank = 10000 / subjob1.clusternetmpi0latency + 10 * subjob1.clustercpusspecfloat;, [ Label= subjob2 ; Requirements = subjob2.type == Machine && subjob2.arch == i686 && subjob2.opsys == LINUX && subjob2.clusternodecount >= 4;, [ Label= subjob3 ; Subnet1 = subjob1.subnet; Requirements = subjob3.type == Machine && subjob3.arch == i686 && subjob3.opsys == LINUX && subjob3.clusternodecount >= 14; Rank = 10 * subjob3.clustercpusspecfloat; Fig. 3. ClassAd example describing a multi-site parallel job. [ MyType = Machine ; Name = vega.na.icar.cnr.it ; gatekeeper_url = vega.na.icar.cnr.it:2122 ; Arch = i686 ; OpSys = LINUX ; Subnet = 140.164.14 ClusterNodeCount = 17; ClusterCPUType = Intel(R) Pentium(R) 4 CPU 1500MHz ; ClusterNetMPIBandwidth = 88 ; ClusterCPUsSpecFloat = 558 ; ClusterCPUsSpecInt = 535 ; ClusterNetiMPI0Latency = 38 ; [ Label = subjob Fig. 4. ClassAd extract describing a cluster named Vega. [ MyType = Machine ; Name= beocomp.dma.unina.it ; gatekeeper_url= beocomp.dma.unina.it:2122 ; Arch= i686 ; OpSys= LINUX ; Subnet = 192.167.11 ClusterNodeCount=14; ClusterCPUType= Pentium II (Deschutes) ; ClusterNetMPIBandwidth= 82 ; ClusterCPUsSpecFloat= 13 ; ClusterCPUsSpecInt= 17 ; ClusterNetMPI0Latency= 65 ; [ Label = subjob Fig. 6. ClassAd extract describing a cluster named Beocomp. resources ClassAds through a specific port. The subjob3 port can refer to an attribute of the resource ClassAd already matched with a preceding port. It conveys the value of this attribute to the resources ClassAds through the P orts[2.attribute syntax. For example the job ClassAd of figure 3 conveys the value of the attribute Subnet of the resource ClassAd already matched with the subjob1 port to the resources ClassAds through the Ports[2.Subnet1 syntax. A user can request that subjob1 and subjob3 run on resources belonging to the same LAN by adding the expression subjob3.subnet == Subnet1 among Requirements. The algorithm implemented by the GangMatchMaker is the simple naive algorithm [11. It marshals a consistent gang of ClassAds for the job ClassAds in the system. In our example, the gang consists of four ads: a job and three clusters. The match for that example consists of the tree illustrated in figure 7. [ MyType = Machine ; Name= altair.dma.unina.it ; gatekeeper_url= altair.dma.unina.it:2122 ; Arch= i686 ; OpSys= LINUX ; Subnet = 192.167.11 ClusterNodeCount=11; ClusterCPUType= Pentium Pro ; ClusterNetMPIBandwidth= 74 ; ClusterCPUsSpecFloat= 6 ; ClusterCPUsSpecInt= 8 ; ClusterNetMPI0Latency= 117 ; [ Label = subjob Fig. 5. ClassAd extract describing a cluster named Altair. Beocomp Job Altair Vega ClassAd port Gangmatching Job Beocomp Altair Vega matched ports (bilateral match) gang Fig. 7. Gangmatching operation for our job example. a port to explicitly indicate its imperative to match only one request port. No Requirements are expressed in the resources ClassAds examples. Complex Requirements may be expressed in the offer resource ClassAd to implement sophisticated resource management policies. These Requirements could refer to some attributes of the job ports. The Gangmatching model allows the job request ClassAd to convey information about matched resources to the C. Advertiser The Advertiser is responsible for periodically sending ClassAds representing available resources status and characteristics to the GangMatchMaker and to the Condor-G system. The Advertiser gathers this information by periodically querying a set of Globus MDS servers, extracting information that is relevant to MJMS.

Parallel and multi-site parallel application generally have different requirements from those of sequential applications. In particular a user can be interested in information about the high-performance network of a parallel computing resource. For this reason a user should be able to specify in her job requirements and preferences conditional expressions with respect to the network characteristics (i.e. MPI communications bandwidth and latency for some packet size). The Information System provided by the Globus Toolkit doesn t supply this information, then we extended it by a custom schema and a corresponding custom Information Provider. A sample of information supplied by the MJMS Information Provider is shown in figure 8. system-network-vendor: Quadrics cluster-network-model: Elan cluster-network-version: 4 cluster-network-mpi-0-latency: 2.63 cluster-network-mpi-1024-latency: 6.32 cluster-network-mpi-32768-latency: 44.40 cluster-network-mpi-64-bandwidth: 25.64 cluster-network-mpi-1024-bandwidth: 177.95 cluster-network-mpi-32768-bandwidth: 718.91 Fig. 8. Network related information provided by the MJMS Information Provider Latency and bandwidth measurements are static information. We used values obtained by the Intel MPI Benchmark suite [12. The MJMS schema defines where to publish and to retrieve information in the MDS tree. D. The role of Condor-G The management of the subjobs execution through the Condor-G system continues as follow: The Condor-G Scheduler creates a new GridManager daemon to submit and manage all subjobs; The GridManager job submission request results in the creation of one Globus JobManager daemon on each Grid resource; The JobManager daemon connects to the GridManager using the Globus GASS service [13 in order to transfer the executable binary and standard input files, and to provide the streaming of standard output and error. The JobManager submits the subjob to the local management systems of the target resource; The JobManager sends back the subjob status updates to the GridManager, which in its turn updates the Scheduler. In order to allow Condor-G to make remote requests on behalf of the user a proxy credential has to be created. The Condor-G system uses the Globus GSI authentication and authorization system. This system make it possible to authenticate a user just once, so she need not reauthenticate herself each time a Condor-G computation management service access target remote resources. Condor-G is able to tolerate four types of failure events: crash of the Globus JobManager, crash of the machine that manages the target resource (on which GateKeeper and JobManager are executing), crash of the GridManager or of the machine on which the GridManager is executing, failures in the network connecting the two machines. The Condor-G is able to transfer any files needed by a job for the computation from the machine where MJMS submitted the subjobs to target resources. In the same way it can transfer output files back to MJMS. The user simply specifies when and which files have to be transmitted in the submit description file. III. Preliminary Experiences MJMS was used to run an iterative solver of sparse linear systems of equations with multiple right-hand sides in the Grid environment. In this context the Block version of the Conjugate Gradient algorithm (BCG) [14 becomes attractive. In fact with respect to the standard Conjugate Gradient parallel algorithm, it reduces the number of synchronization points and increases the size of messages improving latency tolerance. Therefore it is more suited for a distributed Grid environment. The BCG algorithm simultaneously produces a solution vector for each right-hand side vector. It consists of eight concurrent tasks with different computational complexity. To correctly balance the computational load we introduced a two-level parallelism schema in its implementation. The first level reflects the task decomposition while the second is introduced within the most computationally intensive tasks. The resulting multisite job consists of different subjobs each grouping one or more tasks. Each subjob must be assigned to a computing resource. For the implementation we used a grid-enabled MPI library that extend MPICH-G2 called MGF [15 which allows transparent and efficient usage of Grid resources with a particular attention to clusters with private networks. We used MJMS to assign the subjobs to the available parallel computing resources according to their computational costs and inter-task communication costs. This let us to perform a better resource workload balancing that improved the overall speedup performance. IV. Discussion and future work MJMS was designed on top of the Condor-G System which is a personal desktop agent. Even so its features can be useful in a multi-institution multi-user environment. The DataGrid WorkLoad Management System which is based on the Condor-G system, is used in the EGEE/LCG production Grid [16 by thousands of users and hundreds of institutions and is proof of Condor-G s flexibility. The use of MJMS in a multi-user and multi-institutional environment can be achieved, as done for EGEE/LCG Grid, by manging resource selection on the basis of Virtual

Organization (VO) membership and using a system to verify user VO association like VOMS [17. An important issue concerning the management of multi-site parallel applications that emerged during MJMS implementation is how to handle possible subjob or process failures. A robust management system should allow application fault recovery. A possible way to handle faults for MPI applications, easily integrable in our system, is by providing process level fault-tolerance at the MPI API level like that provided by FT-MPI [18. Another important issue related to the use of our system in a real production Grid is the interaction with the local batch management systems. The GangMatchMaker should take care of the batch systems queues when matching subjobs with resources. Queues characteristics and status can be published by the Globus Information System and therefore easily included in the ClassAd representing the computing resource. Resource selection becomes hard when there are not enough free resources available and some subjobs need to be queued on the local batch systems before running. MJMS implementation has been based on MPICH-G2 and pre-ws GRAM but its design is independent of the MPI implementation and Grid technology. Therefore MJMS implementation can be extended in order to support other grid-enabled MPI libraries and other Grid systems used in Condor-G (NorduGrid, Unicore, WS GRAM and remote Condor pools). A limitation with the ClassAd Gangmatching used in the MJMS is that the decomposition of a job into subjobs is constrained. The number of subjobs is fixed in the job ClassAd. GangMatchMaker should allow an aggregate matching policy to increase scheduling flexibility. A proposed extension to Gangmatching called Set-Matching [19 would overcome that limitation. With Set-Matching, a job ClassAd can request to be matched with an unspecified number of resource ClassAds. In addition to Requirements on individual resources ClassAds, the job ClassAd can place Requirements on aggregate properties of the set of matching resources. For example, the job ClassAd can request a certain number of total CPUs and the MatchMaker will add resources to the set until that number is reached. V. Conclusions One of the interesting and exciting fields of application of Grid Computing has always been High-Performance Computing (HPC). Multi-site parallel applications can achieve high performance rates by grouping HPC resources for a single application execution. Therefore we argue that this class of applications can effectively use the Grid for HPC. The use of Globus Grid technologies with their middleware design has given a uniform access to computing resources, while the use of MPI standard has facilitated application development. MJMS was developed to support multi-site application execution and testing. Our system has all basic features for job management: it allows the submission of multi-site jobs to the Grid and raises the abstraction level by transparently handling the job throughout its entire life cycle. MJMS has been designed to be extended in order to support some additional features like fault-tolerance and multi-user usage and to be independent of any particular grid-enabled MPI implementation and Grid technology. VI. Acknowledgment The authors would like to thank the Condor Team at the Computer Sciences Department of the University of Wisconsin-Madison. References [1 C. Lee, S. Matsuoka, D. Talia, A. Sussmann, M. Mueller, G. Allen, and J. Saltz, A grid programming primer, 2001. [2 N. T. Karonis, B. Toonen, and I. Foster, Mpich-g2: a gridenabled implementation of the message passing interface, J. Parallel Distrib. Comput., vol. 63, no. 5, pp. 551 563, 2003. [3 I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, A security architecture for computational grids, in Proceedings of the 5th ACM conference on Computer and communications security. ACM Press, 1998, pp. 83 92. [4 K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, A resource management architecture for metacomputing systems, Lecture Notes in Computer Science, vol. 1459, pp. 62??, 1998. [5 K. Czajkowski, I. T. Foster, and C. Kesselman, Resource coallocation in computational grids, in HPDC, 1999. [6 J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, Condor-G: A computation management agent for multiinstitutional grids, in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC), San Francisco, California, August 2001, pp. 7 9. [7 G. Avellino, S. Beco, B. Cantalupo, A. Maraschini, F. Pacini, M. Sottilaro, A. Terracina, D. Colling, F. Giacomini, E. Ronchieri, A. Gianelle, M. Mazzucato, R. Peluso, M. Sgaravatto, A. Guarise, R. M. Piro, A. Werbrouck, D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Pospísil, M. Ruda, Z. Salvet, J. Sitera, J. Skrabal, M. Vocu, M. Mezzadri, F. Prelz, S. Monforte, and M. Pappalardo, The datagrid workload management system: Challenges and results. J. Grid Comput., vol. 2, no. 4, pp. 353 367, 2004. [8 I. Foster and C. Kesselman, Globus: A metacomputing infrastructure toolkit, The International Journal of Supercomputer Applications and High Performance Computing, vol. 11, no. 2, pp. 115 128, Summer 1997. [9 R. Raman, M. Livny, and M. Solomon, Matchmaking: Distributed resource management for high throughput computing, in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 1998. [10 K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman, Grid information services for distributed resource sharing, 2001. [11 R. Raman, M. Livny, and M. Solomon, Policy driven heterogeneous resource co-allocation with gangmatching, in HPDC 03: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 03). Washington, DC, USA: IEEE Computer Society, 2003, p. 80. [12 Intel mpi benchmark suite, Code and documentation available from http://www.intel.com/software/products/cluster/ downloads/mpibenchmarks.htm. [13 J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke, GASS: A data movement and access service for wide area computing systems, in Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems. Atlanta, GA: ACM Press, 1999, pp. 78 88.

[14 D. P. O Leary, Parallel implementation of the block conjugate gradient algorithm. Parallel Computing, vol. 5, no. 1-2, pp. 127 139, 1987. [15 F. Gregoretti, G. Laccetti, A. Murli, G. Oliva, and U. Scafuri, Mgf: A grid-enabled mpi library with a delegation mechanism to improve collective operations. in PVM/MPI, ser. Lecture Notes in Computer Science, B. D. Martino, D. Kranzlmüller, and J. Dongarra, Eds., vol. 3666. Springer, 2005, pp. 285 292. [16 D. Kranzlmüller, The egee project - a multidisciplinary, production-level grid. in HPCC, ser. Lecture Notes in Computer Science, L. T. Yang, O. F. Rana, B. D. Martino, and J. Dongarra, Eds., vol. 3726. Springer, 2005, p. 2. [17 R. Alfieri, R. Cecchini, V. Ciaschini, L. dell Agnello, Á. Frohner, A. Gianoli, K. Lörentey, and F. Spataro, Voms, an authorization system for virtual organizations. in European Across Grids Conference, ser. Lecture Notes in Computer Science, F. F. Rivera, M. Bubak, A. G. Tato, and R. Doallo, Eds., vol. 2970. Springer, 2003, pp. 33 40. [18 E. Gabriel, G. Fagg, A. Bukovsky, T. Angskun, and J. Dongarra, A fault tolerant communication library for grid environments, in Proceedings of the 17th Annual ACM International Conference on Supercomputing (ICS 03), International Workshop on Grid Computing, 2003. [19 C. Liu, L. Yang, I. Foster, and D. Angulo, Design and evaluation of a resource selection framework for grid applications, in In Proceedings of the 11th IEEE Symposium on High- Performance Distributed Computing, July 2002.