P-GRADE: a Grid Programming Environment

Size: px

Start display at page:

Download "P-GRADE: a Grid Programming Environment"

Norma Nash
5 years ago
Views:

1 Article submission for Journal of Grid Computing P-GRADE: a Grid Programming Environment P. Kacsuk, G. Dózsa, J. Kovács, R. Lovas, N. Podhorszki, Z. Balaton and G. Gombás MTA SZTAKI Lab. of Parallel and Distributed Systems, H-1518 Budapest P.O.Box 63, Hungary {kacsuk, dozsa, smith, rlovas, pnorbert, balaton, gombasg}@sztaki.hu Correspondence: Peter Kacsuk kacsuk@sztaki.hu Tel.: Fax:

2 2 Kacsuk et al. P-GRADE: a Grid Programming Environment P. Kacsuk, G. Dózsa, J. Kovács, R. Lovas, N. Podhorszki, Z. Balaton and G. Gombás MTA SZTAKI Lab. of Parallel and Distributed Systems, H-1518 Budapest P.O.Box 63, Hungary {kacsuk, dozsa, smith, rlovas, pnorbert, balaton, gombasg}@sztaki.hu Key words: grid computing, parallel programming, graphical development environment Abstract P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor,Condor-G or Globus job to execute parallel programs in the Grid. In P- GRADE, the user can generate either PVM or MPI code according to the underlying Grid where the parallel application should be executed. PVM applications generated by P-GRADE can migrate between different Grid sites and as a result P-GRADE guarantees reliable, fault-tolerant parallel program execution in the Grid. The GRM/PROVE performance monitoring and visualisation toolset has been extended towards the Grid and connected to a general Grid monitor (Mercury) developed in the EU GridLab project. Using the Mercury/GRM/PROVE Grid application monitoring infrastructure any parallel application launched by P-GRADE can be remotely monitored and analysed at run time even if the application migrates among Grid sites. P-GRADE supports workflow definition and co-ordinated multi-job execution for the Grid. Such workflow management can provide parallel execution at both inter-job and intra-job level. Automatic checkpoint mechanism for parallel programs supports the migration of parallel jobs inside the workflow providing a fault-tolerant workflow execution mechanism. The paper describes all of these features of P-GRADE and their implementation concepts. 1 Introduction The concept of Grid emerged as a natural extension and generalisation of distributed supercomputing or meta-computing. Although today the Grid means much more than distributed supercomputing still one of the important goals of the Grid is to provide a dynamic collection of resources from which applications requiring very large computational power can select and use actually available resources. In that sense Grid is a natural extension of the concepts of supercomputers and clusters towards an even more distributed and dynamic parallel program execution platform and infrastructure. Components of such a Grid infrastructure include supercomputers and clusters and hence, parallel application programs and programming tools of supercomputers and clusters are expected to be used in the Grid, too. These days there are many projects with the idea of extending and generalising the existing parallel program development and execution environments towards the Grid. The current idea is that a parallel program that was developed for supercomputers and clusters should be used in the Grid, too. Indeed, execution of a

3 P-GRADE: A Grid Programming Environment 3 parallel program in the Grid does not differ semantically from the execution in a parallel computing system. The main differences come from the speed, reliability, heterogeneity, availability and accessing mechanisms (including security and local policy) of the exploited resources. In the traditional parallel systems such as supercomputers, the speed of the processors, communication networks, memory access, I/O access is steadily high, the components of the supercomputers are homogeneous, highly reliable and statically available (after allocating them by a resource scheduler). Clusters have more or less the same parameters as supercomputers though their components might be heterogeneous, they are usually less reliable, a bit slower and the availability of their components are less static than in the supercomputers. The Grid represents a significant modification in these parameters. The Grid is very heterogeneous, the components (resources) dynamically change both in respect of their speed and availability and there is no guarantee for their reliability. It means that a parallel program that was developed for a homogeneous, high-speed, static and reliable execution environment should perform satisfactorily well even in a heterogeneous, changing-speed, dynamic and unreliable execution environment. From the point of view of a Grid end-user there are several important aspects of executing parallel programs in the Grid. The first aspect is the creation of the Grid program. Here two exciting questions should be answered. How different is the creation of a Grid program from the creation of a parallel program for a cluster? Once the Grid program is created can it run on any kind of Grid or on just one specific Grid? In the case of supercomputers and clusters MPI [35] and PVM [19] were introduced in order to write parallel applications in a system independent way. The GAT (Grid Application Toolkit) [2] tries to introduce a similar approach for creating Grid programs in a Grid middleware independent way. However, just as PVM and MPI require the application developer to learn a lot of APIs for writing parallel programs, GAT will require the user to learn the GAT APIs for creating Grid applications. A more elegant way would be to provide a very high-level graphical environment that hides all the low-level details of the various APIs and would be able to generate either PVM, MPI or GAT code according to the actual execution platform. This is exactly the aim of P-GRADE (Parallel Grid Run-time and Application Development Environment) that has been developed by MTA SZTAKI. P-GRADE currently generates either PVM or MPI code from the same graphical notation according to the user s needs. Once the GAT API will be established and stabilised, P-GRADE will generate GAT code as well. The generated PVM code can be executed on supercomputers, clusters and as a Condor job in Condor managed Grid systems like the Hungarian ClusterGrid [39]. The generated MPI code can be executed on supercomputers, clusters and in Globus managed Grid systems like the Hungarian SuperGrid [28]. The second aspect is related to the execution of the parallel program in the Grid. In most cases the parallel program should be executed as a job that can run on any available resources (supercomputers or clusters) of the Grid according to the specified user requirements and the offered resource services. To find and allocate the necessary resources in the Grid is the task of the Grid information system and resource broker. In an ideal case they should do the resource allocation in an optimal and transparent way. Unfortunately, in most of the existing Grid systems it is not the case, typically the Grid user is burdened with the selection of the necessary Grid resources via some Grid portals [20], [24]. Once it is done, resource managers like Condor-G [18] and SGEE [51] are used to maintain a persistent Grid-wide job queue

4 4 Kacsuk et al. and to control the job execution in the Grid as well as to notify the user about the job status changes. P-GRADE currently generates Condor, Condor-G and Globus jobs and can be easily modified to generate other kinds of jobs like SGEE jobs. Another important aspect of executing a parallel program in the Grid is job migration. The Grid is an inherently dynamic and unreliable execution environment. The optimal selection of a Grid resource for a particular job does not mean that the selected resource remains optimal for the whole execution life of the job. The selected resource could be overloaded by newly submitted higher priority jobs or even worse it can go down partially or completely due to some hardware or software errors. These are situations that should be resolved by the Grid middleware transparently to the user. There are some partial solutions for this problem (for example Condor can checkpoint sequential jobs and migrate them from the overloaded or corrupted resource to a new one) but it is not solved yet in its full complexity. P-GRADE represents an important step in that direction. It contains a parallel checkpoint and migration module that enables the checkpoint and migration of generic PVM programs either inside a Grid site, like a cluster, or among Grid sites when the PVM programs are executed as Condor or Condor-G jobs. Naturally, we need tools by which the performance of parallel programs can be monitored and tuned in the Grid. When a parallel application has been developed for a cluster and one would like to run the same application program in the Grid, the main question is how well such a parallel program can perform in the Grid. Therefore it is not surprising that typically the performance monitoring tools are the first candidates to be extended towards the Grid. For example, GRM/PROVE [5] is extended and adapted towards the Grid within the EU DataGrid project, VAMPIR [50] in the EuroGrid project and OMIS [34] in the framework of the EU CrossGrid project. Of course, as stated above parallel programs have to migrate in the Grid and hence a Grid application monitoring infrastructure that can monitor even migrating applications should be established in the Grid. We developed such an infrastructure by adapting GRM/PROVE towards the Grid in the EU DataGrid project and by creating a new Grid monitoring tool, called Mercury, in the GridLab [23] project. The integration of GRM/PROVE with Mercury resulted in a generic Grid application monitoring infrastructure that is integrated into the P-GRADE environment, too. As a result no matter whether the parallel application is executed in a supercomputer, cluster or in the Grid, the application developer and the end-user can observe the execution of their applications in a seamless way even when the application migrates among different sites of the Grid. Finally, there are complex applications where a single parallel program is not sufficient to solve the problem, rather several existing or newly defined components (sequential and/or parallel codes) located at different Grid sites should collaborate. The usual approach to solve such complex problems is the creation of a workflow. P- GRADE also provides a graphical layer for creating such workflows. Components of the workflow can be any sequential or parallel code either provided as a service or developed in the P-GRADE environment. P-GRADE generates a DAGMan specification for the workflow that is executed under the control of Condor or Condor-G. Jobs of the workflow can migrate in the Grid and P-GRADE provides full observability of the execution of the workflow by the integrated Grid application monitoring infrastructure mentioned above. As a summary, P-GRADE has the following main features: 1. Supports the whole life-cycle of parallel Grid program development.

5 P-GRADE: A Grid Programming Environment 5 2. Provides a unified graphical support for program design, editing, debugging, monitoring and execution in the Grid. 3. Can generate either PVM or MPI code for execution on supercomputers, clusters and in the Grid. 4. Its run-time system is highly portable: a parallel program developed with P-GRADE can run without any changes on supercomputers, clusters and in the Grid. The only difference is in the possible working modes and in the mapping possibilities. 5. Is integrated with Condor, Condor-G and Globus-2 and hence it can be used for any Grid system where Condor, Condor-G or Globus-2 is available. 6. Its run-time system is extended with automatic parallel program checkpoint and migration unit for PVM programs and hence it allows complete PVM applications to migrate in the Grid. 7. Is integrated with a generic Grid application monitoring infrastructure that enables the monitoring and visualisation of Grid application execution even when applications migrate in the Grid. 8. Is a layered environment. The top layer is a workflow layer that generates Condor DAGMan governed jobs. The lower layers are used to develop and run parallel programs, and they support fast parallelisation of sequential programs written in C, C++ or FORTRAN. 9. Provides an easy-to-use solution for parallel Grid program development and execution that can be used even by non-specialist programmers (like chemists, biologists, engineers, etc.). The goal of the Grid is to integrate different hardware and software systems and to make them accessible for users in a seamless and transparent way. P-GRADE works exactly according to this concept. It integrates middleware components such as message passing communication systems (PVM, MPI, MPICH-G2), checkpoint system, migration unit, monitor, job managers, and high-level user interface tools such as workflow generator, parallel programming language, compilers, graphical editor, mapping tool, debugger, and execution and performance visualisation tool. By developing P-GRADE our goal was to use and integrate existing Grid middleware layers (Globus, Condor, Condor-G, Condor DAGMan) and message passing layers (PVM, MPI, MPICH-G2) whenever they were available. We developed only those layers that were missing: a checkpoint system and migration unit for PVM programs running as Condor jobs in the Grid a generic Grid application monitoring infrastructure The goal of this paper is two-folded. On the one hand we describe the main features of P-GRADE for potential Grid programmers and show that P-GRADE provides a rich set of high level functionalities by which even non-specialist programmers (like chemists, biologists, engineers, etc.) can easily develop parallel applications for the Grid. On the other hand we describe those Grid middleware tools (checkpoint server, migration unit, Grid application monitoring infrastructure) that we have developed and by which P-GRADE provides an easy-to-use Grid execution environment for complex, parallel Grid applications. According to these two goals the paper contains two main parts. The first part (Section 2) describes those features of P-GRADE that assist the user in constructing complex, parallel Grid applications. Section 2 gives a short general introduction to P- GRADE overviewing its interactive and job mode, and finally explains the workflow

6 6 Kacsuk et al. concept of P-GRADE. The second part consisting of Section 4, 5 and 6 describes those middleware tools that make the P-GRADE Grid execution environment more powerful than other existing Grid environments. Section 3 describes the checkpoint and migration unit of P-GRADE that enables the migration of PVM Condor jobs in the Grid. Section 4 gives details on the components of the Grid application monitoring infrastructure (Mercury/GRM/PROVE toolset) that we developed for monitoring and visualising the executions of Grid applications. This infrastructure can also be used independently from P-GRADE to monitor the execution of PVM and MPI programs in the Grid. Section 6 explains the workflow execution mechanism of P-GRADE. We conclude the paper after comparing our results with related research. 2 P-GRADE 2.1 P-GRADE interactive mode In order to cope with the extra complexity of parallel and distributed programs due to inter-process communication and synchronisation, we have designed a graphical programming environment called P-GRADE. Its major goal is to provide an easy-touse, integrated set of programming tools for the development of generic messagepassing (MP) applications to be run on both homogeneous and heterogeneous distributed computing systems like supercomputers, clusters and Grid systems. The central idea of P-GRADE is to support each stage of the parallel program development life-cycle by an integrated graphical environment where all the graphical views applied at the various stages are associated with the original program graph designed and edited by the user. A program developed by P-GRADE can be executed either in an interactive way or as a job. The parallel program development life-cycle supported by P-GRADE in the interactive execution mode is shown in Fig. 1. Design/ Edit Visualize Compile P-GRADE Interactive Mode Monitor Map Debug Fig. 1. Program development cycle supported by P-GRADE The first stage of the life-cycle is the program design that is supported by the GRAPNEL (GRAphical Process NEt Language) language and the GRED graphical editor. In GRAPNEL, all process management and inter-process communication activities are defined graphically in the user's application. Low-level details of the underlying message-passing system are hidden. P-GRADE generates automatically all message-passing library calls (either PVM or MPI) on the basis of the graphical code of GRAPNEL. Since graphics hides all the low level details of message-passing, P- GRADE is an ideal programming environment for application programmers who are

7 P-GRADE: A Grid Programming Environment 7 not experienced in parallel programming (e.g., for chemists, biologists, etc.). GRAPNEL is a hybrid language: while graphics is introduced to define parallel activities (process definition, message passing), textual language parts (C/C++ or FORTRAN) are used to describe sequential activities. For illustration purposes we use a meteorology application [31], called MEANDER that was developed by the Hungarian Meteorology Service. The main aim of MEANDER is to analyse and predict in the ultra short-range (up to 6 hours) those weather phenomena, which might be dangerous for life and property. Typically such events are snowstorms, freezing rain, fog, convective storms, wind gusts, hail storms and flash floods. The complete MEANDER package consists of more than ten different algorithms from which we have selected four to compose a P-GRADE application for demonstration purposes. Each calculation algorithm is computation intensive and implemented as a parallel program containing C/C++ or FORTRAN sequential code. The structure of the parallel MEANDER application is illustrated in Fig. 2. GRAPNEL is based on a hierarchical design concept supporting both the bottom-up and top-down design approaches. A GRAPNEL program has three hierarchical layers which are as follows from top to bottom: Application layer is a graphical layer that is used to define the component processes, their communication ports as well as their connecting communication channels. Shortly, the Application layer serves for describing the interconnection topology of the component processes. (See Fig. 2.) Process layer is also a graphical layer to define the internal structure of the component processes by a flow-chart like graph (see Fig. 2). The basic goal is to provide a graphical representation for the message passing function calls. As a consequence, every structure that contains message passing calls should be graphically represented. The following types of graphical blocks are applied: loop construct, conditional construct, sequential block, message passing activity block and graph block. Sequential blocks must not contain any message passing calls. Text layer is used to define those parts of the program that are inherently sequential and hence a textual language like C/C++ or FORTRAN can be applied at this level. These textual codes are defined inside the sequential blocks of the Process layer (see Fig. 2). Legacy code from existing libraries or object files can easily be reused as parts of the Text layer providing an efficient and fast way of parallelisation. The top-down design method can be used to describe parallel activities of the application program. At the top level the topology and protocols of the interprocess communication can be defined and then in the next layer the internal structure of individual processes can be specified. At this level and in the Text layer the bottomup and top-down design methods can be used in a mixed way. In the case of the topdown design method, the user can define the graphical structure of the process and then can use the Text layer to define the C/C++ or FORTRAN code for the sequential blocks. In the bottom-up design approach, the user can inherit code from existing C/C++ or FORTRAN libraries and then can build up the internal process structure

8 8 Kacsuk et al. based on these inherited functions. Moreover, GRAPNEL provides predefined scalable process communication templates (process farm, pipeline, ring, mesh, cylinder, torus) that allow the user to generate large process farm, pipeline, etc. applications fast and safely. Fig. 2 illustrates how several process farm templates can be connected in a single application. Fig. 2: Hierarchical layers of GRAPNEL programs in the MEANDER application The GRED editor helps the user to construct the graphical parts of GRAPNEL programs in an efficient and fast way. GRAPNEL programs edited by GRED are saved into an internal file, called the GRP file that contains both the graphical and textual information of GRAPNEL programs. The main concepts of GRAPNEL and GRED are described in detail in [25] and [36]. The second stage is the pre-compilation and compilation of GRAPNEL programs. The goal of pre-compilation is to translate the graphical language information of the GRP file into PVM or MPI function calls and to generate the C or C++ source code of the GRAPNEL program. For the sake of flexibility, PVM and MPI function calls are not called directly in the resulting C code; they are hidden in an

9 P-GRADE: A Grid Programming Environment 9 internal library, called the GRAPNEL Library that has two versions. In the first version (GRP-PVM Library) the GRAPNEL Library functions are realised by PVM calls, and in the second version (GRP-MPI Library) they are realised by MPI function calls. For compiling and linking any standard C/C++ compiler and linker can be used. The linker uses the following libraries: PVM, GRP-PVM (in the case of PVM communication system), GRM MPI, GRP-MPI (in the case of MPI communication system), GRM The GRM monitoring and instrumentation library is optional; it is needed only if performance monitoring is applied at run time. Details of the code generator are described in [12]. The third stage of the program development life-cycle is mapping in order to allocate processes to processors. Mapping can be done by a simple mapping table generated by P-GRADE. The mapping table can be easily modified by simple mouse clicks. The mapping information is also inserted into the GRP file. Having the necessary executables for the parallel/distributed computing system, the next stage is validating and debugging the code. The DIWIDE distributed debugger has been developed for P-GRADE in which the debugging information is presented on the user's graphical code. DIWIDE applies a novel macrostep debugging approach [26] based on which, both replay and systematic debugging is possible. DIWIDE automatically detects deadlock in the message-passing code. GRAPNEL programs can be executed step-by-step both at the C/C++ instruction level, at the higher-level graphical icon level or at the novel macrostep level. These features significantly facilitate parallel debugging which is the most time-consuming stage of parallel program development. After debugging the code, the next step is performance analysis. First, it requires performance monitoring that generates an event trace file at program execution time and then, performance visualisation that displays performance oriented information by several graphical views. Performance monitoring is performed by the P-GRADE Monitor (GRM) that can support both PVM and MPI. PROVE, the performance visualisation tool of P-GRADE, can provide both on-line and off-line visualisation based on the output trace file of GRM. We note that parallel applications developed by P-GRADE can run as standalone executables independently from the P-GRADE environment after the development process has finished. In addition to run parallel programs on one dedicated cluster or supercomputer, P-GRADE also supports the utilization of several resources to execute the same application. This can be done on the basis of the MPICH-G2 message passing system that makes possible for an MPI application to distribute its processes among loosely coupled computational resources, i.e., various clusters or supercomputers may participate to run different parts of the same MPI program. Practically, MPICH-G2 can be considered as another implementation of the MPI standard, i.e. it consists of an MPI library and a corresponding header file. To enable the distributed execution of an MPI program, the user must link the application against the MPICH-G2 library. In order to start such an application we use the Globus-2 job start mechanism. P-GRADE creates automatically an appropriate multirequest RSL file (needed by Globus-2) to start an MPICH-G2 application and passes it to the globusrun command. Multi-request RSL files can only be submitted in interactive mode by Globus, so P-GRADE can exploit MPICH-G2 in its interactive execution mode, too.

10 10 Kacsuk et al. From the P-GRADE user point of view, the difference between the single-site and multi-site interactive executions of an application is merely to select some options in the code generation phase of P-GRADE, i.e., the MPICH-G2 option must be set instead of MPI or PVM in the compilation dialog window. Furthermore, in the mapping phase, instead of a single computing node several Globus-2 resources can be assigned to various parts of the GRAPNEL application. An important feature of the multi-site execution is that performance monitoring and analysing tools of P-GRADE work just as the same way as in case of a single site. It means that the code of the application can be automatically instrumented and the GRM tool can collect all the necessary trace information from all the Globus-2 resources that participate in the execution provided that Mercury monitor is available on those sites. Details of performance monitoring and visualisation support of P-GRADE can be found in Sect. 4. So, from the user point of view, P-GRADE practically makes no difference to execute (and monitor, visualise) a parallel application on a specific resource or on multiple resources. All the necessary auxiliary files (e.g. RSL file) are generated automatically and all the necessary control commands (e.g. globusrun) are executed automatically by P-GRADE. 2.2 P-GRADE job mode The interactive working mode of P-GRADE can be used in supercomputers, clusters and Globus-based Grid systems provided that those resources are dedicated for a particular application. However, Grid resources are typically not dedicated for a single Grid application and hence those resources are exploited by jobs that can be located and controlled by Grid job managers. As a consequence, P-GRADE supports the Grid job execution mode, too. There are two different Grid job execution scenarios that are supported by P- GRADE: Scenario A. There is a Grid constructed as a collection of friendly Condor pools and the Condor flocking mechanism takes care of distributing processes of PVM jobs among the free resources, i.e., processes of a PVM application can run simultaneously in several Condor pools. The Hungarian ClusterGrid [36] is such a Grid and hence its support by P-GRADE is extremely important. Scenario B. There is a Globus-2 Grid containing supercomputers and/or clusters and a parallel program can be launched from P-GRADE to any of these multiprocessor sites, i.e., all the processes of the parallel application will run on the same multi-processor site of the Globus Grid. In order to support these scenarios we have integrated P-GRADE with Condor and Globus-2 and tested these scenarios in the framework of the Hungarian Supercomputing Grid project [25] Scenario A. Condor is a local resource manager in order to support highthroughput computing in clusters and Condor-G is a global resource manager to support high-throughput computing in the Grid [31], [48]. One of the most significant features of Condor is the ClassAds mechanism by which it can match application programs with execution resources. When a user submits a job she has to describe the resource requirements and preferences of her job. Similarly, resource providers should advertise their resources by configuration files. The Condor Matchmaker process tries

11 P-GRADE: A Grid Programming Environment 11 to match jobs and resource ClassAds to find matching resource requirements and resources. When such a matching occurs, the Matchmaker process notifies both the submitter machine and the selected resource. Then, the submitter machine can send the job to the selected machine that will act as an execution machine. Integration of P-GRADE and Condor means that after developing a parallel program in the interactive mode the P-GRADE user can switch to batch-mode inside P-GRADE and in such a case program execution will result in the automatic generation of a parallel Condor job. P-GRADE will automatically construct the necessary Condor job description file containing the resource requirements of the parallel job. The mapping function of P-GRADE is changed according to the needs of Condor. In Condor the user can define machine classes from which Condor can reserve as many machines as it is defined by the user in the job description file. After submitting the Condor job under P-GRADE the user can detach P-GRADE from the job. It means that the job does not need the supervision of P-GRADE when it is executed in the Grid by Condor. Meanwhile the P-GRADE generated Condor job is running in the Grid, P-GRADE can be turned off or it can be used for developing other parallel applications. However, at any time the user can attach back to the job by the help of P-GRADE in order to watch the current status and results of the job. The program development and execution mechanism of P-GRADE for Condor Grids is shown in Fig. 3. The Grid mapping in this case is Condor mapping as described above. Design/ Edit Attach Compile P-GRADE Job Mode Detach Grid Map Submit job Fig. 3. P-GRADE services in job mode There is an important advantage of the integrated Condor/P-GRADE system compared to Condor: generic parallel PVM Condor jobs can be automatically checkpointed and migrated among friendly Condor pools. The realisation of this advance feature is explained in Section Scenario B In addition to pure Condor pools, Globus-2 Grids can also be utilized by P- GRADE to run parallel jobs on Grid resources. Globus-2 provides a more complete set of tools for establishing Grids than Condor does especially with respect to its security infrastructure (GSI). On the other hand, to actually execute jobs Globus-2 relies usually on local job managers like Condor, so it is possible to access a Condor pool via Globus.

12 12 Kacsuk et al. The user can simply select the Globus based execution mode in the appropriate P-GRADE dialog window if she wants to run her parallel application on a Globus-2 resource instead of a dedicated cluster or a Condor pool. After the selection has been made, the mapping tool automatically lists all eligible resources, i.e., all resources that have Globus-2 installed. It is defined in the so called resource configuration files for P-GRADE. The resource configuration files specify what resources are available for program execution and what capabilities they have, i.e., whether they are dedicated clusters or supercomputers, pure Condor pools or Globus- 2 resources with various local job managers, etc. Naturally, in order to run the program by Globus, the user must select a Globus resource in the mapping window of P-GRADE. In addition to selecting a Globus site, a proper local job manager must also be chosen if more than one such managers are available on the specific site. P- GRADE takes care about various restrictions that may arise due to conflicting program types and execution environments. For example, Globus-2 does not support execution of PVM applications by default. However, if the local job manager in the Globus site is Condor, then it is possible to let PVM programs run on the resource by using special RSL attributes to Condor. Accordingly, P-GRADE automatically prevents the user from selecting Globus-2 execution mode for GRAPNEL/PVM programs unless a Globus-2 resource with Condor job manager is available. From the resource and job manager selection information P-GRADE automatically generates the proper RSL file for the parallel program and executes the appropriate Globus-2 commands to run the program. During the execution, the performance monitoring and visualisation tools of P-GRADE can be used exactly the same way (from the user s point of view) as in the case of the interactive execution mode provided that Mercury monitor is available on the specific resource. Detach and attach operations can be applied similarly to the Condor case so, the development cycle for this scenario is the same as in Fig. 3 but in this case the applied Grid mapping corresponds to the Globus-2 requirements. 2.3 P-GRADE workflow layer: Component based Grid programming The P-GRADE layers described in Section 2 enable to develop single-job Grid applications. However, in many cases the Grid application can be so complex that it requires the execution of several jobs according to some predefined order. The execution order of the component jobs are defined by the workflow that connects existing sequential or parallel programs into an interoperating set of jobs. Connections define dependency relations among the components of the workflow with respect to their execution order that can naturally be represented as graphs. Such representation of the above mentioned meteorology application is depicted in Fig. 4. Nodes (labelled as delta, visib, etc.) represent different jobs from the following four types: sequential, PVM, MPI, or GRAPNEL (denoted as GRP) job. Small rectangles (ports labelled by numbers) around nodes represent data files. Dark grey ones (green on a colour screen) are input files, light grey ones are output files of the corresponding job, and directed arcs interconnect pairs of input and output files if an output file serves as input for another job. In other words, arcs denote the necessary file transfers between jobs.

13 P-GRADE: A Grid Programming Environment 13 As a result, the workflow describes both the control-flow and the data-flow of the application. A job can be started when all the necessary input files are available and transferred (e.g. by GridFTP) to the site where the job is allocated for execution. Managing the file-transfers and recognition of the availability of the necessary files is the task of our workflow manager that extends the capabilities of Condor DAGMan. Fig. 4. Workflow representation of MEANDER meteorological application and the underlying design layers of P-GRADE parallel programming environment For illustration purposes, we use the MEANDER meteorology application again. The first graph depicted in Fig. 4 (see Workflow Layer) consists of four jobs (nodes) corresponding four different parallel algorithms of the MEANDER ultra-short range

14 14 Kacsuk et al. weather prediction package and a sequential visualisation job that collects the final results and presents them to the user as a kind of meteorological map: Delta: a P-GRADE/GRAPNEL program compiled as a PVM program with 25 processes Cummu: a PVM application with 10 processes Visib: a P-GRADE/GRAPNEL program compiled as an MPI program with 20 worker processes (see the Application window with the process farm and the master process in Fig. 4.) Satel: an MPI program with 5 processes Ready: a sequential C program This distinction among job types is necessary because the job manager on the selected Grid site should be able to support the corresponding parallel execution mode, and the workflow manager is responsible for handling the various job types by generating the appropriate submit files. Generally, the executables of the jobs can be existing legacy applications or can be developed by P-GRADE. A GRAPNEL job can be translated into either a PVM or an MPI job but it should be distinguished from the other types of parallel jobs since P- GRADE provides fully interactive development support for GRAPNEL jobs; for designing, debugging, performance evaluation and testing the parallel code as described in Section 2. By clicking on such a node of the workflow graph, P-GRADE invokes the Application window, in which the inter-process communication topology of the GRAPNEL job can be defined and modified graphically (see Fig. 4. Application window). Then, from this Application window the lower design layers, such as the Process and the Text levels, are also accessible by the user to change the graphically or the textually described program code of the current parallel algorithm (see the Process and Text window of visibility calculation in Fig. 4.). It means that the workflow represents the top level P-GRADE layer among the hierarchical design layers. Besides the type of the job and the name of the executable (see Fig. 4), the user can specify the necessary arguments and the hardware/software requirements (architecture, operating system, minimal memory and disk size, number of processors, etc.) for each job. To specify the resource requirements, the application developer can currently use either the Condor resource specification syntax and semantics for Condor based Grids or the explicit declaration of the Grid site where the job is to be executed for Globus based Grids (see Fig. 4. Job Attributes window, Requirement field). In order to define the necessary file operations (see Fig. 4) of the workflow execution, the user should define the attributes of the file symbols (ports of the workflow graph) and file transfer channels (arcs of the workflow graph). The main attributes of the file symbols are as follows: file name type The type can be permanent or temporary. Permanent files should be preserved during the workflow execution but temporary files can be removed immediately when the job using it (as input file) has been finished. It is the task of the workflow manager to transfer the input files to the selected site where the corresponding job will run. The transfer can be done in two ways. The off-line transfer mode means that the whole file

15 P-GRADE: A Grid Programming Environment 15 should be transferred to the site before the job is started. The on-line transfer mode enables the producer job and the consumer job of the file to run in parallel. When a part of the file is produced the workflow manager will transfer it to the consumer's site (i.e. a kind of stream connection must be established by the workflow manager). This can improve the performance of some workflows however, this working mode obviously assumes a restricted usage of the file both at the producer and consumer site and hence, it should be specified by the user that the producer and consumer (which may include legacy executables or object files without sources) meet these special conditions. In the current implementation only the off-line transfer mode is supported. 3 P-GRADE Grid execution environment: process and application migration support 3.1 Structure of the parallel application The P-GRADE compiler generates [12] executables which contain the code of the processes defined by the programmer - we refer them as clients hereafter - and an extra system process, called the GRAPNEL Server (see Fig. 5) which coordinates the run-time set-up of the application. Executables of client processes contain the user defined code, the message passing primitives and the GRAPNEL library that manages logical connections among them. To set-up the application first the GRAPNEL Server is activated and then it spawns all the client processes performing the user defined computations. Console built-in grapnel server Files client C grapnel library client A message passing layer client D client B Fig. 5. Structure of the GRAPNEL application generated by P-GRADE As a result of the co-operation between the GRAPNEL Server and the GRAPNEL library the message passing communication topology is built up. GRAPNEL provides special file handling functions for the user which make possible to delegate all file IO operations of the clients to the GRAPNEL Server, i.e., all input and output files of the clients can be physically managed by the GRAPNEL Server process on its host. For instance, if the user inserts a grp_fopen function call into the code of a client then it generates a message (either PVM or MPI) at run-time sent to the GRAPNEL Server that opens the physical file in turn and sends back an abstract file ID for the client process. The client then can use that abstract ID in further grp_fprintf function calls to make the GRAPNEL Server write some data into that file.

16 16 Kacsuk et al. 3.2 Motivations for migration Process migration in distributed systems is a special event when a process running on a particular resource is moved to another one in such a way that the migration does not cause any change in the process execution. It means that the process is not restarted instead, its execution is temporarily suspended and then resumed on the new resource. In order to provide this capability special techniques are necessary to save the total memory image of the target process and to reconstruct it. This technique is called checkpointing. During checkpointing a tool suspends the execution of the process, collects all the internal status information necessary for resumption and terminates the process. Later it creates a new process and all the collected information is restored for the process to continue its execution from where it was suspended. Such migration mechanism can be advantageously used in several scenarios. First, in supercomputing applications load-balancing is a crucial issue. Migration can solve the problem of unbalanced parallel sites of the Grid. Processes on overloaded machines can migrate to underloaded machines without terminating the entire application. Similarly, load-balancing can be ensured among different sites of the Grid, i.e., when a site becomes overloaded complete applications can migrate to other sites. The second situation is related with high-throughput computing where free cycles of underloaded machines are exploited. In such a scenario the owner of the Grid site always has priority over the guest applications and hence, all the guest applications should be removed from the site when the owner increases the load of the resource. Third, the migration module can be used for providing fault-tolerance capability for long-running applications if machines are corrupted or need system maintenance during the execution. Fourth, migration can be driven by resource needs, i.e., processes can be moved in order to access special unmovable or local resources. For example, processes may need to use special equipments or huge databases existing on dedicated machine in the Grid. A migration tool typically consists of a checkpoint server, a checkpoint information store and a checkpoint library. To provide fault-tolerance, applicationwide checkpoint saving is performed, i.e., checkpoint information is stored into files for roll-back if necessary. These files are maintained by a checkpoint server and written/read by the checkpoint library attached to the process to be migrated. There are several existing libraries performing sequential program checkpointing like [15], [37], [32] or [47]. To start checkpointing the checkpoint libraries should be notified and checkpointed/resumed processes need to be managed. It is done by a migration co-ordination module built in the application. It checkpoints, terminates and restarts processes belonging to the application. The co-ordination module is responsible for keeping the application alive at all the time to realise fault-tolerant execution where the application adapts to the dynamically changing execution environment. 3.3 Migration working modes P-GRADE supports the migration of Generic PVM programs in three different working modes: 1. Interactive (or dedicated) mode (details in Section 3.4) This mode is used inside a cluster or supercomputer without the supervision of a local job manager. In such a case P-GRADE directly starts-up the PVM daemons and controls the execution of the parallel application. There are two typical scenarios when this mode is applied. First, the Load-Balancer module of P-GRADE can take care of balancing

17 P-GRADE: A Grid Programming Environment 17 the load on the dedicated cluster or supercomputer. Second, in order to provide fault-tolerance on the dedicated resource, P-GRADE checkpoints the application periodically. When node loss is detected user processes are resumed from the last checkpoint. 2. Job mode with Condor (details in Section 0) This mode is used inside a cluster or among clusters using the Condor flocking technique. There are two typical scenarios when migration is necessary. In the first scenario process migration is applied inside a Grid site (cluster) while in the second scenario process migration takes place among Grid sites using Condor flocking. Interestingly, the applied checkpointing and migration protocol is the same in both cases. P-GRADE submits the application to Condor that allocates nodes for the application. When node vacation signal is detected by the application, checkpointing of user processes is performed, allocations of new nodes are provided by Condor either inside a cluster or among friendly Condor pool clusters and finally, migrated user processes are resumed on the newly allocated nodes. 3. Job mode with Global Application Manager (details in Section 3.6) In this mode the whole application (job) can migrate from one Grid site to another one using a Global Application Manager that includes a Grid broker and a global Grid job manager (for example Condor-G). P-GRADE passes the application to the Global Application Manager that selects a Grid site where nodes are allocated for the application. When the Global Application Manager detects that the Grid site is overloaded, it initiates total application checkpoint by removing the job from the local queue, allocates another Grid site, moves necessary files to the new Grid site and finally re-submits and resumes the application there. Of course, within the selected Grid site Condor job migration mode is possible if the local job manager is Condor. Table 1 summarises the various working modes and their realisation techniques and infrastructures. Table 1. Combination of working modes and their realisation techniques and infrastructures: Interactive Condor Global App. Manager Chkpt. User proc. X X X Detection of node loss X X X Auto node addition X X Chkpt. of server proc. X Migrate by Global X App. Manager Required infrastructure Single cluster Single /Multi cluster Multi cluster 3.4 The checkpoint mechanism in interactive mode There are two main concepts to handle process checkpointing and migration of parallel programs. The first approach (uncoordinated checkpoint) creates checkpoints for processes individually independently from each other and hence, processes can

18 18 Kacsuk et al. migrate without stopping the progress of other processes of the same application. That approach was used in the Dynamite project [10], [11] but it requires the modification of the underlying MP system. The other approach (coordinated checkpoint) generates a global checkpoint suspending the whole application at checkpoint time. The drawback of this solution is that it needs larger overhead than the other approach but its advantage is that no modification of the underlying MP system is necessary. This solution was used in the CoCheck [38], [39] system and we adapted and extended CoCheck for P-GRADE. The checkpointing procedure is controlled by the GRAPNEL library, so no modification of the user code or the underlying message passing library is required to support process and application migration. The GRAPNEL Server performs a consistent checkpoint of the whole application. Checkpoint files contain the state of the individual processes including in-transit messages so the whole application can be rebuilt at any time and on the appropriate site. The checkpoint system of a GRAPNEL application contains the following elements (see Fig. 6.): GRAPNEL Server (GS): an extra co-ordination process that is part of the application and generated by P-GRADE. It sets up the application by spawning the processes and defining the communication topology for them. GRAPNEL library: a layer between the message passing library and the user code, automatically compiled with the application, co-operates with the server, performs preparation for the user process environment and provides a bridge between the server process and the user code. Checkpoint module in GRAPNEL library: in client processes it prepares for checkpoint, performs synchronisation of messages and re-establishes connection to the application after a process is rebuilt from checkpoint; in GS it coordinates the checkpointing activities of the client processes. Dynamic checkpoint library: loaded at process start-up and activated by receiving a predefined chkpt signal, reads the process memory image and passes this information to the Checkpoint Server Checkpoint Server: a component that receives data via a socket and puts it into the chkpt file of the chkpt storage and vice versa. First an instance of the Checkpoint Server (CS) is started in order to transfer checkpoint files to/from the dynamic checkpoint libraries linked to the application. After starting the application, each process of the application automatically loads the dynamic checkpoint library at start-up that checks the existence of a previous checkpoint file of the process by connecting to the Checkpoint Server. If it finds a checkpoint file for the process, the resumption of the process is automatically initiated by restoring the process image from the checkpoint file otherwise, it starts from the beginning. When the application is launched, the first process that starts is the GRAPNEL Server (GS) performing the coordination of the Client processes. It starts spawning the Client processes to create the topology of the parallel application. Whenever a process becomes alive, it first checks the checkpoint file and gets contacted to GS in order to download parameters, settings, etc. When each process has performed the initialisation, GS instructs them to start execution and the application is running.

19 P-GRADE: A Grid Programming Environment 19 submit machine Terminal Files executor machines in local/remote pool Grapnel server/ co-ordination module chkpt lib Checkpoint Server Client A chkpt lib message passing layer Client B chkpt lib Client C user code Storage Client D chkpt lib chk pt lib grapnel lib mp lib Fig. 6. Structure of application in checkpoint mode Fig. 7 shows the main steps of the checkpoint protocol applied between the clients and the GS. While the application is running and the processes are doing their tasks the migration mechanism is inactive. Migration is activated when a Client process detects that it is about to be killed (TERM signal for Client A in Fig. 7). The Client process immediately informs GS (REQUEST_for_chkpt) that in turn initiates the checkpointing of all the Client processes of the application. For a Client process checkpointing is initialised either by a signal or by a checkpoint message (DO_chkpt) sent by GS in order to make sure that all processes are notified regardless of performing calculation or communication. When notified processes are prepared for checkpointing (READY_to_chkpt), they are instructed to initiate synchronisation (DO_sync) of messages aiming at receiving all the in-transit messages and store them in the memory. Finally, Client processes send their memory image to the Checkpoint Server. All checkpointed processes then wait (DONE_chkpt) for further instruction from GS whether to terminate or continue the execution. GS terminates the clients to be migrated by the DO_exit command (Client A in fig. 7) and then GS initiates new node allocations for these terminated processes. (The new node selection decision comes from the P-GRADE Load Balancer in the case of the interactive mode, and from Condor in the case of Condor job mode.) When host allocations are performed, migrating processes are resumed on the newly allocated nodes. Each migrated process automatically loads the checkpoint library that checks for the existence of a previous checkpoint file of the process by connecting to the Checkpoint Server. This time the migrated processes will find their checkpoint file and hence their resumption is automatically initiated by restoring the process image from the checkpoint file. The migrated processes execute post-checkpoint instructions before resuming the real user code. The post-checkpoint instructions serve for initialising the

20 20 Kacsuk et al. message-passing layer and for registering at GS (DONE_restoration). When all the checkpointed and migrated processes are ready to run, GS allows them to continue their execution (DO_continue). Grapnel server/coordinator REQUEST for chkpt.. Client A Termination Client B DO chkpt DO chkpt READY to chkpt READY to chkpt DO sync DO sync SYNC messages READY to save READY to save DO sync DO sync DONE chkpt + need exit SAVE DONE chkpt SAVE DO exit SPAWN RESTORE DONE restoration DO continue DO continue Fig. 7. Checkpoint protocol in the GRAPNEL application 3.5 Process migration under Condor The checkpoint system has been originally integrated with PVM in order to migrate PVM processes inside a cluster. However, in order to provide migration among clusters we have to integrate the checkpoint system with a Grid-level job manager that takes care of finding new nodes in other clusters of the Grid. Condor flocking mechanism provides exactly this functionality among friendly Condor pools and hence the next step was to integrate the checkpoint system with Condor. In order to do that we adapted the Condor MW (Master-Worker) execution mechanism. The basic principle of the Condor MW model is that the master process spawns workers to perform the calculation and continuously watches whether the workers successfully finish their calculation. In case of a failure the master process simply spawns new workers passing the unfinished work to them. The situation when a worker fails to finish its calculation usually comes from the fact that Condor removes the worker because the executor node is no longer available. This action is called vacation of the PVM process. In this case the master node receives a notification message indicating that a particular node has been removed from the PVM machine. As an answer the master process tries to add new PVM host(s) to the virtual machine with the help of Condor, and gets notified when host inclusion is done successfully. At this time it spawns new worker(s).

21 P-GRADE: A Grid Programming Environment 21 For running a GRAPNEL application, the application continuously requires the minimum amount of nodes to execute the processes. Whenever the number of the nodes drops below the minimum, the GRAPNEL Server (GS) tries to extend the number of PVM machines above the critical level. It means that the GS process behaves exactly the same way as the master process does in the Condor MW system. Under Condor the master PVM process is always started on the submit machine and is running until the application is finished. It is not shut down by Condor, even if the submit machine becomes overloaded. Condor assumes that the master process of the submitted PVM application is designed as a work distributor. The functionality of the GRAPNEL Server process fully meets this requirement, so GRAPNEL applications can be executed under Condor without any structural modification and the GS can act as the coordinator of the checkpointing and migration mechanisms as it was described in the previous section. Step 1. spawn ABC GS Condor spawn spawn P P spawn P GS: Grapnel Server CS: Checkpoint Server P: PVM daemon A,B,C: User processes submit node Step 2. CS A B C GS notify ABC client nodes P Condor P terminate P vacate Step 1: Starting the application Step 2: One node to vacate Step 3: Checkpointing Step 4: Migration to friendly pool CS A B C synchronising messages Step 3. Condor vacate GS P P P CS A B C checkpoint write Friendly Condor Pool Step 4. addhost spawn C GS P Condor P spawn P Condor allocate node CS A B C reconnect checkpoint read Fig. 8. Steps of process migration among friendly Condor pools Whenever a process is to be killed (see Fig. 7) (e.g. because its node is being vacated), an application-wide checkpoint must be performed and the exited process should be resumed on another node. The application-wide checkpointing is driven by GS, but it can be initiated by any client process which detects that Condor tries to kill it. In this

22 22 Kacsuk et al. case the client process notifies GS to perform a checkpoint. After this notification GS sends the DO_chkpt signal or message to every client process. After checkpointing, all the client processes wait for further instruction from the server whether to terminate or continue the execution. GS sends a terminate signal to those processes that must migrate (DO_exit for Client A in Fig. 7). At this point GS waits for the decision of Condor that tries to find underloaded nodes either in the home Condor pool of the submit machine or in a friendly Condor pool. The resume phase is performed only when the PVM master process (GS) receives a notification from Condor about new host(s) connected to the PVM virtual machine. When every terminated process is migrated to a new node allocated by Condor, the application can continue its execution according to the protocol shown in Figure 7. This working mode enables the PVM application to continuously adapt itself to the changing PVM virtual machine by migrating processes from the machines being vacated to some new ones that have just been added. Figure 8 shows the main steps of the migration between friendly Condor pools. Notice that the GRAPNEL Server and Checkpoint Server processes remain on the submit machine of the home pool even if every client process of the application migrate to another pool. It should be noticed that Condor does not provide checkpointing for any kind of PVM applications but it provides application level support for fault-tolarent execution of MW type PVM applications. The advantage of the integrated P-GRADE/Condor system is that it extends the Condor fault-tolerant execution mechanism for generic PVM applications, too. 3.6 Application level migration by a Global Application Manager Condor flocking cannot be applied in generic Grid systems where the pools (clusters) are separated by firewalls and hence global Grid job managers should be used. In such systems if the cluster is overloaded, i.e., the local job manager cannot allocate nodes to replace the vacated nodes; the whole application should migrate to another less loaded cluster of the Grid. It means that not only the client process but even the GRAPNEL Server should leave the overloaded cluster. We call this kind of migration total migration as opposed to partial migration where the GRAPNEL Server does not migrate. In order to leave the pool - i.e. migrate the whole application to another pool - two extra capabilities are required. First of all, an upper level layer, called a Grid Application Manager is needed that has submitted the application and is able to recognise the situation when a total migration of the GRAPNEL application to another pool is required. Secondly, the checkpoint saving mechanism must include the server itself, i.e., after checkpointing all the client processes, the server should checkpoint itself as well. Before server checkpoint, the server should avoid sending any messages to the client processes and should store the status of all open files in order to be able to reopen them after resume. The checkpoint support is built into the application; the rest e.g. removing from local queue, file movements, resource reselection and resubmission is task of the upper Grid layers.

23 P-GRADE: A Grid Programming Environment 23 4 P-GRADE Grid execution environment: Grid application monitoring infrastructure Once the user generated a Grid application she would like to observe how the execution of her application progresses on the selected Grid site(s). In order to do that some kind of application monitoring infrastructure is needed. In the P-GRADE environment we apply the Mercury/GRM/PROVE toolset. Mercury is a generic Grid monitoring service created according to the GMA (Grid Monitoring Architecture) recommendation of the Global Grid Forum. Its main task is to provide monitoring information about grid entities (such as, resources, applications, etc.) to consumers both via query and subscribe and to transfer data to consumers subscribed for the specific monitoring information. GRM is the GRAPNEL instrumentation and monitoring tool by which GRAPNEL applications can be automatically instrumented by the P-GRADE system. Finally, PROVE is an execution visualisation system that can be used by the Grid programmer for performance analysis purposes. The GRM instrumentation generates the trace events that are delivered to the client machine by the Mercury Grid monitor and then the trace events are processed and displayed for the user by the PROVE visualisation tool. All these activities are on-line. The trace events and the visualisation can be requested and served at any time during (or after) the execution of the Grid program no matter on which site of the Grid the application is running. The Mercury/GRM/PROVE toolset can be used even if the application migrates in the Grid giving exact information about the source and destination execution and the migration time. Naturally, the Mercury/GRM/PROVE toolset supports the monitoring and observation of complex workflows running under P- GRADE in the Grid. In the following subsections we give details of the architecture and usage of the Mercury/GRM/PROVE toolset. Basically, there are two possible solutions for monitoring applications in the Grid. The first approach does not require any monitoring infrastructure on the Grid sites. The monitor code is transferred with the application code to the selected Grid site and then the monitor is forked from the application at start-up time. The second approach assumes the installation of a Grid monitor on every Grid site. When the application code arrives to the selected Grid site, the application is connected to the local instance of the Grid monitor. In case of P-GRADE we have investigated and realised both approaches. The first approach was realised by the Grid adaptation of the GRM monitor within the EU DataGrid project while the second concept was realised by the Mercury monitor in the EU GridLab project. 4.1 Using GRM in the Grid In the EU DataGrid project we have created the Grid version of the P-GRADE monitor called GRM that was first developed for clusters. GRM s original structure could be easily adapted to the Grid to collect trace data when the application is running on a remote site. In this case, GRM s components are placed as it can be seen in Fig. 9. GRM consists of Local Monitors on each host where applications are running and a main process at the host where the user is working and analysing trace information. A Local Monitor creates a shared-memory buffer to store trace data. Application processes (the instrumentation functions within them) put trace records directly into this buffer without any further notifications to the Local Monitor. This

24 24 Kacsuk et al. way, the intrusion of monitoring is as low as possible from a source-code-based instrumentation. The main process of GRM collects the content of trace buffers. User s machine Main Monitor GRM PROVE Grid resource Host 1 Host 2 Host N Local Monitor GRM-LM Local Monitor GRM-LM Local Monitor GRM-LM shm shm shm Appl. Process Appl. Process Application Process... Application Process Fig. 9. Structure of GRM on the Grid The above structure works for application monitoring but only if it can be set-up. In the local cluster or supercomputer environment the user can start GRM easily. The execution machine is known by the user and it is accessible, i.e., the user can start GRM manually on it. When GRM is set-up, the application can be started and it will be monitored by GRM. In the case of the Grid, however, we do not know in advance which resource will execute our application. Even if we knew it we would not be able to start GRM manually on that resource. This is one of the main problems of all traditional tools used for applications on local resources when moving towards the Grid. In the EU DataGrid project we have chosen the fork solution to minimise the re-implementation effort and keep GRM s original code to the maximum extent as possible. The code of the local monitor process has been compiled into a library, which can be linked into the application at the same time as the instrumentation library. When the application process is started and calls the first instrumentation function, the instrumentation library forks a new process that becomes the local monitor. If there is already such a monitor when more than one process is launched on the same multi-processor machine the application process connects to the already running instance. The local monitor connects to GRM s main monitor, creates the shared memory buffer and initialises the connection between itself and the process. After that, trace generation and GRM work in the same way as in the original cluster version. However, this solution works only if there are no firewalls between the Grid sites where the local monitor and main monitor processes run. (For example, the Hungarian ClusterGrid realizes such a Grid system by utilizing VPN technology.) As this is not the usual case, this version of GRM has a limited use. There are two other limitations of the forking solution. First, some local job managers dislike those jobs that fork out new processes, e.g. Condor which, in such a case, is not able to do checkpointing and clean-up when it is needed. Second, the forked out process duplicates the statically allocated memory areas of the original process. If the job is a large code and its static allocations are comparable to the available memory of the execution machine, it can cause memory allocation problems.

25 P-GRADE: A Grid Programming Environment 25 In order to solve the problems mentioned above we have developed a generic Grid resource and job monitoring system, called Mercury. The following section describes Mercury in detail. 4.2 The Mercury Grid monitor The Mercury Monitor system [6] is being developed in the GridLab project. It is a general Grid resource and job monitoring system. Its architecture is based on the Grid Monitoring Architecture (GMA) [45] proposed by the Global Grid Forum (GGF). The input of the monitoring system consists of measurements generated by sensors. Sensors are controlled by producers that can transfer measurements to consumers when requested. The GMA proposal defines two methods of information retrieval: query and subscribe. Our monitoring system supports further producer consumer interaction functions. Consumers can request buffering of measurements in the producer and multiple logical connections between a producer and a consumer (called channels) are also available. Channels can either be consumer-initiated or producer-initiated, but producer-initiated channels must be explicitly requested previously by a consumer. Consumer-initiated channels are used mainly for interactive monitoring or control execution while producer-initiated channels can be used for event reporting and data archiving in a storage service. The two possible channel directions can also be used for getting through firewalls which block communication in one direction. The monitoring system components (MS, MM, LM) used for monitoring are drawn with solid lines in Fig. 10. The figure depicts a Grid resource consisting of N nodes. A Local Monitor (LM) service is running on each node and collects information from processes running on the node as well as the node itself. Sensors (S) are implemented as shared libraries that are loaded into the LM code at run-time corresponding to the starting configuration and incoming requests for different measurements. The collected information is sent to the consumers. The Main Monitor (MM) service is also a consumer acting as a central access point for local users (i.e. site administrators and non-grid users). Grid users can access information via the Monitoring Service (MS) which is also a client of the MM. In large sized Grid resources there may be more than one MM to balance network load. The modularity of the monitoring system also allows that on Grid resources where an MM is not needed (e.g. on a supercomputer) it can be omitted and the MS can talk directly to LMs. Grid user resource broker Grid resource MS jobmanager MM LRMS Host 1 Host 2 Host N LM S LM S S LM S S Appl. Process Appl. Process Appl. Process... Appl. Process Appl. Process Fig. 10. Structure of Mercury monitor

26 26 Kacsuk et al. The elements drawn with dashed lines are not part of the monitoring system. The resource broker is the Grid service that accepts jobs from Grid users, selects a resource for execution and submits the job to the selected resource. The jobmanager is the public service that accepts Grid jobs (e.g. GRAM in the Globus toolkit [53]) for the Grid resource. The LRMS is the local resource management system which handles job queues and executes jobs on the Grid resource (e.g., Condor [31], SGE [51], etc.). As the Grid itself consists of several layers of services the monitoring system also follows this layout. The LM MM MS triplet demonstrates how a multi-level monitoring system is built. The compound producer consumer entities described in the GMA proposal are exploited here: each level acts as a consumer for the lower level and as a producer for the higher level. This setup has several advantages. Complex compound metrics are easily introduced by sensors at intermediate levels which get raw data from lower levels and provide processed data for higher levels. Transparent access to pre-processed data or data from multiple sources can be supported in the same way. In cases when there is no direct network connection (e.g., because of a firewall) between the consumer and a target host, an intermediary producer can be installed on a host that has connectivity to both sides. This producer can act as a proxy between the consumer and the target host. With proper authentication and authorisation policies at this proxy, this setup is more secure and more manageable than opening the firewall. The above described possibilities are exploited in the MS. The MS acts as a proxy between the Grid resource and Grid users. This is the component where Grid security rules are enforced and mapped to local rules. As the Grid is a heterogeneous environment it is very likely that different Grid resources (or even different nodes of the same resource) provide different kinds of information. In order to make it possible for data from different sources to be interpreted and analysed in a uniform way, the raw data must be converted to a well-defined form. Such conversion is also done in the MS which takes resource specific metrics from the MM and converts them to grid metrics that are independent of the physical characteristics of the resource. 4.3 Integration of GRM and the Mercury monitor When connecting GRM and the Mercury monitor basically, the trace delivery mechanism of the original GRM is replaced with the mechanism of Mercury. The local monitors of GRM are not used now. The instrumentation library is rewritten to publish trace data using the Mercury monitor API, sending trace events directly to the Local Monitor of the Mercury monitor. The LM's of the Mercury monitor are running on each machine of the Grid resource, so application processes connect to an LM locally. Application monitoring data is considered to be just another type of monitoring data, whatever it should be. To support application monitoring, therefore, a sensor, the application sensor is created and is loaded into the LM as a shared library. This sensor accepts incoming data from the processes on the machine, using the Mercury monitor API. The Mercury monitor uses metrics to distinguish different types of measurable quantity thus, GRM uses one predefined metric (application.message) to publish data. This metric contains message from the application process as a string and it is not processed within the components of the Mercury monitor. The main process of GRM behaves as a consumer of the Mercury monitor. It contacts an MM requesting the measurement of the application.message metric. If there are producers of this metric (the application processes), the Mercury monitor creates channels from them to the main process of GRM and data streaming starts. As

27 P-GRADE: A Grid Programming Environment 27 there may be several applications generating trace data at the same time, an application ID is used at subscription to identify the application from which trace information should be transferred for the given request. The use of the GRM tool as an application monitor has not been changed compared to the original cluster usage. First, the application should be instrumented with GRM calls. Then, the job should be started, which, in case of the Grid, means to submit the job to the resource broker (see boxes with dashed lines in Fig. 10., until the job is started by the LRMS on a Grid resource). GRM can be started by the user on the client machine and then meanwhile the application is handled by the resource broker, GRM can connect to Mercury and subscribe for trace information about the application. When the application is started and generates trace data, Mercury forwards the trace to GRM based on the subscription information. 4.4 Using the Mercury/GRM/PROVE toolset for monitoring and visualising job migration All those features mentioned above have been demonstrated in Klagenfurt at the EuroPar'2003 conference. Three clusters were connected (two from Budapest and one from London) to provide a friendly Condor pool based Grid, like the Hungarian ClusterGrid. A parallel urban traffic simulation application [22] developed at the University of Westminster was launched from Klagenfurt on the SZTAKI cluster. Then the cluster was artificially overloaded and Condor recognising the situation vacated the nodes of the cluster. The GRAPNEL Server of P-GRADE controlled the checkpointing of the application and then asked Condor to allocate new resources for the application processes. Condor has found the Westminster cluster and P-GRADE migrated all the processes except for the GRAPNEL Server to the Westminster cluster. After resuming the application at Westminster we artificially overloaded the Westminster cluster and as a result the application was migrated to the last underloaded cluster of the Grid at the Technical University of Budapest. All these actions were monitored using the Grid application monitoring infrastructure described above and visualised as shown in the PROVE snapshot windows of Fig. 11. The horizontal axis represents time. The application was running for 8 minutes and 48 seconds when the snapshot was taken as shown in the upper part of the window. Each process (parent, child1,... child16) is represented by a horizontal line (only 5 of them shown in the figure using the filter technique of PROVE) and arrows among them represent process communication. The three dense parts in the first minute, between 3-4 minutes and after the 7 th minute represent the active working period on the three clusters. The two time periods among them show the two migration periods. Here some hanging messages can be seen that were handled in a special way during checkpointing as described in Section 3. Regarding the performance of checkpointing, overall time spent for migration includes checkpoint writing, reading, allocation of new resources and some coordination overhead. The time spent for writing or reading the checkpoint information through a TCP/IP connection definitely depends on the size of the process to be checkpointed and the bandwidth of the connection between the nodes (or Grid sites) where the processes (including the checkpoint library and the checkpoint server)

28 Kacsuk et al. are running. The overall time of a complete process migration also includes the response time of the resource scheduling system.

28 28 Kacsuk et al. are running. The overall time of a complete process migration also includes the response time of the resource scheduling system. For example, while Condor vacates a machine, the matchmaking mechanism finds a new resource, allocates it, initialises pvmd and notifies the application. Finally, cost of synchronisation of messages and some cost used for coordinating application processes are negligible, less than one percent of the overall migration time. Fig. 11. Screenshot of migrated application in PROVE 5 P-GRADE Grid execution environment: workflow execution and monitoring Two different scenarios can be distinguished according to the underlying Grid infrastructure: Condor-G/Globus based Grid Pure Condor based Grid In this section we describe the more complex Condor-G/Globus scenario in detail but the major differences concerning the pure Condor support are also pointed out. The execution of the designed workflow is a generalisation of the Condor job mode of P- GRADE [9]; but to execute the workflow in Grid we utilise the Condor-G and DAGMan tools [1][2] to schedule and control the execution of the workflow on Globus resources by generating a Condor submit file for each node of the workflow graph a DAGman input file that contains the following information: 1 List of jobs of the workflow (associating the jobs with their submit files) 2 Execution order of jobs in textual form as relations 3 The number of re-executions for each job's abort 4 Tasks to be executed before starting a job and after finishing the job (implemented in PRE and POST scripts).

29 P-GRADE: A Grid Programming Environment 29 The PRE and POST scripts are generated automatically from the abstract workflow designed by the user and they realise the necessary input and output file transfer operations between the jobs. In the current implementation GridFTP commands [19] are applied to deliver the input and output files between Grid sites in a secure way (in the pure Condor scenario it can be done by simple file operations). These scripts are also responsible for the detection of successful file transfers, since a job can be started only if its all input files are already available. In order to improve the efficiency, the data files are transferred in parallel if the same output file serves as an input file of more than one jobs. The actual execution of the workflow can be automatically started from P-GRADE. If P-GRADE is running on a submit machine of a Condor pool, the command is executed locally and immediately interpreted by Condor. If P-GRADE is running on a machine not in a Condor pool, the following extra operations are supported by P- GRADE: 1. A remote Condor pool should be selected by the user via the Mapping Window of P-GRADE. 2. All the necessary files (executables, input files, DAGman input file, Condor submit files) are transferred to a submit machine of the selected Condor pool. 3. The "condor_submit_dag" command is called in the selected Condor pool. 4. After finishing the workflow execution, the necessary files are transferred back to the P-GRADE client machine. 5. The on-line visualisation with PROVE can be performed locally. Actions 2-4 are automatically performed by P-GRADE. During the execution, job status information (like submitted, idle, running, finished) of each component job is reflected by a different colour of the corresponding node in the graph, i.e. the progress of the whole workflow is animated within the workflow editor. Before the execution of each job a new instance of GRM [3] is launched that attaches to Mercury s main monitor [4] located at the Grid site where the current job will be executed and GRM subscribes for traces of the particular job. In order to visualise the trace information, collected on jobs by the GRM/Mercury monitor infrastructure, the PROVE performance visualisation tool [4] is used. Fig. 12 shows that both the progress of the whole workflow as well as the progress of individual jobs can be visualised by PROVE. Fig. 12 depicts the space-time diagram of our workflow-based meteorological application (see in Fig. 2) and one of its parallel component job cummu. In the workflow space-time diagram, horizontal bars represent the progress of each component job in time (see the time axis at the bottom of the diagram) and the arrows among bars represents the file operations performed to make accessible the output file of a job as an input for another one. Interpretation of the same diagram elements is a bit different in case of (parallel) component jobs (like job cummu in Fig. 12). Here the horizontal bars represent the progress of each process comprising the parallel job whereas arrows between bars represent (PVM or MPI) message transfers among the processes.

30 30 Kacsuk et al. Notice that currently we use Condor DAGman as the basis of our workflow engine. However, in the near future we are going to create a generic Grid Application Manager that takes care of possible optimisations concerning the selection of computing sites and file resources in the Grid, controlling the migration of jobs of the workflow among different Grid resources, handling the user s control request during the execution, etc. The Grid Application Manager will allow to run a workflow application across several different Grids provided that those Grids have a gateway mechanism to communicate. For example, on the one hand Globus Grids support the execution of MPI programs (they do not support PVM program execution), while Condor provides more advanced parallel execution mechanism for PVM programs than for MPI programs. Under such circumstances the Grid Application Manager can allocate MPI jobs for Globus-based Grids and PVM jobs of the workflow to Condorbased Grids. Since GRAPNEL jobs can be executed either as PVM or MPI jobs the Grid Application Manager can optimise whether to execute them as PVM or MPI jobs according to the status information of the Globus- and Condor-based Grids. For example, the Hungarian ClusterGrid is a Condor-based Grid, while the Hungarian SuperGrid is a Globus-based Grid. The Grid Application Manager can dynamically allocate jobs of the workflow among the two Grids provided that the user has access to both Grids. Fig. 12. Space-time diagram of the whole workflow and one of its component jobs 6 Related work There is a tremendous Grid community effort in order to define the core Grid infrastructure. However, so far not too much attention was paid how to develop Grid applications and how to supervise the running applications from the point of view of application monitoring, trouble-shooting and performance monitoring. The basic user interface for Grid activities is the Grid portal, which assists the user to get knowledge on available Grid resources and their basic (static and/or dynamic) properties [20], [24]. It also helps the user to launch jobs in the Grid and query their status. Nevertheless, the typical Grid portals give little support to the user to construct Grid applications out of existing or newly written Grid services. An important Grid project that works on providing a Grid Application Toolkit (GAT) by which Grid enabled applications can be constructed is the EU GridLab project. GAT will play a similar role in case of Grid as PVM and MPI played in case

Hungarian Supercomputing Grid 1

Hungarian Supercomputing Grid 1 Péter Kacsuk MTA SZTAKI Victor Hugo u. 18-22, Budapest, HUNGARY www.lpds.sztaki.hu E-mail: kacsuk@sztaki.hu Abstract. The main objective of the paper is to describe the