Programming Grid Applications with GRID Superscalar

Size: px

Start display at page:

Download "Programming Grid Applications with GRID Superscalar"

Paul Payne
5 years ago
Views:

Journal of Grid Computing 1: 151 170, 2003. 2004 Kluwer Academic Publishers. Printed in the Netherlands. 151 Programming Grid Applications with GRID Superscalar Rosa M.

1 Journal of Grid Computing 1: , Kluwer Academic Publishers. Printed in the Netherlands. 151 Programming Grid Applications with GRID Superscalar Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela and Rogeli Grima CEPBA-IBM Research Institute, UPC, Spain Key words: Grid middleware, Grid programming models Abstract The aim of GRID superscalar is to reduce the development complexity of Grid applications to the minimum, in such a way that writing an application for a computational Grid may be as easy as writing a sequential application. Our assumption is that Grid applications would be in a lot of cases composed of tasks, most of them repetitive. The granularity of these tasks will be of the level of simulations or programs, and the data objects will be files. GRID superscalar allows application developers to write their application in a sequential fashion. The requirements to run that sequential application in a computational Grid are the specification of the interface of the tasks that should be run in the Grid, and, at some points, calls to the GRID superscalar interface functions and link with the run-time library. GRID superscalar provides an underlying run-time that is able to detect the inherent parallelism of the sequential application and performs concurrent task submission. In addition to a data-dependence analysis based on those input/output task parameters which are files, techniques such as file renaming and file locality are applied to increase the application performance. This paper presents the current GRID superscalar prototype based on Globus Toolkit 2.x, together with examples and performance evaluation of some benchmarks. 1. Introduction Grid computing is becoming a very important research and development area in this decade. However, one of the important concerns of the Grid community is whether or not a killer application will appear. This concern comes partially because of the difficulty of writing applications for a computational Grid. Although skilled programmers may be willing and able to write applications with complex programming models, scientists usually expect to find easy programming methodologies that allow them to develop their applications with both flexibility and ease of use. Furthermore, different scientific communities (high-energy physics, gravitational-wave physics, geophysics, astronomy, bioinformatics and others) deal with applications with large data sets whose input consists of non monolithic codes composed of standalone application components which can be combined This work has been funded by the Ministry of Science and Technology of Spain under CICYT TIC CO2-01. in different ways. Examples of this kind of applications can be found in the field of astronomy where thousands of tasks need to be executed during the identification of galaxy clusters [10]. These kinds of applications can be described as workflows. Different tools for the development of workflow based applications for GRID environments have been recently presented in the literature [10, 31]. However, in all of them the user has to specify the task dependence graph in a non-imperative language. The goal of this paper is to present GRID superscalar, a programming paradigm that eases the development of Grid applications to the point that writing such an application can be as simple as programming a sequential program to be run on a single processor and the hardware resources remain totally transparent to the programmer. GRID superscalar takes advantage of the way in which superscalar processors execute assembler codes [28]. Even though the assembler codes are sequential, the implicit parallelism of the code is exploited in order to take advantage of the functional

2 152 units of the processor. The processor explores the concurrency of the instructions and assigns them to the functional units. Indeed, the execution order defined in the sequential assembler code may not be followed. The processor will establish the mechanisms to guarantee that the results of the program remain identical. While the result of the application is the same, or even better, if the performance achieved by the application is better than the one that would have been initially obtained, the programmers are freed of any concern with the matter. Another mechanism exploited by processors is the forwarding of data generated by one instruction that is needed by next ones. This reduces the number of stall cycles. All these ideas are exportable to the Grid application level. What changes is the level of granularity: in the processors we have instructions lasting in the order of nanoseconds and, in computational Grids, functions or programs that may last from some seconds to hours. Also, what it changes is the objects: in assembler the objects are registers or memory positions, while in GRID superscalar we will deal with files, similar to scripting languages. In GRID superscalar, applications are described in imperative language (currently C/C++ orperl), and theinherentparallelism of the tasks specified in the sequential code is exploited by the run-time, which is totally transparent to the application programmer. This paper presents these ideas and a prototype that has been developed over Globus Toolkit 2.x [32] GRID Superscalar Behavior and Structure GRID superscalar is a new programming paradigm for GRID enabling applications, composed of an interface and a run-time. With GRID superscalar a sequential application composed of tasks of a certain granularity is automatically converted into a parallel application where the tasks are executed in different servers of a computational GRID. The behavior of the application when run with GRID supescalar is the following: for each task candidate to be run in the GRID, the GRID superscalar runtime inserts a node in a task graph. Then, the GRID superscalar run-time system seeks for data dependences between the different tasks of the graph. These data dependences are defined by the input/output of the tasks which are files. If a task does not have any dependence with previous tasks which have not been finished or which are still running (i.e., the task is not waiting for any data that has not been already generated), it can be submitted for execution to the GRID. If that occurs, the GRID superscalar run-time requests a GRID server to the broker and if a server is provided, it submits the task. Those tasks that do not have any data dependence between them can be run on parallel on the grid. This process is automatically controlled by the GRID superscalar run-time, without any additional effort for the user. The GRID superscalar is notified when a task finishes. Next, the data structures are updated and any task than now have its data dependences resolved, can be submitted for execution. Figure 1 shows an overview of the behavior that we have described above. The reason for only considering data dependences defined by parameter files is that we assume that the tasks of the applications which will take advantage of GRID superscalar will be simulations, finite element solvers, biology applications... In all such cases, the main parameters of these tasks are passed through files. In any case, we do not discard that future versions of the GRID superscalar will take into account all data dependencies. GRID superscalar applications will be composed of a client binary, run on client host, and one server binary for each server host available in the computational GRID. However, this structure will be hidden to the application programmer. The structure of the paper is the following: Section 2 presents the user view or interface and Section 3 presents the runtime, which is a library that automatically gridifies the application while it is being run. Section 4 presents some results that have been obtained with the current prototype, and a performance analysis of those results, Section 5 presents the previous and related work, and finally Section 6 presents ideas for future work and some conclusions. 2. User Interface To develop an application in the GRID superscalar paradigm, a programmer must go through the following three stages: 1. Task definition: identify those subroutines/programs in the application that are going to be executed in the computational Grid.

153 Figure 1. Overview of GRID superscalar behavior. Figure 2. Task and parameters definition example. 2. Task parameters definition: identify which parameters are input/output files and which are input/output generic scalars.

3 153 Figure 1. Overview of GRID superscalar behavior. Figure 2. Task and parameters definition example. 2. Task parameters definition: identify which parameters are input/output files and which are input/output generic scalars. These two first steps are equivalent at the architecture processor level to defining the instruction set. 3. Write the sequential program (main program and task code). In the current prototype, stages 1 and 2 (task definition and task parameters definition) are performed by writing an interface definition file (idl file). This interface definition file is based in the CORBA IDL language [23]. CORBA IDL allows for an elegant and easy way to write and understand syntax. We selected that language simply because it was the one that best fitted our needs, although GRID superscalar does not have any relation to CORBA. Figure 2 shows an example of a task and parameters definition in GRID superscalar. Each task that the user identifies as a candidate to be run in the GRID appears in the IDL file. The list of parameters of the tasks are also specified, indicating its type and if it is an input (in), output (out) or input/output (inout) parameter. Files are a special type of parameters, since they define the tasks data dependences. For that reason, a special type File has been defined. The main program that the user writes for a GRID superscalar application is basically identical to the one that would be written for a sequential version of the application. The differences would be that at some points of the code, some primitives of the GRID superscalar are called. For example, GS_On() and GS_Off() are respectively called at the beginning and at the end of the application. GS_On performs some initializations and GS_Off some final treatments. More implementation details will be given in Section 3. Another change would be necessary on those parts of the main program where files are read or written. Since the files are the objects that define the data dependences, the run-time needs to be aware of any operation performed on a file. The current version offers four primitives for file handling: GS_Open, GS_Close, GS_FOpen and

4 154 Figure 3. Code of the search example written with the GRID superscalar paradigm. GS_FClose, which at user level implement the same behavior as the functions open, close, fopen and fclose. In addition, the GS_Barrier function has been defined to allow the programmers to explicitly control the tasks flow. This function waits till all GRID tasks finish. The current set of specific GRID superscalar primitives is relatively small, and we do not discard the possibility that more primitives could be included in future versions. However, what is more probable is that these functions will be hidden to the programmer by writing wrappers functions. Regarding the file functions, only functions that open, close, copy or rename files should be wrapped, but functions that read or write them could remain as they are. An example of an application written with GRID superscalar paradigm is shown in Figure 3. In this example, a set of N parametric simulations performed using a given simulator (in this case, we used the performance prediction simulator Dimemas [11]) are launched, varying some parameters of the simulation. Later, the range of the parameters is modified according to the simulation results in order to move towards a goal. The application runs until a given goal is reached. Each called function performs the following operations: filter: substitutes two parameters (latency and bandwidth) of the configuration file bh.cfg, generating a new file bh_tmp.cfg dimemas_funct: calls the Dimemas simulator with the bh_tmp.cfg configuration file. trace.trf is the input tracefile and the results of the simulation are stored in the file dimemas_out.txt extract: gets the result of the simulation from the dim_out.txt file and stores it in final_result.txt file generate_new_range: from the data in the final_result.txt file generates the new range for parameters L and BW. The interface definition file for this example is the one shown in Figure 2. Function generate_new_range does not appear in this file, because it would be run locally on the client. For that reason, when opening the file final_result.txt the GRID superscalar specific file functions are used. Figure 4 partially shows the code of this function. Additionally, the user provides the code of the functions that have been selected to be run on the Grid (filter, dimemas_funct, extract). The code of those functions does not differ from the code of the functions for a sequential application. The only current requirement is that they should be provided in a separate file from the main program. As an example, Figure 5 shows the code for the dimemas_funct function. Variable gs_result is a global variable in GRID superscalar, which allows the system to control execution problems of the tasks. In case the system call to DIMEMAS fails for any reason, the GRID superscalar runtime will detect it through the gs_result variable and reacts accordingly. Another change in the user code is that the primitive GS_System should be used in the user functions instead of the system call. The current prototype also allows for the specification of applications in Perl language. For this case, the interface definition file has the same syntax as for the C/C++ case. Similarly, the user can write a main program and its functions in Perl.

5 155 Figure 4. Part of the code of the generate_new_range function. Figure 5. Code of the dimemas_funct function Automatic Code Generation From the interface definition some code is automatically generated by stubgen, a tool provided with the GRID superscalar distribution. This automatically generated code is mainly two files: the function stubs and the skeleton for the code that will be run on the servers. Figure 6 shows a part of the stubs file that will be generated for the idl file of Figure 2 when the C/C++ interface is used. For each function in the idl file, a wrapper function is defined. In the wrapper function, the parameters of the function are encoded using base64 format [14]. Then, the Execute function is called. The Execute function is the main primitive of the GRID superscalar interface. In the next section, the behavior of this primitive will be described. The other file automatically generated by stubgen is shown in Figure 7. This is the main program of the code executed in the servers. This code will be called from GRID superscalar by means of the Globus toolkit middleware. Details of all this process will be described in the next section. Inside this program, calls to the original user functions are performed. Before calling the user functions, the parameters are decoded. Figure 8 shows how files are linked to obtain the final application binaries. One executable will exist in the client host and one in each server host. In the client, the original main program (app.c) is linked with the generated stubs (app-stubs.c). In the servers, the skeleton (app-worker.c) and the file with the code of the original user functions (app-functions.c) are linked to obtain each server s binary. Currently, the deployment process is statically performed by hand, although this process will be automated in subsequent releases. In the case of Perl programs, the process is slightly different (see Figure 9 for a summary). Functioning as initial files that are written by the application programmer, we have the main program in Perl (app.pl), the applications functions code also in Perl (appfunctions.pm) and the interface definition file app.idl). Again, stubgen (called with specific flags for Perl language) is used to generate the required files that enables the execution of the application in a computational GRID. Three files are generated for the Perl binding: appstubs.c, app.i and app-worker.pl. File app-stubs.c is exactly the same file as the generated for the C/C++

6 156 Figure 6. Example of stubs generated for the user functions. Figure 7. Example of skeleton generated for the server code. case. File app-worker.pl is the main program of the code executed in the servers (similar to app-worker.c for the C/C++ case but in Perl). The file app.i is an interface file that will be used by SWIG [4]. Basically it is a translation from the idl syntax to the interface syntax required by SWIG of the application functions interface. SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages, primarily common scripting languages such as Perl, Python, Tcl/Tk and Ruby.

157 Figure 8. GRID superscalar application files organization. Figure 9. Automatic code generation code for Perl applications. From file app.i SWIG generates two files: appwrapper.c and app.pm.

pm indicates the Perl interpret to dynamically load the library (app.so) when the application functions specified by the idl file are called from the client program.

7 157 Figure 8. GRID superscalar application files organization. Figure 9. Automatic code generation code for Perl applications. From file app.i SWIG generates two files: appwrapper.c and app.pm. File app-wrapper.c is linked with file app-stubs.c and with the GRID superscalar library (GRIDsuperscalar.so) and a dynamic library is generated (app.so). File app.pm indicates the Perl interpret to dynamically load the library (app.so) when the application functions specified by the idl file are called from the client program. Finally, the files that compose the client part of the application are: app.pl, app.so and app.pm. The files that compose the servers part of the application are: app-worker.pl and app-functions.pm 3. GRID Superscalar Run-Time The core functionality of the GRID superscalar runtime is the Execute function. When the client program calls one of the user functions, as the main program has been linked with the application stubs, the Execute function is called. When the Execute function is called, an instance of one of the user functions will be called for execution at a server at some time. We will call each instance of the user functions a task. The run-time uses as underlying middleware the Globus Toolkit 2.x APIs. As usual, some of the imple-

8 158 mentation decisions are bound to the used underlying middleware. For example, a previous prototype version of GRID superscalar was implemented over the MW [20], a C++ API defined over Condor-PVM [8]. Some restriction on the task scheduling and brokering make us consider using other possible basic middlewares. However, the core of the GRID superscalar is independent of the Grid middleware, and therefore future versions of GRID superscalar could be based on other software. The rest of this section give details of the GRID superscalar run-time features Data Dependence Analysis The data dependence analysis [19] is performed inside the Execute function and a task dependence graph is automatically and dynamically built. A task dependence graph is such that the vertices of the graph denote tasks and the edges data dependences between pairs of tasks. Currently, the data dependence analysis is only performed for the parameters that are input/output files, but not for the rest of parameters. The rationale for that decision is that the kind of applications which from our point of view can benefit from a model such as GRID superscalar are those that call functions of the type of simulations, FET solvers, bioinformatic applications such as BLAST, etc. All those have in common that the main parameters are data files. However, we do not discard extending the data dependence analysis to all function parameters in the future. Those tasks from the data dependence graph that do not depend on each other (no path between them in the task dependence graph exists) can be concurrently executed. The data dependencies can be classified in three types: Read after Write (RaW): exists when a task reads a parameter that is written by a previous one. For example: filter( bh.cfg, L, BW, bh_tmp.cfg ); dimemas_funct( bh_tmp.cfg, trace.trf, dim_out.txt ); The examples presented in this section all refer to the idl file given in Figure 2. In this case, the result of task filter is written in bh_tmp.cfg,andtask dimemas_funct uses that file as input. Therefore, task dimemas_funct should be executed after task filter since it needs an output of this task. Write after Read (WaR): exists when a task writes a parameter that is read by a previous one. For example: extract( bh_tmp.cfg, dim_out.txt, final_result.txt ); dimemas_funct ( bh_tmp.cfg, trace.trf, dim_out.txt ); In this case the problem is with parameter dim_out.txt which is read by task extract and written by task dimemas_funct. Initially, if sequential execution flow is followed, no problem should arise. However, if dimemas_funct is executed before task extract it may overwrite some data needed by task extract. Write after Write (WaW): exists when a task writes a parameter that is also written by a previous one. For example: filter( bh1.cfg, L1, BW1, bh_tmp.cfg ); filter( bh2.cfg, L2, BW2, bh_tmp.cfg ); Both tasks, are writing the same file bh_tmp. A problem with this data dependence may arise if the first instance of task filter is executed before the second one. Although the execution of both tasks will not be affected, if there is any task that later reads bh_tmp from second function, instead it would read the resulting file from the first function. As has already been pointed out, the data dependence analysis is performed for the input/output files, but not for the generic parameters. In case any dependence between the rest of parameters should be taken into account in the task flow execution, it can be forced by the programmer by the use of the GS_Barrier function. This function will wait till all the previous called tasks in the graph have finished Renaming Although the program will finish correctly if all types of task dependences are respected, only the RaW dependences are unavoidable. For this reason RaW dependences are also called true dependences. However, WaR and WaW dependences can be eliminated by means of renaming. The renaming is just, as the word suggests, the change of the name of the parameter (in our case, the change of the name of a file). For example, in the WaR case described above, instead of using the same dim_out.txt as parameter, if the system renames the file parameter for the dimema_funct task, then the WaR dependence will disappear. Of course, those tasks that later read the dimemas_funct

9 159 output file as input should also have their input file renamed with the same name as the output file of the dimemas_funct task. The WaW case is almost identical. As the data dependence analysis, the renaming is performed automatically inside the Execute function by the GRID superscalar run-time. GRID superscalar handles the renaming by means of two hash-tables: one that, when given the original filename, returns the name of the last renamed filename and a second one that, when given a renamed filename, returns the original name. During the data dependence analysis, the run-time internally substitutes the filenames of the files involved in a task by the last renamed filenames. Then, all data dependences are checked. The run-time has different behaviors depending on the type of dependence. When a RaW dependence is found, the data dependence is taken into account in the graph structures (the behavior does not differ from the case when no renaming is used). When a WaR or WaW dependence is found, a new name for the output file that causes the data dependence is generated. This will eliminate the data dependence. Data structures are updated and both tasks could eventually be run in parallel Shared Disks Management and File Transfer Policy In the current version of the run-time the tasks output files remain on the server where the task has been executed. The run-time also keeps track of the server where each file is located by means of a hash-table which stores for each filename the server where it is located. Then, the files are transferred only on demand and if required (if a task is executed in a server and all its input files are already there, no file transfers are required). Together with a locality-aware scheduling policy, the number of file transfers is dramatically reduced. The next step was to enable the run-time to take into account disk sharing. By disk sharing we mean, for example, multinode servers with disks mounted through systems such as NFS [26]. The run-time receives this information in two ways: 1. In the list of servers provided to the resource broker, the working directory for each server and client is indicated together with a virtual name of the disk where this working directory is mounted. If two servers have their working directory in the same disk but different areas, the file transfers between them can be reduced to a local copy (this is already detected by gsiftp). Furthermore, if the same working directory is used, the run-time detects that the the copies are not necessary. 2. An additional list of virtual shared disks is provided, together with its absolute path to the root where it is mounted in each server. This is useful for large input files, for example, DNA files, large databases or similar. If they are located on a shared disk visible to some or all of the servers, the file transfers are not required. Even if the virtual disk does not really exists, but a local copy of the data-base is located in each server, the transfers will be reduced as well (the existence of a real disk is transparent to the run-time) Resource Brokering It is not intended to provide a resource broker with the GRID superscalar distribution. However, an interface between a resource broker and GRID superscalar is required. This interface is called from the Execute function when a task is going to be submitted for execution and, therefore, a hardware resource is required. In the current prototype, we have included a very simple resource broker, but further versions should be able to interface with other external and more powerful brokers. The current resource broker receives two inputs from the user: 1. For each server that will be available for the execution of the application, the input includes the name of the server, the maximum number of tasks that could be concurrently sent to that server, the name of the queue available on that server (in case any queue should be used), the absolute path to the working directory and a virtual name of the disk where the working directory resides. In addition to this, the name of the client machine, the absolute path to the working directory of the client, and a virtual name of the disk where the client working directory resides are provided. 2. For each server, the maximum bandwidth of the connection between the client and the server and an estimation of the duration of each of the tasks when are run in that server are provided. We acknowledge that this kind of information is of very low level for an application developer. Future versions of the GRID superscalar will get that information dynamically from the system. The resource broker receives requests for hardware resources from the run-time (in Execute function). The

10 160 resource broker answers those requests taking into account the time required to transfer the task input files from the current location to the candidate server and the duration of the task in the candidate server. In view of this, slower servers could be selected if the input files are already located in their disks or if faster links connect them with the input file location. If only one server is available, it will be selected, although this means that all input files have to be transferred or even if it is a very slow server. Although this is a very greedy policy, it pursues the idea of using whatever is available at the moment. This is an example of scheduling policy, but any dynamic scheduler could have been implemented. For example, dynamic information of the system load, network parameters or load, or expected remaining time of running tasks could have been used as inputs for the decision of the scheduler Task Submission The task submission is again performed inside the Execute function. If the task that is currently been called does not have any unresolved data dependence, and a hardware resource is available (provided by the broker), then the task is submitted for execution. If any of the previous conditions are not accomplished, then the task will be inserted in an internal ready queue. The task submission is composed of two steps: file submission and task submission. In the file submission step, those files that are input of the task are sent to the destination machine where the task is going to be executed. The rest of the input parameters (which are not files) are passed to the destination machine too, encoded in base64 format. In the submission step, the task itself is called to be executed in the destination machine. The current version of the Resource Specification Language (RSL) used in Globus allows us to specify which files have to be transferred to the destination machine, when a new scratch directory have to be created or not and which files have to be transferred as output to the original machine. The GRID superscalar takes advantage of these features and does these two steps together in the same job submission. The scheme followed initially regarding the file management is that one temporal directory is made for each task in the destination machine where the task is going to be executed. All required input files which are not already in the server disk are sent from their initial location to the server working directory. For those input files that are located on shared disks and which are accessible from the server, the correct absolute path to the location of the file in the server is calculated by means of cutting and pasting the root of the filename. Copies from the working directory to the temporal task directory are avoided by means of soft links. Also, output files are written directly to the server working directory, although a soft linked copy exists on the task temporal directory. All temporary files created by the task which are not specified in its interface will be written in the temporal task directory. Once the task finishes, the temporal task directory is erased, together with all temporary files, but the input/output files remain in the working directory. The task submission is based in the Resource Management Client API [25], by use of the calls globus_gram_client_job_request for task submission, globus_gram_client_callback_allow for asynchronous end of task synchronization, and globus_poll_blocking for end of task synchronization at the end of the program End of Task Notification When a task finishes, the run-time of the GRID superscalar should be notified. Then, the task dependence graph is modified according to the data dependences that have been accomplished by the end of this task. Tasks that depend on the finished task and that do not have any other unresolved data dependence can now be submitted for execution. To notify the GRID superscalar run-time when a submitted task has finished the asynchronous statechange callbacks monitoring system provided by the Resource Management Client API [25] is used. The globus_gram_client_callback_allow() function opens a TCP port which listens for messages from the Globus job manager (one of the parameters of this call is a callback_func function, provided by the programmer). That function returns a callback_contact, which can then be used for globus_gram_client_job_request() calls, for which the callback_func will be called for state changes of the jobs submitted. In the GRID superscalar run-time when a task enters the done state, the provided callback function is executed. The data structures are updated with the information of what task has finished, and then the broker is notified. Job manager error handling is also performed. These are the only functionalities included in the callback function.

11 161 The initial implementation was such that inside the callback function we submitted the next tasks that were ready at that moment. However, due to problems with the Globus callbacks, the functionality was changed to the one mentioned above (we were not able to treat some callbacks if we called the globus_gram_client_job_request() from the code of the callback function). The data structures that maintain the information about the state of the tasks are checked at the Execute function when other tasks are submitted for execution and at the rest of the GRID superscalar primitives (file primitives and GS_Barrier) Results Collection The policy followed by the GRID superscalar run-time regarding the output files location after task completion have been detailed in Section 3.3. However, the behavior of the run-time for the rest of output parameters is different. Tasks or inline code in the user main program may need output parameters from preceding tasks. Since the data dependence analysis is only performed for the parameters that are files, the run-time guarantees the correct behavior by means of executing a local barrier synchronization when a task has one or more output parameters which are not files. By local barrier synchronization is meant that the run-time will temporally suspend the execution of the user main program at that point until the current task has been executed. Although this behavior may seem inefficient it is reasonable under the assumption that tasks generally will have files as output parameters and, exceptionally, some tasks can have a non file output parameter. We differentiate that local barrier from a global barrier, which will wait until all submitted tasks end. Once the current task has been executed, the output parameter will be sent from the server to the client host. The run-time supports two ways of sending the output parameters back to the client: by sockets or by files. The socket mechanism is handled by a thread that is created by the Execute function. This thread listens from the server program and collects the results. In fact, its mission is not limited to the collection of the output parameters, it also collects the state of the task and the size of the output files generated in the server, allowing task failures to be detected and to gather information for the scheduler decisions. A similar procedure is followed when the file mechanism is used. Although the socket mechanism is faster, the file mechanism allows the run-time to deal with those servers that do not have external IP address. This case is specially common in multinode systems Explicit Task Synchronization Although GRID superscalar should ideally hide the task parallelization mechanism to the user, it is evident that this is difficult to achieve in all cases. For this reason, the GRID superscalar interface provides an explicit task synchronization mechanism: the GS_Barrier primitive. The GS_Barrier primitive can be inserted in the user main program when it is necessary for all submitted tasks to be finished before resuming the execution of the rest of the application. When the run-time receives a call to GS_Barrier, the execution of the main program is blocked. The primitive globus_poll_blocking() is called several times to receive the callbacks and the different data structures are updated as soon as each task finishes. Once all tasks are finished the program may resume File Management Primitives As the dependence task graph is based on the file dependences, the run-time needs to control when a file is locally read or written by the user main program. For example, if the user main program modifies at some point a file and later this file is an input of a given task (i.e., if the task has an input file with execution parameters inside and the main program modifies this file before calling the task). Since the run-time data dependence analysis only takes into account the data dependences between the tasks specified in the idl file, a mechanism should be provided to control these modifications to avoid execution misbehaviors. Currently the interface provides the functions GS_Open(), GS_Close(), GS_FOpen and GS_FClose. The functionality of these functions is partially the same as the open(), close(), fopen() and fclose functions. If a file is opened (with GS_Open() or with GS_FOpen) with the write option the run-time assumes that the file is going to be modified. Then, the run-time performs the same data dependence analysis for the file operation as for the tasks in the idl file. If the file has any WaR or WaW dependence with any task, the same renaming techniques explained above are applied. Additionally, if the GS_Open() function (or GS_FOpen) is called with the read option a partial barrier is executed until the task that is generating that file as output file finishes.

12 162 Internally these file management functions are handled as local tasks. A task node is inserted for each of them and the data-dependence analysis is performed, but the function is locally executed Task Scheduling The task scheduling mechanism is distributed between the Execute call, the callback function and the GS_Barrier call. Each task enters into the run-time data structures when it is instantiated with an Execute call. Different possibilities may arise at this moment. One such case, which has already been explained, is the case when the task can be submitted immediately after being created. If the task has not been scheduled because of some data dependence that must be solved first, then it has to wait. Once the tasks that are responsible of those data dependences end, the data structures will be updated. The callback function only marks the tasks that have finished. The data structure update is performed inside the subsequent instances of the Execute function. After the data structures have been updated, those tasks that now have their data dependences solved are submitted for execution. This last step is always bound to machine availability. The remaining case is when no more Execute calls are going to be called but some tasks are still pending in the graph. Then, the GRID superscalar performs a non CPU consuming wait inside the GS_Barrier primitive before ending the program File Forwarding One of the factors that reduces the concurrency of the execution of the tasks is the RaW dependences. As has already been explained, these dependences are unavoidable. Furthermore, the file that is output of a task has to be sent to the machines where the receiving task is going to be executed, adding more latency. In this section we present a way to reduce the impact of this fact on the performance of the application. The idea is related with the forwarding mechanism used on the processors pipelines, where data produced by one instruction and needed by the following one is directly forwarded in such a way that cycle stalls are reduced. Once a task has begun to write its output files, the tasks that are waiting can start their execution and can start to read the results. Therefore, a mechanism has been implemented using a socket between the tasks that writes and read the file. To give an example of the idea, the functionality of the two following commands: > simulator1 <file_in.cfg >file_out.txt > simulator2 <file_out.txt >file_out2.txt is equivalent to: > simulator1 <file_in.cfg simulator2 > file_out2.txt Instead of using the pipe, a socket is opened when a file is opened for write inside a task. When the same file is opened for read in another task, the other side of the socket is opened. The write/read operations do not need to be substituted, since the writing task writes into the socket and the reading task reads from the socket. However, in the implementation the writing task also writes the data into the file, since this forwarding mechanism is intended to be transparent to the programmer. This mechanism has been implemented by means of dynamic interception of the open, close, read and write operations by using Dyninst [12]. The scheduling schema is slightly modified with the use of this mechanism. Now, a task (T2) that has a RaW data dependence with a running task (T1) is started when the task T1 opens the file that is responsible for the data dependence. The two tasks will then run concurrently. Although this will enlarge the degree of concurrency of the application, care has to be taken since deadlock situations may arise. This forwarding mechanism is currently under development and has not been used in the experiments detailed in the next section. However, the initial experiments that we have tested show that the instrumentation adds a lot of overhead, reducing the expected performance increase. Consequently, we are studying other ways of implementing the file forwarding mechanism. 4. Results and Performance Analysis Several examples have already been implemented with the GRID superscalar, although the current version of the GRID superscalar can still be considered a prototype. We have selected two examples for the paper to analyze performance: first a very simple example that allows us to show the details of an application written with the GRID superscalar paradigm, and second, the NAS Grid Benchmarks. In this section we present some of the results obtained. We have instrumented the run-time to obtain Paraver tracefiles [24] and a performance analysis has also been done. We present the results of the performance analysis for two of the

13 163 cases, and details of an an interesting bioinformatic application are given Simple Optimization Example The simple optimization example was described above in Section 2. Some results of this example are shown in Table 1. The results were obtained by setting the MAX_ITERS parameter of the application to 5 and the ITERS parameter to 12, leading the application to a maximum parallelism of 12 and the total number of remote tasks generated is 180. Two different machines were used: Khafre, an IBM xseries 250 with 4 Intel Pentium III, and Kadesh8 a node of an IBM Power4 with 4 processors. As client machine, a PC based system with Linux was used. In each case, a maximum number of tasks that could be sent to each machine was set. Column Machine describes the server or servers used in each case, column #maxtasksdescribes the maximum number of concurrent processes allowed in each server and column Elapsed time the measured execution time of each case. The number of tasks in each server was set to a maximum value of 4 since the nodes we were using have 4 processors each. For the single machine executions, it is observed that the execution time scales with the number of processes, although it has a better behavior in server Khafre than in server Kadesh8. When using both servers, we obtained execution times between the time obtained in Khafre and the time obtained in Kadesh with the same number of tasks. For example, when using 2 tasks in each server, the elapsed time is between the elapsed times obtained in Khafre and Kadesh with 4 tasks. In this case, the Table 1. Execution times for the simple optimization example. Machine # max tasks Elapsed time Khafre 4 11 min 53 s Khafre 3 14 min 21 s Khafre 2 20 min 37 s Khafre 1 39 min 47 s Kadesh min 37 s Kadesh min 27 s Kadesh min 51 s Kadesh min 31 s Khafre + Kadesh min 45 s Khafre + Kadesh min 11 s Khafre + Kadesh min 33 s 180 tasks executed in the benchmark, 134 were scheduled on the faster server (Khafre) and 46 in the slower one (Kadesh8) NASGridBenchmarks The NAS Grid Benchmarks (NGB, [33]), which are based on the NAS Parallel Benchmarks (NPB), have been recently specified. Each NGB is a Data Flow Graph (DFG) where each node is a slightly modified NPB instance (BT, SP, LU, MG or FT), each defined on a rectilinear discretization mesh. Like NPB, a NGB data flow graph is available for different problems sizes, called classes. Even within the same class there are different mesh sizes for each NPB solver involved in the DFG. In order to use the output of one NPB solver as input for another, a interpolation filter is required. This filter is called MF. Four DFG are defined, named Embarrassingly Distributed (ED), Helical Chain (HC), Visualization Pipe (VP) and Mixed Bag (MB). Each one of these DFG represents an important class of grid applications. 1. ED represent a parameter study, which is formed by a set of independent runs of the same program, with different input parameters. In this case there are not data dependencies between NPB solvers. 2. HC represents a repeating process, such as a set of flow computations that are executed one after another. In this case a NPB solver cannot start before the previous one ends. 3. VP represents a mix of flow simulation, data postprocessing and data visualization. There are dependencies between successive iterations of the flow solver and the visualization module. Moreover, there is a dependence between solver, postprocessor and visualization module in the same iteration. BT acts as flow solver, MG as postprocessor and FT as visualization module. 4. MB is similar to VP, but introducing asymmetry in the data dependencies. Figure 10 shows the DFG of the four benchmarks for class S. A paper and pencil specification is provided for each benchmark. The specification is based on a script file that executes the DFG in sequence and in the local host. For each benchmark a verification mechanism of the final data is provided. The developer has the freedom to select the implementation mechanism. We have implemented them using the GRID superscalar prototype. However, a modification of the NPB instances was needed to allow GRID superscalar to exploit all its functionalities.

14 164 Figure 10. Data Flow Graphs of the NAS Grid Benchmarks. We modified the code of the NPB instances in such a way that the names of the input/output files are passed as input parameters. In the original code each NPB instance generates internally these names. The file names were generated in such a way that they were different in each execution of the same NPB program. With our modification we can reuse the same file name in different iterations, and the GRID superscalar prototype guarantees the proper execution using the renaming feature. In this way, the NGB main program is much simpler. We run all the benchmarks in the same testbed as previous example, and thus we can validate GRID superscalar as an operative system to develop grid applications. The maximum parallelism of ED is 9 for all of these classes. HC is totally sequential and for MB and VP the maximum task parallelism is 3. Tables 2 and 3 show the results for the VP and MB benchmarks when run with classes S and W. The benchmarks were run assigning from 1 to 4 tasks to each server. The times reported in the tables are average times from several executions, since the total execution time can vary more than 10% from one execution to another. MB scales with the number of tasks as expected in both servers used alone. VP is not scaling nicely with the number of tasks. A performance analysis of this benchmark is reported later in this section. When both servers are used the time is not scaling as expected. The reason for that behavior was analyzed and is explained later at the end of this section Bioinformatics Example Another example that has been programmed with GRID superscalar prototype is a bioinformatics application currently in development. The application compares the DNA of the mouse with the DNA of humans. In order to be able to perform this comparison, both DNAs must be split into several files and then each file of the mouse set has to be compared with each file in the human set. The BLAST application is used to compare the DNAs. A previous version of the application was based on Perl using LoadLeveler specific functionalities. This application was ported to GRID superscalar using the C/C++ interface. The use of GRID superscalar has simplified the programming of the application and we plan to use the GRID superscalar version of that

15 165 Table 2. Execution times for the NAS Grid Benchmarks MB and VP on a IBM Power4 node (Kadesh8). Benchmark 1 task 2 tasks 3 tasks 4 tasks MB.S s s s s MB.W s s s s VP.S s s s s VP.W s s s s Table 4. Execution times for the NAS Grid Benchmarks. Machines Kadesh8 and Khafre simultaneously. Benchmark 1 + 1task 1+ 2tasks 2+ 1tasks MB.S s s s MB.W s s s VP.S s s s VP.W s s s Table 3. Execution times for the NAS Grid Benchmarks MB and VP. Machine Khafre. Benchmark 1 task 2 tasks 3 tasks 4 tasks MB.S s s s s MB.W s s s s VP.S s s s s VP.W s s s s VP.A s s s s application for production in our systems. The numbers regarding the porting of this application to GRID superscalar are impressive: the number of lines was reduced to the 10% of the original version in Perl and the development time was reduced to half, including the GRID superscalar learning process. Also, this experience allowed us to get a lot of feedback from the users and motivated the implementation of the Perl interface Performance Analysis In order to be able to do a performance analysis of the benchmarks the GRID superscalar run-time was instrumented to generate Paraver tracefiles. Paraver [24] is a performance analysis and visualization tool which has been developed at CEPBA for more than 10 years. It is a very flexible tool that can be used to analyze a wide variety of applications from traditional parallel applications (MPI, OpenMP or mixed) to web applications. The instrumentation of the GRID superscalar is at a preliminary stage, but facilitate the performance analysis process. The traces generated for the GRID superscalar applications were only of the client side. We are considering getting traces of the whole application in the future. However, to take into account the overhead of Globus, time measures of the duration of the servers tasks were also performed. To generate the traces for the GRID superscalar applications, two kind of elements (which are the base of the Paraver tracefiles) were used: the state and the events. The state of the GRID superscalar can be, for example, user (when running user code), Execute (when running run-time primitives)... Additionally, events were inserted to indicate different situations: beginning/end of callback function, task state change (job request, active, done...), file open/close NAS Grid Benchmark VP, Size W Some of the results for the NGB benchmarks presented in Tables 2, 3 and 4 seem to be unreasonable at a first glance. For example, VP benchmark, class W, when run in the IBM Power4 node. As the maximum parallelism of this benchmark is 3, it is not surprising that no benefit is obtained with 4 tasks. However, we expected to get better performance with 3 tasks than with 2, therefore this two cases were re-run and Paraver tracefiles were obtained. Table 5 shows the time the application is in each state for different runs. It is observed that the main part of the execution of the application is in the Execute function. Analyzing with more detail it can be observed that there are 16 different Execute bursts in the tracefiles, one for each of the tasks of the graph (see Figure 10) plus one for a GS_FOpen performed at the end of the benchmark. Next step was to identify for each task, the time invested in each step. In Figure 11 we can see for each task the elapsed time for different steps of the task (except the last GS_FOpen task, which is locally execute and therefore no Globus events are inserted in the tracefile). The figure shows for each task: Request to Active: the time elapsed between the client does the job request until the callback function enters notifying that the job is in the active state. Active to Done: the time elapsed between the callback notifying that the job is in the active state until the callback notifying that the job has ended.

Task duration: elapsed time of the task measured in the server side (this time is included in the Active to Done time, but it is indicated separately to outline the difference).

16 166 Figure 11. Task elapsed time composition for the NAS Grid Benchmark VP, size W. Figure 12. NAS Grid Benchmark VP: task assignment to servers. In dark color, tasks assigned to server Kadesh; in light color, tasks assigned to server Khafre. Dashed lines lines represent file transfers between servers. Task duration: elapsed time of the task measured in the server side (this time is included in the Active to Done time, but it is indicated separately to outline the difference). It is observed that for each task, the Request to Active time is in average 3.86 seconds and the Active to Done seconds. However, the average elapsed time of the tasks in the server is 1.03 seconds. The Active to Done has an average value rounding the 30 seconds. This time matches the GRAM Job Manager polling interval. This has been reported before in other works [30]. This polling interval can be changed editing the GRAM sources. However, if the granularity of the tasks is large enough, this polling interval would be reasonable. Regarding the performance between the 2 and 3 tasks cases, the corresponding schedules between both cases were observed with some detail. Although the VP data flow graph has a maximum parallelism of 3, this maximum parallelism is only achievable for a part of the graph. With a correct schedule, the same performance can be achieved with 2 tasks as with 3 tasks. The reason why GRID superscalar is not scaling when using 3 tasks is because the schedule with 2 tasks it is good enough to get the maximum performance NAS Grid Benchmark VP, Size S with 2 Servers In this section we describe the results of the analysis of the NAS VP, class S, when run with two servers. The results shown in Table 4 when one task is assigned to each server are worst than the case when one task is assigned to server Khafre. In this case, we are directly analyzing the task schedule. Figure 12 shows the assignment of each VP task to each server. The tasks in light color have been assigned to Khafre, the faster server, and tasks in dark color have been assigned to Kadesh. The dashed line Figure 13. Task elapsed time composition when two servers are used, NAS Grid Benchmark VP, Size S. between tasks assigned to different servers represent that a file transfer is required between both servers. Regarding the assignment, we consider that it is correct, since GRID superscalar has assigned more tasks (and the tasks in the critical path) to the faster server and also the number of file transfers between both servers is very low (only 3 file transfers). Figure 13 shows the elapsed time for each task in the two different servers. In this case, the Request to Active time is different if the task is assigned to one server or the other. In Khafre the average is 2.4 seconds and in Kadesh8 is 4.1 seconds. The overhead of the transfer time from Kadesh8 to Khafre in tasks 9 and 14 makes that the Request to The Active time in these two tasks is above the average in Khafre (6 seconds and 5.8 seconds respectively). Also, for task 6, which receives a file from task 0, the Request to Active time is above the Kadesh8 average (4.6 seconds). The Active to Done time is again around 30 seconds for almost all cases except for two tasks, for which it is around 1.5 seconds. In those two cases, the poll entered much earlier than in the other cases and the end of task was detected with a much shorter time. Finally, Figure 14 show the task schedule for the tasks in each server. Those plots are Paraver windows. The window in the top shows the tasks assigned to Khafre and the one in the bottom the tasks assigned

167 Table 5. General performance information for the NAS Grid Benchmark VP, class W. Task # User Execute GS_On/GS_Off Barrier TOTAL 4 tasks 0.002 s 325.579 s 13.494 s 11 s 339.08 s 3 tasks 0.

17 167 Table 5. General performance information for the NAS Grid Benchmark VP, class W. Task # User Execute GS_On/GS_Off Barrier TOTAL 4 tasks s s s 11 s s 3 tasks s s s 12 s s 2 tasks s s s 11 s s Figure 14. NAS Grid Benchmark VP, size S: task scheduling when two servers are used. to Kadesh8. The segments in dark color represent the Request to Active state. During this time the file transfers (if necessary) are performed. The segments in white represent the Active to Done state. The file transfers between both servers has been highlighted with strong light color lines. Since the overhead of the file transfers is no more than 15 seconds and the schedule is appropriate, it is difficult to understand the low performance achieved with this benchmark when using two servers. Finally, with the comparison with the one server version the reason is detected: to allow the correct execution between the two servers the benchmark is run in ASCII mode. For example, VP.S when run alone in Khafre with two servers in ASCII mode takes seconds, which is above the s when run in binary mode. Also, MB.S when run alone in Khafre with two servers in ASCII mode takes seconds, which is again above the seconds obtained in the binary mode. 5. Related Work Some of the ideas presented in this paper are related to previous work developed by the group in EU projects PEMPAR, PARMAT, and ASRA-HPC. In those projects the PERMAS code was parallelized by means of PTM, a tool that totally hides parallelization from higher-level algorithms [1, 27, 2]. With PTM an operation graph was asynchronously build and executed on top of blocked submatrix operations. A clustering algorithm distributed the work, performing a dynamic load balancing and exploiting data locality such that the communication on the network was kept at a minimum. Furthermore, a distributed data management system allowed free data access from each node. Above PTM the sequential and parallel code was identical. When an application runs on a parallel machine, PTM does an automatic run-time parallelization. Some similarities can be found with the commercial tool ST-ORM [29], although this last is mainly oriented to stochastic studies. In ST-ORM you can define a task graph with dependences. Each task can be anything, from a script to a crash simulation. ST- ORM handles the job submission and results collection into heterogeneous Grids. Again, the difference is that the graph and the dependences should be defined by the user. Also, the work presented may have similarities with the workflow language BPEL4WS [6] or other similar ones, as proposed in the Web Services Choreography Working Group [35]. However, in these languages what we can define is the graph, with the dependences already described. Also, these languages are oriented to medium size graphs, while our system may handle really huge task graphs that are automatically generated. An approach to the automatic

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica