Adding SRM Functionality to QCDGrid. Albert Antony

Adding SRM Functionality to QCDGrid Albert Antony MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2007

Abstract This dissertation discusses an attempt to add Storage Resource Manager (SRM) functionality to QCDGrid. Two approaches to add the functionality have been considered, and it has been explained why one of them is more practical than the other. The better approach has been discussed in further detail, and a design for its implementation has been presented. The attempt to implement it has also been described along with the issues faced, including a major issue regarding what the actual data transfer mechanism should be like. A solution to this issue has also been proposed.

Contents 1 Introduction 1 2 Background 4 2.1 QCDGrid................................. 4 2.1.1 The Data Grid Software..................... 6 2.2 SRM.................................... 7 2.2.1 Why SRM for QCDGrid?.................... 8 3 Design 9 3.1 Important QCDGrid Operations..................... 9 3.1.1 File Insertion........................... 10 3.1.2 File Retrieval........................... 11 3.1.3 File Replication.......................... 12 3.1.4 File Listing............................ 13 3.2 Design Possibilities............................ 14 3.3 Skeleton Design.............................. 17 3.3.1 The Data Storage Interface.................... 17 3.3.2 The SRM-DSI.......................... 18 3.4 Detailed Design.............................. 19 3.4.1 Design of send() and recv() Functions........... 23 4 Implementation 32 4.1 Preparation of Testbed.......................... 32 4.1.1 Installation of GridFTP...................... 32 4.1.2 Installation of dummy DSI functions............... 33 4.1.3 Installation of SRM........................ 33 5 Conclusions 35 5.1 Project Summary............................. 35 5.2 Future Work................................ 37 A The Project Plan 38 B Learnings and Observations 40 i

List of Figures 1.1 Proposed DMS.............................. 2 2.1 QCDGrid Architecture.......................... 5 2.2 The role of SRMs in the Grid....................... 7 3.1 File insertion steps............................ 10 3.2 File retrieval steps............................. 11 3.3 File replication steps........................... 12 3.4 File listing steps.............................. 13 3.5 Interaction after code modification.................... 14 3.6 Request translation............................ 15 3.7 Interaction after wrapper installation................... 16 3.8 The Data Storage Interface........................ 17 3.9 Working of the GridFTP-SRM translator................. 18 3.10 Translation of GRAM executables to SRM commands......... 23 3.11 Working of srmcopy........................... 24 3.12 The working of the first approach to data transfer............ 25 3.13 The working of the second approach to data transfer........... 26 3.14 The working of the third approach to data transfer............ 27 3.15 The working of the fourth approach to data transfer........... 28 3.16 The working of the fifth approach to data transfer............ 29 3.17 Timeline diagram for chosen approach.................. 31 ii

Acknowledgements I would like to thank my supervisor Neil P. Chue Hong for his able guidance, support and understanding throughout the course of this project. I would also like to thank QCDGrid developers George Beckett, Radek Ostrowski and James Perry for their feedback on the proposed design; and Radek also for helping out with the certificate installations. A special thanks to my friends Prashobh, Pana, Miklos and Inma for their company and moral support.

Chapter 1 Introduction In 2007... in a tunnel deep beneath the Earth... travelling close to the speed of light... two objects will collide... and the secrets of the universe will be unlocked. Those are the starting words of Lords of the Ring [1], a short film about the Large Hadron Collider (LHC), the world s most powerful particle accelerator being built at CERN [2]. Although, owing to some glitches, the LHC looks likely to become operational only in May of 2008 [3], the message to take is that advances in science and technology have made it possible to carry out large scale scientific experiments and simulations which were hitherto infeasible. A direct consequence of this feasibility has been the generation of enormous amounts of data. The aforementioned LHC is expected to produce about 15 petabytes of data annually [4], or as physicist Brian Cox puts it in the same film, 10,000 Encyclopedia Britannicas per second! The emergence of the Grid [5] has meant that it has become possible to store, search and retrieve such large volumes of data from geographically distributed locations. Thus, the LHC has its own Grid called the LHC Computing Grid (LCG) [6] being built to store, maintain and share the 10,000 Encyclopedia Britannica equivalent of data that it will produce every second. To carry out storage, search and retrieval tasks on the Grid, a number of Grid-based data management systems (DMSs) are being developed. These systems aim to provide secure access to the distributed data while keeping the nonlocality of the data and the complexity of the Grid hidden from the user. This thesis project started out with the aim of designing a simple and lightweight DMS for the Grid. In the project preparation phase, the basic architecture of such a DMS was outlined and some common use-cases were illustrated. The proposed DMS was to use MCS 1 [7] for metadata cataloging, RLS 2 [8] for replica location, and SRM 3 [9] for storage management. A schematic diagram of the proposed DMS is given in figure 1.1. The notion of data storage, search and retrieval functionalities accessible from within a 1 Metadata Catalog Service 2 Replica Location Service 3 Storage Resource Manager 1

single user interface made the design intuitive and user-friendly. MCS Interface MCS User Interface RLS Interface RLS SRM Interface SRM User Application Storage Node Figure 1.1: Schematic representation of the Data Management System proposed during the Project Preparation phase Following feedback received after the project preparation phase, it was decided to look in detail at an existing data grid software called QCDGrid 4 [10] to see how it carried out the task of data management. Upon study, the QCDGrid architecture seemed to closely match the architecture of the DMS outlined in the project preparation phase. One functionality that QCDGrid seemed to be wanting in, however, was a good dynamic management of its storage resources. Such a functionality becomes all the more critical if offline storage resources such as tapes are added to the grid. With this thought in mind, and to avoid reinventing the wheel, the project was descoped from designing a DMS from scratch, to aiming to add dynamic storage resource management capability to the QCDGrid software. A Storage Resource Manager (SRM) is a middleware software module whose purpose is to manage in a dynamic fashion what resides on the storage resource at any one time 4 Now called DiGS (Distributed Grid Storage) 2

[11]. It is responsible for space allocation and reservation, deletion of outdated files, movement of files to and from the staging disk cache, and while it does not perform any actual file transfer, it does perform the protocol negotiation for data transfer, and the invocation of the file transfer services. It also provides a uniform interface for storage management on different kinds of storage systems. It was observed that adding SRM functionality to QCDGrid will have several benefits. The most significant of these will be the ability of QCDGrid to access offline storage systems 5. It must be noted that several SRM-compliant storage elements have already been deployed by organizations within the International Lattice Data Grid (ILDG), of which UKQCD is a part. The remainder of this dissertation is structured as follows. Chapter 2 provides the background theory. It explains the architecture of the QCDGrid software, and describes the role played by SRMs. Chapter 3 begins with the illustrations of the most common QCDGrid use-cases. Following this, it describes the approaches to adding SRM functionality to QCDGrid, the design of the extension, and the issues involved. Chapter 4 describes the attempt at the implementation of the formulated design. Finally, Chapter 5 concludes the dissertation, provides a summarized recap of the project, and suggests some future work based on the project. 5 At the time of writing, QCDGrid only supports online storage systems 3

Chapter 2 Background This chapter provides the background for the project. Section 2.1 gives a brief description of QCDGrid, while section 2.2 introduces the Storage Resource Manager and explains how the addition of SRM functionality would benefit QCDGrid. 2.1 QCDGrid QCDGrid is principally a data grid used by the UKQCD community [12] for the study of Quantum Chromodynamics (QCD). It supports the storage, search and retrieval, of terabytes of data generated by the QCDOC 1 [13], a purpose-built supercomputer for lattice QCD simulations. In addition, QCDGrid also has a job submission system that allows the scheduling and execution of data generation and analysis jobs on remote machines [14]. QCDGrid combines the storage resources of six UKQCD collaborators across the UK to provide a conglomerated multi-terabyte storage facility. The software has been built on top of the Globus Toolkit [15], the EGEE application stack [16] and an XML database called exist [17]. A GUI browser allows users to post queries for searching and retrieving data, thus making usage of the software quite intuitive. For robust functioning, the QCDGrid software ensures the existence of at least two copies of each file on the Grid. Figure 2.1 shows the architecture of QCDGrid. The central Control Node plays the role of the co-ordinator on the Grid, hosting the replica catalog, initiating replications, and ensuring proper functioning of the storage nodes. If a storage node should go down due to some anomaly, the Grid software initiates the task of replicating all files present in that particular node. If the central Control Node itself should go down, the Grid would switch to a backup mode whereby read-only access would be provided to the Grid s replica catalog by a backup Control Node. 1 Stands for Quantum ChromoDynamics On a Chip 4

The QCDGrid software consists of three main components: A low level data grid A metadata catalog A job submission system Of these, the metadata catalog and the job submission system are not relevant to the project at hand, and hence are not discussed in any further detail. The data grid software is discussed in the following subsection. Storage Elements Authentication Service Control Node Metadata Catalog File Catalog read-only access Backup Node Control Thread read-only access Data Flow Clients Data Flow Figure 2.1: QCDGrid Architecture 2 2 Image reproduced from original at http://www.gridpp.ac.uk/qcdgrid 5

2.1.1 The Data Grid Software The data grid software in QCDGrid facilitates the storage and retrieval of data. It also ensures the proper functioning of the Grid. Each file on the Grid can be identified by a logical filename (LFN), independent of its location on the Grid. If copies of a file exist at different locations on the Grid, they all share the same LFN. For purposes of data transfer, these LFNs need to be translated to unique physical filenames (PFNs) that contain information about the actual location of the file and the protocol to access it. For example, an LFN might look something like sequence00191.dat while the corresponding PFN specifying the location and file access protocol might look like gsiftp://gx240msc.epcc.ed.ac.uk:7001/dat/sequence00191.dat To perform the translation from LFN to PFN, QCDGrid employs the Globus Replica Location Service (RLS) [8]. To carry out the actual file transfer, QCDGrid uses GridFTP. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth and wide-area networks [18]. The Globus Toolkit comes with its own implementation of GridFTP. As mentioned before, central to the proper functioning of QCDGrid is the Control Node. It is the Control Node that hosts the replica catalog which maintains the LFN to PFN mappings. This apart, the Control Node also holds important configuration data, and runs the control thread which performs the following tasks: - Processing of messages from other nodes - Checking of all storage nodes for proper working - Checking of free space on the Grid - Ensuring that enough copies of each file exist and that extra copies are deleted - Insertion of new files into the grid - Verification of the replica catalog - Verification of file integrity - Writing of updated information to config files 6

2.2 SRM As the name suggests, a Storage Resource Manager (SRM) manages what resides on a storage resource. To this end, it provides dynamic space reservation and information on space availability on a storage node, manages staging, pinning and deletion of files, and also invokes Grid middleware components that perform file transfer. Depending on the storage resource that it is managing, an SRM could be a Disk Resource Manager (DRM), a Tape Resource Manager (TRM), or even a Hierarchical Resource Manager (HRM) if what it is managing is a robotic tape with a staging disk cache. But regardless of the underlying storage system, the SRM provides a uniform interface to clients for access to the system. Figure 2.2 illustrates the role of SRMs in the Grid. Client... Client Metadata Index Local SRM Request Manager Replica Catalog Other Middleware Services Disk Cache Network Weather Service Grid SRM (Site A) SRM (Site B) SRM (Site C)... Tape System Disk Cache Disk Cache Tape System Figure 2.2: The role of SRMs in the Grid 3 3 Image reproduced from original at http://atlasgrid.bnl.gov/srm/manuals/hrmclient-guide.htm 7

There are several usage scenarios which are especially suited to the SRM design. One such scenario is a multi-file request. In case a client requests the transfer of multiple files via an SRM, the SRM takes care of queueing and streaming the files, allocating space for each file being brought in, pinning the files so that they stay on disk and are not overwritten if the same space is required by another client, managing disk cache usage, etc. Another powerful feature of SRMs is that they can communicate with peer SRMs to perform required space reservations and file transfers automatically without any intervention from the user. 2.2.1 Why SRM for QCDGrid? Having understood the role of SRMs, let us now take a look at the benefits of adding SRM functionality to QCDGrid. Greater Robustness SRMs are well equipped to deal with temporary failures of the storage node. In such a case, where QCDGrid would have aborted the transaction and returned an error, an SRM, due to the asynchronous nature of its requests, can keep the request on hold and resume it when the storage node comes back up again. Greater Efficiency An SRM can identify frequently accessed files and keep them in the disk cache for a longer duration, thus obviating repeated staging, and reducing access times for them. Support for Offline Storage SRMs will make it possible for QCDGrid to access offline storage systems such as tapes. This will be a big advantage as tapes have historically been the primary means of mass storage. By having access to tapes, QCDGrid could possibly make use of a large amount of data from simulations and experiments done in the past. Better Collaboration with ILDG ILDG recommends the use of SRMs. Adding SRM functionality would enable QCD- Grid to access data generated by other simulation experiments within the scope of ILDG. Using SRMs for managing QCDGrid storage devices would also make QCD- Grid data available to other ILDG collaborators. 8

Chapter 3 Design This chapter explains the design of the SRM extension to QCDGrid. It starts with the illustration of the steps involved in the most common QCDGrid usage scenarios. It then shows two ways in which SRM support could be added to QCDGrid, and gives the reasons for choosing one of them over the other. It presents a detailed description of the chosen design. The issues involved in the design are discussed. A major issue related to what the data transfer mechanism should be like is dealt with in detail. 3.1 Important QCDGrid Operations The most common and most important QCDGrid operations are file insertion, file retrieval, file replication and file listing. This section provides a brief explanation of these operations and the steps involved in them. 9

3.1.1 File Insertion This is one of the most basic usage scenarios on the QCDGrid. A user inserts a new file into the Grid using the put-file-on-qcdgrid command. 4. Replica Catalog updated 2. Control Node informed about new file Control Node 3. Control Node copies file to actual destination Client 1. Client transfers file to "NEW" directory Storage Node Figure 3.1: File insertion steps The steps involved in file insertion are as follows. The workflow is illustrated in Figure 3.1. 1. Client transfers file via GridFTP to NEW directory in a suitable destination node 2. Client informs Control Node to check destination node for new files 3. Control node, in its next iteration, copies the new file to its actual destination by submitting a GRAM executable. It also deletes the file in the NEW directory using GridFTP 4. Control node registers the new file in its replica catalog 10

3.1.2 File Retrieval Perhaps the most common use-case is that of file-retrieval. The user supplies a file name to be retrieved from the Grid and the grid software takes care of locating the file and transferring it to the client. 2. Control Node responds with PFN of nearest copy of file Control Node 1. Client requests PFN of file Client 3. Client copies file from the returned location Storage Node Figure 3.2: File retrieval steps The steps involved in normal file retrieval as follows. The workflow is illustrated in Figure 3.2. 1. Client queries the Control Node for the PFN of the requested file 2. Control node responds to the Client s query with the PFN of the nearest copy of the requested file 3. Client copies file from the returned location using GridFTP 11

3.1.3 File Replication One of the key functions of the Control Node is to ensure that enough replicas of each file exist on the Grid (ref. section 2.1.1). Enough replicas might not exist if a file was newly added to the Grid or if existing data was lost or corrupted. In such cases, the Control Node initiates a replication of the file in question. 3. Replica Catalog updated Source Storage Node 1. Control Node creates local copy of the file Control Node 2. Control Node copies file to destination node Destination Storage Node Figure 3.3: File replication steps The steps involved in the file replication file are as follows. The workflow is illustrated in Figure 3.3. 1. Control Node copies file to local storage using GridFTP 2. Control Node copies the file from local storage to its final destination using GridFTP 3. Control node registers the new copy of the file in its replica catalog 12

3.1.4 File Listing A user might want to get the list of all files available on the Grid. The qcdgrid-list command can be used to get the list of files from the replica catalog. Optionally, a -n switch may be supplied, which will return the number of replicas of each file as well. 2. Replica Catalog scanned 3. List of files returned Control Node 1. Client sends 'list' request Client Storage Node Figure 3.4: File listing steps File listing is a simple operation. The steps involved are: 1. Client sends the list request to the Control Node 2. The Control Node scans its replica catalog for list of files 3. Control Node returns the list of files to the client 13

3.2 Design Possibilities Having looked at the important QCDGrid operations, it can now be said that if QCD- Grid were to support SRMs, the only difference in the interactions from the ones illustrated in the previous section would be in the steps where the storage node is involved. One can note that since the file listing operation does not involve any interaction with the storage node, it can be carried out without any modifications even with SRM nodes. The present QCDGrid storage nodes take GridFTP commands either from the Control Node or directly from the clients. An SRM-managed node would need SRM commands to carry out these tasks. One could say that all that is need for QCDGrid to support SRMs is an agent that would map some GridFTP commands to corresponding SRM commands. However, it turns out this is not as trivial as it sounds. Control Node SRM command Client SRM command SRM Storage Node Figure 3.5: Interaction with SRM node after modification of QCDGrid code To start with, there are two ways in which the mapping could be achieved. One way is to modify the Client software and the Control Node software so that they issue SRM commands instead of GridFTP commands to SRM nodes, while keeping the interaction pattern largely unchanged. This is illustrated in figure 3.5. This means that we can attach the SRM nodes to QCDGrid as they come, without the need for installing any additional software on them. This approach sounds simple. But there are a few things 14

to take note of. The kind of command, i.e. GridFTP or SRM, given to a storage node would depend on whether the node is an SRM node or not. This means that the clients and the Control Node must be aware of the nature of each storage node. This is possible if each client and the Control Node maintain a list which holds this SRM or non-srm information of each node. Further steps would have to be taken to ensure timely update of the list when a new storage node is added to the grid or when an existing storage node is changed from SRM to non-srm or vice-versa. In addition, when a new client joins the grid, there must be a mechanism for it to receive the latest SRM or non-srm list, possibly from the Control Node. Thus, the process on the whole gets cumbersome. The other approach to achieve the mapping is to have some kind of a request translator as a wrapper around the SRM. This wrapper would translate incoming GridFTP commands to equivalent SRM commands. This is illustrated in figure 3.6. Incoming GridFTP request Request Translation Wrapper SRM command SRM Figure 3.6: The SRM wrapper transforms a GridFTP request to an SRM command This means that clients and the Control Node continue to send the same commands as before to the SRM storage node while the request translator attached to these nodes maps the incoming GridFTP requests to appropriate SRM commands. This approach keeps the QCDGrid code unchanged. It obviates the need for a client or the Control Node to know if a given storage node is SRM or non-srm. However it does mean that SRM nodes cannot be attached as is to QCDGrid, but would need the request translation wrapper to be installed on them as well. The wrapper could possibly reside 15

on a separate node, if additional software installation is not possible on the SRM node. With the wrapper installed, the typical QCDGrid interactions with the SRM storage node can be depicted as shown in figure 3.7 Control Node gridftp request Client gridftp request Request Translation Wrapper SRM command SRM Storage Node Figure 3.7: Interaction with SRM node after installation of a request translation wrapper The second approach seems to be the more reasonable one to follow as it does not involve the potentially problematic issue of maintaining the SRM or non-srm list. Thus, it was decided that SRM functionality will be added to QCDGrid by designing and implementing the mentioned GridFTP-to-SRM request translation wrapper. The skeleton design of the wrapper is presented in section 3.3 and the detailed design in section 3.4. The attempt at implementation of the designed wrapper is described in Chapter 4 16

3.3 Skeleton Design As we have seen, the data transfer mechanism used by QCDGrid is GridFTP. The GridFTP software provides a feature called the Data Storage Interface (DSI) which turns out to be the key to implementing the request translation wrapper. Let us take a quick look at the DSI before proceeding to see how it can be used to implement the wrapper. 3.3.1 The Data Storage Interface The Globus implementation of GridFTP comes with default support for accessing a POSIX filesystem across a TCP/IP network [19]. But this does not mean that there is no way to interface Globus GridFTP with other types of file and storage systems, possibly across other types of networks. In order to make this possible, the GridFTP software comes with a programmable interface called the DSI. The DSI resides with the GridFTP server and can be programmed to encode the interaction details for a non-posix file data source. These DSIs can be dynamically uploaded to a running GridFTP server. When the GridFTP server requires access to such a data source, it passes the request on to the DSI. The concept is illustrated in figure 3.8. gridftp server Data Storage Interface Storage Node Can be dynamically uploaded GridFTP Server Node Data Storage Interface Figure 3.8: The Data Storage Interface 17

The DSI is basically a set of interface functions whose signatures are provided by GridFTP. The implementor of a DSI must program these functions to implement the low-level details of device-specifc and file-system-specific interactions for data access and storage. Generally, the functions are programmed to make calls to system-specific library routines for reading and writing data, and for obtaining meta-information about files and spaces. 3.3.2 The SRM-DSI If we consider our approach of developing a GridFTP-to-SRM request translation wrapper discussed in section 3.2, we can now see that the DSI is designed to provide this very kind of functionality. We can implement the DSI so that it will make calls to SRM client library routines. These routines will in turn contact the SRM server managing the relevant storage node to carry out the requested tasks of data storage and retrieval. The results of the tasks will be returned to the source by the SRM client routines via the DSI. Figure 3.9 illustrates this concept. The design was presented before QCDGrid developers, who approved it. The detailed design of the SRM-DSI is explained in the next section. Implementations of the DSI performing a similar functionality already exist for SRB 1 [20] and HPSS 2 [21], and some time was spent studying the SRB-DSI. GridFTP request GridFTP Server Accepts the request and passes it on to the DSI DSI Invokes the appropriate SRM client library function SRM Client Library Generates commands for the SRM server SRM Command GridFTP Server Node Figure 3.9: How the SRM-DSI will translate GridFTP requests to SRM commands 1 Storage Resource Broker 2 High Performance Storage System 18

3.4 Detailed Design The implementation of the DSI requires essentially the implementation of six interface functions, viz., start(), destroy(), send(), recv(), stat() and command(). The stubs for these functions are provided with the GridFTP source code. The task of these functions is briefly outlined below. start() This function is called by the GridFTP server whenever a client connects to the server. Any session specific information is passed to the function via the session_info argument. This function is where all initializations related to the specific session must be carried out. destroy() This function is called by the server at the end of a session when the client disconnects. This function is where all the cleaning up must take place, i.e., all the memory allocated for the session should be deallocated here. send() The send() function is called whenever the client wishes to receive a file. It encodes the logic for communication with the storage device for getting the file. It is required that the following functions be called within the body of the send() function in the given order: globus_gridftp_server_begin_transfer() globus_gridftp_server_register_write() globus_gridftp_server_finished_transfer() recv() The recv() function is called whenever the client wishes to send a file. It encodes the logic for communication with the storage device for writing the file to it. It is required that the following functions be called within the body of the recv() function in the given order: globus_gridftp_server_begin_transfer() globus_gridftp_server_register_read() globus_gridftp_server_finished_transfer() 19

stat() The stat() function is called by the server when it requires file-related or resourcerelated information. This information can be the file-size, permissions, modification times, etc. An example of when such information is required is when a client issues a LIST to the server. command() This function is called by the DSI when the client issues a command. The commands supported by the DSI are mkdir, rmdir, delete, rename, csum, and chmod, which perform the operations of creating a directory, removing a directory, deleting a file, renaming a file, computing a file checksum and changing the access permissions of a file respectively. Each of these functions (except start() and destroy()) would in turn make a call to the corresponding SRM client library routine, and return the result of the call. For example, stat() would make a call to srmls. The following table shows the mapping between the DSI functions and the corresponding SRM client library routines. DSI Function SRM Client Library Routine send srmcopy recv srmcopy stat srmls command (csum) srmls command (mkdir) srmmkdir command (rmdir) srmrmdir command (delete) srmrm command (rename) srmmv command (chmod) srmsetpermission Table 3.1: Mapping of DSI functions to SRM client library routines But as it turns out, the mapping from the GridFTP DSI function to the SRM client library routine is not exactly one-to-one. So, merely calling the SRM client library routine within the DSI and passing on the result of the call to the QCDGrid client will not work. One reason for this is that there are differences in the implementation of the functionalities. Consider the stat() function. For a given filename, the server expects a structure containing the following information about the file to be filled in by the stat() function: 20

- name of the file - size of the file - the file s access permissions - the number of symbolic links to the file - target of a symbolic link - file owner s user id - file owner s group id - time of file creation - time of last file modification - time of last file access - inode number of the file The SRM client library function which corresponds to that stat function is the srmls function. srmls does not return all the fields required by stat(). Also, not all the information returned by srmls is required by stat(). Dealing with the latter issue is trivial just ignore the extra information (in fact, srmls also returns a file s checksum which could be used by the DSI s command() function for calculating checksum). But dealing with the former issue requires some thought. srmls returns a file s creation time and its last modification time, but not its last access time. There are two sensible options to choose from in this case. Either the access time could be returned as a NULL field, or it could be assumed that the last time the file was modified was also the last time it was accessed, and the last modification time could be returned as the last access time as well. Information about symbolic links, and inode number are also not returned by srmls. These could be returned as NULL. There is a major issue when it comes to the implementation of the send() and recv() functions. This issue is regarding how the actual data transfer will take place in the presence of the request translation wrapper. This is discussed in detail in section 3.4.1. Another issue to consider is that there are occasions when QCDGrid communicates with the storage nodes not via GridFTP, but via the GRAM protocol [22]. In such cases, a GRAM executable is submitted to the storage node, which is essentially a shell script with shell commands that the node must execute. There are 3 GRAM executables which are used by QCDGrid: - cp, when QCDGrid wants to perform a local copy of a file (within the same storage node). This situation arises during the file insertion operation when the Control Node moves the file from the NEW directory to the actual destination directory. 21

- makepathvalid, when QCDGrid must create a non-existent destination directory in a storage node. - checksum, when QCDGrid must verify the checksum of a file. The use of GRAM protocol would not be an issue if the SRM storage node also came with a GRAM server, or if it allowed the installation of one. But this may not always be the case. Hence, to be on the safe side, our request translation wrapper must also be capable of mapping these three GRAM executables to SRM commands. The mapping is quite straightforward, as shown in table 3.2. Unlike the GridFTP-SRM mapping, this mapping is more or less one-to-one. The only deviation is with checksum, where the excess information returned by srmls must be ignored. GRAM Executable cp makepathvalid checksum SRM Client Library Routine srmcopy srmmkdir srmls Table 3.2: Mapping of DSI functions to SRM client library routines The GRAM-SRM translator would also consist of a GRAM server-like interface to accept incoming requests, and SRM client library routines for communications with the SRM server to carry out the required tasks, as shown in figure 3.10. But implementation of the GRAM-SRM translator would not be as complicated a process as that of the GridFTP-SRM translator, due to the straightforward mapping and the small number (only 3) of executables to be mapped. Thus, a complete request translation wrapper to be deployed on QCDGrid would consist of a GridFTP-SRM translator and a GRAM-SRM translator. Owing to time constraints in the project, it was decided to follow a design to schedule approach for development. There was a choice regarding which functionality to implement first. If we go back to the QCDGrid use-cases that we considered, we will see that three out of the four use-cases considered would require the send() or the recv() functionality. The fourth one, which is the file listing operation does not require any change at all in the presence of SRM nodes as it does not involve communication with the storage nodes. Thus, it is easy to identify that the the critical functions are send() and recv(). These functions will be involved in storage and retrieval of files and also in the automated replication of files on QCDGrid. Implementing the send() and recv() functionality would also mean addressing the big issue of data transfer discussed in section 3.4.1. Any progress in this regard could also be potentially useful to future projects that work in lines of adding SRM functionality to QCDGrid. Another point in the favour of implementing these two functions first was that it would give a better idea 22

GRAM Executable GRAM Server Interface Accepts the request and passes it on to the request mapper Request Mapper Invokes the appropriate SRM client library function SRM Client Library Generates commands for the SRM server SRM Command Figure 3.10: Translation of GRAM executables to SRM commands about the data structures required by the DSI for a given session, which will have to be initialized in the start() routine. For these reasons, it was decided to design and implement the send() and recv() functions first, and then time permitting, move on to implementing the remaining functionality. 3.4.1 Design of send() and recv() Functions As we saw in section 3.1.2, file retrieval on QCDGrid is a simple process of receiving the file location from the Control Node and then retrieving the file from the received location. The protocol used by QCDGrid for retrieving the file is GridFTP. In the presence of an intermediate request translation wrapper, however, an issue creeps up. The GridFTP server in the wrapper would receive the QCDGrid client s GridFTP get request, which it will pass on to the send() function in the DSI. The DSI will in turn call the srmcopy function to retrieve the file from the SRM node. Let us take a closer look at what happens when the srmcopy function is called. The srmcopy function takes a site-url (SURL) as an argument. The SURL specifies the location of the SRM server managing the storage node and the location of the file within the node. So, the SURL looks something like srm://gx240msc.epcc.ed.ac.uk:8443/dat/sequence00910.dat. The client establishes a connection with the specified SRM server, and makes a request to stage the file by calling srmpreparetoget. Staging implies different things in dif- 23

ferent storage systems, but it basically means preparing a file for transfer. For example, on a tape, staging implies bringing the requested file online. The server upon receiving the client s request, stages the file and returns a transfer-url (TURL) to the client, which contains the details of the how the client can retrieve the file. It specifies the service that the client must use for the data transfer, the address of the server hosting the service, and the location of the staged file. Recall that SRM itself does not carry out the actual data transfer. Thus, the TURL returned for the the given SURL might look something like gsiftp://gx240msc.epcc.ed.ac.uk:9001/dat/sequence00910.dat. Here, gsiftp is the protocol that the client must use for data transfer. Once the client receives the TURL, it invokes the appropriate data transfer service and carries out the transfer. At the end of the transfer, the client calls srmreleasefiles to tell the SRM server to free the staged file. The whole process is illustrated in figure 3.11. SRM Client SRM Server Storage System File Transfer Service Time srmpreparetoget Stage File Return TURL File Transfer request File Transfer srmreleasefiles 'Unpin' file Figure 3.11: Timeline diagram showing the working of srmcopy when retrieving a file It is possible that the file could not be staged immediately upon the client s request. In this case, a pending status message is returned to the client. The client must then keep 24

polling the status of the request till a ready is returned to know when the file is staged and ready for transfer. Having understood what goes on behind the scenes when srmcopy is called, let us now get back to the original story. The DSI makes the srmcopy call, during which it gets a TURL returned from the SRM server. If the DSI simply completes the call, then at the end of the call, all the data ends up with the DSI. Considering that our ultimate interest is in the data reaching the client, we must now examine what are the ways in which this can be achieved. Does the srmcopy call have to proceed normally to completion, or can we take control of the call to optimize the data transfer through the DSI? Or is there a way to carry out the transfer bypassing the DSI altogether? If so, are there any caveats to it? Five approaches were considered to tackle this issue of data transfer. They are described one by one below. GridFTP Server + DSI DSI's Local copy 2. DSI passes file to client via GridFTP 1. DSI copies the whole file locally File being written QCDGrid Client Node Staged File SRM Storage Node Figure 3.12: The working of the first approach to data transfer 1. Copy everything to the DSI node and then transfer to client. This seems to be the most straightforward approach. Just let the srmcopy call complete so that all the data ends up with the DSI. Once this is done the DSI can provide this data to the pending GridFTP connection from the client. This is illustrated in figure 3.12. Though this approach is simple to implement, it suffers from one major weakness. The node hosting the DSI must have enough memory to hold the entire file being transfered. The size of the file could be in gigabytes. This becomes a serious issue when we consider that several such retrieval operations could be in progress as the same time. Another 25

thing to take note of is that this copying of the entire file to an intermediate node before transferring it to the final destination effectively doubles the file transfer time. Depending on the situation, this could be a serious issue for large files for which the transfer times could be in several tens of minutes. GridFTP Server + DSI DSI's buffer queue Another thread simultaneously passes data from the head of buffer queue to the client DSI copies file in small chunks to the end of local buffer queue File being written QCDGrid Client Node Staged File SRM Storage Node Figure 3.13: The working of the second approach to data transfer 2. Have the data transferred via a small local buffer. This would seem to be the ideal way to carry out the file transfer. Keep queueing the data on a local buffer while it is simultaneously being dequeued from the other end and transferred to the destination node by another thread. This is illustrated in figure 3.13. The advantage of this method comes to light if the storage node to DSI connection is fast and the DSI to client connection is slow. Then the SRM storage node does not have to wait to become free till the client receives all the data, but it can be free as soon as the data transfer to the DSI is finished. The implementation of this approach would be slightly more complicated as there would need to be some form of synchronization between the two threads for access to the buffer queue. An issue arises if there is a significant difference in the transfer speeds. If the storage node to DSI connection is much faster than the DSI to client connection then a sizable amount of the data could be on the DSI node at any given time, which can again cause storage space related issues on the node hosting the DSI. On the other hand, if the DSI to client transfer speed turns out to be faster than the storage node to DSI speed, then it is as good as having no buffer at all, and not worth 26

the trouble of implementing a buffer queue and thread synchronization. 3. Copy and transfer small chunks of data at a time. This approach is more like the working of the previous approach in the case when the DSI to client transfer speed is faster. The difference is that there is only one thread. That thread runs in a loop and, in any given iteration, gets a small chunk of data from the SRM node and transfers this chunk to the destination node. This is shown in figure 3.14. This method does not suffer from the storage space requirements or the time consumption issue of the first approach. It is also simpler to implement than the second approach which itself is likely to behave much like this one occasionally. GridFTP Server + DSI 2. In the same iteration, it forwards the small chunk to the client 1. In any iteration, DSI copies one small chunk of the file to its local storage File being written QCDGrid Client Node Staged File SRM Storage Node Figure 3.14: The working of the third approach to data transfer 4. Return the TURL to client, and have the client carry out the file transfer directly. Once the DSI receives the TURL, instead of starting the file transfer and forwarding the data to the client, just send the TURL of the staged file to the client on the pending GridFTP connection. The client will then place a new request on the returned TURL and carry out the file transfer directly, as shown in figure 3.15. The transfer of data in this case will be single-step as opposed to 2-step (storage node to DSI, and DSI to client) in the previous approaches, making this approach work faster than the previous 27

ones. But this approach faces a serious issue. There is no way the DSI can know when the file transfer between the storage node and the client completes. The DSI needs to know this because only after the transfer completes can it issue the srmreleasefiles command to the SRM server. We could program the DSI to wait for a specified amount of time depending on the file size for the transfer to complete. However, this technique would prove to be highly inefficient. Another problem with this approach is that it breaks the standard QCDGrid request-response model. It takes two get requests from the user to retrieve a single file. It is possible to reprogram the client so that the second request happens automatically. But along with change to the QCDGrid client code, this technique would also require knowledge of which storage nodes are SRM and which are not thus defeating the very purpose of following the request translation wrapper design in the first place. GridFTP Server + DSI 1. DSI passes the TURL of staged file to client File being written QCDGrid Client Node 3. File transfer carried out 2. Client issues GridFTP 'get' request for the TURL Staged File SRM Storage Node Figure 3.15: The working of the fourth approach to data transfer 5. Carry out a third-party GridFTP file transfer between the SRM node and the client. Third party file transfer is a feature of GridFTP by which data transfer can be carried out by a client between two remote nodes. The only requirement is that both nodes between which the data transfer is carried out must be hosting GridFTP servers. Thus, when the SRM server returns the TURL to the DSI, the DSI could issue a command for a third party transfer from the returned TURL directly to the QCDGrid client. 28

This method would be quite effective. Like the previous approach, the transfer of data will be single-step. But unlike the previous approach, in this case, as per GridFTP protocol specifications, the DSI will be indicated of the completion of transfer. The standard QCDGrid request-response model is also respected. However, this approach suffers from two major disadvantages. The first is that since third-party GridFTP transfer can take place only between GridFTP servers, the client too would need to have a GridFTP server running. The second is that since the data transfer in running independently, there is no way to monitor the progress of the transfer until the transfer completes (or fails). Another issue this approach faces is that once the file transfer is complete, we still have a pending GridFTP connection from the client, on which the file request was made in the first place. To indicate the success of the file transfer, this call would have to return a success code. This would have to be done by passing the macro GLOBUS_SUCCESS to the globus_gridftp_finished_transfer function within the DSI. The pending call will return a success code even though no data transfer actually took place through that particular connection. Thus, in many ways, this approach is akin to a hack. The approach is illustrated in figure 3.16. GridFTP Server + DSI 3. DSI forces pending connection to end with GLOBUS_SUCCESS 1. DSI issues a third party GridFTP transfer request File being written QCDGrid Client Node 2. File transfer carried out Staged File SRM Storage Node Figure 3.16: The working of the fifth approach to data transfer 29

After considering the advantages and disadvantages of each method, it was decided to follow the approach of copying and transferring small chunks of data at a time. In fact, the SRB-DSI also follows a similar approach of copying and transferring bits of data at a time. Now it was time to think about how to get small chunks to data at a time and what would be the optimum size of the chunks to be transferred in each iteration. For the former, GridFTP provides a function called globus_ftp_client_partial_get in which the offset and block size of the data to be retrieved can be specified. For the latter, GridFTP provides a function called globus_gridftp_server_get_optimal_concurrency() to return the optimal size of each chunk of data to be written. The strategy to implement the chosen approach would be to take control of the srm- Copy call by manually calling its constituent functions. The first call would be to srm- PrepareToGet, which will cause the server to stage the file and return the TURL. Once the DSI gets the TURL, it will send a gobus_gridftp_server_get_optimal_concurrency request to the GridFTP server of which the DSI itself is a part. This will return the optimal size of the chunks of data that must me transferred to the client in any one go. Next it will call globus_gridftp_server_begin_transfer as specified in the DSI requirements, to prepare the server to receive data. Having got the optimal chunk size, the DSI will now start a gsiftp file transfer session with the storage node. It will make repeated calls to globus_ftp_client_partial_get with the block size the same as the optimal chunk size. With the return of each such call, it will transfer the acquired chunk of data to the client node by calling globus_gridftp_server_register_write. When the entire file has been transferred, the client will call srmreleasefiles for the SRM node, and globus_gridftp_server_finished_transfer for the GridFTP server. The timeline diagram for the whole process is shown in figure 3.17 30

GridFTP Server DSI SRM Server File Transfer Service Time srmpreparetoget Return TURL getoptimalconcurrency Optimal chunk size begintransfer partialget(file, chunk size) While not end of file Receive chunk of file registerwrite(chunk of file) finishedtransfer srmreleasefiles Figure 3.17: Timeline diagram for the chosen approach to data transfer 31

Chapter 4 Implementation To carry out the implementation of the chosen design, a testbed had to be prepared first. The following section describes the preparation of the test bed and the issues faced in the process. 4.1 Preparation of Testbed For the purpose of this project, a machine was allocated to carry out the implementation. The machine was an intel 686 running Scientific Linux and was called gx240msc.epcc.ed.ac.uk. 4.1.1 Installation of GridFTP The first thing that was done as a part of the testbed preparation was the installation of Globus GridFTP. The Globus Toolkit source code was downloaded and selective building and installation of GridFTP was performed. There were no issues, and the installation happened smoothly. The installed software was tested by carrying out the transfer of some dummy files. GridFTP worked fine for ftp file transfers. To test it for gsiftp transfers, host and client certificates were installed on the machine. However, the test failed. Not having much experience with proxies and certificates, Radek s help was sought at this stage. The problem was traced to a mismatch between the hostname as specified in the machine s configuration file (gx240msc), and as specified in the host certificate (gx240msc.epcc.ed.ac.uk). Once the hostname in the configuration file was changed to match that in the certificate, GridFTP started working fine with gsiftp as well. 32