Experience with LCG-2 and Storage Resource Management Middleware

Size: px

Start display at page:

Download "Experience with LCG-2 and Storage Resource Management Middleware"

Erica Morrison
6 years ago
Views:

1 Experience with LCG-2 and Storage Resource Management Middleware Dimitrios Tsirigkas September 10th, 2004 MSc in High Performance Computing The University of Edinburgh

2 Year of Presentation: 2004 Authorship declaration I, Dimitrios Tsirigkas, confirm that this dissertation and the work presented in it are my own achievement. 1. Where I have consulted the published work of others this is always clearly attributed; 2. Where I have quoted from the work of others the source is always given. With the exception of such quotations this dissertation is entirely my own work; 3. I have acknowledged all main sources of help; 4. If my research follows on from previous work or is part of a larger collaborative research project I have made clear exactly what was done by others and what I have contributed myself; 5. I have read and understand the penalties associated with plagiarism. Signed: Date: Matriculation no:

3 Abstract The University of Edinburgh is participating in the ScotGrid project, working with Glasgow and Durham to create a prototype Tier 2 site for the LHC Computing Grid (LCG). This requires that LCG-2, the software release of the LCG project, has to be installed on the University hardware. Being a site that will mainly provide storage, Edinburgh is also actively involved in the development of ways to interface such resources to the Grid. The Storage Resource Manager (SRM) is a protocol for an interface between client applications and storage systems. The Storage Resource Broker (SRB), developed at the San Diego Supercomputer Center (SDSC), is a system that can be used to manage distributed storage resources in Grid-like environments. In this report, we will describe work done during a period of sixteen weeks, in the context of an MSc in High Performance Computing. The first part of the work involved helping to set up LCG software at the Edinburgh ScotGrid site and to monitor the hardware using the Ganglia distributed monitoring system. The second part of the work aimed at the development of an interface between the SDSC Storage Resource Broker and an implementation of the SRM specification, which was developed at Lawrence Berkeley National Laboratory (LBNL).

4 Acknowledgements I would like to thank James Perry and Philip Clark for supervising my dissertation. I am also grateful to Alasdair Earl and Steve Thorn, who offered a great deal of both practical help and information in the course of my project. Paul Walsh should also be thanked for helping me write the Latex file for this document.

5 Contents List of Figures iv 1 Introduction 1 2 Background on Grid Computing Virtual Organisations Grid Computing Security Information Data and Storage Resources Management Job and Computing Resources Management Grid Projects Globus The European Data Grid LCG, GridPP and ScotGrid The Large Hadron Collider - why Grid Technologies? The LCG Project Status and near future of the LCG Project GridPP ScotGrid ScotGrid Hardware in Edinburgh LCG Interaction with the user and the applications Interaction with the resources Security Information System Job Management The Job Description Language Command line tools Data Management File names Command line tools Relevance to the Dissertation i

6 5 LCFGng The architecture of LCFG Source Files Profiles Components LCFG and LCG-2. Relevance to the Dissertation Installing LCFG Installing LCG Relevance to the Dissertation Monitoring with Ganglia The Ganglia Architecture Metrics Transmitting and Storing Monitoring Information Messages on the multicast channel XML messages Storing Monitoring Information The PHP Front end Using Ganglia for the Scotgrid Hardware. Relevance to the Dissertation Using LCFG to configure Ganglia The near future Monitoring Examples SRM and SRB SRM SRM file and storage space types SRM functionality File pinning LCG-2 and SRM The LBNL SRM The SDSC SRB The SRB architecture The Metadata Catalogue The S-commands The SRB client API An interface between SRM and SRB Installing SRB from source Installing SRM from source Thoughts on an Interface Summary and Conclusions Summary of the MSc Project Post Mortem General Issues Specific Issues

7 8.2.3 Final thoughts Appendices 48 A leak.c 49 B Index of Acronyms 51 Bibliography 53

8 List of Figures 2.1 The structure of the EDG project. Image taken from [3] A logical layout of ScotGrid. Image taken from [6] The LCFG architecture Ganglia cluster hierarchies Ganglia on the Edinburgh front end The leak program output, a terminal running the top utility and the Ganglia webpage. The program claims to have allocated 1253MB, the top utility gives a value of 1.3 GB and the value shown on the Ganglia Graph for the total memory usage is approximately 1.4GB Memory usage, free memory and free swap memory. We notice that the first two graphs are consistent. The third graph shows the usage of swap memory for the second run The Ganglia page for Glenmorangie shortly after the file transfer. The start of the file transfer resulted in a quick rise in memory usage. When all the memory was used, Glenmorangie only received and processed data as quickly as it could write it on the hard disk. The result was a drop in CPU and network activity As Glenmorangie receives packets of data it sends confirmation packets back to Glenkinchie. Almost 20 minutes after the transfer started and with almost 5GB transfered, the process was stopped, since there was not enough disk space on Glenellen to hold a 21GB file SRM filetypes. Image taken from [22] SRM and file transfers. Image taken from [17] iv

9 7.3 DRM, TRM and HRM and the systems they interface to the Grid. Image taken from [21] The SRB architecture. Image taken from [20]

10 Chapter 1 Introduction When the Large Hadron Collider (LHC) comes online in 2007, it will become the largest elementary particle accelerator ever to have operated in the world. Four experiments will be conducted on the LHC and the data generated will scale to Petabytes. Managing this data efficiently across a world wide network of collaborating institutes and universities is a challenge, which the particle physics community has chosen to address using Grid computing. The University of Edinburgh is one of the major contributors to this effort. It possesses substantial storage resources to be used for storing LHC data and is currently in the process of connecting them to a prototype Grid being set up by a number of institutes in the UK. This document details the work completed as part of a dissertation project for the MSc in High Performance Computing at EPCC. The project had two main parts. The first part involved work on setting up and configuring the Edinburgh Grid site. The goal of the second part was to create an interface between two pieces of middleware used to manage storage resources in a distributed environment. The contents of the chapters following this introduction are summarised bellow. Chapter 2 provides a background in Grid computing. All the necessary concepts in understanding the following chapters can be found here. There is also a brief description of two Grid related projects that are very relevant to this work, Globus and the European DataGrid. In Chapter 3 we explain why modern experimental particle physics can benefit from Grid Computing. We then introduce three related projects. The LHC Computing Grid, GridPP and ScotGrid. The LHC Computing Grid is a successor of the European DataGrid and aims at utilising Grid technologies to address the computing needs of the Large Hadron Colllider experiments. GridPP is an effort to produce the infrastructure and deploy the tecnhology for the creation of a Particle Physics Grid in the UK and ScotGrid is the subset of GridPP that refers to Scottish Grid sites. Describing LCG-2, the latest release of the LHC Computing Grid is the main purpose of Chapter 4. We will see how LCG-2 attempts to address the challenges associated with any Grid 1

11 2 project and provide an outline of how it can be used. LCFGng is a piece of software that was developed at the University of Edinburgh and is being used in many Unix clusters. It provides an automatic way to install and configure the cluster nodes, making administration easier. Chapter 5 describes the main aspects of LCFGng, how it can be used together with LCG-2 and how it has been used in the context of this dissertation. Chapter 6 describes Ganglia, a distributed monitoring system for clusters and cluster hierarchies. Ganglia is being used by GridPP sites, including Edinburgh, to monitor their equipment. The chapter also explains how Ganglia was used for this dissertation. Interfacing storage resources to the Grid is far from trivial. The Storage Resource Manager is a specification defining the ways in which a storage resource should be accessible to applications through the Grid and has already been implemented for different storage systems. The Storage Resource Broker is a complete solution to using storage resources in a distributed environment. Chapter 7 discusses SRM, SRB and how they could be used together, one of the questions this work aimed at answering. Chapter 8 closes the dissertation with an brief reaccount of the work done. It explains how this project resulted in the gaining of knowledge and experience, highlights the encountered problems and attempts to find the reasons why they occured.

12 Chapter 2 Background on Grid Computing 2.1 Virtual Organisations A Virtual Organisation (VO) is a dynamic collection of individuals and/or institutions, willing to share information and computing resources to achieve a common goal. This sharing is regulated by a set of agreed upon rules, which define the role and the priviledges of each entity within the organisation. An example of a virtual organisation would be an international collaboration of universities, industry and government agencies aimed at developing and testing a new type of experimental aircraft. Such a large-scale project would require sharing not only technology and scientific expertise, but also computing power to simulate the aircraft operation and storage resources to keep the data used during the R&D process. This organisation would obviously operate under a set of rules and would be dynamic in nature, as members could join and leave and political and financial circumstances could change at any stage. 2.2 Grid Computing Grid Computing is the science that makes the existence of VOs possible by addressing their computing needs. There are four categories of issues that Grid technology is faced with: Information Data and Storage Resource Management Job and Computing Resource Management Security 3

13 4 In this section, we will provide brief descriptions of each of the four areas.when there are standard ways in which the issues are addressed, we will also provide brief outlines Security It is easy to understand why security is an important concern for Grid computing. In a Grid environment users, institutions and individuals make their resources and data available to a large number of people and certainly need some guarantee that this does not put them at risk. Anyone needs to maintain at least some level of control on what kind of applications run on his personal or cluster and has to trust their users. Data in a storage system connected to the Grid might be confidential or even classified. How can resources with different security policies safely be made to interoperate? The standard way to address the issue is based on public key cryptography. Public key cryptography enables entities to authenticate each other. More specifically, every entity on the Grid has two unique strings. One is only known to the entity and the other is made public. The private and the public key share a relationship that makes it very difficult to derive the former from the latter. The public key works with encryption algorithms to produce unreadable (encrypted) forms of data. Encrypted data can then be decrypted by the private key only. This ensures that anyone can safely send confidential data to the owner of the keys. Therefore, if A is able to read encrypted information sent by B, B can be sure that A is the owner of the public key he/she used for the encryption. The private and the public key can be used with reversed roles as well, in which case the reader decrypting the document using the public key can be sure that it has been written by the private key holder. The above procedure can be prohibitively slow, since encryption and decryption algorithms can take long to execute for large texts. Digital signatures make it faster. From any document, there can be produced a string of characters of standard lenght (the digest) by means of a hash function. This string characterises the document - it is extremely unlikely that different documents will produce the same strings. If the string is encrypted with A s private key and sent along with the document, B can use the public key to decrypt it. Then B can pass the document to the hash function, compare the digests and thus authenticate A. Therefore, the digest of any document, encrypted with A s private key play the part of a digital signature. Of course, the difference between a digital signature and a real signature is that digital signatures are unique to the document they are used for. However, in a Grid environment, a resource cannot hold the public key of every user, neither is it possible to know that a public key belongs to the right person. Certification Authorities (CAs) are authorised by all members of a VO to provide them with certificates and every member has access to their public key. A certificate is a document with the owner s details and public key, digitally signed by the CA. This can be used by two entities on the Grid to authenticate each other. There are different certificate formats but the one most widely used is the Internet Engineering Task Force s X.509. More specifically, if A and B want to authenticate each other, A sends B her certificate. B uses the CA s public key to make sure that the public

14 5 key on the certificate belongs to the person detailed on it. Then B can use A s public key to encrypt something and send it back to A. A decrypts it with her private key and returns it to B. If the returned document is identical to the original, A has been authenticated successfully and the same procedure can be repeated for B. In the end, both A and B are sure of each other s identity. The mutual authorisation process is automated and happens by having the user type a secret phrase. This allows the use of her encrypted private key for the authorisation to take place. It is desired however, that a user only authenticates himself once, at the start of a Grid session as opposed to every time he interracts with a new resource or user. This is achieved by means of Proxy certificates. Proxy Certificates contain the same information about their owner as the normal certificate, but are digitally signed by the owner, during the initial sign in and expire sooner. The public and private keys for the proxy certificate are new and the private key is not stored encrypted and can be accessed without requiring a pass phrase. Therefore, when A needs to prove B his identity, he sends both his CA certificate and his proxy. B can then use the CA s public key to make sure his Certificate is original and then use his public key to authenticate the signature on the proxy. After this point, authorisation of A can proceed as described above, but using the public and private keys corresponding to the proxy Information A Grid is an environment of a very dynamic nature. The number of users and the status and availability of resources are constantly changing. Moreover, it is desired that the users are not necessarily aware of the details of the resources they have access to and therefore may not be able to specify exactly which they will use to run their applications or store their data. This means that the Grid should be self aware in the sense that information on its status and that of its resources should somehow be made available to a number of services, capable of answering requests for that information. This would allow for efficient allocation of resources, sensible management of data and quick diagnosis of technical problems. There are a number of important requirements that an Information Service for the Grid should fulfill. Perhaps the most important one is that it should be distributed. This would make it independent of one or a small number of specific, centralised servers and would guarantee that it would continue to function even though individual resources on the Grid may break. Besides that, the way information is accessed should be standard and platform-independent, to make sure that every resource can publish its status and every authorised user or application can view it. Another important requirement rises from the previous one, as some information on available resources may have to only be available to a group of authorised users. This means that the information and the security service need to made to cooperate to ensure the safekeeping of sensitive information. A standard tool used by Grid Information Services is the Lightweight Directory Access Protocol (LDAP). LDAP is a specification that defines how messages that contain relevant information are formulated and exchanged between applications and the databases that store it.

15 6 The usefulness of LDAP has resulted in the development of an open source implementation, OpenLDAP, which is heavily used in most, if not all, major Grid projects Data and Storage Resources Management One of the facts that make Grid Computing attractive is that a lot of current scientific efforts result in the production of large amounts of data, which need to be safely stored and easily accessed by the scientists - a good example is provided in the next chapter. The management of data is therefore one of the major challenges faced by Grid technologies and is certainly the most relevant to this dissertation. A very important issue that should be addressed is how the great variety of data storage systems are interfaced and connected to the Grid. A VO can be expected to have resources including disk caches, RAID arrays, and tape storage systems. The requirement for transparency rules out the possibility of demanding from clients to be able to access such a variety of systems, therefore all different resources should present the same face to the Grid. This is one of the major issues this dissertation is concerned with and ways of interfacing storage resources to the Grid will be discussed in greater detail in other chapters. Another requirement is that data can be easily and quickly accessed by those authorised to view and manipulate it. This means that replicas of the same data must be distributed among resource sites and that there have to be fast and reliable ways to access it. A standard protocol used for the transfer of data between sites is GridFTP. In order to use data one must first be able to locate it and make sure they contain useful information. Therefore, two more issues arise: How is it possible to find a replica on the Grid and how can it be certified that this replica contains up to date information, suited to the user s needs? These questions point out the need for replica and metadata cervices. Typically, a metadata service provides data about the data contained in file, so that their quality and suitability can be determined. Afterwards, the replica service returns the physical locations of the files satisfying the user s demands Job and Computing Resources Management The management of jobs and the computing resources they run on is another difficult challenge faced by Grid Computing. A first issue that rises is portability. Very few programs are developed and tested in a way that ensures they can be run on any platform. Furthermore, how can the user be guaranteed that the computing resource used for his jobs fulfills her performance requirements? Therefore, there is a need to enable the user to submit a set of requirements together with her job, thus limiting the range of systems the job could be assigned to. This assumes that there is a service deployed, which makes the choice of the resource, based on those requirements and what is available.

16 7 Connecting computing resources to the Grid is also an issue in itself. The main problem is that individual resources are governed by their own policies and have their own batch systems. For obvious reasons the owners of the resources will not be willing to abandon their choices, so these different systems all need to be interfaced to the Grid. 2.3 Grid Projects A brief overview of two of the most important Grid Computing projects to date will be provided in this section Globus The Globus project [2] aims at developing Grid technologies. The major contributors are the Argonne National Laboratory, the Information Sciences Institute at the University of Southern California, the University of Chicago, the University of Edinburgh and the Swedish Center for Parallel Computers. The activities of this collaboration fall under four categories: Research, development of Grid software tools, development of applications that make use of Grid computing and setting up of testbeds for Grid technologies. The Globus project has released the Globus Toolkit (GT), a collection of middleware addressing the major challenges in Grid Computing. The GT, particularly in its second version, was very successful in providing tools that could be used to build information, data management and resource management services as well as a solid security infrastructure. As a result it is being used extensively by other projects like the European Data Grid and the LHC Computing Grid. The latest version, GT3, introduces the idea of grid services and proposes a new Grid architecture based on that idea. The next version, GT4, will go one step further and no longer support the grid services paradigm. However, the particle physics community has not yet embraced GT3 and it will probably be long before it embraces GT4, so those two versions are outside the scope of this work The European Data Grid The European Data Grid (EDG) [3] was an EU-funded project pursuing the development and testing of Grid technologies to be used for scientific purposes. The project ran from 2001 to 2004 and was led by CERN. Other contributors included the European Space Agency (ESA), France s Centre National de la Recherche Scientifique (CNRS), Italy s Istituto Nazionale di Fisica Nucleare (INFN), the Dutch National Institute for Nuclear Physics and High Energy Physics (NIKHEF) and UK s Particle Physics and Astronomy Research Council (PPARC) [3].

17 8 The structure of the EDG project is illustrated in Figure 2.1. The work was divided into workpackages. Workpackages 1 to 5 involved the development of Grid Middleware to address the major issues in Grid Computing, workpackage 6 was about the creation of testbeds for the developed technologies, workpackage 7 targeted network services and workpackages 8 to 10 were concerned with the development of Grid scientific applications. A final work package included the management of the project. The four working groups that carried out the project were the Testbed and Infrastructure group (WP 6-7), the Applications group (WP 8-10), the Computational and DataGrid Middleware group (WP 1-5), and the Management and Dissemination group (WP 11). Figure 2.1: The structure of the EDG project. Image taken from [3]. EDG achieved important advances in Grid technology and developed specifications and implementations that are still in use in the context of the programmes that succeeded it - the EGEE project, which stands for Enabling Grids for E-science in Europe and the LHC Computing Grid, which is specific to Elementary Particle Physics and will be covered in the next chapter.

18 Chapter 3 LCG, GridPP and ScotGrid 3.1 The Large Hadron Collider - why Grid Technologies? The Large Hadron Collider (LHC) at CERN is scheduled to begin operation in 2007 and will be the most powerful particle accelerator in the world. There are four experiments currently being prepared, which will take data from it: CMS, ATLAS, ALICE and LHCb. All four of these experiments are international collaborations involving hundreds of institutions and approximately six thousand scientists. The purpose of the LHC experiments is to study the fundamental properties of matter at high energies and test the current theories used to describe them. All four experiments are based on the idea of detecting and tracing the interactions and movement of accelerated particles by means of sophisticated detectors and a large number of complicated electronic devices. The digital output produced by the LHC experiments, in total, is expected to be of an order of magnitude of Petabytes per year. The creation of a single resource devoted to the purpose of storing and analysing this data is impossible, for political and practical reasons. However, many of these contributors possess considerable computing resources, which could be made available to the other members of their collaborations to serve the common purpose. However, the varying nature of those resources and the different individual policies regulating their use make this a complicated and difficult task. The LHC is faced with problems, which Grid computing is well suited to. Each of the four collaborations constitutes a Virtual Organisation, which aims to combine a set of computing resources and to enable their members to access and manipulate data distributed among many different geographical locations and administrative domains. It should now be apparent that the LHC provides an ideal opportunity for the application of Grid technologies. 9

19 The LCG Project The LHC Computing Grid (LCG) is the project that will prepare the computing infrastructure for the simulation, processing and analysis of LHC data [4]. In other words, the LCG project aims to provide the necessary middleware that will allow the interconnection of the diverse computing resources within the same computational data grid. The LCG middleware will provide transparent access to these resources and serve as a basis for the data-intensive high energy physics applications undertaking the actual science. However, the LHC project is not just about developing middleware. It includes the coordination of the resource sites and the exchange of experience and information among them, the deployment of the services, the extensive testing of the resulting system and the monitoring and support of its operation, while the LHC experiments are running. In the context of LCG, the different sites connected together are grouped into Tiers. The LHC experiments, where the data is produced is the Tier 0. CERN, where from data is distributed to the other sites is a Tier 1 center. Tier 1 centers are sites with major storage and computing resources that often operate on the national level. Smaller sites, for example universities or laboratories, which may or may not possess considerable resources of one or both kinds cooperate in forming regional Tier 2 centers. An individual university is itself a Tier 3, whether participating in a Tier 2 or not. Finally, individual desktops are Tier 4. In the UK, the Rutherford Appleton Laboratory is the Tier 1 center and the University of Edinburgh is one of the Tier 3 sites. 3.3 Status and near future of the LCG Project The LCG project is currently in its first phase, which started in 2002 and is expected to last until During this phase prototypes and actual implementations of the LCG Grid services are being developed and tested in increasingly demanding data challenges, separately for each of the experiments. The releases of the LCG middleware resulting from this process are being installed in an increasing number of sites and work is being done to interface them to the local resource managment systems. The goal of this phase is to produce a detailed and complete design for the final system that will be in place by the time the LHC begins operation. The next phase or phases of the project will be concerned with the execution of this design and the maintenance and possibly further development of the system during its operation. The latest product of the LCG project, LCG-2, comprises a set of middleware tools. It was released in April and will be running throughout The next chapter is devoted to LCG-2 and will provide a detailed discussion of its components and functionality.

20 GridPP GridPP is a project involving 19 British universities, the Rutherford Appleton Laboratory and CERN. It began in 2001 with the purpose of creating a Grid for particle physics, which could ultimately expand to provide services to a wider range of scientific disciplines. The areas in which it is active are the development of Grid applications for particle physics, the development of Grid middleware and the deployment of the current technology in testbeds across the UK. Naturally there are close ties between GridPP and LCG. Through the GridPP project, the UK has become the most active LCG participant of all CERN member states. In the context of GridPP and apart from the Tier 1 center at RAL, there are currently four regional Tier 2 centers in the UK: London, SouthGrid, NorthGrid and ScotGrid [insert figure here]. 3.5 ScotGrid ScotGrid is a collaboration between the Universities of Edinburgh, Glasgow and Durham. Its purpose is to develop a prototype Tier 2 center for the particle physics Grid in the UK. Once connected to the Grid, the resources provided by the ScotGrid institutes will be used by scientists involved with LHCb and ATLAS experiments to perform simulations and data analysis. A logical layout of the ScotGrid system can be seen in figure 3.1 on the next page ScotGrid Hardware in Edinburgh In Edinburgh, the effort is concentrated in Storage and Data Management issues, which is natural since the main part of the available resources comprises storage resources. More specifically, the ScotGrid storage hardware at Edinburgh comprise an IBM eserver x440 of 8 Intel Xeon 1.9GHz CPUs with 32 GB RAM, an IBM Dual FAStT900 22TB RAID array and 10TB of the 155TB Sun Microsystems Storage Area Network that the university has recently acquired. At the front end, interfacing the ScotGrid hardware with the Grid, are two IBM eserver x205 Intel P-IV 1.8GHz with 256 GB RAM, Glenellen and Glenlivet, as well as an IBM eserver x305 four dual Intel P-III Xeon 1GHz with 2 GB RAM, Glenmorangie. Furthermore, there will soon be four dual Intel Xeon 2.8GHz machines, with 2GB RAM and 200GB EIDE HDD to be used as Worker Nodes i.e. machines running jobs submitted to the site from the Grid.

21 Figure 3.1: A logical layout of ScotGrid. Image taken from [6]. 12

22 Chapter 4 LCG-2 This chapter provides an overview of LCG-2, which is the latest release of the LCG project. It is based on work that was conducted by the European DataGrid[3] project and incorporates components of the Globus Toolkit [2]. We provide a brief description of how LCG-2 addresses the following areas, which are crucial to any Grid system: Security, Information, Job Management, Data Management and Interaction with the the user, the applications and the resources. 4.1 Interaction with the user and the applications A Command Line Interface (CLI) and a Graphical User Interface (GUI) handle the interaction between LCG-2 and the user. The CLI allows the user to identify himself as someone authorised to use the Grid, use the information service to find out about the resources available on the Grid and to submit and manage jobs. The Java GUI provides the same functionality in a more user-friendly way. It contains an editor for the Job Description Language (JDL), as well as two more components, for the submission and monitoring of jobs. The functionality of the user interfaces can pass to applications by means of various APIs provided by LCG-2. Many of the potential uses of the LCG-2 UI will be mentioned in the following sections. 4.2 Interaction with the resources The ability of the system to interact with the resources on the Grid is provided by the Storage Element (SE) in the case of a storage resource and by the Grid Gate (GG) in the case of a computing resource. The Storage Element (SE) was first defined in the context of the European DataGrid project (Work Package 5). Setting up a computer as a storage element means connecting the computer to the Grid and using it as a server to enable access to the storage space of the resource. Users 13

23 14 of the Grid can communicate with the various SEs by means of standard protocols, typically without needing to know any resource specific details. The SE providing the interface to a resource can handle all necessary internal communications to serve the users requests. The proposed architecture of the SE software is described in detail in [5]. A number of homogeneous computing nodes (Worker Nodes or WN) connected to the Grid as a single entity are called a Computing Element (CE). A Computing Element also includes a node called the Grid Gate (GG) which interfaces the resource to the Grid. The Grid Gate uses a Globus tool, called the Grid Resource Allocation Manager (GRAM) as well as the local resource manager and a logging and bookkeeping server, that keeps track of the functions performed by the resource. By managing the Worker Nodes, the GG can satisfy job requests coming from the Grid even though they are not adapted to the local batch system. 4.3 Security LCG-2 has adopted the Grid Security Infrastructure (GSI) developed by the Globus Project. The way it handles security is based on the principles described in the previous chapter. In order to use LCG-2, users have to be members of one of the LCG VO s. Then they can request certificates from one of the recognised CAs. With a certificate installed on a browser, the users must visit the LCG-2 registration webpage and register with the LCG-2 service. 4.4 Information System The information system provided by LCG-2 has been taken from the Globus Toolkit and the EDG project. Every individual resource runs a service called the Grid Resource Information Service (GRIS). This service can report on the characteristics and the state of that specific resource to the Grid Index Information Service (GIIS) that runs at the aggregate level. The GRIS and the GIIS can be queried by users (and their applications) and can provide information about the resources connected to the Grid. Additionally, there is another service called the Berkeley DB Information Index which stores information from multiple GIISes in a database that can also be querried by the user. Information to and from the databases of the Grid Information Service is accessed using OpenLDAP. The ways in which a user can make indirect use of the Information Service through the CLI, will be described in the following two sections, on Job and Data Management. However, the CLI also provides the user with direct access to the service by allowing him to querry the information databases for the status of specific resources.

24 Job Management The set of services that provide the ability to manage jobs are collectively called the Workload Management System (WMS). The WMS includes services for receiving job requests from the user or application, finding suitable resources to run them on, modifying them to run in the environment of the Grid and managing them afterwards The Job Description Language The Job Description Language that LCG-2 uses is a product of the Condor project [7] and is called Classified Advertisment (ClassAds) language. It allows the user to create descriptions of his job, specifying important characteristics. Examples include the environment under which the job should run, the names of the input and output files, a particular CE, where the job should run and the SE where the output files should be uploaded. The JDL also provides the ability to set requirements in terms of proximity and performance capability of the CE and of the available storage space of the SE Command line tools There are a number of tools in the CLI, for the purpose of submitting and managing jobs. Through those tools, the WMS provides all the functionality usually required from a job management system, including submitting, cancelling and checking the status of jobs. It also supports the ability to list the available CEs and submission of interactive jobs. Another useful feature of the system that can be exploited from the CLI is the BrokerInfo file. This file is created for a specific job and contains information about it, which can be obtained through the use of the edg-brokerinfo command. Features that are currently not supported but are likely to be incorporated in future official releases of LCG-2 include submission of checkpointable jobs or jobs making use of the Message Passing Interface (MPI). 4.6 Data Management The Replica Management System (RMS), the data management component of LCG-2, was developed in the context of the European DataGrid project. It mainly consists of two services. The first one is the Replica Location Service (RLS). The RLS maintains a mapping between the Grid Unique Identifiers (GUIDs - see below) and the physical locations of the files they identify. The second service is the Replica Metadata Catalog (RMC) which maps the Logical File Names (LFN) of the files to their GUIDs. The RMC also contains metadata about the files. The two services of the RMS can be accessed through the Replica Manager, which is part of

25 16 the user interface. There is one separate RMS for each of the VOs and all of them are provided by CERN File names Every user in the LCG-2 Grid environment, identifies files by four different types of names. First, for each file, there is one Unique Grid Identifier. The GUID of a file is guaranteed to refer to that file exclusively and is the same for any user. In contrast, the Logical File Name of a file, is the name by which a user refers to a particular file. In other words, it is an alias for a GUID, arbitrarily set up by a user. Beside those two names, there is also a Storage URL (SURL) and a Transport URL (TURL). The reason for those is that GUIDs and LFNs cannot be used to specify the physical location of a particular copy (or replica) of that file. SURLs and TURLs contain information about the hosting SE and the local identifier of the file within that SE. The difference between SURLs and TURLs is mainly that the SURL is used by the RMS and the local SE to locate a replica, whereas the TURL is used by an application running on the Grid and contains all the information necessary to get the replica from a particular location, including the protocol and port that should be used Command line tools It was mentioned earlier that the two services of the RMS can be accessed through the Replica Manager. The LCG-2 CLI contains a Replica Manager client, which can perform a number of data management operations. More specifically, there are commands for assigning GUIDs to files and uploading/registering them to the Grid (the user can specify the SE, but it is not necessary), getting information about the available storage resources, finding the existing replicas of files, deleting replicas or creating new ones and viewing the contents of specific SEs. Moreover, the user can assign LFNs to files to enable their use by his/her applications. Besides using the Replica Manager as an interface, the user has the option of accessing the RMC and the RLS directly by using two clients from the CLI. However, the low level operations made possible by those clients are dangerous, as they can create inconsistencies between the two catalogues (an update on one of the two catalogues does not enforce an update on the other), therefore, in most cases, the use of the Replica Manager client is preferable. 4.7 Relevance to the Dissertation An important objective of my MSc project was the gaining of knowledge and understanding of the LCG-2 middleware and the storage aspects in particular. This was important, because, as a Tier 3 center, Edinburgh will make use of this middleware to connect to the rest of the LHC Grid. For this reason, it was necessary to have LCG-2 installed on a front end of machines.

26 During this summer, version 2_1_1 of LCG-2 has indeed been installed on three computers and there is currently a functional SE that can accept commands via the network from any computer running the User Interface software. To be able to follow and participate in this process, I had to study the LCG-2 user guide [9] extensively and learn the architecture and features of LCG-2, what services the LHC Computing Grid will offer and how they can be accessed. I also went through the necessary steps to obtain a Grid Certificate and install it on my workstation, so that I could use the User Interface to try the LCG-2 command line tools. 17

27 Chapter 5 LCFGng The Local Configuration System (LCFG) [8], [11], is a piece of software that can be used to install and manage the configuration of large numbers of Unix systems [8]. It started as a project in 1993 at the University of Edinburgh and was originally designed for the Solaris OS. Since then, it has been ported to Linux and become an open source project distributed under the GNU public license. It has grown and evolved significantly and is today called LCFGng for next generation. The EDG project modified a version of LCFGng as a system for installing and configuring machines running their Grid middleware and since then the two versions have evolved independently. In this chapter we always refer to the EDG LCFG, even though most of the information found here holds true for both versions. 5.1 The architecture of LCFG The architecture of LCFG can be seen in Figure 5.1 on the following page. The operation of LCFG is based on a central server that holds configuration files for all other computers to be configured/managed. A configuration file on the central server is written in the LCFG language and could refer settings that apply to one of the computers or one setting that applies to many computers. These source files can be passed through the LCFG compiler to create profiles. A profile does not correspond to one source file, but contains the complete configuration of one computer, so a single source file may be compiled into many profiles. The LCFG central server also runs a web server which is used to publish the profiles. Every time an update is available, the server notifies the LCFG clients running on the managed machines. The clients can then download their new profile. A number of Perl scripts, the LCFG components, are then run by the client to turn the profile into the appropriate configuration files and make all necessary changes. 18

28 19 Figure 5.1: The LCFG architecture Source Files Source files are made up of header file inclusion statements containing standard configurations and resources. Resources are two-word statements that specify the necessary details that are not included in the header files and are specific to the configured systems. For example, the standard desktop configuration could have one ethernet card whereas the system to be configured could have two - differences like this require additional resources. The first word of the resource is the key and the second is the value. The key is made up from the hostname the statement refers to, the relevant component that should make the configuration changes and the parameter. The second word is the value the parameter should take in the new configuration. The program used to compile source files is called mkxprof. Calls to mkxprof can be made explicitly, but it is also a daemon, checking the configuration files for updates and recompiling the ones that have undergone changes. In this way, the user only needs to save an updated file. The compilation and publishing will be handled automatically Profiles Source files are compiled into profiles. There is one profile for each machine to be configured and it contains all the necessary information on the configuration of that machine. Profiles are XML files divided into sections, one for each component. What keys are included in each individual section depends on the component it is meant for, but the general format of a profile

29 20 is defined by a schema Components Components are scripts, usually written in Perl, which are responsible for one configurable aspect of the machine s operation. Once a profile is received by the client, the relevant components are called with the required arguments that specify the method in the script to be run. The components typically implement a configure method and perhaps a start and stop method as well. What the configure method usually does is to create configuration files and the directories where they are meant to be placed. The start and stop methods are there in components that are meant to stop and start daemons, for example in the case of a service that needs to be restarted to get the new configuration. 5.2 LCFG and LCG-2. Relevance to the Dissertation The EDG project used a version of LCFG as a means to install and configure their software. This practice was passed on to the LCG project and, currently, the usual way to install LCG- 2 is through an LCFG server. Of course, the installation and configuration can also be done manually. In Edinburgh it was decided to use LCFG for installation and managing the system configuration. The installation took place over the summer and I was involved in the process Installing LCFG In order to install LCG using LCFG, it must first be installed on a central server. In Edinburgh, the LCFG service is set up on Glenellen, an IBM eserver x205 Intel P-IV 1.8GHz with 256 GB RAM. All the necessary files, including the essential document with the installation instructions [12], can be downloaded from the LCG-2 CVS repository at [10] The operating system The first step is to install Red Hat 7.3 on the server. It is important to partition the hard disk so that /opt is the largest partition, as this is where the software will reside. It should also be kept in mind during the installation, that alongside the LCFG server there need also be: a DHCP server; an NFS file server; and a web server running on the host. This will affect the packages installed and the firewall configuration. For the installation in Edinburgh, after Red Hat Linux 7.3 was installed, YUM (the Yellow dog Updater, Modified) was used to make sure that the latest and most secure versions of all packages were installed.

30 Downloading the rpm packages After the installation of the operating system is complete, three rpm packages must be downloaded from the CVS repository and installed: edg-populate-serverng, edg-updaterep, updaterpmsstatic-server. The only one of those to be used directly by the user, edg-updaterep.rpm, comes with a configuration script that can be set to make it download the rpms for a specific CVS tag for LCG-2 from the repository. However, the configuration files for the LCG-2 installation must be checked out from the repository manually and put in a directory of the user s choice before updaterep is run. The final step in installing the LCFG server software is to run a Perl script provided with the distribution, lcfgng_server_update.pl, which checks that all the necessary rpms are have been downloaded and generates a scripts that performs the installation LiveOS LiveOS is an operating system that is loaded into the LCFG clients when they are booted from the network, in the beginning of the LCG-2 installation. The next step is to have it installed and configured in a directory on the LCFG server. The installation is handled by a Perl script called lcfgng installroot.pl and then the user can configure some parameters by editing a file called installparams The DHCP server A DHCP server should run on the machine running the LCFG server. A DHCP server has the duty of giving all the other hosts on the network their network configuration. This includes assigning IP addresses the server has a range of IP addresses that it is authorised to distribute and every time a host goes online and requests an address it makes the assignment. A DHCP service needs to be set up on the LCFG server, so that the nodes to be configured obtain their configurations dynamically. The DHCP server also tells the Network Card of the node to boot via PXE and to load the PXE loader, pxelinux.0 using the TFTP server. To make writing a configuration file easy the LCG-2 distribution includes an example configuration file, dhcpd.conf.ngexample. That file should be edited depending on the characteristics of the network where LCG-2 is to be installed and renamed to /etc/dhcp.conf HTTP The HTTP server running on the LCFG node needs to be configured to limit access to the LCFG to the computers from domains that are safe. This is very important, as the configuration information stored in the LCFG server can make the other nodes on the network vulnerable to attacks, if not kept secret. A simple way of configuring the HTTP server properly is provided

31 22 in the form of a configuration file, httpd.conf.ngexample73, which has been downloaded at the beginning of the installation. Another important step that should be taken at this point is the encryption of a password and its storing in a file typically named /etc/httpd/.htpasswd. This password will have to be used when trying to access the LCFG with a web browser NFS An NFS server has to be properly configured on the LCFG server. It is used by the nodes to transfer the LiveOS files during their initial boot. The configuration is very simple and is done by adding two lines in the file /etc/exports. The LCG-2 distribution contains an example file with the name /etc/exports.ngexample73, where from the lines can be copied Setting up PXE PXE allows nodes on a network to boot from the LCFG server. If a node is set to boot from its network card, then, when it starts, its network card sends a request to the DHCP server to obtain its configuration. If the DHCP server is properly configured, it will instruct the node to download the PXE loader, pxelinux.0, from the /tftpboot directory, using TFTP. Once the PXE loader is downloaded, it in turn uses TFTP to download its configuration file from the directory /tftpboot/pxelinux.cfg and a kernel from the directory /tftpboot/kernel/. Then the node boots that kernel. This is the way the LiveOS operating system is passed to the nodes prior to the LCG-2 installation. The user can have multiple configuration files and kernels in the respective directories. If there are multiple configuration files then the user will be presented with an option of boot types when a node is booted. Obviously, since the PXE loader and the kernel and configuration files are transfered to the node via TFTP, it should be made sure prior to the installation that there is a TFTP server running on the LCFG server Installing LCG Site-wide settings After the previous steps have been completed, the server is set up and the configuration for the site must be prepared. All the site-wide settings will go into LCFG header files that can later be included in the node-specific source files. There are four header files that need to be edited. Those files are: cfgdir-cfg.h, which contains the directory of the configuration files.

32 23 local-cfg.h, which contains modifications to standard Red Hat 7.3 settings. private-cfg.h, which contains security settings, including the root password for the site. The password is encrypted using openssl and is stored in encrypted format. site-cfg.h, which contains settings applying to the whole LCG-2 site (site name, LCG version, etc) LCFG source files for the node types After the site-wide settings have been made, it is time to specify the configuration for the different types of nodes. There are example source files provided with the LCG-2 downloaded files and those can be edited and renamed to the hostname of the node they refer to. Therefore, we end up with a number of source files equal to the number of nodes in the site. Those source files can be compiled into XML profiles using mkxprof PXE node installation The installation can begin by accessing the LCFG server with a web browser. The URL that should be used is of LCFG server>/install/install.cgi, the default username is lcfgng and the password is the password that was set during the http configuration. The user is presented with a web interface that allows him/her to specify the node to be installed and the type of the installation. The different installation types correspond to the different configuration files for pxelinux that have been created when PXE was set up. When the nodes are rebooted their configuration will be the one selected. Therefore, in order to install a Storage Element, all that needs to be done is to select the appropriate boot type and reboot the host. Once the machine is booted and the is sent the profile that contains its configuration. The configuration defines the boot type, in other words, the kernel to boot and the filesystem to mount. Once the node is rebooted it functions as an LCFG client node, which means that it will pick up any changes made in the LCFG source files located at the server that affect its profile Relevance to the Dissertation Since one of the aims of the MSc project was to understand the way a site running LCG-2 is set up, I had to install an LCFG server on an old computer that used to be on the physics network. The process began by formatting the hard disk and installing Red Hat Linux 7.3. I then followed the steps described in the previous paragraphs and ended up with a functional installation of LCFG on the computer. Unfortunately, the reproduction of the LCG-2 installation had to stop there, since there were no other machines that I could use to install LCG-2 on. However, I did follow and understand the installation process as it took place for the Edinburgh

33 24 hardware. I feel that following the installation was instructive and even though I was, quite understandably, not allowed to work on the hardware myself, I learnt a lot from the process. As an LCG front end there is an IBM eserver x205 Intel P-IV 1.8GHz with 256 GB RAM, Glenellen, used as the LCFG server and another one, Glenlivet, used as the GG to the Worker Nodes. Glenmorangie, an IBM eserver x305 dual Intel P-III Xeon 1GHz with 2 GB RAM is set up as a SE. Finally, the worker nodes of the Computing Element are going to be four dual Intel Xeon 2.8GHz machines, with 2GB RAM and 200GB EIDE HDD.

34 Chapter 6 Monitoring with Ganglia Due to the dynamic nature of the LHC computing Grid, resources connected to it need to be monitored. The monitoring tool that will be used for the GridPP sites and has recently been installed in Edinburgh is Ganglia [13]. Ganglia is an open source project that started at the University of California, Berkeley, as part of an effort to link university clusters and has evolved into a complete monitoring system. It is also distributed in the sense that it can not only be used to monitor clusters but also supports cluster hierarchies. This is achieved by appointing representative nodes in each of the clusters and organising them into trees. Each node is then responsible for reporting to the one above (parent), on the state of those on the branches below (children). Ganglia has been ported to many different platforms and tested thoroughly. It is also highly scalable - the communications have been optimised so as to introduce as small an overhead as possible and it has been successfully used to handle clusters scaling to thousands of nodes [14]. 6.1 The Ganglia Architecture The two main components of Ganglia are the Monitoring Daemon (gmond) and the Meta Daemon (gmetad). gmond is responsible for collecting information at the cluster level and runs on each of the nodes of the monitored cluster. It has three kinds of threads. The Collect and publish thread collects the monitoring information for the node gmond runs on and publishes it to a multicast channel used by every other gmond in the cluster. The multicast channel is being listened to by the Listening threads, which pick up information broadcasted by other gmond daemons and update the local gmond s information accordingly. Thus, every node that has gmond running has a complete view of the cluster, kept in a hash table. Another kind of threads, the XML export threads, listen for clients applications or gmetad daemons requesting information and answer them with messages in XML format. 25

35 26 gmetad is a Perl daemon, responsible for collecting information for clusters or groups of clusters. One gmetad runs on each of the representative cluster nodes. Typically, this means that it collects information from a set of nodes running gmond daemons as well as a number of child gmetad daemons, representative of other nodes. This information comes in the form of XML messages. Then it puts all its information together in its database and passes it on to the gmetad running on its own parent node, when requested. Figure 6.1 shows how the Ganglia components are combined to build cluster hierarchies. Figure 6.1: Ganglia cluster hierarchies 6.2 Metrics Ganglia distinguishes between metrics characterising them as either built-in or applicationspecific/user defined. Built in metrics are used to describe the state of a cluster node. The number of built in metrics varies depending on the platform, but the most important (number of CPUs, CPU and memory usage, running processes etc) are always available. Applicationspecific metrics are defined by the user and the user can explicitly specify the frequency by which they are collected and sent on the multicast channel. Applications can use the gmetric command line tool to publish informations about themselves.

36 Transmitting and Storing Monitoring Information Messages on the multicast channel There are two types of messages exchanged between Ganglia nodes. The first is messages on the multicast channel. A multicast message is a message from one host to all other hosts in a group. What basically distinguishes a multicast message from a broadcast message is that in the case of the latter, all hosts in a network receive the message. In contrast, a multicast message is delivered to a dynamic group of recepients, which is identified by one IP address and does not include all hosts. Therefore, multicast messages result in less traffic on the network and are more efficient. Furthermore, nodes enter or leave the group of recipients dynamically, without the need to change the configuration of a central service or restart it discovery is automatic. The messages transmitted on the multicast channel are either heartbeats or collected metrics. Heartbeats are messages that signal that a host is up and running. A heartbeat includes the start time of the gmond daemon so that the rest of the nodes can detect restarting daemons. When a gmond does not send heartbeats for some time, the host is assumed to be down. The other kind of multicast messages are in the external Data Representation (XDR) format, which is machine independent and also efficient. XDR messages include monitoring information sent from one gmond to the others in the group. A gmond sends updated values for the metrics every time a change occurs, that surpasses a defined threshold. Updates will not be sent for values that don t vary significantly XML messages The messages passed from gmond to gmetad running nodes or exchanged between gmetad running nodes include monitoring information for subsets of cluster federations and are in XML format. XML, being portable and self describing, enables the integration of Ganglia with other software. These messages are transmitted by gmond or gmetad after a request by another gmetad is received. Any gmetad will periodically send such requests to the gmond daemons of the nodes it represents as well as all of the child gmetad daemons. In this way, it obtains up to date values for the metrics monitored. The XML messages sent from a gmond include information on all the gmond nodes on the same multicast channel. The XML messages from a gmetad include aggregated information on every single node lower than the gmetad in the cluster hierarchy Storing Monitoring Information To store the information they receive through the multicast channel, gmond daemons use a hash table in memory. The hash table supports simultaneous entering of data by listening threads accessing different parts of the table, to increase efficiency. The system is also opti-

37 28 mised for accesses by the XML export threads. Data on the table is stored in binary form to reduce its size and to allow quick conversion from the XDR format. Timeouts are the mechanism by which gmond daemons distinguish between valid data and expired data that should be deleted. For every piece of data put on the hash table, gmond records the time of receipt. Data that have not been updated for longer than a soft limit is considered suspicious and any client applications using it are notified. If a second, hard limit is crossed the data is removed from the hash table. The data collected by the gmetad daemons is stored using the Round Robin Database tool (RRDtool). All hosts running gmetad daemons keep databases with all the data sent to it and create graphs of that data versus time. 6.4 The PHP Front end Ganglia has a web front end system written in PHP. This system runs on the same host as a gmetad daemon. Its role is to create and periodically update web pages where it publishes the information contained in the gmetad RRDdatabase. It thus provides the user with the ability to easily access the information gathered by Ganglia, without having to querry the databases directly. Moreover, it provides this information in the form of the graphs created by RRDtool, as opposed to just raw numbers, making the monitoring process even more user friendly. All the metrics that Ganglia uses can be obtained through the PHP frontend. 6.5 Using Ganglia for the Scotgrid Hardware. Relevance to the Dissertation Ganglia is currently being installed in many GridPP sites, to be used not only by the individual site administrators, but also as a way to centrally monitor the GridPP hardware. As part of this process, it was recently installed and configured for the Edinburgh site as well. For my MSc project, I had to read the Ganglia documentation, understand how it works and then participate in the installation/configuration process Using LCFG to configure Ganglia In Edinburgh, Ganglia was installed and configured by following the instructions found on the GridPP website. Figure 6.2 on the next page shows how the Ganglia daemons interact to gather the monitoring information. The LCFG server, Glenellen is also the node running gmetad, that is, the representative node for the Edinburgh cluster. On all the hosts in the cluster, including Glenellen, there are gmond

38 29 Figure 6.2: Ganglia on the Edinburgh front end daemons running, collecting the monitoring information. The gmond daemons publish their information on the multicast channel, so every host is updated on the state of all others. Periodically, the gmetad on Glenellen requests the cluster monitoring information from one of the gmond daemons. This information is sent as an XML file and then stored in a database. The PHP front end running on Glenellen can dispay this information on browsers of authorised hosts The near future The Edinburgh hardware is now monitored and the monitoring info collected by the gmetad on Glenellen can be seen online at by all allowed hosts. For the time being, those include only hosts of the university network, however, in the near future, the list of allowed hosts will expand to include other GridPP machines. This will allow for centralised monitoring of the GridPP hardware. There might also be internal changes for the Edinburgh site. When the Worker Nodes are in place they might form a different Ganglia cluster, with a gmetad representing them separately. The cluster of the WNs will be lower in the hierarchy than the front end cluster, represented by the gmetad of Glenellen. In this way, all the site information will still be accessible from Glenellen. However, it will be possible to monitor the WN as an isolated group as well.

39 Monitoring Examples This section and the diagrams included show the way two different operations on the Scotgrid host, Glenellen, are viewed from the PHP front end. The graphs have been produced by trying operations and capturing the images from a browser window A memory leak A C program that creates a memory leak has been written and run on Glenellen. The program allocates memory for a long array of pointers to doubles and then loops over the array between time intervals of constant length, allocating a large amount of memory for each of the pointers. The program can be found in appendix A. This results in more and more of the system memory being used until eventually, if left alone, the program will allocate all the memory it tries to allocate or the system will run out of memory. The latter should obviously be avoided because it may crash the system. Glenellen has 2 GB of memory. The first run of the program was set to allocate 1GB of memory, sleeping for 1 second between allocations. The result was the first steep rise in memory usage that can be seen on Graphs 6.3 on the following page and 6.4 on page 32. On the second run, the program was set to exit after allocating 1.5 GB to avoid the risk of crashing the system. The time interval in between allocations was set to 5, to create a slower increase. It can be seen on the graphs that the memory usage rises approximately linearly. This is exactly what should be expected, since the allocations happen between regular time intervals and the same amount of memory is allocated every time. It is interesting to observe that as the amount of allocated memory approaches 1.5 GB, the system starts using swap space to save memory. This did not happen during the first run, because memory usage did not reach a critical level. To cross check that we get the correct information from Ganglia, we ran the top command. This confirmed the accuracy of the Ganglia graph A file transfer Another operation that was monitored was a 21 GB file transfer from Glenkinchie. Graphs 6.5 on page 33 and 6.6 on page 34 show the results. At the beginning of the transfer we notice that the use of memory increases quickly and the network and CPU activity are high. However, when the RAM is filled and can no longer be used as a buffer, the speed of the process is bottlenecked by the speed of writing on the hard disk and both the CPU load and the network activity drop. Furthermore, the process becomes even slower, as the free space in the hard disk of Glenellen is dramatically reduced. In the end, the scp had to be manually cancelled and the file deleted, to stop Glenellen from running out of hard disk space. A final observation to be made from these graphs is that Glenellen sends packets through the network as well as receiving. The reason is that in an scp transfer the receiver sends messages to the sender to acknowledge receipt of the packets.

40 Figure 6.3: The leak program output, a terminal running the top utility and the Ganglia webpage. The program claims to have allocated 1253MB, the top utility gives a value of 1.3 GB and the value shown on the Ganglia Graph for the total memory usage is approximately 1.4GB. 31

41 Figure 6.4: Memory usage, free memory and free swap memory. We notice that the first two graphs are consistent. The third graph shows the usage of swap memory for the second run. 32

Figure 6.5: The Ganglia page for Glenmorangie shortly after the file transfer. The start of the file transfer resulted in a quick rise in memory usage.

42 Figure 6.5: The Ganglia page for Glenmorangie shortly after the file transfer. The start of the file transfer resulted in a quick rise in memory usage. When all the memory was used, Glenmorangie only received and processed data as quickly as it could write it on the hard disk. The result was a drop in CPU and network activity. 33

43 Figure 6.6: As Glenmorangie receives packets of data it sends confirmation packets back to Glenkinchie. Almost 20 minutes after the transfer started and with almost 5GB transfered, the process was stopped, since there was not enough disk space on Glenellen to hold a 21GB file. 34

44 Chapter 7 SRM and SRB It is obvious that the variety of storage resources on the Grid will be great and that each of those resources will have different functionality and their own data handling policies. It is therefore necessary to have a uniform interface to all those different local systems, so that clients can easily interact with them without having to know how to deal with each of them separately. There are a number of tools aimed at addressing this issue. Those most relevant to this project are summarised in the following paragraphs. 7.1 SRM The Storage Resource Manager (SRM) is an interface specification defining the ways in which a server running on a storage resource should be able to interact with applications trying to reach it via the Grid. These applications should be able to invoke a specified set of methods and expect standard responses and the role of the SRM interface is to make sure that any implementation of a storage management system is able to use those methods and responses. SRM has been the result of the collaboration of the European DataGrid, CERN, Fermilab and LBNL.. There are implementations of the SRM protocols for a number of storage systems: HPSS, Enstore, JasMINE, CASTOR, EDG SE and ATLAS and RAID arrays. The following sections will describe the way files should be viewed and treated by an SRM implementation, as well as the most common methods such an implementation should provide SRM file and storage space types The SRM specification characterises files based on their lifetime on a storage system. More specifically, a file living in an SRM-managed storage system can be permanent, volatile or durable. Depending on the type of files stored in it storage space is also assigned one of those three descriptions. 35

45 36 Permanent files in an SRM-managed storage system are of the same nature as permanent files in a typical filesystem. Those files are guaranteed to remain unchanged and inside the storage system, unless their owner chooses to delete them. Therefore, a user of the Grid interacting with an SRM server and using permanent files can usually count on finding those files on that system for long periods of time. Volatile files are those that have a specified lifetime. A volatile file is only guaranteed to be found by a user during its lifetime and its lifetime is specific to that particular user. For example, if user A is accessing a volatile file, she will be guaranteed by the storage system that it will be accessible for a certain amount of time. If in the meantime another user, B, asks for the same file, a new lifetime for the file will be associated with him and he will get an independent guarantee for a different time period. If, at any moment, the existence of a volatile file is not guaranteed to any user (i.e. all its lifetimes have expired) the file could be removed, as soon as the space it resides in needs to be reclaimed by the storage system. Durable files are usually files of a temporary nature that contain important data. They also have lifetimes, but once those expire, the storage system cannot yet delete them. Instead, it has to notify the owner of the files that their lifetime has expired and perhaps copy them to permanent space (depending on the implementation). A suitable candidate for a durable file would be a very large file that contains important information and needs to be accessed quickly. Since durable space will usually be in the disk cache, as opposed to a tape, quick access to the file could be provided and the file would still not be lost if not accessed for some time. A file of a specific type can always be stored in space of the same type. Moreover, durable files can also use permanent spaces and volatile files can be stored in spaces of any type. This is demonstrated in figure 7.1. Figure 7.1: SRM filetypes. Image taken from [22].

46 SRM functionality There are currently many different types of storage systems, like disk caches, tape storage systems, RAID arrays etc. A storage facility may contain one or more kind of storage system. In addition, those systems will support different operations and be governed by different policies. However, in a Grid environment, every storage system should present the same face to the rest of the world. The basic operations that any storage system managed by an SRM implementation should support fall under five categories. A brief description of those five categories of functions will be provided here. For a complete listing, refer to the SRM interface specification [16] Space Management Functions The first category includes space management functions. The functions in this category can be used to reserve and release space in a storage system, as well as to find out information about the space and the files contained in it (free space, type, lifetime etc). By using this type of functions it is also possible to modify these parameters, ie. to prolong the lifetime or change the type of a file or space. Typically, the caller provides a user name and some information about the space, like the type or the size of the space as arguments Space Management Functions Directory functions are those that perform the directory tasks that would be needed to manage directories in a Unix-type filesystem. For this reason, their names are the same as those of well known Unix shell commands with the srm prefix (srmmkdir, srmrmdir, srmmv etc). Directories are a virtual construct that essentially provide the user with a way to logically group files on the Grid. Two files can belong to the same directory regardless of their size, type and physical location Transfer Functions Transfer functions are used to transfer files from and to SRM-managed storage systems. The filenames used by those functions to refer to replicas on the Grid are the SURLs and the TURLs. Typically a client requesting a file will call the PrepareToGet function providing the server with an SURL for the requested file. Then the server will pick a suitable transfer protocol and return it in a TURL that can be used for the transfer. In the case of a client trying to upload a file to the storage system, the PrepareToPut function will be used. Once the SRM storage manager allocates space for that file a TURL is provided to the client and the transfer can begin. It should be noted here that SRM does not perform the data transfers for the client, just provides it with the necessary information (TURL of the file and protocol) to perform it itself.

38 Figure 7.2: SRM and file transfers. Image taken from [17]. 7.1.2.4 Permission Functions The fourth category of functions are the Permission functions.

47 38 Figure 7.2: SRM and file transfers. Image taken from [17] Permission Functions The fourth category of functions are the Permission functions. Permissions are a way of protecting files on the Grid from unauthorised access. Grid file permissions are completely analogous to permissions set for files on a Unix filesystem. Functions in this category allow users to define who has access to their files and what they are allowed to do with them. They also provide a way of checking permissions for a given file, that can be utilised by both clients and SRM servers Status Functions Finally, the SRM specification defines Status functions that can be used to track the progress of an SRM operation. Thus, for the duration of a download or upload from or to a storage system, starting from the moment of the client request, information on the status of the operation can be obtained from the manager by means of the StatusOfGetRequest or the StatusOfPutRequest function respectively File pinning We will provide here a description of file pinning, a technique that SRM implementations should support. In short, file pinning is the practice of extending the guaranteed lifetime of a file of temporary nature, in order to greatly increase the chance that it will be available after an amount of time. This feature can be implemented with different levels of sophistication. The

48 39 simplest example would be to allow the client to request that a mark is placed on the file. This mark does not correspond to a predefined extension of lifetime, but whenever the storage manager needs to make some space it makes sure it removes the file with the oldest mark first. A more complex implementation of file pinning would take into account the identity of the client requesting the pin, not allowing unlimited consecutive pins by the same client. It would also allow the client to specify the desired duration of their requested pins. For a more detailed discussion of file pinning strategies see [18] LCG-2 and SRM LCG-2 currently supports the classic SE solution provided by the DataGrid project but this is soon going to change. It is intended that the final LCG-2 release will include an SE implementation based on the Storage Resource Manager (SRM). There are a number of differences between the classic and the SRM implementation of the SE, the most important being that the latter supports asynchronus data transfer operations and the pinning of files. In the case of a classic SE implementation, the client can only issue the next request to the SE after the previous one has been completed. Moreover, it is not possible for a client to book multiple files in a storage system for future use. 7.2 The LBNL SRM The Lawrence Berkeley National Laboratory have developed a software package which implements the SRM specification. It includes a Disk Resource Manager (DRM) that can be used for disks and NFS systems, a Tape Resource Manager (TRM), which can be used to manage mass storage systems and combines the two to form what is called a Hierarchical Resource Manager (HRM). Figure 7.3 on the next page shows how a DRM, a TRM and an HRM can interface storage resources to the Grid. The protocols supported by the LBNL SRM are GridFTP, FTP and http. Successful use of the LBNL SRM has been made in the context of STAR [19], an experiment at Brookhaven National Laboratory (BNL), invastigating the results of heavy ion collisions. SRM was used to automate the process of transferring data between RCF and NERSC, the storage systems located at BNL and LBNL respectively. It is still being actively developed and a new version (2.1) is going to be released towards the end of For the rest of this document, the acronym SRM will refer to the LBNL implementation unless we explicitly state that we are refering to the specification. 7.3 The SDSC SRB A team working at the San Diego Supercomputer Center (SDSC) have developed the Storage Resource Broker (SRB)[20]. SRB is a software suite that can provide uniform access to dis-

49 40 Figure 7.3: DRM, TRM and HRM and the systems they interface to the Grid. Image taken from [21]. tributed data resources. It can be used as middleware, but it is also a complete solution in itself (it contains a user interface and does not need to be part of an application, in order to manage combinations of storage resources). SRB is a mature and useful product, that is already being used by a number of sites in the UK The SRB architecture Figure 7.4 on the following page illustrates the basic operation of SRB. An SRB server can be installed on top of a storage system and the client applications can use the network to make calls to it in a number of ways, which are detailed below. The SRB server interacts with the the Metadata Catalogue (MCAT) server, which provides it with a logical set of filenames for the files stored inside the managed systems. This enables the SRB server to execute the desired I/O operations. SRB has a number of interesting features that increase its functionality and make it a very useful piece of software. Those most important to this work are summarised below.

41 Figure 7.4: The SRB architecture. Image taken from [20]. 7.3.1.1 SRB Master and SRB Agents The SRB server is implemented in the following manner.

50 41 Figure 7.4: The SRB architecture. Image taken from [20] SRB Master and SRB Agents The SRB server is implemented in the following manner. When the server is running, a process called SRB master listens on a well known port for any incoming client requests. As soon as a request arrives from a new client, it spawns an SRB agent, a process that will handle the interaction with that particular client, thereby moved to a different port. After that, SRB Master goes back to listening for new client requests and the client starts issuing requests to the agent. Each agent has a high level request handler and a low level request handler and client requests are dispatched to one or the other accordingly. The high level request handler is responsible for accessing the MCAT, in order to register or physically locate data. The low level request handler typically makes use of the physical locations of data or storage spaces to perform the I/O operations The Metadata Catalogue MCAT stores metadata about the data stored in the SRB-managed storage systems and is useful in providing an organisation of those files using logical filenames. In this virtual filesystem, the equivalent of directories are called collections. Very much like a Unix directory, a collection may contain a number of files as well as sub-collections, thus enabling the creation of filesystem-like hierarchies. The important difference, however, is that files within the same collection might be physically located in different storage resources. MCAT conveniently keeps the mapping of logical filenames to physical locations hidden from the client applications.

ISTITUTO NAZIONALE DI FISICA NUCLEARE

ISTITUTO NAZIONALE DI FISICA NUCLEARE Sezione di Perugia INFN/TC-05/10 July 4, 2005 DESIGN, IMPLEMENTATION AND CONFIGURATION OF A GRID SITE WITH A PRIVATE NETWORK ARCHITECTURE Leonello Servoli 1,2!, Mirko