DiGS (Version 3.1) Setup Guide

Size: px

Start display at page:

Download "DiGS (Version 3.1) Setup Guide"

John Reynolds
5 years ago
Views:

1 DiGS (Version 3.1) Setup Guide George Beckett, Daragh Byrne, Radosław Ostrowski, James Perry, Eilidh Grant 10th May

2 Contents Preface 3 1 An Overview of DiGS A typical DiGS-powered infrastructure Useful terms to know A note on the history of DiGS Overview of Setup Process Prerequisites Globus Toolkit X.509 Digital Certificates Other software dependencies The DiGS Package Account Running on 32-bit and 64-bit platforms Types of installation User Management and Access Control Policy The Ant-based build system The Setup Process, Details of Client Installation Installing the Software (pre-compiled binaries) Installing the Software (building from source) Installing the Software (building from source from CVS) Creating the configuration files Configuration of User Environment GridFTP-based Storage Element Setup Creating a DiGS Package Account Authorising Control Node to access Storage Element Setting up User Access to Storage Element Setting up Storage Directories SRM Storage Element Setup Control Node Installation Creating a DiGS Package Account Installing the software Creating the Control Node Configuration Files Setting up X.509 certificate for Control Thread Setting up User Access to the Control Node Starting the Control Thread The Metadata Catalogue Installing the software Setting up the Servlet container Deploying the XML database exist Database Administration

3 3.6 Configuring a Backup Node Installing the software Mirroring the File Catalogue Starting the DiGS Backup Thread

4 Preface This document describes how to set up a computer to become part of a DiGS 1 -powered data grid. It lists the steps necessary to install and configure the DiGS software itself, identifies dependencies and prerequisites, and explains how to solve common problems that may be experienced during setup. Finally, it describes how to test the software installation in order to verify that it is correct and complete. Note that this document applies to DiGS Version DiGS is an abbreviation for Distributed Grid Storage. DiGS was formerly known as QCDgrid 4

5 Chapter 1 An Overview of DiGS DiGS is a a data grid application a distributed system that supports the management, sharing, publication and preservation of collections of data. It is specifically designed to handle large volumes of scientific data on dispersed, online (that is, disk-based) facilities, and furnishes semantic data management through the binding of application-specific metadata to datafiles. The key features of DiGS are: multi-terabyte, distributed storage. automatic and transparent replication of data across multiple sites. continuous data validation and consistency checking. support for bulk data transport operations. data provenance with application-specific metadata. simple and intuitive client tools. interface between managed storage and external compute resources. From the perspective of the user, a DiGS system looks like a local file store containing datasets (and possibly associated metadata); although in reality it is split across multiple, distributed machines that may be geographically dispersed. To hide this complexity, each dataset (or file) in the system is attributed with a unique and persistent identified called a Logical Filename (LFN). DiGS can be used to store arbitrary, file-based data. However, its development is motivatived by datadriven computing, focusing in particular on semantic-based data access. With this in mind, DiGS organises datasets into collections called experiments. Each experiment has an associated, scientific commentary captured in XML 1 which describes the experiment as a whole. An experiment consists of one or more datasets (or files), which are individually described by separate XML documents that capture the scientific meaning of the particular datasets. There is a one-to-one correspondence between datasets and the metadata documents that describe them. This connection is captured in a (configurable) text node of the XML description that contains the Logical Filename of the corresponding dataset. There is a one-to-many relationship between an experiment and the datasets it contains. This connection is captured by a text node in the XML description of the datafile, which contains the name of the experiment. This name is matched to a text node in the experiment description, which also contains the name. 1 An open standard for describing data, provided by the W3C. 1

1.1 A typical DiGS-powered infrastructure The component systems within a typical DiGS-powered infrustructure are illustrated in Figure 1.1. The system contains three types of node: Client, Storage Element, and Control/Backup Node, as described below.

Client A user accesses the data grid from a Client. This can be any computer on which: the DiGS client tools are installed, plus the user has a user account and appropriate credentials.

6 1.1 A typical DiGS-powered infrastructure The component systems within a typical DiGS-powered infrustructure are illustrated in Figure 1.1. The system contains three types of node: Client, Storage Element, and Control/Backup Node, as described below. Figure 1.1: Overview of the typical DiGS architecture showing flow of data and control between clients, storage elements, control node and the various catalogue services. Client A user accesses the data grid from a Client. This can be any computer on which: the DiGS client tools are installed, plus the user has a user account and appropriate credentials. From a Client, a user can: Confirm that the data grid is operational. List and retrieve files from the data grid. Query the Metadata Catalogue for scientific metadata. Submit new data/metadata to the data grid. Submit a compute job to a remote compute resource. All of these interactions are supported by a suite of command line tools called the DiGS Command Line Interface (CLI). In addition, a subset of the above functionality is furnished by a graphical client called the DiGS Browser. More information about the client tools is provided in the DiGS User Guide [1]. 2

7 Storage Element A Storage Element provides disk-based storage space to the data grid for holding copies of user data files. The data files held on a storage element are assumed to be immediately available (a property often referred to as on-line when compared to tape-based storage space which is referred to as off-line). User file transfers to and from Storage Elements are normally initiated from a Client. Therefore, a user need never access the Storage Element directly. Other file transfers are also implemented by the Control Node as part of its data replication and validation functionality. Control Node The Control Node hosts a persistent agent (commonly referred to as the Control Thread) that continuously tests and validates the integrity and availability of the contents of the data grid. It performs a number of checks, including: checks that Storage Elements are responding. checks if new data is awaiting insertion into the data grid. checks that sufficient copies of each file exist on the data grid, and performs a replication operation if a file is detected with insufficient copies. compares the state of a Storage Element s file systems with the record that it maintains locally in a file catalogue. checks that each copy of a file is consistent with its peers that is, all replicas of a file are identical. computes the free capacity that remains in the data grid and takes appropriate action if the capacity is deemed to be too low. The Control Node is autonomous and typically orchestrates management of the data grid without intervention from a user/administrator. The Control Node is a critical component of the DiGS system architecture. If the Control Node becomes unavailable, then the system switches to a backup control node (called the Backup Node) that provides read-only access to the data grid. Backup Node The Backup Node provides a subset of the functionality of the Control Node, permitting users to retrieve data from the grid, in the event of a failure of the Control Node. Specifically, the Backup Node hosts a copy of the file catalogue allowing a Client to locate and retrieve data from the Storage Elements. The Backup Node is intended as a temporary substitute for the Control Node: it cannot add new data to the grid, nor can it replicate or validate existing data on the grid. 1.2 Useful terms to know The terms, below, are commonly used to describe the functions of a DiGS-powered data grid. A familiarity with these terms is important, since they are used widely within this document. Data Grid is a term used to describe a distributed file system (spanning multiple institutional facilities) that supports the management, sharing, publication and preservation of collections of files. Logical Filename (LFN) is a unique and persistent identifier for a dataset that is stored on a DiGSpowered data grid. 3

8 File Catalogue is a specialised database that maintains the mapping between each LFN and the physical locations of the actual copies of the data file. The DiGS File Catalogue also maintains basic filebased metadata, recording the size, the MD5 checksum, the owner, and the access permissions associated with each dataset. Metadata Catalogue is a database that stores scientific metadata pertaining to the experiments and datasets stored on the data grid. The DiGS Metadata Catalogue is XML-based. 1.3 A note on the history of DiGS The development of DiGS has been motivated by data management requirements in the computational particle physics community. The first version of DiGS (that is Version 1) was actually called QCDgrid a name that reflects the background of the application. One aim in the name change and the more recent development path has been to generalise the application space of the software. However, to maintain some backward compatiblity and to reduce the scope for introducing new bugs, the name QCDgrid is still present in many of the configuration files and underlying technologies. Subsequent sub-versions of DiGS will continue to reduce the number of references to the original name. 4

9 Chapter 2 Overview of Setup Process 2.1 Prerequisites Before one can begin to install the DiGS software onto a system, one needs to confirm that various prerequisites are satisfied. These prerequisites are described, in turn, below Globus Toolkit DiGS is a grid application that utilises services and functionality from a middleware called the Globus Toolkit. Each system on a DiGS-powered data grid requires this toolkit to be installed. DiGS Version 3.1 requires Globus Version 2.4 or higher. However, Version 4.0 or higher is recommended, since this includes significant bug fixes and security enhancements. Please refer to [2] for detailed information on how to set up a Globus Toolkit installation for use with DiGS X.509 Digital Certificates Each server (that is, Storage Element and Control Node), on a DiGS-powered data grid, needs to have an X.509 host certificate (also known as a server certificate) from a suitable certification authority (CA), such as the UK e-science CA ( Each user of the system will also need an X.509 user certificate to allow them to authenticate to, and be attributed with privileges on, the data grid Other software dependencies There are several other software dependencies that must be fulfilled before one installs DiGS software, as follows: DiGS client tools require access to a Java Virtual Machine. The developers recommend Sun Java Version 1.4 or higher. In addition, if one intends to build DiGS software from source, a Java Development Kit is required for example, Sun Java SDK Version 1.4. If one intends to build DiGS software from source, then the Ant build system (Version 1.6 or higher) is required. In addition to Ant, the build script requires access to the GNU Make utility (Version 3.8 or higher) for building various C-based components. The DiGS Metadata Catalogue requires access to an XMLDB-compliant database. As distributed, the DiGS software is set up to use the exist database [5] (Version 1.0 or higher). 5

10 If a VOMS proxy is required in order to access any of the storage elements on the grid, the glite VOMS client tools should be installed to enable users to run the voms-proxy-init command. If the software is to be built from source and SRM [3] support is required, the gsoap toolkit and the gsoap GSI plugin should first be installed The DiGS Package Account For a multi-user system, it is strongly recommended that the software is installed and configured using a DiGS Package Account a Unix account created specifically to manage the DiGS software installation. DiGS imposes no restrictions on the name of the account (over and above restrictions of the Operating System), though the developers recommend using the name digs 1. For a Storage Element, in addition to owning 2 the DiGS software, the package account also owns all data that is held on the Storage Element. For a single-user Client node, one may choose to install DiGS using the target user s account. However, this type of installation is not discussed further in this document Running on 32-bit and 64-bit platforms All versions of DiGS, since and including Version are compatible with both 64-bit and 32-bit platforms. 2.2 Types of installation To reflect the different types of node that are present on a DiGS-powered data grid, there are several different installation types, which are described in turn below: Client sets up command-line (and possibly graphical) client tools that allow a user to interact with a data grid for example to upload/download files and perform basic data manipulation. See Section 3.1 for more information about Client node installation. Storage Element DiGS software is not required to be installed on a storage element. However, some configuration is required. See Section 3.2 for more information on how to set up the hosting node as a storage element for the data grid. Control Node sets up the hosting node as a Control Node for the data grid. The Control Node installation includes the client toolset. See Section 3.4 for more information about Control Node installation. Backup Node sets up the hosting node as a Backup Node, which will automatically assume a limited Control Node role, should the Control Node become unavailable. See Section 3.6 for more information about Backup Node installation. For each of these installation types, one may choose to complete the set up process using a pre-compiled binary package (provided that one is available for the target architecture) or building the various tools and services from source. 1 Historically, the package account has been called qcdgrid. 2 Here ownership refers to Unix file system permissions. 6

11 2.2.1 User Management and Access Control Policy DiGS employs a group-based access control policy, intended to allow groups of collaborating users to share data, without making it accessible to users of the data grid as a whole (unless they choose so). Two types of group are defined by DiGS: privileged and guest. Each user must be a member of either a privileged group or a guest group, though (a restriction in Version 3.1 of DiGS is that) they may only be a member of one such group. Members of a privileged group may: access (usually view) data that belongs to the group; upload new datasets to the data grid (to be owned by the group); and view public data that belongs to other groups. Members of a guest group may only read public data. Each dataset, on the data grid, belongs to a privileged group and is classified as either: private data which may only be accessed by members of the corresponding privileged group. public data which may be read by any user of the data grid. The implementation of this access control mechanism is described in more detail in Chapter The Ant-based build system As of DiGS Version 2.0.1, the process of building DiGS tools and services from source has been greatly simplified, thanks to the use of an Ant-based build system. This Ant-based build system is encapsulated in an Ant build script that is included in the top-level directory of the DiGS source code bundle (or may be downloaded from the software homepage [6]). The build script defines the following targets: checkout Checks out source code from web-based repository. make-client Compiles client tools. make-client-from-cvs Checks out and compiles client tools. make-storagenode Compiles Storage Element tools and services (Version 2.x only). make-storagenode-from-cvs Checks out and compiles Storage Element tools and services (Version 2.x only). make-server Compiles Control/Backup Node tools and services. make-server-from-cvs Checks out and compiles Control/Backup Node tools and services. dist-source Generates a source distribution. dist-ildg-browser Generates a stand-alone DiGS Browser. dist-client Generates a client distribution. dist-storagenode Generates storage node distribution (Version 2.x only). dist-server Generates Control Node distribution. clean-dist Deletes distribution and browser directories. To invoke one of the build targets, run a command of the form: > ant <build target> However, before one can run the build script, one needs to check some system-specific properties defined within build.xml as follows: The make property should be set to the full path to the GNU Make utility on the target system. 7

12 In addition, if one plans to check out the DiGS software from the project CVS repository, one also needs to check several other properties: The cvs.username property should be set to the username with which one accesses the software CVS repository [6]. One is asked to enter their password during the build process, as required. The cvs.tag property should be set to the CVS tag name for the required version of DiGS. Typically, the tag name has the form DIGSx_y_z, where x, y, and z are the version, major issue, and minor issue number, respectively. For example, the tag name for DiGS Version is DIGS2_0_2. The build script is designed to perform all intermediate steps, as required, when one invokes one of the dist-* targets. For example, ant dist-client both compiles it and packages it up. The output from a build process can be substantial. One may save a copy of the output to file, for diagnostics, with a command of the following form: > ant <build target> tee build.log When building DiGS 3.1 including the SRM support, you may encounter compilation errors related to gsoap if the version of the gsoap toolkit on your system is not compatible with the one used to produce the automatically generated SOAP source files included with DiGS. This can be remedied by regenerating the files using your version of gsoap: > cd StorageElementInterface/src > wsdl2h -c -o srm2_2.h srm.v2.2.wsdl > soapcpp2 srm2_2.h 8

13 Chapter 3 The Setup Process, Details of The setup process for each of the installation types (client, storage element, and control/backup node) is described, in turn, in this section. The instructions below assume that one has created a package account (see Section 2.1.4) and that one runs the various shell commands as the package account user. The instructions below also assume that the Globus environment has been configured within the command shell. This is usually straightforward, though is dependent on the Globus installation approach that has been followed 1 : If Globus has been installed using Virtual Data Toolkit (VDT), then one needs to initialise the $VDT_LOCATION environment variable. For example: export VDT_LOCATION=/opt/vdt and then source the VDT setup script: source $VDT_LOCATION/setup.sh If Globus has been installed as a separate toolkit, then one needs to initialise the $GLOBUS_LOCATION environment variable. For example: export GLOBUS_LOCATION=/opt/globus and then source the Globus setup script: $GLOBUS_LOCATION/etc/globus-user-env.sh 3.1 Client Installation The client-only installation is the simplest to complete and is common to all installation types. It consists of the following steps: 1. Install the DiGS software (from either a pre-compiled binary package or from source code). 2. Create static configuration files that identify the particular characteristics of the target data grid. 3. Set up the user environment for correct operation of the client tools. These three steps are discussed, in turn, below. 1 The initialisation examples assume a Bourne-style shell, though an equivalent setup is available for the C-shell. 9

14 3.1.1 Installing the Software (pre-compiled binaries) Firstly, one should check that a suitable binary distribution is available for the target architecture by visiting the project website [6]. Next, one should decide on where to set up the DiGS binaries and configuration files the home directory of the DiGS Package Account (for example, /home/digs) is a suitable and commonly used location for a multi-user installation. Then, one should download the binary distribution (which is distributed as a compressed TAR ball) and unpack it into the chosen location. For example, as follows: > cd ~digs > tar -zxvf digs-3.1-linux-client.tar.gz Note that the above command sequence is assumed to be completed as the DiGS Package Account user (digs). By default, the bundle unpacks into a directory named digs-2.x.y, which we refer to as the DiGS installation directory from hereon. Note: For a multi-user installation, all of the files should be owned by the DiGS Package Account, should be world-readable and (for binaries) world-executable. One should ensure that an appropriate file mode creation mask (for example, UMASK environment) is configured before unpacking the TAR ball Installing the Software (building from source) Firstly, one should decide where to install the DiGS binaries and configuration files the home directory of the DiGS Package Account (for example, /home/digs) is a suitable and commonly used location for a multi-user installation. One can download a pre-packaged source code bundle from the project website [6], and then unpack it into the chosen location, as follows: > cd ~digs > tar -zxvf <path_to_bundle>/digs-<x.y.z>-source.tar.gz > cd digs-<x.y.z>-source At this point one should have a file called build.xml and a directory called digs-3.x.y. Having obtained a suitable source code bundle, one can run the command: > ant make-client tee build.log to build the client tools. At this point, one should have a directory, named digs-3.x.y, which contains the software installation Installing the Software (building from source from CVS) Firstly, one should decide where to install the DiGS binaries and configuration files the home directory of the DiGS Package Account (for example, /home/digs) is a suitable and commonly used location for a multi-user installation. 10

15 Next, one should download the latest version of the Ant build script build.xml (for example, from the project website [6]) to the target location and make any required configuration changes (see Section 2.2.2). Thirdly, one should obtain a source code bundle. To checkout the code one must have access to the project CVS repository (hosted at [6]), then one can check out a specific version of the source and build the client tools using the make-client-from-cvs build target: > ant make-client-from-cvs tee build.log At this point you should have two directories, named: fromcvs, which contains the source code; and digs-3.x.y, which contains the software installation Creating the configuration files Having built the source code (or unpacked the DiGS client binaries), there are a number of configuration files that need to be created: nodes.conf Configuration file that contains information to allow DiGS client tools to contact the Control Node and Backup Node. nodeprefs.conf Configuration file that enumerates Storage Elements that are available on the data grid, in order of preference. browser.properties Configuration file that provides information to allow DiGS client tools to locate the Metadata Catalogue. Specific details of the particular values that are required for a particular data grid should be obtained from the data grid administrator. What follows, below, is a description of the parameters that must, or can, be specified in each configuration file. Main configuration file The main configuration file nodes.conf, which should be created in the DiGS installation directory (for example, /home/digs/digs-3.1), contains the information that the client tools need in order to locate the Control Node (and Backup Node). Specifically, the file should contain the fully qualified domain name (FQDN) of the Control Node, the path to the DiGS software on that node, and the fully qualified domain name of the server hosting the Metadata Catalogue. It may also (optionally) contain the fully qualified domain name of the Backup Node, which is contacted if the Control Node is unreachable. The format for nodes.conf is, as follows: node = <FQDN of Control Node> path = <path to DiGS software on Control Node> mdc = <FQDN of Metadata Cataloge node, without port (assumed 8080)> backup_node = <FQDN of Backup Node> backup_path = <path to DiGS software on Backup Node> As noted above, the backup_node and backup_path entries are optional: all other entries are mandatory. If you are unsure of the location for the Control Node and Backup Node, then you should contact your data grid administrator. At the time of writing, the Control Node and Metadata Catalogue for UKQCD Grid are running on ukqcdcontrol.epcc.ed.ac.uk at University of Edinburgh, and the backup node is running on pytier2.swan.ac.uk at University of Swansea. The nodes.conf file is, as follows: 11

16 node = ukqcdcontrol.epcc.ed.ac.uk path = /home/qcdgrid/qcdgrid mdc = ukqcdcontrol.epcc.ed.ac.uk backup_node = pytier2.swan.ac.uk backup_path = /home/qcdgrid/qcdgrid For the Control Node (and Backup Node), the package account needs to have write access to this configuration file. For example, > chmod o+rw nodes.conf Storage Element preferences The Storage Element preferences file, nodeprefs.conf, lists the Storage Elements that are defined for the data grid, in order of preference (one per line). The file should be located in the DiGS Installation Directory (for example, /home/digs/digs-3.1), and those Storage Elements that are geographically closest should normally be at the top of the list for reasons of efficiency. For example: local_node.ed.ac.uk remote1.liv.ac.uk remote2.liv.ac.uk far-away_node.swan.ac.uk For an up-to-date list of Storage Elements for a particular data grid, one should contact the data grid administrator. Browser configuration The Metadata Catalogue properties file, browser.properties must also be created, and should be located in the java/ildg_browser subdirectory in the DiGS installation directory (for example, /home/digs/digs-3.1/java/ildg_browser). The file contains the authentication information for the database, as follows: xmldb.user.name=<username for Metadata Catalogue> xmldb.user.password=<password for Metadata Catalogue> This properties file is also used by the DiGS Browser, which is an application-specific client (see Section 3.5 for more information). Additional properties may be available for your particular data grid. For example, the UKQCD data grid has the following form: qcdgrid.browser.mode=<either UKQCD for local or ILDG for guest> queryrunner.maxresults=0 services.file.location=<location of ILDG Services definition> xmldb.user.name=<ukqcd or ILDG username> xmldb.user.password=<ukqcd or ILDG password> Configuration of User Environment In addition to the Globus environment requirements (see discussion at beginning of this chapter), a user needs to have certain environment variables configured in order that the DiGS software functions correctly. These environment variables are described, in turn, below. Since the environment needs to be initialised each time a shell is created, one might consider folding the command sequence into the normal shell initialisation process. The DiGS-specific requirements for a user s environment are, as follows: 12

17 The environment variable DIGS_HOME must be initialised to the DiGS installation directory. For example: export DIGS_HOME=/home/digs/digs-3.1 The DiGS executables must be in the user s path: export PATH=$PATH:$DIGS_HOME Note: Simply giving a full path when executing a DiGS commands in not sufficient for most commands. The DiGS client library (which, by default, is in DiGS installation directory) must be in the library path. For example: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DIGS_HOME The Java Runtime Environment must be in the path. This is usually the case by default, and can be checked by a successful invocation of: > java -version java version "1.5.0_09" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_09-b03) Java HotSpot(TM) Server VM (build 1.5.0_09-b03, mixed mode) The BROWSER_HOME environment variable must be assigned to the location of the DiGS Browser code this is usually $DIGS_HOME/java/ildg_browser. For example: export BROWSER_HOME=${DIGS_HOME}/java/ildg-browser The exist libraries 2 (otherwise known as JARs) must be in the class path. The exist libraries are distributed with both source and binary DiGS distributions in the exist_libraries sub-directory. For example: export EXIST_LIB_DIR=${DIGS_HOME}/exist_libraries for i in ${EXIST_LIB_DIR}/*.jar ; do export CLASSPATH=${i}:${CLASSPATH} done In addition to the above environment variables, the following additional entries may be required for specific circumstances: The assignment: export LD_ASSUME_KERNEL=2.2.5 is required on some Linux-based systems, due to a threading bug in early versions of the Globus Toolkit. Specifically, the assignment is necessary if using Globus Toolkit Version 2.4.x or Version 3.x, with Linux kernel Version or newer the bug does not affect older Linux kernels, and it has been fixed properly in Globus 4.0. On some operating systems, the Globus command globus-hostname does not return the fully qualified domain name of the system on which it is runnning (for example, digs-host.ed.ac.uk), as intended. One can run globus-hostname to check if this is the case. If the command returns either the first element of the hostname (continuing with our example, this would be digs-host) or localhost, then one needs to define the additional environment variable: 2 This assumes that exist is the underlying database for the Metadata Catalogue. 13

18 export GLOBUS_HOSTNAME=<fqdn.of.machine> The environment variable DIGS_GROUP may need to be set to the DiGS group to which the user belongs: export DIGS_GROUP=<users group> One only needs to do this, if instructed to by their data grid administrator. 3. Once the environment is set up as described above, the DiGS software is ready for use. One may wish to create a script to automate this environment setup, so that it can quickly be actioned in any particular shell instance. A template script, called example_setup.sh, is included in the distribution, which may be edited accordingly it may simply require the appropriate value of DIGS_HOME to be specified at beginning of the script. At this point, one may wish to run the DiGS command digs-test-setup to confirm that everything is in order. See the DiGS User Guide [1] for more information on this command. 3.2 GridFTP-based Storage Element Setup As of DiGS Version 3.0, the setup of a storage element has been simplified and it is no longer necessary to install DiGS software onto the resource. The instructions below should provide sufficient information for a competent Unix user to complete the process of setting up a GridFTP-based storage element. The process consists of the following steps: Create a DiGS package account this step, which is optional for a Client Node installation, is mandatory in this case. Map the Control Node s host certificate subject to the DiGS package account, so that the Control Thread process can interact with the Storage Element. Set up Unix accounts and groups reflecting the make-up of the user base. Set up a directory structure to be used for data storage. Note: Once a Storage Element bas been configured, one must contact the data grid administrator to have it added to the list of Storage Elements for the grid Creating a DiGS Package Account The use of a DiGS package account (see Section 2.1.4) is mandatory during the setup of a Storage Element. For security reasons, all datafiles stored on a DiGS Storage Element are owned by the DiGS package account. It is recommended that a new user account (for example, named digs) be created for this purpose, instead of using an existing user account or the root account Authorising Control Node to access Storage Element Having created a DiGS package account, one is ready to add the Control Node s host certificate to the authorisation list for over-grid-access to the DiGS package account on the Storage Element. This is achieved by adding an entry to the machine s grid-mapfile, usually: /etc/grid-security/grid-mapfile 3 For more information regarding the requirement for this environment variable, see the NeSCForge Problem Report #1374 [6] 14

19 a step that requires root permission 4. Specifically, one should add a line to the grid-mapfile, associating the Control Node s host certificate to the local DiGS Package Account. For example, as follows: "/C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=ukqdcontrol.epcc.ed.ac.uk/Em ail=<admin>@epcc.ed.ac.uk" digs This entry must be added as a single, continous line and not as shown above (where it has been wrapped for printing) Setting up User Access to Storage Element The group-based access control policy, which is described in Section 2.2.1, is implemented on each GridFTP-based Storage Element using Unix groups and accounts. On every Storage Element, each DiGS group must be instantiated as a Unix group and each DiGS user must be mapped to a Unix account that belongs to the corresponding Unix group. There does not need to be a one-to-one mapping between data grid users and Unix accounts. It is more typical to either: Create a single Unix account for each group, and map all users (in the corresponding DiGS group) to this Unix account. This approach works well provided that the Storage Element is not also used as a grid compute resource implying that a DiGS user may need to maintain time-limited though stateful data on the resource. Use a technology such as the EDG Pooled Accounts mechanism to create a set of generic, timeleased Unix accounts for each group, with each user being temporarily mapped to a free pooled account whenever they access the Storage Element resource. For more information about the EDG Pooled Accounts mechanism, check out the EDG VO files package, which is available (for example) from the project website [6]. Having set up the group structure on a storage element, one must ensure that the DiGS Package Account is added to each of the groups. This is to ensure that the DiGS Package Account can manage each groups data, as actioned by the Control Thread process. The mechanism used to manage the mapping user accounts to Unix equivalents is outside of the scope of the DiGS software. By default, a Globus installation is configured to utilise the grid-mapfile mechanism to map each user s certificate credentials to a local Unix account: this is a suitable mechanism for describing the user mappings as described above. For small-scale data grid installations with small (and stable) user bases, it is possible to administer groups and users manually. However, the DiGS team recommend that in order to both reduce the administrative load and the scope for errors one should invest effort to set up a more automated process: for example, based on Virtual Organisation Management System. A selection of tools to assist with the automation of user management is provided in the EDG Files package on the project website [6] Setting up Storage Directories The final step in setting up a Storage Element is to create a place (a Storage Directory) in which user data can be stored. The actual target storage can be located anywhere on the file system, though must be owned by and writable by the DiGS package account. 4 Note: If the target data grid uses VOMS for user management, then the host certificate mapping should be added to the local static grid-mapfile (usually in /etc/grid-security/grid-mapfile-local), instead of to the standard grid-mapfile. One should contact the data grid administrator for more information about this step. 15

20 For example, if the Storage Element has a local RAID system that is mounted on the file system as /raid, one may wish to create a target folder for data grid storage (that is, a Storage Directory), called /raid/digs-data. Within the Storage Directory, one must create a sub-directory called data. This directory structure could be implemented by running the following sequence of commands (as root): > cd /raid > mkdir digs-data > chown <package account> digs-data > cd /digs-data > mkdir data > chown <package account> data The data directory should support read access for each of the Unix groups that corresponds to a DiGS privileged group. This can be achieved using: > chmod a+rx data or (in a more controlled manner) using access control lists (ACLs), on systems that support this. An optional Inbox can also be created. This is a folder in which new data may be temporarily stored, in the first stage of being added to the data grid. At least one of the storage elements must have an Inbox. This directory, which must be located in the data directory, should be called NEW. It should be owned by DiGS Package Account and must be writable by members of each of the Unix groups that corresponds to a DiGS privileged group. Continuing with our example, this could be achieved as follows (again, running as root user): > cd /raid/digs-data/data > mkdir NEW > chown <package account> NEW > chmod a+rwx NEW Note: As above, one may configured group-based write permissions using ACLs, on systems that support this. Having created a suitable data storage directory (and possibly an Inbox), one needs to inform the Data Grid Administrator of the details, so that the Control Node can be configured as described in Section Configuring multiple Storage Directories It is possible to have multiple data directories on a single Storage Element. For example, this could allow multiple mass storage devices to be accessed from a single Storage Element. The process for adding further data directories is similar to that described above, with two differences of note: Symbolic links to each data directory should be created in the same directory as the first data directory, named data for the first entry, data1 for the second entry, then data2, and so on. For example, this can be done with the command: > cd /raid/digs-data > ln -s /raid2/digs-data data1 > chown <package account> data1 > chmod a+rx data1 DiGS automatically recognises that multiple Storage Directories are available and aims to share grid datasets between the different directories in order to balance free space on the associated disk units. 16

21 3.3 SRM Storage Element Setup As of DiGS version 3.1, storage elements that expose an Storage Resource Manager (SRM) [3] interface are supported. Any SRM v2 compliant implementation should work with DiGS, however at this stage DPM [4] has been most thoroughly tested. The details of setting up an SRM server for use with DiGS will vary depending on the implementation used, and are beyond the scope of this document. At a minimum, the following two steps will be necessary: Set up authorisation to allow access from DiGS users and the control thread. This is typically done by sourcing user information from a VOMS server or similar. Create the directory structure required by DiGS. This may be as simple as one single data directory and an optional inbox directory. The data directory should be writable by the control thread s identity and readable by all users. The inbox directory (if present) should be owned by the control thread certificate but writable by all users with permission to add data to the grid. As with GridFTP storage elements, details of the SRM storage elements must be added to the control node configuration (see next section). 3.4 Control Node Installation As described in Chapter 1, the Control Node is an essential component of a DiGS-powered system, hosting a daemon process, the Control Thread, that maintains the integrity of the data grid and hosts a database, the File Catalogue, describing the layout of date in the data grid. The setup process for the Control Node consists of the following steps: 1. Create a DiGS package account this step, which is optional for a Client Node installation, is mandatory for the Control Node. 2. Install the DiGS software (from either a pre-compiled binary package or from source code). 3. Set up the Control Node configuration files. 4. Set up an X.509 certificate to allow the Control Thread to authenticate to other nodes in data grid. 5. Set up user access to the Control Node. 6. Start the Control Thread. The Control Node can also host the Metadata Catalogue, though this can be installed elsewhere. The setup process for the Metadata Catalogue is described in Section Creating a DiGS Package Account The use of a DiGS package account (see Section 2.1.4) is mandatory during the setup of the Control Node. It is recommended that a new user account (for example, named digs) be created for this purpose, instead of using an existing user account or the root account Installing the software As with the Client Node installation and Storage the Element installation, one may choose to install the software from either a pre-compiled binary package (if a suitable package is available) or from source code: 17

22 The installation process for a pre-compiled binary distribution is the same as for a client installation (see Section 3.1 for details). The installation process for source code is very similar to the client installation, except one should use the build target make-server: > ant make-server tee build.log Note that, having completed the unpack/build of the DiGS software, one also needs to set up the various configuration files and configure the shell environment for the DiGS package account, as for the client installation (Section 3.1) Creating the Control Node Configuration Files The Control Node hosts a number of configuration files that capture a description of the data grid in its entirety. These configuration files are used by a local daemon process, called the Control Thread, and by remote clients, running on the other nodes, to establish required information about the setup and status of the data grid. To ensure that the configuration files are accessible as required, one should make the files world-readable, though writable only by the DiGS Package Account. All Control Node configuration files should be created in the DiGS installation directory (for example, /home/digs/digs-3.1). The files are, as follows: qcdgrid.conf contains general configuration settings. mainnodelist.conf contains a description of the Storage Elements on the data grid. deadnodes.conf contains a list of those Storage Elements that are currently and unexpectedly not responding to Control Thread requests. disablednodes.conf contains a list of those Storage Elements that have been temporarily disabled by the data grid administrator: for example, during scheduled maintenance sessions. retiringnodes.conf contains a list of those nodes that are scheduled for removal from the data grid. group-mapfile contains a list of group memberships for each user certificate subject. This configuration file is discussed in Section The makeup of each of these configuration files (except for the group-mapfile, which is discussed in Section 3.4.5) is discussed, in turn, below. The qcdgrid.conf file The qcdgrid.conf file contains various settings, as follows: rc_port the port numbers on which the File Catalogue service is listening. qcdgrid_port the port numbers on which the Control Thread is listening. min_copies the default number of replicas for each user data file. disk_space_low the threshold free space (in bytes) below which a Storage Element should not be written to. disk_space_panic the threshold free space (in bytes) below which a Storage Element must not fall. administration the identity (that is, X.509 certificate distinguished name) of the data grid administrator. 18

23 admin_ the address of the data grid administrator. An example configuration is given below: rc_port=39281 qcdgrid_port=51000 min_copies=2 disk_space_low= disk_space_panic= administrator=/o=grid/o=ukhep/ou=epcc.ed.ac.uk/cn=james Perry The mainnodelist.conf file The mainnodelist.conf file contains a list of the Storage Elements on the data grid. For each node, it must/may include the following information: node the fully qualified domain name of the Storage Element. This element is required. site a human-readable name for the site at which the Storage Element is located. This can be an arbitrary text string. It is used by the Control Thread to check that two Storage Elements are located at different sites, when replicating data files. This element is required. disk the amount of disk space (in kilobytes), on the Storage Element, which is available for data grid usage. Initially, this attribute can be given an arbitrary numeric value, as it is automatically updated by the Control Thread as part of its monitoring remit. This attribute is required, though the value is maintained by the Control Thread. path the path containing the data directories on the Storage Element. This element is required. type The type of Storage Element. At the time of writing, globus and srm are supported. This element is required. inbox The full path to the inbox on the storage element. If this element is not supplied it will be assumed that the storage element does not contain an inbox. extrarsl this attribute is used to specify additional Globus Resource Specification Language clauses that are added to the RSL string, whenever a job is submitted to the Storage Element. This attribute can, for example, be useful for specifying additional environment variables that must be set. This element is optional. extrajsscontact this attribute specifies an extra string to be added to the job submission service contact string. This can be used to specify a non-standard Globus job manager, for example. This element is optional. gpfs a flag that should be set to 1 to enable GPFS support on the node 5, and 0 for otherwise. GPFS-based systems require the free space on the disk to be calculated in a different manner to normal Unix file servers. This element is optional. (Versions 1.x and 2.0 only) jobtimeout this attribute specifies the time-out value in seconds for (Globus GRAM-based) job submission operations to this node. This element is optional. ftptimeout this attribute specifies the time-out value in seconds for short file operations (for example, creating new directories, deleting files, and so on) on this node. This element is optional. copytimeout this attributed specifies the time-out value in seconds for long file operations (usually, data file transfers) on this node. This element is optional. 5 The DiGS development team recommend that you contact them, before enabling this flag, for a more detailed explanation of the support for GPFS. 19

24 endpoint this attribute is required for SRM storage elements but not for Globus SEs. It specifies the endpoint URL for contacting the SRM web service. In addition, as of DiGS version 3.0, entries are required which specify the total disk space allocated to DiGS on each data disk of each node. These properties are named data, data1, data2, etc. the same as the directories to which they refer, and their values are in kilobytes. An example mainnodelist.conf is given below: node=trumpton.ph.ed.ac.uk site=edinburgh disk= path=/disk/trumpton0/ukqcd.local/qcdgrid type=globus inbox=/home/digs/new data= data1= node=dylan.amtp.liv.ac.uk site=liverpool disk= path=/users/qcdgrid/dylan type=globus data= data1= node=pygrid1.swan.ac.uk site=swansea disk= path=/home/qcdgrid gpfs=1 type=globus data= data1= data2= data3= node=srm.epcc.ed.ac.uk site=edinburgh disk= path=/dpm/epcc.ed.ac.uk/home/digs inbox=/dpm/epcc.ed.ac.uk/home/digs/inbox type=srm endpoint=srm://srm.epcc.ed.ac.uk:8446/srm/managerv2 data= Other node status files The deadnodes.conf, disablednodes.conf and retiringnodes.conf files are managed by the software and should initially be created as empty files. For example, by running the following command sequence as the DiGS Package Account user: > cd ${DIGS_HOME} > touch deadnodes.conf > chmod 644 deadnodes.conf > touch disablednodes.conf 20

Globus Toolkit Manoj Soni SENG, CDAC. 20 th & 21 th Nov 2008 GGOA Workshop 08 Bangalore

Globus Toolkit Manoj Soni SENG, CDAC. 20 th & 21 th Nov 2008 GGOA Workshop 08 Bangalore Globus Toolkit 4.0.7 Manoj Soni SENG, CDAC 1 What is Globus Toolkit? The Globus Toolkit is an open source software toolkit used for building Grid systems and applications. It is being developed by the