An investigation of a high availability DPM-based Grid Storage Element

An investigation of a high availability DPM-based Grid Storage Element Kwong Tat Cheung August 17, 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017

Abstract As the data volume of scientific experiments continues to increase, there is an increasing need for Grid Storage Elements to provide a reliable and robust storage solution. This work investigates the limitation of the single point of failure in DPM s architecture, and identifies the components which prevent the inclusion of using redundant head nodes to provide higher availability. This work also contributes a prototype of a novel high availability DPM architecture, designed using the findings of our investigation.

Contents 1 Introduction 1 1.1 Big data in science............................ 1 1.2 Storage on the grid............................ 2 1.3 The problem................................ 2 1.3.1 Challenges in availability..................... 2 1.3.2 Limitations in DPM legacy components............. 3 1.4 Aim.................................... 3 1.5 Project scope............................... 3 1.6 Report structure.............................. 4 2 Background 5 2.1 DPM and the Worldwide LHC Computing Grid............. 5 2.2 DPM architecture............................. 6 2.2.1 DPM head node.......................... 6 2.2.2 DPM disk node.......................... 7 2.3 DPM evolution.............................. 8 2.3.1 DMLite.............................. 8 2.3.2 Disk Operations Manager Engine................ 9 2.4 Trade-offs in distributed systems..................... 10 2.4.1 Implication of CAP Theorem on DPM.............. 10 2.5 Concluding Remarks........................... 11 3 Setting up a legacy-free DPM testbed 12 3.1 Infrastructure............................... 13 3.2 Initial testbed architecture......................... 13 3.3 Testbed specification........................... 13 3.4 Creating the VMs............................. 14 3.5 Setting up a certificate authority..................... 15 3.5.1 Create a CA............................ 15 3.5.2 Create the host certificates.................... 16 3.5.3 Create the user certificate..................... 17 3.6 Nameserver................................ 17 3.7 HTTP frontend.............................. 19 3.8 DMLite adaptors............................. 19 3.9 Database and Memcached........................ 19 i

3.10 Creating a VO............................... 19 3.11 Establishing trust between the nodes................... 20 3.12 Setting up the file systems and disk pool................. 20 3.13 Verifying the testbed........................... 22 3.14 Problems encountered and lessons learned................ 23 4 Investigation 24 4.1 Automating the failover mechanism................... 24 4.1.1 Implementation.......................... 25 4.2 Database.................................. 26 4.2.1 Metadata and operation status.................. 26 4.2.2 Issues............................... 27 4.2.3 Analysis............................. 27 4.2.4 Options.............................. 29 4.2.5 Recommendation......................... 30 4.3 DOME in-memory queues........................ 32 4.3.1 Issues............................... 32 4.3.2 Options.............................. 34 4.3.3 Recommendation......................... 36 4.4 DOME metadata cache.......................... 37 4.4.1 Issues............................... 38 4.4.2 Options.............................. 38 4.4.3 Recommendation......................... 38 4.5 Recommended architecture for High Availability DPM......... 39 4.5.1 Failover.............................. 41 4.5.2 Important considerations..................... 41 5 Evaluation 43 5.1 Durability................................. 43 5.1.1 Methodology........................... 43 5.1.2 Findings............................. 44 5.2 Performance................................ 44 5.2.1 Methodology........................... 44 5.2.2 Findings............................. 45 6 Conclusions 48 7 Future work 50 A Software versions and configurations 51 A.1 Core testbed components......................... 51 A.2 Test tools................................. 51 A.3 Example domehead.conf......................... 51 A.4 Example domedisk.conf......................... 52 A.5 Example dmlite.conf........................... 53 ii

A.6 Example domeadapter.conf........................ 53 A.7 Example mysql.conf........................... 53 A.8 Example Galera cluster configuration.................. 54 B Plots 55 iii

List of Tables 3.1 Network identifiers of VMs in testbed.................. 14 iv

List of Figures 2.1 Current DPM architecture......................... 6 2.2 DMLite architecture............................ 8 2.3 Simplified view of DOME in head node................. 9 2.4 Simplified view of DOME in disk node................. 10 3.1 Simplified view of architecture of initial testbed............. 14 4.1 Failover using keepalived......................... 25 4.2 Synchronising records with Galera cluster................ 30 4.3 Remodeled work flow of the task queues using replicated Redis caches. 37 4.4 Remodeled work flow of the metadata cache using replicated Redis caches 39 4.5 Recommended architecture for High Availability DPM......... 40 5.1 Plots of average rate of operations compared to number of threads... 46 B.1 Average rate for a write operation.................... 56 B.2 Average rate for a stat operation..................... 56 B.3 Average rate for a read operation..................... 57 B.4 Average rate for a delete operation.................... 57 v

Acknowledgements First and foremost, I would like to express my gratitude to Dr Nicholas Johnson for supervising and arranging the budget for this project. Without the guidance and motivation he has provided, the quality of this work would certainly have suffered. I would also like to thank Dr Fabrizio Furano from the DPM development team for putting up with the stream of emails I have bombarded him with, and for answering my queries on the inner-workings of DPM.

Chapter 1 Introduction 1.1 Big data in science Big data has become a well-known phenomenon in the age of social media. The vast amount of user generated contents has undeniably influenced the research and advancement in modern distributed computing paradigms [1][2]. However, even before the advent of social media websites, researchers in several scientific fields already faced similar challenges in dealing with a massive amount of data generated by experiments. One such field is high energy physics, including the Large Hadron Collider (LHC) experiments based at the European Organization for Nuclear Research (CERN). In 2016 alone, it is estimated that 50 petabytes of data were gathered by the LHC detectors postfiltering [3]. Since the financial resources required to host an infrastructure that is able to process, store, and analyse the data is far too great for any single organisation, the experiments turned to the grid computing approach. Grid computing, which is mostly developed and used in academia, follows the same principle of its commercial counterpart - cloud computing, where computing resources are provided to end-users remotely and on-demand. Similarly, the physical location of the sites which provide the resource, as well as the infrastructure is abstracted away from the users. From the end-users perspective, they just have to submit their jobs to an appropriate job management system without any knowledge of where the jobs will be run or where the data are physically stored. In grid computing, these computing resources are often distributed across multiple locations, where a site that provides data storage capacity is called a Storage Element, and one that provides computation capacity is called a Compute Element. 1

1.2 Storage on the grid Grid storage elements have to support some unique requirements found in grid environment. For example, the grid relies on the concept of Virtual Organisations (VO) for resource allocation and accounting. A VO represents a group of users, not necessary from the same organisation but usually involved in the same experiment, and manages their membership. Resources on the grid (i.e. storage space provided by a site) are allocated to specific VOs instead of individual users. Storage elements also have to support file transfer protocols that are not commonly used outside of the grid environment, such as GridFTP [4] and xrootd [5]. Various storage management systems were developed for grid storage elements to fulfil these requirements, and one such system is the Disk Pool Manager (DPM) [6]. DPM is a storage management system developed by CERN. It is currently the most widely deployed storage system on tier 2 sites, providing the Worldwide LHC Computing Grid (WLCG) around 73 petabytes of storage across 160 instances [7]. The main functionalities of DPM are to provide a straightforward, low maintenance solution to create a disk-based grid storage element, and to support remote file and meta-data operations using multiple protocols commonly used in grid environment. 1.3 The problem This section presents the main challenges for DPM, the specific limitations that motivate this work, and outlines the project s aim. 1.3.1 Challenges in availability Due to limitations in the DPM architecture, the current deployment model supports only one meta-data server and command node. This deployment model exposes a single point of failure in a DPM-based storage element. There are several scenarios where this deployment model could affect the availability of a site: Hardware failure in the host Software/OS update that results in the host being offline Retirement or replacement of machines If any of the scenario listed above happens to the command node, the entire storage element will become inaccessible, which ultimately means expensive downtime for the site. 2

1.3.2 Limitations in DPM legacy components Some components in DPM were first developed over 20 years ago. The tightly-coupled natural of these components have limited the extensibility of the DPM system and makes it impractical to modify DPM into a multi-servers system. As the grid evolves, the number of users and storage demand have also increased. New software practices and designs have also emerged that could better fulfil the requirements of a high load storage element. In light of this, the DPM development team have put in considerable amount of effort into modernising the system in the past few years, which resulted in some new components that could bypass some limitations of the legacy stack. The extensibility of these new components has opened up an opportunity to modify the current deployment model, which this work aims to explore. 1.4 Aim The aim of this work is to explore the possibility of increasing the availability of a DPM-based grid storage element by modifying its current architecture and components. Specifically, this work includes: An investigation into the availability limitations of the current DPM deployment model. Our experience on setting up and configuring a legacy-free DPM instance including a step-by-step guide. An in-depth analysis of the challenges in enabling a highly available DPM instance, and provides potential solutions. A recommended architecture for high availability DPM storage element based on findings of our investigation, along with a prototype testbed for evaluation. 1.5 Project scope A complete analysis, redesign, modification of the entire DPM system and the access protocol frontends DPM supports would not realistically fit into the time frame of this project. As such, this project aims to act as a preliminary study towards the overall effort in implementing a high availability DPM system. As part of the effort in promoting a wider adoption of the HTTP ecosystem in grid environment, this project will focus on providing a high availability solution for the HTTP frontend. However, compatibility with the other access frontends will also be taken into consideration in the design process, when possible. 3

1.6 Report structure The remainder of the report is structured as follow. Chapter 2 presents the background of DPM, including its deployment environment and information on the components and services which forms the DPM system. The evolution of DPM and its current development direction are also discussed. Chapter 3 describes our experience in setting up and configuring a legacy-free DPM instance on our testbed, including a step-by-step guide. Chapter 4 provides an in-depth investigation on current DPM components which prohibit a high availability deployment model, and describes our suggested modifications. Chapter 5 evaluates the performance and failover behaviour of our prototype high availability testbed. Chapter 6 presents the conclusions of this work, summarising the findings of our investigations and recommendations. Chapter 7 describes some of the future work that is anticipated after the completed of this project. 4

Chapter 2 Background DPM is a complex system which includes a number of components and services. As such, before examining potential ways to improve the availability and scalability of a DPM storage element, the architecture and components of DPM must first be understood. This chapter presents an in-depth analysis of DPM, including its architecture, history and evolution, as well as the functionalities of each component that makes up a DPM system. Common scenarios which could affect the availability of a distributed system, and the trade-offs in a highly available distributed system are also discussed in this chapter. 2.1 DPM and the Worldwide LHC Computing Grid As mentioned in Chapter 1, DPM is designed specifically to allow the setup and management of storage elements on the WLCG. As such, to gain a better understanding of DPM, one must also be familiar with the environment DPM is deployed in. The WLCG is a global e-infrastructure that provides compute and data storage facilities to support the LHC experiments (Alice, Atlas, CMS, LHCb). Currently, the WLCG is formed by more than 160 sites, and is organised into three main tiers: Tier 0 - The main data centre at the European Organisation for Nuclear Research (CERN), where raw data gathered by the detectors are processed and kept on tape. Tier 1 - Thirteen large-scale data centres holding a subset of the raw data. Tier 1 sites also handle the distribution of data to Tier 2 sites. Tier 2 - Around 150 universities and smaller scientific institutes providing storage for reconstructed data, and computational support for analysis. Since DPM supports only disk and not tape storage, it is mostly used in tier 2 storage elements, storing data files that are used in analysis jobs submitted by physicists. For redundancy and accessibility purposes, popular files often have copies and are distributed 5

Disk nodes SOAP Head node Back ends Disks DMLite DMLite XRootD GridFTP SRM DMLite httpd Front ends RFIO DPM DMLite XRootD DMLite GridFTP DPNS DMLite httpd DMLite XRootD Disks DMLite GridFTP DMLite httpd RFIO DB Figure 2.1: Current DPM architecture across different sites, in grid terminology these copies are called replicas. These replicas are stored in filesystems on the DPM disk nodes, where a collective of filesystems forms a DPM disk pool. 2.2 DPM architecture DPM is a distributed system composes of two types of node: the head node and the disk node. A high-level view of the typical DPM architecture used in most of the DPM storage elements is shown in Figure 2.1. 2.2.1 DPM head node The head node is the entry point to a DPM storage element; it is responsible for handling file meta-data and file access requests that come into the cluster. The head node contains decision making logic regarding load balancing, authorisation and authentication, space quota management, file system status, and physical location of the replicas it manages. In the DPM system, the head node acts as the brain of the cluster and maintains a logical view of the entire file system. 6

A DPM head node contains a number of components providing different services. The components can be grouped into two categories: frontends which facilitate access by different protocols, and backends which provide the underlying functionality. Protocol frontends Httpd - DPM uses the Apache HTTP server to allow meta-data and file operations through HTTP and WebDAV. SRM - The Storage Resource Manager [8] daemon that is traditional used to provide dynamic space allocation and file management to grid sites. GridFTP, xrootd, RFIO - These frontends provide access to the DPM system by some of the other popular protocols used in grid environment. Backends DPM - The DPM daemon (not to be confused with the actual DPM system), handles client file access requests, manages the asynchronous operation queues, and interact with the data access frontends. DPNS - The DPM namesever daemon, handles file and directory related metadata operations. For example, adding or renaming a replica. MySQL - Two importance databases vital to DPM operations are stored in the MySQL backend. The cns_db database contains all the file meta-data, replicas and their locations in the cluster, as well as information on groups and VOs. The dpm_db database stores information of the filesystems on the disk servers, space quotas, and the status of ongoing and pending file access requests. The database can be deploy either on the same host as the head node, or remotely on another host, depending on expected load. Memcached - Memcached [9] is a in-memory cache for key-value pairs. In DPM, it is an optional layer that can be set up in front of the MySQL backend to reduce query load to the databases. 2.2.2 DPM disk node Disk nodes in a DPM storage element host the actual file replicas and provide remote access to meta-data and file access requests from clients. Once authenticated and authorized, clients are redirected to the relevant disk nodes by the head node, and never access the disk nodes directly. A disk node will typically contain all the data access frontends supported by the DPM system (e.g. httpd, GridFTP, xrootd, RFIO). 7

Front ends HTTP/WebDAV XROOTD GridFTP RFIO DMLite Framework Namespace Management Pool Management Pool Driver I/O Legacy DPM Legacy DPM Legacy DPM Legacy DPM MySQL MySQL Hadoop Hadoop Memcached S3 S3 2.3 DPM evolution Figure 2.2: DMLite architecture Since DPM was first developed in the early 2000s, it has gone through several rounds of major refactoring and enhancements. The historical components of DPM, for example, the DPM and DPNS daemons, were written a long time ago and extensibility was not one of the design goals. The daemons also introduced several bottlenecks such as excessive inter-processes communications and threading limitations. As such, a lot of effort has been directed to bypassing the so-called legacy components: DPM (daemon) DPNS SRM RFIO Other security and configurations helpers (e.g. CSec) The most significant changes in the recent iterations are the development of the DMLite framework [10] and the Disk Operations Manager Engine (DOME) [11]. 2.3.1 DMLite DMLite is a plugin-based library that is now at the core of a DPM system. DMLite provides a layer of abstraction above the database, pool management, and I/O access. The architecture of DMLite is shown in Figure 2.2. 8

DOME Requests to Disk nodes Timed logic (tickers) Apache httpd Checksum queue Filepull queue Workers AuthN mod_proxy_fcgi DMLite /domehead/... mod_lcgdm_dav /dpm/... External stat Task Executor Request logic DB Figure 2.3: Simplified view of DOME in head node By providing an abstraction to the underlying layers, additional plugins can be implemented to support other storage types, such as S3 and HDFS. Perhaps more importantly, DMLite also allows a caching layer to be loaded in front of the database backend by using the Memcahced plugin, which could significantly reduce query load to the databases. 2.3.2 Disk Operations Manager Engine DOME is arguably the most important recent addition to the DPM system because it represents a new direction in DPM development. DOME is run on both the head and disk nodes as a FastCGI daemon, it exposes a set of RESTful APIs which provides the core coordination functions and uses HTTP and JSON to communicate with both clients and other nodes in the cluster. By implementing the functionalities of the legacy daemons and handling inter-cluster communication itself, the legacy components are now, in theory, optional in a DOME enabled DPM instance. The simplified views of a DOME enabled head node and disk node are shown in Figure 2.3 and Figure 2.4, respectively. The heavy use of in-memory queues and inter-processes communication in the legacy components would have made any attempt to modify the single head node deployment model impractical. However, the introduction of DOME has now opened up the possibility of deploying multiple head nodes in a single DPM instance, which will be explored in the next chapter. 9

DOME Requests to Head node Timed logic (tickers) Apache httpd Workers mod_proxy_fcgi /domedisk/... AuthN External file pull Internal checksum Task Executor Request logic DMLite mod_lcgdm_dav /<diskpaths>/... Disks Figure 2.4: Simplified view of DOME in disk node 2.4 Trade-offs in distributed systems Eric Brewer introduced an idea in 2000 which is now widely known as the CAP Theorem. The CAP Theorem states that in distributed systems, there is a fundamental trade-off between consistency, availability, and partition tolerance [12]. A distributed system can guarantee at most two of the three properties. An informal definition of each these guarantees are listed as follow. Consistency - A read operation should return the most up-to-date result regardless of which node receives the request. Availability - In the event of node failure, the system should still be able to function, meaning each request will receive a response within a reasonable amount of time. Partition tolerance - In the event of a network partition, the system will continue to function. 2.4.1 Implication of CAP Theorem on DPM Since we cannot have all three guarantees as stated by the CAP Theorem, we need to carefully consider which guarantee we are willing to discard based on our requirements. Availability is our highest priority since our ultimate aim is to design a DPM architecture that is resilient to head node failure. This would mean deploying multiple head nodes to increase the availability of the DPM system. DPM relies on records in both the database and cache to function. In a multiple head nodes architecture, these data would likely have to be synchronised on all the head nodes. As such, to ensure operation correctness, consistency is also one of our requirements. Although any network partition happening in a distributed system is less than ideal. 10

Realistically, as DPM is mostly deployed on machines in close proximity, for instance, in a single data centre as opposed to over WAN, network partition is less of an issue. Any network issues that happen in a typical DPM environment would likely affect all the nodes in the system. Based on the reasoning listed above, we believe our architecture should prioritise consistency and availability. 2.5 Concluding Remarks In this chapter, the architecture and core components of DPM were examined. Limitations of the legacy components in DPM and the motivation behind recent refactoring effort were explained. With the addition of DMLite and DOME, it is now worthwhile to explore whether a multiple head nodes deployment is viable with a legacy-free DPM instance. Lastly, we have explained the reasoning behind choosing consistency and availability over partition tolerance as a priority in our new architecture. 11

Chapter 3 Setting up a legacy-free DPM testbed As DPM composes of a number of components and services with many opportunities of misconfiguration that would result in a non-functioning system, manual configuration is discouraged. Instead, DPM storage elements are usually set up by using supplied puppet manifest templates with the puppet configuration manager. However, since this project aims to explore the possibility of the novel DOME-only multiple head nodes DPM deployment model, some of the components have to be compiled from source then installed and configured manually. The testbed will serve three purposes. Firstly, we want to find out if DPM would function correctly if we exclude all the legacy components, meaning that our DPM instance will only include DMLite, DOME, MySQL (MariaDB on Centos 7), and the Apache HTTP frontend. Secondly, once we have verified our legacy-free testbed is functional and have redesigned some of the components, the testbed will serve as a foundation for us to incorporate additional head nodes and the necessary changes in DPM to facilitate this new deployment model. Lastly, the testbed will be used to evaluate the performance impact of the new design. As DOME has only recently gone into production and no other grid site have yet adopted a DOME-only approach, to the best of our knowledge no one has attempted this outside of the DPM development team. As such, we believe our experience in setting up the testbed will be valuable to both the grid sites that may potentially upgrade to DOME later on, and as feedback for the DPM developers. The remaining of this chapter describes the steps that were taken to successfully set up a DOME-only DPM testbed, including details on the infrastructure, specifications, and configurations. Major issues encountered during the process are also discussed. 12

3.1 Infrastructure For ease of testing and deployment, virtual machines (VM) were used instead of physical machines. This decision will certainly impact the performance of the cluster and will be taken into account during performance evaluation. All VMs used in the testbed are hosted by the University s OpenStack instance. 3.2 Initial testbed architecture As mentioned earlier in this chapter, our first objective is to verify the functionality of a legacy-free DPM instance. As such, our initial testbed will only have one head node. Ultimately, redundant head nodes will be included in the testbed once we have verified the functionality of the single head node instance. The testbed will also include two disk nodes for proper testing of file systems and pool management. DPM provides the options to host the database server either locally on the head node, or remotely on another machine. The remote hosting option will remain open to storage elements but in our design, we will also try to accommodate the local database use-case. We will also incorporate our own Domain Name System (DNS) server in the testbed. The rationale behind this is, firstly, we want to evaluate our testbed in isolation. By having our private DNS server, we will be able to monitor the load on the DNS service and examine if it becomes a bottleneck in our tests. Secondly, having full control of the DNS service opens up the possibility to hot-swap the head nodes by changing the IP address mappings in DNS configuration. The initial architecture of the testbed is shown in 3.1. 3.3 Testbed specification After consulting with the DPM development team, it was decided that VMs with 4 virtual CPUs (VCPU) and 8GB RAM is sufficient for the purpose of this project. Among the VM flavours offer by OpenStack, the m1.large flavour provides 4 VCPUs, 8GB of RAM and 80GB of disk space, which fits our needs perfectly. The nameserver requires minimal disk space and CPU, as such, we have chosen to use the m1.small flavour which provides 1 VCPU, 2GB of RAM and 20GB of disk space. All VMs in the testbed runs the Centos 7 operating system. A detailed list of software used in the testbed and their versions can be found in Appendix A. 13

Head node Back ends Front ends DMLite httpd DNS DOME adaptor MariaDB DOME Disk node 1 Disk node 2 Disks DMLite DOME adaptor httpd Disks DMLite DOME adaptor httpd DOME DOME Figure 3.1: Simplified view of architecture of initial testbed 3.4 Creating the VMs Four VM instances were created using OpenStack in the nova availability zone (.novalocal domain). We then assigned a unique floating IP address to each of these instances so that they can be accessed from outside of the private network. The hostnames and IPs of these instances will be referenced throughout this chapter and are shown in Table 3.1. Hostname FQDN Private IP Floating IP dpm-nameserver dpm-nameserver.novalocal 192.168.1.10 172.16.49.14 dpm-head-1 dpm-head-1.novalocal 192.168.1.14 172.16.49.2 dpm-disk-1 dpm-disk-1.novalocal 192.168.1.12 172.16.48.224 dpm-disk-2 dpm-disk-2.novalocal 192.168.1.6 172.16.48.216 Table 3.1: Network identifiers of VMs in testbed The fully qualified domain name (FQDN) of these nodes are important as they need to be included in the head and disk nodes host certificate exactly as they appear, otherwise the host will not be trusted by other nodes. Since DPM and most of the other grid middleware packages are located in the Extra Packages for Enterprise Linux (EPEL) repository, we need to install the repository on each of these VM. 14

sudo yum install epel-release sudo yum update Then enable EPEL testing repository for the latest version of DOME and DMLite. sudo yum-config-manager --enable epel-testing Install DOME and its dependencies: On head node: sudo yum install dmlite-dpmhead-dome On disk node: sudo yum install dmlite-dpmdisk-dome Make sure SELinux is disabled on all the nodes, at in sometimes interfere with DPM operations. This is done by setting SELINUX=disabled in /etc/sysconfig/selinux. Before we can further configure the nodes, we need to acquire a host certificate for each of the nodes to be used for SSL communication. 3.5 Setting up a certificate authority DPM requires valid grid host certificate installed on all its nodes for authentication reasons. Since we do not know how many VMs we will end up using, and to avoid going through the application process to a real CA every time we have to spin up a new VM in the testbed, we decided to set up our own CA to do the signing. It does not matter which host does the signing as long as the CA is installed on that host and it has the private key of the CA. In our testbed we used the nameserver to sign certificate requests. To set up a grid CA, install the globus-gsi-cert-utils-progs and globus-simple-ca packages from the Globus Toolkit. These packages can be found in the EPEL repository. 3.5.1 Create a CA First we use the grid-ca-create command to create a CA with the X.509 distinguished name (DN) "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/CN=DPM Testbed CA". This will be the CA we use to sign host certificates with for all the nodes in the cluster. 15

Our new CA will have to be installed in every node in the cluster before the nodes will trust any certificate signed by it. To simplify the process, our CA can be packaged into an RPM by using the grid-ca-package command, which will give us an RPM package containing our CA and its signing policy that can be distributed and installed on the nodes using yum localinstall. 3.5.2 Create the host certificates Each of the nodes in the cluster will need its own host certificate. Since we have control of both the CA and the nodes, we can issue all the requests on the nameserver on behalf of all the nodes. grid-cert-request -host <FQDN of host> will generated a both a private key (hostkey.pem) for that host and a certificate request that we have to sign by our CA using the grid-ca-sign command. grid-ca-sign -in certreq.pem -out hostcert.pem The hostkey.pem and hostcert.pem files will then have to be transferred to the VM that correspond to the FQDN, and stored in the /etc/grid-security/ directory with the correct permission. sudo chmod 400 /etc/grid-security/hostkey.pem sudo chmod 444 /etc/grid-security/hostcert.pem The certificate and private key also need to be placed in a location used by DPM: sudo mkdir /etc/grid-security/dpmmgr sudo cp /etc/grid-security/hostcert.pem /etc/grid-security/dpmmgr/dpmcert.pem sudo cp /etc/grid-security/hostkey.pem /etc/grid-security/dpmmgr/dpmkey.pem Make sure the files are owned by the dpmmgr user: sudo groupadd -g 151 dpmmgr sudo useradd -c "DPM manager" -g dpmmgr -u 151 -r -m dpmmgr sudo chown -R dpmmgr.dpmmgr /etc/grid-security/dpmmgr 16

3.5.3 Create the user certificate We also need to generate a grid user certificate for communicating with the testbed as a client. This certificate will be used during testing, for instance, supplied to stress testing tools. For testing purposes, we will generate a user certificate without a password to make the testing process easier. This is done by using grid-cert-request with the -nodes switch. Our user certificate has the DN: "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/OU=DPM Testbed CA/CN=Eric Cheung" 3.6 Nameserver For our nameserver, we chose the popular BIND DNS server [13]. We will discuss the configurations of the DNS server in detail because it is related to how we plan on hot-swapping the head nodes. As a result, it is very important to note how the FQDN of the head node is mapped to its IP address in the configurations. sudo yum install bind bind-utils In /etc/named.conf, add all the nodes in our cluster that will use our DNS server to the trusted ACL group: acl "trusted" { 192.168.1.10; // this nameserver 192.168.1.14; // dpm-head-1 192.168.1.12; // dpm-disk-1 192.168.1.6; // dpm-disk-2 }; Modify the options block: listen-on port 53 { 127.0.0.1; 192.168.1.10; }; #listen-on-v6 port 53 { ::1; }; Change allow-query to our trusted group of nodes: allow-query { trusted; }; Finally, add this to the end of the file: 17

include "/etc/named/named.conf.local"; Now set up the forward zone for our domain in /etc/named/named.conf.local: zone "novalocal" { type master; file "/etc/named/zones/db.novalocal"; # zone file path }; Then we can create the forward zone file where we can map the FQDNs in our zone to their IP addresses. sudo chmod 755 /etc/named sudo mkdir /etc/named/zones sudo vim /etc/named/zones/db.novalocal $TTL 604800 @ IN SOA dpm-nameserver.novalocal. admin.novalocal. ( 1 ; Serial 604800 ; Refresh 86400 ; Retry 2419200 ; Expire 604800 ) ; Negative Cache TTL ; ; name servers - NS records IN NS dpm-nameserver.novalocal. ; name servers - A records dpm-nameserver.novalocal. IN A 192.168.1.10 ; 192.168.0.0/16 - A records dpm-head-1.novalocal. IN A 192.168.1.14 dpm-disk-1.novalocal. IN A 192.168.1.12 dpm-disk-2.novalocal. IN A 192.168.1.6 The two most important things to note in this configuration is the IP address in the trusted group in named.conf, and the A records of the nodes in db.novalocal. In theory, if we spin up an additional head node with the same FQDN, then we can simply substitute its IP address in place of the IP of the old head node to redirect any inter-cluster communication and client requests toward the new head node, as illustrated in Figure x. In a production site, we recommend setting up a backup DNS server as well as the reverse zone file so that the lookup of FQDNs using IPs is also possible. For the purpose of this project, since we are not studying the availability of the nameserver, nor do we plan on doing reverse lookup, the configurations listed above should suffice. 18

3.7 HTTP frontend The httpd server and a few other modules are required to allow access to DPM through HTTP and the WebDAV extension. Some key configurations include ensuring the mod_gridsite module and the mod_lcgdm_dav module are installed, which handles authentication and WebDAV access, respectively. 3.8 DMLite adaptors The DMLite framework uses plugins to communicate with the underlying backend services. A traditional DPM instance would use the adaptor plugin to route requests to the DPM and DPNS daemons. Since we do not have those legacy daemons on the testbed, we need to replace the plugin with the DOME adaptor, so the requests are routed to DOME instead. This is done by editing dmlite.conf and load the dome_adaptor library instead of the old adaptor, and removing the adaptor.conf file. 3.9 Database and Memcached DPM works with MySQL compatible database management systems (DBMS), on our testbed we used the default relational DBMS on Centos 7 which is MariaDB. The configuration process is mostly identical to a legacy DPM instance which involves importing the schema of the cns_db and dpm_db, as well as granting access privileges to the DPM process. However, we initially had some troubles with getting our database backend to work in a legacy-free instance. We discovered that the issue was caused by DMLite loading some mysql plugins that are no longer needed in our scenario. We managed to resolve the issue by ensuring DMLite to only load the namespace plugin for mysql related operations. In /etc/dmlite.conf.d/mysql.conf, make sure only the namespace plugin is loaded. LoadPlugin plugin_mysql_ns /usr/lib64/dmlite/plugin_mysql.so Since DOME now includes an internal metadata cache which fulfils the same purposes of the Memcached layer in a legacy setup, Memcached is not installed on the testbed. 3.10 Creating a VO Storage elements on the grid use the grid-mapfile to maps all the users from VOs that are supported on the site. For testing purposes, we will use our own VO and directly 19

map our local users to the testbed by using a local grid-mapfile. This is done so that we can bypass the Virtual Organization Management Service (VOMS). The conventional VO name for development is dteam, and we will use that on our testbed. To create the mapfile, add this line to /etc/lcgdm-mkgridmap.conf: gmf_local /etc/lcgdm-mapfile-local Then create and edit the /etc/lcgdm-mapfile-local file, enter the DN-VO pairs for each users we would like to support. "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/OU=DPM Testbed CA/CN=Eric Cheung" dteam Run the supplied script manually to generate the mapfile. In production site this will be set as a cron job so the mapfile stays up-to-date. /usr/libexec/edg-mkgridmap/edg-mkgridmap.pl --conf=/etc/lcgdm-mkgridmap.conf --output=/etc/lcgdm-mapfile --safe 3.11 Establishing trust between the nodes Oh the head node, edit the /etc/domehead.conf file and add the DNs of the disk nodes to the list of authorised DNs. glb.auth.authorizedn[]: "CN=dpm-disk-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid", "CN=dpm-disk-2.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid" On the disk nodes, edit the /etc/domedisk.conf file and add the DN of the head node to the list of authorised DNs. glb.auth.authorizedn[]: "CN=dpm-head-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid" 3.12 Setting up the file systems and disk pool During the configuration process, we had encountered some issues with the dmliteshell, which is used as an administration tool on the head node. In a normal deployment, DPM would be configured by puppet, which would create the skeleton directory tree in 20

the DPM namespace by inserting the necessary entries into the dns_db database. Since we are manually configuring the system, we have to carry out this step ourselves. The key record is the / entry, which acts as the root of the logical view of the file system. On the head node: mysql -u root >use cns_db >INSERT INTO Cns_file_metadata (parent_fileid, name, owner_uid, gid) VALUES (0, /, 0, 0); Then start the dmlite-shell, remove the entry we just added by using unlink -f, then create the entry again but this time using mkdir so that all the required fields are properly set. Once that is done we can also set up the basic directory tree using the shell. sudo dmlite-shell > unlink -f / > mkdir / > mkdir /dpm > mkdir /dpm/novalocal(our domain) > mkdir /dpm/novalocal/home Add a directory for our VO and set the appropriate ACL: > mkdir /dpm/novaloacl/home/dteam(our VO) > cd /dpm/novalocal/home/dteam > chmod dteam 775 > groupadd dteam > chgrp dteam dteam > acl dteam d:u::rwx,d:g::rwx,d:o::r-x,u::rwx,g::rwx,o::r-x set Add a volatile disk pool to our testbed: > pooladd pool_01 filesystem V One we have a disk pool we can add a file system to our newly created disk pool. This has to be done on all disk nodes that wish to participate in the pool. In the normal shell, create a directory which DPM can use as a file system mount point and make sure it is owned by DPM so it can write to it. sudo mkdir /home/dpmmgr/data sudo chown -R dpmmgr.dpmmgr dpmmgr/data Then we can add the file systems on both disk nodes to our pool. On the head node, 21

inside dmlite-shell: > fsadd /home/dpmmgr/data pool_01 dpm-disk-1.novalocal > fsadd /home/dpmmgr/data pool_01 dpm-disk-2.novalocal Verify our disk pool (it may take a few seconds before DOME registers the new file systems): > poolinfo pool_01 (filesystem) freespace: 155.14GB poolstatus: 0 filesystems: status: 0 freespace: 77.58GB fs: /home/dpmmgr/data physicalsize: 79.99GB server: dpm-disk-1.novalocal status: 0 freespace: 77.56GB fs: /home/dpmmgr/data physicalsize: 79.99GB server: dpm-disk-2.novalocal s_type: 8 physicalsize: 159.97GB defsize: 1.00MB One last thing we need to do before we can test the instance is to create a space token for our VO, so that we can write to the disk pool. > quotatokenset /dpm/novalocal/home/dteam pool pool_01 size 10GB desc test_quota groups dteam Quotatoken written. poolname: pool_01 t_space: 10737418240 u_token: test_quota 3.13 Verifying the testbed At this stage, we should have a functional legacy-free testbed that is able to begin serving client requests. To verify the testbed s functionality, we used the Davix HTTP client to issue a series of requests toward the head node. The operations we performed include uploading and downloading replicas, listing contents of directories and deleting replicas. 22

The outcome of the requests was verified against the log files, database entries, and on the disk nodes file systems. For example, listing contents of the home directory of our dteam VO: [centos@dpm-nameserver ~]$ davix-ls --cert ~/dpmuser_cert/usercert.pem --key ~/dpmuser_cert/userkey_no_pw.pem --capath /etc/grid-security/certificates/ https://dpm-head-1.novalocal/dpm/novalocal/home/dteam -l drwxrwxr-x 0 12738 2017-08-06 20:18:12 hammer -rw-rw-r-- 0 10485760 2017-07-26 15:03:49 testfile_001.root -rw-rw-r-- 0 10485760 2017-07-31 20:21:31 testfile_002.root -rw-rw-r-- 0 10485760 2017-07-31 20:23:04 testfile_003.root -rw-rw-r-- 0 10485760 2017-07-31 20:23:17 testfile_004.root -rw-rw-r-- 0 10485760 2017-08-02 13:41:59 testfile_005.root -rw-rw-r-- 0 10485760 2017-08-02 13:59:33 testfile_006.root -rw-rw-r-- 0 10485760 2017-08-02 14:06:41 testfile_007.root -rw-rw-r-- 0 10485760 2017-08-02 14:33:20 testfile_008.root Reading contents of the helloworld.txt file and print to stdout: [centos@dpm-nameserver ~]$ davix-get --cert ~/dpmuser_cert/usercert.pem --key ~/dpmuser_cert/userkey_no_pw.pem --capath /etc/grid-security/certificates/ https://dpm-head-1.novalocal/dpm/novalocal/home/dteam/helloworld.txt Hello world! 3.14 Problems encountered and lessons learned Many of the issues we have encountered setting up the testbed were due to our initial lack of knowledge of the grid environment. For instance, we were unaware of the X.509 extensions that are used in signing grid certificates and did not understand why our certificates signed using plain OpenSSL were being rejected. We were also unfamiliar with how members of a VO are authenticated by the DPM system, which resulted in a lot of time spent in log monitoring and debugging before the testbed can even be tested. Perhaps most importantly, there are many services and plugins that need to be configured correctly in a DPM instance. One single incorrect setting in one of the many configuration files would result in a non-functional system. During the setup process, there were many occasions where we had to maximise the log level in DMlite, DOME, httpd and MariaDB then analyse the log files in order to diagnose the source of misconfiguration. 23

Chapter 4 Investigation As mentioned in Chapter 2, DMlite and DOME were designed to replace the legacy components of DPM and aim to bypass some of the limitations imposed by the old stack. However, neither DOME or DMlite was designed to run on more than a single head node. In order to successfully design a functional high availability DPM architecture, we must first identify the all the limiting factors in DMlite and DOME that would prevent us from deploying redundant head nodes, and redesign them when possible. A function high availability DPM architecture must contain the follow attributes: Resilient to head node failure. Meaning that the system must continue to function and serve client requests should the primary head node fail. Automatic recovery. In the event of head node failure, a DPM instance using the new architecture must failover automatically in a transparent manner to the clients. Strong data consistency. The redundant head nodes must have access to the most up-to-date information about the file system and status of the cluster. Ultimately, providing availability to DPM would mean increasing the number of head nodes, and therefore turning DPM into an even more distributed system. Providing any distributed system availability and consistency guarantee will likely have performance implications, which we must also keep in mind in our design. This rest of this chapter describes the findings of our investigation and the recommended redesign of the relevant components to allow for a high availability DPM architecture. 4.1 Automating the failover mechanism Ideally, when a head node fails, the system should automatically reroute client requests to one of the redundant head nodes in a way that is transparent to the clients. One of the 24

Head node 1 keepalived DPM Disk node 1 Normal path Public floating IP Head node 2 Clients Failover path keepalived DPM Disk node 2...... Figure 4.1: Failover using keepalived options to achieve this failover mechanism is to use a floating IP address that is shared between all the head nodes, combined with a tool that facilitates nodes monitoring and automatic assignment of this floating IP. Keepalived [14] is a routing software that is designed for this use-case, it offers both load balancing and high availability functionalities to Linux-based infrastructure. In keepalived, high availability is provided by the Virtual Router Redundancy Protocol(VRRP), which is used to dynamically elect a master and backup virtual routers. Figure 4.1 illustrates how keepalived can be used to provide automatic failover to a DPM system. In this topology, all client requests are directed at the floating IP address. If the primary head node fails, the keepalived instances on the redundant head nodes will elect a new master based on the priority value of the servers in the configuration file. The new master would then reconfigure the network setting of its node and bind the floating IP to its interface. From the clients perspective, their requests continue to be served using the same IP address even though they are now fulfilled by a different head node. If the primary head node rejoins the network, keepalived would simply promote the primary node as master again if its server has the highest priority value. With this topology, we can use a single DNS entry in the nameserver for all the head nodes in the cluster, since they would all have the same FQDN and use the same floating IP address, thus further simplifying the configuration process of the system. 4.1.1 Implementation Based on our research keepalived would be the ideal solution for head nodes failover. Unfortunately, after spending a considerable amount of effort in configuring keepalived, we discovered that in order to setup keepalived successfully on our testbed, we would require administrative privileges on the OpenStack instance level (to configure Open- Stack Neutron), which we do not have. 25

However, on a production site this should not be an issue, especially when the site has full control of its network and deployed DPM on physical machines instead of VMs. 4.2 Database Some grid sites prefer to host the database backend locally on the head node for performance reasons, and we would like to preserve this use-case. The first step toward achieving this goal is to fully understand what is stored in the databases, and what their roles are in the DPM system. 4.2.1 Metadata and operation status Information stored in the DPM database backend can be categorised into two groups, metadata and operation status. Metadata Metadata kept by DPM include information that is critical for DPM to function correctly, for example, for validating a user s DN or to translate the logical file name of a replica to its physical file name. The different groups of data kept are summarised as follow. File system infomation - what file system are available on the disk nodes and which disk pool their belong to. Pool infomation - Size, status, and policies of disk pools. Space reserve - Space tokens of supported VOs, describes the available and used storage space of a disk pool a VO has access to. File metadata - Information on each unique file and directory managed by the system, including POSIX file data such as size, type, and ACL. Replica metadata - Information on replicas of file, including which disk pool the replica belongs to, which file system the replica is located, and which disk node is hosting that file system. User and group information - Including DN of users, privilege levels, and ID mappings that are used internally. Operation status If a request (read, write, copy) cannot be immediately fulfilled by DPM, for instance, if the requests replica has to be pulled from an external site or because of scheduling 26

by some job management system, the request is recorded in the dpm_db database as pending. Information recorded including protocol used in the request, DN and host name of the client, number of retries, error messages, requested resource, and status of the request. 4.2.2 Issues DPM cannot function without having access to the information stored in the databases. As our aim is to increase the availability of a DPM storage element, that means we have to provide a certain degree of data redundancy to the database backend. For sites that wish to use a dedicated server to host their database service, they will be responsible for providing redundancy to the service and can choose from a number of options that are likely already built-in to the database. Since we also want to support the local database backend use-case, we have to implement a way to share database records across multiple head nodes. Grid sites are recommended to install DPM on physical hardware instead of VM for performance considerations. As such, simply starting another VM with the latest snapshot is not a viable solution, not to mention the new head node would not have to most up-to-date data when it is swapped in, which would leave the system in an inconsistent state. NoSQL solutions are also deemed not acceptable because we require the ACID properties provided by a transactional database. 4.2.3 Analysis There are already a number of technologies which aims to increase the availability of relational database services. The differences in these technologies are which parallel architecture, types of replication, and node management mode they support. A brief overview of these differences is presented as follow. Parallel database architecture Parallel database management system is typically based on two architectures: shared nothing or shared something. For shared something architecture they may be shared memory, shared disk, or a combination of both. Below is a brief overview of each of these architectures. Shared nothing: As the name implies, in a shared nothing architecture each node maintains a private view of its memory and disk. Because nothing is shared between the nodes, distributed database system using this architecture is often easier to achieve higher availability and extensibility, at the cost of increased data partition complexity for load-balancing reasons [15]. 27