Virtualization in a Grid Environment. Nils Dijk - Hogeschool van Amsterdam Instituut voor Informatica

Virtualization in a Grid Environment Nils Dijk - nils.dijk@hva.nl Hogeschool van Amsterdam Instituut voor Informatica July 8, 2010

Abstract Date: July 8, 2010 Title: Virtualization in a Grid Environment Author: Nils Dijk Company: Nikhef Problem In the grid computing environment there is from both, the client as the developers side, a demand on the ability of running Virtual Machines on grid resources. While running Virtual Machines is opposing new attack surface on the grid resources it is believed to be an improvement for the grid infrastructure. Because virtualization is an upcoming technology in the form of the cloud there is a lot to be investigated and tested before deploying it to a grid infrastructure.

Contents 1 Nikhef & Grid computing 2 1.1 Nikhef....................................... 2 1.2 Participating organizations........................... 3 1.3 Grid resources................................... 4 1.4 PDP group.................................... 4 2 Assignment 5 2.1 Why Virtualization................................ 5 2.2 Thing to sort out................................. 6 2.3 Things explicitly not part of this assignment.................. 6 3 Requirements for Virtual Machines in existing grid infrastructure 7 3.1 Authentication and Authorization....................... 7 3.2 Scheduling..................................... 8 3.3 Destruction.................................... 8 4 Purposed design 10 5 Implementation 13 5.1 Gathering Information.............................. 13 5.1.1 Image to boot............................... 13 5.1.2 OpenNebula user............................. 15 5.1.3 Resources................................. 15 5.1.4 Network.................................. 15 6 Credits 16 6.1 Nikhef & Grid computing............................ 16 7 Sources 17 1

Chapter 1 Nikhef & Grid computing 1.1 Nikhef Nikhef (Nationaal instituut voor subatomaire fysica) is the Dutch national institute for subatomic physics. It s a collaboration between Stichting voor Fundamenteel Onderzoek der Materie (FOM), Universiteit van Amsterdam (UvA), Vrije Universiteit Amsterdam (VU), Radboud Universiteit Nijmegen (RU) and the Universiteit Utrecht (UU). The name was originally an acronym for Nationaal Instituut voor Kernfysica en Hoge Energie- Fysica, (National institute for nuclear and high energy physics). After closing down the linear electron accelerator in 1998 the research into experimental nuclear physics yielded, but the Nikhef name has been retained up to the present day. [5] These days Nikhef is involved in areas dealing with subatomic particles. Most employees at Nikhef are involved with physics projects, some of which like ATLAS, ALICE and LHCb are directly related to the Large Hadron Collider (LHC) particle accelerator at the European Organization for Nuclear Research (CERN). Among the technical departments at Nikhef are Mechanical Engineering (EA), the Mechanic Workshop (MA), Electronics Technology (ET) and Computer Technology (CT). High energy physics experiments generate vast amounts of data, analysis of which requires equally vast amounts of computing power. In the past supercomputers were used to provide this power, but in order to perform analysis on high-energy subatomic particle interactions required by the LHC experiments, a new method of pooling computing resources was adopted: Grid computing. The CT department provides Nikhef s computing infrastructure. The Physics Data Processing (PDP) group is an offshoot of the CT department which develops Grid infrastruc- 2

ture, policy and software. Figure 1.1: A diagram showing the organizational structure of Nikhef[3] 1.2 Participating organizations Like supercomputers, Grids attract science. This has led to a community of Grid computing users which advances the Grid computing field on an international scale. Some of the cooperating organisations within the Grid computing community are: BiG Grid, the Dutch e-science Grid. An example of a National Grid Initiative (NGI), of which there are many. 3

The Enabling Grids for E-sciencE (EGEE) project. A leading body for NGIs. To be transformed into the European Grid Initiative (EGI). The LHC Computing Grid (LCG) is the Grid employed by CERN to store and analyze data generated by the Large Hadron Collider (LHC). Also a member of EGEE. The Virtual Laboratory for e-science (VL-e). A separate entity that tries to make Grid infrastructure accessible for e-science applications in the Netherlands. 1.3 Grid resources Here s an example of the resources potentially available on a national (BiG Grid) and international (EGEE) level. This is not a static number as the Grid is dynamic in nature. Resources shift in and out due to maintenance requirements or upgrade. The Grid has a tendency to grow in computing and storage capacity. BiG Grid has between 4500 and 5000 computing cores (not including LISA, which has 3000 cores) and about 4.7 petabytes of storage. The capacity of available tape storage is about 3 petabytes. EGEE has roughly 150.000 computing cores, 28 petabytes of disk storage and 41 petabytes of tape storage. [6] 1.4 PDP group The Physics Data Processing (PDP) group at Nikhef is associated with BiG Grid, the LHC Computing Grid (LCG), Enabling Grids for E-sciencE (EGEE), the Virtual Laboratory for e-science (VL-e) and the (planned) European Grid Initiative (EGI). Within Nikhef, the PDP group concerns itself with policy and infrastructure decisions pertaining to authentication and authorization for international Grid systems. It facilitates the installation and maintenance of computing, storage and human resources. It provides the Dutch national academic Grid and supercomputing Certificate Authority (CA), and also delivers software such as: Grid middleware components (part of the glite stack) Cluster management software (Quattor) The PDP group employs Application Domain Analysts (ADAs) which try to bridge the gap between Grid technology and its users by developing software solutions and offering domain-specific knowledge to user groups. 4

Chapter 2 Assignment The PDP group at Nikhef came with an assignment involving the preparation of a virtualized environments for grid jobs. This should be implemented within the Executions Environment Service. This service is written at Nikhef and from the beginning the had been claimed it would be able to produce a virtualized environment for a grid job. 2.1 Why Virtualization Since the hype of the so called cloud, virtualization techniques are used to provide on demand execution of machines configured by a user with there software. These machines can be configured locally and when they are ready for production it can be run anywhere. The grid provides users with an environment to run the software of its users, mainly for scientific purposes. Since al the different libraries user software depends on it s very hard to manage the software stack available on worker nodes. Some organizations have there own dedicated hardware within a sites datacenter to provide there users with the right software, but that is only for the bigger organizations. It is believed that a lot of the software conflicts and thus failure of jobs can be reduced by providing users with the ability to run their jobs within a predictable environment which could be realized by a virtual machine. This machine is either user supplied or it is supplied by their organization they work for. 5

2.2 Thing to sort out Despite of all the work done by the Virtualization Working group there is very little known about the possibilities of starting virtual machines on grid infrastructure, the traceability and accountability of it. Since most employees have no spare time on their hands it is desirable to put one full time internship on it. 2.3 Things explicitly not part of this assignment Because virtualization is an enormous field of research I would specify here some parts that may come up while reading this document which are not part of my research. This does not mean I did not look in to some of these things. This internship is not about the performance of virtual machines versus the performance of real hardware. There are lots of discussions going on about this topic and there is still no definite answer to this question. Hypervisor versus Hypervisor is also not addressed within this document. I have been working with the XEN hypervisor just because there was the most known about at Nikhef. Wherever XEN is mentioned it can also be read as KVM or VMWare. 6

Chapter 3 Requirements for Virtual Machines in existing grid infrastructure For providing users of the grid with the ability to deploy virtual machines on grid infrastructure and at the same time keeping risks at a minimum there are multiple requirements on the implementation. In this chapter I will give an overview of high level requirements for deploying virtual machines on the grid. 3.1 Authentication and Authorization The grid has over nine thousand users so authentication and authorization is an important aspect of keeping the grid infrastructure safe. On the batch system this authentication is done by the use of x509 certificates. These certificates provide a Public Key Infrastructure. Since UNIX does not provide user authentication by certificates these users are mapped to local UNIX accounts. With this mapping it is important to register which certificate user is mapped to witch UNIX account at a specific time, this for forensic purposes when a user shows undesired behaviour. Because virtual machines are always run as root user on the host it is not possible to start the vm as a simple user process. By using OpenNebula for deployment of virtual machines you make it possible for users to deploy virtual machines while not having root access them self, but OpenNebula does neither support authentication nor authorization in the form of a x509 certificate but instead maintains its own user database in much the 7

same manner as for example MySQL does. All actions on virtual machines are run as a dedicated OpenNebula user which is in most cases the oneadmin account. To allow a grid user to deploy virtual machines by authenticating its self with his certificate there has to be a same mapping mechanism as there is already for certificate users to local UNIX accounts. This is implemented as a GridMapDir[1]. 3.2 Scheduling Because the amount of users using the grid for computing each users has to be scheduled for resources much the same as processes get scheduled by an operating system to share the machines resources e.g. the processor, memory and storage. But instead of small amounts of time an operating system gives to a process in each scheduling round the gird supplies the user with much longer times up to 72 hours of continues running. When supplying the user with the ability not to run batch job s but virtualized machines you still have to share the total amount of resources with multiple users. Since OpenNebula is a toolchain for providing cloud services most of the scheduling algorithms are not providing rules commonly used on the grid. For example the rule for fair share 1 which is a policy used by the system administrators of the Compute Element located at Nikhef. The easiest way of scheduling virtual machines is to use the scheduler already active on the Compute Element but since there are different schedulers in use worldwide it is easier said than done. 3.3 Destruction After a job is done or its walltime 2 has passed the virtual machine has to be removed by shutting it down. If for some reason this is not done some resources e.g. memory will still be held by the virtual machine preventing new virtual machines from claiming it. This will result in failures when starting new virtual machines on the Worker Node. Therefore it is essential that that the vm is cleaned up afterwards. Also in the case of an administrator killing a job due to suspicious activity it is mandatory to kill the running VM with it. Otherwise a malicious user/machine could still be running 1 Fair share is a rule to suppress users with excessive job submission while other users with less job submission are also submitting jobs but allow them to use unused resources even if they already passed there time quotum. 2 Walltime is the time on the clock (wall) that the job is allowed to use the computational resource of the Worker Node you are scheduled on. 8

and using the infrastructure in a way it is not meant to. 9

Chapter 4 Purposed design Since it is preferred by Nikhef the implementation of such a system is within their developed Execution Environment Service I have looked for a way where authorization is performed within the argus framework of the grid where the EES is part of. Figure 4.1 shows the authorization and booting sequence I came up with which was approved by the security experts at Nikhef. Also a list explaining in detail the interaction between all components involved. 1. The job is delivered at the Compute Element of the site 2. The Compute Element contacts the Policy Enforcement Point with the information of the user job which is the request of a virtual machine, this request is done in an XACML2[4] request. 3. The Policy Enforcement Point asks the Policy Decision Point for a decision about the request based on obligations published by the Policy Administration Point 4. The obligations returned to the Policy Enforcement Point to be fulfilled. 5. The Policy Enforcement Point uses the Execution Environment Service to fulfill obligations making the EES an obligation handler of the pep. 6. The Execution Environment Service returns to the Policy Enforcement Point with an answer for the fulfilled obligations 7. The Policy Enforcement Point returns positive or negative to the Compute Element 8. On a positive answer from the Policy Enforcement Point the Compute Element passes the job to the Local Resource Management System 10

9. The Local Resource Management System schedules the job to a Worker Node with a hypervisor running on it. 10. As the job is deployed it contacts the Authorization Framework through the Policy Enforcement Point with information about the host the job is running on and the user requesting it. 11. The Policy Enforcement Point uses the Execution Environment Service again as an obligation handler for the incomming request. 12. As the Execution Environment Service sees that the request contains a host to run a Virtual Machine on it deploys a machine assigned to the requesting user on the specified host. 13. Open Nebula contacts the hypervisor running on the node to start the virtual machine. 14. Open Nebula returns the VM identifier 1 to the Execution Environment Service 15. The Execution Environment Service forwards the VM identifier to the Policy Enforcement Point by the means of an obligation. 16. The Policy Enforcement Point forwards the VM identifier in the response to the request send by the Worker Node in the same way as the Execution Environment Service did to him 1 Unique number assigned to the vm by Open Nebula 11

PDP PAP 1 3 CE 7 8 LRMS 2 4 PEPd 6 15 5 11 EES 10 14 9 WN Dom0 16 13 12 ONE Figure 4.1: Diagram showing the deployment sequence of Virtual Machines 12

Chapter 5 Implementation For implementing a service which is able to boot virtual machines there are several steps involved. Beginning with the gathering of the information needed for booting. 5.1 Gathering Information Before a Virtual Machine can be booted it is essential to gather all the information e.g. the image to boot, owner of the virtual machine and the resources the machine needs. I will explain all the different attributes needed before booting a virtual machine and why it is implemented as it is. 5.1.1 Image to boot For virtual machines to start you need to know what image to start as the image contains the virtual machine. Here follow some possibilities how to obtain the location of the image to boot and a motivation for the chosen implementation. User supplied The user defines the image to boot in his Job Description Language. This way the user had full control over the virtual machine he would like to run on the infrastructure. 13

Argus The Argus framework consisting of the Policy Administration Point, Policy Decision Point and Policy Enforcement Point are able to set obligations to be fulfilled for the requesting user. This way it is possible to set an obligation for the image to boot for a specific user or role he takes within an organization. GridMapFile An other way of providing the image is by the use of a GridMapFile 1 which is a file containing mappings from one sort of information to an other which looks like "Expression to map" ThingToMapTo,SomethingOther. In this situation it would for example map an FQAN 2 to the information needed. FQAN s are used to describe a role a person has within the VO he works for. This file is stored on the file system and is maintained by the administrators of a site. Choise and motivation In the ideal situation a user should be able to supply the image he would like to boot. Unfortunately it would be very hard to run user supplied images in such an environment because of the privileges a machine gets and thus the user who supplied it. Also the current interpreters for the JDL do not pass the information al the way to the Argus framework so this should be addressed before users can have full control over the image they would like to boot. This eliminates the possibility for user supplied images at least until the JDL supports it. Ideally the EES should be able to perform its operation without the need to be invoked through Argus. To keep plugins capable of performing without the need for obligations provided by Argus it is mandatory to gather the image to boot through some other way. Despite of Argus beeing more scalable than a file for these mappings there is chosen to do it with a GridMapFile for several reasons: 1. It is more portable than relying on the Argus Framework 2. It is easier to implement 3. less work than registering XACML attributes with other organizations 1 A specification of the GridMapFile is provided by the twiki at cern: https://twiki.cern.ch/twiki/ bin/view/egee/authzmapfile 2 Fully Qualified Attribute Name 14

Since GridMapFiles[2] are already being used on the grid there is a stable implementation where Argus is fairly new. 5.1.2 OpenNebula user There is a finite number of users in the Open Nebula database. For traceability it is desired to boot images as separate users log files can show who is responsible for booting an image. Because there are finite accounts they should be dynamically mapped to users of the grid. This is already done with unix accounts, with a construction called GridMapDir, and for simplicity it will be implemented the same; the DN from the users certificate is mapped to an Open Nebula user the first time it is seen and released after it is not used a long time. Hereby it is traceable who booted a Virtual Machine within a certain amount of time. To communicate with Open Nebula you need a session key which is a concatenation of the username and a hash of the password. To obtain a session for a given user of Open Nebula the username is mapped, by the use of a GridMapFile, to an Open Nebula session as the XMLRPC layer of Open Nebula expects. 5.1.3 Resources Currently a machine running batch jobs is logically divided in jobslots by the number of cores a machine has. Since it is not possible to allocate more RAM for virtual machines then there is physically available (with some overhead for the hosting machine) it is good practice to dived the total amount of RAM available by the number of virtual machines e.g. the cores of the machine. This is currently done by setting a default in the Open Nebula configuration files. 5.1.4 Network To mimic the behavior of a normal cluster all machines are currently connected to the same public network as defined by Open Nebula. For some users or groups it could be desirable to put all machines the have running to the same private network. But at the moment there is no support for virtual local area networks in the tools used and it s very complex because switching equipment should be configured on the fly to allow machines in dedicated vlan s. The network administrator at Nikhef s gridsite is currently looking into switches which are able to be configured on the fly. Now the only way to separate networks is through ip ranges, which can be faked by the owner of the virtual machine, and therefor it is not safe to supply users with private networks. 15

Chapter 6 Credits Some of the contents of this document are a direct copy from an other document. This is done because these parts are mandatory to include but are already written numerous of times. Here I will define which chapters or parts of it are written by someone else. In most cases I have found and contacted the original author of the parts to ask for permission to reuse there effort but I cannot guarantee that for all parts the original author has been found nor contacted. 6.1 Nikhef & Grid computing This chapters is written by Aram Verstegen when he was on his internship at Nikhef where he developed the Execution Environment Service also discussed in this document. It was published in his thesis[7] and therefore may be familiar by readers who also read that 16

Chapter 7 Sources 1 p l a c e s o u r c e s here 17

Bibliography [1] Grid map dir mechanism. https://twiki.cern.ch/twiki/bin/view/lhcb/ GridMapDir. [2] The gridmap file. http://gdp.globus.org/gt3-tutorial/multiplehtml/ch15s01. html. [3] Organigram nikhef. http://www.nikhef.nl/over-nikhef/achtergrondinformatie/ organisatiestructuur/. [4] G. Garzoglio (editor). An xacml attribute and obligation profile for authorized interoperability in grids. http://cd-docdb.fnal.gov/cgi-bin/showdocument?docid=2952 and https://edms.cern.ch/document/929867/2, October 2008. [5] Kees Huyser. Over nikhef. http://www.nikhef.nl/over-nikhef/. [6] The EGEE Project. Egee in numbers. http://project.eu-egee.org/index.php?id= 417. [7] Aram Verstegen. Execution environment service, November 2009. 18

Execution Environment Service Glossary ADA Application Domain Analyst (3) Authorization Framework Set of tools running on the grid to authorize users for the actions the want to perform on the grid. (4, 13, 20) BiG Grid CERN The Dutch e-science grid (3) Organisation Europenne pour la Recherche Nuclaire (1) Compute Element EGEE EGI A cluster of Worker Nodes located at the same geological location announcing them selfs as one resource to the outside. (4, 5, 9, 13, 20, 21) Enabling Grids for E-sciencE (3) European Grid Initiative (3) A service providing site specific environments for a job submitted by a user based on the policies of both the system administrator of a site and from the Virtual Organization (4, 5, 13, 14) Hypervisor The software layer enabling the execution of Virtual Machines (6) Information Manager The interface Open Nebula uses to monitor the hypervisors. By implementing it with a corresponding Virtual Machine Manager sone could expand the supported hypervisors of Open Nebula (6) Job Description Language LCG LHC The syntax for the description of the job a user would like to be executed on grid resources. This description contains but is not limited to the job to run, the amount of memory the job uses and files the job needs to access (4) The LHC Computing Grid (3) Large Hadron Collider, the particle accelerator at CERN (1) Local Resource Management System The scheduler on the site to schedule jobs to nodes they correspond to. (13) 19

Nikhef Public Key Infrastructure Nationaal instituut voor subatomaire fysica (Dutch national institute for subatomic physics). Originally: Nationaal Instituut voor Kernfysica en Hoge Energie-Fysica (1) RU A cryptographic methode for encrypting messages over untrusted networks. (8) Open Nebula Open Nebula is a tool chain providing a high level interface to a Virtual Machine cluster (6, 7, 13, 19, 20) Radboud Universiteit Nijmegen (1) Stichting FOM Stichting voor Fundamenteen Onderzoek der Materie (1) PDP Physics Data Processing (3) Policy Administration Point Service, part of the Authorization Framework, used by system administrators on a site or by a Virtual Organization to administrate policies for there users. (4, 5, 13, 14, 20) Policy Decision Point Service, part of the Authorization Framework, local at the site which collects the policies from the Policy Administration Point and make decisions on these policies for incoming requests (4, 5, 13, 14) Policy Enforcement Point Entry point of the Authorization Framework which enforces the policies provided by the site and the Virtual Organization for the requesting user (4, 5, 13, 14) Transfer Manager UU UvA Set of tools to manage the deployment of all the image s used by Open Nebula (6, 15) Universiteit Utrecht (1) Universiteit van Amsterdam (1) Virtual Machine Manager The interface used by Open Nebula to interact with a hypervisor for starting and stopping a Virtual Machine (6, 19) Virtual Organization An administrative container for users working on the same kind of experiments. A good example is Atlas which is the Virtual Organization for physicists processing the results of the Large Hadron Collider (5, 19, 20) 20

VL-e Virtual Laboratory for e-science (3) VU Vrije Universiteit Amsterdam (1) Worker Node Computer in a Compute Element where the jobs are executed. (4, 9, 13, 19) Workload Management System System which is aware of multiple Compute Elements and there expected response time 1. (4) 1 time it will take for a submitted job to the Compute Element to be completed 21