Dynamic Extension of a Virtualized Cluster by using Cloud Resources

Size: px

Start display at page:

Download "Dynamic Extension of a Virtualized Cluster by using Cloud Resources"

Brett Benson
5 years ago
Views:

1 Dynamic Extension of a Virtualized Cluster by using Cloud Resources Oliver Oberst, Thomas Hauth, David Kernert, Stephan Riedel, Günter Quast Institut für Experimentelle Kernphysik, Karlsruhe Institute of Technology, Wolfgang-Gaede-Strasse 1, Karlsruhe oliver.oberst@cern.ch Abstract. The specific requirements concerning the software environment within the HEP community constrain the choice of resource providers for the outsourcing of computing infrastructure. The use of virtualization in HPC clusters and in the context of cloud resources is therefore a subject of recent developments in scientific computing. The dynamic virtualization of worker nodes in common batch systems provided by ViBatch serves each user with a dynamically virtualized subset of worker nodes on a local cluster. Now it can be transparently extended by the use of common open source cloud interfaces like OpenNebula or Eucalyptus, launching a subset of the virtual worker nodes within the cloud. This paper demonstrates how a dynamically virtualized computing cluster is combined with cloud resources by attaching remotely started virtual worker nodes to the local batch system. 1. Introduction Todays HPC clusters are typically overdimensioned to cope with expected peak loads of the system. Through sharing a centralized HPC cluster infrastructure among different users groups the overheads in terms of hardware, administration effort, infrastructure and energy consumption can be minimized. In cases where a common computing environment is not applicable to meet all software requirements of all participating user groups virtualization can be used to supply any required operating system and software environments. The intrinsic performance loss through virtualization is negligible if specific user groups with diverging prerequisites are then able to use additional HPC resources in a shared computing cluster. By dynamically virtualizing the worker nodes the computer cluster is dynamically partitioned providing several different environments A second area where virtualization can be adopted is the extension of the local HPC resources by adding Cloud worker nodes. This extension is even eased if the common usage scenario of the HPC resource utilizes virtual machines which can easily prepared for off-side use within a cloud infrastructure. Both, the dynamic partitioning of a shared HPC cluster and its extension by using Cloud worker nodes are summarized in the following. 2. Dynamic Virtualization of Worker Nodes There are three possible ways for user groups with diverging computing infrastructure requirements as depicted in Figure 1. Either each group is running on their own separated Published under licence by IOP Publishing Ltd 1

2 Figure 1. Three typical possibilities to run a HPC infrastructure. The scenario on the top shows independent clusters maintained by the specific user groups. The second and third scenarios show shared, centralized clusters which run on the same hardware infrastructure. In these cases it can either be offered with statically or dynamically partitioned sub-clusters. As seen in [1] infrastructure or a common infrastructure is shared amongst them. For the second case two further scenarios are conceivable: either statically provide cluster partitions to each group with its own environment installed, or virtualize the computing which results in a dynamic partitioning of the cluster on a job-by-job basis. For the latter, one assumes that the resource management system has to be aware of the usage of virtualization and in fact systems like Condor [2] or Open Grid Scheduler [3] own such functionality. However, we found a possibility to virtualize the worker nodes without having a virtualization aware batch system in use. By mainly using a standard tool which is available in most of the resource managers, the prologue and epilogue scripts, the virtual worker nodes can be handled as being part of the actual user job. The usage of this functionality is implemented in our tool called ViBatch [4] ViBatch The requirements for our concept of dynamic virtualized worker nodes requires pro- and epilogue scripting functionality within the batch system and the common virtualization API libvirt[5]. The detailed ViBatch work-flow is sketched in Figure 2. In detail, the work flow can be described with these steps: (i) A user submits a job to a batch system. He decides if the job should run on a virtual worker node or the native host OS by submitting to an appropriate queue which needs to be set up on the batch server. Due to that it is easily possible to mix up virtual and native worker nodes on the same cluster. 2

3 Figure 2. Schematic overview of the ViBatch concept: portable to any batch system with proand epilogue scripting functionality, independent from the underlying hypervisor, lightweight setup, transparent to the user, allows a mixed batch system setup with native and virtual worker nodes [4]. (ii) If a virtual queue is selected, the batch system executes the prologue script at the beginning of each job. (iii) The prologue script prepares the the virtual machine image by cloning the VM from a provided template on the local worker node hard disk. (iv) Modify template to accept the actual user job later on. Currently, a non-password protected and user specific public ssh key is copied to the authorized keys file on the VM. (v) Start the virtual machine via the libvirt API. Hand over a proper MAC address for the virtual network interface to allow individual network setup via DHCP. (vi) At the end of the booting process, the VM creates a lock-file via an init script on the local or cluster file system (vii) The prologue script checks for this lock-file to guarantee a completely booted VM. (viii) The actual user job is piped via ssh to the VM. (ix) The user jobs is executed inside the VM (x) After the job has finished and the job output was returned to the user, the epilogue script is executed. (xi) The VM is shut down and destroyed. A detailed view of the job hand-over to the VMs is shown in Figure 3. Currently ViBatch runs integrated into the production system of a shared HPC cluster located at the Karlsruhe Institute of Technology (KIT). It is shared among nine different research departments and owns the following key specifications: 1600 CPU cores, 200 x 8-core Intel Xeon X5355(VT-x,64bit) CPUs with 2GB of memory per core, SUSE Linux Enterprise server 11SP2 as host OS (Kernel default), KVM [6] hypervisor (version:qemu-kvm ). The virtualized worker nodes are used by KIT users affiliated to the Compact Muon Solenoid (CMS) [7] experiment at the Large Hadron Collider. The specific requirements of this group are: Scientific Linux CERN 5 [8] and experiment specific software. The experiment specific software is imported via CernVMFS[9] into the VMs which outsources the installation 3

Figure 3. This figure illustrates the hand over of the jobs from the batch system via the worker node to the virtual machine using ssh shells. of new CMS experiment software releases.

4 Figure 3. This figure illustrates the hand over of the jobs from the batch system via the worker node to the virtual machine using ssh shells. of new CMS experiment software releases. During the last two years several thousands of jobs successfully ran via the Maui/TORQUE [10] batch system through the VM queues as depicted in Figure 4 with the profile of typical High Energy Physics (HEP) applications with both, CPU and I/O intensive job classes. Due to the fact that there is now para-virtualization driver available up to now, the cluster file system Lustre [11] was exported via NFS[12] from each host to its currently running VMs. As each VM image is deleted after a job execution, the logs of these VMs have to be stored to enable forensics for security maintenance and debugging. This is performed by using a central syslog-ng [13] server to store the VM logs. The utilization of the central log server has to be configured within the virtual machine templates of each partition prior to the deployment. The cloning and modification steps of the VM are measured to be in the order of less then two seconds. This can be reached as cloning here means creating a copy-on-write overlay of the locally stored VM template on the worker node hard drive in contrast to creating a copy of the template for each VM instance. These templates are deployed by the ViBatch operators as needed, e.g. after applying security-updates or after a change of the VM setup, through the ViBatch helper scripts. It is planned to revisit the deployment for further improvement e.g. using peer-to-peer techniques between the worker nodes. However, for the current production operation of ViBatch, the VM template deployment has no performance impact as the jobs only use the local VM template overlay. Moreover, the length of the prologue procedure is heavily 4

correlated to the used VM images. It mainly depends on the VM boot up time. Through optimizing the VMs itself by removing not needed services e.g. yum auto-update we reach a boot time of 35 seconds which adds up to 40 second start time of the jobs in the virtual worker node queue for the SLC5 VMs.

5 correlated to the used VM images. It mainly depends on the VM boot up time. Through optimizing the VMs itself by removing not needed services e.g. yum auto-update we reach a boot time of 35 seconds which adds up to 40 second start time of the jobs in the virtual worker node queue for the SLC5 VMs. Figure 4. Job Success rate of one month of production usage of ViBatch. 3. Extending Batch Systems with Cloud Resources Additionally to the dynamic partitioning of the cluster using virtualization, Infrastructure as a Service (IaaS) Cloud resources can be dynamically attached to the local resource manager to extend the available farm in times of heavy load. During the last years, various IaaS Cloud providers and implementations entered the service market. One of the first to offer Cloud services was Amazon with EC2[14]. They offer different machine configurations, so called machine types, on a pay-per-hour basis. The software, used by Amazon itself to provide and manage their Cloud services is proprietary and not available to the public. The company Eucalyptus Systems [15] is one of the existing solutions to fill this gap by developing an open-source Cloud Computing infrastructure software called Eucalyptus which implements the same API as Amazon s EC2 does. The Cloud Computing research group at the Steinbuch Centre for Computing [16] at the Karlsruhe Institute of Technology (KIT) runs a private Cloud based on OpenNebula (ONE) [17]. At it s current stage of expansion, the private Cloud can run up to several hundred single-core and multi-core virtual machines. As a result of the possibility to utilize even more resources for our HEP researchers, the Institut für Experimentelle Kernphysik (EKP) at KIT decided to evaluate and develop a dynamic batch system extension tool for our local resources which resulted in the Cloud Meta-Scheduler ROCED ( Responsive on-demand Cloud Enabled Deployment )[18, 19, 20] ROCED The modular design of the meta-scheduler ROCED enables the use of different combinations of local batch system and remote Cloud interfaces. ROCED is composed of three different so-called 5

Adapters as depicted in Figure 5 Figure 5. ROCED design baseline. Three individual Adapters are used to interface local batch system and remote Cloud software.

Site Adapter Interfaces the Cloud site. Boots and stops the Cloud worker nodes. Integration Adapter Registers newly provisioned Cloud nodes and removes them if required.

6 Adapters as depicted in Figure 5 Figure 5. ROCED design baseline. Three individual Adapters are used to interface local batch system and remote Cloud software. The three Adapters are in Detail: Requirement Adapter Gathers information from the local resource manager and calculates the required amount of Cloud worker nodes. Site Adapter Interfaces the Cloud site. Boots and stops the Cloud worker nodes. Integration Adapter Registers newly provisioned Cloud nodes and removes them if required. ROCED has two modes of operation, the so-called ROCED topologies. Within the first topology, a remote batch server is connected as a slave to the local resource management system with a fixed amount of remote cloud worker nodes. In contrast to that, the remote Cloud worker nodes are dynamically provisioned and attached to the local batch system within the second topology, which is the one used within our current setup. The plot in Figure 6 gives an impression of ROCED running in the Topology 2 mode. If the job queue length exceeds a configured threshold, ROCED extends the local cluster by starting additional Cloud nodes. As soon as they are registered within the local batch system, they will be filled up with jobs. At a queue length below the threshold ROCED will unregister the nodes in the batch system and shutdown the Cloud nodes. ROCED is implemented in Python 2.6 and it currently supports Torque and Oracle Grid Engine as batch systems and Amazon EC2, Eucalyptus and OpenNebula Cloud interfaces. 6

7 Figure 6. ROCED Topology 2 example. As soon as the queue is filled above a certain threshold ROCED starts additional remote Cloud worker nodes. As soon as the queue length falls below the threshold again, the remote nodes are shutdown and removed. Taken from [20] 3.2. The ROCED Workflow ROCED runs in management cycles with a configurable length. separated into the following steps: The ROCED workflow is (i) Queue monitoring Within each cycle ROCED first gathers the current queue lengths of one or more batch servers and their monitored queues. (ii) Boot VM Then the ROCED Broker decides how many VMs are required according to the queue lengths. The Site Adapter then decides which Cloud provider to contact by using the current Cloud resource prices. The VMs are started then accordingly. (iii) Add node The fully booted Cloud VMs are added to the local batch system by the Integration Adapter. (iv) Execute job As soon as the VMs are integrated into the local batch system jobs will be started on the free Cloud worker nodes (v) Remove and shutdown If there are no additional submitted jobs in the batch system and the queues drain, the Cloud nodes will be removed from the batch system and shut down. To enable a flawless management and operation of the remote Cloud worker nodes ROCED utilizes a strictly linear state machine as sketched in Figure 7. For each lifetime step of a VM a distinct Adapter is responsible. 4. The Fusion of ViBatch and ROCED The combination of both tools, ViBatch and ROCED, leads to a dynamically scalable virtual cluster. This combination is currently tested as a preparation for its production usage at the Institut für Experimentelle Kernphysik (IEKP) at KIT. Figure 8 depicts the current design. Vi- Batch manages the dynamic virtualization of the IC1 cluster at the SCC(Campus South) with SLC5 VM nodes, whereas ROCED attaches SLC5 Cloud VMs the private ONE campus cloud at SCC(Campus North). As already mentioned the Lustre cluster file system is exported to the local VMs via NFS servers running on the hardware nodes. With this technique we can also provide access to the Lustre to the remote Cloud VMs, as the remote Cloud site is attached via a powerful 10GBit network link between the two locations (Campus North and Campus South) 7

8 3. 4. up 2. ROCED Broker decides how many machines have to be started or shut down Adapter in charge of changing the state of the virtual machine: booting 1. Integration Adapter Site Adapter Figure 7. The ROCED state machine. For each of the eight possible Cloud node states a distinct Adapter takes care of the management. of the KIT. Within the current test setup a dedicated Cloud-enabled queue is prepared within Torque. ROCED monitors only this queue and provides the required ONE resources. In a production scenario, all virtual SLC5 queues will be added to the ROCED configuration to have a fully Cloud extended setup. 5. General Performance Considerations As the additional layer of virtualization has an impact on the performance of the executed applications, we investigated the performance of our production system with respect to typical HEP applications. In Figure 9 one can see that for pure CPU intensive jobs one looses 12% performance whereas Monte Carlo Simulations of High Energy Physics processes loose 17% with respect to the increased I/O. In the case of data analysis the HEP users are bound to their experiment software. Within the CMS experiment, the software framework and therefore the analysis results are only validated for SLC operating systems. The host OS SLES11Sp2 on the IC1 cluster is fixed as it is a compromise between the nine shareholder institutes of the IC1 cluster. With respect to this fact, the benefit of being able to run jobs on this shared cluster prevails over any concerns of losing a few percent in performance through virtualization. 6. Conclusion, Outlook and Future Work The dynamic partitioning of a cluster with ViBatch has proven its stability and performance over the last two years in production usage at KIT. Local and interactive clusters play still a major role in the analysis workflow of todays HEP experiments. They are mainly used as development area and final analysis resources due to the fast turnaround on interactive machines. Therefore, the technique of using a meta-scheduler like ROCED to dynamically add transparent Cloud resources is of great importance when trying to intercept peak load times of the local computing 8

IC1 Cluster SCC Campus Cloud Virtual Worker Nodes Virtual SLC5 Environment Cloud Worker Nodes Worker Nodes NFS server Infiniband Lustre Storage OpenVPN Server (VM) PBS Server (VM) Maui/Torque ROCED

9 IC1 Cluster SCC Campus Cloud Virtual Worker Nodes Virtual SLC5 Environment Cloud Worker Nodes Worker Nodes NFS server Infiniband Lustre Storage OpenVPN Server (VM) PBS Server (VM) Maui/Torque ROCED ViBatch Figure 8. ViBatch + ROCED extending the IC1 cluster. Figure 9. Performance Benchmarks using the KVM virtualization for: HEP specific Monte Carlo Simulations (binary compatible to the Host OS) and a CPU benchmark. infrastructure. The fusion of ViBatch and ROCED will continue by further merging both tools and testing the scalability and performance. Whereas the current test environment is mainly setup by hand, future development will unify the virtual machine setup and deployment as well as the general configuration setup. 7. Acknowledgments We thank the staff of Steinbuch Computing Centre that was responsible for the general setup of the IC and the ONE private Cloud. We wish to acknowledge the financial support of the Bundesministerium für Bildung und Forschung BMBF. 9

10 References [1] Volker Bge, Hermann Hessling, Yves Kemp, Marcel Kunze, Oliver Oberst, Gnter Quast, Armin Scheurer, and Owen Synge. Integration of virtualized worker nodes in standard batch systems. Journal of Physics: Conference Series, 219(5):052010, [2] Condor - High Troughput Computing 2 [3] Oracle Grid Engine 2 [4] A Scheurer, O Oberst, V Bge, G Quast, and M Kunze. Virtualized batch worker nodes: Conception and integration in hpc environments. Journal of Physics: Conference Series, 331(6):062043, , 3 [5] Libvirt Virtualization API 2 [6] KVM Virtualisation 3 [7] CMS Collaboration. The CMS experiment at the CERN LHC. JINST, 3:S08004, [8] ScientificLinux Homepage 3 [9] CernVM File System 3 [10] The MAUI Scheduler 4 [11] Lustre File System 4 [12] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck. Network File System (NFS) version 4 Protocol. RFC 3530 (Proposed Standard), April [13] syslog-ng log manager 4 [14] Amazon Elastic Compute Cloud 5 [15] Eucalyptus Systems 5 [16] Steinbuch Center for Computing 5 [17] The OpenNebula Project 5 [18] S. Riedel. Einbindung von Cloud-Ressourcen in Workflows der Teilchenphysik und Messung des Underlying Event in Proton-Proton-Kollisionen am LHC, volume IEKP-KA/ Institut fuer Experimentelle Kernphysik - Karlsruhe Institute of Technology, [19] T. Hauth. Dynamische Erweiterung von Batchsystemen mit Cloud Ressourcen und Messung der Jetenergieskala des CMS Detektors, volume IEKP-KA/ Institut fuer Experimentelle Kernphysik - Karlsruhe Institute of Technology, [20] T Hauth, G Quast, M Kunze, V Bge, A Scheurer, and C Baun. Dynamic extensions of batch systems with cloud resources. Journal of Physics: Conference Series, 331(6):062034, , 7 10

Virtualizing a Batch. University Grid Center

Virtualizing a Batch. University Grid Center Virtualizing a Batch Queuing System at a University Grid Center Volker Büge (1,2), Yves Kemp (1), Günter Quast (1), Oliver Oberst (1), Marcel Kunze (2) (1) University of Karlsruhe (2) Forschungszentrum