Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

Similar documents
HPC/Cloud Hybrids for Efficient Resource Allocation and Throughput. Multicore World, Wellington, New Zealand, Feb 2017

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Hands-On Workshop bwunicluster June 29th 2015

Virtualized Scientific Research Environments and the future role of Computer Centers

Grid Computing Activities at KIT

Virtualizing a Batch. University Grid Center

LHCb experience running jobs in virtual machines

Clouds at other sites T2-type computing

WLCG Lightweight Sites

Dynamic Extension of a Virtualized Cluster by using Cloud Resources

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

bwfortreff bwhpc user meeting

Comet Virtualization Code & Design Sprint

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Cloud du CCIN2P3 pour l' ATLAS VO

Singularity tests at CC-IN2P3 for Atlas

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

ElastiCluster Automated provisioning of computational clusters in the cloud

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

ATLAS Tier-3 UniGe

Virtualization. A very short summary by Owen Synge

Operating two InfiniBand grid clusters over 28 km distance

A Laconic HPC with an Orgone Accumulator. Presentation to Multicore World Wellington, February 15-17,

The Why and How of HPC-Cloud Hybrids with OpenStack

Batch Services at CERN: Status and Future Evolution

Clouds in High Energy Physics

Automated Deployment of Private Cloud (EasyCloud)

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Cloud Computing. UCD IT Services Experience

Evolution of the HEP Content Distribution Network. Dave Dykstra CernVM Workshop 6 June 2016

IN2P3-CC cloud computing (IAAS) status FJPPL Feb 9-11th 2016

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Computing for LHC in Germany

HTCondor Week 2015: Implementing an HTCondor service at CERN

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management

A comparison of performance between KVM and Docker instances in OpenStack

Implementierung eines Dynamic Remote Storage Systems (DRS) für Applikationen mit hohen IO Anforderungen

CERN: LSF and HTCondor Batch Services

Scientific Computing on Emerging Infrastructures. using HTCondor

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

A WEB-BASED SOLUTION TO VISUALIZE OPERATIONAL MONITORING LINUX CLUSTER FOR THE PROTODUNE DATA QUALITY MONITORING CLUSTER

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

The OnApp Cloud Platform

The National Analysis DESY

ATLAS Experiment and GCE

Opportunities for container environments on Cray XC30 with GPU devices

IBM Bluemix compute capabilities IBM Corporation

ADAC Federated Testbed Creating a Blueprint for Portable Ecosystems

Xen and CloudStack. Ewan Mellor. Director, Engineering, Open-source Cloud Platforms Citrix Systems

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

arxiv: v1 [cs.dc] 7 Apr 2014

HPC learning using Cloud infrastructure

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

Extraordinary HPC file system solutions at KIT

Emerging Technologies for HPC Storage

CouchDB-based system for data management in a Grid environment Implementation and Experience

Practice of Software Development: Dynamic scheduler for scientific simulations

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Scalability / Data / Tasks

CernVM-FS beyond LHC computing

Overview of a virtual cluster

Linux HPC Software Stack

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

BESIII Computing Model and most recent R&Ds

Introduction to OpenStack

Brief review of the HEPIX 2011 spring Darmstadt, 2-6 May

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Your cloud solution for EO Data access and processing

Introduction to Abel/Colossus and the queuing system

Computing / The DESY Grid Center

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

The Legnaro-Padova distributed Tier-2: challenges and results

PROOF-Condor integration for ATLAS

Access: bwunicluster, bwforcluster, ForHLR

Conduire OpenStack Vers l Edge Computing Anthony Simonet Inria, École des Mines de Nantes, France

13th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT 2010 Jaipur, India February

CYFRONET SITE REPORT IMPROVING SLURM USABILITY AND MONITORING. M. Pawlik, J. Budzowski, L. Flis, P. Lasoń, M. Magryś

Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL

Transient Compute ARC as Cloud Front-End

OpenNebula on VMware: Cloud Reference Architecture

Use of containerisation as an alternative to full virtualisation in grid environments.

Austrian Federated WLCG Tier-2

Scheduling Computational and Storage Resources on the NRP

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Running HEP Workloads on Distributed Clouds

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

SAP Monsoon: The goal of a single standardized Hybrid Cloud. David Hoeller - IT Architecture Expert CCSCLD-2372

Big Data Analytics and the LHC

Live Migration of Virtualized Edge Networks: Analytical Modeling and Performance Evaluation

A High-Availability Cloud for Research Computing

Virtualizing Oracle 11g/R2 RAC Database on Oracle VM: Methods/Tips

HP Matrix Operating Environment 7.2 Getting Started Guide

Workload management at KEK/CRC -- status and plan

High Performance Computing Cloud - a PaaS Perspective

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Transcription:

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO Ulrike Schnoor (CERN) Anton Gamel, Felix Bührer, Benjamin Rottler, Markus Schumacher (University of Freiburg) February 02, 2018 Pre-GDB Meeting

Using HPC ressources via virtualization HPC NEMO at Uni Freiburg Resource Use NEMO in Freiburg to extend local Tier-3 resources ( Black Forest Grid = BFG) Job types Currently mainly local ATLAS analysis and simulation jobs, but easily extendable to any ATLAS jobs Setup Full virtualization of the environment + embedding into the existing OpenStack-Torque/Moab infrastructure in a way that is based on demand fully automated transparent for the user Ulrike Schnoor (CERN) 2/14

bwforcluster HPC center NEMO Shared by 3 communities in Baden-Württemberg: Elementary Particle Physics, Neuroscience, Microsystems engineering 752 worker nodes, each with 2 10 cores 128 GB RAM 100 Gbit/s OmniPath 240 GB local SSD 500 TB workspace (BeeGFS) TOP500: Ranked 214 in June 2016 389 in June 2017 (Link) In operation since July 2016 Hybrid of HPC and cloud approach: OpenStack orchestrates bare metal jobs and virtual machines in parallel Ulrike Schnoor (CERN) 3/14

Virtualization of ATLAS infrastructure on NEMO Ingredients OpenStack: Management framework allowing to run both virtual machines and bare metal jobs on NEMO Hypervisor: KVM User interface: BFG login nodes Access to CVMFS, Frontier via BFG squid Scheduler: Slurm (front-end for users @BFG) Torque/Moab (back-end for VMs @NEMO) Scheduling for dynamic allocation of VMs: ROCED VM image (SL6, CentOS7) Access to storage: dcache client, local BeeGFS Access to software: CVMFS client Ulrike Schnoor (CERN) 4/14

Virtual machine image tool chain Requirements Scientific Linux 6 CernVM image uses modified kernel not suitable Setup Packer (www.packer.io) for automatized image generation Basis: SL6 iso Output: VM template image (qcow2) Contextualization with puppet Install software, services (e.g. cvmfs client), user management etc. with the BFG puppet server identical and modularized setup Important updates? generate new VM Ulrike Schnoor (CERN) 5/14

Scheduling with Slurm Elastic Computing Slurm Elastic Computing: Resume and suspend machines on demand with adaptable resume/suspend functions and timeouts Challenges: 3-layer system with Slurm, Torque/Moab, and OpenStack allows almost no transmission/propagation of error messages Not intended for non-permanent resources (queue in Moab): Timeouts not sufficiently adaptable Solution: intermediate layer such as ROCED Ulrike Schnoor (CERN) 6/14

ROCED Responsive On-Demand Cloud-enabled Deployment Tool developed by CMS colleagues in Karlsruhe (KIT): https://github.com/roced-scheduler/roced Monitors demands in a batch system and dynamically manages virtual machines accordingly Python code with modular structure to adapt to different schedulers, VM types, Clouds etc. Integration and Requirement Adapters modified for BFG/Slurm setup: in production Integration Adapters... integrates booted compute nodes into existing batch server HTCondor Torque Grid Engine SLURM ROCED Core Broker... decides which machines to boot or shutdown Site Adapters Requirement Adapters... supplies information about needed compute nodes, e.g. queue size HTCondor Torque Grid Engine SLURM... boot machines on various Cloud Computing sites Hybrid HPC Cluster Commercial Providers OpenStack Ulrike Schnoor (CERN) 7/14

Summary and Outlook Slurm Elastic Computing setup can be used but is very fragile and leads to many job failures Using ROCED instead of Slurm Elastic Computing use non-elastic Slurm together with ROCED Requirement Adapter Integration Adapter implementation for Slurm and BFG in place Future possibilities: Use of containers CVMFS images instead of home-brew with Packer Ulrike Schnoor (CERN) 8/14

The Team Anton Gamel, Felix Buehrer, Benjamin Rottler, Ulrike Schnoor, Markus Schumacher Contacts in the Computing Center (HPC Team): Michael Janczyk, Bernd Wiebelt, Dirk von Suchodoletz Formerly also: Konrad Meier Ulrike Schnoor (CERN) 9/14

Backup Ulrike Schnoor (CERN) 10/14

The Black Forest Grid (BFG) Tier-2 and Tier-3 site of the WLCG In operation since 2005 CPU: 260 nodes, in total 4700 cores (HT) Several generations of worker node hardware Storage: dcache 1.35 PB (grid) lustre parallel storage 180 TB (local users) Local users from physics, biodynamics, and many other groups Future: exclusively Tier-2 and Tier-3 WLCG Ulrike Schnoor (CERN) 11/14

Baden-Württemberg HPC bwhpc-c5 project: Initiative in Baden-Württemberg for common frame for HPC ressources at BW universities co-financed by DFG bwforclusters federated approach: user group defined by research field not affiliation Freiburg: bwforcluster for Elementary Particle Physics, Neuroscience, and Microsystems Engineering: NEMO Ulrike Schnoor (CERN) 12/14

How to run ATLAS jobs on NEMO? OS: ATLAS currently needs Scientific Linux 6; NEMO runs CentOS7 Software: cvmfs = CERN VM File System: basis for all experiment-specific software not installed on NEMO Storage: afs not available on NEMO Virtualize the environment - Virtual machine image and orchestration/scheduling setup can be used both by local jobs as well as grid jobs Ulrike Schnoor (CERN) 13/14

Timeouts in Slurm Elasticity of the Slurm Elastic Computing module can be influenced with several timeout parameters: Main issue: ResumeTimeout should be long in order to catch Moab queue should be short in order to restart quickly if VM start fails Other problem: VMs often stay in COMPLETING (after job is terminated, before turning IDLE) for a long time Ulrike Schnoor (CERN) 14/14