STFC Cloud Developments. Ian Collier & Alex Dibbo STFC Rutherford Appleton Laboratory UK Research & Innovation HEPiX Spring, Madison, WI 17 May 2018

Similar documents
Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

CC-IN2P3: A High Performance Data Center for Research

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

CernVM-FS beyond LHC computing

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

PaNSIG (PaNData), and the interactions between SB-IG and PaNSIG

Clouds in High Energy Physics

EGI: Linking digital resources across Eastern Europe for European science and innovation

Clouds at other sites T2-type computing

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

Pre-Commercial Procurement project - HNSciCloud. 20 January 2015 Bob Jones, CERN

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Grid Computing at the IIHE

European Grid Infrastructure

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Netherlands Institute for Radio Astronomy. May 18th, 2009 Hanno Holties

Helix Nebula Science Cloud Pre-Commercial Procurement pilot. 9 March 2016 Bob Jones, CERN

HPC SERVICE PROVISION FOR THE UK

Users and utilization of CERIT-SC infrastructure

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Maintaining Traceability in an Evolving Distributed Computing Environment

Helix Nebula Science Cloud Pre-Commercial Procurement pilot. 5 April 2016 Bob Jones, CERN

Managing Research Data for Diverse Scientific Experiments

LHCb Computing Resource usage in 2017

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

Federated Identity Management for Research Collaborations. Bob Jones IT dept CERN 29 October 2013

Geographically Distributed Software Defined Storage (the proposal)

Travelling securely on the Grid to the origin of the Universe

EUDAT & SeaDataCloud

Towards a Strategy for Data Sciences at UW

LHC and LSST Use Cases

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Parallel computing, data and storage

Terabit Networking with JASMIN

Square Kilometre Array

N. Marusov, I. Semenov

Spark and HPC for High Energy Physics Data Analyses

irods usage at CC-IN2P3: a long history

I data set della ricerca ed il progetto EUDAT

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Genomics on Cisco Metacloud + SwiftStack

MidoNet Scalability Report

Cloud Computing For Researchers

Virtualizing a Batch. University Grid Center

Metadata Models for Experimental Science Data Management

Opportunities A Realistic Study of Costs Associated

EMPRESS Extensible Metadata PRovider for Extreme-scale Scientific Simulations

Summary of the LHC Computing Review

Computing and Networking at Diamond Light Source. Mark Heron Head of Control Systems

AGIS: The ATLAS Grid Information System

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Support for multiple virtual organizations in the Romanian LCG Federation

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

EGI federated e-infrastructure, a building block for the Open Science Commons

ICAT Job Portal. a generic job submission system built on a scientific data catalog. IWSG 2013 ETH, Zurich, Switzerland 3-5 June 2013

Five years of OpenStack at CERN

Huawei FusionSphere 6.0 Technical White Paper on OpenStack Integrating FusionCompute HUAWEI TECHNOLOGIES CO., LTD. Issue 01.

SKA SDP-COMP Middleware: The intersect with commodity computing. Piers Harding // February, 2017

Long Term Data Preservation for CDF at INFN-CNAF

High Performance Computing from an EU perspective

JINR cloud infrastructure development

SPARC 2 Consultations January-February 2016

JINR cloud infrastructure

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Andrea Sciabà CERN, Switzerland

Grid Computing: dealing with GB/s dataflows

Linux Clusters Institute: OpenStack Neutron

IEPSAS-Kosice: experiences in running LCG site

MANAGING THE COMPLEXITY.

Overview of the CRISP proposal

ATLAS Experiment and GCE

2017 Resource Allocations Competition Results

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Visita delegazione ditte italiane

Distributed Computing Framework. A. Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille

Accelerate OpenStack* Together. * OpenStack is a registered trademark of the OpenStack Foundation

CERN IT procurement update HEPiX Autumn/Fall 2018 at Port d Informació Científica

Webinar: Getting started with CSC's IaaS cloud computing services Pouta

Cross-Site Virtual Network Provisioning in Cloud and Fog Computing

The European DataGRID Production Testbed

Current Status of the Ceph Based Storage Systems at the RACF

A High-Availability Cloud for Research Computing

Grid and Cloud Activities in KISTI

EUDAT- Towards a Global Collaborative Data Infrastructure

Arista 7170 series: Q&A

Challenges of Big Data Movement in support of the ESA Copernicus program and global research collaborations

Designing the Future Data Management Environment for [Radio] Astronomy. JJ Kavelaars Canadian Astronomy Data Centre

Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Ó³ Ÿ , º 5(203) Ä1050. Š Œ œ ƒˆˆ ˆ ˆŠ

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

CernVM-FS. Catalin Condurache STFC RAL UK

HPC learning using Cloud infrastructure

Grid Computing Activities at KIT

Federated data storage system prototype for LHC experiments and data intensive science

Transcription:

STFC Cloud Developments Ian Collier & Alex Dibbo STFC Rutherford Appleton Laboratory UK Research & Innovation HEPiX Spring, Madison, WI 17 May 2018

Contents Background & History Current design considerations New Use cases (Some of) What this allows us to do IRIS (UKT0)

Background OpenNebula cloud now running for several years on a stable, but pre-production basis. Has served very well. Rise of more complex use cases in particular supporting external user communities > deploying OpenStack In final stages of deploying decommissioning OpenNebula migrating remaining users in coming months

Hardware OpenNebula: reducing from ~700 to ~300 cores OpenStack: ~500 cores, growing to ~3500 end May and ~5000 by end 2018. Adding Hypervisors: 108 Dell 6420 sleds (27 x 4-up 2U 6400 chassis) - Testing complete, about enter production 16 physical cores each, 6GB RAM/core, 25G NIC Cloud Storage: 12 Dell R730xd (12 bay 2U) - Testing complete, about enter production 12 x 4TB each, total 576TB, 25G NICs Replicated CEPH (3x), VM image storage Data Storage: 21 x (Dell R630 + 2 x MD1400 ) sets - Just finished testing Added to ECHO for UKT0 project 4PB raw /~2.9TB configured Network: 7 x Mellanox SN2100 16x100Gb port

OpenStack implementation Flexible Within reason want a design that can accomodate whatever comes along within reason Multi tenancy Separating different user communities Highly Available Services should be as highly available as possible OpenStack services behind HAProxy Ceph A replicated Ceph cluster called SIRIUS provides block storage for VMs and Volumes 3x Replication Optimised for lower latency

Ceph RBD A replicated Ceph cluster called SIRIUS provides block storage for VMs and Volumes 3x Replication Optimised for lower latency Helps speed up launch of VMs Are considering use cases that would benefit from local VM storage Currently being rebuilt fresh installation in new machine room location.

Multi Tenancy Projects (Tenants) need to be isolated From each other From STFC site network Security Groups VXLAN private project networks Brings its own problems

Private Networks Virtual machines connect to a private network VXLAN is used to tunnel these networks across hypervisors Ingress and Egress is via a virtual router with NAT Distributed Virtual Routing is used to minimise this bottleneck every hypervisor runs a limited version of the virtual router agent.

VXLAN VXLAN by default showed significant overheads VXLAN performance was ~10% of line rate Tuning memory pages, CPU allocation, mainline kernel Performance is ~40% of line rate Hardware offload VXLAN offload to NIC gives ~80% of line rate High Performance Routed network + EVPN 99+% of line rate

Cloud Network

Flexible Availability Zones across site 1 st will be in ISIS soon In discussion with other departments GPU support AAI APIs Nova, EC2, OCCI Design decisions should, so far as possible, preclude anything Pet VMs as well as cattle

Current User Communities STFC Scientific Computing Department internal Self service VMs Some programmatic use Tier1 Bursting the batch farm ISIS Neutron Source Data-Analysis-as-a-Service Jenkins Build Service for othere groups in SCD Central Laser Facility OCTOPUS project Diamond Lightsource - Xchem Cloud bursting Diamond GridEngine Xchem data processing using OpenShift WLCG Datalake project Quattor Nightlies Range of H2020 & similar projects

AAI EGI CheckIn - Horizon

AAI EGI CheckIn Horizon 2

AAI EGI CheckIn

AAI Google Login

Programmes include: Particle Physics Astronomy Nuclear Physics Large Hadron Collider Daresbury Laboratory What do we do? National Laboratories include: Neutron and Muon Source Synchrotron Radiation Source High Power Lasers Scientific Computing/Hartree Centre Space Science Electronics, Cryogenics 250m International include: CERN ESO SKA ESS ESRF ILL Square Kilometre Array ESRF & ILL, Grenoble

STFC computing: GridPP: HTC for the LHC Hartree Centre: Industry Scientific Computing Department DiRAC: HPC for Theory & Cosmology HTC and data storage for Astronomy, Nuclear, Astroparticle

Wider UK landscape Effort running for several years now under UKT0 banner Bringing together different communities HTC and HPC for STFC science Linking together diverse resources into coherent and complementary support for existing & emerging communities UKT0 (now IRIS) is an umbrella project which represents users contracts to resource providers eventually should provide effort for cross-cutting activities

IRIS einfrastructure for STFC Science Initiative to bring STFC computing interests together Formed bottom up by the science communities and compute providers Association of peer interests (PPAN + National Facilities + Friends) Particle Physics: LHC + other PP experiments (GridPP) DiRAC (UK HPC for theroretical physics) National Facilities: Diamond Light Source, ISIS Astro: LOFAR, LSST, EUCLID, SKA,... Astro-particle: LZ, Advanced-LIGO STFC Scientific Computing Dept (SCD) CCFE (Culham Fusion) Not a project seeking users - users wanting and ready to work together

Evolution UKT0 -> IRIS 2015-2017 Loose association Community Meeting Autumn 2015 2017/2018 first funding 1.5M to be spent by 31 March 18 interim-pmb (to manage short term completion) Grant delivered resources into machine rooms Cambridge, Edinburgh, Manchester, RAL, First (proto) Collaboration Meeting March 2018 Project was successful and on time (spent money/wrote software) 2018 Approved BEIS keep lights on funding of 4M p.a. for 4 years for UKT0/ALC. Now finalising Organisational structures and delivery planning

UKT0 idea Science Domains remain sovereign where appropriate Activity 1 (e.g. LHC, SKA, LZ, EUCLID..) VO Management Reconstruction Data Manag. Analysis Ada Lovelace Centre (Facilities users) Activity 3... Federated HTC Clusters Federated HPC Clusters Federated Data Storage Tape Archive Public & Commercial Cloud access Services: Data Mgm Job Mgm. AAI VO tools... Services: Monitoring Accounting Incident reporting... Share where it makes sense to do so

Long Term Aspiration STFC UKT0 HIGH THROUGHPUT COMPUTING (HTC) PROCESSING/ANALYSIS OF EXPERIMENT/OBSERVATION, & LARGE NON-PARALLEL SIMULATIONS GridPP Ada Lovelace Centre STFC-SCD GridPP Diamond Central Laser Facility USERS PORTALS AAAI ACCESS RING DATA ANALYSIS & DISCOVERY UKT0 Data Activities DATA MODELLING & FITTING Hartree Centre CCFE SCIENTIFIC DATA SOURCES ISIS LHC & HEP Observatories DiRAC STFC-SCD Use of parallelism in Data Science: HTC HPC HIGH PERFORMANCE COMPUTING (HPC) HIGHLY PARALLEL SIMULATIONS & DATA INTENSIVE WORKFLOWS PARTNERSHIP PARTNERSHIP DiRAC Hartree Centre STFC-SCD PUBLIC SECTOR, INDUSTRY & HEIs ARCHER All Data Activities contain elements of: Collection/generation Simulation Analysis/discovery Modelling/fitting Processing/Analysis of Experiment/Observation, Large Non-parallel Simula Bespoke systems that are directly connected to experiments, observatorie National Facilities Highly Parallel Simulation Data Intensive Workflows Systems which generate da from a set of starting assumptions/equations Data Analysis & Discovery Systems analyze data to di relationships, structures an meaning within, and betwe datasets Data Modelling & Fitting Systems use a combination simulations, experimental and statistical techniques t theories and estimate parameters from data

Initial Architecture

Investment next 4 years Bulk is from Department of Business, Energy & Industrial Strategy (BEIS) 10.5M einfrastructure compute and storage assets 5.5M for einfrastructure Digital assets Minimal resource 250K from BEIS in years 3+4 250K resource from STFC in years 1+2

Questions? ian.collier@stfc.ac.uk alexander.dibbo@stfc.a.c.uk