Scheduling Computational and Storage Resources on the NRP

Similar documents
SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

PRP Distributed Kubernetes Cluster

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Adding Cloud Based Interactive Compute Capabilities to Globus Endpoints

Kubernetes 101. Doug Davis, STSM September, 2017

Kuberiter White Paper. Kubernetes. Cloud Provider Comparison Chart. Lawrence Manickam Kuberiter Inc

Kubernetes Integration with Virtuozzo Storage

Launching StarlingX. The Journey to Drive Compute to the Edge Pilot Project Supported by the OpenStack

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

On-Premises Cloud Platform. Bringing the public cloud, on-premises

Analytics Platform for ATLAS Computing Services

INTRODUCING CONTAINER-NATIVE VIRTUALIZATION

Introduction to the Open Service Broker API. Doug Davis

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The Latest EMC s announcements

Akraino & Starlingx: A Technical Overview

Flying HTCondor at 100gbps Over the Golden State

Cisco CloudCenter Solution with Cisco ACI: Common Use Cases

Evolution of Cloud Computing in ATLAS

HPE Synergy HPE SimpliVity 380

Kubernetes 1.9 Features and Future

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Leveraging Globus Identity for the Grid. Suchandra Thapa GlobusWorld, April 22, 2016 Chicago

Airship A New Open Infrastructure Project for OpenStack

Cisco Enterprise Cloud Suite Overview Cisco and/or its affiliates. All rights reserved.

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

Kubernetes: Twelve KeyFeatures

An Introduction to Kubernetes

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Cloud Computing the VMware Perspective. Bogomil Balkansky Product Marketing

Red Hat Roadmap for Containers and DevOps

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Running MarkLogic in Containers (Both Docker and Kubernetes)

PSOACI Why ACI: An overview and a customer (BBVA) perspective. Technology Officer DC EMEAR Cisco

Architecting Microsoft Azure Solutions (proposed exam 535)

How to build scalable, reliable and stable Kubernetes cluster atop OpenStack.

ENHANCE APPLICATION SCALABILITY AND AVAILABILITY WITH NGINX PLUS AND THE DIAMANTI BARE-METAL KUBERNETES PLATFORM

Architectural overview Turbonomic accesses Cisco Tetration Analytics data through Representational State Transfer (REST) APIs. It uses telemetry data

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

ACCENTURE & RED HAT ACCENTURE CLOUD INNOVATION CENTER

HPC learning using Cloud infrastructure

Cornell Red Cloud: Campus-based Hybrid Cloud. Steven Lee Cornell University Center for Advanced Computing

Smarter Storage with Containerized Applications. Always Aligned with your Changing World

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Designing MQ deployments for the cloud generation

Deploying Cloud Network Services Prime Network Services Controller (formerly VNMC)

IRNC:RXP SDN / SDX Update

CoreOS and Red Hat. Reza Shafii Joe Fernandes Brandon Philips Clayton Coleman May 2018

Open Service Broker API: Creating a Cross-Platform Standard Doug Davis IBM Shannon Coen Pivotal

ALICE Grid Activities in US

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Container in Production : Openshift 구축사례로 이해하는 PaaS. Jongjin Lim Specialist Solution Architect, AppDev

Choosing the Right Container Infrastructure for Your Organization

MSB to Support for Carrier Grade ONAP Microservice Architecture. Huabing Zhao, PTL of MSB Project, ZTE

VMworld 2018 Content: Not for publication or distribution

<Insert Picture Here> Enterprise Data Management using Grid Technology

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Globus Platform Services for Data Publication. Greg Nawrocki University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018

Private Cloud at IIT Delhi

Open Cloud Reference Architecture

OSiRIS. Project and participants overview Structural overview and site details Orchestration, monitoring and visualization Networking, NMAL, SDN

Important DevOps Technologies (3+2+3days) for Deployment

OpenNebula on VMware: Cloud Reference Architecture

CONTAINERS AND MICROSERVICES WITH CONTRAIL

Scaling Across the NRP Ecosystem From Campus to Regional to National - What Support Is There? 2NRP Workshop Bozeman, Montana Tuesday, August 7, 2018

Pasiruoškite ateičiai: modernus duomenų centras. Laurynas Dovydaitis Microsoft Azure MVP

Containers Infrastructure for Advanced Management. Federico Simoncelli Associate Manager, Red Hat October 2016

Energy Management with AWS

Accelerate at DevOps Speed With Openshift v3. Alessandro Vozza & Samuel Terburg Red Hat

OpenStack Magnum Pike and the CERN cloud. Spyros

Baremetal with Apache CloudStack

CHARTING THE FUTURE OF SOFTWARE DEFINED NETWORKING

Design patterns for data-driven research acceleration

Module Day Topic. 1 Definition of Cloud Computing and its Basics

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

Securing Microservice Interactions in Openstack and Kubernetes

Table of Contents 1.1. Introduction. Overview of vsphere Integrated Containers 1.2

Kubernetes - Load Balancing For Virtual Machines (Pods)

Cisco Container Platform

A day in the life of a log message Kyle Liberti, Josef

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

SOLUTION OVERVIEW THE ARUBA MOBILE FIRST ARCHITECTURE

Allowing Users to Run Services at the OLCF with Kubernetes

Journey to the Cloud Next Generation Infrastructure for the future workforce.

Containerization Dockers / Mesospere. Arno Keller HPE

Singularity in CMS. Over a million containers served

Secure Kubernetes Container Workloads

SCA19 APRP. Update Andrew Howard - Co-Chair APAN APRP Working Group. nci.org.au

Cloud Systems 2018 Training Programs. Catalog of Course Descriptions

ENTERPRISE-GRADE MANAGEMENT FOR OPENSTACK WITH RED HAT CLOUDFORMS

DEPLOYING NFV: BEST PRACTICES

RED HAT OPENSHIFT A FOUNDATION FOR SUCCESSFUL DIGITAL TRANSFORMATION

CLOUD-NATIVE APPLICATION DEVELOPMENT/ARCHITECTURE

What s New in Red Hat OpenShift Container Platform 3.4. Torben Jäger Red Hat Solution Architect

São Paulo. August,

Joe Butler, Principal Engineer, Director Cloud Services Lab. Nov , OpenStack Summit Paris.

Walkthrough OCCAM. Be on the lookout for this fellow: The callouts are ACTIONs for you to do!

Grid Middleware and Globus Toolkit Architecture

Transcription:

Scheduling Computational and Storage Resources on the NRP Rob Gardner Dima Mishin University of Chicago UCSD Second NRP Workshop Montana State University August 6-7, 2018 slides: http://bit.ly/nrp-scheduling 1

Kubernetes scheduling resources CPU, RAM: QoS: Request Limit Guaranteed Burstable Best Effort Preemption Eviction Pod priority Resources starvation 2

Kubernetes scheduling resources GPU: Request Extended resources: Request Limit Schedulers: Default Can write your own Can run multiple nrp.io/tpu: 16 3

Federation of clusters Managing policies across clusters Common labels for nodes, namespaces Respect local policies when scheduling resources 4

Federation of clusters Names conflicts Requires custom dataplane running in federation cluster to enforce common policies 5

Federation of clusters Mounting storage from other clusters Data locality? 6

Federation V1 Sync resources across clusters Deploy to multiple clusters Cross cluster discovery Allow workloads communicate Common DNS for all clusters https://github.com/kubernetes/federation 7

Federation V2 (documents, almost no code) https://github.com/kubernetes-sigs/federation-v2 8

approaches to scaling federations slateci.io 9

Federation Topology - how to scale? 10

Interoperability for Platforms Create science platforms across institutions and facilities Deal with access, privilege & security in shared environment Adopt a virtual organization trust model Site autonomy and policies respected Cluster groupings, WAN-stretched clusters or regional federations 11

SLATE Concepts & Components http://bit.ly/slate-arch Containerized services in managed clusters Widely used open source technologies for growth and sustainability SLATE additions Curated services Create a Loose federation of clusters & platforms 12

Globus Auth signup/login developers (& admins) cluster admins 13

Platform Client Access for App Developers 14

Platform Client Access for App Developers 15

Platform Client Access for App Developers We'll need both 16

Policy and Trust SLATE applications will be curated into a trusted application catalog Applications must define and request all needed network, disk, device, etc access. Think application permissions on your phone Site policies must be respected Access, privileges, capabilities are controlled and transparent 17

Deploying an "Application" -like 18

scheduling pods for science 19

containerized by Machine Learning for HL-LHC on PRP scripted kubectl to PRP

containerized by Teaching Platform on PRP/CHASE-CI Users request CPU and GPU Service kubectl to PRP CHASE-CI gpus 21

scheduling persistent services 22

containerized by Globus Connect Service Goal: Automate deployment of Globus endpoints on DTN nodes in SciDMZ Containerize Globus Connect Service GCSv5 - service deployed, testing Helm charts for easy installation Investigating how to ease integration to campus storage 23

containerized by OSG StashCache StashCache: XRootD-based data federation used by OSG to support data delivery for LIGO, CMS and individual researchers to computing sites c.f. next talk by F. Wuerthwein: deployments on PRP and Internet2 Bring experiment data from large "data lakes" to fast local caches on the edge, adjacent to the compute Provide O(10-100TB) storage, shared among virtual organizations and connected to HPC LAN Progress is good - Containerized and working on creating Helm chart Pattern for other LHC XRootD-based federations (XCache, AAA) 24

scheduling batch science workflows schedule needed service pods first overlay an application sheduler 25

Scheduling Science Applications (w/ familiar tools)

Scheduling Science Applications (w/ familiar tools) Minutes after pod deployment on FIONA8, MIT student doing quantum simulations for materials 27

Scheduling Science Applications (w/ familiar tools) OSG to SLATE 28

Scheduling Science Applications (w/ familiar tools) Similar efforts at CERN - Keynote at KubeCon EU 2018 (video) slides 29

Scheduling Storage OSiRIS (CC*DNI DIBBS 2015, NSF award #1541335) is building a distributed Ceph-based, multi-institutional storage infrastructure that lets researchers write, manage, and share data from their own computing facility locations (www.osris.org) This storage is VO-allocatable and potentially dynamically schedulable Collaboration between SLATE and OSiRIS focuses on software defined networking to orchestrate our infrastructures Next up: exploring OSiRIS block storage hosting SLATE containers usable across one or more SLATE platform deployments Future: OSiRIS storage dynamically allocated for SLATE VOs 30

Summary & Conclusions Multiple projects tackling scheduling and resource sharing on our emerging national research platform Much progress in deployments & federated edge cluster models challenges ahead applying policy-driven scheduling: cluster-level, federation-level, multi-federations Meanwhile we can keep the NRP filled with science apps Groups are forging best practices and design patterns for application DevOps, and other tools in a k8s-enabled ecosystem All good news for multi-institution science collaborations and the CI engineers providing the infrastructure! 31

Thank you! & Acknowledgements PRP: NSF OAC-1541349 "CC*DNI DIBBs: The Pacific Research Platform" CHASE-CI: NSF CNS-1730158 "CI-New: Cognitive Hardware and Software Ecosystem Community Infrastructure" SLATE: NSF OAC-1724821 "CIF21 DIBBs: EI: SLATE and the Mobility of Capability" OSiRIS: CC*DNI DIBBS 2015, NSF award #1541335 OSG: NSF PHY-1148698 The Open Science Grid, The Next Five Years: Distributed High Throughput Computing for the Nation's Scientists, Researchers, Educators, and Students VC3: Department of Energy ASCR/NGNS DDRM project "Virtual Clusters for Community Computation" US ATLAS Operations: NSF PHY-1624739 "U.S. ATLAS Operations: Discovery and Measurement at the Energy Frontier" slides: http://bit.ly/nrp-scheduling 32

More "App" Examples 33

containerized by perfsonar Make as lightweight containers as possible for easy deployment across clusters Integration with a Slate Platform Provider Central perfsonar Node Automated as much as possible while maintaining full flexibility of the testpoint bundle Allows reduction of requirements per testpoint and allows for easy centralized access 34

containerized by Squid proxies Created containers running the OSG packaging of "Frontier-Squid" Important for software caching with CERN CVMFS Containers are designed to expose the full Squid configuration so that they can be used in an arbitrary environment Packaged the containers into a SLATE application The application exposes a minimum set of parameters which coordinates the configuration of Squid, kubernetes and SLATE For more information see: http://slateci.io/docs/applications/development/sample_applicati ons/frontiersquid/ 35

slate extra 36

Scheduling Storage OSiRIS (CC*DNI DIBBS 2015, NSF award #1541335) is building a distributed Ceph-based, multi-institutional storage infrastructure that lets researchers write, manage, and share data from their own computing facility locations (www.osris.org) This storage is VO-allocatable and potentially dynamically schedulable A Challenge: Multi-site Ceph clusters take a performance hit with latency OSiRIS user provisioning chain can be used to assign users and VO storage pools to separate clusters - data replication out-of-band. Ceph RGW (S3) has concept of geographic zones and built-in replication between them though this is not yet used in OSiRIS (we are one large shared cluster with RGW endpoints at each site) 37

OSIRIS-SLATE Collaboration Collaboration between SLATE and OSiRIS focuses on software defined networking to orchestrate our infrastructures Next up: exploring OSiRIS block storage hosting SLATE containers usable across one or more SLATE platform deployments Current Kubernetes Ceph support is in a state of flux with migration to Container Storage Interface (CSI) based provisioning CSI Ceph plugin requires admin level credentials to enable dynamic provisioning Pre-provisioned volumes can be attached without admin credentials. 38

SLATE Vision Building Block for Multi-Institution Research Platforms 39

Services as a Service! SLATE is designed from the ground up to be programmable and scalable Applications specified and deployed to sites in a declarative way $ slate-client app --vo mylab --cluster dmzk8s install myapp Applications are monitored through their entire lifecycle, re-deployed if failure occurs 40

A New Stack for Research Platforms Rethinking the stack from the ground up. Minimal OS running a container orchestration engine Advanced networking integrated Everything is ephemeral System state lives "in the cloud" Site administrators only have to worry about hardware 41

Monitoring: Apps & Infrastructure SLATE catalog apps have optional flag for granular logs shipped to its own index in central elasticsearch cluster. All SLATE sites forward metrics automatically to SLATE Elasticsearch cluster. All SLATE members has access to Kibana dashboard. 42

SLATE Core Services 43

Analytics: leverage CC* SAND-NMA 44

SLATE Deployment options 45

SLATE v1.0 Currently developing a "minimum viable product" for SLATE Virtual and hardware-based environments Tooling to register and "SLATEify" k8s clusters Launch perfsonar, Globus, and data caching services 46

SLATE Virtual Environment Single-node, CoreOS-based Kubernetes Virtual Machine "Zero to SLATE" Downloadable bundle KVM or Libvirt Register with SLATE, instantiate services, VOs, etc Currently being used at UChicago on FIONA hardware 47

SLATE VM And reachable by web browser SLATE cluster & public IPs Service available on public IP 48

SLATE Hardware Four components for v1.0 High performance switch (40/100 Gbps) Dedicated perfsonar and Management nodes Hypervisor-like "Edge Service" node for K8S Publicly posted hardware configurations for interested partners Self-service portal for registering and configuring devices for your SLATE Cluster 49

SLATE on existing K8S Run SLATE services on an existing cluster SLATE merely a user Only uses limited "admin" privileges for initial installation You control policy Allocate resources, make scheduling decisions as you see fit Easy to install, easy to remove 50