Scheduling Computational and Storage Resources on the NRP

Scheduling Computational and Storage Resources on the NRP Rob Gardner Dima Mishin University of Chicago UCSD Second NRP Workshop Montana State University August 6-7, 2018 slides: http://bit.ly/nrp-scheduling 1

Kubernetes scheduling resources CPU, RAM: QoS: Request Limit Guaranteed Burstable Best Effort Preemption Eviction Pod priority Resources starvation 2

Kubernetes scheduling resources GPU: Request Extended resources: Request Limit Schedulers: Default Can write your own Can run multiple nrp.io/tpu: 16 3

Federation of clusters Managing policies across clusters Common labels for nodes, namespaces Respect local policies when scheduling resources 4

Federation of clusters Names conflicts Requires custom dataplane running in federation cluster to enforce common policies 5

Federation of clusters Mounting storage from other clusters Data locality? 6

Federation V1 Sync resources across clusters Deploy to multiple clusters Cross cluster discovery Allow workloads communicate Common DNS for all clusters https://github.com/kubernetes/federation 7

Federation V2 (documents, almost no code) https://github.com/kubernetes-sigs/federation-v2 8

approaches to scaling federations slateci.io 9

Federation Topology - how to scale? 10

Interoperability for Platforms Create science platforms across institutions and facilities Deal with access, privilege & security in shared environment Adopt a virtual organization trust model Site autonomy and policies respected Cluster groupings, WAN-stretched clusters or regional federations 11

SLATE Concepts & Components http://bit.ly/slate-arch Containerized services in managed clusters Widely used open source technologies for growth and sustainability SLATE additions Curated services Create a Loose federation of clusters & platforms 12

Globus Auth signup/login developers (& admins) cluster admins 13

Platform Client Access for App Developers 14

Platform Client Access for App Developers 15

Platform Client Access for App Developers We'll need both 16

Policy and Trust SLATE applications will be curated into a trusted application catalog Applications must define and request all needed network, disk, device, etc access. Think application permissions on your phone Site policies must be respected Access, privileges, capabilities are controlled and transparent 17

Deploying an "Application" -like 18

scheduling pods for science 19

containerized by Machine Learning for HL-LHC on PRP scripted kubectl to PRP

containerized by Teaching Platform on PRP/CHASE-CI Users request CPU and GPU Service kubectl to PRP CHASE-CI gpus 21

scheduling persistent services 22

containerized by Globus Connect Service Goal: Automate deployment of Globus endpoints on DTN nodes in SciDMZ Containerize Globus Connect Service GCSv5 - service deployed, testing Helm charts for easy installation Investigating how to ease integration to campus storage 23

containerized by OSG StashCache StashCache: XRootD-based data federation used by OSG to support data delivery for LIGO, CMS and individual researchers to computing sites c.f. next talk by F. Wuerthwein: deployments on PRP and Internet2 Bring experiment data from large "data lakes" to fast local caches on the edge, adjacent to the compute Provide O(10-100TB) storage, shared among virtual organizations and connected to HPC LAN Progress is good - Containerized and working on creating Helm chart Pattern for other LHC XRootD-based federations (XCache, AAA) 24

scheduling batch science workflows schedule needed service pods first overlay an application sheduler 25

Scheduling Science Applications (w/ familiar tools)

Scheduling Science Applications (w/ familiar tools) Minutes after pod deployment on FIONA8, MIT student doing quantum simulations for materials 27

Scheduling Science Applications (w/ familiar tools) OSG to SLATE 28

Scheduling Science Applications (w/ familiar tools) Similar efforts at CERN - Keynote at KubeCon EU 2018 (video) slides 29

Scheduling Storage OSiRIS (CC*DNI DIBBS 2015, NSF award #1541335) is building a distributed Ceph-based, multi-institutional storage infrastructure that lets researchers write, manage, and share data from their own computing facility locations (www.osris.org) This storage is VO-allocatable and potentially dynamically schedulable Collaboration between SLATE and OSiRIS focuses on software defined networking to orchestrate our infrastructures Next up: exploring OSiRIS block storage hosting SLATE containers usable across one or more SLATE platform deployments Future: OSiRIS storage dynamically allocated for SLATE VOs 30

Summary & Conclusions Multiple projects tackling scheduling and resource sharing on our emerging national research platform Much progress in deployments & federated edge cluster models challenges ahead applying policy-driven scheduling: cluster-level, federation-level, multi-federations Meanwhile we can keep the NRP filled with science apps Groups are forging best practices and design patterns for application DevOps, and other tools in a k8s-enabled ecosystem All good news for multi-institution science collaborations and the CI engineers providing the infrastructure! 31

Thank you! & Acknowledgements PRP: NSF OAC-1541349 "CC*DNI DIBBs: The Pacific Research Platform" CHASE-CI: NSF CNS-1730158 "CI-New: Cognitive Hardware and Software Ecosystem Community Infrastructure" SLATE: NSF OAC-1724821 "CIF21 DIBBs: EI: SLATE and the Mobility of Capability" OSiRIS: CC*DNI DIBBS 2015, NSF award #1541335 OSG: NSF PHY-1148698 The Open Science Grid, The Next Five Years: Distributed High Throughput Computing for the Nation's Scientists, Researchers, Educators, and Students VC3: Department of Energy ASCR/NGNS DDRM project "Virtual Clusters for Community Computation" US ATLAS Operations: NSF PHY-1624739 "U.S. ATLAS Operations: Discovery and Measurement at the Energy Frontier" slides: http://bit.ly/nrp-scheduling 32

More "App" Examples 33

containerized by perfsonar Make as lightweight containers as possible for easy deployment across clusters Integration with a Slate Platform Provider Central perfsonar Node Automated as much as possible while maintaining full flexibility of the testpoint bundle Allows reduction of requirements per testpoint and allows for easy centralized access 34

containerized by Squid proxies Created containers running the OSG packaging of "Frontier-Squid" Important for software caching with CERN CVMFS Containers are designed to expose the full Squid configuration so that they can be used in an arbitrary environment Packaged the containers into a SLATE application The application exposes a minimum set of parameters which coordinates the configuration of Squid, kubernetes and SLATE For more information see: http://slateci.io/docs/applications/development/sample_applicati ons/frontiersquid/ 35

slate extra 36

Scheduling Storage OSiRIS (CC*DNI DIBBS 2015, NSF award #1541335) is building a distributed Ceph-based, multi-institutional storage infrastructure that lets researchers write, manage, and share data from their own computing facility locations (www.osris.org) This storage is VO-allocatable and potentially dynamically schedulable A Challenge: Multi-site Ceph clusters take a performance hit with latency OSiRIS user provisioning chain can be used to assign users and VO storage pools to separate clusters - data replication out-of-band. Ceph RGW (S3) has concept of geographic zones and built-in replication between them though this is not yet used in OSiRIS (we are one large shared cluster with RGW endpoints at each site) 37

OSIRIS-SLATE Collaboration Collaboration between SLATE and OSiRIS focuses on software defined networking to orchestrate our infrastructures Next up: exploring OSiRIS block storage hosting SLATE containers usable across one or more SLATE platform deployments Current Kubernetes Ceph support is in a state of flux with migration to Container Storage Interface (CSI) based provisioning CSI Ceph plugin requires admin level credentials to enable dynamic provisioning Pre-provisioned volumes can be attached without admin credentials. 38

SLATE Vision Building Block for Multi-Institution Research Platforms 39

Services as a Service! SLATE is designed from the ground up to be programmable and scalable Applications specified and deployed to sites in a declarative way $ slate-client app --vo mylab --cluster dmzk8s install myapp Applications are monitored through their entire lifecycle, re-deployed if failure occurs 40

A New Stack for Research Platforms Rethinking the stack from the ground up. Minimal OS running a container orchestration engine Advanced networking integrated Everything is ephemeral System state lives "in the cloud" Site administrators only have to worry about hardware 41

Monitoring: Apps & Infrastructure SLATE catalog apps have optional flag for granular logs shipped to its own index in central elasticsearch cluster. All SLATE sites forward metrics automatically to SLATE Elasticsearch cluster. All SLATE members has access to Kibana dashboard. 42

SLATE Core Services 43

Analytics: leverage CC* SAND-NMA 44

SLATE Deployment options 45

SLATE v1.0 Currently developing a "minimum viable product" for SLATE Virtual and hardware-based environments Tooling to register and "SLATEify" k8s clusters Launch perfsonar, Globus, and data caching services 46

SLATE Virtual Environment Single-node, CoreOS-based Kubernetes Virtual Machine "Zero to SLATE" Downloadable bundle KVM or Libvirt Register with SLATE, instantiate services, VOs, etc Currently being used at UChicago on FIONA hardware 47

SLATE VM And reachable by web browser SLATE cluster & public IPs Service available on public IP 48

SLATE Hardware Four components for v1.0 High performance switch (40/100 Gbps) Dedicated perfsonar and Management nodes Hypervisor-like "Edge Service" node for K8S Publicly posted hardware configurations for interested partners Self-service portal for registering and configuring devices for your SLATE Cluster 49

SLATE on existing K8S Run SLATE services on an existing cluster SLATE merely a user Only uses limited "admin" privileges for initial installation You control policy Allocate resources, make scheduling decisions as you see fit Easy to install, easy to remove 50