Allowing Users to Run Services at the OLCF with Kubernetes

Similar documents
TEN LAYERS OF CONTAINER SECURITY

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

TEN LAYERS OF CONTAINER SECURITY. Kirsten Newcomer Security Strategist

OpenShift 3 Technical Architecture. Clayton Coleman, Dan McPherson Lead Engineers

Kubernetes introduction. Container orchestration

Investigating Containers for Future Services and User Application Support

Code: Slides:

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Introduction to HPC Parallel I/O

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Overview of Container Management

Important DevOps Technologies (3+2+3days) for Deployment

Container-Native Storage

Red Hat OpenShift Roadmap Q4 CY16 and H1 CY17 Releases. Lutz Lange Solution

A More Realistic Way of Stressing the End-to-end I/O System

Software containers are likely to become a very important tool over the

@briandorsey #kubernetes #GOTOber

CONTAINERS AND MICROSERVICES WITH CONTRAIL

WHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction

Introduction to Kubernetes

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus

Kubernetes 101: Pods, Nodes, Containers, andclusters

Oracle Linux 5 & 6 Advanced Administration

High Performance Containers. Convergence of Hyperscale, Big Data and Big Compute

Convergence of VM and containers orchestration using KubeVirt. Chunfu Wen

Docker und IBM Digital Experience in Docker Container

Run containerized applications from pre-existing images stored in a centralized registry

Backup strategies for Stateful Containers in OpenShift Using Gluster based Container-Native Storage

HTCondor on Titan. Wisconsin IceCube Particle Astrophysics Center. Vladimir Brik. HTCondor Week May 2018

ViryaOS RFC: Secure Containers for Embedded and IoT. A proposal for a new Xen Project sub-project

THE STATE OF CONTAINERS

Root cause codes: Level One: See Chapter 6 for a discussion of using hierarchical cause codes.

Simple custom Linux distributions with LinuxKit. Justin Cormack

Kubernetes 101. Doug Davis, STSM September, 2017

Kubernetes Integration with Virtuozzo Storage

Hacking and Hardening Kubernetes

Kubernetes The Path to Cloud Native

OPENSHIFT FOR OPERATIONS. Jamie Cloud Guy - US Public Sector at Red Hat

Building Kubernetes cloud: real world deployment examples, challenges and approaches. Alena Prokharchyk, Rancher Labs

Designing MQ deployments for the cloud generation

An introduction to Docker

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

VMware Integrated OpenStack with Kubernetes Getting Started Guide. VMware Integrated OpenStack 4.0

Performance Monitoring and Management of Microservices on Docker Ecosystem

Running MarkLogic in Containers (Both Docker and Kubernetes)

Windows Azure Services - At Different Levels

ISLET: Jon Schipp, AIDE jonschipp.com. An Attempt to Improve Linux-based Software Training

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Bright Cluster Manager

SUG Breakout Session: OSC OnDemand App Development

Think Small to Scale Big

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

Linux Containers Roadmap Red Hat Enterprise Linux 7 RC. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Introduction to Containers

Learn. Connect. Explore.

The four forces of Cloud Native

Comparison of Scheduling Policies and Workloads on the NCCS and NICS XT4 Systems at Oak Ridge National Laboratory

Scalable, Automated Characterization of Parallel Application Communication Behavior

An Introduction to Kubernetes

/ Cloud Computing. Recitation 5 February 14th, 2017

What s New in K8s 1.3

S Implementing DevOps and Hybrid Cloud

Singularity: Containers for High-Performance Computing. Grigory Shamov Nov 21, 2017

IBM Spectrum Scale IO performance

Utilizing Databases in Grid Engine 6.0

[Docker] Containerization

TRAINING AND CERTIFICATION UPDATE

OpenShift + Container Native Storage (CNS)

Docker All The Things

VMWARE PIVOTAL CONTAINER SERVICE

Genomics on Cisco Metacloud + SwiftStack

Real World CI with Red Hat Cloud Suite. Sim Zacks - Principal Quality Engineer Oded Ramraz - Manager QE Ops TLV

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

TOSS - A RHEL-based Operating System for HPC Clusters

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

X Grid Engine. Where X stands for Oracle Univa Open Son of more to come...?!?

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

SAMPLE CHAPTER. Marko Lukša MANNING

RED HAT GLUSTER TECHSESSION CONTAINER NATIVE STORAGE OPENSHIFT + RHGS. MARCEL HERGAARDEN SR. SOLUTION ARCHITECT, RED HAT BENELUX April 2017

Toward Improved Support for Loosely Coupled Large Scale Simulation Workflows. Swen Boehm Wael Elwasif Thomas Naughton, Geoffroy R.

Container Management : First Looks

What s New in K8s 1.3

A REFERENCE ARCHITECTURE FOR DEPLOYING WSO2 MIDDLEWARE ON KUBERNETES

VMWARE PKS. What is VMware PKS? VMware PKS Architecture DATASHEET

RDMA Container Support. Liran Liss Mellanox Technologies

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

Container Pods with Docker Compose in Apache Mesos

Introduction to Container Technology. Patrick Ladd Technical Account Manager April 13, 2016

Table of Contents 1.1. Introduction. Overview of vsphere Integrated Containers 1.2

GPFS for Life Sciences at NERSC

OS Security III: Sandbox and SFI

Airship A New Open Infrastructure Project for OpenStack

SIGHT. Benjamin Hernandez, PhD Advanced Data and Workflow(s) Group

MQ High Availability and Disaster Recovery Implementation scenarios

THE AFS NAMESPACE AND CONTAINERS

INTRODUCING CONTAINER-NATIVE VIRTUALIZATION

Beyond 1001 Dedicated Data Service Instances

Transcription:

Allowing Users to Run Services at the OLCF with Kubernetes Jason Kincl Senior HPC Systems Engineer Ryan Adamson Senior HPC Security Engineer This work was supported by the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) for the Department of Energy (DOE) under Prime Contract Number DE-AC05-00OR22725. ORNL is managed by UT-Battelle for the US Department of Energy

What is the Oak Ridge Leadership Computing Facility The Oak Ridge Leadership Computing Facility is charged with helping researchers solve some of the world s most challenging scientific problems with a combination of worldclass high-performance computing (HPC) resources and world-class expertise in scientific computing The OLCF is run by the National Center for Computational Sciences (NCCS) 2 Allowing Users to Run Services at the OLCF with Kubernetes

HPC Operations Tasked with keeping the OLCF leadership supercomputing systems running Compute Titan, Summit Storage Lustre, GPFS, HPSS Infrastructure Lots of industry standard services like DNS, DHCP, LDAP, Internal web applications and databases Monitoring and logging 3 Allowing Users to Run Services at the OLCF with Kubernetes

Our Users Users bring their scientific codes and run on the supercomputer Access to cluster resources is done with a batch scheduler Batch jobs have a start and end based on wallclock Initial use cases for user-run services were around scientific workflows 4 Allowing Users to Run Services at the OLCF with Kubernetes

Make The Case Users We were starting to see project needs for longer running services in addition to existing batch jobs databases, data portals, web services, Security constraints at OLCF for workflows Workflows need to be structured to run locally and reach out to pull jobs Want same guarantees as batch job Runs as regular user, job file that specifies work package, access to shared filesystems (Lustre, GPFS, NFS), access to batch scheduler (qsub) and continues to run as long as user has allocation 5 Allowing Users to Run Services at the OLCF with Kubernetes

Basic Workflow Requirements Need ways for users to manage their workflow system Diverse ecosystem of workflow systems makes it difficult for NCCS Operations to support every one At least 211 as of today[1] Upon surveying existing workflow systems we came up with the following requirements: Run a persistent service locally as a daemon that stays up Talk to batch submission system for current queue information and job submission Interact with files on GPFS/Lustre/NFS [1] https://github.com/common-workflow-language/common-workflow-language/wiki/existing-workflow-systems 6 Allowing Users to Run Services at the OLCF with Kubernetes

Make the Case Staff New service requests can take a long time I have X application that we wrote, how can I get Ops to run it as a service? Wouldn t it be great if we offered X to our users? Lots of steps involved in standing up a new production operational service Even if we only shift the initial burden of standing up and testing service over to user for prototyping, still a big win 7 Allowing Users to Run Services at the OLCF with Kubernetes

Containers Exist only in the kernel Just cgroups and kernel namespaces (process, network, IPC, ) Unix processes, not lightweight virtual machines Root filesystem of container is an image image = application + dependencies Stateless, every time a container starts its in the state it was created 8 Allowing Users to Run Services at the OLCF with Kubernetes

Multiple Container Strategies in HPC Automate deploying, scaling, and operating application containers with Kubernetes Focused on framework for providing resources (cpu, memory, network, ) for running services and applications Uses own scheduler Helps users create and run persistent services HPC container runtimes with Singularity/Shifter Focused on using containers in a batch job Uses scheduler from batch job submission system Provides portable environment to our users for HPC resources Easier to run software that needs new libraries on outdated HPC resources 9 Allowing Users to Run Services at the OLCF with Kubernetes

Why not a VM infrastructure? Virtual machines are very powerful isolation abstractions, running entirely different operating systems completely isolated from host Isolation requires user to run all services related to running an operating system and manage those configurations (access and authentication, monitoring and logging, integration with other systems) Containers are simply processes with cgroups and namespaces which run in same kernel as host Generally our users don t require that level of isolation, just want to be able to run their application in userspace 10 Allowing Users to Run Services at the OLCF with Kubernetes

Platform Create a layer between the infrastructure and the application. This layer would manage the infrastructure resources and ensure applications are running as intended. It would provide a fully isolated container for each application to run independently of any other application running on the infrastructure. 11 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Kubernetes manages containerized applications across nodes and provides mechanisms for deployment, maintenance, and application-scaling. User self-service for allocating CPU, memory, data volumes just like batch scheduling It provides a common platform that is flexible enough for running ops and user services 12 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Architecture Configuration: YAML or JSON data that describes the application being deployed Configuration can define: Containers to run HTTP routes and network ports to expose outside of the cluster Mounting data volumes 13 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Pods Atomic unit of Kubernetes Made up of one or more containers deployed together on one host Pod lifecycle is defined, pod is assigned to run on a node and runs until the container(s) exit or it is removed for some other reason Volumes can be attached that do not share pod lifecycle for persistent data Each pod gets its own IP address that is accessible in the cluster 14 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Scheduling When a pod object is created, the scheduler is responsible for assigning the pod to a node in the cluster Pod is filtered through a configurable number of predicates to select the right node based on pod configuration and node status 15 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Replication Controllers Pod will not recreate itself if deleted for some reason such as cluster maintenance or quota limit exceeded A ReplicationController ensures desired number of pods is running in the cluster ex. thermostat in a room For example: I want to have three pods running nginx:1.10 image 16 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Services Service points to where application pods are running in the cluster Services get static cluster IP and DNS Can be implemented with type= NodePort or LoadBalancer for external connectivity 17 Allowing Users to Run Services at the OLCF with Kubernetes

Kubernetes Persistent Volumes Store stateful data Lifespan of data in volume is independent of lifespan of container Can be backed by many different options NFS Lustre/GPFS Host Disks (bind mount) 18 Allowing Users to Run Services at the OLCF with Kubernetes

Desired State and Implementation of Actual State The real power of offering Kubernetes as a service is in the implementation of actual state Ex. user requests X amount of storage and Kubernetes satisfies it with Y storage controller User does not need to know about topology of storage network, Kubernetes handles that 19 Allowing Users to Run Services at the OLCF with Kubernetes

Declarative vs Imperative Declarative Focuses on what Describes what needs to happen, how is left to system run two copies of this with <= 1 being down at any one time Imperative Focuses on how Explicitly state how to do something with expectation that desired outcome will result start this process on that server 20 Allowing Users to Run Services at the OLCF with Kubernetes

Flexible control over what users can request Processes in container run as a regular user (not root) Capabilities are stripped from process before it starts setuid sudo binary runs without setuid Most of kernel is namespace-aware but pieces that are not cannot be used from inside container All of these are configurable! 21 Allowing Users to Run Services at the OLCF with Kubernetes

Cluster Resources Resource allocation is different from the traditional core hours or node hours we use in HPC Quota system based on CPU and memory limits User defines what CPU and memory are required for each container, if container exceeds limits it is killed 22 Allowing Users to Run Services at the OLCF with Kubernetes

Exposing services OpenShift gives users the ability to expose services outside of the cluster For HTTP-based services, NCCS will handle initial authentication to ensure service is accessed only by members of that project 23 Allowing Users to Run Services at the OLCF with Kubernetes

Accessing NCCS resources All containers run as an automation user that is tied to a project and has access to the project s allocation and files like a regular user Batch job submission from container Users can base their container image off our NCCS golden image which comes with the tools to schedule batch jobs or get queue status Accessing shared filesystems (GPFS/Lustre/NFS) Shared filesystems can be mounted in the container by Kubernetes allowing access just like a login or compute node 24 Allowing Users to Run Services at the OLCF with Kubernetes

HPC Workflow Support Requirements Run a persistent service locally DONE: Kubernetes can run user services in NCCS Talk to batch submission system for current queue information and job submission DONE: Containers running on Kubernetes cluster in NCCS can run qsub/mshow commands to talk to Titan/Rhea/DTN cluster Moab Interact with files on GPFS/Lustre/NFS DONE: Containers running on Kubernetes cluster in NCCS can mount Lustre and NFS project and home areas 25 Allowing Users to Run Services at the OLCF with Kubernetes

NCCS Kubernetes Clusters Clusters are running Red Hat OpenShift distribution and are split by security domain Granite Cluster Ops cluster in our core services security enclave Built for Ops and Staff managed applications Some applications can run as root in the container Marble Cluster User-facing cluster in our moderate security enclave Integration in moderate enclave: Lustre mounted in container NFS home and project areas mounted in container Torque/Moab job submission in container All applications run as project automation user in the container Onyx Cluster Will be user-facing cluster in our open security enclave 26 Allowing Users to Run Services at the OLCF with Kubernetes

Questions? Jason Kincl <kincljc@ornl.gov> Ryan Adamson <adamsonrm@ornl.gov>