Never Port Your Code Again Docker functionality with Shifter using SLURM

Similar documents
Contain This, Unleashing Docker for HPC

Shifter: Fast and consistent HPC workflows using containers

Opportunities for container environments on Cray XC30 with GPU devices

Shifter: Containers for HPC

Docker und IBM Digital Experience in Docker Container

Singularity: Containers for High-Performance Computing. Grigory Shamov Nov 21, 2017

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Deployment Patterns using Docker and Chef

State of Containers. Convergence of Big Data, AI and HPC

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Shifter at CSCS Docker Containers for HPC

NERSC Site Report One year of Slurm Douglas Jacobsen NERSC. SLURM User Group 2016

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

Allowing Users to Run Services at the OLCF with Kubernetes

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

[Docker] Containerization

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

Docker for HPC? Yes, Singularity! Josef Hrabal

Travis Cardwell Technical Meeting

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

MOHA: Many-Task Computing Framework on Hadoop

Shifter and Singularity on Blue Waters

Docker 101 Workshop. Eric Smalling - Solution Architect, Docker

Introduction to Slurm

Reproducible computational pipelines with Docker and Nextflow

Enabling a SuperFacility with Software Defined Networking

GPFS for Life Sciences at NERSC

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

High Performance Containers. Convergence of Hyperscale, Big Data and Big Compute

Think Small to Scale Big

CS-580K/480K Advanced Topics in Cloud Computing. Container III

SolidFire and Ceph Architectural Comparison

XC Series Shifter User Guide (CLE 6.0.UP02) S-2571

Who is Docker and how he can help us? Heino Talvik

Global Software Distribution with CernVM-FS

LINUX CONTAINERS. Where Enterprise Meets Embedded Operating Environments WHEN IT MATTERS, IT RUNS ON WIND RIVER

A declarative programming style job submission filter.

OS Virtualization. Linux Containers (LXC)

An introduction to Docker

Geant4 on Azure using Docker containers

Scheduler Optimization for Current Generation Cray Systems

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Introduction to containers

Singularity tests at CC-IN2P3 for Atlas

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

SLURM Operation on Cray XT and XE

LSST software stack and deployment on other architectures. William O Mullane for Andy Connolly with material from Owen Boberg

Building A Better Test Platform:

Shifter on Blue Waters

RADU POPESCU IMPROVING THE WRITE SCALABILITY OF THE CERNVM FILE SYSTEM WITH ERLANG/OTP

IBM Bluemix compute capabilities IBM Corporation

Docker II - Judgement Day

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Deploy containers on your cluster - A proof of concept

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Getting Started With Containers

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

DevOps in the Cloud A pipeline to heaven?! Robert Cowham BCS CMSG Vice Chair

Singularity in CMS. Over a million containers served

PHP Composer 9 Benefits of Using a Binary Repository Manager

NERSC. National Energy Research Scientific Computing Center

Splunk N Box. Splunk Multi-Site Clusters In 20 Minutes or Less! Mohamad Hassan Sales Engineer. 9/25/2017 Washington, DC

Containerizing GPU Applications with Docker for Scaling to the Cloud

Midterm Presentation Schedule

Running Docker applications on Linux on the Mainframe

API and Usage of libhio on XC-40 Systems

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

Kubernetes Integration with Virtuozzo Storage

Extreme I/O Scaling with HDF5

Slurm Birds of a Feather

Progress Report on Transparent Checkpointing for Supercomputing

Getting Started With Amazon EC2 Container Service

VM Migration, Containers (Lecture 12, cs262a)

Przyspiesz tworzenie aplikacji przy pomocy Openshift Container Platform. Jarosław Stakuń Senior Solution Architect/Red Hat CEE

Container System Overview

HPC Cloud Burst Using Docker

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Automatic Dependency Management for Scientific Applications on Clusters. Ben Tovar*, Nicholas Hazekamp, Nathaniel Kremer-Herman, Douglas Thain

Fast Forward I/O & Storage

ovirt and Docker Integration

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus

docker & HEP: containerization of applications for development, distribution and preservation

Shifter Configuration Guide 1.0

The four forces of Cloud Native

Decrypting your genome data privately in the cloud

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Genomics on Cisco Metacloud + SwiftStack

Introduction to Containers

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Investigating Containers for Future Services and User Application Support

Structuring PLFS for Extensibility

The Leading Parallel Cluster File System

Continuous integration & continuous delivery. COSC345 Software Engineering

libhio: Optimizing IO on Cray XC Systems With DataWarp

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Transcription:

Never Port Your Code Again Docker functionality with Shifter using SLURM Douglas Jacobsen, James Botts, Shane Canon NERSC September 15, 2015 SLURM User Group - 1 -

Acknowledgements Larry Pezzaglia (Former NERSC, UDI COE, CHOS) Scott Burrow (NERSC, Jesup/MyDock) Shreyas Cholia (NERSC, MyDock) Dave Henseler (Cray, UDI COE) John Dykstra (Cray, UDI COE) Charles Shirron (Cray, UDI COE) Michael Barton (JGI, Example Docker Image) Lisa Gerhardt (NERSC, HEP cvmfs work) Debbie Bard (NERSC, LSST testing) This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231. - 2 -

User Defined Images/Containers in HPC Data Intensive computing often require complex software stacks Efficiently supporting big software in HPC environments offers many challenges Shifter NERSC R&D effort, in collaboration with Cray, to support User- defined, user- provided Application images Docker- like functionality on the Cray and HPC Linux clusters Efficient job- start & Native application performance NSHIFTER ChOS - 3 -

Glossary Image: operating environment for a software stack including operating system files, e.g., /etc, /usr, User Defined Image: an image that a user creates for their needs in particular Linux Container: an instance of an application image and its running processes Docker: a software system for creating images and instantiating them in containers. Enables distribution of images through well- defined web APIs Repository/repo online store of related Docker images Layer collection of files Image ordered collection of layers denoting a particular version within repo Tag a label or pointer to a version (e.g., latest points to most recent version; 15.04 points to the relevant image) Shifter: software for securely enabling Docker (and other images) to run in a limited form of Linux Container in HPC and cluster environments, specifically for HPC use- cases - 4 -

Convergence of Disruptive Technology Increasing Role of Data Converging HPC and Data Platforms New Models of Application Development and Delivery

DOE Facilities Require Exascale Computing and Data Astronomy Particle Physics Chemistry and Materials Genomics Fusion Petascale to Exascale Petabyte data sets today, many growing exponentially Processing requirements grow super- linearly Need to move entire DOE workload to Exascale

Popular features of a data intensive system and supporting them on Cori Data Intensive Workload Need Local Disk Large memory nodes Cori Solution NVRAM burst buffer 128 GB/node on Haswell; Option to purchase fat (1TB) login node Massive serial jobs Shared- Node/Serial queue on cori via SLURM Complex workflows Communicate with databases from compute nodes Stream Data from observational facilities Easy to customize environment Policy Flexibility User Defined Images CCM mode Large Capacity of interactive resources Advanced Compute Gateway Node Advanced Compute Gateway Node User Defined Images Improvements coming with Cori: Rolling upgrades, CCM, MAMU, above COEs - 7 - would also contribute

User-Defined Images User- Defined Images (UDI): A software framework which enables users to accompany applications with portable, customized OS environments e.g., include ubuntu base system with Application built for ubuntu(or debian, centos, etc) A UDI framework would: Enable the HPC Platforms to run more applications Increase flexibility for users Facilitate reproducible results Provide rich, portable environments without bloating the base system - 8 -

Use Cases Large high energy physics collaborations (e.g., ATLAS and STAR) requiring validated software stacks Some collaborations will not accept results from non- validated stacks Simultaneously satisfying compatibility constraints for multiple projects is difficult Solution: create images with certified stacks Bioinformatics and cosmology applications with many third- party dependencies Installing and maintaining these dependencies is difficult Solution: create images with dependencies Seamless transition from desktop to supercomputer Users desire consistent environments Solution: create an image and transfer it among machines - 9 -

Development Cycle Previous/inactive versions of an image are automatically purged User Iterates Application User Develops/Modifies Application Users don t need NERSC to build applications or install dependencies Other Users can utilize the image Batch System spawns job and starts Shifter Container User Commits Application to Docker Hub (or uses Dockerfile) User submits job specifying a docker image for Shifter Tagged Images mean they can easily re- run the same image - 10 -

Value Proposition for UDI for Cori Expanded application support Many applications currently relegated to commodity clusters could run on the Cray through UDI This will help position Cori as a viable platform for data- intensive Easier application porting The need for applications to be ported to Cori will be reduced More applications can work out of the box Better Reproducibility Easier to re- instantiate an older environment for reproducing results - 11 -

UDI Design Goals for Cori User independence: Require no administrator assistance to launch an application inside an image Shared resource availability (e.g., PFS/DVS mounts and network interfaces) Leverages or integrates with public image repos (i.e. DockerHub) Seamless user experience Robust and secure implementation Fast job startup time native application execution performance - 12 -

Linux Containers vs. Virtual Machines VM App A Bins/ Libs Guest OS App A Bins/ Libs Guest OS Hypervisor App B Bins/ Libs Guest OS App A Containers are isolated, but share kernel and, where appropriate, bins/libraries App A App A App A Bins/Libs App C App C App B Docker Engine / Shifter App B Bins/Li bs Container Containers provide close to native performance Host OS Host OS Server Server Source: IBM Research Report (RC25482) A container delivers an application with all the libraries, environment, and dependencies needed to run. - 13 -

Docker at a High Level Process Container: Uses Linux kernel features (cgroups and namespaces) to create semi- isolated containers Image Management: Version control style image management front- end and image building interface Ecosystem: Can push/pull images from a community- oriented image hub (i.e. DockerHub) User Interface Resource Management Application and OS Virtualization - 14 -

Development Cycle Previous/inactive versions of an image are automatically purged User Iterates Application User Develops/Modifies Application Users don t need NERSC to build applications or install dependencies Other Users can utilize the image Batch System spawns job and starts Shifter Container User Commits Application to Docker Hub (or uses Dockerfile) User submits job specifying a docker image for Shifter Tagged Images mean they can easily re- run the same image - 15 -

Parallel Python Dockerfile Example FROM python:2.7 ## setup optimized numpy ADD mkl.tar / ADD numpy-1.9.2.tar.gz /usr/src/ ADD site.cfg /usr/src/numpy-1.9.2 / RUN cd /usr/src/numpy-1.9.2 && \ chmod -R a+rx /opt/intel && \ chown -R root:root /opt/intel && \ python setup.py build && \ python setup.py install && \ cd / && rm -rf /usr/src/numpy-1.9.2 && \ printf "/opt/intel/composer_x e_20 15.1.133 /mkl /lib /int el64 \n" >> /etc/ld.so.conf && \ ldconfig ADD scipy-0.16.0b2.tar.gz /usr/src/ ADD site.cfg /usr/src/scipy-0.16. 0b2/ RUN cd /usr/src/scipy-0.16.0b 2 && \ apt-get update && \ apt-get dist-upgrade -y && \ apt-get install gfortran -y && \ python setup.py build && \ python setup.py install && \ cd / && rm -rf /usr/src/scipy-0.16.0b 2 ## install mpi4py ADD mpi4py-1.3.1.tar.gz /usr/src/ ADD optcray_alva.tar / ADD mpi.cfg /usr/src/mpi4py-1.3.1 / RUN cd /usr/src && \ cd mpi4py-1.3.1 && \ chmod -R a+rx /opt/cray && \ chown -R root:root /opt/cray && \ python setup.py build && \ python setup.py install && \ cd / && rm -rf /usr/src/mpi4py-1.3.1 && \ printf "/opt/cray/mpt/default /gni /mpi ch2- gnu/48/lib\n/opt/cray /pmi /def ault /lib 64\n/opt/cra y/ug ni/d efau lt/l ib64 \n/o pt/cray/udre g/de faul t/li b64\n/op t/cr ay/xpmem/def ault /lib 64\n/opt/cra y/al ps/d efault/lib64\n/opt/cr ay/wlm_detect/de faul t/li b64\n" >> /etc/ld.so.conf && \ ldconfig - 16 -

User Defined Images on the Cray Platforms - 17 -

Docker on the Cray? Docker assumes local disk aufs, btrfs demand shared- memory access to local disk Approach Leverage Docker image and integrate with DockerHub node Adopt alternate approach for instantiating the environment on the compute node (i.e. don t use the Docker daemon) Plus: Easier to integrate and support Minus: Can t easily leverage other parts of the Docker ecosystem (i.e. orchestration) Requires running docker daemon on each compute Docker security model Missing kernel modules (i.e. LVM) Additional complexity - 18 -

Image Location/Format Options Image Location Options GPFS No local clients, so overhead of DVS Lustre Local clients, large scale Burst Buffer Not widely available yet Local Disk Not available on Crays Image Format Options Unpacked Trees Simple to implement Metadata performance depends on metadata performance of the underlying system (i.e. Lustre or GPFS) Loopback File Systems Moderate complexity Keeps file system operations local Considerations Scalable Manageable Metadata performance Bandwidth Consumption - 19 -

Layout Options Stack View Process Process Remote Local Lustre MDS Lustre Client Lustre OST /udi/image1 /udi/image1/usr /udi/image1/etc ext4 Lustre Client Lustre OST /udi/image1.e xt4-20 -

Shifter Supports Docker Images CHOS Images Can support other image types (e.g., qcow2, vmware, etc) Basic Idea Convert native image format to common format (ext4, squashfs) Construct chroot tree on compute nodes using common format image Modify image within container to meet site security/policy needs Directly use linux VFS namespaces to support multiple shifter containers on same compute node Resource Management User Interface Application and OS Virtualization NSHIFTER

Shifter Command line interface Instantiate and control shifter containers interactively List images Pull image from DockerHub / private registry Central gateway service Manage images Transfer images to computational resource Convert images several sources to common format udiroot Sets up container on compute node CCM- like implementation to support internode communication Workload Manager Integration Pull image at job submission time if it doesn t already exist Implement user interface for user- specified volume mapping Launch udiroot on all compute nodes at job start Deconstruct udiroot on all compute nodes at job end WLM provides resource management (e.g., cgroups) User Interface Application and OS Virtualization Resource Management NSHIFTER

Shifter ALPS integration shown, Native Slurm uses plugin capability to execute features without MOM node NSHIFTER

Shifter Delivers Performance Pynamic Shifter Balanced memory vs. network Consumption Python Deployment Options Local Memory DataWarp Tuned DVS+GPFS Lustre DVS+GPFS+Readonly? Benchmark local in- memory, some support libraries over network DVS+GPFS 0 50 100 150 200 250 300 350 400 450 Time (seconds) to complete Pynamic Python Benchmark (300 nodes, 24 cores per node) - 24 -

How is Shifter similar to Docker? Sets up user- defined image under user control Allows volume remapping mount /a/b/c on /b/a/c in container Containers can be run Environment variables, working directory, entrypoint scripts can be defined and run Can instantiate multiple containers on same node - 25 -

How does Shifter differ from Docker? User runs as the user in the container not root Image modified at container construction time: Modifies /etc, /var, /opt replaces /etc/passwd, /etc/group other files for site/security needs adds /var/hostsfileto identify other nodes in the calculation (like $PBS_NODEFILE) Injects some support software in /opt/udiimage Adds mount points for parallel filesystems Your homedir can stay the same inside and outside of the container Site configurable Image readonly on the Computational Platform to modify your image, push an update using Docker Shifter only uses mount namespaces, not network or process namespaces Allows your application to leverage the HSN and more easily integrate with the system Shifter does not use cgroups directly Allows the site workload manager (e.g., SLURM, Torque) to manage resources Shifter uses individual compressed filesystemfiles to store images, not the Docker graph Uses more diskspace, but delivers high performance at scale Shifter integrates with your Workload Manager Can instantiate container on thousands of nodes Run parallel MPI jobs Specialized sshd run within container for exclusive- node for non- native- MPI parallel jobs PBS_NODESFILE equivalent provided within container (/var/hostsfile) Similar to Cray CCM functionality Acts in place of CCM if shifter image is pointed to /dsl VFS tree - 26 -

Shifter / SLURM Integration A custom SPANK plugin adds these options to salloc, sbatch, srun: - - image=<image descriptor> - - imagevolume=<volume remapping descriptor> - - ccm (uses shifter to emulate Cray CCM) - - autoshift (automatically run batch script in shifter image) slurm_spank_job_prolog Sets up shifter environment on all exclusive compute nodes in allocation (should use PrologFlags=Alloc) slurm_spank_task_init_privileged Instantiate shifter container for slurmstepd for trusted images when ccm or autoshift is defined slurm_spank_job_epilog Tear down shifter environment on all exclusive compute nodes in allocation Other requirements: Should ensure that all needed cgroup paths, slurm state directories, and munge sockets are mounted in shifter containers (sitefs variable in udiroot.conf) - 27 -

Demo - 28 -

Where are we now? An early version of Shifter is deployed on Edison. Early users are already reporting successes! Shifter is fully integrated with batch system, users can load a container on many nodes at job- start time. Full access to parallel FS and High Speed Interconnect (MPI) Shifter and NERSC were recently featured in HPC Wire and other media. Several other HPC sites have expressed interest. Our early users: NSHIFTER Light Sources Structural Biology (early production) LHC Nuclear Physics (testing) - 29 - Cosmology (testing) nucleotid.es Genome Assembly (proof of concept)

Where are we going? Cori phase 1 will debut with a new fully open- source version of Shifter Users can run multiple containers in a single job (i.e., dynamically select and run Shifter containers) Tighter integration and simple user interface Cray has indicated interest in making a supported product out of Shifter Ultimate Code Portability Don t rewrite your code, shift the data center! Support for volume mappings /scratch2/scratchdirs/username/code1/date2/a/b/c è /input /scratch2/scratchdirs/username/code1/date2/out/1 è /output Support for entrypoint scripts Autorun: /usr/bin/mycontainerscript.sh Entrypoints+ volume mappings = No site- specific batch script or porting delays è Increased scientific productivity NSHIFTER - 30 -

Thank you! (And we are hiring!) HPC Systems Engineers HPC Consultants Storage analysts Postdocs www.nersc.gov - 31 -

National Energy Research Scientific Computing Center - 32 -

Shifter Security Model User only accesses the container as their uid, never root or contextual root Generated site /etc/passwd, /etc/group (filtered) is placed in container no need for shifter containers to interoperate with LDAP or concerns about image PAM Optional sshd run within image is statically linked, only accessible to that container s user All user- provided data (paths within image, environment variables, command line arguments) are filtered Executables that run with root privileges are implemented in C with only glibc dependencies All filesystems in container are remounted no- setuid All processes run with privilege carefully manage environment to prevent accidental/intentional manipulation - 33 -

Using Shifter to deliver cvmfs for LHC Vast majority of LHC computing depends on cvmfs Network file system that relies on http to deliver full software environment to compute node Needs fuse, root perms, and local disk so implementation on Cray systems has been difficult Most groups extract needed portions of cvmfs and extract on compute node RAM disk this breaks automated pipelines Use shifter to deliver full ALICE, ATLAS, CMS cvmfs repositories to Edison ALICE image is 892 GB with 10,116,157 inodes, load up time was negligible Ran simulations of p- Pb collisions at an energy of 5 TeV and reconstructed the result Actual physics (used to evaluate the EMCal trigger), that worked out of the box NERSC is making shifter images available for all cvmfs repositories - 34 -

Shifter Deployed containers are read- only to the user Docker Running containers are mutable - 35 -