Deploy containers on your cluster - A proof of concept

What is HPC cluster (in my world!) Where do I come from? Run and maintain a bioinformatics cluster at Bioinformatic Research Centre (BiRC), Aarhus University E-mail: anders.dannesboe@birc.au.dk The setup 3000+ cores 3.5PB parallel file system (henceforth known as /faststorage ) Use SLURM as our scheduler

What is HPC cluster (in my world!) A bunch of servers connected together with access to a shared file system Pipelines are spread into parallel pieces and run on multiple nodes at onces, to achieve accumulated speedup A multiuser system. Pipelines are run by unprivileged users (no root!) Everything is orchestrated by a scheduler. Takes care of resource sharing. E.g: Kill jobs that takes to long Enforces the limits of cores+memory of each job Packs multiple jobs from multiple users together on as few nodes as possible

What is HPC cluster (in my world!) What kinds of jobs do we run? Lots of data - Large input datasets, large shared reference dataset - Sensitive data Lots of different software by lots of different people - Versions keeps on changing Work-in-progress pipelines - Batches are seldom run twice. But a batch can have 50,000 of the same job-type Everything is in flux

Docker Docker: A Revolutionary Change in Cloud Computing

Docker Dockers focus: Make software run the same anywhere Use containers to make software OS independent Take over networking, to make containers datacenter environment independent no static/fixed ip s One storage model, to make it backing independent image/container content is just fills in your filesystem Docker takes care of many of the nitty gritty details and lets you focus on package your software ones and for all

What are linux containers? Chroot on steroids Each container comes with its own OS Spawning a container runs a new init. Every running container on the host is a independent OS running on the system Uses features i Linux-kernel to achieve process isolation Cgroups for resource management Linux namespaces for process isolation Leverage OverlayFS in data/deployment model Spawn multiple containers from the same template without copying a thing

What are linux containers? Linux Namespaces PID namespace Network namespace UTS namespace(hostname) User namespace(uid/gid) Mount namespace Has been long underway. Full support under anything by Ubuntu/Debian can be tricky

What are linux containers? Why is this powerful? Container will work the same anywhere Each container is isolated Allow unprivileged users to run anything. Let them become root Utilize OverlayFS Spawn a new full OS in under a second Spawning multiple containers from the same template takes up no extra space No hypervisor, just native performance No need for syscall translation => No overhead Run 100+ containers on one host Back to Docker =>

Docker Docker is by far the most popular container implementation The design philosophy of docker has been adopted wholesale Creating docker images through recipes (Dockerfile) Running containers are ephemeral Make docker images reusable by others Images are be easy to publish and to download and use Split your software stack into smaller units by containerizing one service at a time

Docker Has gain serious traction amongst companies/developers working in the cloud. Here Docker and its philosophy helps: Plan, structure, develop and deploy the software stack Lots of effort has been but into containerizing existing software stack (also in academia) Restructure code under a better more scalable model Cloud ready Get in while the buzz it hot

Docker Some of the heavy hitters From academia Björn Grüning (bgruening) from University of Freiburg

Meanwhile in HPC...

Can we get Docker into our HPC clusters? How can we capitalize? A lot of software has already been dockerized. Projects like: https://github.com/biodocker/containers Or easy to get into containerize: https://github.com/mulled/mulled And the list of container resources gross every day How can we deploy all these containers with ready to use software inside our HPC cluster?

Merging containers into cluster computing Let's look at the pipeline Individual pieces of software strung together in a chain* Each link in the chain takes output from the previous link and uses it as input. Instead of the actual software being the link, how about using containers? To rephrase: Split your pipeline into smaller units by containerizing one link at a time Makes your pipelines cluster independent** Much of the development can be done off-cluster, on your own system Write your awesome software once, and everybody can use it. #citations Reuse others (a little bit less awesome) software in your pipeline *A lattice I guess, or else we wouldn't be doing stuff in parallel **well no. But a step in the right direction

Use case - The cluster user Missing a piece of software? Search the web for existing images: https://hub.docker.com https://github.com/biodocker/containers https://github.com/mulled/mulled https://docker-ui.genouest.org/app/#/containers Or query from the cmd: $:> docker search bowtie2 * Find a link in a research paper *This does require mulled, biodocker etc. to be setup as repos

Use case - The cluster user No luck? Build your own container. $:> mkdir bowtie2 && cd bowtie2 $:> vim Dockerfile 1 FROM ubuntu 2 3 RUN apt-get update -qq --fix-missing 4 RUN apt-get install -qq -y wget unzip 5 RUN wget -q -O bowtie2.zip https://sourceforge.net/.../bowtie2-2.2.9-linux-x86_64.zip/download 6 RUN unzip bowtie2.zip -d /opt/ 7 RUN ln -s /opt/bowtie2-2.2.9 /opt/bowtie2 8 RUN rm bowtie2.zip 9 10 ENV PATH $PATH:/opt/bowtie2 $:> docker build -t bowtie2-2.2.9. $:> docker images REPOSITORY TAG IMAGE ID CREATED SIZE bowtie2-2.2.9 latest 49c23f71b287 9 seconds ago 289 MB ubuntu latest c73a085dc378 5 days ago 127 MB $:> docker run --rm -it bowtie2-2.2.9 bowtie2 -h Bowtie 2 version 2.2.9 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea) Usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> -U <r>} [-S <sam>]...

Use case - The cluster user Push our own work to dockerhub for others to re-use: $:> docker push bowtie2-2.2.9 Docker images can be pushed to repositories (dockerhub being one), and automatically pulled in if needed. Dockerhub can monitor git repositories and rebuild a new docker image on commits. Setup a (private) docker repository on your local network that pulls content from the most relevant global repos. Each docker daemon can stream in >1GB docker images within seconds.

What would we like to achieve? Make your lives as user easier by reusing existing and working docker images from papers, colleage, previous projects Make your lives as an administrator easier by not maintaining a plethora of software compiled to custom specifications from source Make our pipelines easier to rerun on a different cluster, by packaging the software into docker images that can run everywhere

What do we need? 1. Mapping of data Enable containers to work on the data (massive in size) on the HPC filesystem like any piece of software (within reason ;)) 2. Resource limiting A way for the docker daemon to run under the resource management of SLURM, so that the scheduler can do resource sharing. 3. Maintain security A cluster user should never be able to achieve priviledge escalation (of any sort) Alice should only be able to run as alice No one but Alice should be able to run as alice

Mapping of data Map data from host to container via mount-bind docker run -v /storage:/storage debian /bin/bash Idear: Make a 1-1 map of the shared storage into the container. File paths are the same outside and inside a container. Easy to work with. Example: #sbatch tool_a /storage/input -o /storage/output.a tool_b /storage/output.a -o /storage output.b cat /storage/output.b #sbatch docker run -v /storage:/storage tool_a /storage/input -o /storage/output.a docker run -v /storage:/storage tool_b /storage/output.a -o /storage/output.b docker run -v /storage:/storage cat /storage/output.b

Mapping of data Problem solved. Let crack on

Mapping of data Problem solved. Let crack on Major break of nr. 3: Maintain security Docker defaults Containers run as root Anyone in the docker group can spawn containers All are equal in eyes of the daemon Alice get to spawn just as much as root does

Mapping of data Evil Alice Mapping part of the host OS into a container, Alice can act like root in the mother OS. What about: docker run -v /storage/sensitive_data:/unsensitive_data debian /bin/bash And even worse: docker run -v /etc/shadow:/root/shadow debian /bin/bash Read-write access to our password file!

Mapping of data Unprivileged containers Any storage that is mapped inside a container retain the restrictions of the user spawning Filesystems doesn t have multiple and separate UID/GID ranges Utilize the size of this UID/GID space, and shift containers into unused UID/GID s to isolate them. UID/GID gets translated back and forth when Unprivileged containers has existed and been used in LXC for a while. Fairly new (and unknown) option Docker

Mapping of data Who does it work? Assign a isolated UID-space and GID-space to a user 2 new files /etc/subuid and /etc/subgid Use these UID/GID s inside the container $:> usermod --add-subuids 100000-165536 alice $:> usermod --add-subgids 100000-165536 alice $:> docker daemon --userns-remap alice:alice & $:> docker run --rm -it -v /etc/shadow:/root/shadow debian /bin/bash #:> touch /etc/shadow #:> touch /root/shadow touch: cannot touch '/root/shadow': Permission denied *Available in Ubuntu since 14.04. But not in CentOS 7 yet.

Mapping of data That was a step too far! What about reference data, input data and output data? Soulution: Shift UID s and GID s into boring isolation but keep the UID of the user and GID on the project. cat /etc/subuid alice:100000:1000 alice:1000:1 alice:101001:64535 cat /etc/subgid plants:100000:10000 plants:10000:1 plants:110000:64535

Mapping of data Succes! $:> docker daemon --userns-remap alice:plant & $:> docker run --rm -it \ -v /etc/shadow:/root/shadow \ -v /storage:/storage debian /bin/bash #:> touch /root/shadow touch: cannot touch '/root/shadow': Permission denied #:> cd /storage #:> ls humans lost+found plants #:> ls humans/ ls: cannot open directory humans/: Permission denied #:> ls plants/ some_plant.gene

Mapping of data What did we need? Edit /etc/subuid and /etc/subgid to shift anything but the user uid and project gid into a isolated uid/gid range Multiple running docker daemons. One pr. <user>:<group> mapping Add --userns-remap to restrict container file access Add --group to restrict access to the docker daemon docker daemon \ --graph=/mnt/scratch/$user.$project/docker \ --pidfile=/mnt/scratch/$user.$project/docker.pid \ -H unix:///mnt/scratch/$user.$project/docker.sock \ --group=$user_id \ --userns-remap=$user_id:$group_id Your users are now able to run containers on your filesystem!

Resource limiting In any HPC cluster the scheduler must have total resource control. Jobs are run with the privileges of the use Processes are subprocesses of slurmd Docker daemon must be spawned by root Containers run as subprocesses of the docker daemon 1. Unprivileged user must be able to start the docker daemon 2. The scheduler must be able to monitor/control the resources of docker 3. When a job is killed, all containers spawned by that job must die

Resource limiting SLURM already uses cgroups. And that is all we need Write a setuid script start_docker that assert permissions and forks out a docker daemon locked to the <user>:<project> Run start_docker inside a job to use containers The cgroup stay with the daemon. Monitoring/limiting its resources Use SLURMs epilog-hook to cleanup afterwards Kills docker daemon and containers if still running Delete any container leftovers

Resource limiting Check the process tree slurmstepd bash pstree 20238 -a sudo docker_daemon plants docker_daemon /usr/local/bin/docker_daemo... dockerd --graph=/mnt/scratch/alice.pl... docker-containe -l unix:///var/ru... 7*[{docker-containe}] 14*[{dockerd}] 5*[{slurmstepd}] And the cgroup alice@vm47:~$ cat /proc/self/cgroup 11:name=systemd:/user/0.user/6.session 10:hugetlb:/user/0.user/6.session... alice@vm47:~$ cat /proc/`pidof dockerd`/cgroup 11:name=systemd:/user/0.user/6.session 10:hugetlb:/user/0.user/6.session...

Limitations This is a proof of concept Docker locks /etc/passwd and /etc/group No way to inject user/project names. Only UID and GID available Dockers --userns-remap limits user to one project at a time Limitations in the kernel make this unlikely to change Limitations in the kernel allow no more than 5 lines in subgid(!?) * There is an (arbitrary) limit on the number of lines in the file. As at Linux 3.18, the limit is five lines. - user_namespaces manpage

Limitations How about network? How to communicate with containers on different nodes? How about RDMA? Docker is still in very active development Docker 1.8 - August 12, 2015 Docker 1.9 - November 3, 2015 Docker 1.10 - February 4, 2016 Docker 1.11 - April 13, 2016 Docker 1.12 - June 20, 2016 All saw major changes and introduction of concepts and features. Not all features are support in the major distribution Ubuntu/debian Archlinux CentOS