Container mechanics in Linux and rkt FOSDEM 2016

Similar documents
rkt and Kubernetes What's new (and coming) with Container Runtimes and Orchestration

OS Containers. Michal Sekletár November 06, 2016

1 Virtualization Recap

2016 Mesosphere, Inc. All Rights Reserved.

Containers and isolation as implemented in the Linux kernel

MESOS A State-Of-The-Art Container Orchestrator Mesosphere, Inc. All Rights Reserved. 1

Linux Containers Roadmap Red Hat Enterprise Linux 7 RC. Bhavna Sarathy Senior Technology Product Manager, Red Hat

LXC(Linux Container) Lightweight virtual system mechanism Gao feng

深 入解析 Docker 背后的 Linux 内核技术. 孙健波浙江 大学 SEL/VLIS 实验室

Introduction to containers

Container's Anatomy. Namespaces, cgroups, and some filesystem magic 1 / 59

CNI, CRI, and OCI - Oh My!

Introduction to Container Technology. Patrick Ladd Technical Account Manager April 13, 2016

THE ROUTE TO ROOTLESS

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

High Performance Containers. Convergence of Hyperscale, Big Data and Big Compute

Dan Williams Networking Services, Red Hat

ViryaOS RFC: Secure Containers for Embedded and IoT. A proposal for a new Xen Project sub-project

HTCondor: Virtualization (without Virtual Machines)

The Open Container Initiative Establishing standards for an open ecosystem

Travis Cardwell Technical Meeting

The failure of Operating Systems,

Infoblox Kubernetes1.0.0 IPAM Plugin

Singularity CRI User Documentation

RDMA Container Support. Liran Liss Mellanox Technologies

OS Virtualization. Linux Containers (LXC)

Lightweight Containerization at Facebook

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

PROCESS MANAGEMENT Operating Systems Design Euiseong Seo

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

How Container Runtimes matter in Kubernetes?

User Namespaces. Linux Capabilities and Namespaces. Outline. Michael Kerrisk, man7.org c 2018 March 2018

The State of Rootless Containers

An introduction to Docker

Replacing Docker With Podman. By Dan

Container Detection and Forensics, Gotta catch them all!

ISSN (Online)

Kubernetes and the CNI: Where we are and What s Next Casey Callendrello RedHat / CoreOS

Introduction to Containers

Creating a Reproducible Build System for Docker Images

SAINT LOUIS JAVA USER GROUP MAY 2014

Container Networking and Openstack. Fernando Sanchez Fawad Khaliq March, 2016

[Docker] Containerization

How to build and run OCI containers

Engineering Robust Server Software

Container Security and new container technologies. Dan

Kubernetes and the CNI: Where we are and What s Next Casey Callendrello RedHat / CoreOS

OS Security III: Sandbox and SFI

Zdeněk Kubala Senior QA

Overlayfs And Containers. Miklos Szeredi, Red Hat Vivek Goyal, Red Hat

Opendaylight: Enabling 5G through Cloud Native Telco Architecture Edgar Lombara Lumina Networks Inc.

Docker Deep Dive. Daniel Klopp

OPENSHIFT FOR OPERATIONS. Jamie Cloud Guy - US Public Sector at Red Hat

Kata Containers The way to run virtualized containers. Sebastien Boeuf, Linux Software Engineer Intel Corporation

Docker Security. Mika Vatanen

Rootless Containers with runc. Aleksa Sarai Software Engineer

Kubernetes The Path to Cloud Native

CONTAINERS AND MICROSERVICES WITH CONTRAIL

Creating a Reproducible Build System for Docker Images

Multi-Arch Layered Image Build System

On the Performance Impact of Virtual Link Types to 5G Networking

Namespaces and Capabilities Overview and Recent Developments

Understanding user namespaces

systemd in Containers

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

for Kerrighed? February 1 st 2008 Kerrighed Summit, Paris Erich Focht NEC

Sandboxing. CS-576 Systems Security Instructor: Georgios Portokalidis Spring 2018

The speed of containers, the security of VMs

Kubernetes networking in the telco space

Project Kuryr. Here comes advanced services for containers networking. Antoni Segura

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

Simplify Container Networking With ican. Huawei Cloud Network Lab

MesosCon Qian Zhang (IBM China), Jie Yu (Mesosphere) OCI Support in Mesos Mesosphere, Inc. All Rights Reserved. 1

Distributed Computing Framework Based on Software Containers for Heterogeneous Embedded Devices

Is it safe to run applications in Linux Containers?

Secure Architecture Principles

Debugging Using Container Technology

State of Containers. Convergence of Big Data, AI and HPC

Container-based virtualization: Docker

Infoblox IPAM Driver for Kubernetes. Page 1

Beyond Init: systemd

Think Small to Scale Big

TEN LAYERS OF CONTAINER SECURITY

Dockercon 2017 Networking Workshop

Landlock LSM: toward unprivileged sandboxing

Bringing Security and Multitenancy. Lei (Harry) Zhang

Infoblox IPAM Driver for Kubernetes User's Guide

Docker und IBM Digital Experience in Docker Container

Software containers are likely to become a very important tool over the

Infrastructure at your Service. Oracle over Docker. Oracle over Docker

TEN LAYERS OF CONTAINER SECURITY

Overview of Container Management

Alternatives to Solaris Containers and ZFS for Linux on System z

Dockerized Tizen Platform

CS 550 Operating Systems Spring Process II

KVM Forum Vancouver, Daniel P. Berrangé

Flatpak a technical walk-through. Alexander Larsson, Red Hat

Securing Microservices Containerized Security in AWS

Multiple Networks and Isolation in Kubernetes. Haibin Michael Xie / Principal Architect Huawei

Improving User Accounting and Isolation with Linux Kernel Features. Brian Bockelman Condor Week 2011

Transcription:

Container mechanics in Linux and rkt FOSDEM 2016

Alban Crequy github.com/alban Jonathan Boulle github.com/jonboulle @baronboulle

a modern, secure, composable container runtime

an implementation of appc (image format, execution environment)

rkt simple CLI tool golang + Linux self-contained

simple CLI tool no (mandatory) daemon: apps run directly under spawning process

bash/runit/systemd rkt application(s)

rkt internals modular architecture execution divided into stages stage0 stage1 stage2

bash/runit/systemd rkt application(s)

bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

stage0 (rkt binary) discover, fetch, manage application images set up pod filesystems commands to manage pod lifecycle

bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

stage1 "the container" execution environment for pods process lifecycle management resource constraints (isolators)

stage1 (swappable) binary ABI with stage0 rkt's stage0 calls exec(stage1, args...) default implementation based on systemd-nspawn + systemd Linux namespaces + cgroups for isolation kvm implementation based on lkvm + systemd hardware virtualisation for isolation

bash/runit/systemd/... (invoking process) rkt (stage0) pod (stage1) app1 (stage2) app2 (stage2)

bash/runit/systemd/... (invoking process) rkt (stage0) systemd-nspawn (stage1) systemd cached (stage2) workerd (stage2)

bash/runit/systemd/... (invoking process) rkt (stage0) systemd-nspawn (stage1) systemd cached (stage2) workerd (stage2) container

Containers on Linux namespaces cgroups chroot

Linux namespaces

containers host hostname: thunderstorm hostname: rainbow hostname: sunshine

Containers: no guest kernel system calls: example: sethostname() app app rkt kernel API app rkt Host Linux kernel Hardware

Containers with an example Getting and setting the hostname: The system calls for getting and setting the hostname are older than containers int uname(struct utsname *buf); int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len);

Processes in namespaces 1 3 2 6 gethostname() -> rainbow 9 gethostname() -> thunderstorm

Linux Namespaces Several independent namespaces uts (Unix Timesharing System) namespace mount namespace pid namespace network namespace user namespace

Creating new namespaces unshare(clone_newuts); 1 2 6 rainbow

Creating new namespaces 1 6 2 rainbow 6 rainbow

PID namespace

Hiding processes and PID translation the host sees all processes the container only its own processes

Hiding processes and PID translation the host sees all processes the container only its own processes Actually pid 30920

Initial PID namespace 1 2 6 7

Creating a new namespace clone(clone_newpid,...); 1 2 6 7

Creating a new namespace 1 2 6 1 7

rkt rkt run... uses clone() to start the first process in the container with a new pid namespace uses unshare() to create a new network namespace rkt enter... uses setns() to enter an existing namespace

Joining an existing namespace 1 setns(...,clone_newpid); 2 6 1 7

Joining an existing namespace 1 2 6 1 4

Mount namespaces

/ container /my-app / host /etc /var /home user

Storing the container data (Copy-on-write) Container filesystem Overlay fs upper directory /var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512.../upper/ Application Container Image /var/lib/rkt/cas/tree/sha512-...

rkt directories /var/lib/rkt cas tree deps-sha512-19bf... deps-sha512-a5c2... pods run e0ccc8d8 overlay/sha512-19bf.../upper stage1/rootfs/

unshare(..., CLONE_NEWNS); 1 7 3 / /etc /var /home user

1 7 7 3 / / /etc /var /home user /etc /var /home user

Changing root with MS_MOVE /... rootfs / / mount($rootfs, /, MS_MOVE) my-app... rootfs my-app $ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs

Relationship between the two mounts: - shared - master / slave - private Mount propagation events / / /etc /var /home user /etc /var /home user

Mount propagation events /home /home /home /home /home /home Private Shared Master and slave

How rkt uses mount propagation events / in the container namespace is recursively set as slave: mount(null, "/", NULL, MS_SLAVE MS_REC, NULL) / /etc /var /home user

Network namespace

Network isolation Goal: each container has their own network interfaces Cannot see the network traffic outside the container (e.g. tcpdump) container1 container2 eth0 eth0 host eth0

Network tooling Linux can create pairs of virtual net interfaces Can be linked in a bridge container1 container2 eth0 eth0 veth1 veth2 bridge eth0 IP masquerading via iptables

rkt networking plugin based Container Network Interface (CNI) rkt Kubernetes Calico

Container Runtime (e.g. rkt) Container Networking Interface (CNI) veth macvlan ipvlan OVS

How does rkt do it? rkt uses the network plugins implemented by the Container Network Interface (CNI, https://github.com/appc/cni) network namespace configure via setns + netlink systemd-nspawn create, join exec() exec network plugins /var/lib/rkt/pods/run/$pod_uuid/netns rkt

User namespaces

History of Linux namespaces 1991: Linux 2002: namespaces in Linux 2.4.19 2008: LXC 2011: systemd-nspawn 2013: user namespaces in Linux 3.8 2013: Docker 2014: rkt development still active

Why user namespaces? Better isolation Run applications which would need more capabilities Per user limits Future: Unprivileged containers: possibility to have container without root

User ID ranges 4,294,967,295 (32-bit range) 0 host 0 65535 container 1 0 container 2 65535

User ID mapping /proc/$pid/uid_map: 0 unmapped unmapped 1048576 host container 65536 65536 unmapped 1048576 65536

Problems with container images web server Application Container Image (ACI) container 1 container 2 Container filesystem Container filesystem Overlayfs upper directory Overlayfs upper directory downloading Application Container Image (ACI)

Problems with container images Files UID / GID rkt currently only supports user namespaces without overlayfs Performance loss: no COW from overlayfs chown -R for every file in each container

Problems with volumes bind mount (rw / ro) / /data mounted in several containers No UID translation Dynamic UID maps /my-app / /data /var /home user /data

User namespace and filesystem problem Possible solution: add options to mount() to apply a UID mapping rkt would use it when mounting: the overlay rootfs volumes

Isolators

Isolators in rkt specified in an image manifest limiting capabilities or resources

Isolators in rkt Currently implemented Possible additions capabilities cpu memory block-bandwidth block-iops network-bandwidth disk-space

cgroups

What s a control group (cgroup) group processes together organised in trees applying limits to them as a group

cgroup API /sys/fs/cgroup/*/ /proc/cgroups /proc/$pid/cgroup

cgroups

List of cgroup controllers /sys/fs/cgroup/ cpu devices freezer memory... systemd

Memory isolator limit : 500M Application Image Manifest [Service] ExecStart= MemoryLimit=500M write to memory.limit_in_ bytes systemd service file systemd action

CPU isolator limit : 500m Application Image Manifest [Service] ExecStart= CPUShares=512 systemd service file write to cpu.share systemd action

Unified cgroup hierarchy Multiple hierarchies: one cgroup mount point for each controller (memory, cpu...) flexible but complex cannot remount with a different set of controllers difficult to give to containers in a safe way Unified hierarchy: cgroup filesystem mounted only one time soon to be stable in Linux (mount option DEVEL sane_behavior being removed) initial implementation in systemd-v226 (September 2015) no support in rkt yet

Questions? Join us! github.com/coreos/rkt

@coreosfest coreos.com/fest May 9 & 10, 2016 Berlin, Germany Early bird tickets Sponsorships are still available Submit a talk before February 29th!