Containers and isolation as implemented in the Linux kernel

Similar documents
OS Containers. Michal Sekletár November 06, 2016

Introduction to Container Technology. Patrick Ladd Technical Account Manager April 13, 2016

For personnal use only

LXC(Linux Container) Lightweight virtual system mechanism Gao feng

Engineering Robust Server Software

Understanding user namespaces

Container mechanics in Linux and rkt FOSDEM 2016

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

See Docker from the Perspective of Linux Process. Allen Hangzhou Docker Meetup

1 Virtualization Recap

Seccomp, network and namespaces. Francesco Tornieri <francesco.tornieri AT kiratech.it>

Docker Networking: From One to Many. Don Mills

Using Docker with Cisco NX-OS

Using Docker with Cisco NX-OS

Container's Anatomy. Namespaces, cgroups, and some filesystem magic 1 / 59

深 入解析 Docker 背后的 Linux 内核技术. 孙健波浙江 大学 SEL/VLIS 实验室

PROCESS MANAGEMENT Operating Systems Design Euiseong Seo

Deep Dive into OpenStack Networking

SAINT LOUIS JAVA USER GROUP MAY 2014

Linux Containers Roadmap Red Hat Enterprise Linux 7 RC. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Travis Cardwell Technical Meeting

Introduction to Containers

High Performance Containers. Convergence of Hyperscale, Big Data and Big Compute

The failure of Operating Systems,

OS Security III: Sandbox and SFI

RDMA Container Support. Liran Liss Mellanox Technologies

Kubernetes Essentials

Docker Deep Dive. Daniel Klopp

January 27, Docker Networking with Linux. Guillaume Urvoy-Keller. Reference Scenario. Basic tools: bridges, VETH

OS Virtualization. Linux Containers (LXC)

November 11, Docker Networking with Linux. Guillaume Urvoy-Keller. Reference Scenario. Basic tools: bridges, VETH

TEN LAYERS OF CONTAINER SECURITY

Rootless Containers with runc. Aleksa Sarai Software Engineer

Namespaces and Capabilities Overview and Recent Developments

Hardware accelerating Linux network functions Roopa Prabhu, Wilson Kok

Software containers are likely to become a very important tool over the

FOSDEM 18. LTTng: The road to container awareness.

Raw Packet Capture in the Cloud: PF_RING and Network Namespaces. Alfredo

Flatpak a technical walk-through. Alexander Larsson, Red Hat

$ wget V SOLUTIONS.tar.bz2 \ --user=lftraining --password=penguin2014

Introduction to containers

Azure Sphere: Fitting Linux Security in 4 MiB of RAM. Ryan Fairfax Principal Software Engineering Lead Microsoft

User Namespaces. Linux Capabilities and Namespaces. Outline. Michael Kerrisk, man7.org c 2018 March 2018

OPENSHIFT FOR OPERATIONS. Jamie Cloud Guy - US Public Sector at Red Hat

ISLET: Jon Schipp, AIDE jonschipp.com. An Attempt to Improve Linux-based Software Training

The State of Rootless Containers

Making Applications Mobile

Landlock LSM: toward unprivileged sandboxing

A Lightweight OS-Level Virtualization Architecture Based on Android Bo-wen LIU, Nai-jie GU and De-he GU

ISSN (Online)

Advanced IP Routing. Policy Routing QoS RVSP

What is an L3 Master Device?

Network stack virtualization for FreeBSD 7.0. Marko Zec

Accessing the Networking Stack

Security of OS-level virtualization technologies

Linux Clusters Institute: OpenStack Neutron

Introduction to VMs & Containers

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

Operating system security models

Advanced Topics. Network Namespaces CHAPTER 14

Backup strategies for Stateful Containers in OpenShift Using Gluster based Container-Native Storage

What s new in control groups (cgroups) v2

OPENSTACK AGILITY. RED HAT RELIABILITY.

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Cross platform enablement for the yocto project with containers. ELC 2017 Randy Witt Intel Open Source Technology Center

Introduction to Virtualization and Containers Phil Hopkins

containerization: more than the new virtualization

THE STATE OF CONTAINERS

THE ROUTE TO ROOTLESS

An introduction to Docker

On the Performance Impact of Virtual Link Types to 5G Networking

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

Sandboxing. (1) Motivation. (2) Sandboxing Approaches. (3) Chroot

TRAINING AND CERTIFICATION UPDATE

FreeBSD Jails vs. Solaris Zones

Kubernetes Love at first sight?

Dockercon 2017 Networking Workshop

Red Hat Roadmap for Containers and DevOps

Kata Containers The way to run virtualized containers. Sebastien Boeuf, Linux Software Engineer Intel Corporation

Filesystem Hierarchy and Permissions

Faculty of Computer Science Institute for System Architecture, Operating Systems Group. Naming. Stefan Kalkowski. Dresden,

OpenStack Neutron. Introduction and project status & Use case ML2 plugin with l2 population

Linux Kernel Security Overview

Lecture 5. Switching

[Docker] Containerization

The Road to Digital Transformation: Increase Agility Building and Managing Cloud Infrastructure. Albert Law Solution Architect Manager

A Design and Implementation of Universal Container

Bringing Security and Multitenancy. Lei (Harry) Zhang

Splunk N Box. Splunk Multi-Site Clusters In 20 Minutes or Less! Mohamad Hassan Sales Engineer. 9/25/2017 Washington, DC

SQUASH. Debugger for microservices. Idit Levine solo.io

Is it safe to run applications in Linux Containers?

Cisco Virtual Update Container networking. Hans Donnerborg, Lars Granberg, Maj 2018

Exam LFCS/Course 55187B Linux System Administration

rkt and Kubernetes What's new (and coming) with Container Runtimes and Orchestration

Course 55187B Linux System Administration

GL-280: Red Hat Linux 7 Update. Course Description. Course Outline

Multi-Arch Layered Image Build System

Security Assurance Requirements for Linux Application Container Deployments

At course completion. Overview. Audience profile. Course Outline. : 55187B: Linux System Administration. Course Outline :: 55187B::

Docker Rocker. Aliyun wzt

Transcription:

Containers and isolation as implemented in the Linux kernel Technical Deep Dive Session Hannes Frederic Sowa <hannes@redhat.com> Senior Software Engineer 13. September 2016

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 2 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

History of operating system isolation Plan9 per-process namespaces Distributed computing Architecture specific files mapped via bind/union mounts User space server via 9p protocol 3 Directory vnodes had an append operation Not yet implemented in linux: RPC via AF_UNIX over NFS

History of operating system isolation POSIX chroot Available as syscall thus usable in self written applications Provides a new filesystem view thus limited isolation FreeBSD s jails Strongly integrated into the operating system 4 Only small helper library available No operating system control and tuning Limited network isolation only based on IP addresses Solaris Zones Strongly integrated into the operating system (even package manager) Tooling is dictated by Solaris tools

Namespace API design in Linux Isolation and resource management completely decoupled API never tightly coupled to any user space library Syscalls openly documented and reusable by 3rd party software Management available on/with already known kernel primitives With rather primitive tools nearly no new tools were needed Fine grain control of primitives to namespace 5 Paved the path to a lot of user space frameworks (e.g. docker) Opt-in model Easy to enhance in user space as well as in the kernel

Isolation vs. Resource Management Not completely orthogonal but still... cgroup1 Process 1 Process 2 Process 3 Process 4 cgroups Resource management cgroup2 ns1 ns2 namespaces isolation 6

Namespaces in regular use Even on non-servers namespaces see regular use nowadays: Type Type code code snip$ snip$ lsns lsns NS NS TYPE TYPE NPROCS NPROCS 4026531836 pid 63 4026531836 pid 63 4026531837 63 4026531837 user user 63 4026531838 uts 70 4026531838 uts 70 4026531839 70 4026531839 ipc ipc 70 4026531840 mnt 70 4026531840 mnt 70 4026531969 63 4026531969 net net 63 4026532501 pid 22 4026532501 pid --type=zygote --type=zygote 4026532503 66 4026532503 net net --type=zygote --type=zygote 4026532621 11 4026532621 pid pid 4026532623 11 4026532623 net net 4026532724 user 11 4026532724 user 4026532725 66 4026532725 user user --type=zygote --type=zygote...... 7 PID PID 3485 3485 USER USER COMMAND /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /opt/google/chrome/chrome /opt/google/chrome/chrome 3485 3485 /opt/google/chrome/chrome /opt/google/chrome/chrome 3486 3486 3486 3486 3486 3486 3485 3485 /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/chrome /opt/google/chrome/chrome

Namespace API wrap-up No dependencies to 3rd party libraries or tools No design mandated by operating system or distributions Resource management independent from isolation Made several management tools possible (some specialized) 8 Iproute2, systemd, rkt, Docker, LXC, LXD, lmctfy, runc Own choices to use complete distribution or specialized init or maybe just running the application directly in a namespace OpenVZ/Virtuozzo reusing and contributing to namespaces upstream

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 9 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

Representation and control from user Processes are associated with one namespace: ## ls ls -l -l /proc/self/ns/ /proc/self/ns/ total 0 total 0 lrwxrwxrwx. lrwxrwxrwx. 11 root root 0 12. 12. Sep Sep 22:09 cgroup cgroup -> -> 'cgroup: 'cgroup: [4026531835]' [4026531835]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 ipc ipc -> 'ipc:[4026531839]' 'ipc:[4026531839]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]' lrwxrwxrwx. 1 root root 12. Sep mnt 'mnt:[4026531840]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 net net -> 'net:[4026531969]' 'net:[4026531969]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]' lrwxrwxrwx. 1 root root 12. Sep pid 'pid:[4026531836]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 user user -> 'user:[4026531837]' 'user:[4026531837]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]' lrwxrwxrwx. 1 root root 12. Sep uts 'uts:[4026531838]' ## unshare unshare -n -n ## -n -n :: :: unshare unshare the network network namespace namespace ## ls ls -l -l /proc/self/ns/net /proc/self/ns/net lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:10 /proc/self/ns/net /proc/self/ns/net -> -> 'net: [4026532727]' [4026532727]' ## 10

Making namespaces persistent Managing namespaces as a mountpoint: ## unshare unshare -n -n ## -n -n :: :: unshare unshare the network network namespace namespace ## ls -l /proc/self/ns/net ls -l /proc/self/ns/net lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:10 /proc/self/ns/net /proc/self/ns/net -> -> 'net: [4026532727]' [4026532727]' ## touch touch /run/netns/my_namespace1 /run/netns/my_namespace1 ## mount -o mount -o bind bind /proc/self/ns/net /proc/self/ns/net /run/netns/my_namespace1 /run/netns/my_namespace1 ## ls ls -i -i /run/netns/my_namespace1 /run/netns/my_namespace1 4026532727 4026532727 /run/netns/foo /run/netns/foo ## exit exit ## readlink readlink /proc/self/ns/net /proc/self/ns/net net:[4026531969] net:[4026531969] ## nsenter nsenter --net=/run/netns/my_namespace1 --net=/run/netns/my_namespace1 ## readlink readlink /proc/self/ns/net /proc/self/ns/net net:[4026532727] net:[4026532727] ## 11

User namespaces User namespaces have a special role as they directly influence permission control Allowing to become root inside a user created namespace Disassociate permissions with parent namespace Example: $ id id -u -u 1000 1000 $ unshare unshare user user -r -r bash bash # id id -u -u 0 # unshare unshare -n -n # nc nc -l -l 80 80 ## netcat netcat is is allowed allowed to bind bind to port port 80 80 12

Easier management: netns OpenStack already uses a lightweight wrapper around these to manage netns: ## ip ip netns netns add add foo foo ## ip ip netns netns add add bar bar ## ip ip link link add add type type veth veth ## ip ip link link set set dev dev veth0 veth0 netns netns foo foo ## ip ip link link set set dev dev veth1 veth1 netns netns bar bar ## ip ip netns netns exec exec foo foo bash bash ## ip ip ll ll 1: 1: lo: lo: <LOOPBACK> <LOOPBACK> mtu mtu 65536 65536 qdisc qdisc noop noop state state DOWN DOWN mode DEFAULT group group default default qlen qlen 11 link/loopback link/loopback 00:00:00:00:00:00 00:00:00:00:00:00 brd brd 00:00:00:00:00:00 2: 2: ip_vti0@none: ip_vti0@none: <NOARP> <NOARP> mtu mtu 1332 1332 qdisc noop noop state DOWN mode mode DEFAULT DEFAULT group group default default qlen qlen 11 link/ipip 0.0.0.0 link/ipip 0.0.0.0 brd brd 0.0.0.0 0.0.0.0 47: 47: veth0@if48: veth0@if48: <BROADCAST,MULTICAST> <BROADCAST,MULTICAST> mtu 1500 1500 qdisc qdisc noop noop state state DOWN DOWN mode DEFAULT group default qlen 1000 mode DEFAULT group default qlen 1000 link/ether link/ether ce:e5:a7:2f:d5:69 ce:e5:a7:2f:d5:69 brd brd ff:ff:ff:ff:ff:ff ff:ff:ff:ff:ff:ff link-netnsid 11 ## exit exit 13

Representation wrap-up Namespaces are internally represented via normal inodes living in its own filesystem, which are globally valid 14 Thus filedescriptor passing works as usual Persisting of namespaces simply achieved by bind mounting the representative file to stable location Easy atomic utilities map directly to the representative syscalls unshare(1) unshare(2) or clone(2) nsenter(1) setns(2) mount is really just mounting

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 15 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

Implementation details in the kernel struct user_namespace Establishes own configurable UID and GID mapping struct nsproxy struct uts_namespace struct ipc_namespace Control isolation with network interfaces, routing tables, ip addresses struct cgroup_namespace (recent development) 16 Isolate process tree and pid numbers struct net Abstraction and isolation over the filesystem views struct pid_namespace Isolates (POSIX/svipc) mqueue, semaphores, shared memory struct mnt_namespace isolates hostname and domainname (e.g. for auth purposes) control group namespace, isolates resource management

Mount namespace Most important namespace, as they also provide the isolation for /proc and (partially) for sysfs, which should get remounted in a new container Mount namespaces basically form trees in the kernel which can be partially overlapping (mount subtrees) Process attached to one subtree 17 Discovered via nsproxy

System configuration (netns) 18 Configuration, Routing tables, firewall etc. are all separated per network namespace, how? System configuration mostly being done via sysctl A lot of sysctls are manageable per namespace netns namespace has own sysctl in struct net Incoming packets use configuration based on the network namespace of the incoming interface Outgoing packets can use socket namespace (locally generated) or the device context

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 19 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

What is coming? Basically the namespace concept is architectural complety implemented New features added to the kernel are already designed in an orthogonal way or can correctly deal with namespaces Network namespace is heavy weight, thus Connecting netns to outside world requires one virtual router or bridge Alternatives exists but are architectural a dead end ipvlan: multiplexes IP addresses on one interface macvlan: multiplexes MAC addresses on one interface Provide isolation on IP layer like FreeBSD jails or Solaris 20 Maybe even extended to act like VRF with sockets

THANK YOU plus.google.com/+redhat facebook.com/redhatinc linkedin.com/company/red-hat twitter.com/redhatnews youtube.com/user/redhatvideos