Containers and isolation as implemented in the Linux kernel

Containers and isolation as implemented in the Linux kernel Technical Deep Dive Session Hannes Frederic Sowa <hannes@redhat.com> Senior Software Engineer 13. September 2016

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 2 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

History of operating system isolation Plan9 per-process namespaces Distributed computing Architecture specific files mapped via bind/union mounts User space server via 9p protocol 3 Directory vnodes had an append operation Not yet implemented in linux: RPC via AF_UNIX over NFS

History of operating system isolation POSIX chroot Available as syscall thus usable in self written applications Provides a new filesystem view thus limited isolation FreeBSD s jails Strongly integrated into the operating system 4 Only small helper library available No operating system control and tuning Limited network isolation only based on IP addresses Solaris Zones Strongly integrated into the operating system (even package manager) Tooling is dictated by Solaris tools

Namespace API design in Linux Isolation and resource management completely decoupled API never tightly coupled to any user space library Syscalls openly documented and reusable by 3rd party software Management available on/with already known kernel primitives With rather primitive tools nearly no new tools were needed Fine grain control of primitives to namespace 5 Paved the path to a lot of user space frameworks (e.g. docker) Opt-in model Easy to enhance in user space as well as in the kernel

Isolation vs. Resource Management Not completely orthogonal but still... cgroup1 Process 1 Process 2 Process 3 Process 4 cgroups Resource management cgroup2 ns1 ns2 namespaces isolation 6

Namespaces in regular use Even on non-servers namespaces see regular use nowadays: Type Type code code snip$ snip$ lsns lsns NS NS TYPE TYPE NPROCS NPROCS 4026531836 pid 63 4026531836 pid 63 4026531837 63 4026531837 user user 63 4026531838 uts 70 4026531838 uts 70 4026531839 70 4026531839 ipc ipc 70 4026531840 mnt 70 4026531840 mnt 70 4026531969 63 4026531969 net net 63 4026532501 pid 22 4026532501 pid --type=zygote --type=zygote 4026532503 66 4026532503 net net --type=zygote --type=zygote 4026532621 11 4026532621 pid pid 4026532623 11 4026532623 net net 4026532724 user 11 4026532724 user 4026532725 66 4026532725 user user --type=zygote --type=zygote...... 7 PID PID 3485 3485 USER USER COMMAND /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd --user /usr/lib/systemd/systemd /usr/lib/systemd/systemd --user --user /opt/google/chrome/chrome /opt/google/chrome/chrome 3485 3485 /opt/google/chrome/chrome /opt/google/chrome/chrome 3486 3486 3486 3486 3486 3486 3485 3485 /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/nacl_helper /opt/google/chrome/chrome /opt/google/chrome/chrome

Namespace API wrap-up No dependencies to 3rd party libraries or tools No design mandated by operating system or distributions Resource management independent from isolation Made several management tools possible (some specialized) 8 Iproute2, systemd, rkt, Docker, LXC, LXD, lmctfy, runc Own choices to use complete distribution or specialized init or maybe just running the application directly in a namespace OpenVZ/Virtuozzo reusing and contributing to namespaces upstream

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 9 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

Representation and control from user Processes are associated with one namespace: ## ls ls -l -l /proc/self/ns/ /proc/self/ns/ total 0 total 0 lrwxrwxrwx. lrwxrwxrwx. 11 root root 0 12. 12. Sep Sep 22:09 cgroup cgroup -> -> 'cgroup: 'cgroup: [4026531835]' [4026531835]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 ipc ipc -> 'ipc:[4026531839]' 'ipc:[4026531839]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]' lrwxrwxrwx. 1 root root 12. Sep mnt 'mnt:[4026531840]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 net net -> 'net:[4026531969]' 'net:[4026531969]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]' lrwxrwxrwx. 1 root root 12. Sep pid 'pid:[4026531836]' lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:09 user user -> 'user:[4026531837]' 'user:[4026531837]' lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]' lrwxrwxrwx. 1 root root 12. Sep uts 'uts:[4026531838]' ## unshare unshare -n -n ## -n -n :: :: unshare unshare the network network namespace namespace ## ls ls -l -l /proc/self/ns/net /proc/self/ns/net lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:10 /proc/self/ns/net /proc/self/ns/net -> -> 'net: [4026532727]' [4026532727]' ## 10

Making namespaces persistent Managing namespaces as a mountpoint: ## unshare unshare -n -n ## -n -n :: :: unshare unshare the network network namespace namespace ## ls -l /proc/self/ns/net ls -l /proc/self/ns/net lrwxrwxrwx. lrwxrwxrwx. 11 root root root root 0 12. 12. Sep Sep 22:10 /proc/self/ns/net /proc/self/ns/net -> -> 'net: [4026532727]' [4026532727]' ## touch touch /run/netns/my_namespace1 /run/netns/my_namespace1 ## mount -o mount -o bind bind /proc/self/ns/net /proc/self/ns/net /run/netns/my_namespace1 /run/netns/my_namespace1 ## ls ls -i -i /run/netns/my_namespace1 /run/netns/my_namespace1 4026532727 4026532727 /run/netns/foo /run/netns/foo ## exit exit ## readlink readlink /proc/self/ns/net /proc/self/ns/net net:[4026531969] net:[4026531969] ## nsenter nsenter --net=/run/netns/my_namespace1 --net=/run/netns/my_namespace1 ## readlink readlink /proc/self/ns/net /proc/self/ns/net net:[4026532727] net:[4026532727] ## 11

User namespaces User namespaces have a special role as they directly influence permission control Allowing to become root inside a user created namespace Disassociate permissions with parent namespace Example: $ id id -u -u 1000 1000 $ unshare unshare user user -r -r bash bash # id id -u -u 0 # unshare unshare -n -n # nc nc -l -l 80 80 ## netcat netcat is is allowed allowed to bind bind to port port 80 80 12

Easier management: netns OpenStack already uses a lightweight wrapper around these to manage netns: ## ip ip netns netns add add foo foo ## ip ip netns netns add add bar bar ## ip ip link link add add type type veth veth ## ip ip link link set set dev dev veth0 veth0 netns netns foo foo ## ip ip link link set set dev dev veth1 veth1 netns netns bar bar ## ip ip netns netns exec exec foo foo bash bash ## ip ip ll ll 1: 1: lo: lo: <LOOPBACK> <LOOPBACK> mtu mtu 65536 65536 qdisc qdisc noop noop state state DOWN DOWN mode DEFAULT group group default default qlen qlen 11 link/loopback link/loopback 00:00:00:00:00:00 00:00:00:00:00:00 brd brd 00:00:00:00:00:00 2: 2: ip_vti0@none: ip_vti0@none: <NOARP> <NOARP> mtu mtu 1332 1332 qdisc noop noop state DOWN mode mode DEFAULT DEFAULT group group default default qlen qlen 11 link/ipip 0.0.0.0 link/ipip 0.0.0.0 brd brd 0.0.0.0 0.0.0.0 47: 47: veth0@if48: veth0@if48: <BROADCAST,MULTICAST> <BROADCAST,MULTICAST> mtu 1500 1500 qdisc qdisc noop noop state state DOWN DOWN mode DEFAULT group default qlen 1000 mode DEFAULT group default qlen 1000 link/ether link/ether ce:e5:a7:2f:d5:69 ce:e5:a7:2f:d5:69 brd brd ff:ff:ff:ff:ff:ff ff:ff:ff:ff:ff:ff link-netnsid 11 ## exit exit 13

Representation wrap-up Namespaces are internally represented via normal inodes living in its own filesystem, which are globally valid 14 Thus filedescriptor passing works as usual Persisting of namespaces simply achieved by bind mounting the representative file to stable location Easy atomic utilities map directly to the representative syscalls unshare(1) unshare(2) or clone(2) nsenter(1) setns(2) mount is really just mounting

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 15 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

Implementation details in the kernel struct user_namespace Establishes own configurable UID and GID mapping struct nsproxy struct uts_namespace struct ipc_namespace Control isolation with network interfaces, routing tables, ip addresses struct cgroup_namespace (recent development) 16 Isolate process tree and pid numbers struct net Abstraction and isolation over the filesystem views struct pid_namespace Isolates (POSIX/svipc) mqueue, semaphores, shared memory struct mnt_namespace isolates hostname and domainname (e.g. for auth purposes) control group namespace, isolates resource management

Mount namespace Most important namespace, as they also provide the isolation for /proc and (partially) for sysfs, which should get remounted in a new container Mount namespaces basically form trees in the kernel which can be partially overlapping (mount subtrees) Process attached to one subtree 17 Discovered via nsproxy

System configuration (netns) 18 Configuration, Routing tables, firewall etc. are all separated per network namespace, how? System configuration mostly being done via sysctl A lot of sysctls are manageable per namespace netns namespace has own sysctl in struct net Incoming packets use configuration based on the network namespace of the incoming interface Outgoing packets can use socket namespace (locally generated) or the device context

Outline Containers and isolation as implemented in the Linux kernel Learned from history and enhanced and innovated in Free Software. 19 Overview of not so recent history from other operating systems Representation and control from user space Implementation details in the kernel What to come?

What is coming? Basically the namespace concept is architectural complety implemented New features added to the kernel are already designed in an orthogonal way or can correctly deal with namespaces Network namespace is heavy weight, thus Connecting netns to outside world requires one virtual router or bridge Alternatives exists but are architectural a dead end ipvlan: multiplexes IP addresses on one interface macvlan: multiplexes MAC addresses on one interface Provide isolation on IP layer like FreeBSD jails or Solaris 20 Maybe even extended to act like VRF with sockets

THANK YOU plus.google.com/+redhat facebook.com/redhatinc linkedin.com/company/red-hat twitter.com/redhatnews youtube.com/user/redhatvideos