THE ROUTE TO ROOTLESS

BILL AND TED'S ROOTLESS ADVENTURE

WHAT SECURITY PROBLEM IS GARDEN SOLVING IN CLOUD FOUNDRY?

THE PROBLEM IN CLOUD FOUNDRY Public Multi-Tenant Docker Workloads

WHAT IS A CONTAINER?

THE GREATEST TRICK CONTAINERS EVER PULLED WAS CONVINCING THE WORLD THEY EXIST

WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management

CONFINEMENT Linux Namespaces Cgroups Dropping Capabilities Seccomp AppArmor

LINUX NAMESPACES There are global resources in linux such as process trees, mount tables, network devices, etc. Namespaces wrap these global system resources to make it appear to the processes within the namespace that they have their own isolated instance of the global resource.

LINUX NAMESPACES PID - Process IDs MNT - Mount points NET - Network devices, stacks, ports, etc UTS - Hostname and NIS domain name IPC - Inter Process Communication USER - User and group IDs CGROUP - Cgroup root directory

SHARING Control Groups Resource limiting Prioritization Accounting Control Disk Quotas More on this trainwreck later!

CAPABILITIES Historically, processes could be privileged (effective user ID = 0, known as root) or unprivileged. Privileged processes would bypass all kernel permission checks. Since 2.2 Linux has divided superuser permissions into distinct units known as capabilities, which can be independently enabled or disabled.

CAPABILITIES CAP_SET_UID (change uid) CAP_NET_BIND_SERVICE (listen on privileged ports) CAP_KILL (send signals to any process) CAP_CHOWN (chown any files) CAP_DAC_OVERRIDE (bypass permission checks) CAP_SYS_ADMIN (do all the things!)

SECCOMP Seccomp stands for secure computing mode. It's a kernel sandboxing tool since Linux version 2.6.12. Enabling seccomp on a process limits the system calls available to that process. Also can limit args allowable in a system call (e.g. namespace clone flags).

APPARMOR AppArmor is a mandatory access control mechanism. Profiles are applied to running process to limit access to resources or privilege. `deny @{PROC}/sysrq-trigger rwklx`

I GET KNOCKED DOWN, BUT I GET UP AGAIN CVE-2016-9962: runc fd traversal: User Namespaces, Capability Dropping, AppArmor CVE-2017-16539: SCSI MICDROP - User Namespaces, AppArmor CVE-2017-16995: ebpf verifier vulnerability - Capability Dropping (sometimes), Seccomp

DEPENDENCY MANAGEMENT pivot_root(8) Layered filesystems

PIVOT ROOT What's in /? run.sh Boring Host Ubuntu Cool Container Busybox

PIVOT ROOT pivot_root! run.sh Boring Host Ubuntu Cool Container Busybox

PIVOT ROOT What's in /? run.sh Boring Host Ubuntu Cool Container Busybox

LAYERED FILESYSTEMS run.sh

LAYERED FILESYSTEMS run.sh /bin/os-specific

LAYERED FILESYSTEMS run.sh /bin/os-specific AMI

LAYERED FILESYSTEMS run.sh Δ B Δ A Base ROOTFS

WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management

SO EVERYTHING IS SECURE IN A CLOUD FOUNDRY CONTAINER RIGHT?

YES...BUT...

ROUTE TO ROUTELESS

CREDITS Jessie Frazelle (@jessfraz) Aleksa Sarai (@lordcyphar) Akihiro Suda (@_AkihiroSuda_)

WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management

CONFINEMENT Unprivileged user namespaces Since Linux 3.8 Just need CAP_SYS_ADMIN in the owning user namespace to do the rest: Other namespaces Seccomp AppArmor

HOW DO USER NAMESPACES WORK? Users are living a double life...

IN THE HOST

IN THE CONTAINER

HOW DO USER NAMESPACES WORK? /proc/self/uid_map and /proc/self/gid_map specify user id mappings from outer to inner user namespace. Mappings are triples of (inner ID, outer ID, range)

CONFINEMENT How do unprivileged user namespaces work? newuidmap/newgidmap setuid binaries validate mappings against /etc/subuid PRed support to runc

SHARING There's no way to do cgroups entirely unprivileged yet cgroups are a virtual filesystem like proc /sys/fs/cgroup/memory/** Files are owned by root by default Privileged setup can chown cgroups to our container root user, so runc can write to them PRed support to runc

DEPENDENCY MANAGEMENT pivot_root(8) User namespace gives us CAP_SYS_ADMIN

DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts Not possible unprivileged BTRFS Snapshots OverlayFS mounts

DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts (not possible unprivileged) BTRFS Snapshots Initial setup as privileged Snapshots can be done unprivileged Exploded with quotas at scale :( OverlayFS mounts (seems to be working?!)

DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts (not possible unprivileged) BTRFS Snapshots OverlayFS mounts Possible on Ubuntu unprivileged Seems to be working?!

OVERLAY ROOTFUL

OVERLAY ROOTLESS

ROAD BLOCKS

DISK QUOTAS We use XFS for filesystem quotas XFS requires privilege Small, focused setuid binary just for quota management

NETWORKING Networking used to be integrated into Garden Garden supports a plugin architecture garden-external-networker is setuid Some awesome work going on from Aleksa Sarai and Akihiro Suda on this front!

GDN SETUP cgroup chowning in garden setup No user input Before any workload can be running in CF At least one attempt to fix this by Aleksa Sarai, but no luck yet

BUT IT'S OK...

REDUCING PRIVILEGE Reduce privilege where we can, when we can Break apart monoliths to only allow privilege where required Some things take time Proving out slowly is a positive thing It's getting better!

DOES IT WORK?

HOW CAN I TRY IT? Experimental config option in BOSH manifest

THANKS

"ET TU ROOT?" - JULIUS CAESAR