THE ROUTE TO ROOTLESS
THE ROUTE TO ROOTLESS
BILL AND TED'S ROOTLESS ADVENTURE
THE ROUTE TO ROOTLESS
WHAT SECURITY PROBLEM IS GARDEN SOLVING IN CLOUD FOUNDRY?
THE PROBLEM IN CLOUD FOUNDRY Public Multi-Tenant Docker Workloads
WHAT IS A CONTAINER?
THE GREATEST TRICK CONTAINERS EVER PULLED WAS CONVINCING THE WORLD THEY EXIST
WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management
CONFINEMENT Linux Namespaces Cgroups Dropping Capabilities Seccomp AppArmor
LINUX NAMESPACES There are global resources in linux such as process trees, mount tables, network devices, etc. Namespaces wrap these global system resources to make it appear to the processes within the namespace that they have their own isolated instance of the global resource.
LINUX NAMESPACES PID - Process IDs MNT - Mount points NET - Network devices, stacks, ports, etc UTS - Hostname and NIS domain name IPC - Inter Process Communication USER - User and group IDs CGROUP - Cgroup root directory
SHARING Control Groups Resource limiting Prioritization Accounting Control Disk Quotas More on this trainwreck later!
CAPABILITIES Historically, processes could be privileged (effective user ID = 0, known as root) or unprivileged. Privileged processes would bypass all kernel permission checks. Since 2.2 Linux has divided superuser permissions into distinct units known as capabilities, which can be independently enabled or disabled.
CAPABILITIES CAP_SET_UID (change uid) CAP_NET_BIND_SERVICE (listen on privileged ports) CAP_KILL (send signals to any process) CAP_CHOWN (chown any files) CAP_DAC_OVERRIDE (bypass permission checks) CAP_SYS_ADMIN (do all the things!)
SECCOMP Seccomp stands for secure computing mode. It's a kernel sandboxing tool since Linux version 2.6.12. Enabling seccomp on a process limits the system calls available to that process. Also can limit args allowable in a system call (e.g. namespace clone flags).
APPARMOR AppArmor is a mandatory access control mechanism. Profiles are applied to running process to limit access to resources or privilege. `deny @{PROC}/sysrq-trigger rwklx`
I GET KNOCKED DOWN, BUT I GET UP AGAIN CVE-2016-9962: runc fd traversal: User Namespaces, Capability Dropping, AppArmor CVE-2017-16539: SCSI MICDROP - User Namespaces, AppArmor CVE-2017-16995: ebpf verifier vulnerability - Capability Dropping (sometimes), Seccomp
DEPENDENCY MANAGEMENT pivot_root(8) Layered filesystems
PIVOT ROOT What's in /? run.sh Boring Host Ubuntu Cool Container Busybox
PIVOT ROOT pivot_root! run.sh Boring Host Ubuntu Cool Container Busybox
PIVOT ROOT pivot_root! run.sh Boring Host Ubuntu Cool Container Busybox
PIVOT ROOT What's in /? run.sh Boring Host Ubuntu Cool Container Busybox
LAYERED FILESYSTEMS run.sh
LAYERED FILESYSTEMS run.sh /bin/os-specific
LAYERED FILESYSTEMS run.sh /bin/os-specific AMI
LAYERED FILESYSTEMS run.sh Δ B Δ A Base ROOTFS
WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management
SO EVERYTHING IS SECURE IN A CLOUD FOUNDRY CONTAINER RIGHT?
YES...BUT...
ROUTE TO ROUTELESS
CREDITS Jessie Frazelle (@jessfraz) Aleksa Sarai (@lordcyphar) Akihiro Suda (@_AkihiroSuda_)
WHAT IS A CONTAINER? Confinement Own view of the system Fair share of resources Unable to modify constraints Dependency Management
CONFINEMENT Unprivileged user namespaces Since Linux 3.8 Just need CAP_SYS_ADMIN in the owning user namespace to do the rest: Other namespaces Seccomp AppArmor
HOW DO USER NAMESPACES WORK? Users are living a double life...
IN THE HOST
IN THE CONTAINER
HOW DO USER NAMESPACES WORK? /proc/self/uid_map and /proc/self/gid_map specify user id mappings from outer to inner user namespace. Mappings are triples of (inner ID, outer ID, range)
CONFINEMENT How do unprivileged user namespaces work? newuidmap/newgidmap setuid binaries validate mappings against /etc/subuid PRed support to runc
SHARING There's no way to do cgroups entirely unprivileged yet cgroups are a virtual filesystem like proc /sys/fs/cgroup/memory/** Files are owned by root by default Privileged setup can chown cgroups to our container root user, so runc can write to them PRed support to runc
DEPENDENCY MANAGEMENT pivot_root(8) User namespace gives us CAP_SYS_ADMIN
DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts Not possible unprivileged BTRFS Snapshots OverlayFS mounts
DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts (not possible unprivileged) BTRFS Snapshots Initial setup as privileged Snapshots can be done unprivileged Exploded with quotas at scale :( OverlayFS mounts (seems to be working?!)
DEPENDENCY MANAGEMENT Layered filesystems AUFS mounts (not possible unprivileged) BTRFS Snapshots OverlayFS mounts Possible on Ubuntu unprivileged Seems to be working?!
OVERLAY ROOTFUL
OVERLAY ROOTLESS
ROAD BLOCKS
DISK QUOTAS We use XFS for filesystem quotas XFS requires privilege Small, focused setuid binary just for quota management
NETWORKING Networking used to be integrated into Garden Garden supports a plugin architecture garden-external-networker is setuid Some awesome work going on from Aleksa Sarai and Akihiro Suda on this front!
GDN SETUP cgroup chowning in garden setup No user input Before any workload can be running in CF At least one attempt to fix this by Aleksa Sarai, but no luck yet
BUT IT'S OK...
REDUCING PRIVILEGE Reduce privilege where we can, when we can Break apart monoliths to only allow privilege where required Some things take time Proving out slowly is a positive thing It's getting better!
DOES IT WORK?
HOW CAN I TRY IT? Experimental config option in BOSH manifest
THANKS
"ET TU ROOT?" - JULIUS CAESAR