Overlayfs And Containers. Miklos Szeredi, Red Hat Vivek Goyal, Red Hat

Similar documents
SE Linux Implementation LINUX20

Access Control. CMPSC Spring 2012 Introduction Computer and Network Security Professor Jaeger.

Fall 2014:: CSE 506:: Section 2 (PhD) Securing Linux. Hyungjoon Koo and Anke Li

Landlock LSM: toward unprivileged sandboxing

1 / 23. CS 137: File Systems. General Filesystem Design

Virtual File System. Don Porter CSE 306

The Case for Security Enhanced (SE) Android. Stephen Smalley Trusted Systems Research National Security Agency

Operating system security models

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

File access-control per container with Landlock

1 / 22. CS 135: File Systems. General Filesystem Design

Linux Kernel Evolution. OpenAFS. Marc Dionne Edinburgh

Idea of Metadata Writeback Cache for Lustre

THE ROUTE TO ROOTLESS

SELinux Introduction. Jason Zaman FOSSASIA 2017 March 17th - 19th blog.perfinion.com

Virtual File System. Don Porter CSE 506

ViryaOS RFC: Secure Containers for Embedded and IoT. A proposal for a new Xen Project sub-project

Filename encoding. and case-insensitive filesystems. Gabriel Krisman Bertazi

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

Introduction to Containers

Container mechanics in Linux and rkt FOSDEM 2016

Hardware. Ahmet Burak Can Hacettepe University. Operating system. Applications programs. Users

Module: Operating System Security. Professor Trent Jaeger. CSE543 - Introduction to Computer and Network Security

Linux Kernel Security Overview

Operating System Security

3/26/2014. Contents. Concepts (1) Disk: Device that stores information (files) Many files x many users: OS management

Keeping Up With The Linux Kernel. Marc Dionne AFS and Kerberos Workshop Pittsburgh

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

Confinement. Steven M. Bellovin November 1,

CONTAINER AND MICROSERVICE SECURITY ADRIAN MOUAT

Linux Security Summit Europe 2018

Operating Systems Security Access Control

System Administration for Beginners

An efficient method to avoid path lookup in file access auditing in IO path to improve file system IO performance

Emulating Windows file serving on POSIX. Jeremy Allison Samba Team

Case Studies in Access Control

Cloud Filesystem. Jeff Darcy for BBLISA, October 2011

Capability and System Hardening

SELinux. Don Porter CSE 506

Introduction to Container Technology. Patrick Ladd Technical Account Manager April 13, 2016

Unix Filesystem. January 26 th, 2004 Class Meeting 2

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

Figure 1. A union consists of several underlying branches, which can be of any filesystem type.

The State of Rootless Containers

How To Write a Linux Security Module That Makes Sense For You. Casey Schaufler February 2016

OS Security III: Sandbox and SFI

Logical disks. Bach 2.2.1

CS 380S. TOCTTOU Attacks. Don Porter. Some slides courtesy Vitaly Shmatikov and Emmett Witchel. slide 1

Fuse Extension. version Erick Gallesio Université de Nice - Sophia Antipolis 930 route des Colles, BP 145 F Sophia Antipolis, Cedex France

Security Architecture

OCFS2 Mark Fasheh Oracle

Using GConf as an Example of How to Create an Userspace Object Manager

Abstract. 1 Introduction

[537] Fast File System. Tyler Harter

Developments in GFS2. Andy Price Software Engineer, GFS2 OSSEU 2018

Hacking and Hardening Kubernetes

SELinux. Daniel J Walsh SELinux Lead Engineer

Directory. File. Chunk. Disk

Case Study: Access Control. Steven M. Bellovin October 4,

PFStat. Global notes

Outline. Operating System Security CS 239 Computer Security February 23, Introduction. Server Machines Vs. General Purpose Machines

Protection Kevin Webb Swarthmore College April 19, 2018

416 Distributed Systems. Distributed File Systems 1: NFS Sep 18, 2018

Files and the Filesystems. Linux Files

NFS Version 4 Open Source Project. William A.(Andy) Adamson Center for Information Technology Integration University of Michigan

Secure and Simple Sandboxing in SELinux

Unix, History

STING: Finding Name Resolution Vulnerabilities in Programs

Systems Programming. 09. Filesystem in USErspace (FUSE) Alexander Holupirek

Program Structure. Steven M. Bellovin April 3,

The UNIX File System

Program Structure I. Steven M. Bellovin November 14,

Security features for UBIFS. Richard Weinberger sigma star gmbh

Storage Industry Resource Domain Model

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

412 Notes: Filesystem

rpaths Documentation Release 0.13 Remi Rampin

File System Implementation

How we added software updates to AGL

How Linux Capability Works in

What is a file system

Network File System (NFS)

Network File System (NFS)

Processes are subjects.

CSE 333 Lecture 9 - storage

CephFS A Filesystem for the Future

Unix File System. Class Meeting 2. * Notes adapted by Joy Mukherjee from previous work by other members of the CS faculty at Virginia Tech

Access Control. Steven M. Bellovin September 13,

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

10/23/12. Fundamentals of Linux Platform Security. Linux Platform Security. Roadmap. Security Training Course. Module 4 Introduction to SELinux

Access Control. Steven M. Bellovin September 2,

NFS. Don Porter CSE 506

CS 140 Project 4 File Systems Review Session

Programming Project # 2. cs155 Due 5/5/05, 11:59 pm Elizabeth Stinson (Some material from Priyank Patel)

CS370 Operating Systems

Demystifying SELinux:

Azure Sphere: Fitting Linux Security in 4 MiB of RAM. Ryan Fairfax Principal Software Engineering Lead Microsoft

Files and Directories Filesystems from a user s perspective

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Transcription:

Overlayfs And Containers Miklos Szeredi, Red Hat Vivek Goyal, Red Hat

Introduction to overlayfs

Union or? Union: all layers made equal How do you take the union of two files? Or a file and a directory? NO! Layers can t be treated equal

...overlay! Layer upon layer upon layer Only upper layer can be modified copy-up (exception: directory contents) Objects in one layer cover up objects with the same name in layer(s) below Exception: directories, which are merged Exception for the exception: opaque directories One more exception: whiteout covers up anything and makes it look like nothing

Design Userspace API (most important!) No new object types Whiteout -> char dev with 0/0 device number Opaque dir -> xattr Make it as simple as possible (and not a bit simpler) Most of the logic is in a separate filesystem module Some VFS impact but not much; some FS impact but not much Upstream early It doesn t have to do everything right; features can be added later...

Implementation Separate cache for the overlay directory tree Allows less impact on VFS/FS BUT bad for memory use Shared cache for the file contents Copy-up when opened for write (may be too early) Ugliness when copy-up happens while file is already open read-only BUT great for performance and memory use Limitations modifying lower layer -> don t care Not (yet) a POSIX filesystem (st_dev/ino quirks, directory rename, hard link copy-up, etc)

Features added later Multiple lower layers Renaming directories SELinux POSIX ACL File locking

Features (work in progress) RW-RO file consistency after copy-up Just need to fix this case up in VFS Fix st_dev, constant st_ino/d_ino Store inode number for copied up files Finding a common ino space for different underlying filesystems Hard link copy up Should be very rare Can use a global database for storing inode numbers of copied up hard links

overlayfs usage in docker

overlay graph driver Container 1 Confined Process merged dir (rootfs) hard links lower dir 1 lower dir 2 lower dir N upper dir Image Layer 1 Image Layer 2 Image Layer N Container writable dir docker daemon option --storage-driver=overlay Overlay supported single lower directory Hard links created between image layers Higher inode utilization

overlay2 graph driver Container 1 Confined Process merged dir (rootfs) lower dir 1 lower dir 2 lower dir N upper dir Image Layer 1 Image Layer 2 Image Layer N Container writable dir docker daemon option --storage-driver=overlay2 overlayfs should support multiple lower dirs No hardlinks and dir creation in every layer Better inode utilization

Container security and overlayfs

How do we handle access permissions? DAC(Ownership/Permissions) MAC (SELinux)

An example setup Container 1 Confined Process 1 merged Container 2 Confined Process 2 merged upper dir 1 lower dir Image Layer upper dir 2 Two containers sharing lower dir with separate upper dir

Escaped container process writes to image dir/files Container 1 Confined Process 1 merged Escaped Process 1 Container 2 Confined Process 2 merged upper dir 1 lower dir Image Layer upper dir 2 -rw-r--r--. root root /etc/passwd DAC allows writing to /etc/passwd

Security goal 1 Container 1 Confined Process 1 merged Escaped Process 1 Container 2 Confined Process 2 merged upper dir 1 lower dir Image Layer upper dir 2 -rw-r--r--. root root /etc/passwd Do not allow writing to image dir/files

Allow access through overlay mount point open(foo.txt, O_RDWR) Overlay mount lower/foo.txt upper/

Deny write access on underlying file open(foo.txt, O_RDWR) Overlay mount lower/foo.txt upper/

DAC allows access through both paths (When root inside container is root outside) open(foo.txt, O_RDWR) Overlay mount lower/foo.txt upper/ -rw-r--r--. root root

Read only label on lower files open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Overlay mount lower/foo.txt upper/ system_u:object_r:container_share_t:s0

Use context mount option for overlay open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Overlay mount (context=label) system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt upper/ system_u:object_r:container_share_t:s0 mount -t overlay -o context= system_u:object_r:container_file_t:s0:c16,c585... merged/

That did not work

Access permission checks in overlay inode_permission() MAC overlay inode context label DAC + MAC real inode (real label)

Read only label on lower file open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 merged/foo.txt system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root upper/ system_u:object_r:container_share_t:s0

Process Overlay inode check open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Process MAY_WRITE merged/foo.txt system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root upper/ system_u:object_r:container_share_t:s0

Process lower inode DAC check open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Process MAY_WRITE merged/foo.txt system_u:object_r:container_file_t:s0:c16,c585 Process MAY_WRITE lower/foo.txt -rw-r--r--. root root upper/ system_u:object_r:container_share_t:s0

Process lower inode MAC check open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Process MAY_WRITE merged/foo.txt system_u:object_r:container_file_t:s0:c16,c585 Process MAY_WRITE lower/foo.txt -rw-r--r--. root root upper/ system_u:object_r:container_share_t:s0

What if we don t do WRITE checks on lower inode

But that will break DAC DAC checks happen only at real inode open(foo.txt, O_RDWR) merged/ lower/foo.txt -r--r--r--. root root Copy Up upper/foo.txt -r--r--r--. root root

Why not do DAC checks on both inodes open(foo.txt, O_RDWR) merged/foo.txt -r--r--r--. root root lower/foo.txt -r--r--r--. root root Copy Up upper/

That kind of worked but...

Certain overlayfs operations failed MAC checks

Certain overlayfs operations fail MAC checks File creation over whiteout

Certain overlayfs operations fail MAC checks File creation over whiteout Use mounter s creds for privileged operations

Two Levels of Permission Checks Overlay inode is checked with creds of task Underlying inode is checked with creds of mounter Certain privileged operations are done with the creds of mounter DAC +MAC (Caller Creds) DAC + MAC Mounter Creds inode_permission() overlay inode context label real inode (real label)

Two levels of checks open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 merged/foo.txt -rw-r--r--. root root system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root system_u:object_r:container_share_t:s0 upper/

Process Overlay inode check open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Process MAY_WRITE merged/foo.txt -rw-r--r--. root root system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root system_u:object_r:container_share_t:s0 upper/

Mounter real lower inode check open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 Process MAY_WRITE merged/foo.txt -rw-r--r--. root root system_u:object_r:container_file_t:s0:c16,c585 Mounter MAY_READ lower/foo.txt -rw-r--r--. root root system_u:object_r:container_share_t:s0 upper/

First requirement met

Escaped process accesses other container s data Container 1 Confined Process 1 merged Container 2 Confined Process 2 merged Escaped Process 1 upper dir 1 lower dir Image Layer upper dir 2 -rw-r--r--. root root /etc/data.txt Container1 accesses container2 s data

Security goal 2 Container 1 Confined Process 1 merged Container 2 Confined Process 2 merged Escaped Process 1 upper dir 1 lower dir Image Layer upper dir 2 -rw-r--r--. root root /etc/data.txt One container should not be able to access other container s data

Label upper files for container access only open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 merged/ system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt upper/ -rw-r--r--. root root system_u:object_r:container_share_t:s0

Label upper files for container access only open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 merged/ system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root system_u:object_r:container_share_t:s0 upper/foo.txt -rw-r--r--. root root system_u:object_r:container_file_t:s0:c16,c585

One container can t access data of another container open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c16,c585 open(foo.txt, O_RDWR) system_u:system_r:container_t:s0:c548,c591 merged/ system_u:object_r:container_file_t:s0:c16,c585 lower/foo.txt -rw-r--r--. root root system_u:object_r:container_share_t:s0 upper/foo.txt -rw-r--r--. root root system_u:object_r:container_file_t:s0:c16,c585

New LSM Hooks inode_copy_up() Called during copy up. Returns new set of creds for file creation. For context mounts, file is created with label specified in context= option. inode_copy_up_xattr() Called during copy up of xattrs. SELinux blocks copying up of SELinux xattr. dentry_create_files_as() Called during creation of new file. Returns new set of creds for file creation. For context mounts, file is created with label specified in context= option.

Overlayfs vs. devicemapper In general, faster than devicemapper Page cache sharing Not fully POSIX compliant, yet So some workloads might experience issues Fedora 26 will have overlay2 as default graph driver Switch back to devicemapper if you face issues

DAC and container security DAC will solve these issues if containers run in user namespaces with different mappings Needing to do a chown on image continues to be a issue shiftfs or something else?

Thank You