Using seccomp to limit the kernel attack surface

Similar documents
Using seccomp to limit the kernel attack surface

Outline. Basic features of BPF virtual machine

Seccomp. Linux Security and Isolation APIs. Outline. Michael Kerrisk, man7.org c 2017 November 2017

Using seccomp to limit the kernel attack surface

Sandboxing Linux code to mitigate exploitation. (Or: How to ship a secure operating system that includes third-party code)

February Syscall Filtering and You Paul Moore

Sandboxing. CS-576 Systems Security Instructor: Georgios Portokalidis Spring 2018

seccomp-nurse Nicolas Bareil ekoparty 2010 EADS CSC Innovation Works France seccomp-nurse: sandboxing environment

The Linux capabilities model

Capabilities. Linux Capabilities and Namespaces. Outline. Michael Kerrisk, man7.org c 2018 March 2018

User Namespaces. Linux Capabilities and Namespaces. Outline. Michael Kerrisk, man7.org c 2018 March 2018

File Descriptors and Piping

Design Overview of the FreeBSD Kernel CIS 657

Design Overview of the FreeBSD Kernel. Organization of the Kernel. What Code is Machine Independent?

The Kernel Abstraction. Chapter 2 OSPP Part I

Windows architecture. user. mode. Env. subsystems. Executive. Device drivers Kernel. kernel. mode HAL. Hardware. Process B. Process C.

CS 326: Operating Systems. Process Execution. Lecture 5

Introduction to OS Processes in Unix, Linux, and Windows MOS 2.1 Mahmoud El-Gayyar

Getting to know you. Anatomy of a Process. Processes. Of Programs and Processes

OS lpr. www. nfsd gcc emacs ls 1/27/09. Process Management. CS 537 Lecture 3: Processes. Example OS in operation. Why Processes? Simplicity + Speed

POSIX Shared Memory. Linux/UNIX IPC Programming. Outline. Michael Kerrisk, man7.org c 2017 November 2017

Operating System Structure

Process Management! Goals of this Lecture!

Reading Assignment 4. n Chapter 4 Threads, due 2/7. 1/31/13 CSE325 - Processes 1

Processes. Operating System CS 217. Supports virtual machines. Provides services: User Process. User Process. OS Kernel. Hardware

Processes. CS439: Principles of Computer Systems January 30, 2019

OS lpr. www. nfsd gcc emacs ls 9/18/11. Process Management. CS 537 Lecture 4: Processes. The Process. Why Processes? Simplicity + Speed

Computer Systems Assignment 2: Fork and Threads Package

o Reality The CPU switches between each process rapidly (multiprogramming) Only one program is active at a given time

Operating Systems and Networks Assignment 9

CSE 333 SECTION 3. POSIX I/O Functions

Operating systems. Lecture 7

2 Processes. 2 Processes. 2 Processes. 2.1 The Process Model. 2.1 The Process Model PROCESSES OPERATING SYSTEMS

Outline. Relationship between file descriptors and open files

Like select() and poll(), epoll can monitor multiple FDs epoll returns readiness information in similar manner to poll() Two main advantages:

CSE 333 SECTION 3. POSIX I/O Functions

SE350: Operating Systems

SOEN228, Winter Revision 1.2 Date: October 25,

Processes often need to communicate. CSCB09: Software Tools and Systems Programming. Solution: Pipes. Recall: I/O mechanisms in C

PROCESS MANAGEMENT Operating Systems Design Euiseong Seo

CS 33. Architecture and the OS. CS33 Intro to Computer Systems XIX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CSCE Operating Systems Interrupts, Exceptions, and Signals. Qiang Zeng, Ph.D. Fall 2018

(In columns, of course.)

Distribution Kernel Security Hardening with ftrace

Packet Sniffing and Spoofing

CS 33. Architecture and the OS. CS33 Intro to Computer Systems XIX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CSci 4061 Introduction to Operating Systems. Processes in C/Unix

Process management. What s in a process? What is a process? The OS s process namespace. A process s address space (idealized)

C++ Important Questions with Answers

CSci 4061 Introduction to Operating Systems. Programs in C/Unix

Processes. CS439: Principles of Computer Systems January 24, 2018

Unix-Linux 2. Unix is supposed to leave room in the process table for a superuser process that could be used to kill errant processes.

Hardware OS & OS- Application interface

CS153: Process 2. Chengyu Song. Slides modified from Harsha Madhyvasta, Nael Abu-Ghazaleh, and Zhiyun Qian

www nfsd emacs lpr Process Management CS 537 Lecture 4: Processes Example OS in operation Why Processes? Simplicity + Speed

Anne Bracy CS 3410 Computer Science Cornell University

CS 550 Operating Systems Spring System Call

Process Management! Goals of this Lecture!

ISOLATION DEFENSES GRAD SEC OCT

Recitation 8: Tshlab + VM

Operating Systems. Operating System Structure. Lecture 2 Michael O Boyle

Midterm Exam CPS 210: Operating Systems Spring 2013

CS C Primer. Tyler Szepesi. January 16, 2013

CSE 451: Operating Systems Winter Module 4 Processes. Mark Zbikowski Allen Center 476

CS510 Operating System Foundations. Jonathan Walpole

Outline. Overview. Linux-specific, since kernel 2.6.0

UNIX System Calls. Sys Calls versus Library Func

TCSS 422: OPERATING SYSTEMS

Processes. Processes (cont d)

A process. the stack

Linux Signals and Daemons

System Calls and Signals: Communication with the OS. System Call. strace./hello. Kernel. Context Switch

Practical Malware Analysis

Process Time. Steven M. Bellovin January 25,

Computer Science 330 Operating Systems Siena College Spring Lab 5: Unix Systems Programming Due: 4:00 PM, Wednesday, February 29, 2012

REVIEW OF COMMONLY USED DATA STRUCTURES IN OS

by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha

Linux Operating System

BUFFER OVERFLOW DEFENSES & COUNTERMEASURES

W4118: OS Overview. Junfeng Yang

Process Management 1

Processes. Johan Montelius KTH

Multithreaded Programming

CS 355 Operating Systems. Keeping Track of Processes. When are processes created? Process States 1/26/18. Processes, Unix Processes and System Calls

Prepared by Prof. Hui Jiang Process. Prof. Hui Jiang Dept of Electrical Engineering and Computer Science, York University

Last class: Today: Thread Background. Thread Systems

Process. Prepared by Prof. Hui Jiang Dept. of EECS, York Univ. 1. Process in Memory (I) PROCESS. Process. How OS manages CPU usage? No.

PROCESS MANAGEMENT. Operating Systems 2015 Spring by Euiseong Seo

CS 31: Intro to Systems Processes. Kevin Webb Swarthmore College March 31, 2016

1. Overview This project will help you understand address spaces and virtual memory management.

CSE 153 Design of Operating Systems Fall 2018

Signals. POSIX defines a variety of signal types, each for a particular

CPSC 341 OS & Networks. Processes. Dr. Yingwu Zhu

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

The Kernel Abstraction

Processes & Threads. Today. Next Time. ! Process concept! Process model! Implementing processes! Multiprocessing once again. ! More of the same J

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

Intel x86 Jump Instructions. Part 5. JMP address. Operations: Program Flow Control. Operations: Program Flow Control.

Lecture 3. Introduction to Unix Systems Programming: Unix File I/O System Calls

VFS, Continued. Don Porter CSE 506

Transcription:

LinuxCon.eu 2015 Using seccomp to limit the kernel attack surface Michael Kerrisk, man7.org c 2015 man7.org Training and Consulting http://man7.org/training/ 6 October 2015 Dublin, Ireland

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Who am I? Maintainer of Linux man-pages (since 2004) Documents kernel-user-space + C library APIs 1000 manual pages http://www.kernel.org/doc/man-pages/ API review, testing, and documentation API design and design review Lots of testing, lots of bug reports, a few kernel patches Day job : programmer, trainer, writer LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introductions 4 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Goals History of seccomp Basics of seccomp operation Creating and installing BPF filters (AKA seccomp2 ) Mostly: look at hand-coded BPF filter programs, to gain fundamental understanding of how seccomp works Briefly note some productivity aids for coding BPF programs LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 6 / 55

Introduction and history Mechanism to restrict system calls that a process may make Reduces attack surface of kernel A key component for building application sandboxes First version in Linux 2.6.12 (2005) Filtering enabled via /proc/pid/seccomp Writing 1 to file places process (irreversibly) in strict seccomp mode Need CONFIG_SECCOMP LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 7 / 55

Introduction and history Initially, just one filtering mode ( strict ) Only permitted system calls are read(), write(), _exit(), and sigreturn() Note: open() not included (must open files before entering strict mode) sigreturn() allows for signal handlers Other system calls SIGKILL Designed to sandbox compute-bound programs that deal with untrusted byte code Code perhaps exchanged via pre-created pipe or socket LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 8 / 55

Introduction and history Linux 2.6.23 (2007): /proc/pid/seccomp interface replaced by prctl() operations prctl(pr_set_seccomp, arg) modifies caller s seccomp mode SECCOMP_MODE_STRICT: limit syscalls as before prctl(pr_get_seccomp) returns seccomp mode: 0 process is not in seccomp mode Otherwise? SIGKILL (!) prctl() is not a permitted system call in strict mode Who says kernel developers don t have a sense of humor? LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 9 / 55

Introduction and history Linux 3.5 (2012) adds filter mode (AKA seccomp2 ) prctl(pr_set_seccomp, SECCOMP_MODE_FILTER,...) Can control which system calls are permitted, Control based on system call number and argument values Choice is controlled by user-defined filter a BPF program Berkeley Packet Filter (later) Requires CONFIG_SECCOMP_FILTER By now used in a range of tools E.g., Chrome browser, OpenSSH, vsftpd, Firefox OS, Docker LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 10 / 55

Introduction and history Linux 3.8 (2013): The joke is getting old... New /proc/pid/status Seccomp field exposes process seccomp mode (as a number) 0 // SECCOMP_ MODE_ DISABLED 1 // SECCOMP_ MODE_ STRICT 2 // SECCOMP_ MODE_ FILTER Process can, without fear, read from this file to discover its own seccomp mode But, must have previously obtained a file descriptor... LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 11 / 55

Introduction and history Linux 3.17 (2014): seccomp() system call added (Rather than further multiplexing of prctl()) Provides superset of prctl(2) functionality Can synchronize all threads to same filter tree Useful, e.g., if some threads created by start-up code before application has a chance to install filter(s) LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Introduction and history 12 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Seccomp filtering and BPF Seccomp filtering available since Linux 3.5 Allows filtering based on system call number and argument (register) values Pointers are not dereferenced Filters expressed using BPF (Berkeley Packet Filter) syntax Filters installed using seccomp() or prctl() 1 Construct and install BPF filter 2 exec() new program or invoke function inside dynamically loaded shared library (plug-in) Once installed, every syscall triggers execution of filter Installed filters can t be removed Filter == declaration that we don t trust subsequently executed code LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Seccomp filtering and BPF 14 / 55

BPF origins BPF originally devised (in 1992) for tcpdump Monitoring tool to display packets passing over network http://www.tcpdump.org/papers/bpf-usenix93.pdf Volume of network traffic is enormous must filter for packets of interest BPF allows in-kernel selection of packets Filtering based on fields in packet header Filtering in kernel more efficient than filtering in user space Unwanted packet are discarded early Avoids passing every packet over kernel-user-space boundary LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Seccomp filtering and BPF 15 / 55

BPF virtual machine BPF defines a virtual machine (VM) that can be implemented inside kernel VM characteristics: Simple instruction set Small set of instructions All instructions are same size Implementation is simple and fast Only branch-forward instructions Programs are directed acyclic graphs (DAGs) Easy to verify validity/safety of programs Program completion is guaranteed (DAGs) Simple instruction set can verify opcodes and arguments Can detect dead code Can verify that program completes via a return instruction BPF filter programs are limited to 4096 instructions LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Seccomp filtering and BPF 16 / 55

Generalizing BPF BPF originally designed to work with network packet headers Seccomp 2 developers realized BPF could be generalized to solve different problem: filtering of system calls Same basic task: test-and-branch processing based on content of a small set of memory locations Further generalization ( extended BPF ) is ongoing Linux 3.18: adding filters to kernel tracepoints Linux 3.19: adding filters to raw sockets In progress (July 2015): filtering of perf events LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Seccomp filtering and BPF 17 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Basic features of BPF virtual machine Accumulator register Data area (data to be operated on) In seccomp context: data area describes system call Implicit program counter (Recall: all instructions are same size) Instructions contained in structure of this form: struct sock_ filter { /* Filter block */ u16 code ; /* Filter code ( opcode )*/ u8 jt; /* Jump true */ u8 jf; /* Jump false */ u32 k; /* Generic multiuse field */ }; See <linux/filter.h> and <linux/bpf_common.h> LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 19 / 55

BPF instruction set Instruction set includes: Load instructions Store instructions Jump instructions Arithmetic/logic instructions ADD, SUB, MUL, DIV, MOD, NEG OR, AND, XOR, LSH, RSH Return instructions Terminate filter processing Report a status telling kernel what to do with syscall LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 20 / 55

BPF jump instructions Conditional and unconditional jump instructions provided Conditional jump instructions consist of Opcode specifying condition to be tested Value to test against Two jump targets jt: target if condition is true jf: target if condition is false Conditional jump instructions: JEQ: jump if equal JGT: jump if greater JGE: jump if greater or equal JSET: bit-wise AND + jump if nonzero result jf target no need for JNE, JLT, JLE, and JCLEAR LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 21 / 55

BPF jump instructions Targets are expressed as relative offsets in instruction list 0 == no jump (execute next instruction) jt and jf are 8 bits 255 maximum offset for conditional jumps Unconditional JA ( jump always ) uses k as offset, allowing much larger jumps LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 22 / 55

Seccomp BPF data area Seccomp provides data describing syscall to filter program Buffer is read-only Format (expressed as C struct): struct seccomp_ data { int nr; /* System call number */ u32 arch ; /* AUDIT_ ARCH_ * value */ u64 instruction_ pointer ; /* CPU IP */ u64 args [6]; /* System call arguments */ }; LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 23 / 55

Seccomp BPF data area struct seccomp_ data { int nr; /* System call number */ u32 arch ; /* AUDIT_ ARCH_ * value */ u64 instruction_ pointer ; /* CPU IP */ u64 args [6]; /* System call arguments */ }; nr: system call number (architecture-dependent) arch: identifies architecture Constants defined in <linux/audit.h> AUDIT_ARCH_X86_64, AUDIT_ARCH_I386, AUDIT_ARCH_ARM, etc. instruction_pointer: CPU instruction pointer args: system call arguments System calls have maximum of six arguments Number of elements used depends on system call LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 24 / 55

Building BPF instructions Obviously, one can code BPF instructions numerically by hand But, header files define symbolic constants and convenience macros (BPF_STMT(), BPF_JUMP()) to ease the task # define BPF_STMT (code, k) \ { ( unsigned short )( code ), 0, 0, k } # define BPF_JUMP (code, k, jt, jf) \ { ( unsigned short )( code ), jt, jf, k } (Macros just plug values together to form structure) LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 25 / 55

Building BPF instructions: examples Load architecture number into accumulator BPF_ STMT ( BPF_LD BPF_W BPF_ABS, ( offsetof ( struct seccomp_data, arch ))) Opcode here is constructed by ORing three values together: BPF_LD: load BPF_W: operand size is a word BPF_ABS: address mode specifying that source of load is data area (containing system call data) See <linux/bpf_common.h> for definitions of opcode constants offsetof() generates offset of desired field in data area LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 26 / 55

Building BPF instructions: examples Test value in accumulator BPF_ JUMP ( BPF_JMP BPF_JEQ BPF_K, AUDIT_ ARCH_ X86_ 64, 1, 0) BPF_JMP BPF_JEQ: jump with test on equality BPF_K: value to test against is in generic multiuse field (k) k contains value AUDIT_ARCH_X86_64 jt value is 1, meaning skip one instruction if test is true jf value is 0, meaning skip zero instructions if test is false I.e., continue execution at following instruction Return value that causes kernel to kill process with SIGSYS BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ KILL ) LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 27 / 55

Checking the architecture Checking architecture value should be first step in any BPF program Architecture may support multiple system call conventions E.g. x86 hardware supports x86-64 and i386 System call numbers may differ or overlap LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 28 / 55

Filter return value Once a filter is installed, each system call is tested against filter Seccomp filter must return a value to kernel indicating whether system call is permitted Otherwise EINVAL when attempting to install filter Return value is 32 bits, in two parts: Most significant 16 bits (SECCOMP_RET_ACTION mask) specify an action to kernel Least significant 16 bits (SECCOMP_RET_DATA mask) specify data for return value LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 29 / 55

Filter return action Filter return action component is one of SECCOMP_RET_ALLOW: system call is executed SECCOMP_RET_KILL: process is immediately terminated Terminated as though process had been killed with SIGSYS SECCOMP_RET_ERRNO: return an error from system call System call is not executed Value in SECCOMP_RET_DATA is returned in errno SECCOMP_RET_TRACE: attempt to notify ptrace() tracer Gives tracing process a chance to assume control See seccomp(2) SECCOMP_RET_TRAP: process is sent SIGSYS signal Can catch this signal; see seccomp(2) for more details LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Constructing seccomp filters 30 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Installing a BPF program A process installs a filter for itself using one of: seccomp(seccomp_set_mode_filter, flags, &fprog) Only since Linux 3.17 prctl(pr_set_seccomp, SECCOMP_MODE_FILTER, &fprog) &fprog is a pointer to a BPF program: struct sock_ fprog { unsigned short len ; /* Number of instructions */ struct sock_ filter * filter ; /* Pointer to program ( array of instructions ) */ }; LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 32 / 55

Installing a BPF program To install a filter, one of the following must be true: Caller is privileged (CAP_SYS_ADMIN) Caller has to set the no_new_privs process attribute: prctl ( PR_SET_NO_NEW_PRIVS, 1); Causes set-uid/set-gid bit / file capabilities to be ignored on subsequent execve() calls Once set, no_new_privs can t be unset Prevents possibility of attacker starting privileged program and manipulating it to misbehave using a seccomp filter! no_new_privs &&! CAP_SYS_ADMIN seccomp() fails with EACCES LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 33 / 55

Example: seccomp/seccomp_deny_open.c 1 int main ( int argc, char ** argv ) { 2 prctl ( PR_ SET_ NO_ NEW_ PRIVS, 1, 0, 0, 0); 3 4 install_ filter (); 5 6 open ("/ tmp /a", O_RDONLY ); 7 8 printf (" We shouldn t see this message \ n"); 9 exit ( EXIT_SUCCESS ); 10 } Program installs a filter that prevents open() being called, and then calls open() Set no_new_privs bit Install seccomp filter Call open() LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 34 / 55

Example: seccomp/seccomp_deny_open.c 1 static void install_ filter ( void ) { 2 struct sock_ filter filter [] = { 3 BPF_ STMT ( BPF_LD BPF_W BPF_ABS, 4 ( offsetof ( struct seccomp_data, arch ))), 5 BPF_ JUMP ( BPF_JMP BPF_JEQ BPF_K, 6 AUDIT_ARCH_X86_64, 1, 0), 7 BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ KILL ), 8... Define and initialize array (of structs) containing BPF filter program Load architecture into accumulator Test if architecture value matches AUDIT_ARCH_X86_64 True: jump forward one instruction (i.e., skip next instruction) False: skip no instructions Kill process on architecture mismatch LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 35 / 55

Example: seccomp/seccomp_deny_open.c 1 BPF_ STMT ( BPF_LD BPF_W BPF_ABS, 2 ( offsetof ( struct seccomp_data, nr ))), 3 4 BPF_ JUMP ( BPF_JMP BPF_JEQ BPF_K, NR_ open, 5 1, 0), 6 BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ ALLOW ), 7 8 BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ KILL ) 9 }; Remainder of filter program Load system call number into accumulator Test if system call number matches NR_open True: advance one instruction kill process False: advance 0 instructions allow system call LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 36 / 55

Example: seccomp/seccomp_deny_open.c 1 struct sock_ fprog prog = { 2. len = ( unsigned short ) ( sizeof ( filter ) / 3 sizeof ( filter [0])), 4. filter = filter, 5 }; 6 7 seccomp ( SECCOMP_ SET_ MODE_ FILTER, 0, & prog ); 8 } Construct argument for seccomp() Install filter LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 37 / 55

Example: seccomp/seccomp_deny_open.c Upon running the program, we see: $./ seccomp_ deny_ open Bad system call # Message printed by shell $ echo $? # Display exit status of last command 159 Bad system call indicates process was killed by SIGSYS Exit status of 159 (== 128 + 31) also indicates termination as though killed by SIGSYS Exit status of process killed by signal is 128 + signum SIGSYS is signal number 31 on this architecture LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 38 / 55

Example: seccomp/seccomp_control_open.c A more sophisticated example Filter based on flags argument of open() O_CREAT specified kill process O_WRONLY or O_RDWR specified cause open() to fail with ENOTSUP error LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 39 / 55

Example: seccomp/seccomp_control_open.c struct sock_ filter filter [] = { BPF_ STMT ( BPF_LD BPF_W BPF_ABS, ( offsetof ( struct seccomp_data, arch ))), BPF_ JUMP ( BPF_JMP BPF_JEQ BPF_K, AUDIT_ARCH_X86_64, 1, 0), BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ KILL ), BPF_ STMT ( BPF_LD BPF_W BPF_ABS, ( offsetof ( struct seccomp_data, nr ))), BPF_ JUMP ( BPF_JMP BPF_JEQ BPF_K, NR_ open, 1, 0), BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ ALLOW ), Load architecture and test for expected value Load system call number Test if system call number is NR_open True: skip next instruction False: skip 0 instructions permit all other syscalls LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 40 / 55

Example: seccomp/seccomp_control_open.c BPF_ STMT ( BPF_LD BPF_W BPF_ABS, ( offsetof ( struct seccomp_data, args [1] ))), BPF_ JUMP ( BPF_JMP BPF_ JSET BPF_K, O_CREAT, 0, 1), BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ KILL ), Load second argument of open() (flags) Test if O_CREAT bit is set in flags True: skip 0 instructions kill process False: skip 1 instruction LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 41 / 55

Example: seccomp/seccomp_control_open.c BPF_ JUMP ( BPF_JMP BPF_ JSET BPF_K, O_ WRONLY O_RDWR, 0, 1), BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ ERRNO ( ENOTSUP & SECCOMP_ RET_ DATA )), BPF_ STMT ( BPF_RET BPF_K, SECCOMP_ RET_ ALLOW ) }; Test if O_WRONLY or O_RDWR are set in flags True: cause open() to fail with ENOTSUP error in errno False: allow open() to proceed LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 42 / 55

Example: seccomp/seccomp_control_open.c int main ( int argc, char ** argv ) { prctl ( PR_ SET_ NO_ NEW_ PRIVS, 1, 0, 0, 0); install_ filter (); if ( open ("/ tmp /a", O_RDONLY ) == -1) perror (" open1 "); if ( open ("/ tmp /a", O_WRONLY ) == -1) perror (" open2 "); if ( open ("/ tmp /a", O_RDWR ) == -1) perror (" open3 "); if ( open ("/ tmp /a", O_CREAT O_RDWR, 0600) == -1) perror (" open4 "); } exit ( EXIT_SUCCESS ); Test open() calls with various flags LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 43 / 55

Example: seccomp/seccomp_control_open.c $./ seccomp_ control_ open open2 : Operation not supported open3 : Operation not supported Bad system call $ echo $? 159 First open() succeeded Second and third open() calls failed Kernel produced ENOTSUP error for call Fourth open() call caused process to be killed LinuxCon.eu 2015 (C) 2015 Michael Kerrisk BPF programs 44 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Installing multiple filters If existing filters permit prctl() or seccomp(), further filters can be installed All filters are always executed, in reverse order of registration Each filter yields a return value Value returned to kernel is first seen action of highest priority (along with accompanying data) SECCOMP_RET_KILL (highest priority) SECCOMP_RET_TRAP SECCOMP_RET_ERRNO SECCOMP_RET_TRACE SECCOMP_RET_ALLOW (lowest priority) LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Further details on seccomp filters 46 / 55

fork() and execve() semantics If seccomp filters permit fork() or clone(), then child inherits parents filters If seccomp filters permit execve(), then filters are preserved across execve() LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Further details on seccomp filters 47 / 55

Cost of filtering, construction of filters Installed BPF filter(s) are executed for every system call there s a performance cost Example on x86-64: Use our deny open seccomp filter Requires 6 BPF instructions / permitted syscall Call getppid() repeatedly (one of cheapest syscalls) +25% execution time (with JIT compiler disabled) (Looks relatively high because getppid() is a cheap syscall) Obviously, order of filtering rules can affect performance Construct filters so that most common cases yield shortest execution paths If handling many different system calls, binary chop techniques can give O(logN) performance LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Further details on seccomp filters 48 / 55

Outline 1 Introductions 2 Introduction and history 3 Seccomp filtering and BPF 4 Constructing seccomp filters 5 BPF programs 6 Further details on seccomp filters 7 Applications, tools, and further information

Applications Possible applications: Building sandboxed environments Whitelisting usually safer than blacklisting Default treatment: block all system calls Then allow only a limited set of syscall / argument combinations Various examples mentioned earlier Failure-mode testing Place application in environment where unusual / unexpected failures occur Blacklist certain syscalls / argument combinations to generate failures LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Applications, tools, and further information 50 / 55

Tools: libseccomp High-level API for kernel creating seccomp filters https://github.com/seccomp/libseccomp Initial release: 2012 Simplifies various aspects of building filters Eliminates tedious/error-prone tasks such as changing branch instruction counts when instructions are inserted Abstract architecture-dependent details out of filter creation Can output generated code in binary (for seccomp filtering) or human-readable form ( pseudofilter code ) Don t have full control of generated code, but can give hints about which system calls to prioritize in generated code http://lwn.net/articles/494252/ Fully documented with man pages that contain examples (!) LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Applications, tools, and further information 51 / 55

Other tools bpfc (BPF compiler) Compiles assembler-like BPF programs to byte code Part of netsniff-ng project (http://netsniff-ng.org/) LLVM has an ebpf back end (merged Jan 2015) ebpf support for seccomp is planned Compiles subset of C to BPF C dialect; does not provide: loops, global variables, FP numbers, vararg functions, passing structs as args... Examples in kernel source: samples/bpf/*_kern.c GCC patches exist, but not (yet?) merged upstream LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Applications, tools, and further information 52 / 55

Other tools In-kernel JIT (just-in-time) compiler Compiles BPF binary to native machine code at load time Execution speed up of 2x to 3x (or better, in some cases) Disabled by default; enable by writing 1 to /proc/sys/net/core/bpf_jit_enable See bpf(2) man page LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Applications, tools, and further information 53 / 55

Resources Kernel source files: Documentation/prctl/seccomp_filter.txt, Documentation/networking/filter.txt http://outflux.net/teach-seccomp/ Shows handy trick for discovering which of an application s system calls don t pass filtering seccomp(2) man page Seccomp sandboxes and memcached example blog.viraptor.info/post/seccomp-sandboxes-and-memcached-example-part-1 blog.viraptor.info/post/seccomp-sandboxes-and-memcached-example-part-2 LinuxCon.eu 2015 (C) 2015 Michael Kerrisk Applications, tools, and further information 54 / 55

Thanks! mtk@man7.org Slides at http://man7.org/conf/ Linux/UNIX system programming training (and more) http://man7.org/training/ The Linux Programming Interface, http://man7.org/tlpi/