Programming in C and C++

Similar documents
Important From Last Time

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Important From Last Time

C++ Undefined Behavior What is it, and why should I care?

C++ Undefined Behavior

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community

Programming in C. 4. Misc. Library Features, Gotchas, Hints and Tips. Dr. Neel Krishnaswami University of Cambridge

My malloc: mylloc and mhysa. Johan Montelius HT2016

Memory Corruption 101 From Primitives to Exploit

Undefined Behaviour in C

Programming in C and C++

Programming in C and C++

Page 1. Today. Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right? Compiler requirements CPP Volatile

In Java we have the keyword null, which is the value of an uninitialized reference type

Lecture 8 Dynamic Memory Allocation

Contents of Lecture 3

Lecture 03 Bits, Bytes and Data Types

Programming in C. Lecture 9: Tooling. Dr Neel Krishnaswami. Michaelmas Term

COMP 2355 Introduction to Systems Programming

C Introduction. Comparison w/ Java, Memory Model, and Pointers

C Programming. Course Outline. C Programming. Code: MBD101. Duration: 10 Hours. Prerequisites:

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor

Programming in C. 3. Pointers and Structures. Dr. Neel Krishnaswami University of Cambridge

Deep C (and C++) by Olve Maudal

CYSE 411/AIT681 Secure Software Engineering Topic #10. Secure Coding: Integer Security

2/9/18. Readings. CYSE 411/AIT681 Secure Software Engineering. Introductory Example. Secure Coding. Vulnerability. Introductory Example.

2/9/18. CYSE 411/AIT681 Secure Software Engineering. Readings. Secure Coding. This lecture: String management Pointer Subterfuge

But first, encode deck of cards. Integer Representation. Two possible representations. Two better representations WELLESLEY CS 240 9/8/15

CSE 333 Lecture 2 Memory

High Performance Computing in C and C++

Arithmetic and Bitwise Operations on Binary Data

AGENDA Binary Operations CS 3330 Samira Khan

When you add a number to a pointer, that number is added, but first it is multiplied by the sizeof the type the pointer points to.

Lecture 05 Integer overflow. Stephen Checkoway University of Illinois at Chicago

Arrays and Pointers. CSC209: Software Tools and Systems Programming (Winter 2019) Furkan Alaca & Paul Vrbik. University of Toronto Mississauga

18-642: Code Style for Compilers

Memory (Stack and Heap)

Hacking in C. The C programming language. Radboud University, Nijmegen, The Netherlands. Spring 2018

Motivation was to facilitate development of systems software, especially OS development.

Work relative to other classes

Exercise Session 2 Systems Programming and Computer Architecture

A brief introduction to C programming for Java programmers

Class Information ANNOUCEMENTS

Lecture 12 CSE July Today we ll cover the things that you still don t know that you need to know in order to do the assignment.

Introduction to C. Robert Escriva. Cornell CS 4411, August 30, Geared toward programmers

CS2141 Software Development using C/C++ C++ Basics

CS 61C: Great Ideas in Computer Architecture. Lecture 3: Pointers. Bernhard Boser & Randy Katz

CS107 Handout 13 Spring 2008 April 18, 2008 Computer Architecture: Take II

edunepal_info

Agenda. CS 61C: Great Ideas in Computer Architecture. Lecture 2: Numbers & C Language 8/29/17. Recap: Binary Number Conversion

OBJECT ORIENTED PROGRAMMING

CS 61C: Great Ideas in Computer Architecture. Lecture 2: Numbers & C Language. Krste Asanović & Randy Katz

Review of the C Programming Language for Principles of Operating Systems

Deep C by Olve Maudal

Understanding Undefined Behavior

11 'e' 'x' 'e' 'm' 'p' 'l' 'i' 'f' 'i' 'e' 'd' bool equal(const unsigned char pstr[], const char *cstr) {

Motivation was to facilitate development of systems software, especially OS development.

o Code, executable, and process o Main memory vs. virtual memory

Programming refresher and intro to C programming

Secure C Coding...yeah right. Andrew Zonenberg Alex Radocea

More about BOOLEAN issues

CS 31: Intro to Systems Pointers and Memory. Martin Gagne Swarthmore College February 16, 2016

(heavily based on last year s notes (Andrew Moore) with thanks to Alastair R. Beresford. 1. Types Variables Expressions & Statements 2/23

High-performance computing and programming Intro to C on Unix/Linux. Uppsala universitet

Secure Coding in C and C++

Computer Security. Robust and secure programming in C. Marius Minea. 12 October 2017

18-642: Code Style for Compilers

Secure Software Programming and Vulnerability Analysis

Amsterdam Compiler Kit-ANSI C compiler compliance statements

Topic 6: A Quick Intro To C. Reading. "goto Considered Harmful" History

Room 3P16 Telephone: extension ~irjohnson/uqc146s1.html

Agenda. Components of a Computer. Computer Memory Type Name Addr Value. Pointer Type. Pointers. CS 61C: Great Ideas in Computer Architecture

Dynamic Memory Management! Goals of this Lecture!

Representing and Manipulating Integers. Jo, Heeseung

Hacking in C. The C programming language. Radboud University, Nijmegen, The Netherlands. Spring 2018

CS 61C: Great Ideas in Computer Architecture. Lecture 3: Pointers. Krste Asanović & Randy Katz

Programming in C++ 5. Integral data types

Review of the C Programming Language

We do not teach programming

Topic 6: A Quick Intro To C

CS2141 Software Development using C/C++ Debugging

ELEC 377 C Programming Tutorial. ELEC Operating Systems

CpSc 1010, Fall 2014 Lab 10: Command-Line Parameters (Week of 10/27/2014)

Lecture Notes on Types in C

CSC 1600 Memory Layout for Unix Processes"

Secure Coding in C and C++ Dynamic Memory Management Lecture 5 Jan 29, 2013

CS201 - Lecture 1 The C Programming Language

Computer Hardware Engineering

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

Lecture Notes on Polymorphism

CIS 2107 Computer Systems and Low-Level Programming Fall 2011 Midterm

Arithmetic Operators. Portability: Printing Numbers

Programming in C and C++

Part I Part 1 Expressions

Writing Program in C Expressions and Control Structures (Selection Statements and Loops)

ECE 250 / CS 250 Computer Architecture. C to Binary: Memory & Data Representations. Benjamin Lee

CS 385 Operating Systems Fall 2011 Homework Assignment 5 Process Synchronization and Communications

Outline. Lecture 1 C primer What we will cover. If-statements and blocks in Python and C. Operators in Python and C

Dynamic memory allocation (malloc)

5.Coding for 64-Bit Programs

Transcription:

Programming in C and C++ 10. C Semantics: Undefined Behaviour & Optimization Issues Dr. Anil Madhavapeddy University of Cambridge (based on previous years thanks to Alan Mycroft, Alastair Beresford and Andrew Moore) Michaelmas Term 2014 2015 1 / 27

C is Portable Assembly C is the most widely deployed language in the world: Unix was originally written in assembly, and C was created as a thin layer to avoid rewriting all of it for every new CPU. C now compiles to almost every CPU architecture ever made! Statements typically map to a small set of machine instructions. Numerous language extensions exist to expose hardware features. For example, to add two numbers in x86 assembly with GNU GCC: 1 int foo = 10, bar = 15; 2 asm volatile ("addl %%ebx,%%eax" 3 :"=a"(foo) 4 :"a"(foo), "b"(bar) 5 ); 6 printf("foo+bar=%d\n", foo); 2 / 27

The Optimisation Tradeoff There is a tension between: The programmer being able to predict performance of a block of code. The compiler generating fast machine code for specific hardware. The language remaining portable on present and future architectures. Revisions to C must remain backwards compatible with existing code. Balance is achieved via an ongoing language specification process: Originally a book by Ritchie and Kernighan in 1978 ( K&R C ). Replaced by an ANSI standard in 1989 ( ANSI C or C89 ). Revised by ISO with features like IEEE 754 floating point ( C99 ) Most recent standard in 2011 has a detailed memory model ( C11 ). 3 / 27

The C Abstract Machine The C standard defines these broad concepts: The semantic descriptions of language features describe the behavior of an abstract machine in which issues of optimization are irrelevant. Defines side effects as accessing a volatile object, modifying an object or file, or calling a function that does any of those operations. Side effects change the state of the execution environment and are produced by expression evaluation. Certain points in the control flow are sequence points at which point all side effects which have been seen so far are guaranteed to be complete. 4 / 27

Examples of Sequence Points in C Between the left and right operands of the && or operators. 1 // Side effects on p happen before those on q 2 *p++!= 0 && *q++!= 0 Between the evaluation of the first operand on the ternary (question mark) operator and the second and third operands. 1 // p incremented when the second instance is evaluated 2 a = (*p++)? (*p++) : 0 At the end of a full expression; e.g. return statements, control flow from if, switch, while or do/while, and all three expressions in a for statement. 5 / 27

Examples of Sequence Points in C (cont.) Function calls have several sequence points: Before a function call is entered. f(i++) // f() is called with the original value of i // but i is incremented before entering the body of f() The order in which arguments are evaluated is not specified. f(i++) + g(j++) + h(k++) f(), g(), h() may be executed in any order. i, j, k may be incremented in any order. Always obeying the previous rule, i is incremented before f() is entered, as is j for g() and k for h() respectively. f(i++, j++) + h(k++) i and j may be incremented in any order, but both will be incremented before f() is entered. 6 / 27

Execution of the Abstract Machine In the abstract machine, all expressions are evaluated as specified by the language semantics in the standard. An implementation need not evaluate part of an expression if it can deduce that: its value is not used no needed side effects are produced includes any caused by function calls or accessing a volatile object If an asynchronous signal is received, only the values of objects from the previous sequence point can be relied on. 7 / 27

Freedom to Optimize Outside of the abstract machine, the C standard leaves plenty of room for compilers to optimize source code to take advantage of specific hardware. The standard defines three types of compiler freedom: implementation-defined behaviour means that the compiler must choose and document a behaviour and obey it consistently. unspecified behaviour means that from a given set of possibilities, the compiler chooses one (or different ones within the same program). undefined behaviour results in arbitrary behaviour from the compiler. When the compiler encounters [a given undefined construct] it is legal for it to make demons fly out of your nose comp.std.c newsgroup, circa 1992 8 / 27

Implementation-defined Behaviour These are typically set depending on the target hardware architecture and operating system s Application Binary Interface (ABI). Examples of implementation-defined behaviour include: Number of bits in a byte (minimum of 8 and exactly 8 in every modern system, but the PDP-10 had 36 bits per byte!) sizeof(int) which is commonly 32- or 64-bit char a = (char)123456; Results of some bitwise operations on signed integers Result of converting a pointer to an integer or vice versa Compiler warnings help spot accidental dependency on this behaviour: 1 test.c:3:13: warning: shift count >= width of type 2 [-Wshift-count-overflow] 3 int x = a >> 64; 9 / 27

Unspecified Behaviour The compiler can vary these within the same program to maximise effectiveness of its optimisations. For example: Evaluation order of arguments in a function call. The order in which side effects take place when not otherwise explicitly specified by the standard. Whether a call to an inline function uses the inline or external definition. The memory layout of storage for function arguments. The order and contiguity of storage allocated by successive calls to the malloc, calloc or realloc functions 10 / 27

Undefined Behaviour There is a long list of undefined behaviours in the C standard (J.2). Some examples include: Between two sequence points, an object is modified more than once, or is modified and the prior value is read other than to determine the value to be stored int i = 0; f(i++, i++); // what arguments will f be called with? Conversion of a pointer to an integer type produces a value outside the range that can be represented char *p = malloc(10); short x = (short)p; // if address space larger than short? The initial character of an identifier is a universal character name designating a digit char \u0031morething; 11 / 27

Undefined Behaviour (cont.) The value of a pointer to an object whose lifetime has ended is used. We already encountered this back in lecture 4: 1 #include <stdio.h> 2 3 char *unary(unsigned short s) { 4 char local[s+1]; 5 int i; 6 for (i=0;i<s;i++) local[i]= 1 ; 7 local[s]= \0 ; 8 return local; 9 } 10 11 int main(void) { 12 printf("%s\n",unary(6)); //What does this print? 13 return 0; 14 } 12 / 27

Undefined Behaviour (cont.) Use of an uninitialized variable before accessing it. 1 int main(int argc, char **argv) { 2 int i; 3 while (i < 10) { 4 printf("%d\n", i); 5 i++; 6 } 7 } Accessing out-of-bounds memory. 1 char *buf = malloc(10); 2 buf[10] = \0 ; Dereferencing a NULL pointer or wild pointer (e.g. after calling free). 1 char *buf = malloc(10); 2 buf[0] = a ; // what if buf is NUL? 3 free(buf); 4 buf[0] = \0 ; // buf has been freed 13 / 27

Undefined Behaviour (cont.) Signed integer arithmetic is undefined if the result overflows. For instance for type int, INT MAX+1 is not INT MIN. By knowing that values cannot overflow, the compiler can enable useful optimisations: for (i = 0; i <= N; ++i) {... } If signed arithmetic is undefined, then the compiler can assume the loop runs exactly N+1 times. But if overflow were defined to wrap around, then the loop could potentially be infinite and hard to optimise. Consider this case: for (i = 0; i <= INT_MAX; ++i) {... } Undefined behaviour thus enables useful compiler optimisations. But at what cost? 14 / 27

Undefined Behaviour (cont.) The C standard states (Section 1.3.24): Permissible undefined behaviour ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Unspecified behaviour usually relates to platform portability issues and can be dealt with (e.g. 32-bit vs 64-bit, Linux vs Windows). Undefined behaviour should be avoided at all costs, even if your program appears to work fine in one instance. 15 / 27

Security Issues due to Undefined Behaviour 1 void read_from_network(int size) { 2 // Catch integer overflow. 3 if (size > size+1) 4 errx(1, "packet too big") 5 6 char *buf = malloc(size+1); 7 if (buf == NULL) 8 errx(1, "out of memory"); 9 10 read(fd, buf, size); 11 //... error checking on read. 12 13 buf[size] = 0; 14 process_packet(buf); 15 free(buf); 16 } 16 / 27

Security Issues due to Undefined Behaviour (cont.) 1 void read_from_network(int size) { 2 // size > size+1 is impossible since signed 3 // overflow is impossible. Optimize it out! 4 // if (size > size+1) 5 // errx(1, "packet too big") 6 7 char *buf = malloc(size+1); 8 if (buf == NULL) 9 errx(1, "out of memory"); 10 11 read(fd, buf, size); 12 //... error checking on read. 13 14 buf[size] = 0; 15 process_packet(buf); 16 free(buf); 17 } 17 / 27

Security Issues due to Undefined Behaviour (cont.) A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type. -- C99 Standard Sec 6.2.5/9 This optimisation does not occur with unsigned int since it is defined to wrap around. Only signed overflow is undefined, so using unsigned integers to track buffer sizes is wise. Particularly insidious since a security auditor reading the code can easily miss this. There appears to be a security check in the code! 18 / 27

Security Issues due to Undefined Behaviour (cont.) Poor Logic: The x86 ADD instruction is used to implement C s signed add operation, and it has twos complement behavior when the result overflows. I m developing for an x86 platform, so I should be able to expect twos complement semantics when 32-bit signed integers overflow. 19 / 27

Security Issues due to Undefined Behaviour (cont.) Poor Logic: The x86 ADD instruction is used to implement C s signed add operation, and it has twos complement behavior when the result overflows. I m developing for an x86 platform, so I should be able to expect twos complement semantics when 32-bit signed integers overflow. Analogy: Somebody once told me that in football you can t pick up the ball and run. I got a football and tried it and it worked just fine. They obviously didn t understand football. It is possible to sometimes get away with undefined behaviour, but it will eventually go wrong in very unexpected ways. Further Reading: http://blog.regehr.org/archives/213 20 / 27

Compiler View on Undefined Behaviour The compiler only has to execute statements where the behaviour is defined, and can optimise the rest away. In the C abstract machine, every operation that is being performed is either defined or undefined (as classified by the C standard). Application developers need to worry about every input to the program being defined, as the compiler is an adversary that can jump on undefined behaviour and subvert the original intention. 21 / 27

Compiler View on Undefined Behaviour (cont.) Consider a simplified program fragment, where function definitions: terminate for every input run in a single thread have infinite computing resources. The compiler categorises these functions into three kinds: Always Defined: No restrictions on their inputs, and they are defined for all possible inputs. Sometimes Defined: Some restrictions on their inputs, so they can be either defined or undefined depending on the input value. Always Undefined: No valid inputs are admitted, and they are always undefined. 22 / 27

Always-Defined Functions 1 int32_t 2 safe_div_int32_t (int32_t a, int32_t b) 3 { 4 if ((b == 0) ((a == INT32_MIN) && (b == -1))) { 5 report_integer_math_error(); 6 return 0; 7 } else { 8 return a / b; 9 } 10 } Inputs are carefully checked to ensure that undefined operations invoke an error function. Error handling functions are not the same as undefined behaviour since they have explicitly defined semantics. 23 / 27

Sometimes-Defined Functions 1 int32_t 2 safe_div_int32_t (int32_t a, int32_t b) 3 { 4 return a / b; 5 } 6 // function call defined iff 7 // ((b == 0) ((a == INT32_MIN) && (b == -1))) All calls to safe div int32 t must now (somehow) be checked to ensure that the preconditions are met. If compiler statically detects an input that would be undefined, it can skip the function call entirely. 24 / 27

Always-Undefined Functions 1 int32_t 2 safe_div_int32_t (int32_t a, int32_t b) 3 { 4 bool check_overflow; 5 if (check_overflow && 6 ((b == 0) ((a == INT32_MIN) && (b == -1)))) { 7 report_integer_math_error(); 8 return 0; 9 } else { 10 return a / b; 11 } 12 } Why is this variant of safe div int32 t always undefined for all inputs? 25 / 27

Case Analysis in Linux kernel 1 static void devexit agnx_pci_remove (struct pci_dev *pdev) 2 { 3 struct ieee80211_hw *dev = pci_get_drvdata(pdev); 4 struct agnx_priv *priv = dev->priv; 5 6 if (!dev) return; 7 //... do stuff using dev... 8 } This code gets a pointer to a device struct, tests for null and uses it. But the pointer is dereferenced before the null check! An optimising compiler (e.g. gcc at -O2 or higher) performs the following case analysis: if dev == NULL then dev->priv has undefined behaviour. if dev!= NULL then the null pointer check will not fail. Null pointer check is dead code and may be deleted. End Result: The!dev check is optimised away security issue 26 / 27

Living with Undefined Behaviour Long term: Unsafe programming languages will only be used to build safer abstractions (e.g. Java or ML are bootstrapped in C/C++). Short term: No straightforward answer, as a combination of tools and techniques are needed. Use multiple compilers (clang or gcc) and enable all compiler warnings (-Wall). Use static analysis tools (like clang-analyzer or Coverity) to spot errors, or dynamic analysis engines like Valgrind. Modern clang and gcc have undefined behaviour sanitizers that detect and generate errors for many classes of undefined behaviour via the -fsanitize=undefined command line flag. Avoid sometimes-undefined functions by checking all inputs at the point of use. Use high-quality third-party libraries that obey these rules. Be very careful, use good tools, and hope for the best. John Regehr, http://blog.regehr.org/archives/213 27 / 27