C11 Compiler Mappings: Exploration, Verification, and Counterexamples

Similar documents
TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

Taming release-acquire consistency

RTLCheck: Verifying the Memory Consistency of RTL Designs

Relaxed Memory: The Specification Design Space

Reasoning about the C/C++ weak memory model

Program logics for relaxed consistency

Full-Stack Memory Model Verification with TriCheck

TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

An introduction to weak memory consistency and the out-of-thin-air problem

TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

Repairing Sequential Consistency in C/C++11

The C1x and C++11 concurrency model

Declarative semantics for concurrency. 28 August 2017

Repairing Sequential Consistency in C/C++11

Repairing Sequential Consistency in C/C++11

PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications

The C/C++ Memory Model: Overview and Formalization

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

Multicore Programming: C++0x

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

GPU Concurrency: Weak Behaviours and Programming Assumptions

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

Constructing a Weak Memory Model

arxiv: v1 [cs.pl] 29 Aug 2018

RTLCheck: Verifying the Memory Consistency of RTL Designs

CCICheck: Using µhb Graphs to Verify the Coherence-Consistency Interface

Implementing the C11 memory model for ARM processors. Will Deacon February 2015

Taming Release-Acquire Consistency

Overhauling SC atomics in C11 and OpenCL

Load-reserve / Store-conditional on POWER and ARM

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

RELAXED CONSISTENCY 1

CS533 Concepts of Operating Systems. Jonathan Walpole

Mechanised industrial concurrency specification: C/C++ and GPUs. Mark Batty University of Kent

Check Quick Start. Figure 1: Closeup of icon for starting a terminal window.

Overview: Memory Consistency

Research Faculty Summit Systems Fueling future disruptions

Lowering C11 Atomics for ARM in LLVM

Partially Redundant Fence Elimination for x86, ARM, and Power Processors

The Semantics of x86-cc Multiprocessor Machine Code

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

A unified machine-checked model for multithreaded Java

Understanding POWER multiprocessors

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

Distributed Operating Systems Memory Consistency

Memory Consistency Models

Relaxed Memory-Consistency Models

Industrial concurrency specification for C/C++ Mark Batty University of Kent

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Overhauling SC atomics in C11 and OpenCL

Memory Consistency Models: Convergence At Last!

Synthesizing Memory Models

Automated Synthesis of Comprehensive Memory Model Litmus Test Suites

Weak Memory Models with Matching Axiomatic and Operational Definitions

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

High-level languages

The Java Memory Model

THÈSE DE DOCTORAT de l Université de recherche Paris Sciences Lettres PSL Research University

Library Abstraction for C/C++ Concurrency

Memory Consistency Models

Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it

Compositional Verification of Compiler Optimisations on Relaxed Memory

Foundations of the C++ Concurrency Memory Model

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

On partial order semantics for SAT/SMT-based symbolic encodings of weak memory concurrency

Programming Language Memory Models: What do Shared Variables Mean?

arxiv: v5 [cs.lo] 9 Jan 2014

Verifying Concurrent Programs

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Memory barriers in C

X-86 Memory Consistency

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Multicore Programming Java Memory Model

Duet: Static Analysis for Unbounded Parallelism

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Can Seqlocks Get Along with Programming Language Memory Models?

WG21/P0868R2: Selected RCU Litmus Tests

Shared Memory Consistency Models: A Tutorial

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Moscova. Jean-Jacques Lévy. March 23, INRIA Paris Rocquencourt

Threads and Locks. Chapter Introduction Locks

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Lock-based queue

C++ Concurrency - Formalised

Hardware Memory Models: x86-tso

Advanced OpenMP. Memory model, flush and atomics

Improving the Practicality of Transactional Memory

Stability in Weak Memory Models

Reasoning between Programming Languages and Architectures

Effective Stateless Model Checking for C/C++ Concurrency

Push-button verification of Files Systems via Crash Refinement

From POPL to the Jungle and back

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

Sequential Consistency & TSO. Subtitle

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Analyzing Snapshot Isolation

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

A Data-flow Approach to Solving the von Neumann Bottlenecks

Transcription:

C11 Compiler Mappings: Exploration, Verification, and Counterexamples Yatin Manerkar Princeton University manerkar@princeton.edu http://check.cs.princeton.edu November 22 nd, 2016 1

Compilers Must Uphold HLL Guarantees High-Level Language (HLL) Program Compiler Assembly Language Program Compiler translates HLL statements into assembly instructions Code generated by compiler must provide functionality required by HLL program 2

Compilers Must Uphold HLL Guarantees C11 Program x.store(1); r1 = y.load(); Compiler X86 C11 Atomic Mapping X86 Assembly Language Program mov [eax], 1 MFENCE mov ebx, [ebx] C/C++11 standards introduced atomic operations Portable, high-performance concurrent code Compiler uses mapping to translate from atomic ops to assembly instructions 3

Compilers Must Uphold HLL Guarantees C11 Program x.store(1); r1 = y.load(); Compiler X86 C11 Atomic Mapping X86 Assembly Language Program mov [eax], 1 MFENCE mov ebx, [ebx] If mapping is correct, then for all programs: C11 Outcome Forbidden implies ISA-Level Outcome Forbidden 4

Exploring Mappings with TriCheck C11 Litmus Test Variants C11 Atomic Mapping ISA-level litmus tests Herd How do HLL outcomes compare to ISA-level outcomes? µcheck? C11 Outcomes ISA-Level Outcomes 5

Exploring Mappings with TriCheck C11 Litmus Test Variants C11 Atomic Mapping ISA-level litmus tests Herd If a mapping is correct, then for all programs: µcheck C11 Outcome Forbidden implies ISA-Level Outcome Forbidden 6

Counterexamples Detected! C11 Litmus Test Variants C11 Power/ARMv7 Trailing-Sync Atomic Mapping Power/ ARMv7-like litmus tests Herd µcheck C11 Outcome Forbidden but ISA-Level Outcome Allowed 7

C11 Litmus Test Variants Herd Counterexamples Detected! C11 Power/ARMv7 Trailing-Sync Atomic Mapping Power/ ARMv7-like litmus tests Counterexample implies mapping is flawed But mapping previously proven correct [Batty et al. POPL 2012] Must be an error in the proof! µcheck C11 Outcome Forbidden but ISA-Level Outcome Allowed 8

Outline Introduction Background on C11 model and mappings IRIW Counterexample and Analysis Loophole in Proof of Batty et al. IBM XL C++ Bugs Conclusions and Future Work 9

C11 Memory Model C11 memory model specifies a C11 program s allowed and forbidden outcomes Axiomatic model defined in terms of program executions Executions that satisfy C11 axioms are consistent Executions that do not satisfy axioms are forbidden Outcome only allowed if consistent execution exists C11 axioms defined in terms of various relations on an execution 10

C11 atomic operations Used to write portable, high-performance concurrent code Atomic ops can have different memory orders seq_cst, acquire, release, relaxed Stronger guarantees: easier correctness, lower performance Weaker guarantees: harder correctness, higher performance Example (y is an atomic variable): y.store(1, memory_order_release); int b = y.load(memory_order_acquire); 11

Relevant C11 Memory Model Relations Happens-before (hb) = sb sw + Transitive closure of statement order and synchronization order Total order on SC operations (sc) Must be acyclic Wsc x = 1 sc edges must not be in opposite direction to hb edges (sc must be consistent with hb) SC read operations cannot read from overwritten writes hb Rsc y = 0 sc 12

Power and ARMv7 Compiler Mappings Trailing-sync mapping: [Boehm 2011][Batty et al. POPL 2012] Power lwsync and ARMv7 dmb prior to releases ensure that prior accesses are made visible before the release 13

Power and ARMv7 Compiler Mappings Trailing-sync mapping: [Boehm 2011][Batty et al. POPL 2012] Power ctrlisync/sync and ARMv7 ctrlisb/dmb after acquires enforce that subsequent accesses are made visible after the acquire Use of sync/dmb for SC loads helps enforce the required C11 total order on SC operations 14

Power and ARMv7 Compiler Mappings Trailing-sync mapping: [Boehm 2011][Batty et al. POPL 2012] Power sync and ARMv7 dmb after SC stores ( trailing-sync ) prevent reordering with subsequent SC loads Ostensibly, this ordering can also be enforced by putting fences before SC loads 15

Power and ARMv7 Compiler Mappings Leading-sync mapping: [McKenney and Silvera 2011] Leading-sync mapping places these fences *before* SC loads Only translations of SC atomics change between the two mappings 16

Both Mappings are Currently Invalid Both supposedly proven correct [Batty et al. POPL 2012] We discovered two counterexamples to trailing-sync mappings on Power and ARMv7 Isolated the proof loophole that allowed flaw Vafeiadis et al. found counterexamples for leading-sync mapping, and have proposed solution 17

Outline Introduction Background on C11 model and mappings IRIW Counterexample and Analysis Loophole in Proof of Batty et al. IBM XL C++ Bugs Conclusions and Future Work 18

IRIW Trailing-Sync Counterexample T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 Variant of IRIW (Independent-Reads- Independent-Writes) litmus test IRIW corresponds to two cores observing stores to different addresses in different orders At least one of first loads on T2 and T3 is an acquire; all other accesses are SC 19

IRIW Counterexample Compilation T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 With trailing sync mapping, effectively compiles down to C0 C1 C2 C3 St x = 1 St y = 1 r1 = Ld x r3 = Ld y ctrlisync/ctrlisb ctrlisync/ctrlisb r2 = Ld y r4 = Ld x Allowed by Power model and hardware [Alglave et al. TOPLAS 2014] Allowed by ARMv7 model [Alglave et al. TOPLAS 2014] 20

IRIW Counterexample Compilation T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 With trailing sync mapping, effectively compiles down to C0 C1 C2 C3 St x = 1 St y = 1 r1 = Ld x r3 = Ld y ctrlisync/ctrlisb ctrlisync/ctrlisb r2 = Ld y Allowed by Power model and hardware [Alglave et al. TOPLAS 2014] Allowed by ARMv7 model [Alglave et al. TOPLAS 2014] r4 = Ld x ctrlisync/ctrlisb are not strong enough to forbid outcome 21

IRIW Trailing-Sync Counterexample T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 Happens-before edges from c f and from d h by transitivity 22

IRIW Trailing-Sync Counterexample T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 Happens-before edges from c f and from d h by transitivity 23

IRIW Trailing-Sync Counterexample T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 Happens-before edges from c f and from d h by transitivity 24

IRIW Trailing-Sync Counterexample SC order must contain edges from c f and from d h to match direction of hb edges Shown below as sc_hb edges c: Wsc x = 1 d: Wsc y = 1 f: Rsc y = 0 h: Rsc x = 0 25

IRIW Trailing-Sync Counterexample SC reads f and h must read from non-sc writes b and a before they are overwritten The SC order must contain f d and h c to satisfy this condition c: Wsc x = 1 d: Wsc y = 1 f: Rsc y = 0 h: Rsc x = 0 26

IRIW Trailing-Sync Counterexample SC reads f and h must read from non-sc writes b and a before they are overwritten The SC order must contain f d and h c to satisfy this condition c: Wsc x = 1 d: Wsc y = 1 Cycle in the SC order Outcome is forbidden as there is no corresponding consistent execution But compiled code allows the behaviour! f: Rsc y = 0 h: Rsc x = 0 27

What went wrong? SC axioms required SC order to contain edges from c f and from d h to match direction of hb edges This requires a sync/dmb ish between e and f as well as between g and h on Power and ARMv7 These fences are NOT provided by trailing-sync mapping 28

What went wrong? SC axioms required SC order to contain edges from c f and from d h to match direction of hb edges This requires a sync/dmb ish between e and f as well as between g and h on Power and ARMv7 These fences are NOT provided by trailing-sync mapping 29

What went wrong? SC axioms required SC order to contain edges from c f and from d h to match direction of hb edges This requires a sync/dmb ish between e and f as well as between g and h on Power and ARMv7 These fences are NOT provided by trailing-sync mapping 30

Outline Introduction Background on C11 model and mappings IRIW Counterexample and Analysis Loophole in Proof of Batty et al. IBM XL C++ Bugs Conclusion 31

Loophole in Batty et al. proof [POPL 2012] Lemma in proof states that SC order for a given Power trace is an arbitrary linearization of po t sc co t sc fr t sc erf t sc This is the transitive closure of program order and coherence edges directly between SC accesses Proof clause checking C11 axiom that sc and hb edges match direction states that having SC order be arbitrary linearization of above relation is sufficient 32

Loophole in Batty et al. proof [POPL 2012] This claim is false in certain scenarios hb edges can arise between SC accesses through the transitive composition of edges to and from a non-sc intermediate access Occurs in IRIW counterexample: 33

Loophole in Batty et al. proof [POPL 2012] This claim is false in certain scenarios hb edges can arise between SC accesses through the transitive composition of edges to and from a non-sc intermediate access Occurs in IRIW counterexample: 34

Loophole in Batty et al. proof [POPL 2012] SC order must be in same direction as these hb edges, but an arbitrary linearization of po t sc co t sc fr t sc erf t sc may not satisfy this condition Result: Proof does not guarantee that sc and hb edges match direction between two accesses, and is incorrect confirmed by Batty et al. 35

Current Compiler and Architecture State Neither GCC nor Clang implement exact flawed trailing-sync mapping Use leading-sync mapping for Power Use trailing-sync for ARMv7, but with stronger acquire mapping (ld; dmb ish or stronger) Sufficient to disallow both our counterexamples Both counterexample behaviours observed on Power hardware [Alglave et al. TOPLAS 2014] ARMv7 model [Alglave et al. TOPLAS 2014] allows counterexample behaviours, but not observed on ARMv7 hardware 36

Outline Introduction Background on C11 model and mappings IRIW Counterexample and Analysis Loophole in Proof of Batty et al. IBM XL C++ Bugs Conclusion 37

What about optimizations? C11 Litmus Test C11 Atomic Mapping Optimizations Assembly Language Program Compiler Even if mapping is correct, optimizations cannot introduce new outcomes Recent work on src-to-src opts and LLVM IR verification [Vafeiadis et al. POPL 2015] [Chakraborty and Vafeiadis CGO 2016] What about commercial compilers? 38

XL C++ Bugs Overview Visited IBM Yorktown Heights to check if XL C++ (v13.1.4) was vulnerable to trailing-sync counterexample XL C++ mapping close to leading-sync Often correct at lower optimization levels, but increasing optimizations to O3 and O4 generated incorrect code for multiple tests Bugs have since been fixed by compiler team Caused by issues in code generator Fixes in v13.1.5 39

Bug #1: Loss of SC Store Release Semantics Message-passing litmus test (mp), relaxed store of x, all other accesses SC T0 T1 x.store(1, relaxed); r1 = y.load(seq_cst); y.store(1, seq_cst); r2 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 St x = 1 sync St y = 1 C1 sync r1 = Ld y ctrlisync (twice) sync r2 = Ld x ctrlisync (twice) XL C++ with O4 compiles to: C0 St x = 1 ctrlisync St y = 1 sync C1 ctrlisync r1 = Ld y sync ctrlisync r2 = Ld x sync Forbidden Allowed Used litmus utility to exercise outcome of incorrect code 40

Bug #1: Loss of SC Store Release Semantics Message-passing litmus test (mp), relaxed store of x, all other accesses SC T0 T1 x.store(1, relaxed); r1 = y.load(seq_cst); y.store(1, seq_cst); r2 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 St x = 1 sync St y = 1 C1 sync r1 = Ld y ctrlisync (twice) sync r2 = Ld x ctrlisync (twice) Bug: Ctrlisync is not strong enough to ensure stores are observed in order XL C++ with O4 compiles to: C0 St x = 1 ctrlisync St y = 1 sync C1 ctrlisync r1 = Ld y sync ctrlisync r2 = Ld x sync Forbidden Allowed Used litmus utility to exercise outcome of incorrect code 41

Bug #2: Incorrect Impl. of Releases Message-passing litmus test (mp), with release-acquire atomics, relaxed store of x T0 T1 x.store(1, relaxed); r1 = y.load(acquire); y.store(1, release); r2 = x.load(acquire); Outcome: r1 = 1, r2 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 St x = 1 St y = 1 Allowed C1 ctrlisync r1 = Ld y ctrlisync r2 = Ld x Used litmus utility to exercise outcome of incorrect code 42

Bug #2: Incorrect Impl. of Releases Message-passing litmus test (mp), with release-acquire atomics, relaxed store of x T0 T1 x.store(1, relaxed); r1 = y.load(acquire); y.store(1, release); r2 = x.load(acquire); Outcome: r1 = 1, r2 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 St x = 1 St y = 1 Allowed C1 ctrlisync r1 = Ld y ctrlisync r2 = Ld x Bug: No ordering enforcement between stores Used litmus utility to exercise outcome of incorrect code 43

Bug #3: Reordering SC Loads and syncs IRIW litmus test with two acquire loads, all other accesses SC T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 C1 C2 C3 St x = 1 St y = 1 ctrlisync ctrlisync r1 = Ld x r3 = Ld y sync sync r2 = Ld y r4 = Ld x ctrlisync ctrlisync Forbidden XL C++ with O4 compiles to: C0 C1 C2 C3 ctrlisync ctrlisync ctrlisync ctrlisync St x = 1 St y = 1 r1 = Ld x r3 = Ld y ctrlisync ctrlisync r2 = Ld y r4 = Ld x sync sync Allowed 44

Bug #3: Reordering SC Loads and syncs IRIW litmus test with two acquire loads, all other accesses SC T0 T1 T2 T3 x.store(1, seq_cst); y.store(1, seq_cst); Outcome: r1 = 1, r2 = 0, r3 = 1, r4 = 0 (Forbidden by C++) XL C++ with O3 compiles to: C0 C1 C2 C3 St x = 1 St y = 1 ctrlisync ctrlisync r1 = Ld x r3 = Ld y sync sync r2 = Ld y r4 = Ld x ctrlisync ctrlisync Forbidden r1 = x.load(acquire); r3 = y.load(acquire); r2 = y.load(seq_cst); r4 = x.load(seq_cst); Bug: Ctrlisync is not enough to enforce required orderings XL C++ with O4 compiles to: C0 C1 C2 C3 ctrlisync ctrlisync ctrlisync ctrlisync St x = 1 St y = 1 r1 = Ld x r3 = Ld y ctrlisync ctrlisync r2 = Ld y r4 = Ld x sync sync Allowed 45

Future Work XL C++ bugs show that it is particularly hard to maintain C11 orderings across optimizations Need a top-to-bottom verification flow from HLL to assembly code, incorporating compiler optimizations Avenue for future work 46

Conclusions TriCheck provides rapid exploration of different compiler mappings for architectures across C11 litmus test variants Using TriCheck, discovered two trailing-sync counterexamples for Power and ARMv7 Also discovered loophole in proof of mappings Either C11 model or mappings must change to enable correct compilation Experiments with IBM XL C++ revealed bugs (since fixed) in their C11 implementation 47

C11 Compiler Mappings: Exploration, Verification, and Counterexamples Yatin Manerkar Princeton University Tools and papers available at http://check.cs.princeton.edu 48