Atomic operations in C

Similar documents
What is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.

Concurrency. Johan Montelius KTH

CS5460: Operating Systems

Memory barriers in C

4) C = 96 * B 5) 1 and 3 only 6) 2 and 4 only

You may work with a partner on this quiz; both of you must submit your answers.

Credits and Disclaimers

Synchronising Threads

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

Sungkyunkwan University

Machine Programming 3: Procedures

1. A student is testing an implementation of a C function; when compiled with gcc, the following x86-64 assembly code is produced:

Procedure Calls. Young W. Lim Sat. Young W. Lim Procedure Calls Sat 1 / 27

Program Translation. text. text. binary. binary. C program (p1.c) Compiler (gcc -S) Asm code (p1.s) Assembler (gcc or as) Object code (p1.

recap, what s the problem Locks and semaphores Total Store Order Peterson s algorithm Johan Montelius 0 0 a = 1 b = 1 read b read a

Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers

COMP 210 Example Question Exam 2 (Solutions at the bottom)

Other consistency models

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

Machine Program: Procedure. Zhaoguo Wang

Synchronization. Disclaimer: some slides are adopted from the book authors slides with permission 1

System Programming and Computer Architecture (Fall 2009)

Homework 0: Given: k-bit exponent, n-bit fraction Find: Exponent E, Significand M, Fraction f, Value V, Bit representation

Locks and semaphores. Johan Montelius KTH

Locks and semaphores. Johan Montelius KTH

CS165 Computer Security. Understanding low-level program execution Oct 1 st, 2015

Multiprocessor Solution

GCC and Assembly language. GCC and Assembly language. Consider an example (dangeous) foo.s

Implementing Threads. Operating Systems In Depth II 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Machine-level Programming (3)

CAS CS Computer Systems Spring 2015 Solutions to Problem Set #2 (Intel Instructions) Due: Friday, March 20, 1:00 pm

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

The course that gives CMU its Zip! Machine-Level Programming III: Procedures Sept. 17, 2002

Princeton University Computer Science 217: Introduction to Programming Systems. Assembly Language: Function Calls

Sungkyunkwan University

Procedure Calls. Young W. Lim Mon. Young W. Lim Procedure Calls Mon 1 / 29

Machine-Level Programming III: Procedures

Understanding C/C++ Strict Aliasing

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Administrivia. p. 1/20

hnp://

CS213. Machine-Level Programming III: Procedures

Assembly III: Procedures. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ASSEMBLY III: PROCEDURES. Jo, Heeseung

Assembly III: Procedures. Jo, Heeseung

Memory Consistency Models

Threads and Synchronization. Kevin Webb Swarthmore College February 15, 2018

Intro to Threads. Two approaches to concurrency. Threaded multifinger: - Asynchronous I/O (lab) - Threads (today s lecture)

Credits and Disclaimers

Question 4.2 2: (Solution, p 5) Suppose that the HYMN CPU begins with the following in memory. addr data (translation) LOAD 11110

The Hardware/Software Interface CSE351 Spring 2013

Synchronization 2: Locks (part 2), Mutexes

(2) Accidentally using the wrong instance of a variable (sometimes very hard one to find).

CS510 Advanced Topics in Concurrency. Jonathan Walpole

C++ 11 Memory Consistency Model. Sebastian Gerstenberg NUMA Seminar

Data Representa/ons: IA32 + x86-64

Machine/Assembler Language Putting It All Together

Administrivia. Review: Thread package API. Program B. Program A. Program C. Correct answers. Please ask questions on Google group

THREADS: (abstract CPUs)

CIT Week13 Lecture

Foundations of the C++ Concurrency Memory Model

GCC as Optimising Compiler

Review: Thread package API

CS241 Computer Organization Spring Loops & Arrays

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Compiler Construction D7011E

Credits and Disclaimers

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Parallel Programming and Optimization with GCC. Diego Novillo

Final exam. Scores. Fall term 2012 KAIST EE209 Programming Structures for EE. Thursday Dec 20, Student's name: Student ID:

W4118: PC Hardware and x86. Junfeng Yang

Atomicity CS 2110 Fall 2017

CS 2505 Computer Organization I

What is a Compiler? Compiler Construction SMD163. Why Translation is Needed: Know your Target: Lecture 8: Introduction to code generation

Second Part of the Course

CS 2505 Computer Organization I Test 2. Do not start the test until instructed to do so! printed

Shared Memory. SMP Architectures and Programming

Y86 Processor State. Instruction Example. Encoding Registers. Lecture 7A. Computer Architecture I Instruction Set Architecture Assembly Language View

Process Layout and Function Calls

Binary Translation 2

Lecture #16: Introduction to Runtime Organization. Last modified: Fri Mar 19 00:17: CS164: Lecture #16 1

AS08-C++ and Assembly Calling and Returning. CS220 Logic Design AS08-C++ and Assembly. AS08-C++ and Assembly Calling Conventions

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

CPS104 Recitation: Assembly Programming

An Elegant Weapon for a More Civilized Age

CPSC W Term 2 Problem Set #3 - Solution

Martin Kruliš, v

Assembly Language: Overview!

Introduction to 8086 Assembly

[537] Locks. Tyler Harter

CSE 351 Section 4 GDB and x86-64 Assembly Hi there! Welcome back to section, we re happy that you re here

Assembly Language: Function Calls

Machine-Level Programming II: Arithmetic & Control /18-243: Introduction to Computer Systems 6th Lecture, 5 June 2012

Objectives. Making TOS preemptive Avoiding race conditions

Simple C Program. Assembly Ouput. Using GCC to produce Assembly. Assembly produced by GCC is easy to recognize:

x86 architecture et similia

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

Instruction Set Architecture

CSE 351: Week 4. Tom Bergan, TA

Assembly Language: Function Calls" Goals of this Lecture"

Transcription:

Atomic operations in C Sergey Vojtovich Software Engineer @ MariaDB Foundation * *

Agenda API overview details, examples, common mistakes a few words about volatile

Rationale Atomic operations are intended to allow access to shared data without extra protection (mutex, rwlock, ). This may improve: single thread performance scalability overall system performance.

Atomic API NoAPI (asm, volatile, read/write/full memory barrier) GCC sync builtins Windows Interlocked Variable Access Solaris atomic operations GCC atomic builtins C11 atomic operations MySQL (removed in 8.0) InnoDB (disaster, removed in MariaDB 10.2) MariaDB (compatible with MySQL, but closer to C11)

MariaDB atomic API Simple operations (cheaper): load store Read-modify-write (more expensive): fetch-and-store add compare-and-swap

Additional C11 atomic operations test-and-set (similar to fetch-and-store) clear (similar to store) sub (same as add(-1)) or xor and

Atomic load my_atomic_load32(int32_t *var) my_atomic_load64(int64_t *var) my_atomic_loadptr(void **var) my_atomic_load32_explicit(int32_t *var, memory_order) my_atomic_load64_explicit(int64_t *var, memory_order) my_atomic_loadptr_explicit(void **var, memory_order) return *var; Order must be one of MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_CONSUME, MY_MEMORY_ORDER_ACQUIRE, MY_MEMORY_ORDER_SEQ_CST.

Atomic store my_atomic_store#(&var, what) my_atomic_store#_explicit(&var, what, memory_order) *var= what; Order must be one of MY_MEMORY_ORDER_RELAXED, MY_MEMORY_ORDER_RELEASE, MY_MEMORY_ORDER_SEQ_CST.

Atomic fetch-and-store my_atomic_fas#(&var, what) my_atomic_fas#_explicit(&var, what, memory_order) old= *var; var= *what; return old; All memory orders are valid.

Atomic add my_atomic_add#(&var, what) my_atomic_add#_explicit(&var, what, memory_order) old= *var; *var+= what; return old; All memory orders are valid.

Atomic compare-and-swap my_atomic_cas#(&var, &old, new) my_atomic_cas#_weak_explicit(&var, &old, new, succ, fail) my_atomic_cas#_strong_explicit(&var, &old, new, succ, fail) if (*var == *old) { *var= new; return TRUE; } else { *old= *var; return FALSE; } succ - the memory synchronization ordering for the read-modify-write operation if the comparison succeeds. All memory orders are valid. fail - the memory synchronization ordering for the load operation if the comparison fails. Cannot be MY_MEMORY_ORDER_RELEASE or MY_MEMORY_ORDER_ACQ_REL and cannot specify stronger ordering than succ. The weak form is allowed to fail spuriously, that is, act as if *var!= *old even if they are equal. When a compare-and-exchange is in a loop, the weak version will yield better performance on some platforms. When a weak compare-and-exchange would require a loop and a strong one would not, the strong one is preferable.

References https://github.com/mariadb/server/blob/10.3/include/my_atomic.h http://en.cppreference.com/w/c/atomic

Example mutex based solution atomic based solution pthread_mutex_lock(&lock_thread_count); thd->query_id= global_query_id++; pthread_mutex_unlock(&lock_thread_count);

Example mutex based solution atomic based solution pthread_mutex_lock(&lock_thread_count); thd->query_id= global_query_id++; pthread_mutex_unlock(&lock_thread_count); thd->query_id= my_atomic_add64_explicit(&global_query_id, 1, MY_MEMORY_ORDER_RELAXED);

Example mutex based solution atomic based solution pthread_mutex_lock(&lock_thread_count); thd->query_id= global_query_id++; pthread_mutex_unlock(&lock_thread_count); thd->query_id= my_atomic_add64_explicit(&global_query_id, 1, MY_MEMORY_ORDER_RELAXED); ACQUIRE + atomic read-modify-write non-atomic load-increment-store RELEASE + atomic store no memory barriers atomic read-modify-write

The rule Any time two (or more) threads operate on a shared variable concurrently, and one of those operations performs a write, both threads must use atomic operations.

Example thread 1 thread 2 my_atomic_store64_explicit(&var, 1, MY_MEMORY_ORDER_RELAXED); if (var) { } Not guaranteed to observe consistent value, subject of compiler optimizations.

Example thread 1 thread 2 my_atomic_store64_explicit(&var, 1, MY_MEMORY_ORDER_RELAXED); if (var) { } Not guaranteed to observe consistent BAD value, subject of compiler optimizations.

Example thread 1 thread 2 var= 1; if (my_atomic_load64_explicit(&var, MY_MEMORY_ORDER_RELAXED)) { } Subject for compiler optimizations. Not guaranteed to observe consistent value.

Example thread 1 thread 2 var= 1; if (my_atomic_load64_explicit(&var, MY_MEMORY_ORDER_RELAXED)) { } Not guaranteed to observe consistent Subject of compiler optimizations. BAD value.

Example thread 1 thread 2 my_atomic_store64_explicit(&var, 1, MY_MEMORY_ORDER_RELAXED); if (my_atomic_load64_explicit(&var, MY_MEMORY_ORDER_RELAXED)) { }

Example thread 1 thread 2 my_atomic_store64_explicit(&var, 1, MY_MEMORY_ORDER_RELAXED); if (my_atomic_load64_explicit(&var, MY_MEMORY_ORDER_RELAXED)) { } OK

The difference By all means atomic operations do make foreign threads happy: other threads guaranteed to see consistent values disable some compiler optimizations, like: dead store elimination, constant folding allow memory barrier.

Atomicity uint32_t foo= 0; void store() { foo= 0x80286; }

Atomicity uint32_t foo= 0; void store() { foo= 0x80286; }

Atomicity Non-atomic 64 bit store: #include <stdint.h> uint64_t var= 0; void store() { var= 0x100000002; } Now compile this for IA-64 with GCC: $ gcc -O3 -m64 -S atomic.c && cat atomic.s movabsq $4294967298, %rax movq %rax, var(%rip) ret

Atomicity Atomic 64 bit store: #include <stdint.h> uint64_t var= 0; void store() { atomic_store_n(&var, 0x100000002, ATOMIC_RELAXED); } Now compile this for IA-64 with GCC: $ gcc -O3 -m64 -S atomic.c && cat atomic.s movabsq $4294967298, %rax movq %rax, var(%rip) ret

Atomicity Non-atomic 64 bit store: #include <stdint.h> uint64_t var= 0; void store() { var= 0x100000002; } Now compile this for IA-32 with GCC: $ gcc -O3 -m32 -S atomic.c && cat atomic.s movl $2, var movl $1, var+4 ret

Atomicity thread 1 thread 2 var= 0x100000002; var= 0x200000001; Concurrent thread loading var may observe: 0x000000000 0x000000001 0x000000002 0x100000000 0x100000001 0x100000002 0x200000000 0x200000001 0x200000002

Atomicity Atomic 64 bit store: #include <stdint.h> uint64_t var= 0; void store() { atomic_store_n(&var, 0x100000002, ATOMIC_RELAXED); } Now compile this for IA-32 with GCC: $ gcc -O3 -m32 -S atomic.c && cat atomic.s subl $12, %esp.cfi_def_cfa_offset 16 movl $2, %eax movl $1, %edx movl %eax, (%esp) movl %edx, 4(%esp) fildq (%esp) fistpq var addl $12, %esp.cfi_def_cfa_offset 4 ret

Compiler optimizations #include <stdint.h> uint32_t stage= 0; void thread1() { stage= 1; while (stage!= 2); stage= 3; }

Compiler optimizations #include <stdint.h> uint32_t stage= 0; void thread1() { stage= 1; while (stage!= 2); stage= 3; } Now compile this with GCC: $ gcc -O3 -S atomic.c && cat atomic.s movl $1, stage(%rip).l2: jmp.l2

Compiler optimizations #include <stdint.h> uint32_t stage= 0; void thread1() { atomic_store_n(&stage, 1, ATOMIC_RELAXED); while ( atomic_load_n(&stage, ATOMIC_RELAXED)!= 2); atomic_store_n(&stage, 3, ATOMIC_RELAXED); }

Compiler optimizations #include <stdint.h> uint32_t stage= 0; void thread1() { atomic_store_n(&stage, 1, ATOMIC_RELAXED); while ( atomic_load_n(&stage, ATOMIC_RELAXED)!= 2); atomic_store_n(&stage, 3, ATOMIC_RELAXED); } Now compile this with GCC: $ gcc -O3 -S atomic.c && cat atomic.s movl $1, stage(%rip).l2: movl stage(%rip), %eax cmpl $2, %eax jne.l2 movl $3, stage(%rip) ret

Ordering Atomic operations on any particular variable are not allowed to be reordered against each other independently of memory barrier. my_atomic_store32_explicit(&a, 1, MY_MEMORY_ORDER_RELAXED); d= my_atomic_load32_explicit(&a, MY_MEMORY_ORDER_RELAXED); my_atomic_store32_explicit(&a, 2, MY_MEMORY_ORDER_RELAXED); b= my_atomic_store32_explicit(&a, MY_MEMORY_ORDER_RELAXED);

Disadvantages atomic operations may drive you crazy absence of easy understandable manual atomic operations are more expensive than their non-atomic counterparts

References http://preshing.com/20130618/atomic-vs-non-atomic-operations/ https://gcc.gnu.org/wiki/atomic/gccmm/optimizations

volatile type qualifier In C/C++ the volatile keyword was intended to: allow access to memory mapped devices allow uses of variables between setjmp and longjmp allow uses of sig_atomic_t variables in signal handlers.

volatile type qualifier Useful side effects: compiler can t optimize out volatile access compiler can t reorder volatile access relative to some other special access (e.g. another volatile access).

volatile type qualifier Disadvantages: operations on volatile variables are not atomic compiler may reorder volatile access relative to regular variable access CPU may reorder volatile access relative to any other access (e.g. another volatile access).

volatile type qualifier NO According to the C++11 ISO Standard, the volatile keyword is only meant for use for hardware access; do not use it for inter-thread communication.

References https://en.wikipedia.org/wiki/volatile_(computer_programming) http://en.cppreference.com/w/c/language/volatile https://msdn.microsoft.com/en-us/library/12a04hfd.aspx