Advanced Operating Systems (CS 202)

Similar documents
Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Portland State University ECE 588/688. Memory Consistency Models

Shared Memory Consistency Models: A Tutorial

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Using Relaxed Consistency Models

Distributed Operating Systems Memory Consistency

Relaxed Memory-Consistency Models

Memory Consistency Models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Overview: Memory Consistency

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Beyond Sequential Consistency: Relaxed Memory Models

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CS533 Concepts of Operating Systems. Jonathan Walpole

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Shared Memory Consistency Models: A Tutorial

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Relaxed Memory Consistency

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Shared Memory Consistency Models: A Tutorial

CS5460: Operating Systems

Operating Systems. Synchronization

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Sequential Consistency & TSO. Subtitle

Systèmes d Exploitation Avancés

Administrivia. p. 1/20

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

RCU in the Linux Kernel: One Decade Later

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Release Consistency. Draft material for 3rd edition of Distributed Systems Concepts and Design

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CS377P Programming for Performance Multicore Performance Synchronization

Threads. Concurrency. What it is. Lecture Notes Week 2. Figure 1: Multi-Threading. Figure 2: Multi-Threading

Concurrency and Race Conditions (Linux Device Drivers, 3rd Edition (

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Distributed Shared Memory and Memory Consistency Models

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Typed Assembly Language for Implementing OS Kernels in SMP/Multi-Core Environments with Interrupts

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Can Seqlocks Get Along with Programming Language Memory Models?

CSE Traditional Operating Systems deal with typical system software designed to be:

Hardware: BBN Butterfly

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Administrivia. Review: Thread package API. Program B. Program A. Program C. Correct answers. Please ask questions on Google group

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

CSE 4/521 Introduction to Operating Systems

Memory Consistency Models. CSE 451 James Bornholt

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Distributed Operating Systems

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

The Java Memory Model

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Foundations of the C++ Concurrency Memory Model

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

M4 Parallelism. Implementation of Locks Cache Coherence

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Advance Operating Systems (CS202) Locks Discussion

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Locks. Dongkun Shin, SKKU

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Multiprocessor Systems. Chapter 8, 8.1

High-level languages

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

RELAXED CONSISTENCY 1

Concept of a process

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

Concurrency: Mutual Exclusion (Locks)

Lecture. DM510 - Operating Systems, Weekly Notes, Week 11/12, 2018

Synchronization for Concurrent Tasks

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

Lecture 13: Memory Consistency. + course-so-far review (if time) Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Synchronization Principles

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

Synchronization I. Jo, Heeseung

Lecture 5: Synchronization w/locks

Multiprocessor Synchronization

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Distributed Systems Synchronization. Marcus Völp 2007

Concurrency: Locks. Announcements

Reminder from last time

CSE 120 Principles of Operating Systems

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter 6: Process Synchronization

Transcription:

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization (Part II) Jan, 30, 2017 (some cache coherence slides adapted from Ian Watson; some memory consistency slides from Sarita Adve)

What is synchronization? Making sure that concurrent activities don t access shared data in inconsistent ways int i = 0; // shared Thread A i=i+1; Thread B i=i-1; What is in i? 2

What are the sources of concurrency? Multiple user-space processes On multiple CPUs Device interrupts Workqueues Tasklets Timers 3

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { } dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { } goto out; Scull is the Simple Character Utility for Locality Loading (an example device driver from the Linux Device Driver book)

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } }

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } } Memory leak

Synchronization primitives Lock/Mutex To protect a shared variable, surround it with a lock (critical region) Only one thread can get the lock at a time Provides mutual exclusion Shared locks More than one thread allowed (hmm ) Others? Yes, including Barriers (discussed in the paper) 8

Synchronization primitives Lock based (cont d) Blocking (e.g., semaphores, futexes, completions) Non-blocking (e.g., spin-lock, ) Sometimes we have to use spinlocks Lock free (or partially lock free J) Atomic instructions seqlocks RCU Transactions (next time) 9

Naïve implementation of spinlock Lock(L): While(test_and_set(L)); //we have the lock! //eat, dance and be merry Unlock(L) L=0; 10

Why naïve? Works? Yes, but not used in practice Contention Think about the cache coherence protocol Set in test and set is a write operation Has to go to memory A lot of cache coherence traffic Unnecessary unless the lock has been released Imagine if many threads are waiting to get the lock Fairness/starvation 11

Better implementation Spin on read Assumption: We have cache coherence Not all are: e.g., Intel SCC Lock(L): while(l==locked); //wait if(test_and_set(l)==locked) go back; Still a lot of chattering when there is an unlock Spin lock with backoff 12

Bakery Algorithm struct lock { int next_ticket; int now_serving; } Acquire_lock: int my_ticket = fetch_and_inc(l->next_ticket); while(l->new_serving!=my_ticket); //wait //Eat, Dance and me merry! Release_lock: L->now_serving++; Still too much chatter Comments? Fairness? Efficiency/cache coherence? 13

Anderson Lock (Array lock) Problem with bakery algorithm: All threads listening to next_serving A lot of cache coherence chatter But only one will actually acquire the lock Can we have each thread wait on a different variable to reduce chatter? 14

Anderson s Lock We have an array (actually circular queue) of variables Each variable can indicate either lock available or waiting for lock Only one location has lock available Lock(L): my_place = fetch_and_inc (queuelast); while (flags[myplace mod N] == must_wait); Unlock(L) flags[myplace mod N] = must_wait; flags[mypalce+1 mod N] = available; Fair and not noisy compare to spin-on-read and bakery algorithm Any negative side effects? 15

Each node has: struct node { bool got_it; Next; //successor} Lock(L, me) MCS Lock Lock Lock Lock Unlock(L,me) join(l); //use fetch-n-store remove me from L while(got_it == 0); signal successor (setting got it to 0) me me Current 16

Race condition! What if there is a new joiner when the last element is removing itself 17

18

Performance impact 19

Concurrency and Memory Consistency References: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve & Kourosh Gharachorloo, September 1995 A primer on memory consistency and cache coherence, Sorin, Hill and wood, 2011 (chapters 3 and 4) Memory Models: A Case for Rethinking Parallel Languages and Hardware, Adve and Boehm, 2010 20

Memory Consistency Formal specification of memory semantics Guarantees as to how shared memory will behave on systems with multiple processors Ordering of reads and writes Essential for programmer (OS writer!) to understand 21

Why Bother? Memory consistency models affect everything Programmability Performance Portability Model must be defined at all levels Programmers and system designers care 22

Uniprocessor Systems Memory operations occur: One at a time In program order Read returns value of last write Only matters if location is the same or dependent Many possible optimizations Intuitive! 23

Multiprocessor: Example 24

Cont d S2 and S1 reordered Why? How? 25

Example 2 26

Sequential Consistency The result of any execution is the same as if all operations were executed on a single processor Operations on each processor occur in the sequence specified by the executing program P1 P2 P3 Pn Memory 27

One execution sequence 28

29

S.C. Disadvantages Difficult to implement! Huge lost potential for optimizations Hardware (cache) and software (compiler) Be conservative: err on the safe side Major performance hit 30

Relaxed Consistency Program Order relaxations locations) (different W à R; W à W; R à R/W Write Atomicity relaxations Read returns another processor s Write early Combined relaxations Read your own Write (okay for S.C.) Safety Net available synchronization operations Note: assume one thread per core 31

Implementations vary on Processors

Comparison of Models Relaxation W R W W R RW Read Others Read Own Safety net Order Order Order Write Early Write Early SC [16] IBM 370 [14] TSO [20] PC [13, 12] PSO [20] WO [5] RCsc [13, 12] RCpc [13, 12] Alpha [19] RMO [21] PowerPC [17, 4] serialization instructions RMW RMW RMW, STBAR synchronization release, acquire, nsync, RMW release, acquire, nsync, RMW MB, WMB various MEMBAR s SYNC 33

Write à Read Can be reordered: same processor, different locations Hides write latency Different processors? Same location? 1.IBM 370 Any write must be fully propagated before reading 2.SPARC V8 Total Store Ordering (TSO) Can read its own write before that write is fully propagated Cannot read other processors writes before full propagation 3.Processor Consistency (PC) Any write can be read before being fully propagated 34

Example: Write à Read P1 P2 F1 = 1 F2 = 1 A = 1 A = 2 Rg1 = A Rg3 = A Rg2 = F2 Rg4 = F1 Rg1 = 1 Rg3 = 2 Rg2 = 0 Rg4 = 0 P1 P2 P3 A = 1 if(a==1) B = 1 if (B==1) Rg1 = A Rg1 = 0, B = 1 TSO and PC PC only 35

Write à Write Can be reordered: same processor, different locations Multiple writes can be pipelined/overlapped May reach other processors out of program order Partial Store Ordering (PSO) Similar to TSO Can read its own write early Cannot read other processors writes early 36

Example: Write à Write P1 P2 Data = 2000 while (Head == 0) {;} Head = 1... = Data PSO = non sequentially consistent can we fix that? P1 P2 Data = 2000 while (Head == 0) {;} STBAR // write barrier Head = 1... = Data 37