Anne Bracy CS 3410 Computer Science Cornell University

Similar documents
CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, M. Martin, S. McKee, A. Roth E. Sirer, and H. Weatherspoon]

Parallelism, Multicore, and Synchronization

Multicore and Parallel Processing

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CS 3410 Computer Science Cornell University

Lec 25: Parallel Processors. Announcements

Processor (IV) - advanced ILP. Hwansoo Han

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Advanced Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

COMPUTER ORGANIZATION AND DESI

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Chapter 4 The Processor (Part 4)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Chapter 4 The Processor 1. Chapter 4D. The Processor

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Synchronization. P&H Chapter 2.11 and 5.8. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Cache Coherence and Atomic Operations in Hardware

Multiple Instruction Issue. Superscalars

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Chapter 4. The Processor

CS425 Computer Systems Architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 4. The Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 4. The Processor

Computer Science 146. Computer Architecture

Thomas Polzer Institut für Technische Informatik

EIE/ENE 334 Microprocessors

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Getting CPI under 1: Outline

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 654 Computer Architecture Summary. Peter Kemper

2 GHz = 500 picosec frequency. Vars declared outside of main() are in static. 2 # oset bits = block size Put starting arrow in FSM diagrams

Last lecture. Some misc. stuff An older real processor Class review/overview.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

Introducing Multi-core Computing / Hyperthreading

Chapter 5. Multiprocessors and Thread-Level Parallelism

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

CS 316: Multicore/GPUs-II

Multithreaded Processors. Department of Electrical Engineering Stanford University

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EC 513 Computer Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Multiple Issue ILP Processors. Summary of discussions

Other consistency models

Keywords and Review Questions

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Multiprocessors & Thread Level Parallelism

Lec 26: Parallel Processing. Announcements

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

TDT 4260 lecture 7 spring semester 2015

Chapter 5. Multiprocessors and Thread-Level Parallelism

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chapter 5. Thread-Level Parallelism

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CS3350B Computer Architecture

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instruction Level Parallelism: Multiple Instruction Issue

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessor Synchronization

Final Exam May 8th, 2018 Professor Krste Asanovic Name:

Lecture 9: Multiple Issue (Superscalar and VLIW)

Computer Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS 351 Final Review Quiz

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Advanced processor designs

Transcription:

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. Also some slides from Amir Roth & Milo Martin in here. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

seconds instructions cycles seconds = x x program program instruction cycle 2 Classic Goals of Architects: Clock period ( Clock frequency) Cycles per Instruction ( IPC)

Darling of performance improvement for decades Why is this no longer the strategy? Hitting Limits: Pipeline depth Clock frequency Moore s Law & Technology Scaling Power

Exploiting Intra-instruction parallelism: Pipelining (decode A while fetching B) Exploiting Instruction Level Parallelism (ILP): Multiple issue pipeline (2-wide, 4-wide, etc.) Statically detected by compiler (VLIW) Dynamically detected by HW Dynamically Scheduled (OoO)

a.k.a. Very Long Instruction Word (VLIW) Compiler groups instructions to be issued together Packages them into issue slots How does HW detect and resolve hazards? It doesn t. J Compiler must avoid hazards Example: Static Dual-Issue 32-bit MIPS Instructions come in pairs (64-bit aligned) One ALU/branch instruction (or nop) One load/store instruction (or nop)

Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB

Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 addu $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 Clicker Question: What is the IPC of this machine? (A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) 2.0

Goal: larger instruction windows (to play with) Predication Loop unrolling Function in-lining Basic block modifications (superblocks, etc.) Roadblocks Memory dependences (aliasing) Control dependences

Exploiting Intra-instruction parallelism: Pipelining (decode A while fetching B) Exploiting Instruction Level Parallelism (ILP): Multiple issue pipeline (2-wide, 4-wide, etc.) Statically detected by compiler (VLIW) Dynamically detected by HW Dynamically Scheduled (OoO)

aka SuperScalar Processor (c.f. Intel) CPU chooses multiple instructions to issue each cycle Compiler can help, by reordering instructions. but CPU resolves hazards Even better: Speculation/Out-of-order Execution Execute instructions as early as possible Aggressive register renaming (indirection to the rescue!) Guess results of branches, loads, etc. Roll back if guesses were wrong Don t commit results until all previous insns committed

It was awesome, but then it stopped improving Limiting factors? Programs dependencies Memory dependence detection à be conservative e.g. Pointer Aliasing: A[0] += 1; B[0] *= 2; Hard to expose parallelism Still limited by the fetch stream of the static program Structural limits Memory delays and limited bandwidth Hard to keep pipelines full, especially with branches

Exploiting Thread-Level parallelism Hardware multithreading to improve utilization: Multiplexing multiple threads on single CPU Sacrifices latency for throughput Single thread cannot fully utilize CPU? Try more! Three types: Course-grain (has preferred thread) Fine-grain (round robin between threads) Simultaneous (hyperthreading)

Process: multiple threads, code, data and OS state Threads: share code, data, files, not regs or stack

Time evolution of issue slots Color = thread, white = no instruction time 4-wide Superscalar CGMT FGMT SMT Switch to thread B on thread A L2 miss Switch threads every cycle Insns from multiple threads coexist

CPU Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W UltraSparc III 2003 1950MHz 14 4 No 1 90W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W Core i5 Nehal 2010 3300MHz 14 4 Yes 1 87W Core i5 Ivy Br 2012 3400MHz 14 4 Yes 8 77W UltraSparc T1 2005 1200MHz 6 1 No 8 70W Those simpler cores did something very right.

Moore s law A law about transistors Smaller means more transistors per die And smaller means faster too But: Power consumption growing too

Dual-core Itanium 2 K10 Itanium 2 K8 P4 Atom 386 286 486 Pentium 4004 8080 8008 8088

Surface of Sun Rocket Nozzle Nuclear Reactor Hot Plate Xeon 180nm 32nm

Power = capacitance * voltage 2 * frequency In practice: Power ~ voltage 3 Reducing voltage helps (a lot)... so does reducing clock speed Better cooling helps The power wall We can t reduce voltage further We can t remove more heat Lower Frequency

Performance Power 1.2x 1.7x Single-Core Overclocked +20% Performance Power 1.0x 1.0x Single-Core Performance Power 0.8x 0.51x Single-Core Underclocked -20% Power Performance 1.6x 1.02x Dual-Core Underclocked -20%

Q: So lets just all use multicore from now on! A: Software must be written as parallel program Multicore difficulties Partitioning work Coordination & synchronization Communications overhead How do you write parallel programs?... without knowing exact underlying architecture?

Partition work so all cores have something to do

Need to partition so all cores are actually working

If tasks have a serial part and a parallel part Example: step 1: divide input data into n pieces step 2: do work on each piece step 3: combine all results Recall: Amdahl s Law As number of cores increases time to execute parallel part? goes to zero time to execute serial part? Remains the same Serial part eventually dominates

Necessity, not luxury Power wall Not easy to get performance out of Many solutions Pipelining Multi-issue Multithreading Multicore

Q: So lets just all use multicore from now on! A: Software must be written as parallel program Multicore difficulties Partitioning work Coordination & synchronization Communications overhead How do you write parallel programs?... without knowing exact underlying architecture?

Cache Coherency Processors cache shared data à they see different (incoherent) values for the same memory location Synchronizing parallel programs Atomic Instructions HW support for synchronization How to write parallel programs Threads and processes Critical sections, race conditions, and mutexes

Shared Memory Multiprocessor (SMP) Typical (today): 2 4 processor dies, 2 8 cores each Hardware provides single physical address space for all processors Core0 Core1......... CoreN Cache Cache Cache Interconnect Memory I/O

Thread A (on Core0) Thread B (on Core1) for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { x = x + 1; x = x + 1; } } What will the value of x be after both loops finish? Core0 Cache Core1 Cache......... CoreN Cache Interconnect Memory I/O

Thread A (on Core0) Thread B (on Core1) for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { x = x + 1; x = x + 1; } } What will the value of x be after both loops finish? a) 6 b) 8 c) 10 d) Could be any of the above e) Couldn t be any of the above

Thread A (on Core0) Thread B (on Core1) for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { LW $t0, addr(x) $t0=0 LW $t0, addr(x) $t0=0 $t0=1 ADDIU $t0, $t0, 1 ADDIU $t0, $t0, 1 SW $t0, addr(x) } } x=1 Problem! $t0=1 x=1 SW $t0, addr(x) Core0 X 01 Cache X 01 Core1 Cache......... CoreN Cache Interconnect X 0 Memory I/O

Executing on a write-thru cache: Time Event CPU A s CPU B s Memory step cache cache 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 Core0 Cache Core1 Cache......... CoreN Cache Interconnect Memory I/O

Coherence What values can be returned by a read Need a globally uniform (consistent) view of a single memory location Solution: Cache Coherence Protocols Consistency When a written value will be returned by a read Need a globally uniform (consistent) view of all memory locations relative to each other Solution: Memory Consistency Models

CC CPU bus D$ tags D$ data Coherence all copies have same data at all times Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you see different things happening on bus Three processor-initiated events Ld: load St: store WB: write-back Two remote-initiated events LdMiss: read miss from another processor StMiss: write miss from another processor 35

Load, Store I V LdMiss/ StMiss LdMiss, StMiss, WB Load, Store VI (valid-invalid) protocol: Two states (per block in cache) V (valid): have block I (invalid): don t have block + Can implement with valid bit Protocol diagram (left) If you load/store a block: transition to V If anyone else wants to read/write block: Give it up: transition to I state Write-back if your own copy is dirty This is an invalidate protocol Update protocol: copy data, don t invalidate Sounds good, but wastes a lot of bandwidth 36

Thread A lw t0, 0(r3), ADDIU $t0,$t0,1 sw t0,0(r3) Thread B lw t0, 0(r3) ADDIU $t0,$t0,1 sw t0,0(r3) CPU0 CPU1 Mem 0 V:0 0 V:1 0 I: V:1 1 V:2 1 lw by Thread B generates an other load miss event (LdMiss) Thread A responds by sending its dirty copy, transitioning to I 37

Store I M LdMiss/ StMiss StMiss, WB Load, Store Store LdMiss S VI protocol is inefficient Only one cached copy allowed in entire system Multiple copies can t exist even if read-only Not a problem in example Big problem in reality MSI (modified-shared-invalid) Fixes problem: splits V state into two states M (modified): local dirty copy S (shared): local clean copy Allows either Multiple read-only copies (S-state) --OR-- Single read/write copy (M-state) Load, LdMiss 38

Thread A lw t0, 0(r3), ADDIU $t0,$t0,1 sw t0,0(r3) Thread B lw t0, 0(r3), ADDIU $t0,$t0,1 sw t0,0(r3) CPU0 CPU1 Mem 0 S:0 0 M:1 0 S:1 S:1 1 I: M:2 1 lw by Thread B generates a other load miss event (LdMiss) Thread A responds by sending its dirty copy, transitioning to S sw by Thread B generates a other store miss event (StMiss) Thread A responds by transitioning to I 39

Coherence introduces two new kinds of cache misses Upgrade miss On stores to read-only blocks Delay to acquire write permission to read-only block Coherence miss Miss to a block evicted by another processor s requests Making the cache larger Doesn t reduce these type of misses As cache grows large, these sorts of misses dominate False sharing Two or more processors sharing parts of the same block But not the same bytes within that block (no actual sharing) Creates pathological ping-pong behavior Careful data placement may help, but is difficult 40

In reality: many coherence protocols Snooping: VI, MSI, MESI, MOESI, But Snooping doesn t scale Directory-based protocols Caches & memory record blocks sharing status in directory Nothing is free à directory protocols are slower! Cache Coherency: requires that reads return most recently written value Is a hard problem!

Thread A lw t0, 0(r3) Thread B lw t0, 0(r3) ADDIU $t0,$t0,1 sw t0,0(x) CPU0 CPU1 Mem 0 S:0 0 S:0 S:0 0 I: M:1 0 ADDIU $t0,$t0,1 sw t0,0(x) M:1 I: 1 What just happened??? Is MSI Cache Coherency Protocol Broken?? 42

Within a thread: execution is sequential Between threads? No ordering or timing guarantees Might even run on different cores at the same time Problem: hard to program, hard to reason about Behavior can depend on subtle timing differences Bugs may be impossible to reproduce Cache coherency is necessary but not sufficient Need explicit synchronization to make guarantees about concurrent threads!

Timing-dependent error involving access to shared state Race conditions depend on how threads are scheduled i.e. who wins races to update state Challenges of Race Conditions Races are intermittent, may occur rarely Timing dependent = small changes can hide bug Program is correct only if all possible schedules are safe Number of possible schedules is huge Imagine adversary who switches contexts at worst possible time

Atomic read & write memory operation Between read & write: no writes to that address Many atomic hardware primitives test and set (x86) atomic increment (x86) bus lock prefix (x86) compare and exchange (x86, ARM deprecated) linked load / store conditional (pair of insns) (MIPS, ARM, PowerPC, DEC Alpha, )

Load linked: LL rt, offset(rs) I want the value at address X. Also, start monitoring any writes to this address. Store conditional: SC rt, offset(rs) If no one has changed the value at address X since the LL, perform this store and tell me it worked. Data at location has not changed since the LL? SUCCESS: Performs the store Returns 1 in rt Data at location has changed since the LL? FAILURE: Does not perform the store Returns 0 in rt

Load linked: LL rt, offset(rs) Store conditional: SC rt, offset(rs) i++ LW $t0, 0($s0) ADDIU $t0, $t0, 1 SW $t0, 0($s0) try: atomic(i++) LL $t0, 0($s0) ADDIU $t0, $t0, 1 SC $t0, 0($s0) BEQZ $t0, try Value in memory changed between LL and SC? à SC returns 0 in $t0 à retry

Load linked: LL $t0, offset($s0) Store conditional: SC $t0, offset($s0) Time Thread A Thread B Thread A $t0 Thread B $t0 0 0 1 try: LL $t0, 0($s0) 0 0 2 try: LL $t0, 0($s0) 0 0 3 ADDIU $t0, $t0, 1 1 0 0 4 ADDIU $t0, $t0, 1 1 1 0 5 SC $t0, 0($s0) 1 1 1 6 BEQZ $t0, try 1 1 1 7 SC $t0, 0 ($s0) 1 0 1 8 BEQZ $t0, try 1 0 1 Success! Mem [$s0] Failure!

Create atomic version of every instruction? NO Does not scale or solve the problem To eliminate races: identify Critical Sections only one thread can be in Contending threads must wait to enter T1 time CSEnter(); Critical section CSExit(); T1 T2 CSEnter(); # wait # wait Critical section CSExit(); T2

Implementation of CSEnter and CSExit Only one thread can hold the lock at a time I have the lock

m = 0; mutex_lock(int *m) { } test_and_set: LI $t0, 1 LL $t1, 0($a0) BNEZ $t1, test_and_set SC $t0, 0($a0) BEQZ $t0, test_and_set This is called a Spin lock aka spin waiting mutex_unlock(int *m) { } SW $zero, 0($a0)

mutex_lock(int *m) Time Thread A Thread B ThreadA ThreadB Mem $t0 $t1 $t0 $t1 M[$a0] 0 0 1 try: LI $t0, 1 try: LI $t0, 1 1 1 0 2 LL $t1, 0($a0) LL $t1, 0($a0) 1 0 1 0 0 3 BNEZ $t1, try BNEZ $t1, try 1 0 1 0 0 4 SC $t0, 0 ($a0) 1 0 1 5 SC $t0, 0($a0) 0 0 1 0 1 6 BEQZ $t0, try BEQZ $t0, try 0 0 1 0 1 7 try: LI $t0, 1 Critical section Failure! Success!

// invariant: // data in A[h t-1] char A[100]; int h = 0, t = 0; // producer: add to tail if room void put(char c) { A[t] = c; t = (t+1)%n; } // consumer: take from head char get() { while (t == h) { }; char c = A[h]; h = (h+1)%n; return c; } Goal: enforce data structure invariants head 1 2 3 head 1 2 3 4 head 2 3 4 tail tail tail

// invariant: // data in A[h t-1] char A[100]; int h = 0, t = 0; Goal: enforce data structure invariants // producer: add to tail if room void put(char c) { A[t] = c; t = (t+1)%n; } // consumer: take from head char get() { while (t == h) { }; char c = A[h]; h = (h+1)%n; return c; } Clicker Q: What s wrong here? a) Will lose update to t and/or h b) Invariant is not upheld c) Will produce if full d) Will consume if empty e) All of the above

// invariant: // data in A[h t-1] char A[100]; int h = 0, t = 0; Goal: enforce data structure invariants // producer: add to tail if room void put(char c) { } A[t] = c; t = (t+1)%n; ß // consumer: take from head char get() { while (t == h) { }; ß char c = A[h]; } h = (h+1)%n; return c; ß What s wrong here? Could miss an update to t or h Breaks invariants: only produce if not full, only consume if not empty à Need to synchronize access to shared data

// invariant: // data in A[h t-1] char A[100]; int h = 0, t = 0; Goal: enforce data structure invariants // producer: add to tail if room void put(char c) { A[t] = c; t = (t+1)%n; } // consumer: take from head char get() { while (t == h) { }; char c = A[h]; h = (h+1)%n; return c; } acquire-lock() release-lock() acquire-lock() release-lock() Rule of thumb: all access & updates that can affect the invariant become critical sections Does this fix work?

Lots of synchronization variations Reader/writer locks Any number of threads can hold a read lock Only one thread can hold the writer lock Semaphores N threads can hold lock at the same time Monitors Concurrency-safe data structure with 1 mutex All operations on monitor acquire/release mutex One thread in the monitor at a time