NUMA-Aware Reader-Writer Locks PPoPP 2013

Size: px

Start display at page:

Download "NUMA-Aware Reader-Writer Locks PPoPP 2013"

Della Carpenter
6 years ago
Views:

1 NUMA-Aware Reader-Writer Locks PPoPP 2013 Irina Calciu Brown University

2 Authors Irina Brown University Dave Dice Yossi Lev Victor Luchangco Virendra J. Marathe Nir MIT 2

3 Cores Chip (node) w1 w2 r3 L1 Threads w4 L1 r2 r1 L1 w3 w5 L1 r4 L1 r5 L1 r6 L1 L2 Local DRAM w6 L1 L2 Local DRAM Shared Bus - interconnect Typical NUMA system

4 NUMA Interconnect is growing most slowly of all interfaces Critical bottleneck on large systems Classic NUMA programming : Avoid cold & capacity misses served from remote node Concern : home node of memory vs node of thread accessing that memory 4

5 NUMA Our concern : contended locks Coherence misses & communication Minimize cache-to-cache coherence transfers Location of thread accessing a line Caches that have that line & states 5

6 Background : cohort locks Non-FIFO : trade short-term fairness for aggregate throughput [PPoPP 2012] NUMA node 0 NUMA node 1 Thread 11 Thread 13 Thread 12 Thread 9 Thread 8 Thread 5 Thread 1 Thread 0 Thread 7 Thread 10 Thread 6 Thread 4 Thread 3 Thread 2 6

7 Reader-Writer Locks W Write Mode W W R R R Critical Section 7

8 Reader-Writer Locks Read Mode W R R W W R Critical Section 8

9 Reader-Writer Locks Maximize size of R-groups Minimize R-W alternation Used in : databases, operating systems, STM Alternative roles : Stop-the-world Garbage collection read confers RW access to heap write confers ability of collector to move 9

10 Admission Policy - Variations Include Read/Write in scheduling decision Reader-preference Writer-preference FIFO : R-groups form from ambient order 10

11 #threads in CS Thread placement: Node 0 : w1, w2, w3, r1, r2, r3 Node 1 : w4, w5, w6, r4, r5, r6 w1 r2 r3 w4 r4 #threads in CS Time w1 w2 r1 w5 r5 r6 w3 w6 (a) Naïve reader-writer lock schedule r5 r6 r1 r4 r2 r3 w4 w2 w5 w3 w6 #threads in CS (b) Lock schedule with aggressive reader batching w1 w2 w3 r5 r6 r1 r4 r2 r3 w4 w6 w5 (c) Lock schedule with aggressive reader and writer batching

12 Problems with existing RW locks Path length Longer relative to a mutex Lock meta-data accesses Centralized : NUMA-oblivious Coherency communication costs Simple mutex often yields better results For relatively short critical sections Despite lack of R-R parallelism RW lock : benefits of R-R parallelism don't overcome additional overhead 12

13 Our design Trade short-term fairness for throughput Similar to Cohort Locks Presume reads dominate Shift burden of work from reader lock path to writer path 13

14 Our design: Writers Single centralized write lock (WL) Abstraction : Lock; Unlock; IsLocked W-vs-W conflicts Best implementation NUMA node 0 Thread 13 Thread 12 Thread 9 Thread 8 Thread 5 Thread 1 Thread 0 NUMA node 1 Thread 11 Thread 10 Thread 7 Thread 6 Thread 4 Thread 3 Thread 2 14

15 Our design: Readers Reader indicators (RI) Publish intent to read to writers Abstraction : Arrive; Depart; IsZero Conceptually : counter 15

16 Reader Indicators Global counter Atomic increment and decrement OK uniprocessor, horrible on NUMA SNZI NUMA node 0 NUMA node 1 16

17 Reader Indicators Per-node distributed counters : Local writes only Per-node pairs : ingress and egress fields Arrive : increment ingress Depart : increment egress Reduces intra-node fetch-and-add contention Preferred implementation 17

18 Our design: Readers and Writers IsLocked and IsZero : Detect and resolve R-vs-W conflicts Reader: start: RI.Arrive() // Check for writers if WL.isLocked(): RI.Depart() while WL.isLocked(): Pause() goto start <read-critical-section> RI.Depart() Writer: WL.Acquire() // Check for readers while not RI.isZero(): Pause() <write-critical-section> WL.release() 18

19 Impatience (I) Adaptive RP-WP policy Start with writer-preference lock C-RW-WP Writers acquire WL and wait for RI to reach 0 Readers increment RI and check WL If locked, decrement and defer to writers 19

20 Impatience (II) Readers initially patient but can become impatient block inflow of newly arriving writers erect barrier avoids reader starvation Bounded bypass : writers can bypass patient readers 20

21 Impatience (III) Effectively : toggling preference policy to avoid starvation Promotes large R-groups Long chains of writers leverage cohort locks Adaptive admission policy 21

22 Better 98% reads, 2% writes

23 Observations Distributed RIs beat SNZI Flat array of RI better, at least for 4 or 8 node systems SNZI expected to win at some N NUMA-like behavior on-chip Core-local L2 caches Treat each core as if a NUMA node Fixed thread roles vs variable Variable : models use of thread pools Fixed : our lock family still yields best results 23

24 Summary (I) Family NUMA-friendly RW locks Trivial to substitute RI or WL implementations High aggregate throughput Fair over long-term for : threads; R/W roles; NUMA nodes 24

25 Summary (II) Long critical sections Quality of scheduling is critical R-group formation Short critical sections Lock overheads can dominate Consider a NUMA-friendly mutex Fixed preference policies can be problematic Adaptive to avoid starvation Non-preferred role can become impatient 25

26 Thank you!

27 98% reads, 2% writes

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar 04.02.2015 NUMA Seminar Agenda 1. Recap: Locking 2. in NUMA Systems 3. RW 4. Implementations 5. Hands On Why Locking? Parallel tasks access shared resources : Synchronization mechanism in concurrent environments