High-Performance Distributed RMA Locks

Size: px

Start display at page:

Download "High-Performance Distributed RMA Locks"

Marianna Joseph
5 years ago
Views:

1 High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER

2 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

3 LOCKS An example structure Proc p Proc q Inuitive semantics Various performance penalties

4 LOCKS: CHALLENGES P1 P2 P3 P4

5 LOCKS: CHALLENGES We need intra- and inter-node topologyawareness We need to cover arbitrary topologies P P P P P P P P P P P P

6 LOCKS: CHALLENGES We need to distinguish Writer between Reader readers Reader and writers Reader Reader Reader Reader Writer Reader Reader We need flexible performance for both types of processes [1] V. Venkataramani et al. Tao: How facebook serves the social graph. SIGMOD 12.

7 What will we use in the design? spcl.inf.ethz.ch

8 WHAT WE WILL USE MCS Locks Proc Proc Proc Proc Cannot enter Cannot enter Cannot enter Can enter Next proc Next proc... Next proc Next proc Pointer to the queue tail

9 WHAT WE WILL USE Reader-Writer Locks W R R R... R

10 How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

11 REMOTE MEMORY ACCESS (RMA) PROGRAMMING Process p Process q Memory A A put A Memory B get B flush B Cray BlueWaters

12 REMOTE MEMORY ACCESS PROGRAMMING Implemented in hardware in NICs in the majority of HPC networks support RDMA

13 How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

14 How to manage the design complexity? Each element has its own distributed MCS queue (DQ) of writers R1 R2... R9 W3 W8 Readers and writers synchronize with a distributed counter (DC) Modular design W1 W3 W5 W8 MCS queues form a distributed tree (DT) R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

15 R1 R2... How to ensure tunable performance? R9 W3 W8 Each DQ: fairness vs throughput of writers DC: a parameter for the latency of readers vs writers A tradeoff parameter for every structure W1 W3 W5 W8 DT: a parameter for the throughput of readers vs writers R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

16 DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness Larger T L,i : more throughput at level i. Smaller T L,i : more fairness at level i. T L,3 W3 W8 Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i. T L,i T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

17 DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers R1 R2... R9 T L,3 W3 W8 DT: The maximum number of consecutive lock passings within readers ( ). T R T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

18 DISTRIBUTED COUNTER (DC) Latency of readers vs writers DC: every kth compute node hosts a partial counter, all of which constitute the DC. k = T DC A writer holds the lock Readers that arrived at the CS b x y T DC = 1 T DC = 2 Readers that left the CS R R4 R R2 R3 R8 R6 R7 R

19 Locality vs fairness (for writers) spcl.inf.ethz.ch THE SPACE OF DESIGNS T L,i Design B Design A Higher throughput of writers vs readers T R T DC

20 LOCK ACQUIRE BY READERS A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation FAA R FAA R4 A writer holds the lock Readers that arrived at the CS b x y Readers that left the CS FAA R2 FAA R3

21 LOCK ACQUIRE BY WRITERS R1 R2... R9 W9 W3 W8 Acquire the main MCS lock Acquire the main lock Acquire MCS Acquire MCS W9 W1 W3 W5 W W9 W1 W2 W3 W4 W5 W6 W7 W8

22 EVALUATION CSCS Piz Daint (Cray XC30) 5272 compute nodes 8 cores per node 169TB memory

23 EVALUATION CONSIDERED BENCHMARKS Throughput benchmarks: The latency benchmark Empty-critical-section Single-operation Wait-after-release Workload-critical-section DHT Distributed hashtable evaluation

24 EVALUATION DISTRIBUTED COUNTER ANALYSIS Throughput, 2% writers Single-operation benchmark

25 EVALUATION READER THRESHOLD ANALYSIS Throughput, 0.2% writers, Empty-critical-section benchmark

26 EVALUATION COMPARISON TO THE STATE-OF-THE-ART [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

27 EVALUATION COMPARISON TO THE STATE-OF-THE-ART Throughput, single-operation benchmark [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

28 EVALUATION DISTRIBUTED HASHTABLE 20% writers 10% writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

29 EVALUATION DISTRIBUTED HASHTABLE 2% of writers 0% of writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

30 OTHER ANALYSES spcl.inf.ethz.ch

Thank you for your attention Improves latency and throughput

31 CONCLUSIONS Modular distributed RMA lock, correctness with SPIN Parameter-based design, feasible with various RMA libs/languages Thank you for your attention Improves latency and throughput over state-of-the-art Enables high-performance distributed hashtabled

32 DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers R1 R2... R9 W3 W8 T W = DT: The maximum number of consecutive lock passings within writers ( T W ) and readers ( T ). R N T L,i i=1 T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8 36

33 Locality vs fairness (for writers) spcl.inf.ethz.ch THE SPACE OF DESIGNS T L,i Design B Design A Higher throughput of writers vs readers T R T DC

34 EVALUATION D-MCS VS OTHERS Latency (LB) Throughput (ECSB)

35 EVALUATION WRITER THRESHOLD ANALYSIS Throughput, 25% of writers Single-operation benchmark

36 EVALUATION FAIRNESS VS THROUGHPUT ANALYSIS Throughput, 25% of writers, Single-operation benchmark

37 EVALUATION READER THRESHOLD ANALYSIS Throughput, 2% and 5% writers, Empty-critical-section benchmark

38 FEASIBILITY ANALYSIS spcl.inf.ethz.ch

High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017 NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch