High-Performance Distributed RMA Locks

Similar documents
High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks

Towards scalable RDMA locking on a NIC

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Advance Operating Systems (CS202) Locks Discussion

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Performance-Centric System Design

Advanced Multiprocessor Programming Project Topics and Requirements

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Design of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor

Hybrid MPI - A Case Study on the Xeon Phi Platform

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

spin: High-performance streaming Processing in the Network

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

NUMA-Aware Reader-Writer Locks PPoPP 2013

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Slim Fly: A Cost Effective Low-Diameter Network Topology

TORSTEN HOEFLER

Distributed Systems Synchronization. Marcus Völp 2007

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Cray XC Scalability and the Aries Network Tony Ford

Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

CSE 120 Principles of Operating Systems

Distributed Operating Systems

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

GPU-centric communication for improved efficiency

Panu Silvasti Page 1

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Exploiting Offload Enabled Network Interfaces

GPU-Aware Intranode MPI_Allreduce

High-Performance Key-Value Store on OpenSHMEM

AST: scalable synchronization Supervisors guide 2002

Operational Robustness of Accelerator Aware MPI

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Reactive Synchronization Algorithms for Multiprocessors

Challenges and Opportunities for HPC Interconnects and MPI

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Fairlocks - A High Performance Fair Locking Scheme

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Design of Concurrent and Distributed Data Structures

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Hardware: BBN Butterfly

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1

Performance Improvement via Always-Abort HTM

Lock cohorting: A general technique for designing NUMA locks

l12 handout.txt l12 handout.txt Printed by Michael Walfish Feb 24, 11 12:57 Page 1/5 Feb 24, 11 12:57 Page 2/5

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

HPC Runtime Support for Fast and Power Efficient Locking and Synchronization

Data Processing on Emerging Hardware

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Reliable Distributed System Approaches

One-Sided Append: A New Communication Paradigm For PGAS Models

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Graphics Processors

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

Scalable Concurrent Hash Tables via Relativistic Programming

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

CS377P Programming for Performance Multicore Performance Synchronization

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concept of a process

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Leveraging MPI s One-Sided Communication Interface for Shared-Memory Programming

Distributed Operating Systems

6.852: Distributed Algorithms Fall, Class 15

Instant Recovery for Main-Memory Databases

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

The Exascale Architecture

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

The Tofu Interconnect D

DISCRETE Event Simulation (DES) is a type of simulation

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

State-based Communication on Time-predictable Multicore Processors

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Advanced Computer Networks. Flow Control

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

Towards Fair and Efficient SMP Virtual Machine Scheduling

Transcription:

High-Performance Distributed RMA Locks PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER Presented at ARM Research, Austin, Texas, USA, Feb. 2017

NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

LOCKS An example structure Proc p Proc q Inuitive semantics Various performance penalties

LOCKS: CHALLENGES P1 P2 P3 P4 Calciu et al.: NUMA-aware reader-writer locks, PPoPP 13

LOCKS: CHALLENGES We need intra- and inter-node topologyawareness We need to cover arbitrary topologies P P P P P P P P P P P P

LOCKS: CHALLENGES We need to distinguish Writer between Reader readers Reader and writers Reader Reader Reader Reader Writer Reader Reader We need flexible performance for both types of processes [1] V. Venkataramani et al. Tao: How facebook serves the social graph. SIGMOD 12.

What will we use in the design? spcl.inf.ethz.ch

WHAT WE WILL USE MCS Locks Proc Proc Proc Proc Cannot enter Cannot enter Cannot enter Can enter Next proc Next proc... Next proc Next proc Pointer to the queue tail Mellor-Crummey and Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM TOCS 91

WHAT WE WILL USE Reader-Writer Locks W R R R... R

How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

REMOTE MEMORY ACCESS (RMA) PROGRAMMING Process p Process q Memory A A put A Memory B get B flush B Cray BlueWaters TH, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, K. Underwood: Remote Memory Access Programming in MPI-3, ACM TOPC 15

REMOTE MEMORY ACCESS PROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA support).

How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch How to manage the design complexity? Each element has its own distributed MCS queue (DQ) of writers R1 R2... R9 W3 W8 Readers and writers synchronize with a distributed counter (DC) Modular design W1 W3 W5 W8 MCS queues form a distributed tree (DT) 2 2 3 2 R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch R1 R2... How to ensure tunable performance? R9 W3 W8 Each DQ: fairness vs throughput of writers DC: a parameter for the latency of readers vs writers A tradeoff parameter for every structure W1 W3 W5 W8 DT: a parameter for the throughput of readers vs writers 2 2 2 2 R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness Larger T L,i : more throughput at level i. Smaller T L,i : more fairness at level i. T L,3 W3 W8 Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i. T L,i T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers R1 R2... R9 T L,3 W3 W8 DT: The maximum number of consecutive lock passings within readers ( ). T R T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch DISTRIBUTED COUNTER (DC) Latency of readers vs writers DC: every kth compute node hosts a partial counter, all of which constitute the DC. k = T DC A writer holds the lock Readers that arrived at the CS b x y T DC = 1 T DC = 2 Readers that left the CS 0 9 7 R1 0 3 1 0 8 5 R4 R5 0 5 3 R2 R3 R8 R6 R7 R9 0 12 8 0 13 8

Locality vs fairness (for writers) spcl.inf.ethz.ch THE SPACE OF DESIGNS T L,i Design B Design A Higher throughput of writers vs readers T R T DC P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch LOCK ACQUIRE BY READERS A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation FAA 0 9 7 0 8 7 0 7 7 R1 0 1 1 0 2 1 0 3 1 FAA R4 A writer holds the lock Readers that arrived at the CS b x y Readers that left the CS FAA R2 FAA R3

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch LOCK ACQUIRE BY WRITERS R1 R2... R9 W9 W3 W8 Acquire the main MCS lock Acquire the main lock Acquire MCS Acquire MCS W9 W1 W3 W5 W8 0 9 9 1 9 9 0 3 3 1 3 3 0 8 8 1 8 8 0 5 5 1 5 5 W9 W1 W2 W3 W4 W5 W6 W7 W8

EVALUATION CSCS Piz Daint (Cray XC30) 5272 compute nodes 8 cores per node 169TB memory Microbenchmarks: acquire/release: latency, throughput Distributed hashtable

EVALUATION DISTRIBUTED COUNTER ANALYSIS Throughput, 2% writers Single-operation benchmark 0 9 7 0 3 1 0 12 8 P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper

EVALUATION READER THRESHOLD ANALYSIS Throughput, 0.2% writers, Empty-critical-section benchmark P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper

EVALUATION COMPARISON TO THE STATE-OF-THE-ART [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION COMPARISON TO THE STATE-OF-THE-ART Throughput, single-operation benchmark [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION DISTRIBUTED HASHTABLE 20% writers 10% writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION DISTRIBUTED HASHTABLE 2% of writers 0% of writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

Another application area - Databases MPI-RMA for distributed databases? Hash-Join Sort-Join C. Barthels, et al., TH: Distributed Join Algorithms on Thousands of Cores presented in Munich, Germany, VLDB Endowment, Aug. 2017 34

Another application area - Databases MPI-RMA for distributed databases on Piz Daint Network dominating Compute dominating C. Barthels, et al., TH: Distributed Join Algorithms on Thousands of Cores presented in Munich, Germany, VLDB Endowment, Aug. 2017 35

Another application area - Databases MPI-RMA for distributed databases on Piz Daint C. Barthels, et al., TH: Distributed Join Algorithms on Thousands of Cores presented in Munich, Germany, VLDB Endowment, Aug. 2017 36

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper spcl.inf.ethz.ch OTHER ANALYSES

CONCLUSIONS Modular distributed RMA lock, correctness with SPIN Parameter-based design, feasible with various RMA libs/languages Thank you for your attention Improves latency and throughput over state-of-the-art P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC 16, best paper Accelerates distributed hashtabled