Towards scalable RDMA locking on a NIC

Similar documents
High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks

Exploiting Offload Enabled Network Interfaces

High-Performance Distributed RMA Locks

spin: High-performance streaming Processing in the Network

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Challenges and Opportunities for HPC Interconnects and MPI

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Operational Robustness of Accelerator Aware MPI

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Slim Fly: A Cost Effective Low-Diameter Network Topology

High-Performance Key-Value Store on OpenSHMEM

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Advanced Computer Networks. Flow Control

TORSTEN HOEFLER

The Exascale Architecture

Interconnect Your Future

Design of Parallel and High-Performance Computing Fall 2017 Lecture: distributed memory

Hardware Acceleration of Barrier Communication for Large Scale Parallel Computer

Hybrid MPI - A Case Study on the Xeon Phi Platform

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Interconnect Your Future

2008 International ANSYS Conference

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Initial Performance Evaluation of the Cray SeaStar Interconnect

A Distributed Hash Table for Shared Memory

Cray XC Scalability and the Aries Network Tony Ford

Interconnect Your Future

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Advanced Computer Networks. End Host Optimization

ABySS Performance Benchmark and Profiling. May 2010

Interconnect Your Future

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Design of MPI Passive Target Synchronization for a Non-Cache- Coherent Many-Core Processor

Non-Blocking Collectives for MPI

Communication Models for Resource Constrained Hierarchical Ethernet Networks

Technical Computing Suite supporting the hybrid system

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine

High Performance Packet Processing with FlexNIC

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

GPU-centric communication for improved efficiency

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing High Performance DSM Systems using InfiniBand Features

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Open Fabrics Interfaces Architecture Introduction. Sean Hefty Intel Corporation

CP2K Performance Benchmark and Profiling. April 2011

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

In-Network Computing. Paving the Road to Exascale. June 2017

Myths and reality of communication/computation overlap in MPI applications

RDMA over Commodity Ethernet at Scale

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Advanced Computer Networks. Flow Control

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Harp-DAAL for High Performance Big Data Computing

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Building the Most Efficient Machine Learning System

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

MULTI-CONNECTION AND MULTI-CORE AWARE ALL-GATHER ON INFINIBAND CLUSTERS

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Building the Most Efficient Machine Learning System

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

MILC Performance Benchmark and Profiling. April 2013

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Data Processing on Emerging Hardware

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Enabling highly-scalable remote memory access programming with MPI-3 One Sided 1

The Tofu Interconnect D

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Using Lamport s Logical Clocks

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Optimizing non-blocking Collective Operations for InfiniBand

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Using RDMA for Lock Management

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Transcription:

TORSTEN HOEFLER spcl.inf.ethz.ch Towards scalable RDMA locking on a NIC with support of Patrick Schmid, Maciej Besta, Salvatore di Girolamo @ SPCL presented at HP Labs, Palo Alto, CA, USA

NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION spcl.inf.ethz.ch

LOCKS An example structure Proc p Proc q Inuitive semantics Various performance penalties

LOCKS: CHALLENGES P1 P2 P3 P4

LOCKS: CHALLENGES We need intra- and inter-node topologyawareness We need to cover arbitrary topologies P P P P P P P P P P P P

LOCKS: CHALLENGES We need to distinguish Writer between Reader readers Reader and writers Reader Reader Reader Reader Writer Reader Reader We need flexible performance for both types of processes [1] V. Venkataramani et al. Tao: How facebook serves the social graph. SIGMOD 12.

What will we use in the design? spcl.inf.ethz.ch

WHAT WE WILL USE MCS Locks Proc Proc Proc Proc Cannot enter Cannot enter Cannot enter Can enter Next proc Next proc... Next proc Next proc Pointer to the queue tail

WHAT WE WILL USE Reader-Writer Locks W R R R... R

How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

REMOTE MEMORY ACCESS (RMA) PROGRAMMING Process p Process q Memory A A put A Memory B get B flush B Cray BlueWaters

REMOTE MEMORY ACCESS PROGRAMMING Implemented in hardware in NICs in the majority of HPC networks support RDMA

How to manage the design complexity? How to ensure tunable performance? What mechanism to use for efficient implementation?

How to manage the design complexity? Each element has its own distributed MCS queue (DQ) of writers R1 R2... R9 W3 W8 Readers and writers synchronize with a distributed counter (DC) Modular design W1 W3 W5 W8 MCS queues form a distributed tree (DT) 2 2 3 2 R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

R1 R2... How to ensure tunable performance? R9 W3 W8 Each DQ: fairness vs throughput of writers DC: a parameter for the latency of readers vs writers A tradeoff parameter for every structure W1 W3 W5 W8 DT: a parameter for the throughput of readers vs writers 2 2 2 2 R4 R2 R1 R5 R3 R8 R6 R7 R9 W1 W2 W3 W4 W5 W6 W7 W8

DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness Larger T L,i : more throughput at level i. Smaller T L,i : more fairness at level i. T L,3 W3 W8 Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i. T L,i T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers R1 R2... R9 T L,3 W3 W8 DT: The maximum number of consecutive lock passings within readers ( ). T R T L,2 T L,2 W1 W3 W5 W8 T L,1 T L,1 T L,1 T L,1 W1 W2 W3 W4 W5 W6 W7 W8

DISTRIBUTED COUNTER (DC) Latency of readers vs writers DC: every kth compute node hosts a partial counter, all of which constitute the DC. k = T DC A writer holds the lock Readers that arrived at the CS b x y T DC = 1 T DC = 2 Readers that left the CS 0 9 7 R1 0 3 1 0 8 5 R4 R5 0 5 3 R2 R3 R8 R6 R7 R9 0 12 8 0 13 8

Locality vs fairness (for writers) spcl.inf.ethz.ch THE SPACE OF DESIGNS T L,i Design B Design A Higher throughput of writers vs readers T R T DC

LOCK ACQUIRE BY READERS A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation FAA 0 9 7 0 8 7 0 7 7 R1 0 1 1 0 2 1 0 3 1 FAA R4 A writer holds the lock Readers that arrived at the CS b x y Readers that left the CS FAA R2 FAA R3

LOCK ACQUIRE BY WRITERS R1 R2... R9 W9 W3 W8 Acquire the main MCS lock Acquire the main lock Acquire MCS Acquire MCS W9 W1 W3 W5 W8 0 9 9 1 9 9 0 3 3 1 3 3 0 8 8 1 8 8 0 5 5 1 5 5 W9 W1 W2 W3 W4 W5 W6 W7 W8

EVALUATION CSCS Piz Daint (Cray XC30) 5272 compute nodes 8 cores per node 169TB memory

EVALUATION CONSIDERED BENCHMARKS Throughput benchmarks: The latency benchmark Empty-critical-section Single-operation Wait-after-release Workload-critical-section DHT Distributed hashtable evaluation

EVALUATION DISTRIBUTED COUNTER ANALYSIS Throughput, 2% writers Single-operation benchmark 0 9 7 0 3 1 0 12 8

EVALUATION READER THRESHOLD ANALYSIS Throughput, 0.2% writers, Empty-critical-section benchmark

EVALUATION COMPARISON TO THE STATE-OF-THE-ART [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION COMPARISON TO THE STATE-OF-THE-ART Throughput, single-operation benchmark [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION DISTRIBUTED HASHTABLE 20% writers 10% writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

EVALUATION DISTRIBUTED HASHTABLE 2% of writers 0% of writers [1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

OTHER ANALYSES spcl.inf.ethz.ch

But why stop at RDMA -- A brief history How to program We need an QsNet? abstraction! Lossy Networks Ethernet Lossless Networks RDMA How to offload in libfabric? How to offload in Portals 4? Device Programming Offload 1980 s 2000 s 2020 s 35

Communications (non-blocking) Computations Dependencies recv comp EXPRESS L0: recv a from P1; L1: b = compute f(buff, a); L2: send b to P1; L0 and CPU-> L1 L1 -> L2 OFFLOAD send CPU Offload Engine 36

Performance Model P0 CPU OE o (s-1)g (s-1)g m o time P1 OE CP U (s-1)g o m o (s-1)g P0{ L0: recv m1 from P1; L1: send m2 to P1; } P1{ L0: recv m1 from P1; L1: send m2 to P1; L0 -> L1 } [1] A. Alexandrov et al. "LogGP: incorporating long messages into the LogP model one step closer towards a realistic model for parallel computation., Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures. ACM, 1995. 37

Fully Offloaded Collectives Collective communication: A communication that involves a group of processes Non-blocking collective: Once initiated the operation may progress independently of any computation or other communication at participating processes P0 P1 P2 P3 P4 P5 P6 P7 38

Fully Offloaded Collectives Collective communication: A communication that involves a group of processes Non-blocking collective: Once initiated the operation may progress independently of any computation or other communication at participating processes P0 S Fully Offloading: P1 P2 P3 1. No synchronization is required in order to start the collective operation 2. Once a collective operation is started, no further CPU intervention R is required in order to progress or complete it. P4 P5 P6 C C R P7 39

A Case Study: Portals 4 Based on the one-sided communication model Matching/Non-Matching semantics can be adopted M D M D M D M D NI Interconnection Network NI Portals Table ME ME ME ME ME Discard Initiator Target Priority List Overflow List [2] The Portal 4.0.2 Network Programming Interface 40

A Case Study: Portals 4 Communication primitives Put/Get operations are natively supported by Portals 4 One-sided + matching semantic Atomic operations Operands are the data specified by the MD at the initiator and by the ME at the target Available operators: min, max, sum, prod, swap, and, or, Counters Associated with MDs or MEs x Count specific ct events (e.g., operation ct completion) y Triggered operations Put/Get/Atomic associated with a counter x x y y Executed when the associated counter reaches the specified threshold ct z ct 41

FFlib: An Example Proof of concept library implemented on top of Portals 4 ff_schedule_h sched = ff_schedule_create( ); ff_op_h r1 = ff_op_create_recv(tmp + blocksize, blocksize, child1, tag); ff_op_h r2 = ff_op_create_recv(tmp + 2*blocksize, blocksize, child2, tag); ff_op_h c1 = ff_op_create_computation(rbuff, blocksize, tmp + blocksize, blocksize, operator, datatype, tag) ff_op_h c2 = ff_op_create_computation(rbuff, blocksize, tmp + 2*blocksize, blocksize, operator, datatype, tag) ff_op_h s = ff_op_create_send(rbuff, blocksize, parent, tag) ff_op_hb(r1, c1) ff_op_hb(r2, c2) ff_op_hb(c1, s) ff_op_hb(c2, s) ff_schedule_add(sched, r1) ff_schedule_add(sched, r2) ff_schedule_add(sched, c1) ff_schedule_add(sched, c2) ff_schedule_add(sched, s) C R S C R 42

Experimental Results Target machine: Curie 5,040 nodes 2 eight-core Intel Sandy Bridge processors Full fat-tree Infiniband QDR OMPI/P4: Open MPI 1.8.4 + Portals 4 RL FFLIB: proof of concept library More about FFLIB at : http://spcl.inf.ethz.ch/research/parallel_programming/fflib/ 43

Experimental Results: Latency/Overhead Broadcast Allreduce Target machine: Curie 5,040 nodes 2 eight-core Intel Sandy Bridge processors Full fat-tree Infiniband QDR OMPI/P4: Open MPI 1.8.4 + Portals 4 RL FFLIB: proof of concept library More about FFLIB at : http://spcl.inf.ethz.ch/research/parallel_programming/fflib/ 44

Experimental Results: Latency/Overhead Scatter Allgather Target machine: Curie 5,040 nodes 2 eight-core Intel Sandy Bridge processors Full fat-tree Infiniband QDR OMPI/P4: Open MPI 1.8.4 + Portals 4 RL FFLIB: proof of concept library More about FFLIB at : http://spcl.inf.ethz.ch/research/parallel_programming/fflib/ 45

PGMRES 3DFFT spcl.inf.ethz.ch Experimental Results: Micro-Benchmarks P=64 P=128 P=256 Target machine: Curie 5,040 nodes 2 eight-core Intel Sandy Bridge processors Full fat-tree Infiniband QDR OMPI/P4: Open MPI 1.8.4 + Portals 4 RL FFLIB: proof of concept library More about FFLIB at : http://spcl.inf.ethz.ch/research/parallel_programming/fflib/ 46

Simulations Why? To study offloaded collectives at large scale How? Extending the LogGOPSim to simulate Portals 4 functionalities Broadcast Allreduce L o g G m P4-SW 5μs 6μs 6μs 0.4ns 0.9ns P4-HW 2.7μs 1.2μs 0.5μs 0.4ns 0.3ns [4] [3] T. Hoefler, T. Schneider, A. Lumsdaine. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model, In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, 2010. 47 [4] Underwood et al., "Enabling Flexible Collective Communication Offload with Triggered Operations, IEEE 19th Annual Symposium on High Performance Interconnects (HOTI 11). IEEE, 2011.

Abstract Machine Model Offloading Collectives Solo Collectives Mapping to Portals 4 FFlib Result s

CONCLUSIONS Modular distributed RMA lock, correctness with SPIN Parameter-based design, feasible with various RMA libs/languages Thank you for your attention Improves latency and throughput over state-of-the-art Enables high-performance distributed hashtabled