EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Similar documents
Memory Management Strategies for Data Serving with RDMA

Advanced Computer Networks. End Host Optimization

Interconnection Network for Tightly Coupled Accelerators Architecture

RDMA programming concepts

Application Acceleration Beyond Flash Storage

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

RAPIDIO USAGE IN A BIG DATA ENVIRONMENT

The rcuda middleware and applications

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

IsoStack Highly Efficient Network Processing on Dedicated Cores

FCoE at 40Gbps with FC-BB-6

Rack Disaggregation Using PCIe Networking

Improving Packet Processing Performance of a Memory- Bounded Application

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

6.9. Communicating to the Outside World: Cluster Networking

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Experiences in Building a 100 Gbps (D)DoS Traffic Generator

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Infiniband Fast Interconnect

OCP Engineering Workshop - Telco

High-Throughput and Low-Latency Network Communication with NetIO

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Ziye Yang. NPG, DCG, Intel

Low Latency Server Virtualization

Creating High Performance Clusters for Embedded Use

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

GPUfs: Integrating a file system with GPUs

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

RDMA and Hardware Support

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Industry Collaboration and Innovation

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Messaging Overview. Introduction. Gen-Z Messaging

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

An FPGA-Based Optical IOH Architecture for Embedded System

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Containers Do Not Need Network Stacks

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Flexible Architecture Research Machine (FARM)

The NE010 iwarp Adapter

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An Intelligent NIC Design Xin Song

5051 & 5052 PCIe Card Overview

Advanced Parallel Programming I

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

AN 829: PCI Express* Avalon -MM DMA Reference Design

Tightly Coupled Accelerators Architecture

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Industry Collaboration and Innovation

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Multifunction Networking Adapters

Chapter 6. Storage and Other I/O Topics

Achieving UFS Host Throughput For System Performance

Network Design Considerations for Grid Computing

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Storage. Hwansoo Han

Exploring Alternative Routes Using Multipath TCP

RapidIO.org Update. Mar RapidIO.org 1

Advanced Computer Networks. RDMA, Network Virtualization

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Computer Systems Laboratory Sungkyunkwan University

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Diffusion TM 5.0 Performance Benchmarks

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Implementing Storage in Intel Omni-Path Architecture Fabrics

Initial Performance Evaluation of the Cray SeaStar Interconnect

Mark Falco Oracle Coherence Development

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

PCI Express System Interconnect Software Architecture for PowerQUICC TM III-based Systems

Design and Implementation of Service Migration

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

... Application Note AN-531. PCI Express System Interconnect Software Architecture. Notes Introduction. System Architecture.

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

RDMA-like VirtIO Network Device for Palacios Virtual Machines

Transcription:

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab

MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency Reliability Provide an instant employment method The CERN use-case 30/10/17 2

ZEROMQ Messaging Library Does not employ a messaging broker Low Latency Asynchronous I/O Easy deployment of complex topologies 30/10/17 3

ZEROMQ Supports many platforms and language bindings Open source project with active development IPC, UDP, TCP/IP Port to an RDMA-enabled interconnect 30/10/17 4

COMMUNICATION PARADIGMS TCP Socket Semantics High Overhead Decent Bandwidth Low implementation effort RDMA Semantics Low Overhead Very Low Latency Very High Throughput High implementation effort 30/10/17 5

SWITCHED FABRICS Switched Fabric Architectures Chassis To another part of the fabric... Allow for any topology Reliable Inside-the-box & Outside-the-box Endpoint Switch Endpoint GPU Endpoint Endpoint Shared Bus Architectures Topology restrictions Switch Bottlenecks Endpoint Endpoint Inside-the-box only Switched Fabric Endpoint Endpoint Storage Network 30/10/17 6

RAPIDIO System-level interconnect originally Independent from Physical Implementation Lately oriented towards chassis-to-chassis and SANs Protocol stack processed in HW Destination Based Routing 30/10/17 7

RIO OPERATIONS MESSAGING INTERFACE Channelized Messages o Maximum 4K o Socket-like interface o Sent to a channel Doorbells o Hardware Signals o 8B software-defined payload o Have to allocate exclusive range to receive doorbells 30/10/17 8

RIO OPERATIONS MMIO Remote Direct Memory Access Read/Write Zero-copy One-way communication Device memory mapped to physical memory Needs to be done at boot time Kernel boot parameter Physical memory mapped to process address space Done through a library call in the application layer Supports multi-cast 30/10/17 9

DESIGN & IMPLEMENTATION RIOZMQ 30/10/17 10

ZEROMQ INTERNAL ARCHITECTURE User creates (0mq!) Socket Network Socket binds/connects Listener/Connecter create Session Session/Engine object for any new connection creates listener I/O thread 0 engine session engine session creates connecter I/O thread 1 socket OS thread API 30/10/17 11

ZEROMQ INTERNAL ARCHITECTURE Network I/O threads for asynchronous operations in_event() and out_event() for every io_object_t io_thread_t creates listener engine session engine session creates connecter poller_t I/O thread 0 I/O thread 1 io_object_t io_object_t X Y mailbox_t fd fd fd socket API OS thread 30/10/17 12

RIOZMQ EXTENSION o rio_address o resolves address of rio://[destid]:[channel] format o rio_connecter / rio_listener o connect()/bind() o RDMA target addresses exchange o doorbell range allotment orio_engine o RDMA write/recv o rio_mailbox o doorbell operations o FD registered with io_object_t o glue code in socket/session 30/10/17 13

MEMORY SCHEME Node A RapidIO Network Node B Conn 0 offset Conn 1 offset Circular Buffer write(pos) Reserved Memory read(pos) Control Records RDMA Buffer Size 30/10/17 14

DOORBELLS AS NOTIFICATIONS Node A (send) check_write_idx(pos) update_write_idx(pos) write(pos) send_doorbell(pos) Node B (recv) update_read_idx(pos) update_write_idx(pos) check_read_idx(pos) update_read_idx(pos) read(pos) send_doorbell(pos) 30/10/17 15

DOORBELLS - FILE DESCRIPTORS - POLLER Node A (send) Node B (recv) out_event() rdma_write() rio_engine rio_engine in_event() rdma_read() poller FD FD poller write(fd) send_doorbell() rio_mailbox rio_mailbox 30/10/17 16

EVALUATION 30/10/17 17

HARDWARE SETUP 4x 2U Quad Units 4 Nodes per Unit Intel Xeon L5640 @ 2.27Ghz 48GB of DDR3 1333MHz RAM IDT Tsi721 RapidIO to PCIe bridge cards QSFP+ cables 38-port Top of Rack (ToR) RapidIO Gen2 switch CERN CentOS 30/10/17 18

BENCHMARKS Standard ZeroMQ Benchmarks Measures round trip time (RTT) between 2 nodes 30/10/17 19

EVALUATION LATENCY SMALL 350 325 RIOZMQ ZeroMQ over TCP/IP Latency o RDMA Buffer Size : 4K o Circular Buffer Length : 1 o RIOZMQ 65% faster Latency (us) 300 275 250 225 200 175 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B Transaction Size 30/10/17 20 1K 2K 4K

EVALUATION LATENCY LARGE 8000 RIOZMQ ZeroMQ over TCP/IP Latency o RDMA Buffer Size : 1M o Circular Buffer Length : 1 Latency (us) 6000 4000 2000 0 8K 16K 32K 64K 128K 256K Transaction Size 30/10/17 21 512K 1M

EVALUATION RDMA BUFFER SIZE SCAN 9 8 Message Size: 512MB Circular Buffer Length: 8 RDMA Buffer Scan An RDMA Buffer is the smallest possible data size to transmit Throughput (Gbps) 7 6 5 4 3 2 1 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M The Circular Buffer consists of the number of RDMA-enabled blocks assigned to a connection RDMA buffer size 30/10/17 22

EVALUATION CIRCULAR BUFFER LENGTH SCAN 8.6 Message Size: 512MB RDMA Buffer Size: 32MB Circular Buffer Length Scan An RDMA Buffer is the smallest possible data size to transmit Throughput (Gbps) 8.4 8.2 8.0 7.8 The Circular Buffer consists of the number of RDMA-enabled blocks assigned to a connection 7.6 1 2 4 8 16 32 Circular Buffer Length 30/10/17 23 64

EVALUATION RDMA SPEEDS o Maximum Measured Speeds o Around rdma_write() call o RapidIO <--> PCIe Translations o Library Overhead 30/10/17 24

EVALUATION THROUGHPUT 16 14 Throughput o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Achieved ~75% saturation Throughput (Gb/s) 12 10 8 6 4 2 0 1M RIOZMQ Max Theoretical Speed Max Measured Speed 2M 4M 8M 16M 32M Transaction Size 64M 128M 30/10/17 25 256M 512M 1G

EVALUATION BREAKDOWN ANALYSIS SEND SMALL o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Main bottleneck dma_write o Encode also consumes time 30/10/17 26

EVALUATION BREAKDOWN ANALYSIS SEND LARGE o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Main bottleneck dma_write o Encode also consumes time 30/10/17 27

EVALUATION BREAKDOWN ANALYSIS RECV SMALL o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o in_event: between calls o poll operations o fd operations o doorbell handling o Can t measure asynchronous operations 30/10/17 28

EVALUATION BREAKDOWN ANALYSIS RECV LARGE o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o in_event: between calls o poll operations o fd operations o doorbell handling o Can t measure asynchronous operations o Decode for larger transactions 30/10/17 29

CONCLUSIONS o ZeroMQ extended to use the RapidIO transport o Achieved better latency for small messages compared to TCP/IP o Designed for use with arbitrary number of nodes o Use in existing setups by changing the address from tcp* to rio* o Used RDMA semantics within ZeroMQ o Same scheme for other RDMA-enabled interconnects o Work is open-source - can be found at github.com/kostorr/libzmq (soon ) 30/10/17 30

FUTURE WORK Implementation o Optimize circular buffer o Zero-Copy in the critical path o Possible removal of file descriptor use Evaluation o More extensive breakdown on recv() performance o Employ on a system with more than 16 nodes o Run a real-life benchmark on a distributed system 30/10/17 31

THANK YOU! 30/10/17 32

THE PROBLEM (1) Moore s Law Semiconductor performance increases at an exponential rate Amdahl s Law Law of diminishing returns The performance of a system can only be assessed as the balance between: CPU Memory Bandwidth I/O Performance The conjunction of these laws leads to an imbalance, limiting performance New Interconnects Technologies!