EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT Konstantinos Alexopoulos ECE NTUA CSLab
MOTIVATION HPC, Multi-node & Heterogeneous Systems Communication with low latency Reliability Provide an instant employment method The CERN use-case 30/10/17 2
ZEROMQ Messaging Library Does not employ a messaging broker Low Latency Asynchronous I/O Easy deployment of complex topologies 30/10/17 3
ZEROMQ Supports many platforms and language bindings Open source project with active development IPC, UDP, TCP/IP Port to an RDMA-enabled interconnect 30/10/17 4
COMMUNICATION PARADIGMS TCP Socket Semantics High Overhead Decent Bandwidth Low implementation effort RDMA Semantics Low Overhead Very Low Latency Very High Throughput High implementation effort 30/10/17 5
SWITCHED FABRICS Switched Fabric Architectures Chassis To another part of the fabric... Allow for any topology Reliable Inside-the-box & Outside-the-box Endpoint Switch Endpoint GPU Endpoint Endpoint Shared Bus Architectures Topology restrictions Switch Bottlenecks Endpoint Endpoint Inside-the-box only Switched Fabric Endpoint Endpoint Storage Network 30/10/17 6
RAPIDIO System-level interconnect originally Independent from Physical Implementation Lately oriented towards chassis-to-chassis and SANs Protocol stack processed in HW Destination Based Routing 30/10/17 7
RIO OPERATIONS MESSAGING INTERFACE Channelized Messages o Maximum 4K o Socket-like interface o Sent to a channel Doorbells o Hardware Signals o 8B software-defined payload o Have to allocate exclusive range to receive doorbells 30/10/17 8
RIO OPERATIONS MMIO Remote Direct Memory Access Read/Write Zero-copy One-way communication Device memory mapped to physical memory Needs to be done at boot time Kernel boot parameter Physical memory mapped to process address space Done through a library call in the application layer Supports multi-cast 30/10/17 9
DESIGN & IMPLEMENTATION RIOZMQ 30/10/17 10
ZEROMQ INTERNAL ARCHITECTURE User creates (0mq!) Socket Network Socket binds/connects Listener/Connecter create Session Session/Engine object for any new connection creates listener I/O thread 0 engine session engine session creates connecter I/O thread 1 socket OS thread API 30/10/17 11
ZEROMQ INTERNAL ARCHITECTURE Network I/O threads for asynchronous operations in_event() and out_event() for every io_object_t io_thread_t creates listener engine session engine session creates connecter poller_t I/O thread 0 I/O thread 1 io_object_t io_object_t X Y mailbox_t fd fd fd socket API OS thread 30/10/17 12
RIOZMQ EXTENSION o rio_address o resolves address of rio://[destid]:[channel] format o rio_connecter / rio_listener o connect()/bind() o RDMA target addresses exchange o doorbell range allotment orio_engine o RDMA write/recv o rio_mailbox o doorbell operations o FD registered with io_object_t o glue code in socket/session 30/10/17 13
MEMORY SCHEME Node A RapidIO Network Node B Conn 0 offset Conn 1 offset Circular Buffer write(pos) Reserved Memory read(pos) Control Records RDMA Buffer Size 30/10/17 14
DOORBELLS AS NOTIFICATIONS Node A (send) check_write_idx(pos) update_write_idx(pos) write(pos) send_doorbell(pos) Node B (recv) update_read_idx(pos) update_write_idx(pos) check_read_idx(pos) update_read_idx(pos) read(pos) send_doorbell(pos) 30/10/17 15
DOORBELLS - FILE DESCRIPTORS - POLLER Node A (send) Node B (recv) out_event() rdma_write() rio_engine rio_engine in_event() rdma_read() poller FD FD poller write(fd) send_doorbell() rio_mailbox rio_mailbox 30/10/17 16
EVALUATION 30/10/17 17
HARDWARE SETUP 4x 2U Quad Units 4 Nodes per Unit Intel Xeon L5640 @ 2.27Ghz 48GB of DDR3 1333MHz RAM IDT Tsi721 RapidIO to PCIe bridge cards QSFP+ cables 38-port Top of Rack (ToR) RapidIO Gen2 switch CERN CentOS 30/10/17 18
BENCHMARKS Standard ZeroMQ Benchmarks Measures round trip time (RTT) between 2 nodes 30/10/17 19
EVALUATION LATENCY SMALL 350 325 RIOZMQ ZeroMQ over TCP/IP Latency o RDMA Buffer Size : 4K o Circular Buffer Length : 1 o RIOZMQ 65% faster Latency (us) 300 275 250 225 200 175 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B Transaction Size 30/10/17 20 1K 2K 4K
EVALUATION LATENCY LARGE 8000 RIOZMQ ZeroMQ over TCP/IP Latency o RDMA Buffer Size : 1M o Circular Buffer Length : 1 Latency (us) 6000 4000 2000 0 8K 16K 32K 64K 128K 256K Transaction Size 30/10/17 21 512K 1M
EVALUATION RDMA BUFFER SIZE SCAN 9 8 Message Size: 512MB Circular Buffer Length: 8 RDMA Buffer Scan An RDMA Buffer is the smallest possible data size to transmit Throughput (Gbps) 7 6 5 4 3 2 1 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M The Circular Buffer consists of the number of RDMA-enabled blocks assigned to a connection RDMA buffer size 30/10/17 22
EVALUATION CIRCULAR BUFFER LENGTH SCAN 8.6 Message Size: 512MB RDMA Buffer Size: 32MB Circular Buffer Length Scan An RDMA Buffer is the smallest possible data size to transmit Throughput (Gbps) 8.4 8.2 8.0 7.8 The Circular Buffer consists of the number of RDMA-enabled blocks assigned to a connection 7.6 1 2 4 8 16 32 Circular Buffer Length 30/10/17 23 64
EVALUATION RDMA SPEEDS o Maximum Measured Speeds o Around rdma_write() call o RapidIO <--> PCIe Translations o Library Overhead 30/10/17 24
EVALUATION THROUGHPUT 16 14 Throughput o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Achieved ~75% saturation Throughput (Gb/s) 12 10 8 6 4 2 0 1M RIOZMQ Max Theoretical Speed Max Measured Speed 2M 4M 8M 16M 32M Transaction Size 64M 128M 30/10/17 25 256M 512M 1G
EVALUATION BREAKDOWN ANALYSIS SEND SMALL o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Main bottleneck dma_write o Encode also consumes time 30/10/17 26
EVALUATION BREAKDOWN ANALYSIS SEND LARGE o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o Main bottleneck dma_write o Encode also consumes time 30/10/17 27
EVALUATION BREAKDOWN ANALYSIS RECV SMALL o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o in_event: between calls o poll operations o fd operations o doorbell handling o Can t measure asynchronous operations 30/10/17 28
EVALUATION BREAKDOWN ANALYSIS RECV LARGE o RDMA Buffer Size : 32M o Circular Buffer Length : 16 o in_event: between calls o poll operations o fd operations o doorbell handling o Can t measure asynchronous operations o Decode for larger transactions 30/10/17 29
CONCLUSIONS o ZeroMQ extended to use the RapidIO transport o Achieved better latency for small messages compared to TCP/IP o Designed for use with arbitrary number of nodes o Use in existing setups by changing the address from tcp* to rio* o Used RDMA semantics within ZeroMQ o Same scheme for other RDMA-enabled interconnects o Work is open-source - can be found at github.com/kostorr/libzmq (soon ) 30/10/17 30
FUTURE WORK Implementation o Optimize circular buffer o Zero-Copy in the critical path o Possible removal of file descriptor use Evaluation o More extensive breakdown on recv() performance o Employ on a system with more than 16 nodes o Run a real-life benchmark on a distributed system 30/10/17 31
THANK YOU! 30/10/17 32
THE PROBLEM (1) Moore s Law Semiconductor performance increases at an exponential rate Amdahl s Law Law of diminishing returns The performance of a system can only be assessed as the balance between: CPU Memory Bandwidth I/O Performance The conjunction of these laws leads to an imbalance, limiting performance New Interconnects Technologies!