Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Similar documents
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Designing High Performance DSM Systems using InfiniBand Features

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Creating an agile infrastructure with Virtualized I/O

Memory Management Strategies for Data Serving with RDMA

Performance Evaluation of InfiniBand with PCI Express

Performance Evaluation of InfiniBand with PCI Express

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

A Case for High Performance Computing with Virtual Machines

Application Acceleration Beyond Flash Storage

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

OFED Storage Protocols

Multifunction Networking Adapters

Workload-driven Analysis of File Systems in Shared Multi-tier Data-Centers over InfiniBand

Advanced RDMA-based Admission Control for Modern Data-Centers

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

The NE010 iwarp Adapter

High Performance MPI on IBM 12x InfiniBand Architecture

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Storage Protocol Offload for Virtualized Environments Session 301-F

Module 2 Storage Network Architecture

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

* Department of Computer Science Jackson State University Jackson, MS 39217

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

by Brian Hausauer, Chief Architect, NetEffect, Inc

What is RDMA? An Introduction to Networking Acceleration Technologies

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Introduction to High-Speed InfiniBand Interconnect

CERN openlab Summer 2006: Networking Overview

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

HP Cluster Interconnects: The Next 5 Years

Intra-MIC MPI Communication using MVAPICH2: Early Experience

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Key Measures of InfiniBand Performance in the Data Center. Driving Metrics for End User Benefits

Mark Falco Oracle Coherence Development

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

InfiniBand Networked Flash Storage

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Configuring iscsi in a VMware ESX Server 3 Environment B E S T P R A C T I C E S

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Infiniband and RDMA Technology. Doug Ledford

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Future Routing Schemes in Petascale clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

MICROBENCHMARK PERFORMANCE COMPARISON OF HIGH-SPEED CLUSTER INTERCONNECTS

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

Introduction to Infiniband

Unified Runtime for PGAS and MPI over OFED

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Advanced Computer Networks. End Host Optimization

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Memcached Design on High Performance RDMA Capable Interconnects

High Performance File Serving with SMB3 and RDMA via SMB Direct

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Performance Implications Libiscsi RDMA support

EXPERIENCES WITH NVME OVER FABRICS

FaRM: Fast Remote Memory

Hands-On Wide Area Storage & Network Design WAN: Design - Deployment - Performance - Troubleshooting

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Containing RDMA and High Performance Computing

iscsi Technology: A Convergence of Networking and Storage

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Introduction to TCP/IP Offload Engine (TOE)

Leveraging HyperTransport for a custom high-performance cluster network

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

End Systems. End Systems

Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC

A Study of iscsi Extensions for RDMA (iser) Patricia Thaler (Agilent).

InfiniBand based storage target

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

Birds of a Feather Presentation

Motivation CPUs can not keep pace with network

Generic RDMA Enablement in Linux

Implementation and Evaluation of iscsi over RDMA

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

High Performance MPI-2 One-Sided Communication over InfiniBand

Transcription:

Evaluating the Impact of RDMA on Storage I/O over InfiniBand J Liu, DK Panda and M Banikazemi Computer and Information Science IBM T J Watson Research Center The Ohio State University

Presentation Outline Introduction/Motivation RDMA Assisted iscsi Overview Design and Implementation Performance Evaluation Conclusion

InfiniBand Overview Industry standard Interconnect for connecting processing nodes and I/O nodes High performance Less than 5us latency Over 840MB/s unidirectional Bandwidth InfiniBand clusters are becoming increasingly popular

Storage for InfiniBand Clusters Local storage Network storage Network Attached Storage (NAS) Storage Area Networks (SANs)

SAN for InfiniBand Clusters Fibre Channel (FC) SCSI RDMA Protocol (SRP) Internet SCSI (iscsi)

FC and SRP Fibre Channel (FC) Good performance Requires new hardware (HBAs, switches) Requires separate management infrastructure SCSI RDMA Protocol (SRP) InfiniBand native protocol No new hardware required Requires implementation from scratch Requires new management infrastructure

iscsi Uses TCP/IP as the underlying transport layer No additional hardware for hosts (InfiniBand supports IPoIB) Relative less software development effort (Existing management infrastructure in TCP/IP can be reused) Performance may be an issue High overhead in the TCP/IP stack

iscsi Data Transfer: Read Initiator Request Data In Data In Flow Control Data In Status Target All communication through TCP/IP Multiple data packets may be necessary Flow control for data packets may be necessary TCP/IP Communication

iscsi Data Transfer: Write Initiator Request Data Out Data Out Flow Control Data Out Data Out Status Target All communication through TCP/IP Multiple data packets may be necessary Flow control for data packets may be necessary TCP/IP Communication

Problems with iscsi Limited Performance because Protocol overhead in TCP/IP Interrupts are generated for each network packet Extra copies when sending and receiving data

Improving iscsi Performance Eliminating receiver side copies in the TCP/IP stack Direct data placement in HBA Special hardware Low compatibility iscsi extension for RDMA Special hardware (RNICs) Standard interface (RDMA over IP) Using RDMA over InfiniBand RDMA assisted iscsi

Presentation Outline Introduction/Motivation RDMA Assisted iscsi Design and Implementation Performance Evaluation Conclusion

RDMA Assisted iscsi Overview Combining IPoIB and RDMA over InfiniBand in iscsi iscsi control PDUs go through TCP/IP iscsi data transfers use RDMA IPoIB for compatibility RDMA over InfiniBand for performance Reusing existing infrastructure Reusing many existing IP based protocols No additional hardware needed for hosts

iscsi Data Transfer with RDMA: Read Initiator Request RDMA Write Status TCP/IP Communication InfiniBand RDMA Target Requests carry buffer information (address, length, tags ) All control transfer through TCP/IP All data transfer through InfiniBand RDMA No need for multiple data packets No flow control for data packets necessary

iscsi Data Transfer with RDMA: Write Initiator Request RDMA Read Status Target Requests carry buffer information All control transfer through TCP/IP All data transfer through InfiniBand RDMA No need for multiple data packets No flow control for data packets necessary TCP/IP Communication InfiniBand RDMA

Advantages of Using RDMA Improved performance Much less protocol overhead (TCP/IP bypassed for data transfer) No data copies in the protocol Reduced number of interrupts at the client

Presentation Outline Introduction/Motivation RDMA Assisted iscsi Design and Implementation Performance Evaluation Conclusion

Design Issues Memory Registration Session Management Reliability and Security

Memory Registration Memory needs to be registered before it can be used for RDMA in InfiniBand Memory registration cost is high Targets can usually pre-registered all the memory to avoid the cost More difficult for hosts (clients) because they do not have total control over the memory buffers

Techniques to Improve Memory Registration Memory Registration Cache (MRC) Maintains a cache of registered buffer and de-register buffer in a lazy manner Depends on buffer reuse May not be effective for storage buffers Fast Memory Registration (FMR) Divide registration into two steps (preparation and mapping) Only mapping appears in the critical path Supported in some InfiniBand implementations Zero-Cost Kernel Memory Registration (ZKMR) Map physical memory to virtual address space in kernel and pre-register the mapped virtual address space Do virtual to mapped address translation during communication (very fast) Some limitations

Session Management Use TCP/IP exclusively for LOGIN phase Reusing existing bootstrapping and target discovering protocols LOGIN phase negotiate the use of InfiniBand RDMA Fall back on the original iscsi if RDMA cannot be used

Reliability and Security Reliability TCP/IP checksum may be insufficient for some applications iscsi supports CRC No need to use CRC in RDMA assisted iscsi because InfiniBand has end-to-end CRC Security RDMA assisted iscsi can take advantage of existing authentication protocols However, IPSec cannot be used directly

Implementation Linux 2.4.18 Based on Intel v18 iscsi implementation InfiniBand Access Layer and IPoIB Ram disk based target implementation Kernel SCSI driver at the client

Presentation Outline Introduction/Motivation RDMA Assisted iscsi Design and Implementation Performance Evaluation Conclusion

Experimental Testbed SuperMicro SUPER P4DL6 nodes (2.4 GHz Xeon, 400MHz FSB, 512K L2 cache) Mellanox InfiniHost MT23108 4X HCAs (A1 silicon), PCIX 66bit 133MHz Mellanox InfiniScale MT43132 switch

File Read Bandwidth 450 400 Original iscsi RDMA Assisted iscsi 350 300 Bandwidth (MB/s) 250 200 150 100 50 0 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Block Size (Bytes) Buffered I/O used RDMA can improve peak bandwidth from 97MB/s to over 400MB/s 16KB block size performs best

Impact of Read Ahead 450 400 350 Original iscsi RDMA Assisted iscsi Bandwidth (MB/s) 300 250 200 150 100 50 0 3 7 15 31 63 127 255 511 1023 Maximum Read Ahead Window Size Prefetching has greater impact on the performance of RDMA assisted iscsi

File Read Latency 400 350 Original iscsi RDMA Assisted iscsi 300 Bandwidth (MB/s) 250 200 150 100 50 0 512 1024 2048 4096 8192 16384 32768 Block Size (Bytes) RAW I/O used instead of buffered I/O RDMA improves performance for large block sizes

Host CPU Utiliazation (File Read Latency Test) 0.8 0.7 Original iscsi RDMA Assisted iscsi 0.6 CPU Utilization 0.5 0.4 0.3 0.2 0.1 0 512 1024 2048 4096 8192 16384 32768 Block Size (Bytes) RDMA reduces CPU untilization for large block sizes

Impact of Buffer Registration 450 400 ZKMR Registration On-the-Fly 350 300 Bandwidth (MB/s) 250 200 150 100 50 0 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Block Size (Bytes) Registration cost significantly degrades performance RDMA is beneficial only when registration cost can be avoided/reduced

Conclusion RDMA assisted iscsi over InfiniBand Evaluating the use of RDMA in storage protocols RDMA can significantly improve storage communication performance Provide useful insight for other protocols such as iser and SRP A practical storage solution for InfiniBand clusters