Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Similar documents
Evaluation of the Chelsio T580-CR iscsi Offload adapter

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Accelerate Applications Using EqualLogic Arrays with directcache

How Flash-Based Storage Performs on Real Applications Session 102-C

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

Demartek September Intel 10GbE Adapter Performance Evaluation for FCoE and iscsi. Introduction. Evaluation Environment. Evaluation Summary

Storage Protocol Offload for Virtualized Environments Session 301-F

FCoE at 40Gbps with FC-BB-6

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Evaluation of Multi-protocol Storage Latency in a Multi-switch Environment

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

InfiniBand Networked Flash Storage

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers

Demartek Evaluation Accelerated Business Results with Seagate Enterprise Performance HDDs

Performance Benefits of NVMe over Fibre Channel A New, Parallel, Efficient Protocol

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

ESG Lab Review The Performance Benefits of Fibre Channel Compared to iscsi for All-flash Storage Arrays Supporting Enterprise Workloads

Demartek Evaluation Accelerate Business Results with Seagate EXOS 15E900 and 10E2400 Hard Drives

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Doubling Performance in Amazon Web Services Cloud Using InfoScale Enterprise

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

The QM2 PCIe Expansion Card Gives Your NAS a Performance Boost! QM2-2S QM2-2P QM2-2S10G1T QM2-2P10G1T

NetApp AFF A300 Review

EMC XTREMCACHE ACCELERATES MICROSOFT SQL SERVER

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Gen 6 Fibre Channel Evaluation of Products from Emulex and Brocade

Running Microsoft SQL Server 2012 on a Scale-Out File Server Cluster via SMB Direct Connection Solution Utilizing IBM System x Servers

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

WINDOWS SERVER 2012 ECHOSTREAMS FLACHESAN2

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Demartek December 2007

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Performance Modeling and Analysis of Flash based Storage Devices

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

NVMe over Universal RDMA Fabrics

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Accelerating Workload Performance with Cisco 16Gb Fibre Channel Deployments

EMC XTREMCACHE ACCELERATES ORACLE

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

Lenovo Database Configuration

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

IBM and HP 6-Gbps SAS RAID Controller Performance

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

PRESENTATION TITLE GOES HERE

Open storage architecture for private Oracle database clouds

Analyst Perspective: Test Lab Report 16 Gb Fibre Channel Performance and Recommendations

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

Using Synology SSD Technology to Enhance System Performance Synology Inc.

QLE10000 Series Adapter Provides Application Benefits Through I/O Caching

Use of the Internet SCSI (iscsi) protocol

EMC VFCache. Performance. Intelligence. Protection. #VFCache. Copyright 2012 EMC Corporation. All rights reserved.

Infrastructure Tuning

Version 1.0 October Reduce CPU Utilization by 10GbE CNA with Hardware iscsi Offload

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Comparison of Storage Protocol Performance ESX Server 3.5

Dell PowerEdge R910 SQL OLTP Virtualization Study Measuring Performance and Power Improvements of New Intel Xeon E7 Processors and Low-Voltage Memory

Session 201-B: Accelerating Enterprise Applications with Flash Memory

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Emulex LPe16000B Gen 5 Fibre Channel HBA Feature Comparison

Vexata VX-100 Scalable Storage Systems Shared storage with in-memory responsiveness for database and analytics workloads

Assessing performance in HP LeftHand SANs

Unveiling the new QM2 M.2 SSD/10GbE PCIe Expansion cards

Low latency and high throughput storage access

Building blocks for high performance DWH Computing

NetApp AFF A700 Performance with Microsoft SQL Server 2014

17TB Data Warehouse Fast Track Reference Architecture for Microsoft SQL Server 2014 using PowerEdge R730 and Dell Storage PS6210S

Use Cases for iscsi and FCoE: Where Each Makes Sense

Application Acceleration Beyond Flash Storage

Flash In the Data Center

Virtual SQL Servers. Actual Performance. 2016

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

RoCE vs. iwarp Competitive Analysis

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

NVMe SSDs A New Benchmark for OLTP Performance

REQUEST FOR PROPOSAL FOR PROCUREMENT OF

Ultimate Workstation Performance

Sizing and Best Practices for Online Transaction Processing Applications with Oracle 11g R2 using Dell PS Series

Performance Testing of SQL Server on Kaminario K2 Storage

Using FPGAs to accelerate NVMe-oF based Storage Networks

Load Dynamix Enterprise 5.2

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics

NVMe Direct. Next-Generation Offload Technology. White Paper

Transcription:

November 2017 Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload Initiator and target iscsi offload improve performance and reduce processor utilization. Executive Summary The Chelsio T6 adapters offload a variety of protocols including iscsi. iscsi offload enables low latency, higher IOPS, and lower processor utilization for iscsi storage solutions. In addition, both initiator and target iscsi offloads are available, allowing offload on both sides of the connection. Chelsio commissioned Demartek to evaluate the Chelsio T6225-LL-CR 25GbE iscsi Offload adapter and Chelsio T62100-CR 100GbE iscsi Offload adapter with synthetic and real-world workloads, comparing the performance and host processor utilization with and without the iscsi offload functions enabled on the client with hardware offload enabled on the target. For this project, small block read and small block write synthetic workloads were run, followed by a real world OTLP MySQL database application workload, simulating real applications that customers have running in their datacenters. When iscsi hardware offload was enabled, performance improved while processor utilization stayed steady or decreased. In all, processor effectiveness was improved by offload features and the time to complete SQL queries decreased. Key Findings > For synthetic 4K reads, the T62100-CR offloaded iscsi target achieved 3 times the IOPS, half the latency, and over 3.5 times the target processor effectiveness compared to software iscsi target. The T6225-LL-CR 25GbE offloaded iscsi initiator achieved 80% more IOPS, 45% less latency, and a 75% increase in initiator processor utilization compared to software iscsi initiator. > For synthetic 4K writes, the T62100-CR offloaded iscsi target achieved 43% more IOPS, 30% less latency, and a 185% increase in target processor effectiveness compared to software iscsi target. The T6225-LL-CR 25GbE offloaded iscsi initiator achieved 43% more IOPS, 31% less latency, and a 43% increase in initiator processor effectiveness compared to software iscsi initiator. > For our MySQL OLTP real-world workload, the T62100-CR offloaded iscsi target took half the time to complete 5.6 million OLTP transactions, and achieved 78% more IOPS, 41% less latency, and 5.5 time the target processor effectiveness compared to software iscsi target. The T6225- LL-CR 25GbE offloaded iscsi initiator took 20% less time to complete 5.6 million OLTP transactions, and achieved 20% more IOPS, 14% less latency, and a 19% increase in initiator processor effectiveness compared to software iscsi.

Chelsio T6 iscsi Offload Adapter The Chelsio T6 adapters offer the various protocol offloads, relieving the host processor of many low-level protocol functions. The offloaded protocols include: > TCP/IP > iwarp RDMA > iscsi (both initiator and target) > FCoE > NVMe-oF By offloading these functions, the host processor cycles that otherwise would have been used for Ethernet protocol processing can be applied to applications and getting more useful work done. The T6 has the high-performance packet switch previously introduced with the T5, while adding support for additional offloads of IPsec, TLS/SSL, DTLS and SMB 3.X crypto in its latest chipset. Test Environment A Chelsio T6225-LL-CR 25GbE adapter was installed in each of four RedHat Enterprise Linux 7.3 initiator servers. Each of these initiators had a single 25GbE port connected to our switch. Relax ordering was disabled. As each initiator server had 6 cores, 6 initiator sessions were configured on each server, for a total of 24 initiator sessions. 5 Samsung SM-1715 NVMe SSDs and one Chelsio T62100-CR 100GbE adapter were installed into our four-

socket RedHat Enterprise Linux 7.3 target server. Both of the 100GbE ports were connected to the switch, despite the bandwidth limitations that occurred due to using a PCIe 3.0 x 16 slot. Chelsio provided a t4_perftune.sh script which was used to spread the adapter queues over processor cores. Relax ordering was enabled. Each NVMe SSD was split into 6 partitions, creating 30 backstores for 30 iscsi targets, one LUN per target, no multipathing. One of the goals of this project was to ensure that the target server was not the bottleneck. To achieve this, we chose a target server that had more processor cores (60) than all the initiators combined and used NVMe storage in the target, ensuring fast performance. Each initiator was mapped to a single LUN on each of the 5 NVMe SSDs. Each initiator also had an additional LUN mapped from one of the prior NVMe drives. Each initiator had this additional LUN come from a different NVMe device. This achieved 6 iscsi connections per initiator from 6 distinct LUNs and 5 NVMe drives on the target server. There were 5 servers configured in this manner. Of the 30 iscsi targets created, 24 had active sessions from the 4 initiator servers. (Although access for a fifth server was also configured, this server was not used in the testing included in this report.) Jumbo frames were configured for the entirety of our test environment, including initiator servers, switch, and target server.

Workloads and Performance Metrics We wanted to compare the performance and host processor utilization of various workloads with and without the iscsi offload functions enabled. For iscsi protocol, there is an initiator and a target. The initiator is generally an application host server running various applications. The target is the recipient of the iscsi storage protocol commands and contains the storage devices. Both offload of the iscsi initiator and the iscsi target are available on Chelsio T6 adapters. Therefore, within our test environment, we created 3 different iscsi setups for testing: > 1. Software Target with Hardware Initiator > 2. Hardware Target with Software Initiator > 3. Hardware Target with Hardware Initiator To observe the benefits of target iscsi offload, results from tests 1 and 3 were compared. To observe the benefits of initiator iscsi offload, results from tests 2 and 3 were compared. For each iscsi setup, a set of 3 tests were performed: > Synthetic vdbench 4K reads > Synthetic vdbench 4K writes > Real-World MySQL OLTP workload simulation Synthetic vdbench For our synthetic workloads we ran vdbench, a workload simulation tool from Oracle that allows the user to control I/O rate, block size, percent read/write, and so on to create a custom workload to run against storage targets. Vdbench can be run as a block workload against raw LUNs with no filesystem, or it can be run with a filesystem workload. For this testing we ran a raw block workload of 4K reads and 4K writes against all 6 storage targets provided by iscsi on each of the 4 initiators, with an I/O rate of MAX, or the maximum sustainable by the storage targets. Real-World MySQL OLTP For our real-world workloads MySQL 5.7.18, an opensource relational database package, was installed on each of the 4 initiators. mdadm 3.4-14 was used to create a software Raid 0 array on each initiator server out of the 6 LUNS provided by iscsi. The default directory for MySQL databases was moved from the local drive to the mdadm array. In addition, while not something recommended in a production environment, in order to maximize I/O to the storage targets, a script was run to flush the filesystem cache every second. This ensured all I/O went straight to the storage targets without us having to worry about the state of the cache when testing. In addition, this forced our setup to send more I/O to storage than it would if it were caching. While this setup will result in lower performance overall, it provides a more powerful test of storage and storage adapter capabilities, especially when processor is limited. HammerDB 2.22, an open source database workload generator, was used to create databases on each initiator and generate a workload for these databases. When database creation was complete, the database on each initiator used a total of 260GB. The workload generated was an OLTP, or Online Transaction Processing workload, where multiple users simultaneously manage orders for a product or service. Our test setup used 70 users against each database.

Each user executed 20,000 database transactions for a total of 1.4 million transactions per database and 5.6 million transactions total executed against the databases on the storage target. Quantifying iscsi Offload Performance Improvements > Number of Cores with High Utilization Looking at the number of cores with high utilization in each iscsi setup is a good way to see where a solution struggles. Often one iscsi session is assigned to one core. Similarly adapter queues are assigned to a limited number of cores. If a particular core is highly utilized, that core is causing a bottleneck in performance, as the work it is doing often is limited to that core. Therefore, keeping individual core utilization lower is beneficial. > IOPS, throughput, processor utilization, and latency These are our raw metrics. > Processor effectiveness - a ratio of IOPS to percent processor utilization. This effectively tells us for each 1% of processor utilized, how many IOPS can be achieved. It can be useful to quantify the often simultaneous increase in IOPS and decrease in processor utilization during iscsi offload with this calculation. > Time - For real-world tests, we want to compare the time each setup took to complete a set amount of work. In each MySQL OLTP test, we will measure how long each iscsi setup takes to complete 5.6 million database transactions. vdbench Test Results Target Offload The Chelsio iscsi target generally had better synthetic performance than the software iscsi target. The software target utilized some cores to well above 90% with iscsi processing, while the hardware target provided processor savings, requiring less processor cores for iscsi processing and providing better performance. For 4K reads on the software target, 15 cores are at or above 90%, while with 4K writes on the software target, 8 cores are at or above 90%. In the software target, high utilization of individual processor cores is going to limit performance. If there were more processing available on these cores, more iscsi traffic could be handled. Compare this with hardware target, where none of the cores were ever above 90% utilization for read or write. If we consider an active core to be a core at 10% utilization or higher, we see that the software target had 31 cores utilized for 4K reads and 29 cores utilized for 4K writes. By contrast, the hardware target had 24 cores utilized for both 4K reads and 4K writes. Less of the available cores had to be utilized with iscsi target offload. In addition, we can see a general reduction in the amount of processor utilized on each individual core with the iscsi target offload. This pattern is most visible in the case of 4K writes.

We also observed an 18% decrease in total processor utilization for 4K reads and 49% decrease in total processor utilization for 4K writes. The hardware offload achieved over 3.5 times the processor effectiveness for 4K reads and 185% more processor effectiveness for 4K writes. Read IOPS almost tripled with the offloaded iscsi target, while write IOPS increased by 43%. We also observed a decrease in latency at the initiator. This correlates with a decrease in the percentage of total processor utilized for IOwait in the initiators.

Initiator Offload The Chelsio hardware initiator generally had higher performance than the software initiator with the same processor utilization, showing greater processor effectiveness. For both hardware and the software initiators, for all synthetic test cases, the available initiator processor was completely used with values between 97-100% for all tests. Available initiator processor was clearly the limiting factor in the initiator tests. With the offloaded initiator taking on some of the processing workload, initiator processor was freed up to do other work. This can also be seen in the decrease in percentage of total processor utilized for IOwait with hardware initiator. However, for the same processor utilization, there was a significant increase in performance. For 4K reads, IOPS increased by 80%, and for 4K writes, IOPS increased by 43%. This increase in performance can be quantified in processor effectiveness ratios. For 4K reads there was a 75% increase in processor effectiveness, and for 4K writes there was a 43% increase in processor effectiveness. The server is able to handle more traffic without being overloaded with hardware initiator, so latency decreases by 45% on 4K reads and 31% on 4K writes.

Real-World MySQL OLTP Test Results Target Offload The Chelsio hardware iscsi target generally had greater performance and less target processor utilization than the software iscsi target. With the real-world MySQL workload, we again saw a decrease in individual target processor core utilization. If we count the cores with processor utilization above 20%, we have 24 in the software target and only 9 in the hardware target. Because the overall processor utilization is lower in the real-world workload, the decrease in processor utilization with iscsi offload makes up a greater percentage of the total processor usage, achieving a 60% reduction in processor utilization with hardware target. This underscores again how processor savings and performance increases will be greater with target systems with less processing power. This difference in processor load is once again due to the iscsi target offload taking over iscsi processing and thus reducing total processor load.

We can again see an increase in processor effectiveness, with almost 5.5 times the processor effectiveness when the iscsi offload target is utilized. We notice again how although the processor utilization decreased, the IOPS increased by 78% and throughput increased by 76% with hardware target. In the case of a real-world workload, block sizes can vary. The differences in our percent increase for IOPS and throughput shows that the average block size changed slightly between the two tests. A more efficiently used processor spends less time on IOwait, improving latency. All of these performance improvements lead to the iscsi offload target completing 5.6 million database transactions in half the time.

Initiator Offload The Chelsio hardware initiator generally had higher performance than the software initiator with the same processor utilization, showing greater processor effectiveness. The average processor utilized between the software initiator and hardware initiator did not change much between the two tests. This enabled us to achieve a 20% increase in IOPS and a 16% increase in throughput with the same initiator server processor utilization. We notice again that in the case of a real-world workload, block sizes can vary. The differences in our percent increase for IOPS and throughput shows that average block size changed slightly between the two tests. Processor time spent in IOwait decreased slightly, but the majority of our performance improvement is due to additional iscsi processing added by our iscsi initiator offload. This increase in IOPS, while the same amount of processor is used, can be quantified with our 19% improvement in initiator processor effectiveness.

Additional improvements from the offload are a 14% reduction in latency observed at the initiator and a 20% reduction in the time taken to complete 5.6 million database transactions.

Test Environment Servers > Initiator Server: 1xIntel Xeon E5-1650 v3 @ 3.50GHz, 6 total cores, 12 total threads, 64GB RAM RHEL 7.3 with 4.9.13-chelsio kernel > Target Server: 4xIntel Xeon E7-4880 v2 @ 2.50GHz, 60 total cores 416 GB RAM 5x 1.6 TB Samsung SM-1715 NVMe SSD RHEL 7.3 with 4.9.13-chelsio kernel Adapters > Chelsio T6225-LL-CR 25GbE iscsi Offload adapter > Chelsio T62100-CR 100GbE iscsi Offload adapter Ethernet Switch > Mellanox SN2410 25GbE/100GbE switch Summary and Conclusion The Chelsio offloaded iscsi initiator and offloaded iscsi target available in their T6 adapters improved IOPS, throughput, and latency, while increasing processor effectiveness. It is recommended to use Chelsio offload adapters in both target and initiator servers to get the most out of your iscsi environment. > Synthetic tests show up to 3X the IOPS and throughput, with as little as half the latency. > Real-world OLTP tests showed up to 60% less processor utilization and up to 5.5 times the processor effectiveness. Chelsio T6 iscsi offload adapters should be considered for any iscsi application where processor utilization is an issue or a performance boost is desired. These include any small block workloads as well as servers with limited processor. Real-World OLTP workloads have smaller block and SQL queries can add significant processing requirements, both of which can cause processor to limit performance. Any of these types of environments and workloads can benefit from Chelsio T6 iscsi offloaded adapters. The most current version of this report is available at http://www./demartek_chelsio_t6_25gbe_100gbe_iscsi_offload_evaluation_2017-11.html on the Demartek website. Chelsio is a registered trademark of Chelsio Communications. Demartek is a registered trademark of Demartek, LLC. All other trademarks are the property of their respective owners.