MELLANOX MTD2000 NFS-RDMA SDK PERFORMANCE TEST REPORT

Similar documents
White Paper. File System Throughput Performance on RedHawk Linux

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

AWOL: An Adaptive Write Optimizations Layer

OFED Storage Protocols

FhGFS - Performance at the maximum

High-Performance Lustre with Maximum Data Assurance

CA485 Ray Walshe Google File System

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Deep Learning Performance and Cost Evaluation

HP StorageWorks Enterprise File Services Clustered Gateway performance and scalability with HP StorageWorks XP12000 Disk Array white paper

designed. engineered. results. Parallel DMF

CSCS HPC storage. Hussein N. Harake

An Overview of The Global File System

Storage Evaluations at BNL

Deep Learning Performance and Cost Evaluation

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Enhancing Checkpoint Performance with Staging IO & SSD

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Ben Walker Data Center Group Intel Corporation

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Comparison of Storage Protocol Performance ESX Server 3.5

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Iozone Filesystem Benchmark

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

DISTRIBUTED FILE SYSTEMS & NFS

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

SNAP Performance Benchmark and Profiling. April 2014

Application Acceleration Beyond Flash Storage

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

Cisco Prime Home 6.X Minimum System Requirements: Standalone and High Availability

Emerging Technologies for HPC Storage

DELL Terascala HPC Storage Solution (DT-HSS2)

Using DDN IME for Harmonie

Storwize/IBM Technical Validation Report Performance Verification

SONAS Best Practices and options for CIFS Scalability

The Last Bottleneck: How Parallel I/O can improve application performance

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Performance Evaluation Using Network File System (NFS) v3 Protocol. Hitachi Data Systems

Accelerate Applications Using EqualLogic Arrays with directcache

Improving I/O Bandwidth With Cray DVS Client-Side Caching

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Structuring PLFS for Extensibility

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Dell TM Terascala HPC Storage Solution

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Demonstration Milestone for Parallel Directory Operations

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Aziz Gulbeden Dell HPC Engineering Team

HP SAS benchmark performance tests

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

OpenFOAM Performance Testing and Profiling. October 2017

Dell EMC CIFS-ECS Tool

Efficient Object Storage Journaling in a Distributed Parallel File System

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Parallel File Systems for HPC

Zhengyang Liu! Oct 25, Supported by NSF Grant OCI

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Improving NFS Performance on HPC Clusters with Dell Fluid Cache for DAS

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

2008 International ANSYS Conference

Memory. Objectives. Introduction. 6.2 Types of Memory

Analysis of high capacity storage systems for e-vlbi

HD RAID Configuration Mode Commands

Efficiency Evaluation of the Input/Output System on Computer Clusters

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Chapter 11: Implementing File Systems

Chapter 10: File System Implementation

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

COMP520-12C Final Report. NomadFS A block migrating distributed file system

NAMD Performance Benchmark and Profiling. January 2015

Demartek Evaluation Accelerated Business Results with Seagate Enterprise Performance HDDs

Mission-Critical Enterprise Linux. April 17, 2006

InfiniBand Networked Flash Storage

RIGHTNOW A C E

Oracle Database 12c: JMS Sharded Queues

CSE 486/586: Distributed Systems

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Transcription:

MELLANOX MTD2000 NFS-RDMA SDK PERFORMANCE TEST REPORT The document describes performance testing that was done on the Mellanox OFED 1.2 GA NFS-RDMA distribution. Test Cluster Mellanox Technologies 1 July 2007

The figure above illustrates the setup used for testing the NFS-RDMA server. The switch is a Flextronics 24-port DDR IB switch. The NFS Filer consists of the Mellanox MTD2000 head-end and Mellanox MTD2000E JBOD expander. This provided a total of 32 15K 36GB SAS drives. The drives were configured as two RAID0 volumes, one two-disk volume for the operating system, and one 30 disk volume for the NFS export. SLES10 was installed on the server and both volumes were formatted with XFS. NFS-RDMA with OFED 1.2 GA was then installed and configured. The clients consisted of: Two dual core, dual processor 64bit 3.46Ghz XEON machine with 4GB of memory, One dual core, single processor 32bit 3.40Ghz XEON machine with 1GB of memory, and One single core, dual processor 64bit 1.8Ghz AMD Athlon machine with 2GB of memory All clients contained a Mellanox dual ported IB DDR adapter. Two of the clients were installed with RHEL 5, and two were installed with SLES 10. All clients ran the Mellanox NFS-RDMA distribution based on OFED 1.2 GA. Test Description All tests were performed using version 3.283 of the iozone 1 filesystem performance benchmarking tool. The tool was installed and built individually on each machine. Cluster Testing In order to test the scalability characteristics of the server, the cluster testing mode of the iozone tool was used. This mode allows multiple clients to participate in a test. A master node communicates with a subordinate iozone agent running on each client node. The agents communicate with the master to ensure that all tests start concurrently and that the test duration reflects the time for all participating nodes to complete. A separate machine on a separate network was used as the master node to ensure that cluster control traffic did not perturb NFS traffic on the IB network. Client Cache The NFS client uses the Linux buffer cache to improve performance. When an application performs a read or write to a file, the I/O is satisfied through this 1 This tool is available for download at http://www.iozone.org. Mellanox Technologies 2 July 2007

buffer cache. When performing a write, the data goes to the buffer cache and is asynchronously flushed to the backing store as memory pressure builds, or in response to a synchronization request (e.g. close). Similarly, when an application reads, the NFS client checks to see if the data is already in the buffer cache. If the data is present, the read is satisfied from this cache. If it is not present, a read is issued to the backing store into the buffer cache. For our purposes, in both cases the backing store is the NFS Filer. In order to evaluate the performance of the NFS Filer, therefore, the operation of the client-side buffer cache must be considered. Write Testing When performing write testing, it is important that the time it takes the NFS client to flush the dirty buffer cache pages to the NFS Filer is considered in the performance results. In order to do this, the close option (-c) was specified to iozone. This instructs iozone to include the time it takes to close the file in the performance calculation. Since close will not return until all pages have been flushed to the NFS Filer, this provides an accurate assessment of the NFS Filer performance. Normally, iozone cleans up it s temporary files after completing a test, therefore, the no-unlink (-w) option was specified to keep the generated file for read testing as described below. Read Testing When performing read testing, it is important that we ensure that the data is not already in the client s buffer cache, because as we ve discussed, if the page is already present, then the NFS Filer is not involved in the operation. Obviously if we are performing a read test, then the file must already exist and if it exists its data may well be in the client buffer cache. To ensure that it is not, the filesystem is unmounted, and remounted between tests, effectively invalidating all buffer cache pages for the filesystem. In addition, to ensure that unrelated dirty pages do not inadvertently impact the result, a filesystem sync operation is performed between read tests. Server Cache The NFS Filer uses the local VFS as it s backing store for exported volumes. For our tests, the local filesystem is XFS and is backed by a 30 disk stripped RAID0 volume. The goal of the testing, however, is to evaluate IB and NFS, not the XFS filesystem. For this reason, it is desirable that file data be served from the buffer cache whenever possible. For read processing, this is accomplished by syncing between tests so that the newly generated read data will have the maximum amount of memory available for the new file and not be polluted with either data from earlier tests or unrelated data on the server. Mellanox Technologies 3 July 2007

This holds for write testing as well, so that any data flushed to disk by the buffer cache is for the current test and not data left over from an earlier test or unrelated process. Test Description The iozone performance tool was used to test NFS sequential read and write performance across a range of record and file sizes. Two scripts were written: bigfile_master.sh and recsize_master.sh. The bigfile_master.sh script was used to test performance across file sizes form 64M to 64G. For this test, a constant record size of 128K was used. The recsize_master.sh script was used to test performance across a range of record sizes from 4K to 512K. For these tests, a constant file size of 1GB was used. Cells in shaded in green below indicate record/file sizes that were specifically tested. File Record Size Size 4K 8K 16K 32K 64K 128K 256K 512K 64M 128M 256M 512M 1G 2G 4G 8G 16G 32G 64G All tests were run with 1, 2, 3 and 4 participating nodes to evaluate how performance scaled as node count increased. Note that for any given file size, the amount of data being read or written from the perspective of the NFS Filer is cumulative. That is, if 4 nodes are writing a file of 1GB in size, the server sees 4GB of data. This observation is important when interpreting the performance results because the tipping point where file size overflows the server side buffer cache occurs earlier (relative to client file size) for larger node counts. See the script files bigfile_master.sh and recsize_master.sh for detail on the exact iozone command syntax used. Mellanox Technologies 4 July 2007

Results Read Performance Read Throughput by File Size 1,400,000 1,200,000 1,000,000 800,000 600,000 1 Node 2 Nodes 3 Nodes 4 Nodes 400,000 200,000 0 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 File Size in Megabytes NFS Filer performance is very good when serving data from cache. Clearly up to the point where the cumulative amount of data exceeds the NFS Filer buffer cache, the server is able to maintain wire rate across four nodes. Note that larger node counts fell off sharply at about 2GB or 4 nodes * 2GB = 8GB which is the total amount of memory in the server. Mellanox Technologies 5 July 2007

Read Throughput by Record Size 1,400,000 1,200,000 1,000,000 800,000 600,000 1 Node 2 Nodes 3 Nodes 4 Nodes 400,000 200,000 0 4 8 16 32 64 128 256 512 Record Size in Kilobytes This result strongly implies that record size is not a significant performance factor. Mellanox Technologies 6 July 2007

Write Performance Write Throughput By File Size 700,000 600,000 500,000 400,000 300,000 1 Node 2 Nodes 3 Nodes 4 Nodes 200,000 100,000 0 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 File Size in Megabytes Write performance rises with node count until about aggregate data size is about 40% of server s memory capacity. After this point, the performance trends down to the backing store s random I/O performance limit. Mellanox Technologies 7 July 2007

Write Throughput by Record Size 400,000.00 350,000.00 300,000.00 250,000.00 200,000.00 1 Node 2 Nodes 3 Nodes 4 Nodes 150,000.00 100,000.00 50,000.00 0.00 4 8 16 32 64 128 256 512 Record Size in Kilobytes As for read, record size does not appear to be a significant factor with respect to write performance. Mellanox Technologies 8 July 2007