FhGFS - Performance at the maximum

Similar documents
Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Dell TM Terascala HPC Storage Solution

Demonstration Milestone for Parallel Directory Operations

The Leading Parallel Cluster File System

DELL Terascala HPC Storage Solution (DT-HSS2)

BeeGFS Benchmarks on IBM OpenPOWER Servers. Ely de Oliveira October 2016 v 1.0

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Experiences with HP SFS / Lustre in HPC Production

NetApp High-Performance Storage Solution for Lustre

Aziz Gulbeden Dell HPC Engineering Team

New Storage Technologies First Impressions: SanDisk IF150 & Intel Omni-Path. Brian Marshall GPFS UG - SC16 November 13, 2016

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

White Paper. File System Throughput Performance on RedHawk Linux

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Enhancing Checkpoint Performance with Staging IO & SSD

CSCS HPC storage. Hussein N. Harake

Blue Waters I/O Performance

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

MELLANOX MTD2000 NFS-RDMA SDK PERFORMANCE TEST REPORT

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Using file systems at HC3

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

High-Performance Lustre with Maximum Data Assurance

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre Metadata Fundamental Benchmark and Performance

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Deep Learning Performance and Cost Evaluation

SFA12KX and Lustre Update

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Extremely Fast Distributed Storage for Cloud Service Providers

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Parallel File Systems for HPC

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Data storage on Triton: an introduction

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

FCoE at 40Gbps with FC-BB-6

Deep Learning Performance and Cost Evaluation

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Modification and Evaluation of Linux I/O Schedulers

InfiniBand Networked Flash Storage

NAMD Performance Benchmark and Profiling. January 2015

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

CA485 Ray Walshe Google File System

HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Testing methodology for large-scale file systems. Sarp Oral, PhD Technology Integration Group

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Efficiency Evaluation of the Input/Output System on Computer Clusters

The Btrfs Filesystem. Chris Mason

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

I/O-500 Status. Julian M. Kunkel 1, Jay Lofstead 2, John Bent 3, George S. Markomanolis

5.4 - DAOS Demonstration and Benchmark Report

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Map3D V58 - Multi-Processor Version

Parallel File Systems Compared

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Isilon Performance. Name

An ESS implementation in a Tier 1 HPC Centre

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

What is Parallel Computing?

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Evaluation of the Chelsio T580-CR iscsi Offload adapter

dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD)

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

PARDA: Proportional Allocation of Resources for Distributed Storage Access

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

Certification Document GRAFENTHAL GmbH R2208 S2 01/04/2016. GRAFENTHAL GmbH R2208 S2 Storage system

Introduction...2. Executive summary...2. Test results...3 IOPs...3 Service demand...3 Throughput...4 Scalability...5

The Optimal CPU and Interconnect for an HPC Cluster

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

RAIN: Reinvention of RAID for the World of NVMe

SAP SD Benchmark with DB2 and Red Hat Enterprise Linux 5 on IBM System x3850 M2

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Milestone Systems CERTIFICATION TEST REPORT Version /08/17

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Efficient Object Storage Journaling in a Distributed Parallel File System

Triton file systems - an introduction. slide 1 of 28

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

MAHA. - Supercomputing System for Bioinformatics

Diamond Networks/Computing. Nick Rees January 2011

Technology Testing at CSCS including BeeGFS Preliminary Results. Hussein N. Harake CSCS-ETHZ

Birds of a Feather Presentation

Transcription:

FhGFS - Performance at the maximum http://www.fhgfs.com January 22, 2013 Contents 1. Introduction 2 2. Environment 2 3. Benchmark specifications and results 3 3.1. Multi-stream throughput................................ 3 3.2. Shared file throughput.................................. 5 3.3. IOPS........................................... 6 3.4. Metadata performance................................. 7 4. Conclusion 8 A. Used command lines for benchmarks 9 A.1. Multi-stream throughput................................ 9 A.2. Shared file throughput.................................. 9 A.3. IOPS........................................... 9 A.4. Metadata performance................................. 9

2 1. Introduction FhGFS 1 is the parallel file system from the Fraunhofer Competence Center for High Performance Computing 2. The distributed metadata architecture of FhGFS has been designed to provide the scalability and flexibility that is required to run today s most demanding HPC applications. In January 2013, Fraunhofer announced the new major release, named 2012.10. Besides adding interesting new features, such as on-the-fly replication of file contents and metadata, the new release introduces a completely redesigned metadata format, which increases performance even more. This document describes a set of benchmarks performed with FhGFS 2012.10 and shows their results. 2. Environment Metadata Storage data 20 combined metadata and storage server Metadata Storage data QDR IB Clients Figure 1: Hardware overview The benchmarks have been performed on a total of 20 server nodes, which have been used as storage and as metadata servers at the same time (Fig. 1). Each of the nodes was equipped with the following hardware: 2x Intel Xeon X5660 (2.8 GHz) 48 GB DDR-3 RAM 1 http://www.fhgfs.com 2 http://www.itwm.fraunhofer.de/en/departments/hpc.html

3 QDR Infiniband (Mellanox MT26428), only one port connected 4x Intel 510 Series SSD (RAID 0 with mdraid) The used operating system was Scientific Linux 6.3 with kernel 2.6.32-279 from the Scientific Linux distribution. All software, except FhGFS and the benchmarking tools IOR and mdtest, was installed from the Scientific Linux repository. FhGFS version 2012.10-beta1, which is available to the public since 2012/10, was used. All FhGFS services were configured to use RDMA over Infiniband. 3. Benchmark specifications and results According to the Parallel File System survey report 3 by the Dice Program, the most important file system metrics for data center representatives are multi-stream throughput, metadata performance and large block data performance. Although most of the metrics described by the report target on large files, the Johannes Gutenberg University in Mainz gathered some statistics, which show the importance of small file I/O. They analyzed typical file systems in data centers and found out, that 90% of the files on actual file systems are only 4KB or less in size 4. So even if the majority of file system administrators consider benchmarks related to large file I/O to be the most important, the University s study clearly shows the significance of handling small files in a real-world environment. As a result, we decided to specify a set of benchmarks, covering large block streaming throughput, as well as small file I/O and metadata performance. All of the following benchmarks have been performed by using the tools IOR 5 and mdtest 6. Each of the measurements was run five times and the result was calculated as the mean average of the single runs. The command lines used to run the following benchmarks can be found in Appendix A. 3.1. Multi-stream throughput In this benchmark the total throughput of sequential read and write requests with multiple streams was measured. This test was performed in two different ways. During the first measurement, a constant amount of 160 clients was used and the number of storage servers scaled from two to 20. The second test was performed with all storage servers, increasing the number of clients up to 768. 3 http://www.avetec.org/applied-computing/dice/projects/pfs/docs/pfs_survey_report_mar2011.pdf 4 A Study on Data Deduplication in HPC Storage Systems; Dirk Meister et al.; Johannes Gutenberg Universität; SC12 5 http://sourceforge.net/projects/ior-sio/ 6 http://http://sourceforge.net/projects/mdtest/

4 25000 20000 MB/s 15000 10000 Write Read 5000 0 0 2 4 6 8 10 12 14 16 18 20 # Storage Servers # Servers 2 4 6 8 10 12 14 16 18 20 Write (MB/s) 2657 5335 7892 10561 12782 15023 17345 20206 22823 25247 Read (MB/s) 2549 5162 7492 10112 11904 14543 17012 19123 21889 24789 Figure 2: Sequential throughput, storage server scaling; 160 clients, 1-20 storage servers, 512k transfer size, filesize of 150GB per server Each file in this run was striped across four FhGFS storage servers with a chunk size of 512kb. With two storage servers, the system achieved a throughput of 2 657 MB/s when writing and 2 549 MB/s when reading. As we added more storage servers to the system, performance scaled nearly linear (Fig. 2). When using all 20 servers, a sustained write throughput of 25 247 MB/s and a sustained read throughput of 24 789 MB/s was measured. The local RAID of a single node delivered a sustained writing performance of 1 332 MB/s and a sustained reading performance of 1 317 MB/s. With this information, one can calculate the maximum throughput, which 20 servers are theoretically able to achieve and it can be seen that FhGFS could provide 94% of this value (writing: 94.7%, reading: 94.1%). With a constant number of 20 storage servers and an increasing number of client processes accessing the file system, one can see, that the sustained performance was stable with a growing number of clients, up to 768. (Fig. 3) The minimum number of clients required to get the maximum performance was in the range between 96 and 192. Local benchmarks showed that at least six concurrent processes on a single machine are needed to achieve the maximum local throughput. As this benchmark used the combined resources of 20 machines, it is very reasonable that the amount of processes needed to reach maximum performance must be about 20 times higher as on a single machine.

5 30000 25000 20000 MB/s 15000 10000 Write Read 5000 0 6 12 24 48 96 192 384 768 # Storage Servers #Clients 6 12 24 48 96 192 384 768 Write (MB/s) 6123 10193 16337 20631 24796 25312 25190 25409 Read (MB/s) 6946 11524 18417 24093 23862 26621 25701 26649 Figure 3: Sequential throughput, client scaling; 20 storage servers, 6-768 clients, 512k transfer size, filesize of 150GB per server 3.2. Shared file throughput 25000 20000 MB/s 15000 10000 Write Read 5000 0 2 4 6 8 10 12 14 16 18 20 # Storage Servers # Servers 2 4 6 8 10 12 14 16 18 20 Write (MB/s) 1919 3326 4766 5766 6432 7239 8512 9591 11527 13278 Read (MB/s) 2510 4853 6552 8366 9384 10546 12709 14121 16012 17786 Figure 4: Shared file throughput, storage server scaling; 192 clients, 1-20 storage servers, 600k transfer size, filesize of 150GB per server In this scenario clients accessed one single shared file and performed read and write operations on it. Most applications access files in an unaligned manner, as the record size is defined by the type of data and the used algorithms. Therefore we chose 600kb as transfer size in IOR. Of course, this is not optimal for the file system, but by using an unaligned access pattern, we simulated real-world examples. Once again, we performed two different measurements. The first one was with a constant number of 160 clients and a growing number of storage servers. The second one used all storage

6 servers again, and scaled up to 768 clients sharing one single file. As we only wrote one file, we needed to set the number of targets for this file to 20 to distribute it evenly over all servers. The benchmarks showed a good scalability with a growing number of storage servers (Fig. 4). It is notable, that the shared file write performance was only about 75% of the read performance. We could also observe this behaviour by using the SSD disks directly on a local node (1 403 MB/s read and 1 108 MB/s write), so this seems to be a general limitation of the used systems. 15000 14000 13000 12000 MB/s 11000 10000 9000 8000 7000 6000 12 24 48 96 192 384 768 #Client processes Write # Clients 12 24 48 96 192 384 768 Write (MB/s) 7445 7749 10122 13122 13288 13212 13288 Figure 5: Shared file throughput, client scaling; 20 storage servers, 6-768 clients, 600k transfer size, filesize of 150GB per server Shared file throughput on 20 servers, with a number of clients growing up to 768, showed that all client processes can write to the same file without any notable performance loss (Fig. 5). Once again, we reached the maximum performance at a point slightly higher than 96 client processes, which corresponded with single node benchmarks. On a local node, six concurrent processes were needed to achieve maximum performance. 3.3. IOPS In this benchmark the I/O operations per second were measured. This value basically represents the mean access time of the file system and is very important for dealing with small files and for algortihms, which access files in a very random way. To measure this metric, 4k random writes were used with a total number of 160 clients and a growing number of storage servers. These measurements resulted in a perfectly linear scale. While two storage servers were able to perform 109 992 operations/sec., one can observe 1 126 963 operations/sec. for 20 storage servers (Fig. 6).

7 IOPS 1150000 1050000 950000 850000 750000 650000 550000 450000 350000 250000 150000 50000 2 4 6 8 10 12 14 16 18 20 # Storage Servers # Servers 2 4 6 8 10 12 14 16 18 20 IOPS 109992 230230 346743 462212 580031 691146 802456 936410 1033324 1126963 Figure 6: IOPS; 1-20 storage servers, 160 clients 3.4. Metadata performance File creates / sec 580000 530000 480000 430000 380000 330000 280000 230000 180000 130000 80000 30000 2 4 6 8 10 12 14 16 18 20 # Metadata Servers # Servers 1 2 4 6 8 10 12 14 16 18 20 Creates/sec 34693 68094 115423 194812 270832 331823 380321 423343 467542 506143 539724 Figure 7: File creation; 1-20 servers, up to 640 client procs (32*#MDS) The number of file creates per second and the number of stat operations per second were determined with an increasing number of metadata servers. The number of clients was always proportional to to the number of servers and was calculated by multiplying the number of servers with 32. This is important because benchmarks on a single node showed that the local RAID-arrays only perform well when using between 20 and 40 accessing processes. Therefore we needed to adjust the number of processes according to the number of used servers. Both metadata performance benchmarks showed a rise in performance when adding new metadata servers. While file creation rates on a single metadata server were in the range of 35 000, a file system with 20 metadata servers was able to create significantly more than 500 000 files per second (Fig. 7). Stat call performance showed a rise from about 93 000 operations on one server to 1 381 339

8 operations on 20 servers (Fig. 8). File stats / sec 1490000 1390000 1290000 1190000 1090000 990000 890000 790000 690000 590000 490000 390000 290000 190000 90000 1 2 4 6 8 10 12 14 16 18 20 # Metadata Servers # Servers 1 2 4 6 8 10 12 14 16 18 20 Stats/sec 93007 187844 320134 518401 623682 748334 871234 1043231 1192625 1288153 1381339 Figure 8: Stat calls; 1-20 servers, up to 640 client procs (32*#MDS) 4. Conclusion FhGFS was designed to be the primary choice for scratch filesystems in the HPC market. Our results demonstrate that FhGFS is a highly scalable parallel filesystem and in most scenarios the throughput performance is not limited by FhGFS itself, but only by the underlying hardware. Although metadata benchmarks did not show a perfectly linear scalability, the results prove that adding metadata servers to the system leads to a significant increase in metadata performance. Therefore it is easily possible to load-balance metadata access in busy environments with a load of access from different users.

9 A. Used command lines for benchmarks A.1. Multi-stream throughput mpirun -hostfile /tmp/nodefile -np ${NUM_PROCS} \ /usr/bin/ior -wr -C -i5 -t512k -b${block_size} \ -F -o /mnt/fhgfs/test.ior with ${NUM_PROCS} being 160 for the first test and increasing values from six to 768 for the second test. The parameter ${BLOCK_SIZE} depends on the number of client processes and servers. Every run writes and reads 150GB multiplied by the number of active servers of total data, so the individual client process block size can be calculated by dividing this value by the number of client processes. A.2. Shared file throughput mpirun -hostfile /tmp/nodefile -np ${NUM_PROCS} \ /usr/bin/ior -wr -C -i5 -t600k -b${block_size} \ -o /mnt/fhgfs/test.ior with ${NUM_PROCS} being 192 for the first test and increasing values from six to 768 for the second test. The parameter ${BLOCK_SIZE} depends on the number of client processes and servers. Every run writes and reads 150GB multiplied by the number of active servers of total data, so the individual client process block size can be calculated by dividing this value by the number of client processes. A.3. IOPS mpirun -hostfile /tmp/nodefile -np 160 \ /usr/bin/ior -w -C -i5 -t4k -b${block_size} \ -F -z -o /mnt/fhgfs/test.ior with ${BLOCK_SIZE} depending on the number of servers. Every run writes 150GB multiplied by the number of active servers of total data, so the individual client process block size can be calculated with (150GB #servers)/160. A.4. Metadata performance For the create benchmark:

10 mpirun -hostfile /tmp/nodefile /tmp/nodefile -np ${NUM_PROCS} \ mdtest -C -d /mnt/fhgfs/mdtest -i 5 -I ${FILES_PER_DIR} -z 2 \ -b 8 -L -u -F For the stat benchmark: mpirun -hostfile /tmp/nodefile /tmp/nodefile -np ${NUM_PROCS} \ mdtest -T -d /mnt/fhgfs/mdtest -i 5 -I ${FILES_PER_DIR} -z 2 \ -b 8 -L -u -F with ${NUM_PROCS} being #servers 32. The total amount of files should always be higher than 1 000 000, so ${FILES_PER_DIR} is calculated as ${F ILES_P ER_DIR} = 1000000/64/$N U M_P ROCS.