Feedback on BeeGFS. A Parallel File System for High Performance Computing

Similar documents
Parallel File Systems for HPC

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The Leading Parallel Cluster File System

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

High-Performance Lustre with Maximum Data Assurance

Demonstration Milestone for Parallel Directory Operations

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

InfiniBand Networked Flash Storage

FhGFS - Performance at the maximum

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Parallel File Systems. John White Lawrence Berkeley National Lab

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

MAHA. - Supercomputing System for Bioinformatics

Coordinating Parallel HSM in Object-based Cluster Filesystems

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

An Overview of Fujitsu s Lustre Based File System

Extremely Fast Distributed Storage for Cloud Service Providers

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

SFA12KX and Lustre Update

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Using file systems at HC3

Habanero Operating Committee. January

IBM Storwize V7000 Unified

Datura The new HPC-Plant at Albert Einstein Institute

Dell TM Terascala HPC Storage Solution

NetApp High-Performance Storage Solution for Lustre

Experiences with HP SFS / Lustre in HPC Production

Parallel Storage Systems for Large-Scale Machines

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Cluster Setup and Distributed File System

The Optimal CPU and Interconnect for an HPC Cluster

DELL Terascala HPC Storage Solution (DT-HSS2)

System input-output, performance aspects March 2009 Guy Chesnot

An Introduction to GPFS

Emerging Technologies for HPC Storage

Storage Adapter Testing Report

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

CSCS HPC storage. Hussein N. Harake

Parallel File Systems Compared

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

An introduction to BeeGFS. Frank Herold, Sven Breuner June 2018 v2.0

Voltaire Making Applications Run Faster

NAMD Performance Benchmark and Profiling. January 2015

Efficient Object Storage Journaling in a Distributed Parallel File System

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Welcome! Virtual tutorial starts at 15:00 BST

The BioHPC Nucleus Cluster & Future Developments

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

Architecting a High Performance Storage System

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

File Systems for HPC Machines. Parallel I/O

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

BeeGFS Benchmarks on IBM OpenPOWER Servers. Ely de Oliveira October 2016 v 1.0

Isilon Performance. Name

Triton file systems - an introduction. slide 1 of 28

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Toward An Integrated Cluster File System

White paper Version 3.10

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

EMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X400. EMC Isilon X410 ARCHITECTURE

Lustre A Platform for Intelligent Scale-Out Storage

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

BeeGFS Solid, fast and made in Europe

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ISILON X-SERIES. Isilon X210. Isilon X410 ARCHITECTURE SPECIFICATION SHEET Dell Inc. or its subsidiaries.

Analytics in the cloud

EMC ISILON X-SERIES. Specifications. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Data Movement & Storage Using the Data Capacitor Filesystem

SAS workload performance improvements with IBM XIV Storage System Gen3

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Certification Document macle GmbH IBM System xx3650 M4 03/06/2014. macle GmbH IBM System x3650 M4

2012 HPC Advisory Council

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Transcription:

Feedback on BeeGFS A Parallel File System for High Performance Computing Philippe Dos Santos et Georges Raseev FR 2764 Fédération de Recherche LUmière MATière December 13 2016 LOGO CNRS LOGO IO December 13 2016 1 / 26

1 FR LUMAT shared HPC cluster 2 BeeGFS : a filesystem for HPC cluster 3 All-in-one BeeGFS node benchmark 4 All-in-one BeeGFS node in the end 5 In conclusion hilippe Dos Santos et Georges Raseev (FR 2764 Fédération defeedback Recherche onlumière BeeGFSMATière) December 13 2016 2 / 26

LUmière MATière Federation Research (FR LUMAT) A shared HPC cluster for four labs Four laboratories in Université Paris Saclay (Laser electron interaction : molecular physics, surfaces and nanophysics. Chemistry and biology interface) Several shared plateforms experiments HPC cluster Computing workstations spread in labs national computing centers Shared cluster "Mésocentre" Working since 2008 : 1 Tflops Two branches in 2016 : 15 Tflops hilippe Dos Santos et Georges Raseev (FR 2764 Fédération defeedback Recherche onlumière BeeGFSMATière) December 13 2016 3 / 26

Grappe Massivement Parallèle de Calcul Scientifique Architecture Service continuity Two branches (and more?) gmpcs-210 et gmpcs-206 (Datacenters in ISMO, VirtualData et DI) Continuously available (Maintenance, power cut,...) Single point of access (Virtual IP + Keepalived) Job distribution (Slurm + Multi Cluster Operation) BeeGFS storage 1x common BeeGFS storage (/home - 6 To - user files and programs) 1x BeeGFS in gmpcs-210 (/scratch - 40 To - temporary big files) hilippe Dos Santos et Georges Raseev (FR 2764 Fédération defeedback Recherche onlumière BeeGFSMATière) December 13 2016 4 / 26

To Make a Long Story Short How did we get to this architecture? Educated guess : hardware is the limiting factor Standard cluster stalled from time to time Related to the master Only happened when too many IO on /home /home NFS was shared among computing nodes Disk IO related? Read / Write Speed SATA HDD (Hard Disk Drive) < 200 MB/s SSD (Solid State Drive) < 550 MB/s Network related? Latency and Bandwidth Ethernet 1Gb/s limiting? Switch to InfiniBand 40Gb/s? December 13 2016 5 / 26

To Make a Long Story Not So Short How did we get to this architecture? In real life : the file system is the limiting factor Network File System Slows down when multiple clients read/write simultaneouly Data is located on one data server Parallel file system metadata server(s) for data placement data server(s) for the data storage allows many IO from many clients Improved throughput with same hardware From NFS to BeeGFS From 3.2Gbps to 9.6Gbps (x3!) December 13 2016 6 / 26

GMPCS HPC Cluster Why BeeGFS storage IO on the shared file system = bottleneck NFS known limitations NFS (IP) over Ethernet (1Gbps) 11/2008 : 800Mbps max throughpupt / whole cluster slowdown Solution : File staging on compute nodes for IO / MPI? NFS (IP over InfiniBand - IPoIB) QDR interconnect (40Gbps) 08/2013 : 3.2Gbps max throughput (x4) / cluster ok Jobs are slow when lot of simultaneous IO Which parallel file system? Production ready Good use of network bandwidth Easy to manage Price quality ratio (low entry price) BeeGFS parallel filesystem For a start 6TB (expandable to 40TB) Standard cluster = 9.6 Gbps with InfiniBand (40Gbps RDMA) Two branches = 2.4 Gbps with Optical Fiber (10Gbps IP) December 13 2016 7 / 26

BeeGFS : a filesystem for HPC cluster Production ready Best known parallel filesystems and Top 500 Best known parallel filesystems for HPC Lustre - Open source (50% Top 500 : Titan #2, Sequoia #3, K Computer #4, CEA/TGCC-GENCI #44,...) GPFS - IBM license (50% Top 500 : Mira #5, CNRS/IDRIS-GENCI #60,...) PanFS - Panasas license (Cielo #57) Top 500 (top500.org) du 26/07/2015 Top 500 supercomputers worldwide BeeGFS in use on 6 supercomputers Mainly used in german speaking countries Same concepts IO in parallel on multiple disks Metadata server(s) Data servers December 13 2016 8 / 26

BeeGFS : a filesystem for HPC cluster Production ready BeeGFS in a few words BeeGFS has two main roles To organize namespace of the files To store attributes of the files and their content To organize with metadata server(s) Data position on disks File size Owners and permissions... To store file content with storage server(s) What users are interested in The content of the file Who s in charge of data integrity? File cut into pieces (chunks or stripes) Each metadata and data server manages several drives RAID ensures data integrity in case of a drive failure December 13 2016 9 / 26

BeeGFS : a filesystem for HPC cluster Architecture overview Four main services for management of metadata, data and client GNU/Linux only On top of existing filesystem : ext4, xfs, zfs,... Management Server (MS) Knows every service (metadata, data, client) Not critical MetaData Server (MDS) Metadata management One MDS has only one MetaData Target (MDT) Object Storage Server (OSS) Data management One OSS may have many ObjectStorage Targets (OST) Client : compile and loads the beegfs kernel module December 13 2016 10 / 26

BeeGFS : a filesystem for HPC cluster BeeGFS layers Four main services for management of metadata, data and client BeeGFS courtesy December 13 2016 11 / 26

BeeGFS : a filesystem for HPC cluster Good use of network bandwidth Scalability Metadata Target (MDT) performance One directory per MDS (MDS randomly chosen => load shared on all MDS) Dedicated SSD disks on RAID1 / RAID10 (Mirroring or striping + mirroring / since random access, avoid RAID5 and RAID6) ext4 filesystem (Efficient with small files) Large inodes stores metadata (ext4 inode extended attribute = 512 bytes) Object Storage Server (OSS) Striping : numtargets + chunksize (How many OST + file size chunk) 1x OST 40TB at 500MB/s => 4x OST 160TB at 2GB/s (More OST => more storage space and more throughput) Typical 6 to 12 HDD on RAID6 (RAID 6 ensures data integrity) On top of existing filesystem : xfs, ext4, zfs,... December 13 2016 12 / 26

All-in-one BeeGFS node benchmark BeeGFS node All-in-one BeeGFS node (mgmtd, meta and storage) 1x Application server 1x JBOD CPU : 16 cores at 2.4GHz (2x Intel Xeon E5-2630 v3) RAM : 64GB (8x 8Go DDR4 at 2133MHz) Metadata : 4x SSD 200GB (1x MDT / RAID10 => 400GB) (Rule = 0.5% of storage space) (Metadata on a dedicated RAID controller) Data : 12x HDD 4TB (1x OST / RAID6 => 40TB) (Data on a dedicated RAID controller) Infiniband QDR (40 Gbit/s) (Intel True Scale HCA, 1x QSFP port) 1Gbps Ethernet NIC (Intel I350 on motherboard) December 13 2016 13 / 26

All-in-one BeeGFS node benchmark Compute nodes Benchmark using 8 compute nodes 2x Twinsquare (8x nodes) CPU : 20 cores at 2.5GHz (2x Intel Xeon E5-2670 v2) RAM : 128GB (8x 16Go DDR3 at 1866MHz) HDD : 2TB (1x per node, 7200 RPM, SATA-3, 3.5 ) Infiniband QDR (40 Gbit/s) (Intel/QLogic QLE7340, 1x QSFP port) 1Gbps Ethernet NIC (Intel I350 on motherboard) December 13 2016 14 / 26

All-in-one BeeGFS node benchmark Metadata performance How to measure metadata performance? Open source mdtest tool (http ://sourceforge.net/projects/mdtest/) Performs open/stat/close operations on files and directories (MPI coordinated) Progressive increase of IOs Creates directory tree with files (Reports number of IOPS) Metadata Performance Evaluation of BeeGFS (http ://www.beegfs.com/docs/metadata_performance_evaluation_of_beegfs_bythinkparq.pdf) December 13 2016 15 / 26

All-in-one BeeGFS node benchmark Metadata performance All-in-one BeeGFS node metadata performance Metadata Performance Evaluation of BeeGFS (http ://www.beegfs.com/docs/metadata_performance_evaluation_of_beegfs_bythinkparq.pdf) December 13 2016 16 / 26

All-in-one BeeGFS node benchmark Data performance How to measure data performance? Open source IOR tool (http ://sourceforge.net/projects/ior-sio/) This parallel program performs writes and reads (MPI coordinated) Progressive increase of IOs Measure parallel file system I/O performance (Reports throughput at both the POSIX ans MPI-IO level) Picking the right number of targets per storage server for BeeGFS (http ://www.beegfs.com/docs/picking_the_right_number_of_targets_per_server_for_beegfs_by_thinkparq.pdf) December 13 2016 17 / 26

All-in-one BeeGFS node benchmark Write performance Tuning is the key (Formatting and mounting options, partition alignement, sysctl, pinning BeeGFS processes,...) December 13 2016 18 / 26

All-in-one BeeGFS node benchmark Read performance Tuning is the key (Formatting and mounting options, partition alignement, sysctl, pinning BeeGFS processes,...) December 13 2016 19 / 26

All-in-one BeeGFS node benchmark Network matters Best to have high bandwidth network 1Gbps Ethernet vs 10Gbps Ethernet vs 40Gbps Infiniband (IPoIB vs RDMA) December 13 2016 20 / 26

All-in-one BeeGFS node benchmark Network matters Best to have high bandwidth network 1Gbps Ethernet vs 10Gbps Ethernet vs 40Gbps Infiniband (IPoIB vs RDMA) December 13 2016 21 / 26

All-in-one BeeGFS node in the end Good working and scalable 1x Application server 1x JBOD Tune to achieve best performance (Know how users access data) Keep high level IO and throughput (Keep on adding compute nodes) 40TB scalable to 160 TB (Adding JBODs on existing application server) Throughput increase (Adding application servers and JBODs) Data close to compute nodes (Better to improve throughput) RAID6 against drive failure (But no data replication) December 13 2016 22 / 26

All-in-one BeeGFS node in the end Throughput increase Write throughput using all-in-one BeeGFS node (8x RAID6 arrays with 6 disks each / 4x RAID6 arrays with 12 disks each / 1x RAID60 array with 48 drives) Picking the right number of targets per storage server for BeeGFS (http ://www.beegfs.com/docs/picking_the_right_number_of_targets_per_server_for_beegfs_by_thinkparq.pdf) December 13 2016 23 / 26

All-in-one BeeGFS node in the end Throughput increase Read throughput using all-in-one BeeGFS node (8x RAID6 arrays with 6 disks each / 4x RAID6 arrays with 12 disks each / 1x RAID60 array with 48 drives) Picking the right number of targets per storage server for BeeGFS (http ://www.beegfs.com/docs/picking_the_right_number_of_targets_per_server_for_beegfs_by_thinkparq.pdf) December 13 2016 24 / 26

In conclusion Production ready and easy to use 1x Application server 1x JBOD All-in-one BeeGFS node (Metadata and data managed on the same host) Avoid commodity hardware (2x RAID controllers, SSDs, high speed interconnect,...) Choose class entreprise disks (12x 4TB HDD, 7200 RPM, Near Line SAS, 3.5 ) Tune to achieve best performance (Know how users access data) Good performance for the money (Low entry price for HPC world!) Easy to administer (Well suited for small IT teams) December 13 2016 25 / 26

Icon made by Freepik from www.flaticon.com December 13 2016 26 / 26