Parallel File Systems for HPC

Similar documents
Parallel File Systems Compared

Lustre overview and roadmap to Exascale computing

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Feedback on BeeGFS. A Parallel File System for High Performance Computing

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Parallel File Systems. John White Lawrence Berkeley National Lab

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

An Overview of Fujitsu s Lustre Based File System

An Introduction to GPFS

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Coordinating Parallel HSM in Object-based Cluster Filesystems

Lessons learned from Lustre file system operation

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

System input-output, performance aspects March 2009 Guy Chesnot

Lustre A Platform for Intelligent Scale-Out Storage

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Using file systems at HC3

CSCS HPC storage. Hussein N. Harake

A GPFS Primer October 2005

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PARALLEL FILE I/O 1

IBM Storwize V7000 Unified

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Architecting a High Performance Storage System

Datura The new HPC-Plant at Albert Einstein Institute

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

The Leading Parallel Cluster File System

Xyratex ClusterStor6000 & OneStor

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Crossing the Chasm: Sneaking a parallel file system into Hadoop

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Welcome! Virtual tutorial starts at 15:00 BST

Introducing Panasas ActiveStor 14

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

The Oracle Database Appliance I/O and Performance Architecture

CA485 Ray Walshe Google File System

IBM IBM Open Systems Storage Solutions Version 4. Download Full Version :

Isilon Performance. Name

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Voltaire Making Applications Run Faster

Blue Waters I/O Performance

Object Storage: Redefining Bandwidth for Linux Clusters

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Experiences with HP SFS / Lustre in HPC Production

The advantages of architecting an open iscsi SAN

Cluster Setup and Distributed File System

An introduction to GPFS Version 3.3

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

EMC Solutions for Backup to Disk EMC Celerra LAN Backup to Disk with IBM Tivoli Storage Manager Best Practices Planning

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

Massive Data Processing on the Acxiom Cluster Testbed

NetApp High-Performance Storage Solution for Lustre

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Omneon MediaGrid Technical Overview

InfiniBand Networked Flash Storage

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Introduction to Lustre* Architecture

Challenges in making Lustre systems reliable

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

A Thorough Introduction to 64-Bit Aggregates

SFA12KX and Lustre Update

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

libhio: Optimizing IO on Cray XC Systems With DataWarp

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

How to Speed up Database Applications with a Purpose-Built SSD Storage Solution

High-Performance Lustre with Maximum Data Assurance

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

A Thorough Introduction to 64-Bit Aggregates

Lustre Parallel Filesystem Best Practices

IBM System Storage DS5020 Express

Glamour: An NFSv4-based File System Federation

RAIDIX 4.5. Product Features. Document revision 1.0

General Purpose Storage Servers

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Microsoft Office SharePoint Server 2007

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Data Movement & Storage Using the Data Capacitor Filesystem

The Google File System

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

EMC Celerra CNS with CLARiiON Storage

High Availability Without the Cluster (or the SAN) Josh Sekel IT Manager, Faculty of Business Brock University

Transcription:

Introduction to Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing

Outline 1 The Need for 2 The File System 3

Cluster & A typical cluster setup with a Master node, several computing nodes and shared storage. Nodes have little or no local storage.

Cluster & The Old-Style Solution a single storage server quickly becomes a bottleneck if the cluster grows in time (quite typical for initially small installations) storage requirements also grow, sometimes at a higher rate adding space (disk) is usually easy adding speed (both bandwidth and IOpS) is hard and usually involves expensive upgrade of existing hardware e.g. you start with an NFS box with a certain amount of disk space, memory and processor power, then add disks to the same box

Cluster & The Old-Style Solution /2 e.g. you start with an NFS box with a certain amount of disk space, memory and processor power adding space is just a matter of plugging in some more disks, ot ar worst adding a new controller with an external port to connect external disks but unless you planned for excess network bandwidth, you cannot benefit from increased disk bandwidth the available memory is now shared among a larger number of disks the availble processor power has now to sustain a larger number of operations (this is usually less of an issue with current multicore, high-ghz processors, so you still probably benfit of the larger number of available spindles)

Cluster & The Parallel Solution In a parallel solution you add storage servers, not only disk space each added storage server brings in more memory, more processor power, more bandwidth software complexity however is much higher there is no «easy» solution, like NFS or CIFS

A parallel solution usually is made of Cluster & The Parallel Solution /2 several Servers that hold the actual filesystem data one or more Metadata Servers that help clients to make sense of data stored in the file system and optionally monitoring software that ensures continuous availablity of all needed components a redundancy layer that replicates in some way information in the storage cluster, so that the file system can survive the loss of some component server

Components Single or dual Metadata Server (MDS) with attached Metadata Target (MDT) multiple (up to 100s) Object Server (OSS) with attached Object Targets (OST) clients

MDS The Metadata Server manages all metadata operations on the file system (file names and directories). A single MDS is needed for file system operation. Multiple MDSs are possible in an active/standby configuration (all of them attached to a shared MDT storage) MDT MDS1 MDS2

OSS Object Servers provide the actual I/O service, connecting to Object Targets. OSSs can be almost anything from local disks to shared storage to high-end SAN fabric. Each OSS can serve one to dozen OSTs, and each OST can be up to 8TB in size. The capacity of a lustre file system is the sum of the capacities provided by OSTs, while the aggregated bandwidth can approach the sum of bandwidths provides by OSSs.

Networking All communication among lustre servers and clients is managed by LNET. Key features of LNET include: support for many commonly used network types such as InfiniBand and IP RDMA, when supported by underlying networks such as Elan, Myrinet, and InfiniBand high-availability and recovery features enabling transparent recovery in conjunction with failover servers simultaneous availability of multiple network types with routing between them

Networking /2 In the Real World LNET can deliver more than 100MB/s on GB ethernet (with PCIe, server-class NICs and a low- to midrange GB switch) more than 400MB/s on 4x SDR infiniband without any special tuning (these are actual numbers measured on our small lustre cluster in the worst case of a single OSS; with multiple OSSs aggregated bandwidth will be much larger)

4 Supermicro 3U servers, each with 2 dual-core opteron 2216 16GB memory infiniband card A Small Cluster 16-port Areca SATA RAID controller 16 500GB SATA disks single RAID6 + 1 hot spare array (6.5TB available space) Each OSS uses the local array as OST; one of the servers acts both as OSS and MDS. Clients connect on the infiniband network, a separate ethernet connection is used for management only.

A Small Cluster /2 initial file system creation is very simple with provided lustre_config script it takes time however, and machines are basically unusable until mkfs is over in our setup the RAID controller is the bottleneck maximum bandwidth per OSS decreases to around 300MB/s once cache is full single MDS provides no redundancy at all (and if you loose the MDT, the file system is gone) MDT and one OST share physical disks suboptimal performance expected, especially on metadata intensive workloads

A Small Cluster /3 Even with all said limitation we observe a performance level that is «good enough»: write bandwidth around 1GB/s with a small number of concurrent clients («small» meaning more or less equal to the number of servers) write bandwidth around 750MB/s with a moderate number of concurrent clients («moderate» meaning no more than 10 times the number of servers) beyond that point write bandwidth decreases slowly with an increasing number of clients on the downside shared hardware between the MDT and one OST really hurts performance we regularly observe the affected OSS finishing its work late

A Small Cluster /4 read bandwidth in the 4 8GB/s range when data are already cached on servers (OSSs have 16GB main memory each) non-cached reads performance starts at 1.5GB/s and decreases with the same pattern as the write performance

Striping The default configuration for a newly-created lustre filesystem is that each file is assigned to an OST. File striping can be configured and the great thing is that it can be configured on a per-file or per-directory base. When striping is enabled file size can grow beyond OST available space bandwidth from several OSSs can be aggregated cache from several OSSs can be aggregated For each file a different striping pattern can be defined (different stripe width and number of stripes).

Striping /2 In our experience, striping can make a huge difference when a single client is working with huge files (here «huge» means larger than combined caches on client and OSS). However, in the more typical workload of many concurrent clients it makes no difference at best and can even hurt performance due to increased metadata access.

GPFS by IBM General Parallel File System rock-solid, 10-years history available for AIX, Linux and Windows Server 2003 proprietary license tightly integrated with IBM cluster management tools

OCFS2 Oracle Cluster File System once proprietary, now GPL available in Linux vanilla kernel not widely used outside the database world (as fa as I know) 16TiB limit (to be confirmed)

PVFS Parallel Virtual File System open source, easy to install userspace-only server, kernel module required only on clients optimized for MPI-IO POSIX compatibility layer performance is sub-optimal

pnfs Parallel NFS extension of NFSv4, under development some proprietary solutions available (Panasas) should put together benefits of parallel IO with those of using a standard solution (NFS)

<calucci@sissa.it>