ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Similar documents
1 Title. 2 Team Information. 3 Abstract. 4 Problem Statement and Hypothesis. Zest: The Maximum Reliable TBytes/sec/$ for Petascale Systems

Blue Waters I/O Performance

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

CSCS HPC storage. Hussein N. Harake

IME Infinite Memory Engine Technical Overview

Toward Understanding Life-Long Performance of a Sonexion File System

An Overview of Fujitsu s Lustre Based File System

Efficient Object Storage Journaling in a Distributed Parallel File System

High-Performance Lustre with Maximum Data Assurance

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Building a High IOPS Flash Array: A Software-Defined Approach

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

The Spider Center-Wide File System

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Parallel File Systems for HPC

The Red Storm System: Architecture, System Update and Performance Analysis

CA485 Ray Walshe Google File System

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Storage Optimization with Oracle Database 11g

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Oak Ridge National Laboratory

DELL Terascala HPC Storage Solution (DT-HSS2)

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

Architecting a High Performance Storage System

Xyratex ClusterStor6000 & OneStor

Architecting Storage for Semiconductor Design: Manufacturing Preparation

JMR ELECTRONICS INC. WHITE PAPER

CS5460: Operating Systems Lecture 20: File System Reliability

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Software-defined Storage: Fast, Safe and Efficient

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Deep Learning Performance and Cost Evaluation

Design and Evaluation of a 2048 Core Cluster System

GFS: The Google File System

System Software for Big Data and Post Petascale Computing

I/O CANNOT BE IGNORED

IBM s General Parallel File System (GPFS) 1.4 for AIX

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

HPC Storage Use Cases & Future Trends

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Lustre TM. Scalability

Application Acceleration Beyond Flash Storage

Dell TM Terascala HPC Storage Solution

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Enhancing Checkpoint Performance with Staging IO & SSD

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

The Google File System

LLNL Lustre Centre of Excellence

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Lustre overview and roadmap to Exascale computing

GFS: The Google File System. Dr. Yingwu Zhu

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

The Spider Center Wide File System

Discriminating Hierarchical Storage (DHIS)

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Distributed Filesystem

IBM Spectrum Scale IO performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Improved Solutions for I/O Provisioning and Application Acceleration

libhio: Optimizing IO on Cray XC Systems With DataWarp

Deep Learning Performance and Cost Evaluation

Table 6.1 Physical Characteristics of Disk Systems

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Google File System. Arun Sundaram Operating Systems

Chapter 10: Mass-Storage Systems

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

High Speed Asynchronous Data Transfers on the Cray XT3

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

LNET MULTI-RAIL RESILIENCY

IBM InfoSphere Streams v4.0 Performance Best Practices

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Bridging the peta- to exa-scale I/O gap

Bridging the peta- to exa-scale I/O gap

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

Toward An Integrated Cluster File System

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

The Google File System (GFS)

PC-based data acquisition II

Andreas Dilger, Intel High Performance Data Division LAD 2017

Transcription:

ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Design Motivation To optimize science utilization of the machine Maximize compute time / wall time ratio Maximize aggregate TB / sec / $ Without compromising R.A.S. Scalable up to large I/O clusters To ensure fault tolerance Application recovery following HW/SW failures Checkpoint I/O In exchange for: Temporarily suspending some POSIX semantics while retaining identical calling sequences (transparency) Delayed availability for reading Pittsburgh Supercomputing Center 2

Design Features Client-side parity calculation Leverages compute node CPU over I/O node CPU In a transparent intercept library Sequential I/O Aggregate many small RPCs into fewer large RPCs Disk block allocator gets next available Both effective latency hiding strategies Asynchronous fragment reconstruction Reassembly MD is carried with data blocks Pittsburgh Supercomputing Center 3

Software Components Compute Node SIO Node Zest Node RAID Vectors (queues) Disk Shelves (by color) 0 1 2 3 LNET LNET Router Router (in memory) 0 1 2 LNET 3 Syncer Threads External LustreFS Zest client (generate parity chunks) Incoming Partial Parity Full Parity Completed Client Library (on compute node) Transparent intercept Parity calculation LNET Router (on SIO node) Route messages from HSN to external IB NW ZEST daemon (on Zest I/O node) Communications Threads LNET to memory IO Thread All I/O to/from each internal disk Syncer Thread Staging data out to external PFS Parity Thread Regenerate lost data buffers Pittsburgh Supercomputing Center 4

Scalable Unit Hardware PCIe 4 GB/s SAS Links 1.2 GB/s 4 4 4 4 4 4 Service and I/O IB Links 1 GB/s (SDR) Zest I/O Node Dual Qual-Core SATA drives 75 MB/s 4 4 Disk Drive Shelves No single point of failure Pittsburgh Supercomputing Center 5

Benchmarking Tests Standard Benchmarking Applications No source modifications required (transparent intercept) IOR used for the following scaling plots SPIOBENCH Bonnie++ FIO PSC-developed, highly parallel and configurable suite Zest Component Tests (sustained) I/O Threads 1.1 GB/sec (91% of 1.2 GB/sec raw spindle) RPC Threads 1.1 GB/sec (27% of 4 GB/sec over IB) Pittsburgh Supercomputing Center 6

Benchmarking Goals 1. Measure client scaling 2. Measure max write bandwidth 3. Compare Zest to Lustre All of the following measurements: IOR client application 1 single server (identical hardware) Pittsburgh Supercomputing Center 7

Zest Performance (Linux / SDP) Note: Good client scaling 750 MB/sec Pittsburgh Supercomputing Center 8

Zest Performance (XT3 / IPoIB) Note: Excellent client scaling to 1K Highly consistent per-write performance IPoIB limited 150 MB/sec Pittsburgh Supercomputing Center 9

Lustre Performance (RAID0 / o2ib) Note: Just for illustration (max BW) 16 OSTs No redundancy 430 MB/sec Pittsburgh Supercomputing Center 10

Lustre Performance (RAID5 / o2ib) Note: What you get w/o HW RAID 1 OST: Linux software RAID 100 MB/sec Pittsburgh Supercomputing Center 11

Storage System Write Efficiency Efficiency = Aggregate Application BW / Aggregate Raw Spindle Speed Because per-drive cost is a substantial portion of a storage subsystem Zest: Highest Server Efficiency Pittsburgh Supercomputing Center 12

Storage System Performance/Cost Aggregate Application BW / Cost of Storage Subsystem Zest: Highest TB/sec/$ (no perserver overhead) Pittsburgh Supercomputing Center 13

The Zest Dashboard Server Efficiency (aggregate) >90% peak 80% sustained All Disk BW Sustained near max Ingest Buffers Remained empty throughout benchmarks Pittsburgh Supercomputing Center 14

zestiondctl A shell tool that provides interactive reporting and control of Zest primitives, including: Controls: logging, syncing, disk online/offline Performance: disk I/O, thread activity, communication rates Resources: memory structures, inodes, multilock lists, requests, parity groups ZESTIONDCTL(8) Pittsburgh Supercomputing Center ZESTIONDCTL(8) NAME zestiondctl - zestion daemon runtime control SYNOPSIS zestiondctl [-H] [-h table] [-L listspec] [-i iostat] [-M mlistspec] [-p paramspec] [-S socket] [-s showspec] DESCRIPTION The zestiondctl utility examines and manipulates zestiond(8) behavior. Pittsburgh Supercomputing Center 15

Zest Configuration (Linux / SDP) SAS SDP Infiniband Switch Zest RAID3 Linux Clients Pittsburgh Supercomputing Center 16

Zest Configuration (XT3 / IBoIB) SAS Routers IPoIB Cray XT3 Infiniband Switch Zest RAID3 Pittsburgh Supercomputing Center 17

Lustre Configuration (RAID0 / o2ib) SAS Routers o2ib Cray XT3 Infiniband Switch RAID0 Lustre Pittsburgh Supercomputing Center 18

Lustre Configuration (RAID5 / o2ib) SAS Routers o2ib Cray XT3 Infiniband Switch Lustre over Software RAID5 Pittsburgh Supercomputing Center 19