libhio: Optimizing IO on Cray XC Systems With DataWarp

Similar documents
API and Usage of libhio on XC-40 Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

HPC NETWORKING IN THE REAL WORLD

Burst Buffers Simulation in Dragonfly Network

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Structuring PLFS for Extensibility

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Welcome! Virtual tutorial starts at 15:00 BST

Lustre and PLFS Parallel I/O Performance on a Cray XE6

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

PLFS and Lustre Performance Comparison

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Application Performance on IME

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Parallel File Systems for HPC

Coordinating Parallel HSM in Object-based Cluster Filesystems

Blue Waters I/O Performance

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Data Movement & Tiering with DMF 7

Improved Solutions for I/O Provisioning and Application Acceleration

Cray XC Scalability and the Aries Network Tony Ford

5.4 - DAOS Demonstration and Benchmark Report

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Enhancing Checkpoint Performance with Staging IO & SSD

The Fusion Distributed File System

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Architecting a High Performance Storage System

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Parallel I/O on Theta with Best Practices

Using DDN IME for Harmonie

An Overview of Fujitsu s Lustre Based File System

Parallel File Systems. John White Lawrence Berkeley National Lab

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

The State and Needs of IO Performance Tools

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

I/O Monitoring at JSC, SIONlib & Resiliency

Trinity: Architecture and Early Experience

Fast Forward I/O & Storage

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Trinity: Opportunities and Challenges of a Heterogeneous System

Lustre overview and roadmap to Exascale computing

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Lustre Lockahead: Early Experience and Performance using Optimized Locking. Michael Moore

The Google File System

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

FastForward I/O and Storage: ACG 8.6 Demonstration

MDHIM: A Parallel Key/Value Store Framework for HPC

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25


Lock Ahead: Shared File Performance Improvements

Introduction to High Performance Parallel I/O

High-Performance Lustre with Maximum Data Assurance

Parallel I/O Libraries and Techniques

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

ScalaIOTrace: Scalable I/O Tracing and Analysis

CLOUD-SCALE FILE SYSTEMS

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CSCS HPC storage. Hussein N. Harake

HPC Storage Use Cases & Future Trends

Toward An Integrated Cluster File System

High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Using Charm++ to Support Multiscale Multiphysics

IME Infinite Memory Engine Technical Overview

Introduction to HPC Parallel I/O

SNAP Performance Benchmark and Profiling. April 2014

CA485 Ray Walshe Google File System

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

FROM FILE SYSTEMS TO SERVICES: CHANGING THE DATA MANAGEMENT MODEL IN HPC

LLNL Lustre Centre of Excellence

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

An Evolutionary Path to Object Storage Access

The Cray XC Series Scalability Advantage

Emerging Technologies for HPC Storage

Transcription:

libhio: Optimizing IO on Cray XC Systems With DataWarp May 9, 2017 Nathan Hjelm Cray Users Group May 9, 2017 Los Alamos National Laboratory LA-UR-17-23841 5/8/2017 1

Outline Background HIO Design Functionality Performance Summary Los Alamos National Laboratory 5/8/2017 2

Background: Terminology Burst Buffer - A high-speed, low cost-per-bandwidth storage facility used to reduce the time spent on high volume IO thus improving system efficiency. Cray DataWarp - A Cray SSD storage product on Trinity. It provides burst buffer (and other) function. Initially developed for Trinity and Cori. Hierarchical IO Library (HIO) - A LANL developed API and library which facilitates the use of burst buffer and PFS (Parallel File System) for checkpoint and analysis IO on Trinity and future systems. Los Alamos National Laboratory 5/8/2017 3

Background: What Is Datawarp? A Cray SSD storage product developed for Trinity Consists of: - Hardware: service nodes connected directly to Aries network each containing two SSDs. - Software: API/library with functions to initiate stage in/ stage out and query stage state. - Can be configured in multiple modes using the workload manger. Striped, cache, private, etc

Background: Trinity Burst Buffer Architecture Compute Nodes CN CN CN CN SS BB DSS D ION IB IB Blade = 2x Burst Buffer Node (2x SSD each) I/O Node (2x InfiniBand HCA) Storage Fabric (InfiniBand) Lustre OSSs/ OSTs Aries High-Speed Network InfiniBand Fabric Trinity Phase Compute Nodes 1 9500 Haswell 1000 KNL DataWarp Nodes Storage Servers DataWarp Capacity 300 1.7 PB 2 8900 KNL added 232 1.3 PB Los Alamos National Laboratory 5/8/2017 4

Background: Trinity Burst Buffer Architecture Los Alamos National Laboratory PCIe Gen3 8x PCIe Gen3 8x Xeon E5 v1 Aries PCIe Gen3 8x PCIe Gen3 8x Xeon E5 v1 To HSN 3.2 TB Intel P3608 SSD 3.2 TB Intel P3608 SSD 3.2 TB Intel P3608 SSD 3.2 TB Intel P3608 SSD 5/8/2017 5

Background: Trinity Burst Buffer Architecture Mode Description Private Scratch Per node burst buffer (BB) space Shared Scratch Shared Cache Load Balanced Read Only Cache Shared space, files may be striped across all BB nodes.! Used for Trinity Checkpoints " Parallel File System (PFS) cache. Transparent and explicit options PFS files replicated into multiple BB nodes to speed up widely read files

What is HIO? Design Motivation Isolate applications from BB and IO technology evolution Support Trinity and future burst buffer implementations Support Lustre and other PFS on all tri-lab systems Easy to incorporate into existing applications - Lightweight, simple to configure Improve checkpoint performance - Support N-1, N-N, N-M (and more) IO patterns No collective read or write calls Support checkpoint/restart and analysis scenarios (e.g., viz) Extend to bridge to future IO technologies - e.g., Sierra, HDF5, Object Store? Provide High Performance I/0 best practices to many applications Los Alamos National Laboratory 5/8/2017 6

What is HIO? API Motivation Codes stored multiple types of data in a restart dump - Provide named elements in a file Codes handle fallback on restart failure (corruption, missing files) - Provide a numeric identifier for files. - Provide highest identifier and last modified identifiers for open Need for performance metrics and tracing Need to handle staging between IO layers - Collective open/close to determine when datasets are complete

What is HIO? Structure and Function HIO is packaged as an independent library Plug-in architecture to support future storage systems Flexible API and configuration functionality Configurable diagnostic and performance reporting Supports Cray DataWarp Burst Buffer on Trinity - Staging to PFS, space management, striping - Automatically handles staging of data. - Automatic handling of space. Supports other PFS on Trinity - Lustre specific support (interrogation, striping) - Add others (GPFS on Sierra) as needed Los Alamos National Laboratory 5/8/2017 7

HIO Concepts HIO is centered around an abstract name space An HIO file is known as a dataset An HIO dataset supports - Named binary elements - Shared or unique-per-rank element offset space Offset space is per-named element This provides traditional N-1 and N-N functionality - Version or level identifier - A notion of completeness or correctness POSIX on-disk structure currently is a directory and its contents Internal on-disk structure is not specified to preserve flexibility for future optimization Backward compatibility across HIO versions - Read any prior dataset formats - Support any prior API definitions 5/8/2017 8

HIO IO Modes: Basic Mode Provided early access to API Now provides fall back Translates directly to POSIX or C streaming IO (stdio) - Unique element addressing translates directly to a file per element per process - Shared element addressing translates directly to a single shared file per element 5/8/2017 9

HIO IO Modes: File Per Node Targeted originally for Lustre Based on ideas from plfs - Turns most writes into contiguous writes Does not support read-write mode N-1 (element_shared) application view maintained Different read vs. write node count supported Reduces number of files by 32:1 or 68:1 Less metadata load than traditional N N Less file contention than traditional N 1 Cross-process data lookup supported with MPI-3 Remote Memory Access (RMA) Typically better performance for shared file access 5/8/2017 9

HIO Development Status API fully documented with Doxygen Version 1.4.0.0 released on GitHub (open source) - ~7.5 KLOC HDF5 plugin under development Extensive HIO / DataWarp / Lustre regression test suite Ongoing work: - Import / Export utility - Checkpoint interval advice - Performance enhancements - Instrumentation enhancements 5/8/2017 10

libhio Document - API Document - User s Guide Library description HIO APIs Configuration variables 40 Pages 5/8/2017 11

Performance: DataWarp Trinity Phase 2 - Cray XC-40-8192 Intel KNL nodes 68 Cores w/ 4 hypertheads/core - 234 Cray DataWarp nodes 2 x 4 TB Intel P3608 SSDs Aggregate bandwidth of ~ 1200 GiB/sec - Open MPI master hash 6886c12 Modified ior benchmark - Added backend for HIO - 1 MiB block size - 8 GiB / MPI process - 4 MPI processes/node 5/8/2017 12

Performance: DataWarp - File Per Process - Read 1400 IOR Read Bandwidth File Per Process 1k Block Size w/ 4 Writers/Node 1200 1000 Bandwidth (GiB/sec) 800 600 400 200 0 4 16 64 256 1024 4096 # Nodes POSIX HIO basic HIO file per node 5/8/2017 13

Performance: DataWarp - File Per Process - Write 1200 IOR Write Bandwidth File Per Process 1k Block Size w/ 4 Writers/Node 1000 Bandwidth (GiB/sec) 800 600 400 200 0 4 16 64 256 1024 4096 # Nodes POSIX HIO basic HIO file per node 5/8/2017 14

Performance: DataWarp - Shared File - Read 1400 IOR Read Bandwidth Shared File 1k Block Size w/ 4 Writers/Node 1200 1000 Bandwidth (GiB/sec) 800 600 400 200 0 4 16 64 256 1024 4096 # Nodes POSIX HIO basic HIO file per node 5/8/2017 15

In Summary... HIO provides: - High Performance I/O - That isolates applications from I/O technology shifts - And improves application performance and reliability - While reducing development costs HIO is ready to use now: - Flexible API set - Documented - Tested - Well structured for future enhancement and extensions Los Alamos National Laboratory 5/8/2017 17

Questions? Thank You Los Alamos National Laboratory 5/8/2017 18