Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Similar documents
On the Role of Burst Buffers in Leadership- Class Storage Systems

Leveraging Flash in HPC Systems

libhio: Optimizing IO on Cray XC Systems With DataWarp

LBANN: Livermore Big Ar.ficial Neural Network HPC Toolkit

The Fusion Distributed File System

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Fast Forward I/O & Storage

HPC Storage Use Cases & Future Trends

Lustre and PLFS Parallel I/O Performance on a Cray XE6

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

5.4 - DAOS Demonstration and Benchmark Report

Improved Solutions for I/O Provisioning and Application Acceleration

Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales

API and Usage of libhio on XC-40 Systems

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Oak Ridge National Laboratory Computing and Computational Sciences

Toward An Integrated Cluster File System

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

IME Infinite Memory Engine Technical Overview

PLFS and Lustre Performance Comparison

Spark and HPC for High Energy Physics Data Analyses

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

ECMWF s Next Generation IO for the IFS Model

Using DDN IME for Harmonie

Mapping MPI+X Applications to Multi-GPU Architectures

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Structuring PLFS for Extensibility

Coordinating Parallel HSM in Object-based Cluster Filesystems

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

ECMWF's Next Generation IO for the IFS Model and Product Generation

A Plugin for HDF5 using PLFS for Improved I/O Performance and Semantic Analysis

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

IBM InfoSphere Streams v4.0 Performance Best Practices

Lecture 11 Hadoop & Spark

Techniques to improve the scalability of Checkpoint-Restart

CS370 Operating Systems

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

FROM FILE SYSTEMS TO SERVICES: CHANGING THE DATA MANAGEMENT MODEL IN HPC

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Lustre at Scale The LLNL Way

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

IBM Spectrum Scale IO performance

The State and Needs of IO Performance Tools

FastForward I/O and Storage: ACG 8.6 Demonstration

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Triton file systems - an introduction. slide 1 of 28

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Allowing Users to Run Services at the OLCF with Kubernetes

Mahanaxar: Quality of Service Guarantees in High-Bandwidth, Real-Time Streaming Data Storage

Data Movement & Tiering with DMF 7

Tintenfisch: File System Namespace Schemas and Generators

Chapter 1: Introduction. Operating System Concepts 8th Edition,

libhio: Optimizing IO on Cray XC Systems With DataWarp

CSE6230 Fall Parallel I/O. Fang Zheng

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Storage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11

Burst Buffers Simulation in Dragonfly Network

CS370 Operating Systems

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

GraphTrek: Asynchronous Graph Traversal for Property Graph-Based Metadata Management

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

SolidFire and Pure Storage Architectural Comparison

Extreme I/O Scaling with HDF5

Delayed Partial Parity Scheme for Reliable and High-Performance Flash Memory SSD

Combing Partial Redundancy and Checkpointing for HPC

Table 9. ASCI Data Storage Requirements

High Level Design Client Health and Global Eviction FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O MILESTONE: 4.

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

TO SHARE OR NOT TO SHARE: COMPARING BURST BUFFER ARCHITECTURES

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

THOUGHTS ABOUT THE FUTURE OF I/O

TOSS - A RHEL-based Operating System for HPC Clusters

Characterizing and Improving Power and Performance in HPC Networks

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Data Centric Computing

The RAMDISK Storage Accelerator

Near Memory Key/Value Lookup Acceleration MemSys 2017

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Data storage services at KEK/CRC -- status and plan

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Analyzing I/O Performance on a NEXTGenIO Class System

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

xsim The Extreme-Scale Simulator

REMEM: REmote MEMory as Checkpointing Storage

Introduction to HPC Parallel I/O

Transcription:

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Burst Buffer 101: quickly absorb write, stream data to parallel file system at slower rate Two common designs: Dedicated storage nodes, shared among compute nodes Storage local to each compute node Dedicated storage nodes Node-local storage node 1 node 2 node 3 node 4 node 1 node 2 node 3 node 4 SSD 1 SSD 2 SSD 3 SSD 4 SSD 1 SSD 2 SSD 3 SSD 4 PFS (Parallel file system) PFS (Parallel file system) Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, Carlos Maltzahn, On the Role of Burst Buffers in Leadership-Class Storage Systems, MSST 2012 Kento Sato et al, A User-level InfiniBand-based File System and Checkpoint Strategy for Burst Buffers, CCGrid 2014 2

LLNL has a history (and future) of machines with node-local storage Coastal (retired) 1000 nodes 250 GB SSD Sierra (2018) At least one node 800 GB SSD Catalyst 324 nodes 800 GB SSD Flash 16 nodes 1TB SSD Node-local storage Hyperion 600 nodes, 800 GB SSD 80 nodes, 600 GB FusionIO Dedicated storage nodes 3

Checkpoint / Restart 4

Checkpoint levels vary in cost and reliability, can recover from most failures using checkpoints 100-1000x faster than PFS High Single: Store checkpoint data on node s local storage, e.g. disk, memory Partner: Write to local storage and on a partner node XOR: Write file to local storage and small sets of nodes collectively compute and store parity redundancy data (RAID-5) Stable Storage: Write to parallel file system Checkpoint Cost and Resliency Low Level 1 Level 2 Level 3 5

Application developers can use new interfaces to abstract details of burst buffers The Scalable Checkpoint/Restart (SCR) library abstracts burst buffer details from application But what do I give up? No support for sharing files between processes Cannot update files after declaring them to be complete Can only access checkpoint files during a specific window Plans to extend to support general output sets (not just checkpoints) Not the only one: Fault Tolerance Interface (FTI) from Argonne National Laboratory Hierarchical Input Output (HIO) from Los Alamos National Laboratory 6

These middleware projects need new support from resource managers and burst buffer APIs Need portable way to determine topology and properties of storage File path Unique identifier Capacity Speed Resiliency New interfaces considered for PMIx standard may help Prestage / poststage Need for scripts to determine which checkpoint to load Need to determine whether final checkpoint was written out Keep checkpoint in cache between consecutive jobs, or extend job runtime of current job 7

Extended Memory 8

Data Intensive Memory Mapped IO (DI-MAP) provides fast access to non-volatile memory With different classes of memory, apps must manage where and when to move data from slow to fast memory. This can be explicit or implicit. For implicit, mmap is nice. Linux memory map runtime: Optimized for shared libraries Does not expect mmap data to churn Does not expect mmap data to exceed memory capacity DI-MMAP: Optimized for frequent evictions Optimized for highly concurrent access Expects to churn memory-mapped data Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, Maya Gokhale, DI-MMAP a scalable memory-map runtime for out-of-core data-intensive applications, Cluster Computing 2013. 9

NVRAM augments DRAM to process trillion edge graphs [IPDPS 2013] Our approach can process 32x larger datasets with only a 39% performance degradation in TEPS by leveraging node-local NVRAM for expanded graph data storage. By exploiting both distributed memory processing and nodelocal NVRAM, significantly larger datasets can be processed than with either approach in isolation. Roger Pearce, Maya Gokhale, Nancy M. Amato, Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory, IPDPS 2013 10

Machine Learning & IO Scheduling 11

Using storage to improve machine learning Training must read large input set many times in random order Local storage removes load from parallel file system What s needed? Support from resource manager to persist dataset between runs Schedule jobs close to data Distributed file system to shard large datasets R a n k 0 - N 0 R a n k 1 - N 1 R a n k 2 - N 2 R a n k 3 - N 3 Mo d e l M 0 - L a y e r H 0 Mo d e l M 0 - I n p u t L a y e r N VR AM N VR AM N VR AM N VR AM Brian Van Essen et al, LBANN: Livermore Big Artificial Neural Network HPC Toolkit, MLHPC 2015 12

Using machine learning to improve storage Since early 2015, LLNL has been tracking Aggregate Lustre counters for each job Copy of each job batch script For each job waiting in the queue, apply machine learning to features in job script to predict expected run time and expected Lustre stats Simulate job scheduler to predict start of each job Predict future load on file system Ryan McKenna, Michela Taufer, Forecasting Storms in Parallel File Systems, SC 2015 Poster Burst buffers and knowledge of job IO requirements enables resource scheduler to control load on file system Stephen Herbein, Michela Taufer, Scalable IO-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters, HPDC 2016. See Stephen s poster here at Salishan tomorrow night! 13

It s beneficial to coordinate IO between compute nodes and burst buffer nodes Which BB nodes should a job write to? When should each job write? How to control who writes when? Work shows that controlling these factors matters Sagar Thapaliya, Jay Lofstead, Managing I/O Interference in a Shared Burst Buffer System 14 14

User-level file systems 15

Consider the possibilities of specialized, userlevel file systems that are setup and torn down with each job Enable application to mount more than one user-level file system Specialize each file system for different task Intercept application libc calls (e.g., techniques from Darshan) Intercept file system operations from batch script using FUSE Let app mount file systems based on its different IO requirements Store data on parallel file system between jobs using pre/post stage if needed Precursors: SCR, FTI, PLFS, HIO, CRUISE node 1 SSD 1 node 2 node 3 SSD 2 SSD 3 node 998 node 999 node 1000 SSD 998 SSD 999 SSD 1000 16

What s needed for distributed services? Need efficient asynchronous services Where is my MPI_THREAD_MULTIPLE performance? Services are processes, too MPI can t hog the CPU or the network Services will need same performance as the MPI job Blocking vs polling for incoming data Some HPC networks only provide polling Enable blocking in MPI without losing significant performance Ability to disable blocking from within app (MPI_T) 17

Recap 18

Recap: What s needed from system software to utilize burst buffers, node-local storage? Resource manager Interface to query storage properties User hooks in pre/post stage, avoid unnecessary data staging Persist dataset between runs Schedule jobs to nodes where data are (like Big Data job schedulers) Track IO counters and batch script for each job Schedule for IO, not just node utilization Need fast launch and wire-up for distributed services Memory management Good first step: ensure mmap works well for extended memory MPI Enable more asynchronous operations / services Enable other services to use network and CPU (MPI can t be the only one) Enable efficient blocking while waiting for incoming messages Support true communication overlap for large Isend/Irecv 19