On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Similar documents
Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

libhio: Optimizing IO on Cray XC Systems With DataWarp

10 years of CyberShake: Where are we now and where are we going

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Intel Xeon Phi архитектура, модели программирования, оптимизация.

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Automating Real-time Seismic Analysis

End-to-End Online Performance Data Capture and Analysis of Scientific Workflows

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Slurm Burst Buffer Support

API and Usage of libhio on XC-40 Systems

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Introduction to HPC Parallel I/O

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Heterogeneous Job Support

Alexander Heinecke (Intel), Josh Tobin (UCSD), Alexander Breuer (UCSD), Charles Yount (Intel), Yifeng Cui (UCSD) Parallel Computing Lab Intel Labs

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Infinite Memory Engine Freedom from Filesystem Foibles

Burst Buffers Simulation in Dragonfly Network

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction to High Performance Parallel I/O

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

The Fusion Distributed File System

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Scheduler Optimization for Current Generation Cray Systems

Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions

HPC Storage Use Cases & Future Trends

THOUGHTS ABOUT THE FUTURE OF I/O

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

HP Z Turbo Drive G2 PCIe SSD

Fast Forward I/O & Storage

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp

Motivation Goal Idea Proposition for users Study

ECMWF's Next Generation IO for the IFS Model and Product Generation

Rapid Processing of Synthetic Seismograms using Windows Azure Cloud

SONAS Best Practices and options for CIFS Scalability

High Performance Computing and Data Resources at SDSC

Cray XC Scalability and the Aries Network Tony Ford

Extreme I/O Scaling with HDF5

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

IBM Emulex 16Gb Fibre Channel HBA Evaluation

MOHA: Many-Task Computing Framework on Hadoop

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Improved Solutions for I/O Provisioning and Application Acceleration

Enabling a SuperFacility with Software Defined Networking

BlueDBM: An Appliance for Big Data Analytics*

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Performance of Variant Memory Configurations for Cray XT Systems

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Using DDN IME for Harmonie

Optimizing Apache Spark with Memory1. July Page 1 of 14

Comprehensive Lustre I/O Tracing with Vampir

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

IME Infinite Memory Engine Technical Overview

Application Performance on IME

NEXTGenIO Performance Tools for In-Memory I/O

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Leveraging Burst Buffer Coordination to Prevent I/O Interference

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Leveraging Flash in HPC Systems

ECE 574 Cluster Computing Lecture 23

Gen-Z Memory-Driven Computing

GPU-centric communication for improved efficiency

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Overview of Tianhe-2

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

CSE6230 Fall Parallel I/O. Fang Zheng

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Building NVLink for Developers

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

I/O Monitoring at JSC, SIONlib & Resiliency

AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Stream Processing for Remote Collaborative Data Analysis

Messaging Overview. Introduction. Gen-Z Messaging

Corral: A Glide-in Based Service for Resource Provisioning

Ian Foster, An Overview of Distributed Systems

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Computer Systems Laboratory Sungkyunkwan University

ECMWF s Next Generation IO for the IFS Model

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Mapping MPI+X Applications to Multi-GPU Architectures

Intel Xeon Phi Coprocessors

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Transcription:

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman 12 th Workflows in Support of Large-Scale Science (WORKS) SuperComputing 17 November 13, 2017 Funded by the US Department of Energy under Grant #DE-SC0012636

OUTLINE Introduction Burst Buffers Model and Design Next-generation Supercomputers Data-intensive Workflows In-transit / In-situ Overview Node-local / Remote-shared NERSC BB BB Reservations Execution Directory I/O Read Operations Workflow Application CyberShake Workflow Implementation I/O Performance: Darshan Experimental Results Overall Write/Read Operations I/O Performance per Process Cumulative CPU Time Rupture Files Conclusion Summary of Findings Future Directions 2

A Brief Introduction and Contextualization Traditionally, Workflows have used the file system to communicate data between tasks To cope with increasing application demands on I/O operations, solutions targeting in situ and/or in transit processing have become mainstream approaches to attenuate I/O performance bottlenecks. Next-generation of Exascale Supercomputers Increased processing capabilities to over 10 18 Flop/s Memory and disk capacity will also be significantly increased Power consumption management I/O performance of the parallel file system (PFS) is not expected to improve much 3

Data-Intensive Scientific Workflows Consumers/produces over 700GB of data Consumers/produces over 4.4TB of data, and requires over 24TB of memory across all tasks CyberShake Workflow 1000 Genome Workflow While in situ is well adapted for computations that conform with the data distribution imposed by simulations, in transit processing targets applications where intensive data transfers are required 4

Improving I/O Performance with Burst Buffers In Transit Processing Burst buffers have emerged as a non-volatile storage solution that is positioned between the processors memory and the PFS, buffering the large volume of data produced by the application at an higher rate than the PFS, while seamlessly draining the data to the PFS asynchronously. A burst buffer consists of the combination of rapidly accessed persistent memory with its own processing power (e.g., DRAM), and a block of symmetric multi-processor compute accessible through highbandwidth links (e.g., PCI Express) Placement of the burst buffer nodes within the Cori system (NERSC) Node-local vs Remote-shared 5

First Burst Buffers Use at Scale Cori System (NERSC) CN CN CN CN Compute Nodes Cori is a petascale HPC system and #6 on the June 2017 Top500 list Aries NERSC BB is based on Cray DataWarp (Cray s implementation of the BB concept) Each BB node contains a Xeon processor, 64 GB of DDR3 memory, and two 3.2 TB NAND ash SSD modules attached over two PCIe gen3 x8 interfaces, which is attached to a Cray Aries network interconnect over a PCIe gen3 x16 interface x16 PCIe Xeon Processor / DRAM x8 x8 PCIe PCIe x8 PCIe Flash SSD card Storage Area Network Burst Buffer Node Storage Servers ~6.5 GB/sec of sequential read and write bandwidth Architectural overview of a burst-buffer node on Cori at NERSC 6

Model and Design: Enabling BB in Workflow Systems Automated BB Reservations BB reservations operations (either persistent or scratch) consist in the creation and release, as well as stage in and stage out operations Transient reservations: needs to implement stage in/out operations at the beginning/end of each job execution I/O Read Operations Read operations from the BB should be transparent to the applications Execution Directory Automated mapping between the workflow execution directory and the BB reservation No changes to the application code are necessary, and the application job directly writes its output to the BB reservation Approaches: point the execution directory to the BB reservation, or create symbolic links to data endpoints into the BB 7

Workflow Application: CyberShake Workflow CyberShake Workflow CyberShake is a high-performance computing software that uses 3D waveform modeling to calculate PSHA estimates for populated areas of California Constructs and Populates a 3D mesh of ~1.2 billion elements with seismic velocity data to compute Strain Green Tensors (SGTs) Post-processing: SGTs are convolved with slip time histories for each of about 500,000 different earthquakes to generate synthetic seismograms for each event We focus on the two CyberShake job types which together account for 97% of the compute time: the wave propagation code AWP-ODC-SGT, and the post-processing code DirectSynth CyberShake hazard map for Southern California, showing the spectral accelerations at a 2-second period exceeded with a probability of 2% in 50 years 8

Burst Buffers: Workflow Implementation bb_setup direct_synth direct_synth SGT_generator direct_synth Data flow Control flow Pegasus WMS Workflow is composed of two tightly-coupled parallel jobs (SGT_generator; and direct_synth), and two system jobs (bb_setup and bb_delete) fx.sgt fx.sgtheader fy.sgt fy.sgtheader Generates/consumes about 550GB of data https://github.com/rafaelfsilva/bb-workflow direct_synth direct_synth direct_synth direct_synth seismogram rotd peakvals #SBATCH -p debug #SBATCH -N 1 #SBATCH -C haswell #SBATCH -t 00:05:00 #BB create_persistent name=csbb capacity=700gb access=striped type=scratch bb_delete A general representation of the CyberShake test workflow bb_setup job Goes through the regular queuing processing #SBATCH -p regular #SBATCH -N 64 #SBATCH -C haswell #SBATCH -t 05:00:00 #DW persistentdw name=csbb 9

Collecting I/O Performance Data with Darshan Darshan: HPC I/O Characterization Tool HPC lightweight I/O profiling tool that captures an accurate picture of I/O behavior (including POSIX IO, MPI-IO, and HDF5 IO) in MPI applications Darshan is part of the default software stack on Cori 10

Experimental Results: Overall Write Operations 10000 1.75 MiB/s 7500 5000 2500 Performance Gain 1.50 1.25 1.00 0 1 4 8 16 32 64 128 256 313 # Nodes BB no BB 0.75 1 4 8 16 32 64 128 256 313 nodes BB no BB Average I/O performance estimate for write operations at the MPI-IO layer (left), and average I/O write performance gain (right) for the SGT_generator job Overall, write operations to the PFS (No-BB) have nearly constant I/O performance No-BB: ~900 MiB/s regardless of the number of nodes used Base values obtained for the BB executions (1 node, 32 cores) are over 4,600 MiB/s, and peak values scale up to 8, 200 MiB/s for 32 nodes (1,024 cores) Slight drop in the I/O performance (#nodes 64) large number of concurrent write operations 11

Experimental Results: Overall Read Operations 7500 2.0 MiB/s 5000 2500 Performance Gain 1.5 1.0 0 1 4 8 16 32 64 128 # Nodes BB No BB 0.5 1 4 8 16 32 64 128 nodes BB No BB I/O performance estimate for read operations at the MPI-IO layer (left), and average I/O write performance gain (right) for the direct_synth job I/O read operations from the PFS yield similar performance regardless of the number of nodes used: ~500 MiB/s BB: single-node performance of 4,000 MiB/s, peak values up to about 8,000 MiB/s Small drop in the performance for runs using 64 nodes or above may indicate an I/O bottleneck when draining the data to/from the underlying parallel file system 12

Experimental Results: I/O Performance per Process 3 fx.sgt fy.sgt 2000 fx.sgt fy.sgt Time (seconds) 2 1 Time (seconds) 1500 1000 500 0 0 1 4 8 16 32 64 128 1 4 8 16 32 64 128 # Nodes 1 4 8 16 32 64 128 1 4 8 16 32 64 128 # Nodes BB No BB BB No BB POSIX module data: Average time consumed in I/O read operations per process for the direct_synth job POSIX operations (left) represent buffering and synchronization operations with the system POSIX values are negligible when compared to the job s total runtime (~8h for 64 nodes) MPI-IO module data: Average time consumed in I/O read operations per process for the direct_synth job MPI-IO: BB accelerates I/O read operations up to 10 times in average for larger configurations ( 32 node), the average time is nearly the same as when running with 16 nodes for the No-BB 13

Experimental Results: Cumulative CPU Time Cumulative CPU time usage (%) 100 75 50 25 0 BB No BB 1 4 8 16 32 64 128 1 4 8 16 32 64 128 # Nodes stime utime Ratio between the cumulative time spent in the user (utime) and kernel (stime) spaces for the direct_synth job, for different numbers of nodes Averaged values (for up to thousands of cores) may mask slower processes In some cases, e.g. 64 nodes, slowest time consumed in I/O read operations can slowdown the application up to 12 times the averaged value Ratio between the time spent in the user (utime) and kernel (stime) spaces handling I/O-related interruptions, etc. Performance at 64 nodes is similar to 32 nodes, suggesting gains in application parallel efficiency would outweigh a slight I/O performance hit at 64 nodes and lead to decreased overall runtime 14

Experimental Results: Rupture Files Cumulative CPU time usage (%) 100 75 50 25 0 BB No BB 1 10 100 1000 2500 5700 1 10 100 1000 2500 5700 # Rupture Files stime utime Ratio between the cumulative time spent in the user (utime) and kernel (stime) spaces for the direct_synth job for different numbers of rupture files (workflow runs with 64 nodes) A typical execution of the CyberShake workflow for a selected site in our experiment processes about 5,700 rupture files The processing of rupture files drive most of the CPU (user space) activities for the direct_synth job The use of a BB attenuates (about 15%) the I/O processing time of the workflow jobs, for both read and write operations 15

Conclusion and Future Work Major Findings I/O write performance was improved by a factor of 9, and I/O read performance by a factor of 15 Performance decreased slightly at node counts above 64 (potential I/O ceiling) I/O performance must be balanced with parallel efficiency when using burst buffers with highly parallel applications I/O contention may limit the broad applicability of burst buffers for all workflow applications (e.g., in situ processing) What s Next? Solutions such as I/O-aware scheduling or in situ processing may also not fulfill all application requirements We intend to investigate the use of combined in situ and in transit analysis Development of a production solution for the Pegasus workflow management system 16

ON THE USE OF BURST BUFFERS FOR ACCELERATING DATA-INTENSIVE SCIENTIFIC WORKFLOWS Thank You Questions? Funded by the US Department of Energy under Grant #DE-SC0012636 Rafael Ferreira da Silva, Ph.D. Research Assistant Professor, Computer Science Department Computer Scientist, USC Information Sciences Institute rafsilva@isi.edu http://rafaelsilva.com