Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Similar documents
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Bridging the complexity gap: Tracing and Replaying I/O

Application Performance on IME

IME Infinite Memory Engine Technical Overview

Improved Solutions for I/O Provisioning and Application Acceleration

NVIDIA Application Lab at Jülich

DDN About Us Solving Large Enterprise and Web Scale Challenges

Characterizing Parallel I/O Behaviour Based on Server-Side I/O Counters

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

HPC Storage Use Cases & Future Trends

Basics of Performance Engineering

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

I/O and Scheduling aspects in DEEP-EST

Using DDN IME for Harmonie

Porting Scientific Applications to OpenPOWER

The Fusion Distributed File System

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Revealing Applications Access Pattern in Collective I/O for Cache Management

I/O Monitoring at JSC, SIONlib & Resiliency

API and Usage of libhio on XC-40 Systems

The State and Needs of IO Performance Tools

Leveraging Burst Buffer Coordination to Prevent I/O Interference

Your cloud solution for EO Data access and processing

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

libhio: Optimizing IO on Cray XC Systems With DataWarp

Techniques to improve the scalability of Checkpoint-Restart

Practical Near-Data Processing for In-Memory Analytics Frameworks

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Infinite Memory Engine Freedom from Filesystem Foibles

Trends in HPC (hardware complexity and software challenges)

HPC Architectures. Types of resource currently in use

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Leveraging Flash in HPC Systems

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Enosis: Bridging the Semantic Gap between

Analytics in the cloud

Percipient StorAGe for Exascale Data Centric Computing Computing for the Exascale

LIMITS OF ILP. B649 Parallel Architectures and Programming

Systems Architectures towards Exascale

The Cray Rainier System: Integrated Scalar/Vector Computing

The Google File System

Challenges in HPC I/O

Extraordinary HPC file system solutions at KIT

Introduction to High Performance Parallel I/O

FhGFS - Performance at the maximum

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Design and Evaluation of a 2048 Core Cluster System

Overview of Tianhe-2

Motivation Goal Idea Proposition for users Study

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

THE SQUARE KILOMETER ARRAY (SKA) ESD USE CASE

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

L3/L4 Multiple Level Cache concept using ADS

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Data storage services at KEK/CRC -- status and plan

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Introduction to High-Performance Computing

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

MOHA: Many-Task Computing Framework on Hadoop

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Users and utilization of CERIT-SC infrastructure

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

SpiNNaker a Neuromorphic Supercomputer. Steve Temple University of Manchester, UK SOS21-21 Mar 2017

The Leading Parallel Cluster File System

I/O-500 Status. Julian M. Kunkel 1, Jay Lofstead 2, John Bent 3, George S. Markomanolis

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

LHCb Distributed Conditions Database

Assessment of LS-DYNA Scalability Performance on Cray XD1

Triton file systems - an introduction. slide 1 of 28

simulation framework for piecewise regular grids

Cray XC Scalability and the Aries Network Tony Ford

Mapping MPI+X Applications to Multi-GPU Architectures

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

An Introduction to GPFS

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Accelerating sequential computer vision algorithms using commodity parallel hardware

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Peta-Scale Simulations with the HPC Software Framework walberla:

The Computation and Data Needs of Canadian Astronomy

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays

Structuring PLFS for Extensibility

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)

Transcription:

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution Wolfram Schenck Faculty of Engineering and Mathematics, Bielefeld University of Applied Sciences, Bielefeld, Germany Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany WOPSSS 2016 Frankfurt, 23.06.2016

Outline Conclusions and Outlook Introduction: The Burst Buffer Concept Data Retention Time Analysis Test System NEST Benchmarks General Benchmarks (IOR) Slide 2

Introduction: The Burst Buffer Concept Slide 3

Need for New Storage Architectures Address growing performance gap Floating-point performance B fp grows faster than I/O bandwidth B io, i.e. B io /B fp becomes smaller For JUQUEEN we have B io /B fp = 1 Byte / 40,000 Flops Mitigation strategy: Hierarchical storage architecture Fast but low capacity storage tier Large capacity but slow storage tier Emerging data-intensive applications Need for large storage capacity C io, and high bandwidth B io, and high IOPs rates Slide 4

Application Classes Dominant read Applications processing data retrieved by experiments or collected by observatories Applications analyzing data from huge databases ("big data") Dominant write Applications from the area of simulation science, generating large amounts of data Transient write/read Applications (or sets of applications) producing and consuming significant amounts of data on the same system Transient data: Long-term storage often not necessary Cluster Main Storage System Slide 5

Conventional Storage System Cluster Arrow direction: Dominant write Main Storage System Time step spent with I/O 10 time steps Time step spent with non- I/O operations t Slide 6

Enhanced by Burst Buffer Scenario: Sustained Performance Cluster Main Storage System I/O burst 10 time steps t Full simulation cycle Cluster Burst Buffer Main Storage System 6 time steps t SPEEDUP = 10/6 = 1.67 Slide 7

Enhanced by Burst Buffer Scenario: Short-Term Peak Performance Cluster Main Storage System I/O burst Full simulation cycle Cluster Burst Buffer t 18 time steps Main Storage System 6 time steps t SPEEDUP = 18/6 = 3.0 Slide 8

Burst Buffer Concept Capacities: Conventional main storage: Large Burst buffer: Small Bandwidth: Between cluster and burst buffer: High Between burst buffer and main storage: Low Speedup obtained via burst buffer depends theoretically on (for dominant write): I/O pattern of application: Continuous vs. in bursts I/O intensity of application: Low vs. high Runtime of application: Long vs. short Increasing speedup Slide 9

Infinite Memory Engine (by DDN) Realisation of storage hierarchy Upper tier = IME Very small C io / B io 10 min Leverage NVM technologies External storage Very large C io / B io O(1 day) Leverage HDD technologies Benefits High bandwidth + IOPs rate Compatibility and support of any POSIX compliant parallel file system Challenges Re-organisation of I/O may be required to leverage performance Compute servers IME External storage Slide 10

Using IME MPI I/O interface Use of namespace of parallel file system (PFS) Prefix controls where created file is allocated, e.g. ime://gpfs/data/pleiter/file.dat Software-controlled sync from IME to PFS POSIX interface IME storage devices mounted using FUSE Use of namespace of parallel file system (PFS), but: Special mountpoint for IME (use path via this mountpoint for direct access to IME) Choice of path allows to control use of IME or PFS Software-controlled sync from IME to PFS Slide 11

Benchmarking Central goal of our study: Benchmarking with real-world system to check if IME fulfills theoretical expectations Benchmarks: General performance: IOR [LLNL, 2003] Benchmarking tool for testing performance of parallel filesystems using various interfaces and access patterns Computational science software from the dominant write class: NEST Slide 12

Test System Slide 13

JUlich Dedicated GPU Environment (JUDGE) (decommissioned end of 2015) JUDGE: For our tests: Up to 64 compute nodes from JUDGE Scientific Linux 6.7 Pre-release version of IME software stack (Dec. 2015) Figure: JSC Slide 14

Test System Schematic overview of the integration of the IME servers at JSC: (64 Gbit/s) (10 Gbit/s) JUST (32 Gbit/s) (64 Gbit/s) (20 Gbit/s) Bandwidth to IME: 128 Gbit/s = 16 GByte/s IME = IME Server 24 SSDs with 200 GiB each (overall ca. 4.7 TiB) 2 IB host adapters (QDR) Bandwidth to GPFS: 20 Gbit/s = 2.5 GByte/s Slide 15

General Benchmarks (IOR) IOR Settings Slide 16

IOR Read Performance Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME) Max. GPFS read bandwidth: 0.63 GByte/s (25% of nominal value) Max. IME read bandwidth: 13.8 GByte/s (86% of nominal value) Slide 17

IOR Write Performance Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME) Max. GPFS write bandwidth: 0.75 GByte/s (33% of nominal value) Max. IME write bandwidth: 15.63 GByte/s (98% of nominal value) Slide 18

NEST Benchmarks Slide 19

The Human Brain Project HBP: Future & Emerging Technologies flagship project (co-)funded by European Commission Science-driven, seeded from FET, extending beyond ICT Ambitious, unifying goal, large-scale Goal To build an integrated ICT infrastructure enabling a global collaborative effort towards understanding the human brain, and ultimately to emulate its computational capabilities Slide 20

Brain Simulation (1) Simulation software: NEST (NEural Simulation Tool) Open source: www.nest-simulator.org / www.nest-initiative.org Purpose: Large-scale simulations of biologically realistic neuronal networks (focus on large networks, use of simple point neurons) Dendriten Axon Soma Neuron Spike Slide 21

Brain Simulation (2) In the human brain: ca. 100 bn neurons ca. 10,000 incoming connections per neuron Largest simulation so far: Simulation with 1 bn neurons (feasibility study on the K computer in Japan) I/O challenge: Simulations can produce huge amounts of data Right fig.: E. Torre, INM-6, Forschungszentrum Jülich Slide 22

Parallel Processing in NEST (VP: Virtual Process) Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons T In the whole network: N neurons with N = M T N VP Slide 23

Simulation Cycle Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Updating of neuronal states (incl. spike generation) Exchange of spike events between MPI processes Slide 24

Creating Spike Events during Neuron Update Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons...................... VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons......................... T Red dot: Single spike event Slide 25

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Updating of neuronal states (incl. spike generation) Exchange of spike events between MPI processes Slide 26

Number of Threads per Rank Creation of Rank-Local Spike Buffers Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons...................... VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons......................... T............................................... Slide 27

MPI Communication: Every rank receives all spike events Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons T............................................... MPI Slide 28

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Updating of neuronal states (incl. spike generation) Exchange of spike events between MPI processes Slide 29

I/O in NEST Data collected during simulations: Spike events Recording device: Spike detector State variables (e.g., membrane potential of neurons) Recording device: Multimeter Recording devices belong to abstract node class: Connected to neurons (from which measurements are collected) Receive spike events (spike detector) Send out measurement events (multimeter) Updated like neurons (writing data during update) Each recording device exists on every virtual process (VP), writes data via C++ output stream into text file (one file per device per VP) Slide 30

Simulation Script for Benchmark: Random Balanced Network One spike detector and one multimeter per population (created last after all neurons) Overall 4 recording devices (= C++ output streams) per VP Fig.: Nadine Daivandy (JSC) Slide 31

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Update of recording devices I/O Updating of neuronal states (incl. spike generation) BURST Exchange of spike events between MPI processes Slide 32

Design of Experiment Factor 1: Number of compute nodes 1, 2, 4, 8, 16 Strict weak scaling design: Number of neurons per node constant Factor 2: Amount of written data per node; manipulated via number of state variables recorded by each multimeter 1 22 Corresponds to 1 GiB/node 8 GiB/node (amount of spike data insignificant) Factor 3: Output file system 1. POSIX I/O to GPFS 2. POSIX I/O to IME 3. POSIX I/O to /dev/null: Baseline condition, "infinitely fast storage device" Further experimental settings: Simulated biological time: 100 ms Network size: 258,750 neurons per compute node, ca. 3e8 synapses per compute node 23 MPI ranks per compute node 5 runs per task condition, minimum reported Slide 33

Bandwidth (1 GiB/node) Slide 34

Bandwidth (8 GiB/node) Slide 35

Bandwidth (1 and 8 GiB/node) POSIX2IME very close to POSIX2DEVNULL: IME close to "ideal" performance Very good scaling behavior of IME: Observed bandwidth nearly doubles with doubling of number of compute nodes Bad scaling behavior of GPFS beyond 4 compute nodes Observed bandwidth small compared to IOR measurements Slide 36

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Update of recording devices I/O Updating of neuronal states (incl. spike generation) BURST Exchange of spike events between MPI processes Slide 37

Simulation Time (1 GiB/node) Effective simulation time = simulation time without step 3 (MPI synchr.) Slide 38

Simulation Times (8 GiB/node) Effective simulation time = simulation time without step 3 (MPI synchr.) Slide 39

Simulation Time: Observations The larger the number of nodes, the stronger the advantage of writing to IME or /dev/null Very good scaling behavior of IME clearly visible in plots GPFS setting suffers heavily from imbalance between ranks IME reaches nearly performance of /dev/null; barely any I/Oinduced additional imbalance between ranks Slide 40

Relative Runtime Reduction Reported values based on average over all measured I/O loads Slide 41

Data Retention Time Analysis Slide 42

Motivation: Interactive Supercomp. Data retention time analysis: Classification of data depending on how long it will be retained Interactive supercomputing/hpc: User can interact with the application(s) that run on the supercomputer/cluster Misc. use cases for NEST Slide 43

NEST: Data Retention Times Data retention time analysis: Classification of data depending on how long it will be retained Slide 44

Conclusions and Outlook Slide 45

Conclusions IOR Results: IME saturated ca. 90% of nominal bandwidth in reading and writing Promising finding for all considered application classes NEST Results: Barely any I/O-induced imbalance between ranks with IME (in constrast to GPFS) IME performance close to baseline condition (/dev/null), nearly perfect weak scaling behavior At largest problem size: Nearly speedup of 4 achieved vs. GPFS Easy handling: No code changes in NEST required Conclusions: IME actually works as theoretically expected for applications from the dominant write class (writing in bursts) NEST users would strongly profit from the incorporation of IME in compute clusters (I/O no longer a limiting factor in gathering simulation results) Slide 46

Outlook and Recommendations Recommendations for the future development of IME: Data pre-fetching: For "dominant read" applications, data prefetching before job start would be highly beneficial Integration into job managers? Development of tools for managing short-term and transient data, integration into job managers Support for end-to-end data integrity like within GPFS Final word: IME shows: Working burst buffer solutions exist for complex parallel applications Opportunity to scale compute and I/O performance Alternatively: Opportunity to reduce bandwidth requirements for external storage system Slide 47

Questions? Thank you for your attention! Acknowledgements: We would like to thank DDN for making an IME test system available at Jülich Supercomputing Centre. In particular, we gracefully acknowledge the continuous support by Tommaso Cecchi and Toine Beckers. Slide 48