Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution Wolfram Schenck Faculty of Engineering and Mathematics, Bielefeld University of Applied Sciences, Bielefeld, Germany Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany WOPSSS 2016 Frankfurt, 23.06.2016

Outline Conclusions and Outlook Introduction: The Burst Buffer Concept Data Retention Time Analysis Test System NEST Benchmarks General Benchmarks (IOR) Slide 2

Introduction: The Burst Buffer Concept Slide 3

Need for New Storage Architectures Address growing performance gap Floating-point performance B fp grows faster than I/O bandwidth B io, i.e. B io /B fp becomes smaller For JUQUEEN we have B io /B fp = 1 Byte / 40,000 Flops Mitigation strategy: Hierarchical storage architecture Fast but low capacity storage tier Large capacity but slow storage tier Emerging data-intensive applications Need for large storage capacity C io, and high bandwidth B io, and high IOPs rates Slide 4

Application Classes Dominant read Applications processing data retrieved by experiments or collected by observatories Applications analyzing data from huge databases ("big data") Dominant write Applications from the area of simulation science, generating large amounts of data Transient write/read Applications (or sets of applications) producing and consuming significant amounts of data on the same system Transient data: Long-term storage often not necessary Cluster Main Storage System Slide 5

Conventional Storage System Cluster Arrow direction: Dominant write Main Storage System Time step spent with I/O 10 time steps Time step spent with non- I/O operations t Slide 6

Enhanced by Burst Buffer Scenario: Sustained Performance Cluster Main Storage System I/O burst 10 time steps t Full simulation cycle Cluster Burst Buffer Main Storage System 6 time steps t SPEEDUP = 10/6 = 1.67 Slide 7

Enhanced by Burst Buffer Scenario: Short-Term Peak Performance Cluster Main Storage System I/O burst Full simulation cycle Cluster Burst Buffer t 18 time steps Main Storage System 6 time steps t SPEEDUP = 18/6 = 3.0 Slide 8

Burst Buffer Concept Capacities: Conventional main storage: Large Burst buffer: Small Bandwidth: Between cluster and burst buffer: High Between burst buffer and main storage: Low Speedup obtained via burst buffer depends theoretically on (for dominant write): I/O pattern of application: Continuous vs. in bursts I/O intensity of application: Low vs. high Runtime of application: Long vs. short Increasing speedup Slide 9

Infinite Memory Engine (by DDN) Realisation of storage hierarchy Upper tier = IME Very small C io / B io 10 min Leverage NVM technologies External storage Very large C io / B io O(1 day) Leverage HDD technologies Benefits High bandwidth + IOPs rate Compatibility and support of any POSIX compliant parallel file system Challenges Re-organisation of I/O may be required to leverage performance Compute servers IME External storage Slide 10

Using IME MPI I/O interface Use of namespace of parallel file system (PFS) Prefix controls where created file is allocated, e.g. ime://gpfs/data/pleiter/file.dat Software-controlled sync from IME to PFS POSIX interface IME storage devices mounted using FUSE Use of namespace of parallel file system (PFS), but: Special mountpoint for IME (use path via this mountpoint for direct access to IME) Choice of path allows to control use of IME or PFS Software-controlled sync from IME to PFS Slide 11

Benchmarking Central goal of our study: Benchmarking with real-world system to check if IME fulfills theoretical expectations Benchmarks: General performance: IOR [LLNL, 2003] Benchmarking tool for testing performance of parallel filesystems using various interfaces and access patterns Computational science software from the dominant write class: NEST Slide 12

Test System Slide 13

JUlich Dedicated GPU Environment (JUDGE) (decommissioned end of 2015) JUDGE: For our tests: Up to 64 compute nodes from JUDGE Scientific Linux 6.7 Pre-release version of IME software stack (Dec. 2015) Figure: JSC Slide 14

Test System Schematic overview of the integration of the IME servers at JSC: (64 Gbit/s) (10 Gbit/s) JUST (32 Gbit/s) (64 Gbit/s) (20 Gbit/s) Bandwidth to IME: 128 Gbit/s = 16 GByte/s IME = IME Server 24 SSDs with 200 GiB each (overall ca. 4.7 TiB) 2 IB host adapters (QDR) Bandwidth to GPFS: 20 Gbit/s = 2.5 GByte/s Slide 15

General Benchmarks (IOR) IOR Settings Slide 16

IOR Read Performance Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME) Max. GPFS read bandwidth: 0.63 GByte/s (25% of nominal value) Max. IME read bandwidth: 13.8 GByte/s (86% of nominal value) Slide 17

IOR Write Performance Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME) Max. GPFS write bandwidth: 0.75 GByte/s (33% of nominal value) Max. IME write bandwidth: 15.63 GByte/s (98% of nominal value) Slide 18

NEST Benchmarks Slide 19

The Human Brain Project HBP: Future & Emerging Technologies flagship project (co-)funded by European Commission Science-driven, seeded from FET, extending beyond ICT Ambitious, unifying goal, large-scale Goal To build an integrated ICT infrastructure enabling a global collaborative effort towards understanding the human brain, and ultimately to emulate its computational capabilities Slide 20

Brain Simulation (1) Simulation software: NEST (NEural Simulation Tool) Open source: www.nest-simulator.org / www.nest-initiative.org Purpose: Large-scale simulations of biologically realistic neuronal networks (focus on large networks, use of simple point neurons) Dendriten Axon Soma Neuron Spike Slide 21

Brain Simulation (2) In the human brain: ca. 100 bn neurons ca. 10,000 incoming connections per neuron Largest simulation so far: Simulation with 1 bn neurons (feasibility study on the K computer in Japan) I/O challenge: Simulations can produce huge amounts of data Right fig.: E. Torre, INM-6, Forschungszentrum Jülich Slide 22

Parallel Processing in NEST (VP: Virtual Process) Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons T In the whole network: N neurons with N = M T N VP Slide 23

Simulation Cycle Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Updating of neuronal states (incl. spike generation) Exchange of spike events between MPI processes Slide 24

Creating Spike Events during Neuron Update Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons...................... VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons......................... T Red dot: Single spike event Slide 25

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Updating of neuronal states (incl. spike generation) Exchange of spike events between MPI processes Slide 26

Number of Threads per Rank Creation of Rank-Local Spike Buffers Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons...................... VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons......................... T............................................... Slide 27

MPI Communication: Every rank receives all spike events Number of Threads per Rank Number of MPI Ranks M VP0 VP1 VP2 N VP neurons N VP neurons N VP neurons VP3 VP4 VP5 N VP neurons N VP neurons N VP neurons T............................................... MPI Slide 28

I/O in NEST Data collected during simulations: Spike events Recording device: Spike detector State variables (e.g., membrane potential of neurons) Recording device: Multimeter Recording devices belong to abstract node class: Connected to neurons (from which measurements are collected) Receive spike events (spike detector) Send out measurement events (multimeter) Updated like neurons (writing data during update) Each recording device exists on every virtual process (VP), writes data via C++ output stream into text file (one file per device per VP) Slide 30

Simulation Script for Benchmark: Random Balanced Network One spike detector and one multimeter per population (created last after all neurons) Overall 4 recording devices (= C++ output streams) per VP Fig.: Nadine Daivandy (JSC) Slide 31

Simulation Cycle (revisited) Communication interval Process-internal routing of spike events to their target neurons (incl. synapse update) Update of recording devices I/O Updating of neuronal states (incl. spike generation) BURST Exchange of spike events between MPI processes Slide 32

Design of Experiment Factor 1: Number of compute nodes 1, 2, 4, 8, 16 Strict weak scaling design: Number of neurons per node constant Factor 2: Amount of written data per node; manipulated via number of state variables recorded by each multimeter 1 22 Corresponds to 1 GiB/node 8 GiB/node (amount of spike data insignificant) Factor 3: Output file system 1. POSIX I/O to GPFS 2. POSIX I/O to IME 3. POSIX I/O to /dev/null: Baseline condition, "infinitely fast storage device" Further experimental settings: Simulated biological time: 100 ms Network size: 258,750 neurons per compute node, ca. 3e8 synapses per compute node 23 MPI ranks per compute node 5 runs per task condition, minimum reported Slide 33

Bandwidth (1 GiB/node) Slide 34

Bandwidth (8 GiB/node) Slide 35

Bandwidth (1 and 8 GiB/node) POSIX2IME very close to POSIX2DEVNULL: IME close to "ideal" performance Very good scaling behavior of IME: Observed bandwidth nearly doubles with doubling of number of compute nodes Bad scaling behavior of GPFS beyond 4 compute nodes Observed bandwidth small compared to IOR measurements Slide 36

Simulation Time (1 GiB/node) Effective simulation time = simulation time without step 3 (MPI synchr.) Slide 38

Simulation Times (8 GiB/node) Effective simulation time = simulation time without step 3 (MPI synchr.) Slide 39

Simulation Time: Observations The larger the number of nodes, the stronger the advantage of writing to IME or /dev/null Very good scaling behavior of IME clearly visible in plots GPFS setting suffers heavily from imbalance between ranks IME reaches nearly performance of /dev/null; barely any I/Oinduced additional imbalance between ranks Slide 40

Relative Runtime Reduction Reported values based on average over all measured I/O loads Slide 41

Data Retention Time Analysis Slide 42

Motivation: Interactive Supercomp. Data retention time analysis: Classification of data depending on how long it will be retained Interactive supercomputing/hpc: User can interact with the application(s) that run on the supercomputer/cluster Misc. use cases for NEST Slide 43

NEST: Data Retention Times Data retention time analysis: Classification of data depending on how long it will be retained Slide 44

Conclusions and Outlook Slide 45

Conclusions IOR Results: IME saturated ca. 90% of nominal bandwidth in reading and writing Promising finding for all considered application classes NEST Results: Barely any I/O-induced imbalance between ranks with IME (in constrast to GPFS) IME performance close to baseline condition (/dev/null), nearly perfect weak scaling behavior At largest problem size: Nearly speedup of 4 achieved vs. GPFS Easy handling: No code changes in NEST required Conclusions: IME actually works as theoretically expected for applications from the dominant write class (writing in bursts) NEST users would strongly profit from the incorporation of IME in compute clusters (I/O no longer a limiting factor in gathering simulation results) Slide 46

Outlook and Recommendations Recommendations for the future development of IME: Data pre-fetching: For "dominant read" applications, data prefetching before job start would be highly beneficial Integration into job managers? Development of tools for managing short-term and transient data, integration into job managers Support for end-to-end data integrity like within GPFS Final word: IME shows: Working burst buffer solutions exist for complex parallel applications Opportunity to scale compute and I/O performance Alternatively: Opportunity to reduce bandwidth requirements for external storage system Slide 47

Questions? Thank you for your attention! Acknowledgements: We would like to thank DDN for making an IME test system available at Jülich Supercomputing Centre. In particular, we gracefully acknowledge the continuous support by Tommaso Cecchi and Toine Beckers. Slide 48