Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores

Similar documents
Damaris: Using Dedicated I/O Cores for Scalable Post-petascale HPC Simulations

Going further with Damaris: Energy/Performance Study in Post-Petascale I/O Approaches

From Damaris to CALCioM Mi/ga/ng I/O Interference in HPC Systems

Damaris. In-Situ Data Analysis and Visualization for Large-Scale HPC Simulations. KerData Team. Inria Rennes,

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

KerData: Scalable Data Management on Clouds and Beyond

CRAY User Group Mee'ng May 2010

Techniques to improve the scalability of Checkpoint-Restart

Unstructured Finite Volume Code on a Cluster with Mul6ple GPUs per Node

BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications

MPI & OpenMP Mixed Hybrid Programming

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Oh, Exascale! The effect of emerging architectures on scien1fic discovery. Kenneth Moreland, Sandia Na1onal Laboratories

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Results from the Early Science High Speed Combus:on and Detona:on Project

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Parallel Visualiza,on At TACC

A Performance and Energy Analysis of I/O Management Approaches for Exascale Systems

Using Global Behavior Modeling to improve QoS in Cloud Data Storage Services

Distributed State Es.ma.on Algorithms for Electric Power Systems

Combinatorial Mathema/cs and Algorithms at Exascale: Challenges and Promising Direc/ons

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

MPICH: A High-Performance Open-Source MPI Implementation. SC11 Birds of a Feather Session

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

Caching and Buffering in HDF5

Implementing MPI on Windows: Comparison with Common Approaches on Unix

ECSS Project: Prof. Bodony: CFD, Aeroacoustics

Algorithms for Auto- tuning OpenACC Accelerated Kernels

Crea%ng and U%lizing Linked Open Sta%s%cal Data for the Development of Advanced Analy%cs Services E. Kalampokis, A. Karamanou, A. Nikolov, P.

Introduc)on to Informa)on Visualiza)on

Hobbes: Composi,on and Virtualiza,on as the Founda,ons of an Extreme- Scale OS/R

Preliminary ACTL-SLOW Design in the ACS and OPC-UA context. G. Tos? (19/04/2016)

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013

Taming Parallel I/O Complexity with Auto-Tuning

Big Data: Tremendous challenges, great solutions

Abstract Storage Moving file format specific abstrac7ons into petabyte scale storage systems. Joe Buck, Noah Watkins, Carlos Maltzahn & ScoD Brandt

Parallel Programming Pa,erns

REsources linkage for E-scIence - RENKEI -

HPC and Inria

FORTISSIMO HPGA HIGH PERFORMANCE GEAR ANALYZER. Ing. Antonio Zippo, PhD PULSAR DYNAMICS - UniMORE

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

BlobSeer: Next Generation Data Management for Large Scale Infrastructures

RV: A Run'me Verifica'on Framework for Monitoring, Predic'on and Mining

Monitoring & Analy.cs Working Group Ini.a.ve PoC Setup & Guidelines

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Super Instruction Architecture for Heterogeneous Systems. Victor Lotric, Nakul Jindal, Erik Deumens, Rod Bartlett, Beverly Sanders

HDF5 I/O Performance. HDF and HDF-EOS Workshop VI December 5, 2002

Outline. In Situ Data Triage and Visualiza8on

UAB Research Compu1ng Resources and Ac1vi1es

PGAS Languages (Par//oned Global Address Space) Marc Snir

Parallel File Systems. IIT Course: Data- Intensive Compu4ng Guest Lecture Samuel Lang September 20, 2010

MapReduce. Tom Anderson

Combining Real Time Emula0on of Digital Communica0ons between Distributed Embedded Control Nodes with Real Time Power System Simula0on

Parallel programming with Java Slides 1: Introduc:on. Michelle Ku=el August/September 2012 (lectures will be recorded)

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

November 1 st 2010, Internet2 Fall Member Mee5ng Jason Zurawski Research Liaison

Scientific Visualization Services at RZG

Welcome and introduction to SU 2

Application I/O on Blue Waters. Rob Sisneros Kalyana Chadalavada

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

Volume Visualiza0on. Today s Class. Grades & Homework feedback on Homework Submission Server

Allevia'ng memory bandwidth pressure with wavefront temporal blocking and diamond 'ling Tareq Malas* Georg Hager Gerhard Wellein David Keyes*

Systems Software for Scalable Applications (or) Super Faster Stronger MPI (and friends) for Blue Waters Applications

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

A One-Sided View of HPC: Global-View Models and Portable Runtime Systems

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Preliminary Evalua.on of ABAQUS, FLUENT, and in- house GPU code Performance on Blue Waters

EIOW Exa-scale I/O workgroup (exascale10)

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

7 Ways to Increase Your Produc2vity with Revolu2on R Enterprise 3.0. David Smith, REvolu2on Compu2ng

What NetCDF users should know about HDF5?

Danesh TaNi & Amit Amritkar

SEDA An architecture for Well Condi6oned, scalable Internet Services

Applica'on Aware Deadlock Free Oblivious Rou'ng

Sec$on 4: Parallel Algorithms. Michelle Ku8el

Decision Support Systems

A MAC protocol for Reliable Broadcast Communica7ons in Wireless Network- on- Chip

Storage-Based Convergence Between HPC and Big Data

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

CORPORATE PRESENTATION

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Fusion PIC Code Performance Analysis on the Cori KNL System. T. Koskela*, J. Deslippe*,! K. Raman**, B. Friesen*! *NERSC! ** Intel!

Contain This, Unleashing Docker for HPC

Today s Objec2ves. Kerberos. Kerberos Peer To Peer Overlay Networks Final Projects

A scalability comparison study of data management approaches for smart metering systems

Cloud Computing WSU Dr. Bahman Javadi. School of Computing, Engineering and Mathematics

How to live with IP forever

1/12/11. ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures. Introduc3on. Interconnec3on Networks Introduc3on

EULAG: high-resolution computational model for research of multi-scale geophysical fluid dynamics

Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

The Art and Science of Memory Alloca4on

Op#miza#on & Scalability

Collaborative data-driven science. Collaborative data-driven science

CCW Workshop Technical Session on Mobile Cloud Compu<ng

Inver&ble Bloom Lookup Tables and Applica&ons. Michael Mitzenmacher Joint work with Michael Goodrich, Rasmus Pagh, George Varghese

UC3: A Framework for Coopera;ve Compu;ng at the University of Chicago

Design Principles & Prac4ces

Op#miza#on of D3Q19 La2ce Boltzmann Kernels for Recent Mul#- and Many-cores Intel Based Systems

Transcription:

Concurrency-Optimized I/O For Visualizing HPC Simulations: An Approach Using Dedicated I/O Cores Ma#hieu Dorier, Franck Cappello, Marc Snir, Bogdan Nicolae, Gabriel Antoniu 4th workshop of the Joint Laboratory for Petascale Computing, Urbana, November 2010 NCSA University of Illinois at Urbana-Champaign INRIA Saclay Grand-Large Team INRIA Rennes KerData Team ENS de Cachan - Brittany Third Workshop of the INRIA- Illinois Joint Laboratory on Petascale Compu@ng, Bordeaux June 22-24 2010

Context: I/O efficient visualiza@on for post- petascale HPC simula@ons

Case study: CM1 simula@on/visualiza@on pipeline 3

The CM1 parallel simula@on Cloud Model 1st Genera@on (CM1) Sky discre@za@on along a 3D grid: uses an approxima@on of the finite element method Layout: two dimensional grid of subdomains MPI processes are mapped onto this layout Each MPI process solves equa@ons on its subdomain, then exchange ghost data with its neighborhood Initialize, time t = 0 While t < time_max Solve equations Exchange ghost data (MPI) If t % K == 0 then Write file (HDF5) Endif t++ Endwhile 4

Output descrip@on Output format: HDF5 (Hierarchical Data Format), providing dataset organiza@on and descrip@on through metadata Gzip compression (level 4) One file per core per backup One backup every K steps (K configurable) In a BlueWaters deployement: 100K files per backup, wri^en at the same @me Backup every 100 steps (depending on the configura@on, 100 steps take 25 to 400 seconds) Only 100 GPFS servers to handle all requests bo^leneck! 5

General behavior All 100K processes have to wait for the slowest to complete its write 6

The visualiza@on: VisIt Developed at LLNL, Department of Energy (DOE) Advanced Simula@on and Compu@ng Ini@a@ve (ASCI) Relies on VTK and distributed rendering 7

Challenge: how to efficiently cope with various read pa^erns? VisIt parallel rendering engine deployed on 10% of computa@onal resources (rule of thumb) A wide variety of purposes a wide variety of pa^erns E.g.: How can 10.000 rendering engines gather and render only horizontal slices of a subset of variables (arrays) stored in more than 100.000 files and accessed through 100 GPFS servers? Need to filter the data at some point... 8

This problem is not specific Other simulaeon applicaeons write one file/core/step and have a similar behavior, such as: The GTC Fusion Modeling Code (Gyrokine@c Toroidal Code), studying micro- turbulence in magne@c confinement fusion, using plasma theory The Pixie3D code, 3D extension of MHD (Magne@c- Hydro- Dynamics) core The «too- many- files» problem and the problem of post- processing and visualizing them already subject to studies 9

Issues and possible solu@ons Current approach involves: Inefficient concurrent wri@ng (wai@ng for the slowest process) Too many files at every step bo^leneck in GPFS A read pa^erns too different from the write pa^erns Possible solueons: Using PHDF5 to write in a collec@ve manner, but S@ll slow compared to individual writes Does not allow any compression Using filters In the simula@on slows down the simula@on In the visualiza@on tools slows down the visualiza@on 10

What we are looking for Op@mize simula@on/visualiza@on I/O concurrency Reduce overhead in write sec@ons Reduce the overall number of files Adapt the output layout to the way it will be read by the visualiza@on Increase the compression level Leverage the capabili@es of HDF5 11

Proposal: use dedicated I/O cores One core per node handles a shared buffer The computa@on cores write in the shared buffer (possible, as CM1 only uses 2% of RAM out of 128GB/node) The write phase of the dedicated I/O core overlaps the next computa@on phase 12

Poten@al usage of the spare @me The I/O core may: Filter data Reformat data for more efficient visualiza@on Redistribute data Be^er compress data Directly handle VisIt requests (inline visualiza@on) Avoid read pressure on GPFS 13

A dedicated MPI process in each node Implementa@on IPCs are used to handle communica@on between each computa@on core and the I/O core A shared memory segment is opened by the I/O core when the applica@on starts Each computa@on core writes at a specific offset in the segment (no sharing, no synchroniza@on) 14

Goal: gather data into larger arrays we want cores to handle con@guous subdomains Some modificaeons required in the simulaeon A new hierarchical layout built from the knowledge of node ID'' and core ID A precise organiza@on of processes within cores and nodes Implementa@on 15

On BluePrint: 64 nodes, 16 cores/node 2 GPFS servers Domain: 960x960x300 pts Output example (for 15 variables enabled, uncompressed): Backup every 50 steps (i.e. every 3min) 15 MB/core/backup 1024 files/backup (with current approach) Total: about 16 GB/backup Experimental sejngs On BlueWaters: 3200 nodes, 32 cores/node 100 GPFS servers Domain: 12800x12800x1000 pts Output example (for 15 variables enabled, uncompressed): Backup every 100 steps (i.e. every 2min) 96 MB/core/backup 102400 files/backup (with current approach) Total: about 10 TB/backup 16

Data layout Current approach 16 cores/node par@cipa@ng in computa@on 32x32 grid layout 30x30x300 pts per subdomain Proposed approach 15 cores/node par@cipa@ng in computa@on, 1 IO- core/node 8x8 nodes layout, 5x3 cores layout 40x24 grid layout 24x40x300 pts per subdomain 17

Comparison of the two approaches: seen from computa@onal cores 18

Theore@cal limits First limit: write @me must be less than computa@on @me, in order to overlap the I/O without slowing down the applica@on. 19

Theore@cal limits Second limit: the gain of overlapping I/O and computa@on must be worth removing one core from computa@on Achieved in 1hour: 814 steps with the current approach, 866 with the new one 20

Summary: using dedicated I/O cores The proposed solu@on consists in Removing one core per node from computa@on Using this core to gather, filter and flush the data Proof- of- concept implementa@on Uses MPI processes Opens shared memory segments to handle the IPCs 21

Contribu@on at this stage an approach which... Reduces overhead in I/O phases Spares @me for filtering, compressing and other post- processing Globally achieves higher performance on BluePrint Considering the experiments conducted on BluePrint and the theore6cal analysis, we think this approach will also achieve higher performance on BlueWaters 22

Next steps: evalua@on Very short term: more extensive experiments Hybrid programming models Larger scales Could be used with other back- ends, such as PHDF5 Great results with CM1, need to evaluate genericity GTC (nuclear fusion simula@on) 23

Next steps: leverage the I/O cores Goal: use the I/O core spare @me to reduce visualiza@on I/O The I/O core can stay connected to VisIt for inline visualiza@on Avoid reading GPFS files, directly read from I/O cores Generate summaries (metadata) for specific visualiza@on pa^erns Some visualiza@on requests could only rely on metadata Reduce concurrent accesses to the shared segment Adap@ve metadata management depending on the desired visual output 24

Next steps: concurrency- op@mized metadata management Idea: leverage the BlobSeer approach A concurrency- op@mized, versioning- based data management scheme (KerData team, INRIA Rennes Bretagne Atlan@que) Relies on concurrency- op@mized metadata management Writes generates new data chunks and metadata, no overwri@ng Mul@ple versions of the data stay available and can be used for advanced post- processing or visualiza@on Concurrent reads and writes without sync [0, 4] [0, 4] [0, 2] [0, 2] [2, 2] [2, 2] [0, 1] [1, 1] [1, 1] [2, 1] [2, 1] 25 [3, 1]

Next steps: concurrency- op@mized metadata management Leveraging BlobSeer: what is needed? Adapt BlobSeer s metadata layer to the needs of visualiza@on Efficient GPFS back- end for BlobSeer What can we expect? Enhanced support for inline visualiza@on Only from metadata, with no sync From the I/O cores, with no sync Longer term: even less GPFS files overall Ini@al approach: one file/core/step Our approach: one file/node/step Future: one file/mul@ple nodes/step 26

Thank you 27