Parallel I/O using standard data formats in climate and NWP
|
|
- Lynn Lucas
- 5 years ago
- Views:
Transcription
1 Parallel I/O using standard data formats in climate and NWP Project ScalES funded by BMBF Deike Kleberg Luis Kornblueh, Hamburg MAX-PLANCK-GESELLSCHAFT
2 Outline Introduction Proposed Solution Implementation Results Conclusions, Input and Outlook Kleberg Parallel I/O using standard data formats in climate and NWP 2 / 27
3 Thanks to Uwe Schulzweida, MPIM Mathias Pütz, IBM Christoph Pospiech, IBM Colleagues at DKRZ Kleberg Parallel I/O using standard data formats in climate and NWP 3 / 27
4 ScalES (4hpc) Parallel I/O Targets Scalability of Earth System Models Usability of Infrastructure Components Portability of Implemented Solutions Solutions Prepared for Future Computer Development Free Availability for the Modelling Community Kleberg Parallel I/O using standard data formats in climate and NWP 4 / 27
5 Atmospheric and Oceanographic Data Constraints Large amounts of data Comparable small spatial extent (O(2, 3)) Large Time extent (O(8)) Large Number of variables (O(3)) Kleberg Parallel I/O using standard data formats in climate and NWP 5 / 27
6 ... continued... Requirements long term storage of metadata and data standardization compression Solutions WMO GRIB standard lowest entropy data subsampling two stage compression: lossy entropy based and lossless compression of result image metadata uncommpressed! Kleberg Parallel I/O using standard data formats in climate and NWP 6 / 27
7 ... continued Problem model decomposition is based on two-dimensional horizontal slicing storage unit of model data is based on vertical slicing requires transpose and data gathering Kleberg Parallel I/O using standard data formats in climate and NWP 7 / 27
8 File writing in ECHAM AS-IS Compute Pes Pe 0 time step horizontal vertical/variable decomposition decomposition gather copy streamwritevar libcdi GRIB write filesystem GPFS time step NetCDF Lustre All processes are compute processes. After one I/O timestep all processes gather their data on process 0. Process 0 writes the data to the filesystem. All processes compute till the next I/O timestep. Problem All processes wait until process 0 has written the data. Kleberg Parallel I/O using standard data formats in climate and NWP 8 / 27
9 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
10 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
11 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
12 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
13 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
14 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record 6. write... sort of Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
15 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record 6. write... sort of First five steps runs fine, but the last one makes a lot of trouble! Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27
16 File writing in ECHAM TO-BE Compute Pes time step copy I/O Pes gather streamwritevar libcdi GRIB write filesystem GPFS time step NetCDF Lustre After calculating one I/O timestep the compute processes copy their data to a buffer and go on calculating till the next I/O timestep. I/O processes fetch the data using MPI one sided communication. Gather and transpose of the data is based on callback routines supplied by the model. Most important property Compute processes are not disturbed by file writing. Kleberg Parallel I/O using standard data formats in climate and NWP 10 / 27
17 Algorithm alternatives collect write communicate POSIX 0 File 1 File 2... File m (a) Classic serial, no comm n 2 n 1 File File File m MPI I/O (b) MPI writer, no visible comm n 2 n 1 File File... File 1 2 m MPI I/O or POSIX (c) Offset sharing, comm. offsets n 2 n 1 File 1 File 2... Filem MPI I/O or POSIX (d) Offset guard, comm. offsets POSIX n 1 n 2 File File Filem (e) POSIX writer, comm. data Kleberg Parallel I/O using standard data formats in climate and NWP 11 / 27
18 Timesteps newtimestep ( tid ) tid yes == current timestep? no piowrite ( queue ) MPI_Barrier ( comm ) Characteristics Within files data have to be ordered by time. In case of serial writing is a natural barrier between the timesteps. When writing in parallel buffering of data is necessary for performance. Buffers have to be flushed after each I/O timestep to keep data in right time order. Kleberg Parallel I/O using standard data formats in climate and NWP 12 / 27
19 Program flow (1), Classic serial Timings Pe 0 Collect and write serial using fwrite. streamwritevar ( data, id ) cdilib gribencode ( data ) Advantages No communication is needed. Simple algorithm. Simple multi-file handling. compress ( data ) fwrite ( data, fp, id ) Disadvantages Does not scale. Kleberg Parallel I/O using standard data formats in climate and NWP 13 / 27
20 Program flow (2), MPI writer (1) Pe io Collect and write parallel using MPI File write all. cdilib streamwritevar ( data, id, tid ) encodenbuffer ( data, id, tid ) Advantages No user visible communication between I/O processes is needed. Straight forward implementation. Performs best of all MPI file writing routines. no isfilled ( buffer, id ) yes piowrite ( node ) MPI_File_write_all ( buffer, fh, id ) Disadvantages Collective call, doesn t work for inhomogenous allocation of variables per process/file. Kleberg Parallel I/O using standard data formats in climate and NWP 14 / 27
21 Program flow (3), MPI writer (2) Timings Run Pe io streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, id, tid ) no isfilled ( buffer, id ) piowrite ( node ) yes MPI_Wait ( request ) MPI_File_iwrite_shared ( buffer, fp, id, request ) Collect and write parallel using MPI File iwrite shared. Advantages No user visible communication between I/O processes is needed. Use of double buffering is possible. Disadvantages Very bad performance on GPFS. switch (buffer, id ) Kleberg Parallel I/O using standard data formats in climate and NWP 15 / 27
22 Program flow (4), Offset sharing Pe io Pe i streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, id, tid ) no isfilled ( buffer, id )? yes piowrite ( node ) MPI_Win_lock ( id ) Collect and write parallel, communicate file offsets using MPI RMA with passive target: Advantages All collectors write. Use of POSIX AIO possible. MPI_Get ( offset, id ) MPI_Accumulate ( amount, offset, id ) MPI_Win_unlock ( id ) get, origin memory fwrite ( buffer, fp, id ) get, target memory accumulate, target memory lock unlock Disadvantages Complex locking is needed, performance is bad. Remark: double buffering not done yet. Kleberg Parallel I/O using standard data formats in climate and NWP 16 / 27
23 Program flow (5), Offset guard Timings Run Pe col streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, buffer, id ) no isfilled ( buffer, id ) yes piowrite ( node ) MPI_Sendrecv ( amount, offset, id ) fwrite ( buffer, fp, id ) cdilib Pe spec MPI_Probe ( status ) switch ( status.tag ) case UPDATE_OFFSET MPI_Sendrecv ( amount, offset, id ) update ( amount, offset, id ) Collect and write parallel using one process to administrate file offsets. Advantages Performs best of tested versions. Use of POSIX AIO possible. Disadvantages One process is idle most of the time. Problems if I/O PEs span over different nodes. Remark: double buffering not done yet. Kleberg Parallel I/O using standard data formats in climate and NWP 17 / 27
24 Program flow (6), Posix writer Timings Run fwrite Run aio write Pe col Pe spec Collect parallel and write serial using fwrite. streamwritevar ( data, id, tid ) cdilib Advantages encodenbuffer ( data, buffer, id ) no isfilled ( buffer, id ) yes piowrite ( node ) MPI_Wait ( request ) MPI_Isend ( buffer, id, request ) cdilib MPI_Probe ( status ) Use of POSIX AIO possible. Use of double buffering possible. Use one writer process for each file possible. This would probably gain performance. Furthermore these processes could be located on different nodes. switch ( buffer, id ) switch ( status.tag ) case WRITE_BUFFER MPI_Recv ( buffer, id ) fwrite ( buffer, fp, id ) Disadvantages The buffer has to be communicated. With one writer process speedup limited. Kleberg Parallel I/O using standard data formats in climate and NWP 18 / 27
25 Parallel performance profile analysis We made several runs with the tools SCALASCA and TAU. Some of the results follow on the next slides. For these runs we let the processes write 5.9 GB distributed to 10 files. Remark The instumentation of the code increased the runtime of the programs up to 130%! Kleberg Parallel I/O using standard data formats in climate and NWP 19 / 27
26 MPI writer, MPI File iwrite shared MPI writer 8 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI File iwrite shared MPI, other Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI File iwrite shared MPI, other Kleberg Parallel I/O using standard data formats in climate and NWP 20 / 27
27 POSIX writer, aio write, 5 output streams POSIX writer 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Recv aio suspend Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Recv aio suspend Kleberg Parallel I/O using standard data formats in climate and NWP 21 / 27
28 Offset guard Offset guard 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode fseek/fwrite MPI Barrier MPI Probe Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode fseek/fwrite MPI Barrier MPI Probe Kleberg Parallel I/O using standard data formats in climate and NWP 22 / 27
29 POSIX writer, fwrite POSIX writer 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Barrier MPI Recv fwrite Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Barrier Kleberg Parallel I/O using standard data formats in climate and NWP 23 / 27
30 Timings for Power6, AIX, IB Target: 2 GB/s write method # Pes # collectors # writers [MB/s] Classic serial, fwrite MPI writer, MPI File iwrite shared Offset guard, fwrite POSIX writer, fwrite Not all possibilities might be fully optimized. Kleberg Parallel I/O using standard data formats in climate and NWP 24 / 27
31 Timings for Opteron, Linux, IB Linux cluster test: 16 MB buffer, 10 files, 49 GB in total write method # nodes # PEs throuput [MB/s] Classic serial, fwrite MPI writer, MPI File iwrite shared Offset guard, fwrite POSIX writer, fwrite Not all possibilities might be fully optimized. Kleberg Parallel I/O using standard data formats in climate and NWP 25 / 27
32 Comparison Elapsed time ( s ), Classic serial 319 s 480 POSIX writer, aio_write, 5 prefetch streams 240 MPI writer, MPI_File_iwrite_shared 120 POSIX writer, fwrite 60 Offset guard, fwrite # Pes Speedup S p = T s / T p # Pes Throughput ( MB/s ), Classic serial 148 MB/s # Pes Efficiency E p = S p / # Pes # Pes Kleberg Parallel I/O using standard data formats in climate and NWP 26 / 27
33 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27
34 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27
35 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Did part of MPI developers work on very unusual usage model of MPI-IO... Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27
36 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Did part of MPI developers work on very unusual usage model of MPI-IO... Expect them to take over and beat our results by at least a factor of?. Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27
CDI-pio, CDI with parallel I/O
CDI-pio, CDI with parallel I/O Deike Kleberg, Uwe Schulzweida, Thomas Jahns and Luis Kornblueh Max-Planck-Institute for Meteorology and Deutsches Klima Rechenzentrum Project ScalES funded by BMBF February
More informationcdo Data Processing (and Production) Luis Kornblueh, Uwe Schulzweida, Deike Kleberg, Thomas Jahns, Irina Fast
cdo Data Processing (and Production) Luis Kornblueh, Uwe Schulzweida, Deike Kleberg, Thomas Jahns, Irina Fast Max-Planck-Institut für Meteorologie, DKRZ September 24, 2014 MAX-PLANCK-GESELLSCHAFT Data
More informationEnd-to-end optimization potentials in HPC applications for NWP and Climate Research
End-to-end optimization potentials in HPC applications for NWP and Climate Research Luis Kornblueh and Many Colleagues and DKRZ MAX-PLANCK-GESELLSCHAFT ... or a guided tour through the jungle... MAX-PLANCK-GESELLSCHAFT
More informationPorting and Optimizing the COSMOS coupled model on Power6
Porting and Optimizing the COSMOS coupled model on Power6 Luis Kornblueh Max Planck Institute for Meteorology November 5, 2008 L. Kornblueh, MPIM () echam5 November 5, 2008 1 / 21 Outline 1 Introduction
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationHow to overcome common performance problems in legacy climate models
How to overcome common performance problems in legacy climate models Jörg Behrens 1 Contributing: Moritz Handke 1, Ha Ho 2, Thomas Jahns 1, Mathias Pütz 3 1 Deutsches Klimarechenzentrum (DKRZ) 2 Helmholtz-Zentrum
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationCOSC 6374 Parallel Computation. Remote Direct Memory Acces
COSC 6374 Parallel Computation Remote Direct Memory Acces Edgar Gabriel Fall 2013 Communication Models A P0 receive send B P1 Message Passing Model: Two-sided communication A P0 put B P1 Remote Memory
More informationUsing DDN IME for Harmonie
Irish Centre for High-End Computing Using DDN IME for Harmonie Gilles Civario, Marco Grossi, Alastair McKinstry, Ruairi Short, Nix McDonnell April 2016 DDN IME: Infinite Memory Engine IME: Major Features
More informationSami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1
Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var
More informationDeutscher Wetterdienst. Ulrich Schättler Deutscher Wetterdienst Research and Development
Deutscher Wetterdienst COSMO, ICON and Computers Ulrich Schättler Deutscher Wetterdienst Research and Development Contents Problems of the COSMO-Model on HPC architectures POMPA and The ICON Model Outlook
More informationCOSC 6374 Parallel Computation. Remote Direct Memory Access
COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models A P0 receive send B P1 Message Passing Model A B Shared Memory Model P0 A=B P1 A P0 put B P1 Remote
More informationA Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing
A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationParallel File Systems Compared
Parallel File Systems Compared Computing Centre (SSCK) University of Karlsruhe, Germany Laifer@rz.uni-karlsruhe.de page 1 Outline» Parallel file systems (PFS) Design and typical usage Important features
More informationpnfs, POSIX, and MPI-IO: A Tale of Three Semantics
Dean Hildebrand Research Staff Member PDSW 2009 pnfs, POSIX, and MPI-IO: A Tale of Three Semantics Dean Hildebrand, Roger Haskin Arifa Nisar IBM Almaden Northwestern University Agenda Motivation pnfs HPC
More informationSCALASCA parallel performance analyses of SPEC MPI2007 applications
Mitglied der Helmholtz-Gemeinschaft SCALASCA parallel performance analyses of SPEC MPI2007 applications 2008-05-22 Zoltán Szebenyi Jülich Supercomputing Centre, Forschungszentrum Jülich Aachen Institute
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationDo You Know What Your I/O Is Doing? (and how to fix it?) William Gropp
Do You Know What Your I/O Is Doing? (and how to fix it?) William Gropp www.cs.illinois.edu/~wgropp Messages Current I/O performance is often appallingly poor Even relative to what current systems can achieve
More informationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed
More informationPortable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline
More informationImplementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
More informationIntroduction to High Performance Parallel I/O
Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing
More informationAnalyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016
Analyzing the High Performance Parallel I/O on LRZ HPC systems Sandra Méndez. HPC Group, LRZ. June 23, 2016 Outline SuperMUC supercomputer User Projects Monitoring Tool I/O Software Stack I/O Analysis
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationData Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016
National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor
More informationDiscussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center
1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate
More informationIntroduction to Parallel I/O
Introduction to Parallel I/O Bilel Hadri bhadri@utk.edu NICS Scientific Computing Group OLCF/NICS Fall Training October 19 th, 2011 Outline Introduction to I/O Path from Application to File System Common
More informationProbabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications
International Conference on Parallel Architectures and Compilation Techniques (PACT) Minneapolis, MN, Sep 21th, 2012 Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications Ignacio
More informationCESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011
CESM (Community Earth System Model) Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,
More informationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 RMA MPI-3.0 supports RMA ( MPI One Sided ) Designed to react to
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationImplementing Byte-Range Locks Using MPI One-Sided Communication
Implementing Byte-Range Locks Using MPI One-Sided Communication Rajeev Thakur, Robert Ross, and Robert Latham Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA
More informationThe General Parallel File System
The General Parallel File System We have decided at this point to introduce a product of our employer, IBM, for once. The General Parallel File System (GPFS) is a shared disk file system that has for many
More informationParallel I/O on JUQUEEN
Parallel I/O on JUQUEEN 4. Februar 2014, JUQUEEN Porting and Tuning Workshop Mitglied der Helmholtz-Gemeinschaft Wolfgang Frings w.frings@fz-juelich.de Jülich Supercomputing Centre Overview Parallel I/O
More informationI/O: State of the art and Future developments
I/O: State of the art and Future developments Giorgio Amati SCAI Dept. Rome, 18/19 May 2016 Some questions Just to know each other: Why are you here? Which is the typical I/O size you work with? GB? TB?
More informationCSCS HPC storage. Hussein N. Harake
CSCS HPC storage Hussein N. Harake Points to Cover - XE6 External Storage (DDN SFA10K, SRP, QDR) - PCI-E SSD Technology - RamSan 620 Technology XE6 External Storage - Installed Q4 2010 - In Production
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationCOS 318: Operating Systems. NSF, Snapshot, Dedup and Review
COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early
More informationFilesystems on SSCK's HP XC6000
Filesystems on SSCK's HP XC6000 Computing Centre (SSCK) University of Karlsruhe Laifer@rz.uni-karlsruhe.de page 1 Overview» Overview of HP SFS at SSCK HP StorageWorks Scalable File Share (SFS) based on
More informationSystem that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files
System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files Addressable by a filename ( foo.txt ) Usually supports hierarchical
More informationDVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011
DVS, GPFS and External Lustre at NERSC How It s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves
More informationData Management. Parallel Filesystems. Dr David Henty HPC Training and Support
Data Management Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance on ARCHER
More informationEnterprise Volume Management System Project. April 2002
Enterprise Volume Management System Project April 2002 Mission Statement To create a state-of-the-art, enterprise level volume management system for Linux which will also reduce the costs associated with
More informationIBM HPC Development MPI update
IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July
More informationPerformance Analysis for Large Scale Simulation Codes with Periscope
Performance Analysis for Large Scale Simulation Codes with Periscope M. Gerndt, Y. Oleynik, C. Pospiech, D. Gudu Technische Universität München IBM Deutschland GmbH May 2011 Outline Motivation Periscope
More informationICON Performance Benchmark and Profiling. March 2012
ICON Performance Benchmark and Profiling March 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource - HPC
More informationAdvanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS)
Advanced Data Placement via Ad-hoc File Systems at Extreme Scales (ADA-FS) Understanding I/O Performance Behavior (UIOP) 2017 Sebastian Oeste, Mehmet Soysal, Marc-André Vef, Michael Kluge, Wolfgang E.
More informationPersistent Storage - Datastructures and Algorithms
Persistent Storage - Datastructures and Algorithms 1 / 21 L 03: Virtual Memory and Caches 2 / 21 Questions How to access data, when sequential access is too slow? Direct access (random access) file, how
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationLecture 33: More on MPI I/O. William Gropp
Lecture 33: More on MPI I/O William Gropp www.cs.illinois.edu/~wgropp Today s Topics High level parallel I/O libraries Options for efficient I/O Example of I/O for a distributed array Understanding why
More informationKommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen
Kommunikations- und Optimierungsaspekte paralleler Programmiermodelle auf hybriden HPC-Plattformen Rolf Rabenseifner rabenseifner@hlrs.de Universität Stuttgart, Höchstleistungsrechenzentrum Stuttgart (HLRS)
More informationParallel I/O and Portable Data Formats I/O strategies
Parallel I/O and Portable Data Formats I/O strategies Sebastian Lührs s.luehrs@fz-juelich.de Jülich Supercomputing Centre Forschungszentrum Jülich GmbH Jülich, March 13 th, 2017 Outline Common I/O strategies
More informationTowards Zero Cost I/O:
Towards Zero Cost I/O: Met Office Unified Model I/O Server Martyn Foster Martyn.Foster@metoffice.gov.uk A reminder Amdahl s Law Performance is always constrained by the serial code T T s T P T T s T 1
More informationThe RAMDISK Storage Accelerator
The RAMDISK Storage Accelerator A Method of Accelerating I/O Performance on HPC Systems Using RAMDISKs Tim Wickberg, Christopher D. Carothers wickbt@rpi.edu, chrisc@cs.rpi.edu Rensselaer Polytechnic Institute
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationAdventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps
Adventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory Outline Introduction Two abstract programming
More informationParallel File Systems. John White Lawrence Berkeley National Lab
Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File System Our Specific Case for File Systems Parallel File Systems A Survey of Current Parallel File Systems Implementation
More informationOptimizing an Earth Science Atmospheric Application with the OmpSs Programming Model
www.bsc.es Optimizing an Earth Science Atmospheric Application with the OmpSs Programming Model HPC Knowledge Meeting'15 George S. Markomanolis, Jesus Labarta, Oriol Jorba University of Barcelona, Barcelona,
More informationWhat is a file system
COSC 6397 Big Data Analytics Distributed File Systems Edgar Gabriel Spring 2017 What is a file system A clearly defined method that the OS uses to store, catalog and retrieve files Manage the bits that
More informationImproved Solutions for I/O Provisioning and Application Acceleration
1 Improved Solutions for I/O Provisioning and Application Acceleration August 11, 2015 Jeff Sisilli Sr. Director Product Marketing jsisilli@ddn.com 2 Why Burst Buffer? The Supercomputing Tug-of-War A supercomputer
More informationHigh Performance Computing
High Performance Computing Course Notes 2009-2010 2010 Message Passing Programming II 1 Communications Point-to-point communications: involving exact two processes, one sender and one receiver For example,
More informationDesign of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono
Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise
More informationOutline. Operating Systems: Devices and I/O p. 1/18
Outline Diversity of I/O devices block and character devices Organization of I/O subsystem of kernel device drivers Common hardware characteristics of device I/O subsystem tasks Operating Systems: Devices
More informationEnhancing Checkpoint Performance with Staging IO & SSD
Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationExtreme I/O Scaling with HDF5
Extreme I/O Scaling with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group koziol@hdfgroup.org July 15, 2012 XSEDE 12 - Extreme Scaling Workshop 1 Outline Brief overview of
More informationDPHPC Recitation Session 2 Advanced MPI Concepts
TIMO SCHNEIDER DPHPC Recitation Session 2 Advanced MPI Concepts Recap MPI is a widely used API to support message passing for HPC We saw that six functions are enough to write useful
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationDesign Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters
Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by:
More informationVREDPro HPC Raytracing Cluster
1 HPC Raytracing Cluster... 1 1.1 Introduction... 1 1.2 Configuration... 2 1.2.1 Cluster Options... 4 1.2.2 Network Options... 5 1.2.3 Render Node Options... 6 1.2.4 Preferences... 6 1.2.5 Starting the
More informationECMWF's Next Generation IO for the IFS Model and Product Generation
ECMWF's Next Generation IO for the IFS Model and Product Generation Future workflow adaptations Tiago Quintino, B. Raoult, S. Smart, A. Bonanni, F. Rathgeber, P. Bauer ECMWF tiago.quintino@ecmwf.int ECMWF
More informationScalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012
Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationAndreas Dilger. Principal Lustre Engineer. High Performance Data Division
Andreas Dilger Principal Lustre Engineer High Performance Data Division Focus on Performance and Ease of Use Beyond just looking at individual features... Incremental but continuous improvements Performance
More informationMultifunction Networking Adapters
Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task
More informationParallel File Systems for HPC
Introduction to Scuola Internazionale Superiore di Studi Avanzati Trieste November 2008 Advanced School in High Performance and Grid Computing Outline 1 The Need for 2 The File System 3 Cluster & A typical
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationEfficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics
1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department
More informationPart - II. Message Passing Interface. Dheeraj Bhardwaj
Part - II Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110016 India http://www.cse.iitd.ac.in/~dheerajb 1 Outlines Basics of MPI How to compile and
More informationRevealing Applications Access Pattern in Collective I/O for Cache Management
Revealing Applications Access Pattern in for Yin Lu 1, Yong Chen 1, Rob Latham 2 and Yu Zhuang 1 Presented by Philip Roth 3 1 Department of Computer Science Texas Tech University 2 Mathematics and Computer
More informationUniform Resource Locator Wide Area Network World Climate Research Programme Coupled Model Intercomparison
Glossary API Application Programming Interface AR5 IPCC Assessment Report 4 ASCII American Standard Code for Information Interchange BUFR Binary Universal Form for the Representation of meteorological
More informationICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction
ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow
More informationMPI Optimization. HPC Saudi, March 15 th 2016
MPI Optimization HPC Saudi, March 15 th 2016 Useful Variables MPI Variables to get more insights MPICH_VERSION_DISPLAY=1 MPICH_ENV_DISPLAY=1 MPICH_CPUMASK_DISPLAY=1 MPICH_RANK_REORDER_DISPLAY=1 When using
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationVoltaire Making Applications Run Faster
Voltaire Making Applications Run Faster Asaf Somekh Director, Marketing Voltaire, Inc. Agenda HPC Trends InfiniBand Voltaire Grid Backbone Deployment examples About Voltaire HPC Trends Clusters are the
More informationCSE 544: Principles of Database Systems
CSE 544: Principles of Database Systems Anatomy of a DBMS, Parallel Databases 1 Announcements Lecture on Thursday, May 2nd: Moved to 9am-10:30am, CSE 403 Paper reviews: Anatomy paper was due yesterday;
More informationFile Systems for HPC Machines. Parallel I/O
File Systems for HPC Machines Parallel I/O Course Outline Background Knowledge Why I/O and data storage are important Introduction to I/O hardware File systems Lustre specifics Data formats and data provenance
More informationFramework of an MPI Program
MPI Charles Bacon Framework of an MPI Program Initialize the MPI environment MPI_Init( ) Run computation / message passing Finalize the MPI environment MPI_Finalize() Hello World fragment #include
More informationOut-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays
Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays Wagner T. Corrêa James T. Klosowski Cláudio T. Silva Princeton/AT&T IBM OHSU/AT&T EG PGV, Germany September 10, 2002 Goals Render
More informationNFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC
Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data
More informationWhat NetCDF users should know about HDF5?
What NetCDF users should know about HDF5? Elena Pourmal The HDF Group July 20, 2007 7/23/07 1 Outline The HDF Group and HDF software HDF5 Data Model Using HDF5 tools to work with NetCDF-4 programs files
More informationWhat SMT can do for You. John Hague, IBM Consultant Oct 06
What SMT can do for ou John Hague, IBM Consultant Oct 06 100.000 European Centre for Medium Range Weather Forecasting (ECMWF): Growth in HPC performance 10.000 teraflops sustained 1.000 0.100 0.010 VPP700
More informationLog-structured files for fast checkpointing
Log-structured files for fast checkpointing Milo Polte Jiri Simsa, Wittawat Tantisiriroj,Shobhit Dayal, Mikhail Chainani, Dilip Kumar Uppugandla, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University
More informationComputer Architectures and Aspects of NWP models
Computer Architectures and Aspects of NWP models Deborah Salmond ECMWF Shinfield Park, Reading, UK Abstract: This paper describes the supercomputers used in operational NWP together with the computer aspects
More informationSPEC MPI2007 Benchmarks for HPC Systems
SPEC MPI2007 Benchmarks for HPC Systems Ron Lieberman Chair, SPEC HPG HP-MPI Performance Hewlett-Packard Company Dr. Tom Elken Manager, Performance Engineering QLogic Corporation Dr William Brantley Manager
More informationCollective Communication in MPI and Advanced Features
Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective
More informationOne Sided Communication. MPI-2 Remote Memory Access. One Sided Communication. Standard message passing
MPI-2 Remote Memory Access Based on notes by Sathish Vadhiyar, Rob Thacker, and David Cronk One Sided Communication One sided communication allows shmem style gets and puts Only one process need actively
More information