The DEEP-ER take on I/O

Size: px

Start display at page:

Download "The DEEP-ER take on I/O"

Arline Warren
5 years ago
Views:

1 Wolfgang Frings Jülich Supercomputing Centre Workshop Exascale I/O: Challenges, Innovations and Solutions SC16, Salt Lake City 18 November 2016 The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/ ) under Grant Agreement n and n

2 DEEP and DEEP-ER EU-Exascale projects 20 partners Total budget: 28,3 M EU-funding: 14,5 M Nov 2011 Mar

3 I/O Hardware Innovation On- NVM Network attached memory 3

4 Driven by Applications Pilot applications: Space weather simulation (KULeuven) High temperature superconductivity (CINECA) Human exposure to electromagnetic fields (Inria) Geoscience (BADW-LRZ) Radio astronomy (ASTRON) Oil exploration (BSC) Lattice QCD (UREG) Goals: Analyse I/O and resiliency requirements of HPC codes Evaluate DEEP-ER I/O and resiliency concept and its programmability 4

5 Scalable I/O Improve I/O scalability on all usage-levels Used also for checkpointing 5

6 BeeGFS and Caching Two instances: Global FS on HDD server Cache FS on NVM at node API for cache domain handling Synchronous version Asynchronous version 6

7 Architecture of the asynchronous cache API User application DEEP-ER cache daemon Application workflow I/0 subsystem DEEP-ER cache API 1. Wait() Messaging facility 2.a Check list Lookup list of requests in processing DEEP-ER cache library Messaging facility 1. Asynchronous API call 2.b Check list FIFO with new requests Lookup list 2. Add request to FIFO 3.a Worker take request from FIFO 3.b Add request to list Worker threads 5. remove request from list Workflow synchronous API call Workflow asynchronous API call Workflow wait call Synchronous API call (executed by the thread of the caller process) 4. Process work Cache file system Global file system 7

New MPI-IO Hints e10_cache e10_cache_path e10_cache_flush_flag

ROMIO Processes Buffers Tested in DEEP cluster Global Sync Group

(Abstract Device IO) GPFS Driver UFS Driver DEEP-ER Cache Parallel File

8 New MPI-IO Hints e10_cache e10_cache_path e10_cache_flush_flag e10_cache_discard_flag e10_cache_threads DEEP-ER Cache Integration in ROMIO Processes Buffers Tested in DEEP cluster Global Sync Group (MPI_COMM_WORLD) ROMIO Lustre Driver MPI MPI-IO ADIO (Abstract Device IO) GPFS Driver UFS Driver DEEP-ER Cache Parallel File System BeeGFS Driver Collective Buffers Collective I/O DEEP-ER Cache Independent I/O Parallel File System 8

9 Workflows Using DEEP-ER Cache cache enabled cache disabled S(k): amount of written to the file at phase k (here constant) Tc(k): time to write S(k) to the cache Ts(k): time to sync S(k) with the parallel file system C(k): compute time at phase k 9

10 MPIWRAP Support Library MPI-IO hints are defined in a config file and injected by libmpiwrap into the middleware Provides deeper and more flexible control of MPI-IO functionalities to the users Provides transparent integration of E10 functionalities into applications Works with any high level library (e.g. phdf5) MPI_{Init,Finalize} MPI_File_{open,close} ROMIO Lustre Driver MPI MPI-IO ADIO (Abstract Device IO) GPFS Driver MPIWRAP UFS Driver Parallel File System BeeGFS Driver 10

11 : Shared Files for Task- Data Parallel Application HDF5 NETCDF MPI-I/O POSIX I/O Parallel file system t 1 t 2 t n-1 t n #files: O(10000)./checkpoint/file.0001./checkpoint/file.nnnn Parallel file system

12 : Shared Files for Task- Data Parallel Application HDF5 NETCDF MPI-I/O POSIX I/O Serial program Application t 1 t 2 t 3 Tasks t n-2 t n-1 t n Parallel file system #files: O(10) Physical multi-file Logical task- files Parallel file system

13 : Strategy for Local Storage Tasks t 1 t n/2 t n/2+1 t n file container multi-file approach metablock 1 metablock 2 mapping metablock 1 metablock 2 Shared file 1 Shared file n I/Onode I/Onode I/Onode I/Onode Staging I/Onode I/Onode I/Onode I/Onode node group with storage Local Storage node group with storage Transparent access from application to container Global Storage

14 : Local checkpointing MPI/OpenMP tt t tt t tt t tt t tt t ttt Global File System Mapping: task writes to storage, using one physical file of file container Transparent access to from application Transparent access to when files migrated to global storage Re-distribution of task- in layer possible (Buddy-CP)

15 : Buddy-CP Logical mapping MPI/OpenMP tt t ttt ttt t t t t t t t tt Global File System buddy buddy buddy buddy buddy buddy Mapping: 1:1 task writes to storage and storage of buddy node 1:m task writes to storage and storage of m buddy nodes Data exchange to buddy node is done via MPI/OpenMP layer Collective checkpoint calls required Open: sid=sion_paropen_mpi(, bw,buddy=m,mpi_comm_world, ) Write: sion_coll_write_mpi(,size,elements,sid) Close: sion_parclose(sid)

16 Restore checkpoint after failure MPI/OpenMP tt t t t t t t t t t t t t t t tt Global File System buddy buddy buddy buddy buddy buddy Automatic analysis of availability on storage On missing : falls back to buddy if first open fails Open: sid=sion_paropen_mpi(, br,buddy=m,mpi_comm_world, ) Read: sion_coll_read_mpi(,size,elements,sid) Close: sion_parclose(sid)

17 and SCR interoperability SCR_Start_checkpt() SCR_Route_file(fn, fn_scr) fn = check1 fn_scr= /abspath/check1 sid=sion_paropen_mpi(fn_scr, wb,buddy...) info=sion_get_io_info(sid) sion_parclose_mpi(sid) SCR_Complete_checkpt() (node0) /abspath/check1 (node1) /abspath/check (node0) /abspath/check1_buddy_ (node1) /abspath/check1_buddy_ List of filename opened on this task - Bytes written SCR_update_filename(nfiles, info.names,info.sizes, info.roles)

18 Preliminary results MAXW-DGTD: Human exposure to electromagnetic fields (Inria) reduces number of files less filesystem overhead Data write time [s] P1 P1 with P3 P3 with # of tasks # of tasks P1 P1 with P3 P3 with output times on DEEP

19 Preliminary results NVMe Oil exploration (BSC) sdv-work NVMe Backward Forward 28, Impact of NVMe Propagation time [s] Backward Forward sdv-work NVMe Human exposure to electromagnetic fields (Inria) sdv-work NVMe P P P P I/O performance of MAXW-DGTD Writing time [s] P1 P2 P3 P4 sdv-work NVMe 19

20 Summary DEEP-ER explores future directions of I/O On POSIX optimization level On filesystem level BeeGFS On MPI-IO level E10 Aims to be able to test and combine the approaches Exploration and validations of new hardware NVMe NAM More information on 20

The DEEP (and DEEP-ER) projects

The DEEP (and DEEP-ER) projects Estela Suarez - Jülich Supercomputing Centre BDEC for Europe Workshop Barcelona, 28.01.2015 The research leading to these results has received funding from the European