Parallel I/O using standard data formats in climate and NWP

Size: px

Start display at page:

Download "Parallel I/O using standard data formats in climate and NWP"

Lynn Lucas
5 years ago
Views:

1 Parallel I/O using standard data formats in climate and NWP Project ScalES funded by BMBF Deike Kleberg Luis Kornblueh, Hamburg MAX-PLANCK-GESELLSCHAFT

2 Outline Introduction Proposed Solution Implementation Results Conclusions, Input and Outlook Kleberg Parallel I/O using standard data formats in climate and NWP 2 / 27

3 Thanks to Uwe Schulzweida, MPIM Mathias Pütz, IBM Christoph Pospiech, IBM Colleagues at DKRZ Kleberg Parallel I/O using standard data formats in climate and NWP 3 / 27

4 ScalES (4hpc) Parallel I/O Targets Scalability of Earth System Models Usability of Infrastructure Components Portability of Implemented Solutions Solutions Prepared for Future Computer Development Free Availability for the Modelling Community Kleberg Parallel I/O using standard data formats in climate and NWP 4 / 27

5 Atmospheric and Oceanographic Data Constraints Large amounts of data Comparable small spatial extent (O(2, 3)) Large Time extent (O(8)) Large Number of variables (O(3)) Kleberg Parallel I/O using standard data formats in climate and NWP 5 / 27

6 ... continued... Requirements long term storage of metadata and data standardization compression Solutions WMO GRIB standard lowest entropy data subsampling two stage compression: lossy entropy based and lossless compression of result image metadata uncommpressed! Kleberg Parallel I/O using standard data formats in climate and NWP 6 / 27

7 ... continued Problem model decomposition is based on two-dimensional horizontal slicing storage unit of model data is based on vertical slicing requires transpose and data gathering Kleberg Parallel I/O using standard data formats in climate and NWP 7 / 27

8 File writing in ECHAM AS-IS Compute Pes Pe 0 time step horizontal vertical/variable decomposition decomposition gather copy streamwritevar libcdi GRIB write filesystem GPFS time step NetCDF Lustre All processes are compute processes. After one I/O timestep all processes gather their data on process 0. Process 0 writes the data to the filesystem. All processes compute till the next I/O timestep. Problem All processes wait until process 0 has written the data. Kleberg Parallel I/O using standard data formats in climate and NWP 8 / 27

9 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

10 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

11 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

12 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

13 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

14 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record 6. write... sort of Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

15 Solution strategy 1. decompose I/O in a way that all variables are distributed over the collector/concentrator PEs (I/O PEs) 2. store each compute PEs data in buffers to be collected by collector PEs (I/O PEs) buffer should reside in RDMA capable memory areas 3. instead of doing I/O, copy data to buffer and continue simulation 4. collectors collect their respectable data (gather) via one-sided (RDMA based) MPI calls and do the transpose 5. compress each indivdual record 6. write... sort of First five steps runs fine, but the last one makes a lot of trouble! Kleberg Parallel I/O using standard data formats in climate and NWP 9 / 27

16 File writing in ECHAM TO-BE Compute Pes time step copy I/O Pes gather streamwritevar libcdi GRIB write filesystem GPFS time step NetCDF Lustre After calculating one I/O timestep the compute processes copy their data to a buffer and go on calculating till the next I/O timestep. I/O processes fetch the data using MPI one sided communication. Gather and transpose of the data is based on callback routines supplied by the model. Most important property Compute processes are not disturbed by file writing. Kleberg Parallel I/O using standard data formats in climate and NWP 10 / 27

17 Algorithm alternatives collect write communicate POSIX 0 File 1 File 2... File m (a) Classic serial, no comm n 2 n 1 File File File m MPI I/O (b) MPI writer, no visible comm n 2 n 1 File File... File 1 2 m MPI I/O or POSIX (c) Offset sharing, comm. offsets n 2 n 1 File 1 File 2... Filem MPI I/O or POSIX (d) Offset guard, comm. offsets POSIX n 1 n 2 File File Filem (e) POSIX writer, comm. data Kleberg Parallel I/O using standard data formats in climate and NWP 11 / 27

18 Timesteps newtimestep ( tid ) tid yes == current timestep? no piowrite ( queue ) MPI_Barrier ( comm ) Characteristics Within files data have to be ordered by time. In case of serial writing is a natural barrier between the timesteps. When writing in parallel buffering of data is necessary for performance. Buffers have to be flushed after each I/O timestep to keep data in right time order. Kleberg Parallel I/O using standard data formats in climate and NWP 12 / 27

19 Program flow (1), Classic serial Timings Pe 0 Collect and write serial using fwrite. streamwritevar ( data, id ) cdilib gribencode ( data ) Advantages No communication is needed. Simple algorithm. Simple multi-file handling. compress ( data ) fwrite ( data, fp, id ) Disadvantages Does not scale. Kleberg Parallel I/O using standard data formats in climate and NWP 13 / 27

20 Program flow (2), MPI writer (1) Pe io Collect and write parallel using MPI File write all. cdilib streamwritevar ( data, id, tid ) encodenbuffer ( data, id, tid ) Advantages No user visible communication between I/O processes is needed. Straight forward implementation. Performs best of all MPI file writing routines. no isfilled ( buffer, id ) yes piowrite ( node ) MPI_File_write_all ( buffer, fh, id ) Disadvantages Collective call, doesn t work for inhomogenous allocation of variables per process/file. Kleberg Parallel I/O using standard data formats in climate and NWP 14 / 27

21 Program flow (3), MPI writer (2) Timings Run Pe io streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, id, tid ) no isfilled ( buffer, id ) piowrite ( node ) yes MPI_Wait ( request ) MPI_File_iwrite_shared ( buffer, fp, id, request ) Collect and write parallel using MPI File iwrite shared. Advantages No user visible communication between I/O processes is needed. Use of double buffering is possible. Disadvantages Very bad performance on GPFS. switch (buffer, id ) Kleberg Parallel I/O using standard data formats in climate and NWP 15 / 27

22 Program flow (4), Offset sharing Pe io Pe i streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, id, tid ) no isfilled ( buffer, id )? yes piowrite ( node ) MPI_Win_lock ( id ) Collect and write parallel, communicate file offsets using MPI RMA with passive target: Advantages All collectors write. Use of POSIX AIO possible. MPI_Get ( offset, id ) MPI_Accumulate ( amount, offset, id ) MPI_Win_unlock ( id ) get, origin memory fwrite ( buffer, fp, id ) get, target memory accumulate, target memory lock unlock Disadvantages Complex locking is needed, performance is bad. Remark: double buffering not done yet. Kleberg Parallel I/O using standard data formats in climate and NWP 16 / 27

23 Program flow (5), Offset guard Timings Run Pe col streamwritevar ( data, id, tid ) cdilib encodenbuffer ( data, buffer, id ) no isfilled ( buffer, id ) yes piowrite ( node ) MPI_Sendrecv ( amount, offset, id ) fwrite ( buffer, fp, id ) cdilib Pe spec MPI_Probe ( status ) switch ( status.tag ) case UPDATE_OFFSET MPI_Sendrecv ( amount, offset, id ) update ( amount, offset, id ) Collect and write parallel using one process to administrate file offsets. Advantages Performs best of tested versions. Use of POSIX AIO possible. Disadvantages One process is idle most of the time. Problems if I/O PEs span over different nodes. Remark: double buffering not done yet. Kleberg Parallel I/O using standard data formats in climate and NWP 17 / 27

24 Program flow (6), Posix writer Timings Run fwrite Run aio write Pe col Pe spec Collect parallel and write serial using fwrite. streamwritevar ( data, id, tid ) cdilib Advantages encodenbuffer ( data, buffer, id ) no isfilled ( buffer, id ) yes piowrite ( node ) MPI_Wait ( request ) MPI_Isend ( buffer, id, request ) cdilib MPI_Probe ( status ) Use of POSIX AIO possible. Use of double buffering possible. Use one writer process for each file possible. This would probably gain performance. Furthermore these processes could be located on different nodes. switch ( buffer, id ) switch ( status.tag ) case WRITE_BUFFER MPI_Recv ( buffer, id ) fwrite ( buffer, fp, id ) Disadvantages The buffer has to be communicated. With one writer process speedup limited. Kleberg Parallel I/O using standard data formats in climate and NWP 18 / 27

25 Parallel performance profile analysis We made several runs with the tools SCALASCA and TAU. Some of the results follow on the next slides. For these runs we let the processes write 5.9 GB distributed to 10 files. Remark The instumentation of the code increased the runtime of the programs up to 130%! Kleberg Parallel I/O using standard data formats in climate and NWP 19 / 27

26 MPI writer, MPI File iwrite shared MPI writer 8 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI File iwrite shared MPI, other Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI File iwrite shared MPI, other Kleberg Parallel I/O using standard data formats in climate and NWP 20 / 27

27 POSIX writer, aio write, 5 output streams POSIX writer 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Recv aio suspend Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Recv aio suspend Kleberg Parallel I/O using standard data formats in climate and NWP 21 / 27

28 Offset guard Offset guard 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode fseek/fwrite MPI Barrier MPI Probe Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode fseek/fwrite MPI Barrier MPI Probe Kleberg Parallel I/O using standard data formats in climate and NWP 22 / 27

29 POSIX writer, fwrite POSIX writer 4 Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Barrier MPI Recv fwrite Pes Routines, functions Time(s) % # Time #Pes (s) % Pe i cdiwrite grbencode MPI Wait MPI Barrier Kleberg Parallel I/O using standard data formats in climate and NWP 23 / 27

30 Timings for Power6, AIX, IB Target: 2 GB/s write method # Pes # collectors # writers [MB/s] Classic serial, fwrite MPI writer, MPI File iwrite shared Offset guard, fwrite POSIX writer, fwrite Not all possibilities might be fully optimized. Kleberg Parallel I/O using standard data formats in climate and NWP 24 / 27

31 Timings for Opteron, Linux, IB Linux cluster test: 16 MB buffer, 10 files, 49 GB in total write method # nodes # PEs throuput [MB/s] Classic serial, fwrite MPI writer, MPI File iwrite shared Offset guard, fwrite POSIX writer, fwrite Not all possibilities might be fully optimized. Kleberg Parallel I/O using standard data formats in climate and NWP 25 / 27

32 Comparison Elapsed time ( s ), Classic serial 319 s 480 POSIX writer, aio_write, 5 prefetch streams 240 MPI writer, MPI_File_iwrite_shared 120 POSIX writer, fwrite 60 Offset guard, fwrite # Pes Speedup S p = T s / T p # Pes Throughput ( MB/s ), Classic serial 148 MB/s # Pes Efficiency E p = S p / # Pes # Pes Kleberg Parallel I/O using standard data formats in climate and NWP 26 / 27

33 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27

34 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27

35 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Did part of MPI developers work on very unusual usage model of MPI-IO... Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27

36 Some initial conclusions For our kind of models sufficient write speed on a single IBM node to go on for the next months RDMA concept is meeting the expectations - may need more RMA Windows. Did part of MPI developers work on very unusual usage model of MPI-IO... Expect them to take over and beat our results by at least a factor of?. Kleberg Parallel I/O using standard data formats in climate and NWP 27 / 27

CDI-pio, CDI with parallel I/O

CDI-pio, CDI with parallel I/O Deike Kleberg, Uwe Schulzweida, Thomas Jahns and Luis Kornblueh Max-Planck-Institute for Meteorology and Deutsches Klima Rechenzentrum Project ScalES funded by BMBF February