Application Example Running on Top of GPI-Space Integrating D/C

Size: px

Start display at page:

Download "Application Example Running on Top of GPI-Space Integrating D/C"

Johnathan Chase
5 years ago
Views:

project is funded from the European Union s Horizon

1 Application Example Running on Top of GPI-Space Integrating D/C Tiberiu Rotaru Fraunhofer ITWM This project is funded from the European Union s Horizon 2020 Research and Innovation programme under Grant Agreement no

The main idea behind GPI-Space is separating the domain specific programming layer

2 GPI-Space Software platform intended to help the domain developers to build specific complex parallel applications. The main idea behind GPI-Space is separating the domain specific programming layer from the HPC software infrastructure. Application domains: seismic imaging, financial services, life sciences, mechanical engineering.

3 The Distributed Runtime System of GPI-Space User Orchestrator Agent Agent Worker Worker Worker Worker Virtual Memory Manager Event-driven architecture (asynchronous communication). The components may have multiple masters, slaves or subscribers. Can be deployed in dynamically configurable topologies. The workers may have assigned specific capabilities (e.g. I/O, compute, etc.). The tasks may have requirements. The workers may join and leave the system at any time without affecting the execution of the workflow or other workers.

4 The virtual memory layer of GPI-Space Used for storing data shared between tasks. There is no direct communication between tasks or workers. Features: The storage layer can be easily changed, depending on the application s memory requirements (e.g. in-memory or on disk) Tolerance to worker failures: no impact on other workers nor on the execution of the workflow Shortcomings of the current virtual memory layer: lack of a caching mechanism not an independent software module that can be developed and tested standalone

5 Directory/Cache client-server architecture Node 0 Node 1 Node 2 Node 3 Segment 0 Segment 0 Segment 0 Segment 0 GASPI MPI BeeGFS Segment 1 Segment 1 Segment 1 Original Data Server Cache Cache Cache Cache Cache Cache Scratch Copies Client Client Client Client Client Client Client Client Multiple segments, caches and clients. The caches may be shared between clients. The local servers coordinate with each other for carrying out operations with memory ranges in a consistent way. External programs may connect to an already running D/C. Tolerance to client failures (client and server in different processes).

6 D/C prototype implementation A working prototype implementation of D/C API is available at: Used C++11 advanced features (variadic templates, lambda expressions, asynchronous launching, futures, etc.) The prototype provides implementation for GASPI, MPI and BeeGFS segments. Other segment implementations can be added without modifying the D/C existing implementation. Different allocation policies for segments and data are supported. Direct coupling with applications possible: e.g. Jacobi iteration for solving the Laplace equation on a 2D mesh of processes (tested with GASPI and MPI segments). Intended to be used by the task-based runtimes, primarily.

7 Integration of GPI-Space with the Directory/Cache Goals: Replace the existing virtual memory layer with D/C and achieve the same functionality while ensuring that all tests are working fine. Take advantage of using the D/C features. Scenario: use a shared cache per node and a private cache per worker. Changes required at multiple levels: Runtime API: methods for allocating and freeing segments or data. Workflow engine: workflow description language, parser, module loader. Bootstrapping: on each node, before starting the workers, do: start one or multiple D/C server instances create shared caches Workers: Create at startup D/C inter-process clients and private caches Inform the masters about the caches used Master agent: use D/C API transfer costs for scheduling.

8 Memory buffers In GPI-Space the transfers are managed by the runtime. The buffers used in transfers are described in the workflow. There are either const or mutable buffers. A buffer in GPI-Space equates to a local range in the D/C. The buffers may have attributes (e.g. if it is read-only or not). <memory-buffer name="seek_table" read-only="true"> <size> ${size_of_seek_table} </size> </memory-buffer> <memory-buffer name="data read-only= false"> <size> ${data_buffer.size} </size> </memory-buffer>

9 Managed memory transfers The workflow parser generates lists of get and put operations per task, which are executed by the runtime, when executing a task, using one of the D/C API methods. <memory-get> <global> ${range.handle} := ${seek_table.handle}; ${range.offset} := 0UL; ${range.size} := ${size_of_seek_table}; stack_push (List(), ${range}); </global> <local> ${range.buffer} := "seek_table"; ${range.offset} := 0UL; ${range.size} := ${size_of_seek_table}; stack_push (List(), ${range}); </local> </memory-get> <memory-put> <global> ${range.handle} := ${data.handle}; ${range.offset} := 0UL; ${range.size} := ${size_of_data}; stack_push (List(), ${range} </global> <local> ${range.buffer} := "data"; ${range.offset} := 0UL; ${range.size} := ${data_buffer.size}; stack_push (List(), ${range}); </local> </memory-put>

10 The renderer Splotch Splotch: publicly available rendering software used for exploration and visual discovery in particle-based datasets produced by astronomical observations or numerical simulations. Development started at Max-Planck Institute for Astrophysics München. Several academic institutions are contributing. The rendering algorithm optimizes the ray-tracing calculation by ordering the particles according to a depth function. The algorithm produces high-quality imagery (realistic 3D impressions) and works with very large-scale data sets. Original version: OpenMP/MPI hybrid implementation. Problem: low parallel efficiency (only 26% on the SuperMUC cluster at Leibniz Supercomputer Centre)

Example: formation and evolution of a galaxy cluster The upper

It is rich of filamentary structures, and later, when the universe

two main progenitors of the final galaxy cluster are going to merge.

11 Example: formation and evolution of a galaxy cluster The upper panels show a region of the universe at an early time. It is rich of filamentary structures, and later, when the universe evolves, almost all these structures are going to collapse to form a single, prominent galaxy cluster. The lower panels show the same region at a later time, where the two main progenitors of the final galaxy cluster are going to merge. From: Splotch: visualizing cosmological simulations (New Journal of Physics, Volume 10, Issue 12, pp (2008)).

12 First version of Splotch implemented on top of GPI-Space Task-based implementation instead of hybrid OpenMP/MPI implementation. The data-flow was re-organized s.t. to allow an overlay of the rendering part with the non-parallel parts of the algorithm (e.g. I/O or the aggregation of partial images). User is required to provide implementation only for three interface functions, which can be developed, tested and compiled without GPI-Space as a dynamic library. GPI-Space provides automatic support for scheduling, dynamic load balancing and fault-tolerance. Used GPI-2 for data transfers (low latencies and high bandwidth)

13 Workflow of the parallelized Splotch algorithm The Splotch algorithm operational scenario consists of a number of stages: a) read data from one or more files, b) process data (e.g. for normalization), c) render data and d) save the final image.

14 Splotch parameters For each simulation Splotch uses a parameter file (contains color, intensity, brightness, palette, etc). Scene file: contains scene descriptions (e.g. camera positions) The input contains specific data for a number of scenes. Splotch produces for each scene a picture (frame). The workflow combines 4 types of tasks: load, make_picture, reduce, store. Invariants: the number of make_picture, reduce and store tasks is proportional to the number of scenes.

15 GPI-Space implementation vs the original version Test data: data-set with 5 billion particles (167 GByte) frames to render (fly-through the data-set at a fixed time step). Output: 161 GByte of picture data. Data chunks of 2 Gbyte size make_picture tasks, reduce tasks, 3245 store tasks 32 Gbyte memory used on each node On 128 nodes: OpenMP/ MPI-Splotch has the same runtime as on 32 nodes The GPI-Space version outperforms the OpenMP/MPI version by a factor of 10.

16 Running Splotch on top of GPI-Space integrating D/C In Splotch, scene_parameters and the seek_table are constant parameters. The node-local shared cache is used for storing const local ranges. The private cache of a worker is used for storing mutable local ranges of assigned tasks. Policies used: evenly_distributed for segments and fill_from_left for data allocations. The application required minor modifications w.r.t segment and data allocation for running on GPI-Space integrating D/C (the D/C API is called under the hood).

17 Allocating memory for the global data by the runtime The D/C API is called under the hood by the runtime, the application developer is not required to call explicitly D/C API methods. gspc::scoped_vmem_segment_and_allocation const allocation_scene_parameters (drts.alloc_and_fill ( gspc::vmem::gaspi_segment_description(), memory_for_scene_parameters_in_bytes, scenes_parameter.data() ) ); gspc::scoped_vmem_segment_and_allocation const allocation_seek_table (drts.alloc_and_fill ( gspc::vmem::gaspi_segment_description(), seek_table.size(), seek_table.data() ) ); gspc::scoped_vmem_segment_and_allocation const allocation_data (drts.alloc ( gspc::vmem::gaspi_segment_description(), size_of_data ) ); gspc::scoped_vmem_segment_and_allocation const allocation_picture (drts.alloc ( gspc::vmem::gaspi_segment_description(), memory_for_pictures ) );

18 Splotch running on top of GPI-Space + D/C on 30 nodes (100 scenes)

19 Output validation Tests performed with various numbers of nodes and scenes. The reference output contains 3425 frames ( 161 GB of data). In all cases the output was compared against the reference output produced by the original application. The produced frames are identical to those in the reference output (up to a very small tolerance error). This proves that the Directory/Cache API operations are correctly implemented, resulting in no data corruption. Note: the correctness of data transfers was also previously checked with the Jacobi 2D example implemented directly on top of D/C.

Execution times for varying numbers of scenes and nodes #Scenes #Tasks 80 13524 90 15204 100 16884 Execution times

20 Execution times for varying numbers of scenes and nodes #Scenes #Tasks Execution times of simulations with 80, 90, and 100 scenes on sets of 20 to 30 cluster nodes, each with 16 Intel Xeon E cores.

Evaluating the cache reuse Total number of cache misses compared with the number of cache hits per rank for a simulation with 20 nodes and 100 scenes.

21 Evaluating the cache reuse Total number of cache misses compared with the number of cache hits per rank for a simulation with 20 nodes and 100 scenes. The maximum number of data transfers from the global memory into a node-local shared cache is the number of scenes plus one. As shown in the graph this limit is never exceeded (i.e. 101)

22 Improvement in terms of absolut times The cache reuse leads to improvement of the execution times (compared to the situation when no caching is performed). The improvement depends on the size of the cached data and the number of const transfers. The gain is higher when the size of cached data is larger.

Splotch on GPI-Space integrating D/C vs Splotch on GPI- Space using the current virtual memory layer The version using the current virtual memory layer is up to 2,2% faster than the one over

23 Splotch on GPI-Space integrating D/C vs Splotch on GPI- Space using the current virtual memory layer The version using the current virtual memory layer is up to 2,2% faster than the one over GPI-Space + D/C. Not too bad, taking into account that the D/C implementation is just a prototype and the current virtual memory layer is highly optimized, being used for years already. The D/C prototype implementation can be further optimized. number of scenes with D/C current difference in % ,38% ,19% ,02% ,13% ,12%

24 Conclusions and future work The results produced by the application (Splotch) are correct, which proves that the API is correctly implemented The application scales and it benefits from automatic caching. The architectural design of D/C allows GPI-Space to preserve important features such as: easy switching between storage layers, depending on the application requirements and tolerance to worker failures. Long term goal: replace the current version of the virtual memory manager with an optimized implementation of D/C. In order to exploit the full potential of the API, further changes touching multiple layers of the GPI-Space architecture are planned.

THOUGHTS ABOUT THE FUTURE OF I/O

THOUGHTS ABOUT THE FUTURE OF I/O Dagstuhl Seminar Challenges and Opportunities of User-Level File Systems for HPC Franz-Josef Pfreundt, May 2017 Deep Learning I/O Challenges Memory Centric Computing :