MB3 D6.1 Report on profiling and benchmarking of the initial set of applications on ARM-based HPC systems Version 1.1

Size: px

Start display at page:

Download "MB3 D6.1 Report on profiling and benchmarking of the initial set of applications on ARM-based HPC systems Version 1.1"

Christian Glenn
5 years ago
Views:

1 MB3 D6.1 Report on profiling and benchmarking of the Document Information Contract Number Project Website Contractual Deadline PM12 Dissemination Level Public Nature Report Authors Jesus Labarta (BSC) Contributors Alban Lumi (UGRAZ), Dmitry Nikolaenko (UGRAZ), Patrick Schiffmann (AVL), Marc Josep (BSC), Julian Morillo (BSC), Filippo Mantovani (BSC), Vladimir Marjanovic (HLRS), Jose Gracia (HLRS) Reviewers Roxana Rusitoru (ARM) Keywords Applications, performance analysis, co-design insight Notices: This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement N o c Mont-Blanc 3 Consortium Partners. All rights reserved.

2 MB3 D6.1 - Report on profiling and benchmarking of the Change Log Version Description of Change v0.1 19/09/2016. Initial Draft released for internal reviewer v0.2 26/09/2016. Update with corrections suggested by internal reviewer v1.0 05/10/2016. Final updates for internal review v1.1 13/10/2016. Integration of reviewer comments. Submitted to EC. 2

3 D6.1 - Report on profiling and benchmarking of the Contents Executive Summary 6 1 Introduction 7 2 Application selection 7 3 Alya - BSC About Alya Alya scaling on ThunderX Analysis of Alya on ThunderX Analysis of Alya on Intel based systems Co-design insight EC-Earth - BSC About EC-Earth EC-Earth in Mont-Blanc EC-Earth on MareNostrum Analysis of EC-Earth on MareNostrum Co-design insight MiniFE - BSC About MiniFE MiniFE scaling on ThunderX Analysis of MiniFE on ThunderX Co-design insight NTChem - BSC About NtChem NTChem scaling Analysis of NTChem on ThunderX Analysis of NTChem on MareNostrum Co-design insight Lulesh - BSC About Lulesh Lulesh scaling on ThunderX Analysis of Lulesh on ThunderX Lulesh on MareNostrum Co-design insights CoMD - BSC About CoMD CoMD scaling on ThunderX Analysis of CoMD on ThunderX Co-design insights

4 MB3 D6.1 - Report on profiling and benchmarking of the 9 Jacobi Solver - UGRAZ Review MPI parallelization Benchmarking Profiling Co-design insights Future Work Algebraic Multigrid Solver - UGRAZ Review Geometrical Multigrid Algebraic Multigrid Benchmarking Co-design insight Future Work CARP - UGRAZ Review CARP in Mont-Blanc Co-design insight Future work Eikonal Solver - UGRAZ Review Local Solver Benchmarking Profiling Co-design insight Future work MercuryDPM - UGRAZ Overview Code analysis, benchmarking and profiling Co-design insights Intake port - AVL AVL Fire Overview Scaling on Intel platform Analysis on Intel platform Co-design insights Parallel mesh generation and domain decomposition - AVL Overview Analysis on Intel Co-design insights

5 D6.1 - Report on profiling and benchmarking of the 16 Shape optimization by an adjoint Solver - AVL Overview Scaling on Intel platform Analysis on Intel platform Co-design insights Aerodynamics - AVL Overview Scaling on Intel platform Analysis on Intel platform Co-design insights Quenching - AVL Overview Scaling on Intel platform Analysis on Intel platform Co-design insights Air borne noise simulation of an internal combustion engine - AVL Overview Scaling on Intel platform Analysis on Intel platform Co-design insights HPGMG - HLRS Overview Platforms and Software Stack Benchmarking Co-design insights Conclusions 71 Acronyms and Abbreviations 74 5

6 MB3 D6.1 - Report on profiling and benchmarking of the Executive Summary This deliverable describes the analyses performed on the different applications being considered as part of the co-design process in Mont-Blanc 3. We present different types of results for 18 applications. For most of them we performed evaluations on Mont-Blanc platforms and analyses on both ARM and intel based platforms. The different analyses try to identify fundamental issues that limit performance on the currently available platforms and through predictive studies we identify issues that will be relevant at larger scales. For each application we summarize, at the end of the corresponding section, the fundamental issues and possible co-design alternatives to consider. Overall, load imbalance is identified as a very important issue. Operating system noise, interaction between NUMAness and process migration, and memory bandwidth limitations are issues with different impacts depending on the application characteristics. Our observations reinforce the perception that a hybrid MPI + task-based OpenMP/OmpSs is a potentially very interesting approach to address the behavioral issues observed. Such model provides the application malleability necessary to dynamically adapt the exploited concurrency to the available resources. Features in the runtime library such as the Dynamic Load Balancing Library (DLB) being developed at BSC are able to dynamically shift cores between processes to compensate for imbalances. We find cases where granularity of computation is rather fine and a pure runtime implementation will have overheads. It would be important to provide architectural mechanisms to reallocate resources with finer granularity than full cores. APIs to shift frequency, memory bandwidth or cache capacity between processes would allow the runtime to compensate for small imbalances that will be important at large scale. Architectural or operating system support to detect the imbalances at all levels of granularity should be the target of research. Collaboration between the architecture, the operating system and the runtime is the key to achieve high efficiency. 6

7 D6.1 - Report on profiling and benchmarking of the 1 Introduction This deliverable reports on the porting status, performance achieved and detailed profiling analyses of the initial set of applications selected for the co-design and evaluation purposes in the Mont-Blanc 3 project. This deliverable is produced as a result of the work performed in tasks T6.1 and T6.2. The objective of the application work is to provide a set of kernels, mini-apps and applications for co-design and evaluation of both the Mont-Blanc 3 architectural proposals and prototype, as well as the system software developments. The objective of any co-design work is to identify fundamental issues in the applications and determine whether they are best addressed by restructuring the application or providing support in the system software or architecture. The objective is to devise the more cost effective and efficient solution which can imply introducing architectural or system software features or changes in the application. For each application we will provide a very short description of application area, algorithms and problems considered, followed by a report of the status of execution on ARM-based prototypes and detailed performance analyses. Although we will report performance number on ARM-based systems (especially the Cavium ThunderX) we will also perform some of the analyses on different platforms for comparison purposes or as a way to look at platform dependent or independent application characteristics. For each application we will summarize its potential scalability, fundamental limitations and what type of techniques addressed by the project at either architectural or system software level might be relevant to address them. 2 Application selection Table 1 shows all the applications targeted by the project. They have been chosen by the different partners with different motivations of technical and economic interest. Some of them are of strategic interest for our research teams, of direct commercial interest or constitute frequently used co-design benchmarks at worldwide level (US, Japan). The last column identifies those applications for which some evaluation or analysis is reported in this deliverable. 3 Alya - BSC 3.1 About Alya The Alya System [Aly16] is the BSC simulation code for multi-physics problems, specifically designed to run efficiently in supercomputers. It solves time-dependent Partial Differential Equations (PDEs). Variational formulations and the Finite Element Method (FEM) are preferred but Finite Volumes (FV), Finite Differences (FD) or other more exotic discretization methods can be easily programmed. Time advance is numerically integrated by FD with trapezoidal rules. Monolithic implicit, by-block segregated or explicit schemes are used. Higher order time schemes are also programmed, like Runge-Kutta s or BDF. Alya uses non-structured meshes and is parallelized with MPI and OpenMP. METIS is used for domain decomposition at the MPI level. Alya is especially well-suited to simulate complex problem in different domains of Science and Technology such as incompressible flows, compressible flows, non-linear solid mechanics, species transport equations, excitable media, thermal flows or N-body collisions. Of special interest is the solution of coupled muti-physics problems. 7

8 MB3 D6.1 - Report on profiling and benchmarking of the Table 1: Selected applications Application(s) Class Promoter Motivation Analysis Lulesh Mini-app BSC Global co-design benchmark (US) X COMD Mini-app BSC Global co-design benchmark (US) X MiniFE Mini-app BSC Global co-design benchmark (US) X ntchem Mini-app BSC Global co-design benchmark (Japan) X Alya Application BSC Strategic research (enginering) X IFSkernel Kernel BSC Co-design benchmark (EU) EC-EARTH Application BSC Strategic research (earth sciences) X Cholesky Kernel BSC Architectural research benchmark PARSECS (11) Mini-app BSC Architectural research benchmark Smooth Particle Hydrodynamics Mini-app USTUTT HPGMG Mini-app USTUTT Global co-design benchmark (US) X Jacobi Solver Mini-App UGRAZ Architectural research benchmark X Algebraic Multigrid Solvers App-Kernel UGRAZ Co-design benchmark X Eikonal Solver App-kernel UGRAZ Strategic research X CARP Application UGRAZ Strategic research X MercuryDPM App - kernel UGRAZ Co-design benchmark X No-Newtonian Fluid Solver Kernel UGRAZ Co-design benchmark Rodinia Micro-app CNRS Architectural research benchmark Intake port App - kernel AVL Commercial use case X Shape Optimization by an Adjoint Solver App - kernel AVL Commercial use case X MPI/OpenMP parallel mesh deformation using RBF interpolation App - kernel AVL Commercial use case X Air borne noise simulation of an internal combustion engine App - kernel AVL Commercial use case X Aerodynamics App - kernel AVL Commercial use case X Quenching App - kernel AVL Commercial use case X Friction analysis and acoustic excitation of an internal combustion engine App - kernel AVL Commercial use case In our evaluations we use three different problems and data sets: respiratory nastin. Simulation of the human respiratory system modeled with a grid of 17 millions of elements. cavity10 hexa. This uses a grid of 10 millions of elements. Alya-Red. Simulation of the human heart. These simulations are of great strategic interest for the BSC teams involved in the Biosimulators Center of Excellence recently awarded by the EC. 3.2 Alya scaling on ThunderX Scalability results have been gathered for Alya mainly on the ThunderX mini-cluster. Figures 1, 2 and 3 present the strong scaling results for the three tested cases and the range of core counts available in the mini-cluster. Although efficiency is fair up to the few hundred cores available, it is apparent that efficiency will not be good at very large core counts. It is important to understand the fundamental causes limiting such scalability. 3.3 Analysis of Alya on ThunderX We obtained traces to compare the fine grain performance of the executions on the ThunderX and the Applied Micro X-Gene 1 mini-clusters. We tried to compare the systems in terms of 8

9 D6.1 - Report on profiling and benchmarking of the Figure 1: Strong scalability of Alya in ThunderX for the respirartory nastin input set. Figure 2: Strong scalability of Alya in ThunderX for the cavity10 hexa input set. Figure 3: Strong scalability of Alya Red in ThunderX. 9

10 MB3 D6.1 - Report on profiling and benchmarking of the Figure 4: Synchronized timelines for two executions of the same problem on the X-Gene with 24 MPI processes (top) and ThunderX with 256 processes (bottom). In both cases three nodes of the corresponding machine are used. performance per node, per core and the cost performance ratio. In both cases we run on three nodes. Although the problem size is the same, we had to use 24 MPI processes on the X-Gene platform and 256 on the ThunderX. The timelines are shown in Figure 4. We can see that the ThunderX system is about 5x faster than the X-Gene on a node to node comparison. Given that the ThunderX node is dual socket, the socket to socket performance ratio is about 2.5x for this application and when comparing on a core per core basis the ratio is 0.42x. Based on the acquisition cost of our mini-clusters, the performance per dollar ratio is about 4.8x better in the ThunderX case. The detailed analysis of the two traces shows that both expose some load imbalance. Part of it derives from the fact that the first process only performs general bookkeeping activities and is not involved in the actual computation. The global inefficiency this induces is thus smaller as the number of processes grows and will not be relevant at large scale. The X-Gene trace has hardware counters and we can verify that the imbalance between the processes involved in the computation is algorithmic and derives from the METIS domain decomposition. Its global impact on performance is a reduction of efficiency of less than 5% in both cases. We also identify in the traces one region that, although irrelevant at these core counts, might impact scalability at very large core counts. The region performs a logical gather operation by point to point messages sent from every processor to rank 0. This serialization may have an impact at very large core counts. More traces have been obtained for the bigger case (i.e. respiratory nastin) on ThunderX, for 4, 16, 32 and 64 MPI processes. The trace with 32 processes reveals the impact of unexpected persistent imbalances not justified by algorithmic imbalance. This is shown in Figure 5 where only one thread takes significantly more time than others. Overall, the IPC is not very good ( 0.48), but this thread exposes an even lower IPC ( 0.38) than others. This difference is persistent along all solver iterations. The cache counters in the trace indicate that the L2 cache miss ratio of this thread is even better than in the others. The explanation for the lower IPC is then not clear. We speculate that it might be larger memory access latencies caused by process migrations, but do not have data enough to support or reject the hypothesis. An interesting co-design information from this experience affects performance analysis tools rather than the system architecture itself. The requirement for such tools is to detect and display process migrations as this might provide useful information to explain some observed behaviours. 10

D6.1 - Report on profiling and benchmarking of the Figure 5: Imbalance in solver phase at ThunderX with 32 processes caused by lower IPC in just one core Figure 6: Paraver snapshot showing the Alya

In this case, one socket is fully loaded (48 threads) while the other one only has 16 threads. The IPCs for these last threads are better (0.49 vs 0.45) but also have less variability.

11 D6.1 - Report on profiling and benchmarking of the Figure 5: Imbalance in solver phase at ThunderX with 32 processes caused by lower IPC in just one core Figure 6: Paraver snapshot showing the Alya imbalance in matrix assembly. Run with 129 processes. Another interesting behaviour is exposed by the trace with 64 processes. In this case, one socket is fully loaded (48 threads) while the other one only has 16 threads. The IPCs for these last threads are better (0.49 vs 0.45) but also have less variability. This observation suggests again that memory subsystem issues may have to be studied in hardware co-design work packages as potential causes of micro-architectural variability. 3.4 Analysis of Alya on Intel based systems The simulation of the respiratory system exposes two problems that are challenging when getting to very large scales, and for which the Mont-Blanc 3 project aims to propose solutions. The first issue is about the difficulty of balancing a multi-physics code coupling fluid dynamics and particle transport modules with potentially very different computational cost in different regions of space. Figure 6 shows a timeline of a run with 129 processes where the matrix assembly at the beginning of the timeline happens to be very imbalanced, while the solver step in the final part of the timeline seems to be more balanced. We see the same effects in traces of different core counts. The matrix assembly constitutes in all cases an important part of the elapsed time of each timestep computation. This region is a good candidate for automatic balancing with DLB. The code region is a large loop that can be parallelized with OpenMP (or OmpSs). The main potential performance issue in this case is a reduction through an indirection on a large array. The second issue is apparent when zooming in the solver step. The code uses a traditional preconditioned CG algorithm. At large scale, this type of solvers results in very fine grain synchronized computations that expose important waiting times at collectives operations. Some of this inefficiency is due to a small load imbalance as the METIS partitioner is not able to perfectly balance the computational load in cases with complicated topologies as this one. The second reason of inefficiency is micro-imbalances caused by operating system noise where 11

12 MB3 D6.1 - Report on profiling and benchmarking of the Figure 7: Paraver view of the Alya solver execution. individual and typically different processes at different iterations get delayed by interrupts, kernel preemptions or system daemons. Both effects can be seen in Figure 7. The lower orange lines represent long computations that are persistent along all iterations and are justified by a correspondingly large number of instructions. The orange line at the top right corresponds to a long execution that does not happen every iteration. Neither the number of instructions, nor cycles in this instance is different form the corresponding numbers in other iterations. This is explained by preemptions and the fact that hardware counters are virtualized and thus only measure the activity of the instrumented process. A final general observation by looking at the application source code of the application is that, although its main objective is to perform numerical computations in double precision, a large amount of bookkeeping integer code is used with a branchy control flow structure. Approaches to efficiently overlap these operations might help this code. For example, it may be interesting to explore the usefulness of very large instruction windows. We have not evaluated the potential usage of the newly released ARM vector instruction set in Alya, but it might be very useful given this instruction mix. 3.5 Co-design insight The analysis identifies several techniques and co-design directions that we believe are relevant for this application. Some of these techniques are being worked on in WP7 and thus this application will be an important test case for them. The main observations we make are: Load imbalance at either a macroscopic (long term, persistent across algorithmic iterations) or microscopic (short term, changing at different algorithmic iterations) level is a source of inefficiency. The source of load imbalance can be algorithmic or generated by external system variability (operating system noise, locality differences between what otherwise seem like identical processes, unfairness in the scheduling of resources at the micro-architectural level, process variability, etc.). In many cases the analysis tools allow us to identify or discard sources such variability. There are situations where the captured data is enough to observe the phenomenon but not enough to find out the ultimate reason. The same application may expose different combinations of imbalance patterns at different phases along its execution, as well as intermixed causes within one phase. Dynamic load balancing techniques are thus very important. The DLB approach seems appropriate for cases with relatively large granularity (e.g. the matrix assembly imbalance). 12

13 D6.1 - Report on profiling and benchmarking of the A pure runtime-based approach but may not be enough in regions with fine granularity. In this case, coordination between the operating system and the runtime will be needed. Architectural support for both detecting and quantifying the perturbations on one side, and to balance architectural resource usage on other (e.g. energy, memory bandwidth), are relevant co-design directions. Asynchronous iterative solvers are needed to tolerate variability and operating system noise. This may on one side require algorithmic developments. On the other side, programming models should include features that allow the programmer to express the potential asynchrony and overlap in the algorithm in clean and easy to read ways. Given that in some cases the granularity of the computations to overlap will be very fine, efficient micro-architectural support for asynchrony will also be relevant. Reductions on large sparse matrices are an important functionality for this application and in general for all finite elements applications. Programming model features to support sparse reductions efficiently are needed and best practices on how to use them should be promoted among programmers. New micro-architectural features will be very useful here. Developing tools and methods for estimating the sensitivity of applications to variability are needed to predict the impact of these effects. Developing solutions for this problem will be very important at large scale. These solutions may require a close collaboration between operating system, runtime library and micro-architecture. Co-design input for performance analysis tool designers is to focus on detecting and reporting the process migrations and the memory access latencies. 4 EC-Earth - BSC 4.1 About EC-Earth Earth System Models (ESMs), such as EC-Earth [EC-16], are currently the only way of providing society with information on future climate. EC-Earth generates reliable in-house predictions and projections of global climate change, which are a prerequisite to support the development of national adaptation and mitigation strategies. EC-Earth is developed as part of a Europe-wide consortium, thus promoting international cooperation and wide access to knowledge and data. It further enables fruitful interactions between academic institutions and the European climate impact community. The consortium is lead by a Steering Group planning and coordinating the development of the model and the consortium. EC-Earth made successful contributions to international climate change projections such as CMIP5. Ongoing development by the consortium will ensure that increasingly more reliable projections can be offered to decision and policy makers at regional, national and international levels. A new version of EC-Earth is under development and the consortium plans to participate in CMIP EC-Earth in Mont-Blanc 3 EC-Earth has a lot of components, as depicted in Figure 8. Inside the Mont-Blanc 3 project we will be interested in a coupled atmosphere + ocean simulation where IFS + NEMO modules are involved. 13

14 MB3 D6.1 - Report on profiling and benchmarking of the Figure 8: EC-Earth components. At the time of writing this deliverable, the application has been compiled and built on ThunderX, including all the needed modules for the target coupled IFS+NEMO execution. This includes the IFS, NEMO, XIOS and OASIS components. Unfortunately, some execution problems related to MPI are still present and they are being investigated. 4.3 EC-Earth on MareNostrum An initial analysis and profiling of the application has been done on the MareNostrum III supercomputer [MN16]. A typical problem when executing these coupled cores is to decide how many processes to devote to each component. We performed an exhaustive search of the configuration space for the targeted problem size around the setups in current production practices. The simulated grid was a real production case (T255L91-ORCA1L75). Figure 9 shows the speedups obtained for different combinations of NEMO and IFS processe numbers. Even if increasing the number of cores devoted to each component reduces the execution time, the efficiency is really low. Only for less than ten cores for each of the components, parallel efficiency reaches values above The code is, nevertheless, often run with a few hundred cores in total. 4.4 Analysis of EC-Earth on MareNostrum A first grasp of the behavior of the code can be seen in Figure 10. It shows an efficiency timeline for an execution interval of about 2.5 seconds. The plot corresponds to a run of 1600 processes, devoting 640 to IFS (top lines) and 960 to NEMO (bottom lines). The dark blue color represents a high efficiency value, meaning that in that region of time and space cores were doing useful user-level computation. A light green indicates that in that region cores were mostly waiting in MPI calls. A gradient from light green to dark blue represents the efficiency for the region of space between zero and one. We see a region of time where NEMO is doing very little computation compared to its communication, regions where it is essentially waiting for IFS and a ripple effect of delays that originates at an imbalanced computation (corresponding to special processing at the poles) and propagates through sequences of dependency chains. 14

D6.1 - Report on profiling and benchmarking of the Figure 9:

There are different regions with different load balance

A zoom into one of these regions is shown in Figure 11.

15 D6.1 - Report on profiling and benchmarking of the Figure 9: EC-Earth Speedup at MareNostrum III. Figure 10: EC-Earth Efficiency. We see regions of IFS where some internal imbalance appears. There are different regions with different load balance patterns. A zoom into one of these regions is shown in Figure 11. It displays the MPI calls timeline (yellow are barriers, red are MPI waits) on the left and a histogram of the duration of the computations on the right. Figure 11: Imbalance in IFS module. 15

MB3 D6.1 - Report on profiling and benchmarking of the We also observed also serialization in some IFS computations before the coupling communications (Figure 12).

In this very localized region of code we carried out the simple experiment of packing data and performing a single broadcast operation.

16 MB3 D6.1 - Report on profiling and benchmarking of the We also observed also serialization in some IFS computations before the coupling communications (Figure 12). In the observed pattern, one process performs several broadcasts of very small messages. In this very localized region of code we carried out the simple experiment of packing data and performing a single broadcast operation. As can be seen in Figure 13, that was an easy way to get a 40% performance improvement in that region. Figure 12: Coupling Serialization. The global pattern of imbalances and the fact that one component has to wait for the other (NEMO for IFS or IFS for NEMO) does depend on the statically decided partitioning of the processors between them in a hard to predict manner. Furthermore, it may change between iterations of the algorithm as there are some expensive operations that a given component only executes every certain number of iterations. Our observation from this analysis is that a solution based on the DLB techniques being developed at BSC should help achieve a significantly better efficiency. The application of this technique nevertheless requires a hybrid parallelized code (MPI+OpenMP/OmpSs). Even if the code is huge, with many distributed development teams, a good news about it is that our proposed approach can be applied incrementally. We also believe that by using OmpSs it will be possible to easily introduce the computation and communication overlapping capabilities that will also help address the observed low efficiency. As a final comment, we have also observed that the NEMO module is quite sensitive to OS noise as it is very fine grained. We consider that approaches with less processes and us- Figure 13: Improvement by grouping collectives. 16

17 D6.1 - Report on profiling and benchmarking of the ing OpenMP/OmpSs for the finer granularity, can reduce the amount of MPI overhead and dependencies. 4.5 Co-design insight Among the directions that we consider should be studied with this application are: Dynamic load balancing by reallocating cores between processes of the different components (IFS and NEMO) at runtime can result in important performance gains. NEMO has very fine grain computations. It will be interesting to reduce the number of processes devoted to it and leverage the finer grain communication and synchronization at the threaded level. Addressing the load imbalance issue generated at the poles is another topic to look at. Process/thread mapping issues (processes to nodes and threads to cores) should be considered. Potential asynchrony between components, as well as within components, should be explored. Overall, we consider that this application is a real example of complex multiphysics coupled codes and it has a huge potential to benefit from the proposed features of the Mont-Blanc system software and architecture. The fact that it is a project with multiple contributors makes pushing new approaches a slow process, but there is a lot of potential to improve. As the application of strategic interest for the Earth Sciences department at BSC, we are committed to work in this direction, trying to push the community of developers to understand the potential gains and adopt the programming practices that will enable the above suggested techniques. 5 MiniFE - BSC 5.1 About MiniFE MiniFE (Implicit Finite Elements) is a miniapp from the Mantevo project set of applications, which mimics the finite element generation, assembly and solution for an unstructured grid problem. It is also part of the CORAL benchmark suite. Its simple code is not intended to be a true physics problem, but the best approximation to an unstructured implicit finite element or finite volume application, but in 8000 lines or fewer. 5.2 MiniFE scaling on ThunderX We installed and run MiniFE on the Thunder X mini-cluster. Although the code is hybrid MPI + OpenMP, OpenMP parallelism is only partial, most of the code within a process being serial. Thus the parallel efficiency is really low, and using only MPI yields much better results. The scaling behavior is shown in Figure 14. We display results of the runs on the default ThunderX environment which does not pin processes to cores and of the runs with explicit sequential pinning of ranks to cores. We performed the pinning experiments as the results of the analysis reported in the next section showed the huge impacts of process migration. 17

MB3 D6.1 - Report on profiling and benchmarking of the Figure 14: MiniFE strong scaling at ThunderX. Figure 15: Useful duration timeline for the default run of MiniFE at ThunderX. 5.

Figure 16 shows a zoom on a few iterations. The main issue is clearly a huge imbalance between MPI processes even if computations are well balanced in terms of instructions.

18 MB3 D6.1 - Report on profiling and benchmarking of the Figure 14: MiniFE strong scaling at ThunderX. Figure 15: Useful duration timeline for the default run of MiniFE at ThunderX. 5.3 Analysis of MiniFE on ThunderX Figure 15 shows the duration of the computation burst for a region of 322 seconds of the trace of an execution with the default setup (no pinning). Figure 16 shows a zoom on a few iterations. The main issue is clearly a huge imbalance between MPI processes even if computations are well balanced in terms of instructions. This imbalance is due to IPC drops in some processes. At intervals in the order of 100 seconds, the IPC drop migrates between processes, following a nonstructured pattern. We verified that cache misses stay stable throughout the entire execution and thus speculate that NUMAness and process migrations between sockets as the main reasons for this behavior. We also speculated that this is the cause of a huge standard deviation between the execution time of different production (non-traced) runs, mainly on small core counts. Figure 16: Zoom on a few inner iterations of Figure

19 D6.1 - Report on profiling and benchmarking of the Figure 17: MiniFE IPC at ThunderX when using CPU binding. Same scale as Figure 16. By using CPU pinning, we can avoid the migrations and observe it drastically removes the IPC imbalance and variability throughout time (Figure 17). The obtained results show a significant performance improvement for single-node core counts. CPU binding also removes the high deviation in duration between the different runs. This analysis demonstrates the importance of NUMAness within a node (more than 50% slowdown for threads accessing data remotely). NUMAness and process migration may also be the explanation for the behavior observed in Alya in section 3. Even in the properly pinned runs, the IPC is very low ( 0.20). We observe that each L1 miss is an L2 miss and the miss ratio is very high ( close to 20 misses per thousand instructions). The application is thus heavily memory bandwidth limited. This is also the explanation of the efficiency drop in Figure 14 when the application is confined within one node (from 1 to 96 processes). After that, the slope of the efficiency plot flattens as all nodes are fully populated. Nevertheless, it is strange to see that the non-pinned runs behave the same than the pinned ones at this scale. This is a strange behavior that deserves some further study. We also performed predictive studies on the scalability of the application using the methodology described in [RGL14]. Figure 18 shows the projection of load balance, serialization, transfer and global efficiencies. This was done based on 10 iterations cuts extracted from real run traces using 4, 8, 24 and 192 processes. The plot shows that this application and problem size should decently scale (efficiency above 0.75) up to a significantly larger number of cores ( 4x) than available in the ThunderX mini-cluster. The forecasted equally-limiting factors are load balancing and data transfer. We performed the same forecast based on the Dimemas simulations for the same traces for a machine with an idealized MPI behavior with latency of 30 microseconds and bandwidth of 150 MB/s. These values result in a Dimemas prediction error for all the available traces of less than 5%. The result is shown in Figure 19. It shows that in this case the application should scale up to a larger core count ( 8x). The difference between the two forecasts is in the data transfer component. We interpret this as the effect of internal variability of the MPI implementation on the ThunderX mini-cluster. One thing is not apparent from the timelines of useful duration for different core counts, but shows up in figures 18 and 19. The prediction indicates a strong impact of load imbalance, while the timelines (e.g. Figure 17) seem fairly balanced. This observation motivated the search for the detailed origin of the imbalance. Looking at traces of 48 processes we see an imbalance in the number of neighbors and bytes sent by the different processes. Beyond the imbalance caused by the MPI overhead, this also implies a certain imbalance in the duration of the code region that packs the data to be sent. The imbalance in communication overheads and packing overhead is thus correlated. Better domain decompositions would be an alternative to consider improving the balance of number of neighbours and size of messages. The trace of 192 processes showed a more important imbalance in the packing region as shown in Figure 20. For some reason, the packing region between irevs (pink) and sends (blue) 19

20 MB3 D6.1 - Report on profiling and benchmarking of the Figure 18: Scalability projection for the MiniFE execution based on actual execution on ThunderX minicluster Figure 19: Scalability projection for the MiniFE execution based on actual execution on ThunderX minicluster took significantly longer ( 5.14 ms vs less than 1.4 ms) in rank 123 than in other ranks even if it is not the one with the larger message sizes. The effective clock frequency for the packing region in Figure 20 is also somewhat lower than the nominal processor frequency. The effective clock frequency for a region is computed by dividing the number of cycles returned by PAPI by the duration of the region. It should typically match the processor frequency, but lower values can be observed if the thread is preempted or performs system calls where the process releases the core. To check our observations, we obtained another trace for the same core count and the effect appears again but on different processes. In both cases the effect appears on one of the processes of each socket, for the whole run on the same process, but on different processes for different runs. We then looked for this phenomena in larger core counts, and it also seems to be present there (at 288 and 384 processes). Strangely, in some of these cases it appears between two of the back to back sequence of isends (always the same isends) issued 20

D6.1 - Report on profiling and benchmarking of the by a process. We do not have an explanation for this behavior, but it does show a structure that leads to the conviction that it is not noise.

21 D6.1 - Report on profiling and benchmarking of the by a process. We do not have an explanation for this behavior, but it does show a structure that leads to the conviction that it is not noise. Whether this is due to issues of the specific platform, its operating system, MPI runtime or any other cause, our conclusion for co-design is that the runtime should be ready to detect (OS support may be helpful for this) and react to these pseudo-random effects. Approaches based on asynchronous task-based programming and our DLB library may be enablers for such a runtime. Figure 20: outlier duration of packing between irecvs (pink) and sends(blue) for one process. We also observe other load imbalance effects. One of them is a small computational imbalance where the number of instructions slightly grows with rank. The other imbalance effect that we observe happens in regions of code where there is essentially no computational imbalance and still we see a difference in IPC between odd and even ranks. This results in a difference of between 3 to 4% in execution time for those sections. This difference in IPC is not correlated with differences in L2 cache misses. Although we do not have much information on the internal architecture of the Cavium chip, we do suspect it may have architectural heterogeneities or NU- MAness that may cause these imbalances. To illustrate it we show the histograms of IPC for the longest computational burst of 24, 48 and 192 processes in figure 21. The three histograms have the same scale, the lowest IPC is and the largest one 0.223). Besides the reduction in IPC that we observe when the number of processes on each node is increased, we also see the aforementioned difference between odd and even cores. The case of 192 processes fills two nodes and shows an interesting behavior. In the first node we observe the bimodal distribution of IPCs, while in the second one all threads achieve an intermediate IPC value. These behaviors may have their roots in micro-architectural issues. It would be important to understand the detailed reasons of these behaviors for a better co-design of the micro-architecture as well as the runtime. At the architectural level, achieving an average IPC in all threads is better than having some threads faster than others, at least in the very synchronous and statically-scheduled approach that dominates current programming practices. If the heterogeneity is unavoidable, the runtime could be optimized to handle it properly. Given the very small imbalance (< 5%), architectural support might be very useful for this runtime functionality. Other domain decompositions might result in better balance, but it is probably difficult to balance both computation and communication very early in the application execution cycle. We consider that the other direction that should be considered is to introduce asynchrony in the application by means of a task-based model. The idea would be to allow for these packing and communication tasks to be executed concurrently with the main computation tasks. Given the fact that these are very small tasks, they could even be offloaded to small, symbiotic processors as targeted by the Mont-Blanc 3 vision. An interesting plot is presented in Figure 22. It displays efficiency measured by the tools from the whole range of traced ranks (from 4 to 388). It is interesting to see how the time-based efficiency follows the shape we presented for untraced runs (Figure 18), thus indicating that the tracing process is not perturbing the data and insight derived from it. Another interesting insight is that the IPC reduction explains the loss of efficiency when scaling to fill the node. 21

22 MB3 D6.1 - Report on profiling and benchmarking of the Figure 21: Scalability projection for the MiniFE execution based on actual execution on ThunderX minicluster Figure 22: Different efficiencies for the whole range of traced core counts executions This reduction is most probably caused by the limited memory bandwidth of the socket. It is also curious how the plot indicates a certain improvement in IPC when scaling the number of nodes after that. This could be further studied but it is probably related to the reduction of the problem size per core in strong scaling runs. Finally, although not clearly visible at the scale of the plot, the data actually shows a bit on inefficiency due to an increase in the number of total instructions executed in user mode per iteration. 5.4 Co-design insight The analysis identifies several techniques and co-design directions that we believe are relevant for this application. Some of these techniques are being worked on in WP7 and thus this application will be an important test case for them. The main observations we draw are: NUMAness combined with process migration may result in significant performance losses. These effect should be taken into account during OS design. Additionally, runtimes should be ready to react and compensate this effect, in cases it happens. Architectural designs to reduce/tolerate the cost of remote memory accesses is an important topic to take into account in the design of the system. Even if not targeting NUMA nodes, memory bandwidth is very important for this application, and either architectural approaches to maximize the throughput obtained from 22

23 D6.1 - Report on profiling and benchmarking of the the memory devices or algorithmic changes to reduce the impact of such latency in the application should be explored. Further studies on the application behavior might look at its memory footprint and access patterns to understand how architectural implementations based on high bandwidth memory would be applicable and how beneficial they could be. Small computational load imbalances are present in different parts of the application and their nature varies (many causes are unidentified). A significant communication load imbalance, generated by the domain decomposition, may also play an important role at larger scale. We suggest to consider approaches using task-based asynchrony and offloading to helper cores to address these imbalances. It would be interesting to detect architectural features that generate unfairness/heterogeneity in otherwise homogeneous nodes. This can be a small difference at small scale, but could have an important impact at a larger scale. Alternatives to compensate these effects at the runtime level should also be studied. Given the small quantitative effect of this phenomenon at the local node level, it might be important to provide architectural support for runtime-driven solutions. 6 NTChem - BSC 6.1 About NtChem NTChem is a software package for molecular electronic structure calculation, developed by RIKEN Advanced Institute for Computational Science (AICS) for the K computer. As a part of the Mont-Blanc project we are running NTChem-mini, which includes a packaged version of NTChem/RI-MP2, a scheme to account for the electron correlations based on the resolution of identity second-order MllerPlesset perturbation method that enables the computation of the large molecular systems such as nano-molecules and biological molecules. 6.2 NTChem scaling NTChem-mini has been run on the ThunderX mini-cluster, using a small test case (H2O). The scaling performance is shown in Figure 23 for both the pinned and non pinned runs. In this case, the difference does not seem to be as important as in the MiniFE case. The efficiency certainly drops to very low values as we scale. We also executed the application on other platforms such as the Pi (a FUJITSU machine similar to K but smaller scale) and MareNostrum. The scaling plots are given in Figure 24 and Figure 25. We used a larger problem size (taxol) and core counts in this case. In the MareNostrum executions we show the efficiency of the pure MPI run (full line) and the hybrid MPI+ OpenMP version (dotted lines). All plots show the same effect of a significant efficiency loss of the pure MPI runs when the core count gets sufficiently high compared to the problem size. The MareNostrum plot also shows that at large core counts, the hybrid version with a few threads outperforms the pure MPI one, but if many OpenMP threads per process are used, the performance drops drastically. 23

24 MB3 D6.1 - Report on profiling and benchmarking of the Figure 23: NTChem strong scaling at ThunderX. H2O test case. Figure 24: NTChem strong scaling at Pi. Taxol test case. Figure 25: NTChem strong scaling at MareNostrum. Pure MPI and Hybrid (MPI+OpenMP). Taxol test case. 24

D6.1 - Report on profiling and benchmarking of the 6.3 Analysis of NTChem on ThunderX In this section, we analyze a trace of the execution of the H2O problem size with 48 processes on a ThunderX node.

Nevertheless, we focus our analysis on the iterative part as we expect this will be more relevant in real production cases. Figure 26: NTChem H2O case overview at ThunderX.

25 D6.1 - Report on profiling and benchmarking of the 6.3 Analysis of NTChem on ThunderX In this section, we analyze a trace of the execution of the H2O problem size with 48 processes on a ThunderX node. The application structure is shown in Figure 26 and consists of a prologue and a loop. As we can see, this prologue is highly unbalanced and represents nearly half of the application execution time. Nevertheless, we focus our analysis on the iterative part as we expect this will be more relevant in real production cases. Figure 26: NTChem H2O case overview at ThunderX. In the actual computational region, each process iterates between phases with sequences of several local computations of a fixed instruction count (which happen to be DGEMM operations) and phases where after each DGEMM some data is sent to the next process in a circular way. The communication pattern results in delays that propagate and stack over iterations as shown in Figure 27. The histogram of duration of the DGEMMs (Figure 28) shows a trimodal distribution with an increase of around 20% in duration from the first mode to the second and the third. This is so, even if the histogram of instructions shows that all DGEMMs have the same number of instructions. The histograms of L1 and L2 miss ratios show that there is no variability in these metrics, thus pointing again to the process migration and NUMAness of the architecture. In this case, the impact (20% increase in duration) in time is not as important as it was in the MiniFE case (100% increase) given that the DGEMM has better locality. This is also apparent in the miss ratio for the L1 ( 8.4 misses per thousand instructions) and L2 ( 2.1 misses per thousand instructions) caches. To address the aforementioned issues, we use CPU pinning. This eliminates the trimodal distribution of the DGEMM duration, but does not entirely eliminate (Figure 29) the accumulated delays described previously. These delays derive from an algorithmic imbalance in the code deriving from the fact that the number of DGEMMs executed in the local phase is different between processes. We also observe that the IPC for the DGEMM operations is only 0.45, which is fairly low. Figure 27: NTChem loop. Left: first iterations. Right: last iterations and synchronization. 25

MB3 D6.1 - Report on profiling and benchmarking of the Figure 28: Histogram of the duration of DGEMM operations. Figure 29: NTChem loop when pinning. Left: first iterations.

4 Analysis of NTChem on MareNostrum We were interested in trying to identify the fundamental factor behind the poor scaling behavior on MareNostrum from the initial part of the curve in Figure 25.

26 MB3 D6.1 - Report on profiling and benchmarking of the Figure 28: Histogram of the duration of DGEMM operations. Figure 29: NTChem loop when pinning. Left: first iterations. Right: last iterations and synchronization. 6.4 Analysis of NTChem on MareNostrum We were interested in trying to identify the fundamental factor behind the poor scaling behavior on MareNostrum from the initial part of the curve in Figure 25. The traces analysis determined that the issue is caused by the performance and power features of the Intel processor. On MareNostrum, the TurboBoost mode is enabled. The result is that the sequential execution used as reference leaves the other 15 processor cores idle. This results in a high boost in frequency, and thus a performance boost for the reference run. When more cores are used, the power comsumption and temperature go up and the cores end up running at lower frequencies. This explains the observed reduction in efficiency. 6.5 Co-design insight The analysis identifies several techniques and co-design directions that we believe are relevant for this application. Some of these techniques are being worked on in WP7 and thus this application will be an important test case for them. The main observations we draw are: The application tries to overlap communication and computation by using non-blocking calls. The benefit is, however, diminished on the current machines by the dominating load balance issues. The fact that this might allow the application to run with the same efficiency on machines with very low communication bandwidth is a co-design input. Quantifying the bandwidth that would be tolerated is something to consider when codesigning systems and new software. This application is a perfect test case for the asynchrony support of the hybrid MPI+OmpSs work being done in WP7. We consider this would allow the simplification of the source 26

27 D6.1 - Report on profiling and benchmarking of the code while achieving similar of better performance by enabling more flexible asynchronous executions than what has been hardwired in the current implementation. The poor performance of the Hybrid version at large thread counts seems to indicate that the granularity gets too fine. It is valuable to seek coarser-grained parallelism. In particular, if the different DGEMMs are independent, they could provide a clean outer level of parallelism to exploit. Besides granularity, this would provide the malleable structure required for DLB. Whether that level would be enough to efficiently strong scale to very large core counts or nested parallelism is needed should be studied. Dynamic load balancing would be needed and the use of DLB should be explored. The application source code also has CUDA. We think that the use of the OmpSs support for CUDA would help simplify the code significantly. The DGEMM IPC seems to be very low. It will be necessary to identify whether optimized versions provided by WP7 could be used or whether this is a specific architectural issue of the ThunderX architecture. 7 Lulesh - BSC 7.1 About Lulesh LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) approximates hydrodynamics equations to describe the motion of different materials between each other when they are subject to forces. It is also part of the CORAL benchmark suite. As a mini-app, Lulesh is a highly simplified version of a hydrodynamic simulation, and is hard-coded to only solve a simple Sedov blast problem with analytic answers. 7.2 Lulesh scaling on ThunderX Lulesh limits the amount of MPI processes so it must be a cube of an integer (1, 8, 27,...). Therefore, we cannot run it on core counts which may be more representatives for our architecture, and we cannot use the cluster s full potential (for example, at ThunderX, the closest core count to 384 cores which is a cube of an integer is 343). We run the code on the ThunderX cluster with and without pinning. Even if the default scaling mode for Lulesh is weak, we ran it in strong scaling mode as we are interested in the behavior of applications towards very large core counts, where strong scaling will play an important role. The number of elements used for the largest core count is 9,261,000 and the iteration count set to 150. The scaling results are shown in Figure 30. Again here there is an important difference between pinning and not pinning that can go up to 30%. Notice that there is a super-linear speedup when running with 8 MPI processes. We have not yet determined the cause of this phenomenon. In order to have a better coverage of all the core counts available on the machine, we run a hybrid version using multiple OpenMP threads within each MPI process. This way we can fully populate the nodes using certain configurations. In Figure 31 we show the resultant efficiency for multiple configurations. Each line shows a different amount of MPI processes, listing all the cubes of integers which can be hosted within the ThunderX mini-cluster. Looking at the orange line (1 MPI process), we can see the low efficiency obtained when using OpenMP. We can achieve the best efficiency result (when using 27

28 MB3 D6.1 - Report on profiling and benchmarking of the Figure 30: Lulesh strong scaling at ThunderX. the full or nearly full machine) when running 125 MPI processes, with 3 OpenMP threads each. Nevertheless, since using 64 MPI processes with 6 OpenMP each completely populates the machine, performance-wise this configuration is slightly better. Figure 31: Hybrid Lulesh strong scaling at ThunderX. 7.3 Analysis of Lulesh on ThunderX The overall structure of an execution with 64 cores is shown in Figure 32. Lulesh consists mainly of a large loop, where each iteration alternates between a set of communication and computation phases. Most of the phases expose some imbalance when executing in the default non-pinned mode. One of them shows an important load imbalance and the associated long waiting time during the final synchronization phase. The use of CPU binding drastically reduces the imbalance for most of the phases (Figure 33), except for the highly imbalanced one, for which it now shows a more structured pattern. The histograms of instructions and IPC show that this imbalance is only computational (number of instructions) while all processes have about the same poor IPC ( 0.22) in this region. Other regions have better IPCs even if still low (0.48, 0.55). The fact that for this execution one socket has less threads than the other does not reflect in its threads IPC being better. The L2 miss ratios vary between 14.4 misses per thousand instructions for the imbalanced region to 3.0 or 28

$D6.1 - Report on profiling and benchmarking of the Figure 32: Lulesh loop pattern without CPU binding. 3.8 for other regions. In general, a large fraction of the L1 misses also result in L2 misses.$ 4 Lulesh on MareNostrum We also performed several analyses of Lulesh on an Intel platform, specially looking at the efficiency of the hybrid MPI + OpenMP version, with the objective of understanding

4 Lulesh on MareNostrum We also performed several analyses of Lulesh on an Intel platform, specially looking at the efficiency of the hybrid MPI + OpenMP version, with the objective of understanding

29 D6.1 - Report on profiling and benchmarking of the Figure 32: Lulesh loop pattern without CPU binding. 3.8 for other regions. In general, a large fraction of the L1 misses also result in L2 misses. Figure 33: Lulesh loop pattern with CPU binding. 7.4 Lulesh on MareNostrum We also performed several analyses of Lulesh on an Intel platform, specially looking at the efficiency of the hybrid MPI + OpenMP version, with the objective of understanding how different number of cores can be used in cases where the number of MPI processes is restricted. In the case of Lulesh, forcing the execution to use exactly a cube number of processes is, from our point of view, a strong limitation. The hybrid MPI+OpenMP had very poor scaling when multiple threads are used within each MPI process. We identified two important reasons for that. A first reason is the high overhead of the many malloc and free operations that are done throughout run and which are serialized at the OpenMP level. A second and more important reason is related to the granularity of the parallel loops. Within each timestep iteration, each MPI process executes parallel loops on the different blocks of data assigned to the process. It comes out that although some of those blocks are sufficiently large, for many others the granularity of the OpenMP loops is very small and overheads and variability result in very inefficient execution (34). We consider that coarser-grained parallelism should be exploited at the OpenMP level, as in this case at the block level. Of course, if the number of blocks is not huge and their sizes vary, a single level of parallelism will be inefficient. For that reason it is very important to perform nested parallelization, where blocks are internally parallelized, but small blocks are executed on few cores and large blocks on many cores. Nesting is already supported in OpenMP, but there is a lack of experience on how to better use it and efficiently implement it. We have started using OmpSs in Lulesh to explore in detail the issues and potential of such approach. This is a co-design activity with the OmpSs model and runtime development. Figure 35 corresponds to one timestep iteration. It shows that it is possible to maintain a sufficiently coarse granularity to make runtime overheads irrelevant. At different points of the execution, tasks from different blocks overlap asynchronously, fixing the coarse-grain imbalance issue 29

MB3 D6.1 - Report on profiling and benchmarking of the Figure 34: Lulesh OpenMP parallel loops. Colors correspond to different OpenMP outlined routines.

In other parts there is still global algorithmic synchronization that exposes small imbalances deriving from fixed granularity tasks.

30 MB3 D6.1 - Report on profiling and benchmarking of the Figure 34: Lulesh OpenMP parallel loops. Colors correspond to different OpenMP outlined routines. Background color when worker threads are waiting for work of when masters is executing sequential code previously identified. In other parts there is still global algorithmic synchronization that exposes small imbalances deriving from fixed granularity tasks. New features in OmpSs/OpenMP should be devised, maybe with architectural support to address these fine-grained imbalances. The work done also shows how the long malloc and free operations can be to a large extent overlapped with other computations. Even so, there is room for a lot of best programming practices and runtime optimizations in this area. Figure 35: Lulesh tasks and nesting. 7.5 Co-design insights Some of the application analysis observations are: Nested parallelism at the OpenMP level has to be exploited. Co-design experience has to be gained between application and programming model/runtime. This application is being used to develop best practices of how to use nesting it at the application level and on how to efficiently support at the runtime level. It will be necessary to co-design a mechanism between the runtime and the architecture to enable the dynamic balancing of very fine-grained code regions. Dynamic memory management (malloc and free) is a potential cause of important overheads when running multi-threaded programs. We need to develop and promote best programming practices to retain the flexibility this mechanism provides but allow the runtime to optimize how memory is handled. There is a need for tools to help the programmer identify code dependencies. 30

31 D6.1 - Report on profiling and benchmarking of the 8 CoMD - BSC 8.1 About CoMD As MiniFE, CoMD (Molecular Dynamics) is a miniapp which is part of the Mantevo Project Suite, and created and maintained by The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx). It is a reference implementation of classical molecular dynamics algorithms and workloads. CoMD simulation only evaluates the interatomic potentials between atoms within a cutoff range, thus reducing the problem size and the computation cost from evaluating all the system s atoms. Only the Lennard-Jones (LJ) and Embedded Atom Method (EAM) potentials are considered for this proxy app. 8.2 CoMD scaling on ThunderX We run CoMD on the ThunderX mini-cluster. Unlike previous mini-applications, CoMD doesn t present a relevant improvement when under the use of CPU-binding. The strong scaling plot is shown in Figure 36. Efficiency drops noticeably when increasing the core count. Figure 36: CoMD strong scaling at ThunderX. 8.3 Analysis of CoMD on ThunderX The trace of the execution for 96 processes on a ThunderX node is shown in Figure 37. The application follows a simple iterative pattern with well-balanced computational burst and a point to point exchange phase. For this problem size and core count, the cost of the communication phase is a small part of the total iteration time. Every ten iterations there is an MPI Allreduce call. The IPC of the computation burst is 0.58 with L2 miss ratio of around 0.53 misses per thousand instructions. The L1 data cache miss ratio is around These values are much lower than in previous applications and this is a possible explanation for the observed lack of impact of the binding process. Using traces of 12, 24, 96, 192, 288 and 384 processes we performed a scalability study to extrapolate the efficiency for larger core counts. The results are shown in Figure 38. The 31

32 MB3 D6.1 - Report on profiling and benchmarking of the Figure 37: CoMD loop, including three synchronizations. efficiency values obtained from the traces are in accordance with the non-traced values reported in Figure 36, thus verifying that the instrumentation does not significantly perturb the traces. The extrapolation plot indicates that load imbalance and data transfer are equally dominant issues at large scale. A detailed analysis of the traces looking for the cause of the imbalance shows that there is a small computational imbalance (0.98), but the main problem comes from a few cores (typically one per node) that execute user code at a slightly lower effective frequency than others. This behavior is persistent throughout the successive iterations. It is also correlated with an increase in the number of instructions reported by PAPI for those processes. We suspect this effect may be related to address translation or some other operating system issues but deeper analyses should be performed. Irrespective of the cause, the granularity of the imbalance is sufficient to allow for runtime-based compensation techniques (e.g. DLB). Figure 38: Extrapolation of efficiency factors for CoMD 8.4 Co-design insights Some of the observations form the analysis of this application are: The application is not sensitive to memory bandwidth. The executions on the ThunderX platform suffer an imbalance potentially related to operating system issues. The runtime should be able to dynamically compensate for this heterogeneity between cores irrespective of their actual cause. 32

33 D6.1 - Report on profiling and benchmarking of the The algorithmic imbalance in the application is small, but it has a regular, repetitive structure so techniques to adapt micro-architectural resource consumption (e.g. frequency) might be useful. Data transfer is another dominating efficiency loss cause. Asynchrony and communication/computation overlap should be explored. 9 Jacobi Solver - UGRAZ 9.1 Review The Jacobi solver is a micro-app for exploring new hardware and/or software environments and its basic algorithms are reused in the Algebraic Multigrid and the CARP code later on. We consider the potential problem T λ(x) u(x) = f(x) x Ω + boundary conditions on Ω wherein mainly Dirichlet boundary conditions are used for the experiments but also mixed boundary conditions with Neumann and Robin are possible (pure Neumann boundary conditions are avoided). The computational domain Ω R 2 is discretized with triangular elements and linear shape functions are used for the local approximation in each finite element. This finite element approach ends in linear system of equations Ku = f with a symmetric and positive definite n n matrix K. This system matrix K is sparse and unstructured, and therefore it is stored in the compressed row storage (CRS) format - other sparse storage formats can be realized easily. The Jacobi iteration ( u k+1 = u k + ωd 1 f K u k) } {{ } r := is performed with u 0 = 0 until the relative error rk,w k r 0,w 0 with w := D 1 r is smaller than a given tolerance ε. The solver uses the scalar parameter ω = 1, D contains the diagonal entries of K and the inverse of this diagonal matrix is pre-computed before the iteration. The basic ingredients of the Jacobi iteration above are the (sparse) matrix-vector product and the scalar product r, w. Having implemented and parallelized these operations opens already the opportunity for implementing more sophisticated iterative solvers as Krylov/Lanczosmethods and multigrid methods as we do in the Algebraic Multigrid and the CARP code. 9.2 MPI parallelization Our parallelization for distributed memory systems is based on distributing the finite elements (using ParMETIS or SCOTCH) in a non-overlapping way, see the pt-part in Fig. 47 on page 41. The degrees of freedom on the interface belong to more than one MPI process which requires some care to handle those data. We distinguish between accumulated data, i.e., each process stores the full value, and distributed data in which each process stores only its share of the full value, see Fig. 39. The formal mapping from the global space of all unknowns to the 33

34 MB3 D6.1 - Report on profiling and benchmarking of the unknowns possessed by process s is denoted with A s. This rectangular matrix is never stored but its application and the application of A T s is realized in the appropriate data exchange routines that include MPI-communication. Please note, that system matrix K is automatically stored in a distributive way (because of the non-overlapping element distribution). Cleary, the accumulated distributed [j] w = 3 [j] w = 3 [i,i] M = 5 [i,i] M = 5 u s = A s u K s = A s KA T s K ij = τ h ϕ i ϕ j τ h r = K = K ij s = τ h Ω s P A T s r s s=1 P A T s K s A s s=1 τ h ϕ i ϕ j Figure 39: Handling of interface data. data distribution above requires no MPI-communication for operation with vectors of the same type. Interestingly, also the matrix-vector multiplication v K s requires no communication, but changes the storage type of the vector. The inner product w, r = P w s, r s calculates the local inner products and performs a parallel reduction of a scalar afterwards in case the vectors have different storage schemes. This requires that the preconditioned residuum w s P A T k w k contains the full values which can be only achieved by an explicit data exchange A s k=1 denoted by A s P k=1 A T k. The resulting parallel Jacobi iteration is presented in Fig s=1

35 D6.1 - Report on profiling and benchmarking of the D 1 := diag 1 ( P A T s K s A s ) s=1 Choose u 0 = 0 r := f K u 0 P w := D 1 A T s r s s=1 // vector accumulation + inverting the vector elements // vector accumulation σ := σ 0 := (w, r) // reduction of a scalar k := 0 while σ > ε 2 σ 0 do k := k + 1 u k := u k 1 + ω w r := f K u k w P := D 1 A T s r s end σ := (w, r) s=1 Figure 40: Parallel Jacobi iteration: Jacobi(K, u 0, f). 9.3 Benchmarking In order to build a benchmark of Eikonal application, we performed a set of executions on different platforms using a fixed problem size 1024 x 1024 for 1000 iterations. Scalability results for Jacobi solver are shown in Tables 2, 3, 4, 5. Table 2: Jacobi execution on the Mont-Blanc Prototype with one OpenMP thread MPI Processes Execution time (seconds) Speed-Up Efficiency Table 3: Jacobi execution on the Mont-Blanc Prototype with two OpenMP thread MPI Processes Execution time (seconds) Speed-Up Efficiency Table 4: Jacobi execution on ThunderX with one OpenMP thread per process MPI Processes Execution time (seconds) Speed-Up Efficiency Table 5: Jacobi execution on ThunderX with two OpenMP thread per process MPI Processes Execution time (seconds) Speed-Up Efficiency Scalability of the Jacobi solver is very low when we measure it for more than 16 MPI processes on the Mont-Blanc prototype. It can be seen in Figure 41 where the parallel efficiency reaches values below 0.5%. On the ThunderX cluster we achieve better scalability of the Jacobi solver that starts to drop for more than 64 MPI processes, as shown in Figure

MB3 D6.1 - Report on profiling and benchmarking of the Figure 41: Efficiency of Jacobi solver on the Mont-Blanc Prototype. Figure 42: Efficiency of Jacobi solver on ThunderX. 9.

There is some imbalance between processes so the amount of time that each process takes changes. It makes the next set of computation start later, as we can see better in Figure 44.

36 MB3 D6.1 - Report on profiling and benchmarking of the Figure 41: Efficiency of Jacobi solver on the Mont-Blanc Prototype. Figure 42: Efficiency of Jacobi solver on ThunderX. 9.4 Profiling Profiling is performed based on the trace generated on an Intel platform with 10 threads. In Figure 43 we can see 4 MPI processes with 4 OpenMP threads per each process. There is some imbalance between processes so the amount of time that each process takes changes. It makes the next set of computation start later, as we can see better in Figure 44. This can be defined as noise and is represented with orange color. At some point we see a bigger noise impact where threads fight for resources such as memory. The faster threads that we see are the threads that do not share resources. Figure 43: Jacobi MPI call on an Intel platform. Figure 44: Jacobi MPI call zoom on an Intel platform. 36

37 D6.1 - Report on profiling and benchmarking of the 9.5 Co-design insights The observed load imbalances in the MPI communication (Fig. 43, 44) might be caused by hyper-threading. Slow interconnect (1Gbps) decrease the efficiency of this application on the Mont-Blanc prototype. Although the parallel efficiency is dramatically better on ThunderX (2 x 10 Gbps), an improvement of the communication structure will be of great benefit for larger numbers of cores. The performance of the Jacobi solver is limited by the available memory bandwidth and therefore more data fits globally into faster caches which gives the chance for a superlinear speedup. Thanks to the very fast internal communication between nodes, we even see this superlinear speedup on ThunderX. The main problem of the Jacobi solver is related to the poor ratio of arithmetics versus memory access as it is typical for PDEs with unstructured discretizations. This is also a major limitation for power efficient computations because memory access is by far more energy expensive than arithmetics. 9.6 Future Work The Jacobi solver is a kind of sandbox for AMG, as well as for CARP, because here we can easily improve components that might be of great impact for the other applications. The next steps for improving the communication in the Jacobi solver will be: Use of non-blocking MPI point-to-point communication for data exchange. Local renumbering of interface nodes such that no copy into MPI buffers is needed. This will allow also an overlap of computation (of inner nodes) and communication (of interface nodes). Application of non-blocking MPI IALLGATHERV in combination with overlapping communication and computation. Improving the ratio of arithmetics versus memory access (that includes communication) requires more than just some numerical changes in case of elliptic PDEs. The most promising idea consists of an unstructured (coarse) discretization with several regular refinement steps afterwards. This enables the calculation of the fine grid element matrices on-the-fly from the available coarse grid information such that the matrix never has to be stored, and a matrix-free solver can be realized. This approach will decrease the memory footprint at the cost of more (cheap) computations. The element refinement can be done either by increasing the polynomial degree of the finite element functions (Hierarchical Hybrid Grids (HHG) from group by S. Turek, Darmstadt) or by an h-refinement using the same functions as in the original element (group of U. Rüde, Erlangen). For the Jacobi solver, we will try: A matrix-free implementation based on the h-refinement by a loop on the coarse elements with dense computational kernels in each element. Taking into account vectorization in kernels by using a local Hilbert curve numbering. Example with coefficients varying inside a coarse element. 3D example with tetrahedrals. 37

38 MB3 D6.1 - Report on profiling and benchmarking of the 10 Algebraic Multigrid Solver - UGRAZ 10.1 Review The CARP application contains two major time consuming parts. The most time consuming part for linear PDEs is the linear equation solver. This still holds for non-linear PDEs although in this case the re-computation of the matrix entries requires also a big junk of the compute time. The computation of the finite element matrices is needed in each non-linear iteration of the Newton solver for the non-linear elasticity problem (with large deformations), but it is needed only once for solving the linear system of equations from the bidomain equations. Here we use preconditioned Krylov methods with an algebraic multigrid (AMG) preconditioner. Because the performance of this AMG preconditioner is crucial for the overall performance of CARP, we extracted the AMG solver part as an extra mini-app Geometrical Multigrid We assume that there exists a series of regular (FE) meshes/grids {T q } l q=1, where the finer grid T q+1 was derived from the coarser grid T q by element refinement. The simplest case in 2D is to subdivide all triangles into 4 congruent ones, i.e., by connecting the bisection points of the three edges. We assume that the grids are nested, T 1 T 2 T l. Discretizing the given differential equation on each grid T q results in a series of systems of equations. The intergrid transfer of data is realized via an interpolation operator/matrix P and its transposed P T as restriction operator. The operator P should be a local approximation of the harmonic extension from coarse to fine grid with respect to operator K q. K q u q = f q (1) with the symmetric, positive definite sparse stiffness matrices K q. Now, the basic idea of a twogrid method consists of handling high frequency parts of the error on the finer mesh, and the coarser mesh has to resolve the remaining low frequency part. Applying the same idea again to the coarser mesh results in a recursive algorithm named multigrid. if q == 1 then Solve P A T s K 1 A s u 1 = f 1 s=1 else ũ q Smooth(K q, u q, f q, ν) d q f q K q ũ q d q 1 P T d q w q 1 0 pmg γ (K q 1, w q 1, d q 1, q 1) w q P w q 1 û q ũ q + w q u q Smooth(K q, û q, f q, ν) end if Pre Smoothing Restriction Coarse grid solver Post Smoothing Interpolation fine grid coarse grid Figure 45: Multigrid algorithm pmg γ (K q, u q, f q, q) for solving K q u q = f q (V-cycle: γ = 1) 38

39 D6.1 - Report on profiling and benchmarking of the The algorithm above uses ν pre-smoothing as well as ν post-smoothing steps with an appropriate iteration scheme, e.g., by using the Jacobi iteration, see page 33. Operations with require MPI communication, the same holds for solving the coarse system (q == 1) wherein a parallel solver (e.g., Pastix) or a bunch of Jacobi iterations can be used. As long as the refined elements of the finer mesh are associated with the same process as the original finite element on the coarse grid no MPI communication is needed for the intergrid transfer. If smoothers and intergrid transfer operators are chosen correctly then the number of multigrid iterations needed to solve a system of equation with dimension n is independent of n, i.e., it is the perfect method for large systems of equations Algebraic Multigrid The geometric multigrid needs an already existing hierarchy of discretizations T 1 T 2 T l as starting point. Most practical applications have only the fine mesh T l together with the matrix K l available without the coarser discretizations and without the coarser operators K q, q < l. Algebraic multigrid is the step to generate all the missing operators in order to apply the multigrid algorithm from Fig. 45 afterwards. Let us denote the index set of discretization points on the given fine mesh by ω h. A subset of ω h named ωc h represents the indices of those nodes which are still available on the next coarser discretization after a coarsening step. The interpolation matrix P represents the intergrid transfer operator. 1. coarsening (strong connections; aggregation): ω h = ω h C ω h F. 2. interpolation weights (matrix dependent; agglomeration) : P = {α ij } i ω h, j ω h C : R H R h. 3. coarse mesh matrix (fully in parallel): Galerkin approach K H = P T K P have to be applied recursively. 5. apply the standard MG-procedure. Figure 46: Main ingredients of AMG with as AMG setup 10.4 Benchmarking The benchmarking results for AMG are contained in the benchmark results for CARP, especially for the Mont-Blanc prototype in Table 7 (page 44). A more detailed analysis will be generated for the more interesting ThunderX platform. 39

40 MB3 D6.1 - Report on profiling and benchmarking of the 10.5 Co-design insight The AMG solver for the elliptic part of the bidomain equations is limited by the memory bandwidth on one hand and has a worse computation to communication ratio on coarser grids on the other hand. The latter one can be influenced to a certain extent by merging subsystems onto one process, i.e., we use less computational cores for coarser systems of equations. This has to be adapted for the ThunderX Future Work The redesign of AMG includes a bunch of ideas from the Jacobi solver in 9 and needs some adaptations which are closely related to the given coarse discretization and its regular refinement. We have to investigate: Setup of AMG based on the coarse discretization with finite elements, and the appropriate stiffness matrix as usual. Afterwards a sequence of finer discretizations is generated based on the regular h-refinement. Matrix-free patch smoothers (in each coarse element) will be used on the finest discretization. Matrix-free interpolation/restriction from fine to coarse discretization has to be implemented. 11 CARP - UGRAZ 11.1 Review The Cardiac Arrhythmia Research Package (CARP) [VHPL03], which is built on top of the MPI-based library PETSc [BBE + 08], was used as a framework for solving the cardiac bidomain equations in parallel. PETSc [BBE + 08] served as the basic infrastructure for handling parallel matrices and vectors. Hypre [Hyp06], advanced algebraic multigrid methods such as BoomerAMG, and ParMetis [KK09], graph-based domain decomposition of unstructured grids, were compiled with PETSc as external packages. An additional package, the publicly available Parallel Toolbox (pt) library ( [HLDP09] which can be compiled for both CPUs and GPUs, was interfaced with PETSc. The parallelization strategy was based on recent extensions of the CARP framework [NMSP11]. Briefly, achieving good load balancing, where both computational load and communication costs are distributed as evenly as possible, is of critical importance. While this is achieved for structured grids with relative ease [MPS09], since the nodal numbering relationship is the same everywhere, mirroring the spatial relationship, it is far more challenging in the more general case of unstructured grids, which are preferred for cardiac simulations [VWdSP + 08]. For general applicability, unstructured grids are mandatory to accommodate complex geometries with smooth surfaces, which prevent spurious polarizations when applying extracellular fields. To obtain a well-balanced grid partitioning, ParMeTis computes a k-way element-based partition of the mesh s dual graph, to redistribute finite elements among partitions. Depending on whether PETSc or pt was employed, two different strategies were devised. These fundamental differences are illustrated in Figure

D6.1 - Report on profiling and benchmarking of the Figure 47: Strategy for interfacing the linear algebra package PETSc with the domain decomposition solver pt.

41 D6.1 - Report on profiling and benchmarking of the Figure 47: Strategy for interfacing the linear algebra package PETSc with the domain decomposition solver pt. In both scenarios, following the domain decomposition step, nodal indices in each domain were renumbered. Inner nodes were linearly numbered, forming the main diagonal block of the global matrix, which map onto the local rows of the linear system to solve. In the PETSc scenario, interface nodes have to be split and assigned to one parallel partition. Figure 48: Heart hrecon Since entries in the off-diagonal of a matrix are more expensive in terms of communication cost, we aimed at evenly distributing interface nodes across the computed partitions to load balance communication. Interfacial nodes were split equally between the minimum and maximum numbered partitions. Both grid partitioning and nodal renumbering were tightly integrated to compute partitioning information very fast in parallel, on the fly. Permutation vectors kept track of the relationship between the reordered mesh and the user-provided canonical mesh. 41

For the first experiential case we use the solver only for mono-domain excluding the elliptic PDE part. Runs where we vary the number of MPI processes can be seen in Table 6.

42 MB3 D6.1 - Report on profiling and benchmarking of the Figure 49: Hierarchic heart 11.2 CARP in Mont-Blanc 3 CARP is a full-sized application for solving the cardiac bidomain equations, consisting of: Elliptic PDE Parabolic PDE ODEs solved by algebraic multi-grid method. For the first experiential case we use the solver only for mono-domain excluding the elliptic PDE part. Runs where we vary the number of MPI processes can be seen in Table 6. Table 6: CARP execution on ThunderX MPI Processes Total time (sec) Parabolic ODE Efficiency In the Figure 50 can be seen that CARP scales very well on the ThunderX mini-cluster. When only MPI is used, we achieved the same efficiency as when we used both MPI and OpenMP in CARP code until 8 OpenMP threads per MPI process. The full timeline of CARP is shown in Figure 51. During the setup phase most of the time is spent by the group communication, which is shown in orange color. Useful duration of CARP for the same timeline is shown in Figure 52. The gradient between the dark blue and light green colors shows balance across the processor. Here some load imbalances in the setup phase can be noticed. 42

43 D6.1 - Report on profiling and benchmarking of the Figure 50: CARP Efficiency on ThunderX. Figure 51: CARP timeline on ThunderX Figure 52: CARP Useful duration on ThunderX Algebraic multigrid solver timeline and useful duration is shown in Figures 53, 54. Here we can see a good balance across processors. From the MPI call profile for this case of execution, we can see that the AMG solver achieves a good parallel efficiency: only 4.4% of time is spent for MPI communication, load balance is 99% and communication efficiency is 96.7%. Testing the full bidomain equations (elliptic PDE + parabolic PDE + ODEs) is more time consuming and so we only had the chance to perfom test on the Mont-Blanc prototype. The results from Table 7 indicate a very good scaling for the ODEs while the parabolic PDE and the elliptic solver including AMG achieve a good scaling. The reason for the lower performance of the PDE solvers consists, similar to the Jacobi solver, in the low ratio of arithmetics operations to memory/interconnect accesses. 43

44 MB3 D6.1 - Report on profiling and benchmarking of the Figure 53: CARP solver timeline on ThunderX Figure 54: CARP Useful duration for solver on ThunderX Table 7: CARP execution on the Mont-Blanc prototype, including Elliptic PDE MPI Processes Total time (sec) Elliptic Parabolic ODE Efficiency Figure 55: CARP Efficiency on the Mont-Blanc prototype, including Elliptic PDE 44

45 D6.1 - Report on profiling and benchmarking of the Figure 56: CARP Efficiency on ThunderX, including Elliptic PDE 11.3 Co-design insight The CARP code for the monodomain equations shows a similar good efficiency on ThunderX as the simple Jacobi Solver from 9 because there are similar components included, and the additional ODE system increases the arithmetic load without increasing the communication load. Applying the elliptic solver in the full bidomain equations shows a different picture because of the more complicated AMG preconditioner. Here, we have only preliminary results on the MontBlanc prototype see Fig. 55 and on ThunderX see Fig. 56. The elliptic solvers, as well as the parabolic solver, are limited by memory bandwidth. Here, a real breakthrough in performance can be achieved only by reducing the memory footprint, which is not trivial for unstructured discretizations. Therefore, we will need a discretization approach similar to Jacobi Solver for AMG in CARP. This change of discretization will not be easy because the coarse discretization has to approximate the complicated computational domain sufficiently. The use of isogeometric elements in the coarse discretization would a be a good choice but this will be beyond this project Future work The planned work for the CARP application is as follows: Perform analysis on individual components of CARP for the bidomain equations on the Mont-Blanc platforms, especially on ThunderX. Obtain more traces for CARP. Further analysis using Paraver traces to understand key behaviours and bottlenecks. Incorporate improvements from Jacobi Solver and from AMG whenever they pay off. Try a matrix-free implementation in some solver parts. 45

46 MB3 D6.1 - Report on profiling and benchmarking of the 12 Eikonal Solver - UGRAZ 12.1 Review The Eikonal equation and its variations (forms of the static Hamilton-Jacobi and level-set equations) are used as models in a variety of applications. These applications include virtually any problem that entails the finding of shortest paths, possibly with inhomogeneous or anisotropic metrics. The Eikonal equation is a special case of non-linear Hamilton-Jacobi partial differential equations (PDEs). In this work, we consider the numerical solution of this equation on a 3D domain with an inhomogeneous, anisotropic speed function: { H(x, ϕ) = ( ϕ) M ϕ = 1, x Ω R 3 ϕ(x) = B(x), x B Ω Where Ω is a 3D domain, ϕ(x) is the travel time at position x from a collection of given (known) sources within the domain, M(x) is a 3 * 3 symmetric positive-definite matrix encoding the speed information on Ω, and B is a set of smooth boundary conditions which adhere to the consistency requirements of the PDE. We approximate the domain Ω by a planar-sided tetrahedralization denoted by Ω T. Based upon this tetrahedralization, we form a piecewise linear approximation of the solution by maintaining the values of the approximation on the set of vertices V and employing linear interpolation within each tetrahedral element in Ω T Local Solver One of the main building blocks of the proposed algorithm is the local solver, a method for determining the arrival time at a vertex assuming a linear characteristic across a tetrahedron emanating from the planar face, defined by the other three vertices whose solution values are presumed known. The Fast Iterative Method (FIM) was initially proposed as a new iterative method to solve the Eikonal equation in parallel architectures, especially on GPUs. The proposed method uses a modification of the active list update scheme combined with the local solver described above, designed for unstructured tetrahedral meshes with inhomogeneous anisotropic speed functions. We adopted this method in our work and modified the algorithm in such way that during the expansion phase less computations are done. The parallel algorithms have been tested on workstations and on Android devices such as NVIDIA SHIELD tablet K1. We have 46 a very short convergence time of the algorithm and good quality results, see Figure 57 wherein the wave propagation looks very smooth. Figure 57: Arrival time ϕ(x) ranging from 0 (bottom) to 1 (top).

47 D6.1 - Report on profiling and benchmarking of the 12.3 Benchmarking In order to benchmark the Eikonal application, we performed a set of runs on different platforms using the TBunnyC mesh, which is an unstructured heart mesh with tetrahedral elements and vertices. Table 8 shows the execution time, speed-up and efficiency of the Eikonal solver, parallelized solely by OpenMP, on the Mont-Blanc prototype. Since, the nodes of the Mont-Blanc prototype have only two threads, we performed two different runs, and we achieved an efficiency of 71.81%. Table 8: Eikonal Solver execution on the Mont-Blanc prototype Execution time using 1-Thread (seconds) Execution time using 2-Thread (seconds) Speed-Up Efficiency (%) Below, we see the results of the first Hybrid MPI and OpenMP version of the Eikonal application. Tables 9 and 10 show the execution time, speed-up and efficiency of the Elikonal application running on the Mont-Blanc Prototype and ThunderX cluster. Table 9: Eikonal Solver execution on the Mont-Blanc prototype (one thread per MPI process) MPI Processes Execution time (seconds) Speed-Up Efficiency Table 10: Eikonal solver execution on ThunderX MPI+OpenMP Processes Execution time (seconds) Speed-Up Efficiency Figures 58 and 59 show the efficiency of the Eikonal application. Since this is the first version, the performance is not so good and it has a lot to improve based on the obtained traces. 47

MB3 D6.1 - Report on profiling and benchmarking of the Figure 58: Efficiency of Eikonal solver on the Mont-Blanc prototype. Figure 59: Eficiency of Eikonal solver on ThunderX. 12.

48 MB3 D6.1 - Report on profiling and benchmarking of the Figure 58: Efficiency of Eikonal solver on the Mont-Blanc prototype. Figure 59: Eficiency of Eikonal solver on ThunderX Profiling Profiling of the Eikonal application is performed using Extrae to generate trace files and Paraver to visualize and analyse the traces. Here is the first version MPI plus OpenMP of the Eikonal application visualized with Paraver. Trace generated with Extrae on the Mont-Blanc prototype running Eikonal using 14 MPI processes with 2 OpenMP threads per process. Figure 60: Eikonal MPI call on the Mont-Blanc Prototype. In Figure 60 we see in pink the allreduce operation and in yellow the broadcast operation. Together they decrease the efficiency of this application. Figure 61 shows the major computation phase and their balance across threads. It also displays a serialization issue where all the processes are staying idle during this time. 48

49 D6.1 - Report on profiling and benchmarking of the Figure 61: Eikonal Useful duration on the Mont-Blanc Prototype 12.5 Co-design insight The analysis has shown opportunities for several techniques that will be appropriate for this application. The hybrid MPI plus OpenMP version is dominated by message transfer during overall execution, first All reduce operation followed by a Broadcast operation with long messages of 2MB. This significantly reduce the efficiency of the Eikonal application. To avoid this, source code changes to replace the All reduce operation with ScatterV and GatherV. This way the message size between processes is significantly decreased. The algorithmic structure of the Eikonal solver limits scalability a lot. By distributing the computation done in the master thread to other slaves threads will increase performance significantly. To do this, it is necessary to transfer more data to the other MPI processes. This will result in a small performance degradation. Implementing domain decomposition that is expected to solve the poor performance of the hybrid MPI with OpenMP version for Eikonal Future work The next version will include the domain decomposition that is expected to solve all the current issues. This will increase the efficiency and performance of the Eikonal application. The code will be parallelized fully using MPI. Code improvements based on the obtained traces. 13 MercuryDPM - UGRAZ 13.1 Overview MercuryDPM [Mer16] is an open-source code implementing Discrete Particle Method simulating flows of granular media and interactions of particles due to short-ranged forces and torques. MercuryDPM is intended for simulations of granular flows and designed to handle pairwise 49

50 MB3 D6.1 - Report on profiling and benchmarking of the interactions between particles of varying sizes in arbitrary domains. The code is written in C++ using object-oriented approach and describes particles (their physical properties such as position, velocity, etc.), their interactions between each other and with confining walls, and their constituents called particle species (defining their material properties such as density, stiffness, etc.) for simulation of mixtures. The design of MercuryDPM is based on contact laws of granular materials which are given in [Lud08]. The two novel features introduced to discrete particle modeling by MercuryDPM [TKtV + 13] are 1) multilevel hierarchical grid speeding up collision detection and 2) coarsegraining statiscal package allowing to derive macroscopic fields of a particulate system from its microscopic properties. Among other features are a self-test suite, a number of demo programs (one called HourGlass3DDemo is chosen to run the simulation tests), and a visualization tool. MercuryDPM allows to store the results for easy application restart and in a format suitable for visualization in ParaView. In this work, we focus only on paralelization of collision detection based on the feature 1 - multilevel hierarchical grid followed by force computation. The code modernization is done with respect to the most time-consuming part, namely, contact detection algorithm and multilevel hierarchical grid. One of the most important parts of particle-based modeling involving short-range interactions is collision detection. Often, the contact detection algorithm is the most time-consuming in particle simulation codes. In traditional DPM packages, contact detection is realized by some sort of neighborhood search algorithm in a single-level grid, e.g. linked-cell method. In monodisperse particle systems, its performance is as fast as O(N), where N is the total number of particles. However, in the case of polydisperse flows with particles of widely varying sizes, the complexity of the algorithm becomes of order O(N 2 ). The contact detection (CD) algorithm developed in MercuryDPM is based on generating a multilevel hierarchical grid. In this algorithm, particles are positioned in corresponding hierarchy levels according to their sizes during the first mapping phase and potential contacts are searched during the second contact detection phase. The search is performed first among particles at the same hierarchy level ( level-of-insertion search ) and then across different levels ( cross-level search ). The CD algorithm in MercuryDPM allows to arbitrarily set the number of hierarchy levels and the cell size distribution. These parameters have great influence on the performance and a study has already been carried out in [KOL14] giving recommendations for the optimal selection of these parameters. In this report, I propose a code modernization of MercuryDPM to efficiently utilize sharedmemory parallel architecture using OpenMP on Intel multicore and many-core architectures. Additionally, the first attempt to exploit data-dependencies and to implement task-based parallelism using the OmpSs programming model on ARM processors has been done Code analysis, benchmarking and profiling The simulation tests have been conducted on Intel Xeon, Intel Xeon Phi and ARM processors, and performance results are demonstrated for OpenMP and OmpSs and compared with the initial serial code. For testing the parallelization strategy, the HourGlass3DDemo demo-program was chosen (see Figure 62). It simulates falling spherical particles in a cylinder-like volume with a horizontal platform, which at first is fixed in the middle and after a period of time starts to shift downwards. The only input parameters modified in the main program for the test simulations are as follows: Dimensions of the hourglass (e.g., Width=100 cm, Height=1000 cm) The wedge of the contraction (e.g., ContractionWidth=25 cm, ContractionHeight=50 cm) 50

D6.1 - Report on profiling and benchmarking of the The minimal number of particles (e.g., N=1000 or N=10000), the actual number is set by the setupinitialconditions() function in the main program.

part of the cylinder. Let s refer to this beginning part of the code as IC (from initial conditions ).

51 D6.1 - Report on profiling and benchmarking of the The minimal number of particles (e.g., N=1000 or N=10000), the actual number is set by the setupinitialconditions() function in the main program. The radii (from which masses are calculated), initial positions and velocities of particles are set up in a loop using a random number generator so that all the particles are located in the upper part of the cylinder. Let s refer to this beginning part of the code as IC (from initial conditions ). Before computing forces and doing integration time steps, the multilevel hierarchical grid is set up (number of levels, cells and their sizes) and particles are assigned to grid levels according to their sizes. At the same time, the map of buckets for quick search of neighbor particles is created by hashing the grid (based on cell coordinates and level) and particles are assigned to buckets according to their position. For convenience, we can refer later to this phase of the program as neighbor list build or NL for brevity. The time steps are done in the loop which consists of: 1. Contact detection (referred to as CD for brevity) finding all interactions of particles between each other and with walls and calculating the short-range force contributions resulting from their contacts; 2. Force computation where forces are applied to particles (referred to as FC for brevity; more precisely, we should call it force application as the force contributions are already calculated in the previous phase); 3. Time integration (referred to as TI for brevity) calculating new positions and velocities. So, now we are ready to represent the structure of a typical MercuryDPM application schematically as the sequence of the following phases: IC + NL + loop(cd + FC + TI). (a) The domain in the HourGlass3DDemo test application: a cylinder with a bottom wall (not shown) fixed at first in the middle of the cylinder and then shifting downwards after 0.9 seconds (b) A ParaView snapshot of particles falling down with different velocities (depicted by colors) shortly after the start Figure 62: The demo program HourGlass3DDemo used for benchmarking and profiling The initial code analysis by means of Intel Advisor (Figure 63) has revealed the part of the code where the overwhelming duration of time is spent (namely, the CD and FC phases) and predicted good scalability using OpenMP on Intel Xeon and Xeon Phi. As a matter of fact, CD and FC in the initial code were implemented in one function called computeallforces(). Therefore, this function has been split up into CD and FC phases, so that different parallelization strategies (or separately) could be applied to these phases. 51

Intel Advisor for the HourGlass3DDemo test program with dimensions: w = 100 cm, h = 1000 cm; wedge of the contraction: w = 25 cm, h = 50 cm; simulation time t = 0.

Performance of the implementation in OpenMP (as well as the first attempt in OmpSs) is measured as the combined number of all interactions particle+particle and particle+wall for the contact

52 MB3 D6.1 - Report on profiling and benchmarking of the (a) Survey report for the initial serial code by Intel Advisor (b) CPU, OpenMP, 16 threads (c) Xeon Phi, OpenMP, 128 threads Figure 63: Reports by Intel Advisor for the HourGlass3DDemo test program with dimensions: w = 100 cm, h = 1000 cm; wedge of the contraction: w = 25 cm, h = 50 cm; simulation time t = 0.01 s; simulated for N = 1952 particles and predicted for 125N particles with 125x = grid cells). Performance of the implementation in OpenMP (as well as the first attempt in OmpSs) is measured as the combined number of all interactions particle+particle and particle+wall for the contact detection (CD) and force computation (FC) phases. The Figure 64 shows a very good speedup (close to ideal) on one Intel dual-socket processor (OpenMP on 1 node) and that of about 12x achieved on up to 16 threads on an Intel Xeon Phi co-processor (native mode). Further, on 2 sockets of a multicore Intel Xeon and on more than 16 threads of a many-core Intel Xeon Phi, efficiency starts to significantly drop as memory bandwidth reaches its peak. Performance on ARM (on a single node) is approximately a half of that on Intel which can be seen on the Figure 65. The chosen parallelization strategy of the phases CD+FC scales worse on ARM than on Intel due to lower memory bandwidth. The Cavium ThunderX with its dual socket demonstrates slightly better scalability than the Applied Micro X-gene 2. The OpenMP parallelization on ThunderX scales well up to 8 cores, then efficiency slightly decreases, and after 24 cores performance drops down again due to memory bandwidth limitations and poor ratio of computation to memory access. Figure 64: Benchmark and speedup for the contact detection (CD) and force computation (FC) phases on Intel-based architectures using OpenMP Profiling of the HourGlass3DDemo test application has been done using the Extrae and Paraver tools and traces were obtained for 8 threads for both OpenMP and OmpSs realizations. 52

D6.1 - Report on profiling and benchmarking of the Figure 65: Benchmark and speedup for the

OpenMP and OmpSs The Figures 66-67 illustrate higher overhead of fine-grained OmpSs tasks,

Therefore, a further investigation of task-based parallelization is needed, for example, for

53 D6.1 - Report on profiling and benchmarking of the Figure 65: Benchmark and speedup for the contact detection (CD) and force computation (FC) phases on ARM-based architectures using OpenMP and OmpSs The Figures illustrate higher overhead of fine-grained OmpSs tasks, compared to the OpenMP ones. Therefore, a further investigation of task-based parallelization is needed, for example, for possibilities of coarse grain tasks. (a) the entire timeline (b) start of the program Figure 66: Parallel efficiency of MercuryDPM on ThunderX using OpenMP 53

54 MB3 D6.1 - Report on profiling and benchmarking of the Figure 67: OmpSs tasks in MercuryDPM on ThunderX 13.3 Co-design insights The current status shows poor computational intensity (ratio of arithmetic operations to memory accesses) for large number of cores. The following improvements are suggested: 1. Adding MPI for coarse-grained domain decomposition on many nodes of a cluster. 2. Improving load balance by graph partitioning of the interaction matrix representing connectivity of buckets in the grid (by means of the Scotch library) (first version for a single node implemented). 3. Improving memory access, data locality by applying loop tiling. 4. Vectorization, switching between Array-of-Structures (AoS) and Structure-Of-Arrays (SoA) layouts, and data alignment. 5. Exploring algorithmic change in the hierarchical grid part of the MercuryDPM code. 14 Intake port - AVL 14.1 AVL Fire AVL FIRE is a commercial CFD simulation tool. It specializes in the accurate prediction of internal combustion engine related-processes as there are injection nozzle flow, fuel injection, combustion, emission and exhaust gas aftertreatment modelling, but also aerodynamics and quenching simulations. The software also supports the development of electrified powertrains and drivelines. 54

55 D6.1 - Report on profiling and benchmarking of the 14.2 Overview Description of this application follows (G. Kotnig, M. Rainer: Multi-objective Adjoint Optimization of Intake Port Designs, NAFEMS 2016) For any modern spark ignition engine, nowadays, a proper in-cylinder flow pattern is a very important factor for the resulting engine performance and emissions. Well-known measures for intake port quality judgement are the tumble and the discharge ratios. General requirements for an intake port are providing sufficient filling capacity and enhancing the turbulence intensity at spark time (increasing combustion stability, reducing emissions, improving fuel economy... ). Additionally, engine designers need to concentrate on improving knock resistance and reducing heat losses. The latter can be accomplished with a combustion process at lower temperature by introduction of cooled exhaust gas recirculation (EGR). This is a pure MPI application. The benchmark case solves the steady flow on a mesh of 1.3 million elements. Most computation time is spend in an AMG solver Scaling on Intel platform Figure 68 shows the strong scaling behavior of this test case. The plots show the speed-up and efficiency for the whole run for the case of existing partitioned meshes and when the meshes need to be partitioned prior to calculation. Both cases are shown because users are likely to perform multiple runs using the same geometry and number of cores, thus the partitioning overhead only occurs once. Furthermore the cases used for benchmarks simulate a short time scale compared to the average user, leading to an artificially increased proportion of time spent in the mostly serial partitioning. The speedup achieved is fair (above 0.6 efficiency) up to 8 cores, but after that it increases very slowly with core count. A least-squares fit of Amdahl s law gives a parallel code fraction of 93 % excluding partitioning wall time efficiency 8 7 wall time speedup # cores # cores Partitioning + Calc Calculation Ideal Figure 68: Strong scaling efficiency and speedup for the intake port case Analysis on Intel platform The application was traced on a development machine at AVL with 8 MPI processes. The timeline shows the iterative structure of the program. The parallel efficiency of the whole computational part is 0.84, and the load imbalance (0.88) is the dominant factor. Serialization efficiency is 0.95, leaving a transfer efficiency of We thus see that for this problem size and 55

The internal structure of one iteration is presented in the two top timelines of Figure 69 in terms of MPI calls and useful duration.

56 MB3 D6.1 - Report on profiling and benchmarking of the core count latency and bandwidth of the network are not a limiting factor. The granularity of computations is above 1 ms which is fine for this core count, but might be insufficient if strong scaling the execution. The internal structure of one iteration is presented in the two top timelines of Figure 69 in terms of MPI calls and useful duration. We can see some imbalance in the longer computation phases, but also in finer grain phases zoomed in the two lower timelines. The magnitude of the imbalance is different in the different phases of the program. This variation of the imbalance throughout the program execution explains the 5% efficiency loss assigned in the previous paragraph to serialization and which actually corresponds to variations in the load balance structure of the application at the fine grain/microscopic level. There seems to be some correlation in that the same processes take more time in some of these phases. This potentially indicates that the imbalance is algorithmic and has to do with the domain decomposition. Not having hardware counters in the trace we can not verify this hypothesis. Figure 69: Useful duration of one iteration of the intake port case Co-design insights Some of the observations that should be considered when working to improve its performance are: The load imbalance does change in magnitude along the different phases of the iteration. This suggests that dynamic balancing techniques would be useful for this application. Since the code uses a geometric decomposition and distributed memory parallelism, manually enabling DLB is a major task. Investigations on system and runtime support to simplify DLB are suggested. For short runs the serial startup and finalising tasks such as I/O limit scalability, which suggests exploration of parallel I/O. 56

57 D6.1 - Report on profiling and benchmarking of the 15 Parallel mesh generation and domain decomposition - AVL 15.1 Overview The numerical treatment of many simulation problems in science and industry has to handle changing computational domains originating from the given PDE (system) or from design variables in optimization and optimal control problems. A re-meshing in case of a direct problem, i.e. the PDE, is possible but rather inefficient because all mesh-dependent data have to be reallocated and recomputed. In the context of an optimization problem the re-meshing would destroy the continuous differentiability of the objective functional and therefore we are forced to apply a mesh deformation instead.[hmo15] We use interpolation with radial basis functions (RBF interpolation) for mesh deformation as proposed by deboer. In this project we focus on the parallelization of RBF interpolation with its application for mesh deformation in view. Within this context parallelization covers shared memory and distributed memory parallel computing. Calculating an RBF interpolant requires the solution of a dense system of linear equations. There have been several achievements to overcome the illconditioning of the linear system. Nevertheless a direct solution of the system is inhibited, if the problem size exceeds certain limits, thus iterative methods have to be used. Due to the ill-conditioning of the linear system, some preconditioning has to be applied. One way is to use domain decomposition methods. We employ a Krylov-subspace method that uses approximate Lagrange functions as preconditioner, namely the Faul-Goodsell-Powell (FGP) algorithm for a shared memory solution. Our reasonable approach to a distributed memory solution is applying well known domain decomposition methods. Due to our application of deforming given computational meshes our data distribution is predetermined. In particular we work on distributed finite volume discretizations with one cell-layer overlap. This is a hybrid MPI + OpenMP application. The benchmark case uses a mesh of 30,000 points as starting point for the interpolation Analysis on Intel The application was traced on a development machine at AVL with 4 MPI processes with 4 threads per process. The Figure 70 shows the structure of the program in terms of MPI calls at the top, OpenMP parallel functions in the first level of zooming and useful duration at the third level. The initial part of the execution shows an iterative pattern. Within each iteration MPI calls are made only by the main process except for one region. Here all threads call MPI - comm rank which does not require communication. Looking at the OpenMP outlined functions timeline we identify several regions within one iteration. A first phase is serialized at the MPI level but parallelized at the OpenMP level (dark green function). In the second phase (red function corresponding to a loop in a function related to an evaluation for inner and ghost vertices) every process is statically scheduled and the computation is very imbalanced at the OpenMP level. In the third phase every process seems to iterate a different number of times (two in the first and last process, three in the two other processes) over a static parallel loop, that in some cases does have imbalance. The final phase includes a parallel loop with dynamic schedule where the MPI comm rank calls are done by every thread for each of its assigned iterations. 57

In the last part of the trace, each process executes a dynamically scheduled OpenMP loop where the duration of the different iterations has an important variability as shown by the timeline and

58 MB3 D6.1 - Report on profiling and benchmarking of the Figure 70: Useful duration of one iteration of the mesh generation case. The granularity of the static scheduled parallel loops is very coarse (order of seconds). A dynamic or guided schedule might help achieve better balance in the OpenMP loops. In the last part of the trace, each process executes a dynamically scheduled OpenMP loop where the duration of the different iterations has an important variability as shown by the timeline and histogram at the bottom of Figure Co-design insights Some of the observations that should be considered when working to improve its performance are: Load balance is a very important issue for this application. Some times is at the MPI level and sometimes at the OpenMP level or both. DLB would help in the first case and coarser grain dynamic scheduling (or guided) would help at the OpenMP level. The interaction between these two levels of imbalance and the alternatives is a a very important topic to adress for co-design between application programming and runtime. 16 Shape optimization by an adjoint Solver - AVL 16.1 Overview The adjoint solver is a modern optimization approach which allows optimizing shapes with an effort that depends only weakly on the number of design variables. This is opposite to the straightforward black-box parameter optimization that requires a computational effort, which is directly proportional to the number of optimization variables. For one optimization step, only once the primal equations and once the adjoint equations have to be solved to compute the deformation vectors needed for the optimization of the shape of an object. Even if the number 58

59 D6.1 - Report on profiling and benchmarking of the of linear equation systems, which have to be solved is independent of the number of design variables, the size of the systems increases when large finite volume meshes are used. It is a pure MPI application Scaling on Intel platform Figure 71 shows the strong scaling behavior of this test case. The plots show the speed-up and efficiency for the whole run for the case. Mesh partitioning was not a significant factor here, thus the serial partitioning time is not shown separately here. The speedup maximum speed-up achieved is 3.1, efficiency immediately drops to 0.6 for two cores and declines approximately linearly to 0.2 for 16 cores. A least-squares fit of Amdahls law gives a parallel code fraction of only 72 %, which corresponds to a maximum speed-up of 3.5 for a large numbers of cores. 1.0 wall time efficiency 3.5 wall time speedup Partitioning + Calc Calculation # cores # cores Figure 71: Strong scaling efficiency and speedup for the 3D pipe case Analysis on Intel platform We obtained a trace of the execution of the shape optimization process for a small problem on an 8-core node at AVL. The hierarchical structure of its behavior is shown in Figure 72. At the outermost level (top timeline) we see that the execution alternates parallel phases with serial ones, where only the first process works. In other phases the execution is parallelized with apparently different granularities. The timelines at the middle layers zoom into two such regions. It comes out that these regions have internally areas with the same behavior as represented in the bottom timeline. The granularity in this level is quite fine (from several tens to a couple of hundreds microseconds). This region has a load balance efficiency of 0.91 and a data transfer efficiency of So communication is the dominant factor, confirming the typical experience that when granularity is very small, the runtime overhead is an important bottleneck Co-design insights From the analysis we envisage that several topics should be researched in cooperation between application developers and system software development: Granularity of the dominant computation steps at the innermost level is very fine. For larger problems we envisage that granularity may increase, but if larger core counts are 59

MB3 D6.1 - Report on profiling and benchmarking of the Figure 72: Hierarchical structure of Useful duration timelines for the Shape optimization problem run on 8 processes.

There is a small amount of imbalance, but with very fine granularity. To get to large scales will require to increase the granularities.

60 MB3 D6.1 - Report on profiling and benchmarking of the Figure 72: Hierarchical structure of Useful duration timelines for the Shape optimization problem run on 8 processes. targeted, the overhead of the runtime (MPI in this case) will still be relevant. Optimization of the MPI runtime will certainly help reduce the overheads and efficiency in this core computation. There is a small amount of imbalance, but with very fine granularity. To get to large scales will require to increase the granularities. This may be possible at the algorithmic level and would require proper nesting support at the programming model level. 17 Aerodynamics - AVL 17.1 Overview Aerodynamics is a CFD application which uses meshes with a very large number of degrees of freedom. This is typically in the range of 300 million cells, which is significantly higher compared to the combustion chamber application case, which typically considers up to 1 million. Due to the resulting memory requirements, the use of distributed memory paradigms (e.g. MPI) is unavoidable. Additionally, high demands are put on mesh quality to ensure stability of solutions. Mesh structure is also important in finding optimal geometric decompositions. This is a pure MPI application. The benchmark case solves flow around an Ahmed body for 3 timesteps using 10 iterations each. Ahmed body is a standard benchmark geometry in aerodynamics which resembles a ground vehicle. Here a mesh of 8 million elements was used Scaling on Intel platform Figure 73 shows the strong scaling behavior for the Ahmed body case with 8 million cells. The plots show the speedup and efficiency for the whole run for two cases. Once when the partitioned 60

61 D6.1 - Report on profiling and benchmarking of the meshes are already available and second when the geometry first needs to be partitioned for the desired number of MPI ranks. Since the impact of this serial partitioning diminishes for users who run longer time-scale or repeated simulations, we show both. The speedup achieved is fair (above 0.6 efficiency) up to 8 cores, but after that it increases very slowly with core count. 1.0 wall time efficiency 7 wall time speedup # cores # cores Partitioning + Calc Calculation Ideal Figure 73: Strong scaling efficiency and speedup for the Ahmed body case Analysis on Intel platform A first trace was obtained for a small problem size was obtained in a development machine at AVL with 8 MPI processes. It was used to detect the structure of the execution, that shows the iterative behavior presented in Figure 74. In the top timeline we see the useful duration of four iterations of the very repetitive pattern. The overall parallel efficiency is 0.76 and the dominant factor is load balance (0.78). Inside each iteration there are two main phases. A phase with finer grain subiterations is shown at the bottom left in terms of useful duration and MPI call timelines. There is an important imbalance in this region and granularity at this core count is still relatively coarse (up to 10 ms). Of course, if we scale the number of processes we will soon enter into very fine granularity issues. The other phase shown on the right timelines actually has finer substructure, It has some computations that are relatively balanced and granularities in the order of tenths of milliseconds but other regions have a few iterations of the same imbalanced pattern of the first phase. The MPI calls used include allreduce and allgather as collective operations within the iteration, Barriers separating the outer iterations and sendrecv replace for point to point exchanges. The number of messages and the amount of data sent/received by each process in these calls is fairly imbalanced ( 0.65) and not correlated with the computational imbalance. Overall, the application achieves a fair efficiency, mainly limited by load imbalance. Although its structure is varies between phases it is probable that improving the static domain decomposition may help. A dynamic approach would most probably lead to better efficiencies. Granularity for this problem size will be too fine if we significantly scale the core count. Overheads may then limit the applicability of a dynamic runtime balancing library (e.g. DLB). Approaches based on dynamically adapting core frequency might then be possible. 61

MB3 D6.1 - Report on profiling and benchmarking of the Figure 74: Structure of the aerodynamics case. Useful duration of four outer iterations in the top timeline.

It is apparent that the scaling is very poor. In particular, there is actually a slowdown from 8 to 16, resulting in a very similar time per iteration between 4 and 16 processes.

62 MB3 D6.1 - Report on profiling and benchmarking of the Figure 74: Structure of the aerodynamics case. Useful duration of four outer iterations in the top timeline. Zoom in two regions showing useful duration and MPI calls. Comparing that trace for 8 processes with the corresponding ones for 4 and 16 cores we get the efficiencies reported in Figure 75. It is apparent that the scaling is very poor. In particular, there is actually a slowdown from 8 to 16, resulting in a very similar time per iteration between 4 and 16 processes. As the plot shows, load balance is the main fundamental behavioral factor to blame. Serialization inefficiencies are also important. In reality, this factor captures effects of microscopic load imbalance, where the process(es) exposing higher computation times than others does change with very fine granularity in different phases separated by globally synchronizing communications. Communication itself (transfer) is not an issue at these core counts. Figure 75: Efficiency and fundamental factors for 4, 8 and 16 processes for the small problem size A second set of traces was obtained for a problem size that scales better at these core counts. In this case, the overall structure is similar to that of Figure 74 but granularities are now much coarser (from tens of milliseconds to couple of seconds). The scaling analysis an its projection are shown in Figure 76. Of special interest is the fact that load balance is the dominating factor at low core counts, but transfer is the predicted main bottleneck beyond a few tens of processes. 62

63 D6.1 - Report on profiling and benchmarking of the Figure 76: Efficiencies and fundamental factors projection (based on traces of 4, 8 and 16 processes) for the large problem size Co-design insights Load balance is an important issue for this application. Dynamic approaches (DLB) will certainly be beneficial, but tuning the static domain decomposition might also help improve the efficiency. Data transfer is a predicted bottleneck for large problem sizes. It will be necessary to estimate the network bandwidth needs for these large problems and the possibility of overlapping communication with computation given the very coarse grain of the computation. 18 Quenching - AVL 18.1 Overview Quenching is the process of rapid cool-down, which enables obtaining a desired material microstructure, hardness, strength or toughness. The computational model demands coupling of a fluid and one or more solid domains. The simulation of the fluid domain implies modeling multi-phase flows. The complexity of the solid domains appearing in real life examples results in large meshes. A characteristic requirement of this application is a high mesh quality especially at domain boundaries. This is a pure MPI application. For a quenching calculation two AVL Fire processes need to be coupled via a non-mpi communication server. One of the MPI runs calculates the solid object inserted into the fluid, the other the fluid. The computational effort for the solid calculation is orders of magnitude below the fluid. For the fluid 120,000 elements have been used Scaling on Intel platform Figure 77 shows the strong scaling behaviour of this test case. Since the computation consists of two communicating application instances with the fluid instance dominating, the solid instance 63

64 MB3 D6.1 - Report on profiling and benchmarking of the was fixed to 2 cores. The scaling data was then generated by varying the cores for the fluid instance from 1 to 14. Here a good speed-up with efficiency above 0.8 is achieved up to 8 cores. At 14 cores the run time is still decreasing but efficiency dropped below 0.7. A least square fit of Amdahl s law yields parallel code fraction of 96%. Since the benchmark case is shorter than actual usage of the application, the serial startup and IO time is a large part of the 4% serial code wall time efficiency # cores wall time speedup # cores Partitioning + Calc Calculation Figure 77: Strong scaling efficiency and speedup for the quenching case Analysis on Intel platform The application was traced on a development machine at AVL with 4 MPI processes. Figure 78 shows the structure of the execution. The top timeline shows the duration of the computation bursts for the whole run ( 6.5 seconds). Different phases seem to appear, from which we have focused on two regions. The timeline on the left in the second row focuses on a region where high computational cost ( 30 second bursts) appears in one process and then seems to migrate to other processes. From the zoomed view it really looks like a case of migrating load imbalance, although hardware counters would be needed to certify that it is really useful computation imbalance (and not system noise). The second region is shown on the two bottom right timelines. We end up seeing a similar effect of migrating load imbalance but now at much finer granularity (a couple of milliseconds). Again this seems to be an algorithmic structure of the program Co-design insights Some of the observations that should be considered when working to improve its performance are: Migrating load imbalance plays an important role in the application. It appears at different granularities. The DLB mechanism would potentially be very useful, but at finer granularities additional architectural support might be useful, specially if the application is run at larger core counts. 64

D6.1 - Report on profiling and benchmarking of the Figure 78: Structure of the quenching test case. 19 Air borne noise simulation of an internal combustion engine - AVL 19.

It is a specialized tool that calculates the dynamics, strength, vibration and acoustics of combustion engines, transmissions and conventional or electrified powertrains.

65 D6.1 - Report on profiling and benchmarking of the Figure 78: Structure of the quenching test case. 19 Air borne noise simulation of an internal combustion engine - AVL 19.1 Overview AVL EXCITE is a commercial software for the simulation of rigid and flexible multi-body dynamics of powertrains. It is a specialized tool that calculates the dynamics, strength, vibration and acoustics of combustion engines, transmissions and conventional or electrified powertrains. In industrial applications, engine noise radiation is a typical example for a one-direction coupled (from structure to fluid) exterior radiation problem, where the velocity boundary conditions are derived from the structural vibrations of the engine surface. Consequently, these applications are typically investigated in a sequential two-step process. In a first step structure borne noise is simulated. For this purpose a flexible multi-body dynamics model is set up. This considers both component structures (engine block, crankshaft, connecting rods etc.) but also the highly non-linear contacts between these components (e. g. radial slider bearings). Phenomena such as friction power loss and wear as well as acoustic excitation and oil consumption of the overall lubrication system are mainly affected by these contacts. From a mathematical perspective a number of differential algebraic equations (DAEs), representing each a component, and a number of partial differential equations (PDEs), representing a contact between two component surfaces, need to be coupled and solved in time domain. Elastic velocities of the outer engine block surface, the cylinder head and the oil pan are results of the according DAEs and serve as boundary conditions for the second step. In the second step air borne noise is computed based on the wave-based technique. This technique is used for solving steady-state acoustic problems in the mid-frequency range and is based on an indirect Trefftz approach. The field variables are expressed in terms of globally defined shape functions, which are the exact solution of the homogeneous governing differential equation, but which do not necessarily satisfy the boundary conditions. Radiation in terms of acoustic pressure is solved in frequency steps (e.g. 10Hz, 15Hz,... ) up to a defined limiting frequency (e.g. 1 khz). For this purpose a dense equation system with complex values, has to be solved for each frequency step. This is a pure OpenMP application. 65

66 MB3 D6.1 - Report on profiling and benchmarking of the 19.2 Scaling on Intel platform Figure 79 shows the strong scaling efficiency and speed-up for this application. The speedup is bad with a maximum of 1.2 for 4 cores and efficiency dropping almost as if the code was completely serial. Since the code uses both a parallel BLAS implementation (Intel MKL 11.1) and OpenMP worksharing constructs, understanding the scaling behaviour requires closer analysis performed in the following section. 1.0 wall time efficiency 1.30 wall time speedup # cores # cores Calculation Ideal Figure 79: Strong scaling efficiency and speedup for the air borne simulation Analysis on Intel platform The application was traced on a development machine at AVL with 8 threads. The timeline shows the iterative structure of the program. At a coarse grain we see that only a part of the iteration is parallelized with OpenMP, and a very important part remains serial. Digging in the parallelized region, the same effect appears, with sequences of parallel loops separated by significantly large serial parts. The structure of this inner part is shown in Figure 80. The duration of the sequential parts is in the range of several milliseconds while the granularity of the parallel parts is at best in the order of 200 microseconds for some loops, but very often less than 30 microseconds Co-design insights The application shows a poor scaling and a lot of effort will be required to target very large scale systems. Some of the observations that should be considered when working to improve its performance are: It is really necessary to parallelize a more significant part of the application. Coarser granularity should be aimed at. 66

67 D6.1 - Report on profiling and benchmarking of the Figure 80: Air borne simulation parallel loops timeline. 20 HPGMG - HLRS 20.1 Overview After the TOP500 project started in 1993, High-Performance Linpack (HPL, [DLP03]) became the most widely used benchmark to measure the floating-point operation execution rate of a computer and the basis to rank the fastest supercomputers. Lately, HPL has been criticized more and more often for its lack of representativity. In 2014 High Performance Conjugate Gradient (HPCG, [DH13]) has been developed with the above discussions in mind, aiming at a superior representativity. Several studies have shown that the HPCG benchmark is heavily memory bound for large problem sizes. High-performance Geometric Multigrid (HPGMG, [Ada14]) has become popular shortly after HPCG. It contains two solvers: the finite element method and the finite volume method. We will review the finite volume method. HPGMG aims to improve representativity as compared to HPL and has a higher efficiency in terms of Flop/s than HPCG. Complexity of communication and computation are the same as for HPCG, leading to the same issue regarding large problem sizes and communication cost. In HPGMG runs, three different problem sizes are executed, where the user chooses only the biggest one and the two smaller sizes are 1 and 1 of the initial problem size. HPGMG-FV solves variable-coefficient elliptic problems on isotropic Cartesian grids using the finite volume method (FV) and Full Multigrid (FMG). The method is fourth-order accurate in the max norm, as demonstrated by the FMG convergence. FMG interpolation (prolongation) is quartic, V-cycle interpolation is quadratic, and restriction is piecewise constant. Recursive decomposition is used to construct a space filling curve akin to Z-Mort in order to distribute work among processes. Out-of-place Gauss-Seidel, Red-Black is used for smoothing, preconditioned by the diagonal. FMG convergence is observed with a using a V(3,3) cycle. Thus convergence is reached in a total of 13 fine-grid operator applications (3 pre-smooth GSRBs, residual, 3 post-smooth GSRBs) Platforms and Software Stack In this section we describe the hardware and the software stack used in this work. Real performance data of HPGMG was collected on three different platforms, which shall be referred to as A, B, and C. The platform A is based on the Cray XC40 architecture. For experiments we use up to 4096 nodes with 2 chips of the Intel Haswell E5-2680v3 2,5 GHz, 12 Cores, 2 HT/Core. Each chip has 8 cores (16 HT) with 30 MB of shared L3 cache per chip and 4 memory 67

Parallel Mesh Partitioning in Alya

Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es