Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum

Size: px

Start display at page:

Download "Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum"

Joella Wilkinson
5 years ago
Views:

1 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum Belle Andrea, Pourcelot Gregoire Abstract The aim of this project was to investigate the computation efficiency of Serpent and Serpent 2 using the KTH supercomputer. Several simulations were run using different input parameters and parallel mode configurations, in order to have a wide view of the process of parallelization. Increases and decreases of the computation time, changing various parameters and configurations, were studied. I. INTRODUCTION The parallel calculations in a Monte Carlo code such as Serpent and Serpent 2 consist in splitting the size and the computational cost of the simulation through different parts. In this way, the computation time likely decreases. Nevertheless, this kind of application often requires to be performed in a powerful machine. For this project, the KTH supercomputer, the so-called Parallel Dator Centrum or PDC, was used for all the simulations. The codes used were Serpent, version , and Serpent 2, version A. Parallel Dator Centrum The Parallel Dator Centrum, or PDC, is the KTH supercomputer. It consists of two main parts, called clusters, which are Beskow and Tegner. Each of this machine is formed by several units called nodes, and each node contains many cores or CPUs. Many hardware configuration are available for parallel calculation, depending on the needed computation power and the complexity of the simulation [1]. B. PDC and Serpent configuration As mentioned before, Serpent version and Serpent 2 version and Tegner machine were used. In particular, a specific part of Tegner was available, with 46 nodes. Each node has 24 Intel E5-2690v3 Haswell cores, with a configuration 2x12, and 512 GB of RAM [1]. The codes were compiled using the gcc/7.2.0 and openmpi/3.0- gcc-7.2 compilers, and each simulation was launched using a protocol script called sbatch, in order to follow the security procedure of the supercomputer. C. Parallel computation mode Both Serpent and Serpent 2 support parallel calculations. In Serpent only the MPI mode (Message Passing Interface) is available. It consists in splitting the simulation into a specific number of parts, called tasks. The total memory available is distributed through the tasks. Each of them runs a small part of the total simulation, and the results are then combined at the end of the whole simulation, using the independent simulations scheme. In this particular case, each node was divided into 24 MPI tasks, each corresponding to a single core. The batch size, or the number of neutron histories simulated per cycle, is divided into a certain number of cores, and the results are then combined using the aforementioned independent simulations scheme. Serpent 2 allows to utilize both MPI and OpenMP parallel mode. The MPI mode is the same previously described for Serpent, while OpenMP is a different parallel mode. It consists into splitting the simulation in a certain number of parts called threads. The memory is not equally distributed, but it is shared among all the threads. Serpent 2 can also implement the so-called Hybrid MPI- OpenMP mode, that allows to merge the features of the two modes in order to find an optimal configuration. Each node can be divided into some MPI tasks, and each task can be divided into some OpenMP threads. The total memory is therefore divided and equally distributed among the MPI tasks, and each MPI memory size is shared between each OpenMP threads inside the task itself. In this case, therefore, the batch size is split into the MPI tasks, and then each neutron history is simulated in a different thread. The results are then combined using a sort of master/slave scheme in the OpenMP threads. Results from different MPI tasks are then combined at the end as independent simulations. A. Serpent input files II. SIMULATION PROCEDURE For all the simulations, a single input file used. It consists in a BWR 2D fuel assembly [2]. The geometry is visible in the figure 1. Each pin has a pitch of cm, and the assembly has a pitch of cm. The fuel is UO 2, with different level of concentration of 235 U and 238 U. In the figure, different pin colors correspond to different level of enrichment. In some fuel pins, corresponding to the blue color in the figure, the uranium dioxide is mixed with gadolinium. The moderator is light water, the cladding and box material is a zirconium alloy. Several types of detectors are also present. In order to evaluate the influence of the input file geometry on the computation efficiency, a different type of fuel assembly, visible in the figure 2, was used. It consists in a CANDU 2D fuel cluster [2]. The fuel material is uranium dioxide, UO 2 with an 235 U enrichment of 0.7% (natural uranium). The moderator is heavy water (D 2 O)

KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY 2018 2 Fig. 1.

$parameter F = parallelizable fraction of the simulation N = number of processors used in the simulation.$

2 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 1. Geometry of the BWR fuel assembly using the figure of merit: with: F OM = 1 σ 2 t FOM = figure of merit σ 2 = standard deviation t = computation time. The main parameter in order to analyze the efficiency of a parallel simulation is the speed-up parameter defined by Gene Amdahl [3] with the following formula: with: 1 s = (1 F ) + F N s = speed-up parameter F = parallelizable fraction of the simulation N = number of processors used in the simulation. For simplicity, the speed-up parameter can be considered also as the ratio between the FOM of a reference simulation and the FOM of the simulation which has to be evaluated. In this study, for each series of simulations, the seed, the batch size and the number of active/inactive cycles are preserved [2], and therefore the standard deviation can be considered constant. The speed-up parameter can therefore be expressed as: s = F OM (n) = σ2 t (Ref) F OM (Ref) σ 2 = t (Ref). t (n) t (n) In the case of MPI mode, the computation time of the simulation using one core was taken as reference for the speed-up parameter. In the case of Hybrid MPI-OpenMP mode with Serpent 2, only the computation time was taken into account for the evaluation of the results. Fig. 2. Geometry of the CANDU fuel assembly and the structure materials are different zirconium alloy. No detector is present. Different combinations of batch size and active/inactive cycles were used during the study, in order to optimize the simulations and evaluate the influence of the batch size on the efficiency. For all the simulations the same seed (1.5E7) was used, in order to preserve the random numbers series and to keep the results unbiased by statistical fluctuations. B. Speed-up parameter The main goal of this study is to evaluate the changes in the computation time with various input parameters and hardware configuration, such as number of cores or nodes involved in the parallelization. The easiest way to evaluate the efficiency of a simulation is C. Results evaluation Each series of simulation was evaluated taking into account the changes in the speed-up and the actual computation time in seconds with the increase of the number of cores and nodes. Only computation times from the Serpent output file were evaluated. Indeed, computation times in PDC output files were slightly longer due to execution and procedure time required by the supercomputer. Exploiting this extra time would have biased the results. Each simulation series focused on the utilization of up to three nodes. III. MPI MODE RESULTS A. Serpent and Serpent 2 comparison The first series of simulation was run using the BWR input file, the MPI mode for both Serpent and Serpent 2, a batch size of 20,000 neutrons, 5000 active cycle and 200 inactive cycles. The speed-up parameter and the computation time were evaluated for both Serpent and Serpent 2, and then compared. The simulations were run using different numbers of MPI tasks: [1, 28], 30, 32, 36, 40, 44, [48, 52]. Each task corresponded

3 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY to one core, or CPU. The choice of these values is to optimize the study of the parameters between the first and the second nodes, and between the second and the third one. The plot of the computation time is visible in the figure 3, the plot of the speed-up parameter is visible in the figure 4. It can be clearly seen that, increasing the number of cores involved in the parallel simulation, the computation time decreases exponentially and the speed-up parameter increases linearly. From the figure 4, it can be noticed that the patterns of the speed-up parameter of Serpent and Serpent 2 are similar, and both of them can be approximated with a linear function. The data of the fitting are visible in the table 1. TABLE I SPEED-UP PARAMETER FITTING FOR BWR y=0.7968x y=0.7707x this means that the decrease of the computation time is not perfectly inversely proportional to the increase of the number of cores. For example, when using two cores rather one, the computation time is not the half of the previous one, but slightly bigger. This phenomenon is known as overhead [4], and it is due the communication, execution and process time required by the machine which is performing the parallel simulation. Serpent seems to be slightly less influenced by this factor. Nevertheless, it has to be noticed that Serpent 2 is more stable and less inclined to instabilities when an extra node is needed. The pattern of the speed-up parameter is indeed more linear and with less fluctuations. On the other hand, Serpent presents a more unstable pattern, with a slight instability between the first and the second node, and a more pronounced fluctuation between the second and the third one. All these differences are probably due to the different internal architecture of the codes. B. Influence of the geometry The influence of the geometry was evaluated running a series of simulations, using the same number of cores of the previous one, but using the CANDU cluster geometry. The results for the computation time and speed-up parameter are visible respectively in the figure 5 and 6. It can be noticed that both the plots are very similar to the previous ones for the BWR assembly geometry. Fig. 3. Plot of computation time and number of cores for BWR with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles Fig. 5. Plot of computation time and number of cores for CANDU with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles TABLE II SPEED-UP PARAMETER FITTING FOR CANDU y=0.7911x y=0.7806x Fig. 4. Plot of speed-up parameter and number of cores for BWR with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles The slope of linear function of the speed-up parameter for Serpent is , while for Serpent 2 is This means that the increase of the speed-up parameter, or the decrease of the computation time, in Serpent is slightly faster rather than Serpent 2. The slope is, as expected, smaller than 1: The plot of the speed-up parameter was again approximated using linear functions, visible in the table 2. The slope of Serpent linear function is slightly bigger, and the latter seems again to be slightly more efficient. It has to be noticed that the slopes differ of a value smaller than 5%, either using the BWR or the CANDU geometry. Moreover, the patterns of the speed-up parameter are very similar. Using Serpent and either the BWR or the CANDU geometry, the pattern is

4 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 6. Plot of speed-up parameter and number of cores for CANDU with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles Fig. 8. Plot of speed-up parameter and number of cores for BWR with batch size 50,000 neutrons, 5000 active cycles, 200 inactive cycles more irregular, with more pronounced instabilities among the interface region of different nodes. Using Serpent 2, the pattern is more regular and the fluctuations less pronounced. It can be therefore concluded that, in this case, a different geometry does not bring any considerable changes in the computation time efficiency of Serpent and Serpent 2. The small differences between the two series are not particularly relevant and they are probably caused by some statistical fluctuations due to the different input file. C. Influence of batch size The batch size was changed from 20,000 to 50,000 neutron histories per cycle. The results are visible on the figure 7 and 8 and in the table 3. As it can be clearly seen, the results are have a considerable impact on the pattern of the computation time and the speed-up parameter. Serpent is still slightly more efficient and more unstable, and the fluctuations between the second and the third node are more pronounced than the ones between the first and the second node. Serpent 2 seems to be more regular in its trend. The fitting slopes of the speed-up parameter are similar and comparable to the previous ones. D. Influence of the number of active/inactive cycles The influence of the number of active and inactive cycles was evaluated with this series of simulations. The number of cores used was always [1, 28], 30, 32, 36, 40, 44, [48, 52], the batch size was 50,000 neutrons, the number of active and inactive cycles were 12,500 and 500. The plots of the computation time and the speed-up parameter are available in the figure 7 and 8. The results are again similar to the previous ones, with an exponential decrease of the computation time and a linear increase of the speed-up parameter. The data of the linear fitting of the speed-up parameter are visible in the table 4. TABLE IV SPEED-UP PARAMETER FITTING FOR BWR, 50,000 NEUTRONS, 12,500 ACTIVE CYCLES, 500 INACTIVE CYCLES y=0.7914x y=0.7755x Fig. 7. Plot of computation time and number of cores for BWR with batch size 50,000 neutrons, 5000 active cycles, 200 inactive cycles TABLE III SPEED-UP PARAMETER FITTING FOR BWR, 50,000 NEUTRONS, 5,000 ACTIVE CYCLES, 200 INACTIVE CYCLES y=0.8047x y=0.7772x similar to the previous ones, and the batch does not seem to The results regarding the slopes of the speed-up parameter plots are comparable with the previous ones, and the efficiency of the computation time can be considered the same. The big difference with the previous results is the pronounced fluctuations in the interface region between nodes in Serpent. Indeed, the instabilities between the first and the second node, and between the second and the third one, are bigger than before. In particular, the fluctuation of the speed-up parameter between the first and the second node is considerably bigger than the one in the previous simulations. On the other hand, Serpent 2 confirmed its more stable behavior. The explanation of this different behavior is due to the internal

5 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY TABLE V HYBRID MPI-OPENMP COMBINATIONS Total MPI tasks MPI tasks per node OpenMP threads per task Fig. 9. Plot of computation time and number of cores for BWR with batch size 50,000, 12,500 active cycles, 500 inactive cycles Fig. 10. Plot of speed-up parameter and number of cores for BWR with batch size 50,000, 12,500 active cycles, 500 inactive cycles (20,000 and 50,000), as visible in the figures 11, 12, 13 and 14. The results were evaluated using only the computation time, and they showed a similar trend. Passing from a pure MPI mode, with a total number of 72 MPI tasks (24 per node) with 1 OpenMP threads per task, to a hybrid mode with a 12 MPI tasks (4 per node) and 6 threads per task, the computation time slightly decreases. Using 9 MPI tasks (3 per node) with 8 threads per node, the computation time increases considerably. This is due to the hardware architecture of the Haswell nodes used for the simulations. In the node, indeed, there are 24 cores, and they are divided into 2 separate blocks of 12 cores each. If a node is divided into 3 MPI tasks and 8 threads (cores) per node, one of the task has four cores into a block, and the other four cores in the other one: this means that the memory of this task is shared within the blocks, and additional process time is needed due to the communication between the two blocks. The same consideration can be done for the last point of each simulation series, where each node accounts for 1 single MPI task with 24 OpenMP threads: also in this case the needed communication time between the two blocks influences negatively the total computation time. differences of the codes. Serpent could be more prone to instabilities due to the way of splitting the batch size through the MPI tasks. When a new node is necessary due to the increasing number of tasks, Serpent probably requires more time than Serpent 2 in order to split the batch size when only one or two cores of a new node are included in the simulation. Another factor could be the communication and process time at the beginning and at the end of the simulation: indeed, when only few cores of a new node are used, this communication process between MPI tasks could be not optimized. The cause of these fluctuations could be therefore due to the architecture of the Serpent code, and a higher number of active/inactive cycles, and therefore longer simulations, seem to increase the magnitude of the fluctuations in Serpent. IV. HYBRID MPI-OPENMP RESULTS The Hybryd MPI-OpenMP mode was evaluated within 3 nodes, starting from a pure MPI mode and ending to a pure OpenMP mode, as visible in the table 5. These combinations were evaluated using four series of simulations, with different geometries (BWR and CANDU) and different batch size Fig. 11. Plot of computation time and number OpenMP threads for BWR with batch size 20,000, 5,000 active cycles, 200 inactive cycles A. MPI mode V. CONCLUSION Both Serpent and Serpent 2, increasing the number of cores used per simulation, present an exponential decrease of the

6 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 12. Plot of speed-up parameter and number OpenMP threads for BWR with batch size 50,000, 5,000 active cycles, 200 inactive cycles the value of the speed-up slope is always somewhat higher than the one for Serpent 2. On the other hand, Serpent 2 presents a more stable trend, with very small instabilities; Serpent instead shows more pronounced fluctuations in the nodes interface region. Geometry and batch size do not seem to have a considerable influence on the results in both Serpent and Serpent 2. Computation time and speed-up trends are indeed similar. The number of active/inactive cycles seems to have a stronger influence in Serpent. It was shown indeed that, the longer are the simulations, the more pronounced will be the instabilities, especially when passing from one to two nodes used, adding only one or two cores in the new node. In this particular case, visible in the figure 10, it can be clearly seen that adding an extra core to the parallel simulation is not always an advantage for the computation efficiency, since it could lead to an increase of the computation time. On the other hand, Serpent 2 did not show any considerable changes. The differences between Serpent and Serpent 2 should be ascribed to the intrinsic differences in the code architecture. In particular, some specific reasons could be the way of splitting the batch size among the cores, the communication system between each independent part of the simulation and the method of collecting and combining the results using the independent simulations scheme. Fig. 13. Plot of computation time and number OpenMP threads for CANDU with batch size 20,000, 5,000 active cycles, 200 inactive cycles Fig. 14. Plot of computation time and number OpenMP threads for CANDU with batch size 50,000, 5,000 active cycles, 200 inactive cycles computation time. The speed-up parameter increases linearly in both the cases. The fitting slopes are similar, and they can be approximated to a value of 0.78±0.04 in all the simulations. This value of the speed-up slopes is symptom of a good efficiency. Some clear differences between the two codes emerged during the study. Serpent seems to be slightly more efficient, since B. Hybrid MPI-OpenMP The results of the Hybrid MPI-OpenMP mode show a similar pattern for different geometries and batch sizes. The computation time seems to be strongly influenced by the internal hardware architecture of the supercomputer. In particular, the division of each node into 24 cores, divided into two blocks of 12 cores, plays a key-role. It is clear indeed that if the communication time is not optimized, due to the division in tasks and threads, the total computation time will increase. This fact is verified in the 6th point (3 MPI tasks per node, 8 OpenMP threads per task) and the 8th point (1 MPI task per node, 24 OpenMP threads per task) of each simulation series. The most efficient point for each simulation series is the 5th one (4 MPI tasks per node, 6 OpenMP threads per task). Also the other points (1st, 2nd, 3rd, 4th, 7th) are quite efficient: indeed, their computation times differ for less than 10% from the most efficient one. These minor differences and their causes are difficult to evaluate, and they would require a deeper investigation. REFERENCES [1] (last access 3 April 2018). [2] Serpent - a Continuous-energy Monte Carlo Reactor Physics Burnup Calculation Code, Users Manual, Jaakko Leppnen, 18 June [3] law (last access 3 April 2018). [4] comp/ (last access 3 April 2018).

1 st International Serpent User Group Meeting in Dresden, Germany, September 15 16, 2011

1 st International Serpent User Group Meeting in Dresden, Germany, September 15 16, 2011 Discussion notes The first international Serpent user group meeting was held at the Helmholtz Zentrum Dresden Rossendorf