Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum
|
|
- Joella Wilkinson
- 5 years ago
- Views:
Transcription
1 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Parallel computation performances of Serpent and Serpent 2 on KTH Parallel Dator Centrum Belle Andrea, Pourcelot Gregoire Abstract The aim of this project was to investigate the computation efficiency of Serpent and Serpent 2 using the KTH supercomputer. Several simulations were run using different input parameters and parallel mode configurations, in order to have a wide view of the process of parallelization. Increases and decreases of the computation time, changing various parameters and configurations, were studied. I. INTRODUCTION The parallel calculations in a Monte Carlo code such as Serpent and Serpent 2 consist in splitting the size and the computational cost of the simulation through different parts. In this way, the computation time likely decreases. Nevertheless, this kind of application often requires to be performed in a powerful machine. For this project, the KTH supercomputer, the so-called Parallel Dator Centrum or PDC, was used for all the simulations. The codes used were Serpent, version , and Serpent 2, version A. Parallel Dator Centrum The Parallel Dator Centrum, or PDC, is the KTH supercomputer. It consists of two main parts, called clusters, which are Beskow and Tegner. Each of this machine is formed by several units called nodes, and each node contains many cores or CPUs. Many hardware configuration are available for parallel calculation, depending on the needed computation power and the complexity of the simulation [1]. B. PDC and Serpent configuration As mentioned before, Serpent version and Serpent 2 version and Tegner machine were used. In particular, a specific part of Tegner was available, with 46 nodes. Each node has 24 Intel E5-2690v3 Haswell cores, with a configuration 2x12, and 512 GB of RAM [1]. The codes were compiled using the gcc/7.2.0 and openmpi/3.0- gcc-7.2 compilers, and each simulation was launched using a protocol script called sbatch, in order to follow the security procedure of the supercomputer. C. Parallel computation mode Both Serpent and Serpent 2 support parallel calculations. In Serpent only the MPI mode (Message Passing Interface) is available. It consists in splitting the simulation into a specific number of parts, called tasks. The total memory available is distributed through the tasks. Each of them runs a small part of the total simulation, and the results are then combined at the end of the whole simulation, using the independent simulations scheme. In this particular case, each node was divided into 24 MPI tasks, each corresponding to a single core. The batch size, or the number of neutron histories simulated per cycle, is divided into a certain number of cores, and the results are then combined using the aforementioned independent simulations scheme. Serpent 2 allows to utilize both MPI and OpenMP parallel mode. The MPI mode is the same previously described for Serpent, while OpenMP is a different parallel mode. It consists into splitting the simulation in a certain number of parts called threads. The memory is not equally distributed, but it is shared among all the threads. Serpent 2 can also implement the so-called Hybrid MPI- OpenMP mode, that allows to merge the features of the two modes in order to find an optimal configuration. Each node can be divided into some MPI tasks, and each task can be divided into some OpenMP threads. The total memory is therefore divided and equally distributed among the MPI tasks, and each MPI memory size is shared between each OpenMP threads inside the task itself. In this case, therefore, the batch size is split into the MPI tasks, and then each neutron history is simulated in a different thread. The results are then combined using a sort of master/slave scheme in the OpenMP threads. Results from different MPI tasks are then combined at the end as independent simulations. A. Serpent input files II. SIMULATION PROCEDURE For all the simulations, a single input file used. It consists in a BWR 2D fuel assembly [2]. The geometry is visible in the figure 1. Each pin has a pitch of cm, and the assembly has a pitch of cm. The fuel is UO 2, with different level of concentration of 235 U and 238 U. In the figure, different pin colors correspond to different level of enrichment. In some fuel pins, corresponding to the blue color in the figure, the uranium dioxide is mixed with gadolinium. The moderator is light water, the cladding and box material is a zirconium alloy. Several types of detectors are also present. In order to evaluate the influence of the input file geometry on the computation efficiency, a different type of fuel assembly, visible in the figure 2, was used. It consists in a CANDU 2D fuel cluster [2]. The fuel material is uranium dioxide, UO 2 with an 235 U enrichment of 0.7% (natural uranium). The moderator is heavy water (D 2 O)
2 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 1. Geometry of the BWR fuel assembly using the figure of merit: with: F OM = 1 σ 2 t FOM = figure of merit σ 2 = standard deviation t = computation time. The main parameter in order to analyze the efficiency of a parallel simulation is the speed-up parameter defined by Gene Amdahl [3] with the following formula: with: 1 s = (1 F ) + F N s = speed-up parameter F = parallelizable fraction of the simulation N = number of processors used in the simulation. For simplicity, the speed-up parameter can be considered also as the ratio between the FOM of a reference simulation and the FOM of the simulation which has to be evaluated. In this study, for each series of simulations, the seed, the batch size and the number of active/inactive cycles are preserved [2], and therefore the standard deviation can be considered constant. The speed-up parameter can therefore be expressed as: s = F OM (n) = σ2 t (Ref) F OM (Ref) σ 2 = t (Ref). t (n) t (n) In the case of MPI mode, the computation time of the simulation using one core was taken as reference for the speed-up parameter. In the case of Hybrid MPI-OpenMP mode with Serpent 2, only the computation time was taken into account for the evaluation of the results. Fig. 2. Geometry of the CANDU fuel assembly and the structure materials are different zirconium alloy. No detector is present. Different combinations of batch size and active/inactive cycles were used during the study, in order to optimize the simulations and evaluate the influence of the batch size on the efficiency. For all the simulations the same seed (1.5E7) was used, in order to preserve the random numbers series and to keep the results unbiased by statistical fluctuations. B. Speed-up parameter The main goal of this study is to evaluate the changes in the computation time with various input parameters and hardware configuration, such as number of cores or nodes involved in the parallelization. The easiest way to evaluate the efficiency of a simulation is C. Results evaluation Each series of simulation was evaluated taking into account the changes in the speed-up and the actual computation time in seconds with the increase of the number of cores and nodes. Only computation times from the Serpent output file were evaluated. Indeed, computation times in PDC output files were slightly longer due to execution and procedure time required by the supercomputer. Exploiting this extra time would have biased the results. Each simulation series focused on the utilization of up to three nodes. III. MPI MODE RESULTS A. Serpent and Serpent 2 comparison The first series of simulation was run using the BWR input file, the MPI mode for both Serpent and Serpent 2, a batch size of 20,000 neutrons, 5000 active cycle and 200 inactive cycles. The speed-up parameter and the computation time were evaluated for both Serpent and Serpent 2, and then compared. The simulations were run using different numbers of MPI tasks: [1, 28], 30, 32, 36, 40, 44, [48, 52]. Each task corresponded
3 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY to one core, or CPU. The choice of these values is to optimize the study of the parameters between the first and the second nodes, and between the second and the third one. The plot of the computation time is visible in the figure 3, the plot of the speed-up parameter is visible in the figure 4. It can be clearly seen that, increasing the number of cores involved in the parallel simulation, the computation time decreases exponentially and the speed-up parameter increases linearly. From the figure 4, it can be noticed that the patterns of the speed-up parameter of Serpent and Serpent 2 are similar, and both of them can be approximated with a linear function. The data of the fitting are visible in the table 1. TABLE I SPEED-UP PARAMETER FITTING FOR BWR y=0.7968x y=0.7707x this means that the decrease of the computation time is not perfectly inversely proportional to the increase of the number of cores. For example, when using two cores rather one, the computation time is not the half of the previous one, but slightly bigger. This phenomenon is known as overhead [4], and it is due the communication, execution and process time required by the machine which is performing the parallel simulation. Serpent seems to be slightly less influenced by this factor. Nevertheless, it has to be noticed that Serpent 2 is more stable and less inclined to instabilities when an extra node is needed. The pattern of the speed-up parameter is indeed more linear and with less fluctuations. On the other hand, Serpent presents a more unstable pattern, with a slight instability between the first and the second node, and a more pronounced fluctuation between the second and the third one. All these differences are probably due to the different internal architecture of the codes. B. Influence of the geometry The influence of the geometry was evaluated running a series of simulations, using the same number of cores of the previous one, but using the CANDU cluster geometry. The results for the computation time and speed-up parameter are visible respectively in the figure 5 and 6. It can be noticed that both the plots are very similar to the previous ones for the BWR assembly geometry. Fig. 3. Plot of computation time and number of cores for BWR with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles Fig. 5. Plot of computation time and number of cores for CANDU with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles TABLE II SPEED-UP PARAMETER FITTING FOR CANDU y=0.7911x y=0.7806x Fig. 4. Plot of speed-up parameter and number of cores for BWR with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles The slope of linear function of the speed-up parameter for Serpent is , while for Serpent 2 is This means that the increase of the speed-up parameter, or the decrease of the computation time, in Serpent is slightly faster rather than Serpent 2. The slope is, as expected, smaller than 1: The plot of the speed-up parameter was again approximated using linear functions, visible in the table 2. The slope of Serpent linear function is slightly bigger, and the latter seems again to be slightly more efficient. It has to be noticed that the slopes differ of a value smaller than 5%, either using the BWR or the CANDU geometry. Moreover, the patterns of the speed-up parameter are very similar. Using Serpent and either the BWR or the CANDU geometry, the pattern is
4 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 6. Plot of speed-up parameter and number of cores for CANDU with batch size 20,000 neutrons, 5000 active cycles, 200 inactive cycles Fig. 8. Plot of speed-up parameter and number of cores for BWR with batch size 50,000 neutrons, 5000 active cycles, 200 inactive cycles more irregular, with more pronounced instabilities among the interface region of different nodes. Using Serpent 2, the pattern is more regular and the fluctuations less pronounced. It can be therefore concluded that, in this case, a different geometry does not bring any considerable changes in the computation time efficiency of Serpent and Serpent 2. The small differences between the two series are not particularly relevant and they are probably caused by some statistical fluctuations due to the different input file. C. Influence of batch size The batch size was changed from 20,000 to 50,000 neutron histories per cycle. The results are visible on the figure 7 and 8 and in the table 3. As it can be clearly seen, the results are have a considerable impact on the pattern of the computation time and the speed-up parameter. Serpent is still slightly more efficient and more unstable, and the fluctuations between the second and the third node are more pronounced than the ones between the first and the second node. Serpent 2 seems to be more regular in its trend. The fitting slopes of the speed-up parameter are similar and comparable to the previous ones. D. Influence of the number of active/inactive cycles The influence of the number of active and inactive cycles was evaluated with this series of simulations. The number of cores used was always [1, 28], 30, 32, 36, 40, 44, [48, 52], the batch size was 50,000 neutrons, the number of active and inactive cycles were 12,500 and 500. The plots of the computation time and the speed-up parameter are available in the figure 7 and 8. The results are again similar to the previous ones, with an exponential decrease of the computation time and a linear increase of the speed-up parameter. The data of the linear fitting of the speed-up parameter are visible in the table 4. TABLE IV SPEED-UP PARAMETER FITTING FOR BWR, 50,000 NEUTRONS, 12,500 ACTIVE CYCLES, 500 INACTIVE CYCLES y=0.7914x y=0.7755x Fig. 7. Plot of computation time and number of cores for BWR with batch size 50,000 neutrons, 5000 active cycles, 200 inactive cycles TABLE III SPEED-UP PARAMETER FITTING FOR BWR, 50,000 NEUTRONS, 5,000 ACTIVE CYCLES, 200 INACTIVE CYCLES y=0.8047x y=0.7772x similar to the previous ones, and the batch does not seem to The results regarding the slopes of the speed-up parameter plots are comparable with the previous ones, and the efficiency of the computation time can be considered the same. The big difference with the previous results is the pronounced fluctuations in the interface region between nodes in Serpent. Indeed, the instabilities between the first and the second node, and between the second and the third one, are bigger than before. In particular, the fluctuation of the speed-up parameter between the first and the second node is considerably bigger than the one in the previous simulations. On the other hand, Serpent 2 confirmed its more stable behavior. The explanation of this different behavior is due to the internal
5 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY TABLE V HYBRID MPI-OPENMP COMBINATIONS Total MPI tasks MPI tasks per node OpenMP threads per task Fig. 9. Plot of computation time and number of cores for BWR with batch size 50,000, 12,500 active cycles, 500 inactive cycles Fig. 10. Plot of speed-up parameter and number of cores for BWR with batch size 50,000, 12,500 active cycles, 500 inactive cycles (20,000 and 50,000), as visible in the figures 11, 12, 13 and 14. The results were evaluated using only the computation time, and they showed a similar trend. Passing from a pure MPI mode, with a total number of 72 MPI tasks (24 per node) with 1 OpenMP threads per task, to a hybrid mode with a 12 MPI tasks (4 per node) and 6 threads per task, the computation time slightly decreases. Using 9 MPI tasks (3 per node) with 8 threads per node, the computation time increases considerably. This is due to the hardware architecture of the Haswell nodes used for the simulations. In the node, indeed, there are 24 cores, and they are divided into 2 separate blocks of 12 cores each. If a node is divided into 3 MPI tasks and 8 threads (cores) per node, one of the task has four cores into a block, and the other four cores in the other one: this means that the memory of this task is shared within the blocks, and additional process time is needed due to the communication between the two blocks. The same consideration can be done for the last point of each simulation series, where each node accounts for 1 single MPI task with 24 OpenMP threads: also in this case the needed communication time between the two blocks influences negatively the total computation time. differences of the codes. Serpent could be more prone to instabilities due to the way of splitting the batch size through the MPI tasks. When a new node is necessary due to the increasing number of tasks, Serpent probably requires more time than Serpent 2 in order to split the batch size when only one or two cores of a new node are included in the simulation. Another factor could be the communication and process time at the beginning and at the end of the simulation: indeed, when only few cores of a new node are used, this communication process between MPI tasks could be not optimized. The cause of these fluctuations could be therefore due to the architecture of the Serpent code, and a higher number of active/inactive cycles, and therefore longer simulations, seem to increase the magnitude of the fluctuations in Serpent. IV. HYBRID MPI-OPENMP RESULTS The Hybryd MPI-OpenMP mode was evaluated within 3 nodes, starting from a pure MPI mode and ending to a pure OpenMP mode, as visible in the table 5. These combinations were evaluated using four series of simulations, with different geometries (BWR and CANDU) and different batch size Fig. 11. Plot of computation time and number OpenMP threads for BWR with batch size 20,000, 5,000 active cycles, 200 inactive cycles A. MPI mode V. CONCLUSION Both Serpent and Serpent 2, increasing the number of cores used per simulation, present an exponential decrease of the
6 KTH ROYAL INSTITUTE OF TECHNOLOGY, SH2704, 9 MAY Fig. 12. Plot of speed-up parameter and number OpenMP threads for BWR with batch size 50,000, 5,000 active cycles, 200 inactive cycles the value of the speed-up slope is always somewhat higher than the one for Serpent 2. On the other hand, Serpent 2 presents a more stable trend, with very small instabilities; Serpent instead shows more pronounced fluctuations in the nodes interface region. Geometry and batch size do not seem to have a considerable influence on the results in both Serpent and Serpent 2. Computation time and speed-up trends are indeed similar. The number of active/inactive cycles seems to have a stronger influence in Serpent. It was shown indeed that, the longer are the simulations, the more pronounced will be the instabilities, especially when passing from one to two nodes used, adding only one or two cores in the new node. In this particular case, visible in the figure 10, it can be clearly seen that adding an extra core to the parallel simulation is not always an advantage for the computation efficiency, since it could lead to an increase of the computation time. On the other hand, Serpent 2 did not show any considerable changes. The differences between Serpent and Serpent 2 should be ascribed to the intrinsic differences in the code architecture. In particular, some specific reasons could be the way of splitting the batch size among the cores, the communication system between each independent part of the simulation and the method of collecting and combining the results using the independent simulations scheme. Fig. 13. Plot of computation time and number OpenMP threads for CANDU with batch size 20,000, 5,000 active cycles, 200 inactive cycles Fig. 14. Plot of computation time and number OpenMP threads for CANDU with batch size 50,000, 5,000 active cycles, 200 inactive cycles computation time. The speed-up parameter increases linearly in both the cases. The fitting slopes are similar, and they can be approximated to a value of 0.78±0.04 in all the simulations. This value of the speed-up slopes is symptom of a good efficiency. Some clear differences between the two codes emerged during the study. Serpent seems to be slightly more efficient, since B. Hybrid MPI-OpenMP The results of the Hybrid MPI-OpenMP mode show a similar pattern for different geometries and batch sizes. The computation time seems to be strongly influenced by the internal hardware architecture of the supercomputer. In particular, the division of each node into 24 cores, divided into two blocks of 12 cores, plays a key-role. It is clear indeed that if the communication time is not optimized, due to the division in tasks and threads, the total computation time will increase. This fact is verified in the 6th point (3 MPI tasks per node, 8 OpenMP threads per task) and the 8th point (1 MPI task per node, 24 OpenMP threads per task) of each simulation series. The most efficient point for each simulation series is the 5th one (4 MPI tasks per node, 6 OpenMP threads per task). Also the other points (1st, 2nd, 3rd, 4th, 7th) are quite efficient: indeed, their computation times differ for less than 10% from the most efficient one. These minor differences and their causes are difficult to evaluate, and they would require a deeper investigation. REFERENCES [1] (last access 3 April 2018). [2] Serpent - a Continuous-energy Monte Carlo Reactor Physics Burnup Calculation Code, Users Manual, Jaakko Leppnen, 18 June [3] law (last access 3 April 2018). [4] comp/ (last access 3 April 2018).
1 st International Serpent User Group Meeting in Dresden, Germany, September 15 16, 2011
1 st International Serpent User Group Meeting in Dresden, Germany, September 15 16, 2011 Discussion notes The first international Serpent user group meeting was held at the Helmholtz Zentrum Dresden Rossendorf
More informationBEAVRS benchmark calculations with Serpent-ARES code sequence
BEAVRS benchmark calculations with Serpent-ARES code sequence Jaakko Leppänen rd International Serpent User Group Meeting Berkeley, CA, Nov. 6-8, Outline Goal of the study The ARES nodal diffusion code
More informationPSG2 / Serpent a Monte Carlo Reactor Physics Burnup Calculation Code. Jaakko Leppänen
PSG2 / Serpent a Monte Carlo Reactor Physics Burnup Calculation Code Jaakko Leppänen Outline Background History The Serpent code: Neutron tracking Physics and interaction data Burnup calculation Output
More informationBreaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport
Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National
More informationClick to edit Master title style
Fun stuff with the built-in response matrix solver 7th International Serpent UGM, Gainesville, FL, Nov. 6 9, 2017 Jaakko Leppänen VTT Technical Research Center of Finland Click to edit Master title Outline
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationSchool of Computer and Information Science
School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast
More informationParallel Computing Concepts. CSInParallel Project
Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................
More informationSERPENT Cross Section Generation for the RBWR
SERPENT Cross Section Generation for the RBWR Andrew Hall Thomas Downar 9/19/2012 Outline RBWR Motivation and Design Why use Serpent Cross Sections? Modeling the RBWR Generating an Equilibrium Cycle RBWR
More informationComputing Acceleration for a Pin-by-Pin Core Analysis Method Using a Three-Dimensional Direct Response Matrix Method
Progress in NUCLEAR SCIENCE and TECHNOLOGY, Vol., pp.4-45 (0) ARTICLE Computing Acceleration for a Pin-by-Pin Core Analysis Method Using a Three-Dimensional Direct Response Matrix Method Taeshi MITSUYASU,
More informationVerification of the Hexagonal Ray Tracing Module and the CMFD Acceleration in ntracer
KNS 2017 Autumn Gyeongju Verification of the Hexagonal Ray Tracing Module and the CMFD Acceleration in ntracer October 27, 2017 Seongchan Kim, Changhyun Lim, Young Suk Ban and Han Gyu Joo * Reactor Physics
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists
More informationMIC Lab Parallel Computing on Stampede
MIC Lab Parallel Computing on Stampede Aaron Birkland and Steve Lantz Cornell Center for Advanced Computing June 11 & 18, 2013 1 Interactive Launching This exercise will walk through interactively launching
More informationClick to edit Master title style
New features in Serpent 2 for fusion neutronics 5th International Serpent UGM, Knoxville, TN, Oct. 13-16, 2015 Jaakko Leppänen VTT Technical Research Center of Finland Click to edit Master title Outline
More information2-D Reflector Modelling for VENUS-2 MOX Core Benchmark
2-D Reflector Modelling for VENUS-2 MOX Core Benchmark Dušan Ćalić ZEL-EN d.o.o. Vrbina 18 8270, Krsko, Slovenia dusan.calic@zel-en.si ABSTRACT The choice of the reflector model is an important issue in
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationCSC 2515 Introduction to Machine Learning Assignment 2
CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530) Problem 1 See attached scan files for question 1. 2. Neural Network 2.1 Examine the statistics and plots of training error
More informationStatus of the Serpent criticality safety validation package
VTT TECHNICAL RESEARCH CENTRE OF FINLAND LTD Status of the Serpent criticality safety validation package Serpent UGM 2017 Riku Tuominen and Ville Valtavirta, VTT Outline Criticality Safety Evaluation What
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationCMS High Level Trigger Timing Measurements
Journal of Physics: Conference Series PAPER OPEN ACCESS High Level Trigger Timing Measurements To cite this article: Clint Richardson 2015 J. Phys.: Conf. Ser. 664 082045 Related content - Recent Standard
More informationIMPROVEMENTS TO MONK & MCBEND ENABLING COUPLING & THE USE OF MONK CALCULATED ISOTOPIC COMPOSITIONS IN SHIELDING & CRITICALITY
IMPROVEMENTS TO MONK & MCBEND ENABLING COUPLING & THE USE OF MONK CALCULATED ISOTOPIC COMPOSITIONS IN SHIELDING & CRITICALITY N. Davies, M.J. Armishaw, S.D. Richards and G.P.Dobson Serco Technical Consulting
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationChapter 6: Examples 6.A Introduction
Chapter 6: Examples 6.A Introduction In Chapter 4, several approaches to the dual model regression problem were described and Chapter 5 provided expressions enabling one to compute the MSE of the mean
More informationFDS and Intel MPI. Verification Report. on the. FireNZE Linux IB Cluster
Consulting Fire Engineers 34 Satara Crescent Khandallah Wellington 6035 New Zealand FDS 6.7.0 and Intel MPI Verification Report on the FireNZE Linux IB Cluster Prepared by: FireNZE Dated: 11 August 2018
More informationApplication of MCNP Code in Shielding Design for Radioactive Sources
Application of MCNP Code in Shielding Design for Radioactive Sources Ibrahim A. Alrammah Abstract This paper presents three tasks: Task 1 explores: the detected number of as a function of polythene moderator
More informationChapter 13 Strong Scaling
Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables
More informationSpring 2010 Research Report Judson Benton Locke. High-Statistics Geant4 Simulations
Florida Institute of Technology High Energy Physics Research Group Advisors: Marcus Hohlmann, Ph.D. Kondo Gnanvo, Ph.D. Note: During September 2010, it was found that the simulation data presented here
More informationMethodology for spatial homogenization in Serpent 2
Methodology for spatial homogenization in erpent 2 Jaakko Leppänen Memo 204/05/26 Background patial homogenization has been one of the main motivations for developing erpent since the beginning of the
More informationStatus and development of multi-physics capabilities in Serpent 2
Status and development of multi-physics capabilities in Serpent 2 V. Valtavirta VTT Technical Research Centre of Finland ville.valtavirta@vtt.fi 2014 Serpent User Group Meeting Structure Click to of edit
More informationGeant4 Computing Performance Benchmarking and Monitoring
Journal of Physics: Conference Series PAPER OPEN ACCESS Geant4 Computing Performance Benchmarking and Monitoring To cite this article: Andrea Dotti et al 2015 J. Phys.: Conf. Ser. 664 062021 View the article
More informationImproving Range Query Performance on Historic Web Page Data
Improving Range Query Performance on Historic Web Page Data Geng LI Lab of Computer Networks and Distributed Systems, Peking University Beijing, China ligeng@net.pku.edu.cn Bo Peng Lab of Computer Networks
More informationAteles performance assessment report
Ateles performance assessment report Document Information Reference Number Author Contributor(s) Date Application Service Level Keywords AR-4, Version 0.1 Jose Gracia (USTUTT-HLRS) Christoph Niethammer,
More informationParallelization of DQMC Simulations for Strongly Correlated Electron Systems
Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun
More informationMultiphase flow metrology in oil and gas production: Case study of multiphase flow in horizontal tube
Multiphase flow metrology in oil and gas production: Case study of multiphase flow in horizontal tube Deliverable 5.1.2 of Work Package WP5 (Creating Impact) Authors: Stanislav Knotek Czech Metrology Institute
More informationIntroduction to Parallel Programming. Tuesday, April 17, 12
Introduction to Parallel Programming 1 Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational
More informationMath 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.
Math 340 Fall 2014, Victor Matveev Binary system, round-off errors, loss of significance, and double precision accuracy. 1. Bits and the binary number system A bit is one digit in a binary representation
More informationParallel Mesh Partitioning in Alya
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationLecture 7 Notes: 07 / 11. Reflection and refraction
Lecture 7 Notes: 07 / 11 Reflection and refraction When an electromagnetic wave, such as light, encounters the surface of a medium, some of it is reflected off the surface, while some crosses the boundary
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationTHE BENEFIT OF ANSA TOOLS IN THE DALLARA CFD PROCESS. Simona Invernizzi, Dallara Engineering, Italy,
THE BENEFIT OF ANSA TOOLS IN THE DALLARA CFD PROCESS Simona Invernizzi, Dallara Engineering, Italy, KEYWORDS automatic tools, batch mesh, DFM, morphing, ride height maps ABSTRACT In the last few years,
More informationThe determination of the correct
SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationUsing the Eulerian Multiphase Model for Granular Flow
Tutorial 21. Using the Eulerian Multiphase Model for Granular Flow Introduction Mixing tanks are used to maintain solid particles or droplets of heavy fluids in suspension. Mixing may be required to enhance
More informationSELECTION OF A MULTIVARIATE CALIBRATION METHOD
SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper
More informationImportance Sampling Spherical Harmonics
Importance Sampling Spherical Harmonics Wojciech Jarosz 1,2 Nathan A. Carr 2 Henrik Wann Jensen 1 1 University of California, San Diego 2 Adobe Systems Incorporated April 2, 2009 Spherical Harmonic Sampling
More informationA N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer
A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer Stéphane Vialle, Xavier Warin, Patrick Mercier To cite this version: Stéphane Vialle,
More informationEdge-Preserving Denoising for Segmentation in CT-Images
Edge-Preserving Denoising for Segmentation in CT-Images Eva Eibenberger, Anja Borsdorf, Andreas Wimmer, Joachim Hornegger Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität Erlangen-Nürnberg
More informationAssembly dynamics of microtubules at molecular resolution
Supplementary Information with: Assembly dynamics of microtubules at molecular resolution Jacob W.J. Kerssemakers 1,2, E. Laura Munteanu 1, Liedewij Laan 1, Tim L. Noetzel 2, Marcel E. Janson 1,3, and
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationQ: Which month has the lowest sale? Answer: Q:There are three consecutive months for which sale grow. What are they? Answer: Q: Which month
Lecture 1 Q: Which month has the lowest sale? Q:There are three consecutive months for which sale grow. What are they? Q: Which month experienced the biggest drop in sale? Q: Just above November there
More informationThe Why and How of HPC-Cloud Hybrids with OpenStack
The Why and How of HPC-Cloud Hybrids with OpenStack OpenStack Australia Day Melbourne June, 2017 Lev Lafayette, HPC Support and Training Officer, University of Melbourne lev.lafayette@unimelb.edu.au 1.0
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationv MODFLOW Stochastic Modeling, Parameter Randomization GMS 10.3 Tutorial
v. 10.3 GMS 10.3 Tutorial MODFLOW Stochastic Modeling, Parameter Randomization Run MODFLOW in Stochastic (Monte Carlo) Mode by Randomly Varying Parameters Objectives Learn how to develop a stochastic (Monte
More informationInvestigations into Alternative Radiation Transport Codes for ITER Neutronics Analysis
CCFE-PR(17)10 Andrew Turner Investigations into Alternative Radiation Transport Codes for ITER Neutronics Analysis Enquiries about copyright and reproduction should in the first instance be addressed to
More informationv Prerequisite Tutorials Required Components Time
v. 10.0 GMS 10.0 Tutorial MODFLOW Stochastic Modeling, Parameter Randomization Run MODFLOW in Stochastic (Monte Carlo) Mode by Randomly Varying Parameters Objectives Learn how to develop a stochastic (Monte
More informationBagging & System Combination for POS Tagging. Dan Jinguji Joshua T. Minor Ping Yu
Bagging & System Combination for POS Tagging Dan Jinguji Joshua T. Minor Ping Yu Bagging Bagging can gain substantially in accuracy The vital element is the instability of the learning algorithm Bagging
More informationEnemy Territory Traffic Analysis
Enemy Territory Traffic Analysis Julie-Anne Bussiere *, Sebastian Zander Centre for Advanced Internet Architectures. Technical Report 00203A Swinburne University of Technology Melbourne, Australia julie-anne.bussiere@laposte.net,
More information30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy
Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How
More informationPerformance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf
PADC Anual Workshop 20 Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture Alexander Berreth RECOM Services GmbH, Stuttgart Markus Bühler, Benedikt Anlauf IBM Deutschland
More informationDesigning for Performance. Patrick Happ Raul Feitosa
Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance
More informationA recipe for fast(er) processing of netcdf files with Python and custom C modules
A recipe for fast(er) processing of netcdf files with Python and custom C modules Ramneek Maan Singh a, Geoff Podger a, Jonathan Yu a a CSIRO Land and Water Flagship, GPO Box 1666, Canberra ACT 2601 Email:
More informationA FLEXIBLE COUPLING SCHEME FOR MONTE CARLO AND THERMAL-HYDRAULICS CODES
International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering (M&C 2011) Rio de Janeiro, RJ, Brazil, May 8-12, 2011, on CD-ROM, Latin American Section (LAS)
More informationThe p-sized partitioning algorithm for fast computation of factorials of numbers
J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationTheoretical Investigations of Tomographic Methods used for Determination of the Integrity of Spent BWR Nuclear Fuel
a UPPSALA UNIVERSITY Department of Radiation Sciences Box 535, S-751 1 Uppsala, Sweden http://www.tsl.uu.se/ Internal report ISV-6/97 August 1996 Theoretical Investigations of Tomographic Methods used
More informationState of the art of Monte Carlo technics for reliable activated waste evaluations
State of the art of Monte Carlo technics for reliable activated waste evaluations Matthieu CULIOLI a*, Nicolas CHAPOUTIER a, Samuel BARBIER a, Sylvain JANSKI b a AREVA NP, 10-12 rue Juliette Récamier,
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCover Page. The handle holds various files of this Leiden University dissertation.
Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:
More informationOn the Performance of MapReduce: A Stochastic Approach
On the Performance of MapReduce: A Stochastic Approach Sarker Tanzir Ahmed and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M University October 28, 2014
More informationUsing Excel for Graphical Analysis of Data
Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are
More informationMPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition
MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition Yun He and Chris H.Q. Ding NERSC Division, Lawrence Berkeley National
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationThe Pennsylvania State University. The Graduate School. Department of Mechanical and Nuclear Engineering
The Pennsylvania State University The Graduate School Department of Mechanical and Nuclear Engineering IMPROVED REFLECTOR MODELING FOR LIGHT WATER REACTOR ANALYSIS A Thesis in Nuclear Engineering by David
More informationCURRICULUM UNIT MAP 1 ST QUARTER
1 ST QUARTER Unit 1: Pre- Algebra Basics I WEEK 1-2 OBJECTIVES Apply properties for operations to positive rational numbers and integers Write products of like bases in exponential form Identify and use
More informationAccelerating GATE simulations
GATE Simulations of Preclinical andclinical Scans in Emission Tomography, Transmission Tomography and Radiation Therapy Accelerating GATE simulations Parallel computing and GPU GATE Training, INSTN-Saclay,
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More information6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS
Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationDetecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2. Xi Wang and Ronald K. Hambleton
Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2 Xi Wang and Ronald K. Hambleton University of Massachusetts Amherst Introduction When test forms are administered to
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationvsan 6.6 Performance Improvements First Published On: Last Updated On:
vsan 6.6 Performance Improvements First Published On: 07-24-2017 Last Updated On: 07-28-2017 1 Table of Contents 1. Overview 1.1.Executive Summary 1.2.Introduction 2. vsan Testing Configuration and Conditions
More informationSamuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR
Detecting Missing and Spurious Edges in Large, Dense Networks Using Parallel Computing Samuel Coolidge, sam.r.coolidge@gmail.com Dan Simon, des480@nyu.edu Dennis Shasha, shasha@cims.nyu.edu Technical Report
More informationIntel MPI Library Conditional Reproducibility
1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance
More informationCS 229: Machine Learning Final Report Identifying Driving Behavior from Data
CS 9: Machine Learning Final Report Identifying Driving Behavior from Data Robert F. Karol Project Suggester: Danny Goodman from MetroMile December 3th 3 Problem Description For my project, I am looking
More informationerror
PARALLEL IMPLEMENTATION OF STOCHASTIC ITERATION ALGORITHMS Roel Mart nez, László Szirmay-Kalos, Mateu Sbert, Ali Mohamed Abbas Department of Informatics and Applied Mathematics, University of Girona Department
More informationWhitepaper Spain SEO Ranking Factors 2012
Whitepaper Spain SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com
More informationBootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping
Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,
More informationarxiv: v1 [cs.dc] 2 Apr 2016
Scalability Model Based on the Concept of Granularity Jan Kwiatkowski 1 and Lukasz P. Olech 2 arxiv:164.554v1 [cs.dc] 2 Apr 216 1 Department of Informatics, Faculty of Computer Science and Management,
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationEvaluation of RAPID for a UNF cask benchmark problem
Evaluation of RAPID for a UNF cask benchmark problem Valerio Mascolino 1,a, Alireza Haghighat 1,b, and Nathan J. Roskoff 1,c 1 Nuclear Science & Engineering Lab (NSEL), Virginia Tech, 900 N Glebe Rd.,
More informationHyper-Threading Influence on CPU Performance
João Martins* Jorge Gomes* Mario David* Gonçalo Borges* * LIP Laboratório de Instrumentação e Física Experimental de Particulas HePiX Spring
More informationarxiv: v1 [cs.dc] 27 Sep 2018
Performance of MPI sends of non-contiguous data Victor Eijkhout arxiv:19.177v1 [cs.dc] 7 Sep 1 1 Abstract We present an experimental investigation of the performance of MPI derived datatypes. For messages
More informationOptimised corrections for finite-difference modelling in two dimensions
Optimized corrections for 2D FD modelling Optimised corrections for finite-difference modelling in two dimensions Peter M. Manning and Gary F. Margrave ABSTRACT Finite-difference two-dimensional correction
More information