Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport GTC 2018 Jeremy Sweezy Scientist Monte Carlo Methods, Codes and Applications Group 3/28/2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-18-XXXX
What is Monte Carlo Particle Transport? Follows the path of individual particles through a system Uses pseudo-random numbers to sample processes Randomly sample physical and non-physical processes Attributed to Stanislaw Ulam and Enrico Fermi Named because Ulam had an uncle who who would borrow money from relatives because he just had to go to Monte Carlo FERMIAC 3/23/18 2
Porting to Specialized Hardware is Prohibitively Expensive The world s production Monte Carlo codes have decades of development LANL s MCNP code has been in development since 1977 Equally extensive amount of V&V effort Codes have to run on desktop machines and super-computers DOE HPC platforms have been in a state of flux for the last 10-years Cell Broadband Engine Intel Xeon Phi (MIC) GPUs ARM??? Barrier #1: Limited Resources (Money, People, Time) 3/23/18 3
Monte Carlo Random Walk on GPU Hardware has reached a Performance Wall A least 6 different research groups have ported the Monte Carlo random walk to GPU hardware for neutron transport All report results against different numbers of CPUs All get the same results! Almost all are extremely simplified Production codes will likely have worse performance. What are the limitations? Conditional branching Random data access No small computational intensive kernel to accelerate 4.5x 3.0x Barrier #2: Performance of random walk on GPUs 3/25/18 4
How do You Define Performance? A computer scientist might measure performance as an increase in speed. P = T CPU T GPU A Monte Carlo specialist would measure performance as an balance between speed and statistical variance using a Figure-of-Merit FOM = σ 2 CPU T CPU 2 σ GPU T GPU Example: FOM = 0. 12 7 1 min 0. 05 2 7 2 min = 2 To date, almost all GPU implementations of Monte Carlo particle transport of have focused on increasing speed. 3/23/18 5
Next Event Estimator Next-event estimator calculates the probability of a particle from a source or collision event reaches a point without interaction Typically used for image tallies S R, E = N C σ i R, E is1 σ T w 2πR 2 R p i μ, E E G exp( M Σ T s, E G ds 0 Ray-cast A Cell 1 μ ) Cell 2 Image Plane B One to two orders of magnitude faster on GPU hardware 3/23/18 6
Traditional Track-Length Estimator The standard Monte Carlo fluence estimator Uses the sampled distance in each cell as fluence estimator Only contributes to cells through which the particle passes Easy to compute Nothing to accelerate on GPU B Cell 3 Cell 1 Cell 2 Computing has changed, we need to change our algorithms too! 3/25/18 7
Volumetric-Ray-Casting Estimator For use in place of the traditional track-length estimator on GPU Multiple pseudo-rays are generated at each source and collision event Computational intensive estimator with lower variance B Cell 3 Cell 1 Cell 2 Ray-cast F i, E = w 1UVWX UΣ T,i E Y l i NΣ T,i (E Y ) exp r Y Ur 0 Σ T r + Ω s, E G ds A neutron dance for a neutron fan. P.M. Dawn 3/25/18 8
MonteRay - Accelerating Monte Carlo Transport with GPU Ray Tracing MonteRay A library for accelerating Monte Carlo tallies with GPU Random walk is maintained on CPU Ray casting based tallies are calculated on the GPU Next-Event estimator Volumetric-Ray-Casting estimator, a new estimator designed for GPUs Supports neutron and photon tallies Can be incorporated into new and legacy Monte Carlo codes Uses continuous energy cross-section data Single precision ray casting Single precision attenuation cross-sections Double precision tallies Reduces cost of accelerating an existing Monte Carlo code with GPUs 3/23/18 9
MonteRay - Testing Tests use: GeForce GTX TitanX GPU with NVIDIA Maxwell architecture 2 CPUs (Intel Haswell E5-2660 v3 at 2.60 GHz), with 10 cores each MonteRay linked with LANL s C++ Monte Carlo code MCATK MCATK uses MPI parallelism building shared ray buffers using MPI-3 shared memory 3-D Cartesian Structured Mesh Geometry 2 tests measured performance of the Next-event estimator 4 tests measured the performance of the Volumetric-ray-casting estimator Volumetric-ray-casting estimator performance on GPU compared to the Track-length estimator performance on the CPU Base performance measured as compared to 8 CPU cores 3/23/18 10
Testing the Next-Event Estimator on GPU Hardware: Two Radiography Tests 3/23/18 11
MonteRay Medical X-Ray Imaging Simulation 50-keV X-ray beam 0.12mm spot size Radiograph used Next-Event Estimator Simulation useful for designing collimator to minimize scattered contribution 3/23/18 12
MonteRay Medical X-Ray Imaging Simulation Source and Collided contribution calculated separately Source contribution relatively easy to calculate Collided contribution important for collimator design Collided performance 15-18x 14.5x 15.3x 3/23/18 13
MonteRay Industrial Radiography Simulated a physical test object used at Los Alamos Dual Axis Radiographic Hydrodynamic Test Facility Used 4-MeV mono-energetic X-ray beam 100 x 100 image grid (10,000 estimators) to simulate image detector Calculation of scatter component needed to design collimators and experiment, but too computational expensive I'm a peeping-tom techie with x-ray eyes Patrick Lee MacDonald 3/23/18 14
MonteRay Industrial Radiography GPU Performance vs Number of CPU Cores 100 Source Collided Relative Performance 10 28.5x 24.2x 0 5 10 15 20 Number of CPU Cores / GPU Collided calculation performance 15-32x! 3/23/18 15
Volumetric-Ray-Casting Estimator on GPU Hardware vs Track-Length Estimator on CPU Hardware 3/23/18 16
Cancer Treatment Simulation 2-MeV Photon beam ( peak of 6MV medical accelerator photon spectrum) 1-cm beam radius What is the dose to healthy tissue? Tumor 2-MeV Photon Beam GPU Performance vs 8 CPU Cores 14x performance improvement in healthy tissue 3/23/18 17
Cancer Treatment Simulation GPU Performance vs Number of CPU Cores in Healthy Tissue 14.3x 10.2x Performance is 14x vs 8 CPU cores or 10x vs 12 CPU cores 3/23/18 18
Pressured Water Reactor Assembly Simulation 16x16 Fuel Assembly Performance 7.5x in the Control Rods, 5x in the fuel, and 4.5x in the coolant Fuel Pin Control Rod GPU Performance vs 8 CPU Cores 3/23/18 19
Pressured Water Reactor Assembly Simulation GPU Performance vs Number of CPU Cores 7.2x 6.0x 5.4x 4.4x Compared to 8 CPU cores performance in control rod 7.2x and 6.0x in the fuel 3/23/18 20
Criticality Accident Simulation Critical Uranium sphere in the corner of a concrete room Concrete floor, walls, ceiling, and 4 concrete pillars Uranium Sphere GPU Performance vs 8 CPU Cores Performance increase of 14-16x in the center of the room 3/23/18 21
Criticality Accident Simulation Smoother Fluence Estimate Track-Length Estimator Volumetric-Ray-Casting Estimator 3/23/18 22
Criticality Accident Simulation GPU Performance vs Number of CPU Cores 15x 10.5x Things are going great, and they re only getting better Patrick Lee MacDonald 3/23/18 23
Reflected Godiva Criticality Experiment Simulation U-235 sphere reflected by water Performance Improvement 2.5x in the core 1.0x in the water GPU Performance vs 8 CPU Cores 3/23/18 24
Reflected Godiva Criticality Experiment Simulation Variance of the Volumetric-Ray-Casting estimator approaches that of the Track-Length estimator is strong scattering material. GPU Performance vs. Num. CPU Cores 2.2x 2 Variance Ratio ( σ TL / σ 2 VRC ) 4.5 4 3.5 3 2.5 2 1.5 Variance Ratio vs Num. Collisions 2.2x 1 1 4 8 12 16 20 Number of Samples per Collision (N) Performance is limited by the estimator variance, not the GPU speed 3/23/18 25
Conclusions MonteRay provides a low cost method of providing GPU accelerated Monte Carlo particle transport Can be incorporated into legacy codes at low cost. Works with standard variance reduction methods Performance improvements of MonteRay are significant: Up to 32 times for the Next-event estimator as compared to 8 CPU cores Up to 14 times for the Volumetric-ray-casting estimator as compared to the Track-Length estimator on 8 CPU cores MonteRay provides a method of breaking through the barriers of limited resources and limited performance 3/23/18 26
Questions? Jeremy Sweezy jsweezy@lanl.gov 3/23/18 27
Extra 3/23/18 28
Uncertainty - Pressured Water Reactor Assembly Simulation Track-Length Estimator Volumetric-Ray-Casting Estimator 600 sec., 8 CPU Cores 124 cycles, 40000 Particles/Cycle 600 sec., 8 CPU Cores and 1 GPU 93 cycles, 40000 Particles/Cycle 8 rays/collision 3/23/18 29