Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy. GTC 2014 N. Henderson & K. Murakami

Similar documents
CUDA- based Geant4 Monte Carlo Simula8on for Radia8on Therapy. N. Henderson & K. Murakami GTC 2013

MPEXS benchmark results

gpmc: GPU-Based Monte Carlo Dose Calculation for Proton Radiotherapy Xun Jia 8/7/2013

The Geant4-DNA project at the Physics-Biology frontier. JFY2015(BIO_02) reports and JFY2016(APP_01) plan Sébastien Incerti, CENBG Takashi Sasaki, KEK

GPU Based Convolution/Superposition Dose Calculation

Basics of treatment planning II

A fast and accurate GPU-based proton transport Monte Carlo simulation for validating proton therapy treatment plans

GPU applications in Cancer Radiation Therapy at UCSD. Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC

Outline. Outline 7/24/2014. Fast, near real-time, Monte Carlo dose calculations using GPU. Xun Jia Ph.D. GPU Monte Carlo. Clinical Applications

Electron Dose Kernels (EDK) for Secondary Particle Transport in Deterministic Simulations

GPU-based fast Monte Carlo simulation for radiotherapy dose calculation

Monte Carlo simulations. Lesson FYSKJM4710 Eirik Malinen

Op#miza#on of a Fast CT System for More Accurate Radiotherapy. Erica Cherry, Meng Wu, and Rebecca Fahrig

ELECTRON DOSE KERNELS TO ACCOUNT FOR SECONDARY PARTICLE TRANSPORT IN DETERMINISTIC SIMULATIONS

Representing Range Compensators in the TOPAS Monte Carlo System

Status and plan for the hadron therapy simulation project in Japan

Monte Carlo simulations

Improvements in Monte Carlo simulation of large electron fields

Basics of treatment planning II

Faster Navigation in Voxel Geometries (DICOM) Joseph Perl (SLAC/SCCS) G4NAMU AAPM Houston 27 July 2008

Michael Speiser, Ph.D.

Implementation of the EGSnrc / BEAMnrc Monte Carlo code - Application to medical accelerator SATURNE43

Comparison of absorbed dose distribution 10 MV photon beam on water phantom using Monte Carlo method and Analytical Anisotropic Algorithm

Breaking Through the Barriers to GPU Accelerated Monte Carlo Particle Transport

Software Suite for Particle Therapy Simulation

Arezoo Modiri Department of Radiation Oncology University of Maryland, Baltimore

Transitioning from pencil beam to Monte Carlo for electron dose calculations

Indrin Chetty Henry Ford Hospital Detroit, MI. AAPM Annual Meeting Houston 7:30-8:25 Mon 08/07/28 1/30

Hidenobu Tachibana The Cancer Institute Hospital of JFCR, Radiology Dept. The Cancer Institute of JFCR, Physics Dept.

Outline. Monte Carlo Radiation Transport Modeling Overview (MCNP5/6) Monte Carlo technique: Example. Monte Carlo technique: Introduction

A Geant4 based treatment plan verification tool: commissioning of the LINAC model and DICOM-RT interface

Por$ng Monte Carlo Algorithms to the GPU. Ryan Bergmann UC Berkeley Serpent Users Group Mee$ng 9/20/2012 Madrid, Spain

REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT

Effects of the difference in tube voltage of the CT scanner on. dose calculation

Release Notes for Dosimetry Check with Convolution-Superposition Collapsed Cone Algorithm (CC)

MCNP4C3-BASED SIMULATION OF A MEDICAL LINEAR ACCELERATOR

EVALUATION OF SPEEDUP OF MONTE CARLO CALCULATIONS OF TWO SIMPLE REACTOR PHYSICS PROBLEMS CODED FOR THE GPU/CUDA ENVIRONMENT

Monte Carlo methods in proton beam radiation therapy. Harald Paganetti

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU-based high-performance computing for radiotherapy applications

Proton dose calculation algorithms and configuration data

Calculation algorithms in radiation therapy treatment planning systems

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Supercomputing the Cascade Processes of Radiation Transport

Artifact Mitigation in High Energy CT via Monte Carlo Simulation

Analysis of Radiation Transport through Multileaf Collimators Using BEAMnrc Code

A dedicated tool for PET scanner simulations using FLUKA

Investigation of tilted dose kernels for portal dose prediction in a-si electronic portal imagers

Geant4 v9.5. Kernel III. Makoto Asai (SLAC) Geant4 Tutorial Course

Detector simulations for in-beam PET with FLUKA. Francesco Pennazio Università di Torino and INFN, TORINO

Comparison of Predictions by MCNP and EGSnrc of Radiation Dose

GPU-based fast gamma index calcuation

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Op#miza#on of an On- Board Imaging System for Rapid Radiotherapy. Erica Kemmerling, Meng Wu, He Yang, and Rebecca Fahrig

Photon Dose Algorithms and Physics Data Modeling in modern RTP

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Evaluation of latent variances in Monte Carlo dose calculations with Varian TrueBeam photon phase-spaces used as a particle source

GPU-accelerated ray-tracing for real-time treatment planning

Mesh Human Phantoms with MCNP

Assesing multileaf collimator effect on the build-up region using Monte Carlo method

Introduction to GPU hardware and to CUDA

THE SIMULATION OF THE 4 MV VARIAN LINAC WITH EXPERIMENTAL VALIDATION

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

The OpenGATE Collaboration

Accuracy of treatment planning calculations for conformal radiotherapy van 't Veld, Aart Adeodatus

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A CT-based Monte Carlo Dose Calculations for Proton Therapy Using a New Interface Program

Validation of GEANT4 Monte Carlo Simulation Code for 6 MV Varian Linac Photon Beam

An approach to calculate and visualize intraoperative scattered radiation exposure

Measurement of depth-dose of linear accelerator and simulation by use of Geant4 computer code

Suitability Study of MCNP Monte Carlo Program for Use in Medical Physics

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Code characteristics

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Documentation, examples and user support

Tesla GPU Computing A Revolution in High Performance Computing

General Purpose GPU Computing in Partial Wave Analysis

VALIDATION OF A MONTE CARLO DOSE CALCULATION ALGORITHM FOR CLINICAL ELECTRON BEAMS IN THE PRESENCE OF PHANTOMS WITH COMPLEX HETEROGENEITIES

84 Luo heng-ming et al Vol. 12 Under the condition of one-dimensional slab geometry the transport equation for photons reads ±()N(x; μ; ) du d N(

Validation of Proton Nozzle. Jae-ik Shin (KIRAMS) Sebyeong Lee (NCC)

GATE-RT Applications in Radiation Therapy

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Geant4-DICOM Interface-based Monte Carlo Simulation to Assess Dose Distributions inside the Human Body during X-Ray Irradiation

Using GPUs to compute the multilevel summation of electrostatic forces

I. INTRODUCTION. Figure 1. Radiation room model at Dongnai General Hospital

Source Model Tuning for a 6 MV Photon Beam used in Radiotherapy

Comparison of measured and Monte Carlo calculated electron beam central axis depth dose in water

Design and performance characteristics of a Cone Beam CT system for Leksell Gamma Knife Icon

Construction of Voxel-type Phantom Based on Computed Tomographic Data of RANDO Phantom for the Monte Carlo Simulations

ARTCOLL PACKAGE FOR GAMMA ART6000 TM ROTATING GAMMA SYSTEM EMISSION SPECTRA CALCULATION

Mathematical computations with GPUs

Development a simple point source model for Elekta SL-25 linear accelerator using MCNP4C Monte Carlo code

Limitations in the PHOTON Monte Carlo gamma transport code

Determination of primary electron beam parameters in a Siemens Primus Linac using Monte Carlo simulation

Monitor Unit (MU) Calculation

Monte Carlo Simulation for Neptun 10 PC Medical Linear Accelerator and Calculations of Electron Beam Parameters

Massively Parallel GPU-friendly Algorithms for PET. Szirmay-Kalos László, Budapest, University of Technology and Economics

UNIVERSITY OF CALGARY. Monte Carlo model of the Brainlab Novalis Classic 6 MV linac using the GATE simulation. platform. Jared K.

Transcription:

Op#miza#on of CUDA- based Monte Carlo Simula#on for Radia#on Therapy GTC 2014 N. Henderson & K. Murakami

The collabora#on Geant4 @ Special thanks to the CUDA Center of Excellence Program Makoto Asai, SLAC Joseph Perl, SLAC Andrea DoO, SLAC Koichi Murakami, KEK Takashi Sasaki, KEK Margot Gerritsen, ICME Nick Henderson, ICME 2

Radiotherapy simula#on methods Analy#c #me: seconds to minutes accurate within 3-5% used in treatment planning Monte Carlo #me: several hours to days of CPU #me accurate within 1-2% used to verify treatment plans in certain cases 3

Project overview GPU based dose calcula#on for radia#on therapy CT data Input Processing GPU/CUDA Dose calc. Output DICOM Visualiza#on gmocren Features GPU code based on Geant4 Voxelized geometry DICOM interface Material is water with variable density Limited Geant4 EM physics: electron/positron/gamma Scoring of dose in each voxel Process secondary par#cles

Geant4 High Energy Physics Space & Radia#on Medical Physics ATLAS LISA gmocren Images from: Geant4 gallery and gmocren

Basic idea Par#cles are transported through material in steps Physics processes act on par#cles and may generate secondaries Par#cles are removed when out of domain or out of energy

A single step

G4CU: CUDA implementa#on Each GPU thread processes a single par#cle un#l the par#cle exits the geometry Each thread has a stack for secondary par#cles Every thread takes a step in each itera#on Algorithm is periodically interrupted for various ac#ons Periodic actions Check termination conditions Generate primary particles Pop secondary particles Check resources Take step 8

G4CU: parallel execu#on kernel launches along-step post-step thread 0 thread 1 thread 2 gamma wait transport Compton scattering wait wait gamma wait transport wait Photoelectric effect wait electron ionization transport wait wait Bremsstrahlung CUDA kernels are applied to all par#cles Threads must wait if physics process does not apply to par#cle Parallel execu#on is achieved: maintenance opera#ons transport process same process on same par#cle

Performance and valida#on Hardware and so`ware: GPU: Tesla K20 Kepler 2496 cores, 706 MHz, 5GB GDDR5 (ECC) CPU Xeon X5680 (3.33GHz) single thread opera#on SDK: CUDA 5.5 Geant4 release 9.6 p2

Phantom geometry 30 cm 30.5 cm 10 cm water {water, lung, bone} water 100 cm 5 cm 20 cm voxel size: 5mm x 5mm x 2mm

Par#cle sources 20 MeV electron / 6 MeV photons (mono- energy) 6 MV & 18 MV spectrum photons

Depth dose distribu#on: water phantom 6MeV photon, 20 MeV electron dose along central axis Sep/25/2013 18th Geant4 Collaboration meeting 18

Comparisons for slab phantoms 6MV photon, Lung/Bone as inner slab material

Visualiza#on with gmocren Image for demonstra#on purposes only! 50M 6MeV photons Pencil beam configura#on

Challenges Geant4 is complex Implement targeted subset Varied character of physics process Bad occupancy No clear op#miza#on target Random simula#on Thread divergence Lookup (interpola#on) tables Thread divergence Non- ideal memory access Energy deposi#on in global array Possible race condi#on Non- ideal memory access

Process Physics Processes Lines of code Ini#aliza#on kernel registers Step length kernel registers Ac#on kernel registers Compton scaiering 116 14 18 49 Photo electric effect 118 14 18 40 Gamma conversion 160 14 18 36 Bremsstrahlung 156 14 18 44 Ioniza#on 394 14 (18,18) (60,36) Scaiering 575 12 28 Positron annihila#on 156 12 26 Electron removal 44 12 8 Transport 491 20 22 (20,22)

Physics Profile (6 MeV gamma) Sum of Time(%) step along-step after-step other init at-rest Grand Total em-helper 18.4 4.2 22.5 ionization 2.5 15.3 2.1 19.9 multiple-scattering 11.6 7.7 19.3 management 3.7 8.7 0.5 12.8 transport 2.4 1.7 3.3 1.4 8.9 bremsstrahlung 5.7 5.7 positron-annihilation 2.1 0.9 0.6 0.7 4.3 compton-scattering 2.6 2.6 gamma-conversion 1.4 1.4 photo-electric-effect 1.3 1.3 electron-removal 1.2 1.2 Grand Total 40.7 24.6 17.4 8.7 6.6 2.0 100.0

Op#miza#on 1: dose storage Old idea: use thrust to join and reduce dose stacks into global array Beier idea: use atomicadd Speedup: ~1.5 (at the #me) index dose dose stack i 5 10 13 3.0 2.1 0.9 join and sort dose stack j 10 5 19 4.3 1.2 7.2 5 5 10 10 13 19 1.2 3.0 2.1 4.3 0.9 7.2 reduce by index 5 10 13 19 4.2 6.4 0.9 7.2

Op#miza#on 2: device configura#on 6 MeV gamma pencil beam 3,000,000 events Table shows run #me / shortest #me Voxels: 61 x 61 x 150 1 = 22 seconds ~ 136 events / ms threads per block blocks 32 64 128 256 512 32 17.98 9.85 5.62 3.21 2.04 64 9.85 5.63 3.2 2 1.48 128 5.62 3.21 1.98 1.39 1.22 256 3.55 2.15 1.36 1.1 1.14 512 2.42 1.46 1.08 1.02 1.15 1024 1.8 1.22 1 1.01 1.61 threads per block blocks 32 64 128 1000 2.2 1.33 1 2000 1.82 1.19 1.03 3000 1.8 1.19 1.04 4000 1.74 1.27 1.15

Other op#miza#ons New hardware! Refactoring New features some#me slow us down Results from: 512 x 512 x 256 geometry 6 MeV gamma pencil beam Date Hardware Feature Time (min) Events / ms March C2070 72 23.1 May K20 60 27.8 June K20 atomicadd 41 40.7 July K20 mul#ple scaiering 42.5 39.2 August K20 world volume 49 34.0 Current K20 Refactoring, device config 23 72.5

Comparison to Geant4 on CPU Geometry:61 x 61 x 150 GPU: Tesla K20 CPU: Xeon X5680 6 core, 12 thread Our measurement was with single threaded G4 New G4 is mul#- threaded We mul'ply by 12 to get CPU events / ms CPU GPU 20 MeV electron (pencil) 6 MeV photon (pencil) Time (h) 29.44 12.42 Events / ms / core 0.94 2.24 Events / ms 11.32 26.85 Time (min) 22.23 9.65 Events / ms 74.99 172.69 speedup (GPU/core) 79.49 77.19 speedup (GPU/CPU) 6.62 6.43

Other op#miza#on experiments 6 MeV gamma pencil beam 1,000,000 events Voxels: 61 x 61 x 150 Experiment Run #me (s) Events / ms standard 9.33 107.2 prefer L1 cache 9.28 107.8 disable L1 cache 9.27 107.9 remove atomicadd 9.31 107.4 limit to 32 registers 9.29 107.6 no thread divergence, no atomic add no thread divergence, no atomic add, limit to 32 registers 3.25 307.7 3.297 303.3 add bounds checking 17.91 55.8

Going forward Hope to get factor 1.5 speed up from spliong par#cles into separate CUDA streams Reduce thread divergence Allow concurrent kernel launches Some possible benefits from using texture memory and hardware interpola#on for lookup tables Look for other opportuni#es in expensive kernels

Acknowledgements NVidia for support of the CUDA Center of Excellence Koichi Murakami Makoto Asai, Joseph Pearl, Andrea DoO @ SLAC Takashi Sasaki @ KEK Akinori Kimura, Ashikaga Ins#tute of Technology 25