GPU-accelerated ray-tracing for real-time treatment planning
|
|
- Natalie Robinson
- 5 years ago
- Views:
Transcription
1 Journal of Physics: Conference Series OPEN ACCESS GPU-accelerated ray-tracing for real-time treatment planning To cite this article: H Heinrich et al 2014 J. Phys.: Conf. Ser View the article online for updates and enhancements. Related content - 3D Scientific Visualization with Blender : Introduction B R Kent - GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr, Sergey Gorbunov, Volker Lindenstruth et al. - GPU-accelerated 3D phase-field simulations of dendrite competitive growth during directional solidification of binary alloy S Sakane, T Takaki, M Ohno et al. This content was downloaded from IP address on 17/07/2018 at 21:46
2 GPU-accelerated ray-tracing for real-time treatment planning H Heinrich 1, P Ziegenhein 1, C P Kamerling 1, H Froening 2 and U Oelfke 1 1 German Cancer Research Center (DKFZ), Heidelberg Germany 2 Institute of Computer Engineering, University of Heidelberg, Heidelberg Germany h.heinrich@dkfz.de Abstract. Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracingbased calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine. 1. Introduction The computation of the radiological depth data is a vital preprocessing step for radiotherapy treatment planning dose calculations. Since the depth data represents the patient geometry, dose algorithms can use it to correct for tissue inhomogeneities along incident rays. The computation of the radiological depth data relies on time consuming ray-tracing operations typically carried out for every voxel of the patient volume. In adaptive radiotherapy (ART) [1][2][3] treatment planning scenarios with changing patient geometries, this is a severe bottleneck. Our group investigates a new IMRT treatment planning paradigm (interactive dose shaping) which requires rapid access to radiological depth information for changing patient geometries [4][5]. GPUs are powerful high core count devices that are no longer solely used for graphics applications, but also for general-purpose computing tasks. Compared to CPUs, their computing cores are weaker but the massive amount of these cores results in a significant performance increase. Also, GPUs offer memory bandwidth that is about one order of magnitude higher compared to single-socket CPUs. However, GPUs only excel in performance if (1) their vast Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1
3 amount of computing cores can be kept busy, and (2) if they can perform the computations incore, i.e. if the input and output data associated with the computation is stored in the on-device memory. Otherwise, performance can degrade due to data movements over the bandwidthlimited PCIe interface. In this work we implemented a GPU-based ray-tracing algorithm using CUDA [6]. We investigated the impact of different strategies to increase the memory access efficiency with respect to the runtime. This includes the potentially inefficient memory access patterns of the ray-tracing kernel functions to the on-device graphics memory as well as the overhead resulting from data transfers to and from the host s main memory. In addition we investigated the benefits of using GPU capabilities for concurrent kernel execution and data movement. Results are obtained for a prostate patient data set. 2. Materials & Methods 2.1. The ray-tracing algorithm Our ray-tracing implementation is based on the algorithm proposed by Siddon [7], which was adapted for the GPU by utilizing the stepping approach in [8] and the ray-matrix approach in [9]. A ray-matrix is defined at the iso-center of the treatment field perpendicular to each incident beam. The size of the matrix is defined by the extension of the treatment field plus a scattering margin on each side of a beam. Each matrix element represents a point in 3D space. Rays are defined from the beam source, through each matrix element and tracing is performed until a ray exits the patient. The number of points is determined so that the distance of two neighboring rays at the patient s exit plane is equal to the voxel size. The ray-matrix approach can greatly reduce the number of ray-tracing processes in comparison to tracing every voxel of the patient volume for each beam. In addition, every ray can be traced independently from every other ray, which fits well to the CUDA programming model and allowed us to decomposed the problem so that one thread processes the ray-tracing along one ray. For each beam we execute one kernel which spawns a thread for each point in the corresponding ray-matrix Time-to-solution We use time-to-solution (TTS) as metric to evaluate the performance of our implementation. We define TTS as the accumulated runtimes of the ray-tracing kernel functions on the GPU, the required input data transfer to the graphics device and the transfer of the computed radiological depth data cubes back to the host s main memory as illustrated in Figure 1. Figure 1. Definition of time-to-solution. 2
4 2.3. Acceleration strategies GPUs employ an explicit memory hierarchy that is exposed to the user using NVidia s CUDA framework. Accessing data from e.g. global or texture memory has an impact on the utilization of the computational resource of the graphics hardware. The global memory is the main memory of NVidia graphics devices. It is cached (although not to reduce access latency, but to reduce contention) and reaches best performance for coalesced access patterns, i.e. when threads access consecutive memory addresses. In contrast, texture memory is optimized for accessing grid data. It reaches best performance for access patterns based on spatial locality. Our initial implementation of the ray-tracing kernel, the global memory kernel (GMK), accesses the input data stored in global memory. As acceleration strategy (1) we implemented a second version, the texture memory kernel (TMK), which accesses the input data provided as a 3D texture. This strategy focuses on the data transfer from graphics memory and computational units. The input and output data for the both the GMK and TMK kernel implementations is transferred to and from paged host memory. This can be a potential bottleneck as demand-paging might swap pages out, decreasing access times. To avoid this, the CUDA driver has to copy the data from a paged memory pointer to pinned memory pointer before it can invoke the Host to device (H2D) and device to host (D2H) data transfers using direct memory access (DMA). Allocating the host memory as pinned or page-lock memory permits demand-paging and therefore enables the GPU to access the host s main memory directly using DMA without copy overhead. Strategy (2) focuses on the data transfer between the host s main memory and the device memory. We enhanced our implementation for both, GMK and TMK kernel by using page-lock memory on the host side. The expected benefit is an approximately 2x higher memory bandwidth for the input/output data transfers and thus a reduction of the TTS. Table 1. Overview of strategies to reduce the TTS. Strategy (1) Enhance data transfer between graphics GMK vs. TMK both implemented memory and computational units using paged host memory Strategy (2) Enhance data transfer between host s Paged vs. Page-lock memory main memory and graphics memory implemented for both GMK and TMK Strategy (3) Enhance concurrency of data transfers Hyper-Q implemented and computation for both GMK and TMK Strategy (3) focuses on overlapping computational workloads and data transfers. This can be accomplished using Hyper-Q, a novel feature introduced with CUDA 5.0, available on NVidia s Tesla K20 graphics hardware of the current Kepler architecture [10]. To utilize Hyper-Q, data transfers as well as kernel invocation are required to be implemented for asynchronous execution, i.e. that the control flow returns immediately to the CPU after invocation. Dependencies between asynchronous invocations can be expressed with streams, which are firstin-first-out queues of work packets. Separate streams can be processed concurrently by the graphics hardware if sufficient resources are available. The actual scheduling of the different computational tasks and data transfers is carried out transparently to the user. We enhanced our implementation with asynchronous data transfers and kernel invocations and organized independent tasks in separate streams to exploit Hyper-Q. It is expected that computation 3
5 and memory transfers of multiple beams overlap which can contribute to reduction in TTS. Table 1 provides a quick overview of the different strategies and versions of the ray-tracing implementation Development & test environment The proposed acceleration strategies were implemented in CUDA 5.0 and C/C++ using MS VisualStudio We tested the implementations on a Win7 workstation equipped with an Intel Xeon E5 CPU with 64 GByte RAM and an NVidia Tesla K20c with 5 GByte graphics memory. We used a IMRT treatment plan for a prostate patient with 9 coplanar beams and 256 x 256 x 234 voxels of size (2.62 mm) 3 as test data set. The ray-matrix size for each of the 9 beams includes the beam extension plus an 8 cm scattering margin on each side. The ray-matrix dimensions range 140 x 140 to 186 x 186 points. To test the accuracy of our GPU implementations used the serial CPU implementation of the ray-tracing algorithm used in [9] as reference. 3. Results The results of our GPU implementations were compared to the reference implementation. Both produce equal results within single precision accuracy. The runtimes measured for the different versions of our ray-tracing implementation are depicted in Table 2. The columns 3 to 5 present the runtimes in msec for the Kernel execution on the GPU, the data transfer times H2D and D2H and the TTS. The runtimes are accumulated over 9 kernel invocations, a single data transfer of the input volume and the 9 data transfers of output volumes. For the Hyper-Q versions though, the kernel execution time represents the time from the first kernel start until the last kernel finished. Table 2. The runtimes for the ray-tracing implementations in msec for a 9 beam prostate case. The presented runtimes are accumulated runtimes over 9 beams. Rutimes [msec] Kernel Kernel Data transfer Time to 9 Beams Execution H2D / D2H Solution Paged Memory Page-lock Memory Hyper-Q GMK / TMK / GMK / TMK / GMK / TMK / A snippet of the output of NVidia s Visual Profiler normalized to an identical time line is shown in Figure 2. It allows an intuitive visual access to the work flow on the GPU and the impact of the implemented acceleration strategies. The golden boxes represent the time consumption of data transfers between host and graphics hardware. The aqua blue boxes show the runtime of the GMK and the purple boxes represent the TMK. The first golden box depicts the single input data transfer to the graphics device and the remaining ones account for the transfers of the output cubes to the host s main memory, one cube for each beam respectively. 4
6 Figure 2. NVidia Visual Profiler output for a 9 beam prostate case. The colored boxes represent runtimes for the different version of the ray-tracing implementations: aqua blue GMK, purple TMK, the respective first golden box the input data transfers and the remaining golden boxes the output data transfers to the host s main memory. 4. Discussion We investigated three acceleration strategies for enhancing ray-tracing performance for the radiological depth computation for real-time treatment planning. The application of the strategies can enhance the efficiency of the IMRT treatment planning process in particular when changing patient geometries are considered. We defined the TTS metric to assess the benefit of offloading the ray-tracing to the graphic device for our in-house TPS framework [4][5]. However, many other applications require the results back in the host s memory and can benefit from the proposed strategies. Our first strategy (1) focuses on the impact of using texture memory as opposed to global memory for the ray-tracing kernel implementations. The TMK showed a better performance, since the ray-tracing memory access patterns are based rather on spatial locality than temporal locality. This difference is especially prominent for beam angles which require data access patterns not mapping well to caching strategies based on locality of reference. The profiler output shows reduced sizes of the purple boxes in comparison to the aqua blue boxes and thus indicates the TMK performance to be rather independent from the beam angle. For strategy (2), we enhanced the implementations by using page-lock host memory with both kernels. The performance gain is shown in the profiler output by reduced sizes of the golden boxes. The memory bandwidth observed was stable above 6 Gbyte/sec across the PCI-e interface which amounts to approximately a factor of 2 versus the implementations using paged host memory. 5
7 The final strategy (3) leads to the largest reduction in TTS. It includes overlapping kernel executions and output data transfers back to the host using Hyper-Q. To find that both concurrent implementations for GMK and TMK achieved the same TTS was not expected. It indicates that the rather inefficient memory accesses using global memory implemented in the GMK were efficiently hidden using Hyper-Q. Combining the strategies, we were able to compute the ray-tracing related tasks in 28 msec for 9 treatment beams in clinical resolution using graphics hardware of NVidia s current Kepler architecture. We exploited its capability of concurrent kernel execution and data movement. We obtained runtimes of 99 msec which corresponds to a speed-up of 3 in TTS in comparison to the initial implementation of GMK and TMK using paged host memory. Future work will include the integration of multi GPU support to study the scaling behavior of our implementation. Furthermore, we will investigate the application of the GPU-based ray-tracing in adaptive treatment planning scenarios for rapid response to changing patient geometries. 5. Conclusion We have shown that the computation of radiological depth data for each beam and each voxel of a patient volume can be carried out in real-time using graphics hardware. The memory bandwidth between the host and the device is the limiting factor for ray-tracing on the GPU. The presented strategies minimize the time consumption of memory accesses and data movements using novel features of NVidia s Kepler GPU architecture. We found an optimal time-to-solution by combining our texture memory kernel with Hyper-Q. References [1] Yan D, Vicini F, Wong J and Martinez A 1997 Adaptive radiation therapy Phys. Med. Biol [2] Wu C, Jeraj R, Olivera G H and Mackie T R 2002 Re-optimization in adaptive radiotherapy Phys. Med. Biol [3] de la Zerda A, Armbruster B and Xing L 2007 Formulating adaptive radiation therapy (ART) treatment planning into a closed-loop control framework Phys. Med. Biol [4] Ziegenhein P, Kamerling C P, Oelfke U 2013 Interactive Dose Shaping Efficient Strategies for CPU-based Real-Time Treatment Planning J. Phys.: Conf. Ser. presented at ICCR 2013, not yet published [5] Kamerling C P, Ziegenhein P, Heinrich H, Oelfke U 2013 A 3D isodose manipulation tool for interactive dose shaping J. Phys.: Conf. Ser. presented at ICCR 2013, not yet published [6] NVIDIA CUDA Compute Unified Device Architecture Programming Guide 5.0 edition 2012 NVIDIA Corp., Santa Clara, CA [7] Siddon R L 1985 Fast calculation of the exact radiological path for a three-dimensional CT array. Med. Phys [8] de Greef M, Crezee J, van Eijk J C, Pool R, Bel A 2009 Accelerated ray tracing for radiotherapy dose calculations on a GPU Med. Phys. 36(9) [9] Siggel M, Ziegenhein P, Nill S, Oelfke U 2012 Boosting runtime-performance of photon pencil beam algorithms for radiotherapy treatment planning Physica Medica 28(4) [10] Whitepaper 2012 NVIDIA s Next Generation CUDA Compute Architecture: Kepler GK110 NVIDIA Corp., Santa Clara, CA 6
GPU applications in Cancer Radiation Therapy at UCSD. Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC
GPU applications in Cancer Radiation Therapy at UCSD Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC Conventional Radiotherapy SIMULATION: Construciton, Dij Days PLANNING:
More informationState-of-the-Art IGRT
in partnership with State-of-the-Art IGRT Exploring the Potential of High-Precision Dose Delivery and Real-Time Knowledge of the Target Volume Location Antje-Christin Knopf IOP Medical Physics Group Scientific
More informationEvaluation of 3D Gamma index calculation implemented in two commercial dosimetry systems
University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2015 Evaluation of 3D Gamma index calculation implemented
More informationradiotherapy Andrew Godley, Ergun Ahunbay, Cheng Peng, and X. Allen Li NCAAPM Spring Meeting 2010 Madison, WI
GPU-Accelerated autosegmentation for adaptive radiotherapy Andrew Godley, Ergun Ahunbay, Cheng Peng, and X. Allen Li agodley@mcw.edu NCAAPM Spring Meeting 2010 Madison, WI Overview Motivation Adaptive
More informationUsing a research real-time control interface to go beyond dynamic MLC tracking
in partnership with Using a research real-time control interface to go beyond dynamic MLC tracking Dr. Simeon Nill Joint Department of Physics at The Institute of Cancer Research and the Royal Marsden
More informationPyCMSXiO: an external interface to script treatment plans for the Elekta CMS XiO treatment planning system
Journal of Physics: Conference Series OPEN ACCESS PyCMSXiO: an external interface to script treatment plans for the Elekta CMS XiO treatment planning system To cite this article: Aitang Xing et al 2014
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationDose Distributions. Purpose. Isodose distributions. To familiarize the resident with dose distributions and the factors that affect them
Dose Distributions George Starkschall, Ph.D. Department of Radiation Physics U.T. M.D. Anderson Cancer Center Purpose To familiarize the resident with dose distributions and the factors that affect them
More informationOutline. Outline 7/24/2014. Fast, near real-time, Monte Carlo dose calculations using GPU. Xun Jia Ph.D. GPU Monte Carlo. Clinical Applications
Fast, near real-time, Monte Carlo dose calculations using GPU Xun Jia Ph.D. xun.jia@utsouthwestern.edu Outline GPU Monte Carlo Clinical Applications Conclusions 2 Outline GPU Monte Carlo Clinical Applications
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationGPU-based fast gamma index calcuation
1 GPU-based fast gamma index calcuation 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Xuejun Gu, Xun Jia, and Steve B. Jiang Center for Advanced Radiotherapy Technologies
More informationgpmc: GPU-Based Monte Carlo Dose Calculation for Proton Radiotherapy Xun Jia 8/7/2013
gpmc: GPU-Based Monte Carlo Dose Calculation for Proton Radiotherapy Xun Jia xunjia@ucsd.edu 8/7/2013 gpmc project Proton therapy dose calculation Pencil beam method Monte Carlo method gpmc project Started
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationDose Calculation and Optimization Algorithms: A Clinical Perspective
Dose Calculation and Optimization Algorithms: A Clinical Perspective Daryl P. Nazareth, PhD Roswell Park Cancer Institute, Buffalo, NY T. Rock Mackie, PhD University of Wisconsin-Madison David Shepard,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationComparison of absorbed dose distribution 10 MV photon beam on water phantom using Monte Carlo method and Analytical Anisotropic Algorithm
Journal of Physics: Conference Series PAPER OPEN ACCESS Comparison of absorbed dose distribution 1 MV photon beam on water phantom using Monte Carlo method and Analytical Anisotropic Algorithm To cite
More informationIncremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs
Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs Amit Kalele and Manoj Nambiar April 21, 2014 1 Optimization & Parallelization COE Center of Excellence
More informationREAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT
REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT Anand P Santhanam Assistant Professor, Department of Radiation Oncology OUTLINE Adaptive radiotherapy for head and
More informationLUNAR TEMPERATURE CALCULATIONS ON A GPU
LUNAR TEMPERATURE CALCULATIONS ON A GPU Kyle M. Berney Department of Information & Computer Sciences Department of Mathematics University of Hawai i at Mānoa Honolulu, HI 96822 ABSTRACT Lunar surface temperature
More informationInvestigation of tilted dose kernels for portal dose prediction in a-si electronic portal imagers
Investigation of tilted dose kernels for portal dose prediction in a-si electronic portal imagers Krista Chytyk MSc student Supervisor: Dr. Boyd McCurdy Introduction The objective of cancer radiotherapy
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationBasics of treatment planning II
Basics of treatment planning II Sastry Vedam PhD DABR Introduction to Medical Physics III: Therapy Spring 2015 Dose calculation algorithms! Correction based! Model based 1 Dose calculation algorithms!
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationThrust ++ : Portable, Abstract Library for Medical Imaging Applications
Siemens Corporate Technology March, 2015 Thrust ++ : Portable, Abstract Library for Medical Imaging Applications Siemens AG 2015. All rights reserved Agenda Parallel Computing Challenges and Solutions
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationData. ModuLeaf Mini Multileaf Collimator Precision Beam Shaping for Advanced Radiotherapy
Data ModuLeaf Mini Multileaf Collimator Precision Beam Shaping for Advanced Radiotherapy ModuLeaf Mini Multileaf Collimator Precision Beam Shaping for Advanced Radiotherapy The ModuLeaf Mini Multileaf
More informationGPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction
GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction Meng Wu and Jeffrey A. Fessler EECS Department University of Michigan Fully 3D Image
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationGPU-based finite-size pencil beam algorithm with 3Ddensity correction for radiotherapy dose calculation
1 2 GPU-based finite-size pencil beam algorithm with 3Ddensity correction for radiotherapy dose calculation 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Xuejun
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationCUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging
CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationAn approach to calculate and visualize intraoperative scattered radiation exposure
Peter L. Reicertz Institut für Medizinische Informatik An approach to calculate and visualize intraoperative scattered radiation exposure Markus Wagner University of Braunschweig Institute of Technology
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationGPU-based Fast Cone Beam CT Reconstruction from Undersampled and Noisy Projection Data via Total Variation
GPU-based Fast Cone Beam CT Reconstruction from Undersampled and Noisy Projection Data via Total Variation 5 10 15 20 25 30 35 Xun Jia Department of Radiation Oncology, University of California San Diego,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationArezoo Modiri Department of Radiation Oncology University of Maryland, Baltimore
Photon Optimization with GPU and Multi- Core CPU; What are the issues?, PhD Parallelization CPUs/Clusters/Cloud/GPUs Data management Outline Computation-Intensive Applications in Photon Radiotherapy Dose
More informationA Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT
A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,
More informationTowards Breast Anatomy Simulation Using GPUs
Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationImprovement and Evaluation of a Time-of-Flight-based Patient Positioning System
Improvement and Evaluation of a Time-of-Flight-based Patient Positioning System Simon Placht, Christian Schaller, Michael Balda, André Adelt, Christian Ulrich, Joachim Hornegger Pattern Recognition Lab,
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationGeorgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing
Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures
More informationELECTRON DOSE KERNELS TO ACCOUNT FOR SECONDARY PARTICLE TRANSPORT IN DETERMINISTIC SIMULATIONS
Computational Medical Physics Working Group Workshop II, Sep 30 Oct 3, 2007 University of Florida (UF), Gainesville, Florida USA on CD-ROM, American Nuclear Society, LaGrange Park, IL (2007) ELECTRON DOSE
More informationTomoTherapy Related Projects. An image guidance alternative on Tomo Low dose MVCT reconstruction Patient Quality Assurance using Sinogram
TomoTherapy Related Projects An image guidance alternative on Tomo Low dose MVCT reconstruction Patient Quality Assurance using Sinogram Development of A Novel Image Guidance Alternative for Patient Localization
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationArtifact Mitigation in High Energy CT via Monte Carlo Simulation
PIERS ONLINE, VOL. 7, NO. 8, 11 791 Artifact Mitigation in High Energy CT via Monte Carlo Simulation Xuemin Jin and Robert Y. Levine Spectral Sciences, Inc., USA Abstract The high energy (< 15 MeV) incident
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationShell: Accelerating Ray Tracing on GPU
Shell: Accelerating Ray Tracing on GPU Kai Xiao 1, Bo Zhou 2, X.Sharon Hu 1, and Danny Z. Chen 1 1 Department of Computer Science and Engineering, University of Notre Dame 2 Department of Radiation Oncology,
More informationimplementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot
Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationGPU-Based Acceleration for CT Image Reconstruction
GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed
More information3D Registration based on Normalized Mutual Information
3D Registration based on Normalized Mutual Information Performance of CPU vs. GPU Implementation Florian Jung, Stefan Wesarg Interactive Graphics Systems Group (GRIS), TU Darmstadt, Germany stefan.wesarg@gris.tu-darmstadt.de
More informationMonte Carlo methods in proton beam radiation therapy. Harald Paganetti
Monte Carlo methods in proton beam radiation therapy Harald Paganetti Introduction: Proton Physics Electromagnetic energy loss of protons Distal distribution Dose [%] 120 100 80 60 40 p e p Ionization
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationPhoton beam dose distributions in 2D
Photon beam dose distributions in 2D Sastry Vedam PhD DABR Introduction to Medical Physics III: Therapy Spring 2014 Acknowledgments! Narayan Sahoo PhD! Richard G Lane (Late) PhD 1 Overview! Evaluation
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationSignificance of time-dependent geometries for Monte Carlo simulations in radiation therapy. Harald Paganetti
Significance of time-dependent geometries for Monte Carlo simulations in radiation therapy Harald Paganetti Modeling time dependent geometrical setups Key to 4D Monte Carlo: Geometry changes during the
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationADVANCING CANCER TREATMENT
3 ADVANCING CANCER TREATMENT SUPPORTING CLINICS WORLDWIDE RaySearch is advancing cancer treatment through pioneering software. We believe software has un limited potential, and that it is now the driving
More information15 Dose Calculation Algorithms
Dose Calculation Algorithms 187 15 Dose Calculation Algorithms Uwe Oelfke and Christian Scholz CONTENTS 15.1 Introduction 187 15.2 Model-Based Algorithms 188 15.3 Modeling of the Primary Photon Fluence
More informationCh. 4 Physical Principles of CT
Ch. 4 Physical Principles of CT CLRS 408: Intro to CT Department of Radiation Sciences Review: Why CT? Solution for radiography/tomography limitations Superimposition of structures Distinguishing between
More informationarxiv: v1 [physics.ins-det] 11 Jul 2015
GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationMonte Carlo Simulation for Neptun 10 PC Medical Linear Accelerator and Calculations of Electron Beam Parameters
Monte Carlo Simulation for Neptun 1 PC Medical Linear Accelerator and Calculations of Electron Beam Parameters M.T. Bahreyni Toossi a, M. Momen Nezhad b, S.M. Hashemi a a Medical Physics Research Center,
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationBasic Radiation Oncology Physics
Basic Radiation Oncology Physics T. Ganesh, Ph.D., DABR Chief Medical Physicist Fortis Memorial Research Institute Gurgaon Acknowledgment: I gratefully acknowledge the IAEA resources of teaching slides
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationMotion artifact detection in four-dimensional computed tomography images
Motion artifact detection in four-dimensional computed tomography images G Bouilhol 1,, M Ayadi, R Pinho, S Rit 1, and D Sarrut 1, 1 University of Lyon, CREATIS; CNRS UMR 5; Inserm U144; INSA-Lyon; University
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationParallel Approach for Implementing Data Mining Algorithms
TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
More informationREDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS
BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationMichael Speiser, Ph.D.
IMPROVED CT-BASED VOXEL PHANTOM GENERATION FOR MCNP MONTE CARLO Michael Speiser, Ph.D. Department of Radiation Oncology UT Southwestern Medical Center Dallas, TX September 1 st, 2012 CMPWG Workshop Medical
More informationSimulation of Mammograms & Tomosynthesis imaging with Cone Beam Breast CT images
Simulation of Mammograms & Tomosynthesis imaging with Cone Beam Breast CT images Tao Han, Chris C. Shaw, Lingyun Chen, Chao-jen Lai, Xinming Liu, Tianpeng Wang Digital Imaging Research Laboratory (DIRL),
More informationThe Case for Heterogeneous HTAP
The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid
More informationDeep Scatter Estimation (DSE): Feasibility of using a Deep Convolutional Neural Network for Real-Time X-Ray Scatter Prediction in Cone-Beam CT
Deep Scatter Estimation (DSE): Feasibility of using a Deep Convolutional Neural Network for Real-Time X-Ray Scatter Prediction in Cone-Beam CT Joscha Maier 1,2, Yannick Berker 1, Stefan Sawall 1,2 and
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationImproving DPDK Performance
Improving DPDK Performance Data Plane Development Kit (DPDK) was pioneered by Intel as a way to boost the speed of packet API with standard hardware. DPDK-enabled applications typically show four or more
More informationRT 3D FDTD Simulation of LF and MF Room Acoustics
RT 3D FDTD Simulation of LF and MF Room Acoustics ANDREA EMANUELE GRECO Id. 749612 andreaemanuele.greco@mail.polimi.it ADVANCED COMPUTER ARCHITECTURES (A.A. 2010/11) Prof.Ing. Cristina Silvano Dr.Ing.
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationBlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines. AtHoc SMS Codes
BlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines AtHoc SMS Codes Version Version 7.5, May 1.0, November 2018 2016 1 Copyright 2010 2018 BlackBerry Limited. All Rights Reserved.
More informationIterative regularization in intensity-modulated radiation therapy optimization. Carlsson, F. and Forsgren, A. Med. Phys. 33 (1), January 2006.
Iterative regularization in intensity-modulated radiation therapy optimization Carlsson, F. and Forsgren, A. Med. Phys. 33 (1), January 2006. 2 / 15 Plan 1 2 3 4 3 / 15 to paper The purpose of the paper
More informationOpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data
OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data Andrew Miller Computer Vision Group Research Developer 3-D TERRAIN RECONSTRUCTION
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More information