GPU-accelerated ray-tracing for real-time treatment planning

Size: px

Start display at page:

Download "GPU-accelerated ray-tracing for real-time treatment planning"

Natalie Robinson
5 years ago
Views:

Journal of Physics: Conference Series OPEN ACCESS GPU-accelerated ray-tracing for real-time treatment planning To cite this article: H Heinrich et al 2014 J. Phys.: Conf. Ser. 489 012050 View the article online for updates and enhancements.

1 Journal of Physics: Conference Series OPEN ACCESS GPU-accelerated ray-tracing for real-time treatment planning To cite this article: H Heinrich et al 2014 J. Phys.: Conf. Ser View the article online for updates and enhancements. Related content - 3D Scientific Visualization with Blender : Introduction B R Kent - GPU-accelerated track reconstruction in the ALICE High Level Trigger David Rohr, Sergey Gorbunov, Volker Lindenstruth et al. - GPU-accelerated 3D phase-field simulations of dendrite competitive growth during directional solidification of binary alloy S Sakane, T Takaki, M Ohno et al. This content was downloaded from IP address on 17/07/2018 at 21:46

2 GPU-accelerated ray-tracing for real-time treatment planning H Heinrich 1, P Ziegenhein 1, C P Kamerling 1, H Froening 2 and U Oelfke 1 1 German Cancer Research Center (DKFZ), Heidelberg Germany 2 Institute of Computer Engineering, University of Heidelberg, Heidelberg Germany h.heinrich@dkfz.de Abstract. Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracingbased calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine. 1. Introduction The computation of the radiological depth data is a vital preprocessing step for radiotherapy treatment planning dose calculations. Since the depth data represents the patient geometry, dose algorithms can use it to correct for tissue inhomogeneities along incident rays. The computation of the radiological depth data relies on time consuming ray-tracing operations typically carried out for every voxel of the patient volume. In adaptive radiotherapy (ART) [1][2][3] treatment planning scenarios with changing patient geometries, this is a severe bottleneck. Our group investigates a new IMRT treatment planning paradigm (interactive dose shaping) which requires rapid access to radiological depth information for changing patient geometries [4][5]. GPUs are powerful high core count devices that are no longer solely used for graphics applications, but also for general-purpose computing tasks. Compared to CPUs, their computing cores are weaker but the massive amount of these cores results in a significant performance increase. Also, GPUs offer memory bandwidth that is about one order of magnitude higher compared to single-socket CPUs. However, GPUs only excel in performance if (1) their vast Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

amount of computing cores can be kept busy, and (2) if they can perform the computations incore, i.e. if the input and output data associated with the computation is stored in the on-device memory.

3 amount of computing cores can be kept busy, and (2) if they can perform the computations incore, i.e. if the input and output data associated with the computation is stored in the on-device memory. Otherwise, performance can degrade due to data movements over the bandwidthlimited PCIe interface. In this work we implemented a GPU-based ray-tracing algorithm using CUDA [6]. We investigated the impact of different strategies to increase the memory access efficiency with respect to the runtime. This includes the potentially inefficient memory access patterns of the ray-tracing kernel functions to the on-device graphics memory as well as the overhead resulting from data transfers to and from the host s main memory. In addition we investigated the benefits of using GPU capabilities for concurrent kernel execution and data movement. Results are obtained for a prostate patient data set. 2. Materials & Methods 2.1. The ray-tracing algorithm Our ray-tracing implementation is based on the algorithm proposed by Siddon [7], which was adapted for the GPU by utilizing the stepping approach in [8] and the ray-matrix approach in [9]. A ray-matrix is defined at the iso-center of the treatment field perpendicular to each incident beam. The size of the matrix is defined by the extension of the treatment field plus a scattering margin on each side of a beam. Each matrix element represents a point in 3D space. Rays are defined from the beam source, through each matrix element and tracing is performed until a ray exits the patient. The number of points is determined so that the distance of two neighboring rays at the patient s exit plane is equal to the voxel size. The ray-matrix approach can greatly reduce the number of ray-tracing processes in comparison to tracing every voxel of the patient volume for each beam. In addition, every ray can be traced independently from every other ray, which fits well to the CUDA programming model and allowed us to decomposed the problem so that one thread processes the ray-tracing along one ray. For each beam we execute one kernel which spawns a thread for each point in the corresponding ray-matrix Time-to-solution We use time-to-solution (TTS) as metric to evaluate the performance of our implementation. We define TTS as the accumulated runtimes of the ray-tracing kernel functions on the GPU, the required input data transfer to the graphics device and the transfer of the computed radiological depth data cubes back to the host s main memory as illustrated in Figure 1. Figure 1. Definition of time-to-solution. 2

4 2.3. Acceleration strategies GPUs employ an explicit memory hierarchy that is exposed to the user using NVidia s CUDA framework. Accessing data from e.g. global or texture memory has an impact on the utilization of the computational resource of the graphics hardware. The global memory is the main memory of NVidia graphics devices. It is cached (although not to reduce access latency, but to reduce contention) and reaches best performance for coalesced access patterns, i.e. when threads access consecutive memory addresses. In contrast, texture memory is optimized for accessing grid data. It reaches best performance for access patterns based on spatial locality. Our initial implementation of the ray-tracing kernel, the global memory kernel (GMK), accesses the input data stored in global memory. As acceleration strategy (1) we implemented a second version, the texture memory kernel (TMK), which accesses the input data provided as a 3D texture. This strategy focuses on the data transfer from graphics memory and computational units. The input and output data for the both the GMK and TMK kernel implementations is transferred to and from paged host memory. This can be a potential bottleneck as demand-paging might swap pages out, decreasing access times. To avoid this, the CUDA driver has to copy the data from a paged memory pointer to pinned memory pointer before it can invoke the Host to device (H2D) and device to host (D2H) data transfers using direct memory access (DMA). Allocating the host memory as pinned or page-lock memory permits demand-paging and therefore enables the GPU to access the host s main memory directly using DMA without copy overhead. Strategy (2) focuses on the data transfer between the host s main memory and the device memory. We enhanced our implementation for both, GMK and TMK kernel by using page-lock memory on the host side. The expected benefit is an approximately 2x higher memory bandwidth for the input/output data transfers and thus a reduction of the TTS. Table 1. Overview of strategies to reduce the TTS. Strategy (1) Enhance data transfer between graphics GMK vs. TMK both implemented memory and computational units using paged host memory Strategy (2) Enhance data transfer between host s Paged vs. Page-lock memory main memory and graphics memory implemented for both GMK and TMK Strategy (3) Enhance concurrency of data transfers Hyper-Q implemented and computation for both GMK and TMK Strategy (3) focuses on overlapping computational workloads and data transfers. This can be accomplished using Hyper-Q, a novel feature introduced with CUDA 5.0, available on NVidia s Tesla K20 graphics hardware of the current Kepler architecture [10]. To utilize Hyper-Q, data transfers as well as kernel invocation are required to be implemented for asynchronous execution, i.e. that the control flow returns immediately to the CPU after invocation. Dependencies between asynchronous invocations can be expressed with streams, which are firstin-first-out queues of work packets. Separate streams can be processed concurrently by the graphics hardware if sufficient resources are available. The actual scheduling of the different computational tasks and data transfers is carried out transparently to the user. We enhanced our implementation with asynchronous data transfers and kernel invocations and organized independent tasks in separate streams to exploit Hyper-Q. It is expected that computation 3

5 and memory transfers of multiple beams overlap which can contribute to reduction in TTS. Table 1 provides a quick overview of the different strategies and versions of the ray-tracing implementation Development & test environment The proposed acceleration strategies were implemented in CUDA 5.0 and C/C++ using MS VisualStudio We tested the implementations on a Win7 workstation equipped with an Intel Xeon E5 CPU with 64 GByte RAM and an NVidia Tesla K20c with 5 GByte graphics memory. We used a IMRT treatment plan for a prostate patient with 9 coplanar beams and 256 x 256 x 234 voxels of size (2.62 mm) 3 as test data set. The ray-matrix size for each of the 9 beams includes the beam extension plus an 8 cm scattering margin on each side. The ray-matrix dimensions range 140 x 140 to 186 x 186 points. To test the accuracy of our GPU implementations used the serial CPU implementation of the ray-tracing algorithm used in [9] as reference. 3. Results The results of our GPU implementations were compared to the reference implementation. Both produce equal results within single precision accuracy. The runtimes measured for the different versions of our ray-tracing implementation are depicted in Table 2. The columns 3 to 5 present the runtimes in msec for the Kernel execution on the GPU, the data transfer times H2D and D2H and the TTS. The runtimes are accumulated over 9 kernel invocations, a single data transfer of the input volume and the 9 data transfers of output volumes. For the Hyper-Q versions though, the kernel execution time represents the time from the first kernel start until the last kernel finished. Table 2. The runtimes for the ray-tracing implementations in msec for a 9 beam prostate case. The presented runtimes are accumulated runtimes over 9 beams. Rutimes [msec] Kernel Kernel Data transfer Time to 9 Beams Execution H2D / D2H Solution Paged Memory Page-lock Memory Hyper-Q GMK / TMK / GMK / TMK / GMK / TMK / A snippet of the output of NVidia s Visual Profiler normalized to an identical time line is shown in Figure 2. It allows an intuitive visual access to the work flow on the GPU and the impact of the implemented acceleration strategies. The golden boxes represent the time consumption of data transfers between host and graphics hardware. The aqua blue boxes show the runtime of the GMK and the purple boxes represent the TMK. The first golden box depicts the single input data transfer to the graphics device and the remaining ones account for the transfers of the output cubes to the host s main memory, one cube for each beam respectively. 4

6 Figure 2. NVidia Visual Profiler output for a 9 beam prostate case. The colored boxes represent runtimes for the different version of the ray-tracing implementations: aqua blue GMK, purple TMK, the respective first golden box the input data transfers and the remaining golden boxes the output data transfers to the host s main memory. 4. Discussion We investigated three acceleration strategies for enhancing ray-tracing performance for the radiological depth computation for real-time treatment planning. The application of the strategies can enhance the efficiency of the IMRT treatment planning process in particular when changing patient geometries are considered. We defined the TTS metric to assess the benefit of offloading the ray-tracing to the graphic device for our in-house TPS framework [4][5]. However, many other applications require the results back in the host s memory and can benefit from the proposed strategies. Our first strategy (1) focuses on the impact of using texture memory as opposed to global memory for the ray-tracing kernel implementations. The TMK showed a better performance, since the ray-tracing memory access patterns are based rather on spatial locality than temporal locality. This difference is especially prominent for beam angles which require data access patterns not mapping well to caching strategies based on locality of reference. The profiler output shows reduced sizes of the purple boxes in comparison to the aqua blue boxes and thus indicates the TMK performance to be rather independent from the beam angle. For strategy (2), we enhanced the implementations by using page-lock host memory with both kernels. The performance gain is shown in the profiler output by reduced sizes of the golden boxes. The memory bandwidth observed was stable above 6 Gbyte/sec across the PCI-e interface which amounts to approximately a factor of 2 versus the implementations using paged host memory. 5

7 The final strategy (3) leads to the largest reduction in TTS. It includes overlapping kernel executions and output data transfers back to the host using Hyper-Q. To find that both concurrent implementations for GMK and TMK achieved the same TTS was not expected. It indicates that the rather inefficient memory accesses using global memory implemented in the GMK were efficiently hidden using Hyper-Q. Combining the strategies, we were able to compute the ray-tracing related tasks in 28 msec for 9 treatment beams in clinical resolution using graphics hardware of NVidia s current Kepler architecture. We exploited its capability of concurrent kernel execution and data movement. We obtained runtimes of 99 msec which corresponds to a speed-up of 3 in TTS in comparison to the initial implementation of GMK and TMK using paged host memory. Future work will include the integration of multi GPU support to study the scaling behavior of our implementation. Furthermore, we will investigate the application of the GPU-based ray-tracing in adaptive treatment planning scenarios for rapid response to changing patient geometries. 5. Conclusion We have shown that the computation of radiological depth data for each beam and each voxel of a patient volume can be carried out in real-time using graphics hardware. The memory bandwidth between the host and the device is the limiting factor for ray-tracing on the GPU. The presented strategies minimize the time consumption of memory accesses and data movements using novel features of NVidia s Kepler GPU architecture. We found an optimal time-to-solution by combining our texture memory kernel with Hyper-Q. References [1] Yan D, Vicini F, Wong J and Martinez A 1997 Adaptive radiation therapy Phys. Med. Biol [2] Wu C, Jeraj R, Olivera G H and Mackie T R 2002 Re-optimization in adaptive radiotherapy Phys. Med. Biol [3] de la Zerda A, Armbruster B and Xing L 2007 Formulating adaptive radiation therapy (ART) treatment planning into a closed-loop control framework Phys. Med. Biol [4] Ziegenhein P, Kamerling C P, Oelfke U 2013 Interactive Dose Shaping Efficient Strategies for CPU-based Real-Time Treatment Planning J. Phys.: Conf. Ser. presented at ICCR 2013, not yet published [5] Kamerling C P, Ziegenhein P, Heinrich H, Oelfke U 2013 A 3D isodose manipulation tool for interactive dose shaping J. Phys.: Conf. Ser. presented at ICCR 2013, not yet published [6] NVIDIA CUDA Compute Unified Device Architecture Programming Guide 5.0 edition 2012 NVIDIA Corp., Santa Clara, CA [7] Siddon R L 1985 Fast calculation of the exact radiological path for a three-dimensional CT array. Med. Phys [8] de Greef M, Crezee J, van Eijk J C, Pool R, Bel A 2009 Accelerated ray tracing for radiotherapy dose calculations on a GPU Med. Phys. 36(9) [9] Siggel M, Ziegenhein P, Nill S, Oelfke U 2012 Boosting runtime-performance of photon pencil beam algorithms for radiotherapy treatment planning Physica Medica 28(4) [10] Whitepaper 2012 NVIDIA s Next Generation CUDA Compute Architecture: Kepler GK110 NVIDIA Corp., Santa Clara, CA 6

GPU applications in Cancer Radiation Therapy at UCSD. Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC

GPU applications in Cancer Radiation Therapy at UCSD Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC Conventional Radiotherapy SIMULATION: Construciton, Dij Days PLANNING: