Hyperfast parallel-beam and cone-beam backprojection using the cell general purpose hardware

Size: px

Start display at page:

Download "Hyperfast parallel-beam and cone-beam backprojection using the cell general purpose hardware"

Nathaniel Mills
5 years ago
Views:

1 Hyperfast parallel-beam and cone-beam backprojection using the cell general purpose hardware Marc Kachelrieß a and Michael Knaup Institute of Medical Physics, University of Erlangen-Nürnberg, Germany Olivier Bockenbach Mercury Computer Systems, Berlin, Germany Received 23 May 2006; revised 21 December 2006; accepted for publication 22 January 2007; published 27 March 2007 Tomographic image reconstruction, such as the reconstruction of computed tomography projection values, of tomosynthesis data, positron emission tomography or SPECT events, and of magnetic resonance imaging data is computationally very demanding. One of the most time-consuming steps is the backprojection. Recently, a novel general purpose architecture optimized for distributed computing became available: the cell broadband engine CBE. To maximize image reconstruction speed we modified our parallel-beam backprojection algorithm two dimensional 2D and our perspective backprojection algorithm three dimensional 3D, cone beam for flat panel detectors and optimized the code for the CBE. The algorithms are pixel or voxel driven, run with floating point accuracy and use linear LI or nearest neighbor NN interpolation between detector elements. For the parallel-beam case, 512 projections per half rotation, 1024 detector channels, and an image of size was used. The cone-beam backprojection performance was assessed by backprojecting a full circle scan of 512 projections of size into a volume of size voxels. The field of view was chosen to completely lie within the field of measurement and the pixel or voxel size was set to correspond to the detector element size projected to the center of rotation divided by 2. Both the PC and the CBE were clocked at 3 GHz. For the parallel backprojection of 512 projections into a image, a throughput of 11 fps LI and 15 fps NN was measured on the PC, whereas the CBE achieved 126 fps LI and 165 fps NN, respectively. The cone-beam backprojection of 512 projections into the volume took 3.2 min on the PC and is as fast as 13.6 s on the cell. Thereby, the cell greatly outperforms today s top-notch backprojections based on graphical processing units. Using both CBEs of our dual cell-based blade Mercury Computer Systems allows to 2D backproject 330 images/s and one can complete the 3D cone-beam backprojection in 6.8 s American Association of Physicists in Medicine. DOI: / I. INTRODUCTION Cell processors are general purpose processors that combine a Power PC element PPE with eight synergistic processor elements SPEs. 1 3 The SPEs are the most interesting feature of the cell broadband engine CBE, as they are the source of its processing power. A single chip contains eight SPEs, each with a synergistic processing unit SPU, a memory flow controller MFC, and 256 kb of static random access memory that are used as local store LS memory. The LS runs in its own address space at the full 3 GHz clock frequency. An SPU uses 128 bit vector operations, it can execute up to eight floating point instructions per clock cycle, and it provides 128 registers. For our particular focus on backprojecting floating point values 32 bit each the data vector consists of four floats. A fast 96 byte per clock element interconnect bus EIB connects the cell processor s PPE with the SPEs Fig. 1. Up to two instructions per cycle can be issued by each SPU to its seven execution units, organized in two pipelines. To overcome memory latency, the memory wall, direct memory access data DMA data transfers from and to the SPU can be scheduled in parallel with core execution. The PPE can be understood as being the controller or manager that distributes small tasks to the eight SPEs, which are the workers. In our case, communication between the manager and the workers is realized via mailboxes and DMA transfers. The fact that the CBE is freely programmable and not just a special purpose processor makes it especially attractive to high-end applications such as medical imaging. The CBE can be used for all processing steps ranging from acquisition, image reconstruction, to volumetric image display. Other time-consuming algorithms such as dose calculation or scatter prediction that either require deterministic or Monte Carlo calculations are also potential candidates to be adapted to run on the CBE. The bottleneck of tomographic image reconstruction is the backprojection of the raw data into the final image or volume. 4 The aim of this investigation is to implement a two-dimensional 2D parallel-beam backprojection algorithm and a three-dimensional 3D cone-beam perspective backprojection algorithm for the cell processor and to benchmark their performance against our PC-based implementations. The paper does not propose novel image reconstruc Med. Phys. 34 4, April /2007/34 4 /1474/13/$ Am. Assoc. Phys. Med. 1474

1475 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1475 II. METHOD A. Parallel backprojection We consider a 2D parallel beam backprojection of type f r = d p,,r,z FIG. 1. Block diagram of the cell with pictures of one CBE and of the Mercury dual cell-based blade.

The dominating application of backprojection algorithms are the 2D parallel-beam filtered backprojection algorithm 9 and the 3D cone-beam Feldkamp image reconstruction algorithm.

2 1475 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1475 II. METHOD A. Parallel backprojection We consider a 2D parallel beam backprojection of type f r = d p,,r,z FIG. 1. Block diagram of the cell with pictures of one CBE and of the Mercury dual cell-based blade. tion techniques and the novice reader is referred to basic literature to learn details about image reconstruction, e.g., Refs The dominating application of backprojection algorithms are the 2D parallel-beam filtered backprojection algorithm 9 and the 3D cone-beam Feldkamp image reconstruction algorithm. 10 We want to emphasize that, although the computed tomography CT world has evolved towards cone-beam, the parallel backprojection algorithm is still of highest relevance. For example, many spiral cone-beam image reconstruction algorithms are based on rebinning the data onto tilted planes followed by parallel filtered backprojection advanced singleslice rebinning type Even if the primary reconstruction uses a true cone-beam algorithm, one may decide for subsequent iterative corrections, such as beam hardening correction or metal artifact correction, where one is rather free to decide for the forward and backprojection geometry and where one would prefer the parallel beam geometry due to performance reasons. It should be noted that our results may also apply in this or in slightly modified ways to other imaging situations such as transmission computed tomography or magnetic resonance tomography where slightly modified backprojections are used or where forward projection is an issue. In both cases a parallel beam geometry and hence the parallel beam backprojection would apply. Our results are also applicable to iterative image reconstruction in general since the algorithmic structure of the forward projection steps is highly related to the backprojection functions. 16 The paper is organized as follows. Section II introduces the parallel backprojection and the perspective backprojection algorithms. Analytical expressions as well as a simple reference code example are given, implementation details are discussed. To give an idea of how the final code actually looks, two simplified code examples that run on the SPU are given. At last our way to assess the performance is introduced. Section III provides the performance values achieved with our implementations. A literature survey that puts other attempts to speed up the backprojection into relation to the results obtained in our study is given in Sec. IV. with,r = c 0 x + c 1 y + c 2 where, c i = c i. The function f is the backprojected image, p are the raw data typically they would be convolved in the direction, r = x,y,z denotes the pixel location within slice z, is the view parameter and is proportional to the distance of the ray to the origin and therefore corresponds to the detector look-up coordinate. The coefficients c i are arbitrary functions of the projection angle. For example, a scanner with projection angle and ray distance to the origin would have c 0 =cos,c 1 =sin and c 2 =0 such that the ray parametrized by the pair, is the line x cos +y sin =. Although this is a 2D backprojection algorithm where backprojection is done in the x-y plane, we have added the z coordinate on both sides of the equation. This allows to perform the simultaneous backprojection of several sinograms using the same in-plane ray geometry. Simultaneous backprojection of, say, 16 slices allows for fast innermost loops since the detector look-up index and the linear interpolation weights have to be calculated only once. It further enables straightforward vectorization and unrolling of the innermost loop and is the key to the high performance achieved by our algorithms. The backprojection integral is usually realized in a discretized version called pixel-driven backprojection. The reference code is shown in listing 1. Apart from this unoptimized reference code, our highly optimized PC-based implementation, coded in early 1999, that is equivalent to the FIG. 2. Data reorganization rebinning is used to a align the projection matrix with one axis of the volume x axis and with the direction of convolution and b to upsample the detector pixels until they are small enough to be suitable for nearest neighbor interpolation.

3 1476 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1476 reference code, is used to benchmark against the new cellbased parallel backprojection. Note that this PC-based or CPU-based implementation is pure C code and neither uses explicit assembler segments nor specific processor intrinsics. Checking the assembler output shows that the compiler we used the Intel C compiler to compile the PC-based code automatically vectorizes using SSE2 extensions, hence the high backprojection speed. The only effort that went into optimization was to ensure the number of simultaneously backprojected images to be a multiple of four and to ensure proper data alignment on 16 byte borders. Listing 1: Reference code for the parallel backprojection. The pixel indices i and j correspond to x and y and the sinogram indices m and n correspond to and. The index k denotes the number of the slice and may be regarded as the z position. Implementation When porting the code to the cell several constraints had to be followed. The LS is limited to 256 kb and only small portions of the full problem can be handled by each worker. To accomodate demand, the image and raw data had to be tiled into subimages and subsinograms. The size of the subimages and the size of the subsinograms was chosen to allow for double buffering of the sinogram data. Two subsinograms plus one subimage plus code stack must fit into the 256 kb local store. Only those portions of a projection that were needed by a worker s particular subimage comprise the subsinogram and were DMAed to the worker. Double buffering means that while the worker is busy backprojecting the first subsinogram, the DMA of the other subsinogram was active. Thereby, the DMA latency is almost completely hidden behind the backprojection process. Further, care was taken to make use of the 128 available registers per SPU to fully fill the execution pipelines. Manual loop unrolling and reordering of instructions ensured to achieve a throughput of more than one instruction per clock cycle. Vectorization and loop unrolling were achieved by simultaneously backprojecting multiples of four images. In our case 48 images are backprojected simultaneously which allows for 12-fold loop unrolling in the innermost loop. B. Perspective backprojection We consider a cone-beam backprojection of type f r = d w 2,r p,u,r,v,r with u,r = c 00 x + c 01 y + c 02 z + c 03 w,r v,r = c 10 x + c 11 y + c 12 z + c 13 w,r, where c ij = c ij. w,r =1/ c 20 x + c 21 y + c 22 z + c 23. Here, f is the reconstructed volume, p is the preweighted and convolved raw data, r= x,y,z denotes the voxel location, is the trajectory parameter for circular scans the trajectory parameter is often chosen to coincide with the rotation angle and u and v are the detector coordinates and therefore correspond to the detector look-up indices. The coefficients c ij =c ij, that define the perspective transform from the detector into the volume, are arbitrary functions of the projection parameter, in general. The distance weight w,r is required for cone-beam filtered backprojection e.g., for Feldkamp-type image reconstruction. To under-

4 1477 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1477 stand that the distance weight can always be split into a product of a detector preweighing function that only depends on u and v and a voxel-dependent weight that is the same as the denominator of the perspective transform, see Appendix A. The backprojection integral is usually realized in a discretized version called pixel-driven backprojection. Our reference code is shown in listing 2. Apart from this unoptimized reference code our highly optimized PC-based implementation pure C++, coded in 2001 that is equivalent to the reference code is used to benchmark against the new cell-based parallel backprojection. Both, our optimized PC-based code and the new optimized cell-based implementation are hybrid algorithms in terms of first performing a detector alignment, based on upsampling oversampling and bilinear interpolation, followed by a voxel-driven backprojection based on nearest neighbor interpolation a similar rectification technique is used and its image quality is analyzed in Ref. 17. The backprojection part assumes that the detector s axis is aligned with the volume s x axis which yields c 00 =c 20 =0 see Appendix B for details. The optimized implementations take advantage of this fact to speed up the code by reordering the nested loops and by avoiding divisions in the innermost loop. To achieve this alignment ideal detector the original data physical detector or real detector are transformed into the ideal geometry as the first processing step. This real-to-ideal rebinning includes bilinear interpolation and it includes an upsampling that doubles the number of detector pixels. Thereby, the ideal detector s pixels are small enough to carry out the subsequent voxel-driven backprojection with nearest neighbor interpolation instead of bilinear interpolation without loss in image quality. It should be noted that this kind of real-to-ideal transform is also needed to align the detector s u axis along the direction of convolution before convolution can be carried out. Hence one may regard this preprocessing step as not being part of the backprojection. Nevertheless, the performance values measured for our hybrid algorithms include the time needed for the real-to-ideal rebinning. They do not include the time required for convolution, however. Figure 2 illustrates the orientation of the real and the ideal detector with respect to the volume. Listing 2: Reference code for the perspective backprojection. The voxel indices i, j and k correspond to x, y and z and the raw data indices l, m and n correspond to v, u and. It should be emphasized that the data rebinning or rectification process can also be used to switch from curved detectors, as they typically occur in clinical CT, from distorted detector arrays, as they typically occur in image amplifiers, or from any other detector shape to the ideal flat detector. Hence the backprojection times provided here also apply for other detector geometries, at least for the hybrid approach. We further implemented and optimized the direct nonhybrid perspective backprojection that is numerically equivalent to the reference code of listing 2. Since there are no zero-valued perspective transform coefficients and since a bilinear interpolation step must be performed for each voxel update, this direct code is expected to be significantly slower than the hybrid approach.

1478 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1478 128 available registers per SPU otherwise the compiler would insert slow load and store instructions to accomodate demand. FIG. 3.

Implementation The local store limit of 256 kb per worker does not allow to simultaneously update the full volume.

5 1478 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection available registers per SPU otherwise the compiler would insert slow load and store instructions to accomodate demand. FIG. 3. Only small subvolumes fit into the worker. The corresponding raw data patches are DMAed to the worker prior to real-to-ideal rebinning and backprojection. Implementation The local store limit of 256 kb per worker does not allow to simultaneously update the full volume. We rather use a hierarchical memory layout and tile the volume into small subvolumes of voxels. Such a subvolume occupies half of the LS. The remaining 128 kb are used to hold the code, the stack, to hold two raw data buffers and, in case of the hybrid algorithm, to hold the ideal detector data that are produced during real-to-ideal rebinning see Fig. 3. Only those patches of raw data that are actually needed to backproject the current worker s subvolume are DMAed to the worker. While the worker is busy rebinning and backprojecting the first raw data buffer, the DMA of the next raw data patch was active. Just as in the parallel backprojection case we thereby achieve to hide the DMA latency behind the perspective backprojection process. Again, loop unrolling techniques and instruction reordering methods were employed to fully fill the execution pipelines while care was taken to demand for not more than the C. Code example To give an idea of the cell code, listing 3 shows the innermost loop for the direct perspective backprojection of a 32 3 subvolume. To keep the code example short we removed the bilinear interpolation part and only show the nearest neighbor version that is not used for actual timing measurements in this paper, for convenience. The commands used are SPU-specific types and SPU intrinsics. Since the computation of the detector indices divisions followed by casts to integers is vectorized the loop index k increases by four elements on each pass. The corresponding LI algorithm is almost twice as long and consists of 60 lines of code. Since the loop is passed eight times, the final version that contains both vectorization and loop unrolling and that is used for the timing measurements consists of 500 lines of code. For loop unrolling we did not simply repeat the loop body eight times but we also rescheduled the commands to account for data dependencies and latencies. This rescheduling is shown in listing 4 where one can see that most variables that are loaded into registers are not used before six clock cycles lines of code have passed. These six clock cycles are the latency of the commands and correspond to the time needed until the result for the operation is available for further use. D. Performance assessment The code was implemented to cope with any number of pixels or voxels also nonsquare images and noncubic volumes, projections and pixels per projections. For the parallel backprojection we assessed the performance of backprojecting 512 parallel projections into an image of size The complexity of the code is O=512 3 operations. The fact that each projection consisted of 1024 channels is irrelevant to our timing measurement. In case of the cone-beam backprojection 512 projections of size were back- FIG. 4. A simulated noise-free phantom consisting of fat, water, tissue and bone contrasts of 50, 0, 50, and 1000 HU was reconstructed using the direct and hybrid approach. Note the narrow window width of the subtraction image: the differences between the direct and the hybrid method are below the typical noise level of a CT image and hence negligible.

6 1479 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1479 projected into a volume of size The complexity of the cone-beam backprojection code is O= For both types of algorithms the field of view FOV was chosen to completely lie within the circular/cylindrical field of measurement FOM. The pixel or voxel size was chosen to be the detector element size projected to the center of rotation divided by the square root of two. The standard and the optimized central processing unit CPU -based algorithms ran on a single 3.06 GHz Xeon processor with 533 MHz front side bus while the cell-based implementation used a 3 GHz CBE running on a dual cell blade Mercury Computer Systems. For both systems we ensured that the second CPU or the second CBE was idle during our timing measurements. The time T per slice was measured using the system clock. To improve the timing accuracy and to overcome the granularity of the system clock we report the average of 512 reconstructed slices. Care was taken that no other significant CPU workload impaired our measurements. Additionally, we compute the number of CPU clock cycles per operation as C/O with C=FT being the number of clock cycles per reconstructed image and F being the clock frequency that equals 3.06 GHz for our PC and 3 GHz for the cell system. The CPU times stated below are linearly scaled from 3.06 to the 3 GHz our cell processor uses. Listing 3: Innermost loop of the direct perspective backprojection shortened to NN interpolation. The number of voxels K in z direction must be a multiple of four. Loop unrolling is not shown here. The comments on the right hand side are the corresponding pseudo-code listing. Note that most variables are four vectors and operations are element wise. III. RESULTS A. Parallel backprojection The timing results for a nearest neighbor NN and a linear interpolation LI parallel-beam backprojection are shown in Table I The LI reference algorithm is the code provided in listing 1, the NN reference code can be found by a straightforward reduction of the LI reference algorithm to nearest neighbor. The reference algorithm is PC based but not optimized. The PC-based optimal backprojection is a highly optimized backprojection code that has been in use by our group since It is pure C++, contains some loop

7 1480 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1480 unrolling but does not explicitly make use of intrinsics or assembler code. The cell-based code is also highly optimized as detailed earlier in this paper. All algorithms are equivalent to the reference code. Apparently, the CBE achieves a backprojection rate of 165 fps with nearest neighbor interpolation and 126 fps with linear interpolation. Considering that two cells are available per blade one may backproject 330 images per second. Listing 4: Specialization of the inner loop of listing 3 for K=32 that shows eightfold loop unrolling. The ellipses indicate that the actual code is about six times larger. A and B denote even or odd loop index, whereas the integers running from 0 to 7 denote the loop index itself. How does our implementation compare to the theoretical peak performance? Theoretically, and this assumes optimal optimization, one may not do better than updating four pixels per step. An update step requires at least two loads, one add and one store for nearest neighbor. The add runs on the even pipeline and can theoretically be completely hidden by the three load/stores that execute in parallel on the odd pipeline. Per clock we have eight workers and assumed that each step updates 4 pixels one can theoretically update 32/3 pixels, i.e., C/O Similarly, linear interpolation requires four load/stores and two multiply adds which means 32/4 pixel updates per clock. Hence C/O must hold. Regarding the measured values our implementation reaches 69% NN and 71% LI of the theoretical peak performance. B. Perspective backprojection Table II shows the timing achieved for the perspective backprojection. It should be noted that the direct method is

1481 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1481 TABLE I. Timing results for the parallel backprojection for one CPU or one CBE, respectively. TABLE II.

8 1481 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1481 TABLE I. Timing results for the parallel backprojection for one CPU or one CBE, respectively. TABLE II. Timing results for the perspective backprojection for one CPU or one CBE, respectively. NN LI PerBackProj C/O T 1 T 512 T ParBackProj C/O T 1 T C/O T 1 T PC based, reference fps 4.1 s fps 5.2 s PC based, optimized fps 68 ms fps 93 ms Cell based, optimized fps 6.1 ms fps 7.9 ms PC based, reference fps 13.6 s 1.93 h PC based, hybrid fps 376 ms 3.21 min Cell based, direct fps 53.1 ms 27.2 s Cell based, hybrid fps 26.6 ms 13.6 s numerically equivalent to the reference code. Due to the intermediate resampling step, this is not exactly the case for the hybrid approaches. Let us again compare the performance to the theoretical optimum. An update step requires at least five loads, one add and one store. The add runs on the even pipeline and can theoretically be completely hidden by the six load/stores that execute on the odd pipeline. This means at most 32/6 voxel updates per clock. Hence C/O must hold. Our direct method achieves 15.8% and the hybrid method achieves 31.6% of the theoretical peak performance. The image quality of the direct and the hybrid approach is nearly equivalent as shown in Ref. 17. To give additional evidence, Fig. 4 shows an example of a transversal section that was reconstructed with the direct approach and with the hybrid backprojection. The difference image of these noisefree data contains only values that lie below the noise value of typical CT exams. Consequently, the images of both methods can be regarded as being equivalent. The overall high image quality achievable with the hybrid backprojection is demonstrated in Fig. 5 These preclinical images show an in-vivo mouse scanned with a dedicated small animal imaging micro-ct scanner TomoScope 30 s, VAMP GmbH, Erlangen, Germany. C. DMA latency One of the most prominent features of the CBE is its fast DMA between the main memory and the worker local store. Since cell DMA works in parallel to the SPU s command execution pipeline, the DMA latency may be completely hidden for some CPU-limited problems. To measure the DMA latency for our implementations, we performed dummy reconstructions without DMA transfers and calculated the differences of the total backprojection times to that of real backprojections. The backprojection times were measured with clock-cycle precision via the socalled worker decrementer. The decrementer is a counter on each SPU that is decremented by one at each clock cycle. Statistical errors were estimated by repeating all measurements five times. Table III shows the results for the parallel backprojection and the direct perspective backprojection of a volume, both using linear interpolation. It turned out that the DMA fraction of the total reconstruction time is about 0.57% for the parallel backprojection and about 0.37% for the perspective backprojection. As expected, there is no significant difference in the latencies for the direct and the hybrid conebeam backprojection since the same amount of data are transferred in both cases. IV. OTHER RECENT ATTEMPTS TO SPEED UP BACKPROJECTION Other groups have made lots of efforts to speed up CT image reconstruction. Although a fair and quantitative comparison is not always possible, Table IV lists those performance figures that have been published in this millennium, including those published in this paper. Benchmarks found in older literature are considered obsolete due to the ongoing developments in computer technology. To allow for some comparison we scale the values found in the literature to the case of backprojecting 512 projections. For the parallel beam backprojection we scale to pixels, for the cone-beam backprojection to FIG. 5. In vivo study of a mouse scanned with the TomoScope 30 s cone-beam micro-ct scanner VAMP GmbH, Erlangen, Germany. The Feldkamp reconstruction is cell based and uses our hybrid backprojection. C=100 HU, W=750 HU.

9 1482 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1482 TABLE III. DMA latencies for the parallel backprojection with linear interpolation and the direct and hybrid perspective backprojection of a volume. DMA get: raw data flow from manager to worker. DMA put: volume flow from worker to manager. The statistical error for the parallel backprojection is below 0.1 ms and thus shown as zero. Cell-based Without DMA DMA get DMA put Total ParBackProj 4047±0 ms 16±0 ms 7±0 ms 4070±0 ms PerBackProj, direct ±1 ms 94±10 ms 6±1 ms ±12 ms PerBackProj, hybrid ±1 ms 93±10 ms 6±1 ms ±12 ms 512 voxels. The projection size itself is considered irrelevant. For PC-based implementations the CPU clock rate is scaled to 3.0 GHz. This assumption is quite optimistic since backprojection is usually limited by memory latency and memory speed has not increased that significantly during the last years. Especially for older experiments that have been carried out on slow CPUs, this scaling will overestimate the actual performance that could be achieved with the same algorithm on modern CPUs. Note that in most cases comparing the cost-to-performance ratio would be more adequate than just comparing performance. However, there are no reliable cost figures available to us. TABLE IV. Top: Parallel backprojection performance. Bottom: perspective backprojection performance. All values have been scaled to 512 projections and pixels and to voxels, respectively. All values were further scaled to a single processing unit, i.e., to one CPU, one FPGA, one GPU and to one CBE, respectively, and to 3 GHz in the case of CPU-based algorithms. The type column specifies the interpolation type, NN or LI, and the type of arithmethic used: f number of bits denotes floating point arithmethics while i+number of bits stands for integer fixed point arithmetics. Type Hardware Time Comment Leeser et al. a LI /i09 CPU 4.66 s LI /i09 FPGA 125 ms Schiwietz et al. b LI /f32 CPU 22.6 s Includes FFT LI /? GPU 176 ms Includes FFT Xue et al. c?/f32 CPU 7.13 s?/i32 FPGA 273 ms?/i32 GPU 295 ms?/i16 GPU 143 ms Kachelrieß et al. this work NN /f32 CPU 68 ms LI /f32 CPU 93 ms NN /f32 CBE 6.1 ms LI /f32 CBE 7.9 ms Wiesent d LI /f32 CPU 10.0 min Includes convolution Yu et al. e? /? CPU 8.51 min Includes convolution Goddard, Trepanier f LI /i16 FPGA 66.0 s Detector rotation axis Xu and Muelle g LI /f32 CPU 7.57 h LI /f32 GPU 34 min Kole and Beekman h NN /? GPU 17.3 min LI /? GPU 25.8 min Hornegger i NN/f32 CBE 1.99 min Simulation Mueller and Xu j?/f32 CPU 1.28 h Includes convolution?/f32 GPU 17.9 min Includes convolution?/i16 GPU 3.84 min Includes convolution Riddell and Trousset k LI /? CPU 9.15 min Hybrid Kachelrieß et al. this work LI /f32 CPU 3.21 min Hybrid LI /f32 CBE 27.2 s Direct LI /f32 CBE 13.6 s Hybrid a Reference 25. b Reference 26. c Reference 27. d Reference 28. e Reference 29. f References g Reference 33. h Reference 34. i Reference 35. j Reference 36. k Reference 17.

10 1483 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1483 A further complication regarding the cone-beam backprojection algorithms is given by the fact that the underlying assumptions are different from publication to publication and it is not always clear whether all assumptions are precisely stated in the paper. One example are assumptions about the detector alignment. Another assumption that is sometimes made is that the scanner performs an exact rotation. In this case the perspective coefficients are not independent but can be transformed into each other using a rotation matrix. This allows to use the resulting symmetries and thereby speed up the reconstruction process. We further want to point to the fact that there are significant differences whether the reconstructed FOV is cuboid or cylindrical or even spherical. A cylindrical FOV contains only /4 79% of the voxels that are contained in the enclosing cuboid. This adds another 21% uncertainty to the values found in the literature if the FOV shape is not disclosed or if voxels outside the FOM are not backprojected. Similarly, the volume ratio between a spherical FOV and its enclosing cube is /6 52%. Divide-and-conquer-type backprojection, such as Fourierbased image reconstruction, hierarchical backprojection 21 or the link method, 22 for example, is of completely different type than the standard backprojection algorithms discussed here and therefore not included in our comparison. It should be noted that these methods have the potential to increase reconstruction speed by a factor cn/lnn with c being some sometimes rather small constant. Except maybe for Fourier reconstruction there is no highly optimized implementation that can really compete with the standard backprojection performance values listed here. Further, some of these divide-and-conquer concepts work well in 2D but become difficult or impossible in the cone-beam case. For example, Fourier reconstruction in 3D only works when the complete Radon data are available. 23 Last but not least, they often suffer from a trade-off between reconstruction speed and reconstruction accuracy, except for the Fourier-based algorithms. We also did not include the interesting distance-driven backprojection algorithm proposed in reference. 24 Although the authors claim significant speed-ups relative to their pixeldriven backprojection implementation, their approach is not fully optimized. Hence the achievable timing cannot be reliably determined from the paper. A. Parallel backprojection Leeser et al. published an field programmable gate array FPGA -driven parallel beam backprojection. 25 Using fixed point arithmetic with 9 bits they can backproject 1024 projections into a image in 0.25 s using 16-way parallel processing. A great deal of their work has to do with bit reduction which always means a loss of image quality, however. They assume 12 bit input data which is not sufficient for clinical CT where the data are acquired with at least 20 bits. They compare their results to a 1 GHz CPU-based version that needs 28 s for the 1024 projections. Schiwietz et al. compare CPU-based with graphical processing unit GPU -based magnetic resonance image reconstruction. 26 Since they start in Fourier domain their code includes the inverse fast Fourier transform FFT to obtain data ready for backprojection. They use a 3 GHz CPU and an ATI Radeon X1800 XT GPU. The reconstruction of three images from 504 projections requires 16.7 s on the CPU and 130 ms on the GPU. Normalizing this to one image and 512 projection yields 22.6 s and 176 ms, respectively. Xue and co-authors compared CPU with GPU and with FPGA performance for a parallel beam backprojection from 165 projections into a image. 27 Their PC runs at 3.4 GHz and they compare the ATI X700 Pro and the NVidia GF7800 GPU whereby the Nvidia greatly outperforms the ATI GPU. The CPU code uses floating point arithmetics and an image is backprojected in 507 ms. The FPGA uses fixed point arithmetics with 32 bit precision and does the same job in 22 ms. The GPU Nvidia with 32 bit fixed point arithmetics performs in 23.8 ms and with 16 bit it does the backprojection in 11.5 ms. Scaling these values to 512 projections, image pixels and to 3.0 GHz yields 7.13 s CPU, 273 ms FPGA, 295 ms GPU, 32 bit and 143 ms GPU, 16 bit, respectively, for one image. B. Perspective backprojection Wiesent et al. use a dual Pentium III Xeon 550 MHz CPU. 28 They reconstruct voxels from 100 projections in about 40 s. In terms of the operations at 3 GHz and a single CPU this scales to 10.0 min. Yu et al. provide a PC-based implementation. 29 On a 500 MHz Pentium III CPU they can reconstruct a volume from 288 projections within min. They use a spherical FOV and do not backproject voxels outside this sphere. Scaled to 3 GHz, to 512 projections and to a cubic FOV this becomes 8.51 min whereby we believe that this scaling yields a far too optimistic value since memory speed did not improve the same way as the CPU clock rates did. Their code utilizes single instruction multiple data instructions. Goddard and Trepanier present an FPGA-driven reconstruction which includes convolution that can reconstruct a volume from 300 projections between 15 and 38.7 s The range of values corresponds to using one or more FPGAs. Since the convolution process was completely hidden behind the backprojection, the reconstruction times also correspond to the backprojection performance. Scaling the 38.7 s one FPGA to the 512 projections used here we obtain a performance of 66.0 s. Among other assumptions the algorithm assumes one detector axis to be parallel to the rotation axis, the center of rotation to be the center of the cubic volume, and the distances of the focal spot to the isocenter and to the detector to be constant. The first assumption implies that their backprojection matrix is of the same type as for our hybrid approach. The real-to-ideal rebinning is not

11 1484 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1484 mentioned and probably not included in their experiment. The other assumptions imply that the perspective coefficients c ij are generated by a rotation matrix. Xu and Mueller published on GPU-based image reconstruction. 33 They compare a fairly optimized CPU implementation with the GPU-based approach they propose. The PC runs on 2.66 GHz and the GPU is an Nvidia FX Their backprojection LI requires 75 s for the fairly optimized CPU algorithm and 5 s for the GPU code when a volume of and 80 projections are used. In terms of our problem at 3.0 GHz these values become 7.57 h for the PC and 34 min for the GPU, respectively. Kole and Beekman recently optimized a statistical image reconstruction algorithm to run on a GPU. 34 Each iteration consists of one forward and two backprojection steps. Since the forward projection is of about the same speed as the backprojection, we may divide their performance values by three to estimate the GPU performance of a perspective backprojection. They cite a speed of 195 s for NN and of 290 s for LI for one iteration consisting of 256 projections and a volume. The time needed for a backprojection of the problem will be about 17.3 min for the nearest neighbor backprojection and about 25.8 min for the linear interpolation version. Hornegger recently presented a backprojection code for the cell processor that was tested on a cell simulator and not on a real cell system. 35 They show that six projections/s can be backprojected NN on a volume using a dual cell 16 SPUs running with 2.1 GHz and speculate that the code can be further sped up by a factor of 5. Scaling their value to 512 projections, 3.0 GHz and 1 CBE yields 1.99 min for the complete volume. Lately, Mueller and Xu published new results on GPUbased CT image reconstruction. 36 Since the problem of floating point arithmetics on GPUs seems not yet to be solved they find integer arithmetics very useful to speed up the process although image quality becomes inferior. Depending on what arithmetic is used, the timing for a volume and 160 projections achieved on an Nvidia 7800 FX GPU ranges from everything between 1.9 and 42 s. Adequate images are provided by their dual-pass approach that allows for 16 bit accuracy and finishes in 9 s. Full floating point accuracy requires 42 s on the GPU. Their PC-based implementation needs 180 s for the same task in full floating point accuracy CPU and bus clock frequencies are not stated. Normalizing their values to yields 3.84 min 16 bit integer and 17.9 min single precision float for the GPU and 1.28 h for the CPU. It should be noted that these values include the convolution step which typically makes up about 10% of the reconstruction time if it cannot be hidden behind the backprojection by using a parallel thread. Riddell and Trousset implemented a rectification-based perspective backprojection on a 3.4 GHz Pentium 4 CPU. 17 Their code uses the decomposition given in Appendix B and therefore is a hybrid approach. In contrast to our cell-based hybrid algorithm that first performs the alignment A followed by the backprojection B C Riddell and Trousset perform the rectification A B followed by the backprojection C. The authors state that backprojecting 148 projections into a cylinder of 512 voxels height and diameter takes 110 s. Scaling this to our problem and to 3.0 GHz we find that their code takes 9.15 min. V. DISCUSSION The cell broadband engine enables very fast backprojection on a general purpose hardware. The parallel backprojection performance allows to generate 330 images pixels, 512 projections /s on a dual cell board. For the cone-beam backprojection one may generate a complete volume voxels, 512 projections in 6.8 s. Convolution of 512 projections of size with a 2047-element kernel runs in 0.2 s on the dual cell blade and is therefore negligible compared to the backprojection step. Considering that typical scan times are in the same order at least for flat-panel detector-based CT one can potentially achieve real-time imaging at full spatial resolution. Besides its very high performance, probably the most significant advantage of the CBE over other hardware-based acceleration approaches is its versatility. FPGA-, application specific integrated circuit ASIC -, or GPU-based solutions are usually limited to certain functionality. The cell processor, in contrast, is a general purpose hardware that can be used for all kinds of tasks ranging from data preprocessing, image reconstruction, image display, volume rendering to more complicated issues such as done and scatter calculation. Its high performance may even leverage completely new applications or may help to bring other, low performance approaches into clinical routine, such as iterative or statistical CT image reconstruction, for example. APPENDIX A: DISTANCE WEIGHTING Distance weighting means multiplying each voxel s update value during backprojection prior to accumulation by a function W n r s where n is some power, r is the voxel location, s is the source or vertex position of the perspective transform, and where W is a homogeneous function of degree one. We will now show that significant parts of this voxelbased distance weighting can be reorganized such that a detector pixel-based weighting can be performed. Let q=r s, define the perspective projection of point r with vertex s as u = c 0 q c 2 q and v = c 1 q c 2 q and verify by expansion that with c i = c i0 c i1 c i2 q = c 0 c 1 + uc 1 c 2 + vc 2 c 0 c 2 q. c 0 c 1 c 2 Obviously, q is decomposed into a function of u and v and into a factor that corresponds to the denominator of the perspective transform, i.e., q= u,v c 2 q, is valid. Hence

12 1485 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1485 W q = W u,v c 2 q = W u,v c 2 q = W u,v c 2 q is a decomposition of the distance weight into a product of detector weight and voxel weight. The detector weight must be applied to the raw data before they are passed to the backprojecting function. The latter is of type c 2 q and is the denomi of the perspective transform which is computed during perspective backprojection anyway. APPENDIX B: DECOMPOSITION OF THE PERSPECTIVE TRANSFORM Using homogeneous coordinates the 3D perspective transform, that defines the backprojection geometry, can be written as the 3 4 matrix = c 00 c 01 c 02 c 03 C orig c 10 c 11 c 12 c 13 c 20 c 21 c 22 c 23. C orig may be decomposed into a product of two 3 3 detector-to-detector perspective transform matrices A and B and a new 3 4 perspective backprojection matrix C A B C C orig. B1 Each of the matrices A and B defines a 2D perspective transform and may be realized by rebinning the detector data. The matrix C defines a 3D perspective transform between the volume and the detector. A is designed to align the detector s coordinate axes with an arbitrary vector t on the one hand and with the volume s x axis on the other hand. B further aligns the detector such that its u axis is parallel to the volume s y axis and that its v axis remains parallel to the volume s x axis. The new transform matrices are given by A = c 0 t v 1,w 2 0 c 1 t u 2,w 1 0 c 2 t u 1,v 2,, t 2 0 t 1,w 2 B = w t 2,w 0 t 1,w 2 0 C = 0 w 2 w 1 s 1,w 2 w 2 0 w 0 s 0,w s 2 and the irrelevant constant of proportionality of the right hand side of Eq. B1 is given by t 2,w 1 w 2. Here, we use the commutator u i,v j =u i v j u j v i, the scale factor = c 0 c 1 c 2, and the coefficient vector TABLE V. Possible alignment steps of the projection data to allow for convolution along t and to introduce a number of zeroes in the backprojection matrix. Our hybrid code versions perform the convolution alignment A, assume the convolution to be performed elsewhere, and finally use the 3D perspective transform matrix C bp =B C for backprojection. c i = c i0 c i1 c i2 Detector A from scanner for abbreviation. The vectors s, t, u, v and w, whose components are denoted as s 0,s 1,..., can be identified with the source or vertex position, the direction of convolution, the vectors spanning the detector, and the vector connecting the detector origin with the vertex, respectively. Whereas t is arbitrary and usually depends on the tangent of the scan trajectory, the others are given by s = c 00 c 01 c 02 c 10 c 11 c 12 c 20 c 21 c 22 1 c 03 c 13 c 23 and u = c 1 c 2 / v = c 2 c 0 /. w = c 0 c 1 / There are two advantages of this kind of decomposition. One is that convolution must usually be carried out along a certain direction t that corresponds to the tangent t=s of the source trajectory s. To avoid convolving across detector rows, which would be highly inefficient, the detector must be rebinned using the transform A. Only then, convolution can be done along the detector s u axis for each detector row v separately. The second advantage of the detector alignment is that there are a number of zeroes introduced into the backprojection matrix. These zeroes help to improve the backprojection speed. Depending on whether there is no detector alignment at all, only the convolution alignment A or both alignment steps A and B are performed the final backprojection matrix will become one of C bp = A B C C orig C bp = B C C bp = C. Detector B convolution Detector C backprojection u axis u t y v axis v x x C bp A B C B C C The zero entries of C bp and the detector orientation are illustrated in Table V. Note that whenever one desires full alignment but does not need the intermediate convolution

13 1486 Kachelrieß, Knaup, and Bockenbach: Hyperfast backprojection 1486 step, one may perform transforms A and B simultaneously by using the 3 3 detector-to-detector transform matrix A B instead. It must be emphasized that the decomposition shown here corresponds to a particular detector and volume orientation. Aligning the detector parallel to the x and y axis of the volume is reasonable only when the projection is oriented more or less along the z direction. For other projections detector alignments along the x-z plane or along the y-z plane are required. These situations can be easily handled by swapping the corresponding rows of C orig. a Electronic mail: marc.kachelriess@imp.uni-erlangen.de 1 H. P. Hofstee, Power efficient processor architecture and the cell processor, Proceedings of the 11th International Symposium on High- Performance Computer Architecture, February 2005, San Francisco, CA. 2 D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, The design and implementation of a firstgeneration cell processor, IEEE International Solid-State Circuits Conference, pp , 6 10 February 2005, San Francisco, CA. 3 B. Flachs, S. Asano, S. H. Dhong, H. P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano, A streaming processing unit for a cell processor, IEEE International Solid-State Circuits Conference, pp , 6 10 February 2005, San Francisco, CA. 4 W. A. Kalender, Computed Tomography, 2nd ed. Wiley, New York, Gabor T. Herman, Image Reconstruction from Projections: The Fundamentals of Computerized Tomography (Computer Science and Applied Mathematics) Academic, New York, H. H. Barrett and W. Swindell, Radiological Imaging Academic, New York, A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging SIAM, Philadelphia, F. Natterer, The Mathematics of Computerized Tomography Teubner, Stuttgart, L. A. Shepp and B. F. Logan, The Fourier reconstruction of head section, IEEE Trans. Nucl. Sci. 21, L. A. Feldkamp, L. C. Davis, and J. W. Kress, Practical cone-beam algorithm, J. Opt. Soc. Am. A 1 6, M. Kachelrieß, S. Schaller, and W. A. Kalender, Advanced single-slice rebinning in cone-beam spiral CT, Med. Phys. 27 4, L. M. Chen, D. J. Heuscher, and Y. Liang, Oblique surface reconstruction to approximate cone-beam helical data in multislice CT, Proc. SPIE 4123, S. Schaller, K. Stierstorfer, H. Bruder, M. Kachelrieß, and T. Flohr, Novel approximate approach for high-quality image reconstruction in helical cone beam CT at arbitrary pitch, SPIE Medical Imaging Conference Proc. 4322, M. Kachelrieß, T. Fuchs, S. Schaller, and W. A. Kalender, Advanced single-slice rebinning for tilted spiral cone-beam CT, Med. Phys. 28 6, K. Stierstorfer, T. Flohr, and H. Bruder, Segmented multiple plane reconstruction a novel approximate reconstruction for multi-slice spiral CT, Phys. Med. Biol. 47, M. Knaup, W. A. Kalender, and M. Kachelrieß, Statistical cone-beam CT image reconstruction using the cell broadband engine, IEEE Medical Imaging Conference Record 2006, Oct 29 Nov 4, San Diego, CA, M11-422, pp C. Riddell and Y. Trousset, Rectification for cone-beam projection and backprojection, IEEE Trans. Med. Imaging 25 7, H. Stark, J. W. Woods, I. Paul, and R. Hingorani, An investigation of computerized tomography by direct Fourier inversion and optimum interpolation, IEEE Trans. Biomed. Eng. 28 7, H. Schomberg and J. Timmer, The gridding method for image reconstruction by Fourier transformation, IEEE Trans. Med. Imaging 14 3, S. Schaller, T. Flohr, and P. Steffen, An efficient Fourier method in 3D reconstruction from cone-beam data, IEEE Trans. Med. Imaging 17, S. Basu and Y. Bresler, An O N 2 log N filtered backprojection reconstruction algorithm for tomography, IEEE Trans. Med. Imaging 9 10, P. E. Danielsson and M. Ingerhed, Backprojection in O N 2 log N time, IEEE Nucl. Sc. Symp. Rec. 2, C. Axelsson and P. E. Danielsson, Three-dimensional reconstruction from cone-beam data in O N 3 log N time, Phys. Med. Biol B. De Man and S. Basu, Distance-driven projection and backprojection in three dimensions, Phys. Med. Biol. 49, M. Leeser, S. Coric, E. Miller, H. Yu, and M. Trepanier, Parallel-beam backprojection: An FPGA implementation optimized for medical imaging, Proceedings of the Tenth Int. Symposium on FPGA, Monterey, CA, , February T. Schiwietz, T.-C. Chang, P. Speier, and R. Westermann, MR image reconstruction using the GPU, Proc. SPIE 6142, X. Xue, A. Cheryauka, and D. Tubbs, Acceleration of fluoro-ct reconstruction for a mobile C-arm on GPU and FPGA hardware: A simulation study, Proc. SPIE 6142, K. Wiesent, K. Barth, N. Navab, P. Durlak, T. Brunner, O. Schuetz, and W. Seissler, Enhanced 3D reconstruction algorithm for C-arm systems suitable for interventional procedures, IEEE Trans. Med. Imaging 19 5, R. Yu, R. Ning, and B. Chen, High-speed cone-beam reconstruction on PC, Proc. SPIE 4322, M. Trepanier and I. Goddard, Adjunct processors in embedded medical imaging systems, Proc. SPIE 4681, I. Goddard and M. Trepanier, High-speed cone-beam reconstruction: An embedded systems approach, Proc. SPIE 4681, I. Goddard and M. Trepanier, The role of FPGA-based processing in medical imaging, VMEbus systems F. Xu, and K. Mueller, Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware, IEEE Trans. Nucl. Sci. 52 3, J. S. Kole and F. J. Beekman, Evaluation of accelerated iterative x-ray CT image reconstruction using floating point graphics hardware, Phys. Med. Biol. 51, J. Hornegger, Moscow-Bavarian joint advanced student school www5.informatik.uni-erlangen.de/lehre/ws0506/mb-jass06/ 36 K. Mueller and F. Xu, Practical considerations for GPU-accelerated CT, IEEE International Symposium on Biomedical Imaging, pp , April 2006, Arlington, Virginia.

Accelerated C-arm Reconstruction by Out-of-Projection Prediction

Accelerated C-arm Reconstruction by Out-of-Projection Prediction Hannes G. Hofmann, Benjamin Keck, Joachim Hornegger Pattern Recognition Lab, University Erlangen-Nuremberg hannes.hofmann@informatik.uni-erlangen.de