A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction

The Journal of Supercomputing, 38, 35 47, 2006 C 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction JUNJUN DENG HENGYONG YU JUN NI TAO HE SHIYING ZHAO LIHE WANG GE WANG Department of Radiology, the University of Iowa, Iowa City, Iowa 52242, USA jdeng@math.uiowa.edu hengyong-yu@uiowa.edu jun-ni@uiowa.edu tao-he@uiowa.edu shiying-zhao@uiowa.edu lihe-wang@uiowa.edu ge-wang@uiowa.edu Abstract. Yu and Wang [1, 2] implemented the first theoretically exact spiral cone-beam reconstruction algorithm developed by Katsevich [3, 4]. This algorithm requires a high computational cost when the data amount becomes large. Here we study a parallel computing scheme for the Katsevich algorithm to facilitate the image reconstruction. Based on the proposed parallel algorithm, several numerical tests are conducted on a high performance computing (HPC) cluster with thirty two 64-bit AMD-based Opteron processors. The standard phantom data [5] is used to establish the performance benchmarks. The results show that our parallel algorithm significantly reduces the reconstruction time, achieving high speedup and efficiency. Keywords: Computed tomography (CT), medical imaging, image reconstruction, Katsevich algorithm, spiral cone-beam CT, high performance computing, parallel computing, MPI 1. Introduction X-ray computed tomography (CT) is an important medical-imaging modality where projection data are used to reconstruct a cross-sectional or volumetric image of a patient. In this field, spiral cone-beam CT has become a main mode, in which a data acquisition system consisting of an X-ray tube and a multi-row detector bank rotates while the patient is moved into a scanner gantry [6]. Relative to the patient, the X-ray source scans along a helix, and generates cone beam X-rays through the object. The attenuated X-ray signals are then recorded on the detectors placed on the other side of the patient. Although the mechanism of spiral cone beam CT seems simplistic, the cone-beam divergence and the longitudinal truncation of projection data make the exact image reconstruction far from trivial. In 1984, a landmark contribution was made by Feldkamp et al. [7]. The Feldkamp algorithm allows approximate reconstruction from cone-beam data collected along a circular trajectory. In 1993, a generalized Feldkamp algorithm was developed by Wang et al. [8]. The Wang algorithm is primarily for approximate reconstruction in the case of spiral cone-beam CT. Like the Feldkamp algorithm, the Wang algorithm is excellent in terms of efficiency and parallelism. As far as exact reconstruction for spiral cone-beam CT is concerned, a breakthrough was made in 2002 when Katsevich derived a filtered backprojection algorithm which is quite similar to the

36 Deng et al. Feldkamp-type algorithm but does reconstruct images exactly for spiral cone-beam CT [3, 4]. In 2004, the Katsevich algorithm was implemented by Yu and Wang [1] and other groups. In practice, the Katsevich algorithm requires significantly long computational time when the amount of data becomes increasingly large. There are several approaches to reduce the computation time. From the computing perspective, one can use the parallel computing technology for that purpose. A parallel computing machine can be a single Symmetric Multi-Processing (SMP) system with multiple built-in processors sharing a common memory; or a cluster of locally-connected computer processors with distributed interconnected memories; or a cluster comprising multiple workstations linked by a network. In analysis of parallel computing, a processor participating in a computational process is called a processing element (PE). An overall computational task is typically partitioned into multiple sub-tasks, and the associated data is sent to different PEs through a local connection (with an internal switch) or a networked connection (with an external switch). After the sub-tasks are completed, the results are assembled by a master PE to obtain the final result. The parallel computing technology has been successfully used in several medical applications involving image reconstruction. Many parallel algorithms were developed. A parallel algorithm is usually designed based on a corresponding sequential algorithm. For example, Raman [9] developed a parallel Filtered-BackProjection (FBP) algorithm and implemented it on Intel Paragon system with 16 processors and the Connection Machine (CM5) system with 32 processors. The performance of their parallel FBP programs was compromised by a large communication overhead, giving a speedup of about 4 on Paragon and 1.36 on CM5, respectively. In the early 1990s, some parallel Expectation-Maximization (EM) algorithms were proposed [10, 11]. The parallel implementation was directly based on the conventional EM algorithm with various domain partition techniques [11, 12]. Ordered subset techniques were also further used to speedup the iterative reconstruction [13]. Recently, Johnson and Sofer investigated various parallelisms in image reconstruction [14]. An OSC (Order- Subset Convex)-based parallel statistical cone-beam X-ray CT algorithm was proposed based on a shared memory [15]. This algorithm employs two parallelization techniques: (1) processing all the projections within one subset in parallel (OSC-ang), and (2) dividing the whole volume into various parts and reconstructing them in parallel (OSC-vol). Both the techniques rely on re-projection/back-projection operations heavily. The second parallelization strategy is suitable for distributed memory systems. It was also found that the optimal choice of the OSC-ang and OSC-vol specifics depended on the dataset size. The paradigm of using multiple parallelization techniques is effective to reduce the communication cost during data transferring. This paper presents a parallel computing scheme for the Katsevich algorithm. This parallelization is built upon the previous numerical implementation [1, 2] of the original Katsevich algorithm [3, 4]. In the following sections, the Katsevich algorithm is first outlined. Then, the parallel computing scheme is presented. In the simulation, the major indexes are studied, including the computation time, the ratio of communication to computation time, the overall speedup, and the parallel efficiency. Finally, relevant issues are discussed.

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 37 Figure 1. CT. Coordinate systems and variables used for image reconstruction in the case of helical cone-beam 2. Katsevich algorithm and its sequential implementation 2.1. Katsevich theorem As shown in Figure 1, a helical scanning locus C in 3-D Euclidean space R 3 can be mathematically described as [4] C := { y R 3 : y 1 = R cos(s), y 2 = R sin(s), y 3 = sh } 2π, s R, (1) where s is a angular parameter, h (> 0) and R (> 0) are the pitch and radius of the locus, and y is a Cartesian-coordinate vector with three components y 1, y 2 and y 3. In a practical CT system, a patient is moved through the gantry while the X-ray source rotates around the patient. Relative to the patient s position, the locus of the X-ray source can be viewed as the helix C. Let U denote an open set that is strictly inside the helix and contains the volume (object) of interest (VOI): U { x R 3 : x 2 1 + x 2 2 < r 2, 0 < r < R }, (2) where r is the radius of VOI inside the locus and x is the Cartesian-coordinate vector with three coordinate components x 1, x 2 and x 3.

38 Deng et al. Assume f is a compactly supported function defined on U and let S 2 be the unit sphere in R 3, then the cone-beam transform of f is defined as D f (y, β) := 0 f (y + tβ) dt, β S 2. (3) The π-line of a given point x is a line segment passing through x with its two endpoints within one helix turn. It has been proved that any point strictly inside the spiral belongs to one and only one π-line [16, 17]. Assume s b (x) and s t (x) are the angular parameters of the two endpoints, the π-interval can be denoted as I PI (x) := [s b (x), s t (x)]. For a given s I PI (x), one can find s 2 I PI (x) such that x, y(s), y(s 2 ) and y(s 1 (s, s 2 )) are on the same plane with the constraint s 1 (s, s 2 ) = (s + s 2 )/2. Denote (y(s 1 ) y(s)) (y(s 2 ) y(s)) (y(s u(s, x) = 1 ) y(s)) (y(s 2 ) y(s)) sgn(s 2 s), 0 < s 2 s < 2π, ẏ(s) ÿ(s) ẏ(s) ÿ(s), s 2 = s Katsevich theorem [4] can be given as Theorem 1 For f C 0 (U), one has f (x) = 1 1 2π 2π 2 I PI (x) x y(s) 0 (4) q D dγ f (y(q), Θ(s, x,γ)) q=s ds, (5) sin γ where Θ(s, x,γ) = cos(γ )β(s, x) + sin(γ )e(s, x), β(s, x) = (x y(s)) / x y(s) and e(s, x) = β(s, x) u(s, x). 2.2. Numerical implementation As illustrated in Figure 1, with d 1 = ( sin(s), cos(s), 0), d 2 = (0, 0, 1) and d 3 = ( cos(s), sin(s), 0), a local coordinate system on planar detector is formed to numerically implement Katsevich s formula [1]. The cone-beam projection data is measured using planar detector arrays parallel to d 1 and d 2 at a distance D from y(s). The detector position in the array is given by a pair of values (u,v), which are the signed distances along d 1 and d 2, respectively. Let (u,v) = (0, 0) be the orthogonal projection of y(s) onto the detector array. For given s and D, projection (u,v) is determined by β. Ifwe denote g(s, u,v) = D f (y, β) and D g (s, u,v) = d g(s, u,v), Katsevich algorithm can ds be implemented by the following two steps [1]: (S1) Hilbert filtering Define an intermediate function ψ(s, u,v) for this filtering step as: ψ(s, u,v) = D2 + u D g (s, ũ, ṽ) 2 + v 2 dũ (6) D2 + ũ 2 + ṽ 2 (ũ u)

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 39 where (ũ, ṽ) represents the local coordinates of a variable point on the filtering line determined by (u,v), and D g (s, u,v) is the first order derivative of cone-beam data which can be computed by the following equation: D g (s, u,v) = ( s + D2 + u 2 D u + uv D ) g(s, u,v) (7) v (S2) Weighted Backprojection The weighted backprojection is expressed by the following formula: f (x) = 1 st (x) 1 2π 2 s b (x) x y(s) ψ(s, u,v )ds. (8) u = D(x y(s)) d 1 (x y(s)) d,v = D(x y(s)) d 2 3 (x y(s)) d 3 To numerically implement katsevich algorithm, the cone beam projections are first uniformly sampled with intervals s, u, v for s, u, and v respectively. The sampled data is denoted as g(s k, u m,v n ) 0 k < K, 0 m < M, 0 n < N (9) where k, m and n are the indexes of sampling points for s, u and v, respectively. In practice, m and n are indexes of unit detector positions, and u and v represent the unit detector size. Therefore D g (s, u,v) can be numerically computed as D g (s k, u m,v n ) = Dg s (s k, u m,v n ) + D2 + u 2 m Dg u D (s k, u m,v n ) + u mv n D Dv g (s k, u m,v n ) (10) where Dg s (s k, u m,v n ), Dg u(s k, u m,v n ) and Dg v(s k, u m,v n ) are the first-order central difference formats of s g(s, u,v), g(s, u,v), and g(s, u,v) respectively. Then the data u v ψ(s k, u m,v n ) can be calculated from Eq. (6). Finally, the filtered data needs to be backprojected by Eq. (8) where π-interval I PI (x) has to be numerically determined. For the details of numerical implementations, refer to [1, 2]. 3. Parallel implementation As mentioned above, the accomplishment of a 3-D image reconstruction requires a great amount of time if one sequentially implements the Katsevich algorithm. This is not acceptable in demanding biomedical applications. To speedup the computational process is the primary goal of this project. Specifically, the parallel Katsevich algorithm is implemented on a multiprocessor HPC cluster with 16 nodes at our Medical Imaging High Performance Computing Lab (MIHPC Lab). Each node has two 64-bit AMD-based Opteron processors (PEs) and 4 GB memory shared between the processors. The total system storage is 8 TB for archiving

40 Deng et al. and retrieval of high-resolution data and images. The program is in C, compiled by the Porland C compiler. The MPI serves as a parallel library to perform message passing among the PEs. Since the MPI protocol is implemented through a low-level socket, the communication between the processors (processes) on the same node is realized through message passing. Besides, the processors on different nodes have higher priority to be assigned than the processors on the same nodes. The main message passing functions include MPI-based sending, receiving, broadcasting, and collecting. As described above, the two major computing procedures are filtration and backprojection (S1 and S2). In the filtering step, the calculation of numerical differentiation terms D s g (s k, u m,v n ), D u g (s k, u m,v n ) and D v g (s k, u m,v n ) in Eq. (10) and integration in Eq. (6) are the most time-consuming parts. The computation of D u g (s k, u m,v n ) and D v g (s k, u m,v n ) requires only the data collected at one view angle s k and therefore, it is independent of the data at other view angles. By this property, the projection data from different view angles can be distributed to different PEs and processed in parallel. The computation of D s g (s k, u m,v n ) takes the data from the view angles s k+1 and s k 1. This requires that the projection data be partitioned in such a way that the data from view angles s k, s k+1 and s k 1 be sent to the same PE. The integration operation in Eq. (6) uses the data at one view angle s k, thus the previous partition strategy for D u g (s k, u m,v n ) and D v g (s k, u m,v n ) applies here. An important issue is to determine how much projection data should be distributed to each PE in the filtering step. The filtering operation (6) is identical for all projection data indexedby(s k, u m,v n ). As a result, each PE should process an amount of projection data consistent to its computing capacity. Since the PC cluster is a homogenous system, we assume that each PE has the same computing capacity. The computing privilege is not a critical issue here. The load balance is not very important either. Hence, the projection data are just partitioned evenly, as shown in the Figure 2. In the backprojection step, Eq. (8) is a voxel-driven formulation. The reconstruction of each voxel x can be independently performed, requiring the same amount of computation. Therefore, the volume is also partitioned over the PEs consistent to their processing capabilities. Each PE reconstructs corresponding voxels, as shown also in Figure 2. Figure 2. Data flow of the parallel Katsevich algorithm.

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 41 Figure 3. Flowchart for the parallel reconstruction process. To sum up, the overall parallel computing is processed in the following order. The projection data is first partitioned and distributed over selected PEs. After each PE receives its assigned data, it performs the filtering operation. After each PE accomplishes the filtering operation, it sends the filtered data to all the PEs. Once each PE received all of the filtered data, it independently performs intensive backprojection. Finally, the backprojected data are collected and assembled on the master PE to obtain the final reconstruction. The flowchart of the whole parallel reconstruction process is presented in Figure 3. 4. Numerical results The parallel implementation of the Katsevich algorithm was evaluated by reconstructing the 3-D Shepp-Logan phantom [5]. The spiral cone-beam projection data was collected with a planar detector, as shown in Figure 1. Different datasets (volumes of 128 3, 256 3, 384 3 and 512 3 ) were used to measure the performance (mainly speedup and efficiency) and study the effects of sizes of datasets and images. The double precision format was used for all the data and images. The measured computational time in each run was slightly different due to varying computational loads at the nodes. Therefore, the average parallel computation time was calculated from ten runs for each test. The mean value of the computational time (Case I, II, III and IV for volumes of 128 3, 256 3, 384 3 and 512 3 voxels, respectively) and the

42 Deng et al. Figure 4. Comparisons of the performance parameters for the parallel Katsevich algorithm. All the X-axes represent the number of processors. The Y-axes of (a), (b), (c), and (d) are for computational time, speedup, efficiency and ratio, respectively. corresponding standard deviations are listed in Table 1. The corresponding semi-log plots are shown in Figure 4(a). From these results, it is observed that the reconstruction time significantly decreases with the increment in the number of PEs. It is also seen in Table 1 that the standard deviation is relatively large in the case of 4 processors. This is because the cluster has a master node and it needs to handle not only one computing task but also coordinate the whole reconstruction process, and handle tasks submitted by other users sometimes. Therefore, it often has more memory allocated than other slave nodes. As a result, the slave nodes sometimes, although not always, need wait for the master node in our experiments. Such a phenomenon is more prominent when fewer processors are used, causing a higher standard deviation. The benchmarks of a parallel algorithm are quantified in terms of speedup S p and parallel efficiency η, which are respectively defined as S p = T s T np, and η = S p n p. (11)

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 43 Table 1. Average total reconstruction time with the number of processors Number of Processors (NP) 1 4 8 12 16 20 24 28 32 Case I: Volume = 128 3 (16 MB) Reconstruction time 226 66 33 25 21 22 21 20 19 Standard deviation 0.10 3.19 0.05 0.17 0.05 0.74 0.12 0.40 0.08 Case II : Volume = 256 3 (128 MB) Reconstruction time 975 277 117 77 58 52 46 43 37 Standard deviation 2.82 10.6 3.48 0.06 0.19 1.01 0.08 0.21 0.07 Case III : Volume = 384 3 (432 MB) Reconstruction time 2756 733 333 207 156 135 112 100 88 Standard deviation 9.42 13.4 2.94 0.07 0.15 0.48 0.74 0.07 0.27 Case IV : Volume = 512 3 (1 GB) Reconstruction time 6013 1639 738 463 344 285 246 216 185 Standard deviation 15.2 39.7 7.66 2.67 0.61 0.92 0.89 1.00 0.38 *Note: The values are the means from 10 runs. The unit of time is second. Table 2. Speedup with the number of processors Number of Processors (NP) 1 4 8 12 16 20 24 28 32 Case I: Volume = 128 3 1.00 3.44 6.77 9.07 10.9 10.1 10.9 11.4 12.1 Case II: Volume = 256 3 1.00 3.51 8.31 12.7 16.9 18.7 21.2 22.7 26.1 Case III: Volume = 384 3 1.00 3.76 8.29 13.3 17.7 20.5 24.7 27.7 31.5 Case IV: Volume = 512 3 1.00 3.67 8.14 13.0 17.5 21.1 24.5 27.8 32.6 n p is the number of processors, T s the total execution time when one processor is used, T np the total parallel execution time when n processors are used. Based on the data in Table 1, the associated speedup was calculated in each case to produce Table 2. Figure 4(b) is a plot of speedup with the number of processors in each of the four cases. The parallel efficiencies in the Cases I, II, III and IV with respect to the number of processors are listed in Table 3, and plotted in Figure 4(c). It is noticed that the efficiency curve for the first case stays below the ideal efficiency curve and decreases relatively rapidly, whereas the curves for the other cases descend slowly and are close to the ideal efficiency curve. In addition, the efficiency curves for the latter cases show a common wavy pattern, in which the efficiency decreases first, then increases and finally decreases again. In the region 1, where the number of PE ranges from 1 to 5, the parallel efficiencies for these cases decrease. In the region 2, the efficiencies increase with increment in the Table 3. Efficiency with the number of processors Number of Processors (NP) 1 4 8 12 16 20 24 28 32 Case I: Volume = 128 3 1.00 0.86 0.85 0.76 0.68 0.51 0.45 0.41 0.38 Case II: Volume = 256 3 1.00 0.88 1.04 1.06 1.06 0.94 0.88 0.81 0.81 Case III: Volume = 384 3 1.00 0.94 1.04 1.11 1.11 1.02 1.03 0.99 0.98 Case IV: Volume = 512 3 1.00 0.92 1.02 1.08 1.09 1.05 1.02 0.99 1.02

44 Deng et al. Table 4. Time used in different steps* Number of Processes (NP) 1 4 8 12 16 20 24 28 32 Case I: Volume = 128 3 Filtration 104 26 13 9 7 5 4 4 3 Collecting filtered data 0 5 7 8 7 11 11 12 12 Backprojection 122 35 13 8 7 6 5 4 4 Collecting BP data 0 0 0 0 0 0 0 0 0 Total reconstruction time 226 66 33 25 21 22 21 20 19 Case II: Volume = 256 3 Filtration 104 27 13 9 7 5 4 4 3 Collecting filtered data 0 6 6 7 7 11 11 12 12 Backprojection 871 243 97 60 42 34 30 26 20 Collecting BP data 0 1 1 1 2 2 1 1 2 Total reconstruction time 975 277 117 77 58 52 46 43 37 Case III: Volume = 384 3 Filtration 104 26 13 9 7 5 4 4 3 Collecting filtered data 0 5 6 8 8 11 11 11 12 Backprojection 2652 698 310 186 138 115 92 80 69 Collecting BP data 0 4 4 4 4 4 4 4 4 Total reconstruction time 2756 733 333 207 156 135 112 100 88 Case IV: Volume = 512 3 Filtration 104 27 13 9 7 5 4 4 3 Collecting filtered data 0 5 7 7 8 11 11 11 11 Backprojection 5909 1598 708 437 319 259 220 190 160 Collecting BP data 0 9 10 10 10 10 10 10 10 Total reconstruction time 6013 1639 738 463 344 285 246 216 185 * The unit of time is second. The values are means from10 runs. The time for broadcasting projection data to PEs is not listed since it is insignificant. number of PEs. The curves reach their peaks when the number of PEs is about 16. In the region 3, also called the post-peak performance region, the efficiencies decrease again as the number of PEs further increases. The appearance of the super-linear effect (the behavior in which the speedup is greater than the ideal linear speedup) is due to the fact that in the multiprocessor system the memory usage associated with each PE is less than that in the single processor system [18]. For example, during the backprojection process, each processor reconstructs a portion of the object, thus allocating only that portion of memory. In the case IV, to reconstruct an object into a volume of 512 3, at least 512 3 8 bytes = 1 GB memory is needed for backprojection in a single-processor system. While in a multiprocessor system where n (n > 1) PEs are used, the memory associated with each processor is 1/n of the memory (1 GB). The impact of memory on the computational ability of PEs is responsible for the super-linear speedup. Such phenomena are more evident with a larger dataset. That is why it appears more prominently in the cases II through IV. Table 4 compares the time used in different steps. The results indicate that the communication time constitutes a smaller percentage of the total reconstruction time as the reconstruction volume becomes larger. Hence, the parallel algorithm will be computationally more efficient when a large dataset is dealt with for higher resolution reconstruction. The ratio between the communication and computation time corresponding to

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 45 Figure 5. Representative slices of reconstructed 256 3 volume. The top row shows the reconstructed slices of the 3D Shepp-Logan phantom, while the bottom reveals the differences between the reconstructed and original slices. The gray ranges are [1.00, 1.05] and [ 0.05 0.05] for the reconstructed slices and the differences, respectively. different numbers of processors is also plotted in Figure 4(d). It shows that as the size of a dataset increases this ratio decreases, resulting in a higher performance. Finally, to verify the correctness of the current parallel implementation, the selected slices of the reconstructed objects are compared with the corresponding slices of the 3-D Shepp-Logan phantom. An excellent agreement can be seen in Figure 5. 5. Discussion and conclusion Some concerns need to be addressed here. One may argue why not to start the 3-D backprojection as soon as some PEs finish their filtering task so as to avoid waiting for the others. Theoretically, it is feasible. However, there are three reasons that make it unnecessary or impractical. First, the time needed for the filtering constitutes only a small portion of the total reconstruction time. Second, in our homogeneous system, the computation load is balanced among PEs in the filtering step, which can be observed from the timeline, hence the PEs finish the filtering task at almost the same time. Third, the programming complexity would be increased if we had done that way. Nevertheless, it is admitted that, in a heterogeneous system where the computation load of PEs is imbalanced, one need consider starting backprojection asynchronically. Another concern is that, after the filtration, it seems not economic in terms of data communication and memory storage to send a full copy of the filtered data to every involved PE. An alternative solution is to search for all the PI-segments and the affiliated

46 Deng et al. projection data immediately after the filtration, gather and distribute only the necessary data, and finally do the backprojection. It could reduce the needed data storage and communication among PEs, but it would demand more computing resources to compute and store the endpoints of the PI-segments and so on. Therefore, the tradeoff between them should be optimized. Both these concerns suggest that the parallel implementation of the Katsevich algorithm is not unique. Since this work is to demonstrate the feasibility and advantages utilizing the parallelism, more implementation options are not discussed here. It is also worth noting that the proposed parallel computing structure can definitely be adapted for many other CB-FBP algorithms, such as those described in [19, 21]. In conclusion, the parallel Katsevich algorithm for 3-D CT has been designed and studied. Our algorithm distributes the projection data and image sub-volumes to multiple PEs consistent to their computing abilities. It is feasible to modify the partitioning scheme when PEs are not identical or more PEs are used. Future work includes studies on the trend of speedup and efficiency curves when more PEs are used, and on the impact of enlarging the image volume on the speedup and efficiency of the parallel computing system. Acknowledgments The project is partially supported by National Health Institute (NIH/NIBIB) grants EB001685 and EB002667. The authors are grateful for the anonymous reviewers who made constructive comments. The authors thank Academic Technology-Research Services, and Information Technology Services at the University of Iowa for generous support. We also thank Ms. Diane Machatka in Facility Management at the University of Iowa for contributing legacy Gateway Desktop PCs to build a test-bed PC cluster, and Mr. Deepak Bharkhada for editorial help. References 1. H. Yu and G. Wang. Studies on implementation of the Katsevich algorithm for spiral cone-beam CT. Journal of X-Ray Science and Technology, 12:97 116, 2004. 2. H. Yu and G. Wang. Studies on artifacts of the Katsevich algorithm for spiral cone-beam CT. In Developments in X-Ray Tomography IV, Proceedings of SPIE, 5535: Denver, CO, United States, pp. 540 549, Aug 4 6, 2004. 3. A. Katsevich. Theoretically exact FBP-type inversion algorithm for spiral CT. SIAM J. Appl. Math., 62(6):2012 2026, 2002. 4. A. Katsevich. An improved exact filtered backprojection algorithm for spiral computed tomography. Advance in Applied Mathematics, 32:681 697, 2004. 5. L. A. Shepp and B. F. Logan. The Fourier reconstruction of a head section. IEEE Transactions on Nuclear Science, NS21(3):21 34, 1974. 6. J. L. Prince and J. M. Links. Medical Imaging Signals and Systems. Pearson Prentice Hall, Upper Saddle River, New Jersey, 2006. 7. L. A. Feldkamp, L. C. Davis, and J. W. Kress. Practical cone-beam algorithm. J. Opt. Soc. Am., A:612 619, 1984. 8. G. Wang, T. H. Lin, P. C. Cheng, and D. M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Transaction on Medical Imaging, 12(3):486 496, 993.

A parallel implementation of the Katsevich algorithm for 3-D CT image reconstruction 47 9. P. V. R. Raman. Parallel implementation of the filtered back projection for tomographic imaging, Masters Thesis, Dept. Electrical Engineering, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, January, 1995. 10. M. Miller and C. Butler. 3-D maximum a posteriori estimation for single photon emission computed tomography on massively-parallel computers. IEEE Trans. Med. Imag., 12:560 565, 1993. 11. C. M. Chen, S. Y. Lee, and Z. H. Cho. A parallel implementation of 3-D CT image reconstruction on hypercube multiprocessor. IEEE Transaction on Nuclear Science, 37(3):1333 1346, 1990. 12. G. Kontaxakis, L. G. Strauss, and G. Tzanakos. An efficient implementation of the iterative MLEM image reconstruction algorithm for PET on a Pentium PC platform. Journal of Computing and Information Technology, 7(2):153 163, 1999. 13. C. Kamphuis and F. J. Beekman. Accelerated Iteirative transmission CT reconstruction using an ordered subset convex algorithm. IEEE Trans. Med. Imaging, 17:1101 1105, 1998. 14. C. Johnson and A. Sofer. A Data-parallel algorithm for iterative tomographic image reconstruction. In Proc. of 7th IEEE Symp. Front Mass Parallel Computing, IEEE Computer Society Press, 1999. 15. J. S. Kole and F. J. Beekman. Parallel statistical image reconstruction for cone-beam X-ray CT on a shared memory computation platform. Phys. Med. Bio., 50:1265 1272, 2005. 16. P. E. Danielsson, P. Edholm, J. Eriksson, and M. Magnusson. Towards exact reconstruction for helical cone-beam scanning of long objects. A new detector arrangement and a new completeness condition, In Proc.1997 Meeting on Fully 3D Image Reconstruction in Radiology and Nuclear Medicine (Pittsburgh) D.W. Townsend and P.E. Kinahan, eds., pp. 141 144 1997. 17. M. Defrise, F. Noo, and H. Kudo. A solution to the long-object problem in helical cone-beam tomography, Physics in Medicine and Biology, 45(2000), 623 643. 18. B. Wilkinson and M. Allen. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall Press, 2005. 19. H. Tuy. 3D image reconstruction for helical partial cone beam scanners using wedge beam transform. US Patent, 6,104,775, 2000. 20. X. Tang and J. Hsieh. A filtered backprojection algorithm for cone beam reconstruction using rotational filtering under helical source trajectory. Med. Phys., 31: 2949 2960 (2004). 21. X. Tang, J. Hsieh, A. Hagiwara, R. A. Nilsen, J. Thibault, and E. Drapkin. A three-dimensional weighted cone beam filtered backprojection (CB-FBP) algorithm for image reconstruction in volumetric CT under a circular source trajectory. Phys. Med. Biol., 50: 3889 3905 (2005).