Parallel iterative cone beam CT image reconstruction on a PC cluster

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 1 Journal of X-Ray Science and Technology 13 (2005) 1 10 1 IOS Press Parallel iterative cone beam CT image reconstruction on a PC cluster Xiang Li a, Jun Ni c,d and Ge Wang a,b a Department of Biomedical Engineering, The University of Iowa, Iowa City, IA 52242, USA b Department of Radiology, The University of Iowa, Iowa City, IA 52242, USA c Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA d Research Services, ITS, The University of Iowa, Iowa City, IA 52242, USA E-mail: {xiang-li, jun-ni, ge-wang}@uiowa.edu Received 30 June 2004 Revised 30 July 2004 Accepted 7 August 2004 Abstract. The iterative reconstruction (IR) algorithms are able to generate CT images with higher quality compared with the filtered backprojection method when the projection data is noisy or the radiation dose is low. IR can also be used when the data is incomplete. The major disadvantage of the IR is its high demand on computation and slow reconstruction. To improve the time performance of the IR we parallelized four representative iterative algorithms: EM, SART and their ordered subset (OS) versions on a Linux PC cluster. A micro-forward-back-projector was implemented for increased parallelization. Parameters are cached at the ray level during forward-projection to further reduce time used for back-projection. The speed-up and efficiency factors of our parallel implementation are reported. 1. Introduction The iterative reconstruction (IR) technique is a branch of methods other than the filtered back-projection (FBP) for the computed tomography (CT). We use IR to include algorithms from both the statistical reconstruction (SR) and the algebraic reconstruction technique (ART) because they all compute the final image through a loop of steps. There are many IR algorithms available. Representative among them are the maximum likelihood (ML) expectation maximization (EM) [1 3] and the simultaneous algebraic reconstruction technique (SART) [4 6]. With ML-EM the image is obtained iteratively as an optimal estimate that maximizes the likelihood of the detection of the actual measured photons of a statistical modeling of the imaging system. The EM method can also be deterministically interpreted as the process of minimizing the I-divergence error between the estimated and measured projection data in nonnegative space [7]. With SART the algorithm repeatedly minimizes the mean square error between the estimated and measured projections in real space. Both EM and SART, together with other IR algorithms, are superior to the FBP method in terms of better image quality (contrast and resolution) under noisy projections [8], which is especially prominent in the positron emission tomography (PET) and single photon emission computed tomography (SPECT) where the noise level in the projection is high. IR is capable of modeling the PET and SPECT systems more precisely by compensation for effects of attenuation, scatter and distance dependent detector response (DDDR) [9 13]. IR can also be 0895-3996/04/$17.00 2005 IOS Press and the authors. All rights reserved

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 2 2 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster used with incomplete projection data and for metal artifacts deduction where FBP fails [14]. Modern commercial PET and SPECT scanners have begun to include IR as a standard algorithm in the software package [15]. However, the major disadvantage of the IR is its high demand on computation. For a large set of data from 3D volume scan it might take several hundreds hours for IR to run on a single computer, preventing it from real time clinical applications. The reason is that besides updating the image a single iteration of the IR needs one forward-projection and one back-projection whereas the FBP only needs one back-projection. As an example for EM often at least 30 iterations are needed. One approach to improving the time performance of the IR is to accelerate the algorithm by applying the ordered-subset (OS) concepts [16,17] to the projection data during iteration in which the projections are divided into smaller-sized groups then the IR algorithm is applied to every subset in sequence. For example, the OS-EM has reduced the reconstruction time by an order of magnitude than the EM [16]. Other efforts to speed up the IR algorithm have focused on optimizing the forward and back projection schemes, including the application of the texture rendering hardware to increase the projection calculation speed [18,19]. Another approach has been focused on the parallelization of the computation. As a matter of fact, parallel computing can also quicken the back-projection part of the FBP method. Due to the inherent nature of data parallelization in IR algorithms, parallel implementation applies to all IR algorithms. Various parallelization schemes have been proposed and they can be categorized into two types in terms of the hardware used in the system. The first type utilized a centralized parallel system such as VLSI architecture computers, vector computers, large-scale parallel computers, and shared memory multiprocessor computers [20 26]. The second type of parallel computing systems employed a network of general-purpose computers and the cluster of PCs was connected by a fast local area network (LAN). As the computing technology advances dramatically many recent parallel implementations were based on the second type [27 30]. Clusters have been built on either WinNT or Unix/Linux platform and could be a hybrid of SIMD computers and MIMD system. For example [29], and [31] used a cluster of computers each with four processors. Carefully designed load-balancing and inter-process communication algorithms have also been used to optimize the parallel system s performance [32]. With the development of the distributed computing technology, client-server and peer-to-peer network models have been used to decentralize the image reconstruction system [27,29,33]. Most of these works parallelized EM and OS-EM algorithms for PET data. In this paper we parallelized four representative IR algorithms: EM, OS-EM, SART, and OS-SART on a Linux PC cluster. The projection data were generated from a simulation of the spiral cone-beam X-ray CT scan. A micro-forward-back-projector was implemented for increased parallelization. Parameters are cached at the ray level during forward-projection to further reduce time used for back-projection. In next section we briefly reviewed IR algorithms and introduced the parallelization scheme. The performance of our parallel implementation was reported in the third section. We concluded the paper with discussion in the last section. 2. Methods 2.1. Review of EM and SART algorithms and their ordered-subset versions Compared with the analytic method it is easy to adapt the IR algorithms to various scan geometries and EM, OS-EM, and SART for cone-beam geometry have been reported. In our simulation of the cone-beam X-ray scan, the volume is sampled with a cube of voxels of equal size. (Figure 1) Each voxel

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 3 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster 3 y x z (a) (b) Fig. 1. The cone-beam scan geometry. (a) The volume is sampled with a cube of voxels. (b) The spiral X-ray source locus. has a constant attenuation factor x j,j =1,...,N, where N is the total number of voxels. Estimated projection ˆb i,i=1,...,m is the line integral along the i-th ray path and b i,i=1,...,m is the measured projection, where M is the total number of rays. Let x and b represent the corresponding N-dimensional and M-dimensional column vector respectively; A =(a ij ) be the matrix mapping x to b, where a ij measures the contribution of x j to b i. We have the following linear system: Ax = b (1) The CT image reconstruction problem is to find x. The EM algorithm is formulated as follows: M a ij (b i /ˆb i ) x (k+1) j = x (k) i=1 j,j =1,...,N. (2) M a ij i=1 The SART formula is expressed as: M (b i a ˆb i ) ij i=1 N a ij x (k+1) j = x (k) j=1 j + M,j =1,...,N. (3) a ij i=1 In the ordered-subset (OS) version of IR the projection data are divided into smaller sets S t,t = 1, 2,...,L, where L is the total number of subsets. The IR algorithms are applied to each individual

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 4 4 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster Updated Image forward-project Data Measurement Compare Measurement Measurement Error (Weighted) Back-project Correction Measurement Fig. 2. Use of ordered subsets for iterative reconstruction. For each_iteration Do For each_projection Do forward-projection get-comparison back-projection update-image (a) For each_iteration Do For each_subset Do For each_projection Do forward-projection get-comparison back-projection update-image (b) Fig. 3. The pseudo-code for IR (a) and its OS version (b). subset. The intermediate reconstructed image is used as the input to the next subset reconstruction. A single iteration is defined as a loop of calculations that pass through all subsets. Figure 2 gives a schematic illustration of the ordered- subset IR. The formulation of OS-EM and OS-SART can be found in [16,34]. In this paper the ordering of subsets is simply sequential according to the original projection view s order. 2.2. Parallel implementation of the IR algorithm In this paper we parallelized four representative IR algorithms: EM, OS-EM, SART, and OS-SART. Other IR algorithms can be parallelized in the same manner. These algorithms can be described by the pseudo-code as Fig. 3. The most time-consuming part of IR is the forward-projection and back-projection. The computation involved in steps of forward-projection, get-comparison, and back-projection is in a piecewise manner for all projection data. As soon as one piece of the projection data is calculated after ray-tracing along a specific path, the data can be compared with the corresponding measured projection and then be

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 5 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster 5 Table 1 The time needed in seconds to reconstruct images in Fig. 5 by sequential IR algorithms and their parallel counterparts EM P OS-EM P SART P OS-SART P 32 iterations 2 iterations 16 subsets 32 iterations 2 iterations 16 subsets np = 1 1,420.28 120.79 1,428.18 121.11 np = 2 728.43 100.52 731.50 101.15 np = 4 379.89 98.67 381.96 99.13 np = 8 224.38 108.24 226.14 108.83 np = 16 162.70 140.66 163.20 140.45 Sequential 2,584 192 2,590 192 For each_iteration Do For each_projection_assigned_to_worker_node Do micro-forward-back-projection update-image-on-master-node send-updated-image-to-all-worker-nodes (a) For each_iteration Do For each_subset Do For each_projection_assigned_to_worker_node Do micro-forward-back-projection update-image-on-master-node send-updated-image-to-all-worker-nodes (b) Fig. 4. The pseudo-code for parallel IR (a) and its OS version (b). back-projected. We took an object-oriented view of the abstract definition of a ray which allowed each ray to take care of its three actions (forward-projection, get-comparison, and back-projection) and three pieces of data (estimated projection, measured projection, and the comparison between the two). The ray was implemented as a micro-forward-back-projector and its implementation facilitated parallelization of the IR algorithm. Because the back-projection is along the same ray-path and using the same interpolation, we cached six parameters (three for ray path direction three for interpolation) during the forward-projection step. There are six direction parameters and eight interpolation weight factors in total. We find that more aggressive caching does not reduce computation time further because the cost for saving and retrieving more parameters overruns the time saved by caching. We divided projection data into groups and assigned each group to a process on the worker node. Figure 4 The worker node carried out computation of the micro-forward-back-projection. The master node gathered temporary images after back-projection and combined them to generate the updated image and then send it back to all worker nodes for next iteration of reconstruction. For OS version of IR algorithms projection data within each subset was divided and assigned among worker nodes. We used message passing model and the MPI protocol [35] for communication between master node and worker nodes. 2.3. Numerical simulation A 3D Shepp-Logan phantom bounded by a 2.0 3 volume of arbitrary unit was used as the object. The phantom was sampled by 128 3 voxels of equal size. 96 projection views were measured along a spiral

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 6 6 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster Node 3 Switch 2 Client Node 1 Switch 1 Node 7 Node 2 Node 8 Switch 3 Node 12 Fig. 5. The PC cluster topology. with one turn and a total translation of 1.0 along the axial direction. The source-origin distance was 5.0. A planar detector with 128 64 cells was used which defined a cone-angle of 20 and a fan angle of 30. Tri-linear interpolation was used in ray-tracing to obtain the value at sampling points from their eight neighboring voxels along the ray path. These neighboring points and associated interpolation weights were cached for a single ray so that during back-projection these parameters would not be computed again. All data were stored as 4-byte floating-point type. The PC cluster has 14 nodes running Linux operating system with dual 1G Hz Pentium III processors and 1 GB memory. The processor cache size is 32 KB L1 and 256 KB L2. They are connected with gigabit/s switches. Figure 5 illustrates the topology of the cluster. The MPI implementation MPICH 1.2.5 [36,37] is installed as the message-passing interface among nodes. The sequential and parallel C code was compiled with gcc and mpicc, respectively. 3. Performance tests results Figure 6 shows the reconstruction results from four parallel IR algorithms. The image is one slice of the 3D Shepp-Logan phantom at z = 0.25 where all ellipsoids intersect with the cross-section. Table 1 gives the time needed in seconds to reconstruct images in Fig. 6 by sequential IR algorithms and their parallel counterpart. Timing for sequential IR algorithms was taken using system clock. Timing for parallel IR algorithms was taken using MPI built-in function MPI Wtime(), which is often referred to as the wall time. The MPI function MPI Barrier() was called when timing started and ended to synchronize all processes. We can see that reconstruction of images with same visual appearance in parallel is a magnitude faster than in sequence for both EM and SART. For OS-EM and OS-SART the parallel implementation is approximately two times faster than their sequential counterparts. Note that the parallel reconstruction using only 1 process is still faster than the sequential reconstruction with a speed-up gain of 1.81. The reason is that the MPICH C compiler mpicc has internal optimization for multi-processor nodes. Tables 2 and 3 give the speed up and efficacy factors for each parallel algorithm. The speed-up is defined as: Time parallel (np =1) Speed up (np) = (3.1) Time Parallel (np) The efficiency is defined as: Speed up (np) Efficiency = (3.2) np

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 7 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster 7 Table 2 The speed-up factors for each parallel IR algorithms EM P OS-EM P SART P OS-SART P np = 1 1 1 1 1 np = 2 1.95 1.20 1.95 1.20 np = 4 3.74 1.22 3.74 1.22 np = 8 6.33 1.12 6.32 1.11 np = 16 8.73 0.859 8.75 0.862 Table 3 The efficiency factors for each parallel IR algorithms EM P OS-EM P SART P OS-SART P np = 1 1 1 1 1 np = 2 0.975 0.60 0.975 0.60 np = 4 0.935 0.305 0.935 0.305 np = 8 0.791 0.140 0.790 0.139 np = 16 0.546 0.0538 0.547 0.0538 (a) (b) (c) (d) Fig. 6. The reconstructed image slice at position z = 0.25. (a) EM 32 iterations. (b) OS-EM 2 iterations 16 subsets. (c) SART 32 iterations. (d) OS-SART 2 iterations 16 subsets. where np is the number of processes. The optimal linear speed-up is obtained if Efficiency = 1. If Efficiency < 1 np the parallel implementation actually exhibits slow down. Please note that the EM P and SART P have almost identical speed up and efficiency performance. The same thing happens to OS-EM P and OS-SART P. Therefore in Fig. 7 only the speed-up and efficiency for EM P and OS-EM P are drawn. From the figure we can see that the greatest speed-up is achieved for EM and OS-EM using 16 processes and 4 processes, respectively. The sub-linear speed up for EM P after nodes 8 is caused by the extra communication when passing from node 3 to node 4 via switch 2. (See Fig. 5) Due to the large load of data communication resulted from ordered subset technique, the parallel OS-EM that partitions data only in the projection domain is very inefficient in terms of processes used. Optimized data partition and optimization schemes are needed to improve the efficiency and to reconstruct large volume. 4. Discussion With the ever-increasing computing power from the commodity desktop computers a PC cluster becomes a practical tool for both research and application purposes. In this paper we parallelized four representative IR algorithms: EM, OS-EM, SART, and OS-SART with spiral cone-beam scan simulation on a Linux PC cluster to demonstrate this capability. For ordered subset versions of EM and SART we used the simple sequential-order grouping of projection data. To achieve the same image reconstruction

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 8 8 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster Fig. 7. Top-left and right: the speed-up factors for EM P and OS-EM P, respectively. Bottom-left and right: the efficiency factors for EM P and OS-EM P, respectively quality the parallel EM and SART take less than three minutes while the normal sequential reconstructions takes more than 47 minutes. The speed-up factor is not so prominent, however, for the OS-EM and OS-SART. The reason is that traditional ordered subsets technique is inherently sequential within one iteration as every subset of projections is used in turn. A direct data-parallel partition scheme is very inefficient in this manner due to the large data integration and dispatching overhead occurred at the beginning and the end of each subset iteration. This suggests a future work of the redesign of a parallel algorithm for the ordered subsets. Acknowledgements We appreciate helpful suggestions from Dr. Shiying Zhao with the University of Iowa Radiology Department. We would like to thank Mr. James Hunsaker for the University of Iowa ITS cluster server management. This work is supported in part by an NIH grant (R01 DC03590).

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 9 References X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster 9 [1] J. Rockmore and A. Macovski, A maximum likelihood approach to image reconstruction, Proc Joint Automatic Control Conf, 1977, pp. 782 786. [2] L.A. Shepp and Y. Valdi, Maximum likelihood reconstruction for emission tomography, IEEE, Trans. Med. Imag. MI-1 (1982), 113 122. [3] K. Lange and R. Carson, EM reconstruction algorithms for emission and transmission tomography, J. Comput. Assist. Tomog. 8(2), (April, 1984), 302 316, [4] A.H. Andersen, Algebraic reconstruction in CT from limited views, IEEE Trans. Med. Imag. 8 (1989), 50 55. [5] A.H. Andersen and A.C. Kak, Simultaneous algebraic reconstruction technique (SART): A superior implementation of the ART algorithm, Ultrasonic Imaging 6 (1984), 81 94. [6] M. Jiang and G. Wang, Convergence of the simultaneous algebraic reconstruction technique (SART), IEEE Trans. Image Processing 12 (2003), 957 961. [7] D.L. Synder et al., Deblurring subject to nonnegativity constraints, IEEE Trans. Signal Processing 40 (1992), 1143 1150. [8] B.F. Hutton et al., A clinical perspective of accelerated statistical reconstruction, European Journal of Nuclear Medicine 24(7) (July, 1997). [9] D. Blocket et al., Maximum-likelihood reconstruction with ordered subsets in bone SPECT, J. Nucl. Med. 40(12) (1999), 1978 1984. [10] C.Y. Bai et al., Postinjection single photon transmission tomography with ordered-subset algorithms for whole-body PET imaging, IEEE Trans. Nucl. Sci. 49(1) (2002), 74 81. [11] C. Kamphuis et al., Dual matrix ordered subsets reconstruction for accelerated 3D scatter compensation in single-photon emission tomography, Eur. J. Nucl. Med 25(1) (1998), 8 18. [12] S.H. Manglos et al., Transmission maximum-likelihood reconstruction with ordered subsets for cone-beam CT, Phys. Med. Biol. 40(7) (1995), 1225 1241. [13] M.V. Narayanan et al., Optimization of regularization of attenuation and scatter- corrected Tc-99m cardiac SPECT studies for defect detection using hybrid images, IEEE Trans. Nucl. Med. 48(3) (2001), 785 789. [14] G. Wang et al., Iterative deblurring for CT metal artifact reduction, IEEE Trans. Med. Imag. 15 (1996), 657 664. [15] R. Leahy and C. Byrne, Recent developments in iterative image reconstruction for PET and SPECT, Editorial, IEEE Trans. Med. Imag. 19(4) (2000). [16] H.M. Hudson and R.S. Larkin, Accelerated image reconstruction using ordered subsets of projection data, IEEE Trans. Med. Imag. 13 (1994), 601 609. [17] C. Kamphuis and F.J. Beekman, Accelerated iterative transmission CT reconstruction using an ordered subsets convex algorithm, IEEE Trans. Med. Imaging 17(6) (December, 1998). [18] K. Mueller et al., Fast implementations of algebraic methods for three-dimensional reconstruction from cone-beam data, IEEE Trans. Med. Imag. 18 (1999), 538 548. [19] K. Mueler et al., Rapid 3-D cone-beam reconstruction with the simultaneous algebraic reconstruction technique using 2-D texture mapping hardware, IEEE Trans. Med. Imag. 19 (2000), 1227 1237. [20] W.F. Jones et al., Design of a super fast three-dimensional projection system for positron emission tomography, IEEE Trans. Nucl. Sci. 37 (1990), 800 804. [21] C. Guerrini and G. Spaletta, An image reconstruction algorithm in tomography: A version for the CRAY X-MP vector computer, Computers and Graphics 13 (1989), 367 372. [22] M. Miller and C. Butler, 3-D maximum a posteriori estimation for single photon emission computed tomography on massively-parallel comuters, IEEE Trans. Med. Imag. 12 (1993), 560 565. [23] A. McCarty and M. Miller, Maximum likelihood SPECT in clinical computation times using mesh-connected parallel computers, IEEE Trans, Med. Imag. 10 (1991), 426 436. [24] M. Atkins et al., Use of transputers in a 3-D positron emission tomography, IEEE Trans. Med. Imag. 10(3) (1991), 276 283. [25] C.M. Chen et al., A parallel implementation of 3-D CT image reconstruction on hypercube multiprocessor, IEEE Trans. Nucl. Sci. 37(3) (1990), 1333 1346. [26] C. Johnson and A. Sofer, A data-parallel algorithm for iterative tomographic image reconstruction, http://www. nih.gov/publications. [27] D. Shattuck et al., Internet2-based 3D PET image reconstruction using a PC cluster, Phys. Med. Biol. 47 (2002), 2785 2795. [28] G. Kontaxakis et al., Grid processing techniques for the iterative reconstruction of large clinical positron emission tomography sonogram datasets, Health Grid 2003, http://ww.die.upm.es/im. [29] S. Vollmar et al., Heinzel Cluster: accelerated reconstruction for FORE and OSEM3D, Phys. Med. Biol. 47 (2002), 2651 2658.

Galley Proof 30/04/2005; 12:41 File: xst127.tex; BOKCTP/ljl p. 10 10 X. Li et al. / Parallel iterative cone beam CT image reconstruction on a PC cluster [30] A. Bevilacqua, Evaluation of a fully 3-D BPF method for small animal PET images on MIMD architectures, International Journal of Modern Physics C 10(4) (1999), 1 17. [31] W. Backfrieder et al., Web-based parallel ML EM reconstruction for SPECT on SMP clusters, Proceeding of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Science, Las Vegas, Nevada, USA, June, 2001. [32] A. Bevilacqua, A dynamic load balancing method on a heterogeneous cluster of workstations, Special Issue: Parallel Computing on Networks of Computers 23(1) (1999), 49 56. [33] X. Li et al., P2P-enhanced distributed computing in medical image reconstruction, Proceeding of the International Conference on PDPTA, Las Vegas, Nevada, USA, June, 2004. http://www.world-academy-of-science.org. [34] X. Li, M. Jiang and G. Wang, A numerical simulator in VC++ on PC for iterative image reconstruction, J. of X-Ray Science and Technology 11 (2003), 61 70. [35] The message passing interface (MPI) standard, http://www-unix.mcs.anl.gov/mpi/. [36] W. Gropp, E. Lusk, N. Doss and A. Skjellum, A high-performance, portable implementation of the message passing interface standard, Parallel Computing 22(6) (1996), 789 828. [37] W.D. Gropp and E. Lusk, User s guide for MPICH, a portable implementation of MPI, ANL-96/6, Mathematics and Computer Science Division, Argonne National Laboratory, 1996.