This paper deals with ecient parallel implementations of reconstruction methods in 3D

Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c Christophe.Laurent@imag.fr Christophe.Calvin@imag.fr a TIMC-IMAG, IAB Domaine de la Merci, 3876 La Tronche cedex, France b LMC-IMAG, INPG 46 Av. F. Viallet, 3831 Grenoble cedex, France c CREATIS, URA CNRS 1216, INSA, 69621 Villeurbanne cedex, France This paper deals with ecient parallel implementations of reconstruction methods in 3D tomography. Depending on the method, we use two main approaches to parallelize the algorithms and we propose dierent optimizations in order to improve the eciency of the parallel algorithms. These improvements are based either on the minimization of the communication time by using optimized collective communication algorithm, or by overlapping the communication by the computation. Experimental results on dierent distributed parallel machines are presented which highlight the improvements obtained. Keywords: Tomography - Parallel 3D reconstruction methods - Overlap of communications. 1. Introduction Tomography has been developed in order to obtain 2D slices of human anatomy. Truly 3D tomography is a generalization of conventional 2D tomography allowing the reconstruction of volumes (3D images). In 3D X-Ray tomography, some prototypes using the rotation of one (or several) cone-beam X-Ray source(s) have been built [8,9]. In these cases, the computational problem is to reconstruct a 3D image from a set of 2D conic projections from dierent angles of view. The 2D conventional reconstruction methods are not suited and the problem has to be considered directly in 3D. These reconstruction algorithms involve large amount of data (536 MBytes for a 512 3 image) and large computation time. For instance, the reconstruction of a 128 3 image from about 1 128 2 projections requires at about 2 hours and 3 minutes on a Sun 4 workstation. Thus, realistic image sizes for medical applications (236 3, 512 3 ) can not be computed on classical computers. Moreover memory requirements can not be achieved by these machines. Thus the implementation of these technics onto distributed memory parallel machines seems to be a good solution to solve real problems in suitable times [1{3,11]. We present in this paper ecient implementations of reconstruction algorithms in 3D X-Ray tomography on dierent parallel machines. Depending on the type of the method, we either minimize the communication time or overlap it. The experimental results on the dierent machines highlight the improvement of the eciency of these opti-

mized implementations. The remainder of the paper is organized as follows: in section 2 we describe some methods to reconstruct 3D images from a set of 2D acquisitions. Section 3 is devoted to the parallelization of these methods and especially to the optimization of the communications. Before concluding, we present experimental results on dierent parallel machines and compare the dierent parallel implementations. 2. 3D reconstruction methods 3D reconstruction methods from cone-beam acquisitions may be classied into analytic, algebraic and statistical methods [5]. Although these methods rely on dierent mathematical basis, their implementation require similar basic operations: the projection and the back-projection operators. The projection operator permits the change from a 3D space to a 2D one. The back-projection operator corresponds to the inverse operation. Three reconstruction methods have been implemented: Feldkamp algorithm [4]: this analytic method is an extension of the 2D ltered backprojection algorithm. It consists in computing a back-projection of an weighted ltered acquisition. Block ART algorithm [7]: this algebraic method consists in computing, for each iteration, the dierence between the 2D acquisition and the projection of the 3D image obtained at the previous iteration. This dierence is then back-projected and then summed with the initial 3D image. SIRT algorithm [6]: this method is basically the same at the block ART one. In the ART method, the back-projection is done after each computation of an image of dierence. In SIRT method, the back-projection is realized when all the images of dierences have been computed. This method needs more iterations than the ART method to obtain the same result. The Feldkamp algorithm computes an approximated 3D image, while the ART and SIRT algorithms approach the exact 3D image by successive iterations. 3. Parallelization and data distribution The data are divided into two sets: the 2D acquisitions and the 3D image to reconstruct. Each pixel of the 2D acquisitions may contribute to the value of all the voxels of the 3D image, during the back-projection. In the same way, during the projection, each voxel may contribute to the value of all the pixels of 2D projection images. However each voxel (respectively pixel) is independent of the other voxel (respectively pixel). In a rst implementation, we choose to distribute the 3D image in a load-balancing way among the P E processors. The parallel versions of the three algorithms are based on the parallelization of the basic operators. Two approaches have been used:

The local approach computes locally the basic operators on the data owned by each processor. Thus the acquisitions are exchanged between the processors. For example, to compute a projection of 3D image, the processor q projects its 3D sub-image on an acquisition P j and sends P j to processor (q + 1) mod P E. The projection of the whole 3D image is realized when all processors have received P j and computed its projection. The global approach computes the basic operators through the network. In this approach, each projection is computed using a reduction scheme [1]. To compute the projection on an acquisition P j, each processor projects its 3D sub-image on a partial image P. To realize the projection on j P J, the processors send their partial image P j on the same processor by using a reduction-somation operation. In order to improve the eciency of these parallel methods, we have to minimize the part of the communication time in the total execution time. In the local approach, we have overlapped the exchange of the acquisitions by the local computation on another set of acquisitions. As the projection algorithms and the backprojection algorithms are similar, we present only the parallelization of the projection operator without overlap on gure 1 and in gure 2 the version of this algorithm with overlap. In these algorithms m represents the total number of projections to compute, and P E is the number of processors. Algorithm of processor q for all P j 2 fp q m PE : : : P (q+1) m PE?1g do Create a new projection P j for n = 1 to n = P E do Update P j : compute of the projection of the local 3D image Send P j to processor (q + 1) mod P E Receive P j from processor (q? 1) mod P E enddo Store final projection P j enddo Figure 1. Projection algorithm without overlapping

Algorithm of processor q number of projections = number of update = while number of projections < m do if nb recv((q? 1) mod P E, P, number of update ) = false then Create a new projection P j number of update j = else P j = P number of update j = number of update endif Update P j : computation of the local 3D image projection number of projections = number of projections + 1 number of update j = number of update j + 1 if number of update j = P E then Store final projection P j else Send P j,number of update j to processor (q + 1) mod P E endif enddo where nb recv(q,msg) is a non-blocking reception routine of message msg from processor q. Figure 2. Projection algorithm with overlapping For the global approach, we have implemented an optimized version of the reduction scheme using a binary tree of reduction [1]. We give in gure 3 an example of the projection algorithm using the global approach. Algorithm of processor q for j = 1 to m do Compute partial projection P j /* The computation of projection P j is done on processor q */ P j = reduce(sum,p j, j P E ) enddo where reduce(op,buf,dest) is a global combine operation on the variable buf. The operation applied is OP and the nal result is stored on processor dest. Figure 3. Projection algorithm using global approach With Feldkamp and SIRT methods, the basic operators could been computed for all P j simultaneously, whereas with the Block ART method needs to compute the projection on P j and the backprojection of P j before to compute basic operators with the acquisitions P k

where k > j. Then, both Feldkamp and SIRT methods have been parallelized using a local approach, and the parallel algorithm of Block ART method uses a global approach. 4. Experimentations All the 3D reconstruction methods have been implemented on dierent distributed memory parallel machines using PVM. We present here some results on three dierent ones: a Cray T3D (DEC alpha processors connected via a 3D torus network), a IBM SP2 (Power 1 processors connected via a multi-stage network) and farm of DEC alpha processors connected via a multistage network. We present on gures 4, 5 and 6 the execution times of the three methods to reconstruct a 3D image (128 3 ) from 128 2D acquisitions (128 2 ) on, respectively, a Cray T3D, a SP2 and a farm of processors. For each experiment, we have detailed the part of the communication in the total execution time. The reported execution times for ART and SIRT methods correspond to one iteration of reconstruction. 4 35 3 Time (sec) T3D Times T3D Communication 25 2 15 1 5 Art Sirt Feld Art Sirt Feld Art Sirt Feld PE=32 PE=64 PE=128 Figure 4. Execution times of the three reconstruction methods on a Cray T3D.

Time (sec) 12 1 SP2 Times SP2 Communication 8 6 4 2 Art Sirt Feld Art Sirt Feld Art Sirt Feld PE=8 PE=16 PE=32 Figure 5. Execution times of the three reconstruction methods on a IBM SP2. 25 2 Time (sec) Farm Times Farm Communication 15 1 5 Art Sirt Feld Art Sirt Feld PE=8 PE=16 Figure 6. Execution times of the three reconstruction methods on a farm of processors. We can notice that the communication are very ecient on the T3D. On the contrary, on the SP2 and on the farm, the part of the communication time is very important, and thus has to be minimize. Moreover, due to the communication, some of the parallel algorithms, depending on the machine, are not scalable (see for instance ART method on SP2 in gure 5, or SIRT and Feldkamp methods on the processor farm in gure 6).

We present on gures 7 and 8 parallel versions of two methods which illustrate both global and local optimized approaches of parallelization. 25 2 15 Time (sec) Total time without optimization Communication time with optimization 16 14 12 1 8 Time (sec) without optimization with optimization Total time Communication time 1 6 5 4 2 PE=4 PE=8 PE=16 PE=4 PE=8 PE=16 IBM SP2, image size 64x64x64 DEC alpha farm, image size 32x32x32 Figure 7. Comparison of execution times of ART method without and with optimizations on a IBM SP2 and on DEC alpha farm. The two previous gures illustrate the better eciency of the optimized version of ART method. We compare here two versions of the parallel algorithm, the rst one (without optimization) have been implemented using the global combine routine of PVM. The second one (with optimization) uses an optimized algorithm of global combine. On processor farm, the communication time is reduced by using the optimized communication algorithm. On SP2, even if the communication time is not reduced in a signicant way, processor idle times decrease. 7 6 5 4 3 time (sec) Total time Communication time without optimization with optimization time (sec) 8 7 6 5 4 3 without optimization Total time Communication time with optimization 2 2 1 1 PE=4 PE=8 PE=16 PE=4 PE=8 PE=16 SP2, image size 128x128x128 DEC alpha farm, image size 128x128x128 Figure 8. Comparison of execution times of Feldkamp method without and with optimizations. In the Feldkamp method, we have implemented a version which allows the overlap of the communication time. On the SP2, the communication time has been widely reduced (see left curves on gure 4). The overlap is less ecient in the case of the implementation on

the processor farm. This can be explained by the perturbations from other users of the communication network of the farm. Moreover, on this machine, no hardware mechanism is dedicated to the communication, and thus the processor has to deal with the management of the communication. Although, the implementation with communication overlapped minimizes the idle time of the processors. 5. Conclusion We have presented in this paper ecient parallel implementations of reconstruction methods for 3D tomography. Using some communication optimizations, like overlapping or improved collective communication algorithms, the presented methods lead to scalable parallel algorithms. This allow us to reconstruct realistic image sizes. For example, an image of size 256 3 for 256 acquisitions of size 256 2 have been reconstructed on the Cray T3D with 128 processors in 5 minutes. The same problem is solved in 22 hours on a SUN4 and in 3 hours on a IBM 39. REFERENCES 1. H. Charles, J. Li, and S. Miguet, 3D image processing on distributed memory parallel computers, SPIE, 195 (1993). 2. C. Chen, S. Lee, and Z. Cho, A Parallel Implementation of a 3-D CT Image Reconstruction on Hypercube Multiprocessor, IEEE Transactions on Nuclear Science, 37 (199), pp. 1333{1346. 3., Parallelisation of EM Algorithm for a 3-D PET Image Reconstruction, IEEE Transactions on Medical Imaging, 1 (1991), pp. 513{522. 4. L. Feldkamp, L. Davis, and J. Kress, Practical Cone-beam algorithm., Journal of Opt. Soc. Am., 1 (1984), pp. 612{619. 5. L. Garnero and F. Peyrin, Methodes de Reconstruction 3D en Tomograp hie X, tech. rep., GDR TDSI CNRS, France, May 1993. Rapport de synthese 93-1. 6. P. Gilbert, Iterative Methods for the Three Dimensional Reconstruction of an Object from Projections., Journal Theor. Biol., 36 (1972), pp. 15{117. 7. F. Peyrin, R. Goutte, and M. Amiel, 3D Reconstruction from Cone-beam Projections by Block Iterative Technic., in in \ SPIE Medical Imaging IV " proceedings, San Jose, CA, Feb. 1991. 8. E. Ritman, J. Kinsey, R. Robb, L. Harris, and B. Gilbert, Physics and technical considerations in the design of the DSR, Journal of Roenhgenology, 134 (198), pp. 369{374. 9. D. Saint-Felix and al, A New System for 3D computerized X-RAY angiography: rts in vivo result, in Proceedings of the Annual Conference of the IEEE Engineering in Medecine and Biology Society, 1992, pp. 251{252. 1. R. van de Geijn, Massively Parallel LINPACK Benchmark on the Intel Touchstone and ipsc/86 Systems, Computer Science Technical Report TR-91-28, University of Texas, Aug. 1991. 11. E. Zapata, I. Benavides, F. Rivera, J. Brugera, and J. Crazo, Image Reconstruction on Hypercube Computers, in Proceedings of the Third Symposium on the Frontiers of Massively Parallel Computation, 199, pp. 127{133.