This paper deals with ecient parallel implementations of reconstruction methods in 3D

Similar documents
DEVELOPMENT OF CONE BEAM TOMOGRAPHIC RECONSTRUCTION SOFTWARE MODULE

Comparison of Probing Error in Dimensional Measurement by Means of 3D Computed Tomography with Circular and Helical Sampling

Accelerated C-arm Reconstruction by Out-of-Projection Prediction

Introduction to Medical Imaging. Cone-Beam CT. Klaus Mueller. Computer Science Department Stony Brook University

2D Fan Beam Reconstruction 3D Cone Beam Reconstruction. Mario Koerner

2D Fan Beam Reconstruction 3D Cone Beam Reconstruction

Consistency in Tomographic Reconstruction by Iterative Methods

Micro-CT in situ study of carbonate rock microstructural evolution for geologic CO2 storage

An Acquisition Geometry-Independent Calibration Tool for Industrial Computed Tomography

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Gengsheng Lawrence Zeng. Medical Image Reconstruction. A Conceptual Tutorial

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

Feldkamp-type image reconstruction from equiangular data

Adaptive region of interest method for analytical micro-ct reconstruction

Medical Image Reconstruction Term II 2012 Topic 6: Tomography

SPECT reconstruction

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Medical Image Processing: Image Reconstruction and 3D Renderings

Registration concepts for the just-in-time artefact correction by means of virtual computed tomography

Reconstruction from Projections

Accelerated quantitative multi-material beam hardening correction(bhc) in cone-beam CT

Translational Computed Tomography: A New Data Acquisition Scheme

CIVA Computed Tomography Modeling

Improvement of Efficiency and Flexibility in Multi-slice Helical CT

Spiral ASSR Std p = 1.0. Spiral EPBP Std. 256 slices (0/300) Kachelrieß et al., Med. Phys. 31(6): , 2004

Scaling Calibration in the ATRACT Algorithm

High-performance tomographic reconstruction using graphics processing units

high performance medical reconstruction using stream programming paradigms

Implementation of a backprojection algorithm on CELL

Comparison of different iterative reconstruction algorithms for X-ray volumetric inspection

GPU implementation for rapid iterative image reconstruction algorithm

Computer Architectures for! Medical Applications

Reconstruction in CT and relation to other imaging modalities

Efficient 3D Crease Point Extraction from 2D Crease Pixels of Sinogram

CIVA CT, an advanced simulation platform for NDT

Reconstruction in CT and relation to other imaging modalities

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

Limited view X-ray CT for dimensional analysis

ACCURACY EVALUATION OF 3D RECONSTRUCTION FROM CT-SCAN IMAGES FOR INSPECTION OF INDUSTRIAL PARTS. Institut Francais du Petrole

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Iterative and analytical reconstruction algorithms for varying-focal-length cone-beam

A Fast Implementation of the Incremental Backprojection Algorithms for Parallel Beam Geometries5

ADAPTIVE ACQUISITIONS IN BIOMEDICAL OPTICAL IMAGING BASED ON SINGLE PIXEL CAMERA: COMPARISON WITH COMPRESSIVE SENSING

An Iterative Approach to the Beam Hardening Correction in Cone Beam CT (Proceedings)

Reconstruction of complete 3D object model from multi-view range images.

MULTI-PURPOSE 3D COMPUTED TOMOGRAPHY SYSTEM

BLUT : Fast and Low Memory B-spline Image Interpolation

Interface. Dispatcher. Meta Searcher. Index DataBase. Parser & Indexer. Ranker

Two Local FBP Algorithms for Helical Cone-beam Computed Tomography

Analysis of Matrix Multiplication Computational Methods

Maple on the Intel Paragon. Laurent Bernardin. Institut fur Wissenschaftliches Rechnen. ETH Zurich, Switzerland.

Image Acquisition Systems

Tomographic Reconstruction

Advanced Computed Tomography System for the Inspection of Large Aluminium Car Bodies

An approximate cone beam reconstruction algorithm for gantry-tilted CT

INDUSTRIAL SYSTEM DEVELOPMENT FOR VOLUMETRIC INTEGRITY

Projection and Reconstruction-Based Noise Filtering Methods in Cone Beam CT

Evaluation of tomographic reconstruction methods for small animal microct and micropet/ct

Modeling Cone-Beam Tomographic Reconstruction U sing LogSMP: An Extended LogP Model for Clusters of SMPs

GPU-based Fast Cone Beam CT Reconstruction from Undersampled and Noisy Projection Data via Total Variation

The ASTRA Tomography Toolbox 5April Advanced topics

Theoretically-exact CT-reconstruction from experimental data

''VISION'' APPROACH OF CALmRATION METIIODS FOR RADIOGRAPHIC

Acknowledgments and financial disclosure

Limited View Angle Iterative CT Reconstruction

One-Sided Routines on a SGI Origin 2000 and a Cray T3E-600. Glenn R. Luecke, Silvia Spanoyannis, Marina Kraeva

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Arion: a realistic projection simulator for optimizing laboratory and industrial micro-ct

Accelerating Cone Beam Reconstruction Using the CUDA-enabled GPU

Research on outlier intrusion detection technologybased on data mining

Investigation on reconstruction methods applied to 3D terahertz computed Tomography

Parallel Pipeline STAP System

Tomography at all Scales. Uccle, 7 April 2014

Central Slice Theorem

Cover Page. The handle holds various files of this Leiden University dissertation

7/31/ D Cone-Beam CT: Developments and Applications. Disclosure. Outline. I have received research funding from NIH and Varian Medical System.

Planar tomosynthesis reconstruction in a parallel-beam framework via virtual object reconstruction

Biophysical Techniques (BPHS 4090/PHYS 5800)

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

Advances in Neural Information Processing Systems, 1999, In press. Unsupervised Classication with Non-Gaussian Mixture Models using ICA Te-Won Lee, Mi

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

A Curvelet based Sinogram Correction Method for Metal Artifact Reduction

GPU-Based Acceleration for CT Image Reconstruction

Reconstruction Methods for Coplanar Translational Laminography Applications

Sliced Ridgelet Transform for Image Denoising

Efficient Data Structures for the Fast 3D Reconstruction of Voxel Volumes with Inhomogeneous Spatial Resolution

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D.

Clinical Evaluation of GPU-Based Cone Beam Computed Tomography.

Distributing reconstruction algorithms using the ASTRA Toolbox

Parallel Implementation of Katsevich s FBP Algorithm

Development of a multi-axis X-ray CT for highly accurate inspection of electronic devices

Computer-Tomography II: Image reconstruction and applications

CIRCULAR scanning trajectory has been widely used in

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering

Transcription:

Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c Christophe.Laurent@imag.fr Christophe.Calvin@imag.fr a TIMC-IMAG, IAB Domaine de la Merci, 3876 La Tronche cedex, France b LMC-IMAG, INPG 46 Av. F. Viallet, 3831 Grenoble cedex, France c CREATIS, URA CNRS 1216, INSA, 69621 Villeurbanne cedex, France This paper deals with ecient parallel implementations of reconstruction methods in 3D tomography. Depending on the method, we use two main approaches to parallelize the algorithms and we propose dierent optimizations in order to improve the eciency of the parallel algorithms. These improvements are based either on the minimization of the communication time by using optimized collective communication algorithm, or by overlapping the communication by the computation. Experimental results on dierent distributed parallel machines are presented which highlight the improvements obtained. Keywords: Tomography - Parallel 3D reconstruction methods - Overlap of communications. 1. Introduction Tomography has been developed in order to obtain 2D slices of human anatomy. Truly 3D tomography is a generalization of conventional 2D tomography allowing the reconstruction of volumes (3D images). In 3D X-Ray tomography, some prototypes using the rotation of one (or several) cone-beam X-Ray source(s) have been built [8,9]. In these cases, the computational problem is to reconstruct a 3D image from a set of 2D conic projections from dierent angles of view. The 2D conventional reconstruction methods are not suited and the problem has to be considered directly in 3D. These reconstruction algorithms involve large amount of data (536 MBytes for a 512 3 image) and large computation time. For instance, the reconstruction of a 128 3 image from about 1 128 2 projections requires at about 2 hours and 3 minutes on a Sun 4 workstation. Thus, realistic image sizes for medical applications (236 3, 512 3 ) can not be computed on classical computers. Moreover memory requirements can not be achieved by these machines. Thus the implementation of these technics onto distributed memory parallel machines seems to be a good solution to solve real problems in suitable times [1{3,11]. We present in this paper ecient implementations of reconstruction algorithms in 3D X-Ray tomography on dierent parallel machines. Depending on the type of the method, we either minimize the communication time or overlap it. The experimental results on the dierent machines highlight the improvement of the eciency of these opti-

mized implementations. The remainder of the paper is organized as follows: in section 2 we describe some methods to reconstruct 3D images from a set of 2D acquisitions. Section 3 is devoted to the parallelization of these methods and especially to the optimization of the communications. Before concluding, we present experimental results on dierent parallel machines and compare the dierent parallel implementations. 2. 3D reconstruction methods 3D reconstruction methods from cone-beam acquisitions may be classied into analytic, algebraic and statistical methods [5]. Although these methods rely on dierent mathematical basis, their implementation require similar basic operations: the projection and the back-projection operators. The projection operator permits the change from a 3D space to a 2D one. The back-projection operator corresponds to the inverse operation. Three reconstruction methods have been implemented: Feldkamp algorithm [4]: this analytic method is an extension of the 2D ltered backprojection algorithm. It consists in computing a back-projection of an weighted ltered acquisition. Block ART algorithm [7]: this algebraic method consists in computing, for each iteration, the dierence between the 2D acquisition and the projection of the 3D image obtained at the previous iteration. This dierence is then back-projected and then summed with the initial 3D image. SIRT algorithm [6]: this method is basically the same at the block ART one. In the ART method, the back-projection is done after each computation of an image of dierence. In SIRT method, the back-projection is realized when all the images of dierences have been computed. This method needs more iterations than the ART method to obtain the same result. The Feldkamp algorithm computes an approximated 3D image, while the ART and SIRT algorithms approach the exact 3D image by successive iterations. 3. Parallelization and data distribution The data are divided into two sets: the 2D acquisitions and the 3D image to reconstruct. Each pixel of the 2D acquisitions may contribute to the value of all the voxels of the 3D image, during the back-projection. In the same way, during the projection, each voxel may contribute to the value of all the pixels of 2D projection images. However each voxel (respectively pixel) is independent of the other voxel (respectively pixel). In a rst implementation, we choose to distribute the 3D image in a load-balancing way among the P E processors. The parallel versions of the three algorithms are based on the parallelization of the basic operators. Two approaches have been used:

The local approach computes locally the basic operators on the data owned by each processor. Thus the acquisitions are exchanged between the processors. For example, to compute a projection of 3D image, the processor q projects its 3D sub-image on an acquisition P j and sends P j to processor (q + 1) mod P E. The projection of the whole 3D image is realized when all processors have received P j and computed its projection. The global approach computes the basic operators through the network. In this approach, each projection is computed using a reduction scheme [1]. To compute the projection on an acquisition P j, each processor projects its 3D sub-image on a partial image P. To realize the projection on j P J, the processors send their partial image P j on the same processor by using a reduction-somation operation. In order to improve the eciency of these parallel methods, we have to minimize the part of the communication time in the total execution time. In the local approach, we have overlapped the exchange of the acquisitions by the local computation on another set of acquisitions. As the projection algorithms and the backprojection algorithms are similar, we present only the parallelization of the projection operator without overlap on gure 1 and in gure 2 the version of this algorithm with overlap. In these algorithms m represents the total number of projections to compute, and P E is the number of processors. Algorithm of processor q for all P j 2 fp q m PE : : : P (q+1) m PE?1g do Create a new projection P j for n = 1 to n = P E do Update P j : compute of the projection of the local 3D image Send P j to processor (q + 1) mod P E Receive P j from processor (q? 1) mod P E enddo Store final projection P j enddo Figure 1. Projection algorithm without overlapping

Algorithm of processor q number of projections = number of update = while number of projections < m do if nb recv((q? 1) mod P E, P, number of update ) = false then Create a new projection P j number of update j = else P j = P number of update j = number of update endif Update P j : computation of the local 3D image projection number of projections = number of projections + 1 number of update j = number of update j + 1 if number of update j = P E then Store final projection P j else Send P j,number of update j to processor (q + 1) mod P E endif enddo where nb recv(q,msg) is a non-blocking reception routine of message msg from processor q. Figure 2. Projection algorithm with overlapping For the global approach, we have implemented an optimized version of the reduction scheme using a binary tree of reduction [1]. We give in gure 3 an example of the projection algorithm using the global approach. Algorithm of processor q for j = 1 to m do Compute partial projection P j /* The computation of projection P j is done on processor q */ P j = reduce(sum,p j, j P E ) enddo where reduce(op,buf,dest) is a global combine operation on the variable buf. The operation applied is OP and the nal result is stored on processor dest. Figure 3. Projection algorithm using global approach With Feldkamp and SIRT methods, the basic operators could been computed for all P j simultaneously, whereas with the Block ART method needs to compute the projection on P j and the backprojection of P j before to compute basic operators with the acquisitions P k

where k > j. Then, both Feldkamp and SIRT methods have been parallelized using a local approach, and the parallel algorithm of Block ART method uses a global approach. 4. Experimentations All the 3D reconstruction methods have been implemented on dierent distributed memory parallel machines using PVM. We present here some results on three dierent ones: a Cray T3D (DEC alpha processors connected via a 3D torus network), a IBM SP2 (Power 1 processors connected via a multi-stage network) and farm of DEC alpha processors connected via a multistage network. We present on gures 4, 5 and 6 the execution times of the three methods to reconstruct a 3D image (128 3 ) from 128 2D acquisitions (128 2 ) on, respectively, a Cray T3D, a SP2 and a farm of processors. For each experiment, we have detailed the part of the communication in the total execution time. The reported execution times for ART and SIRT methods correspond to one iteration of reconstruction. 4 35 3 Time (sec) T3D Times T3D Communication 25 2 15 1 5 Art Sirt Feld Art Sirt Feld Art Sirt Feld PE=32 PE=64 PE=128 Figure 4. Execution times of the three reconstruction methods on a Cray T3D.

Time (sec) 12 1 SP2 Times SP2 Communication 8 6 4 2 Art Sirt Feld Art Sirt Feld Art Sirt Feld PE=8 PE=16 PE=32 Figure 5. Execution times of the three reconstruction methods on a IBM SP2. 25 2 Time (sec) Farm Times Farm Communication 15 1 5 Art Sirt Feld Art Sirt Feld PE=8 PE=16 Figure 6. Execution times of the three reconstruction methods on a farm of processors. We can notice that the communication are very ecient on the T3D. On the contrary, on the SP2 and on the farm, the part of the communication time is very important, and thus has to be minimize. Moreover, due to the communication, some of the parallel algorithms, depending on the machine, are not scalable (see for instance ART method on SP2 in gure 5, or SIRT and Feldkamp methods on the processor farm in gure 6).

We present on gures 7 and 8 parallel versions of two methods which illustrate both global and local optimized approaches of parallelization. 25 2 15 Time (sec) Total time without optimization Communication time with optimization 16 14 12 1 8 Time (sec) without optimization with optimization Total time Communication time 1 6 5 4 2 PE=4 PE=8 PE=16 PE=4 PE=8 PE=16 IBM SP2, image size 64x64x64 DEC alpha farm, image size 32x32x32 Figure 7. Comparison of execution times of ART method without and with optimizations on a IBM SP2 and on DEC alpha farm. The two previous gures illustrate the better eciency of the optimized version of ART method. We compare here two versions of the parallel algorithm, the rst one (without optimization) have been implemented using the global combine routine of PVM. The second one (with optimization) uses an optimized algorithm of global combine. On processor farm, the communication time is reduced by using the optimized communication algorithm. On SP2, even if the communication time is not reduced in a signicant way, processor idle times decrease. 7 6 5 4 3 time (sec) Total time Communication time without optimization with optimization time (sec) 8 7 6 5 4 3 without optimization Total time Communication time with optimization 2 2 1 1 PE=4 PE=8 PE=16 PE=4 PE=8 PE=16 SP2, image size 128x128x128 DEC alpha farm, image size 128x128x128 Figure 8. Comparison of execution times of Feldkamp method without and with optimizations. In the Feldkamp method, we have implemented a version which allows the overlap of the communication time. On the SP2, the communication time has been widely reduced (see left curves on gure 4). The overlap is less ecient in the case of the implementation on

the processor farm. This can be explained by the perturbations from other users of the communication network of the farm. Moreover, on this machine, no hardware mechanism is dedicated to the communication, and thus the processor has to deal with the management of the communication. Although, the implementation with communication overlapped minimizes the idle time of the processors. 5. Conclusion We have presented in this paper ecient parallel implementations of reconstruction methods for 3D tomography. Using some communication optimizations, like overlapping or improved collective communication algorithms, the presented methods lead to scalable parallel algorithms. This allow us to reconstruct realistic image sizes. For example, an image of size 256 3 for 256 acquisitions of size 256 2 have been reconstructed on the Cray T3D with 128 processors in 5 minutes. The same problem is solved in 22 hours on a SUN4 and in 3 hours on a IBM 39. REFERENCES 1. H. Charles, J. Li, and S. Miguet, 3D image processing on distributed memory parallel computers, SPIE, 195 (1993). 2. C. Chen, S. Lee, and Z. Cho, A Parallel Implementation of a 3-D CT Image Reconstruction on Hypercube Multiprocessor, IEEE Transactions on Nuclear Science, 37 (199), pp. 1333{1346. 3., Parallelisation of EM Algorithm for a 3-D PET Image Reconstruction, IEEE Transactions on Medical Imaging, 1 (1991), pp. 513{522. 4. L. Feldkamp, L. Davis, and J. Kress, Practical Cone-beam algorithm., Journal of Opt. Soc. Am., 1 (1984), pp. 612{619. 5. L. Garnero and F. Peyrin, Methodes de Reconstruction 3D en Tomograp hie X, tech. rep., GDR TDSI CNRS, France, May 1993. Rapport de synthese 93-1. 6. P. Gilbert, Iterative Methods for the Three Dimensional Reconstruction of an Object from Projections., Journal Theor. Biol., 36 (1972), pp. 15{117. 7. F. Peyrin, R. Goutte, and M. Amiel, 3D Reconstruction from Cone-beam Projections by Block Iterative Technic., in in \ SPIE Medical Imaging IV " proceedings, San Jose, CA, Feb. 1991. 8. E. Ritman, J. Kinsey, R. Robb, L. Harris, and B. Gilbert, Physics and technical considerations in the design of the DSR, Journal of Roenhgenology, 134 (198), pp. 369{374. 9. D. Saint-Felix and al, A New System for 3D computerized X-RAY angiography: rts in vivo result, in Proceedings of the Annual Conference of the IEEE Engineering in Medecine and Biology Society, 1992, pp. 251{252. 1. R. van de Geijn, Massively Parallel LINPACK Benchmark on the Intel Touchstone and ipsc/86 Systems, Computer Science Technical Report TR-91-28, University of Texas, Aug. 1991. 11. E. Zapata, I. Benavides, F. Rivera, J. Brugera, and J. Crazo, Image Reconstruction on Hypercube Computers, in Proceedings of the Third Symposium on the Frontiers of Massively Parallel Computation, 199, pp. 127{133.