Parallel computing techniques for computed tomography

Size: px

Start display at page:

Download "Parallel computing techniques for computed tomography"

Bruce Matthews
5 years ago
Views:

University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Parallel computing techniques for computed tomography Junjun

edu/etd/945 Recommended Citation Deng, Junjun. "Parallel computing techniques for computed tomography.

1 University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Parallel computing techniques for computed tomography Junjun Deng University of Iowa Copyright 2011 Junjun Deng This dissertation is available at Iowa Research Online: Recommended Citation Deng, Junjun. "Parallel computing techniques for computed tomography." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Applied Mathematics Commons

2 PARALLEL COMPUTING TECHNIQUES FOR COMPUTED TOMOGRAPHY by Junjun Deng An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Applied Mathematical and Computational Sciences in the Graduate College of The University of Iowa May 2011 Thesis Supervisor: Professor Lihe Wang

3 1 ABSTRACT X-ray computed tomography is a widely adopted medical imaging method that uses projections to recover the internal image of a subject. Since the invention of X-ray computed tomography in the 1970s, several generations of CT scanners have been developed. As 3D-image reconstruction increases in popularity, the long processing time associated with these machines has to be significantly reduced before they can be practically employed in everyday applications. Parallel computing is a computer science computing technique that utilizes multiple computer resources to process a computational task simultaneously; each resource computes only a part of the whole task thereby greatly reducing computation time. In this thesis, we use parallel computing technology to speed up the reconstruction while preserving the image quality. Three representative reconstruction algorithms namely, Katsevich, EM, and Feldkamp algorithms are investigated in this work. With the Katsevich algorithm, a distributed-memory PC cluster is used to conduct the experiment. This parallel algorithm partitions and distributes the projection data to different computer nodes to perform the computation. Upon completion of each sub-task, the results are collected by the master computer to produce the final image. This parallel algorithm uses the same reconstruction formula as the sequential counterpart, which gives an identical image result. The parallelism of the iterative CT algorithm uses the same PC cluster as in the first one. However, because it is based on a local CT reconstruction algorithm, which is different from the sequential EM algorithm, the image results are different with the sequential counterpart. Moreover, a special strategy using inhomogeneous resolution was used to further speed up the computation. The results showed that the image quality was largely preserved while the computational time was greatly reduced. Unlike the two previous approaches, the third type of parallel implementation uses a shared-memory computer. Three major accelerating methods SIMD (Single

4 2 instruction, multiple data), multi-threading, and OS (ordered subsets) were employed to speed up the computation. Initial investigations showed that the image quality was comparable to those of the conventional approach though the computation speed was significantly increased. Abstract Approved: Thesis Supervisor Title and Department Date

5 PARALLEL COMPUTING TECHNIQUES FOR COMPUTED TOMOGRAPHY by Junjun Deng A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Applied Mathematical and Computational Sciences in the Graduate College of The University of Iowa May 2011 Thesis Supervisor: Professor Lihe Wang

6 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Junjun Deng has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Applied Mathematical and Computational Sciences at the May 2011 graduation. Thesis Committee: Lihe Wang, Thesis Supervisor Ge Wang Jun Ni Yangbo Ye Keith Stroyan

7 ACKNOWLEDGMENTS I would like to express my deepest gratitude to those who have directly or indirectly contributed to this thesis. Without their valuable guidance and advice, this thesis would not have been possible. I am particularly thankful to Professor Lihe Wang, my Ph.D. advisor at the University of Iowa, who provided advice and support while I was working on this project. His remarkable insight into and knowledge of both pure and applied mathematics were of great help to me. On a personal level, his kindness and encouragement have been one of my greatest supports during these past few years. Additionally, I am sincerely grateful to Professor Ge Wang, who led me into the field of computed tomography and parallel computing and guided me through my academic pursuits. His broad and superb knowledge in computed tomography have been essential to the completion of this thesis. He created and led an excellent research group that fostered a positive environment. Without the equipment and people in his lab, this work would never have come to fruition. I would also like to thank my co-advisor, Professor Jun Ni, for guidance throughout my research projects. He has always been easy to approach and ready to help. Numerous seemingly casual discussions with Professor Ni have inspired research pursuits. I have been consistently impressed by his enthusiasm for and insight into the field of parallel computing. Professor Shiying Zhao originally suggested that I choose this topic for my thesis. I am grateful for this valuable suggestion in the early stages of my research. I owe further thanks to the members of my group: Dr. Hengyong Yu, Dr. Jiehua Zhu, Dr. Kai Zeng, Dr. Wentao He, and Dr. Xiang Li. Working with and learning from them have proven to be an excellent experience. In particular, I want to thank Dr. Hengyong Yu for his willingness to help when I encountered obstacles in my projects. His smart ideas, experience, and persistence were able to resolve many problems quickly. Finally, I thank my manager, Mu Chen, for supporting me while I was preparing this thesis. ii

8 The project is partially supported by National Health Institute (NIH/NIBIB) grants EB and EB iii

9 ABSTRACT X-ray computed tomography is a widely adopted medical imaging method that uses projections to recover the internal image of a subject. Since the invention of X-ray computed tomography in the 1970s, several generations of CT scanners have been developed. As 3D-image reconstruction increases in popularity, the long processing time associated with these machines has to be significantly reduced before they can be practically employed in everyday applications. Parallel computing is a computer science computing technique that utilizes multiple computer resources to process a computational task simultaneously; each resource computes only a part of the whole task thereby greatly reducing computation time. In this thesis, we use parallel computing technology to speed up the reconstruction while preserving the image quality. Three representative reconstruction algorithms namely, Katsevich, EM, and Feldkamp algorithms are investigated in this work. With the Katsevich algorithm, a distributed-memory PC cluster is used to conduct the experiment. This parallel algorithm partitions and distributes the projection data to different computer nodes to perform the computation. Upon completion of each sub-task, the results are collected by the master computer to produce the final image. This parallel algorithm uses the same reconstruction formula as the sequential counterpart, which gives an identical image result. The parallelism of the iterative CT algorithm uses the same PC cluster as in the first one. However, because it is based on a local CT reconstruction algorithm, which is different from the sequential EM algorithm, the image results are different with the sequential counterpart. Moreover, a special strategy using inhomogeneous resolution was used to further speed up the computation. The results showed that the image quality was largely preserved while the computational time was greatly reduced. Unlike the two previous approaches, the third type of parallel implementation uses a shared-memory computer. Three major accelerating methods SIMD (Single instruction, multiple data), multi-threading, and OS (ordered subsets) were employed to speed up the computation. Initial investigations showed that the image quality was iv

10 comparable to those of the conventional approach though the computation speed was significantly increased. v

11 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES viii ix CHAPTER 1. INTRODUCTION Overview of X-ray computed tomography Parallel computing The demand for parallel computing Performance analysis of parallel computing 1.3 Parallel computing for CT reconstruction Main contributions Organization of the thesis Publications PARALLE IMPLEMENTATION OF KATSEVICH ALGORITHM Overview of cone-beam CT reconstruction techniques Overview of the common scanning trajectories Cone-beam CT Exact cone-beam CT reconstruction D approximate cone-beam CT reconstruction Parallel implementation of the Katsevich algorithm A sequential implementation of the Katsevich algorithm Parallel implementation of the Katsevich algorithm Numerical simulation results for parallel implementation of the Katsevich algorithm Experiments on real-world CT data Summary and conclusion PARALLELISM OF ITERATIVE CT ALGORITHM BASED ON LOCAL RECONSTRUCTION Overview of iterative CT reconstruction A brief introduction to algebraic reconstruction technique (ART) The Expectation Maximization (EM) algorithm Iterative de-blurring for CT reconstruction Local iterative CT reconstruction algorithm Parallel reconstruction based on the local CT algorithm The scheme of the parallel reconstruction Numerical results for the parallel reconstruction based on the local CT algorithm A strategy to speed-up the computation more Experiments on real-world CT data Summary and discussion 75 vi

12 4. A FAST ITERATIVE RECONSTRUCTION SCHEME FOR MICRO-CT DATA Background Fast iterative reconstruction scheme Sequential implementation of the iterative method Implementation of the fast iterative method 4.3 Experiments Data acquisition Computation results Discussion and conclusion CONCLUSION AND FUTURE WORK 88 REFERENCES 97 vii

13 LIST OF TABLES Table 2.1 Average Total Reconstruction Time with the Number of Processors Speed-up with the Number of Processors Efficiency with the Number of Processors Time Used in Different Steps Configuration of the Scanners Average Total Reconstruction Time Speed-up with the Number of Processors Efficiency with the Number of Processors Parameters of the Spiral Cone-Beam Geometry Reconstruction Time with Different Number of Processors (NP) Speed-up and Efficiency with Different Number of Processors (NP) Computational Time with Different NP for Heterogeneous Resolution Speed-up and Efficiency with Different Number of Processors (NP), Heterogeneous Cases Parameters of the Spiral Cone-Beam Geometry Reconstruction Time with Different Number of Processors (NP) Benchmarks with Different Number of Processors (NP) Reconstruction Time with Different Number of Processors (NP) Benchmarks with Different Number of Processors (NP) Configuration of the Scanners Computational Results for Different Methods 86 viii

14 LIST OF FIGURES Figure 1.1. Photograph of a Siemens AG, Medical Solutions 16-slice CT Illustrations of the three prevalent X-ray emission geometries (a) Parallel beam. (b) Fan beam (c) Cone beam General structure of a parallel computing system Coordinate systems and variables used for image reconstruction in the case of helical cone-beam CT An illustration of circular cone-beam scanning geometry Data flow of the parallel Katsevich algorithm Flowchart for the parallel reconstruction process Comparisons of the performance parameters for the parallel Katsevich algorithm. All of the X-axes represent the number of processors. The Y-axes of (a), (b), (c), and (d) represent computational time, speed-up, efficiency, and ratio, respectively Representative slices of reconstructed volume. The top row shows the reconstructed slices of the 3D Shepp-Logan phantom while the bottom reveals the differences between the reconstructed and original slices. The gray ranges are [1.00, 1.05] and [-0.05, 0.05] for the reconstructed slices and the differences, respectively Tissue equivalent phantom (TEP). The density of the basic material of the phantom is the same as water (1000 mg/cc). The six structures are marked in the picture with their densities Comparisons of the performance parameters for the parallel Katsevich algorithm. All of the X-axes represent the number of processors. The Y-axes of (a), (b), (c), and (d) are speed-up on Inveon and Somatom, and efficiency on Inveon and Somatom, respectively Trans-axial images of (a) TEP from InveonTM, and (b) water phantom from SomatomTM. The displaying windows are [-1000, 1000], and [-600, 600] for (a) and (b), respectively A schematic graph of ART. The object is represented by the region in the curve. A rectangular area that encloses the object region is discretized to grids, and each block is assumed to have the same attenuation coefficient value. The shaded blocks are on the j-th ray path and whose weight w ij is non-zero 52 ix

15 3.2. A geometrical illustration of the cone-beam CT system for the local iterative algorithm A flowchart of the parallel algorithm Illustration where only part of the cone-beam X-ray intersects with the object The speed-up (a) and efficiency (b) of the parallel iterative algorithm Representative slices of reconstructed volume. a) original phantom, b) the sequential EM algorithm, c) homogenous step size, d) double step size. The displaying window for all cases is [0.95, 1.15], where the value in the range is linearly rescaled to [0, 255] Representative profiles of reconstructed slices. a) the profiles of the original phantom, reconstruction result of EM algorithm, and reconstruction results of the parallel algorithm, respectively. b) the profiles of the reconstruction results when using homogeneous step size, double step size for outside sub-rois region, and 4 times step size for the outside sub-rois region The speed-up of the Inveon TM scanner (a), and the Somatom TM scanner (b), and the efficiency of the Inveon TM scanner (c), and the Somatom TM scanner (d) with the parallel iterative algorithm Representative slices of reconstructed 2563 volume. Left: sequential EM algorithm, Middle: homogenous step size, Right: double step size. The images are converted to Hounsfield Unit, and the displaying windows are [-1000, 1000] and [-600, 600] for TEP phantom (top) and water phantom (bottom), respectively Pseduo-code for conventional iterative reconstruction algorithm Illustration of the forward projection scheme. In this example, forward projections from four views were performed simultaneously, and two CPUs were used. The corresponding projections from different CPUs were summed as the final forward projection result Reconstructed images for different methods together with the profiles through the center of the images 85 x

16 1 CHAPTER 1 INTRODUCTION 1.1 Overview of X-ray computed tomography X-ray computed tomography (CT) is a non-invasive imaging technique that produces cross-sectional images of an object from 2D X-ray projections. Since its introduction in the 1970s, CT has been widely adopted in medical imaging, and many other fields, such as manufacturing, geosciences, agriculture, and botany. Among its applications, medical diagnosis is perhaps the most common. By producing 2D or 3D images of internal organs without an incision, CT imaging enables a more accurate diagnosis without subjecting the patient to a painful, invasive procedure. Nowadays, it is a routine diagnostic test in many hospitals. Traditional 2D medical radiography uses X-ray technology to penetrate an object and to record a 2D image on special film. Before the emergence of CT, X-rays were the primary tool for medical diagnosis. One of the main drawbacks of this traditional 2D imaging technique is the superimposition of different structures; in other words, the internal structures along the ray path overlap on the film or detector, which makes it difficult to distinguish between different objects or to precisely determine the 3D spatial location of an internal feature on the ray path. This issue was resolved in the 1970s when Sir Godfrey Newbold Hounsfield and Allan MacLeod Cormack each independently invented the first CT scanners. They developed both instruments and associated methods to produce 3D images of an object from a series of 2D X-ray projections. A milestone in the history of radiology, this contribution to modern medicine earned Hounsfield and Cormack a joint Nobel Prize in Physiology or Medicine in 1979.

17 2 Figure 1.1. Photograph of a Siemens AG, Medical Solutions 16-slice CT. X-ray computed tomography (CT) is an imaging modality or procedure whereby an X-ray tube is used to generate an X-ray beam that passes through an object while a detector placed on the other side of the object records the attenuated X-ray signals, which are called projections. During the X-ray exposure, the projection data from different directions are gathered and processed with the aid of a computer to produce 2D or 3D cross-sectional or volumetric images of the object. This technique of imaging by section is sometimes referred to as tomography. Because of its ability to reveal the internal anatomical structures of an object, it has surpassed traditional X-ray technology as a

18 3 preferred/superior diagnostic method. Figure 1.1 shows a photograph of a commercial clinical CT scanner. In X-ray CT, an X-ray of intensity I 0 is emitted and passes through an object; the intensity is attenuated and measured at intensity I 1 on the detector. An empirical Lambert-Beer s law reveals the relationship between the intensities and the characteristics of the object radiated by the X-ray: I I exp( µ ( x) dx), = 1 0 L (1.1) where µ ( x) is the intrinsic attenuation coefficient of the material at location x, and L is the path of the X-ray through the object. The attenuation coefficient characterizes the rate at which X-rays are weakened by scattering or absorption as they propagate through the object. Roughly speaking, this coefficient is proportional to the material density at that position. Therefore, recovering the coefficients from the projection intensities and I0 I1 is equivalent to drawing a picture of the density distribution of the object. Because malignant tissue usually has a higher density than the nearby benign tissue, the CT technique can be used to find a tumor deep inside the human body. In order to solve this problem, first rearrange Eq (1.1) to be I I 1 = µ 0 L ( x) dx. (1.2) This modified formation gives a more explicit view of the CT reconstruction problem. Given the input and output X-ray intensities and I, respectively, the line integral of I0 1 the object attenuation coefficient along the ray path can be determined. The task of computed tomography is to calculate the attenuation coefficient function µ ( x) at various locations based on the available information, and I, gathered by the detector at I0 1 different directions. This can be regarded as an inverse problem that looks for the reversed process of the line integration, given certain necessary information.

19 4 Detector Translation Detector Source Rotation Rotation Detector Translation Detector Source (a) (b) Z Y X Detector Rotation Source (c) Figure 1.2. Illustrations of the three prevalent X-ray emission geometries. (a) Parallel beam. (b) Fan beam (c) Cone beam. The configuration of the X-ray emission path plays a significant role in computing this inverse problem. The path is given by the scanning locus of the X-ray source as well as the direction of the X-rays that this source emits. X-ray emission geometry is, for the most part, classified in one of three categories: parallel-beam, fan-beam, and cone-beam geometry. The earliest form that was used in CT, parallel beam involves the emission of a

20 5 narrow beam from the X-ray source during the acquisition. Because the X-ray source follows a rotational shoot-and-translate exposure pattern (one CT scan usually requires thousands of projections), this is an extremely time-consuming method. For example, the first CT machine took more than five minutes to complete a scan. In contrast, fan-beam geometry alleviates this problem by shooting a collection of fan-shaped X-rays from the source and using a detector ring consisting of a sequence of detector units to record the X-ray signals simultaneously. Consequently, this method can acquire projections much faster. As detector technology advanced to allow for the production of area detectors, cone-beam X-ray came into use. In this geometry, the X-ray source generates an X-ray in all three directions, and an area detector is used to record the projections. In this way, the acquisition time is reduced significantly. Figure 1.2 shows illustrations for these three geometries. In terms of the X-ray scanning trajectory, circular and spiral are the two primary types in use. As the name suggests, in circular scanning mode, the X-ray source rotates along a circular locus within a 2D plane. Generally the associated 2D reconstruction methods are relatively simpler and less computationally intensive. The drawback of this trajectory is the slow scanning speed, which is similar to that of the parallel-beam X-ray geometry described earlier. In other words, although the results might be as useful as those of other methods, the technique requires that the patient remain still during the entire scanning process. Furthermore, it has also been shown that X-ray emission along a circular orbit does not satisfy Smith s data sufficiency condition: on every plane that intersects the object, there exists at least one cone beam source point [79]. Therefore, it is not possible to recover a 3D image with this method exactly in theory. A spiral scanning trajectory is one of the curves that satisfy the data sufficiency condition. In this mode, an X-ray tube and a multi-row detector bank rotate while the patient is moved into a scanner gantry. Relative to the patient, the X-ray source scans along a helix while generating cone-beam X-rays that pass through the object. The attenuated X-ray signals are then

21 6 recorded on the detectors that have been placed on the other side of the patient. Spiral source trajectory in combination with cone-beam X-ray emission geometry has become the most popular scanning geometry for contemporary commercial clinical CT scanners. 1.2 Parallel computing Parallel computing is a computer science technique that makes use of multiple computer resources to process a computational task simultaneously. Each computes a portion of the task, and the results are compiled to produce the final result. In this way, processing time can be greatly reduced. In this work, we utilize this technique to achieve a significant improvement in the computational speed of CT reconstruction. The rest of this chapter outlines the basic knowledge of parallel computing as well as provides a brief introduction of its applications in medical imaging The demand for parallel computing The invention of microprocessor technology has greatly changed scientific computing. As the computational ability of the microprocessor becomes increasingly powerful, many problems that were unsolvable or unrealistic to solve in the past can now be resolved with relative ease. In general, the computational ability of a processor is proportional to the number of integrated transistors on the processor chip. For more than half a century, the development of the microprocessor has followed Moore s law, which empirically states: the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years [55]. The current state-of-the-art microprocessor can reach a peak speed as high as several Giga-FLOPS (floating point operations per second), compared to only 5,000 simple additions or subtractions per second for the first electronic computer invented in Although computation speed has been greatly improved, for many large-scale scientific problems, it remains either insufficient or impractical for problems of

22 7 substantial size. Conventionally, the modern electric computer is designed to solve a problem by sequentially executing a series of instructions on a single central processing unit (CPU). With a large computational load, this pattern of execution may translate into long wait times because any given instruction in the series is not able to execute until all of the earlier instructions have been completed. Several approaches have been developed to reduce the computation time. More advanced hardware is one option (e.g., a CPU with a higher clock frequency or specially designed processing chips). Currently, many commercial products are using high-end processors or have their own ASIC (application-specification integrated circuit), which can perform the specific computational operations faster. However, these approaches are usually expensive or application-restricted. When the hardware is upgraded, the old parts can rarely be re-used in the new product. Meanwhile, there are limits to increasing a processor s clock frequency because of difficulties in fabricating more complicated microprocessor chips. Moreover, contemporary microprocessor technology advancement is impeded by a data-transfer bottleneck; computing speed is crippled by neither the CPU nor the microprocessor, but, rather, by the ability of the memory or hard-disk to transfer data between them. Secondly, from a software perspective, existing algorithms can be improved or new algorithms that require less-intensive computation can be designed. This approach generally requires great effort in algorithm research and development, and depends on the properties of the problem. Therefore, not all applications can be easily resolved by this way. The idea of a parallel computing technique, which involves multiple computing units working together on the task simultaneously, was proposed to address this problem. Programs that are parallelized properly could run much faster than their sequential counterparts and, thus, increase computation time efficiency. With the use of parallel computing, numerous scientific and engineering problems that were previously unrealistic can already be or have the potential to be resolved (e.g., weather forecasts,

23 8 earthquake prediction, nuclear reaction simulation, etc.). One of the well-known examples is the man-machine chess match. Currently, many computer chess programs choose the next move based on searching strategies that evaluate all of the possible moves and pick the optimal one. This searching tree could be huge in many chess games, resulting in formidable complexity. A big surprise stirred the world in 1997 when Deep Blue --- the IBM supercomputer, beat the World Chess Champion Garry Kasparov. This machine consists of 131,000 processors and achieves over 280 Tera-FLOPS, which can evaluate 200 million potential moves in just a second. These numbers continued to increase, and, in 2007, Deep Blue s successor Blue Gene had 212,992 cores and reached 478 Tera-FLOPS. This achievement would not have been possible with only one processor. Another reason for using parallel computing is the bottleneck in hardware design itself. From the perspective of IC (integrated circuit) technology, increasing the processing ability of a single CPU by placing more transistors on a chip is becoming both difficult and costly. As a result, parallel computing, which employs the computational resource of multiple general-purpose CPUs concurrently, is perhaps the most feasible way to continue to increase computational ability. On the newly disclosed list of the top 500 most powerful computer systems in the world, most systems have been built using parallel computing techniques [53]. Due to the comparative ease of the technique, it is anticipated that this trend will continue to grow. Figure 1.3 gives an illustration of the general structure of the supercomputer. In parallel computing, the fundamental instruments are the computers that perform the computing task. A parallel computing machine can be a single Symmetric Multi-Processing (SMP) system with multiple built-in processors sharing a common memory; a cluster of locally connected computer processors with distributed interconnected memories; or a cluster comprising multiple workstations linked by a network, generally called a PC cluster. For example, a computer installed with Intel dual-

24 9 core processors (e.g., Intel Core Duo) is a local shared-memory system. The sharedmemory machine usually has limited bandwidth between the CPUs and the memory, which may greatly affect its performance when the number of CPUs becomes large. Consequently, its application is mainly restricted to small- or moderate-size problems. In contrast to the shared-memory system, a PC cluster consists of multiple general-purpose PCs that are connected by a network and communicate through a Message Passing protocol. A well-designed PC cluster may contain thousands of processors working together. Compared to select other methods, it is more scalable, less expensive, and easier to program. Network Node Node Node Node Message I/O Memory Processor Storage Figure 1.3. General structure of a parallel computing system.

25 10 In summary, parallel computing is a computer science technique that makes use of multiple computer resources to work on a computational task simultaneously. Each computes a portion of the task, and the results are gathered for a final result. In this way, processing time can be greatly reduced. In this work, we utilize this technique to achieve significant improvements in computational speed for CT reconstruction. Below we outline the basic knowledge of parallel computing and provide a brief introduction related to its application in medical imaging Performance analysis of parallel computing In parallel computing, a processor that participates in a computational process is called a processing element (PE). An overall computational task is typically partitioned into multiple sub-tasks, and then the associated data are sent to different PEs through a local connection (with an internal switch) or a networked connection (with an external switch). After the sub-tasks are completed, the results are assembled by a master PE to obtain the final result. The effectiveness of parallel computing relies heavily on the ability of the algorithm to be parallelized. Such ability can be evaluated by several common benchmarks, which are usually calculated in terms of the number of processors that are involved in the computing. The two most commonly used benchmarks of a parallel algorithm are quantified in terms of speed-up, S, and parallel efficiency, η. Speed-up is defined as the ratio of p the time taken to solve a problem on a single processing element to the time required to solve the same problem on a parallel computer with n identical processing elements: T S p = T s np, where T is the total execution time when one processor is used, and T is the total s parallel execution time when n processors are used. np

26 11 Speed-up measures the extent to which the improvement in speed can be reached by parallelizing a given application. It describes the relative benefit of solving a problem with a parallel approach over a sequential implementation. For a given problem, there may be more than one sequential algorithm. When measuring speed-up, the one that solves the problem in the least amount of time on the same PE is usually used. Here the n PEs for the parallel algorithm are assumed to be identical to the one used by the sequential algorithm. The other popular benchmark, efficiency, is closely related to speed-up and defined as: η = S n p p, where n p is the number of processors. It measures the fraction of time for which a PE is usefully employed in computation, other than being wasted on overhead management. It can be helpful to know the best result that a parallel algorithm could achieve to avoid unnecessary effort in the early stages of development. Generally, any large computational problem consists of both parallelizable and non-parallelizable (sequential) parts. Some theory has been developed for this purpose and is based on how well the sequential algorithm is able to be parallelized. Amdahl's Law, originally formulated by Gene Amdahl in the 1960s [1], states that the portion of the program that cannot be parallelized limits the overall speed-up available from parallelization. This relationship is given by: S p 1 =, (1 α ) p where α p is the fraction of the algorithm that is parallelizable. This formula gives the maximum expected speed-up that any parallelization scheme could achieve for the algorithm. For example, if the sequential portion of a program is 20% of the runtime, a

27 12 speedup of no more than 5 can be achieved, no matter how many processors are used in the computation. The limitation of Amdahl s law is that it assumes a fixed problem size and that the fraction of the sequential parts is independent of the number of PE in the computation. This is often not true in practice. In fact, as the size of the problem changes, the parallelizable fraction may vary. When this is the case, Gustafson's Law applies: S(n) = n α*(n 1), where n is the number of processors, S is the speed-up, andαthe non-parallelizable part of the process. Under Gustafson's Law, the theoretical maximum speed-up of using n processors is n: namely, linear speed-up. This is in accord with common sense: divide the computation task by n, and then the least time is not less than 1/n of the original time. However, sometimes a super-linear phenomenon might be observed; speed-up of more than n may be achieved with n processors. A possible reason for this is the effect of cache or memory aggregation. In parallel computers, not only does the number of processors change, but also the size of total accumulated caches from different processors. With a larger accumulated cache size, more (or even the entire) data set(s) could fit into caches, reducing memory access time dramatically. In this way, an additional speed-up beyond what arises from the parallel algorithm alone is gained. Another possibility is that the parallel algorithm is more optimal than the sequential one. When the total accumulative instructions in the parallel case are computationally simpler than the sequential instructions, the speed-up is possible to show this super-linearity. The above discusses the parallelizability of the algorithm in relationship to its performance in a parallel system. The sole consideration is the change in computational load on the parallel machine when the algorithm is parallelized. In a situation in which the computing task can be decomposed and executed in parallel with virtually no communication, this gives a good estimate of performance. But this scenario is not very

28 13 common in real-world applications. Most problems do need PEs to communicate to share data. In a distributed-memory system, such as the PC cluster used in this research, the communication cost may represent an un-neglectable portion of the total cost. Consequently, the parallel performance could be substantially diminished. Therefore, it is critical to avoid unnecessary communication overhead in parallel computing. When a message is sent to a parallel computer, there is a minimum latency time regardless of the size of the message. Thus, even the size of the package being sent might need to be taken into consideration since a smaller package means that more communication is involved. It has been suggested that small messages should be packed into larger messages to avoid multiple latency time. In the end, the overhead depends not only on the design of the parallel algorithm, but also on the architecture of the parallel platform, network bandwidth, the message passing policy used, etc. It is difficult to address all factors in advance. A basic criterion in designing a parallel algorithm is to minimize communication overhead as much as possible in order to achieve better performance. In addition to the two factors mentioned above (i.e., the parallelizable portion of the problem and the communication overhead), load balance also affects the performance of the parallel system. Load balancing refers to the distribution of tasks in a balanced way such that all PEs can efficiently participate in the computation throughout the whole process. In other words, the processors have roughly the amount of work proportional to their computational ability so that all can finish the task simultaneously and no one processor delays the entire solution. In the previous discussion, we assumed that the PEs were identical, but in real-world applications, a parallel computing machine may consist of various PEs. The solution may appear simple; however challenges arise in applications for which the size remains unknown until run time. Sometimes the computing load is difficult to partition without introducing additional communication costs. In the above section, we have briefly outlined speed-up, efficiency, and the factors that could affect them. In addition to speed-up and efficiency, scalability is also a

29 14 crucial index for solving large-scale computational tasks in parallel computing. Generally, an application is considered to be scalable if a larger parallel configuration can solve proportionally larger problems in the same running time as smaller problems on smaller configurations [19]. Furthermore, scalability encompasses a parallel system's ability to accelerate the computation with the addition of processors. By this definition, a scalable parallel system should achieve an ideal speed-up. However, due to hardware and software restrictions, such as memory-accessing speed limit, bandwidth, algorithm design, etc., at some point, adding more PEs causes performance to decrease. Alternatively, a revised definition of scalability, scaled speedup, is actually used in practice. An application is said to be scalable if the computational time remains the same when the number of PEs and the problem size are increased by the same factor [27]. This is a more practical definition since it considers the limitations of the real system while still capturing the essence of the definition of scalability. These three measurements serve as basic guidelines for designing as well as methods for evaluating a parallel algorithm. In summary, a parallel algorithm should result in increased speed-up, efficiency, and preserved performance when more PEs are used for a larger problem. 1.3 Parallel computing for CT reconstruction CT reconstruction requires a large number of computing resources. Although analytic 2D reconstruction, due to its simplistic formula, can be accomplished in a relatively short time, the popular spiral cone-beam 3D reconstruction remains timeconsuming. The dilemma is exacerbated when numerous projection data and a highresolution image are needed, which is standard in modern clinical diagnosis or preclinical experiments. Moreover, when the iterative reconstruction methods are used in some applications, such as incomplete projection data or projection data that contains noise, the

30 15 computation time is so formidable that it must be significantly reduced before it can be practically used. Over the past decades, parallel computing technology has been successfully applied in several medical applications for image reconstruction. Many parallel algorithms have been developed. The parallel algorithm is usually designed based on a corresponding sequential algorithm. For example, Chen et al. proposed a parallel implementation of the 3D CT reconstruction algorithm based on an inverse 3D Radon transform on an Intel hypercube multiprocessor [11]. Later Raman et al. developed a parallel implementation for the parallel-beam FBP algorithm and examined it on the Intel Paragon system with 16 processors and the Connection Machine (CM5) system with 32 processors [61]. The performance of their parallel FBP programs was compromised by a large communication overhead, giving a speed-up of about 4 on Paragon and 1.36 on CM5. The authors found that, due to the parallel implementation that split the backprojection summation among different PEs, the communication overhead for gathering the sub-summation results may cost more time than the CPU execution time. Therefore, increasing the number of processors could increase the total execution time. As for iterative algorithms, in the 1990s, some parallel Expectation-Maximization (EM) algorithms were proposed [41 and 54]. The parallel implementation was directly based on the conventional EM algorithm with various domain partition techniques. Ordered subset techniques were also used to further speed up the iterative reconstruction [34]. Johnson and Sofer investigated various parallelisms in image reconstruction [31]. More recently, an OSC (Order-Subset Convex)-based parallel statistical cone-beam X-ray CT algorithm was proposed based on shared memory [40]. This algorithm employs two parallelization techniques: (1) processing for all of the projection angles within one subset in parallel (OSC-ang), and (2) dividing the whole volume into various parts and reconstructing them in parallel (OSC-vol). Both of these techniques rely heavily on reprojection/back-projection operations. The first parallelization strategy has been proven

31 16 to be appropriate for shared-memory architectures to avoid high communication overhead, and the second is suitable for distributed memory systems. Importantly, the optimal choice of the OSC-ang and OSC-vol depends on the dataset size. 1.4 Main contributions In this work, we will mainly focus on the most popular acquisition geometries in current commercial CT scanners (i.e., spiral cone-beam CT and circular cone-beam CT). Two major reconstruction methods are used to recover images from projection data: the analytical method and the iterative method. Analytical methods (e.g., the FDK and the Katsevich algorithms) use some direct analytic formulas to reconstruct the image. Iterative methods (e.g., Algebraic Reconstruction Techniques [ART] and Expectation- Maximization [EM] [2, 3, 18, 43, 82 and 83]) repeatedly utilize the formula until the final result is achieved. More specifically, iterative methods match the measured projection data with the calculated data based on the current object-density-distribution estimation, and subsequently make corrections according to the difference. This procedure is repeated until some predetermined error level or maximum iteration number is reached. Both approaches have advantages and specific ranges of applications. Analytical methods are usually faster than iterative methods and are, thus, more widely used in commercial applications where time is an important consideration whereas the latter shows its strength when projection data contain substantial noise or are incomplete. However, as 3D CT reconstruction becomes popular; both suffer from computation time requirements due to the geometric complexity and large amount of projection data acquired in a short period of time. Furthermore, the high-resolution image usually required in real-world applications could also significantly slow the reconstruction speed. In this research, we have utilized computation acceleration techniques to reduce the computational time for the Katsevich algorithm and the EM algorithm, which are representatives of analytical and iterative methods, respectively. Specifically, the parallel

32 17 implementation of the Katsevich algorithm gives an identical result as that from the sequential algorithm since the reconstruction formula is intact. As for the parallelism of the EM algorithm, we attempted to reduce the reconstruction speed via two different methods. The first method produced a result that was different from the sequential EM algorithm because it was based on iterative local CT reconstruction, a generalization of the EM algorithm originally proposed for incomplete projection data [82 and 83]. The second method, which was conducted on a shared-memory computer, also gave an identical result to the sequential counterpart. The results from all implementations showed a significant speed-up over the sequential algorithms with the image quality preserved. 1.5 Organization of the thesis The topics in this thesis are discussed in the following sections: Chapter 2 gives an overview of the spiral cone-beam geometry that this thesis uses as its foundation. The two most representative analytic algorithms namely, the approximate Feldkamp algorithm for circular source trajectory and the exact Katsevich algorithm for spiral trajectory are briefly introduced. Although the computation acceleration work for iterative algorithms is also a component of this geometry, we leave it to a separate chapter since its derivation is independent of the specific acquisition geometry used. Next, the parallel implementation of the Katsevich algorithm is presented. The numerical and real-world data results are reported, and the parallel performance is compared with its sequential counterpart. Chapter 3 first gives some general information about the iterative reconstruction algorithm. Then the EM (Expectation Maximization) algorithm is presented, followed by a variation: the iterative de-blurring approach from which the local iterative CT algorithm is derived. The local iterative algorithm itself is introduced after that. Later, a parallelism of the iterative CT algorithm based on local reconstruction is proposed. Simulation and

33 18 real-world data reconstruction are carried out and the results are presented; the associated parallel benchmarks are thereafter calculated. Concerns about image quality, relatively low efficiency, and other potential limitations are discussed at the end. Chapter 4 presents a computation acceleration scheme for the iterative algorithm. Such techniques as multi-threading, SIMD, and OSEM work together so as to achieve maximum speed-up. Real-world data are processed and compared with the standard Feldkamp algorithm to verify that the image quality is not compromised in an effort to reach a high speed-up. Chapter 5 summarizes the work herein described and makes suggestions for future research directions related to this topic. 1.6 Publications Parts of the results in the thesis have been published or have been submitted by: Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L. and Wang, G., A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction, The Journal of Supercomputing 38(1): (2006). Deng, J. Yu, H., Ni, J., Wang, L., and Wang, G., Parallelism of iterative CT algorithm based on local reconstruction, Developments in X-Ray Tomography V, Proceedings of SPIE, Vol 6318, Paper ID:63181P, 10 pages, Aug 15-17, 2006, San Diego, CA, United States. Ni, J., Deng, J., Yu, H., He, T., and Wang, G., Analytical Model for Performance Evaluation of Parallel Katsevich Algorithm for 3-D CT Image Reconstruction, Int. Journal on Computational Science and Engineering, Inderscience Publisher (accepted). He, T., Ni, J., Deng, J., Yu, H. and Wang, G., Deployment of One-Sided Communication Technique for Parallel Computing in Katsevich CT Image Reconstruction, IMSCCS (1) 2006:

34 19 Deng, J., Hong, I., Burbar, Z., Yan, S. and Chen, M., A fast iterative reconstruction scheme for micro-ct data, International Meeting on Fully 3D Image Reconstruction in Radiology and Nuclear Medicine, 2009 (Beijing, China). Deng, J., Yan, S., Yu, H., Wang, G., and Chen, M., A study on spiral cone beam scanning mode for preclinical micro-ct, Nuclear science symposium conference record(nss/mic), IEEE, 2009: Deng, J., Siegel, S., and Chen, M., 3D cone-beam rebinning and reconstruction for animal PET transmission tomography, Nuclear science symposium conference record(nss/mic), IEEE, 2010(accepted).

35 20 CHAPTER 2 PARALLEL IMPELEMENTATION OF KATSEVICH ALGORITHM 2.1 Overview of cone-beam CT reconstruction techniques As discussed in Chapter 1, CT technology has undergone several fundamental revolutions since its advent. The first CT scanner designed by Hounsfield used 2D parallel-beam circular acquisition mode; the source emitted a narrow beam of X-ray and operated according to a translation-then-rotation pattern. Later, fan-beam X-ray geometry became the preferred method because it was able to acquire data much faster. When an area detector became available, cone-beam CT was introduced and was accepted widely because it could achieve a much higher data acquisition speed and greater coverage than the other methods. In the following section, a brief introduction on the popular CT scanning modes will be given, and some major reconstruction algorithms designed specially for them will be discussed Overview of the common scanning trajectories Traditionally, CT scans an object following a circular locus. The table remains stationary while the X-ray source rotates around the subject. After a slice is scanned, the table then moves to the next position and the scan continues. In the age of the single-slice detector ring, such a scan-and-stop slice-by-slice method is an inefficient use of time because a full-body scan may comprise hundreds of slices. Moreover, it may result in motion artifacts in the reconstructed images because the patient generally cannot hold his or her breath or keep still during the scan, and, consequently, the slices from different movement phases are stacked together. To make the acquisition faster, in the 1990s, spiral CT was introduced; in it, the X-ray tube and the detector rotate by the same circular orbit as the table is moved into the scanner gantry. The X-ray source scans along a helix relative to the patient. In this mode, the X-ray tube continuously rotates and does not need to stop during the movement of the table. Consequently, the total acquisition time is

36 21 greatly reduced. It should be noted that, in the early stages of this mode, the source still emits a fan-beam of X-rays, which is recorded by a ring of detectors. When only 2D reconstruction methods are available, interpolation is necessary to fulfill the required projection data for those 2D methods because each view is at a different axial position. To do so, the projection data can be interpolated at the same view angle (360 o apart) but a different axial value. Alternately, two geometrically opposite projections that are 180 o apart can be used to perform the interpolation. It can be shown that both are capable of producing reasonably good image quality although the latter would produce better resolution in the axial direction because the data is intrinsically closer to the true interpolation of the slice. The drawback with both is that interpolation could result in low image resolution in the axial direction when the translation of the table during one rotation of the source (i.e., the helical pitch) is large Cone-beam CT Although the spiral fan-beam CT can complete a scan faster than the circular orbit CT, it is still not fast enough for clinical use. The motion artifact is not fully removed either. Furthermore, transaxial 2D cross-sectional images are reconstructed from the projection data in the same plane. To obtain a 3D view of the patient, the 2D images are stacked together in order despite having been acquired at different time. When there is relatively large patient motion or breathing during the table translation, severe artifacts could result. Therefore, the X-ray needs to cover the region of interest (ROI) as much as possible and complete the data acquisition in a single breath-hold. An X-ray tube emanating a cone-beam of X-rays could fulfill this goal. Different from 2D fan-beam geometry, cone-beam CT emits photons in all three dimensions toward the object, and a multi-row area detector is used to record the data. This type of CT acquisition allows a larger ROI to be captured in a much shorter period of time. Higher resolution in the axial direction can, thus, be achieved by using the overlapping

37 22 projection data in this direction. The range of the cone angle is limited by the span of the detector plane. Since the large area detector has been, historically, expensive and hard to design and fabricate, in its early stages, only a few slices were used. Nowadays, 128-slice scanners have been developed, and some vendors have announced or plan to provide 256- slice scanners. Although the cone-beam CT enables users to collect the projection data at a much faster speed and with a larger coverage area, the divergence of the cone-beam geometry makes the 3D reconstruction far from trivial. With a circular scanning locus, the central plane has complete projection data. Thus the central plane image can be recovered exactly by parallel-beam or fan-beam 2D CT reconstruction algorithms. However, in other planes, exact reconstruction cannot be accomplished with those algorithms because the X-rays are tilted and do not lie on the same plane. Meanwhile, this cone-shaped tilting makes it difficult to use an interpolation strategy that was used in the spiral fan-beam mode. Therefore, algorithms that can account for cone-beam geometry need to be formulated to efficiently utilize this large amount of projection data. The 3D cone-beam CT reconstruction algorithms can be categorized into two major types: approximate and exact algorithms. An approximate method reconstructs the CT image by a non-exact formula with an acceptable error. In contrast, an exact method recovers the image by a theoretically accurate inversion formula. Although exact methods look more attractive than approximate methods, the approximate ones are, in fact, more widely adopted in practice. This is largely due to their ease of implementation and the fact that they are relatively less sensitive to noise; furthermore, the reduced computational complexity for the approximate algorithms is another reason. In the following section, both approaches will be discussed with one or two representatives presented in more detail.

38 Exact cone-beam CT reconstruction As early as the 1980s, exact cone-beam reconstruction algorithm had already been a focus of the CT community. In 1983, Tuy derived a formula for exact cone-beam reconstruction with scanning trajectories satisfying the following condition: every plane that intersected the object should cut the X-ray source trajectory at one point at least [79]. Specifically, two perpendicular circular scanning loci formed one such curve. This condition has been proven to be sufficient for an exact 3D reconstruction and is referred to as Tuy s data sufficiency condition. This condition has become the standard for designing CT scanners that use exact CT reconstruction algorithms. In 1991, Grangeat found a fundamental relationship between X-ray projection and the derivative of 3D Radon transform. Based on this relationship, Grangeat proposed an exact reconstruction formula that directly used the Radon inversion [24]. However, these algorithms require the coverage of the object in all directions on a sphere (or, at least, a semi-sphere by using the property of symmetry). This restriction makes reconstruction of a long object impossible because the X-ray projection data is unavailable or truncated in the longitudinal direction due to the size of the detector. Additionally, the complexity in implementation and computation makes the algorithms unsuitable for real-world applications. Hence, although these algorithms played a very important role in the development of CT reconstruction, they are impractical for clinical use. Despite these drawbacks, Tuy and Grangeat s efforts promoted the investigation of exact cone-beam reconstruction. Many algorithms have been proposed that were either directly or indirectly based on Grangeat s formula by utilizing the fundamental relation he proved. For instance, Defrise et al. [14] and Kudo et al. [42] used this relation to develop an exact reconstruction formula for an arbitrary trajectory, which satisfies the Tuy s condition. Their formulas are in the form of filtered back-projection and can be reduced to the famous Feldkamp algorithm, which is approximate and will be introduced in the next section on approximation algorithms. The difficulty of the implementation is

39 24 similar to that associated with the Grangeat algorithm --- the projection data are filled in all directions. Therefore, a more practical, exact formula could be favorable for realworld application. In 2002, Katsevich derived an exact Filtered-Back-Projection (FBP) algorithm for spiral cone-beam CT reconstruction, which was quite similar to the Feldkamp-type algorithm in formation. Unlike the other exact algorithms, Katsevich s algorithm was much more computationally efficient. This made it practical for commercial scanners. In the following section, a brief introduction of the algorithm will be given. PI-segment s ( x ) t r U locus g ( s, u, v ) x β d 2 d 1 d 3 y ( s ) s ( b x ) h R detector gantry Figure 2.1. Coordinate systems and variables used for image reconstruction in the case of helical cone-beam CT.

40 25 3 As shown in Figure 2.1, a helical scanning locus, C, in a 3D Euclidean space, R, can be mathematically described as 3 sh C : = y R : y1 = R cos( s), y2 = R sin( s), y3 =, s R, (2.1) 2π where s is an angular parameter, h ( > 0) and R ( > 0) are the pitch and radius of the y y1 y2 3 locus, and is a Cartesian-coordinate vector with three components,, and y. As mentioned before, in a practical CT system, a patient is moved through the gantry while the X-ray source rotates around the patient. Relative to the patient s position, the locus of the X-ray source can be viewed as the helix, C. Let U denote an open set that is strictly inside the helix and contains the volume (object) of interest (VOI): U { x R : x + x < r,0 < r < R}, (2.2) 1 2 where r is the radius of VOI inside the locus and x is the Cartesian-coordinate vector with three coordinate components,, and x. x1 x2 3 2 Assume f is a compactly supported function defined on U and lets be the unit sphere 3 in R, then the cone-beam transform of f is defined as: D ( yβ, ) : = f ( y + tβ) dt, β S f 0 2 (2.3) The π -line of a given point x is a line segment passing through x with its two endpoints within one helix turn. It has been proven that any point strictly inside the spiral belongs to one and only one π -line [13 and 15]. Assume s ( x) and ( x) are the angular parameters of the two endpoints; the π -interval can be denoted as I ( x) : = [ s ( x), s ( x)]. b s t PI b t For a given s IPI ( x), one can find s2 IPI ( x) such that x, y( s), y( s2) and y( s1 ( s, s2)) are on the same plane with the constraint s1 ( s, s2) = ( s + s2) / 2. Denote

41 26 ( ) ( ) y( s1 ) y( s) y( s2) y( s) sgn( s2 s), 0 < s2 s < 2π ( y( s1 ) y( s) ) ( y( s2) y( s) ) u( s, x) =, (2.4) yɺ ( s) ɺɺ y( s), s2 = s yɺ ( s) ɺɺ y( s) Katsevich s theorem [37] can be given as Theorem 1. For f C ( 0 U ), one has 1 1 2π dγ f ( x) = D ( y( q), Θ( s, x, γ )) ds, γ 2 f q= s 2 π IPI ( x) x y( s) 0 q sin (2.5) where Θ( s, x, γ ) cos( γ ) β( s, x) + sin( γ ) e( s, x), β( s, x) x y( s) / x y( s), and ( ) e( s, x) β( s, x) u( s, x). For details of the proof for this theorem, refer to [36 and 37]. This algorithm resolves the drawback associated with previous approaches by allowing long object reconstruction in spiral cone-beam CT. The efficiency of this algorithm arises from the fact that it has a filtered-back-projection formation similar to that of the Feldkamp algorithm. It has been regarded as a milestone of spiral cone-beam CT reconstruction and is very promising in clinical application and other commercial CT systems D approximate cone-beam CT reconstruction Approximate reconstruction is a very important branch in the CT field and is the major 3D reconstruction method used in most contemporary CT scanners. In the early stage of the cone-beam CT, the cone angle was relatively small due to the expensive cost and the technical difficulties associated with placing enough detector units together to form a large detector matrix. For instance, the single-slice CT was introduced in the mid- 1990s, dual slices a year later, but only in 2003 did the 16-slice detector appear on the market. At this time, the angle span (cone angle) of the X-ray in the axial direction was relatively small; therefore, many algorithms were modified directly from existing 2D reconstruction for the cone-beam geometry. Now consider the helical locus. Due to the

42 27 small cone angle, the simplest way to understand it would be to ignore the cone angle and imagine the tilted plane as being flat. Similar to the case of the single-row scanner, the projection data is interpolated using the same (360 o apart) or the opposite (180 o apart) view angle except that the detector row that is nearest to the slice that is to be interpolated in Z direction is used. Then the 2D reconstruction algorithm is used to recover the image. More complicated techniques that take cone angle property into consideration have been designed. One example is the tilted-plane reconstruction first proposed by Larson [47] for which the approximation and reconstruction was performed on a tilted slice that covered a half turn of the helix. The tilted images were interpolated to a 3D Cartesian coordinate system to assemble the flat images. It has been shown that the tilted-plane reconstruction produces better image quality than the conventional reconstruction plane image with direct 2D back-projection methods in the cone-beam geometry. The above techniques are, essentially, 2D reconstructions since the cone-beam geometry is ignored. 2D images are actually reconstructed, and the 3D volumetric image is formed by stacking or interpolating the 2D slices as in traditional single-row spiral CT. They require that the cone angle and the helical pitch be small; otherwise, the resolution in Z direction would be compromised by inaccurate approximations from 2D planar reconstruction, and the fact that the interpolation of two projections deviate too much in the axial direction. The Feldkamp algorithm, also called the Feldkamp, Davis, and Kress (FDK) algorithm, is the most well known and widely used filtered-back-projection type conebeam CT reconstruction algorithm. In the original work [21], a circular X-ray source trajectory was used. The central plane that is determined by the circular orbit of the source can be exactly recovered by a 2D fan-beam algorithm or a parallel-beam algorithm after rebinding. By heuristically developing the fan-beam FBP algorithm, Feldkamp et al. generalized the central plane reconstruction to the other plane. The scanning geometric arrangement is shown in Figure 3.2. Letting the rotation axis of the

43 28 X-ray source be the Z-axis and the rotation plane of the circular locus be perpendicular to the Z-axis, the fan-beam reconstruction for the central plane is given by: 2 1 2π d f ( r, φ) = D ( (, )), β U r φ dβ 4 π [ d + r cos( φ β )] where Y ( r, φ) = dr sin( φ β ) [ d + r cos( φ β )], d Dβ ( u) = P ( u)* g( u), 2 2 β d + Y where ( r, φ) 1 g( u) = ω exp( iωu) dω, 2 are polar coordinates of the central plane. Z Y source X Figure 2.2. An illustration of circular cone-beam scanning geometry.

44 29 For the non-central tilted projection plane defined by the source and a detector row parallel to the central plane, treat these as if they were in another central plane. By relating the parameters of the tilted plane with those of the central plane, the reconstruction is extended to the non-central plane: 2 1 2π d f ( x, y, z) = D ( (,, ), (,, )), β U x y z V x y z dβ 4 π [ d + x cos β + y sin β ] where U ( x, y, z) = d( y cos β xsin β ) [ d + x cos β + y sin β ], V ( x, y, z) = dz [ d + x cos β + y sin β ], d Dβ ( u, v) = P ( u, v)* g( u), β d + U + V 1 g( u) = ω exp( iωu) dω. 2 Although this formula is an empirical extension and no rigorous proof has been provided, it is still widely used, due to its simplicity in implementation, efficient filteredback-projection form, and good image quality with moderately large cone angle. For planes that do not deviate from the central plane too much, this formula also shows robustness in the cone-beam artifacts caused by incompleteness of the source trajectory. As could be observed from the derivation, there are some limitations with the Feldkamp algorithm. First, its application is restricted to objects with a relatively short longitudinal dimension determined by the cone span. However, many real-world objects, such as the human body, are long. Secondly, the X-ray source must be moved along a circular path in the object coordinate system. This trajectory fails the sufficiency condition [1 and 69] stated earlier --- to have an exact inversion formula, almost every plane intersecting with the object must cut the X-ray source trajectory curve at some point. This makes it only possible to have an approximate reconstruction formula for the trajectory.

45 30 Inspired by Feldkamp s work, researchers have developed numerous Feldkamptype reconstruction algorithms for different types of scanning loci. In clinical applications, a solution for the spiral scanning trajectory is more desirable than others. In 1993, Wang et al. proposed the first approximate reconstruction method for cone-beam CT with general trajectories, especially spiral X-ray scanning locus [84]. The algorithm, also derived empirically from the standard fan-beam reconstruction, is considered to be a generalization of the Feldkamp algorithm. It shares the same pattern of filtering and weighed back-projection and has the merits of its predecessor in computational efficiency and implementation simplicity, but without the restriction that can only handle the reconstruction of short objects. The above algorithms are not perfect, though. They deal with small or moderate large cone angles relatively well, but suffer from decreased image quality as the cone angle increases or if the helix pitch is large in the helix case. Despite these limitations, approximate algorithms still perform better than existing exact algorithms due to better noise resistance and dose efficiency. As a result, they have been adopted in most commercial CT scanners. 2.2 Parallel implementation of the Katsevich algorithm In 2004, the Katsevich algorithm was implemented by Yu and Wang [92] and other groups. The implementation may take a few hours to reconstruct an image of a moderately large matrix size. Compared to the modern CT scanner, which takes only a few minutes to accomplish a full-body scan, this reconstruction speed is far from satisfactory. Therefore, the computation time needs to be reduced greatly. In this chapter, a computational scheme to reconstruct an image in parallel with the Katsevich algorithm is presented. This parallelization is based on Yu and Wang s sequential implementation.

46 A sequential implementation of the Katsevich algorithm As illustrated in Figure 2.1, with d1 = ( sin( s),cos( s),0), d2 = (0, 0,1), and d3 = ( cos( s), sin( s),0), a local coordinate system on the planar detector is formed to numerically implement Katsevich s formula [92]. The cone-beam projection data is measured using the planar detector arrays that are parallel to andd at a distance, D, d1 2 from y( s). The detector position in the array is given by a pair of values ( u, v), which are the signed distances along and d, respectively. d1 2 Let ( u, v ) = (0,0) be the orthogonal projection of y( s) onto the detector array. When given s and D, the projection ( u, v) is determined by β. If we denote d g( s, u, v) D f ( yβ, ) and D g ( s, u, v) = g( s, u, v), the Katsevich algorithm can be ds implemented by the following two steps [92]: (S1) Hilbert Filtering Define an intermediate function ψ ( s, u, v) for this filtering step as: D + u + v ψ ( s, u, v) = Dg ( s, uɶ, vɶ ) duɶ, D + uɶ + vɶ ( uɶ u) (2.6) where ( uɶ, vɶ ) represents the local coordinates of a variable point on the filtering line determined by ( u, v ), and D ( s, u, v) is the first-order derivative of cone-beam data, which can be computed by the following equation: g 2 2 D + u uv Dg ( s, u, v) = + + g( s, u, v). s D u D v (2.7) (S2) Weighted Back-Projection The weighted back-projection is expressed by the following formula: 1 st ( ) 1 2 π x sb ( ) s ψ x x y D x y( s) id D x y( s) id u* =, v* = x y( ) d x y( ) d f ( x) = ( s, u*, v*) ds 2 ( ) ( ) ( s ) i ( ) ( s ) i (2.8)

47 32 To numerically implement the Katsevich algorithm, the cone-beam projections are first uniformly sampled with intervals s, u, and v for s, u, and v, respectively. The sampled data is denoted as: g( s, u, v ) 0 k < K,0 m < M,0 n < N, (2.9) k m n where k, m, and n are the indexes of sampling points for s, u, and v, respectively. In practice, m and n are indexes of unit detector positions, and u and v represent the unit detector size. Therefore D g ( s, u, v) can be numerically computed as: D ( s, u, v ) g k m n 2 2 s D + um u umvn v = Dg ( sk, um, vn ) + Dg ( sk, um, vn ) + Dg ( sk, um, vn ) D D (2.10) where s u D ( s, u, v ), D ( s, u, v ), and D v ( s, u, v ) are the first-order central g k m n g k m n g k m n difference formats of g( s, u, v), g( s, u, v), and g( s, u, v), respectively. Then the s u v filtering data ψ ( s, u, v ) can be calculated from Equation (2.6). Finally, the filtered k m n data needs to be back-projected by Equation (2.8) where the π -interval ( x) has to be numerically determined. For more details of numerical implementations, refer to [92 and 93] Parallel implementation of the Katsevich algorithm Specifically, the parallel Katsevich algorithm was implemented on a multiprocessor HPC cluster with 16 nodes at our Medical Imaging High Performance Computing Lab (MIHPC Lab). Each node had two 64-bit AMD-based Opteron processors (PEs) and 4 GB memory shared between the processors. The total system storage is 8 TB for archiving and retrieval of high-resolution data and images. The program was in C and was compiled by the Portland C compiler. The Message Passing Interface (MPI) served as a parallel library to perform message-passing among the PEs. Because the MPI protocol was implemented through a low-level socket, the communication between the processors (processes) on the same node was realized I PI

48 33 through message-passing. Moreover, the processors on different nodes were given higher priority for assignment than were the processors on the same nodes. The main messagepassing functions included MPI-based sending, receiving, broadcasting, and collecting. As described above, the two major computing procedures were filtration and back-projection (S1 and S2). In the filtering step, the calculation of numerical s u v differentiation terms D s, u, v ), D s, u, v ), and D s, u, v ) in Equation (2.10) g ( k m n g ( k m n g ( k m n and integration in Equation (2.6) were the most time-consuming components. The u v computation of D s, u, v ) and D s, u, v ) required only the data collected at one g ( k m n g ( k m n view angle s k, and, therefore, it was independent of the data at other view angles. By this property, the projection data from different view angles was able to be distributed to different PEs and processed in parallel. The computation of s Dg ( sk, um, vn ) took the data sk+1 s k 1 from the view angles and. This required that the projection data be partitioned in sk sk+ 1 s k 1 such a way that the data from view angles,, and be sent to the same PE. The integration operation in Equation (2.6) used the data at one view angle, u v previous partition strategy for D s, u, v ) and D s, u, v ) applies here. g ( k m n g ( k m n s k ; thus, the Projection data Filtered data PE 1 PE 1 Reconstructed data Collected reconstructed data Root PE PE n PE n Filtering stage Backprojection stage Figure 2.3. Data flow of the parallel Katsevich algorithm.

49 34 Determining how much projection data should be distributed to each PE in the filtering step is an important issue. The filtering operation (2.6) is identical for all projection data indexed by ( sk, um, vn). As a result, each PE should process an amount of projection data that is consistent with its computing capacity. Since the PC cluster is a homogenous system, we assume that each PE has the same computing capacity. Neither the computing privilege nor the load balance is a critical issue here. Hence, the projection data are partitioned evenly as shown in Figure 2.3. If the PEs have different processing abilities, then the distribution of the computation load on each PE should be proportional to its ability. After the filtering is finished, the load for each PE during the back-projection stage must be determined. As can be observed, Equation (2.8) is a voxel-driven formulation. The reconstruction of each voxel, x, can be independently performed. For the partition ratio, although the integration interval [ s ( x), s ( x)] in Equation (2.8) for each voxel might be different, the amount of computation for the whole slice in Z direction is invariant due to the geometric properties of the spiral scanning locus. Therefore, if we cut the objection volume along Z direction, the partition can be based on the processing capability of the PEs. Each PE reconstructs the corresponding voxels, as is also shown in Figure 2.3. To sum up, the overall parallel computing is processed in the following order. The projection data is first partitioned and distributed over selected PEs. After each PE receives its assigned data, it performs the filtering operation. When each PE accomplishes its filtering operation, it sends the filtered data to all other PEs. Once each PE has received all of the filtered data, it independently performs intensive back-projection. Finally, the back-projected data are collected and assembled on the master PE to obtain the final reconstruction. A flowchart of the whole parallel reconstruction process is presented in Figure 2.4. b t

50 35 PE s Initialization Projection Data Generation/Distribution Projection Data Filtration Projection Data Collection and Distribution Backprojection Reconstructed Result Integration on Root PE PE s Finalization Figure 2.4. Flowchart for the parallel reconstruction process. The reader might notice that, because the back-projection for a voxel only needs the projection data in the view interval [ s ( x), s ( x)], it is not, in fact, necessary for each b PE to pass its filtered data to all other PEs. Therefore, in theory, only the useful data for back-projection must be sent to the other PEs, which could save the communication overhead during the message passing. Although it is feasible that such an interval could be determined, the parameters for each voxel are calculated only after the back-projection process begins. There is no way to determine these parameters during the passing of the filtered projection data; the only option would be to calculate them one more time, or to pre-calculate them and consult them later for back-projection information. This approach imposes extra calculations, storage requirements, and possible look-up-table communication costs. In fact, in the next section, our experimental results show that, in t

51 36 the parallel implementation, the communication overhead comprises only a small portion of the total cost. Therefore, the above consideration may not be able to improve the parallel performance, but has the potential to increase the programming difficulty or to add overhead Numerical simulation results for parallel implementation of the Katsevich algorithm The parallel implementation of the Katsevich algorithm was evaluated by reconstructing the 3D Shepp-Logan phantom [67]. The spiral cone-beam projection data was collected with a planar detector as shown in Figure 2.1. Different datasets (volumes of 128 3, 256 3, 384 3, and ) were used to measure the performance (mainly speed-up and efficiency) and study the effects of various sizes of datasets and images. The double precision format was used for all of the data and images. The measured computational time in each run was slightly different possibly due to measuring error, varying computational loads at the nodes, and certain other environmental changes that could cause instability of the CPUs. Therefore, the average parallel computation time was calculated from ten runs of each test. The mean value of the computational time (Cases I, II, III, and IV for volumes of 128 3, 256 3, 384 3, and voxels, respectively) and the corresponding standard deviations are listed in Table 2.1. For convenience, the corresponding semi-log plots are shown in Figure 2.5(a). From these results, it can be observed that the reconstruction time decreases significantly with the increase in the number of PEs. Table 2.1 also indicates that the standard deviation is relatively large in the case of four processors. This is because the cluster has a master node, and it needs not only to handle one computing task but also to coordinate the whole reconstruction process, which sometimes involves handling tasks submitted by other users. Therefore, the master node often has more memory allocated and conducts more computation and resource management than other slave nodes. As a result, sometimes the

52 37 slave nodes needed to wait for the master node in our experiments although not in every instance. Such a phenomenon is more prominent when fewer processors are used, causing a higher standard deviation. Table 2.1. Average Total Reconstruction Time with the Number of Processors Number of Processors (NP) Case I: 3 Volume= 128 (16 MB) Reconstruction time Standard deviation Case II : 3 Volume= 256 (12 8MB) Reconstruction time Standard deviation Case III : Reconstruction time Volume=384 3 (432MB) Standard deviation Case IV : 3 Volume= 512 (1 GB) Reconstruction time Standard deviation Note: The values are the means from 10 runs. The unit of time is the second. Based on the data in Table 2.1, the associated speed-up was calculated in each case to produce Table 2.2. Figure 2.5(b) is a plot of speed-up with the number of processors in each of the four cases.

53 38 Computation time Case I Case II Case III Case IV Speedup Case I Case II Case III Case IV Ideal Speedup Efficiency (a) Case I Case II Case III Case IV Ideal Speedup (c) Ratio Case I Case II Case III Case IV (b) (d) Figure 2.5. Comparisons of the performance parameters for the parallel Katsevich algorithm. All of the X-axes represent the number of processors. The Y-axes of (a), (b), (c), and (d) represent computational time, speed-up, efficiency, and ratio, respectively. Table 2.2. Speed-up with the Number of Processors Number of Processors(NP) Case I: Volume= Case II: Volume= Case III: Volume= Case IV: Volume=

54 39 The parallel efficiencies in Cases I, II, III, and IV with respect to the number of processors are listed in Table 2.3 and plotted in Figure 2.5(c). Please note that the efficiency curve for the first case stays below the ideal efficiency curve and decreases relatively rapidly whereas the curves for the other cases descend slowly and are close to the ideal efficiency curve. In addition, the efficiency curves for the latter case show a common wavy pattern in which the efficiency decreases first, then increases, and finally decreases again. In region 1, where the number of PE ranges from 1 to 5, the parallel efficiencies for these cases decrease. In region 2, the efficiencies increase with each increment in the number of PEs. The curves reach their peaks when the number of PEs is about 16. In region 3, also called the post-peak performance region, the efficiencies decrease again as the number of PEs increases further. Table 2.3. Efficiency with the Number of Processors (NP) Number of Processors (NP) Case I: Volume= Case II: Volume= Case III: Volume= Case IV: Volume= The appearance of the super-linear effect (the behavior in which the speed-up is greater than the ideal linear speed-up) is due to the fact that, in the multiprocessor system, the memory usage associated with each PE is less than that in the single-processor system [87]. For example, during the back-projection process, each processor reconstructs a portion of the object, thus allocating only that portion of memory.

55 40 Table 2.4. Time Used in Different Steps* Number of Processes (NP) Case I: Volume= Filtration Collecting filtered data Back-projection Collecting BP data Total-time Case II: Volume= Filtration Collecting filtered data Back-projection Collecting BP data Total-time Case III: Volume= Filtration Collecting filtered data Back-projection Collecting BP data Total-time Case IV: Volume= Filtration Collecting filtered data Back-projection Collecting BP data Total-time * The unit of time is the second. The values are means from 10 runs. The time for broadcasting projection data to PEs is not listed because it is insignificant.

56 41 To better explain this super-linear effect, in Case IV, to reconstruct an object into 3 3 a volume of 512, at least bytes = 1GB memory is needed for back-projection in a single-processor system whereas in a multiprocessor system where n (n>1) PEs are used, the memory associated with each processor is 1/n of the total memory (1GB). The impact of memory on the computational ability of PEs is responsible for the super-linear speed-up. Such phenomena are more evident with larger datasets, which explains their prominence in Cases II through IV. (a) (b) (c) (d) (e) (f) Figure 2.6. Representative slices of reconstructed 2563 volume. The top row shows the reconstructed slices of the 3D Shepp-Logan phantom while the bottom reveals the differences between the reconstructed and original slices. The gray ranges are [1.00, 1.05] and [-0.05, 0.05] for the reconstructed slices and the differences, respectively. Table 2.4 compares the time used in different steps. The results indicate that the

57 42 communication time constitutes a smaller percentage of the total reconstruction time as the reconstruction volume becomes larger. Hence, the parallel algorithm will be more computationally efficient when used with a large dataset for higher-resolution reconstruction. The ratio between the communication and the computation time corresponding to different numbers of processors is also plotted in Figure 2.5(d). It shows that, as the size of a dataset increases, this ratio decreases, resulting in greater performance. To verify the correctness of the current parallel implementation, the selected slices of the reconstructed objects were compared with the corresponding slices of the 3D Shepp-Logan phantom. The difference images were calculated and displayed in Figure 2.6. The resulting excellent agreement can be observed in the figure Experiments on real-world CT data The parallel implementation was also evaluated on real-world CT data collected by commercial CT scanners (e.g., Siemens Inveon micro-ct and Somatom Sensation clinical CT). The configurations of the scanners are listed in Table 2.5. It should be noted that the Somatom CT scanner employs a curved detector, in contrast to the planar detector used in Yu s implementation. Furthermore, the X-ray signals that were recorded in Yu s experiment were equi-angular, (i.e., the angles between consecutive X-rays were equivalent) whereas our implementation utilized equi-spatial projection data. To address this discrepancy, the data was first re-binned to equi-spatial so that the projection data satisfy the requirements of the implementation. The phantoms used for the experiments are a tissue equivalent phantom (TEP) and a water phantom for Inveon and Somatom CT scanners, respectively. Figure 2.7 shows a photograph of the tissue equivalent phantom, where the filled materials are marked with different densities.

43 Table 2.5. Configuration of the Scanners Inveon Somatom Scanning Radius (cm) 26.865 570 Source to Detector Distance (cm) 34.318 1040 Helical Pitch (cm) 3.038 12.945 Region of Interest (cm) 2.

58 43 Table 2.5. Configuration of the Scanners Inveon Somatom Scanning Radius (cm) Source to Detector Distance (cm) Helical Pitch (cm) Region of Interest (cm) Detector Size (Width, Height) (cm) Number of Projections Per Turn Number of Detector Cells Typical Current 500 µ A 200mA Typical Voltage 80kV 120kV Typical Exposure Time 200msec 500msec 1000 mg/cc 1050 mg/cc 1750 mg/cc Open hole Open hole 1250 mg/cc Figure 2.7. Tissue equivalent phantom (TEP). The density of the basic material of the phantom is the same as water (1000 mg/cc). The six structures are marked in the picture with their densities.

59 44 The same reconstruction tasks were performed as described in the previous section. Due to the loss of three nodes during the time when this experiment was conducted, a maximum of 26 CPUs were used. Table 2.6 shows the reconstruction time for the real-world data. It can be observed that, for both datasets, the reconstruction time has been greatly reduced, similar to the simulation that was presented in the previous section. Table 2.6. Average Total Reconstruction Time Number of Processors (NP) Volume= 128 Inveon Somatom Volume= 256 Inveon Somatom Volume= 384 Inveon Somatom Volume= 512 Inveon Somatom Note: The values are the means from 10 runs. The unit of time is second. Tables 2.7 and 2.8 show the standard benchmarks (i.e., speed-up and efficiency of the parallel implementation). The benchmarks for the experiments are plotted in Figure 2.8. A similar pattern related to speed-up and efficiency as was witnessed in simulation experiments can be observed.

60 45 Table 2.7. Speed-up with the Number of Processors Number of Processors (NP) Case I: Volume= 128 Inveon Somatom Case II: Volume= 256 Inveon Somatom Case III: Volume= 384 Inveon Somatom Case IV: Volume= 512 Inveon Somatom Table 2.8. Efficiency with the Number of Processors Number of Processors (NP) Case I: Volume= 128 Inveon Somatom Case II: Volume= 256 Inveon Somatom Case III: Volume= 384 Inveon Somatom Case IV: Volume= 512 Inveon Somatom

61 46 Finally, Figure 2.9 displays some representative images that have been reconstructed from Inveon and Somatom data. The images have been transformed into Hounsfield Units (HU) based on the water value as determined from the data from each scanner Case I Case II Case III Case IV Ideal Speedup Case I Case II Case III Case IV Ideal Speedup (a) (b) Case I Case II Case III Case IV Ideal Efficiency Case I Case II Case III Case IV Idea Efficiency (c) (d) Figure 2.8. Comparisons of the performance parameters for the parallel Katsevich algorithm. All of the X-axes represent the number of processors. The Y-axes of (a), (b), (c), and (d) are speed-up on Inveon and Somatom, and efficiency on Inveon and Somatom, respectively. Due to the relatively low current flux of X-ray sources used in the Inveon scanner, the projection data contain much higher noise than that associated with the

47 Somatom scanner. Consequently, the reconstructed image from Inveon scanner (Figure 2.9(a)) is noisier than that from the Somatom scanner (Figure 2.9(b)). (a) (b) Figure 2.9. Trans-axial images of (a) TEP from Inveon TM, and (b) water phantom from Somatom TM.

62 47 Somatom scanner. Consequently, the reconstructed image from Inveon scanner (Figure 2.9(a)) is noisier than that from the Somatom scanner (Figure 2.9(b)). (a) (b) Figure 2.9. Trans-axial images of (a) TEP from Inveon TM, and (b) water phantom from Somatom TM. The displaying windows are [-1000, 1000], and [-600, 600] for (a) and (b), respectively. To examine the accuracy of the HU recovered for both phantoms, we used the results from the scanners factory default settings as the standard. For the Inveon scanner, circular scanning loci with the standard Feldkamp algorithm is factory default. Due to the difference of scanning modes between the default and our experimental spiral scanning trajectory, we were unable to compare the results directly as in the simulation experiment. In contrast, for the Somatom clinical CT, spiral loci are available, and the images have been reconstructed according to the methods outlined by Schaller et al. [65]. However, due to the limited exposed geometric parameters, such as the longitudinal position, Euclidean coordinates that the reconstruction is based on, etc., we were unable

Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D.

Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D. Applied Science Laboratory, GE Healthcare Technologies 1 Image Generation Reconstruction of images from projections. textbook reconstruction advanced