Acceleration of an image reconstruction algorithm for Positron Tomography using a Graphics Processing Unit

Size: px

Start display at page:

Download "Acceleration of an image reconstruction algorithm for Positron Tomography using a Graphics Processing Unit"

Phyllis Payne
5 years ago
Views:

1 Image Reconstruction Algorithm for PET using a GPU Thomas Felder 13 Mitglied der Helmholtz-Gemeinschaft Gesundheit Health Band Volume 13 ISBN Acceleration of an image reconstruction algorithm for Positron Tomography using a Graphics Processing Unit Thomas Felder

2 Schriften des Forschungszentrums Jülich Reihe Gesundheit / Health Band / Volume 13

4 Forschungszentrum Jülich GmbH Zentralinstitut für Elektronik (ZEL) Acceleration of an image reconstruction algorithm for Positron Tomography using a Graphics Processing Unit Thomas Felder Schriften des Forschungszentrums Jülich Reihe Gesundheit / Health Band / Volume 13 ISSN ISBN

5 Bibliographic information published by the Deutsche Nationalbibliothek. The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at Publisher and Distributor: Cover Design: Printer: Forschungszentrum Jülich GmbH Zentralbibliothek, Verlag D Jülich phone: fax: Internet: Grafische Medien, Forschungszentrum Jülich GmbH Grafische Medien, Forschungszentrum Jülich GmbH Copyright: Forschungszentrum Jülich 2009 Schriften des Forschungszentrums Jülich Reihe Gesundheit / Health Band / Volume 13 ISSN ISBN: Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher.

6 Declaration of Authorship I, Thomas Felder, declare that this thesis titled, Acceleration of an image reconstruction algorithm for Positron Emission Tomography using a Graphics Processor Unit and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Thomas Felder Date: Valencia, January 20, 2008 v

7 Abstract In this work we applied an iterative image reconstruction algorithm to measured data of a Siemens Biograph16 PET scanner. Furthermore the reconstruction algorithm was also implemented on a graphics processing unit (GPU) to reduce the calculation time. Positron emission tomography (PET) is a medical imaging modality to study metabolic processes in a human or animal body. To reconstruct images from measurement data of a Siemens Biograph16 PET scanner we applied the maximum likelihood - expectation maximization (ML-EM) algorithm. The system matrix, which is needed for the ML-EM, is calculated using the Siddon s ray-tracing algorithm. Basically two approaches were implemented; the off-line version saves the pre-calculated system matrix to the hard disk and reads the values during computation of the reconstruction process. The online version computes the elements of the system matrix during the reconstruction. The online version was implemented on the graphic card and on the CPU. We evaluated if a medium range graphic card can already accelerate a reconstruction algorithm versus a CPU implementation. The same algorithms were applied to the GPU and the CPU, however, due to the limited and specific hardware the implementation on the graphic card had to be adapted and parallelized. For the GPU applications NVIDIA s programming language compute unified device architecture (CUDA) was used. The reconstructed images showed artifacts that are typical for iterative reconstruction algorithms, as well as background noise. Apart from these effects the images were good in quality. In the GPU application we detected a loss of data due to simultaneous memory accesses in the back-projection process. The error depended, among others, on the resolution of the image. We took various measures that reduced the error, so that we finally achieved images for several resolutions that are comparable in quality to the images calculated by the CPU. We implemented two GPU applications, both run faster than the CPU approaches; the maximum acceleration factor that we achieved is 3.

8 Acknowledgements First of all I would to like to thank Magdalena Rafecas for an outstanding supervision of this thesis. She always had time for questions and reviewed this thesis even being close to give birth to her first son. It was a pleasure to work with her not only professionally but also personally. Furthermore I thank her team, Josep F. Oliver and Moritz Blume, for all the questions and problems they helped me to solve. And, I would like to thank the department of Medical Imaging of the IFIC for the great company, the integration in the department and the lunch and coffee breaks on the terrace. I also would like to thank the two university supervisors; Mr. Karl Ziemons accepted the supervision immediately and helped me to overcome the administrative obstacles of the university. Even by distance he always offered help and supported me. Also thanks to Mr. Thoralf Niendorf for the supervision of the project. Thanks to Bill Jones from Siemens Medical Solutions, USA, for the information about the Biograph scanner. Gracias a la panda! Ana and Montse, the time-outs with you have always been great fun; thanks for mental support, the dance and sing performances and the smiles in the moments, when I was stuck! Thanks to my whole family, who always supported me! Great to have you all! vii

10 Contents Declaration of Authorship v Abstract vi Acknowledgements vii List of Figures List of Tables xiii xv 1 Introduction Motivation Goals Thesis outline Positron Emission Tomography Positron emission and annihilation Coincidence Detection Limits of PET PET/CT scanner PET scanner Siemens Biograph Sinogram Arc correction Michelogram Measurement file List-mode data Image Reconstruction System Matrix Siddon algorithm Maximum Likelihood - Expectation Maximization algorithm Implementation of ML-EM on CPU Sinogram and Michelogram combined in Biograph Computation of the system matrix A geom ix

11 Contents x Calculation of detector coordinates Storage of the system matrix On-the-fly CPU implementation Graphics Processing Unit and Computed Unified Device Architecture NVIDIA Geforce 8600GT Some definitions Memory model and hardware limitations Computer Unified Device Architecture CUDA software model Function declaration, calls and threadids Variable declaration and memory usage Device runtime library Debugging and emulation mode General performance issues Implementation of ML-EM in CUDA Related work Parallelization of the reconstruction process Parallelization of forward-projection First implementation approach Improved implementation Evaluation of forward-projection implementation Implementation of back-projection Threads in parallel and memory usage Results and discussion Data used for evaluation Time performance Image quality ECAT7 versus CPU Evaluation of images reconstructed on CPU Evaluation of images reconstructed on GPU Error evaluation at different resolutions Our results compared to the related works Conclusions and outlook Summary Conclusions Time performance Image quality General conclusions Outlook A CUDA implementation example 89

12 Contents xi Bibliography 91

14 List of Figures 1.1 Different versions implemented in this work Positron emission and annihilation process Scheme of PET process Line of response (LOR) Different types of coincidences Images of a PET/CT scan The Siemens Biograpg16 PET/CT Scanner Detector block and detector ring Definition of LOR by angle and distance Sinogram formation Projection of parallel LORs Direct and crossed LORs Different spacing in curved gantry Direct and oblique sinograms in a Michelogram Different spans in a Michelogram Segments and ring difference in Michelogram One LOR crosses the image space Length of intersection (LOI) Loss of resolution in Siddon algorithm Flow diagram of ML-EM algorithm Michelogram of Biograph Sections in Michelogram of Biograph Coordinate system of gantry Sketch to illustrate xy-coordinate calculation Implementation of fully calculated and on-the-fly SM GFLOP/s development in recent years Architecture of Geforce 8600GT The different components of a multiprocessor (MP) Kernel and grids of an application Different kernels of ML-EM Calculation of forward-projection on GPU Flow chart of forward-projection on GPU Backprojection implementation on GPU Intersection of voxels by LORs xiii

15 List of Figures xiv 7.1 Images used for implementation evaluation Comparison of computation time Ecat7 and CPU images Reconstructed phantom with line profile Results of projection and their differential images Comparison of CPU-GPU images with voxels Comparison of CPU-GPU images with voxels Comparison of CPU-GPU images with voxels

16 List of Tables 1.1 Cost and performance of different systems to process image reconstruction Key data of Biograph16 scanner from Siemens [1], [2], [3] PETLINK 32bit word format. First bit is the most significant bit (MSB) Bit encoding of event word and time marker of 32 bit word format Example of list-mode data from Biograph Parameters used to build the Michelogram and sinogram of Biograph16 scanner Data size of SM in compressed sparse matrix form Technical data of NVIDIA Geforce 8800GTX and 8600GT [ Variable type qualifiers used by CUDA Implementations of PET image reconstruction using GPUs Implemented kernels and the parameters that are needed to define the number of parallel threads and processor occupancy Arrays that are allocated in global memory space to save output of kernel functions for a image space of voxels Analysis of region of interest shown in figure Analysis of regions of interest shown in figures 7.6, 7.7 and xv

18 Für meine Eltern, denen ich so viel zu verdanken habe. xvii

20 Chapter 1 Introduction When in the last years the cost development in health care were discussed an interdisciplinary and controversial dispute could be observed. Cost in the public health system is a topic with many facets; it is not just about economical points, but also about ethics, politics, medicine, technology and social justice, among others. In research and development of medical technologies a compromise between economical cost and medical benefit is attempted to be found. Companies should not only search for the products with the best medical output but also with a payable price for the social health system. Therefore constant investigation and development of products is essential in medical technologies to develop new methods but also to improve existing treatments; this can also mean an improvement from the cost-benefit perspective. Several models have been developed to analyze medical technical inventions [4 6]. It is tried to define methods to judge economical aspects as a high starting investment versus lower maintenance cost or shorter hospital stays of patients. Furthermore, medical aspects are taken into account and are tried to set in comparison to economical perspectives. One of the fields that revolutionized medicine in the last decades is medical imaging, an area that can be seen as high-tech medicine that includes high cost. Especially the tomographic modalities as magnet resonance imaging (MRI), computed tomography (CT) single photon emission computed tomography (SPECT) and positron emission tomography (PET) imply a high investment and high maintenance cost for hospitals. To give an example what an analysis of a certain technology could include, we have a look to one medical devices, the positron emission tomography. PET is one of the most expensive image modalities, not just regarding the budget needed for installation but also regarding running expenses as maintenance, staff and material that are needed for each study 1

21 2 Chapter 1. Introduction of a patient. Keppler et al. [7] analyzed the overall expenses for a hospital that runs a PET system in the year The starting investment for installation varies depending on the equipment from 1.3 million to 6.3 million dollars [7]. The greatest difference in the presented configurations is the addition of a cyclotron to produce the tracer used in the PET studies. But positron emission tomography is not just expensive because of complex devices and installation that have to be financed, also the expenses for maintenance, staff and radiopharmaceutica applied to the patients are high. An average scan is estimated from $1, 602 to $2, 981 [7] depending on the scans per year, the configuration of the system, the used radiopharmaceutica and the type of study which is performed. The overall cost of this technology seems enormous, however, the percentage of the investments of high technology medicine, where PET belongs to, with respect to the total health care costs in 1999 were relatively low (approximately 0.6%) [8]. Hence, it can be stated that even if this technique is one of the most expensive image modalities, the impact of investment on the overall health cost is small. If the technique is used wisely the high expenses can be converted into savings, due to better diagnosis and better differentiated treatments of diseases. For example, a physician may decide to cancel a planed surgery due to the results of the PET study, or the time of a surgery can be reduced as the operation can be planed in detail beforehand [8]. Valk et al. [9] were able to demonstrate cost savings due to the usage of PET; the saving/cost ratio for nonsmall cell lung cancer, recurrent colorectal cancer and metastatic melanoma was between 2.2 and 2.5. From the medical point of view positron emission tomography is a high-tech medical application, which is used and necessary in fields as cardiology, neurology/psychiatry and oncology [8]. As it can be seen in the example of the positron emission tomography an economical and medical analysis of a technology includes many aspects and interdisciplinary experts are needed to judge such a technology. Hence, PET is on the one hand an important medical application used in various fields and on the other hand it is an expensive technology, that instead of increasing health system cost can help saving expenses which would have been spent if a traditional treatment were applied. Nevertheless, even when a technology is established in a field as state-of-the-art, as PET is in certain clinical uses, investigation and development are still going on to improve the methods. In many fields of personal and professional life computer technology changed methods, work processes, communication and technical possibilities. Tomographic systems are based on analytical calculations processed by computers. The new possibilities in medical imaging which were developed in the last decades were based, among others, on the increased computational performance of microchips. Consequently higher investments are necessary for information technologies, including hardware, software licenses, maintenance and so on. The computational need to process the images has increased due to a

22 Chapter 1. Introduction 3 higher amount of measurement data from the scanner and further developed reconstruction algorithms that are computational more expensive. There are different solutions to provide the computational power for image reconstruction as a single workstation, a cluster of several workstations or even a remote cluster. They differ in installation cost, maintenance cost and performance. The perfect solution would be a system with an acceptable cost-benefit perspective, not just for installation but also concerning the running cost. 1.1 Motivation Computational power can be a bottleneck for image reconstruction in positron emission tomography. On the one hand with a long computation time it is possible to get a greater degree of detail for images which can improve diagnosis capacities of this technology, on the other hand it has to be assured that a physician receives the image in an acceptable period of time. So, one major issue in scientific and industrial investigation is how to handle these enormous calculation demand in an acceptable period of time without limiting the hardware capacities of a scanner. One computer workstation normally is not sufficient to achieve a good compromise between quality and time. Computer clusters of several workstations or a powerful remote cluster that process the image reconstruction are more suitable. However, these computational powerful solutions are cost intensive, so that there is always the tradeoff between maintenance cost, computational power and initial investment. A new solution which could provide high computational output for a reasonable price came up in recent years; the scientific community got interested in graphics processing units (GPUs), widely known as graphic cards, for scientific calculations. The graphic cards were developed and used for 3D computer games and high-end graphical applications. Basically, a GPU is a co-processor which can calculate many equal instructions in parallel and thus may reach a higher data throughput. If a workstation equipped with a GPU performed well the image reconstruction, it would be an economical interesting alternative for medical image reconstruction. Table 1.1 compares the cost and performance of different systems to process the image reconstruction. The idea of using a co-processor to accelerate the calculation of image reconstruction is not new. In the 1990 s the use of floating point acceleration cards were already investigated. Specific hardware circuits were used for the calculation of the filtered backprojection. For CT image reconstruction the computational speed increase was up to 150 and more [10]. However, the development of such cards is expensive and the circuits are specific to one application. In contrast, graphic cards offer a general, commercial

23 4 Chapter 1. Introduction Table 1.1: Cost and performance of different systems to process image reconstruction. System Investment Running cost Performance Workstation low (+) a low (+) low (+) Computer cluster high (-) high (-) very high (++) Remote server low (+) very high (- -) very high (++) Workstation/GPU low (+) low (+) to be evaluated a + and - highlights if this rating is positive or negative. hardware with a software interface for generic implementations. It has to be said, that not every application can be implemented efficiently on a GPU and an implementation for the graphic card is much more specific than for the CPU. Even if GPUs have been on the market for many years, the use outside of graphical applications were limited by the way of programming the cards which needed highly skilled programmers to get good calculations performance. To overcome this problem new programming languages were introduced by scientists and graphics card producers, however, the breakthrough seems to be the launch of the Compute Unified Device Architecture (CUDA) released by NVIDIA. First applications in medical image reconstruction using GPUs have been made in computed tomography and the promising results in this area motivated scientists of other fields of medical imaging to explore the capacities of GPUs for other image modalities. Works of several developers have shown that the above mentioned problems in PET image reconstruction could be solved by using GPUs. The promising results of the first published trials motivate further investigation of the capacities of a GPU for PET technology. NVIDIA claims its launch of the CUDA language as a milestone in GPU programming that opens cost effective parallel programming to a broad field of users. But it has to be proven if these marketing promises can be hold. 1.2 Goals The final aim of the presented work is a comparison of an image reconstruction implemented on a CPU and on a GPU. One of the most common iterative algorithm used on PET scanners is the Maximum- Likelihood Expectation-Maximization (ML-EM) algorithm. In this work this algorithm is implemented using clinical measurement data from a Siemens Biograph16 scanner. The system matrix that is used in the ML-EM algorithm is calculated applying the Siddon s ray-tracing algorithm. Two versions are implemented on the CPU, an off-line version, that pre-calculates the whole system matrix and saves it to hard disk and an online

24 Chapter 1. Introduction 5 approach that calculates the system matrix on-the-fly. The CPU online applications will be done with double and single floating point precision. It will be analyzed to what extent the choice of double and single floating point values affect computation time and quality. On the GPU two version will be implemented, that both uses single floating point precision as double floating point precision values are not supported by the used GPU. A quality loss can be observed in the implementation on the graphic card. The occurring error in the resolution dependent implementation (RDI) on the GPU varies with the resolution of the image space, the less voxels the higher the error. The resolution independent implementation (RII) includes various measures to avoid the dependency. Both implementation will be evaluated for computation time and image quality. As high data transfer between CPU and GPU should be avoided and the GPU only have limited memory capacities, an off-line version cannot be applied efficiently. Figure 1.1 shows the different versions of the image reconstruction that will be implemented in this work. The different implementations will be compared regarding computation speed and image quality. measured data Biograph16 CPU GPU online version off-line version online version RDI RII Double floating Point precision Single floationg point precison Figure 1.1: The flow chart shows the different versions that will be implemented in this work. We use a medium-range graphic card, but a high-end workstation, it will be analyzed if even a low budget GPU can provide satisfying computational output versus an expensive workstation. Furthermore it will be evaluated how simple an implementation on a GPU can be achieved and what restricts the use of a GPU. Existing works [11 14] achieved different acceleration factors, depending on the applied image reconstruction algorithm and the used hardware, e.g. Barker et al. achieved a calculation time for the GPU that was 14 times faster than the CPU implementation of the same algorithm [12]. The image quality was stated as acceptable by all authors. The methods and the hardware used in our approach are different to the ones that were

25 6 Chapter 1. Introduction applied by the other groups, therefore it is difficult to estimate a gain factor or the image quality. 1.3 Thesis outline Chapter 2 introduces the basic concepts of PET technology. The physics behind is briefly explained and the limits of PET regarding resolution and image quality. Throughout this work data from the clinical PET scanner Siemens Biograph16 are used, so that the key data of this device are shown. Furthermore the concept of projections and their data sampling into sinograms is presented. Chapter 3 treats about the image reconstruction process. There are several subtopics which have to be known to understand the ML-EM algorithm, as Michelograms, the processing of the measured data and the system matrix, as well as the weighting algorithm used to calculate the system matrix. This leads to the final understanding how the ML-EM algorithm reconstructs an image. Chapter 4 cover the implementation of the algorithm on the CPU. The ML-EM is implemented in two ways, firstly the calculation is processed with a pre-calculated system matrix and secondly the whole reconstruction is done on-the-fly, which means the data of the system matrix are calculated when they are needed. The theoretical basics of the second and third chapter are applied to the measured data from the Biograph16 scanner. Chapter 5 provides an overview about GPUs. The hardware and memory model of a NVIDIA Geforce 8600GT is presented, this graphic card is used for the GPU implementation of the image reconstruction. The CUDA programming language is introduced, this includes the software model, restrictions as well as the basic commands and variables that have to be known. In chapter 6 the implementation of the image reconstruction of the GPU is explained. The differences to the CPU implementation are presented. Especially the implementation of the back projection will be dealt more in detail, as in the implementation on the GPU problems occurred due to simultaneous memory access. Chapter 7 covers the results and discussion. The GPU and CPU results are compared for recontruction time and image quality. Differences are analyzed and discussed.

26 Chapter 1. Introduction 7 Chapter 8 concludes the presented work and gives possible future perspective of the use of GPUs in image reconstruction.

28 Chapter 2 Positron Emission Tomography A clinical PET study starts with the application of a tracer to a patient. This tracer is radioactive labeled, radioisotopes are used to label molecules that are biologically active. So e.g the positron emitting non-stable isotope fluorine-18 ( 18 F) is used to create the fluorodeoxyglucose (FDG) molecule. This molecule is chemically very similar to glucose, so that after injection the body takes it up and introduces it into the glucose metabolism, where the molecule is not further metabolized and excreted later on. There are other isotopes used in PET studies, among others 11 C, 15 Oor 13 N, however 18 F bound as FDG is one of the most common used tracer in PET. 2.1 Positron emission and annihilation The isotopes incorporated in the tracer molecules are unstable so that they undergo a radioactive decay process in that a proton (p) is converted in a neutron (n). In this so called β + -decay a positron (e + ) and a neutrino (ν) are emitted. p n + e + + ν (2.1) The neutrino leaves the patient immediately, but the positron travels a certain distance (positron range) through the tissue, where it loses most of its kinetic energy by causing ionization and excitation, before its finally annihilates with an electron (e - ) of the surrounding matter. Two photons are created that depart from the annihilation position in almost opposite direction. In this event the mass of the electron and the mass of the positron are transformed in energy, the leaving energy has a value of 511 kev per photon, which are also called γ -rays (s. figure 2.1). 9

10 Chapter 2. Positron Emission Tomography Figure 2.1: Positron emission and annihilation process [15]. 2.2 Coincidence Detection The annihilation process is the basis of coincidence detection.

29 10 Chapter 2. Positron Emission Tomography Figure 2.1: Positron emission and annihilation process [15]. 2.2 Coincidence Detection The annihilation process is the basis of coincidence detection. The term coincidence refers to the nearly simultaneous detection of two photons. If an annihilation takes place the γ-rays pass through the surrounding matter until they leave the body and reach the γ-sensitive detectors. A PET ring scanner consists of several rings of scintillator crystals. A photon gives rise to a light flash when it hits a scintillator crystal. This visible light is amplified by a photomultiplier tube (PMT) and then converted into an electrical signal by a light sensitive sensor. If two photons are detected within a limited time window (few nanoseconds), it is assumed that they were emitted in the same annihilation process. The incoming signals are processed by a so called coincidence electronic circuits(s. figure 2.2). Figure 2.2: Scheme of PET process [15]. The position of an annihilation event, that was identified in the same time window, is most likely along the connection line of the two detectors that converted the γ rays in an electrical signal. This line is called line of response (LOR) (s. figure 2.3). In PET the detected events are either stored in histograms (histogram mode) or lists (list mode). While in the first approach the count number of a LOR is incremented when the

30 Chapter 2. Positron Emission Tomography 11 Figure 2.3: The figure illustrates an annihilation event, that is identified by two detectors. The position of the annihilation event is most likely along the line that connects the two crystals, this line is called Line Of Response (LOR) (Picture modified from [11]). appropriate event is detected, the latter stores each event separately, including a time marker, which makes it possible to distinguish the detection moment on a time line. It is always possible to generate histograms from list mode data, but not vice versa, because when the data are collected in a histogram, information such as the time marker and detection energy gets lost. 2.3 Limits of PET The spatial resolution of PET is limited, among others, by the physical characteristics of the used isotopes. The positrons move through the tissue before annihilation takes place, the positron range gives the mean distance from positron emission to the annihilation position. The positron range varies with the applied isotope and depends on the density of the passed tissue and on the maximal positron energy, e.g.that is 634 kev for 18 F[16]. In general it can be said the less the energy, the shorter the mean photon range and the better the obtained spatial resolution. The positrons and electrons are in motion when they annihilates, due to the conservation of momentum this leads to that the photons are not exactly 180 to each other, but with a deviation of ±0.5. This deviation results in additional spatial degradation, which is called annihilation angle blurring. The influence of this effect is less important for small animal scanners, as the photon covers less distance until detection. All detection systems have limits to the rate at which events may be detected and processed. The electronics as well as intrinsic detector characteristics like the afterglow effect might limit a finite maximum count rate of events. When several coincidences occur, that are close together in time, they cannot be detected as single events. This effect is called pulse pile-up. The detector dead time represents the time that has to be between two events to be detected as two single events. Due to this effect a loss of

31 12 Chapter 2. Positron Emission Tomography events can occur, this can become significant at high count-rates. Another problem that limits the image quality is photon attenuation, this means that due to this effect a reduced number of true counts can be achieved. Effects that force this loss of true counts are Compton scattering, photoeffect and Rayleigh scattering. To reduce the effect of attenuation a transmission scan can be performed, the object is scanned with an external source, the measurement is set in relation to the a blank scan of the used scanner. This relation allows to compute the overall attenuation of the scan and to compensate for it. Random coincidences are two detection events that are matched as one event as they were detected in the same time window, even their origin is not the same annihilation event. Random coincidences lead to a loss of contrast in the image. Figure 2.4 illustrates the different coincidences. Figure 2.4: A scattered coincidence is one in which at least one of the detected photons has undergone at least one Compton scattering event prior to detection. Random coincidences occur when two photons not arising from the same annihilation event are incident on the detectors within the coincidence time window of the system. True coincidences occur when both photons from an annihilation event are detected by detectors in coincidence. a a Image source: intro/intro src/section2.html Another aspect that limits the resolution and image quality of a PET scan can be patient or organ movement. There are several investigations which try to reduce the influence of this movement, one idea is motion tracking using a camera [15] another interesting approach is a complete analytical analyze of pictures of different time frames to compensate the motion between the time frames [17].

Chapter 2. Positron Emission Tomography 13 2.3.1 PET/CT scanner A PET/CT scanner is a combined system which includes two image modalities, Computed Tomography and Positron Emission Tomography.

32 Chapter 2. Positron Emission Tomography PET/CT scanner A PET/CT scanner is a combined system which includes two image modalities, Computed Tomography and Positron Emission Tomography. Computed tomography (CT) is based on x-rays and provides exact and very detailed morphological information of the patient. The combination of the two approaches can be very helpful especially when highly specific tracers are used, sometimes it is crucial to associate the very localized higher uptake to some morphological structures. There are several systems on the market as the Discovery scanner from General Electrics, the Gemini from Philips or the Biograph from Siemens. The application possibilities can be widely found in diagnosis in cardiology, oncology, neurology and psychiatry [1, 18]. Figure 2.5: Cranial images taken from a PET/CT scanner. The CT image on the left shows the anatomy. For the PET study 18 F-FDG was used as a tracer to indicate metabolic processes (image in the middle). On the right both images are fused. a a Image source: PET scanner Siemens Biograph16 Throughout this work we will perform image reconstructions of measured data of a Siemens Biograph16 scanner located at the university hospital Klinikum rechts der Isar of the Technische Universitaet Muenchen (TUM), Germany. The Biograph16 scanner is a PET/CT system, it is not the latest generation by Siemens as the Biograph64 was already launched, however the Biograph16 is already a full integrated PET/CT scanner. The fact that the Biograph16 is not just a PET scanner but a PET/CT scanner does not have any influence to this work, so that in the later chapters the focus is only put to

33 14 Chapter 2. Positron Emission Tomography the PET system. A list of the characteristic data of a Biograph16 scanner can be found in table 2.1, for completeness the key data of the CT system part is also provided. Table 2.1: Key data of Biograph16 scanner from Siemens [1], [2], [3]. General CT PET Parameters Biograph16 Geometry ring Detector ring diameter(cm) 82.7 Slices 16 Rotation speed (s) 0.42 Temporal resolution (ms) 105 Spatial resolution (line pairs/cm) 30 Scintillator LSO Detector dimensions (mm) Number of detector blocks 144 Crystals per detector block 8 8 Total amount of crystals 9216 Photomultiplier tubes 4 per block Axial field of view (cm) 16.2 The γ-sensitive scintillator crystals of this tomograph are combined in groups of 8 8 that form one detector block, each of this block has four PMT, 48 blocks are unified in a ring and there are three block rings in a row. Between the block rings there are small gaps, in the case of the given scanner the gap has a size of 0.36cm. For the geometry it is important to know the dimensions of the detector block rings, however in reconstruction we refer to detector Figure 2.6: PET/CT scanner Biograph16 [3]. or crystal rings, this is the number of crystal in the transaxial dimension, so for this scanner 24 detector rings, eight in each block, three blocks in a row. The data given in table 2.1 refer to the configuration of the scanner which is in operation at the university hospital in Munich, however the Biograph16 is also available with crystals per block. The Biograph uses lutetium oxyorthosilicate (LSO) crystals which are attractive for PET because of their physical characteristics. Originally bismuth germanate (BGO) crystals have been widely used, but LSO is one of the materials which have been increasingly

34 Chapter 2. Positron Emission Tomography 15 Figure 2.7: Example of a detector block with photo multiplier tubes and a detector ring (in this example with 4 block rings, whereas the Biograph16 just has 3 block rings) [15]. used in the last scanner generations. Advantages of LSO scintillators are their relatively fast light decay time and high light yield. The fast scintillation light decay time decreases dead time and allows the use of short coincidence time windows, thus improving counting rate capability and reducing the contribution of random coincidences [1]. Theoretically each detector pair defines a LOR. If n is the number of detectors the number of LORs N LOR can be calculated as follows: N LOR == n (n 1) 2 = 42, 462, 720 LORs (2.2) Such a high number of LORs is difficult to handle regarding data size and computation time, hence it is tried to reduce the number of LORs that are taking into account for reconstruction. There are several approaches to reduce the data size, in the Biograph16 sinograms are used to sample the data and Michelograms are applied to define the maximum angular orientation of the sinogram in transaxial direction. In sections 2.5 and 2.7 both techniques will be explained in detail. 2.5 Sinogram The number of detector crystals in PET scanners has increased immensivly in the last decades. Between the development of the Biograph16 and Biograph64 scanners from Siemens are just some years, but the number of crystals increased from 9216 to for the high-end version of the Biograph64, so the number of crystals were multiplied by a factor of 3.5. If we take all LORs into account and using equation 2.2 we get 42,462,720 and 526,420,128 LORs for the Biograph16 and Biograph64, respectively; this means an

35 16 Chapter 2. Positron Emission Tomography increase in the amount of measured data by a factor of Theoretically all LORs could be used for image reconstruction, however it is difficult to handle the data size and reconstruction time required, if all physical coincidence lines are taken into account. Therefore commercial scanners reduce the file size by data sampling. The data are sampled in a sinogram, a diagram that shows a coincidence line as function of its angular orientation and its minimum distance to the center. A sinogram is discretized, the elements or pixels of a sinogram are called bins. The number of bins in a sinogram in each dimension defines the resolution. When an annihilation event is detected along a LOR, the line is mapped to the corresponding bin in the sinogram and the value of this pixel is incremented by one, hence the value of a bin represents the number of events that are detected along the coincidence lines that are sampled to that pixel. It is possible that several lines are mapped into one bin, in that case after data sampling it cannot be redefined along which line the event were detected. The angle and the shortest distance to the center of the gantry that are used to define a LOR are illustrated in figure 2.8. These two information define any line in the transaxial 2D plane, if a line is plotted in a sinogram as a function of the angle and the distance it is visible as a point (s. figure 2.10). Figure 2.8: A LOR can be defined by the minimal distance from the line to the center and the angle which is enclosed by one of the coordinate axes and the perpendicular line between the center and the LOR. Several definitions could be found in the literature; the angle is either defined to the x- or to the y-axis, clockwise or counterclockwise, however as long as the implementation stays with one definition, the only difference which can be noted is a rotated image or a mirrored image. If a scan of a point source is processed and all measured coincidence lines are plotted in a sinogram a half sine wave would be visible (if the source were in the center of the gantry a straight line parallel to the y-axis would be visible). The distance between the source and the center of the gantry can be determined by the amplitude of the sine wave. A large number of LORs that show many different annihilation locations, plotted in a such a graph, will consists of many overlapping sine graphs (s. figure 2.9C).

36 Chapter 2. Positron Emission Tomography 17 Some scanners as the Biograph16 sample the PET data directly into sinograms. To do so, hardware circuits were developed to ensure fast conversion from coincidence signal to sinogram, the electronic cards supports also the PETLINK ( 2.8) data format [19]. Figure 2.9: Sinogram formation. (A) shows an object in a scanner, the center of the gantry is marked with a x; four annihilation events are detected along the LORs A, B, C and D. These LORs can be described as a function of their angular orientation and the minimal distance to the center. (B) shows a sinogram, a graph that represents the angle of a coincidence line versus its displacement from the center. The LORs illustrated in (A) are mapped to this graph. If all possible LORs that pass through point source of (A) are plotted, it maps half of a sine wave, as it is shown here. (C) Sinograms of more complex objects are composed of many overlapping sine waves. Here the sinogram shows a brain PET scan is shown, the reconstructed image is presented in (D) [20]. The values of one row in the sinogram represents all LORs with the same angle but varying distance to the center. Each pixel value is the sum of events that happened along this coincidence lines. Such a collection of row sums is referred to as a projection, a simple example of such a projection can be seen in figure By defining the maximum value for the displacement, an area can be defined in that coincidence events should be detected. Events detected by crystal pairs whose LORs do not pass this area will not be saved in the sinogram and the information is lost. Coincidence events can be detected by crystals within the same ring or by crystals of different rings. LORs that just connect detectors of the same transaxial plane are called direct and the sinogram where these lines are mapped into is called direct sinogram.

37 18 Chapter 2. Positron Emission Tomography Figure 2.10: The figure shows a projection of parallel LORs. The projection is the sum of the events that happened along all the LORs of this sinogram row [21]. Annihilation events that are detected by crystals of different rings are saved into crossed or oblique sinograms. Figure 2.11 illustrates a transaxial cut through a PET scanner with six detector rings, a) shows direct, b) oblique LORs. In image reconstruction either just direct sinograms can be used or direct and oblique. In the first case one direct sinogram samples one ring and the density distribution detected within this plane is reconstructed to one slide of the image. In the second case crossed LORs are also taken into account. a) b) Figure 2.11: Direct LORs connect detectors within the same transaxial plane (a). Crossed or oblique LORs are between detectors of different rings (b). 2.6 Arc correction As it can be seen in figure 2.12 the spacing between parallel LORs is different depending on the distance to the center of the gantry. This comes from the curved geometry of the gantry. To avoid this problem the arc correction is applied to the raw measurement data, so that an equal spacing over the whole gantry is achieved. It is important to know if the data which are used for image reconstruction are corrected, as this can lead

38 Chapter 2. Positron Emission Tomography 19 to artifacts in the final image. The arc correction is more important for large objects, were some of the LORs are far away from the center, small objects that just have events in LORs close to the center are less influenced by uncorrected data. The data used in this work are already arc corrected by the PET scanner, however it is not known how this correction is done. For the coordinate calculation of the LORs we assume a trigonometrical correction of the center of the bin of the projection (s ). Figure 2.12: The figure shows a set of parallel LORs. Due to the curved nature of the gantry, LORs towards periphery are more closely spaced than those towards the center. Arc correction is applied before image reconstruction to adjust the LORs to an equal spacing all over the gantry. 2.7 Michelogram A method to visualize transaxial the planes related to direct and oblique sinogram is a Michelogram graph. The ring where the first event is detected is is plotted on the x-axis, and ring of second event is plotted on the y-axis. Figure 2.13 a) shows direct and indirect LORs that are mapped to different sinograms. b) shows a Michelogram of several sinograms, just direct coincidence lines are mapped to the bins, c) shows all sinogram that are illustrated in a). Direct sinograms are marked by a x oblique sinograms are marked by two x connected with a line. a) b) c) x 5 x x 4 x 4 x x x 3 x 3 x x x 2 x 2 x x x 1 x 1 x x x 0 x 0 x x Figure 2.13: a)shows direct and indirect LORs that are mapped to different sinograms. b) shows a Michelogram of direct sinograms, just direct coincidence lines are mapped to the bins, c) shows all sinogram that are illustrated in a).

39 20 Chapter 2. Positron Emission Tomography Note, if the scanner is exposed to a uniform source of activity, the cross planes will contain twice as many counts as a direct plane, since it combines the data of two cells in the diagram, however this is just valid if the sensibility of the cells is equal and no inter plane septa used. Several sinograms can be aggregated to one, this has two reasons first several sinograms do not have enough counts, combining several bins increases the detected events per bin and improves the signal to noise (SNR) ratio of the reconstruction. The second reason is a reduction of data size. However the aggregation of sinograms also means a loss of information, it can not be reconstructed in which plane an annihilation event was detected, it is mapped to an area rather than to a line. A Michelogram is used to illustrate the aggregation of sinograms. The term span describes the extend of the axial data combined. The span is the sum of the cells in the Michelogram unified into an oblique sinogram added to the number of cells combined into a direct sinogram. Figure 2.14 shows different span numbers. Figure 2.14: The span defines the axial acceptance angle. It is calculated using a Michelogram, the sum of cells of an odd sinogram added to the sum of the cells of an even sinogram [22]. A Michelogram can be divided in several segments. Figure 2.15(A) show the different segments of a Michelogram, in this case the diagram is divided into five parts. The segments separate the different planes by their axial combinations of rings. Each plane in one segment includes relatively the same rings for the even and odd sinograms. The axial extent of the coincidences is characterized by the maximum ring difference (RD). This parameter defines the maximum number of rings that can be between two detectors acquiring one coincidence event. In figure 2.15(A) all ring combinations are allowed,

40 Chapter 2. Positron Emission Tomography 21 whereas in (B) a RD of 11 is defined, so that e.g. a event detected by a crystals of ring 1 and 15 will not be saved in a sinogram, this can be seen in the Michelogram as the cell of column 1 and row 15 is empty. Figure 2.15: (A) shows the different segments of a Michelogram, here noted with 0, 1, -1, 2 and -2. (B) shows a Michelogram with a maximum ring difference (RD) of 11. The maximum allowed ring combination for ring 1 is ring 12, all events detected in crystals whose rings are further away then 11 rings are not saved in a sinogram. It is obvious the application of a RD leads to a loss of information. Events that are detected along coincidence lines that have an axial angle above a certain value, can be detected, but are not registered in a sinogram and are not further employed for image reconstruction. Many scanners use detectors with small axial widths to provide fine axial sampling, using small detector widths leads to low, in-slice sensitivity and thereby noisy reconstructed images. So it can make sense to increase the span to increase the in-slice count for the sinograms and so to improve the sensibility. On the other hand this means a possible loss of spatial resolution. For that reason, some vendors refer to the use of span 3 and 7 as high resolution and high sensitivity modes, respectively [20]. 2.8 Measurement file In this work, the measurement data from the scanner are saved in list-mode using the PETLINK format [23]. The PETLINK format was established by CTI/Siemens in 1995, the aim was to create a standard format for the measurement output files of PET and SPECT scanners. This format does not just permit to save the indices or coordinates of

41 22 Chapter 2. Positron Emission Tomography crystals that detected a coincidence event but also information about elapsed acquisition time, physiological events (as cardiac gating, breathing, patient motion), detector dead time status or scanner motions (as rotating detectors, rotating rods, bed). Furthermore it is possible instead of saving the address a detector pair to save directly the bin address which refers to a sinogram (s. 2.5) List-mode data List-mode data gives, apart from the information of the detection position, also a temporal and energetic information. There are different ways how the detection position can be saved, the single event detection saves each detection as a single event, in the data analysis two detection can be identified as coincidence event using the time information, so each event is saved with the detector position, the detection energy and the detection time. The coincidence event detection matches two detections directly to one coincidence event, therefore the two detector positions are saved, the energy of both detections and either one or both detection times. The sinogram list-mode are used in this work, the coincidence event is directly sampled into sinograms, the bin ID is saved as well as one time stamp and the detection energy. The additional time data make it possible to implement a dynamic image reconstruction, so e.g. a set of images is reconstructed for a set of time-frames. The images are reconstructed independently and are considered as static within the time frame. So the temporal change of the distribution of the tracer in a region of interest (ROI) can be observed [24]. An inconvenience of the PETLINK format is that a file header is not mandatory, such a header could include on hand patient information, but more important from the technical side, information which data format was used. The PETLINK offers several formats for the different list-mode types. The files we worked were provided without header, so no information about the data type (single, coincidence, sinogram list-mode) were available. Therefore our first step was to determine the data format. Hence we had to implement all data formats, and then check the results for plausibility. Comparing the different outputs, just one data format gave plausible results, so that we could conclude that the measurement data were saved in sinogram list-mode that uses as 32 bit word format. Table 2.2 shows the available information that can be saved in this format and the bit encoding of the tag words. For this work the focus will be put on the event word and on the time marker, the coding of these data is:

42 Chapter 2. Positron Emission Tomography 23 Table 2.2: PETLINK 32bit word format. First bit is the most significant bit (MSB). Parameter Bit encoding (bit 0 to 31) Tag Bit (T) (1-Tag/0-Event) TXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX a Event Word (Not a Tag Word) 0XXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX TAG 1: Time Marker/ Block Singles 10XX XXXX XXXX XXXX XXXX XXXX XXXX XXXX TAG 2: Gantry Motions and Positions 110X XXXX XXXX XXXX XXXX XXXX XXXX XXXX TAG 3: Patient Monitoring: 1110 XXXX XXXX XXXX XXXX XXXX XXXX XXXX (Gating/Physiological /Head Tracking) TAG 4: Control / Acquisition 1111 XXXX XXXX XXXX XXXX XXXX XXXX XXXX Parameters a X = Not Restricted Table 2.3: Bit encoding of event word and time marker of 32 bit word format. Event Word PWB BBBB BBBB BBBB BBBB BBBB BBBB BBBB Event Word: (Most Significant Nibble = 0uuu) P: Prompt (1 - Prompt; 0 - Delay) W: Window (1 - Transmission; 0 - Emission; Alternately defined as B29) B: Bin Address: (Note that a 30-bit BA field is now more widely used.) Time Marker M MMMM MMMM MMMM MMMM MMMM MMMM MMMM M: Elapsed milliseconds: 0-28 Table 2.4 shows an example of data we decoded from the measurement file, just the time stamp and the bin ID are presented. For our application the time marker was important for the plausibility test of the data, but as we implemented a static reconstruction, the time information has no further relevance for our approach. The bin index is the essential information we needed, to use the data for the ML-EM algorithm they have to be summarized in an array, one element for each bin so that the size of the vector is (N bins 1). Each element of the vector contains as value the counts of the detected events; equation 2.3 shows the structure of the vector.

43 24 Chapter 2. Positron Emission Tomography Table 2.4: Example of list-mode data from Biograph16. time stamp (ms) bin index events bin 0 events bin 1 y = events bin 2... events bin max (2.3)

44 Chapter 3 Image Reconstruction The process of computing the emission density distribution given the detected coincidences is called image reconstruction. Numerous algorithms exist to perform this task, which can be classified into two groups, analytical and iterative approaches. The iterative methods can be further divided in algebraic and statistical approaches, a statistical approach will be applied in this work. Traditionally the analytical approaches [21] were used for PET image reconstruction, as the iterative ones are computationally more expensive. Nowadays, the computational costs however are not restricting anymore the choice of method. In recent times the iterative methods became state-of-the-art, this was possible due to constant improvement of calculation capacities of computers. Iterative approaches are often preferable, because they allow to correct physical effects of scan process ( 3.1), e.g. scattering, blurring or detector sensitivity. Furthermore a higher image quality can be achieved in comparison to analytical methods. Iterative reconstruction methods consider the projections (s. 2.6) and the reconstructed object as discretized function. The number of projections depends on the sampling of the sinogram, we define the total number of bins as B. The object is discretized in V = m n o voxels, so it can be described by V data values. If a ij is the probability that an annihilation event is generated in pixel x i and detected along the coincidence line i and y i is the measurement vector, the measurement process can be described using the equation: V 1 y i = a ij x j for 0 <j<v (3.1) j=0 This equation can also be written in matrix form: 25

45 26 Chapter 3. Image Reconstruction y = A x with y R B,A R B V and x R V In principle the original image x could be reconstructed out of the measurement vector y and the probability matrix A, inverting A gives: A 1 y = x However this is not possible as the matrix contains singularities and is not invertible. Instead, iterative approaches are chosen to solve the problem. Therefore the measured projections y are compared to theoretical projections y k,forward that are determined by the forward projection with the estimate [22]: y (k),forward = A x (k) (3.2) Out of this comparison correction factors y (k),corrected are calculated, which can be back projected: x (k),back = A T y (k),corrected (3.3) which then leads to a new, improved estimate: x (k+1) = x (k) x (k),back (3.4) This is one of the approaches of an iterative technique to reconstruct an image. For reconstruction we will use one of the standard methods presented by Dempster et al. in 1977 [25], the maximum likelihood - expectation maximization (ML-EM) algorithm. This algorithm is applied in emission tomography as an iterative technique for computing maximum likelihood estimates of the activity density parameters. Using this mathematical model, it is possible to calculate the probability that any initial distribution density in the object under study could have produced. In the set of all possible images, which represent a potential object distribution, the image having the highest probability is the maximum likelihood estimate of the original object [26]. We have chosen this algorithm, because the ML-EM is one of the standard methods in the field of PET image reconstruction, it is a simple algorithm that provides good output results. It has the advantage that parameters are not free and it shows a good convergence.

46 Chapter 3. Image Reconstruction System Matrix The System Matrix (SM) A should incorporates the physical factors to be taken into account in the reconstruction process. The SM makes it possible to model more accurately the relationship between the object and the projection space. The accuracy with which this matrix is defined has a critical role in the quality of the reconstructed image [27]. We denote the system matrix as A, where each element a ij models the probability that an event generated in voxel j is detected along a line of response i, therefore the matrix is also called probability matrix or system response matrix. In iterative image reconstruction an approximate solution is searched by solving the following linear system of equation by an iterative approach: V 1 y i = a ij x i j=0 for 0 <i<v, where y is the measurement vector (s. equation 2.3), x the unknown image and A the known system matrix with a size of (B V ), where B is the number of bins of the all sinograms and V the number of voxels in the image space. The goal of the image reconstruction is to use the data values out of y to find the image x. The possibility that physical factors can be included in the modeling of the system matrix is one of the advantages of the iterative approaches. There exists several methods to model the SM, e.g. using Monte-Carlo simulations permits to compute the probability matrix at once [28], another approach is to factorize the matrix [29] into several sub-matrices: A = A det sens A positron A blur A atten A geom (3.5) Each factor models a specific part of the scan process as sensitivity of the detectors (A detsens), the positron range (A positron ), the attenuation through the passed tissue (A atten ), blurring effects A blur or the geometry of the LORs (A geom ). There are more factors which can be added to make the model of the system more accurate, e.g. Compton scatter or object scatter. In this work we will focus on the geometrical factor, that models the geometric probability term relating each voxel j to an LOR i. This is convenient as the information we have about the scanner was not detailed enough to define a more accurate model. Applying other factors then the geometrical one would yield to better image quality, however improving image quality is out of scope of this work, so that a simple model of the SM is sufficient. Hence our system matrix only consists of A geom.

47 28 Chapter 3. Image Reconstruction A = A geom (3.6) The SM A geom is sparse, i.e. a matrix that is mostly populated by zero entries. To illustrate the number of non zero entries, we compute an example. We take one bin out of the sinogram with the ID 0, it is a direct sinogram just counting coincidence events that were detected by crystals of the same ring. If we assume a voxel space of voxels, the voxels one LOR intersects depend on the angular orientation (s. figure 3.1), we assume that 100 voxels are crossed, i.e. that the row of this bin in the SM has 100 non-zero entries, however each row contains of all voxels, hence ( ) 100 elements of the row are zero, i.e %. Figure 3.1: An arbitrary LOR crosses the image space, presented in 2D and 3D, voxels that are passed by this LOR are filled [22] Siddon algorithm There are several methods to model the geometrical part of the probability matrix. They differ in the the fundamental approach, the accuracy, the time of calculation and/or if they are voxel or bin driven. A list of different methods can be found in the bibliography [30 38]. The algorithm which is used through this work was published by R. Siddon in 1985 [39]. This method is widely used, even if the accuracy is not that high, however as the spatial resolution of the reconstructed image is a minor aspect for this work, Siddon s approach seems to be an adequate method to model the SM. Several modifications of this algorithm have been published, improving the accuracy or the calculation procedure; we will use a version developed by Jacobs et al in 1998 [40], the difference to the original Siddon algorithm is a reduction of calculation steps and so an acceleration of the ray tracing, however the principle remains the same, therefore the algorithm used will be noted as Siddon through the work. The basic idea of the Siddon s algorithm is that the contribution of a voxel j to the detection of events in LOR i is proportional to the length Of intersection (LOI) of the

48 Chapter 3. Image Reconstruction 29 LOR through the voxels. Note that the pixels in this section are denoted by their x, y and z position in the image space, defined as m, n and o, respectively. We denote the particular voxel density as ρ(m, n) of pixel (m, n) and the LOI contained by this pixel l(m, n). Hence, the line-integral from point p(p 1x,p 1y )topointp(p 2x,p 2y ) (s. figure 3.2), over the discretized image, can be approximated by the following weighted sum [40]: d 12 = m l(m, n)ρ(m, n) (3.7) n This can easily be extended to three dimensions: d 12 = m l(m, n, o)ρ(m, n, o) (3.8) n o p(p 1x,p 1y ) l(m,n) p(p 2x,p 2y ) Figure 3.2: Length of intersection (LOI). The Siddon algorithm weights each voxel by the length of intersection of the LOR with the voxel. Note, m and n defines the pixels in x and y direction. One of the problem of the Siddon algorithm is the assumption that the interaction with the photon happens in the middle of the crystal surface facing to the center of the gantry. This is not realistic, an event can be detected in the whole volume of the crystal, the photon can even pass through several crystal until it is detected. That means that the LOR in fact is not a line but an area in 2D respectively a volume in 3D reconstruction (s. figure 3.3). These effects are usually modeled in the sub-matrix A blur, as we are not including this sub-matrix we expect a loss of image quality/resolution.

30 Chapter 3. Image Reconstruction Figure 3.3: The dashed line show the LORs as they are assumed in the Siddon algorithm, the gray area shows the area were the coincidence event could had happened. 3.2 Maximum Likelihood - Expectation Maximization algorithm The general approach of an iterative method was explained through equations 3.

49 30 Chapter 3. Image Reconstruction Figure 3.3: The dashed line show the LORs as they are assumed in the Siddon algorithm, the gray area shows the area were the coincidence event could had happened. 3.2 Maximum Likelihood - Expectation Maximization algorithm The general approach of an iterative method was explained through equations 3.1, 3.2, 3.3 and 3.4 and we already mentioned that we apply the maximum-likelihood estimationmaximization algorithm for image reconstruction. We already introduced a relationship for the system to transform from the image domain to the measurement domain and vice-versa, another model is needed that relates how the projection measurements vary around their expected mean values and is derived from the basic understanding of the acquisition process. Shepp and Vardi [41] incorporated the Poisson nature of the acquired data in their reconstruction algorithm. The goal of the reconstruction algorithm is to find the distribution x which has the highest probability to have generated the measured projection data y. The probability function is called the likelihood function and is derived from the Poisson statistics [42]: L(Y = y x) = B 1 i=0 ȳ yi i exp( ȳ i) p i! (3.9) To maximize this likelihood Shepp and Vardi used the EM algorithm, which yields the following update towards a new estimation:

50 Chapter 3. Image Reconstruction 31 x (k+1) j = x(k) B 1 B 1 i=0 a ij i=0 a ij y i V 1 j =0 a ij x (k) j (3.10) Where x (k+1) is the next image estimate for the voxel j based on the current estimate x k. The different computation steps that have been explained in the general approach (s. equations 3.1, 3.2, 3.3 and 3.4) are specified for the ML-EM in figure 3.4. All elements of the initial guess x (0) are usually set to a constant value. The first step (1) forwardprojects the estimate to the projection domain. y (k),forward V 1 i = a ij x (k) j (3.11) Then, in step (2) these projections are compared to the measurement projections y. j =0 y (k),corrected = y i y (k),forward i (3.12) This forms a corrective factor for each projection, which is then back projected into the image domain in (3). x (k),back B 1 j = a ij y (k),corrected i (3.13) i=0 This image correction factor is multiplied by the current image estimate and then divided by a weighting term based on the system model to apply the desired strength of each image correction factor. x (k+1) j = x(k) j B 1 x a ij i=0 (k),back j (3.14) The new image estimate is now reentered in the algorithm for the next iteration step; the algorithm repeats itself while the estimate approaches the maximum likelihood solution [21].

0 a x ( k ) ij' j' Measurement Domain x (4)Update image ( k 1) x ( k ) j B 1 a i0 ij x (

yi x (3)Backproject ratio to all voxels ( k ), back B 1 ( k ), corrected a i0 ij yi

51 32 Chapter 3. Image Reconstruction Image Domain (0) x x y (1)Forward projection ( k ), forward V 1 j' 0 a x ( k ) ij' j' Measurement Domain x (4)Update image ( k 1) x ( k ) j B 1 a i0 ij x ( k ), back j (2)Compare with measured projections y ( k ), corrected i y ( k ), forward yi x (3)Backproject ratio to all voxels ( k ), back B 1 ( k ), corrected a i0 ij yi Figure 3.4: Flow diagram of ML-EM algorithm. Starting with an initial guess x (0) the algorithm iteratively creates new image estimates based on the measured projections y.

52 Chapter 4 Implementation of ML-EM on CPU Two versions have been implemented using the central processor unit (CPU) of the workstation. In the first version the system matrix will be computed for all bins and the non-zero values of the matrix are saved to hard disk, we refer to this version as off-line or full-matrix implementation. During the calculation of the image reconstruction the elements of the system matrix a ij will be read from hard disk without calculating them again. In the second implementation the on-the-fly or online version the elements of the system matrix will be calculated when they are used in the reconstruction process, none of the the SM elements are saved, so that the same elements will be calculated several times depending on the number of iterations executed. In the following the implementation of the off-line version will briefly be explained, then the differences to the on-the-fly will be high-lighted. In general the decision if an online or an off-line version will be implemented for an image reconstruction depends on the application, the online version can be faster for list-mode sinogram data or high resolution PETs. 4.1 Sinogram and Michelogram combined in Biograph16 Section 2.8 described how to extract data from the measurement file. Finally the data were reduced to a list of event counts for each bin. One bin can include the counts of events that happened along several LORs, for image reconstruction it is essential to know which coincidence lines are sampled in one bin. In table 4.1 data are summarized to define the sinogram resolution and the Michelogram for the scanner. 33

53 34 Chapter 4. Implementation of ML-EM on CPU Table 4.1: Parameters used to build the Michelogram and sinogram of Biograph16 scanner. Sinogram Michelogram Parameters Biograph16 Radial number of bins 192 Angular number of bins 192 Displacement FOV 2 Angle -90 to 90 2 to FOV Span 7 Max. RD 17 Number of segments 5 Number of sinograms 175 List of numbers of sinograms per seg st bin index sino st bin index sino 46 (seg.1) Hence each sinogram consists of bins the angular sampling is about 1.97 and the radial sampling can be calculated by FOV/192, whereby the FOV is 58.2cm for the Biograph 16, hence the radial sampling is cm. As we know the number of bins per sinogram of the sinogram and the number of sinograms in total we can calculate the total number of bins: =6, 340, 608 bins. Applying the definitions out of section 2.7 and taking into account the data of the scanner given in table 2.1, the Michelogram can be plotted (s. figure 4.1). Figure 4.2 shows an example of aggregated sinograms. The first picture illustrates the ring combination several sinograms. The other images show the separated sinograms ordered by segments. 4.2 Computation of the system matrix A geom The system matrix A has B rows and V columns, each row is related to one bin of the sinograms and each column one voxel of the image space. The images we use consists of voxels, hence we obtain a matrix with N SM N SM = ( ) ( ) = 4, 967, 733, 657, 600 (4.1) elements.

54 Chapter 4. Implementation of ML-EM on CPU segment 2 segment 1 segment 0 23 x x x x x x x 79o 80o 81o 82o o o 85o 43 x 44 x 45x46 x 22 x x x x x x x 77o 78o o o o o o 41 x 42x x x x 21 x x x x x x x 75o76o o o o o o 39x 40x x x x x 20 x x x x x x x 73o 74o o o o o o 37x 38 x x x x x x 19 x x x x x x x 71o o o o o o o x 36x x x x x x o x x x x x x x 69o 70o o o o o o 33x 34x x x x x x o o 17 x x x x x x x 67o 68o o o o o o31 x 32x x x x x x o o o 16 x x x x x x 65o 66o o o o o o 29x 30x x x x x x o o o o 15 x x x x x 63o 64o o o o o o 27x 28x x x x x x o o o o o seg 14 x x x x 61o 62o o o o o o 25 x 26 x x x x x x o o o o o o 13 x x x 59o 60o o o o o o 23x 24x x x x x x o o o o o o o 12 x x 57o 58o o o o o o 21x 22 x x x x x x o o o o o o o x x 55 o o o o o o o 19x 20 x x x x x x o o o o o o o x x 10 53o 54o o o o o o 17x 18x x x x x x o o o o o o o x x x 9 52o o o o o o 15x 16x x x x x x o o o o o o o x x x x seg 8 51o o o o o 13x 14x x x x x x o o o o o o o x x x x x 7 50o o o o 11x 12 x x x x x x o o o o o o o x x x x x x 6 49o o o 9x 10x x x x x x o o o o o o o x x x x x x x 5 48o o 7x 8x x x x x x o o o o o o o x x x x x x x 4 47o 5x 6x x x x x x o o o o o o o x x x x x x x 3 3x 4x x x x x x o o o o o o o x x x x x x x 2 2x x x x x x o o o o o o o x x x x x x x 1 1x x x x x o o o o o o o x x x x x x x 0 0x x x x o o o o o o o x x x x x x x Figure 4.1: Michelogram of Biograph16 scanner with a span of 7 and a max. RD of 17. The different notation of o and x for allowed ring combination are just used to present better the different segments. all 5 segments segment segment 1 and -1 segment 2 and Figure 4.2: Aggregated sinograms of the Biograph16. The first image shows several sinograms of the scanner, the other pictures show the same sinograms ordered by segments.

55 36 Chapter 4. Implementation of ML-EM on CPU To compute the system model the Siddon algorithm is used ( 3.1.1). There are several parameters needed for the calculation of the LOI of a voxel with a LOR. First the dimensions of the image space have to be known, this parameter can change with the scan, the number of voxels in radial (x), angular (y) and axial (z) direction can vary as well as the size of the voxels in each axis. As second parameter the geometry of the scanner has to be given, in this case the coordinates of the detectors are of relevance. A coordinate system is defined whose origin is in the center of the gantry, i.e. at the midpoint of the axial FOV and in the center of the detector ring (s. figure 4.3). y x z Figure 4.3: Coordinate system of Gantry. Coordinate system for the calculation of the system matrix. To calculate the system response for one bin, it is necessary to split the bin in its elements, i.e. in the different LORs it is built of. The Michelogram tells if a bin consists of one, two, three or four LORs, the x and y coordinate of all LORs contributing to one bin are the same, they just differ in the z-coordinate. So each event is detected by two crystals, each crystal is characterized by the x, y and z coordinate, each bin has two x, y coordinates and up to eight z coordinates, depending on the number of LORs in it. The Siddon algorithm has to be applied for each LOR and the LOI are summed up for all voxels: S i 1 a ij w l j, (4.2) l=0 where w l j is the length of intersection of one line of response with the voxel, S i is the number of LORs that contain the bin i and a ij is the sum of all of these lines intersection the voxel.

56 Chapter 4. Implementation of ML-EM on CPU Calculation of detector coordinates The calculation of the z coordinates for a bin is straight forward. First the bin has to be referenced to a sinogram, knowing the sinogram ID the rings of the detectors that contribute to a ring can be read out of the Michelogram. Knowing the ring indices the z coordinate can be computed as follows: z ring = z 0 + ID ring FOV axial N rings (4.3) where z ring is the z coordinate of the ring of interest, z 0 is the z coordinate of the first ring, the FOV axial defines the transaxial area covered by the detector and the denominator refers to the total number of rings. The Biograph16 has an axial FOV of 16.2cm and 24 rings. The 24 rings are grouped in 3 block rings. For the calculation of the z coordinate the gaps can either be included in the calculation or the size of the gaps is spread among all 24 rings. Differences between the two approaches are hardly visible in the reconstructed image, so that the the gap is spread over all rings which leads to eq The x and y coordinates of the detector are calculated using the information out of the sinogram. Actually in this approach we do not calculate the coordinate of a real detector but the intersection of the LOR defined by the bin coordinates with the detector ring. In this way our SM also incorperates the arc correction which was applied to the data. y LOR P 1 x P 2 radius gantry Figure 4.4: Sketch to illustrate xy-coordinate calculation. The intersection points of the LOR and the detector ring are searched.

57 38 Chapter 4. Implementation of ML-EM on CPU For the calculation of the xy coordinates the LOR and the detector ring are expressed as functions of xy in the coordinate system (s. figure 4.4). Using the Hessian normal form of a line we can define the LOR by the minimum distance to the coordinate origin, ρ and the angle θ between the line and the y axis: ρ = x cos θ + y sin θ (4.4) The detector ring can be defined as a circle in a plane: r 2 = x 2 + y 2 (4.5) Where r is the radius of the circle, i.e. the radius of the detector ring. Solving equation 4.4 and 4.5 for x and y leads to two quadratic equations with two solutions each, the xy coordinates for the two detectors x 1,y 1 and x 2,y 2 are: x 1 = ρ cos θ + sin θ r 2 ρ 2 (4.6) y 1 = ρ sin θ cos θ r 2 ρ 2 (4.7) x 2 = ρ cos θ sin θ r 2 ρ 2 (4.8) y 2 = ρ sin θ + cos θ r 2 ρ 2 (4.9) The xy coordinates of one pixel in one specific sinogram is the same for any pixel with the same row and column ID in any sinogram. So it is possible to calculate the xy coordinates just once, putting it in a 2D array and take out the coordinates according to the row/column position of the bin in the related sinogram. The same can be done for the z-coordinate, as the z coordinates are the same for one whole sinogram Storage of the system matrix Due to the sparse form of the system matrix it is not necessary to save all its elements, but just the non-zero elements. There are several compression approaches to save sparse matrices, the one used for this program will briefly be explained.

58 Chapter 4. Implementation of ML-EM on CPU 39 Assuming the following sparse matrix: SM = (4.10) This matrix can be saved in three vectors ( ) N non zero = (4.11) ( ) voxel = (4.12) ( ) value = (4.13) The first vector gives the number of non-zero entries per row, so the size of this vector is equal to the number of rows of the sparse matrix. The second vector gives the position of the non-zero element in a row, the information in which row the element is located has to be calculated out of vector one. Vector three gives the value of the non-zero element referenced in vector two. This compression method for saving a sparse matrix does not allow a direct referencing of one element, knowing the column and the row in the sparse form. Nevertheless this approach is convenient for our calculation as the implementation of the ML-EM is bin driven, so that the algorithm is calculated bin by bin, and therefore an absolute referencing of elements is not needed. Due to the size of the system matrix it is not possible to put it completely in the random access memory (RAM) of the computer. The calculated matrix for an image space of voxels has a total size of 19 GB (s. table 4.2). Table 4.2: Data size of SM in compressed sparse matrix form array size data format n bin n zero 24,6MB integer voxels 6GB integer values 13GB double floating point precision 19GB

59 40 Chapter 4. Implementation of ML-EM on CPU The values are saved in double floating point precision, single precision should not lead to a loss of accuracy, this would reduce the size of the matrix, but it still would be too big to load in the RAM of the computer. This fact will have a notable effect on the computation speed of the algorithm. This has two reasons, first RAM access time is a much less as hard disk access time. The computer can access anything stored in RAM nearly instantly, whereas data on the hard drive need to be located, read and sent to RAM before they can be processed. The order of magnitude of the time difference depends on the system. Second, the system matrix has to be used as many times as iteration steps are processed. As it is not possible to read the SM at once, for each iteration the RAM is filled up several times with a certain number of rows of the matrix, forward and backprojection are carried out for these bins, then the next rows are read and the computation is done for this part until all rows have been read and used for calculation of the projections. At the end of the iteration the new estimate of the image is calculated and the next iteration step is processed. If we were aiming at a speed-optimized implementation of the full calculated system matrix version, it should be tried to save the whole matrix in RAM. First step would be to save the values in single floating precision, which would reduce the size of the values array by a factor of 2, but this still is not enough as for the used hardware 4GB of RAM are available. Furthermore it is possible to use symmetries of the system matrix to reduce their size [28]. Another approach is to reduce the amount of total bins in the sinogram, this can be done by sinogram mashing or rebinning methods [20, 43], however these approaches can lead to a loss of spatial resolution. Applying a combination of the methods should lead to a system matrix that fits into RAM, but was out of scope of this work. 4.3 On-the-fly CPU implementation The main characteristic of the on-the-fly implementation of the ML-EM algorithm is the calculation of the elements of the system matrix when they are needed in the image reconstructions process. However not the whole system matrix is always calculated but only the bins of the measurement vector that have at least one coincidence event recorded. Consequently the time of computation will vary with the number of bins that detected an event; the more non-zero bins exists, the more LORs have to be computed using the Siddon algorithm and the longer the reconstruction will take. The implementation of the fully calculated system matrix and the on-the-fly calculated system matrix are very similar, however, one difference should be mentioned. In the first

60 Chapter 4. Implementation of ML-EM on CPU 41 implementation the different LORs of a bin are calculated and then their contribution is added to one bin, according to equation 4.2. This takes more time to calculate the system matrix, as a time consuming comparison method is implemented, to check if a voxel of LOR 2 has already been passed by LOR 1 of a specific bin. But later on this reduces the time for the ML-EM. In the on-the-fly calculation each LOR is directly processed to the ML-EM, that means, if a voxel is passed by several LORs of one bin, it will first be projected and then summed up. Mathematically there is no difference in the two approaches. Figure 4.5 illustrates the differences in a flow chart. Siddon LOR1 Siddon LOR2 Siddon LOR2 Siddon LOR3 (w i1 w i2 w iv ) i-th row (w i1 w i2 w iv ) (w i1 w i2 w iv ) (w i1 w i2 w iv ) System Matrix Fully calculated SM V 1 j' 00 a ij x j ' ( k ) ' Forward projection 1 ( k ) Siddon (w LOR1 i1 w i2 w iv ) a ij ' x j ' j ' 0 V Siddon LOR2 Siddon LOR3 V 1 j ' 0 ( k (w i1 w i2 w iv ) a ij ' x j ' V 1 (w i1 w i2 w iv ) ( k a ij ' x j ' ' 0 j ) ) On-the-fly calculation Siddon LOR4 V 1 (w i1 w i2 w iv ) ( k ) a ij ' x j bin i+1 ' j ' 0 Figure 4.5: Implementation of off-line and online calculation, shown on the forward projection calculating one bin.

62 Chapter 5 Graphics Processing Unit and Computed Unified Device Architecture In the last years the scientific community discovered graphical processor units (GPUs), widely known as graphic cards, for scientific calculations. Originally the graphic cards were developed and used for 3D computer games and graphical applications; however, there is a huge interest in using GPUs for other applications, this new field of application is known as General Purpose Computation on GPU. The focus of the development of GPUs was put on parallel calculations, so that a graphic card has several microprocessors, each one alone being less powerful than the central processor unit (CPU) of a computer. These multiple processors (MP) can calculate the same instruction many times simultaneously; this method is known as single-instruction-multiple-data (SIMD). Even if GPUs have been on the market for many years, the use apart from graphical applications was limited by the way of programming the cards which needed highly skilled programmers to get good calculations performance. In recent years it was tried to overcome this problem and new programing models were introduced. First of all, why might a graphic card obtain higher data throughput as a CPU? The CPU developers tried for years to increase the clock frequencies of the sequentially working processors, and because of new and improving technologies this worked well, so that e.g. from 2001 to 2003 the frequency of the Pentium 4 processor could be doubled from 1.5 to 3 GHz. This made sense due to the sequential nature work flow of the CPU, hence the increase of frequencies causes an increase of data throughput. Although this is not absolutely true anymore as in the first Pentium family slight parallelism were implemented [44]. The frequency increase remained almost static in the last years, 43

63 44 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture between 2003 and 2005 the frequencies raised only from 3 to 3.8GHz. The reason is that the chip manufacturers reached physical limits [44], so instead of focusing on one processor core, more than one arithmetic logic unit (ALU) were put on a CPU to increase the performance. The development was based on the idea to make a CPU as fast and powerful as possible; by contrast, the GPU developers followed a complete different approach: they tried to put as many processors as possible on an electronic card and to make them work in parallel. Each processor is relatively weak compared to a CPU; however, the combination of 128 small cores (Geforce 8800GTX) makes many small ALUs to a one strong community of processors. Figure 5.1: The development of new architecture increased the number of (giga) floating point operations per second (gflop/s) for CPUs and GPUs. This graph shows the enormous potential of GPU; however, the number of gflop/s is a theoretical value that does not represent the possible increase of calculation time for implementations [45]. Figure 5.1 shows the increase of floating point operation per second (flop/s) of new introduced processor architectures over the last years. Flop/s are a measure of computer performance, which is especially interesting for scientific computation that usually uses many arithmetic operations. Nevertheless the graph does not represent an absolute speed up for calculations on a GPU. Firstly, the graphs show peak values that assume a total occupancy of all processor cores, which is difficult to achieve; secondly, not each code is appropriate for the implementation on a GPU, i.e. not every program can be accelerated on graphic cards, and finally, the graph is published by NVIDIA and does

64 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 45 not give details about the GPU nor the CPU system configuration. To summarize, from the graphics we cannot conclude which acceleration factor for a given implementation could be archived, but show the potential of GPUs and the difference in flop/s that could be reached by using many parallel processors. 5.1 NVIDIA Geforce 8600GT To program applications for a GPU, it is necessary to have some knowledge about the graphic card hardware. In this work, a Geforce 8600GT is used and its architecture will be presented in this section. This graphic card has been on the market for some years and a new generation is already released. The 8series architecture of NVIDIA, where this card belongs to, was a completely renewed generation in comparison to the former 7series and was the start of the Compute Unified Device Architecture (CUDA) programming language. It should be stated that the 8600GT is not a high-end GPU, table 5.1 shows the technical specification of a 8600GT and a 8800GTX, the 8800GTX is widely used for CUDA programming, as it has a good performance to price ratio. The performance especially depends on the number of cores and on a high bandwidth; as it can be seen, the 8600GT is low in both numbers. Nevertheless the video memory of the used 8600GT is bigger than the one of the 8800GTX, this could be an advantage for some implementations. Table 5.1: Technical data of NVIDIA Geforce 8800GTX and 8600GT [ parameter 8800GTX 8600GT Stream processors Core clock Shader clock Memory clock Memory amount 768MB 1GB Memory interface (bit) Memory bandwidth (GB/sec) Texture fill rate (billion/sec) NVIDIA calls their processor cores stream processors (SP). The 8600GT has 32 SPs, these cores are embedded in multiprocessors, there are 4 MPs having 8 SPs each. Apart from that, a MP has a shared, register and constant memory which is shared by all cores of the MP and cannot be accessed by cores of other MPs (s. figure 5.2).

65 46 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture Host Input Assembler Thread Execution Manager SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Parallel Data Cache Texture Parallel Data Cache Texture Registermem Shared mem Constant mem Unitof2MP Texture cache Load/store Global Memory Figure 5.2: The figure shows the architecture of a Geforce 8600GT. It has 4 multiprocessor (MP), each one consists of 8 stream processors (SP). Each MP has its own shared, register and constant memory Some definitions Before having a closer look to the architecture and the different memory types, we will introduce some definitions that are used by NVIDIA and will be used in this work: Multiprocessor: A GPU has several MPs, each multiprocessor has several cores called stream processors. Furthermore, it has a register, shared, local and constant memory as well as a texture cache. The access to the different memory banks is limited to the SPs. Streaming processor: A streaming processor is a subunit of the MP. NVIDIA defines a SP as a fully generalized, fully decoupled, scalar processor that supports floating point precision similar to the IEEE 754 norm [46]. Thread: A thread is an instruction which is processed on a stream processor. Each thread has an index, these indices can be used for referencing elements in arrays. A thread is always executed by one stream processor, a stream processor can handle several threads simultaneously. Warp: A warp is a bunch of 32 threads. A multiprocessor executes its parallel threads in warp size, so it should be tried when programming a GPU to group the threads in multiples of 32.

66 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 47 Block: A block is a group of threads, the block size is defined by the programmer. The execution of the threads within a block is not organized, so it can be sequentially as well as randomly. This cannot be controlled by the programmer. Threads within a block can be synchronized by using the _syncthreads() command, that makes a thread stop until all threads of this block reached this point. A block is always executed on one MP. Grid: The blocks are grouped in a grid. It is not possible to synchronize different grids. Device: The GPU is called device. Host: The CPU is called host. Kernel: A kernel is a program that will run on the device but is called from the host Memory model and hardware limitations Figure 5.3 shows the different memories available on the GPU. They differ in size, access time, life time and access permission. Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Processor 1 Registers Processor 2 Registers Processor M Instruction Unit Constant Cache Texture Cache Device memory Figure 5.3: There are N multiprocessors on one GPU, each one has several cores that share the same shared, register and constant memory [45]. Device Memory: The device memory is also called global memory and is the RAM of the graphic card. This is the largest memory unit of the GPU; the used

67 48 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture Geforce 8600GT has 1 GB of device memory, it can be addressed by the host and by the device. All threads have permission to read and write to it. The device memory is the slowest of all memory types on the card, but is the only that can be addressed by the host, so it is the interface for data communication between the CPU and the GPU. Texture Memory: The texture memory is not a physical memory, but it uses the global memory as scope. There are extra runtime functions to work with this type of memory, according to the CUDA programming guide [45]: it is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. The use of this memory was not further studied for this work. Texture Cache: The texture cache is a memory within a MP that caches data from the texture memory, so that potentially a higher bandwidth can be achieved. Threads can just read out of this cache but not write into it. Constant Memory: The constant memory is initialized by execution of an application and has a size of 8 kb per SP. Threads have a read-only access to it. Shared Memory: The shared memory can be read and written by all the threads of a block, the size is 16 kb for each thread. Register Memory: The register memory is used for variables that are not declared for other scopes; it is limited to 8192 registers per MP. Local Memory: The local memory is not really a physical memory. A MP can occupy space in the global memory when it is running out of registers or when bigger arrays are initialized. It should be avoided that this happens as global memory is much slower than register memory. This outsourcing of register cannot be controlled directly by the programmer, it is done by the device. However there exists tools to find out if memory was outsourced in local memory. Local memory is restricted to the MP that reserved the space for it. Register and shared memory are on-chip memory and hence really fast but small in size. Constant memory is not on-chip, but still fast; efficient usage of these three memory locations can speed up the application; however, their limited sizes make efficient use a difficult task. Global memory is much slower, but offers larger storing place for data. The number of parallel threads is limited by the register and the shared memory, as all the threads of one multiprocessor have to share these resources. Furthermore the

68 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 49 maximum number of threads per block is limited to 512, and a MP can have up to eight active blocks, but just up to 24 warps (768 threads) at the same time; this makes a total of 3,072 possible parallel threads for a Geforce 8600GT. It can be seen there are several limitations that have to be taken into account, however NVIDIA helps figuring out how to get the best configuration to achieve maximal processor occupancy by most possible parallel threads. On the CUDAZone internet site 1 a development kit is offered for download which includes several implementation examples. If you run the DeviceQuery program it returns the technical configuration of your card, including the memory sizes, the maximum thread number per block, the warp size and the revision number of the computation capability. NVIDIA provides a compiler to build the program projects, this compiler is called nvcc. If you add the argument -cubin or --ptxas-options=-v to the compilation instruction, the compiler returns the number of occupied register and shared memory for all kernels. A tool called CUDA occupancy calculator comes with the compiler and helps to choose the right configuration for each function that is run on the device. The technical configuration and the memory occupancy of your kernels are needed as input, the calculator then returns the recommended block size, the occupancy of the processors and the limiting factor that prevents the GPU to start more threads in parallel. Even if the calculations to figure out with which parameter a limit is reached are not that complicated, this tool is really helpful. On the one hand it reduces work as once has to repeat these calculation for every kernel that is launched, and on the other hand it provides graphs showing for different configurations the best adjustment. It is worth mentioning another factor which is not related to what was discussed in this section: the power consumptions of the cards. The overall power consumption of computers increased a lot in the last decade and GPUs contribute to it, not only because of the number of processors and memory, but also because of extra on-board cooling units. A Geforce 8600 showed in a test 2 a consumption of 145W and 195W for unloaded and loaded operation, respectively. The Geforce 8800GTX already reaches a consumption 3 of 187W (unloaded) and 277W (load); high-end cards, as the recently released Tesla 10series, goes up to 700W (typical power specified by NVIDIA). 1 home.html gt/10.html 3

69 50 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 5.2 Computer Unified Device Architecture The Computer Unified Device Architecture is one but not the first approach to open the computational capacities of graphic cards to users that are not familiar in graphic programming but in general purpose applications. The general-purpose computation on GPUs (GPGPU) has been an issue of interest in science for several years. First implementations started already in the 1990 s, also in the field of tomographic reconstruction [47]; however, the need of knowledge in graphic programming and the use of non standard data types with 9 and 12 bits complicated the implementation of GPUs for general computations. Application programming interfaces (API) as OpenGL or Direct3D allowed programmers to use the resources on a graphic card, but the libraries were done for graphic applications and programmers familiar with the particularities of these implementations. The BrookGPU programming environment, developed by the Stanford university, tried to break with this problem and offered an addition to the C language. Another environment introduced by NVIDIA in cooperation with Microsoft was C for graphics (Cg), also a programming language based on C that should facilitate the programming of GPUs. There have been several approaches to make GPGPU happen, however so far the breakthrough was not achieved with any of those technologies. The last approach was the release of CUDA by NVIDIA and, according to the number of scientific papers applying CUDA published since the release of this programming environment, we could consider that the breakthrough has been achieved and GPUs are ready for general purpose computing CUDA software model CUDA is based on the C programming language. The download of CUDA is free of charge and a list of graphic cards that support this language can be found on the NVIDIA homepage. nvcc is the compiler that comes with CUDA, it converts the C code first in an intermediate language called PTX, that is similar to assembler code and is used to analyze for inefficiencies. Then the program is translated into the target code understood by the GPU. CUDA files have the suffix.cu and kernel header files.cuh. Apart from the nvcc compiler a toolkit is given that includes a profiler, the already mentioned occupancy calculator, a debugger and the GPU drivers Function declaration, calls and threadids An application always consists of host and device codes. The host code is written in C or C++, any compiler can be used for compilation and there is no restriction for the

70 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 51 CPU code, hence all known libraries can be implemented. The device section includes the source code to be run on the GPU. There are two type of functions, global and device ones. The global ones are the kernels, they are called from the host and executed on the device; device functions however can just be called by global functions or other device functions. To define the functions the identifiers global or device are put in front of the function declaration. A program can have several kernels that are executed sequentially (s. figure 5.4(a)). By default a device function is always inlined 4. The noinline function qualifier however can be used to tell the compiler not to inline a function, but the compiler ignores it. For functions with pointer parameters or large argument lists. In practice this can have an important impact to the register size of a kernel; each time a kernel calls a device function it will be inlined, i.e. each call reserves register for variables used in the device function, hence register occupation increases and the number of parallel threads decreases. a) b) Host Device Grid Grid 1 Kernel Block Block Block 1 (0, 0) (1, 0) (2, 0) Block (0, 0) Block (1, 0) Shared Memory Shared Memory Block (0, 1) Block (1, 1) Block (2, 1) Registers Registers Registers Registers Grid 2 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Kernel 2 Local Memory Local Memory Local Memory Local Memory Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Host Global Memory Constant Memory Texture Memory Figure 5.4: a) An application can contain several kernels which are called sequentially. b) A grid contains several blocks, each block can consist of up to 512 threads, each grid, block and thread has an index [45]. When a global function is called, the programmer has to define the size of the blocks and the grids; therefore, an extra argument is added to the function call. The following source code shows a function declaration and a call: global void myglobalfunction(float* parameter); //declaration of kernel function myglobalfunction<<<nblocks,blocksize>>>(parameter); //call of kernel from host, 4 The complete body of an inline function is inserted by the compiler in every context where that function is used.

71 52 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture where nblocks and blocksize define the number of blocks in a grid and the number of threads in a block, respectively. Both variables are of type dim3, a vector data type of three integer variables; this provides the possibility to have three dimensional blocks and a maximum of two dimensional grids. The maximum block size in three dimensions is , still within the restriction of a maximum of 512 threads per block and 65,535 65,535 1 for the grid. The numbers of these variables define how many threads will be executed. A kernel can execute up to a maximum of = 2, 198, 956, 147, 200 threads. If just one dimension is defined, the others are set to 1 by default. There exist some restrictions for global and device functions; they cannot be called recursively, the number of arguments is fixed, pointers to device functions are not allowed, all global functions have to be of return type void and global function parameters are limited to a size of 256 bytes. Kernel function calls are asynchronous, they return before the execution on the device has completed. In practice that means the CPU is not blocked during a kernel execution and can be occupied by other instructions. There are are several built-in variables that are set by the execution configuration: blockidx gives the index of the block where the thread is in, it is a data container of type dim3 that has a x and y component. threadidx is the index of the thread within the block, it has a x, y and z component. blockdim gives the dimension of the grid, it has a x and y component. These variables can be used to give a unique index to every thread, for example for a kernel, defined by a one dimensional block and a one dimensional grid, the ID can be calculated using the following statement: int threadid = blockid.x * blockdim.x + threadid.x; These thread IDs are very useful, e.g. to reference elements in data arrays Variable declaration and memory usage CUDA also introduces new variable type qualifiers, which define the memory location the data are saved in. Table 5.2 shows the different qualifiers, a declaration example, the scope and the lifetime.

72 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 53 Table 5.2: Variable type qualifiers used by CUDA. Declaration Memory Scope Lifetime device local int localvar; local thread thread device shared int sharedvar; shared block block device int globalvar; global grid application device constant int constantvar; constant grid application The device qualifier is optional when used with shared, constant or local. If any qualifier is defined the variable stays in the registers, however in some cases the compiler can place it in local memory, which cannot be controlled by the user. This is often the case for arrays and big structures, that would consume too much register space. Pointers in code that is executed on the device are supported, as long as the compiler is able to resolve whether they point to either the shared memory space or the global memory space. To pass data from the host to the device memory, has to be allocated on the graphic card. Global memory is the only memory that can be allocated and referenced from host code; the structure of the CUDA instruction cudamalloc() for memory allocation is similar to the C instruction malloc(). To copy data, the CUDA specific instruction cudamemcpy() has to be used. The following example allocates memory for 256 floating point values on the host and on the device, and copies the elements from array_h to array_d. Note that it is convenient to add a _h and a _d for pointer pointing to host and to device memory, respectively to avoid confusion. Arrays that are sent to the GPU exist on the CPU and the graphic card, using this suffix helps to distinguish between them. int size = 256; float* array h; array_h = (float*)malloc(size); //C instruction to allocate memory... //array_h is filled with values float* array_d; array_d = cudamalloc((void**)&array_d, size * sizeof(float)); //memory allocation on device cudamemcpy(array_d, array_h, size * sizeof(float), cudamemcpyhosttodevice); //copy of data to device free(array_h); //clean host memory cudafree(array_d); //clean device memory An easy application showing a kernel definition and call can be find in the appendix A. Memory management in CUDA is similar to C; however, one particularity should be mentioned as it might influence results calculated on the device. Memory access on global or shared scope is not protected, i.e. several threads can read and write to the same element simultaneously, which may cause a loss of data. Assume that two threads process an arithmetic operation on the same array element, they read the data at the

73 54 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture same time, thread 1 terminates first and writes the new value to the same array position, then thread 2 terminates and writes also its result to this memory address and overwrites the result of thread 1. To avoid this inconvenience, CUDA has built in atomic functions; such a function reads an element and protects the memory address until the operation is complete and threads that try to access this element are queued. The problems of these instructions are that so far atomic functions are just available for integer values, the arithmetic operations that can be processed are limited, the performance can decrease enormously and data dependencies between threads have to been studied precisely to not run the risk of a deadlock Device runtime library The device runtime library contains mathematical functions, synchronization functions, type conversion functions, type casting functions, texture fetching and atomic functions. For detailed information on the functions please refer to the CUDA programming guide. Apart from the built-in device runtime functions, CUDA also contains two libraries: the CUBLAS [48] and CUFFT [49], providing GPU implementations for basic linear algebra and fast Fourier transform, respectively. The GPU does not support double precision floating point values, but only single floating point precision values. The implementation of the floating point arithmetic follows the IEEE norm 754 [46], but does not comply it in all senses(details can be found in the appendix of the programming guide). Another restriction is the accuracy, internally a CPU handles a float values with 80 bits, but a GPU with 32 bits Debugging and emulation mode One important issue in programming is debugging of the code. The programming environment does not include any native debug support for code that runs on the device, but provides an emulation mode for the purpose of debugging. The emulation is compiled for and runs on the host, allowing to use the normal debugging support of the host, debugging the code as if it were a host application. The emulation mode is useful to find algorithmic errors in the code, however this tool emulates the code but does not simulate the device. That means that the parallelized code is executed sequentially on the host; furthermore, the host memory is used and therefore limitations of the graphic card are not taken into account. However, memory leakage problems are much more delicate on the GPU than on the CPU, i.e. a program with a memory leakage problem

74 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture 55 can run without problems in emulation mode but fails on the GPU, which can cause a crash of the card and force a hard boot of the system. CUDA is not really blamable for the memory leakage problem, this is more a problem of C that does not check properly possible memory violations. It is recommendable to use a tool like Valgrin 5 that checks for memory errors, to avoid such a GPU crash down. The problems occurring due to the parallel execution are much harder to solve, many errors that occur while programming are due to the parallelism of the threads, but exactly these errors are not checked by the emulation, and once a kernel is started on the GPU, there is no option for communication; instructions like printf() are not supported on the device. Consequently a kernel is like a black box, the input can be checked before kernel launch, the output can be downloaded by the cudamemcpy() instruction and be checked, but whatever happens inside the black box is not visible. If the application is stopped during execution due to a failure, no output exists, for that case CUDA provides some error instructions to figure out what caused the program abort, but the error messages are often vague and unspecific, so that debugging can be a time consuming task, where a lot of patience is needed General performance issues The programming guide gives several hints to achieve a good performance using the GPU, some should be mentioned since they will have an impact on our work: GPUs are not optimized for a high amount of data-parallel work, but for work with high arithmetic intensity. Device memory access are more expensive than most calculations, so it might be less time consuming calculating a value several times than saving it to global memory. Memory access should be minimized, shared and register memory are fast and should be preferred for usage. Communication between host and device is slow and should be minimized. Program controlling statements like if -conditions should be avoided, as they slow down execution. A high of degree of parallelism is needed, this refers to the division of the algorithm as well as threads that are executed in parallel, for efficient execution NVIDA recommends a block size between 128 and 256 threads. Memory access should be optimized to reach a high bandwidth. 5

75 56 Chapter 5. Graphics Processing Unit and Computed Unified Device Architecture The performance that can be achieved depends on the several issues, not all codes can be implemented efficiently and a program might run slower on a GPU than on the CPU.

76 Chapter 6 Implementation of ML-EM in CUDA The field of medical image reconstruction faces a high amount of data to be processed and an high arithmetic load for the computation; consequently a powerful co-processor as a GPU could have favorable effects on computation time. In 1994 Cabral et al. [47] proposed an acceleration method which uses the texture-mapping capability of graphics hardware to speed-up the Filtered Back Projection (FBP) reconstruction used in CT. First applications of GPUs in medical imaging focused on CT reconstruction and this modality is the best investigated regarding the use of graphic cards [50 56]. Iterative as well as analytical methods have been implemented, acceleration factors could be achieved in an order of magnitude maintaining an image quality comparable to CPU implementation. After these promising results in CT, applications for MRI [57 59], SPECT [60] and PET reconstruction [11 14] came up, however investigation for those modalities is still not as far as for CT. 6.1 Related work Next, we present in some detail the implementations of GPGPU in PET, table 6.1 shows some parameters of published works. Some remarks to the publications summarized in table 6.1: Used algorithms: All four implementations apply the Ordered Subset - Expectation Maximization (OS-EM) [61] algorithm for reconstruction. This approach is a modification of the ML-EM, that divides a complete iteration over all events into 57

77 58 Chapter 6. Implementation of ML-EM in CUDA Parameters Table 6.1: Implementations of PET image reconstruction using GPUs. Pratx et al. [14] 2006 Bai et al. [13] 2006 Barker et al. [12] 2007 Schellmann et al. [11] 2008 Reconstruction algorithm OS-EM OS-EM OS-EM OS-EM Data type list-mode sinogram list-mode list-mode SM algorithm Voxel space Implemented parts GPU:tri-linear interpolation technique, Gaussian TOR a ; CPU: Siddon, Siddon with 10 dithering no information forward and backprojection GPU:line integral with Joseph s algorithm[31] CPU: FORE+FBP, FORE+OSEM2D, SSP-OSEM forward and backprojection GPU performance good performance good performance visual inspection images acceptable, comparable, Image but no quantitative but no quantitative quality comparison comparison Geforce 7900GTX Geforce 7800GTX Gaussian linespread function Siddon forward forward projection projection 14 speedup GPU vs. CPU when > 10 6 events good performance mean difference < 0.001%; max.2% Geforce Quadro 7600 max. relative error< 1% 2 Geforce 8800GTX Used GPU Programming language Cg with OpenGL Cg with OpenGL CUDA CUDA a TOR-Tube Of Response sub-iterations to accelerate the process. Each sub-iteration processes one block of events, these blocks are called subsets. Performance: It is difficult to evaluate the performance of the GPU within one work and even more to compare different works of the groups; this has several reasons: Pratx et al. and Bai et al. implemented different algorithms to calculate the values of the SM in the GPU and the CPU. The GPU shows a good performance but a comparison between both implementations cannot be done, as the algorithms are not the same. Barker et al. implement the same algorithm on GPU and CPU but in list-mode reconstruction the computation time depends on the number of detected events; therefore the gain factor is related to the number of events and images with less events would have a lower acceleration factor. Apart from that it was stated that the used PC was not state-of-the-art

78 Chapter 6. Implementation of ML-EM in CUDA 59 compared to the graphic card, it was estimated that a more powerful CPU would have reduced the gain factor to 10. Schellmann et al. implemented the same algorithm on a PC with GPU, on a computer cluster, and on a multiprocessor server, but not on a single CPU system as in our case. Furthermore the implementations on the GPU, due to hardware limitation, works with single floating point precision values and the other implementations with double precision, which favors the computation time of the GPU. A comparison of the computation speed between the implementations of the scientific papers does not make sense, because the approaches and the hardware systems vary too much. The algorithm for SM computation that used Pratx et al. contains a bilinear interpolation. Interpolation methods are often used in graphics programming, the GPU has build-in functions that are faster compared to self-written functions, the use of such functions gives an advantage versus the CPU. The group of Pratx tried to implement the Siddon algorithm for the computation of the SM on the GPU, however they write that the computation for each line is done in a single long thread that cannot be parallelized; we were therefore unable to efficiently implement it on the GPU. All groups were really satisfied with the performance of the GPU and confirm that graphic cards used as co-processor presents an interesting option in image reconstruction for PET. Image quality: All stated that the image quality in comparison to a CPU implementation was good, some quality problems still had to be studied as e.g. an artifact in the images of Barker et al.; the quality is either already acceptable or little differences are expected to be solved. Implemented parts None of the groups implemented the entire algorithm in the GPU, at least the sensitivity image sum was calculated on the CPU. Barker et al. did not implement the back-projection because of the simultaneous memory access of threads to one memory address. It was stated that the results are unpredictable. (The problem of the back-projection will be explained more in detail in section 6.3.) Schellmann et al. commented that the discrepancies in the images are due to the simultaneous memory access, however the images show a good quality and the error is acceptable.

79 60 Chapter 6. Implementation of ML-EM in CUDA The groups of Pratx and Bai implemented the forward and the backprojection, however they did not mention the problem the other two groups faced. None of the groups used atomic functions, as their graphic cards did not at all or insufficiently support them. Used GPU: All used graphic cards are state-of-the-art at the time the scientific papers were published, they were high-end versions of available hardware, except of the group of Barker, the CPUs were comparable in power to the GPUs and Barker mentioned that with a more powerful CPU the gain factor would decrease. Programming language: The Geforce 7series does not support CUDA and so other languages were used. Concerning the choice of the programming environment, the publication of Barker et al. is interesting, they started their work with Cg, but NVIDIA launched CUDA during their project. They did some implementations in CUDA and Cg and realized that CUDA is not only easier to handle and to program but also gives better performance, hence they implement the rest of their programs only in CUDA. One piece of information should be given here to avoid confusion: the Gefore Cuadro 7600, which was used by the group of Barker, is not part of the 7series architecture, it is equipped with processors of the 8series, therefore the card supports CUDA. 6.2 Parallelization of the reconstruction process Before starting the implementation some points already mentioned in section should be emphasized. An algorithm that is going to be implemented on a GPU should be parallelized as much as possible. Taking into account the limited hardware resources that are shared by all threads of one MP, each thread has to be low in occupancy of register and shared memory. Furthermore, data exchange between host and device should be minimized, as well as global memory access. Use of the fast memory banks of registers, shared and constant memory is favorable versus device memory. Programming controlling statements as if -conditions should be avoided or minimized. In the algorithm the Single Instruction Multiple Data approach should be applied, following this basic principle a high data throughput can be expected on a GPU. Hence we search for instructions that apply the same arithmetic operation on many data, the ML-EM algorithm can be divided in several parts, each part does the same computation on a huge amount of data. Figure 6.1 shows our approach to define the kernels.

80 Chapter 6. Implementation of ML-EM in CUDA 61 3.kernel Reconstruction 2.kernel Back-projection Sensitivity image pre-calculated once on CPU for all iterations 1.kernel Forward-projection Figure 6.1: Proposed kernels for ML-EM algorithm Parallelization of forward-projection The forward-projection is nothing else than a matrix-vector multiplication: V 1 a ij x (k) j (6.1) j =0 Linear algebra is a widely studied field for GPGPU, an example of a matrix - matrix multiplication is presented in the CUDA programming guide and the SDK development kit from NVIDIA gives two implementations for the same problem. Scientific publications tackled dense as well as sparse matrices and tried to increase the gain factor by changing the number of elements each thread computes and by using the register and shared memory in an efficient way [62 64]. Generally it can be stated that the GPU is favorable for huge matrices, as the computation load is higher and few elements per threads showed higher data throughput because more threads can be launched in parallel. In our case we multiply the system matrix A and the last estimate of the image, x (k). x ( k) is known, but the system matrix has to be calculated using Siddon s algorithm ( 3.1.1). The Siddon approach computes the lengths of intersection of a coincidence line with the image voxels. The algorithm calculates the LOI for each voxel, starting with the entrance point of the LOR in the voxel space until the line reaches the last voxel that

81 62 Chapter 6. Implementation of ML-EM in CUDA it passed. A whole line is traced at once, the result is a 2D array that includes the LOIs and the indices of the voxels crossed by the coincidence line. In our system matrix each row is related to one sinogram bin ( 2.5). To calculate the probability that an event was detected in one bin up to four LORs have to be calculated, this is due to the aggregation of direct and oblique sinograms ( 2.7). We use the Michelogram to know which sinograms are mashed. Analyzing Siddon s algorithm we face the same problem as stated by Pratx et al. [14]: the implementation of Siddon s approach is a long sequence of instructions that cannot be parallelized, hence all intersections have to be calculated for one LOR; it is not possible to calculate only one LOI of a particular voxel passed particular LOR, neither to divide the computation for one LOR in several parts. Hence we have to decide either if the algorithm is implemented on the GPU or on the CPU. An implementation on the GPU is challenging due to the stated reasons, however in case of an implementation on the host, a huge amount of data has to be transfered from the host to the device. We can make an easy calculation to show the size of data that has to be transfered. Let assume that 20% of all sinogram bins (N bins ) stored at least one event, for each bin whose count value is differnt to zero (N nonzerobins ) the related row in the system matrix has to be calculated. Hence: N nonzerobins = N bins 0.2 =1, 290, 240 bins Now we assume that a mean of two LORs are sampled in one bin and each LOR intersects a mean of 80 voxels, so we get: N intersected voxels = N bin with event 2 80 = 206, 438, 400 voxels As the LOIs are saved as single floating point precision values (32 bit) we obtain a data transfer of 787 MB for each projection, so 1.5 GB per iteration which sums up to 23 GB for 15 iterations. Due to this huge amount of data we decided to implement the complete ray tracing algorithm on the GPU First implementation approach As we cannot parallelize the Siddon s algorithm, one thread should calculate one row of the matrix and multiply it with the estimated image vector x (k). Furthermore, in the first implementation a thread included also the calculation of the coordinates of the LORs sampled in one bin. This implementation had several inlined device functions, one

82 Chapter 6. Implementation of ML-EM in CUDA 63 small for the x, y calculation, one medium sized for the z coordinates and a long function for the LOIs. We were not able to launch this first program on the device because of too many used registers. This problem occurred due to the inlined functions ( ); each device function is called several times, so that each time register memory is reserved. As we cannot reduce easily the registers that each function occupies, we outsource the coordinates computation. We implemented two new kernels, the first one to calculate all radial and angular coordinates of one sinogram, and the second computes the axial coordinates for all sinograms. The results of the kernels are saved to data arrays located in the global memory, where all threads can access the data. This program modification reduced our register usage sufficiently that we were able to start the calculation on the GPU. However it were still high and the shared memory usage per thread was about 6 kbyte per block, which is really high. Due to the register and memory usage we could not launch more than 16 threads in parallel on each MP, resulting that our implementation on the GPU at that time was slower than on the CPU. There are two reasons why the shared memory is high, firstly we avoided global memory access and saved intermediate results to shared memory. And secondly, the amount of memory allocated for the intermediate values is very high. The LOIs of all voxels of one LOR were put in memory, this implementation is similar to the host implementation and a flow chart of the calculation scheme can be seen in figure 4.5. The problem when computing the intersections with the voxel space is that the number of crossed voxels is unknown in advance, so that it is difficult to allocate memory efficiently. Hence, as a first approach, memory was allocated for the maximum number of voxels that could be passed. Of course, this approach wastes memory and is not efficient at all. In the C++ implementation there are several methods to solve this problem, on the one hand the memory can be allocated dynamically; on the other hand, the vector-class can be used that permits to add an element at the end of the array without allocation memory for it. Such a class is not available in CUDA. Dynamic memory allocation is generally available, however the implementation is tricky and error-prone. The parallel execution makes it difficult to find memory leakage and programs as Valgrind 1 cannot find memory problems occurring due to the parallel implementation. We were unable to implement it efficiently, so that another strategy had to be found Improved implementation This first approaches showed the difficulties when facing GPU programming for the first time. So we checked again the whole algorithm to minimize shared memory usage. After 1

83 64 Chapter 6. Implementation of ML-EM in CUDA analyzing the Siddon source code again a new approach was developed. The LOIs are computed voxel by voxel, so instead of calculating all lengths at once, we can process each LOI immediately and multiply it with the corresponding estimate vector element. In this way, we avoid saving any intermediate result. Figure 6.2 shows the matrix-vector multiplication and its separation in several parts; figure 6.3 presents a flow chart of the implementation in CUDA. A x j y k Bin has 2 LORs *1 0.34* * * ab 1 ab2 ab3... abv 1 abv *1 0.22* voxel.w= 1... element value *1 0.12* voxel.index= 1 column index Figure 6.2: Calculation of forward-projection on the GPU. In the upper part of the figure the matrix-vector multiplication is shown. The LOIs of two LORs have to be calculated for Bin 1, hence the Siddon s algorithm will be applied twice. Each calculated distance will be multiplied directly with the according element of the vector x j and then summed up to the according element of vector y k. Figure 6.3 shows a flow-chart of the implementation of the forward projection in CUDA. The kernel forward\_proj() is called by the host. Each thread multiplies one row of the system matrix with the according element of the vector x (k). To do so, the kernel first loads the radial, angular and axial coordinates of the first LOR whose events are saved in this bin, these parameters are passed as arguments to the device function Siddon(). The function Siddon() computes the intersections of the coincidence line with the voxel space, each time a LOI of one voxel is derived the function matr_vect() is called with the voxel index and weight as argument. The device function matr_vect() multiplies the weighted voxel value with the according element of the vector. This is repeated for all voxels that are passed by this line. When the last LOI is multiplied with the vector, the kernel forward\_proj() checks if events of another LOR are saved in the

84 Chapter 6. Implementation of ML-EM in CUDA 65 Arrays in global memory x-y z thread N_bins binid = threadid; thread N_bins-1 binid if (binid<bin_max) = threadid; binid if {(binid<bin_max) = threadid; thread 2 binid if {(binid<bin_max) = //load threadid; data out thread of 1 arrays binid if { (binid<bin_max) = Siddon(xy, //load threadid; dataz, out thread binid) of 0 arrays binid if {(binid<bin_max) = Siddon(xy, //load threadid; //ifdata 2. LOR z, out binid) new of value arraysfor z if { (binid<bin_max) Siddon(xy, //load //if Siddon(xy, data 2. LOR z, out binid) z, new of binid) value arraysfor z { Siddon(xy, //load //if Siddon(xy, data 2. LOR z, out binid) z, new of binid) value arraysfor z Siddon(xy, //load //if Siddon(xy, //same 2. and LOR z, binid) z for z, new coordintes binid) 3. value and for 4.LOR, z //for //if Siddon(xy, //same 2. LOR1 LOR iffor z, new applicable binid) 3. value and for 4.LOR, z Siddon(xy, //same } iffor z, applicable binid) 3. and 4.LOR, //z //same coordintes } if for applicable 3. for and LOR2 4.LOR, //same Siddon(xy, } iffor applicable 3. z, and binid) 4.LOR, //same } if applicable for 3. and 4.LOR, }// if applicable } global forward_proj device Siddon() Compute voxel 0 Compute voxel 0 Voxel index Voxel weight Voxel index Voxel weight device matr_vect() y k [BIN] += voxel.w * x j [voxel.index] y k [BIN] += voxel.w * x j [voxel.index] arrays that are stored in global memory Compute voxel Z Voxel index Voxel weight y k [BIN] += voxel.w * x j [voxel.index] Figure 6.3: The flow chart shows the implementation of the forward-projection in CUDA, for detail please refer to the text. bin; if not, the thread terminates, otherwise the axial coordinates for the second LOR are loaded and the device function Siddon() is called again. The radial and the angular coordinates stay the same for all LORs of one bin and so need not to be loaded again Evaluation of forward-projection implementation The improved implementation we allowed us to reduce the amount of shared memory to an acceptable value, therefore we decided to transfer several variables to shared memory space to reduce the amount of registers, that were the limiting factor to launch more threads in parallel at that moment. Finally we used 404 kbyte of shared memory per block and each thread occupies 40 registers. As we have 8192 registers on a MP, we could launch 204 threads in parallel. As for performance it is favorable to execute always blocks in warp size, we chose the next smaller multiple of 32, hence 192 threads. Using the CUDA occupancy calculator, we compute that the best processor occupancy

85 66 Chapter 6. Implementation of ML-EM in CUDA is reached when 64 threads are grouped in one block, we reach with the given register usage an occupancy of 25%. As already mentioned one problem of efficiently implementing the Siddon algorithm on the GPU is the size of the implementation, which limits the number of threads that can be executed in parallel. Another problem of the algorithm is the use of programming controlling conditions, which are more penalized on a GPU than on a CPU program. It is recommended by the programming guide to use as few as possible condition statements; however in the implementation of the Siddon s algorithm it is not possible to avoid them and not even that, the function is full of such instructions. This can have a notable effect on performance and should not be neglected. 6.3 Implementation of back-projection Basically, the method for the implementation of the back-projection is similar to the forward-projection, but we face one problem due to the parallel structure of our algorithm and the memory management. We already mentioned in section that, if one thread operates a calculation on a memory element, this memory address is not protected against access of other threads. This leads to a problem in the implementation of the backprojection, because several threads access the same memory element at the same time. The backprojection is the multiplication of the transpose of the SM A with the vector y k,corrected ( 3.2): B 1 x (k),back = i=0 a T ijy (k),corrected i In the forward projection one thread calculates one row of the SM, i.e., one column of its transpose A T. Then, for the back-projection, the elements of that column are multiplied with the corresponding vector element, so that the result is added to the corresponding element in the back-projection vector. Figure 6.4 illustrates the problem of simultaneous memory access in the backprojection. Two threads calculate a fraction of element 3 of the vector x_back. This data array is saved in global memory and so accessible for all threads of the kernel. Thread 2 and 4 read the actual value out of the array at time t 0, add their fraction, and write the sum back to the same memory address. However thread 2 terminates the operation at time t 1, then at time t 2 thread 4 terminates its computation and writes a new value to the memory address, the value that was stored by thread 2 before is now overwritten and

86 Chapter 6. Implementation of ML-EM in CUDA 67 y k, corrected A T thread 2 thread a1b a 2B a 3B a3b a4b av 1 av 2 av 3 av 4 av 5 av 6 av 7... avb... x _ backv y _ backb x_back thread2 : * x_back x_back Read element at t 0... x _ back 0... V x _ back V x _ back V thread4 : * ! ! read of Thread 2 and 4 at t 0 write of thread 2 at t 1 write of thread 4 at t 2 time Figure 6.4: The figure shows the computation of the backprojection. On the top the multiplication of the transpose matrix with the vector is shown, below the parallelized implementation on the GPU. The vector x back is calculated column by column; as threads run in parallel, it it possible that results are overwritten, for more detail see text. the data are lost. CUDA offers atomic functions, which guarantee that the memory address is protected while the arithmetic operation is processed. However so far these functions only support integer values. We tried to implement a workaround that converts the float value in an integer, sums up the values and converts them back to float values. The function worked for small data sets, and errors that could be seen due to the simultaneous access are removed. However for a full image data set the implementation hangs up and has to be aborted. So far we do not know why this happens and further studies have to be done. The problem can also be explained in a geometrical way. Two lines of response which are calculated in parallel cross the same voxel (s. figure 6.5). The voxel intersected by both lines is computed at the same time; one length of intersection will not be summed to the value of the voxel, because both computation instructions accessed the value at the same time.

87 68 Chapter 6. Implementation of ML-EM in CUDA LOR1 LOR2 Voxel crossed by LOR1 Voxel crossed by LOR2 Voxel crossed by LOR 1 and 2 Figure 6.5: Two LORs are calculated in parallel. If both line cross the same voxel and this voxel is computed at the same time, the length of intersection of one LOR is lost, due to the simultaneous memory access. When we increase the number of voxels, each voxel will be intersected by less LORs, that means that the probability of simultaneous memory access and the error would decrease. However, an increase of the number of voxels maintaining the same image size, means smaller voxels. The voxel size is not definable arbitrarily, it depends on various factors as the detector size or the used ray-tracing algorithm. We will evaluate the dependency of the error to the resolution, we refer to this version of the program as resolution dependent implementation (RDI). However, another solution should be found that does not depend on the number of voxels. The geometrical explanation of the problem brings us to a new approach. If the computation of the bins is ordered linear, the probability that two LORs intersecting the same voxel is higher, because bins with a low difference in their bin ID are geometrically close together. Hence, we changed the pattern how the the Siddon s algorithm calculate the LOIs to random. To further reduce the error we implemented several arrays to store intermediate results. This measure will also reduce the simultaneous access of the same element. 64 x_back arrays are allocated to store the results of the back-projection. After the complete computation of the projection the arrays are summed up in one array. This second version of the GPU program will be called resolution independent implementation (RII).

88 Chapter 6. Implementation of ML-EM in CUDA 69 The problem of simultaneous memory access also occurs if the sensitivity image is calculated on the GPU. We decided to implement the calculation of this vector on the CPU and copy it to the GPU. That decision had two reasons, firstly small differences in this vector can lead to huge discrepancies in the reconstructed image, so even if we could reduce the error to a minimum, it would still notably effect the image. Secondly, the calculation just has to be done once for a whole computation, if the image space remains the same, it can even be computed and stored on hard disk and just be loaded for the computation. Another problem should be mentioned; CUDA applications are compatible for all NIVIDA graphic cards that support CUDA. If older graphic cards are used, it can happen, that some functions or libraries are not available, but for future GPUs NVIDIA wants to keep the compatibility of previous versions. The problem of simultaneous memory access would affect our results, if we run the application on a GPU with more processor cores. More multi processors would mean, that more threads can be launched in parallel, i.e. for our case, that more lines are computed at the same time. Hence, that the probability of two lines intersecting the same voxel would increase, which would also increase the error of the image. 6.4 Threads in parallel and memory usage In the end six kernels have been implemented. The register and shared memory usage defines the number of threads per blocks, table 6.2 summarizes the parameters of the kernels. Table 6.2: Implemented kernels and the parameters that are needed to define the number of parallel threads and processor occupancy. parameter Sinogram() Michelogram() projections a reconstruction() register shared memory block size threads in parallel processor occupancy 50% 67% 25% 100% a pre sum(), forward projection() and backprojection() are equal in size and mememory usage. Kernel Sinogram() calculates the radial and angular coordinates of one sinogram and saves them to the global array xy_coords, Michelogram() computes the number of LORs whose events are saved in a bin, the related axial coordinates and saves all to the array z_coords. The global function pre_sum() calculates the normalization sum,

89 70 Chapter 6. Implementation of ML-EM in CUDA this function is not used on the GPU in all implementations due to the high mean error that is caused by simultaneous memory access. The forward_projection() and backprojection() calculates the projections, the function reconstruction() computes the final image. These last three functions are repeated in a loop as many times as iteration steps are defined. We tried to reduce global memory usage, however allocation of memory at these banks could not avoided, as it was needed to save the results of the kernels. Data in the device memory have the advantage that they remain there until the application terminates, furthermore all threads have access to global memory. Partially, kernels load data to shared memory which are erased after kernel execution, this is favorable, if data elements are used more than once. Table 6.3 shows the different arrays that are allocated in global memory space and their size for an image space of The total memory usage depends on the number of bins of the system and the voxel space, so that for a different resolution the occupancy would vary. Table 6.3: Arrays that are allocated in global memory space to save output of kernel functions for a image space of voxels. array number of elements type of elements total size xy coords a struct xy struct b 0.56 MB z coords 175 c struct z struct d MB bin events e max. N bins integer max MB y i N bins integer 24.6 MB pre sum N voxels float 3 MB y k N bins float 24.6 MB x j N voxels float 3 MB x back 64 N voxels float 192 MB total 272 MB a One for each bin of a sinogram. b xy struct contains 4 float values for x 1, x 2, y 1 and y 2 c According to the number of sinograms. d z struct contains 8 float and one integer values for the z coordinates and the number of LORs sampled in one bin. e bin events contains the binids of the bins whose count rate is different to 0.

90 Chapter 7 Results and discussion The evaluation of the different implementations is split in two parts; on the hand the time performance, on the other hand the image quality. For the calculation time the CPU and the GPU implementations are compared. The image quality of the used reconstruction algorithm will be evaluated. Reference pictures from the Biograph16 scanner software help to verify our results. The GPU implementation should provide images that are comparable in quality to the images reconstructed on the CPU, the expected error due to the simultaneous memory access( 6.3) will be checked in detail. All results presented in this work were obtained using a workstation with an Intel Pentium Core2 Quad processor that runs with a frequency of 2.8 GHz and has 4 GB of RAM. The CPU implementations run only on one core of the processor. The computer is equipped with a NVIDIA Geforce 8600GT, where the GPU implementations are processed on. The computer uses Scientific Linux as operating system. For the development of the GPU code, the CUDA compiler nvcc in version 2.0 was used. 7.1 Data used for evaluation Three data sets will be used to evaluate the reconstruction implementations. The first picture shows a point sources, the data set has, compared to the other measurement data, few event counts and only 13% of the bins saved at least one count. This image is less interesting for image quality, but for computation time to show the relation between non-zero entries and reconstruction time. For the second data set a phantom was used. It is a cylindrical phantom with a point source inside. The number of bins that saved an event is high, namely 99.7% of all bins. The third image shows a torso of a patient, the non-zero bins are 99.7%. The measurement data are saved in PETLINK format ( 2.8). 71

72 Chapter 7. Results and discussion To compare our reconstruceted images, we use images that were reconstrcuted by the Biograph16 scanner software; the files are saved in Ecat7 format.

91 72 Chapter 7. Results and discussion To compare our reconstruceted images, we use images that were reconstrcuted by the Biograph16 scanner software; the files are saved in Ecat7 format. All Ecat7 images have a voxel space of and were reconstructed with a attenuation weighted OS-EM with 4 iterations and 8 subsets. Apart from the PET scan the patient image also includes the CT scan. Figure 7.1 presents the three images. (a) (b) (c) 100% 0% Figure 7.1: Three data sets are used to evaluate the implementations. (a) shows a point source, (b) a phantom with a point source inside and (c) a torso of a patient. The U- shaped structure shows the heart of the patient. All images have an voxel space of voxels. 7.2 Time performance We compare the performance of the implementations for several images and image sizes. We evaluate the computation time for , as it was the size of the ECAT7 images of the scanner. Furthermore the computation is run with voxels, as some artifacts are reduced at that sized in the reconstructed image (s ). Furthermore we verify the dependency between the calculation time and the number of non-zero bins in the measurement file. Figure 7.2 presents the results. The left columns show the computation time for the different reconstructions, the columns on the right compares the time to the fastest implementation. The off-line version is not included in the computation for the higher resolution, as the system matrix was only calculated for the smaller image size. The measurement data of the point source (s. figure 7.1) contains 13% of filled bins, the file of the phantom has 99.7% of non-zero bins. 13% is quite low for a normal PET study, but it allows to show the difference in computation time compared to a measurement file with a high count rate over all bins. It can be seen in figure 7.2(a) and (b) that the number of filled bins only affects slightly the computation time of the off-line

92 Chapter 7. Results and discussion 73 (a) computation time in sec Point source data after 15 iterations with 128x128x47 voxels off-line CPU online CPU (double) online CPU (single) RII GPU 317 RDI GPU compared to fastest implemantation (b) computation time in sec Phantom data after 15 iterations with 128x128x47 voxels compared to fastest implemantation 0 off-line CPU online CPU (double) online CPU (single) RII GPU RDI GPU 0 (c) computation time in sec Phantom data after 15 iterations with 192x192x54 voxels compared to fastest implemantation off-line CPU online CPU (double) online CPU (single) RII GPU RDI GPU Figure 7.2: The left columns show the computation time in seconds. Details to the reconstruction are given above the graphs. The point source data contains 13% of non-zero bins, the phantom measurement file 99.7%. The number of filled bins affect the computation time of the on-the-fly reconstruction but only slightly the off-line performance. The columns on the right compare the computation time to the fastest implementation. The maximum acceleration for the RII GPU version is 1.4 and 3.1 for the resolution dependent implementation.

93 74 Chapter 7. Results and discussion implementation, but noticeable of the on-the-fly computations. The reason lies in the implementation; the online versions just calculates the lines of response for the bins that are filled, so the more non-zero bins exists, the more calculations have to be processed. In contrast, the off-line version loads stepwise the whole system matrix from the hard disk in the random access memory and performs the forward and back-projection. The reconstruction time for the off-line version is high, a change from double to single floating point precision values should already reduce noticeably the calculation time. On the one hand, more values can be loaded at once in the RAM and on the other hand the calculation itselfs should run faster. For the online version on the CPU the change between double and single precision values makes a differnces of almost 30% (figure 7.2) in computation time with no visible effects on the image quality ( 7.3.2). We compare the GPU implementations to the single-floating precision online CPU version to get a reasonable comparison, as both GPU versions work only with single floating point precision values. The measures we took to reduce the error of simultaneous memory access ( 6.3) influenced the performance of the implementation. The resolution dependent implementation for the GPU runs 2.4 times faster then the resolution independent implementation and around 3 times faster than the online CPU version with single floating point precision. The time gap is caused by the different implementations; the RII program has an aleatory counting pattern and several arrays to store intermediate results. We added a new kernel function that sums up the different arrays after the back-projection, this computation needs less then a second, so that the influence on computation time is minimal. The disordered pattern effects much more the performance. CUDA tries to coalesce memory access, i.e. it tries to read packages of size 32, 64 or 128 bytes at once. However the linear organized arrays are now accessed in an aleatory way, which results in more memory accesses and a decrease of performance. 7.3 Image quality We will compare the image quality in several steps. Firstly, the images reconstructed from the scanner software are compared to our online CPU images. Secondly, the results of the different CPU implementations are verified and finally we compare the images of the two GPU versions with the CPU images ECAT7 versus CPU To compare our reconstructed images we use images that were reconstructed by the scanner software. The files are saved in Ecat7 format. We do not compare quantitatively

94 Chapter 7. Results and discussion 75 the Ecat7 images with our results, because the information about the reconstruction method and about the image processing applied by Siemens is not sufficient. We know that the images were reconstructed with an attenuation weighted OS-EM algorithm with 4 iterations and 8 subsets. However, we have no information, which methods were used to build the system matrix. Furthermore, it is possible that filters or smoothing techniques were applied. For the patient image a motion compensation was used, this reduces the blurring effect caused by the moving heart. For the reconstruction of our images we used the ML-EM algorithm with 30 iterations instead of 15 as applied before. The calculation time changes almost linearly when increasing the iteration number, hence, 15 iterations were enough to evaluate the time difference between the different implementations, however, for the image quality we reached better results for 30 iterations. To build the system matrix Siddon s algorithm was applied, this algorithm has a low accuracy which could lead to a lower spatial resolution ( 3.1.1), blurring effects can appear as just the symmetrical matrix has been computed and no blurring affects have been taken into account to model the matrix. Figure 7.3 presents the Ecat7 images and the reconstructed results of the online CPU version. The image space of the original images is , our reconstructions are presented with and We observe background noise and artifacts in all images. The artifacts shown by the arrows in picture (e) are symmetrical and can also be seen in the sensibility images (not shown) used for the reconstruction. These symmetrical artifacts almost disappear in the higher resolution images. However, the background noise and the edge artifacts are still visible. The edge artifacts appear at the border of the image space and at the border of the phantom object, this type of artifacts are common side-effects when applying iterative algorithms. In the phantom images two blank areas are visible, it is not clear why these areas are blank, as they just appear in these images. The ellipsoid borders which can be seen in the images (e) and (f) are the limits of the patients torso. The patient scan was a PET/CT study, the visible ellipsoid matches with the patient body in the CT image. However, these parts are not visible in the Ecat7 images, this image is zoomed. The white dotted circle in image (e) and (f) show the area which is presented in image (d). The structure of the heart muscle (U-formed structure) is clearly visible, the surrounding areas are weaker in signal strength and appear blurred.

noise 0% (c) CPU 192x192x54 (f) 100% 0% Figure 7.

equipment. These images have a resolution of 128 128 47.

95 76 Chapter 7. Results and discussion (a) Ecat7 (d) 100% 0% (b) blank area CPU 128x128x47 edge artifacts artifacts (e) 100% noise 0% (c) CPU 192x192x54 (f) 100% 0% Figure 7.3: (a) and (d) present a phantom and a patient s torso, respectively, both are reconstructed by the scanner equipment. These images have a resolution of (b), (c), (e) and (f) were reconstructed with the on-the-fly CPU implementation using the ML-EM algorithm for reconstruction and the Siddon s algorithm to compute the system matrix. 30 iterations were processed for reconstruction. The patient torso images were all smoothed using a tri-linear interpolation technique.

96 Chapter 7. Results and discussion Evaluation of images reconstructed on CPU Three different versions have been implemented on the CPU. First, the off-line version, that saves the whole system matrix to the hard disk and reads the values during calculation. This implementation uses values in double floating point precision. The second and third implementation calculates the elements of the system matrix on-the-fly, the difference between both versions is the accuracy of the values, one uses single and one double floating point precision. Figure 7.4 shows an image that was reconstructed using the off-line implementation. The line shows the position of the line profile which is presented on the right of the figure. The graph compares the three implementations. The circle presents a region of interest whose analysis is put in table off-line double online double online-single Figure 7.4: The image presents the reconstructed phantom with a resolution of after 15 iterations. The line profile compares the off-line program that uses double floating point precision with the online implementation that use single and double floating point precision values. The analysis of the region of interest that is shown by the circle in the image is given in table 7.1. Table 7.1: Analysis of region of interest shown in figure 7.4. Resolution image min.value max.value std.dev. voxels ROI off-line double online double online single Visually, the quality of the images of the three CPU implementations are almost equal. Between the single floating point and double floating point images, that are reconstructed with the on-the-fly program, no difference can be seen visually. A slight discrepancy can be detected between the off-line and the online versions. The visual difference that can be observed in the images is confirmed by the analysis of the ROI in table 7.1 and in the line profile(figure 7.4). We assume that the reason comes from the different way of implementation (s. 4.3), the calculation is slightly different, which could lead to

97 78 Chapter 7. Results and discussion rounding errors, however as the discrepancies are small, we did not further study the issue Evaluation of images reconstructed on GPU In chapter 6.3 the problem of the simultaneous memory access is explained. As memory access is not restricted on the GPU, we expect an error for the images reconstructed on the GPU. First of all we want to show, that the error really occurs during backprojection and not during the forward-projection. Therefore, we save the result of both processes after the first iteration of the CPU and the RDI GPU implementation and compare them. The result of the forward projection has the same size as the sinogram and the result of the back-projection the same as the image, hence, we can present the results as an image. Figure 7.5 (a) and (b) show one slice of the results of the forward and back-projection of the CPU based projections; the size of image (a) is and the size of (b) is (c) and (d) show the differential images; each voxel of image (a) and (b) is subtracted by each voxel of the corresponding image achieved from the GPU computation. In (c) are almost no differences in the images, in contrast, (d) shows great discrepancies across the image. Hence, it can be stated that the error occurs during the back-projection process. The blank pixels in (c) are bins that did not save any event and therefore the forward-projection for these bins is not computed Error evaluation at different resolutions The GPU images are compared to the CPU images for different resolution, therewith we want to show the relationship of the resolution and the expected error, that is explained in detail in chapter 6.3. Simultaneous memory access during the back-projection process cause a lost of data which leads to a loss of image quality. Figure 7.6, 7.7 and 7.8 present images reconstructed with the on-the-fly CPU, the resolution-dependent-implementation (RDI) and the resolution-independent-implementation (RII) GPU version. All images were reconstructed with 15 iterations. Each figure contains a line profile, that compares the three different images. The circle shows a region of interest (ROI), details to the ROIs are shown in table 7.2. The dependency on the error of the RDI GPU version can be seen in the images, the line profiles and the data out of the ROIs. Only for a resolution of the image quality is comparable to the CPU reconstruction. In contrast the measures, that were taken for the RII GPU version reduced the error to a minimum; the line profile hardly shows any discrepancies and the values for the ROIs varies just slightly. Even for a resolution of , where a high error would be expected due to the low number

Chapter 7. Results and discussion 79 (a) (b) -high distribution (c) (d) low -distribution -no difference high -difference Figure 7.

(c) and (d) are differential images that show the differences between the results of the CPU and the RDI GPU version.

It can be seen that the error occurs during back and not during forward projection. Table 7.2: Analysis of regions of interest shown in figures 7.

voxels ROI 32 32 32 128 128 47 192 192 54 CPU 14431 55063 9915 46 RDI GPU 1165 12771 3407 46 RII GPU 14405 55475 9926 46 CPU 590 3588 632 218 RDI

98 Chapter 7. Results and discussion 79 (a) (b) -high distribution (c) (d) low -distribution -no difference high -difference Figure 7.5: (a) and (b) shows the result of the forward and back-projection, respectively. (c) and (d) are differential images that show the differences between the results of the CPU and the RDI GPU version. The results are saved after the first iteration; image space for (a) and (c) is and for (b) and (d) It can be seen that the error occurs during back and not during forward projection. Table 7.2: Analysis of regions of interest shown in figures 7.6, 7.7 and 7.8. resolution image min.value max.value std.dev. voxels ROI CPU RDI GPU RII GPU CPU RDI GPU RII GPU CPU RDI GPU RII GPU of voxels, the result is almost equal in quality compared to the CPU image. Artifacts can be seen in the image of for both GPU images, the artifacts are below the line of the line profile. The source of these artifacts is unknown.

80 Chapter 7. Results and discussion CPU RDI GPU RII GPU 60000 50000 40000 30000 20000 CPU RDI GPU RII GPU 10000 0 Figure 7.6: The images were reconstructed with 15 iterations and 32 32 32 voxels.

99 80 Chapter 7. Results and discussion CPU RDI GPU RII GPU CPU RDI GPU RII GPU Figure 7.6: The images were reconstructed with 15 iterations and voxels. The line shows the position of the line profile, which is presented above the images. The circle shows the position of a region of interest (ROI), an analyze of this region can be found in table 7.2. CPU RDI GPU RII GPU CPU RDI GPU RII GPU Figure 7.7: The images were reconstructed with 15 iterations and voxels. The line shows the position of the line profile, which is presented above the images. The circle shows the position of a region of interest (ROI), an analyze of this region can be found in table 7.2.

Corso di laurea in Fisica A.A Fisica Medica 5 SPECT, PET

Corso di laurea in Fisica A.A. 2007-2008 Fisica Medica 5 SPECT, PET Step 1: Inject Patient with Radioactive Drug Drug is labeled with positron (β + ) emitting radionuclide. Drug localizes