University of California at Los Angeles. Electrical Engineering. Iterative Reconstruction for 3D Tomography

Size: px

Start display at page:

Download "University of California at Los Angeles. Electrical Engineering. Iterative Reconstruction for 3D Tomography"

Shanon Marsh
5 years ago
Views:

1 University of California at Los Angeles Electrical Engineering Iterative Reconstruction for 3D Tomography Comprehensive Project Report for Master Degree Non-thesis Option by Yinan Dong Advisor: Prof. Dejan Markovic WINTER 2014

2 TABLE OF CONTENT Abstract 1. Background a) 3D tomography b) Iterative reconstruction for 3D tomography i. SIRT ii. SART 2. CPU and GPU architectures. a) CPU architecture b) GPU architecture 3. Detailed implementation a) Intel MKL b) CUDA 4. Result and Conclusion 5. Reference

3 Abstract In this project, I designed the C and CUDA implementation for the SIRT and SART algorithm which used in the reconstruction for 3D tomography and compare the speed up against the MATLAB implementation. The C implementation use the Intel MKL and CUDA implementation use the CUSPARSE library. The C program is 200% faster than the MATLAB and the CUDA is 250% faster than C in average. This project also analyzes the complexity of all the three implementation and gives some suggestion for future work.

4 1. Background 1.1 3D tomography Tomography refers to imaging by sections or sectioning, through the use of any kind of penetrating wave. A device used in tomography is called a tomograph, while the image produced is called a tomogram. Computed tomographic (CT) scanner, as the most common application of tomography was invented by Sir Godfrey Hounsfield in 1972, and thereby made an exceptional contribution to medicine. In conventional medical X-ray tomography, clinical staff makes a sectional image through a body by moving an X-ray source and the film in opposite directions during the exposure. By modifying the direction and extent of the movement, operators can select different focal planes which contain the structures of interest and it was principally feasible, based on a very large number of measurements to reconstruct a cross sectional slice of a patient with fairly high accuracy. The image quality of the slices, throughout the years improved significantly due to the development of computer power and reconstruction methods. Several algorithms for image reconstruction have been developed, and the techniques applied can be distinguished in two main categories: Analytical which capitalize on the Fourier Slice Theorem

5 Figure 1 Parallel projection, each projection is made up of the set of line integrals through the object. Above is the illustration of the parallel projection. The simplest and easiest way to visualize method of scanning is the system of parallel projection. The projection of an object at a given angle is made up of a set of line integrals. In X-ray CT, the line integral represents the total attenuation of the beam of x-rays as it travels in a straight line through the object. The resulting image is a 3D model of the attenuation coefficient. We consider the data to be collected as a series of parallel rays, at position, across a projection at angle. This is repeated for various angles. Using the coordinate system of Figure 1, the value of onto which the point will be projected at angle is given by: (1) So the equation can be rewritten as

Iterative, solve the reconstruction problem using Algebraic Reconstruction Methods These are the methods I use in my project.

6 (2) Where represents. This function is known as the Radon transform (or sinogram) of the 3D object. Figure 2 Shepp-Logan Phantom and its sinogram. Above is the Shepp-Logan Phantom, which is the standard test image for tomographic reconstructions. Iterative, solve the reconstruction problem using Algebraic Reconstruction Methods These are the methods I use in my project. The parallel projection per-form well in cases where the complete set of projections uniformly distributed over 180 or 360 is available. But they are very sensitive to noise. So someone came up with some algebraic reconstruction techniques. Unlike transform methods, algebraic methods do

7 not require complete set of uniformly distributed projections for precise reconstruction and they are also more stable under noisy conditions. Furthermore, they allow using a priori information in reconstruction process. However, algebraic techniques are computationally very demanding. With the development in parallel programming and high-performance processors, nowadays the algebraic methods present a viable alternative. In the next section I will introduce the basic concept and idea of iterative reconstruction methods for 3D tomography. And then introduce the two methods: SIRT and SART which I implemented in my project. 1.2 Iterative reconstruction for 3D tomography The iterative reconstruction algorithm wants to formulate the reconstruction Ax b problem into a Linear Inverse Problem: (3) Where x is a column vector of length N2 representing the pixels of the original image, A is an M by N2 matrix representing the data acquisition process, and b is a column vector of length M representing the measured projection data. x 1 We want to find a solution such that: (4) left If we directly solve this equation by doing the inverse of the Matrix A, this will be very time and resource consuming if the size of A is very large. If N and M were small, it would be possible to use conventional matrix theory methods, to invert the system of equations in (3). In practice N may be a large number and in most cases the A b

8 number of ray-sums (called later as M) will also have the same magnitude. For these values of M and N the size of the matrix which precludes any possibility of direct matrix inversion. So there exists iterative method to solve the equation. Let x(k) denote the kth estimation of the reconstruction. Then: x k1 k T k x A Ax b (5) where the relaxation factor λ is a scalar. If we choose proper relaxation factor, then X(k) will converge to the actual value after several rounds of iterations. This will only invlove matrix multiplication and addtion which is much less time consuming than doing matrix inversion. Also, in order to formulate the matrix, we do the matrix normalization. We Introduce diagonal matrices V and W: x k1 k T k x VA W Ax b V is the diagonal matrix of the inverse of the row sums: W is the diagonal matrix of the inverse of the column sums: Use this, we can aviod the computation to be overflow. V i W j j1 (6) 1 Vi, i 2 N a W j, j M i1 i, j 1 a i, j And for the iterative reconstruction method, there are several different method to achieve. The three main methods are ART(Algebraic Reconstruction Technique), SIRT(Simultaneous Iterative Reconstruction Technique), SART(Simultaneous

9 Algebraic Reconstruction Technique). ART update image after each ray is processed. ART reconstructions usually suffer from salt and pepper noise. The result is that the computed ray-sums in are usually poor approximations to the corresponding measured ray-sums. SIRT update image after all rays are processed. It has slower convergence than ART because it needs more iteration, but leads to better looking images. SART update image after all rays in a single projection angle are processed. SART combine the best of ART and SIRT. This technique yields reconstructions of good quality and numerical accuracy in only one iteration. In my project, I implemented the SIRT and SART algorithms on the CPU and GPU platform. So in the next chapter, I will breifly introduce the CPU and GPU architecture and explain the key main difference between them.

10 2. CPU and GPU architectures. 2.1 CPU Architecture Most of us are familiar with CPU. Below is the illustration of a pipelined MIPS cpu. It has instructin frech, instruction decode, excution, memory access and result write back stages. And this is the basic architecture of most of the modern microprocessors. Figure 3 Modern microprocessor architecture Nowadays, Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behave less linearly and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can

11 execute simultaneously. Despite of these kinds of parallielism technques, CPU is still a processor which execute program serially. Most parts of a CPU are dedicated as the data schedualing, branch prediction and other control units. The execution units only occupy a very small part of the whole CPU. So CPU is good at branching-intensive operations, random access, memory-intensive operations and other program which requre complex flow control and branch decision. This can also be seen from the above graph. Only ALU is used as the computation in the chip, all the other parts are control and multiplex logic. 2.2 GPU Architecture Driven by the insatiable market demand for realtime, high-definition 3D graphics, the Graphic Processor Unit has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth.

12 Figure 4 a typical GPU architecture. Above is the architecture overview of a nvidia G80 GPU. It has: 16 Multiprocessors Blocks Each MP Block Has: 8 Streaming Processors (IEEE 754 spfpcompliant) 16K Shared Memory 64K Constant Cache 8K Texture Cache Each processor can access all of the memory at 86Gb/s, but with different latencies.

13 From this simple architecture, we can find the reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. GPUs also have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs. More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel, with high arithmetic intensity. The ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. From this chapter, I have explained the basic difference between CPU and GPU architectures and there respectively suitable application. In the next chapter, I will introduce the detailed implementation of the 3D tomography reconstruction using the Intel MKL(CPU) and CUDA SPARSE(GPU) platform.

3. Algorithm implementation 3.1 C implementation(intel MKL) First, I will introduce the CPU implementation of SIRT and SART which use the Intel Math Kernel Library (Intel MKL).

14 3. Algorithm implementation 3.1 C implementation(intel MKL) First, I will introduce the CPU implementation of SIRT and SART which use the Intel Math Kernel Library (Intel MKL). The Intel Math Kernel Library (Intel MKL) improves performance with math routines for software applications that solve large computational problems. Intel MKL provides BLAS and LAPACK linear algebra routines, fast Fourier transforms, vectorized math functions, random number generation functions, and other functionality. Here I mainly use the BLAS and Sparse BLAS Routines. Since there are a lot of zeros in matrix A, using the sparse BLAS routines which will gain a significant speed up for the computation of matrix multiplication. The whole procedure can be divided into three parts, which are detailed explained below. For the details please refer to my code. Figure 5 C Implementation design flow

15 3.1.1 MATLAB and C interface We use some MATLAB programs as the input generator and output monitor. The MATLAB program will generate the matrix A and B, and after the reconstruction is finished, it will get the output of the C code, then compute the SNR, reconstruction time and plot the reconstructed graph. To achieve this, I use the mex function which is used as the interface between the MATLAB program and C program. The details about the mex function can be found in the mathworks website and I will not explain it here. For SIRT, the parameters I pass into the mex function are: A, A T, both of them are converted into sparse format in MATLAB. The reason I feed both A and A T is that to do transpose in C is very time consuming. V, W, since they are both diagonal matrix, so I only pre-compute their diagonal elements using the definition in the slides in MATLAB and pass them into the mex function. B, X0, which is just the original format of them. X0 is the initial value of X, which usually to be all zeros. K, the iteration number Compute and conversion to CSR format

16 To use the BLAS routines, there are several arrays which need to be computed before the iteration. These 3 arrays are: values, columns, rowindex which is called the CSR(Compressed Sparse Row) format. The definitions of them are as follows: Values: A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of Aare mapped into the values array using the row-major storage mapping described above. Columns: Element I of the integer array columns is the number of the column in A that contains the i-th value in the values array. RowIndex: Element j of the integer array rowindex gives the index of the element in the values array that is first non-zero element in a row j. The length of the value sand columns arrays is equal to the number of non-zero elements in the matrix. As the rowindex array gives the location of the first non-zero element within a row, and the non-zero elements are stored consecutively, the number of non-zero elements in the i-th row is equal to the difference of rowindex(i) and rowindex(i+1). To have this relationship hold for the last row of the matrix, an additional entry (dummy entry) is added to the end of rowindex. Its value is equal to the number of non-zero elements. This makes the total length of the rowindex array one larger than the number of rows in the matrix. To get these arrays, I use several ways to do it:

17 For A and A T, since the matrix I pass into the mex function is converted into sparse format. So I use the mxgetpr(), mxgetjc(), mxgetir() and mxgetnzmax() function to get the corresponding arrays. These routines should all operating on the sparse format matrix. The definition of them can be found on the mathworks website. Here, mxgetpr() is to get the values array, mxgetjc() is to get the rowindex array and columns is obtained by the mxgetir(). And mxgetnzmax() is used to find the number of non-zero elements in A and A T. Since to use BLAS routine for the CSR format. There are two other arrays which called the ponterb and pointere. Their definitions are: pointerb(i) = rowindex(i) for i=1,..; (7) pointere(i) = rowindex(i+1) for i=1,..; (8) Convert the rowindex arrays to pointerb and pointere according to these rules. Also, the definition of Jc and Ir array is the opposite of rowindex and columns array. So I use the Jc and Ir of A as the rowindex and columns of A T and the Jc and Ir of A T as the rowindex and columns of A. For V and W, since they are both diagonal matrix, in the original MATLAB code, it use the diag() and sparse() operation, but this will cost very long time for the program to do these two operations. So in my C implementation, I just pass their diagonal elements to the mex function. If use the whole matrix, the redundant zeros will cost longer time and is huge waste of memory space. In the C program, I wrote

18 the function which can compute the values, columns and rowindex arrays of V and W according to their definition. Since the non-zero elements only exist along the diagonal, it is very easy to derive these arrays Iterative reconstruction using the sparse routine After getting all the arrays, then the iteration process can start. The equation is for SIRT and SART: x x k1 k T k x VA W Ax b 1 T x V A W A x b (9) (10) routines: So I do the multiplication matrix by matrix using different kind of BLAS sparse mkl_dcsrmv() is to do the general matrix and vector multiplication. It computes matrix - vector product of a sparse matrix stored in the CSR format. So I use it to compute A*xi and A'*(W*(A*xi-b)). The definition can be found in the MKL manual. mkl_cspblas_dcsrgemv() is to do the Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with zero-based indexing. The matrix is an m by m matrix. So I use this routine to do the W and V multiplication. The definition can also be found in the MKL manual. cblas_daxpy() is to computes a vector-scalar product and adds the result to a vector. I use this routine to do all the addition and deduction operation Difference between SIRT and SART implementation

19 For SIRT and SART, most of the things are the same. The only difference is that in SART, since it will update the matrix after all rays in a single projection angle are processed, so every time we change the angle, we need to feed new A(:,:,j), W(:,j) and V(:,j) to the program. M= D*N_theta, so A, B, W and V is divided into D s sub-matrix. In my program, I do not pass the whole matrix to the C program and then do the partition. This is because that pass huge matrix cost long time and the mxgetjc(), mxgetir() function not work for sub-matrix inside the C program, they only apply to mxarray type from the mex function interface. So I just use MATLAB to partition the matrix and pass them to the C program after each angle. This saves time and memory space. But the switch context costing will be larger. Type Method 1 Method 2 N=50, N_theta=24 loop 0 0 total N=100, N_theta=50 loop total N=150 N_theta=100 loop total

20 Above is the comparison of the all pass method and my method. The total time is much larger due to the context switching cost and matlab operation. But loop time is a little better. For easy implementation, one can choose this method. Also, the program dynamically allocate all the memory space for all the matrix and vectors using the mkl_malloc() and mkl_free() routines. The reason to do this is to save space. If the problem size is huge, free the memory space when the variable is not needed is necessary. Sometime in huge cases, the program will collapse due to the lack of memory. Above is the detailed explanation of the C implementation. For reference, please refer to the sirt.c and sart.c 3.2 GPU implementation(cuda) The second implementation is the GPU implementation using the Nvidia CUDA( Compute Unified Device Architecture). CUDA comes with a software environment that allows developers to use C as a high-level programming language. It is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions a hierarchy of thread groups, shared memories, and barrier synchronization that are simply exposed to the programmer as a minimal set of language extensions. Therefore most of my CUDA codes are the

similar style as the C program. Except for three key differences and I will explain them in the following: Figure 6 CUDA Implementation design flow 3.2.

21 similar style as the C program. Except for three key differences and I will explain them in the following: Figure 6 CUDA Implementation design flow Memory allocation, copy and sharing Since all the paralleling computing is done in the GPU, so all the data should be copied to the GPU memory. In the mex function, the data is got from MATLAB which is stored in the Host(CPU) memory. All the input matrix should be copied from the HOST to the Device(GPU) memory. Here I use the cudamemcpy() routine to realize this. Also, all the other matrix, such as V and W, should be created in the GPU memory using the cudamalloc() routine. Then they can be computed on the GPU. And the end of the program, the result should also be copied from the device back to the host using the cudamemcpy() again Parallel global kernel

22 Since GPU is good at paralleling computing, CUDA C extends C by allowing the programmer to define C functions, called kernels. When called, they are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using the global declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<< >>>execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadidx variable. Since there are multiple processors inside the GPU, so they can do the computation in parallel. The whole GPU can be modeled as several grids and inside each grid there are several blocks the number of which equals to the number of processors in the system. And inside each block, there are several threads which can do the computation in parallel. So the ID of a certain thread is: idx = blockdim.x * blockidx.x + threadidx.x. (11) The below graph illustrated this. BlockDim is the number of blocks in the grid, blockidx is id of the certain block and threadidx is the id of a thread inside that block. In my implementation, the blocksize is no more than 2 and the grid size is equal to the number of threads needed divided by the blocksize.

23 Figure 7 GPU thread hierarchy I use the kernel to compute the V, W matrix and the nonneg() function. For V and W, The genv() and genw() compute each element along the diagonal using different threads in parallel. Every thread is in charge of one row or column in the A or AT. So totally N2 threads are needed for W and M threads are needed for V. Also for the non-neg function, N2 threads are used to set the negative value in X to zero in parallel. Details can refer to the sirt_cuda.cu and sart_cuda.cu files Thread synchronization This problem, though not important in my project, is necessary to mention. Since different threads are working in parallel, so data dependency is an important problem. For example, if the next row s data is depend on the previous row s result, then they should be computed in the order of rows.

24 I use the syncthreads() routine to achieve this. At the end of each kernel, use this routine then each thread can synchronize with each other. Without this, then sometimes the result will be incorrect. Also, CUSPARSE library and CUBLAS library are used to do the matrix operation. Most of the functions are the similar to the Intel MKL. I use the cusparsescsrmv() to do the matrix multiplication and the cublassaxpy() to do the addition and deduction. Please refer to the CUSPARSE user guide and CUBLAS user guide as detailed reference. In the CUSPARSE Lib, there are two functions which can convert the format of the matrix to CSR format and get all the auxiliary arrays such as the rowindex, values and columns. So not like in the C implementation (I pass the sparse format of A and A T, and also compute V and W in MATLAB). In the CUDA implementation, I only pass the full matrix of A and A T. Using the kernel to obtain V and W, then use cusparsesnnz() to get the number of nonzero elements and then use cusparsesdense2csr() to get all the auxiliary arrays, include the values, columns, rowindex. So in the CUDA implementation, the parameters passed to the program are much simpler. Also, to use the CUSPARSE and CUBLAS, the CUSPARSE and CUBLAS library should be initialized and destroyed at the beginning and end of the program. cusparsecreate() and cusparsedestroy() are the routine to use. Also a matrix descriptor should be created and destroyed use the cusparsecreatematdescr() and

25 cusparsedestroymatdescr () routines. All of them can be found in the CUSPARSE helper function chapter. Above is the explanation of my CUDA implementation. The details can be found in the sirt_cuda.cu and sart_cuda.cu code.

4. Result and Conclusion In this project, I compare the reconstruction speed of the MATAB implementation, C implementation and CUDA implementation. Both for the SIRT and SART algorithm.

26 4. Result and Conclusion In this project, I compare the reconstruction speed of the MATAB implementation, C implementation and CUDA implementation. Both for the SIRT and SART algorithm. The results are presented below: C vs. MATLAB For this, I set N=200 and N_theta=120, which means the size of A is 34440*40000 and the resolution of original picture is pixels. Figure 8 Original result, SIRT, SART SIRT (N=200, N_theta=120) C MATLAB Speed Up Reconstruction time(only loop) 390 ms 750 ms 192.3% Reconstruction time(all) 738 ms 775 ms 105% SART (N=200, N_theta=120) C MATLAB Speed Up

27 Reconstruction time(only loop) 180 ms ms 10902% Reconstruction time(all) ms ms 151.2% We can see, if we consider the total time, C code is almost the same speed as MATLAB since the initialize time will account for a huge part for the C code. If only consider the loop time, than C is much faster. Especially for SART, of which the initialize time will be longer than SIRT due to the matrix partition. C vs. CUDA For this, I set N=100, N_theta=48 for SIRT and N=140, N_theta=60 for SART. If set N and N_theta too large, than the GPU is very easy to broken. This is because I pass the whole matrix of A to the CUDA and the GPU cannot allocate such big memory for them. This will be fixed in the future. SIRT (N=100, N_theta=48) CUDA C Speed Up Reconstruction time(only loop) 10 ms 30 ms 300% SART (N=140, N_theta=60) CUDA C Speed Up Reconstruction time(only loop) 40 ms 100 ms 250%

28 For CUDA, I only measure the loop time. If measure the total time, then the memory copy will cost very long time which is unavoidable. This is the same case as C. So I only compare the loop time between C and CUDA to do the comparison. We can see that CUDA is faster than C since the GPU is good at paralleling computing. Since the problem size is not very large, so the advantage is not so obvious. If use larger matrix size, the difference will be bigger. Conclusion In this project, C and CUDA implementation for SIRT and SART algorithms are designed. The C model uses the Intel MKL and the CUDA uses the CUSPARSE Library. For SIRT, C is around 200% faster than MATLAB and CUDA is 300% faster than C. For SART, C is around 10000% faster than MATLAB and CUDA is around 250% faster than C. Here I all refer to the loop time not the total time. If measure the total time, than C and CUDA will not have obvious advantage or even worse performance than MATLAB. Due to the parallel computing ability, CUDA is the fastest one among the three of them. But it complexity for programmer is also the highest. It need large amount of memory allocation, data moving back and forth, as well as handler setup. MATLAB code is the easiest to write, since it does not need any setup and memory operation, but is the slowest. Both C and CUDA use the sparse matrix routine, they can achieve

29 shorter time if the input matrixes contains a lot of zeros. But large amount of aux array preparation work should be done before the real computation. Also, the CUDA still has some unstable problem which will occasionally cause the GPU to broken. This may be fixed in the future. Below is the conclusion: Complexity, CUDA>C>MATLAB Stability, Speed, MATLAB=C>CUDA CUDA>C>MATLAB In my opinion, the C implementation achieves the best balance between speed, complexity and stability, which should be the best choice to implement the 3D tomography reconstruction. 5. Reference 1. Simultaneous Algebraic Reconstruction Technique for Electron Tomography using OpenCL, Beata Turono, June 9, D Simultaneous Algebraic Reconstruction Technique for Cone-Beam Projections, Wojciech Chlewicki, Patras Intel Math Kernel Library Reference Manual MKL 11.0 Update 1 4. CUDA_CUSPARSE_Users_Guide v5.0, October CUDA_CUBLAS_Users_Guide, v5.0, October CUDA C PROGRAMMING GUIDE, v5.0, October 2012

University of California at Los Angeles. Electrical Engineering. C and CUDA Implementation for SIRT and SART Reconstruction Algorithms

University of California at Los Angeles Electrical Engineering C and CUDA Implementation for SIRT and SART Reconstruction Algorithms Comprehensive Project Report for Master Degree Non-thesis Option by