University of California at Los Angeles. Electrical Engineering. Iterative Reconstruction for 3D Tomography

Size: px
Start display at page:

Download "University of California at Los Angeles. Electrical Engineering. Iterative Reconstruction for 3D Tomography"

Transcription

1 University of California at Los Angeles Electrical Engineering Iterative Reconstruction for 3D Tomography Comprehensive Project Report for Master Degree Non-thesis Option by Yinan Dong Advisor: Prof. Dejan Markovic WINTER 2014

2 TABLE OF CONTENT Abstract 1. Background a) 3D tomography b) Iterative reconstruction for 3D tomography i. SIRT ii. SART 2. CPU and GPU architectures. a) CPU architecture b) GPU architecture 3. Detailed implementation a) Intel MKL b) CUDA 4. Result and Conclusion 5. Reference

3 Abstract In this project, I designed the C and CUDA implementation for the SIRT and SART algorithm which used in the reconstruction for 3D tomography and compare the speed up against the MATLAB implementation. The C implementation use the Intel MKL and CUDA implementation use the CUSPARSE library. The C program is 200% faster than the MATLAB and the CUDA is 250% faster than C in average. This project also analyzes the complexity of all the three implementation and gives some suggestion for future work.

4 1. Background 1.1 3D tomography Tomography refers to imaging by sections or sectioning, through the use of any kind of penetrating wave. A device used in tomography is called a tomograph, while the image produced is called a tomogram. Computed tomographic (CT) scanner, as the most common application of tomography was invented by Sir Godfrey Hounsfield in 1972, and thereby made an exceptional contribution to medicine. In conventional medical X-ray tomography, clinical staff makes a sectional image through a body by moving an X-ray source and the film in opposite directions during the exposure. By modifying the direction and extent of the movement, operators can select different focal planes which contain the structures of interest and it was principally feasible, based on a very large number of measurements to reconstruct a cross sectional slice of a patient with fairly high accuracy. The image quality of the slices, throughout the years improved significantly due to the development of computer power and reconstruction methods. Several algorithms for image reconstruction have been developed, and the techniques applied can be distinguished in two main categories: Analytical which capitalize on the Fourier Slice Theorem

5 Figure 1 Parallel projection, each projection is made up of the set of line integrals through the object. Above is the illustration of the parallel projection. The simplest and easiest way to visualize method of scanning is the system of parallel projection. The projection of an object at a given angle is made up of a set of line integrals. In X-ray CT, the line integral represents the total attenuation of the beam of x-rays as it travels in a straight line through the object. The resulting image is a 3D model of the attenuation coefficient. We consider the data to be collected as a series of parallel rays, at position, across a projection at angle. This is repeated for various angles. Using the coordinate system of Figure 1, the value of onto which the point will be projected at angle is given by: (1) So the equation can be rewritten as

6 (2) Where represents. This function is known as the Radon transform (or sinogram) of the 3D object. Figure 2 Shepp-Logan Phantom and its sinogram. Above is the Shepp-Logan Phantom, which is the standard test image for tomographic reconstructions. Iterative, solve the reconstruction problem using Algebraic Reconstruction Methods These are the methods I use in my project. The parallel projection per-form well in cases where the complete set of projections uniformly distributed over 180 or 360 is available. But they are very sensitive to noise. So someone came up with some algebraic reconstruction techniques. Unlike transform methods, algebraic methods do

7 not require complete set of uniformly distributed projections for precise reconstruction and they are also more stable under noisy conditions. Furthermore, they allow using a priori information in reconstruction process. However, algebraic techniques are computationally very demanding. With the development in parallel programming and high-performance processors, nowadays the algebraic methods present a viable alternative. In the next section I will introduce the basic concept and idea of iterative reconstruction methods for 3D tomography. And then introduce the two methods: SIRT and SART which I implemented in my project. 1.2 Iterative reconstruction for 3D tomography The iterative reconstruction algorithm wants to formulate the reconstruction Ax b problem into a Linear Inverse Problem: (3) Where x is a column vector of length N2 representing the pixels of the original image, A is an M by N2 matrix representing the data acquisition process, and b is a column vector of length M representing the measured projection data. x 1 We want to find a solution such that: (4) left If we directly solve this equation by doing the inverse of the Matrix A, this will be very time and resource consuming if the size of A is very large. If N and M were small, it would be possible to use conventional matrix theory methods, to invert the system of equations in (3). In practice N may be a large number and in most cases the A b

8 number of ray-sums (called later as M) will also have the same magnitude. For these values of M and N the size of the matrix which precludes any possibility of direct matrix inversion. So there exists iterative method to solve the equation. Let x(k) denote the kth estimation of the reconstruction. Then: x k1 k T k x A Ax b (5) where the relaxation factor λ is a scalar. If we choose proper relaxation factor, then X(k) will converge to the actual value after several rounds of iterations. This will only invlove matrix multiplication and addtion which is much less time consuming than doing matrix inversion. Also, in order to formulate the matrix, we do the matrix normalization. We Introduce diagonal matrices V and W: x k1 k T k x VA W Ax b V is the diagonal matrix of the inverse of the row sums: W is the diagonal matrix of the inverse of the column sums: Use this, we can aviod the computation to be overflow. V i W j j1 (6) 1 Vi, i 2 N a W j, j M i1 i, j 1 a i, j And for the iterative reconstruction method, there are several different method to achieve. The three main methods are ART(Algebraic Reconstruction Technique), SIRT(Simultaneous Iterative Reconstruction Technique), SART(Simultaneous

9 Algebraic Reconstruction Technique). ART update image after each ray is processed. ART reconstructions usually suffer from salt and pepper noise. The result is that the computed ray-sums in are usually poor approximations to the corresponding measured ray-sums. SIRT update image after all rays are processed. It has slower convergence than ART because it needs more iteration, but leads to better looking images. SART update image after all rays in a single projection angle are processed. SART combine the best of ART and SIRT. This technique yields reconstructions of good quality and numerical accuracy in only one iteration. In my project, I implemented the SIRT and SART algorithms on the CPU and GPU platform. So in the next chapter, I will breifly introduce the CPU and GPU architecture and explain the key main difference between them.

10 2. CPU and GPU architectures. 2.1 CPU Architecture Most of us are familiar with CPU. Below is the illustration of a pipelined MIPS cpu. It has instructin frech, instruction decode, excution, memory access and result write back stages. And this is the basic architecture of most of the modern microprocessors. Figure 3 Modern microprocessor architecture Nowadays, Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behave less linearly and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can

11 execute simultaneously. Despite of these kinds of parallielism technques, CPU is still a processor which execute program serially. Most parts of a CPU are dedicated as the data schedualing, branch prediction and other control units. The execution units only occupy a very small part of the whole CPU. So CPU is good at branching-intensive operations, random access, memory-intensive operations and other program which requre complex flow control and branch decision. This can also be seen from the above graph. Only ALU is used as the computation in the chip, all the other parts are control and multiplex logic. 2.2 GPU Architecture Driven by the insatiable market demand for realtime, high-definition 3D graphics, the Graphic Processor Unit has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth.

12 Figure 4 a typical GPU architecture. Above is the architecture overview of a nvidia G80 GPU. It has: 16 Multiprocessors Blocks Each MP Block Has: 8 Streaming Processors (IEEE 754 spfpcompliant) 16K Shared Memory 64K Constant Cache 8K Texture Cache Each processor can access all of the memory at 86Gb/s, but with different latencies.

13 From this simple architecture, we can find the reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. GPUs also have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs. More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel, with high arithmetic intensity. The ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches. From this chapter, I have explained the basic difference between CPU and GPU architectures and there respectively suitable application. In the next chapter, I will introduce the detailed implementation of the 3D tomography reconstruction using the Intel MKL(CPU) and CUDA SPARSE(GPU) platform.

14 3. Algorithm implementation 3.1 C implementation(intel MKL) First, I will introduce the CPU implementation of SIRT and SART which use the Intel Math Kernel Library (Intel MKL). The Intel Math Kernel Library (Intel MKL) improves performance with math routines for software applications that solve large computational problems. Intel MKL provides BLAS and LAPACK linear algebra routines, fast Fourier transforms, vectorized math functions, random number generation functions, and other functionality. Here I mainly use the BLAS and Sparse BLAS Routines. Since there are a lot of zeros in matrix A, using the sparse BLAS routines which will gain a significant speed up for the computation of matrix multiplication. The whole procedure can be divided into three parts, which are detailed explained below. For the details please refer to my code. Figure 5 C Implementation design flow

15 3.1.1 MATLAB and C interface We use some MATLAB programs as the input generator and output monitor. The MATLAB program will generate the matrix A and B, and after the reconstruction is finished, it will get the output of the C code, then compute the SNR, reconstruction time and plot the reconstructed graph. To achieve this, I use the mex function which is used as the interface between the MATLAB program and C program. The details about the mex function can be found in the mathworks website and I will not explain it here. For SIRT, the parameters I pass into the mex function are: A, A T, both of them are converted into sparse format in MATLAB. The reason I feed both A and A T is that to do transpose in C is very time consuming. V, W, since they are both diagonal matrix, so I only pre-compute their diagonal elements using the definition in the slides in MATLAB and pass them into the mex function. B, X0, which is just the original format of them. X0 is the initial value of X, which usually to be all zeros. K, the iteration number Compute and conversion to CSR format

16 To use the BLAS routines, there are several arrays which need to be computed before the iteration. These 3 arrays are: values, columns, rowindex which is called the CSR(Compressed Sparse Row) format. The definitions of them are as follows: Values: A real or complex array that contains the non-zero elements of A. Values of the non-zero elements of Aare mapped into the values array using the row-major storage mapping described above. Columns: Element I of the integer array columns is the number of the column in A that contains the i-th value in the values array. RowIndex: Element j of the integer array rowindex gives the index of the element in the values array that is first non-zero element in a row j. The length of the value sand columns arrays is equal to the number of non-zero elements in the matrix. As the rowindex array gives the location of the first non-zero element within a row, and the non-zero elements are stored consecutively, the number of non-zero elements in the i-th row is equal to the difference of rowindex(i) and rowindex(i+1). To have this relationship hold for the last row of the matrix, an additional entry (dummy entry) is added to the end of rowindex. Its value is equal to the number of non-zero elements. This makes the total length of the rowindex array one larger than the number of rows in the matrix. To get these arrays, I use several ways to do it:

17 For A and A T, since the matrix I pass into the mex function is converted into sparse format. So I use the mxgetpr(), mxgetjc(), mxgetir() and mxgetnzmax() function to get the corresponding arrays. These routines should all operating on the sparse format matrix. The definition of them can be found on the mathworks website. Here, mxgetpr() is to get the values array, mxgetjc() is to get the rowindex array and columns is obtained by the mxgetir(). And mxgetnzmax() is used to find the number of non-zero elements in A and A T. Since to use BLAS routine for the CSR format. There are two other arrays which called the ponterb and pointere. Their definitions are: pointerb(i) = rowindex(i) for i=1,..; (7) pointere(i) = rowindex(i+1) for i=1,..; (8) Convert the rowindex arrays to pointerb and pointere according to these rules. Also, the definition of Jc and Ir array is the opposite of rowindex and columns array. So I use the Jc and Ir of A as the rowindex and columns of A T and the Jc and Ir of A T as the rowindex and columns of A. For V and W, since they are both diagonal matrix, in the original MATLAB code, it use the diag() and sparse() operation, but this will cost very long time for the program to do these two operations. So in my C implementation, I just pass their diagonal elements to the mex function. If use the whole matrix, the redundant zeros will cost longer time and is huge waste of memory space. In the C program, I wrote

18 the function which can compute the values, columns and rowindex arrays of V and W according to their definition. Since the non-zero elements only exist along the diagonal, it is very easy to derive these arrays Iterative reconstruction using the sparse routine After getting all the arrays, then the iteration process can start. The equation is for SIRT and SART: x x k1 k T k x VA W Ax b 1 T x V A W A x b (9) (10) routines: So I do the multiplication matrix by matrix using different kind of BLAS sparse mkl_dcsrmv() is to do the general matrix and vector multiplication. It computes matrix - vector product of a sparse matrix stored in the CSR format. So I use it to compute A*xi and A'*(W*(A*xi-b)). The definition can be found in the MKL manual. mkl_cspblas_dcsrgemv() is to do the Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-array variation) with zero-based indexing. The matrix is an m by m matrix. So I use this routine to do the W and V multiplication. The definition can also be found in the MKL manual. cblas_daxpy() is to computes a vector-scalar product and adds the result to a vector. I use this routine to do all the addition and deduction operation Difference between SIRT and SART implementation

19 For SIRT and SART, most of the things are the same. The only difference is that in SART, since it will update the matrix after all rays in a single projection angle are processed, so every time we change the angle, we need to feed new A(:,:,j), W(:,j) and V(:,j) to the program. M= D*N_theta, so A, B, W and V is divided into D s sub-matrix. In my program, I do not pass the whole matrix to the C program and then do the partition. This is because that pass huge matrix cost long time and the mxgetjc(), mxgetir() function not work for sub-matrix inside the C program, they only apply to mxarray type from the mex function interface. So I just use MATLAB to partition the matrix and pass them to the C program after each angle. This saves time and memory space. But the switch context costing will be larger. Type Method 1 Method 2 N=50, N_theta=24 loop 0 0 total N=100, N_theta=50 loop total N=150 N_theta=100 loop total

20 Above is the comparison of the all pass method and my method. The total time is much larger due to the context switching cost and matlab operation. But loop time is a little better. For easy implementation, one can choose this method. Also, the program dynamically allocate all the memory space for all the matrix and vectors using the mkl_malloc() and mkl_free() routines. The reason to do this is to save space. If the problem size is huge, free the memory space when the variable is not needed is necessary. Sometime in huge cases, the program will collapse due to the lack of memory. Above is the detailed explanation of the C implementation. For reference, please refer to the sirt.c and sart.c 3.2 GPU implementation(cuda) The second implementation is the GPU implementation using the Nvidia CUDA( Compute Unified Device Architecture). CUDA comes with a software environment that allows developers to use C as a high-level programming language. It is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. At its core are three key abstractions a hierarchy of thread groups, shared memories, and barrier synchronization that are simply exposed to the programmer as a minimal set of language extensions. Therefore most of my CUDA codes are the

21 similar style as the C program. Except for three key differences and I will explain them in the following: Figure 6 CUDA Implementation design flow Memory allocation, copy and sharing Since all the paralleling computing is done in the GPU, so all the data should be copied to the GPU memory. In the mex function, the data is got from MATLAB which is stored in the Host(CPU) memory. All the input matrix should be copied from the HOST to the Device(GPU) memory. Here I use the cudamemcpy() routine to realize this. Also, all the other matrix, such as V and W, should be created in the GPU memory using the cudamalloc() routine. Then they can be computed on the GPU. And the end of the program, the result should also be copied from the device back to the host using the cudamemcpy() again Parallel global kernel

22 Since GPU is good at paralleling computing, CUDA C extends C by allowing the programmer to define C functions, called kernels. When called, they are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using the global declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<< >>>execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadidx variable. Since there are multiple processors inside the GPU, so they can do the computation in parallel. The whole GPU can be modeled as several grids and inside each grid there are several blocks the number of which equals to the number of processors in the system. And inside each block, there are several threads which can do the computation in parallel. So the ID of a certain thread is: idx = blockdim.x * blockidx.x + threadidx.x. (11) The below graph illustrated this. BlockDim is the number of blocks in the grid, blockidx is id of the certain block and threadidx is the id of a thread inside that block. In my implementation, the blocksize is no more than 2 and the grid size is equal to the number of threads needed divided by the blocksize.

23 Figure 7 GPU thread hierarchy I use the kernel to compute the V, W matrix and the nonneg() function. For V and W, The genv() and genw() compute each element along the diagonal using different threads in parallel. Every thread is in charge of one row or column in the A or AT. So totally N2 threads are needed for W and M threads are needed for V. Also for the non-neg function, N2 threads are used to set the negative value in X to zero in parallel. Details can refer to the sirt_cuda.cu and sart_cuda.cu files Thread synchronization This problem, though not important in my project, is necessary to mention. Since different threads are working in parallel, so data dependency is an important problem. For example, if the next row s data is depend on the previous row s result, then they should be computed in the order of rows.

24 I use the syncthreads() routine to achieve this. At the end of each kernel, use this routine then each thread can synchronize with each other. Without this, then sometimes the result will be incorrect. Also, CUSPARSE library and CUBLAS library are used to do the matrix operation. Most of the functions are the similar to the Intel MKL. I use the cusparsescsrmv() to do the matrix multiplication and the cublassaxpy() to do the addition and deduction. Please refer to the CUSPARSE user guide and CUBLAS user guide as detailed reference. In the CUSPARSE Lib, there are two functions which can convert the format of the matrix to CSR format and get all the auxiliary arrays such as the rowindex, values and columns. So not like in the C implementation (I pass the sparse format of A and A T, and also compute V and W in MATLAB). In the CUDA implementation, I only pass the full matrix of A and A T. Using the kernel to obtain V and W, then use cusparsesnnz() to get the number of nonzero elements and then use cusparsesdense2csr() to get all the auxiliary arrays, include the values, columns, rowindex. So in the CUDA implementation, the parameters passed to the program are much simpler. Also, to use the CUSPARSE and CUBLAS, the CUSPARSE and CUBLAS library should be initialized and destroyed at the beginning and end of the program. cusparsecreate() and cusparsedestroy() are the routine to use. Also a matrix descriptor should be created and destroyed use the cusparsecreatematdescr() and

25 cusparsedestroymatdescr () routines. All of them can be found in the CUSPARSE helper function chapter. Above is the explanation of my CUDA implementation. The details can be found in the sirt_cuda.cu and sart_cuda.cu code.

26 4. Result and Conclusion In this project, I compare the reconstruction speed of the MATAB implementation, C implementation and CUDA implementation. Both for the SIRT and SART algorithm. The results are presented below: C vs. MATLAB For this, I set N=200 and N_theta=120, which means the size of A is 34440*40000 and the resolution of original picture is pixels. Figure 8 Original result, SIRT, SART SIRT (N=200, N_theta=120) C MATLAB Speed Up Reconstruction time(only loop) 390 ms 750 ms 192.3% Reconstruction time(all) 738 ms 775 ms 105% SART (N=200, N_theta=120) C MATLAB Speed Up

27 Reconstruction time(only loop) 180 ms ms 10902% Reconstruction time(all) ms ms 151.2% We can see, if we consider the total time, C code is almost the same speed as MATLAB since the initialize time will account for a huge part for the C code. If only consider the loop time, than C is much faster. Especially for SART, of which the initialize time will be longer than SIRT due to the matrix partition. C vs. CUDA For this, I set N=100, N_theta=48 for SIRT and N=140, N_theta=60 for SART. If set N and N_theta too large, than the GPU is very easy to broken. This is because I pass the whole matrix of A to the CUDA and the GPU cannot allocate such big memory for them. This will be fixed in the future. SIRT (N=100, N_theta=48) CUDA C Speed Up Reconstruction time(only loop) 10 ms 30 ms 300% SART (N=140, N_theta=60) CUDA C Speed Up Reconstruction time(only loop) 40 ms 100 ms 250%

28 For CUDA, I only measure the loop time. If measure the total time, then the memory copy will cost very long time which is unavoidable. This is the same case as C. So I only compare the loop time between C and CUDA to do the comparison. We can see that CUDA is faster than C since the GPU is good at paralleling computing. Since the problem size is not very large, so the advantage is not so obvious. If use larger matrix size, the difference will be bigger. Conclusion In this project, C and CUDA implementation for SIRT and SART algorithms are designed. The C model uses the Intel MKL and the CUDA uses the CUSPARSE Library. For SIRT, C is around 200% faster than MATLAB and CUDA is 300% faster than C. For SART, C is around 10000% faster than MATLAB and CUDA is around 250% faster than C. Here I all refer to the loop time not the total time. If measure the total time, than C and CUDA will not have obvious advantage or even worse performance than MATLAB. Due to the parallel computing ability, CUDA is the fastest one among the three of them. But it complexity for programmer is also the highest. It need large amount of memory allocation, data moving back and forth, as well as handler setup. MATLAB code is the easiest to write, since it does not need any setup and memory operation, but is the slowest. Both C and CUDA use the sparse matrix routine, they can achieve

29 shorter time if the input matrixes contains a lot of zeros. But large amount of aux array preparation work should be done before the real computation. Also, the CUDA still has some unstable problem which will occasionally cause the GPU to broken. This may be fixed in the future. Below is the conclusion: Complexity, CUDA>C>MATLAB Stability, Speed, MATLAB=C>CUDA CUDA>C>MATLAB In my opinion, the C implementation achieves the best balance between speed, complexity and stability, which should be the best choice to implement the 3D tomography reconstruction. 5. Reference 1. Simultaneous Algebraic Reconstruction Technique for Electron Tomography using OpenCL, Beata Turono, June 9, D Simultaneous Algebraic Reconstruction Technique for Cone-Beam Projections, Wojciech Chlewicki, Patras Intel Math Kernel Library Reference Manual MKL 11.0 Update 1 4. CUDA_CUSPARSE_Users_Guide v5.0, October CUDA_CUBLAS_Users_Guide, v5.0, October CUDA C PROGRAMMING GUIDE, v5.0, October 2012

University of California at Los Angeles. Electrical Engineering. C and CUDA Implementation for SIRT and SART Reconstruction Algorithms

University of California at Los Angeles. Electrical Engineering. C and CUDA Implementation for SIRT and SART Reconstruction Algorithms University of California at Los Angeles Electrical Engineering C and CUDA Implementation for SIRT and SART Reconstruction Algorithms Comprehensive Project Report for Master Degree Non-thesis Option by

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Medical Image Reconstruction Term II 2012 Topic 6: Tomography

Medical Image Reconstruction Term II 2012 Topic 6: Tomography Medical Image Reconstruction Term II 2012 Topic 6: Tomography Professor Yasser Mostafa Kadah Tomography The Greek word tomos means a section, a slice, or a cut. Tomography is the process of imaging a cross

More information

Algebraic Iterative Methods for Computed Tomography

Algebraic Iterative Methods for Computed Tomography Algebraic Iterative Methods for Computed Tomography Per Christian Hansen DTU Compute Department of Applied Mathematics and Computer Science Technical University of Denmark Per Christian Hansen Algebraic

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Tomographic Reconstruction

Tomographic Reconstruction Tomographic Reconstruction 3D Image Processing Torsten Möller Reading Gonzales + Woods, Chapter 5.11 2 Overview Physics History Reconstruction basic idea Radon transform Fourier-Slice theorem (Parallel-beam)

More information

GPU-Based Acceleration for CT Image Reconstruction

GPU-Based Acceleration for CT Image Reconstruction GPU-Based Acceleration for CT Image Reconstruction Xiaodong Yu Advisor: Wu-chun Feng Collaborators: Guohua Cao, Hao Gong Outline Introduction and Motivation Background Knowledge Challenges and Proposed

More information

Radon Transform and Filtered Backprojection

Radon Transform and Filtered Backprojection Radon Transform and Filtered Backprojection Jørgen Arendt Jensen October 13, 2016 Center for Fast Ultrasound Imaging, Build 349 Department of Electrical Engineering Center for Fast Ultrasound Imaging Department

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Reconstruction of Tomographic Images From Limited Projections Using TVcim-p Algorithm

Reconstruction of Tomographic Images From Limited Projections Using TVcim-p Algorithm Reconstruction of Tomographic Images From Limited Projections Using TVcim-p Algorithm ABDESSALEM BENAMMAR, AICHA ALLAG and REDOUANE DRAI Research Center in Industrial Technologies (CRTI), P.O.Box 64, Cheraga

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Central Slice Theorem

Central Slice Theorem Central Slice Theorem Incident X-rays y f(x,y) R x r x Detected p(, x ) The thick line is described by xcos +ysin =R Properties of Fourier Transform F [ f ( x a)] F [ f ( x)] e j 2 a Spatial Domain Spatial

More information

Using a GPU in InSAR processing to improve performance

Using a GPU in InSAR processing to improve performance Using a GPU in InSAR processing to improve performance Rob Mellors, ALOS PI 152 San Diego State University David Sandwell University of California, San Diego What is a GPU? (Graphic Processor Unit) A graphics

More information

Algebraic Iterative Methods for Computed Tomography

Algebraic Iterative Methods for Computed Tomography Algebraic Iterative Methods for Computed Tomography Per Christian Hansen DTU Compute Department of Applied Mathematics and Computer Science Technical University of Denmark Per Christian Hansen Algebraic

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

Introduction to Biomedical Imaging

Introduction to Biomedical Imaging Alejandro Frangi, PhD Computational Imaging Lab Department of Information & Communication Technology Pompeu Fabra University www.cilab.upf.edu X-ray Projection Imaging Computed Tomography Digital X-ray

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computed tomography - outline

Computed tomography - outline Computed tomography - outline Computed Tomography Systems Jørgen Arendt Jensen and Mikael Jensen (DTU Nutech) October 6, 216 Center for Fast Ultrasound Imaging, Build 349 Department of Electrical Engineering

More information

Reconstruction in CT and relation to other imaging modalities

Reconstruction in CT and relation to other imaging modalities Reconstruction in CT and relation to other imaging modalities Jørgen Arendt Jensen November 1, 2017 Center for Fast Ultrasound Imaging, Build 349 Department of Electrical Engineering Center for Fast Ultrasound

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging Saoni Mukherjee, Nicholas Moore, James Brock and Miriam Leeser September 12, 2012 1 Outline Introduction to CT Scan, 3D reconstruction

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

High Resolution Iterative CT Reconstruction using Graphics Hardware

High Resolution Iterative CT Reconstruction using Graphics Hardware High Resolution Iterative CT Reconstruction using Graphics Hardware Benjamin Keck, Hannes G. Hofmann, Holger Scherl, Markus Kowarschik, and Joachim Hornegger Abstract Particular applications of computed

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Implementation of a backprojection algorithm on CELL

Implementation of a backprojection algorithm on CELL Implementation of a backprojection algorithm on CELL Mario Koerner March 17, 2006 1 Introduction X-ray imaging is one of the most important imaging technologies in medical applications. It allows to look

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Limited view X-ray CT for dimensional analysis

Limited view X-ray CT for dimensional analysis Limited view X-ray CT for dimensional analysis G. A. JONES ( GLENN.JONES@IMPERIAL.AC.UK ) P. HUTHWAITE ( P.HUTHWAITE@IMPERIAL.AC.UK ) NON-DESTRUCTIVE EVALUATION GROUP 1 Outline of talk Industrial X-ray

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

A GPU Implementation of Distance-Driven Computed Tomography

A GPU Implementation of Distance-Driven Computed Tomography University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-2017 A GPU Implementation of Distance-Driven Computed Tomography Ryan D. Wagner University

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Translational Computed Tomography: A New Data Acquisition Scheme

Translational Computed Tomography: A New Data Acquisition Scheme 2nd International Symposium on NDT in Aerospace 2010 - We.1.A.3 Translational Computed Tomography: A New Data Acquisition Scheme Theobald FUCHS 1, Tobias SCHÖN 2, Randolf HANKE 3 1 Fraunhofer Development

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Reconstruction in CT and relation to other imaging modalities

Reconstruction in CT and relation to other imaging modalities Reconstruction in CT and relation to other imaging modalities Jørgen Arendt Jensen November 16, 2015 Center for Fast Ultrasound Imaging, Build 349 Department of Electrical Engineering Center for Fast Ultrasound

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro INTRODUCTION TO GPU COMPUTING IN AALTO Topi Siro 12.6.2013 OUTLINE PART I Introduction to GPUs Basics of CUDA PART II Maximizing performance Coalesced memory access Optimizing memory transfers Occupancy

More information

The Role of Linear Algebra in Computed Tomography

The Role of Linear Algebra in Computed Tomography The Role of Linear Algebra in Computed Tomography Katarina Gustafsson katarina.gustafsson96@gmail.com under the direction of Prof. Ozan Öktem and Emer. Jan Boman Department of Mathematics KTH - Royal Institution

More information

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD Goals. The goal of the first part of this lab is to demonstrate how the SVD can be used to remove redundancies in data; in this example

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline

More information

Compressed Sensing for Electron Tomography

Compressed Sensing for Electron Tomography University of Maryland, College Park Department of Mathematics February 10, 2015 1/33 Outline I Introduction 1 Introduction 2 3 4 2/33 1 Introduction 2 3 4 3/33 Tomography Introduction Tomography - Producing

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2 Grzegorz Tomasz Kowalik 1, Jennifer Anne Steeden 1, Bejal Pandya 1, David Atkinson 2, Andrew Taylor 1, and Vivek Muthurangu 1 1 Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging,

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

Adaptive algebraic reconstruction technique

Adaptive algebraic reconstruction technique Adaptive algebraic reconstruction technique Wenkai Lua) Department of Automation, Key State Lab of Intelligent Technology and System, Tsinghua University, Beijing 10084, People s Republic of China Fang-Fang

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

CS377P Programming for Performance GPU Programming - I

CS377P Programming for Performance GPU Programming - I CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic

More information

Index. aliasing artifacts and noise in CT images, 200 measurement of projection data, nondiffracting

Index. aliasing artifacts and noise in CT images, 200 measurement of projection data, nondiffracting Index Algebraic equations solution by Kaczmarz method, 278 Algebraic reconstruction techniques, 283-84 sequential, 289, 293 simultaneous, 285-92 Algebraic techniques reconstruction algorithms, 275-96 Algorithms

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Image Acquisition Systems

Image Acquisition Systems Image Acquisition Systems Goals and Terminology Conventional Radiography Axial Tomography Computer Axial Tomography (CAT) Magnetic Resonance Imaging (MRI) PET, SPECT Ultrasound Microscopy Imaging ITCS

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA Sensors & Transducers 2013 by IFSA http://www.sensorsportal.com Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA 1 Wang LI-Fang, 2 Zhang Shu-Hai 1 School of Electronics and Computer

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D.

Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D. Multi-slice CT Image Reconstruction Jiang Hsieh, Ph.D. Applied Science Laboratory, GE Healthcare Technologies 1 Image Generation Reconstruction of images from projections. textbook reconstruction advanced

More information