Master Thesis Accelerating Image Registration on GPUs

Size: px

Start display at page:

Download "Master Thesis Accelerating Image Registration on GPUs"

Bennett McDowell
6 years ago
Views:

1 Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität Erlangen-Nürnberg March 5, 2010

2 FAIR Image Registration FAIR Fixed level experiment MATLAB on CUDA MATLAB MEX interface CUDA MEX environment FAIR on CUDA The Design phase CUDA MEX Interpolation CUDA MEX transformation CUDA enabled FAIR registration cycle Improvements Summary

3 Image Registration Given a reference image R and a template image T,find a reasonable transformation y, such that the transformed image T [y] is similar to R J [y] = D[T [y], R] + αs[y y ref y ] min (1) where, D measures image similarity and S measures reasonability of the transform.

4 A software viewpoint

FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] = D[T [y], R] + αs[y y ref y ] min Salient features Continuous (functional) framework Numerical

5 FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] = D[T [y], R] + αs[y y ref y ] min Salient features Continuous (functional) framework Numerical Optimization Constrained Image Registration collection of MATLAB files. toolbox for image models, transformations, distance measures, regularizer,.. multi-level, multi-scale, multigrid amenable

6 Parametric Image Registration in FAIR HNSP (a) T (xc) (b) R(xc) (c) T (xc) R(xc) rigid/fine (d) T (xc) with yc (e) T (yc) (f) T (yc) R(xc)

Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d 1 43.

7 Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d s 100 inter = splineinter2d s 59.3 opt = Armijo s 14 distance = SSD s 2.6 trafo = rigid2d s 1.5 FAIRplots and others s 22.4

8 MATLAB MEX interface Even though MATLAB is built on many well optimized libraries,some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU.

9 CUDA MEX environment NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex f nvmexopts.bat filename.cu IC:\cuda\include LC:\cuda\lib -lcudart Available for Windows and Linux from com/compute/cuda/1_1/matlab_cuda_1.1.tgz

10 Typical CUDA MEX file 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8)

11 Design requirements,roadmap and Considerations Requirements Integration of the FAIR toolbox with CUDA programming interface. Efficient implementations of FAIR functional modules on GPU. Measurement for accuracy and runtime of complete registration cycle and individual modules. Roadmap 1. Setup CUDA MEX environment within FAIR toolbox. 2. Implement an optimised FAIR interpolation toolbox within FAIR on CUDA. 3. Implement transformation and distance toolboxs on CUDA. 4. Combine all CUDA functional modules to run a complete registration cycle on the GPU.

12 Textures in CUDA Texture is an object for reading data Benefits Data is cached (optimized for 2D locality). Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat. Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage CPU code binds data to a texture object Kernel reads data by calling a fetch function

13 Basic interpolation schemes Host side texture<float, 2, cudareadmodeelementtype> tex;... void mexfunction(int nlhs...){... // set texture parameters tex.addressmode[0] = cudaaddressmodeclamp; tex.addressmode[1] = cudaaddressmodeclamp; // access with normalized texture coordinates tex.normalized =false;... // Bind the array to the texture cudabindtexturetoarray( tex, cu_array, channeldesc);... }

tex.filtermode =cudafiltermodepoint; T linear (x)

14 Basic interpolation schemes global void Inter2DKernel(){... T = tex2d(tex, tx, ty);... } Nearest Neighbor Device kernel Low Precision Linear T nn (x) = 0 for x / Ω T nn (x) := datat (j) tex.filtermode =cudafiltermodepoint; T linear (x) := datat (p) (1 ξ)+ datat (p + 1) ξ, tex.filtermode =cudafiltermodelinear;

15 B Spline Interpolation S[T ] = R Ω (T (x)) 2 dx, (2) S[T ] =! min subject to T (x j ) = datat (j), j = 1,..., m, (3) 8 (x + 2) 3, 2 x < 1, >< x 3 2(x + 1) 3 + 6(x + 1), 1 x < 0, b(x) = x 3 + 2(x 1) 3 6(x 1), 0 x < 1, (4) >: (2 x) 3, 1 x < 2, 0, else. T (x) = T spline (x) = mx c j b j (x) (5) j=1

16 B Spline Interpolation [Sigg, C. and Hadwiger, M.] T spline (x) = c p 1 b(ξ + 1) + c pb(ξ) + c p+1 b(ξ 1) + c p+2 b(ξ 2) (6) T linear (x) := datat (p) (1 ξ) + datat (p + 1) ξ, (7) (a + b) T linear (x) := datat (p) a + datat (p + 1) b, (8) T spline (x) = g 0 (ξ) c linear p+h 0 + g 1 (ξ) c linear p+h 1 (9) where, g 0 (ξ) = b(ξ + 1) + b(ξ) g 1 (ξ) = b(ξ 1) + b(ξ 2) (10) h 0 = ( b(ξ) g 0 (ξ) ) 1 h b(ξ 2) 1 = ( ) + 1 (11) g 1 (ξ)

17 Bandwidth Results Interpolation (a) splineinter2d(l) (b) splineinter2d(nn) splineinter2d splineinter2d (NN) (bilinear) Grid Measured Worst Best Measured Worst Best Size bandwidth Case Case bandwidth Case Case 64X X X X

18 Runtime Results Interpolation (a) Runtime Comparision (b) Runtime vs ideal Grid Size linearinter2d splineinter2d splineinter2d splineinter2d (FAIR)(ms) (FAIR)(ms) (NN texture)(ms) (bilinear texture)(ms) 64X X X X

19 Results Interpolation (a) Der. test Inter2D(MATLAB) (b) Der. test Inter2D(CUDA MEX)

20 Rigid transformation An affine linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an affine linear transformation are y 1 = w 1 x 1 + w 2 x 2 + w 3, (12) y 2 = w 4 x 1 + w 5 x 2 + w 6, (13) In matrix form Q(x) =» x 1 x x 1 x 2 1 (14) (15) y = Q(x)w. Rigid transformation: A special affine linear transform that allows only rotation and translation Although this function is non-linear in w, s y 1 = cos(w 1 )x 1 sin(w 1 )x 2 + w 2, (16) y 2 = sin(w 1 )x 1 + cos(w 1 )x 2 + w 3, (17) y(x) = Q(x)f (w), f (w) = [cos w 1 ; sin w 1 ; w 2 ; sin w 1 ; cos w 1 ; w 3 ].

21 Persistent Memory and Hybrid Memory #include "cuda.h" #include "mex.h"... ///*Static variable to retain device memory locations static float *xf_gpu, *yf_gpu; static float *yc_gpu; static int initialised_rigid=0; ///*routine to clear CUDA MEX persistent variable host void cleanup(void){ mexprintf("mex-file rigid2d is terminating,destroying array "); cudafree(xf_gpu); cudafree(yf_gpu); cudafree(yc_gpu); } ///////////////////////////////////////////////////////////// //! Kernel to transform an image y_xf_gpu,y_yf_gpu output data in global memory xf_gpu,yf_gpu input data (Q) from global memory ///////////////////////////////////////////////////////////// global void rigid2dkernel( float* y_xf_gpu,..){ } //////////////////////////////////////////////////////////// ///* Gateway function */ ///*function [yc,dy] = rigid2d(w,x,varargin); //////////////////////////////////////////////////////////// void mexfunction(int nlhs, mxarray *plhs[],..){ ///* Find the dimensions of the data */ ///* Allocate memory for output... cudamalloc( (void **) & yc_gpu,sizeof(float)*xn*xm/2) ///* Setup kernel

22 Persistent Memory and Hybrid Memory Cont.. if(!initialised_rigid){ x = mxgetpr(prhs[1]); //*Allocate memory for Q cudamalloc( (void **) & xf_gpu,..); cudamalloc( (void **) & yf_gpu,..); ///* Construct Q using input data cudamemcpy( xf_gpu,...,cudamemcpyhosttodevice); cudamemcpy( yf_gpu,...,cudamemcpyhosttodevice); ///* register function and set flag to handle cuda memory cleanup mexatexit(cleanup); initialised_rigid = 1; /**Call function to perform rigid2d on GPU */ rigid2dkernel<<<dimgrid,dimblock>>>(...); cutilsafecall(cudathreadsynchronize()); } else{ /**Call function to perform rigid2d on GPU */ } ///* Set result to device pointer */ mxarray *parray = mxcreatedoublematrix(0,0,mxreal); double data[10]; mxsetpr(parray,yc_gpu); mxsetm(parray,xm);mxsetn(parray,xn); ///* Clean-up non persistent memory on device and host */ cudathreadexit(); }

23 Results Grid Size Grid Size rigid2d rigid2d % time saved X Y (non persistent) (persistent) using persistent memory

24 CUDA MEX Registration cycle GridSize GridSize PIR SSD RIGID PIR SSD RIGID X Y (MATLAB) (CUDA MEX) s s s 33 s s 92 s

25 FAIR Improvements Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.

26 CUDA MEX Improvements (a) Cuda Driver Objects (b) Cuda Driver Objects (c) Improved framework

27 Summary 1. Successful integration of MATLAB and CUDA. 2. Porting of the FAIR toolbox onto the GPU. 3. Fast implementation of spline interpolation within the CUDA MEX framework. 4. Analysis of accuracy results for texture usage for interpolant derivatives. 5. GPU acceleration of fixed level image registration scheme for large descritizations. 6. Implementation of persistent memory on GPUs.

Technical Application Field. Scientific Computing. Applied Numerics

Technical Application Field. Scientific Computing. Applied Numerics EVIP Technical Application Field Scientific Computing Applied Numerics Variational Modeling EVIP Parallel Processing Rank efficient operators Elasticity modeled Image Registration Motivation Given a reference