Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures Georgia Institute of Technology, August 17, 2009 Justin W. L. Wan Canada Research Chair in Scientific Computing David R. Cheriton School of Computer Science University of Waterloo Joint work with Lin Xu (Princess Margaret Hospital)

Outline Rigid (2D-3D) image registration Modeling and numerical solution Multi-core/GPGPU programming RapidMind Multi-core Development Platform Results

Medical Imaging

Image Registration template target transformed template Find a transformation that best maps the template image to the target image. Best match: Minimize i i differences Maximize similarities

Diagnosis Applications Combine different information from multiple imaging modalities. Monitor disease progression Alignment of tissues/organs that have changed in size, shape, position over time. Image guided surgery or radiotherapy Align pre-operative images and surgical plans to the images obtained from real time during surgery. Patient comparison Compare an individual s anatomy to a standardized atlas.

Image Registration Problem Image: mapping from image domain to intensity range template image F: F, target image G: G Find G F such hthat tcorresponding points in F and G are aligned. F G y Φ x

Classification Based on types of transformations. Rigid translation + rotation Nonrigid affine, projective physically based elastic fluid

Rigid Image Registration Only rotations and translations are considered.

Rigid Image Registration Used for registration where there is no or very little distortion. Often served as pre-alignment step for nonrigid registration. 2D-2D: composed of 2 translations ti and 1 rotation. ti 3D-3D: composed of 3 translations and 3 rotations. 2D-3D image registration important in medical applications such as radiation therapies, computer-assisted surgeries. align a 2D image with a 3D image volume.

2D-3D Image Registration real world virtual world position of patient can be estimated when DRR matches portal X-ray image

Rigid Image Registration Model 2 min F ( s ) G F ( s) i, j Gi, j S ) 2 i, j s = transformation parameters ( x, y, z, x, y, z ) F (s) = digitally reconstructed radiograph (DRR) G = portal image Match the DRR and portal image by appropriately p rotating and translating the 3D volume. Two major steps Construction of DRR Solving the minimization problem 2

Construction of DRR DRR is constructed by perspective projection of 3D image volume onto a given plane.

Volume Rendering of 3D Image Ray casting: Light rays pass from the source through the pixels of DRR into 3D image volume. Pixel values = accumulated intensities of 3D image. Intensities at x are given by trilinear interpolation.

Solving the Registration Min. Problem Solving the Registration Min. Problem Gauss-Newton method. Let s n = previous approx. Let s n+1 = s n + s. The nonlinear least squares problem: The nonlinear least squares problem: j i j i j i n s G s F 2,, 1 ) ( min Taylor expansion: j i, Th li LS bl n T j i n j i n j i n s s F s F s F,,, 1 ) ( ) ( ) ( The approx. linear LS squares problem: j i n T j i n j i n G s s F s F 2 ) ( ) ( min j i j i j i j i s,,,, ) ( ) (

Gauss-Newton Method The linear LS problem can be written as: min A = F (s n ) = N 2 6 matrix b = G - F = N 2 1 vector x = s = N 2 1 vector x Ax b The linear LS problem is solved by the normal equations: 2 2 A T Ax A T b The procedure is repeated until s n converges. Computationally expensive to form A, A T A, and A T b.

Parallel Computation of A and A T A The entries of A are image gradients at pixel (i, j). The gradients are approximated by finite difference. E.g. n n n F ( s ) i, j F ( s s ) i, j F ( s ) x i, x x j, s x (,0,0,0,0,0) x F(s n +s x ) obtained from DRR by changing the rotation angle x x + x, keeping other parameters fixed. DRRs as well as subtractions and divisions are computed in //. Six cols of A are: F/ x x,, F/ y y,, F/ z z,, F/ x x,, F/ y y,, F/ z. To compute A T A, take 2 derivative arrays, multiply the corresponding elements in //, and then use a reduction operation to compute the global sum.

2D-3D Image Registration on GPU s s Portal image s Matrix & vector computation Solve A T A x = A T b A T A, A T b Form A, A T A, A T b in parallel on GPU

Parallel Systems PC clusters powerful not portable hard to maintain Roadrunner@LANL: 1.1PF/s Jaguar@ORNL: 1.06PF/s

Multi-Core Systems Dual & quad cores on PCs Cell Broadband Engine Graphics Processing Units (GPUs)

Multi-Core Processors Specifications Core i7 960 GTX285 Processing Elements Resident Strands/Threads (max) 4 cores, 4 way SIMD 30 cores, 8 way SIMD @3.2 GHz 4 cores, 2 threads, 4 way SIMD: 32 strands @1.5 GHz 30 cores, 32 SIMD vectors, 32 way SIMD: 30720 threads Core i7 (45nm) SP GFLOP/s 102 1080 Memory Bandwidth 25.6 GB/s 159 GB/s Register File 1.875 MB Local Store 480 kb GTX285 (55nm)

Cell Broadband Engine Processors

GPU for Computing GPUs have evolved into very flexible and powerful procs: It s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: LOPS GFL G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 GPU in every PC and workstation.

GPU: Graphics Processing Unit GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about). Control Cache ALU ALU CPU ALU ALU GPU DRAM DRAM Low-latency floating point (FP) computation. Applications Game effects, physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting.

GPU Programming g Model GPU w/ local l DRAM (device) CPU (host)

GPU Programming Language: CUDA Integrated t dhost+device C program Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);...

Grids, Blocks, Threads, and Memory Host Device (Device) Grid Grid 1 Block (0, 0) Block (1, 0) Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Shared Memory Shared Memory Block (0, 1) Block (1, 1) Block (2, 1) Registers Registers Registers Registers Grid 2 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Kernel 2 Local Local Local Local Memory Memory Memory Memory Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread (0, 1) Thread (0, 2) Thread (1, 1) Thread (1, 2) Thread (2, 1) Thread (2, 2) Thread (3, 1) Thread (3, 2) Thread (4, 1) Thread (4, 2) Host Global Memory Constant Memory Texture Memory

RapidMind Overview RapidMind provides: 1.A flexible platform that allows an arbitrary algorithm to be expressed and efficiently mapped to both multi-core CPUs and GPUs 2.Accelerated volume processing components that provide core building blocks for medical imaging g applications Copyright 2009 RapidMind Inc.

RapidMind System API Architecture Intuitive, integrates with C++, and requires no new tools or workflow Platform Code Optimizer analyzes and optimizes computations to remove overhead Load Balancer plans and synchronizes work to keep all cores fully utilized Data Manager reduces data bottlenecks Logging/Diagnostics detects and reports performance bottlenecks Processor Support Modules x86 processors from AMD and Intel ATI/AMD and NVIDIA GPUs Cell Blade, Cell Accelerator Board, PS3 Copyright 2008, RapidMind, Inc.

RapidMind Programming Model Standard C++ using RapidMind interface Standard C++ Tools Interface extracts computation expressed in C++ while eliminating overhead Code generator creates native machine code Runtime tightly couples multiple optimizations and manages execution over multiple cores Platform specific code RapidMind Collection RapidMind Compilation Standard executable with embedded RapidMind operations Massively parallel computation RapidMind Execution Multicore Processor Streaming execution Copyright 2008, RapidMind, Inc.

RapidMind Platform Interface Summary Usage: Include platform header Link to runtime library Data: #include <rapidmind/platform.hpp> using namespace rapidmind; i d Value1f f = 2.0f; Array<2,Value3f> a(512,512); 512); Array<2,Value3f> b(512,512); Values g p g { Arrays Data abstraction Programs: Defined dynamically Execute on coprocessors Code abstraction Program prog = BEGIN { In<Value3f> r, s; Out<Value3f> q; q = (r + s) * f; } END; a = prog(a,b); f = 3.0f; stride(a,2,2) = prog( slice(a,0,255,0,255), slice(b,256,511,0,255)); 511 255)); Copyright 2008, RapidMind, Inc.

Numerical Experiments 3D image volume: Artificial data: white cube (128*128*128) Clinical i l 3D CT data (Univ. of Iowa Health Care): tripod facture of a skull (128*128*100) Artificial data Simulate template 2D image by projecting 3D volume data with known parameters. Standard PC with NVIDIA GeForce 8800 GTX. Real CT data

Numerical Results (Artificial data) Resolution C++ RapidMind 161616 Time per iteration (sec) 0.015 0.138 Iteration 6 6 323232 Time per iteration (sec) 0.200 0.186 Iteration 5 5 646464 Time per iteration (sec) 5.308 0.262 Iteration 5 5 128128128 Time per iteration (sec) 56.720 0.400 Iteration 6 6

Numerical Results (Clinical data) Portal Image Parameters C++ Rotations: (2, 2, 2) Total lti Time (sec) 168.16 175 1.75 Translations: (2mm, 2mm, 2mm) Iteration 5 5 Rotations: (4, 4, 4) Total Time (sec) 263.30 2.70 Translations: (4mm, 4mm, 4mm) Iteration 8 8 Rotations: (6, 6, 6) Total Time (sec) 371.39 3.65 Translations: (6mm, 6mm, 6mm) Iteration 11 11 RapidMind

Numerical Results (Comparison) Timing i (sec) 340 371 2.0 3.7

Conclusion We have developed an efficient 2D-3D rigid image registration. Amenable for GPU processing. Implemented the algorithm using RapidMind to exploit the highly parallelism of GPUs. Numerical results show that the GPU code 100 times faster than CPU code. For real image datasets, it takes around 3 seconds for performing 2D-3D image registration.

Challenges Portability. Need standard (something like MPI) for programming on GPU, Cell, etc. Floating gpoint precision. Math libraries such as BLAS, LAPACK, FFT.