Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Similar documents
RapidMind. Accelerating Medical Imaging. May 13, 2009

The RapidMind Platform for Portable Programming of Multi-Core Processors and Many-Core Accelerators

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

high performance medical reconstruction using stream programming paradigms

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA (Compute Unified Device Architecture)

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Comparison of High-Speed Ray Casting on GPU

Threading Hardware in G80

Lecture 1: Introduction and Computational Thinking

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

B. Tech. Project Second Stage Report on

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Accelerating image registration on GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

QR Decomposition on GPUs

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion

General Purpose GPU Computing in Partial Wave Analysis

Massively Parallel Architectures

Using GPUs to compute the multilevel summation of electrostatic forces

Parallel Computing: Parallel Architectures Jin, Hai

REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

William Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Technology for a better society. hetcomp.com

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Introduction to CUDA (1 of n*)

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Tesla Architecture, CUDA and Optimization Strategies

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Portland State University ECE 588/688. Graphics Processors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Parallel Computing. Hwansoo Han (SKKU)

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

GPU Programming Using NVIDIA CUDA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

CUDA GPGPU Workshop 2012

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Optimisation Myths and Facts as Seen in Statistical Physics

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Graphics Processor Acceleration and YOU

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

CUDA. Matthew Joyner, Jeremy Williams

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Master Informatics Eng.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

From Brook to CUDA. GPU Technology Conference

GPU applications in Cancer Radiation Therapy at UCSD. Steve Jiang, UCSD Radiation Oncology Amit Majumdar, SDSC Dongju (DJ) Choi, SDSC

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

CSE 160 Lecture 24. Graphical Processing Units

Solving Dense Linear Systems on Graphics Processors

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

High Performance Computing on GPUs using NVIDIA CUDA

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Optimization of Cone Beam CT Reconstruction Algorithm Based on CUDA

Performance potential for simulating spin models on GPU

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

Introduction to CELL B.E. and GPU Programming. Agenda

Multigrid algorithms on multi-gpu architectures

3D Registration based on Normalized Mutual Information

Abstract. Introduction. Kevin Todisco

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Large scale Imaging on Current Many- Core Platforms

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Tesla GPU Computing A Revolution in High Performance Computing

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Dense matching GPU implementation

High Performance Computing and GPU Programming

Current Trends in Computer Graphics Hardware

Scientific Computations Using Graphics Processors

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Numerical Algorithms on Multi-GPU Architectures

Accelerating CFD with Graphics Hardware

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

Spring 2009 Prof. Hyesoon Kim

High-Performance Computing Using GPUs

General Purpose Computing on Graphical Processing Units (GPGPU(

Turbostream: A CFD solver for manycore

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Transcription:

Real-Time Rigid id 2D-3D Medical Image Registration ti Using RapidMind Multi-Core Platform Georgia Tech/AFRL Workshop on Computational Science Challenge Using Emerging & Massively Parallel Computer Architectures Georgia Institute of Technology, August 17, 2009 Justin W. L. Wan Canada Research Chair in Scientific Computing David R. Cheriton School of Computer Science University of Waterloo Joint work with Lin Xu (Princess Margaret Hospital)

Outline Rigid (2D-3D) image registration Modeling and numerical solution Multi-core/GPGPU programming RapidMind Multi-core Development Platform Results

Medical Imaging

Image Registration template target transformed template Find a transformation that best maps the template image to the target image. Best match: Minimize i i differences Maximize similarities

Diagnosis Applications Combine different information from multiple imaging modalities. Monitor disease progression Alignment of tissues/organs that have changed in size, shape, position over time. Image guided surgery or radiotherapy Align pre-operative images and surgical plans to the images obtained from real time during surgery. Patient comparison Compare an individual s anatomy to a standardized atlas.

Image Registration Problem Image: mapping from image domain to intensity range template image F: F, target image G: G Find G F such hthat tcorresponding points in F and G are aligned. F G y Φ x

Classification Based on types of transformations. Rigid translation + rotation Nonrigid affine, projective physically based elastic fluid

Rigid Image Registration Only rotations and translations are considered.

Rigid Image Registration Used for registration where there is no or very little distortion. Often served as pre-alignment step for nonrigid registration. 2D-2D: composed of 2 translations ti and 1 rotation. ti 3D-3D: composed of 3 translations and 3 rotations. 2D-3D image registration important in medical applications such as radiation therapies, computer-assisted surgeries. align a 2D image with a 3D image volume.

2D-3D Image Registration real world virtual world position of patient can be estimated when DRR matches portal X-ray image

Rigid Image Registration Model 2 min F ( s ) G F ( s) i, j Gi, j S ) 2 i, j s = transformation parameters ( x, y, z, x, y, z ) F (s) = digitally reconstructed radiograph (DRR) G = portal image Match the DRR and portal image by appropriately p rotating and translating the 3D volume. Two major steps Construction of DRR Solving the minimization problem 2

Construction of DRR DRR is constructed by perspective projection of 3D image volume onto a given plane.

Volume Rendering of 3D Image Ray casting: Light rays pass from the source through the pixels of DRR into 3D image volume. Pixel values = accumulated intensities of 3D image. Intensities at x are given by trilinear interpolation.

Solving the Registration Min. Problem Solving the Registration Min. Problem Gauss-Newton method. Let s n = previous approx. Let s n+1 = s n + s. The nonlinear least squares problem: The nonlinear least squares problem: j i j i j i n s G s F 2,, 1 ) ( min Taylor expansion: j i, Th li LS bl n T j i n j i n j i n s s F s F s F,,, 1 ) ( ) ( ) ( The approx. linear LS squares problem: j i n T j i n j i n G s s F s F 2 ) ( ) ( min j i j i j i j i s,,,, ) ( ) (

Gauss-Newton Method The linear LS problem can be written as: min A = F (s n ) = N 2 6 matrix b = G - F = N 2 1 vector x = s = N 2 1 vector x Ax b The linear LS problem is solved by the normal equations: 2 2 A T Ax A T b The procedure is repeated until s n converges. Computationally expensive to form A, A T A, and A T b.

Parallel Computation of A and A T A The entries of A are image gradients at pixel (i, j). The gradients are approximated by finite difference. E.g. n n n F ( s ) i, j F ( s s ) i, j F ( s ) x i, x x j, s x (,0,0,0,0,0) x F(s n +s x ) obtained from DRR by changing the rotation angle x x + x, keeping other parameters fixed. DRRs as well as subtractions and divisions are computed in //. Six cols of A are: F/ x x,, F/ y y,, F/ z z,, F/ x x,, F/ y y,, F/ z. To compute A T A, take 2 derivative arrays, multiply the corresponding elements in //, and then use a reduction operation to compute the global sum.

2D-3D Image Registration on GPU s s Portal image s Matrix & vector computation Solve A T A x = A T b A T A, A T b Form A, A T A, A T b in parallel on GPU

Parallel Systems PC clusters powerful not portable hard to maintain Roadrunner@LANL: 1.1PF/s Jaguar@ORNL: 1.06PF/s

Multi-Core Systems Dual & quad cores on PCs Cell Broadband Engine Graphics Processing Units (GPUs)

Multi-Core Processors Specifications Core i7 960 GTX285 Processing Elements Resident Strands/Threads (max) 4 cores, 4 way SIMD 30 cores, 8 way SIMD @3.2 GHz 4 cores, 2 threads, 4 way SIMD: 32 strands @1.5 GHz 30 cores, 32 SIMD vectors, 32 way SIMD: 30720 threads Core i7 (45nm) SP GFLOP/s 102 1080 Memory Bandwidth 25.6 GB/s 159 GB/s Register File 1.875 MB Local Store 480 kb GTX285 (55nm)

Cell Broadband Engine Processors

GPU for Computing GPUs have evolved into very flexible and powerful procs: It s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: LOPS GFL G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 GPU in every PC and workstation.

GPU: Graphics Processing Unit GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about). Control Cache ALU ALU CPU ALU ALU GPU DRAM DRAM Low-latency floating point (FP) computation. Applications Game effects, physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting.

GPU Programming g Model GPU w/ local l DRAM (device) CPU (host)

GPU Programming Language: CUDA Integrated t dhost+device C program Serial Code (host) Parallel Kernel (device) KernelA<<< nblk, ntid >>>(args);... Serial Code (host) Parallel Kernel (device) KernelB<<< nblk, ntid >>>(args);...

Grids, Blocks, Threads, and Memory Host Device (Device) Grid Grid 1 Block (0, 0) Block (1, 0) Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Shared Memory Shared Memory Block (0, 1) Block (1, 1) Block (2, 1) Registers Registers Registers Registers Grid 2 Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Kernel 2 Local Local Local Local Memory Memory Memory Memory Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread (0, 1) Thread (0, 2) Thread (1, 1) Thread (1, 2) Thread (2, 1) Thread (2, 2) Thread (3, 1) Thread (3, 2) Thread (4, 1) Thread (4, 2) Host Global Memory Constant Memory Texture Memory

RapidMind Overview RapidMind provides: 1.A flexible platform that allows an arbitrary algorithm to be expressed and efficiently mapped to both multi-core CPUs and GPUs 2.Accelerated volume processing components that provide core building blocks for medical imaging g applications Copyright 2009 RapidMind Inc.

Trends: Massive Portable parallelism, Architectures heterogeneity, and hybrid computing RapidMind provides portability, scalability and future-proofing Copyright 2009 RapidMind Inc.

RapidMind System API Architecture Intuitive, integrates with C++, and requires no new tools or workflow Platform Code Optimizer analyzes and optimizes computations to remove overhead Load Balancer plans and synchronizes work to keep all cores fully utilized Data Manager reduces data bottlenecks Logging/Diagnostics detects and reports performance bottlenecks Processor Support Modules x86 processors from AMD and Intel ATI/AMD and NVIDIA GPUs Cell Blade, Cell Accelerator Board, PS3 Copyright 2008, RapidMind, Inc.

RapidMind Programming Model Standard C++ using RapidMind interface Standard C++ Tools Interface extracts computation expressed in C++ while eliminating overhead Code generator creates native machine code Runtime tightly couples multiple optimizations and manages execution over multiple cores Platform specific code RapidMind Collection RapidMind Compilation Standard executable with embedded RapidMind operations Massively parallel computation RapidMind Execution Multicore Processor Streaming execution Copyright 2008, RapidMind, Inc.

RapidMind Basic Types Purpose Container for fixed-length data Container for variable-sized multidimensional data Container for computations Type Value Array Program Copyright 2008, RapidMind, Inc.

RapidMind Values 1 half 2 double Value< 3, float> 4 int Tuple size Element type Copyright 2008, RapidMind, Inc.

RapidMind Values 1h 2d Value3f 4i Tuple size Element type Copyright 2008, RapidMind, Inc.

RapidMind Arrays 1 Value4d Array< 2,Value3f > 3 Value2i Dimensionality Item type Copyright 2008, RapidMind, Inc.

RapidMind Programs Declaration Interface Program p; p = BEGIN { In<Value3f> a, b; Out<Value3f> c; Definition Value3f d = f(a, b); c = d + a * 2.0f; } END; Computation Copyright 2008, RapidMind, Inc.

Program Application Apply programs to arrays, get new arrays C = p(a,b); Invokes parallel execution All array elements updated simultaneously Copyright 2008, RapidMind, Inc.

RapidMind Platform Interface Summary Usage: Include platform header Link to runtime library Data: #include <rapidmind/platform.hpp> using namespace rapidmind; i d Value1f f = 2.0f; Array<2,Value3f> a(512,512); 512); Array<2,Value3f> b(512,512); Values g p g { Arrays Data abstraction Programs: Defined dynamically Execute on coprocessors Code abstraction Program prog = BEGIN { In<Value3f> r, s; Out<Value3f> q; q = (r + s) * f; } END; a = prog(a,b); f = 3.0f; stride(a,2,2) = prog( slice(a,0,255,0,255), slice(b,256,511,0,255)); 511 255)); Copyright 2008, RapidMind, Inc.

Numerical Experiments 3D image volume: Artificial data: white cube (128*128*128) Clinical i l 3D CT data (Univ. of Iowa Health Care): tripod facture of a skull (128*128*100) Artificial data Simulate template 2D image by projecting 3D volume data with known parameters. Standard PC with NVIDIA GeForce 8800 GTX. Real CT data

Numerical Results (Artificial data) Resolution C++ RapidMind 161616 Time per iteration (sec) 0.015 0.138 Iteration 6 6 323232 Time per iteration (sec) 0.200 0.186 Iteration 5 5 646464 Time per iteration (sec) 5.308 0.262 Iteration 5 5 128128128 Time per iteration (sec) 56.720 0.400 Iteration 6 6

Numerical Results (Clinical data) Portal Image Parameters C++ Rotations: (2, 2, 2) Total lti Time (sec) 168.16 175 1.75 Translations: (2mm, 2mm, 2mm) Iteration 5 5 Rotations: (4, 4, 4) Total Time (sec) 263.30 2.70 Translations: (4mm, 4mm, 4mm) Iteration 8 8 Rotations: (6, 6, 6) Total Time (sec) 371.39 3.65 Translations: (6mm, 6mm, 6mm) Iteration 11 11 RapidMind

Numerical Results (Comparison) Timing i (sec) 340 371 2.0 3.7

Conclusion We have developed an efficient 2D-3D rigid image registration. Amenable for GPU processing. Implemented the algorithm using RapidMind to exploit the highly parallelism of GPUs. Numerical results show that the GPU code 100 times faster than CPU code. For real image datasets, it takes around 3 seconds for performing 2D-3D image registration.

Challenges Portability. Need standard (something like MPI) for programming on GPU, Cell, etc. Floating gpoint precision. Math libraries such as BLAS, LAPACK, FFT.