Accelerating image registration on GPUs

Similar documents
Numerical Algorithms on Multi-GPU Architectures

Multigrid algorithms on multi-gpu architectures

Large scale Imaging on Current Many- Core Platforms

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Master Thesis Accelerating Image Registration on GPUs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

How to Optimize Geometric Multigrid Methods on GPUs

Massively Parallel Architectures

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Tesla Architecture, CUDA and Optimization Strategies

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Software and Performance Engineering for numerical codes on GPU clusters

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Real-time Graphics 9. GPGPU

Introduction to CUDA

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

ECE 574 Cluster Computing Lecture 15

Real-time Graphics 9. GPGPU

Introduction to CUDA Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Technical Application Field. Scientific Computing. Applied Numerics

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to GPU hardware and to CUDA

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Scientific discovery, analysis and prediction made possible through high performance computing.

University of Bielefeld

Comparison of High-Speed Ray Casting on GPU

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 160 Lecture 24. Graphical Processing Units

GPUs and GPGPUs. Greg Blanton John T. Lubia

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 574 Cluster Computing Lecture 17

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU programming. Dr. Bernhard Kainz

JCudaMP: OpenMP/Java on CUDA

GPU Programming Using CUDA

CS427 Multicore Architecture and Parallel Computing

GPGPU/CUDA/C Workshop 2012

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU for HPC. October 2010

GPU Programming for Mathematical and Scientific Computing

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

General Purpose GPU Computing in Partial Wave Analysis

Using a GPU in InSAR processing to improve performance

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CUDA (Compute Unified Device Architecture)

Parallel Numerical Algorithms

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

High Performance Computing on GPUs using NVIDIA CUDA

Paralization on GPU using CUDA An Introduction

Programmable Graphics Hardware (GPU) A Primer

High Performance Computing and GPU Programming

CUDA Architecture & Programming Model

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

GPU Programming Using NVIDIA CUDA

High-Performance Computing Using GPUs

CUDA Programming Model

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

high performance medical reconstruction using stream programming paradigms

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Introduction to CUDA

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Computer Architecture

OpenMP and GPU Programming

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Accelerating CFD with Graphics Hardware

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to Multicore Programming

General-purpose computing on graphics processing units (GPGPU)

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

GPGPU introduction and network applications. PacketShaders, SSLShader

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Transcription:

Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010

Contents Motivation: Image registration with FAIR GPU Programming Combining Matlab and CUDA Future Work 2

Image registration in FAIR MOTIVATION 3

Image Registration in FAIR 4

Variational model for image registration Find mapping or deformation field u : R d d a R, d { 2,3} such that the energy functional E( u) = D[ J,u] + α S[ u] with data term D and regularizer S is minimized. 5

FAIR: Flexible Algorithms for Image Registration Framework for image registration in Matlab Different transformations, data terms, regularizers Different (multi-level) numerical optimization techniques Constrained image registration FAIR, J. Modersitzki, SIAM, 2009 6

Image Registration cycle 7

Runtime distribution for rigid registration cycle 4% 22% 15% Functional Optimization 59% Interpolation Matlab plots 8

GPUs COMPUTER ARCHITECTURE 9

Nvidia GeForce GTX 295 Costs: 450 Interface: PCI-E 2.0 x16 Shader Clock: 1242 MHz Memory Clock: 999 MHz Memory Bandwidth: 2x112 GB/s FLOPS: 2x894 GFLOPS Max Power Draw: 289 W Framebuffer: 2x896 MB Memory Bus: 2x448 bit Shader Processors: 2x240 10

Why GPUs? Graphics Processing Units (GPUs) are a massively parallel computer architecture GPUs offer high computational performance at low costs GPGPU General-Purpose computing on a GPU Using graphic hardware for non-graphic computations Programming Tools: CUDA or OpenCL Without explicitly parallel algorithms, the performance potential of current hardware architectures cannot be used any more 11

CPU vs. GPU CPUs are great for task parallelism Fast caches Branching adaptability GPUs are great for data parallelism Multiple ALUs Fast onboard memory High throughput on parallel tasks Executes program on each fragment Think of the GPU (device) as a massively-threaded coprocessor 12

Memory Architecture Constant Memory Shared Memory Texture Memory Device Memory 13

Data-parallel Programming What is CUDA? Compute Unified Device Architecture Software architecture for managing data-parallel programming Write kernels (functions) that execute on the device and process multiple data elements in parallel Massive threading Local memory 14

Paradigm of the CUDA toolkit I 0 1 2 3 4 5 Thread 0 Thread 1 Thread 2 OpenMP Divide domain into huge chunks Equally distribute to threads Parallelize outer most loop to minimize overhead 15

Paradigm of the CUDA toolkit II CUDA Divide domain into small pieces But! Data Mapping is different to enblock distribution of OpenMP Alignment constraints must be met No cache and cache lines need to be considered Block 1 Block 2 Thread 0 Thread 1 Thread 2 Thread 0 Thread 1 0 1 2 3 4 Thread 2 5 16

Grid Size and Block Size Grid size: The size and shape of the data that the program will be working on Block size: The block size indicates the sub-area of the original grid that will be assigned to a multiprocessor 17

CUDA Example GPU: global void sum(float* a, float* b, float* c) { int idx = blockidx.x*blockdim.x + threadidx.x; // computes array index for current thread c[idx] = a[idx] + b[idx]; // adds array entry (global memory) } CPU: void vector_add(float *in_a, float *in_b, float *out, unsigned int length){ float* a, *b, *c; Declare data cudamalloc((void**)&a_gpu, sizeof(float)*length); cudamalloc((void**)&b_gpu, sizeof(float)*length); cudamalloc((void**)&out_gpu, sizeof(float)*length); cudamemcpy ( a_gpu,in_a, sizeof ( float ) *length, cudamemcpyhosttodevice ); cudamemcpy ( b_gpu,in_a, sizeof ( float ) *length, cudamemcpyhosttodevice ); dim3 dimblock(256,1); dim3 dimgrid(length/dimblock.x,1); sum<<<dimgrid,dimblock>>>(a,b,c); Allocate data on GPU Copy data to GPU Define block/grid sizes Kernel execution cudamemcpy ( out_gpu,out, sizeof ( float ) *length, cudamemcpydevicetohost );} Write back data to CPU 18

OpenCL OpenCL (Open Computing Language) is an open standard for heterogeneous parallel computing Managed by non-profit technology consortium Khronos group www.khronos.org/opencl Analogous to the open industry standards OpenGL for 3D graphics OpenCL exploits task-based and data-based parallelism Similar to CUDA, OpenCL includes a language (based on C99) for writing kernels executing on devices like GPU, cell, or multicore processors Performance of high level heterogeneous tools? 19

nvmex COMBINING MATLAB AND CUDA 20

MATLAB MEX interface 21

CUDA MEX environment MATLAB script nvmex.m compiles CUDA code to create MATLAB function files 1. Allocate memory on the GPU 2. Transfer the data from the host to the GPU 3. Perform computation on GPU (library, custom code) 4. Transfer results from the GPU to the host 22

Basic interpolation schemes: Host side 23

Basic interpolation schemes: GPU kernel 24

B-Spline Interpolation [Sigg, C. and Hadwiger, M.] Combine 2 linear interpolations to obtain B-spline interpolation 25

Runtime Results Interpolation I 26

Runtime Results Interpolation II 1 0,1 time (ms) 0,01 0,001 SplineInter2D (bilinear texture) Theoretical Best Case runtime Theoretical worst Case runtime 0,0001 2048 8192 32768 131072 27

Discretization Error Accuracy f+hf -f(h) f-f(h) h 28

Persistent Memory and Hybrid Memory Goal: avoid allocation and deallocation of memory for each call to a GPU kernel through Matlab Static variables routine to clear CUDA MEX persistent variable Saves up to 17% runtime for bigger images (2048x1024) 29

CUDA MEX Registration cycle 30

Runtime summary for rigid registration Time includes data transfers and Matlab visualization (plots about 20 %) GPU pays off for large images (speedup factor about 2x, pure GPU kernels about 20x) A lot more potential to move more parts to GPU xsize ysize Matlab Matlab+GPU 128 64 15 s 14 s 256 128 45 s 33 s 512 256 202 s 92 s 31

A Multigrid Solver for Diffusion-based PDEs MULTIGRID ON GPU 32

What is Multigrid? May consume a lot of time during optimization step to find an updated transformation A very fast and efficient iterative solver for huge sparse, linear (and nonlinear) systems A u = f Geometric multigrid on structured grids is well suited for GPU implementation because of regular data access patterns Multigrid exhibits a convergence rate that is independent of the number of unknowns N, i.e. a multigrid solver has complexity O(N) 33

Multigrid Idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid 34

Multigrid Idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 35

Complex Diffusion on Nvidia GeForce GTX 295 80 70 runtime V(2,2) in ms 60 50 40 30 20 10 0 1024x1024 1024x2048 2048x2048 2048x4096 4096x4096 Image size Runtime on 2 x C2 Penryn (8 cores) @ 2.8 GHz, 8 x 22,4 GFLOPs is 1100 ms for 4096 x 4096 Speedup factor 14 36

Summary and Future Work When to use GPGPU Suitable algorithms? Get the promised performance? Get it at low effort and cost? Future Work Multi-GPU OpenCL Adapt GPU multigrid to image registration 37

Questions?

Acknowledgements Nvidia Corporation developer.nvidia.com/cuda Technical Brief Architecture Overview CUDA Programming Guide Supercomputing 2008 Education Program: CUDA Introduction, Christian Trefftz / Greg Wolffe (Grand Valley State University) AMD/ATI http://developer.amd.com/gpu_assets/stream_computing_overview. pdf ATI Stream SDK User Guide 39

Possible Improvements 40

Runtime Results Interpolation II 41

Full Multigrid Cycle Smoothing V-cycle Interpolation of solution u Exact solution Restriction of residual r = f - Au Interpolation of error and correction of solution u 42

Multigrid on GTX 295 with red-black splitting 18 16 14 Runtime V(2,2) in ms 12 10 8 6 4 2 0 1024x1024 1024x2048 2048x2048 2048x4096 4096x4096 Image size 43

Memory Bandwidth 100% Percent of memory bandwidth 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1024x1024 1024x2048 2048x2048 2048x4096 4096x4096 Image size In percent from maximum measured (rounded) streaming bandwidth (100 GB / s) 44

Frames per second for Image Stitching fps (stitching GTX 295) fps (solver GTX 295) fps (CPU) 143 111 83 71 63 83 45 50 17 9 25 28 4 2 1 1024x1024 1024x2048 2048x2048 2048x4096 4096x4096 CPU: Intel Core2 Quad Q9550@2.83GHz with OpenMP (4 cores) 45

Interpolation and Restriction 46