Master Thesis Accelerating Image Registration on GPUs
|
|
- Bennett McDowell
- 6 years ago
- Views:
Transcription
1 Master Thesis Accelerating Image Registration on GPUs A proof of concept migration of FAIR to CUDA Sunil Ramgopal Tatavarty Prof. Dr. Ulrich Rüde Dr.-Ing.Harald Köstler Lehrstuhl für Systemsimulation Universität Erlangen-Nürnberg March 5, 2010
2 FAIR Image Registration FAIR Fixed level experiment MATLAB on CUDA MATLAB MEX interface CUDA MEX environment FAIR on CUDA The Design phase CUDA MEX Interpolation CUDA MEX transformation CUDA enabled FAIR registration cycle Improvements Summary
3 Image Registration Given a reference image R and a template image T,find a reasonable transformation y, such that the transformed image T [y] is similar to R J [y] = D[T [y], R] + αs[y y ref y ] min (1) where, D measures image similarity and S measures reasonability of the transform.
4 A software viewpoint
5 FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] = D[T [y], R] + αs[y y ref y ] min Salient features Continuous (functional) framework Numerical Optimization Constrained Image Registration collection of MATLAB files. toolbox for image models, transformations, distance measures, regularizer,.. multi-level, multi-scale, multigrid amenable
6 Parametric Image Registration in FAIR HNSP (a) T (xc) (b) R(xc) (c) T (xc) R(xc) rigid/fine (d) T (xc) with yc (e) T (yc) (f) T (yc) R(xc)
7 Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d s 100 inter = splineinter2d s 59.3 opt = Armijo s 14 distance = SSD s 2.6 trafo = rigid2d s 1.5 FAIRplots and others s 22.4
8 MATLAB MEX interface Even though MATLAB is built on many well optimized libraries,some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU.
9 CUDA MEX environment NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex f nvmexopts.bat filename.cu IC:\cuda\include LC:\cuda\lib -lcudart Available for Windows and Linux from com/compute/cuda/1_1/matlab_cuda_1.1.tgz
10 Typical CUDA MEX file 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8)
11 Design requirements,roadmap and Considerations Requirements Integration of the FAIR toolbox with CUDA programming interface. Efficient implementations of FAIR functional modules on GPU. Measurement for accuracy and runtime of complete registration cycle and individual modules. Roadmap 1. Setup CUDA MEX environment within FAIR toolbox. 2. Implement an optimised FAIR interpolation toolbox within FAIR on CUDA. 3. Implement transformation and distance toolboxs on CUDA. 4. Combine all CUDA functional modules to run a complete registration cycle on the GPU.
12 Textures in CUDA Texture is an object for reading data Benefits Data is cached (optimized for 2D locality). Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat. Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage CPU code binds data to a texture object Kernel reads data by calling a fetch function
13 Basic interpolation schemes Host side texture<float, 2, cudareadmodeelementtype> tex;... void mexfunction(int nlhs...){... // set texture parameters tex.addressmode[0] = cudaaddressmodeclamp; tex.addressmode[1] = cudaaddressmodeclamp; // access with normalized texture coordinates tex.normalized =false;... // Bind the array to the texture cudabindtexturetoarray( tex, cu_array, channeldesc);... }
14 Basic interpolation schemes global void Inter2DKernel(){... T = tex2d(tex, tx, ty);... } Nearest Neighbor Device kernel Low Precision Linear T nn (x) = 0 for x / Ω T nn (x) := datat (j) tex.filtermode =cudafiltermodepoint; T linear (x) := datat (p) (1 ξ)+ datat (p + 1) ξ, tex.filtermode =cudafiltermodelinear;
15 B Spline Interpolation S[T ] = R Ω (T (x)) 2 dx, (2) S[T ] =! min subject to T (x j ) = datat (j), j = 1,..., m, (3) 8 (x + 2) 3, 2 x < 1, >< x 3 2(x + 1) 3 + 6(x + 1), 1 x < 0, b(x) = x 3 + 2(x 1) 3 6(x 1), 0 x < 1, (4) >: (2 x) 3, 1 x < 2, 0, else. T (x) = T spline (x) = mx c j b j (x) (5) j=1
16 B Spline Interpolation [Sigg, C. and Hadwiger, M.] T spline (x) = c p 1 b(ξ + 1) + c pb(ξ) + c p+1 b(ξ 1) + c p+2 b(ξ 2) (6) T linear (x) := datat (p) (1 ξ) + datat (p + 1) ξ, (7) (a + b) T linear (x) := datat (p) a + datat (p + 1) b, (8) T spline (x) = g 0 (ξ) c linear p+h 0 + g 1 (ξ) c linear p+h 1 (9) where, g 0 (ξ) = b(ξ + 1) + b(ξ) g 1 (ξ) = b(ξ 1) + b(ξ 2) (10) h 0 = ( b(ξ) g 0 (ξ) ) 1 h b(ξ 2) 1 = ( ) + 1 (11) g 1 (ξ)
17 Bandwidth Results Interpolation (a) splineinter2d(l) (b) splineinter2d(nn) splineinter2d splineinter2d (NN) (bilinear) Grid Measured Worst Best Measured Worst Best Size bandwidth Case Case bandwidth Case Case 64X X X X
18 Runtime Results Interpolation (a) Runtime Comparision (b) Runtime vs ideal Grid Size linearinter2d splineinter2d splineinter2d splineinter2d (FAIR)(ms) (FAIR)(ms) (NN texture)(ms) (bilinear texture)(ms) 64X X X X
19 Results Interpolation (a) Der. test Inter2D(MATLAB) (b) Der. test Inter2D(CUDA MEX)
20 Rigid transformation An affine linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an affine linear transformation are y 1 = w 1 x 1 + w 2 x 2 + w 3, (12) y 2 = w 4 x 1 + w 5 x 2 + w 6, (13) In matrix form Q(x) =» x 1 x x 1 x 2 1 (14) (15) y = Q(x)w. Rigid transformation: A special affine linear transform that allows only rotation and translation Although this function is non-linear in w, s y 1 = cos(w 1 )x 1 sin(w 1 )x 2 + w 2, (16) y 2 = sin(w 1 )x 1 + cos(w 1 )x 2 + w 3, (17) y(x) = Q(x)f (w), f (w) = [cos w 1 ; sin w 1 ; w 2 ; sin w 1 ; cos w 1 ; w 3 ].
21 Persistent Memory and Hybrid Memory #include "cuda.h" #include "mex.h"... ///*Static variable to retain device memory locations static float *xf_gpu, *yf_gpu; static float *yc_gpu; static int initialised_rigid=0; ///*routine to clear CUDA MEX persistent variable host void cleanup(void){ mexprintf("mex-file rigid2d is terminating,destroying array "); cudafree(xf_gpu); cudafree(yf_gpu); cudafree(yc_gpu); } ///////////////////////////////////////////////////////////// //! Kernel to transform an image y_xf_gpu,y_yf_gpu output data in global memory xf_gpu,yf_gpu input data (Q) from global memory ///////////////////////////////////////////////////////////// global void rigid2dkernel( float* y_xf_gpu,..){ } //////////////////////////////////////////////////////////// ///* Gateway function */ ///*function [yc,dy] = rigid2d(w,x,varargin); //////////////////////////////////////////////////////////// void mexfunction(int nlhs, mxarray *plhs[],..){ ///* Find the dimensions of the data */ ///* Allocate memory for output... cudamalloc( (void **) & yc_gpu,sizeof(float)*xn*xm/2) ///* Setup kernel
22 Persistent Memory and Hybrid Memory Cont.. if(!initialised_rigid){ x = mxgetpr(prhs[1]); //*Allocate memory for Q cudamalloc( (void **) & xf_gpu,..); cudamalloc( (void **) & yf_gpu,..); ///* Construct Q using input data cudamemcpy( xf_gpu,...,cudamemcpyhosttodevice); cudamemcpy( yf_gpu,...,cudamemcpyhosttodevice); ///* register function and set flag to handle cuda memory cleanup mexatexit(cleanup); initialised_rigid = 1; /**Call function to perform rigid2d on GPU */ rigid2dkernel<<<dimgrid,dimblock>>>(...); cutilsafecall(cudathreadsynchronize()); } else{ /**Call function to perform rigid2d on GPU */ } ///* Set result to device pointer */ mxarray *parray = mxcreatedoublematrix(0,0,mxreal); double data[10]; mxsetpr(parray,yc_gpu); mxsetm(parray,xm);mxsetn(parray,xn); ///* Clean-up non persistent memory on device and host */ cudathreadexit(); }
23 Results Grid Size Grid Size rigid2d rigid2d % time saved X Y (non persistent) (persistent) using persistent memory
24 CUDA MEX Registration cycle GridSize GridSize PIR SSD RIGID PIR SSD RIGID X Y (MATLAB) (CUDA MEX) s s s 33 s s 92 s
25 FAIR Improvements Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.
26 CUDA MEX Improvements (a) Cuda Driver Objects (b) Cuda Driver Objects (c) Improved framework
27 Summary 1. Successful integration of MATLAB and CUDA. 2. Porting of the FAIR toolbox onto the GPU. 3. Fast implementation of spline interpolation within the CUDA MEX framework. 4. Analysis of accuracy results for texture usage for interpolant derivatives. 5. GPU acceleration of fixed level image registration scheme for large descritizations. 6. Implementation of persistent memory on GPUs.
Technical Application Field. Scientific Computing. Applied Numerics
EVIP Technical Application Field Scientific Computing Applied Numerics Variational Modeling EVIP Parallel Processing Rank efficient operators Elasticity modeled Image Registration Motivation Given a reference
More informationAccelerating MATLAB with CUDA
Accelerating MATLAB with CUDA Massimiliano Fatica NVIDIA mfatica@nvidia.com Won-Ki Jeong University of Utah wkjeong@cs.utah.edu Overview MATLAB can be easily extended via MEX files to take advantage of
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationTextures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)
Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute) Outline Intro to Texturing and Texture Unit CUDA Array Storage Textures in CUDA C (Setup, Binding Modes, Coordinates) Texture
More informationCS179: GPU Programming Recitation 5: Rendering Fractals
CS179: GPU Programming Recitation 5: Rendering Fractals Rendering Fractals Volume data vs. texture memory Creating and using CUDA arrays Using PBOs for screen output Quaternion Julia Sets Rendering volume
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationAIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK.
AIRWC : Accelerated Image Registration With CUDA Richard Ansorge 1 st August 2008 BSS Group, Cavendish Laboratory, University of Cambridge UK. We report some initial results using an NVIDA 9600 GX card
More informationCMPSCI 691AD General Purpose Computation on the GPU
CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management
More informationProgramming with CUDA
Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationClass. Windows and CUDA : Shu Guo. Program from last time: Constant memory
Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer
More informationCUDA Memory Hierarchy
CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationl ealgorithms for Image Registration
FAIR: exib Image Registration l F l ealgorithms for Jan Modersitzki Computing And Software, McMaster University 1280 Main Street West, Hamilton On, L8S 4K1, Canada modersit@cas.mcmaster.ca August 13, 2008
More informationGPU Profiling and Optimization. Scott Grauer-Gray
GPU Profiling and Optimization Scott Grauer-Gray Benefits of GPU Programming "Free" speedup with new architectures More cores in new architecture Improved features such as L1 and L2 cache Increased shared/local
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationDebugging and Optimization strategies
Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25 Writing a correct CUDA code You should start with a functional
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationReview. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API
Review Lecture 10 CUDA (II) host device CUDA many core processor threads thread blocks grid # threads >> # of cores to be efficient Threads within blocks can cooperate Threads between thread blocks cannot
More informationCS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay
: CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationGPGPU in Film Production. Laurence Emms Pixar Animation Studios
GPGPU in Film Production Laurence Emms Pixar Animation Studios Outline GPU computing at Pixar Demo overview Simulation on the GPU Future work GPU Computing at Pixar GPUs have been used for real-time preview
More informationGPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions
GPGPU, 4th Meeting Mordechai Butrashvily, CEO moti@gass-ltd.co.il GASS Company for Advanced Supercomputing Solutions Agenda 3rd meeting 4th meeting Future meetings Activities All rights reserved (c) 2008
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationWir schaffen Wissen heute für morgen
Wir schaffen Wissen heute für morgen The MEXperience, Getting to Grips with MATLAB Executable Files Jan Chrin Paul Scherrer Institut Contents Motivation Context of SwissFEL Injector Test Facility (2010-2014)
More informationHow to get Real Time Data into Matlab
How to get Real Time Data into Matlab First make sure you have Visual Studio 6.0 installed. You re going to have to build a mex file in visual studio. A mex file is just C code that has been compiled to
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationFace Recognition. Programming Project. Haofu Liao, BSEE. Department of Electrical and Computer Engineering. Northeastern University.
Face Recognition Programming Project Haofu Liao, BSEE June 23, 2013 Department of Electrical and Computer Engineering Northeastern University 1. How to build the PCA Mex Funtion 1.1 Basic Information The
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationCUDA C/C++ BASICS. NVIDIA Corporation
CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions
More informationImplementation of Parma Polyhedron Library -functions in MATLAB
Implementation of Parma Polyhedron Library -functions in MATLAB Leonhard Asselborn Electrical and Computer Engineering Carnegie Mellon University Group meeting Oct. 21 st 2010 Overview Introduction Motivation
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationGTC 2014 Session 4155
GTC 2014 Session 4155 Portability and Performance: A Functional Language for Stencil Operations SFB/TR 7 gravitational wave astronomy Gerhard Zumbusch Institut für Angewandte Mathematik Results: Standard
More informationGPU Computing Master Clss. Development Tools
GPU Computing Master Clss Development Tools Generic CUDA debugger goals Support all standard debuggers across all OS Linux GDB, TotalView and DDD Windows Visual studio Mac - XCode Support CUDA runtime
More informationCS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 12: GPU Texturing 1 Markus Hadwiger, KAUST Reading Assignment #6 (until Mar. 17) Read (required): Programming Massively Parallel Processors book, Chapter 4 (CUDA
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 3: GPU Application GPU Intro Review Simple Example Memory Effects GPU Intro Review GPU Intro Review Shared Multiprocessors Global parallelism Assign
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationBasic principles of MR image analysis. Basic principles of MR image analysis. Basic principles of MR image analysis
Basic principles of MR image analysis Basic principles of MR image analysis Julien Milles Leiden University Medical Center Terminology of fmri Brain extraction Registration Linear registration Non-linear
More informationA System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation
A System for Interfacing MATLAB with External Software Geared Toward Automatic Differentiation 02. Sept. 2006 - ICMS 2006 - Castro-Urdiales H. Martin Bücker, RWTH Aachen University, Institute for Scientific
More informationCS GPU and GPGPU Programming Lecture 11: GPU Texturing 1. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 11: GPU Texturing 1 Markus Hadwiger, KAUST Reading Assignment #6 (until Mar. 9) Read (required): Programming Massively Parallel Processors book, Chapter 4 (CUDA
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationGeometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion
Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationUSING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB
USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB Radek Frízel*, Martin Hromčík**, Zdeněk Hurák***, Michael Šebek*** *Department of Control Engineering, Faculty of Electrical Engineering, Czech
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationGPU Computing: Introduction to CUDA. Dr Paul Richmond
GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationCS179: GPU Programming. Lecture 7: Lab 3 Recitation
CS179: GPU Programming Lecture 7: Lab 3 Recitation Today Miscellaneous CUDA syntax Recap on CUDA and buffers Shared memory for an N-body simulation Flocking simulations Integrators CUDA Kernels Launching
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationINTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro
INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationMemory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces
Memory Access Bandwidth Memory Management Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology High Performance Computer Graphics Lab Host and device different memory spaces
More information1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture
Administrative L5: emory Hierarchy Optimization III, Data lacement, cont. and emory Bandwidth Optimizations ext assignment available ext four slides Goals of assignment: simple memory hierarchy management
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationCOSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors
COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationGPGPU/CUDA/C Workshop 2012
GPGPU/CUDA/C Workshop 2012 Day-2: Intro to CUDA/C Programming Presenter(s): Abu Asaduzzaman Chok Yip Wichita State University July 11, 2012 GPGPU/CUDA/C Workshop 2012 Outline Review: Day-1 Brief history
More informationGPU Memory Model Overview
GPU Memory Model Overview John Owens University of California, Davis Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationGeneral Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop
General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationOffloading Java to Graphics Processors
Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCUDA Kenjiro Taura 1 / 36
CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More information