Technical Application Field. Scientific Computing. Applied Numerics

Similar documents
Master Thesis Accelerating Image Registration on GPUs

Accelerating image registration on GPUs

l ealgorithms for Image Registration

Multithreaded Processors. Department of Electrical Engineering Stanford University

Fundamental CUDA Optimization. NVIDIA Corporation

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing: Parallel Architectures Jin, Hai

B. Tech. Project Second Stage Report on

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Processing SIMD, Vector and GPU s cont.

! Readings! ! Room-level, on-chip! vs.!

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA GPGPU Workshop 2012

Portland State University ECE 588/688. Graphics Processors

Handout 2 ILP: Part B

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

High Performance Computing on GPUs using NVIDIA CUDA

Computer Architecture

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Double-Precision Matrix Multiply on CUDA

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Tesla Architecture, CUDA and Optimization Strategies

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Code Optimizations for High Performance GPU Computing

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

CUDA Performance Optimization. Patrick Legresley

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CS425 Computer Systems Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Hardware-Based Speculation

Advanced Parallel Programming I

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Fundamental CUDA Optimization. NVIDIA Corporation

Master Informatics Eng.

Massively Parallel Architectures

Kaisen Lin and Michael Conley

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

CS 426 Parallel Computing. Parallel Computing Platforms

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Threading Hardware in G80

Memory hierarchy review. ECE 154B Dmitri Strukov

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU Fundamentals Jeff Larkin November 14, 2016

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CPU Architecture. HPCE / dt10 / 2013 / 10.1

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introducing the Cray XMT. Petr Konecny May 4 th 2007

When MPPDB Meets GPU:

Programmer's View of Execution Teminology Summary

THREAD LEVEL PARALLELISM

Parallel Systems I The GPU architecture. Jan Lemeire

45-year CPU Evolution: 1 Law -2 Equations

Improving Performance of Machine Learning Workloads

Lect. 2: Types of Parallelism

27. Parallel Programming I

CUDA Experiences: Over-Optimization and Future HPC

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Contour Detection on Mobile Platforms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav


27. Parallel Programming I

WHY PARALLEL PROCESSING? (CE-401)

GeForce4. John Montrym Henry Moreton

Exploitation of instruction level parallelism

Lecture 25: Board Notes: Threads and GPUs

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Modern Processor Architectures. L25: Modern Compiler Design

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)

Current Trends in Computer Graphics Hardware

Modern CPU Architectures

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

CME 213 S PRING Eric Darve

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Transcription:

EVIP

Technical Application Field Scientific Computing Applied Numerics

Variational Modeling EVIP Parallel Processing Rank efficient operators

Elasticity modeled Image Registration

Motivation Given a reference image, and a template image, find a reasonable transformation, such that the transformed image is similar to R

Applications

HNSP: Sectioning --> sliced --> flattened --> stained --> mounted... --> digitized large scale digital images, up to 10.000 x 20.000 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Microscopy Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Deformed Images sec:3799 sec:3800 human affine affine linear Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Results 3D elastic registration of a part of the visual cortex 2 hemispheres; 100 sections of 512 X 512 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

Registration in Medical Imaging Comparing/merging/integrating images from different : times, devices, perspectives, objects, e.g., Pre-/post surgery CT-images/MRI panorama imaging atlas/patient mapping Catheter in blood vessel Find 2D view in 3D data HNSP Template matching, e.g., I Atlas mapping, e.g., I Serial sectioning, e.g., Registration is not restricted to medical applications

Variational Modelling

Interpolation Continuous models for reference and template: discrete data

Transformation

NOMIR Part I Eldad Haber & Jan Modersitzk Eulerian versus Lagrangian View Eulerian versus Lagrangian View y y(x) y 0 x(p) p y(x0 ) p0 x(p0 ) Euler: T [y](x) = T (y(x)); Lagrange: (p, T (p)) 7! x(p), T (p) ; x x0 x x0 easy, but x 2 y 1 ( )? option for constraints

Distance measures Sum of Squared Differences (SSD)

Distance measures

Regularization ill-posedness

Regularization Implicit vs explicit regularization Parametric regularization Regularized Parametric registration Non Parametric regularization

Elastic Regularizer Elastic potential of u

Numerical optimization

ELE to PDE balance of forces outer forces, drive registration inner forces, tissue properties

Discretized Regularizer discretise and

Discretized Cost function

Minimization of J Necessary condition for minimizer

Minimization of Solve

Remarks on B Need to solve 0 200 400 600 HUGE very sparse Has a lot of structure 800 1000 1200 0 200 400 nz = 3296

0 50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 nz = 6319 28

Performance Optimization

Outline Fundamentals Architecture and Little s Law Yesterday's Constraints - ILP/DLP Today's Constraints - MLP Summary

Little s Law

Basic Throughput Quantities

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency)

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc )

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc )

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc ) Concurrency! : Total # of operations in flight

Little s Law

Little s Law relates these three: Little s Law

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or -

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations Can t exceed peak throughput with superfluous concurrency (each channel has a maximum throughput).

Basic Traffic Quantities Traffic often includes #Floating-point operations (FLOPs) #Bytes from (registers, cache, DRAM, network)

Performance Optimization: Contending Forces Improve Throughput (Gflop/s, GB/s, etc ) Reduce Volume of Data (Flop s, GB s, etc ) Contending forces of device Efficiency and usage/traffic

Performance Optimization: Contending Forces Restructure to satisfy Little s Law Implementation & Algorithmic Optimization

Architects, Mathematicians, Programmers

Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law.

Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law. Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.

Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law. Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic. Programmers: restructure algorithms and implementations to these new features.

Performance Optimization

Performance Optimization Often boils down to several key challenges:

Performance Optimization Often boils down to several key challenges:

Performance Optimization Often boils down to several key challenges: Management of data/task locality

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication Management of variable and dynamic parallelism

Yesterday s Constraint: Instruction Latency & Parallelism

Single-issue, non-pipelined Consider a single issue, non-pipelined processor Little s Law Bandwidth = issue width = 1 Latency = 1 Concurrency = 1 Very easy to get good performance even if all instructions are dependent Issue width Future instructions In flight completed 77

Pipelined By pipelining, we can increase the processor frequency. However, pipeline should be filled to achieve better performance. Little s Law Bandwidth = issue width = 1 Latency = 3 Concurrency = 3 Performance may drop to 1/3 of peak Issue width Future instructions In flight completed 78

Pipelined There may be inherent and untapped parallelism in the code Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full Issue width Future instructions In flight completed 79

Out-of-order Alternately, the hardware can try to find instruction level parallelism (ILP) Issue width Instructions are: Queued up Executed out-of-order Reordered Committed in-order Useful when parallelism or latency cannot be determined at compile time. 9 11 10 8 7 4 6 Future instructions Reservation Stations Out-of-order execution 9 8 7 6 5 4 3 Reorder buffer 2 1 completed 80

Superscalar Increase throughput, by executing multiple instructions in parallel Usually separate pipelines for Issue width different instruction types: 13 14 FP, integer, memory Future instructions 11 12 Significantly complicates 10 out-of-order execution Reservation Stations 8 5 9 7 Out-of-order execution 10 9 8 7 6 5 4 Reorder buffer 1 3 2 completed 81

SIMD Many codes perform the same operations on different pieces of data(data level parallelism = DLP) SIMD : Single Instruction Multiple Data Register sizes are increased. Instead of each register being a 64b FP #, each register holds 2 or 4 FP# s. Much more efficient solution than superscalar on data parallel codes 82

Multithreaded Superscalars fail when there is no ILP or DLP However, there are many codes with Thread-level parallelism (TLP) Consider architectures that are virtualised to appear as N cores. In reality, there is one core maintaining multiple contexts and dynamically switching between them There are 3 main types of multithread architectures: Coarse-grained multithreading (CGMT) Fine-grained multithreading (FGMT), aka Vertical Multithreading Simultaneous multithreading (SMT) 83

Coarse-grained Multithreading Maintain multiple contexts On a long latency instruction: dispatch instruction Switch to a ready thread Hide latency with multiple ready threads Eventually switch back to original Ready instructions In flight completed 84

Fine-grained Multithreading Maintain multiple contexts On every cycle choose a ready thread May now satisfy Little s Law through multithreading: threads ~ latency * bandwidth Ready instructions In flight completed 85

Simultaneous Multithreading Maintain multiple contexts On every cycle choose as many ready instructions from the thread pool as possible Can be applied to both in-order and out-of-order architectures Ready instructions In flight completed 86

Today s Constraint: The Memory Wall

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; DRAM float z; int i; float y[n]; float x[n]; 88

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File DRAM float z; int i; float y[n]; float x[n]; 89

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File Cache DRAM float z; int i; float y[n]; float x[n]; 90

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; <6000 GB/s Register File <1000 GB/s Cache <50 GB/s DRAM float z; int i; float y[n]; float x[n]; 91

Impact on Little s Law? Today, utilising the full DRAM bandwidth and minimizing memory traffic are paramount. DRAM latency can exceed 1000 cpu cycles. Impact on Little s Law (200 ns * 20GB/s): 4KB of data in flight How did we solve this? 92

Outline FAIR FAIR on CUDA Improvements Summary FAIR on CUDA A proof of concept of multicore acceleration Sunil Ramgopal Tatavarty June 14, 2013

Outline FAIR FAIR on CUDA Improvements Summary 1 FAIR Image Registration FAIR Fixed level experiment 2 FAIR on CUDA The Design phase CUDA MEX Interpolation CUDA MEX transformation CUDA enabled FAIR registration cycle 3 Improvements 4 Summary

Outline FAIR FAIR on CUDA Improvements Summary Image Registration Given a reference image R and a template image T,find a reasonable transformation y, suchthatthetransformedimaget [y] issimilartor J [y] =D[T [y], R]+ S[y y ref ] y! min (1) where, D measures image similarity and S measures reasonability of the transform.

Outline FAIR FAIR on CUDA Improvements Summary A software viewpoint

Outline FAIR FAIR on CUDA Improvements Summary FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] =D[T [y], R]+ S[y y ref ] y! min Salient features Continuous (functional) framework Numerical Optimization Constrained Image Registration collection of MATLAB files. toolbox for image models, transformations, distance measures, regularizer,.. multi-level, multi-scale, multigrid amenable

Outline FAIR FAIR on CUDA Improvements Summary Parametric Image Registration in FAIR HNSP (a) T (xc) (b) R(xc) (c) T (xc) R(xc) rigid/fine (d) T (xc) withyc (e) T (yc) (f) T (yc) R(xc)

Outline FAIR FAIR on CUDA Improvements Summary Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d 1 43.25 s 100 inter = splineinter2d 180 25.64 s 59.3 opt = Armijo 85 5.95 s 14 distance = SSD 175 1.12 s 2.6 trafo = rigid2d 179 0.648 s 1.5 FAIRplots and others 89 9.688 s 22.4

Outline FAIR FAIR on CUDA Improvements Summary Design requirements,roadmap and Considerations Requirements Integration of the FAIR toolbox with CUDA programming interface. E cient implementations of FAIR functional modules on GPU. Measurement for accuracy and runtime of complete registration cycle and individual modules. Roadmap 1 Setup CUDA MEX environment within FAIR toolbox. 2 Implement an optimised FAIR interpolation toolbox within FAIR on CUDA. 3 Implement transformation and distance toolboxs on CUDA. 4 Combine all CUDA functional modules to run a complete registration cycle on the GPU.

Outline FAIR FAIR on CUDA Improvements Summary Textures in CUDA Texture is an object for reading data Benefits Usage Data is cached (optimized for 2D locality). Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat. Addressable in 1D, 2D, or 3D Using integer or normalized coordinates CPU code binds data to a texture object Kernel reads data by calling a fetch function

Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes

Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes Nearest Neighbor T nn (x) =0 for x /2 T nn (x) :=datat (j) Low Precision Linear T linear (x) :=datat (p) (1 )+ datat (p +1),

Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation S[T ]= R (T 00 (x)) 2 dx, (2) S[T ] =min! subjectto T (x j )=datat (j), j =1,...,m, (3) 8 (x +2) 3, 2 apple x< 1, >< x 3 2(x +1) 3 +6(x +1), 1 apple x<0, b(x) = x 3 +2(x 1) 3 6(x 1), 0 apple x<1, (4) >: (2 x) 3, 1 apple x<2, 0, else. T (x) =T spline (x) = mx c j b j (x) (5) j=1

Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation [Sigg, C. and Hadwiger, M.] T spline (x) =c p 1 b( +1)+c p b( )+c p+1 b( 1) + c p+2 b( 2) (6) T linear (x) :=datat (p) (1 )+datat (p +1), (7) (a + b) T linear (x) :=datat (p) a + datat (p +1) b, (8) T spline (x) =g 0 ( ) c linear p+h 0 + g 1 ( ) c linear p+h 1 (9) where, g 0 ( ) =b( +1)+b( ) g 1 ( ) =b( 1) + b( 2) (10) h 0 =( b( ) g 0 ( ) ) 1 h 1 =( b( 2) )+1 (11) g 1 ( )

Outline FAIR FAIR on CUDA Improvements Summary Bandwidth Results Interpolation (a) splineinter2d(l) (b) splineinter2d(nn) splineinter2d splineinter2d (NN) (bilinear) Grid Measured Worst Best Measured Worst Best Size bandwidth Case Case bandwidth Case Case 64X32 1.44 2.39 0.5 1.44 3.24 0.68 128X64 2.45 7.07 1.49 4.15 12.71 2.67 256X128 4 18.58 3.91 10.66 37.17 7.83 512X256 9.14 33.43 7.04 26.76 113.2 23.83

Outline FAIR FAIR on CUDA Improvements Summary Runtime Results Interpolation (a) Runtime Comparision (b) Runtime vs ideal Grid Size linearinter2d splineinter2d splineinter2d splineinter2d (FAIR)(ms) (FAIR)(ms) (NN texture)(ms) (bilinear texture)(ms) 64X32 23.717 28.856 0.065 0.048 128X64 67.898 78.599 0.088 0.049 256X128 216.525 229.961 0.134 0.067 512X256 556.287 575.266 0.298 0.088

Outline FAIR FAIR on CUDA Improvements Summary Results Interpolation (a) Der. test Inter2D(MATLAB) (b) Der. test Inter2D(CUDA MEX)

Outline FAIR FAIR on CUDA Improvements Summary Rigid transformation An a ne linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an a ne linear transformation are y 1 = w 1 x 1 + w 2 x 2 + w 3, (12) y 2 = w 4 x 1 + w 5 x 2 + w 6, (13) In matrix form Q(x) = apple x 1 x 2 1 0 0 0 0 0 0 x 1 x 2 1 y = Q(x)w. (14) (15) Rigid transformation: A special a and translation ne linear transform that allows only rotation y 1 =cos(w 1 )x 1 sin(w 1 )x 2 + w 2, (16) y 2 =sin(w 1 )x 1 +cos(w 1 )x 2 + w 3, (17) Although this function is non-linear in w, y(x) =Q(x)f(w), f(w) =[cosw 1 ; sin w 1 ; w 2 ;sinw 1 ;cosw 1 ; w 3 ]. s

Outline FAIR FAIR on CUDA Improvements Summary Results Grid Size Grid Size rigid2d rigid2d % time saved X Y (non persistent) (persistent) using persistent memory 64 32 0.2181 0.2139 2 128 64 0.2369 0.2243 5 256 128 0.2289 0.2233 2 512 256 0.2247 0.2142 5 512 512 0.2320 0.2200 5 1024 512 0.2427 0.2135 12 1024 1024 0.2683 0.2329 13 2048 1024 0.2874 0.2379 17

Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Registration cycle GridSize GridSize PIR SSD RIGID PIR SSD RIGID X Y (MATLAB) (CUDA MEX) 128 64 14.96 s 14.13 s 256 128 45 s 33 s 512 256 201.85 s 92 s

Outline FAIR FAIR on CUDA Improvements Summary FAIR Improvements Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.

Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Improvements (a) Cuda Driver Objects (b) Cuda Driver Objects (c) Improved framework

Outline FAIR FAIR on CUDA Improvements Summary Summary 1 Successful integration of MATLAB and CUDA. 2 Porting of the FAIR toolbox onto the GPU. 3 Fast implementation of spline interpolation within the CUDA MEX framework. 4 Analysis of accuracy results for texture usage for interpolant derivatives. 5 GPU acceleration of fixed level image registration scheme for large descritizations. 6 Implementation of persistent memory on GPUs.

Rank efficient operators

HSS Hierarchically Semi-Separable Representation

Generic HSS structure

Symmetric HSS matrix For Siblings i & j :

Introducing Zeros

Introducing Zeros

Partial factorisation of diagonal blocks

Partial factorisation of diagonal blocks

Partial factorisation of diagonal blocks

Compression

Compression

Merge

Update

Root node Compute full Cholesky

Cholesky based solver

HSS vs Classical

Summary

Continual struggle : computer architects, mathematicians, and computer scientists. Quick solution --- > satisfy Little s Law Optimize: data/task locality, data dependencies, communication, variable and dynamic parallelism Parallel hardware is here to stay Parallelism & scalability are crucial for success Presents many important research challenges