Technical Application Field. Scientific Computing. Applied Numerics

EVIP

Technical Application Field Scientific Computing Applied Numerics

Variational Modeling EVIP Parallel Processing Rank efficient operators

Elasticity modeled Image Registration

Motivation Given a reference image, and a template image, find a reasonable transformation, such that the transformed image is similar to R

Applications

HNSP: Sectioning --> sliced --> flattened --> stained --> mounted... --> digitized large scale digital images, up to 10.000 x 20.000 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Microscopy Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Deformed Images sec:3799 sec:3800 human affine affine linear Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Results 3D elastic registration of a part of the visual cortex 2 hemispheres; 100 sections of 512 X 512 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

Registration in Medical Imaging Comparing/merging/integrating images from different : times, devices, perspectives, objects, e.g., Pre-/post surgery CT-images/MRI panorama imaging atlas/patient mapping Catheter in blood vessel Find 2D view in 3D data HNSP Template matching, e.g., I Atlas mapping, e.g., I Serial sectioning, e.g., Registration is not restricted to medical applications

Variational Modelling

Interpolation Continuous models for reference and template: discrete data

Transformation

NOMIR Part I Eldad Haber & Jan Modersitzk Eulerian versus Lagrangian View Eulerian versus Lagrangian View y y(x) y 0 x(p) p y(x0 ) p0 x(p0 ) Euler: T [y](x) = T (y(x)); Lagrange: (p, T (p)) 7! x(p), T (p) ; x x0 x x0 easy, but x 2 y 1 ( )? option for constraints

Distance measures Sum of Squared Differences (SSD)

Distance measures

Regularization ill-posedness

Regularization Implicit vs explicit regularization Parametric regularization Regularized Parametric registration Non Parametric regularization

Elastic Regularizer Elastic potential of u

Numerical optimization

ELE to PDE balance of forces outer forces, drive registration inner forces, tissue properties

Discretized Regularizer discretise and

Discretized Cost function

Minimization of J Necessary condition for minimizer

Minimization of Solve

Remarks on B Need to solve 0 200 400 600 HUGE very sparse Has a lot of structure 800 1000 1200 0 200 400 nz = 3296

0 50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 nz = 6319 28

Performance Optimization

Outline Fundamentals Architecture and Little s Law Yesterday's Constraints - ILP/DLP Today's Constraints - MLP Summary

Little s Law

Basic Throughput Quantities

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency)

Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc )

Little s Law

Little s Law relates these three: Little s Law

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or -

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations

Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations Can t exceed peak throughput with superfluous concurrency (each channel has a maximum throughput).

Basic Traffic Quantities Traffic often includes #Floating-point operations (FLOPs) #Bytes from (registers, cache, DRAM, network)

Performance Optimization: Contending Forces Improve Throughput (Gflop/s, GB/s, etc ) Reduce Volume of Data (Flop s, GB s, etc ) Contending forces of device Efficiency and usage/traffic

Performance Optimization: Contending Forces Restructure to satisfy Little s Law Implementation & Algorithmic Optimization

Architects, Mathematicians, Programmers

Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law.

Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law. Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.

Performance Optimization

Performance Optimization Often boils down to several key challenges:

Performance Optimization Often boils down to several key challenges: Management of data/task locality

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication

Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication Management of variable and dynamic parallelism

Yesterday s Constraint: Instruction Latency & Parallelism

Single-issue, non-pipelined Consider a single issue, non-pipelined processor Little s Law Bandwidth = issue width = 1 Latency = 1 Concurrency = 1 Very easy to get good performance even if all instructions are dependent Issue width Future instructions In flight completed 77

Pipelined By pipelining, we can increase the processor frequency. However, pipeline should be filled to achieve better performance. Little s Law Bandwidth = issue width = 1 Latency = 3 Concurrency = 3 Performance may drop to 1/3 of peak Issue width Future instructions In flight completed 78

Pipelined There may be inherent and untapped parallelism in the code Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full Issue width Future instructions In flight completed 79

Out-of-order Alternately, the hardware can try to find instruction level parallelism (ILP) Issue width Instructions are: Queued up Executed out-of-order Reordered Committed in-order Useful when parallelism or latency cannot be determined at compile time. 9 11 10 8 7 4 6 Future instructions Reservation Stations Out-of-order execution 9 8 7 6 5 4 3 Reorder buffer 2 1 completed 80

Superscalar Increase throughput, by executing multiple instructions in parallel Usually separate pipelines for Issue width different instruction types: 13 14 FP, integer, memory Future instructions 11 12 Significantly complicates 10 out-of-order execution Reservation Stations 8 5 9 7 Out-of-order execution 10 9 8 7 6 5 4 Reorder buffer 1 3 2 completed 81

SIMD Many codes perform the same operations on different pieces of data(data level parallelism = DLP) SIMD : Single Instruction Multiple Data Register sizes are increased. Instead of each register being a 64b FP #, each register holds 2 or 4 FP# s. Much more efficient solution than superscalar on data parallel codes 82

Multithreaded Superscalars fail when there is no ILP or DLP However, there are many codes with Thread-level parallelism (TLP) Consider architectures that are virtualised to appear as N cores. In reality, there is one core maintaining multiple contexts and dynamically switching between them There are 3 main types of multithread architectures: Coarse-grained multithreading (CGMT) Fine-grained multithreading (FGMT), aka Vertical Multithreading Simultaneous multithreading (SMT) 83

Coarse-grained Multithreading Maintain multiple contexts On a long latency instruction: dispatch instruction Switch to a ready thread Hide latency with multiple ready threads Eventually switch back to original Ready instructions In flight completed 84

Fine-grained Multithreading Maintain multiple contexts On every cycle choose a ready thread May now satisfy Little s Law through multithreading: threads ~ latency * bandwidth Ready instructions In flight completed 85

Simultaneous Multithreading Maintain multiple contexts On every cycle choose as many ready instructions from the thread pool as possible Can be applied to both in-order and out-of-order architectures Ready instructions In flight completed 86

Today s Constraint: The Memory Wall

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; DRAM float z; int i; float y[n]; float x[n]; 88

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File DRAM float z; int i; float y[n]; float x[n]; 89

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File Cache DRAM float z; int i; float y[n]; float x[n]; 90

Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; <6000 GB/s Register File <1000 GB/s Cache <50 GB/s DRAM float z; int i; float y[n]; float x[n]; 91

Impact on Little s Law? Today, utilising the full DRAM bandwidth and minimizing memory traffic are paramount. DRAM latency can exceed 1000 cpu cycles. Impact on Little s Law (200 ns * 20GB/s): 4KB of data in flight How did we solve this? 92

Outline FAIR FAIR on CUDA Improvements Summary FAIR on CUDA A proof of concept of multicore acceleration Sunil Ramgopal Tatavarty June 14, 2013

Outline FAIR FAIR on CUDA Improvements Summary 1 FAIR Image Registration FAIR Fixed level experiment 2 FAIR on CUDA The Design phase CUDA MEX Interpolation CUDA MEX transformation CUDA enabled FAIR registration cycle 3 Improvements 4 Summary

Outline FAIR FAIR on CUDA Improvements Summary Image Registration Given a reference image R and a template image T,find a reasonable transformation y, suchthatthetransformedimaget [y] issimilartor J [y] =D[T [y], R]+ S[y y ref ] y! min (1) where, D measures image similarity and S measures reasonability of the transform.

Outline FAIR FAIR on CUDA Improvements Summary A software viewpoint

Outline FAIR FAIR on CUDA Improvements Summary FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] =D[T [y], R]+ S[y y ref ] y! min Salient features Continuous (functional) framework Numerical Optimization Constrained Image Registration collection of MATLAB files. toolbox for image models, transformations, distance measures, regularizer,.. multi-level, multi-scale, multigrid amenable

Outline FAIR FAIR on CUDA Improvements Summary Parametric Image Registration in FAIR HNSP (a) T (xc) (b) R(xc) (c) T (xc) R(xc) rigid/fine (d) T (xc) withyc (e) T (yc) (f) T (yc) R(xc)

Outline FAIR FAIR on CUDA Improvements Summary Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d 1 43.25 s 100 inter = splineinter2d 180 25.64 s 59.3 opt = Armijo 85 5.95 s 14 distance = SSD 175 1.12 s 2.6 trafo = rigid2d 179 0.648 s 1.5 FAIRplots and others 89 9.688 s 22.4

Outline FAIR FAIR on CUDA Improvements Summary Design requirements,roadmap and Considerations Requirements Integration of the FAIR toolbox with CUDA programming interface. E cient implementations of FAIR functional modules on GPU. Measurement for accuracy and runtime of complete registration cycle and individual modules. Roadmap 1 Setup CUDA MEX environment within FAIR toolbox. 2 Implement an optimised FAIR interpolation toolbox within FAIR on CUDA. 3 Implement transformation and distance toolboxs on CUDA. 4 Combine all CUDA functional modules to run a complete registration cycle on the GPU.

Outline FAIR FAIR on CUDA Improvements Summary Textures in CUDA Texture is an object for reading data Benefits Usage Data is cached (optimized for 2D locality). Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat. Addressable in 1D, 2D, or 3D Using integer or normalized coordinates CPU code binds data to a texture object Kernel reads data by calling a fetch function

Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes

Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes Nearest Neighbor T nn (x) =0 for x /2 T nn (x) :=datat (j) Low Precision Linear T linear (x) :=datat (p) (1 )+ datat (p +1),

Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation S[T ]= R (T 00 (x)) 2 dx, (2) S[T ] =min! subjectto T (x j )=datat (j), j =1,...,m, (3) 8 (x +2) 3, 2 apple x< 1, >< x 3 2(x +1) 3 +6(x +1), 1 apple x<0, b(x) = x 3 +2(x 1) 3 6(x 1), 0 apple x<1, (4) >: (2 x) 3, 1 apple x<2, 0, else. T (x) =T spline (x) = mx c j b j (x) (5) j=1

Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation [Sigg, C. and Hadwiger, M.] T spline (x) =c p 1 b( +1)+c p b( )+c p+1 b( 1) + c p+2 b( 2) (6) T linear (x) :=datat (p) (1 )+datat (p +1), (7) (a + b) T linear (x) :=datat (p) a + datat (p +1) b, (8) T spline (x) =g 0 ( ) c linear p+h 0 + g 1 ( ) c linear p+h 1 (9) where, g 0 ( ) =b( +1)+b( ) g 1 ( ) =b( 1) + b( 2) (10) h 0 =( b( ) g 0 ( ) ) 1 h 1 =( b( 2) )+1 (11) g 1 ( )

Outline FAIR FAIR on CUDA Improvements Summary Bandwidth Results Interpolation (a) splineinter2d(l) (b) splineinter2d(nn) splineinter2d splineinter2d (NN) (bilinear) Grid Measured Worst Best Measured Worst Best Size bandwidth Case Case bandwidth Case Case 64X32 1.44 2.39 0.5 1.44 3.24 0.68 128X64 2.45 7.07 1.49 4.15 12.71 2.67 256X128 4 18.58 3.91 10.66 37.17 7.83 512X256 9.14 33.43 7.04 26.76 113.2 23.83

Outline FAIR FAIR on CUDA Improvements Summary Runtime Results Interpolation (a) Runtime Comparision (b) Runtime vs ideal Grid Size linearinter2d splineinter2d splineinter2d splineinter2d (FAIR)(ms) (FAIR)(ms) (NN texture)(ms) (bilinear texture)(ms) 64X32 23.717 28.856 0.065 0.048 128X64 67.898 78.599 0.088 0.049 256X128 216.525 229.961 0.134 0.067 512X256 556.287 575.266 0.298 0.088

Outline FAIR FAIR on CUDA Improvements Summary Results Interpolation (a) Der. test Inter2D(MATLAB) (b) Der. test Inter2D(CUDA MEX)

Outline FAIR FAIR on CUDA Improvements Summary Rigid transformation An a ne linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an a ne linear transformation are y 1 = w 1 x 1 + w 2 x 2 + w 3, (12) y 2 = w 4 x 1 + w 5 x 2 + w 6, (13) In matrix form Q(x) = apple x 1 x 2 1 0 0 0 0 0 0 x 1 x 2 1 y = Q(x)w. (14) (15) Rigid transformation: A special a and translation ne linear transform that allows only rotation y 1 =cos(w 1 )x 1 sin(w 1 )x 2 + w 2, (16) y 2 =sin(w 1 )x 1 +cos(w 1 )x 2 + w 3, (17) Although this function is non-linear in w, y(x) =Q(x)f(w), f(w) =[cosw 1 ; sin w 1 ; w 2 ;sinw 1 ;cosw 1 ; w 3 ]. s

Outline FAIR FAIR on CUDA Improvements Summary Results Grid Size Grid Size rigid2d rigid2d % time saved X Y (non persistent) (persistent) using persistent memory 64 32 0.2181 0.2139 2 128 64 0.2369 0.2243 5 256 128 0.2289 0.2233 2 512 256 0.2247 0.2142 5 512 512 0.2320 0.2200 5 1024 512 0.2427 0.2135 12 1024 1024 0.2683 0.2329 13 2048 1024 0.2874 0.2379 17

Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Registration cycle GridSize GridSize PIR SSD RIGID PIR SSD RIGID X Y (MATLAB) (CUDA MEX) 128 64 14.96 s 14.13 s 256 128 45 s 33 s 512 256 201.85 s 92 s

Outline FAIR FAIR on CUDA Improvements Summary FAIR Improvements Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.

Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Improvements (a) Cuda Driver Objects (b) Cuda Driver Objects (c) Improved framework

Outline FAIR FAIR on CUDA Improvements Summary Summary 1 Successful integration of MATLAB and CUDA. 2 Porting of the FAIR toolbox onto the GPU. 3 Fast implementation of spline interpolation within the CUDA MEX framework. 4 Analysis of accuracy results for texture usage for interpolant derivatives. 5 GPU acceleration of fixed level image registration scheme for large descritizations. 6 Implementation of persistent memory on GPUs.

Rank efficient operators

HSS Hierarchically Semi-Separable Representation

Generic HSS structure

Symmetric HSS matrix For Siblings i & j :

Introducing Zeros

Partial factorisation of diagonal blocks

Compression

Merge

Update

Root node Compute full Cholesky

Cholesky based solver

HSS vs Classical

Summary

Continual struggle : computer architects, mathematicians, and computer scientists. Quick solution --- > satisfy Little s Law Optimize: data/task locality, data dependencies, communication, variable and dynamic parallelism Parallel hardware is here to stay Parallelism & scalability are crucial for success Presents many important research challenges