EVIP
Technical Application Field Scientific Computing Applied Numerics
Variational Modeling EVIP Parallel Processing Rank efficient operators
Elasticity modeled Image Registration
Motivation Given a reference image, and a template image, find a reasonable transformation, such that the transformed image is similar to R
Applications
HNSP: Sectioning --> sliced --> flattened --> stained --> mounted... --> digitized large scale digital images, up to 10.000 x 20.000 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
HNSP: Microscopy Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
HNSP: Deformed Images sec:3799 sec:3800 human affine affine linear Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
HNSP: Results 3D elastic registration of a part of the visual cortex 2 hemispheres; 100 sections of 512 X 512 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
Registration in Medical Imaging Comparing/merging/integrating images from different : times, devices, perspectives, objects, e.g., Pre-/post surgery CT-images/MRI panorama imaging atlas/patient mapping Catheter in blood vessel Find 2D view in 3D data HNSP Template matching, e.g., I Atlas mapping, e.g., I Serial sectioning, e.g., Registration is not restricted to medical applications
Variational Modelling
Interpolation Continuous models for reference and template: discrete data
Transformation
NOMIR Part I Eldad Haber & Jan Modersitzk Eulerian versus Lagrangian View Eulerian versus Lagrangian View y y(x) y 0 x(p) p y(x0 ) p0 x(p0 ) Euler: T [y](x) = T (y(x)); Lagrange: (p, T (p)) 7! x(p), T (p) ; x x0 x x0 easy, but x 2 y 1 ( )? option for constraints
Distance measures Sum of Squared Differences (SSD)
Distance measures
Regularization ill-posedness
Regularization Implicit vs explicit regularization Parametric regularization Regularized Parametric registration Non Parametric regularization
Elastic Regularizer Elastic potential of u
Numerical optimization
ELE to PDE balance of forces outer forces, drive registration inner forces, tissue properties
Discretized Regularizer discretise and
Discretized Cost function
Minimization of J Necessary condition for minimizer
Minimization of Solve
Remarks on B Need to solve 0 200 400 600 HUGE very sparse Has a lot of structure 800 1000 1200 0 200 400 nz = 3296
0 50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 nz = 6319 28
Performance Optimization
Outline Fundamentals Architecture and Little s Law Yesterday's Constraints - ILP/DLP Today's Constraints - MLP Summary
Little s Law
Basic Throughput Quantities
Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency)
Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc )
Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc )
Basic Throughput Quantities Latency: every operation requires time to execute. (i.e. instruction, memory or network latency) Bandwidth:! # of (parallel) operations completed per cycle.! (i.e. #FPUs, DRAM, Network, etc ) Concurrency! : Total # of operations in flight
Little s Law
Little s Law relates these three: Little s Law
Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or -
Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency
Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations
Little s Law Little s Law relates these three: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency This concurrency must be filled with parallel operations Can t exceed peak throughput with superfluous concurrency (each channel has a maximum throughput).
Basic Traffic Quantities Traffic often includes #Floating-point operations (FLOPs) #Bytes from (registers, cache, DRAM, network)
Performance Optimization: Contending Forces Improve Throughput (Gflop/s, GB/s, etc ) Reduce Volume of Data (Flop s, GB s, etc ) Contending forces of device Efficiency and usage/traffic
Performance Optimization: Contending Forces Restructure to satisfy Little s Law Implementation & Algorithmic Optimization
Architects, Mathematicians, Programmers
Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law.
Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law. Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.
Architects, Mathematicians, Programmers Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little s Law. Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic. Programmers: restructure algorithms and implementations to these new features.
Performance Optimization
Performance Optimization Often boils down to several key challenges:
Performance Optimization Often boils down to several key challenges:
Performance Optimization Often boils down to several key challenges: Management of data/task locality
Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies
Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication
Performance Optimization Often boils down to several key challenges: Management of data/task locality Management of data dependencies Management of communication Management of variable and dynamic parallelism
Yesterday s Constraint: Instruction Latency & Parallelism
Single-issue, non-pipelined Consider a single issue, non-pipelined processor Little s Law Bandwidth = issue width = 1 Latency = 1 Concurrency = 1 Very easy to get good performance even if all instructions are dependent Issue width Future instructions In flight completed 77
Pipelined By pipelining, we can increase the processor frequency. However, pipeline should be filled to achieve better performance. Little s Law Bandwidth = issue width = 1 Latency = 3 Concurrency = 3 Performance may drop to 1/3 of peak Issue width Future instructions In flight completed 78
Pipelined There may be inherent and untapped parallelism in the code Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full Issue width Future instructions In flight completed 79
Out-of-order Alternately, the hardware can try to find instruction level parallelism (ILP) Issue width Instructions are: Queued up Executed out-of-order Reordered Committed in-order Useful when parallelism or latency cannot be determined at compile time. 9 11 10 8 7 4 6 Future instructions Reservation Stations Out-of-order execution 9 8 7 6 5 4 3 Reorder buffer 2 1 completed 80
Superscalar Increase throughput, by executing multiple instructions in parallel Usually separate pipelines for Issue width different instruction types: 13 14 FP, integer, memory Future instructions 11 12 Significantly complicates 10 out-of-order execution Reservation Stations 8 5 9 7 Out-of-order execution 10 9 8 7 6 5 4 Reorder buffer 1 3 2 completed 81
SIMD Many codes perform the same operations on different pieces of data(data level parallelism = DLP) SIMD : Single Instruction Multiple Data Register sizes are increased. Instead of each register being a 64b FP #, each register holds 2 or 4 FP# s. Much more efficient solution than superscalar on data parallel codes 82
Multithreaded Superscalars fail when there is no ILP or DLP However, there are many codes with Thread-level parallelism (TLP) Consider architectures that are virtualised to appear as N cores. In reality, there is one core maintaining multiple contexts and dynamically switching between them There are 3 main types of multithread architectures: Coarse-grained multithreading (CGMT) Fine-grained multithreading (FGMT), aka Vertical Multithreading Simultaneous multithreading (SMT) 83
Coarse-grained Multithreading Maintain multiple contexts On a long latency instruction: dispatch instruction Switch to a ready thread Hide latency with multiple ready threads Eventually switch back to original Ready instructions In flight completed 84
Fine-grained Multithreading Maintain multiple contexts On every cycle choose a ready thread May now satisfy Little s Law through multithreading: threads ~ latency * bandwidth Ready instructions In flight completed 85
Simultaneous Multithreading Maintain multiple contexts On every cycle choose as many ready instructions from the thread pool as possible Can be applied to both in-order and out-of-order architectures Ready instructions In flight completed 86
Today s Constraint: The Memory Wall
Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; DRAM float z; int i; float y[n]; float x[n]; 88
Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File DRAM float z; int i; float y[n]; float x[n]; 89
Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; Register File Cache DRAM float z; int i; float y[n]; float x[n]; 90
Abstract Machine Model z=0; Core i++; z+=x[i]*y[i]; <6000 GB/s Register File <1000 GB/s Cache <50 GB/s DRAM float z; int i; float y[n]; float x[n]; 91
Impact on Little s Law? Today, utilising the full DRAM bandwidth and minimizing memory traffic are paramount. DRAM latency can exceed 1000 cpu cycles. Impact on Little s Law (200 ns * 20GB/s): 4KB of data in flight How did we solve this? 92
Outline FAIR FAIR on CUDA Improvements Summary FAIR on CUDA A proof of concept of multicore acceleration Sunil Ramgopal Tatavarty June 14, 2013
Outline FAIR FAIR on CUDA Improvements Summary 1 FAIR Image Registration FAIR Fixed level experiment 2 FAIR on CUDA The Design phase CUDA MEX Interpolation CUDA MEX transformation CUDA enabled FAIR registration cycle 3 Improvements 4 Summary
Outline FAIR FAIR on CUDA Improvements Summary Image Registration Given a reference image R and a template image T,find a reasonable transformation y, suchthatthetransformedimaget [y] issimilartor J [y] =D[T [y], R]+ S[y y ref ] y! min (1) where, D measures image similarity and S measures reasonability of the transform.
Outline FAIR FAIR on CUDA Improvements Summary A software viewpoint
Outline FAIR FAIR on CUDA Improvements Summary FAIR: Flexible Algorithms for Image Registration Image Registration (Optimization Approach) J [y] =D[T [y], R]+ S[y y ref ] y! min Salient features Continuous (functional) framework Numerical Optimization Constrained Image Registration collection of MATLAB files. toolbox for image models, transformations, distance measures, regularizer,.. multi-level, multi-scale, multigrid amenable
Outline FAIR FAIR on CUDA Improvements Summary Parametric Image Registration in FAIR HNSP (a) T (xc) (b) R(xc) (c) T (xc) R(xc) rigid/fine (d) T (xc) withyc (e) T (yc) (f) T (yc) R(xc)
Outline FAIR FAIR on CUDA Improvements Summary Profiling Results HNSP PIR SSD rigid2d Function Name Calls Total Time(s) % E6 HNSP PIR SSD rigid2d 1 43.25 s 100 inter = splineinter2d 180 25.64 s 59.3 opt = Armijo 85 5.95 s 14 distance = SSD 175 1.12 s 2.6 trafo = rigid2d 179 0.648 s 1.5 FAIRplots and others 89 9.688 s 22.4
Outline FAIR FAIR on CUDA Improvements Summary Design requirements,roadmap and Considerations Requirements Integration of the FAIR toolbox with CUDA programming interface. E cient implementations of FAIR functional modules on GPU. Measurement for accuracy and runtime of complete registration cycle and individual modules. Roadmap 1 Setup CUDA MEX environment within FAIR toolbox. 2 Implement an optimised FAIR interpolation toolbox within FAIR on CUDA. 3 Implement transformation and distance toolboxs on CUDA. 4 Combine all CUDA functional modules to run a complete registration cycle on the GPU.
Outline FAIR FAIR on CUDA Improvements Summary Textures in CUDA Texture is an object for reading data Benefits Usage Data is cached (optimized for 2D locality). Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for out-of-bounds addresses) Clamp to edge / repeat. Addressable in 1D, 2D, or 3D Using integer or normalized coordinates CPU code binds data to a texture object Kernel reads data by calling a fetch function
Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes
Outline FAIR FAIR on CUDA Improvements Summary Basic interpolation schemes Nearest Neighbor T nn (x) =0 for x /2 T nn (x) :=datat (j) Low Precision Linear T linear (x) :=datat (p) (1 )+ datat (p +1),
Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation S[T ]= R (T 00 (x)) 2 dx, (2) S[T ] =min! subjectto T (x j )=datat (j), j =1,...,m, (3) 8 (x +2) 3, 2 apple x< 1, >< x 3 2(x +1) 3 +6(x +1), 1 apple x<0, b(x) = x 3 +2(x 1) 3 6(x 1), 0 apple x<1, (4) >: (2 x) 3, 1 apple x<2, 0, else. T (x) =T spline (x) = mx c j b j (x) (5) j=1
Outline FAIR FAIR on CUDA Improvements Summary B Spline Interpolation [Sigg, C. and Hadwiger, M.] T spline (x) =c p 1 b( +1)+c p b( )+c p+1 b( 1) + c p+2 b( 2) (6) T linear (x) :=datat (p) (1 )+datat (p +1), (7) (a + b) T linear (x) :=datat (p) a + datat (p +1) b, (8) T spline (x) =g 0 ( ) c linear p+h 0 + g 1 ( ) c linear p+h 1 (9) where, g 0 ( ) =b( +1)+b( ) g 1 ( ) =b( 1) + b( 2) (10) h 0 =( b( ) g 0 ( ) ) 1 h 1 =( b( 2) )+1 (11) g 1 ( )
Outline FAIR FAIR on CUDA Improvements Summary Bandwidth Results Interpolation (a) splineinter2d(l) (b) splineinter2d(nn) splineinter2d splineinter2d (NN) (bilinear) Grid Measured Worst Best Measured Worst Best Size bandwidth Case Case bandwidth Case Case 64X32 1.44 2.39 0.5 1.44 3.24 0.68 128X64 2.45 7.07 1.49 4.15 12.71 2.67 256X128 4 18.58 3.91 10.66 37.17 7.83 512X256 9.14 33.43 7.04 26.76 113.2 23.83
Outline FAIR FAIR on CUDA Improvements Summary Runtime Results Interpolation (a) Runtime Comparision (b) Runtime vs ideal Grid Size linearinter2d splineinter2d splineinter2d splineinter2d (FAIR)(ms) (FAIR)(ms) (NN texture)(ms) (bilinear texture)(ms) 64X32 23.717 28.856 0.065 0.048 128X64 67.898 78.599 0.088 0.049 256X128 216.525 229.961 0.134 0.067 512X256 556.287 575.266 0.298 0.088
Outline FAIR FAIR on CUDA Improvements Summary Results Interpolation (a) Der. test Inter2D(MATLAB) (b) Der. test Inter2D(CUDA MEX)
Outline FAIR FAIR on CUDA Improvements Summary Rigid transformation An a ne linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an a ne linear transformation are y 1 = w 1 x 1 + w 2 x 2 + w 3, (12) y 2 = w 4 x 1 + w 5 x 2 + w 6, (13) In matrix form Q(x) = apple x 1 x 2 1 0 0 0 0 0 0 x 1 x 2 1 y = Q(x)w. (14) (15) Rigid transformation: A special a and translation ne linear transform that allows only rotation y 1 =cos(w 1 )x 1 sin(w 1 )x 2 + w 2, (16) y 2 =sin(w 1 )x 1 +cos(w 1 )x 2 + w 3, (17) Although this function is non-linear in w, y(x) =Q(x)f(w), f(w) =[cosw 1 ; sin w 1 ; w 2 ;sinw 1 ;cosw 1 ; w 3 ]. s
Outline FAIR FAIR on CUDA Improvements Summary Results Grid Size Grid Size rigid2d rigid2d % time saved X Y (non persistent) (persistent) using persistent memory 64 32 0.2181 0.2139 2 128 64 0.2369 0.2243 5 256 128 0.2289 0.2233 2 512 256 0.2247 0.2142 5 512 512 0.2320 0.2200 5 1024 512 0.2427 0.2135 12 1024 1024 0.2683 0.2329 13 2048 1024 0.2874 0.2379 17
Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Registration cycle GridSize GridSize PIR SSD RIGID PIR SSD RIGID X Y (MATLAB) (CUDA MEX) 128 64 14.96 s 14.13 s 256 128 45 s 33 s 512 256 201.85 s 92 s
Outline FAIR FAIR on CUDA Improvements Summary FAIR Improvements Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.
Outline FAIR FAIR on CUDA Improvements Summary CUDA MEX Improvements (a) Cuda Driver Objects (b) Cuda Driver Objects (c) Improved framework
Outline FAIR FAIR on CUDA Improvements Summary Summary 1 Successful integration of MATLAB and CUDA. 2 Porting of the FAIR toolbox onto the GPU. 3 Fast implementation of spline interpolation within the CUDA MEX framework. 4 Analysis of accuracy results for texture usage for interpolant derivatives. 5 GPU acceleration of fixed level image registration scheme for large descritizations. 6 Implementation of persistent memory on GPUs.
Rank efficient operators
HSS Hierarchically Semi-Separable Representation
Generic HSS structure
Symmetric HSS matrix For Siblings i & j :
Introducing Zeros
Introducing Zeros
Partial factorisation of diagonal blocks
Partial factorisation of diagonal blocks
Partial factorisation of diagonal blocks
Compression
Compression
Merge
Update
Root node Compute full Cholesky
Cholesky based solver
HSS vs Classical
Summary
Continual struggle : computer architects, mathematicians, and computer scientists. Quick solution --- > satisfy Little s Law Optimize: data/task locality, data dependencies, communication, variable and dynamic parallelism Parallel hardware is here to stay Parallelism & scalability are crucial for success Presents many important research challenges