Efficient Imaging Algorithms on Many-Core Platforms

Efficient Imaging Algorithms on Many-Core Platforms H. Köstler Dagstuhl, 22.11.2011

Contents Imaging Applications HDR Compression performance of PDE-based models Image Denoising performance of patch-based models Image Deblurring Algorithmic problems Image Segmentation Modelling problems 2

An efficient multigrid solver in action HDR COMPRESSION 3

HDR Compression of 2D X-ray Images Data: Siemens AG, Healthcare Sector Original Image (960x960) HDR Compression 4

HDR Compression The dynamic range of an image refers to the ratio between the brightest and darkest portions of the image which is accurately captured or observed HDR compression is used to get more details out of the image based on»gradient Domain High Dynamic Range Compression«[Fattal/Lischinski/Werman], SIGGRAPH 2002 5

HDR Compression Idea: Modify the magnitude of the image gradient by applying a position-dependent attenuating function Φ : 2 R R C = I Φ Φ is computed on different image resolutions 0..L by φ l = α I l I l α β Φ Φ 0 = φ L L = Φ = P l ( Φ φ l 1 l Φ + ) α determines which gradient magnitudes are left unchanged, β <1 is the attenuating factor of the larger gradients 6

Imaging in Gradient Space Energy functional E( u) Ω = min u( x) C d x Euler-Lagrange equations u 2 2 u = divc = f Solve by multigrid the PDE u = f in Ω u = 0 on Ω 7

General Software Features High dynamic range compression for 2D CT images on GPU/CPU Image histogram computation and windowing CPU/GPU Interactive computation on GPU with user-input parameters Interactive visualization of results with OpenGL on GPU 8

Runtime Distribution for one Frame transfer to device 7% setup RHS, output 10% transfer from device 17% multigrid solver 66% Frame is transfered to gradient space gradients are scaled processed image is restored 9

Frames per second for HDR Compression 160 140 120 100 80 60 fps fps (solver) fps (CPU) 40 20 0 1024x1024 1024x2048 2048x2048 2048x4096 4096x4096 CPU: Intel Core2 Quad Q9550@2.83GHz, OpenMP (4 cores), GPU: GTX 295 10

Optimized HDR Compression 140 120 100 80 60 40 20 0 GTX 295/2 GTX 480 GTX 480 (wavefront) half of an NVIDIA GTX 295 112 GB/s peak bandwidth compute capability 1.3 NVIDIA GTX 480 177 GB/s peak bandwidth compute capability 2.0 (Fermi) fps for HDR compression (size 2048x2048) 11

HDR Compression Results Data: Siemens AG, Healthcare Sector Original Image (960x960) HDR Compression 12

HDR Compression Results Data: Siemens AG, Healthcare Sector Attenuating function Φ 13

Hierarchical Hybrid Grids (HHG) Solve 3D Poisson equation on an unstructured tetrahedral input grid Bey s Tetrahedral Refinement Finite element discretization Patch-wise regular refinement generates nested grid hierarchies naturally suitable for geometric multigrid algorithms 14

Data sets for 3D HDR Compression MRI data provided by Universitätsklinikum Erlangen Tetrahedral finite element mesh used in HHG 15

Strong Scaling for Multigrid Solver on Jugene 140 120 100 Ratio computation : communication is about 3 : 1 80 60 40 20 0 512x512x288 1024x1024x576 2048x2048x1152 Runtime for one V(2,2) cycle in ms Setups: 512x512x288 30 465 215 unknowns = 40% mesh cover (16 patches per direction) 5646 cores 1024x1024x576 201 476 955 unknowns = 33% (32 patches per direction) 37158 cores 2048x2048x1152 1 617 632 955 unknowns = 33% (32 patches per direction) 37158 cores 16

Performance for one V(2,2)-cycle PowerPC 450 Xeon 5550 M1060 C2050 GTX 480 2D const stencil (5p) 798.9 Mu/s 1613 Mu/s 2D variable stencil (5p, complex) 86.2 Mu/s 418.7 Mu/s 3D const stencil (7p) 7.4 Mu/s 26.8 Mu/s 93.2 Mu/s 3D variable stencil (7p) 11.2 Mu/s 32.9 Mu/s 88.3 Mu/s For strong scaling on Jugene (PowerPC 450) we achieve 2,7 Mu/s per node and in total 12,3 Gu/s in HHG! 17

Sparse coding IMAGE DENOISING 18

Image Denoising of 3D CT Volume Data: Siemens AG, Healthcare Sector 19

Noise Model Assumption: Relation between an original, unknown image u : Ω R d R and an observed image u 0 can be expressed by u 0 = u +η where η stands for the noise that is estimated locally. 20

Image Denoising Models Variational approach Requires solution of a nonlinear diffusion-based PDE Done by a multigrid solver Wavelet-based approach Thresholding of coefficients based on noise variance Haar wavelet shows to be most efficient Sparse Coding Image is coded patch-wise by a sparse representation in an overcomplete basis Coefficients are computed by batch-omp algorithm M. Mayer, A. Borsdorf, H. Koestler, J. Hornegger, and U. Ruede, Nonlinear Diffusion vs.wavelet Based Noise Reduction in CT Using Correlation Analysis, VMV 2007 21

Patch-based Image Processing Image processing on many overlapping sub-blocks called patches M. Elad, Sparse and Redundant Representations from Theory to Applications in Signal and Image Processing, Springer, 2010. 22

Sample Dictionaries 23

Sparse Coding Patch x is represented by linear combination of few atoms of overcomplete dictionary D Dictionary D: matrix comprising prototype signal-atoms (extension beyond basis vectors spanning vector space) Find sparsest representation a for x: a = arg min a a 0 subject to Da x 2 2 ε 24

Batch-OMP Algorithm Solve overdetermined linear system while finding the sparsest solution in general NP-hard Efficient Orthogonal Matching Pursuit (OMP): Greedy algorithm, selects atoms sequentially Select atom with highest correlation to the current residual Project signal orthogonally to span of selected atoms More efficient Batch-OMP on GPU : No need to compute residual Progressive Cholesky update instead of full pseudoinverse computation Bartuschat, D. and Stürmer, M., Köstler, H.;An orthogonal matching pursuit algorithm for image denoising on the cell broadband engine, Parallel Processing and Applied Mathematics, 557-566, 2010. 25

Batch-OMP Algorithm Find next atom (3,4,11) Matrix multiplication for initial data (1) substitutions (5-9) projection (10) Error update (12-13) R. Rubinstein, M. Zibulevsky, M. Elad, Efficient implementation of the K- SVD algorithm using batch orthogonal matching pursuit, Technion, 2008. 26

Contribution to overall Batch-OMP Runtime Fermi GPU Cell broadband engine 27

Patches per second for Batch-OMP 28

Performance Batch-OMP for single compute unit 29

Performance Multigrid vs. Batch-OMP To achieve for an image of size 2048 x 2048 the same no. of frames per second (120) as for our optimized multigrid solver we can use 85000 patches if we select only 1 atom 17000 patches if we select 16 atoms In this case we have around 65000 non-overlapping patches of size 8x8 in the image 30

Variational and sparse coding approach IMAGE DEBLURRING 31

Image Deblurring data provided by G. Donnert, MPI Göttingen 32

Image Deblurring Assumption: Image u is blurred (convolved) by PSF or kernel K resulting in blurred image x Ku = x In case of a noise free u the deblurred image is given by u = K 1 x In case of a noisy u we have to take into account (with original image u* and additive noise n) u = u * + n 33

Simple Variational Model for Image Deblurring Energy functional becomes 2 ) E[ u] = ( Ku x + α u Ω 2 2 dx Resulting Euler-Lagrange equations: ( α + K ) u = f in Ω u, n = 0 on Ω Drawback: PSF can have large support! 34

Image Deblurring Results Original image blurred and noisy image deblurred image From: Lou, Y., Bertozzi, A.L., Soatto, S.; Direct sparse deblurring, Journal of Mathematical Imaging and Vision, pp. 1-12, 2011 35

Deblurring by Sparse Coding Idea: Use blurred dictionary D = KD Ku = x u Da KDa = D' a x Compute coefficients a with respect to D, but then restore the deblurred image by using D a = arg min a a subject to ' 0 D a x 2 2 ε No inverse problem has to be solved! 36

Image Deblurring Results 37

Deblurring by Sparse Coding Open problems: Patch boundaries Best dictionary learned from original, non-blurred data 38

Muscle fibres IMAGE SEGMENTATION 39

Image Segmentation Goal: Extract fibres from structural images of a mouse muscle obtained from extended volume imaging. Data provided by O. Röhrle, Universität Stuttgart from Dane Gerneke, Auckland Bioengineering Institute at the University of Auckland, New Zealand 40

Image Segmentation 41

Segmentation Process The following steps are employed during the segmentation process: Step 1: Pre-filtering of raw image data Step 2: Circle detection as initial rough approximation to the shape of a fibre Step 3: Finding the final contours of the muscle fibres by the method of active contours Step 4: Post-processing O. Roehrle, H. Koestler, and M. Loch, Segmentation of skeletal muscle fibers for applications in computational skeletal muscle mechanics, Computational Biomechanics for Medicine, Springer, 2011. 42

Automatic segmentation result 43

Future Work and Future Topics Image Deblurring Explore different dictionaries Solve boundary problems with ideas from domain decomposition Image Segmentation Include geometric/shape information in model 44