arxiv: v1 [cs.cv] 27 Apr 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.cv] 27 Apr 2018"

Transcription

1 A MATRIX-FREE APPROACH TO PARALLEL AND MEMORY-EFFICIENT DEFORMABLE IMAGE REGISTRATION LARS KÖNIG, JAN RÜHAAK, ALEXANDER DERKSEN, AND JAN LELLMANN, arxiv: v1 [cs.cv] 27 Apr 2018 Abstract. We present a novel computational approach to fast and memory-efficient deformable image registration. In the variational registration model, the computation of the objective function derivatives is the computationally most expensive operation, both in terms of runtime and memory requirements. In order to target this bottleneck, we analyze the matrix structure of gradient and Hessian computations for the case of the normalized gradient fields distance measure and curvature regularization. Based on this analysis, we derive equivalent matrix-free closed-form expressions for derivative computations, eliminating the need for storing intermediate results and the costs of sparse matrix arithmetic. This has further benefits: (1) matrix computations can be fully parallelized, (2) memory complexity for derivative computation is reduced from linear to constant, and (3) overall computation times are substantially reduced. In comparison with an optimized matrix-based reference implementation, the CPU implementation achieves speedup factors between 3.1 and 9.7, and we are able to handle substantially higher resolutions. Using a GPU implementation, we achieve an additional speedup factor of up to 9.2. Furthermore, we evaluated the approach on real-world medical datasets. On ten publicly available lung CT images from the DIR-Lab 4DCT dataset, we achieve the best mean landmark error of 0.93 mm compared to other submissions on the DIR-Lab website, with an average runtime of only 9.23 s. Complete non-rigid registration of full-size 3D thorax-abdomen CT volumes from oncological follow-up is achieved in 12.6 s. The experimental results show that the proposed matrix-free algorithm enables the use of variational registration models also in applications which were previously impractical due to memory or runtime restrictions. Key words. deformable image registration, computational efficiency, parallel algorithms AMS subject classifications. 92C55, 65K10, 65Y05 1. Introduction. Image registration denotes the process of aligning two or more images for analysis and comparison [30]. Deformable registration approaches, which allow for non-rigid, non-linear deformations, are of particular interest in many areas of medical imaging and contribute to the development of new technologies for diagnosis and therapy. Applications range from motion correction in gated cardiac Positron Emission Tomography (PET) [16] and biomarker computation in regional lung function analysis [15] to inter-subject registration for automated labeling of brain data [1]. For successful clinical adoption of deformable image registration, moderate memory consumption and low runtimes are indispensable, which is made more difficult by the fact that data is usually three-dimensional and therefore large. For example, in order to fuse computed tomography (CT) images for oncological follow-up, volumes with typical sizes of around voxels need to undergo full 3D registration; see Figure 1 for an example of a thorax-abdomen registration. In a clinical setting, these registrations have to be performed for every acquired volume and must be available quickly in order not to slow down the clinical workflow. In other situations, such as in population studies or during screening, the large number of required registrations demands efficient deformable image registration algorithms. In some areas, such as liver ultrasound tracking, real-time performance is even required [27, 11]. Consequently, a lot of research has been dedicated to increasing the efficiency Fraunhofer MEVIS, Lübeck, Germany (lars.koenig@mevis.fraunhofer.de, jan.ruehaak@mevis.fraunhofer.de, alexander.derksen@mevis.fraunhofer.de). Institute of Mathematics and Image Computing, University of Lübeck, Lübeck, Germany (jan.lellmann@mic.uni-luebeck.de) First preliminary results of this work were announced in 4-page conference proceedings [28, 25] 1

2 2 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Prior scan (b) Current scan (c) Initial difference (d) After registration Fig. 1. Registration of thorax-abdomen 3D CT scans for oncological follow-up. (a) coronal slice of prior scan, (b) coronal slice of current scan with deformation grid overlay, (c) subtraction image before registration, (d) subtraction image after registration. The current scan is deformed onto the prior scan using the proposed variational method. The subtraction image highlights the areas of change (white spots in the lung) corresponding to tumor growth. Image courtesy of Radboud University Medical Center, Nijmegen, The Netherlands. of image registration algorithms, see [45, 47, 12] for an overview. The literature can be divided into approaches that aim at modifying the algorithmic structure of a registration method in order to increase efficiency [17, 18, 48] and approaches that specialize a given algorithm on a particular hardware platform mostly on massively parallel architectures such as Graphics Processing Units (GPUs) [24, 38, 4, 46], but also on Digital Signal Processors (DSPs) [2] or Field Programmable Gate Arrays (FPGAs) [9]. Contributions and overview. In the following, we focus on variational approaches for image registration, which rely on numerical minimization of an energy function that depends on the input data. While such methods have been successfully used in various applications [43, 5, 16, 36], they involve expensive derivative computations, which are not straightforward to implement efficiently and hinder parallelizability. In this work, we aim to work towards closing this gap: We analyze the derivative structure of the classical, widely-known variational image registration model [35, 36] with the normalized gradient fields distance measure [19] and curvature regularization [14], which has been successfully used for various image registration problems [49, 26, 11] (Section 3 and Section 4). We study the use of efficient matrix-free techniques for derivative calculations in order to improve computational efficiency (Section 4). In particular, we present fully matrix-free computation rules, based on the work on affinelinear image registration in [44] and deformable registration in [28, 25], for objective function gradient computations (Section 4.1.1) and Gauss-Newton Hessian-vector multiplications (Section 4.1.2). We perform a theoretical analysis of the proposed approach in terms of computational effort and memory usage (Section 5). Finally, we quantitatively evaluate CPU and GPU implementations (Section 6.3) of the new method with respect to speedup, parallel scalability (Section 6.1) and problem size, and in comparison with two alternative meth-

3 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 3 ods (Section 6.2). We demonstrate the real-world applicability on two medical applications (Section 6.4). Benefits of the proposed approach are full parallelization of objective function and derivative computations, reduction of derivative computation memory requirements from linear to constant, as well as a large reduction in overall runtime. The strategy can also be applied to other distance measures and regularizers, for which we discuss requirements and limitations. In order to underline the merits of the approach in the clinic, we exemplarily consider two medical applications. Firstly, the registration algorithm is employed for registering lung CT images in inhaled and exhaled position (Section 6.4.1). On the widely-used DIR-Lab 4DCT database [8, 6], we achieve the best mean landmark error of 0.93 mm in comparison with the results reported at [7] with an average runtime of only 9.23 s, computed on a standard workstation. Secondly, we consider oncological follow-up studies in the thorax-abdomen region (Section 6.4.2), where qualitatively convincing results are obtained at a clinically feasible runtime of 12.6 s. 2. Variational image registration framework Continuous model. While there are numerous different applications of image registration, the goal is always the same: to obtain spatial correspondences between two or more images. Here we only consider the case of two images. The first image is denoted the reference image R (sometimes also called fixed image), the second image is the template image T (or moving image) [36]. For three-dimensional gray-scale images, the two images are given by functions R : R 3 R and T : R 3 R on domains Ω R R 3 and Ω T R 3, mapping each coordinate to a gray value. To establish correspondence between the images, a deformation function ϕ : Ω R R 3 is sought, which encodes the spatial alignment by mapping points from reference to template image domain. With the composition T (ϕ) : Ω R R, x T (ϕ(x)), a deformed template image in the reference image domain can be obtained. In order to find a suitable deformation ϕ, two competing criteria must be balanced. Firstly, the deformed template image should be similar to the reference image, determined by some distance measure D(R, T (ϕ)). Secondly, as the distance measure alone is generally not enough to make the problem well-posed and robust, a regularizer S(ϕ) is introduced, which requires the deformation to be reasonable. An optimal deformation meeting both criteria is then found by solving the optimization problem (1) min J (ϕ), ϕ:ω R R 3 J (ϕ) := D(R, T (ϕ)) + αs(ϕ), where α > 0 is a weighting parameter. Numerous different approaches and definitions for distance measures [10, 51, 35] and regularizers [14, 3, 5] have been presented, each with specific properties tailored to the application. This work focuses on the normalized gradient fields (NGF) distance measure. It was first presented in [19] and has since been successfully used in a wide range of applications [43, 41, 26]. The NGF distance is particularly suited for multi-modal registration problems, where image intensities of reference and template image are related in complex and usually non-linear ways, and represents a fast and robust alternative [19] to the widely-used mutual information [10, 51]. It is defined as (2) ( ) 2 T (ϕ), R + τϱ D(ϕ) = 1 dx, Ω R T (ϕ) τ R ϱ

4 4 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN where ε :=, + ε 2, and τ, ϱ > 0 are parameters that control filtering of noise in each image. The underlying assumption is that even in different imaging modalities, intensity changes still take place at corresponding locations. Therefore, the NGF distance advocates parallel alignment of image gradient directions, i.e., edge orientations. For the regularizer S, we use curvature regularization [14], S(ϕ) = Ω R j=1 3 ( u j ) 2 dx. S is defined in terms of the displacement u := (u 1, u 2, u 3 ) := ϕ id, where id is the identity transformation. The curvature regularizer penalizes second-order derivatives and thus favors smooth deformations, while not penalizing affine transformations such as translations or rotations. Similar to NGF, curvature regularization has been successfully used in multiple applications, e.g., [43, 26, 27] Discretization. In order to solve the registration problem (1), we follow a discretize-then-optimize approach [36]: Firstly, the deformation ϕ and objective function J are turned into a discretized energy. Secondly, this discretized energy is minimized using standard methods from numerical optimization (Section 2.3). For reference, the notation introduced in the following is summarized in Table 1. The image domain Ω R is discretized on a cuboid grid with m x, m y, m z grid cells (voxels) in each dimension and m := m x m y m z cells in total. The corresponding grid spacings are denoted by h x, h y, h z, and the cell volume by h := h x h y h z. The associated image grid is defined as the cell-centers X := {(îhx hx 2, ĵh y hy 2, ˆkh ) z hz 2 î = 1,..., m x, ĵ = 1,..., m y, ˆk } = 1,..., m z. Using a lexicographic ordering defined by the index mapping i := î + ĵ m x + ˆk m x m y, the elements in X can be assembled into a single vector x := (x 1,..., x 3 m ) R 3 m, which is obtained by first storing all x-, then all y- and finally all z- coordinates in the order prescribed by i. The i-th grid point in X is then x i := (x i, x i+ m, x i+2 m ) R 3. Evaluating the reference image on all grid points can be compactly written as R(x) := R (x i ) i=1,..., m, resulting in a vector R(x) R m of image intensities at the grid points. In a similar way, we define the discretized deformation y := ϕ(x y ) as the evaluation of ϕ on a deformation grid x y. Again we order all components of y in a vector, y = (y 1,..., y 3 m y), so that ϕ(x y i ) = (y i, y i+ m y, y i+2 m y), where the deformation grid has grid spacings h y x, h y y, h y z, and m y x, m y y, m y z denote the number of grid points in each direction with m y := m y xm y ym y z. Importantly, instead of using cell-center coordinates, the deformation is discretized on cell corners, i.e., using a nodal grid. This ensures that the deformation is discretized up to the boundary of the image domain, enabling an efficient grid conversion without requiring extrapolation or additional boundary conditions (Section 4.3). This results in m y x, m y y, m y z coordinates in each direction and m y := m y xm y ym y z grid points in total. The set of all nodal grid points is given by X y y := {(îh x, ĵh y y, ˆkh z) y î= 0,..., m y x 1, ĵ = 0,..., m y y 1, ˆk } = 0,..., m y z 1.

5 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 5 x, y, z spatial coordinate axes R, T reference, template image functions, R, T : R 3 R Ω R reference image domain, Ω R R 3 ϕ deformation function, ϕ : Ω R R 3 m, m y number of grid cells in the image/deformation grid, m = (m x, m y, m z) m, m y total number of grid cells, m = m xm ym z h, h y grid spacings for each coordinate direction, h = (h x, h y, h z) h, h y volume of a single grid cell, h = h xh yh z i linear index with i = î + ĵm x + ˆkm xm y i x, i + x indices of neighbors in x-direction, taking boundary conditions into account i y, i + y indices of neighbors in y-direction, taking boundary conditions into account i z, i + z indices of neighbors in z-direction, taking boundary conditions into account x, x y vector of lexicographically ordered grid points, x = (x 1,..., x 3 m ) R 3 m x i, x y i i-th grid point, x i = (x i, x i+ m, x i+2 m ) R 3 y deformation discretized on deformation grid, y = (y 1,..., y 3 m y ) R 3 my, (y i, y i+ m y, y i+2 m y ) = ϕ(x y i ) u displacement discretized on deformation grid, u = y x y R 3 my P grid conversion operator from deformation grid to image grid, P : R 3 my R 3 m ŷ deformation discretized on image grid points, ŷ = P (y) R 3 m ŷ i i-th deformation point, ŷ i = (P (y) i, P (y) i+ m, P (y) i+2 m ) R(x) reference image evaluated on image grid, R : R 3 m R m R i scalar value of reference image at i-th grid point, R i = R(x i ) R T (P (y)) template evaluated on deformed image grid, T : R 3 m R m T i scalar value of reference image at deformed i-th grid point, T i = T (ŷ i ) R Table 1 Summary of notation introduced in Section 2.2 for discretization of the continuous registration problem. All notation regarding the deformation grid is identical to the corresponding notation for the image grid, except for an additional superscript y, e.g., m denotes the number of grid cells in the image grid, and m y denotes the number of grid cells in the deformation grid. Analogously to the image grid, using the linear index i := î+ĵ m y x+ˆk m y xm y y, the points from X y can be ordered lexicographically in a vector x y := (x y 1,..., xy 3 m y) R3 my, with a single point having the coordinates x y i := (xy i, xy i+ m y, xy i+2 m y). We refer to [36] for further discussion on different grids. In this work, the deformation grid size is always chosen with m y k 1 m k, k = x, y, z, so that the image grid is as least as fine as the deformation grid. The separate choice of grids allows to use a coarser grid for the deformation which directly affects the number of unknowns and a high-resolution grid for the input images. However, it adds an extra interpolation step, as in order to compute the distance measure, the deformed template image T (ϕ) needs to be evaluated on the same grid as the reference image R. We use a linear interpolation function P : R 3 my R 3 m, mapping between image grid and deformation grid, and define T (P (y)) := (T (ŷ i )) i=1,..., m R m with ŷ i := (P (y) i, P (y) i+ m, P (y) i+2 m ). For evaluating the template image T, tri-linear interpolation with Dirichlet boundary conditions is used. This is a reasonable assumption for medical images, which often exhibit a black background. Discretizing the integral in (2) using the midpoint quadrature rule and denoting by T i, R i the i-th component function, the compact expression ( D(y) = h m T i (P (y)), R ) 2 i (x) + τϱ (3) T i (P (y)) τ R i (x) ϱ i=1 is obtained for the full discretized NGF term. The factor 1/2 is caused by the square-

6 6 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN then-average scheme for discretizing the gradient : Given a discretized image I R m, we define the backward difference operator at the i-th grid point, I i : R m R 3, as I i := ( Ii I i x h x where the neighborhood is defined via, I i I i y, I ) i I i z, h y h z i x := max(î 1, 1) + ĵ m x + ˆkm x m y (4) i y := î + max(ĵ 1, 1)m x + ˆkm x m y i z := î + ĵm x + max(ˆk 1, 1)m x m y i + x := min(î + 1, m x ) + ĵ m x + ˆkm x m y i + y := î + min(ĵ + 1, m y ) m x + ˆkm x m y i + z := î + ĵm x + min(ˆk + 1, m z ) m x m y and i 0 := i. This allows to implement Neumann boundary conditions without special treatment of the boundary cells. By + I i we denote the corresponding forward finite differences operator. We combine both operators into one, ( I i := I i, ) (5) + I i with I i : R m R 6, and define I 1 i ε := 2 I i, I i + ε 2. This square-thenaverage scheme for approximating the terms in (2) allows to use short forward and backward differences, which better preserve high frequency gradients than long central differences [21]. The factor 1/2 is required in order to faithfully discretize (2), and can be interpreted as an averaging of the squared norms of the forward and backward operators. Similarly to the NGF, the curvature regularizer is discretized as (6) y m S(y) = h y 2 i=1 d=0 ( ui+d m y ) 2, with the discretized displacement u := (u 1,..., u 3 m y) := y x y and the discretized Laplace operator (7) u i+d m y := k {x,y,z} 1 (h y k )2 (u i k+d m y 2 u i+d m y + u i+k+d m y) with homogeneous Neumann boundary conditions. Overall, we obtain the discretized version (8) min J(y), y R3 my J(y) := D(y) + αs(y) of the minimization problem (1), which can then be solved using quasi-newton methods, as summarized in the following section Numerical optimization. In order to find a minimizer of the discretized objective function (8), we use iterative Newton-like optimization schemes. In each step of the iteration, an equation of the form (9) ˆ 2 J(y k )s k = J(y k )

7 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 7 is solved for a descent direction s k. Then y k is updated via y k+1 = y k + ηs k, where the step length η is determined by Armijo line search [39, 36]. The matrix ˆ 2 J should approximate the Hessian 2 J(y k ). Here we consider the Gauss-Newton scheme and the L-BFGS scheme, which have both been used in different image registration applications [50, 25, 43]. The minimization is embedded in a coarse-to-fine multi-level scheme, where the problem is solved on consecutively finer deformation and image grids Gauss-Newton. The Gauss-Newton scheme uses a quadratic approximation of the Hessian and is suitable for least-squares type objective functions of the form (10) Ĵ(y) = ˆr(y) ˆr(y) = ˆr(y) 2 2, where ˆr(y) is a residual function, which can depend in a nonlinear way on the unknown y. The gradient in the Newton equation (9) can be written as Ĵ = 2 ˆr dˆr, where dˆr is the Jacobian of ˆr, and the Hessian is approximated by H := ˆ 2 Ĵ = 2dˆr dˆr [39]. This approximation discards second-order derivative parts of the Hessian and guarantees a symmetric positive semi-definite Hessian approximation H, so that s k in (9) is always a descent direction. We apply this approximation only to the Hessian of the distance measure D, 2 D H. For the regularizer S, we use the exact Hessian, which is readily available, as S is a quadratic function. The (quasi-)newton-equation (9) can be written as (H + α 2 S)s k = ( D + α S), which is then approximately solved in each step using a conjugate-gradient (CG) iterative solver [39] L-BFGS. Instead of directly determining a Hessian approximation at each step, the L-BFGS scheme iteratively updates the approximation using previous and current gradient information. L-BFGS requires an initial approximation H 0 of the Hessian, which is often chosen to be a multiple of the identity. However, in our case we can incorporate the known Hessian of the regularizer S by choosing H 0 = 2 S + γi, where I is the identity matrix and γ > 0 is a parameter in order to make H 0 positive definite; further details can be found in [36, 39]. 3. Analytical derivatives. As can be seen from Section 2.3, computing derivatives of distance measure and regularizer is a critically important task for the minimization of the discretized objective function. In this section, algebraic formulations of the required derivatives are presented, which allow for a closer analysis and construction of specialized computational schemes Curvature. The discretized curvature regularizer (6) is a quadratic function involving the (linear) Laplace operator u i. Thus, using the chain rule the i-th element of the gradient can explicitly be computed by ( S(y)) i+d m y = 2 h ( ) (11) y u i+d m y, with d {0, 1, 2} for the directional derivatives. The Hessian is constant and can be immediately seen from (7).

8 8 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN 3.2. NGF. The inner computations of the NGF in (3) can be rewritten more compactly by introducing a residual function r : R m R m with components (12) r i (T ) := so that the NGF term in (3) becomes (13) D(y) = h 1 2 T i, R i + τϱ T i τ R i ϱ, m i=1 ( 1 ri (T (P (y))) 2). Introducing a reduction function ψ : R m R, (r 1,..., r m ) h m i=1 (1 r2 i ), it holds (14) D(y) = ψ(r(t (P (y)))). This composite function maps the deformation y R 3 my onto a scalar image similarity in four steps, (15) R 3 my P R 3 m T R m r R m ψ R. Using the chain rule, the gradient of D can now be obtained as a product of (sparse) Jacobian matrices ( ) ψ r P (16) D =. r P y The Gauss-Newton approximation of the Hessian (Section 2.3.1) becomes (17) P H = 2 h r r P y P P y. Note that, in contrast to the classical Gauss-Newton method (10), the residual in (13) has a different sign. Therefore, in order to compute a descent direction in (9), the sign of the Hessian approximation has been inverted (see also [36]). Using these formulations, the evaluation of the gradient and Hessian can be implemented based on sparse matrix representations [36]. However, while insightful for educational and analytic purposes, these schemes have important shortcomings: storing the matrix elements requires large amounts of memory, while assembling the sparse matrices and sparse matrix multiplications impede efficient parallelization. In the following chapters, we will therefore focus on strategies for direct evaluation that do not require intermediate storage and are highly parallel. 4. Derivative analysis and matrix-free computation schemes. From a computational perspective, the NGF (13) and curvature formulations (6) are very amenable to parallelization, as all summands are independent of each other, and there are no inherent intermediate matrix structures required. However, this advantage is lost when the evaluation of the derivatives is implemented using individual matrices for the factors in (16) and (17). In order to exploit the structure of the partial derivatives, in the following sections we will analyze the matrix structure of all derivatives and derive equivalent closedform expressions that do not rely on intermediate storage. We will make substantial use of the fact that the matrix structure of the partial derivatives is independent of the data. Therefore, the locations of non-zero elements are known a priori, which allows to derive efficient sparse schemes.

9 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION Derivative computations for NGF Gradient. The NGF gradient computation in (16) involves four different Jacobian matrices. The derivative of the vector reduction ψ(r) is straightforward: (18) ψ r = 2 hr = 2 h (r 1,..., r m ) R 1 m. In contrast, the computation of r R m m is considerably more involved. In order to calculate a single row of this Jacobian ri, the definition (13) can be used and interpreted as a quotient (19) r i = r 1,i r 2,i = 1 2 T i, R i + τϱ T i τ R i ϱ. Separately differentiating numerator and denominator, this yields r 1,i = 1 2 ( R i ) i R1 m and r 2,i = R i ϱ ( T i ) 2 T i τ i R1 m. Included in both of these terms is the derivative of the finite difference gradient computation T i : R m R 6 as defined in (5), i.e., mapping an image to the forwardand backward difference gradients. As the gradient at a single point only depends on the values at the point itself and at neighboring points, the Jacobian i R6 m is of the form i = i z i y i x i i + x i + y i + z 1 1 h x h x 1 1 h y h y 1 h z 1 h z 1 h x 1 h x 1 h y 1 h y 1 h z 1 h z where only non-zero elements are shown and the non-zero column indices are displayed above the matrix. The quotient rule, applied to (19), leads to r i = 1 ( ) r1,i r2,i 2 r r 2,i 2,i r 1,i = r 1,i 1 r 1,i r 2,i (20) r 2,i r2,i 2 (21) (22) ( = R i ) 1 i 2 T i τ R i ϱ 2 T i, R i + τϱ T i 2 τ R i 2 ϱ ( = 1 ( R i ) 1 i 2 T i τ R i ϱ 2 T i, R i + τϱ T i 3 τ R i ϱ We introduce the abbreviation (23) ρ i (k) := R i + R i+k T i τ R i ϱ R ( i T i ) i ϱ 2 T i τ ) ( T i ) i ( 1 2 T i, R ) i + τϱ ( T i + T i+k ) T i 3 τ R, i ϱ.,

10 10 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN and the set of indices of non-zero elements (24) K := { z, y, x, 0, x, y, z}. Furthermore, define ĥk := 1 2(h k ) 2 and (25) { j K\{0} ˆρ i (k) := ĥjρ i (j) if k = 0, ĥ k ρ i (k) otherwise. Then, (22) can be compactly written as (26) r ( i = i z i y i x i i + x i + y i + z ) ˆρ i ( z) ˆρ i ( y) ˆρ i ( x) ˆρ i (0) ˆρ i (x) ˆρ i (y) ˆρ i (z) R 1 m, where only non-zero elements are shown and single element locations are denoted above the vector. It can be seen that (26) exhibits a very sparse pattern with only seven non-zero elements. For a full gradient computation as in (16), it remains to consider the terms P and P y. The first term represents the derivative of the template image interpolation function with respect to the image grid coordinates. Given the lexicographical ordering of the grid points, the image derivative matrix is composed of three diagonal matrices, (27) ( ( ) ( P = diag 1 m P 1... P m, diag 1 P m+1... ) ( m P 2 m, diag 1 P 2 m+1... )) m P 3 m, where each non-zero element represents the derivative at a single point with respect to a single coordinate. The last remaining term P y is the Jacobian of the grid conversion function. This is a linear operator, which will be analyzed in detail in Section 4.3. From the derivations in (18), (26) and (27), it can be seen that the main components of the NGF gradient computation exhibit a sparse structure with fixed patterns. The main effort comes from computing the derivative of r, which has a seven-banded diagonal structure. Exploiting this pattern, a single element of the partial gradient D ψ r P := r P R 1 3 m can be explicitly computed as ( ) ( ) D = 2 P h i (28) r i+k ˆρ i+k ( k), i+d m P i+d m k K where d {0, 1, 2} and K = { z, y, x, 0, x, y, z} as in (24). Algorithm 1 shows a pseudocode implementation for computing the individual elements of the gradient as in (28). This formulation has multiple benefits: Any element of the gradient can be computed directly from the input data, without the need for (sparse) matrices and without having to store intermediate results. Furthermore, gradient elements can be computed independently, which allows for a fully parallel implementation. As will be discussed in Section 5, both of these properties substantially decrease memory usage and computation time.

11 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 11 Algorithm 1 Pseudocode for matrix-free computation of the elements of the NGF gradient as in (28). The algorithm can be fully parallelized over all loop iterations. 1: for ˆk in [0, m z 1] do 2: for ĵ in [0, m y 1] do 3: for î in [0, m x 1] do 4: i î + m x ĵ + m x m yˆk Compute linear index 5: i ± x, i ± y, i ± z as in (4) Compute neighbor indices 6: [dtx, dty, dtz] imagederivative(t i ) Compute image derivative 7: 8: r [r i z, r i y, r i x, r i, r i+x, r i+y, r i+z ] r i as defined in (19) 9: dr [ˆρ i z (z), ˆρ i y (y), ˆρ i x (x), ˆρ i (0), ˆρ i+x ( x), ˆρ i+y ( y), ˆρ i+z ( z)] 10: ˆρ i as defined in (25) 11: drsum 2 h (r[0]dr[0] + r[1]dr[1] + r[2]dr[2] + r[3]dr[3] 12: + r[4]dr[4] + r[5]dr[5] + r[6]dr[6]) 13: Compute sum of (28) 14: grad[i ] drsum dtx 15: grad[i + m] drsum dty 16: grad[i + 2 m] drsum dtz 17: end for 18: end for 19: end for Output: grad[i + d m] = ( ) D P i+d m Hessian-vector multiplication. As noted in Section 2.3.1, when using the Gauss-Newton scheme for optimization, linear systems involving the quadratic approximation H R 3 my 3 m y of the Hessian (17) need to be solved in each iteration. Although H is sparse, the memory requirements for storing the final as well as intermediate matrices can still be considerable. Therefore, in the following we will consider an efficient scheme for evaluating the matrix-vector product Hp = q with p, q R 3 my, which is the foundation for using iterative methods, such as conjugate gradients, in order to solve the (Gauss-)Newton equation (9). Figure 2 shows a schematic of the computations involved. Again, we consider the Jacobian P y and its transpose in (17) as separate grid conversion steps, which will be discussed in detail in Section 4.3. The main computations consist of computing the matrix product (29) Ĥ := r r P P R3 m 3 m, with the components P R m 3 m r, R m m, which is equivalent to computing the approximate Hessian H in (17) with the exception of the grid conversion steps. We first analyze the matrix-vector product Ĥˆp = ˆq with ˆp, ˆq R3 m. Abbreviating dr := r, the main challenge is efficiently computing the matrix product dr dr. Using the definition and notation of a single row of dr given in (26), a single

12 12 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN column of the matrix can be written as (30) r i = ( i z i y i x i i + x i + y i + z ˆρ i z (z) ˆρ i y (y) ˆρ i x (x) ˆρ i (0) ˆρ i+x ( x) ˆρ i+y ( y) ˆρ i+z ( z) ), again only showing non-zero elements at the indices denoted above the vector. As in (26), a single column contains only seven non-zero elements. Abbreviating dr i := r i R m 1, each element of the matrix dr dr is a scalar product of columns dr i, dr j for i, j = 1,..., m. However, due to the sparsity of the columns of dr in (30), there are only few non-zero scalar products. As a basic example, consider the diagonal elements of dr dr. In column dr i, non-zero elements are located at (dr i ) i+k, k K, with K as in (24). This yields the scalar products dr i, dr i = k K (dr i) 2 i+k = k K ˆρ i+k( k) 2, a sum of seven terms. More generally, for arbitrary i, j, define κ := j i, so that dr i, dr j = dr i, dr i+κ. In order to characterize the elements where dr i, dr i+κ can be non-zero, observe that the non-zero elements of dr i are located at indices i + M, (31) M := { m x m y, m x, 1, 0, 1, m x, m x m y }. Consequently, dr i, dr i+κ can only be non-zero if (i + M) (i + κ + M), i.e., if the scalar product involves at least two non-zero elements. Equivalently, κ N := M M, therefore the number of non-zero inner products is bounded from above by the number of elements N in N. Naturally, N M M = 49, however it turns out that this bound can be substantially lowered. In order to do so, define the mapping N(i, j) := j i, N : M M Z, so that N = N(M, M) = { } κ Z κ = N(i, j), i, j M. Inspecting all possible combinations for i and j, it turns out that N = 25 (Table 2), which implies that each column of dr dr has at most 25 non-zero elements. For a given offset κ = j i, the pre-image N 1 (κ) indicates the indices of nonzero elements contributing to the inner product dr i, dr j. As shown in the rightmost column in Table 2, for the main diagonal of dr dr, seven elements are involved in the scalar product, while for all other elements only either one or two products need to be computed. Using κ = j i and (î, ĵ) N 1 (κ), it holds ( dr dr ) (dr i ) (dr i+î j), if j i N, j+ĵ i,j = (32) (î,ĵ) N 1 (j i) 0, otherwise, ˆρ ( î) i+î ˆρ j+ĵ ( ĵ), if j i N, = (î,ĵ) N 1 (j i) 0, otherwise, where, in a slight abuse of notation, ˆρ( m 1 m 2 ) should be interpreted as ˆρ( z), etc. With this and the explicit formulation of ˆρ i (k) in (23) and (25), the computational cost of calculating dr dr can be reduced substantially: multiplications involving zero

13 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 13 P y P r r P P y p = q Fig. 2. Schematic view of Hessian-vector multiplication Hp = q. The computation involves highly sparse matrices with a fixed pattern of non-zeros. The main computational effort results from the matrix multiplication r r, where r has seven non-zero diagonals. elements are no longer considered, administrative costs for locating and handling sparse matrix elements have been eliminated, and only essential calculations remain. Substituting (32) into the expression (29) for evaluating the matrix-vector product ˆq = Ĥˆp, we obtain ˆq d m+i = κ N l {0,1,2} i P d m+i ( dr dr ) i,i+κ κ P l m+i+κ ˆp l m+i+κ, (33) for d {0, 1, 2}, using the definition of ( dr dr ) i,i+κ from (32). Similar to the matrix-free gradient formulation in (28), this allows a direct computation of each element of the result vector ˆq = Ĥˆp fully in parallel and directly from the input data. An implementation in pseudocode is shown in Algorithm Derivative computations for curvature regularization. In (11), a matrix-free version of the curvature gradient computation was already given. As the curvature computation is a quadratic function, the Hessian-vector multiplication q i+d m y = ( 2 Sp ) i+d m y = 2 h y ( p ) i+d m y, (34) for d {0, 1, 2}, is closely related, with p replacing u in (11). In contrast to NGF, this uses the exact Hessian rather than a Gauss-Newton approximation. As the regularizer operates on the deformation grid, no grid conversion is required Grid conversion. We now consider the grid conversion steps omitted in the previous sections. The grid conversion maps deformation values y R 3 my from the deformation grid to the image grid using the function P : R 3 my R 3 m (Section 2.2). In this section, we separately analyze P and its transpose P. The conversion, i.e., interpolation, of y from deformation grid to image grid is performed using tri-linear interpolation and thus can be written as a matrix-vector product ŷ = P y with P R 3 m 3 my and y R 3 my, ŷ R 3 m. The transformation matrix P has a block-diagonal structure with three identical blocks, each converting one of the three coordinate components of y. As the deformation grid is a nodal grid, no boundary handling is required, as each (cell-centered) image grid point is surrounded by deformation grid points.

14 14 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN κ N N 1 (κ) N 1 (κ) 2m 1m 2 {(m 1m 2, m 1m 2)} 1 m 1m 2 m 1 {(m 1, m 1m 2), (m 1m 2, m 1)} 2 m 1m 2 1 {(1, m 1m 2), (m 1m 2, 1)} 2 m 1m 2 {(0, m 1m 2), (m 1m 2, 0)} 2 m 1m {( 1, m 1m 2), (m 1m 2, 1)} 2 m 1m 2 + m 1 {( m 1, m 1m 2), (m 1m 2, m 1)} 2 2m 1 {(m 1, m 1) 1 m 1 1 {(1, m 1), (m 1, 1)} 2 m 1 {(0, m 1), (m 1, 0)} 2 m {( 1, m 1), (m 1, 1)} 2 2 {(1, 1)} 1 1 {(0, 1), (1, 0)} 2 0 {(î, î) î M} 7 1 {(0, 1), ( 1, 0)} 2.. Table 2 Offsets κ := j i, for which dr i, dr j = 0, corresponding to non-zero elements (dr dr) i,j in the matrix-product dr dr. The pre-image sets N 1 (κ) characterize the locations of the non-zero element in dr i and dr j. The 11 omitted cases are identical to the ones shown except for opposite sign. The rightmost column shows the number of products involving non-zero coefficients in the evaluation of dr i, dr j and sums to M M = In order to get closed-form expressions for the non-zero entries of P, define the function c l (k) := (k+0.5)my l m l, which converts a coordinate index from deformation to image grid, and the remainder function r l (k) := c l (k) c l (k). Then, for the point at coordinates (î, ĵ, ˆk) with i = î + ĵm x + ˆkm x m y and d {0, 1, 2}, the interpolated deformation value can be written as (35) (P y)î+ĵmx+ˆkm xm y+d m = 1 1 α=0 β=0 γ=0 1 w α,β,γ (î, ĵ, ˆk) y, ξ(α,β,γ,î,ĵ,ˆk,d) where the indices of the neighboring deformation grid points are (36) ξ(α, β, γ, î, ĵ, ˆk, d) := and the interpolation weights are given by ( ) ( ) ( ) c x (î) + α + m x c y (ĵ) + β + m x m y c z (ˆk) + γ + d m w α,β,γ (î, ĵ, ˆk) := ŵ α x (î)ŵ β y (ĵ)ŵ γ z (ˆk), ŵ s l (k) := { 1 r l (k), if s = 0, r l (k), otherwise. As can be seen in (35), a single interpolated deformation value on the image grid is simply a weighted sum of the values of its eight neighbors on the deformation grid, see also Figure 3(a). For the NGF derivative computations in (16) and (17), the left-sided multiplication ŷ P = (P ŷ) is also required in order to evaluate the transposed operator. For each (nodal) deformation grid point, this amounts to a weighted sum of values from all image grid points in the adjacent deformation grid cells, which can be many more than eight (Figure 3(b)).

15 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 15 Algorithm 2 Pseudocode for matrix-free computation of the elements of the result vector ˆq in the NGF Hessian-vector multiplication Ĥˆp = ˆq as in (29). The algorithm can be fully parallelized over all loop iterations of î, ĵ, ˆk, computing three elements of the result per thread. 1: N [ 2m 1m 2, m 1m 2 m 1, m 1m 2 1, m 1m 2, m 1m 2 +1, m 1m 2 +m 1, 2m 1, m 1 1, m 1, 2: m 1 + 1, 2, 1, 0, 1, 2, m 1 1, m 1, m 1 + 1, 2m 1, m 1m 2 m 1, m 1m 2 1, m 1m 2, m 1m : m 1m 2 + m 1, 2m 1m 2] Initialize 25 indices in N, see Table 2 4: q 0 5: for ˆk in [0, m z 1] do 6: for ĵ in [0, m y 1] do 7: for î in [0, m x 1] do 8: i î + m x ĵ + m x m yˆk Compute linear index 9: i ± x, i ± y, i ± z as in (4) Compute neighbor indices 10: [dtx, dty, dtz] imagederivative(t i+n ) 11: Compute image derivatives at 25 points 12: for k in [0, 24] do Compute 25 values of dr dr, see (33) 13: drdr 0 14: 15: for (ĩ, j) in N 1 (N[k]) do 16: Compute 49 values of ˆρ i, see (32) and Table 2 17: drdr drdr + ˆρ ( ĩ) i+ĩ ˆρ ( j) i+n[k]+ j 18: end for 19: 20: q[i ] q[i ] + dtx[12] drdr dtx[k] ˆp i+n[k] 21: q[i + m] q[i + m] + dty[12] drdr dty[k] ˆp i+n[k]+ m 22: q[i + 2 m] q[i + 2 m] + dtz[12] drdr dtz[k] ˆp i+n[k]+2 m 23: The 12th element corresponds to the derivative at index i 24: end for 25: end for 26: end for 27: end for Output: q[i + d m] = ˆq d m+i By singling out the contribution of all image grid points within a single deformation grid cell as in Figure 3(c), the weights ŵl s (k) can be re-used and only need to be computed once for each deformation grid cell. This suggests to parallelize the computations on a per-cell basis. However, as a deformation grid point has multiple neighboring grid cells, this would lead to write conflicts. Therefore, in order to achieve both efficient re-use of weights as well as parallelizability, we propose to use a red-black scheme as shown in Figure 3(c). Each parallel unit sequentially processes a single 2D slice of deformation grid cells, first all odd (red) and finally all even slices (black) Extension to related models. By carefully examining the derivative structure of the NGF distance measure and curvature regularizer, we have derived closed-form expressions for computing individual elements of the gradients as well as Hessian matrix-vector products. The new formulations operate directly on the input data, avoid storage of intermediate results (and thus costly memory accesses) and allow for fully parallel execution.

16 16 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN y (0,0) y (1,0) w 0,0 w 1,0 w 0,1 w 1,1 y (0,1) y (1,1) y (1,1) y (0,0) y (1,0) y (0,1) y (1,1) (a) (b) (c) Fig. 3. Grid conversion operations (2D example). Circles: nodal deformation grid; squares: cell-centered image grid. (a) deformation grid to image grid: each value on the image grid is computed by a weighted sum of the values of its deformation grid neighbors. (b) transposed operation: each value on the deformation grid is a weighted sum of all image grid values from neighboring deformation grid cells. (c) transposed operation using the proposed red-black scheme. Weighted values on the image grid are accumulated into all surrounding deformation grid points using multiple write accesses. Weights need to be computed only once per cell. Parallelization is performed per row (2D) or slice (3D) using a red-black scheme to avoid write conflicts. It is important to note that the proposed approach is by no means conceptually restricted to the NGF distance measure or curvature regularization. The central idea of replacing sparse matrix construction by on-the-fly coefficient calculation for derivative computation is in principle applicable to all distance measures and regularization schemes. The effectiveness, however, strongly depends on the derivative structure of the considered objective function terms. In more detail, consider an arbitrary distance measure D = ψ(r(t (P ))). We first observe that the proposed reformulation for the grid conversion operations related to P as well as the image interpolation calculations for T are independent of the choice of the distance metric. Hence, any matrix-free reformulation for these functions can be applied to every choice of distance measure D. Since the derivative of ψ is just a vector, the most critical part for a matrix-free computation is the residual function derivative r. For NGF, the non-zero elements are distributed to just seven diagonals, thus allowing to derive a closed expression for r. In the case of the well-known sumof-squared-differences (SSD) distance measure, the derivative r is even the identity and can thus be omitted from the computations [44]. More generally, matrix-free methods may be considered practically applicable as long as the sparsity pattern of r is known and fixed. Examples of suitable matrix types are band matrices, block structures and similar configurations. Unfortunately, the widely used mutual information distance measure [10, 51] does not allow for such r application of the proposed matrix-free concept: here, the sparsity pattern of depends on the image intensity distribution of the deformed template image and thus on both T and the current deformation y [36]. As the deformation changes at each iteration step, the sparsity pattern of r may do so as well, prohibiting the derivation of static matrix-free calculation rules. Again, however, the matrix-free grid change schemes and the discussed template image derivative computations can directly be reused also for mutual information. For the regularization term, we note that widely used energies such as the diffusive regularizer [13] or the linear-elastic potential [35] are essentially calculated using finite difference schemes on the deformation y. Hence, their matrix-free computation is far less complex than for the distance term with its function chain structure,

17 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 17 and similar techniques may directly be used as done for the curvature regularization in this work. Also more complex regularization schemes such as hyperelastic regularization [5], which penalizes local volume change and generates diffeomorphic transformations, may be similarly implemented in a matrix-free manner. In summary, the underlying idea of our work is widely applicable and not restricted to the exemplary use case of NGF with curvature regularization. An interesting question is what performance improvements can be expected for conceptually different methods such as LDDMM [32]; we refer to [33] for some notes on parallelizing such methods. 5. Algorithm analysis. In this section, we will perform a theoretical analysis and quantification of the expected speed-up obtained by using the proposed matrixfree approach, compared to a matrix-based approach as used in [36] Matrix element computations and memory stores. While the exact cost of sparse matrix multiplications is hard to quantify and highly depends on the implementation and chosen sparse matrix format, meaningful estimates can be made for the cost of matrix coefficient computations and memory stores. As in this section we are primarily concerned with the time consumed by memory write accesses, by memory store we refer to a single memory write operation. The amount of memory required by each scheme is discussed in Section Matrix-based approach. Data term. For the NGF gradient, analyzing the structure of ψ r, r, and P in (18), (26), and (27), an upper bound for the required number of non-zero element computations can be computed. The vector ψ r requires the computation of m residual elements, where m is the number of cells in the image grid (Section 2.2). For the matrix r, with a sevenbanded structure, at most 7 m elements need to be computed, while P contains at most 3 m non-zero matrix elements. The grid conversion, discussed in Section 4.3, involves a matrix with eight coefficients in each row, one per contributing deformation grid point (35), resulting in the computation of 8 m coefficients. In total, this gives 19 m coefficient computations for the NGF term in the matrix-based scheme. As the matrix elements only depend on the size of the image and deformation grids, they can be computed once and stored for each resolution. Regularizer. The curvature regularizer is a quadratic function and can be implemented using a matrix with 25 diagonals, resulting in 25 m y memory stores, where m y is the number of cells in the deformation grid. Hessian. For the Gauss-Newton Hessian-vector multiplication, no additional coefficient computations are needed, as the Hessian is approximated from first-order derivatives. Total number of operations. After computing all matrix elements, these have to be stored in memory for later use, resulting in 19 m + 25 m y memory store operations for one evaluation of the full gradient and Hessian approximation. This assumes that the latter will never be explicitly stored, as discussed in the following section. If multiple matrix-vector multiplications with the same gradient are required, as in the inner CG iterations, the factor matrices can be re-used and no new memory stores are necessary Matrix-free approach.

18 18 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Data term. For the matrix-free formulation of the NGF gradient in (28), excluding grid conversion, the same 11 m coefficient computations as in the matrix-based case have to be performed for one evaluation of the gradient. However, no intermediate storage of elements is needed. For the grid conversion, the coefficients are re-computed during each gradient evaluation and Hessian-vector multiplication. Furthermore, they are re-computed during the application of the transposed operator. With 8 m computations in each grid conversion, this adds up to 16 m additional coefficient computations for application of the forward and transposed operator compared to the matrix-based case. In the proposed approach, the grid conversions are evaluated separately from the remaining derivatives, instead of fully integrating them into the matrix-free computations. While the latter further lowers memory requirements, it requires recomputations of weights even within a single conversion, which in our experience resulted in slower overall computation times. Regularizer. For the curvature regularizer, the coefficients are highly redundant, resulting from the finite-differences stencil. Therefore, there is no computational cost for coefficients in the matrix-free scheme. Hessian. For the matrix-free Gauss-Newton Hessian-vector multiplication, a number of elements are re-computed. As shown in (33) and (32), coefficients ˆρ i (k) are computed 2 49 m times for one Hessian-vector multiplication. With 25 diagonals in dr dr, image derivatives from P are required for 25 points, resulting in additional 25 m coefficient computations. In total, the Hessian-vector multiplication requires 123 m additional coefficient computations compared to the matrix-based method. Note that, in the matrix-based case, we have assumed that the full Hessian is never stored but rather evaluated using its factor matrices (Figure 2). If the final Hessian is also saved, an additional 9 25 m memory stores are required, as dr dr exhibits 25 diagonals. While in both schemes the gradient is fully stored once computed and can be re-used, for the matrix-free Hessian-vector multiplication, all elements need to be recomputed for every single matrix-vector multiplication. Therefore, the exact number of additional arithmetic operations depends on other variables such as the number of CG iterations and line-search steps, and is specific to the input data. Total number of operations. In summary, for matrix-free NGF gradient evaluations, there are no additional coefficient computations compared to the matrix-based scheme and all intermediate memory stores are completely eliminated, making it highly amenable to massive parallelization. For the Gauss-Newton Hessian-vector multiplications, the matrix-free scheme adds a considerable number of re-computations, but again no stores are required. However, memory store operations are generally orders of magnitude slower than arithmetic operations. Therefore, as will be shown in the next section, in practice this trade-off between computations and memory stores results in faster overall runtimes. Moreover, it significantly reduces the amount of required memory, which is a severely limiting factor for Hessian-based methods Memory requirements. As discussed in the previous section, the memory required for the derivative matrices in a classical matrix-based scheme depends on the image grid size m. For the gradient computations, when storing ψ P, r and P, m, 7 m and 3 m values need to be saved. For the grid conversion, 8 m values need to be stored, while the curvature regularizer requires 25 m y values. In terms of complexity, this leads to memory requirements in the order of O( m).

19 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 19 In comparison, the memory required by the matrix-free schemes does not depend on the image grid size. When integrating the grid conversion steps directly into the computation, as described in the previous section, only constant auxiliary space is required. The matrix-free approach therefore reduces the required memory size for intermediate storage from O( m) to O(1). 6. Experiments and results. In order to characterize the performance at the implementation level, we performed a range of experiments comparing four approaches for derivative computation. Three approaches are implemented on the CPU: an opensource, mixed C and MATLAB implementation in the FAIR toolbox [36] ( FAIR ), an in-house C++ implementation of the matrix-based computations using sparse matrices ( MBC ) and a C++ implementation of the proposed matrix-free computations ( MFC ). Additionally, we evaluated a GPU implementation of the matrix-free computations using NVIDIA CUDA ( MFC GPU ). All CPU implementations use OpenMP for parallelization of critical components. In FAIR and MBC, these include distance measure computations, linear interpolation and sparse-matrix multiplications. FAIR provides a matrix-free mode, which uses matrix-free computations for the curvature regularizer. In contrast to our approach, all other derivative matrices are still built explicitly and require additional storage and memory accesses. This mode was used for all comparisons. Furthermore, MATLAB-MEX versions of FAIR components were used where available. In the matrix-free method MFC, full element-wise parallelization was used for function value and derivative calculations. Both the classical MBC and proposed MFC implementations have undergone thorough optimization. In addition, due to the favorable structure, the NGF gradient computations for MFC on the CPU were manually vectorized using the Intel Advanced Vector Instructions (AVX). As stated in Section 4.3, the grid conversions were computed as separate steps. Additionally, the deformed template was temporarily stored in order to reduce interpolation calls. The experiments on the CPU were performed on a dual processor 12-core Intel Xeon E workstation with 2.0 GHz and 32 GB RAM. The GPU computations were performed on a NVIDIA GeForce GTX980, see Section 6.3 for more detailed specifications. All experiments were averaged over 30 runs with identical input in order to reduce measurement noise Scalability. The proposed matrix-free derivative calculations permit a fully parallel execution with only a single reduction for the function value, leading to high arithmetic intensity, i.e., floating point operations per byte of memory transfer. Thus, excellent scalability with respect to the number of computational cores can be expected. In order to experimentally support this claim, we computed objective function derivatives using a varying number of computational cores on CPU. Two cases were considered: small images (64 3 voxels), typically occurring in multi-level computations of medical images, and larger images (512 3 voxels), representing, e.g., state-of-the-art thoracic computed tomography (CT) scans. As can be seen from the results in Table 3, for both image sizes, all NGF evaluations function value, gradient, and Hessian-vector multiplication exhibit the expected speedup factors ranging from 11.0 to 15.0 on a 12-core system. Grid conversions from deformation grid to image grid also show speedups of approximately 12. For the transposed operator, on small images, a speedup of only 4.34 is achieved: due to the small number of parallel tasks available, not all computational cores can be

20 20 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Small images Large images Method Serial (ms) Par. (ms) Speedup Serial (s) Par. (s) Speedup D D P P y P ŷ Ĥ ˆp S S Sp Table 3 Scaling of the proposed parallel implementation on a 12-core workstation compared to serial computation. Runtimes and speedups are itemized for gradient evaluations, Hessian vectormultiplications and grid conversions. All operations scale almost linearly, with the exception of the transposed grid conversion operator P ŷ on small images, where the number of available parallel tasks is too low, and the curvature regularizer, which is primarily memory-bound. Small images: 64 3 image resolution, 17 3 deformation grid size; large images: image resolution, deformation grid size. fully utilized. On the large images, the expected speedup of 12.4 is achieved. In comparison, the curvature regularizer scales much less favorably, with speedups ranging from 1.54 to As the implementation of the curvature regularizer derivatives is fully parallelized and matrix-free, this indicates that the curvature computation is limited by memory bandwidth rather than arithmetic throughput. However, this has little effect on the overall performance, as the evaluation of the curvature regularizer is multiple orders of magnitude faster than the NGF evaluation Runtime comparison. As discussed in Section 5, the matrix-free approach reduces the memory consumption and improves parallelization, however it introduces additional cost due to multiple re-calculations. Therefore, it is interesting to compare in detail the runtime of the matrix-free implementation MFC to the matrix-based implementations MBC and FAIR Derivative and Hessian evaluation. We separately evaluated runtimes for computation of the derivatives and for matrix-vector multiplication with the Gauss-Newton Hessian, including grid conversions. Image sizes ranged from 64 3 to voxels. To account for various use cases with different requirements for accuracy, we tested three different resolutions for the deformation grid: full resolution, 1/4, and 1/16 of the image resolution. As can be seen from the results in Table 4, for the gradient computation, we achieve speedups from 33.7 to 275 relative to FAIR. Furthermore, FAIR runs out of memory at larger image and deformation grid sizes. Here, the matrix-free scheme allows for computations at much higher resolutions. Relative to the matrix-based C++ implementation MBC, we achieve speedups between 2.34 and These values are particularly remarkable if one considers that MBC is already highly optimized and parallelized, and that MFC includes a large amount of redundant computations. Again, MFC allows to handle the highest resolution of 512 3, while MBC runs out of memory at voxels. Comparing the runtimes of the Hessian-vector multiplication in Table 4, the benefits of the low memory usage of MFC become even more obvious. While MBC and

21 Gradient Hessian-vector mult. MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 21 Runtime m i m y i FAIR (s) MBC (s) Speedup MBC MFC MFC MFC (s) vs. vs. vs. (Proposed) FAIR FAIR MBC * * * * * * * * * * * * * * * * * * * * * Table 4 Runtimes (left) and speedup factors (right) for objective function gradient computations (top) and Hessian matrix-vector multiplications (bottom) on CPU. Three implementations were compared: a classical matrix-based implementation from the FAIR toolbox (FAIR), an optimized matrix-based C++ implementation (MBC), and the proposed matrix-free approach (MFC). For each image grid size m i, three different deformation grid sizes m y i were tested. The proposed matrix-free approach MFC achieves speedups of (derivatives) and (Hessian multiplications) compared to the optimized C++ implementation MBC (rightmost column). It can also handle high resolutions for which FAIR and MFC run out of memory (third and fourth column, marked ). FAIR cannot be used for more than half of the resolutions tested, MFC has no such issues and scales approximately linear in the number of voxels. MFC achieves particularly high speedups for large deformation grids compared to the optimized C++ implementation MBC, ranging from 17.9 to Full registration. The ultimate goal of the presented approach is to improve runtime and memory usage of the full image registration system in real-world applications. Therefore we performed a range of full multi-level image registrations on three levels for different image and deformation grid sizes with all three imple-

22 22 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Runtime (s) , FAIR MBC MFC (Proposed) ,314 1, ,553 * * * * * * L-BFGS 4,299 Gauss-Newton Fig. 4. Runtime comparison for a full multi-level registration on CPU using L-BFGS and Gauss-Newton optimization schemes for different image sizes (logarithmic scale). FAIR: matrixbased, MATLAB, using the FAIR toolbox [36]; MBC: matrix-based, C++, using sparse matrices; MFC: proposed matrix-free algorithm in C++; : computation ran out of memory. Deformation grid sizes were set to one quarter of the image size in each dimension. The proposed algorithm (MFC) enables faster computations at higher resolutions for both optimization schemes and routinely handles resolutions of up to voxels. L-BFGS Gauss-Newton MFC vs. FAIR * * * MFC vs. MBC * * * Table 5 Speedup factors of the proposed matrix-free approach on CPU (MFC is n times faster) relative to the matrix-based MBC and MATLAB-based FAIR methods for the data in Figure 4. : FAIR or MBC ran out of memory. mentations. The input data consisted of thorax-abdomen images as described in Section below, cropped to a size of voxels. In order to avoid potentially large runtime differences caused by the effect of small numerical differences on the stopping criteria, we set a fixed number of 20 iterations on each resolution level. For L-BFGS, the proposed method MFC is between 8.62 and 9.27 times faster than MBC (Figure 4 and Table 5). Computation on the highest resolution with an image size of and deformation grid size of is only possible using MFC due to memory limitations of the other methods. For Gauss-Newton, overall speedups range from 3.12 to Additionally, due to the more memory-intense Hessian computations, even a registration with an image size of and a modest deformation grid size of 65 3 could only be performed with the proposed MFC method. While for L-BFGS the overall speedups are in a similar range as the raw speedups in Table 4, they are slightly reduced for the Gauss-Newton method. This is due to the fact that in the matrix-based MBC method, the stored NGF Hessian is re-used multiple times in the CG solver, while it is re-computed for each evaluation in the matrix-free MFC approach. Solving a linear system with the CG solver using the NGF Hessian is only needed for Gauss-Newton optimization, which could additionally

23 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 23 L-BFGS Gauss-Newton MFC MBC * * * FAIR * * * Table 6 Total peak memory usage in megabytes for the full multi-level registration on CPU, corresponding to the data in Figure 4. Reported sizes include objective function derivatives, multi-level representations of the images, memory required for the optimization algorithm, CG solver, and further auxiliary space. Compared to MBC and FAIR, the proposed algorithm (MFC) enables computations at high resolutions with moderate memory use. : FAIR or MBC ran out of memory (more than 32 GB required). impose a performance bottleneck. In our evaluations, however, we found that, while the CG solver consumes 97% to 99% of the total registration runtime, the matrix-free NGF Hessian-vector multiplications again consume more than 99% of the overall CG solver runtime, which can thus fully benefit from our parallelized computations. The memory usage of the full multi-level registration for all approaches is shown in Table 6. Besides the objective function derivatives, the measured memory usage includes memory required for multi-level representations of the images, the optimization algorithm, CG solver and further auxiliary space. The matrix-free approach reduces memory requirements by one to two orders of magnitude.. As the L-BFGS method uses additional buffers for storing previously computed gradients, the Gauss-Newton method has a lower memory usage with the matrix-free approach. Overall, the proposed efficient matrix-free method enables both computation of higher resolutions and shorter runtimes GPU Implementation. The presented matrix-free algorithm is not limited to a specific platform or processor type. Due to its parallelizable formulation and low memory requirements, it is also well-suited for implementation on specialized hardware such as GPUs. In comparison to CPUs, GPUs feature a massively parallel architecture and excellent computational performance. Thus, GPUs have become increasingly popular for general purpose computing applications [47]. However, in order to utilize these benefits, specialized implementations are required, which exploit platform-specific features such as GPU topology and different memory classes. In this section, we present a GPU implementation of the matrix-free registration algorithm using NVIDIA CUDA C/C++ [40]. The CUDA framework allows for a comparatively easy implementation and direct access to hardware-related features such as different memory and caching models and has been widely adopted by the scientific community [47]. We implemented optimized versions of all parts of the registration algorithm in CUDA, allowing the full registration algorithm to run on the GPU without intermediate transfers to the host. In the following, we present details on how the implementation makes use of the specialized platform features in order to improve performance Implementation details. The CUDA programming model organizes the GPU code in kernels, which are launched from the host to execute on the GPU. The kernels are executed in parallel in different threads, which are grouped to thread blocks. Memory areas. CUDA threads have access to different memory areas: Global memory, constant memory and shared memory. All data has to be transferred from

24 24 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Reference (b) Template (c) Initial overlay (d) After registration Fig. 5. Registration of patches from histological serial sections with different stains on GPU. (a) reference image, (b) template image, (c) checkerboard overlay of details from both images before, and (d) after registration. While discontinuities can be seen at the borders of the checker pattern before registration, smooth transitions are achieved after registration. Areas of large differences are indicated by arrows. The images were obtained from the dataset used in [31]. the CPU main memory to the GPU global memory before it can be used for GPU computations. As frequent transfers can limit computational performance, the GPU implementation transfers the initial images R, T exactly once at the beginning of the registration. All computations, including the creation of coarser images for the multi-level scheme, are then performed entirely in GPU memory. While the global memory is comparatively large (up to 16 GB on recent Tesla devices) and can be accessed from all threads, access is slow because caching is limited. Constant and shared memory in contrast are very fast. However, constant memory is read-only, while shared memory can only be accessed within the same thread block. Both are currently limited to 48 to 96 kb, depending on the architecture. We therefore store frequently used parameters, such as m, my, m, h, hy, h in constant memory to minimize access times. Shared memory (private per block) is used for reductions in L-BFGS and function value, with an efficient scheme from [20]. Image interpolation. Current GPUs include an additionally specialized memory area for textures. Texture memory stores the data optimized for localized access in 2D or 3D coordinates, and provides hardware-based interpolation and boundary condition handling. Therefore, we initially employed texture memory for the image interpolation of the deformed template image T (y ). However, we found that the hardware interpolation causes errors in the order of 10 3 in comparison to the CPU implementation, as the GPU only uses 9-bit fixed-point arithmetic for these computations. The errors resulted in erroneous derivative calculations, causing the optimization to fail. Despite an interpolation speedup of approx. 35% with texture memory, we therefore performed the interpolation in software within the kernels rather than in hardware. Grid conversion. In Section 4.3, a red-black scheme for the transposed operator P > y was proposed in order to avoid write conflicts in parallel execution. While on the CPU this results in fast calculations, on the GPU there are specialized atomics available to allow for fast simultaneous write accesses without conflicts. This enables a per-point parallel implementation without the red-black scheme, which also improves the limited speedup for small images shown in Table 3. We took care that all GPU computations return identical results to the CPU code with the exception of numerical and rounding errors, in order to allow for a meaningful comparison Evaluation. Experiments were performed on a GeForce GTX980 graphics card with 4 GB of memory. The GPU features 16 streaming multiprocessors with

A Non-Linear Image Registration Scheme for Real-Time Liver Ultrasound Tracking using Normalized Gradient Fields

A Non-Linear Image Registration Scheme for Real-Time Liver Ultrasound Tracking using Normalized Gradient Fields A Non-Linear Image Registration Scheme for Real-Time Liver Ultrasound Tracking using Normalized Gradient Fields Lars König, Till Kipshagen and Jan Rühaak Fraunhofer MEVIS Project Group Image Registration,

More information

Variational Lung Registration With Explicit Boundary Alignment

Variational Lung Registration With Explicit Boundary Alignment Variational Lung Registration With Explicit Boundary Alignment Jan Rühaak and Stefan Heldmann Fraunhofer MEVIS, Project Group Image Registration Maria-Goeppert-Str. 1a, 23562 Luebeck, Germany {jan.ruehaak,stefan.heldmann}@mevis.fraunhofer.de

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

l ealgorithms for Image Registration

l ealgorithms for Image Registration FAIR: exib Image Registration l F l ealgorithms for Jan Modersitzki Computing And Software, McMaster University 1280 Main Street West, Hamilton On, L8S 4K1, Canada modersit@cas.mcmaster.ca August 13, 2008

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

An Efficient, Geometric Multigrid Solver for the Anisotropic Diffusion Equation in Two and Three Dimensions

An Efficient, Geometric Multigrid Solver for the Anisotropic Diffusion Equation in Two and Three Dimensions 1 n Efficient, Geometric Multigrid Solver for the nisotropic Diffusion Equation in Two and Three Dimensions Tolga Tasdizen, Ross Whitaker UUSCI-2004-002 Scientific Computing and Imaging Institute University

More information

CS231A Course Notes 4: Stereo Systems and Structure from Motion

CS231A Course Notes 4: Stereo Systems and Structure from Motion CS231A Course Notes 4: Stereo Systems and Structure from Motion Kenji Hata and Silvio Savarese 1 Introduction In the previous notes, we covered how adding additional viewpoints of a scene can greatly enhance

More information

Ultrasonic Multi-Skip Tomography for Pipe Inspection

Ultrasonic Multi-Skip Tomography for Pipe Inspection 18 th World Conference on Non destructive Testing, 16-2 April 212, Durban, South Africa Ultrasonic Multi-Skip Tomography for Pipe Inspection Arno VOLKER 1, Rik VOS 1 Alan HUNTER 1 1 TNO, Stieltjesweg 1,

More information

Overview of Proposed TG-132 Recommendations

Overview of Proposed TG-132 Recommendations Overview of Proposed TG-132 Recommendations Kristy K Brock, Ph.D., DABR Associate Professor Department of Radiation Oncology, University of Michigan Chair, AAPM TG 132: Image Registration and Fusion Conflict

More information

Robot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss

Robot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss Robot Mapping Least Squares Approach to SLAM Cyrill Stachniss 1 Three Main SLAM Paradigms Kalman filter Particle filter Graphbased least squares approach to SLAM 2 Least Squares in General Approach for

More information

Graphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General

Graphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General Robot Mapping Three Main SLAM Paradigms Least Squares Approach to SLAM Kalman filter Particle filter Graphbased Cyrill Stachniss least squares approach to SLAM 1 2 Least Squares in General! Approach for

More information

Motion Estimation. There are three main types (or applications) of motion estimation:

Motion Estimation. There are three main types (or applications) of motion estimation: Members: D91922016 朱威達 R93922010 林聖凱 R93922044 謝俊瑋 Motion Estimation There are three main types (or applications) of motion estimation: Parametric motion (image alignment) The main idea of parametric motion

More information

arxiv: v1 [cs.cv] 2 May 2016

arxiv: v1 [cs.cv] 2 May 2016 16-811 Math Fundamentals for Robotics Comparison of Optimization Methods in Optical Flow Estimation Final Report, Fall 2015 arxiv:1605.00572v1 [cs.cv] 2 May 2016 Contents Noranart Vesdapunt Master of Computer

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

B-Spline Registration of 3D Images with Levenberg-Marquardt Optimization

B-Spline Registration of 3D Images with Levenberg-Marquardt Optimization B-Spline Registration of 3D Images with Levenberg-Marquardt Optimization Sven Kabus a,b, Thomas Netsch b, Bernd Fischer a, Jan Modersitzki a a Institute of Mathematics, University of Lübeck, Wallstraße

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam Presented by Based on work by, Gilad Lerman, and Arthur Szlam What is Tracking? Broad Definition Tracking, or Object tracking, is a general term for following some thing through multiple frames of a video

More information

Dense Image-based Motion Estimation Algorithms & Optical Flow

Dense Image-based Motion Estimation Algorithms & Optical Flow Dense mage-based Motion Estimation Algorithms & Optical Flow Video A video is a sequence of frames captured at different times The video data is a function of v time (t) v space (x,y) ntroduction to motion

More information

Edge and local feature detection - 2. Importance of edge detection in computer vision

Edge and local feature detection - 2. Importance of edge detection in computer vision Edge and local feature detection Gradient based edge detection Edge detection by function fitting Second derivative edge detectors Edge linking and the construction of the chain graph Edge and local feature

More information

Image Registration with Local Rigidity Constraints

Image Registration with Local Rigidity Constraints Image Registration with Local Rigidity Constraints Jan Modersitzki Institute of Mathematics, University of Lübeck, Wallstraße 40, D-23560 Lübeck Email: modersitzki@math.uni-luebeck.de Abstract. Registration

More information

Non-Rigid Image Registration III

Non-Rigid Image Registration III Non-Rigid Image Registration III CS6240 Multimedia Analysis Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (CS6240) Non-Rigid Image Registration

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,

More information

f xx + f yy = F (x, y)

f xx + f yy = F (x, y) Application of the 2D finite element method to Laplace (Poisson) equation; f xx + f yy = F (x, y) M. R. Hadizadeh Computer Club, Department of Physics and Astronomy, Ohio University 4 Nov. 2013 Domain

More information

HYBRID MULTISCALE LANDMARK AND DEFORMABLE IMAGE REGISTRATION. Dana Paquin. Doron Levy. Lei Xing. (Communicated by Yang Kuang)

HYBRID MULTISCALE LANDMARK AND DEFORMABLE IMAGE REGISTRATION. Dana Paquin. Doron Levy. Lei Xing. (Communicated by Yang Kuang) MATHEMATICAL BIOSCIENCES http://www.mbejournal.org/ AND ENGINEERING Volume 4, Number 4, October 2007 pp. 711 737 HYBRID MULTISCALE LANDMARK AND DEFORMABLE IMAGE REGISTRATION Dana Paquin Department of Mathematics,

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

Shape-based Diffeomorphic Registration on Hippocampal Surfaces Using Beltrami Holomorphic Flow

Shape-based Diffeomorphic Registration on Hippocampal Surfaces Using Beltrami Holomorphic Flow Shape-based Diffeomorphic Registration on Hippocampal Surfaces Using Beltrami Holomorphic Flow Abstract. Finding meaningful 1-1 correspondences between hippocampal (HP) surfaces is an important but difficult

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Hybrid Spline-based Multimodal Registration using a Local Measure for Mutual Information

Hybrid Spline-based Multimodal Registration using a Local Measure for Mutual Information Hybrid Spline-based Multimodal Registration using a Local Measure for Mutual Information Andreas Biesdorf 1, Stefan Wörz 1, Hans-Jürgen Kaiser 2, Karl Rohr 1 1 University of Heidelberg, BIOQUANT, IPMB,

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

TG 132: Use of Image Registration and Fusion in RT

TG 132: Use of Image Registration and Fusion in RT TG 132: Use of Image Registration and Fusion in RT Kristy K Brock, PhD, DABR, FAAPM Associate Professor Department of Radiation Oncology, University of Michigan Chair, AAPM TG 132: Image Registration and

More information

Methodological progress in image registration for ventilation estimation, segmentation propagation and multi-modal fusion

Methodological progress in image registration for ventilation estimation, segmentation propagation and multi-modal fusion Methodological progress in image registration for ventilation estimation, segmentation propagation and multi-modal fusion Mattias P. Heinrich Julia A. Schnabel, Mark Jenkinson, Sir Michael Brady 2 Clinical

More information

White Pixel Artifact. Caused by a noise spike during acquisition Spike in K-space <--> sinusoid in image space

White Pixel Artifact. Caused by a noise spike during acquisition Spike in K-space <--> sinusoid in image space White Pixel Artifact Caused by a noise spike during acquisition Spike in K-space sinusoid in image space Susceptibility Artifacts Off-resonance artifacts caused by adjacent regions with different

More information

Visual Tracking (1) Feature Point Tracking and Block Matching

Visual Tracking (1) Feature Point Tracking and Block Matching Intelligent Control Systems Visual Tracking (1) Feature Point Tracking and Block Matching Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

An Automated Image-based Method for Multi-Leaf Collimator Positioning Verification in Intensity Modulated Radiation Therapy

An Automated Image-based Method for Multi-Leaf Collimator Positioning Verification in Intensity Modulated Radiation Therapy An Automated Image-based Method for Multi-Leaf Collimator Positioning Verification in Intensity Modulated Radiation Therapy Chenyang Xu 1, Siemens Corporate Research, Inc., Princeton, NJ, USA Xiaolei Huang,

More information

Iterative CT Reconstruction Using Curvelet-Based Regularization

Iterative CT Reconstruction Using Curvelet-Based Regularization Iterative CT Reconstruction Using Curvelet-Based Regularization Haibo Wu 1,2, Andreas Maier 1, Joachim Hornegger 1,2 1 Pattern Recognition Lab (LME), Department of Computer Science, 2 Graduate School in

More information

Algebraic Iterative Methods for Computed Tomography

Algebraic Iterative Methods for Computed Tomography Algebraic Iterative Methods for Computed Tomography Per Christian Hansen DTU Compute Department of Applied Mathematics and Computer Science Technical University of Denmark Per Christian Hansen Algebraic

More information

PROGRAMMING OF MULTIGRID METHODS

PROGRAMMING OF MULTIGRID METHODS PROGRAMMING OF MULTIGRID METHODS LONG CHEN In this note, we explain the implementation detail of multigrid methods. We will use the approach by space decomposition and subspace correction method; see Chapter:

More information

Image Registration. Prof. Dr. Lucas Ferrari de Oliveira UFPR Informatics Department

Image Registration. Prof. Dr. Lucas Ferrari de Oliveira UFPR Informatics Department Image Registration Prof. Dr. Lucas Ferrari de Oliveira UFPR Informatics Department Introduction Visualize objects inside the human body Advances in CS methods to diagnosis, treatment planning and medical

More information

Axial block coordinate descent (ABCD) algorithm for X-ray CT image reconstruction

Axial block coordinate descent (ABCD) algorithm for X-ray CT image reconstruction Axial block coordinate descent (ABCD) algorithm for X-ray CT image reconstruction Jeffrey A. Fessler and Donghwan Kim EECS Department University of Michigan Fully 3D Image Reconstruction Conference July

More information

Local Image Registration: An Adaptive Filtering Framework

Local Image Registration: An Adaptive Filtering Framework Local Image Registration: An Adaptive Filtering Framework Gulcin Caner a,a.murattekalp a,b, Gaurav Sharma a and Wendi Heinzelman a a Electrical and Computer Engineering Dept.,University of Rochester, Rochester,

More information

Chapter 18. Geometric Operations

Chapter 18. Geometric Operations Chapter 18 Geometric Operations To this point, the image processing operations have computed the gray value (digital count) of the output image pixel based on the gray values of one or more input pixels;

More information

RIGID IMAGE REGISTRATION

RIGID IMAGE REGISTRATION RIGID IMAGE REGISTRATION Duygu Tosun-Turgut, Ph.D. Center for Imaging of Neurodegenerative Diseases Department of Radiology and Biomedical Imaging duygu.tosun@ucsf.edu What is registration? Image registration

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,

More information

arxiv: v3 [math.na] 14 Jun 2018

arxiv: v3 [math.na] 14 Jun 2018 LAP: A LINEARIZE AND PROJECT METHOD FOR SOLVING INVERSE PROBLEMS WITH COUPLED VARIABLES JAMES L. HERRING, JAMES G. NAGY, AND LARS RUTHOTTO arxiv:1705.09992v3 [math.na] 14 Jun 2018 Abstract. Many inverse

More information

Numerical methods for volume preserving image registration

Numerical methods for volume preserving image registration Numerical methods for volume preserving image registration SIAM meeting on imaging science 2004 Eldad Haber with J. Modersitzki haber@mathcs.emory.edu Emory University Volume preserving IR p.1/?? Outline

More information

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects Intelligent Control Systems Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT

REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT REAL-TIME ADAPTIVITY IN HEAD-AND-NECK AND LUNG CANCER RADIOTHERAPY IN A GPU ENVIRONMENT Anand P Santhanam Assistant Professor, Department of Radiation Oncology OUTLINE Adaptive radiotherapy for head and

More information

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi Lagrangian methods for the regularization of discrete ill-posed problems G. Landi Abstract In many science and engineering applications, the discretization of linear illposed problems gives rise to large

More information

Subpixel Corner Detection Using Spatial Moment 1)

Subpixel Corner Detection Using Spatial Moment 1) Vol.31, No.5 ACTA AUTOMATICA SINICA September, 25 Subpixel Corner Detection Using Spatial Moment 1) WANG She-Yang SONG Shen-Min QIANG Wen-Yi CHEN Xing-Lin (Department of Control Engineering, Harbin Institute

More information

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited. page v Preface xiii I Basics 1 1 Optimization Models 3 1.1 Introduction... 3 1.2 Optimization: An Informal Introduction... 4 1.3 Linear Equations... 7 1.4 Linear Optimization... 10 Exercises... 12 1.5

More information

Lecture 13 Theory of Registration. ch. 10 of Insight into Images edited by Terry Yoo, et al. Spring (CMU RI) : BioE 2630 (Pitt)

Lecture 13 Theory of Registration. ch. 10 of Insight into Images edited by Terry Yoo, et al. Spring (CMU RI) : BioE 2630 (Pitt) Lecture 13 Theory of Registration ch. 10 of Insight into Images edited by Terry Yoo, et al. Spring 2018 16-725 (CMU RI) : BioE 2630 (Pitt) Dr. John Galeotti The content of these slides by John Galeotti,

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Computational Fluid Dynamics - Incompressible Flows

Computational Fluid Dynamics - Incompressible Flows Computational Fluid Dynamics - Incompressible Flows March 25, 2008 Incompressible Flows Basis Functions Discrete Equations CFD - Incompressible Flows CFD is a Huge field Numerical Techniques for solving

More information

Image registration for motion estimation in cardiac CT

Image registration for motion estimation in cardiac CT Image registration for motion estimation in cardiac CT Bibo Shi a, Gene Katsevich b, Be-Shan Chiang c, Alexander Katsevich d, and Alexander Zamyatin c a School of Elec. Engi. and Comp. Sci, Ohio University,

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear

More information

A projected Hessian matrix for full waveform inversion Yong Ma and Dave Hale, Center for Wave Phenomena, Colorado School of Mines

A projected Hessian matrix for full waveform inversion Yong Ma and Dave Hale, Center for Wave Phenomena, Colorado School of Mines A projected Hessian matrix for full waveform inversion Yong Ma and Dave Hale, Center for Wave Phenomena, Colorado School of Mines SUMMARY A Hessian matrix in full waveform inversion (FWI) is difficult

More information

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear. AMSC 600/CMSC 760 Fall 2007 Solution of Sparse Linear Systems Multigrid, Part 1 Dianne P. O Leary c 2006, 2007 What is Multigrid? Originally, multigrid algorithms were proposed as an iterative method to

More information

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics 1. Introduction EE 546, Univ of Washington, Spring 2016 performance of numerical methods complexity bounds structural convex optimization course goals and topics 1 1 Some course info Welcome to EE 546!

More information

Image Coding with Active Appearance Models

Image Coding with Active Appearance Models Image Coding with Active Appearance Models Simon Baker, Iain Matthews, and Jeff Schneider CMU-RI-TR-03-13 The Robotics Institute Carnegie Mellon University Abstract Image coding is the task of representing

More information

Fast marching methods

Fast marching methods 1 Fast marching methods Lecture 3 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 Metric discretization 2 Approach I:

More information

Combinatorial optimization and its applications in image Processing. Filip Malmberg

Combinatorial optimization and its applications in image Processing. Filip Malmberg Combinatorial optimization and its applications in image Processing Filip Malmberg Part 1: Optimization in image processing Optimization in image processing Many image processing problems can be formulated

More information

Constrained and Unconstrained Optimization

Constrained and Unconstrained Optimization Constrained and Unconstrained Optimization Carlos Hurtado Department of Economics University of Illinois at Urbana-Champaign hrtdmrt2@illinois.edu Oct 10th, 2017 C. Hurtado (UIUC - Economics) Numerical

More information

Registration: Rigid vs. Deformable

Registration: Rigid vs. Deformable Lecture 20 Deformable / Non-Rigid Registration ch. 11 of Insight into Images edited by Terry Yoo, et al. Spring 2017 16-725 (CMU RI) : BioE 2630 (Pitt) Dr. John Galeotti The content of these slides by

More information

Image Registration Lecture 4: First Examples

Image Registration Lecture 4: First Examples Image Registration Lecture 4: First Examples Prof. Charlene Tsai Outline Example Intensity-based registration SSD error function Image mapping Function minimization: Gradient descent Derivative calculation

More information

Bildverarbeitung für die Medizin 2007

Bildverarbeitung für die Medizin 2007 Bildverarbeitung für die Medizin 2007 Image Registration with Local Rigidity Constraints Jan Modersitzki Institute of Mathematics, University of Lübeck, Wallstraße 40, D-23560 Lübeck 1 Summary Registration

More information

Preprocessing II: Between Subjects John Ashburner

Preprocessing II: Between Subjects John Ashburner Preprocessing II: Between Subjects John Ashburner Pre-processing Overview Statistics or whatever fmri time-series Anatomical MRI Template Smoothed Estimate Spatial Norm Motion Correct Smooth Coregister

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Graphics and Interaction Transformation geometry and homogeneous coordinates

Graphics and Interaction Transformation geometry and homogeneous coordinates 433-324 Graphics and Interaction Transformation geometry and homogeneous coordinates Department of Computer Science and Software Engineering The Lecture outline Introduction Vectors and matrices Translation

More information

COMP30019 Graphics and Interaction Transformation geometry and homogeneous coordinates

COMP30019 Graphics and Interaction Transformation geometry and homogeneous coordinates COMP30019 Graphics and Interaction Transformation geometry and homogeneous coordinates Department of Computer Science and Software Engineering The Lecture outline Introduction Vectors and matrices Translation

More information

Medical Image Segmentation Based on Mutual Information Maximization

Medical Image Segmentation Based on Mutual Information Maximization Medical Image Segmentation Based on Mutual Information Maximization J.Rigau, M.Feixas, M.Sbert, A.Bardera, and I.Boada Institut d Informatica i Aplicacions, Universitat de Girona, Spain {jaume.rigau,miquel.feixas,mateu.sbert,anton.bardera,imma.boada}@udg.es

More information

AMS527: Numerical Analysis II

AMS527: Numerical Analysis II AMS527: Numerical Analysis II A Brief Overview of Finite Element Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao SUNY Stony Brook AMS527: Numerical Analysis II 1 / 25 Overview Basic concepts Mathematical

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Prototype of Silver Corpus Merging Framework

Prototype of Silver Corpus Merging Framework www.visceral.eu Prototype of Silver Corpus Merging Framework Deliverable number D3.3 Dissemination level Public Delivery data 30.4.2014 Status Authors Final Markus Krenn, Allan Hanbury, Georg Langs This

More information

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier Computer Vision 2 SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung Computer Vision 2 Dr. Benjamin Guthier 1. IMAGE PROCESSING Computer Vision 2 Dr. Benjamin Guthier Content of this Chapter Non-linear

More information

Chapter 7 Practical Considerations in Modeling. Chapter 7 Practical Considerations in Modeling

Chapter 7 Practical Considerations in Modeling. Chapter 7 Practical Considerations in Modeling CIVL 7/8117 1/43 Chapter 7 Learning Objectives To present concepts that should be considered when modeling for a situation by the finite element method, such as aspect ratio, symmetry, natural subdivisions,

More information

Performance Evaluation of an Interior Point Filter Line Search Method for Constrained Optimization

Performance Evaluation of an Interior Point Filter Line Search Method for Constrained Optimization 6th WSEAS International Conference on SYSTEM SCIENCE and SIMULATION in ENGINEERING, Venice, Italy, November 21-23, 2007 18 Performance Evaluation of an Interior Point Filter Line Search Method for Constrained

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Reconstruction of complete 3D object model from multi-view range images.

Reconstruction of complete 3D object model from multi-view range images. Header for SPIE use Reconstruction of complete 3D object model from multi-view range images. Yi-Ping Hung *, Chu-Song Chen, Ing-Bor Hsieh, Chiou-Shann Fuh Institute of Information Science, Academia Sinica,

More information

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles INTERNATIONAL JOURNAL OF MATHEMATICS MODELS AND METHODS IN APPLIED SCIENCES Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles M. Fernanda P.

More information

Outlines. Medical Image Processing Using Transforms. 4. Transform in image space

Outlines. Medical Image Processing Using Transforms. 4. Transform in image space Medical Image Processing Using Transforms Hongmei Zhu, Ph.D Department of Mathematics & Statistics York University hmzhu@yorku.ca Outlines Image Quality Gray value transforms Histogram processing Transforms

More information

Recent Developments in Model-based Derivative-free Optimization

Recent Developments in Model-based Derivative-free Optimization Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010 Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints:

More information

Lecture 19: November 5

Lecture 19: November 5 0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

Finite Element Method. Chapter 7. Practical considerations in FEM modeling

Finite Element Method. Chapter 7. Practical considerations in FEM modeling Finite Element Method Chapter 7 Practical considerations in FEM modeling Finite Element Modeling General Consideration The following are some of the difficult tasks (or decisions) that face the engineer

More information

Fast Natural Feature Tracking for Mobile Augmented Reality Applications

Fast Natural Feature Tracking for Mobile Augmented Reality Applications Fast Natural Feature Tracking for Mobile Augmented Reality Applications Jong-Seung Park 1, Byeong-Jo Bae 2, and Ramesh Jain 3 1 Dept. of Computer Science & Eng., University of Incheon, Korea 2 Hyundai

More information

Topology Optimization of Two Linear Elastic Bodies in Unilateral Contact

Topology Optimization of Two Linear Elastic Bodies in Unilateral Contact 2 nd International Conference on Engineering Optimization September 6-9, 2010, Lisbon, Portugal Topology Optimization of Two Linear Elastic Bodies in Unilateral Contact Niclas Strömberg Department of Mechanical

More information

Stereo Matching for Calibrated Cameras without Correspondence

Stereo Matching for Calibrated Cameras without Correspondence Stereo Matching for Calibrated Cameras without Correspondence U. Helmke, K. Hüper, and L. Vences Department of Mathematics, University of Würzburg, 97074 Würzburg, Germany helmke@mathematik.uni-wuerzburg.de

More information

Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis - WS 07/08 Modern Methods of Data Analysis Lecture XV (04.02.08) Contents: Function Minimization (see E. Lohrmann & V. Blobel) Optimization Problem Set of n independent variables Sometimes in addition some constraints

More information

The Lucas & Kanade Algorithm

The Lucas & Kanade Algorithm The Lucas & Kanade Algorithm Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Registration, Registration, Registration. Linearizing Registration. Lucas & Kanade Algorithm. 3 Biggest

More information

Assignment 4: Mesh Parametrization

Assignment 4: Mesh Parametrization CSCI-GA.3033-018 - Geometric Modeling Assignment 4: Mesh Parametrization In this exercise you will Familiarize yourself with vector field design on surfaces. Create scalar fields whose gradients align

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

arxiv: v1 [math.na] 26 Jun 2014

arxiv: v1 [math.na] 26 Jun 2014 for spectrally accurate wave propagation Vladimir Druskin, Alexander V. Mamonov and Mikhail Zaslavsky, Schlumberger arxiv:406.6923v [math.na] 26 Jun 204 SUMMARY We develop a method for numerical time-domain

More information

Biomedical Image Analysis. Point, Edge and Line Detection

Biomedical Image Analysis. Point, Edge and Line Detection Biomedical Image Analysis Point, Edge and Line Detection Contents: Point and line detection Advanced edge detection: Canny Local/regional edge processing Global processing: Hough transform BMIA 15 V. Roth

More information