arxiv: v1 [cs.cv] 27 Apr 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.cv] 27 Apr 2018"

Roxanne Cooper
5 years ago
Views:

1 A MATRIX-FREE APPROACH TO PARALLEL AND MEMORY-EFFICIENT DEFORMABLE IMAGE REGISTRATION LARS KÖNIG, JAN RÜHAAK, ALEXANDER DERKSEN, AND JAN LELLMANN, arxiv: v1 [cs.cv] 27 Apr 2018 Abstract. We present a novel computational approach to fast and memory-efficient deformable image registration. In the variational registration model, the computation of the objective function derivatives is the computationally most expensive operation, both in terms of runtime and memory requirements. In order to target this bottleneck, we analyze the matrix structure of gradient and Hessian computations for the case of the normalized gradient fields distance measure and curvature regularization. Based on this analysis, we derive equivalent matrix-free closed-form expressions for derivative computations, eliminating the need for storing intermediate results and the costs of sparse matrix arithmetic. This has further benefits: (1) matrix computations can be fully parallelized, (2) memory complexity for derivative computation is reduced from linear to constant, and (3) overall computation times are substantially reduced. In comparison with an optimized matrix-based reference implementation, the CPU implementation achieves speedup factors between 3.1 and 9.7, and we are able to handle substantially higher resolutions. Using a GPU implementation, we achieve an additional speedup factor of up to 9.2. Furthermore, we evaluated the approach on real-world medical datasets. On ten publicly available lung CT images from the DIR-Lab 4DCT dataset, we achieve the best mean landmark error of 0.93 mm compared to other submissions on the DIR-Lab website, with an average runtime of only 9.23 s. Complete non-rigid registration of full-size 3D thorax-abdomen CT volumes from oncological follow-up is achieved in 12.6 s. The experimental results show that the proposed matrix-free algorithm enables the use of variational registration models also in applications which were previously impractical due to memory or runtime restrictions. Key words. deformable image registration, computational efficiency, parallel algorithms AMS subject classifications. 92C55, 65K10, 65Y05 1. Introduction. Image registration denotes the process of aligning two or more images for analysis and comparison [30]. Deformable registration approaches, which allow for non-rigid, non-linear deformations, are of particular interest in many areas of medical imaging and contribute to the development of new technologies for diagnosis and therapy. Applications range from motion correction in gated cardiac Positron Emission Tomography (PET) [16] and biomarker computation in regional lung function analysis [15] to inter-subject registration for automated labeling of brain data [1]. For successful clinical adoption of deformable image registration, moderate memory consumption and low runtimes are indispensable, which is made more difficult by the fact that data is usually three-dimensional and therefore large. For example, in order to fuse computed tomography (CT) images for oncological follow-up, volumes with typical sizes of around voxels need to undergo full 3D registration; see Figure 1 for an example of a thorax-abdomen registration. In a clinical setting, these registrations have to be performed for every acquired volume and must be available quickly in order not to slow down the clinical workflow. In other situations, such as in population studies or during screening, the large number of required registrations demands efficient deformable image registration algorithms. In some areas, such as liver ultrasound tracking, real-time performance is even required [27, 11]. Consequently, a lot of research has been dedicated to increasing the efficiency Fraunhofer MEVIS, Lübeck, Germany (lars.koenig@mevis.fraunhofer.de, jan.ruehaak@mevis.fraunhofer.de, alexander.derksen@mevis.fraunhofer.de). Institute of Mathematics and Image Computing, University of Lübeck, Lübeck, Germany (jan.lellmann@mic.uni-luebeck.de) First preliminary results of this work were announced in 4-page conference proceedings [28, 25] 1

2 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Prior scan (b) Current scan (c) Initial difference (d) After registration Fig. 1.

(a) coronal slice of prior scan, (b) coronal slice of current scan with deformation grid overlay, (c) subtraction image before registration, (d) subtraction image after registration.

Image courtesy of Radboud University Medical Center, Nijmegen, The Netherlands. of image registration algorithms, see [45, 47, 12] for an overview.

given algorithm on a particular hardware platform mostly on massively parallel architectures such as Graphics Processing Units (GPUs) [24, 38, 4, 46], but also on Digital Signal Processors (DSPs) [2]

2 2 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Prior scan (b) Current scan (c) Initial difference (d) After registration Fig. 1. Registration of thorax-abdomen 3D CT scans for oncological follow-up. (a) coronal slice of prior scan, (b) coronal slice of current scan with deformation grid overlay, (c) subtraction image before registration, (d) subtraction image after registration. The current scan is deformed onto the prior scan using the proposed variational method. The subtraction image highlights the areas of change (white spots in the lung) corresponding to tumor growth. Image courtesy of Radboud University Medical Center, Nijmegen, The Netherlands. of image registration algorithms, see [45, 47, 12] for an overview. The literature can be divided into approaches that aim at modifying the algorithmic structure of a registration method in order to increase efficiency [17, 18, 48] and approaches that specialize a given algorithm on a particular hardware platform mostly on massively parallel architectures such as Graphics Processing Units (GPUs) [24, 38, 4, 46], but also on Digital Signal Processors (DSPs) [2] or Field Programmable Gate Arrays (FPGAs) [9]. Contributions and overview. In the following, we focus on variational approaches for image registration, which rely on numerical minimization of an energy function that depends on the input data. While such methods have been successfully used in various applications [43, 5, 16, 36], they involve expensive derivative computations, which are not straightforward to implement efficiently and hinder parallelizability. In this work, we aim to work towards closing this gap: We analyze the derivative structure of the classical, widely-known variational image registration model [35, 36] with the normalized gradient fields distance measure [19] and curvature regularization [14], which has been successfully used for various image registration problems [49, 26, 11] (Section 3 and Section 4). We study the use of efficient matrix-free techniques for derivative calculations in order to improve computational efficiency (Section 4). In particular, we present fully matrix-free computation rules, based on the work on affinelinear image registration in [44] and deformable registration in [28, 25], for objective function gradient computations (Section 4.1.1) and Gauss-Newton Hessian-vector multiplications (Section 4.1.2). We perform a theoretical analysis of the proposed approach in terms of computational effort and memory usage (Section 5). Finally, we quantitatively evaluate CPU and GPU implementations (Section 6.3) of the new method with respect to speedup, parallel scalability (Section 6.1) and problem size, and in comparison with two alternative meth-

3 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 3 ods (Section 6.2). We demonstrate the real-world applicability on two medical applications (Section 6.4). Benefits of the proposed approach are full parallelization of objective function and derivative computations, reduction of derivative computation memory requirements from linear to constant, as well as a large reduction in overall runtime. The strategy can also be applied to other distance measures and regularizers, for which we discuss requirements and limitations. In order to underline the merits of the approach in the clinic, we exemplarily consider two medical applications. Firstly, the registration algorithm is employed for registering lung CT images in inhaled and exhaled position (Section 6.4.1). On the widely-used DIR-Lab 4DCT database [8, 6], we achieve the best mean landmark error of 0.93 mm in comparison with the results reported at [7] with an average runtime of only 9.23 s, computed on a standard workstation. Secondly, we consider oncological follow-up studies in the thorax-abdomen region (Section 6.4.2), where qualitatively convincing results are obtained at a clinically feasible runtime of 12.6 s. 2. Variational image registration framework Continuous model. While there are numerous different applications of image registration, the goal is always the same: to obtain spatial correspondences between two or more images. Here we only consider the case of two images. The first image is denoted the reference image R (sometimes also called fixed image), the second image is the template image T (or moving image) [36]. For three-dimensional gray-scale images, the two images are given by functions R : R 3 R and T : R 3 R on domains Ω R R 3 and Ω T R 3, mapping each coordinate to a gray value. To establish correspondence between the images, a deformation function ϕ : Ω R R 3 is sought, which encodes the spatial alignment by mapping points from reference to template image domain. With the composition T (ϕ) : Ω R R, x T (ϕ(x)), a deformed template image in the reference image domain can be obtained. In order to find a suitable deformation ϕ, two competing criteria must be balanced. Firstly, the deformed template image should be similar to the reference image, determined by some distance measure D(R, T (ϕ)). Secondly, as the distance measure alone is generally not enough to make the problem well-posed and robust, a regularizer S(ϕ) is introduced, which requires the deformation to be reasonable. An optimal deformation meeting both criteria is then found by solving the optimization problem (1) min J (ϕ), ϕ:ω R R 3 J (ϕ) := D(R, T (ϕ)) + αs(ϕ), where α > 0 is a weighting parameter. Numerous different approaches and definitions for distance measures [10, 51, 35] and regularizers [14, 3, 5] have been presented, each with specific properties tailored to the application. This work focuses on the normalized gradient fields (NGF) distance measure. It was first presented in [19] and has since been successfully used in a wide range of applications [43, 41, 26]. The NGF distance is particularly suited for multi-modal registration problems, where image intensities of reference and template image are related in complex and usually non-linear ways, and represents a fast and robust alternative [19] to the widely-used mutual information [10, 51]. It is defined as (2) ( ) 2 T (ϕ), R + τϱ D(ϕ) = 1 dx, Ω R T (ϕ) τ R ϱ

4 4 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN where ε :=, + ε 2, and τ, ϱ > 0 are parameters that control filtering of noise in each image. The underlying assumption is that even in different imaging modalities, intensity changes still take place at corresponding locations. Therefore, the NGF distance advocates parallel alignment of image gradient directions, i.e., edge orientations. For the regularizer S, we use curvature regularization [14], S(ϕ) = Ω R j=1 3 ( u j ) 2 dx. S is defined in terms of the displacement u := (u 1, u 2, u 3 ) := ϕ id, where id is the identity transformation. The curvature regularizer penalizes second-order derivatives and thus favors smooth deformations, while not penalizing affine transformations such as translations or rotations. Similar to NGF, curvature regularization has been successfully used in multiple applications, e.g., [43, 26, 27] Discretization. In order to solve the registration problem (1), we follow a discretize-then-optimize approach [36]: Firstly, the deformation ϕ and objective function J are turned into a discretized energy. Secondly, this discretized energy is minimized using standard methods from numerical optimization (Section 2.3). For reference, the notation introduced in the following is summarized in Table 1. The image domain Ω R is discretized on a cuboid grid with m x, m y, m z grid cells (voxels) in each dimension and m := m x m y m z cells in total. The corresponding grid spacings are denoted by h x, h y, h z, and the cell volume by h := h x h y h z. The associated image grid is defined as the cell-centers X := {(îhx hx 2, ĵh y hy 2, ˆkh ) z hz 2 î = 1,..., m x, ĵ = 1,..., m y, ˆk } = 1,..., m z. Using a lexicographic ordering defined by the index mapping i := î + ĵ m x + ˆk m x m y, the elements in X can be assembled into a single vector x := (x 1,..., x 3 m ) R 3 m, which is obtained by first storing all x-, then all y- and finally all z- coordinates in the order prescribed by i. The i-th grid point in X is then x i := (x i, x i+ m, x i+2 m ) R 3. Evaluating the reference image on all grid points can be compactly written as R(x) := R (x i ) i=1,..., m, resulting in a vector R(x) R m of image intensities at the grid points. In a similar way, we define the discretized deformation y := ϕ(x y ) as the evaluation of ϕ on a deformation grid x y. Again we order all components of y in a vector, y = (y 1,..., y 3 m y), so that ϕ(x y i ) = (y i, y i+ m y, y i+2 m y), where the deformation grid has grid spacings h y x, h y y, h y z, and m y x, m y y, m y z denote the number of grid points in each direction with m y := m y xm y ym y z. Importantly, instead of using cell-center coordinates, the deformation is discretized on cell corners, i.e., using a nodal grid. This ensures that the deformation is discretized up to the boundary of the image domain, enabling an efficient grid conversion without requiring extrapolation or additional boundary conditions (Section 4.3). This results in m y x, m y y, m y z coordinates in each direction and m y := m y xm y ym y z grid points in total. The set of all nodal grid points is given by X y y := {(îh x, ĵh y y, ˆkh z) y î= 0,..., m y x 1, ĵ = 0,..., m y y 1, ˆk } = 0,..., m y z 1.

5 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 5 x, y, z spatial coordinate axes R, T reference, template image functions, R, T : R 3 R Ω R reference image domain, Ω R R 3 ϕ deformation function, ϕ : Ω R R 3 m, m y number of grid cells in the image/deformation grid, m = (m x, m y, m z) m, m y total number of grid cells, m = m xm ym z h, h y grid spacings for each coordinate direction, h = (h x, h y, h z) h, h y volume of a single grid cell, h = h xh yh z i linear index with i = î + ĵm x + ˆkm xm y i x, i + x indices of neighbors in x-direction, taking boundary conditions into account i y, i + y indices of neighbors in y-direction, taking boundary conditions into account i z, i + z indices of neighbors in z-direction, taking boundary conditions into account x, x y vector of lexicographically ordered grid points, x = (x 1,..., x 3 m ) R 3 m x i, x y i i-th grid point, x i = (x i, x i+ m, x i+2 m ) R 3 y deformation discretized on deformation grid, y = (y 1,..., y 3 m y ) R 3 my, (y i, y i+ m y, y i+2 m y ) = ϕ(x y i ) u displacement discretized on deformation grid, u = y x y R 3 my P grid conversion operator from deformation grid to image grid, P : R 3 my R 3 m ŷ deformation discretized on image grid points, ŷ = P (y) R 3 m ŷ i i-th deformation point, ŷ i = (P (y) i, P (y) i+ m, P (y) i+2 m ) R(x) reference image evaluated on image grid, R : R 3 m R m R i scalar value of reference image at i-th grid point, R i = R(x i ) R T (P (y)) template evaluated on deformed image grid, T : R 3 m R m T i scalar value of reference image at deformed i-th grid point, T i = T (ŷ i ) R Table 1 Summary of notation introduced in Section 2.2 for discretization of the continuous registration problem. All notation regarding the deformation grid is identical to the corresponding notation for the image grid, except for an additional superscript y, e.g., m denotes the number of grid cells in the image grid, and m y denotes the number of grid cells in the deformation grid. Analogously to the image grid, using the linear index i := î+ĵ m y x+ˆk m y xm y y, the points from X y can be ordered lexicographically in a vector x y := (x y 1,..., xy 3 m y) R3 my, with a single point having the coordinates x y i := (xy i, xy i+ m y, xy i+2 m y). We refer to [36] for further discussion on different grids. In this work, the deformation grid size is always chosen with m y k 1 m k, k = x, y, z, so that the image grid is as least as fine as the deformation grid. The separate choice of grids allows to use a coarser grid for the deformation which directly affects the number of unknowns and a high-resolution grid for the input images. However, it adds an extra interpolation step, as in order to compute the distance measure, the deformed template image T (ϕ) needs to be evaluated on the same grid as the reference image R. We use a linear interpolation function P : R 3 my R 3 m, mapping between image grid and deformation grid, and define T (P (y)) := (T (ŷ i )) i=1,..., m R m with ŷ i := (P (y) i, P (y) i+ m, P (y) i+2 m ). For evaluating the template image T, tri-linear interpolation with Dirichlet boundary conditions is used. This is a reasonable assumption for medical images, which often exhibit a black background. Discretizing the integral in (2) using the midpoint quadrature rule and denoting by T i, R i the i-th component function, the compact expression ( D(y) = h m T i (P (y)), R ) 2 i (x) + τϱ (3) T i (P (y)) τ R i (x) ϱ i=1 is obtained for the full discretized NGF term. The factor 1/2 is caused by the square-

6 6 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN then-average scheme for discretizing the gradient : Given a discretized image I R m, we define the backward difference operator at the i-th grid point, I i : R m R 3, as I i := ( Ii I i x h x where the neighborhood is defined via, I i I i y, I ) i I i z, h y h z i x := max(î 1, 1) + ĵ m x + ˆkm x m y (4) i y := î + max(ĵ 1, 1)m x + ˆkm x m y i z := î + ĵm x + max(ˆk 1, 1)m x m y i + x := min(î + 1, m x ) + ĵ m x + ˆkm x m y i + y := î + min(ĵ + 1, m y ) m x + ˆkm x m y i + z := î + ĵm x + min(ˆk + 1, m z ) m x m y and i 0 := i. This allows to implement Neumann boundary conditions without special treatment of the boundary cells. By + I i we denote the corresponding forward finite differences operator. We combine both operators into one, ( I i := I i, ) (5) + I i with I i : R m R 6, and define I 1 i ε := 2 I i, I i + ε 2. This square-thenaverage scheme for approximating the terms in (2) allows to use short forward and backward differences, which better preserve high frequency gradients than long central differences [21]. The factor 1/2 is required in order to faithfully discretize (2), and can be interpreted as an averaging of the squared norms of the forward and backward operators. Similarly to the NGF, the curvature regularizer is discretized as (6) y m S(y) = h y 2 i=1 d=0 ( ui+d m y ) 2, with the discretized displacement u := (u 1,..., u 3 m y) := y x y and the discretized Laplace operator (7) u i+d m y := k {x,y,z} 1 (h y k )2 (u i k+d m y 2 u i+d m y + u i+k+d m y) with homogeneous Neumann boundary conditions. Overall, we obtain the discretized version (8) min J(y), y R3 my J(y) := D(y) + αs(y) of the minimization problem (1), which can then be solved using quasi-newton methods, as summarized in the following section Numerical optimization. In order to find a minimizer of the discretized objective function (8), we use iterative Newton-like optimization schemes. In each step of the iteration, an equation of the form (9) ˆ 2 J(y k )s k = J(y k )

7 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 7 is solved for a descent direction s k. Then y k is updated via y k+1 = y k + ηs k, where the step length η is determined by Armijo line search [39, 36]. The matrix ˆ 2 J should approximate the Hessian 2 J(y k ). Here we consider the Gauss-Newton scheme and the L-BFGS scheme, which have both been used in different image registration applications [50, 25, 43]. The minimization is embedded in a coarse-to-fine multi-level scheme, where the problem is solved on consecutively finer deformation and image grids Gauss-Newton. The Gauss-Newton scheme uses a quadratic approximation of the Hessian and is suitable for least-squares type objective functions of the form (10) Ĵ(y) = ˆr(y) ˆr(y) = ˆr(y) 2 2, where ˆr(y) is a residual function, which can depend in a nonlinear way on the unknown y. The gradient in the Newton equation (9) can be written as Ĵ = 2 ˆr dˆr, where dˆr is the Jacobian of ˆr, and the Hessian is approximated by H := ˆ 2 Ĵ = 2dˆr dˆr [39]. This approximation discards second-order derivative parts of the Hessian and guarantees a symmetric positive semi-definite Hessian approximation H, so that s k in (9) is always a descent direction. We apply this approximation only to the Hessian of the distance measure D, 2 D H. For the regularizer S, we use the exact Hessian, which is readily available, as S is a quadratic function. The (quasi-)newton-equation (9) can be written as (H + α 2 S)s k = ( D + α S), which is then approximately solved in each step using a conjugate-gradient (CG) iterative solver [39] L-BFGS. Instead of directly determining a Hessian approximation at each step, the L-BFGS scheme iteratively updates the approximation using previous and current gradient information. L-BFGS requires an initial approximation H 0 of the Hessian, which is often chosen to be a multiple of the identity. However, in our case we can incorporate the known Hessian of the regularizer S by choosing H 0 = 2 S + γi, where I is the identity matrix and γ > 0 is a parameter in order to make H 0 positive definite; further details can be found in [36, 39]. 3. Analytical derivatives. As can be seen from Section 2.3, computing derivatives of distance measure and regularizer is a critically important task for the minimization of the discretized objective function. In this section, algebraic formulations of the required derivatives are presented, which allow for a closer analysis and construction of specialized computational schemes Curvature. The discretized curvature regularizer (6) is a quadratic function involving the (linear) Laplace operator u i. Thus, using the chain rule the i-th element of the gradient can explicitly be computed by ( S(y)) i+d m y = 2 h ( ) (11) y u i+d m y, with d {0, 1, 2} for the directional derivatives. The Hessian is constant and can be immediately seen from (7).

8 8 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN 3.2. NGF. The inner computations of the NGF in (3) can be rewritten more compactly by introducing a residual function r : R m R m with components (12) r i (T ) := so that the NGF term in (3) becomes (13) D(y) = h 1 2 T i, R i + τϱ T i τ R i ϱ, m i=1 ( 1 ri (T (P (y))) 2). Introducing a reduction function ψ : R m R, (r 1,..., r m ) h m i=1 (1 r2 i ), it holds (14) D(y) = ψ(r(t (P (y)))). This composite function maps the deformation y R 3 my onto a scalar image similarity in four steps, (15) R 3 my P R 3 m T R m r R m ψ R. Using the chain rule, the gradient of D can now be obtained as a product of (sparse) Jacobian matrices ( ) ψ r P (16) D =. r P y The Gauss-Newton approximation of the Hessian (Section 2.3.1) becomes (17) P H = 2 h r r P y P P y. Note that, in contrast to the classical Gauss-Newton method (10), the residual in (13) has a different sign. Therefore, in order to compute a descent direction in (9), the sign of the Hessian approximation has been inverted (see also [36]). Using these formulations, the evaluation of the gradient and Hessian can be implemented based on sparse matrix representations [36]. However, while insightful for educational and analytic purposes, these schemes have important shortcomings: storing the matrix elements requires large amounts of memory, while assembling the sparse matrices and sparse matrix multiplications impede efficient parallelization. In the following chapters, we will therefore focus on strategies for direct evaluation that do not require intermediate storage and are highly parallel. 4. Derivative analysis and matrix-free computation schemes. From a computational perspective, the NGF (13) and curvature formulations (6) are very amenable to parallelization, as all summands are independent of each other, and there are no inherent intermediate matrix structures required. However, this advantage is lost when the evaluation of the derivatives is implemented using individual matrices for the factors in (16) and (17). In order to exploit the structure of the partial derivatives, in the following sections we will analyze the matrix structure of all derivatives and derive equivalent closedform expressions that do not rely on intermediate storage. We will make substantial use of the fact that the matrix structure of the partial derivatives is independent of the data. Therefore, the locations of non-zero elements are known a priori, which allows to derive efficient sparse schemes.

9 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION Derivative computations for NGF Gradient. The NGF gradient computation in (16) involves four different Jacobian matrices. The derivative of the vector reduction ψ(r) is straightforward: (18) ψ r = 2 hr = 2 h (r 1,..., r m ) R 1 m. In contrast, the computation of r R m m is considerably more involved. In order to calculate a single row of this Jacobian ri, the definition (13) can be used and interpreted as a quotient (19) r i = r 1,i r 2,i = 1 2 T i, R i + τϱ T i τ R i ϱ. Separately differentiating numerator and denominator, this yields r 1,i = 1 2 ( R i ) i R1 m and r 2,i = R i ϱ ( T i ) 2 T i τ i R1 m. Included in both of these terms is the derivative of the finite difference gradient computation T i : R m R 6 as defined in (5), i.e., mapping an image to the forwardand backward difference gradients. As the gradient at a single point only depends on the values at the point itself and at neighboring points, the Jacobian i R6 m is of the form i = i z i y i x i i + x i + y i + z 1 1 h x h x 1 1 h y h y 1 h z 1 h z 1 h x 1 h x 1 h y 1 h y 1 h z 1 h z where only non-zero elements are shown and the non-zero column indices are displayed above the matrix. The quotient rule, applied to (19), leads to r i = 1 ( ) r1,i r2,i 2 r r 2,i 2,i r 1,i = r 1,i 1 r 1,i r 2,i (20) r 2,i r2,i 2 (21) (22) ( = R i ) 1 i 2 T i τ R i ϱ 2 T i, R i + τϱ T i 2 τ R i 2 ϱ ( = 1 ( R i ) 1 i 2 T i τ R i ϱ 2 T i, R i + τϱ T i 3 τ R i ϱ We introduce the abbreviation (23) ρ i (k) := R i + R i+k T i τ R i ϱ R ( i T i ) i ϱ 2 T i τ ) ( T i ) i ( 1 2 T i, R ) i + τϱ ( T i + T i+k ) T i 3 τ R, i ϱ.,

10 10 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN and the set of indices of non-zero elements (24) K := { z, y, x, 0, x, y, z}. Furthermore, define ĥk := 1 2(h k ) 2 and (25) { j K\{0} ˆρ i (k) := ĥjρ i (j) if k = 0, ĥ k ρ i (k) otherwise. Then, (22) can be compactly written as (26) r ( i = i z i y i x i i + x i + y i + z ) ˆρ i ( z) ˆρ i ( y) ˆρ i ( x) ˆρ i (0) ˆρ i (x) ˆρ i (y) ˆρ i (z) R 1 m, where only non-zero elements are shown and single element locations are denoted above the vector. It can be seen that (26) exhibits a very sparse pattern with only seven non-zero elements. For a full gradient computation as in (16), it remains to consider the terms P and P y. The first term represents the derivative of the template image interpolation function with respect to the image grid coordinates. Given the lexicographical ordering of the grid points, the image derivative matrix is composed of three diagonal matrices, (27) ( ( ) ( P = diag 1 m P 1... P m, diag 1 P m+1... ) ( m P 2 m, diag 1 P 2 m+1... )) m P 3 m, where each non-zero element represents the derivative at a single point with respect to a single coordinate. The last remaining term P y is the Jacobian of the grid conversion function. This is a linear operator, which will be analyzed in detail in Section 4.3. From the derivations in (18), (26) and (27), it can be seen that the main components of the NGF gradient computation exhibit a sparse structure with fixed patterns. The main effort comes from computing the derivative of r, which has a seven-banded diagonal structure. Exploiting this pattern, a single element of the partial gradient D ψ r P := r P R 1 3 m can be explicitly computed as ( ) ( ) D = 2 P h i (28) r i+k ˆρ i+k ( k), i+d m P i+d m k K where d {0, 1, 2} and K = { z, y, x, 0, x, y, z} as in (24). Algorithm 1 shows a pseudocode implementation for computing the individual elements of the gradient as in (28). This formulation has multiple benefits: Any element of the gradient can be computed directly from the input data, without the need for (sparse) matrices and without having to store intermediate results. Furthermore, gradient elements can be computed independently, which allows for a fully parallel implementation. As will be discussed in Section 5, both of these properties substantially decrease memory usage and computation time.

11 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 11 Algorithm 1 Pseudocode for matrix-free computation of the elements of the NGF gradient as in (28). The algorithm can be fully parallelized over all loop iterations. 1: for ˆk in [0, m z 1] do 2: for ĵ in [0, m y 1] do 3: for î in [0, m x 1] do 4: i î + m x ĵ + m x m yˆk Compute linear index 5: i ± x, i ± y, i ± z as in (4) Compute neighbor indices 6: [dtx, dty, dtz] imagederivative(t i ) Compute image derivative 7: 8: r [r i z, r i y, r i x, r i, r i+x, r i+y, r i+z ] r i as defined in (19) 9: dr [ˆρ i z (z), ˆρ i y (y), ˆρ i x (x), ˆρ i (0), ˆρ i+x ( x), ˆρ i+y ( y), ˆρ i+z ( z)] 10: ˆρ i as defined in (25) 11: drsum 2 h (r[0]dr[0] + r[1]dr[1] + r[2]dr[2] + r[3]dr[3] 12: + r[4]dr[4] + r[5]dr[5] + r[6]dr[6]) 13: Compute sum of (28) 14: grad[i ] drsum dtx 15: grad[i + m] drsum dty 16: grad[i + 2 m] drsum dtz 17: end for 18: end for 19: end for Output: grad[i + d m] = ( ) D P i+d m Hessian-vector multiplication. As noted in Section 2.3.1, when using the Gauss-Newton scheme for optimization, linear systems involving the quadratic approximation H R 3 my 3 m y of the Hessian (17) need to be solved in each iteration. Although H is sparse, the memory requirements for storing the final as well as intermediate matrices can still be considerable. Therefore, in the following we will consider an efficient scheme for evaluating the matrix-vector product Hp = q with p, q R 3 my, which is the foundation for using iterative methods, such as conjugate gradients, in order to solve the (Gauss-)Newton equation (9). Figure 2 shows a schematic of the computations involved. Again, we consider the Jacobian P y and its transpose in (17) as separate grid conversion steps, which will be discussed in detail in Section 4.3. The main computations consist of computing the matrix product (29) Ĥ := r r P P R3 m 3 m, with the components P R m 3 m r, R m m, which is equivalent to computing the approximate Hessian H in (17) with the exception of the grid conversion steps. We first analyze the matrix-vector product Ĥˆp = ˆq with ˆp, ˆq R3 m. Abbreviating dr := r, the main challenge is efficiently computing the matrix product dr dr. Using the definition and notation of a single row of dr given in (26), a single

12 12 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN column of the matrix can be written as (30) r i = ( i z i y i x i i + x i + y i + z ˆρ i z (z) ˆρ i y (y) ˆρ i x (x) ˆρ i (0) ˆρ i+x ( x) ˆρ i+y ( y) ˆρ i+z ( z) ), again only showing non-zero elements at the indices denoted above the vector. As in (26), a single column contains only seven non-zero elements. Abbreviating dr i := r i R m 1, each element of the matrix dr dr is a scalar product of columns dr i, dr j for i, j = 1,..., m. However, due to the sparsity of the columns of dr in (30), there are only few non-zero scalar products. As a basic example, consider the diagonal elements of dr dr. In column dr i, non-zero elements are located at (dr i ) i+k, k K, with K as in (24). This yields the scalar products dr i, dr i = k K (dr i) 2 i+k = k K ˆρ i+k( k) 2, a sum of seven terms. More generally, for arbitrary i, j, define κ := j i, so that dr i, dr j = dr i, dr i+κ. In order to characterize the elements where dr i, dr i+κ can be non-zero, observe that the non-zero elements of dr i are located at indices i + M, (31) M := { m x m y, m x, 1, 0, 1, m x, m x m y }. Consequently, dr i, dr i+κ can only be non-zero if (i + M) (i + κ + M), i.e., if the scalar product involves at least two non-zero elements. Equivalently, κ N := M M, therefore the number of non-zero inner products is bounded from above by the number of elements N in N. Naturally, N M M = 49, however it turns out that this bound can be substantially lowered. In order to do so, define the mapping N(i, j) := j i, N : M M Z, so that N = N(M, M) = { } κ Z κ = N(i, j), i, j M. Inspecting all possible combinations for i and j, it turns out that N = 25 (Table 2), which implies that each column of dr dr has at most 25 non-zero elements. For a given offset κ = j i, the pre-image N 1 (κ) indicates the indices of nonzero elements contributing to the inner product dr i, dr j. As shown in the rightmost column in Table 2, for the main diagonal of dr dr, seven elements are involved in the scalar product, while for all other elements only either one or two products need to be computed. Using κ = j i and (î, ĵ) N 1 (κ), it holds ( dr dr ) (dr i ) (dr i+î j), if j i N, j+ĵ i,j = (32) (î,ĵ) N 1 (j i) 0, otherwise, ˆρ ( î) i+î ˆρ j+ĵ ( ĵ), if j i N, = (î,ĵ) N 1 (j i) 0, otherwise, where, in a slight abuse of notation, ˆρ( m 1 m 2 ) should be interpreted as ˆρ( z), etc. With this and the explicit formulation of ˆρ i (k) in (23) and (25), the computational cost of calculating dr dr can be reduced substantially: multiplications involving zero

13 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 13 P y P r r P P y p = q Fig. 2. Schematic view of Hessian-vector multiplication Hp = q. The computation involves highly sparse matrices with a fixed pattern of non-zeros. The main computational effort results from the matrix multiplication r r, where r has seven non-zero diagonals. elements are no longer considered, administrative costs for locating and handling sparse matrix elements have been eliminated, and only essential calculations remain. Substituting (32) into the expression (29) for evaluating the matrix-vector product ˆq = Ĥˆp, we obtain ˆq d m+i = κ N l {0,1,2} i P d m+i ( dr dr ) i,i+κ κ P l m+i+κ ˆp l m+i+κ, (33) for d {0, 1, 2}, using the definition of ( dr dr ) i,i+κ from (32). Similar to the matrix-free gradient formulation in (28), this allows a direct computation of each element of the result vector ˆq = Ĥˆp fully in parallel and directly from the input data. An implementation in pseudocode is shown in Algorithm Derivative computations for curvature regularization. In (11), a matrix-free version of the curvature gradient computation was already given. As the curvature computation is a quadratic function, the Hessian-vector multiplication q i+d m y = ( 2 Sp ) i+d m y = 2 h y ( p ) i+d m y, (34) for d {0, 1, 2}, is closely related, with p replacing u in (11). In contrast to NGF, this uses the exact Hessian rather than a Gauss-Newton approximation. As the regularizer operates on the deformation grid, no grid conversion is required Grid conversion. We now consider the grid conversion steps omitted in the previous sections. The grid conversion maps deformation values y R 3 my from the deformation grid to the image grid using the function P : R 3 my R 3 m (Section 2.2). In this section, we separately analyze P and its transpose P. The conversion, i.e., interpolation, of y from deformation grid to image grid is performed using tri-linear interpolation and thus can be written as a matrix-vector product ŷ = P y with P R 3 m 3 my and y R 3 my, ŷ R 3 m. The transformation matrix P has a block-diagonal structure with three identical blocks, each converting one of the three coordinate components of y. As the deformation grid is a nodal grid, no boundary handling is required, as each (cell-centered) image grid point is surrounded by deformation grid points.

14 14 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN κ N N 1 (κ) N 1 (κ) 2m 1m 2 {(m 1m 2, m 1m 2)} 1 m 1m 2 m 1 {(m 1, m 1m 2), (m 1m 2, m 1)} 2 m 1m 2 1 {(1, m 1m 2), (m 1m 2, 1)} 2 m 1m 2 {(0, m 1m 2), (m 1m 2, 0)} 2 m 1m {( 1, m 1m 2), (m 1m 2, 1)} 2 m 1m 2 + m 1 {( m 1, m 1m 2), (m 1m 2, m 1)} 2 2m 1 {(m 1, m 1) 1 m 1 1 {(1, m 1), (m 1, 1)} 2 m 1 {(0, m 1), (m 1, 0)} 2 m {( 1, m 1), (m 1, 1)} 2 2 {(1, 1)} 1 1 {(0, 1), (1, 0)} 2 0 {(î, î) î M} 7 1 {(0, 1), ( 1, 0)} 2.. Table 2 Offsets κ := j i, for which dr i, dr j = 0, corresponding to non-zero elements (dr dr) i,j in the matrix-product dr dr. The pre-image sets N 1 (κ) characterize the locations of the non-zero element in dr i and dr j. The 11 omitted cases are identical to the ones shown except for opposite sign. The rightmost column shows the number of products involving non-zero coefficients in the evaluation of dr i, dr j and sums to M M = In order to get closed-form expressions for the non-zero entries of P, define the function c l (k) := (k+0.5)my l m l, which converts a coordinate index from deformation to image grid, and the remainder function r l (k) := c l (k) c l (k). Then, for the point at coordinates (î, ĵ, ˆk) with i = î + ĵm x + ˆkm x m y and d {0, 1, 2}, the interpolated deformation value can be written as (35) (P y)î+ĵmx+ˆkm xm y+d m = 1 1 α=0 β=0 γ=0 1 w α,β,γ (î, ĵ, ˆk) y, ξ(α,β,γ,î,ĵ,ˆk,d) where the indices of the neighboring deformation grid points are (36) ξ(α, β, γ, î, ĵ, ˆk, d) := and the interpolation weights are given by ( ) ( ) ( ) c x (î) + α + m x c y (ĵ) + β + m x m y c z (ˆk) + γ + d m w α,β,γ (î, ĵ, ˆk) := ŵ α x (î)ŵ β y (ĵ)ŵ γ z (ˆk), ŵ s l (k) := { 1 r l (k), if s = 0, r l (k), otherwise. As can be seen in (35), a single interpolated deformation value on the image grid is simply a weighted sum of the values of its eight neighbors on the deformation grid, see also Figure 3(a). For the NGF derivative computations in (16) and (17), the left-sided multiplication ŷ P = (P ŷ) is also required in order to evaluate the transposed operator. For each (nodal) deformation grid point, this amounts to a weighted sum of values from all image grid points in the adjacent deformation grid cells, which can be many more than eight (Figure 3(b)).

15 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 15 Algorithm 2 Pseudocode for matrix-free computation of the elements of the result vector ˆq in the NGF Hessian-vector multiplication Ĥˆp = ˆq as in (29). The algorithm can be fully parallelized over all loop iterations of î, ĵ, ˆk, computing three elements of the result per thread. 1: N [ 2m 1m 2, m 1m 2 m 1, m 1m 2 1, m 1m 2, m 1m 2 +1, m 1m 2 +m 1, 2m 1, m 1 1, m 1, 2: m 1 + 1, 2, 1, 0, 1, 2, m 1 1, m 1, m 1 + 1, 2m 1, m 1m 2 m 1, m 1m 2 1, m 1m 2, m 1m : m 1m 2 + m 1, 2m 1m 2] Initialize 25 indices in N, see Table 2 4: q 0 5: for ˆk in [0, m z 1] do 6: for ĵ in [0, m y 1] do 7: for î in [0, m x 1] do 8: i î + m x ĵ + m x m yˆk Compute linear index 9: i ± x, i ± y, i ± z as in (4) Compute neighbor indices 10: [dtx, dty, dtz] imagederivative(t i+n ) 11: Compute image derivatives at 25 points 12: for k in [0, 24] do Compute 25 values of dr dr, see (33) 13: drdr 0 14: 15: for (ĩ, j) in N 1 (N[k]) do 16: Compute 49 values of ˆρ i, see (32) and Table 2 17: drdr drdr + ˆρ ( ĩ) i+ĩ ˆρ ( j) i+n[k]+ j 18: end for 19: 20: q[i ] q[i ] + dtx[12] drdr dtx[k] ˆp i+n[k] 21: q[i + m] q[i + m] + dty[12] drdr dty[k] ˆp i+n[k]+ m 22: q[i + 2 m] q[i + 2 m] + dtz[12] drdr dtz[k] ˆp i+n[k]+2 m 23: The 12th element corresponds to the derivative at index i 24: end for 25: end for 26: end for 27: end for Output: q[i + d m] = ˆq d m+i By singling out the contribution of all image grid points within a single deformation grid cell as in Figure 3(c), the weights ŵl s (k) can be re-used and only need to be computed once for each deformation grid cell. This suggests to parallelize the computations on a per-cell basis. However, as a deformation grid point has multiple neighboring grid cells, this would lead to write conflicts. Therefore, in order to achieve both efficient re-use of weights as well as parallelizability, we propose to use a red-black scheme as shown in Figure 3(c). Each parallel unit sequentially processes a single 2D slice of deformation grid cells, first all odd (red) and finally all even slices (black) Extension to related models. By carefully examining the derivative structure of the NGF distance measure and curvature regularizer, we have derived closed-form expressions for computing individual elements of the gradients as well as Hessian matrix-vector products. The new formulations operate directly on the input data, avoid storage of intermediate results (and thus costly memory accesses) and allow for fully parallel execution.

16 16 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN y (0,0) y (1,0) w 0,0 w 1,0 w 0,1 w 1,1 y (0,1) y (1,1) y (1,1) y (0,0) y (1,0) y (0,1) y (1,1) (a) (b) (c) Fig. 3. Grid conversion operations (2D example). Circles: nodal deformation grid; squares: cell-centered image grid. (a) deformation grid to image grid: each value on the image grid is computed by a weighted sum of the values of its deformation grid neighbors. (b) transposed operation: each value on the deformation grid is a weighted sum of all image grid values from neighboring deformation grid cells. (c) transposed operation using the proposed red-black scheme. Weighted values on the image grid are accumulated into all surrounding deformation grid points using multiple write accesses. Weights need to be computed only once per cell. Parallelization is performed per row (2D) or slice (3D) using a red-black scheme to avoid write conflicts. It is important to note that the proposed approach is by no means conceptually restricted to the NGF distance measure or curvature regularization. The central idea of replacing sparse matrix construction by on-the-fly coefficient calculation for derivative computation is in principle applicable to all distance measures and regularization schemes. The effectiveness, however, strongly depends on the derivative structure of the considered objective function terms. In more detail, consider an arbitrary distance measure D = ψ(r(t (P ))). We first observe that the proposed reformulation for the grid conversion operations related to P as well as the image interpolation calculations for T are independent of the choice of the distance metric. Hence, any matrix-free reformulation for these functions can be applied to every choice of distance measure D. Since the derivative of ψ is just a vector, the most critical part for a matrix-free computation is the residual function derivative r. For NGF, the non-zero elements are distributed to just seven diagonals, thus allowing to derive a closed expression for r. In the case of the well-known sumof-squared-differences (SSD) distance measure, the derivative r is even the identity and can thus be omitted from the computations [44]. More generally, matrix-free methods may be considered practically applicable as long as the sparsity pattern of r is known and fixed. Examples of suitable matrix types are band matrices, block structures and similar configurations. Unfortunately, the widely used mutual information distance measure [10, 51] does not allow for such r application of the proposed matrix-free concept: here, the sparsity pattern of depends on the image intensity distribution of the deformed template image and thus on both T and the current deformation y [36]. As the deformation changes at each iteration step, the sparsity pattern of r may do so as well, prohibiting the derivation of static matrix-free calculation rules. Again, however, the matrix-free grid change schemes and the discussed template image derivative computations can directly be reused also for mutual information. For the regularization term, we note that widely used energies such as the diffusive regularizer [13] or the linear-elastic potential [35] are essentially calculated using finite difference schemes on the deformation y. Hence, their matrix-free computation is far less complex than for the distance term with its function chain structure,

17 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 17 and similar techniques may directly be used as done for the curvature regularization in this work. Also more complex regularization schemes such as hyperelastic regularization [5], which penalizes local volume change and generates diffeomorphic transformations, may be similarly implemented in a matrix-free manner. In summary, the underlying idea of our work is widely applicable and not restricted to the exemplary use case of NGF with curvature regularization. An interesting question is what performance improvements can be expected for conceptually different methods such as LDDMM [32]; we refer to [33] for some notes on parallelizing such methods. 5. Algorithm analysis. In this section, we will perform a theoretical analysis and quantification of the expected speed-up obtained by using the proposed matrixfree approach, compared to a matrix-based approach as used in [36] Matrix element computations and memory stores. While the exact cost of sparse matrix multiplications is hard to quantify and highly depends on the implementation and chosen sparse matrix format, meaningful estimates can be made for the cost of matrix coefficient computations and memory stores. As in this section we are primarily concerned with the time consumed by memory write accesses, by memory store we refer to a single memory write operation. The amount of memory required by each scheme is discussed in Section Matrix-based approach. Data term. For the NGF gradient, analyzing the structure of ψ r, r, and P in (18), (26), and (27), an upper bound for the required number of non-zero element computations can be computed. The vector ψ r requires the computation of m residual elements, where m is the number of cells in the image grid (Section 2.2). For the matrix r, with a sevenbanded structure, at most 7 m elements need to be computed, while P contains at most 3 m non-zero matrix elements. The grid conversion, discussed in Section 4.3, involves a matrix with eight coefficients in each row, one per contributing deformation grid point (35), resulting in the computation of 8 m coefficients. In total, this gives 19 m coefficient computations for the NGF term in the matrix-based scheme. As the matrix elements only depend on the size of the image and deformation grids, they can be computed once and stored for each resolution. Regularizer. The curvature regularizer is a quadratic function and can be implemented using a matrix with 25 diagonals, resulting in 25 m y memory stores, where m y is the number of cells in the deformation grid. Hessian. For the Gauss-Newton Hessian-vector multiplication, no additional coefficient computations are needed, as the Hessian is approximated from first-order derivatives. Total number of operations. After computing all matrix elements, these have to be stored in memory for later use, resulting in 19 m + 25 m y memory store operations for one evaluation of the full gradient and Hessian approximation. This assumes that the latter will never be explicitly stored, as discussed in the following section. If multiple matrix-vector multiplications with the same gradient are required, as in the inner CG iterations, the factor matrices can be re-used and no new memory stores are necessary Matrix-free approach.

18 18 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Data term. For the matrix-free formulation of the NGF gradient in (28), excluding grid conversion, the same 11 m coefficient computations as in the matrix-based case have to be performed for one evaluation of the gradient. However, no intermediate storage of elements is needed. For the grid conversion, the coefficients are re-computed during each gradient evaluation and Hessian-vector multiplication. Furthermore, they are re-computed during the application of the transposed operator. With 8 m computations in each grid conversion, this adds up to 16 m additional coefficient computations for application of the forward and transposed operator compared to the matrix-based case. In the proposed approach, the grid conversions are evaluated separately from the remaining derivatives, instead of fully integrating them into the matrix-free computations. While the latter further lowers memory requirements, it requires recomputations of weights even within a single conversion, which in our experience resulted in slower overall computation times. Regularizer. For the curvature regularizer, the coefficients are highly redundant, resulting from the finite-differences stencil. Therefore, there is no computational cost for coefficients in the matrix-free scheme. Hessian. For the matrix-free Gauss-Newton Hessian-vector multiplication, a number of elements are re-computed. As shown in (33) and (32), coefficients ˆρ i (k) are computed 2 49 m times for one Hessian-vector multiplication. With 25 diagonals in dr dr, image derivatives from P are required for 25 points, resulting in additional 25 m coefficient computations. In total, the Hessian-vector multiplication requires 123 m additional coefficient computations compared to the matrix-based method. Note that, in the matrix-based case, we have assumed that the full Hessian is never stored but rather evaluated using its factor matrices (Figure 2). If the final Hessian is also saved, an additional 9 25 m memory stores are required, as dr dr exhibits 25 diagonals. While in both schemes the gradient is fully stored once computed and can be re-used, for the matrix-free Hessian-vector multiplication, all elements need to be recomputed for every single matrix-vector multiplication. Therefore, the exact number of additional arithmetic operations depends on other variables such as the number of CG iterations and line-search steps, and is specific to the input data. Total number of operations. In summary, for matrix-free NGF gradient evaluations, there are no additional coefficient computations compared to the matrix-based scheme and all intermediate memory stores are completely eliminated, making it highly amenable to massive parallelization. For the Gauss-Newton Hessian-vector multiplications, the matrix-free scheme adds a considerable number of re-computations, but again no stores are required. However, memory store operations are generally orders of magnitude slower than arithmetic operations. Therefore, as will be shown in the next section, in practice this trade-off between computations and memory stores results in faster overall runtimes. Moreover, it significantly reduces the amount of required memory, which is a severely limiting factor for Hessian-based methods Memory requirements. As discussed in the previous section, the memory required for the derivative matrices in a classical matrix-based scheme depends on the image grid size m. For the gradient computations, when storing ψ P, r and P, m, 7 m and 3 m values need to be saved. For the grid conversion, 8 m values need to be stored, while the curvature regularizer requires 25 m y values. In terms of complexity, this leads to memory requirements in the order of O( m).

19 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 19 In comparison, the memory required by the matrix-free schemes does not depend on the image grid size. When integrating the grid conversion steps directly into the computation, as described in the previous section, only constant auxiliary space is required. The matrix-free approach therefore reduces the required memory size for intermediate storage from O( m) to O(1). 6. Experiments and results. In order to characterize the performance at the implementation level, we performed a range of experiments comparing four approaches for derivative computation. Three approaches are implemented on the CPU: an opensource, mixed C and MATLAB implementation in the FAIR toolbox [36] ( FAIR ), an in-house C++ implementation of the matrix-based computations using sparse matrices ( MBC ) and a C++ implementation of the proposed matrix-free computations ( MFC ). Additionally, we evaluated a GPU implementation of the matrix-free computations using NVIDIA CUDA ( MFC GPU ). All CPU implementations use OpenMP for parallelization of critical components. In FAIR and MBC, these include distance measure computations, linear interpolation and sparse-matrix multiplications. FAIR provides a matrix-free mode, which uses matrix-free computations for the curvature regularizer. In contrast to our approach, all other derivative matrices are still built explicitly and require additional storage and memory accesses. This mode was used for all comparisons. Furthermore, MATLAB-MEX versions of FAIR components were used where available. In the matrix-free method MFC, full element-wise parallelization was used for function value and derivative calculations. Both the classical MBC and proposed MFC implementations have undergone thorough optimization. In addition, due to the favorable structure, the NGF gradient computations for MFC on the CPU were manually vectorized using the Intel Advanced Vector Instructions (AVX). As stated in Section 4.3, the grid conversions were computed as separate steps. Additionally, the deformed template was temporarily stored in order to reduce interpolation calls. The experiments on the CPU were performed on a dual processor 12-core Intel Xeon E workstation with 2.0 GHz and 32 GB RAM. The GPU computations were performed on a NVIDIA GeForce GTX980, see Section 6.3 for more detailed specifications. All experiments were averaged over 30 runs with identical input in order to reduce measurement noise Scalability. The proposed matrix-free derivative calculations permit a fully parallel execution with only a single reduction for the function value, leading to high arithmetic intensity, i.e., floating point operations per byte of memory transfer. Thus, excellent scalability with respect to the number of computational cores can be expected. In order to experimentally support this claim, we computed objective function derivatives using a varying number of computational cores on CPU. Two cases were considered: small images (64 3 voxels), typically occurring in multi-level computations of medical images, and larger images (512 3 voxels), representing, e.g., state-of-the-art thoracic computed tomography (CT) scans. As can be seen from the results in Table 3, for both image sizes, all NGF evaluations function value, gradient, and Hessian-vector multiplication exhibit the expected speedup factors ranging from 11.0 to 15.0 on a 12-core system. Grid conversions from deformation grid to image grid also show speedups of approximately 12. For the transposed operator, on small images, a speedup of only 4.34 is achieved: due to the small number of parallel tasks available, not all computational cores can be

20 20 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Small images Large images Method Serial (ms) Par. (ms) Speedup Serial (s) Par. (s) Speedup D D P P y P ŷ Ĥ ˆp S S Sp Table 3 Scaling of the proposed parallel implementation on a 12-core workstation compared to serial computation. Runtimes and speedups are itemized for gradient evaluations, Hessian vectormultiplications and grid conversions. All operations scale almost linearly, with the exception of the transposed grid conversion operator P ŷ on small images, where the number of available parallel tasks is too low, and the curvature regularizer, which is primarily memory-bound. Small images: 64 3 image resolution, 17 3 deformation grid size; large images: image resolution, deformation grid size. fully utilized. On the large images, the expected speedup of 12.4 is achieved. In comparison, the curvature regularizer scales much less favorably, with speedups ranging from 1.54 to As the implementation of the curvature regularizer derivatives is fully parallelized and matrix-free, this indicates that the curvature computation is limited by memory bandwidth rather than arithmetic throughput. However, this has little effect on the overall performance, as the evaluation of the curvature regularizer is multiple orders of magnitude faster than the NGF evaluation Runtime comparison. As discussed in Section 5, the matrix-free approach reduces the memory consumption and improves parallelization, however it introduces additional cost due to multiple re-calculations. Therefore, it is interesting to compare in detail the runtime of the matrix-free implementation MFC to the matrix-based implementations MBC and FAIR Derivative and Hessian evaluation. We separately evaluated runtimes for computation of the derivatives and for matrix-vector multiplication with the Gauss-Newton Hessian, including grid conversions. Image sizes ranged from 64 3 to voxels. To account for various use cases with different requirements for accuracy, we tested three different resolutions for the deformation grid: full resolution, 1/4, and 1/16 of the image resolution. As can be seen from the results in Table 4, for the gradient computation, we achieve speedups from 33.7 to 275 relative to FAIR. Furthermore, FAIR runs out of memory at larger image and deformation grid sizes. Here, the matrix-free scheme allows for computations at much higher resolutions. Relative to the matrix-based C++ implementation MBC, we achieve speedups between 2.34 and These values are particularly remarkable if one considers that MBC is already highly optimized and parallelized, and that MFC includes a large amount of redundant computations. Again, MFC allows to handle the highest resolution of 512 3, while MBC runs out of memory at voxels. Comparing the runtimes of the Hessian-vector multiplication in Table 4, the benefits of the low memory usage of MFC become even more obvious. While MBC and

21 Gradient Hessian-vector mult. MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 21 Runtime m i m y i FAIR (s) MBC (s) Speedup MBC MFC MFC MFC (s) vs. vs. vs. (Proposed) FAIR FAIR MBC * * * * * * * * * * * * * * * * * * * * * Table 4 Runtimes (left) and speedup factors (right) for objective function gradient computations (top) and Hessian matrix-vector multiplications (bottom) on CPU. Three implementations were compared: a classical matrix-based implementation from the FAIR toolbox (FAIR), an optimized matrix-based C++ implementation (MBC), and the proposed matrix-free approach (MFC). For each image grid size m i, three different deformation grid sizes m y i were tested. The proposed matrix-free approach MFC achieves speedups of (derivatives) and (Hessian multiplications) compared to the optimized C++ implementation MBC (rightmost column). It can also handle high resolutions for which FAIR and MFC run out of memory (third and fourth column, marked ). FAIR cannot be used for more than half of the resolutions tested, MFC has no such issues and scales approximately linear in the number of voxels. MFC achieves particularly high speedups for large deformation grids compared to the optimized C++ implementation MBC, ranging from 17.9 to Full registration. The ultimate goal of the presented approach is to improve runtime and memory usage of the full image registration system in real-world applications. Therefore we performed a range of full multi-level image registrations on three levels for different image and deformation grid sizes with all three imple-

22 22 L. KÖNIG, J. RÜHAAK, A. DERKSEN AND J. LELLMANN Runtime (s) , FAIR MBC MFC (Proposed) ,314 1, ,553 * * * * * * L-BFGS 4,299 Gauss-Newton Fig. 4. Runtime comparison for a full multi-level registration on CPU using L-BFGS and Gauss-Newton optimization schemes for different image sizes (logarithmic scale). FAIR: matrixbased, MATLAB, using the FAIR toolbox [36]; MBC: matrix-based, C++, using sparse matrices; MFC: proposed matrix-free algorithm in C++; : computation ran out of memory. Deformation grid sizes were set to one quarter of the image size in each dimension. The proposed algorithm (MFC) enables faster computations at higher resolutions for both optimization schemes and routinely handles resolutions of up to voxels. L-BFGS Gauss-Newton MFC vs. FAIR * * * MFC vs. MBC * * * Table 5 Speedup factors of the proposed matrix-free approach on CPU (MFC is n times faster) relative to the matrix-based MBC and MATLAB-based FAIR methods for the data in Figure 4. : FAIR or MBC ran out of memory. mentations. The input data consisted of thorax-abdomen images as described in Section below, cropped to a size of voxels. In order to avoid potentially large runtime differences caused by the effect of small numerical differences on the stopping criteria, we set a fixed number of 20 iterations on each resolution level. For L-BFGS, the proposed method MFC is between 8.62 and 9.27 times faster than MBC (Figure 4 and Table 5). Computation on the highest resolution with an image size of and deformation grid size of is only possible using MFC due to memory limitations of the other methods. For Gauss-Newton, overall speedups range from 3.12 to Additionally, due to the more memory-intense Hessian computations, even a registration with an image size of and a modest deformation grid size of 65 3 could only be performed with the proposed MFC method. While for L-BFGS the overall speedups are in a similar range as the raw speedups in Table 4, they are slightly reduced for the Gauss-Newton method. This is due to the fact that in the matrix-based MBC method, the stored NGF Hessian is re-used multiple times in the CG solver, while it is re-computed for each evaluation in the matrix-free MFC approach. Solving a linear system with the CG solver using the NGF Hessian is only needed for Gauss-Newton optimization, which could additionally

23 MATRIX-FREE DEFORMABLE IMAGE REGISTRATION 23 L-BFGS Gauss-Newton MFC MBC * * * FAIR * * * Table 6 Total peak memory usage in megabytes for the full multi-level registration on CPU, corresponding to the data in Figure 4. Reported sizes include objective function derivatives, multi-level representations of the images, memory required for the optimization algorithm, CG solver, and further auxiliary space. Compared to MBC and FAIR, the proposed algorithm (MFC) enables computations at high resolutions with moderate memory use. : FAIR or MBC ran out of memory (more than 32 GB required). impose a performance bottleneck. In our evaluations, however, we found that, while the CG solver consumes 97% to 99% of the total registration runtime, the matrix-free NGF Hessian-vector multiplications again consume more than 99% of the overall CG solver runtime, which can thus fully benefit from our parallelized computations. The memory usage of the full multi-level registration for all approaches is shown in Table 6. Besides the objective function derivatives, the measured memory usage includes memory required for multi-level representations of the images, the optimization algorithm, CG solver and further auxiliary space. The matrix-free approach reduces memory requirements by one to two orders of magnitude.. As the L-BFGS method uses additional buffers for storing previously computed gradients, the Gauss-Newton method has a lower memory usage with the matrix-free approach. Overall, the proposed efficient matrix-free method enables both computation of higher resolutions and shorter runtimes GPU Implementation. The presented matrix-free algorithm is not limited to a specific platform or processor type. Due to its parallelizable formulation and low memory requirements, it is also well-suited for implementation on specialized hardware such as GPUs. In comparison to CPUs, GPUs feature a massively parallel architecture and excellent computational performance. Thus, GPUs have become increasingly popular for general purpose computing applications [47]. However, in order to utilize these benefits, specialized implementations are required, which exploit platform-specific features such as GPU topology and different memory classes. In this section, we present a GPU implementation of the matrix-free registration algorithm using NVIDIA CUDA C/C++ [40]. The CUDA framework allows for a comparatively easy implementation and direct access to hardware-related features such as different memory and caching models and has been widely adopted by the scientific community [47]. We implemented optimized versions of all parts of the registration algorithm in CUDA, allowing the full registration algorithm to run on the GPU without intermediate transfers to the host. In the following, we present details on how the implementation makes use of the specialized platform features in order to improve performance Implementation details. The CUDA programming model organizes the GPU code in kernels, which are launched from the host to execute on the GPU. The kernels are executed in parallel in different threads, which are grouped to thread blocks. Memory areas. CUDA threads have access to different memory areas: Global memory, constant memory and shared memory. All data has to be transferred from

24 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Reference (b) Template (c) Initial overlay (d) After registration Fig. 5.

(a) reference image, (b) template image, (c) checkerboard overlay of details from both images before, and (d) after registration.

The images were obtained from the dataset used in [31]. the CPU main memory to the GPU global memory before it can be used for GPU computations.

24 24 L. KO NIG, J. RU HAAK, A. DERKSEN AND J. LELLMANN (a) Reference (b) Template (c) Initial overlay (d) After registration Fig. 5. Registration of patches from histological serial sections with different stains on GPU. (a) reference image, (b) template image, (c) checkerboard overlay of details from both images before, and (d) after registration. While discontinuities can be seen at the borders of the checker pattern before registration, smooth transitions are achieved after registration. Areas of large differences are indicated by arrows. The images were obtained from the dataset used in [31]. the CPU main memory to the GPU global memory before it can be used for GPU computations. As frequent transfers can limit computational performance, the GPU implementation transfers the initial images R, T exactly once at the beginning of the registration. All computations, including the creation of coarser images for the multi-level scheme, are then performed entirely in GPU memory. While the global memory is comparatively large (up to 16 GB on recent Tesla devices) and can be accessed from all threads, access is slow because caching is limited. Constant and shared memory in contrast are very fast. However, constant memory is read-only, while shared memory can only be accessed within the same thread block. Both are currently limited to 48 to 96 kb, depending on the architecture. We therefore store frequently used parameters, such as m, my, m, h, hy, h in constant memory to minimize access times. Shared memory (private per block) is used for reductions in L-BFGS and function value, with an efficient scheme from [20]. Image interpolation. Current GPUs include an additionally specialized memory area for textures. Texture memory stores the data optimized for localized access in 2D or 3D coordinates, and provides hardware-based interpolation and boundary condition handling. Therefore, we initially employed texture memory for the image interpolation of the deformed template image T (y ). However, we found that the hardware interpolation causes errors in the order of 10 3 in comparison to the CPU implementation, as the GPU only uses 9-bit fixed-point arithmetic for these computations. The errors resulted in erroneous derivative calculations, causing the optimization to fail. Despite an interpolation speedup of approx. 35% with texture memory, we therefore performed the interpolation in software within the kernels rather than in hardware. Grid conversion. In Section 4.3, a red-black scheme for the transposed operator P > y was proposed in order to avoid write conflicts in parallel execution. While on the CPU this results in fast calculations, on the GPU there are specialized atomics available to allow for fast simultaneous write accesses without conflicts. This enables a per-point parallel implementation without the red-black scheme, which also improves the limited speedup for small images shown in Table 3. We took care that all GPU computations return identical results to the CPU code with the exception of numerical and rounding errors, in order to allow for a meaningful comparison Evaluation. Experiments were performed on a GeForce GTX980 graphics card with 4 GB of memory. The GPU features 16 streaming multiprocessors with

A Non-Linear Image Registration Scheme for Real-Time Liver Ultrasound Tracking using Normalized Gradient Fields

A Non-Linear Image Registration Scheme for Real-Time Liver Ultrasound Tracking using Normalized Gradient Fields Lars König, Till Kipshagen and Jan Rühaak Fraunhofer MEVIS Project Group Image Registration,