Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems Irina F. Gorodnitsky Cognitive Sciences Dept. University of California, San Diego La Jolla, CA 9293-55 igorodni@ece.ucsd.edu Dmitry Beransky Elect. and Computer Engineering Dept. University of California, San Diego La Jolla, CA 9237-47 Abstract The computational cost of solving biomedical inverse problems is extremely high. As a result, expensive high end computational platforms are required for processing and at times a trade-off must be made between accuracy and cost of computation. In this paper we present two fast computational algorithms for solving regularized inverse problems. The computational advantages are obtained by utilizing the extreme discrepancy between the dimension of the solution space and the measured data sets. The algorithms implement two common regularization procedures, Tikhonov regularization and Truncated Singular Value Decomposition (TSVD). The algorithms do not compromise numerical accuracy of the solutions. Comparisons of costs of the conventional and proposed algorithms are given. Although the algorithms are presented in the context of biomedical inverse problems, they are applicable to any inverse problem with similar characteristics, such as geophysical inverse problems and non-destructive evaluation. Solving biomedical inverse problems requires operating on large matrices. The computational cost of these problems is extremely high. This increases the cost of the technology, as expensive high end computational platforms are required, and the quality of the solutions may also be compromised at times for the sake of manageable computational costs. We measure the computational cost in terms of the time it takes to compute a solution. There are a number of other ways in which computational cost may be measured. In numerical analysis the number of floating point operations (flops) is commonly used. The processing time measure that we use here, however, is the most accurate indicator of the resources used in a computation and is most relevant to the end user. We report on a study of the cost of inverse solutions for ill-conditioned problems and present two fast algorithms that improve the computational cost compared to the exist- This work was supported by the ONR grant no. N4-94--856 ing methods. The work here is developed in the context of biomedical inverse problems, but other physical inverse problems, most notably in geophysics and non-destructive evaluation, share the same mathematical characteristics and the results of this paper apply directly to those problems. The two characteristic properties of physical inverse problems are ) their severely ill-posed nature, with a great discrepancy between the dimensions of the solution space and the data ( 4 or more), and 2) severe ill-conditioning of the forward model. Solutions to ill-posed problems are non-unique and we focus our attention on computation of the most commonly used solutions, the minimum 2-norm based solutions. Ill-conditioned inverse problems require regularization to prevent the solutions from being excessively sensitive to noise in the data. While efficient algorithms exist for computing inverses, the role of regularization in increasing the cost of the computations has not been well considered. The two regularization techniques that are most widely employed are Tikhonov regularization and Truncated Singular Value Decomposition (TSVD). In its standard implementation, Tikhonov regularization requires over an order of magnitude more operations than the computation of the unregularized solution. Another factor that is commonly not considered in numerical analysis is the cost of memory access. This cost becomes significant when data sets are large, as the cost of accessing a number from a disk exceeds the cost of a floating point operation by 5. Memory access requirements depend on the size of the processor memory and can generate significant cost overhead for mid-range platforms. Here we develop efficient algorithms for the two common regularization techniques. For Tikhonov regularization we derive a new algorithm, termed the Efficient Tikhonov Regularization (ETR) algorithm, which reduces the number of floating point operations by approximately an order of magnitude. For TSVD, we propose modifications of the existing method that significantly reduce the memory access requirements. In both cases the new algorithms provide a

significant reduction in the cost of regularized inverse computations. While standard implementations of Tikhonov regularization are higher in cost than TSVD, the new ETR algorithm is shown to be more efficient than TSVD in terms of flop count. Background. The inverse problem Linear inverse problems can be stated mathematically as the estimation of a signal p from its noisy linear transformation : where p is an linear transformation matrix on p, and the vector n represents a set of measurements consisting of the exact values of the transform p, and the additive noise n. In most biomedical inverse problems is severely underdetermined, i.e., being 2 ) and being 5 7. We will refer to such matrices as very wide matrices. The large computational cost arises from operating on such matrices and those are the problems we will consider here. Examples of such problems include bio-electromagnetic imaging of the heart (MCG) and of brain function (EEG and MEG). When, () has no unique solution. We focus our attention here on finding minimum 2-norm (or seminorm) based solutions to () as these are by far the most widely used. These solutions are found by satisfying the constraint min p x The vector x is a reference model that is used in some cases, but more often it is set to zero. is often the identity matrix. It can also represent a derivative operator, such as a Laplacian, in which case is either a square banded matrix or a general matrix of weights which bias the solution [3]. In physical tomography is typically square. To allow a uniform treatment of the constraints represented by (2), it is convenient to transform the above problem into the standard form. For square this transformation is straightforward: "! #, $, x # p x &%, where '!! indicates the Moore-Penrose pseudo-inverse. We use, in place of a regular inverse to accommodate possible rank deficient. The problem in a standard form then is ()+*-,/. $ x 243658729;:<:6= min >? x @ p is recovered by p x x. all ACB&A norms will indicate D 2-norms unless stated otherwise 2 3 The exact solution to (3) is x $! 4 where $! $ FE $G$ E IH. Unfortunately, this is not an acceptable solution when $ is ill-conditioned, that is when the condition number of $ is large. Ill-conditioned $ are to be expected in biomedical inverse problems. In this case, even a small amount of noise in the data can produce arbitrarily large variations in the solution x, rendering these solutions useless. To avoid this problem, the process of regularization must be used, where the original $ is replaced by a slightly different well-conditioned matrix $KJ&LNM. Regularization aims to provide some kind of optimal trade-off between the error in x due to changes in $ and the error due to noise. In the next section we discuss regularization techniques and their costs, but before we start we review the main factors that contribute to the cost of numerical computations..2 Workload and cost of computation The basic units of computer instructions are integer operations. In numerical analysis it is standard to evaluate the complexity and the cost of algorithms in terms of the number of flops. This cost is assumed to be representative of the computational time for the algorithm. Although flops do not reveal the number of integer operations it takes to complete a given instruction, it is nevertheless the accepted and perhaps the best readily obtainable analytic measure of algorithmic complexity. Flop count, however, becomes a poor indicator of the execution time when computations involve large data sets. Other factors, most importantly disk memory access, can affect the execution time of an algorithm far more than an increase in the number of flops, and thus must be taken into account. Because the memory access is not often considered in numerical algorithm analysis and it is rarely described in such literature, we provide a brief overview this process. The variables used in a program are stored in cache and working memory, and when these two storage devices are full, on disk. Access to cache is the fastest, but its hardware is expensive, so its size is very small. Access to working memory is also fast. The size of this memory is indicated by the RAM of a given system. The portion of the data which does not fit in cache and working memory, is put on the disk. The retrieval of data from the disc is very expensive. When a required number is not found in the working memory, program execution is suspended and the kernel starts an I/O operation from the disk. The retrieval request is registered and put into a queue while the disk I/O processor is executing other instructions. Once the retrieval from the disk is in process, a chunk of storage that is called a page and that contains the requested

E E number is retrieved and substituted for some other page in the working memory. Only after the processor accesses the desired number does the execution resume. The whole process is called a swap and involves 3 6 integer operations, compared to 2 operations for a flop. The speed of a swap is measured in milliseconds, while the speed of a flop is measured in nanoseconds. Vectors or matrices that exceed certain dimension cannot be stored entirely in the working memory and must be split between this memory and the disk. Operations on such vectors involve a large number of swaps, which creates a bottleneck in the speed of processing. Programs can be designed to minimize the number of swaps by maximizing access of adjacent array entries in the algorithm. Nevertheless, swaps cannot be avoided in certain matrix operations. One of the more expensive operations in terms of swaps is the matrix transpose, because the source and the destination of an entry in the transposed matrix are likely to reside in different storage areas. In our development of fast algorithms we consider two factors, the memory access requirements and the number of flops. We control memory access requirements by minimizing the number of very wide matrix transposes in the algorithms. In simulations, the number of matrix transposes correlates well with the total disk access requirements in regularization algorithms. 2 Regularization Solutions can be regularized in a number of ways, but only two techniques are predominantly used. We described these methods next. Tikhonov regularization [2] is the most common method. The objective min > x is modified to include a misfit parameter, leading to the regularized equation min > $ x 2 2 x 2 is the regularization parameter, and before (5) can be solved the value of must be chosen. The solution depends critically on this choice and only a small range of values produces a good approximation to the true x of the noiseless case, if such an approximation can be found at all. Finding the optimal is not simple. The existing methods for this can be subdivided into two groups. One group of methods assumes that n is known or can be estimated. This assumption often cannot be fulfilled and the approach has other pitfalls that can be deduced from the analysis in [4]. We therefore consider only the second approach to finding, where no knowledge of n is assumed. All the methods in this group can be equated to finding the corner of an -curve, which is the plot of the norm versus the residual $, for various values of. 5 The optimal value for occurs at the sharp -shaped corner of this plot. To find this corner, (5) must be solved repeatedly, typically more than times, for different values of. Thus a direct implementation of Tikhonov regularization increases the cost of finding a solution by over an order of magnitude. The solution to (5) can be found by direct differentiation which leads to the normal equations $ E $ 2 $ E 3 These equations are solved by first performing Cholesky factorization, E $ E $" 2, and then by solving the triangular systems y $ E and E x y. The problem with this approach is that the accuracy of the Cholesky factorization is proportional to the condition number of $ E $, which is the square of the condition number of $. When $ is ill-conditioned, this can lead to a significant decrease in numerical stability of the solution [4]. Thus the use of the normal equations to solve (5) is not optimal. Instead, factorization $ E can be factored directly. The QR $ E 7 is most frequently used. Then, where is the top section of and is upper triangular. can also be found simply in two steps, by solving the upper triangular E y and x y. A slightly more efficient method than, but one that is rarely used, is bidiagonalization by means of left and right orthogonal transformations. in this case is the solution to the resulting sparse system. Note that these factorizations require a transpose of the wide matrix $. Two algorithms for computing the factorization, the Householder and the Modified Gramm-Schmidt (MGS) algorithm, are used. Householder QR has about twice the number of flops than MGS, but it is more stable numerically. Because the matrix!-$ #" is full rank and well conditioned for the appropriate, no column pivoting is needed in factorization and the cheaper MGS algorithm is also acceptable here. Note, however, that MGS will not produce an orthogonal if! $$ " is poorly conditioned. The cost of regularization via both factorization methods is shown in Table I. Truncated Singular Value Decomposition 6 is the second method used for regularization. Here the ill-conditioned $ is replaced by a well-conditioned rank % approximation $&, which is closest to $ in the 2-norm sense. This approximation is given by an SVD expansion of $ truncated to the first % components $'& )( & *,+ u * * v*.-/&#&2 E &43 8

* %%2% : 2 matrices -/& and &43 E are composed of the % left and right singular vectors. The subscripts & and & 3 indicate the number of columns and rows, respectively, in these matrices. The % th order TSVD solution then is given by where u i and v i are the left and right singular vectors and are the singular values. The & & H & - & E 3 The regularization parameter in this method is %, which determines the number of singular subspaces of $ used in computing x &. Various methods for selecting % can all be ultimately interpreted as satisfying the -curve criterion, as in the case of Tikhonov regularization. In the case of TSVD, however, the different % th order decompositions are found from a single SVD. Hence TSVD regularization does not significantly add to the cost of unregularized solution to (). By far the most efficient algorithm for computing the SVD of matrices with is R-SVD []. The algorithm first performs the factorization of $ E $" where is upper triangular. is bidiagonalized by where - and - E are orthogonal matrices and is upper bidiagonal. Define - - H. Then the equivalent bidiagonalization of $ is - E $ ; The SVD is then computed from the bidiagonal using the Golub and Kahan algorithm []. The % th order solution is given by 3 & & H & - E&43 When, the QR of $ E must be taken to preserve the small computational cost of R-SVD. The solution to this system is the transpose of (3) &K -/&# H & & E 3 4 E need not be explicitly formed, only applied to as it develops. A TSVD solution via R-SVD factorization requires 2 3 flops and one wide matrix transform (Table I). The majority of the cost of SVD when is contained in the operation and the transformation of $. We can see that although SVD is more expensive than a factorization, TSVD is a cheaper method of regularization than Tikhonov overall, because only a single factorization of $ is required. 9 2 3 Efficient regularization algorithms Here we describe two efficient regularization algorithms for Tikhonov regularization and TSVD for very wide matrices. Both algorithms are based on factorization, which we describe next. 3. LQ factorization The factorization provides orthogonalization of matrices, ( ) and is defined as $# 2 # where is an lower triangular matrix and is the top section of an orthogonal matrix. is equivalent to the factorization of $ E, i.e. $ E E E 6 The Householder reflections and Gives rotations can be used to compute of the factorization analogously to the factorization. Thus the algorithms for computation have the same numerical and work load properties as the corresponding factorization algorithms. The advantage of is that it can be applied directly to $ when, avoiding the expensive transpose of the matrix. 3.2 Efficient Tikhonov Regularization (ETR) Here we describe a novel fast algorithm for Tikhonov regularization. Since a regularized solution seeks to fit the data within some residual interval, we can rewrite the regularized problem as 7 s= 7 $ r where r is the residual r $ x. Let r as s 5 s, then the regularized problem can be written 2 243658729;: :6= $ s 8 The standard approach to this problem is to solve it for a range of by performing repeated factorizations of, as described above. Instead, in ETR we perform a single factorization of $ $" Then factorization of a small 2 system # 9 2

& is done repeatedly for different values of. The solution for each is given by t s E z E t E z Note that we only need to find z, not /, for each value system and is of. z is the solution to the small? 2 cheap to compute. It can be used to find the corner of an -curve and the optimal. The final solution then can be computed for the optimal only. The total cost of factorizations in ETR is shown in Table I. The factorization of $ dominates the flop count for the algorithm when @ 2. As we can see, ETR produces over an order of magnitude reduction in flop count compared to the standard implementations of Tikhonov regularization. 3.3 Economy TSVD Although taking a transpose of a very large matrix cannot be avoided entirely in TSVD, a significant savings can be achieved by using factorization of $ instead of factorization of $ E. We call SVD via factorization the L-SVD algorithm. With this factorization, the solution is & & E E H - &43 E 22 The matrix & E is %, where typically % K. The saving in taking a transpose of this matrix vs. a transpose of $ is that there are % less rows to transpose. In simulations using 32 RAM platforms and matrices $ with 4, we observed an order of magnitude cost saving using the L-SVD algorithm compared to R-SVD. 3.4 Comparison of the two novel factorizations 2 Thus we cannot make a general statement about relative cost of the two algorithms, but we can observe a tradeoff between the increases in RAM and in the ratio and the cost advantage of TSVD. It is up to the user to determine when the break point in the trade-off occurs for his/her computational platform. 3.5 Mixed compiler implementation Taking transposes of large matrices can also be avoided entirely by mixing Fortran and C subroutines in one program. Because Fortran uses column major storage, i.e. it stores matrices column by column, while C is row major, a matrix stored by one compiler is interpreted naturally as a transpose by the other compiler. Thus invoking C subroutines from Fortran and vice versa for operations requiring a matrix transpose avoids taking this transpose explicitly. In this case, ETR becomes the method of choice for regularization. References [] T. F. Chan. An improved algorithm for computing the singular value decomposition. ACM Trans. Math. Soft., 8:72 83, 982. [2] M. Foster. An application of the wiener-kolmogorov smoothing theory to matrix inversion. J. SIAM, 9:387 392, 96. [3] I. F. Gorodnitsky, J. S. George, H. A. Schlitt, and P. S. Lewis. A weighted iterative algorithm for neuromagmetic imaging. Proc. IEEE Satellite Symposium on Neuroscience and Technology, Lyon, France:6 64, Nov. 992. [4] I. F. Gorodnitsky and B. D. Rao. Analysis of Regularization Error in Tikhonov Regularization and Truncated Singular Value Decomposition Methods. Proc. 28th Asilomar Conf. on Signals, Systems and Computers, :25 9, Oct.-Nov. 994. The cost of the factorizations for the two novel regularization methods is shown in Table I. We can see that even for @ the ETR algorithm is slightly cheaper than in terms of the flop counts. For larger ratios of the flop count for ETR is significantly less than for TSVD. The cost of a wide matrix transpose is hard to evaluate because so much depends on the details of implementation and the computing platform. Table I Tikhonov Householder QR < Regularization Method Factorization Matrix Size Cost (flops) Wide Matrix Transposes 4 2 8 3 3 complete ( < ) Regularization MGS QR < 2 2 3 2 complete ( < ) Efficient Tikhonov Householder LQ and 2 2 2 3 3 complete ( < ) (ETR) and MGS 2 4 3 R-SVD QR 6 2 3 complete ( ) L-SVD LQ 6 2 3 partial (% ) Modified Gramm-Schmidt