Operator-Based Backward Motion Estimation. Aria Nosratinia and Michael T. Orchard. Beckman Institute for Advanced Science and Technology.

Size: px

Start display at page:

Download "Operator-Based Backward Motion Estimation. Aria Nosratinia and Michael T. Orchard. Beckman Institute for Advanced Science and Technology."

Nicholas Barton
6 years ago
Views:

1 Operator-Based Backward Motion Estimation Aria Nosratinia and Michael T. Orchard Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign 405 N. Mathews Ave., Urbana, IL Abstract Backward motion estimation consists of the class of motion algorithms that do not transmit motion information directly, but rather compute it at the decoder from already transmitted data. Traditional backward algorithms { also known as pel-recursion { rely on continuous image models, and apply resulting parameters onto sampled data. Within the framework of this approach, it is not feasible to implement a least squared error criterion, or to account for and analyze the eects of various segments of the algorithm, e.g. interpolation. This paper investigates backward motion estimation directly in a sampled environment, and proposes an operator-based method for motion estimation and compensation. Our re-examination of backward motion compensation indicates that minimum mean squared estimation of pixel intensities in a recursive coder naturally decomposes into two distinct operations: a search on the image lattice and a least squares optimization. We oer algorithmic solutions to both the least squares and search problems for a variety of cases. We propose and study two interpolation operators. We also investigate families of non-interpolative operators, where computational considerations motivate a recursive least squares (RLS) algorithm. Operator-based methods not only outperform traditional pel-recursion { as indicated by experimental results { they also oer a much more exible framework to trade complexity vs. performance, or to design hybrid forward/backward algorithms that use partial forward motion information to combat tracking diculties. Permission to publish this abstract separately is granted. Corresponding address: Aria Nosratinia Department of Electrical Engineering Princeton University, Princeton, NJ Tel: (609) Fax: (609) aria@ee.princeton.edu Submitted to IEEE Trans. Circuits Syst. Video Tech. c IEEE 1995

2 1 Introduction Compact coding of image sequences (video) necessitates an ecient representation of inter-frame dependencies. Intuition and empirical results have long shown that in \natural" video sequences, this dependency is well captured through inter-frame motion. In fact, given the liberty of specifying a motion vector for each image pixel, it is almost always possible to predict the intensities in the current frame exactly. However, the bitrate required for such an unconstrained specication can well exceed the bitrate required to transmit the original frame. To address this issue, motion estimation and compensation algorithms have branched into two major categories. One group of motion estimation algorithms associates a relatively small number of motion vectors with the image pixels of each frame, computes motion vectors at the encoder, and transmits it to the decoder. This class is known as forward motion estimation/compensation, and includes such methods as block matching, overlapped-block, and warping motion estimators. Another class of motion estimators assigns a separate motion to each pixel, but not being able to transmit all these vectors, computes them at the decoder using the already transmitted data. This class is known as backward motion estimation/compensation. Traditional formulation of backward motion estimation (pel-recursion) is based on modeling two successive frames of a video sequence as samples of two 2-D continuous functions, related by a continuous motion eld. Conventional pel-recursion estimates motion by a local analysis of the slope of the signal [1, 2, 3]. Usage of such models have been partially motivated and inuenced by parallel studies in optical ow [4, 5], whose nal aim are nding ow elds, rather than coding. These models are based on intuitions of how the underlying continuous images are related in successive frames, and then directly extend that intuition to the sampled fames. Although based on reasonable assumptions, these traditional models are not necessarily the best choice for estimation and coding purposes. This paper explores backward motion estimation directly in a sampled environment, and based on resulting insights oers an operator-based framework for backward motion estimation and compensation. In addition to out-performing traditional pel-recursion { as indicated by experimental results { the operator-based framework oers the designer a great degree of exibility to trade computation for performance, or to address tracking issues. Traditional methods are based on a truncated Taylor series approximation to the continuous image signal. I k (s) = I k?1 (s) + v t :ri k?1 (s) + O(kvk 2 ) (1) where I k (s) is the intensity of frame k at pixel s, v is the motion vector, and r is the dierenti- 2

3 ation operator. These methods aim to nd the coecient to the rst derivative of the signal, which is the ow or motion. They then use the motion { computed with a continuous model { along with an arbitrarily chosen interpolator (typically the bilinear interpolator), to estimate the intensity of a given pixel in the sampled domain. The traditional model is based on relating two continuous-valued frames, and then relating each of the sampled frames to its continuous counterpart. With this approach, it is dicult to incorporate minimization of error energy into motion estimation algorithms. In fact the model in (1) in no explicit way reects minimization of estimation error energy It is not dicult to see that an estimator designed with a least squares criterion has a much better chance of reducing the estimation error energy. Indeed operator-based estimators (to be introduced shortly) outperform traditional pel-recursion. Also, while the conceptual simplicity of assuming a direct continuous-discrete extension has allowed designers to channel their eorts into other directions, and generate fairly complex pelrecursive algorithms [6], the continuous-discrete transition is not without diculties. Video signals are often sampled at sub-nyquist rates and are noisy. Thus, an accurate estimation of the gradient of the underlying continuous image is very dicult. Moreover, on an operational level, compared to a model working entirely in the sampled domain, the traditional formulation can oer little exibility in controlling various properties of the algorithm. For example, little insight is gained into how the computational complexity of the algorithm might be reduced, or how to resolve various tracking diculties experienced by standard pel-recursive algorithms. Finally, the relationship of the underlying continuous images (which are unavailable) and the sampled frames are modeled through interpolation. While a detailed discussion of criteria involved in interpolation is outside the scope of this paper, motion estimates do depend on the choice of interpolator. Estimated motion vectors are mostly not integer-valued, thus the intensity estimates depend on interpolation strategy. One would like to compute motions such that, once applied with the interpolator of our choice, the best intensity estimate is obtained. In traditional pel-recursion, it is dicult to incorporate our knowledge of the interpolator in the motion model, therefore its inuence is almost always ignored. This paper presents a discrete formulation of pel-recursive motion estimation. Our approach is to view successive frames of a sequence purely as two arrays of data. This view enables us to nd the best (in the MSE sense) motion compensated estimators within specic classes, dictated solely by the data, and not by any a-priori assumption on local relationships of continuous motion and intensity elds. We will see that these estimators automatically take into account both the discrete 3

4 nature of the data and the structure of the interpolator. In this approach, motion estimation assigns to each pixel in the new frame an operator acting on some group of pixels from the past frame. Through this new formulation, we endeavor to expose more clearly the principles and limitations of backward motion estimation, which will in turn allow more exibility in tailoring pel-recursive techniques to various applications. In Section 2, we present a view of backward motion estimation in a sampled environment, and show that the estimation problem in such an environment naturally breaks down into two components: a search problem and a least mean squares (LMS) optimization. In Section 2.2, we expand on the LMS optimization of linear inter-frame operators, thus laying the groundwork for Sections 3 and 4. Section 3 denes two interpolative operators with a view towards computational eciency, and presents solutions of the LMS optimization for them. Section 4 denes general, non-interpolative operators and the corresponding LMS problem, where computational considerations motivate an algorithm incorporating recursive least squares (RLS). Section 5 investigates the other aspect of backward motion estimation, namely the search strategies. Section 6 presents experimental results, and Section 7 closes with a concluding discussion. 2 Operator Framework 2.1 Denitions and Formulation Let each frame of the image sequence be dened on a rectangular lattice S of pixels, whose members s 2 S can be explicitly shown as s = (i; j) t, with i and j denoting row and column indices, respectively. I k (s) denotes the intensity of pixel s of frame k of the sequence to be coded, and ~I k (s) denotes the pixel intensity of the corresponding decoded frame. Let v k (s) represent a motion vector relating object motion from frames k? 1 to k. When discussing motion estimation for a single pixel (as during the description of one iteration of the pel-recursive algorithm), we will omit extra superscripts and arguments, and refer to a generic v = (v i ; v j ). The estimated motion is applied during motion compensation to predict pixel intensities. Denote the prediction of I k (s) by ^I k (s). The estimation error is typically quantized before being transmitted to the decoder. In order to preserve synchronism between the coder and decoder, both use the reconstructed past frame, denoted I ~ k?1, as the anchor frame for predicting the current frame. The formulation of operator-based paradigm is motivated by the way motion compensation uses motion vector v. Traditional backward motion compensation is based on a continuous model for 4

5 β a b α v-[v] c d v [v] s Figure 1: Determination of the domain of operations based on bvc, the integer part of the motion vector. the image frames. If image frames were indeed continuous, motion compensation would predict I k (s) as ^I k (s) = ~ I k?1 (s? v). However, the frames are not dened continuously, but only over the discrete lattice S, thus ~ I k?1 (s? v) may not be (and most often is not) dened. The actual implementation of the traditional motion compensation is performed in two stages: rst the integer part of the motion vector, namely bvc = (bv i c; bv j c) is used to select a set of pixels in the past frame { as shown in Figure 1 { which we denote: a = s? bvc? (1; 1); b = s? bvc? (1; 0); c = s? bvc? (0; 1); d = s? bvc? (0; 0): (2) Then, the value I k (s) is predicted by some operator, v?bvc( ). The subscript of the operator indicates that it is parameterized by the fractional displacement of the two frames, and hence is an interpolation operator (traditional formulation is limited to interpolation operators). Typically this interpolator has a support of four: ^I k (s) = v?bvc ~Ik?1 (a); Ik?1 ~ (b); Ik?1 ~ (c); Ik?1 ~ (d) : (3) More specically, standard approaches typically use a bilinear interpolator, dened by: (;) ( I1; I2; I3; I4 ) = (1? ) [(1? ) I1 + I2] + [(1? ) I3 + I4] : (4) where (; ) are the fractional positions of the interpolated value in a grid square (see Figure 1), and fi1; I2; I3; I4g are four luminance values at the four corners of the said grid square. We consider the objective of motion estimation to be to provide the parameters for motion compensation { i.e. integer and fractional parts as mentioned above { such that the expected 5

6 0 Previous Frame i 0 Current Frame i c d e 1 2 g f h a k b l m 1 2 b a a (v i, v j ) = (-0.4,-1.2) b Causal Window 3 j e c d j c d Figure 2: Example: discrete formulation of pel-recursive motion estimation. motion compensated prediction error is minimized. Since motion compensation computes the estimated luminance ^Ik (s) in two stages, motion estimation itself should be described as having two distinct goals: 1. Motion estimation should direct the motion compensation process to a group of pixels in the previous frame from which I k (s) can be predicted well. This essentially is a search problem, and the strategy used for this search determines the tracking characteristics of the backward motion estimation algorithm. 2. After selecting a group of pixels { i.e. an integer position { in the past frame, the second goal of the motion estimation process is to select an operator, from within an allowable class of operators, which predicts I k (s) best when it is applied to the said group of pixels. This goal denes an optimization problem, whose characteristics are determined by the allowable class of operators over which we optimize the predictor of I k (s). We now proceed to formalize the operator optimization problem, where a variety of solutions to this problem will be investigated. We will return to consider the search problem in Section 5. 6

7 2.2 Optimization of Operators Let us assume at this point that some search strategy has selected a vector s = [ ~ Ik?1 (a); ~ Ik?1 (b); ~ Ik?1 (c); ~ Ik?1 (d) ] t (5) consisting of the four pixel intensities from the previous frame that are to be used to predict I k (s). Recall that ~ I k?1 represents the reconstructed past frame, available at both the coder and decoder. Now consider a predened parameterized class of linear operators, which we represent by 2 IR 4. Dierent such classes dene distinct backward motion compensation algorithms. Given this class of operators, we seek the parameter set (s) such that t (s) s I k (s). gives the best prediction of Since I k (s) is not available at the decoder, (s) must be estimated based on ~ I k on a causal neighborhood N s S of s, and some model relating the motion eld on N s to the motion eld at s. While some work [7] has explored more sophisticated models for the motion eld, we use a constant motion eld model in order to limit computational complexity. That is, we assume that the best parameter (s) to apply at pixel s is the same parameter which best predicts the neighborhood N s from corresponding pixels in the previous frames. Specically, we select (s) to minimize a sum of squared errors over the neighborhood: (s) = arg min X t [ s 0? ~ I k (s 0 )] 2 ; (6) where each s 0 is specied by (5) and (2), using the same integer motion vector bvc. Figure 2 illustrates this problem setup for a pixel a 0 in the current frame. In this case, a 0 consists of the intensities at pixels a; b; c, and d in the previous frame, as specied by (5). The neighborhood N s in the current frame is dened as fb 0 ; c 0 ; d 0 ; e 0 g, and b 0, c 0, d 0, and e 0 are vectors in IR 4 containing the intensities of the pixels ff ; a; e; cg, fg; h; f ; ag, fh; k; a; bg, and fk; l; b; mg, respectively. The form of the solution of (6) depends on the form of. interpolation operator we discussed in the previous section, = [; ] t, and For example, for the bilinear = [ (1? )(1? ); (1? ); (1? ); ] t : (7) In this case, there is a direct correspondence between and the fractional portion of the motion vector. In general, the (s) found from (6) may not relate to a motion vector, although it will predict I k (s) well. Thus, the operator kernels naturally divide into two classes which we shall denote with interpolative and non-interpolative kernels. Interpolative kernels are those kernels for which there 7

8 exists a parameterization that directly corresponds to spatial locations. Non-interpolative kernels in general do not allow for such parameterization uniquely. Typically, interpolation kernels are parameterized by two parameters { related to two degrees of freedom on the image plane { whose values in an obvious way relate to a position on the image plane. Non-interpolative kernels presented in this paper all have more than two degrees of freedom. Sine non-interpolative kernels are free of geometrical constraints, they have a simpler structure than their interpolative counterparts. Among other advantages, the simple structure facilitates the usage of matrix notation in the development of the algorithm, thus one can more easily adapt known and ecient computational methods to the optimization of non-interpolative operators. There also exists a duality theory for this class of operators [9]. On the other hand, because of a clear connection between the parameters of interpolation kernels and spatial variables, development of integer-motion updates is much easier for interpolative kernels. The following sections are devoted to introduction and detailed analysis of individual kernels. 3 Interpolation Kernels In this section, we introduce interpolation kernels for backward motion estimation algorithms. Interpolation kernels are those whose parameters have a direct mapping to the points in a unit square. In other words, each interpolation operator species a point in the unit square. 1 Recall that the characteristics of the solution of (6), including the amount of computation needed in the solution, depend on the class of allowable operators. For computational and analytical tractability, we have chosen the class of linear operators. Furthermore, in order to minimize computation, we restrict ourselves to interpolating surfaces that lead to the solution of linear equations, thus avoiding iterative solutions to (6). This means that these operators are not only linear, they are also linear in their parameters. As an example of a linear operator that is non-linear in its parameters, consider the bilinear operator. Even though solution of (6) for the bilinear operator is possible, it requires iterations and is computationally prohibitive. The geometrical constraints of an interpolation operator implies structure,. and the nature of the structure governs the behavior and performance of the operator. Generally speaking, slacker constraints lead to more sophisticated operators with better performance, but also with heavier 1 Note that the converse may not necessarily be true. All interpolation operators generate mappings from operator space to image space (no more than one image location corresponds to each operator). However, it is not guaranteed that this mapping is one-to-one, i.e. there may be more than one operator mapped to a given image location. Both operators introduced in this section provide examples of this behavior. 8

9 a 0 b c 2 d Figure 3: Interpolating lines on square with corners a,b,c, and d computation. We will oer two examples of interpolation operators, and begin by dening a kernel corresponding to interpolation along lines in the unit square. Denition 1 are vectors in IR 4 satisfying: has all non-negative elements has at most two non-zero elements the elements of sum to 1. We parameterize this set by letting = [0; 1], where 0 2 f0; 1; 2; 3; 4; 5g is an index identifying one of six lines connecting any two of the corners of the unit square, and 1 2 [0; 1] is a real number identifying a position along one of those lines. Figure 3 shows the six lines we consider. The value at any position along a line is the linear interpolation of that line's endpoints. For example, if line 0 denotes the line between pixels a and b, then f0;:25g = [:75; :25; 0; 0]. With dened as above, minimizing (6) involves solving six one-dimensional least-squares problems along each of the lines, and then selecting the minimum cost solution from among the six. To derive the form of each line solution, let us pick an arbitrary line, and label the intensities at its endpoints I 0 (s 0 ) and I 1 (s 0 ), for each of the pixels s 0 2 N s. (Recall: when we refer to an interpolating line, we mean one line corresponding to each pixel s 0 2 N s.) Let 1 denote the distance from the I 0 (s 0 ) end of the line. Then the optimal position along this generic line is given by: 1 = arg min X [~ Ik (s 0 )? I 1 (s 0 )? (1? )I 0 (s 0 )] 2 ; (8) The minimum error is computed for each of the lines, and we select the best line. The solution to (8) along with a computationally ecient algorithm is presented in Appendix A. The linear interpolation kernel dened above is designed to identify translational motion in exactly vertical, horizontal, or diagonal directions. For more general translational motions, we dene a second interpolation kernel corresponding to interpolation on triangular planar patches on the unit square. Figure 4 shows the four triangles we consider. 9

10 a b a b c d c d Figure 4: Interpolating triangles on square with corners a,b,c, and d Denition 2 are vectors in IR 4 satisfying: has all non-negative elements has at most three non-zero elements the elements of sum to 1. We parameterize this set by letting = [0; 1; 2], where 0 2 f0; 1; 2; 3g is an index identifying one of four triangles in Figure 4, and 1 and 2 specify the weights applied to the intensities of two of the vertices (the third weight is given by (1? 1? 2)). This kernel corresponds to four planar interpolating surfaces covering the four triangles of Figure 4, each passing through the pixel intensities at three corners of the unit square. Since the triangles of Figure 4 overlap, each position on the unit square is associated with two distinct values of, corresponding to two dierent linear operators. Thus, the mapping! v, from the interpolation kernel to positions in the image plane, is in this case not one-to-one. With the dened above, we minimize (6) by solving four two-dimensional least-squares problems on four triangular patches, and select the minimum cost solution. To derive the form of each planar solution, select an arbitrary triangle and label the intensities at its vertices I 0 (s 0 ); I 1 (s 0 ); and I 2 (s 0 ). Then, the optimal (1; 2) are given by: (1; 2) = arg min ( 1 ; 2 ) X [ ~ I k (s 0 )? 1I 1 (s 0 )? 2I 2 (s 0 )? (1? 1? 2)I 0 (s 0 )] 2 ; (9) The estimation error is computed for each of the two possible planar congurations, and the better one is chosen. Once again, the expressions for the optimal point and the residual error, along with computational considerations and savings, are given in Appendix A. These two interpolation kernels oer considerable exibility in adapting the computational complexity of an algorithm to available resources. Unlike standard pel-recursive algorithms which are based on a bilinear interpolation kernel, all expressions to be evaluated are functions of the original pixel intensities rather than bilinearly interpolated pixel values. 10 This allows much of

11 the computation to be shared among multiple pixels via running boxcar lters. It also allows many operations, which involve previous pixel's motion estimates in the standard algorithms, to be computed prior to (and in parallel with) the actual estimation processing. In addition to these computational savings, one's choice of the number and type of interpolating surfaces over which to search can reect computational considerations. Of course, in exchange for dening strategies using fewer interpolating surfaces, and employing interpolating lines vs. interpolating triangles (to save computations), one sacrices prediction accuracy and runs the risk of losing track of the true motion eld. However, it is possible to accommodate for such tracking problems by implementing search strategies which incorporate information transmitted from the encoder. These hybrid forward/backward strategies will be discussed in Section 5. 4 Non-Interpolative Kernels Recall that interpolation kernels reect the geometrical constraints built into the operators. These constraints can be motivated by either computational considerations, or robustness to the iterations inherent in recursive motion estimation. Recall also that non-interpolative operators are those whose parameters do not have a mapping into the unit square. In other words, they are free of geometrical constraints linking them to locations on the image plane. In this section we foray into the realm of non-interpolative kernels. Non-interpolation kernels generally have larger spaces and often include interpolative solutions as special cases. But more importantly, removal of the constraints simplies the expression of the optimization problem, facilitating the usage of known and tested mathematical techniques (such as Recursive Least Squares) for ecient computation of the operators. This elegance also facilitates extensions of the concept into larger spaces and/or with weighted estimation models, 2 which we will explore in more detail shortly. Perhaps the most intuitive non-interpolative kernel is the unrestricted four dimensional linear operator, = [0; 1; 2; 3] t, where the parameterization in this case is trivial [10]. The operator (where we for simplicity of notation suppress the explicit dependence on the parameter set ) is obtained through a least squares operation, A = h ; (10) 2 The price paid for this elegance and generality is the loss of the intuitive concept of 2-D motion vectors. 11

12 with A = t 1 t ; (11) t N and h = [ ~ Ik (s1); ::: ; ~ Ik (s N ) ] t (12) where N = fs1; ::::; s N g is the causal neighborhood in the present frame used for setting up the equations. In the example of Figure 2, N = 4 and h = [~ Ik (B); ~ Ik (C); ~ Ik (D); ~ Ik (E)] t 1 = [~ Ik?1 (f); Ik?1 ~ (a); Ik?1 ~ (e); Ik?1 ~ (c) ] t 2 = [ I ~ k?1 (g); I ~ k?1 (h); I ~ k?1 (f); I ~ k?1 (a) ] t 3 = [~ Ik?1 (h); Ik?1 ~ (i); Ik?1 ~ (a); Ik?1 ~ (b) ] t 4 = [ I ~ k?1 (i); I ~ k?1 (j); I ~ k?1 (b); I ~ k?1 (k) ] t (13) We denote i as the operator domains. The optimum operator is therefore ^ = (A t A)?1 A t h (14) This has to be computed at every point in every frame. The amount of computation may be reduced, however, if we take into account the manner in which matrix A changes as the estimator moves along the row scans. Consider the scalar equation corresponding to each row of the system of equations (10). The window at each pixel has a considerable overlap with the past window, hence a number of scalar equations in (10) are shared among successive estimators (see Figure 5). This motivates the use of more ecient solution techniques that carry over some of the work done in nding one solution to the next one. The most notable of these techniques is the recursive least squares (RLS) algorithm. RLS, a well known algorithm in the eld of numerical linear algebra [11], uses the Sherman-Morrison matrix inversion lemma for rank-one updates of matrix A at each iteration. At each iteration of this algorithm, one equation is added to the system of equations, and all other equations are multiplied by a \forgetting factor" of < 1. The resulting implementation is elegant, but when applied to our problem results in a severe degradation in performance compared to a direct implementation. For this we oer two explanations: rst, the RLS 12

13 RASTER SCAN Old Window New Window Out-going pixel In-coming Pixel Shared pixel Causal window Figure 5: As the window moves in the raster scan, pixels move in and out of it algorithm with the concept of forgetting factor is inherently one-dimensional. When used with a two-dimensional window, a direct inspection shows that the weighting is not consistent with any measure of the relative importance of the equations, including the distance of the corresponding pixels in the window with the current pixel, and this holds true for any ordering of the pixels. Second, the forgetting factor results in a window with an exponential tail. Such a long window is inconsistent with the assumption of uniform motion on the training window. These problems can be circumnavigated in a variation of RLS that uses QR decomposition. 4.1 RLS with QR Decomposition We are in search of an ecient (rank-one update) algorithm that allows a nite window with reasonable weighting of the equations over the window. For a boxcar window, such a rank-one update is possible through solving the equations in a QR factorized form: A = QR ) R = Q t h (15) Nonzero rows of R constitute an upper triangular matrix and the least squares solution can be found easily by back substitution. As the window moves along the row scans, some pixels move into the window and some move out of it, but there is an overlap between successive windows. Using this, we can update the 13

14 QR decomposition with the data corresponding to incoming pixels, and downdate it by the data corresponding to outgoing pixels. For the theory and computational details of QR factorization and update/downdates see [12, 13]. QR decomposition can be performed by either Given's rotations or Householder reections. Given's rotations are preferable because they are more ecient in the update-downdate process. Note that the multiplication Q t h in (15) is not necessary if the rotations are performed on the right hand side vector simultaneously. This reduces the multiplications on the RHS by 50 percent. The algorithm is outlined in Figure Formulation with Weighting The predictor b was designed with the assumption of a constant motion eld over the causal window and absence of noise, which led to a uniform weighting in the least squares problem. By relaxing these assumptions, a more realistic setup is created in which better predictors can be designed. In general, one would expect that the scalar equations corresponding to the pixels in the window that are closer to the current pixel should have greater weight. Let us consider such a setting and denote this new (weighted) predictor as w, and the weight matrix as w = [w1; :::; w n ] t. The problem now changes to where min w E( w ; w) (16) Denoting E( w ; w) = k p w t (A w? h) k 2 X 2 = w i t w si? I k(s i ) (17) si2ns S = diag(w1; :::; w n ) ; (18) the solution to the weighted least squares problem can be written as b w = A t S A?1 A t S h : (19) This solution assumes knowledge of the weights. In what follows, we oer a means of nding optimal weighting coecients, given a xed support for the causal window. Let T represent a very large ensemble of pixels taken from a training set of sequences. Since the weights are intended to serve as a statistical model applicable to the entire ensemble, we seek a set of weights to optimize E(w) = X s2t b t w (s) s? I k (s) 2 : (20) 14

15 At each iteration, R is stored in A and the rotated version of h is stored in place. 1. If the pixel is at the edge of the frame, or if the integer part of the motion is changed from the last iteration, reset the updates, (i.e. discard the previous solution), refill the matrix A with the points in the window, and do a complete QR decomposition (with Given's rotations). Store R in A. 2. Corresponding to one of the points that are coming into the window, add a new row at the top of A and an element at the top of h. Also append a unity vector to Q (first row and column). A was upper triangular, and now becomes upper Hessenberg. 3. Perform n Given's rotations on columns of A to make it upper triangular. Update Q and h with the same rotations. 4. Do row exchange on Q so that the row corresponding to one of the outgoing points goes to the top row. 5. Givens rotations on the top row of Q, simultaneously apply to A and h. 6. Discard first row of A, first element of h, and first row and column of Q. 7. Continue steps 2 to 6 until the window is completely updated. 8. Solve the (now) upper triangular system A = h to find the best least squares predictor b. If the system is rank deficient, set b = [ ]t. 9. Move the window to the next position in the row scan. Go to step 1. Figure 6: RLS algorithm for non-interpolative backward motion estimation. 15

16 Unlike E( w ; w), which measures the performance of operator on neighboring pixels, E(w) measures how those operators, trained on neighbors, perform on the pixel s to be predicted. Note that E(w) depends on w through the implicit dependence of w b on w. This indirect relationship makes it impossible to nd an analytic expression for the optimal weights. Instead, a gradient descent strategy can be used to minimize E(w) [14]. Dierentiation, after some algebra, i E(w) =?2 X s2t e s e s;j t s At A s ; (21) where e s = b t w(s) s? I k (s) (22) is the residual error from applying b w at s, and e s;j = b t w(s) sj? I k (s j ) (23) is the residual error from applying b w at s j, the j-th neighbor of s. Note that this gradient descent optimization is performed o-line. 4.3 A Wider Class of Operators So far, we have focused on operators in IR 4. In the case of interpolation operators, this limitation is a natural byproduct of the rectangular geometry of sampling lattice. In the case of non-interpolative operators, however, we have no incentive to link the parameters of the operator to the planar geometry of image lattice. Thus, non-interpolative operators are easily extended to higher dimensional spaces. In fact, for every n 4, one can construct at least one operator space on IR n. For each n, whole families of operators are made possible by considering symmetric n-element arrangements (templates) of pixels [15]. 3 To arrive at these new families of operators, we generalize equations (3) and (5) as follows: i = h i ~Ik?1 (a i;1) ::: I ~ t k?1 (a i;n ) ; (24) ^I k (s) = ~Ik?1 (a1); :::; ~ Ik?1 (a n ) : (25) where fa1; :::; a n g are pixels in the past frame forming the template corresponding to pixel s to be predicted, and fa i;1; :::; a i;n g is the template corresponding to the i-th neighbor of s. Then, equations (10), (11), and (12) once more dene the problem, and the solution is given by (14). In the same manner as before, the method of Recursive Least Squares can be utilized for 3 Although not all possible arrangements are necessarily useful, or lead to ecient algorithms. 16

17 a fast implementation of the algorithm. Also, a weighted least squares approach can be developed much in the same way as in Section 4.2. While the generality of non-interpolative operator allows a direct extension from 4-D to higher dimensions, we note that computational complexity of the optimization grows rapidly with increasing dimension, in fact as O(n 2 ). This can become very cumbersome even with moderate values of n. Furthermore, given a xed size of neighborhood N, the solution becomes less robust as the dimensionality of the operator increases. With increasing n, the system of equations solving for a higher-dimensional operator becomes less and less over-determined, leading to noise sensitivity; an eect not unlike the well-known \over-classication" problem in statistical pattern recognition. Such factors motivate a compromise: using a higher dimensional operator, yet optimizing it in a lower dimensional subspace by forcing upon it a set of constraints. Loosely speaking, this compromise can be considered a middle ground between interpolation operators (fully geometrically constrained) and non-interpolation operators. The solution of the resulting constrained least squares problem is possible either through direct substitution (in lower dimensions and with simple constraints), or through a Lagrangian approach. Providing a catalog of even a small subset of these operators would take this paper far beyond its size limitations. Instead, we show here the process of developing a specic higher-dimensional constrained operator, through which we demonstrate the underlying ideas and methods. The particular example is a ve dimensional operator (see Figure 7). In this case a1 = s? bv + (0:5; 0:5)c ; a2 = s? bv + (?0:5; 0:5)c ; a3 = s? bv + (0:5;?0:5)c ; a4 = s? bv + (1:5; 0:5)c ; a5 = s? bv + (0:5; 1:5)c : (26) and A is now a N 5 matrix whose i-th row consists of intensities of pixels in frame k? 1 corresponding to s i. The least squares solution for the operator, as before, is b = A t A?1 A t h : This can obviously be solved with the same recursive least squares technique of Section 4.1. But we are interested in a constrained problem. Consider the alternative LS problem with a linear constraint min ka? hk ; s:t: c t = 1 (27) 17

18 a 2 a 1 a 2 a a a a 3 a 4 a 4 Figure 7: Templates of a 4D (left) and 5D (right) kernel. These specic set of points in the past frame are used if the motion vector falls within the solid square Pixel in frame k motion vector Diamond template with corresponding pixels in frame k-1 Figure 8: Relationship of pixels in the present frame to the template in the past frame through the motion vector where c is some weight vector. Such a linear constraint is equivalent to reducing the dimension of the problem by one. Let us introduce the following partitions in matrix A and vectors and c: A Nn = h B N(n?1) b N1 i ; (28) = c = n d 7 c n ; (29) 5 ; (30) where n = 5, and and d are four-dimensional vectors. Solving for n in terms of and 18

19 substitution results in a reduced, unconstrained problem: min B? b d t? cn whose solution is b = " B t B + kbk2 jc n j 2 d dt? Bt b d t + d b t B c n!#?1 h? b c n ; (31)! B t? d bt h? b c n c n Although this equation seems complex, the inversion is performed on a symmetric 4 4 matrix, whose number of independent elements is only 10. The main burden is therefore the computation of the matrices themselves at each pixel. The constraint that we use has the form (32) c = [1; 1; 1; 1; 1] t : (33) This is a constraint on net local amplication between frames, reecting our belief that inter-frame redundancies can be addressed through translational operations. The matrix in the right hand side of the reduced LS problem can be written as B? b c n d t = A I?1 t : (34) Equation (32) therefore takes the form 0 b = 1] A t A I?1 t C A?1 [I? 1] A t (h? b) : (35) A little insight into the way A t A is formed results in considerable savings. Computation of this matrix directly involves 300 multiplications and 300 additions per pixel. However, NX A t A = i t i ; (36) i=1 where i are vectors whose elements are dened in (26). Further taking into account the symmetry, only 10 multiplications and 120 additions per pixel are necessary. If a sliding window technique is used, the number of additions can be further reduced to only 60 per pixel. After that, forming the necessary matrices of (32) from A t A and A involves only row and column operations. b is obtained by multiplication of a matrix that depends only on pixel values in the past frame, to a vector of pixel values of the present frame. Considering this, the algorithm makes a pre-run where all matrices 0 1] A t A I?1 t C A?1 [I? 1] A t 19

20 are computed. Then, as the frame is progressively scanned, the vector h? b is formed and b is computed at each pixel. b is trivially related to b through the linear constraint. 5 On Search Strategies In the previous sections, we assume that each iteration begins with a good guess of the integer portion of the motion vector at each pixel. In this section, we consider strategies for computing this integer motion vector. We begin by discussing a variety of types of motion information that can be the basis for search strategies. (a) Global Information: initialize the integer motion vector by exhaustively searching a large space of allowable integer motion vectors to nd the best match for a neighborhood of pixel s; (b) Local Smoothness: initialize the integer motion vector with some average of the vectors assigned to neighbors of pixel s; (c) Local Information: rene the integer motion vector based on how the neighborhood of pixel s ts an initial integer position; (d) Hierarchical Information: initialize the integer motion vector based on how a coarse version of the neighborhood ts at the current position; (e) Information from Encoder: initialize the integer motion vector based on information from the encoder. The rst four sources of motion information are computed at the decoder. They are all based on the principle that the best integer motion vector for pixel s is the vector that is best for a neighborhood of pixel s. They dier in how the best integer vector for the neighborhood is determined. The computational complexity of a strategy based on (a) precludes its use in practical systems. Standard pel-recursive estimation algorithms apply search strategies based on a combination of (b) and (c). They initialize the integer motion vector with an average of neighboring vectors, and then rene this initial guess if the local analysis suggests that a dierent integer vector would provide a better t for the neighborhood of s. Due to their reliance on local information, such strategies can 20

21 suer from tracking and convergence problems, particularly in regions with motion eld discontinuities and texture. If the motion eld is not smooth, neighborhood vectors can be unreliable as initial guesses. In addition, if the initial guess is bad enough, the local analysis of part (c) may yield meaningless results, potentially leading to diverging estimates. Hierarchical search strategies based on (d) could potentially eliminate some of the diculties with local strategies by resolving local ambiguities with more global information. An example of a hierarchical motion framework built on a multi-resolution image structure can be found in [16]. Finally, motion information from the encoder, computed with the benet of both current and previous frames, can provide very reliable initial motion vectors. However, such information comes at the expense of the bandwidth needed to transmit the motion information. The algorithms proposed in this section dier only in the search strategy used for selecting the initial integer motion vector. The rst algorithm, labeled NPR1, uses the strategy of standard pelrecursive algorithms. At each pixel s = (i; j), the initial integer vector is provided by the average of the integer motion vectors at (i? 1; j) and (i; j? 1). Using the interpolating lines described in Section 3, if the optimal linear operator is found to correspond to a position on a line that is closer to some other integer vector than the initial vector, this closer vector replaces the initial vector, and the procedure is repeated. The integer motion vector at the end of the second iteration is stored as the nominal integer motion vector for s. However, up to ve additional iterations are allowed for the purpose of nding the best linear operator to apply at each pixel. On the nal iteration, we select the optimal linear operator based on interpolating triangles rather than interpolating lines. The distinction between the nominal integer motion vector assigned at s (dened after two iterations), and the nal integer vector, which results after all iterations, deserves some discussion. It can be shown that the minimum error achieved on the optimal line (38) monotonically decreases with each iteration. Thus, each iteration can only improve the neighborhood t to the interpolating lines. However, empirical studies have demonstrated that, while searching for the optimal neighborhood t does improve the performance of the linear operator, it does not always provide a good integer starting point for future pixels. This is especially true for regions with small gradients, in which other factors besides motion contribute to the performance of the linear operator. Thus, we impose a type of smoothness constraint to the integer motion eld, by limiting the dierence between nominal integer vectors of adjacent pixels. In selecting the optimal linear operator, however, we do not impose this constraint. The second algorithm, labeled NPR2, is a simple modication of NPR1 incorporating blockbased motion vectors as the integer motion vector at each pixel. In this algorithm, all pixels are 21

22 initialized with the integer-valued motion vector which would be applied by a block-based motion compensation scheme. Motion vectors are computed at the encoder using exhaustive search block matching. After the initialization, the algorithm proceeds like NPR1, with a maximum of seven allowable iterations. We consider NPR2 to be a partially pel-recursive algorithm, because it shares some, but not all, of the recursive characteristics of standard pel-recursive algorithms. Pel-recursive algorithms are recursive in two ways. First, standard pel-recursive algorithms are based on modeling the motion of previously received pixels in the current frame. Thus, the motion model is recursively rened with the addition of each new pixel value. This general characterization of recursive algorithms contrasts with non-recursive approaches which model the motion at the encoder and transmit motion information to the decoder. Second, pel-recursive algorithms typically propagate an explicit motion estimate from pixel to pixel, rening the estimate at each pixel with the neighborhood motion model. We consider NPR2 as partially pel-recursive because it has the rst type of recursiveness, though it does not propagate an explicit motion estimate from one pixel to the next. The third algorithm, labeled NPR3, combines features of NPR1 and NPR2. NPR3 has two modes of initialization. At each pixel s = (i; j), the magnitude of the prediction errors at pixels (i? 1; j) and (i; j? 1) are added and compared with a threshold. If the threshold is exceeded, a block-based motion vector received from the encoder is used as the initial integer motion vector. Otherwise, the integer motion vector is initialized as in NPR1. The threshold is selected to control the number of block motion vectors which must be transmitted over the channel. In the simulations of the following section, several thresholds are tested, reporting both the average prediction error energy and the number of transmitted blocks. Note that the structure of NPR3 does not require the encoder to notify the decoder about which block motion vectors will be sent, since the decision is based on information available to both the encoder and decoder. NPR3 can be viewed as a fully pel-recursive algorithm which occasionally looks to the encoder for a little help. 6 Simulations The experimental results to follow aim to illuminate the issues involved in both components of operator-based motion estimation, namely the search and optimization problems. Recall that we have presented two types of operator kernels. Going through all combinations of search strategies and operator kernels can be tedious; thus we have chosen to show head-to-head comparisons for non-interpolation operators (with identical integer motions) and illustrate the search experiments 22

23 DFD Energy/Pixel STANDARD 4-D 4DW Frame No. Figure 9: Performance comparisons of 4-D operator motion compensation algorithms on the \football" sequence with interpolation operators. All experiments were performed on rst 50 frames of the \football" sequence. This \football" foreground has relatively large amounts of motion, with signicant object deformation and occlusion. In addition to foreground motion, a constant camera pan generates motion across the background, thus requiring (in forward motion estimation methods) motion vectors for all blocks. We study the performance of purely backward algorithms for the purpose of investigating the eectiveness of operator kernels, particularly with non-interpolative operators. Algorithms NPR1, NPR2 and NPR3 were designed with interpolative operators, to address motion tracking issues. They are compared with both a standard pel-recursive motion estimation algorithm and block-based motion compensation algorithms using both integer and fractional (1/2 pixel accuracy) block-matching motion estimation. The average displaced frame dierence (DF D) energy per pixel is used as the measure of performance for all simulations. Figure 9 compares the performances of the four dimensional non-interpolative operators on the sequence \football". For comparison purposes the algorithms are named STANDARD, 4D, and 4DW. STANDARD is the standard pel-recursive algorithm based on [2], but with centralized gradients as suggested in [17]. 4D and 4DW are the algorithms employing the four-dimensional operators b and b w respectively. All of the above utilize a 12-pixel causal window. In order to ensure the fairness of the comparison, the same integer motions are used for all three methods. Results of applying the constrained ve-dimensional operator is shown in Figure 10. The curve marked 5D represents the constrained ve-dimensional operator. The results show some improvement over the four-dimensional operator. Note that this improvement was obtained with virtually 23

24 DFD Energy/Pixel D 5-D Frame No. Figure 10: Performance of the constrained 5-D operator on the \football" sequence no increase in computational complexity, since the ve-dimensional operator is constrained and has only four degrees of freedom. Figure 11 compares the performance of two pel-recursive motion estimation algorithms, STAN- DARD and NRP1. These algorithms are traditionally pel-recursive, incorporating no explicit motion information received from the encoder. STANDARD is standard pel-recursion algorithm based on [2], but modied to use temporally averaged gradient estimates as suggested in [17]. NRP1 is a pel-recursion algorithm based on the rst search strategy dened in Section 5. While NPR1 shows superior performance to the STANDARD algorithm throughout the sequence, both algorithms exhibit periods of poor performance, reecting tracking problems common to pel-recursive algorithms. A comparison of the DFD arrays produced by the two algorithms suggested that the new algorithm produced much lower DFD energy over most of the frame, but experienced more severe tracking problems than the STANDARD algorithm in some local regions. A comparison of motion vectors showed that, in these regions of poor performance, the integer motion vectors propagated from pixel to pixel wandered far from the correct vector (namely, the block-matching result). The STANDARD algorithm never deviated as drastically from the correct vector. The tracking diculties experienced by NPR1 motivates us to dene NPR2, a new, partially pel-recursive algorithm, which avoids tracking problems by using information from the encoder, as described in the previous section. Since NPR2 is designed around the use of block motion vectors, Figure 12 compares NPR2 with both integer and fractional (1/2 pixel accuracy) blockmatching algorithms. In comparing NPR2 with the integer algorithm, we compare two algorithms relying on the same motion information from the encoder. In comparing NPR2 with the fractional algorithm, we compare NPR2 with an algorithm receiving two additional bits from the encoder, to 24

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five