Dense Motion Field Reduction for Motion Estimation

Dense Motion Field Reduction for Motion Estimation Aaron Deever Center for Applied Mathematics Cornell University Ithaca, NY 14853 adeever@cam.cornell.edu Sheila S. Hemami School of Electrical Engineering Cornell University Ithaca, NY 14853 hemami@ee.cornell.edu Abstract A new approach to motion estimation/compensation is presented that uses a morphological pyramidal representation of a dense motion field. The dense field is reduced and coded in a rate-distortion-optimized fashion, in that the compensated frame produced by the encoded motion for a given number of bits minimizes the energy of the residual frame. Relative to standard block-based techniques, the resulting compensated frames are of higher quality while the motion provides a better representation of true motion. Additionally, the motion representation delineates objects well. As such, this technique is useful in low-rate applications and in forthcoming object-based techniques. 1. Introduction Motion estimation and compensation (ME/C) are essential components of efficient video compression. Standard block-based ME/C methods yield good PSNR values in coding but suffer from several drawbacks. Regions with multiple motions and motion edges are handled poorly by block-based ME/C, and blocking artifacts occur at low bit rates. In addition, the motion vectors are chosen independently of one another and are selected so as to minimize residual error, and consequently may not correspond to the true motion in the scene. In this paper, a ME/C scheme is presented that addresses these drawbacks of block-based ME/C. A dense motion field is calculated using a variant of the Horn and Schunck algorithm [5], producing pixel-level estimates of the true motion in the scene. These pixel-level estimates allow motion boundaries to be detected and thus the problems of multiple motions within a block are avoided. A morphological pyramid is constructed, in the style of a Laplacian pyramid [1], to efficiently represent, reduce and code the dense motion field in a rate-distortion-optimized fashion. Morphological rather than linear filtering is employed because it is better suited to motion data for two reasons. One, motion fields are often approximately piecewise-constant, and morphological decompositions perform well with shapes and sizes of objects. Morphological filtering avoids the averaging inherent in linear filtering that blurs motion boundaries, resulting in poor estimates for motions near a boundary. Secondly, given a dense motion field, the efficient representation of this field can be seen as a problem in nonuniform sampling. The nonlinearity of the morphological filters resembles a sampling procedure. Various levels of a morphological decomposition contain samples of (motion) objects large enough not to have been suppressed by the filtering process. By selectively coding various coefficients of the pyramid, a non-uniform sampling approach results. This technique performs quantitatively similarly to standard block-based motion compensation on many sequences, but improves visual quality by alleviating blocking artifacts, and by allowing refined ME/C at motion boundaries. Because of the improved motion estimation and compensation, the resulting residual frames are generally noisier, containing more high frequency components, and are less correlated than residual frames corresponding to block-based motion compensation. As such, residual-based encoding loses efficiency. However, this technique is well-suited to low-rate applications in which the residual is minimally coded, or not coded at all. Additionally, the motion representation tends to delineate objects well, making it useful for object-based coding and segmentation schemes. 2. Dense Motion Field Computation and Reduction 2.1. Motion Estimation A dense motion field is constructed using a hierarchical approach to the Horn and Schunck algorithm, which uses a gradient-based technique to calculate optical flow. (In this paper, no distinction will be made between optical flow cal-

culation and motion estimation.) Each pixel has two independent dimensions to its associated motion vector. The gradient constraint is derived from the assumption that image intensity remains constant: di(x;y;t) = 0. This relates dt the spatial derivatives to the temporal derivative, and constrains the motion in the direction of the intensity gradient. However, this does not constrain motion perpendicular to the intensity gradient. A smoothness constraint on the motion field provides an additional necessary restriction. Several well-known improvements are incorporated into the Horn and Schunck framework to increase the accuracy of the algorithm. First, hierarchical estimation is used to better handle large motion. This entails computing the estimates at a coarse resolution and projecting the results to the next finer level to be used as an initial solution at that resolution. Secondly, a short, linear smoothing filter [8] is used prior to calculating the spatial derivatives. This provides substantial improvements to the original derivative estimates of Horn and Schunck. Lastly, the smoothness parameter of the Horn and Schunck equation is allowed to vary as a function of the hierarchical level, to allow for finer control of the hierarchical estimation process. At completion, this estimation procedure yields a dense motion field. 2.2. Motion Field Reduction As it is impractical and quite inefficient to transmit the entire dense motion field to the decoder, it is necessary to reduce this data while retaining the advantage it provides: access to true motion estimates at the pixel level. Pixellevel estimates allow motion boundaries to be detected and thus the problems of multiple motions within a block are avoided. This feature is captured effectively through a multiresolution morphological pyramid. Structurally, the multiresolution morphological pyramid is identical to the standard Laplacian pyramid. It consists of a coarse estimate of the motion field and detail layers at finer resolutions. The coarse estimate is obtained through a sequence of smoothing and subsampling operations. In a standard Laplacian pyramid, a linear filter is used for the smoothing operation. However, in a morphological pyramid, a series of dilation and erosion operations is used to smooth the data prior to subsampling. Dilation and erosion are defined respectively as (F A)(z) = sup F h (z) (1) h2a (F A)(z) = inf F?h (z): (2) h2a In each case, F is the input signal, A is the structuring element, and F h (z) = F (z? h). The structuring element acts as a window, indicating which values of F to consider in the sup or inf operations. Dilation serves to remove isolated minima while erosion removes isolated maxima. In sequence, they can be used as a low-pass smoothing filter. Depending on the order of operations, the filtering sequence is defined as an opening or closing. F A = (F A) A (opening) (3) F A = (F A) A (closing) (4) Morphological pyramids and morphological sampling are discussed in more detail in the literature [2, 4, 3]. For the proposed ME/C scheme, the filtering is performed through the application of a closing followed by an opening, and reconstruction is achieved through upsampling and dilation, a scheme described in [2]. The structuring element, A, is 2 2. The nonlinear morphological filtering process avoids the averaging of motion vectors that occurs in linear filtering. Instead, the various levels of a morphological decomposition contain samples of (motion) objects large enough not to have been suppressed by the filtering process, and the efficient coding of this information resembles a non-uniform sampling procedure, as seen in Figure 1. Figure 1. Morphological Pyramid Example Morph. Pyramid Motion Difference Fields Level 2 Level 1 Level 0 In this figure, different grayscales represent different motion regions. At the coarsest level, the smaller motion regions have been filtered out, and only the background remains. The smaller motion regions are added through the difference fields at different scales. A non-uniform sampling approach is enacted by transmitting only the necessary coefficients of the difference fields. 2.3. Motion Field Coding The morphological pyramid can be described as a tree with the nodes at depth one corresponding to the coarsest level motion estimates, and each node (excluding the root node and the leaves) having four children representing motion refinements. This is a direct result of the choice of 2

a 2 2 structuring element for the dilating reconstruction process. During reconstruction, the motion field is upsampled and then a dilation is performed. Because of the 2 2 structuring element, dilation corresponds to replication in a 2 2 window, and so refinements (children) are dependent on only one (parent) value from the previous level. For each (non-root) node in the tree, a sum of squares error (SSE) term is calculated that corresponds to the change in SSE of the residual image resulting from the coding of this node during motion compensation. The nodes are then ranked by the distortion improvement they produce. This provides a distortion-based framework for coding motion, whereby the nodes can be coded in distortion-ranked order. However, rate is also considered in the coding process as variable length codes are used to code the location and motion value of an improvement (node). Special shorter codewords exist for 4-neighbors and children of the previously coded node. Hence both rate and distortion are considered in the coding of the motion field. The decoded motion field can be obtained from the coded version by beginning with the motion coded at the coarsest level, and repeatedly reconstructing by upsampling and dilation, and adding in the refinements coded at each level. 3. Experimental Results The morphological pyramid (MP) motion compensation scheme was tested on a variety of sequences and bit rates, and compared to full-search block matching (BM) on 16 16 blocks with half-pel accuracy. Displaced frame differences (DFD) were coded both by block-based discrete cosine transforms (DCT) and by the wavelet SPIHT algorithm [7]. SPIHT coding of the DFD yielded better results for both compensation methods, demonstrating both a slight PSNR improvement as well as improved visual quality (decreased blockiness), and hence only these results are given. Experiments were performed on SIF-sized frames (352 240) at 30 frames/second. Groups of Pictures (GOP) were of size 10, with an I frame followed by nine P frames. Results are presented for bit rates in the ranges of 0.7-1 Mb/sec ( high ) and 100 kb/sec ( low ). 3.1. High Bit Rates Mobile Calendar at 1 Mb/sec The Mobile Calendar sequence was coded with.8 b/pixel for I frames and approximately.3 b/pixel for P frames. Figure 2 shows the PSNR results for the fully coded sequence. The MP method outperformed the BM method by an average of 0.2 db. The visual results of the two methods were very similar, with the large bit rate allowing for blocking artifacts to be eliminated during residual coding. MP allocated an average of nearly 2400 motion bits/frame, while BM allocated on average 1200 motion bits/frame. Figure 2. PSNR Comparison of MP and BM at High Bit Rates PSNR 24.8.6.4.2.8.6.4.2 Mobile Calendar at 1 Mb/sec MP BM 0 5 10 15 20 25 30 Frame Number However, the number of motion bits in MP is configurable, and provides an insight into the limitations of motion compensation. Figure 3 shows this for Mobile Calendar Frame 2. As the number of bits allocated to motion increases, the PSNR of the compensated frame rapidly reaches an asymptote, as shown in Graph (a). The autocorrelation of the resulting residual frame decreases, though less quickly (Graph (b)), and indicates that the residuals become noisier and less correlated. The combination of these two effects results in the PSNR of the coded frame reaching its maximum when the compensated frame is within 0.5 db of its asymptote and the residual correlation is still over 0.4 (Graph (c)). The MP method is robust in this instance in terms of the range of motion bits that can be allocated and still maintain optimal PSNR coding of the image. Flower Garden at 700 kb/sec The Flower Garden sequence was coded with.6 b/pixel for I frames and approximately.2 b/pixel for P frames. Performance of the two methods was nearly equivalent on this sequence, with PSNR results differing by less than.07 db and visual appearance similar as well. Lower autocorrelation of the residual for MP (.46 vs.50 for BM) resulted in less efficient residual coding, negating gains from motion compensation. 3.2. Low Bit Rates Mother and Daughter at 100 kb/sec To provide a comparison of MP and BM at low bit rates, the Mother and Daughter sequence was coded at 100 kb/sec. I frames were coded at.125 b/pixel, while P frames were coded at approximately.025 b/pixel. As seen in Figure 4, 3

Figure 3. Limits of Motion Compensation PSNR of Motion Compensated Frame 21.5 21 20.5 20 19.5 19 18.5 18 17.5 Graph(a) 17 0.42 Graph(b) Figure 4. PSNR Comparison of MP and BM at Low Bit Rates PSNR 32 31 30 29 28 27 26 25 BM, MP Mother Daughter Autocorrelation of Residual 0.41 0.4 0.39 0.38 24 Table Tennis 0 5 10 15 20 25 30 Offset from Starting Frame 0.37 0.36.5 Graph(c) indicate multiple levels of refinement. As more bits are allocated, a more refined segmentation of the object(s) in motion is possible. PSNR of Coded Frame.5 21.5 21 20.5 the PSNR of the two methods was nearly identical, differing on average by less than.05 db. However, the MP method exhibited improved visual quality, due to finer resolution motion coding. This can be seen in Figure 5, which contains a section of Frame 16, coded with equal bits allocated to motion compensation for each method. Table Tennis at 100 kb/sec The Table Tennis sequence was also coded at 100 kb/sec, with the same parameters as in Mother and Daughter. The PSNR results are given again in Figure 4, indicating the quantitatively similar performance of the two methods. The MP method yielded visually superior results, again alleviating the blocking artifacts exhibited at low bit rates by the BM method. As it is coding a dense motion field containing true motion vectors, the MP method is also capable of delineating moving objects in a scene. Figure 6 shows the locations of motion refinements for increasing allocations of motion bits for Table Tennis Frame 92. Squares inside of squares 3.3. Discussion In general, the MP scheme results in a compensated frame with equivalent or higher PSNR than BM when allocating bits equivalent to those used in BM compensation. However, the coding of the dense motion field often produces noisier, less correlated residuals, which contain more high frequency information than the BM residuals, and are coded less efficiently. As a result, at higher total bit rates, often most of the coding gain from the compensation stage is lost during the residual coding stage. This suggests a limit to the level at which improved motion compensation yields improved compression in a residual-oriented coder. At lower rates for which bits available for residual coding are more sparse, the improvements of MP over BM are more visually pronounced. Motion boundaries are more precisely defined and blocking artifacts are alleviated through the coding of fine resolution motion blocks. The computational complexity of the MP method is comparable to the BM method as well as other related methods [6]. The high cost of the sorting of SSE improvements is alleviated by recognizing that only a small subset of the possible fixes need to be saved. Due to the limited bit budget for motion compensation, only a small fraction ( 1%) of the possible fixes need to be maintained in a sorted list. The computation of the dense field is the most expensive component of the algorithm, and this can be improved with a slight tradeoff (.1-.2 db) in compensation quality by either decreasing the iterations in each stage of the hierarchical motion estimation algorithm or by eliminating the last 4

Figure 5. Visual Comparison of MP (top) and BM (bottom) Figure 6. Moving Object Delineation by Motion Compensation 80 bits 400 bits 1200 bits stage of the estimation algorithm at which the finest scale refinements of the dense field are calculated. 4. Conclusions In this paper a new method for motion estimation and compensation in video coding is presented. A morphological pyramid is used to reduce a dense motion field and code the most significant information in a rate-distortion optimized manner. It is comparable to block-matching motion compensation at MPEG-1 coding rates, and provides improved visual quality at low bit rates by eliminating blocking artifacts. Additionally, the refined motion estimation yields information that can be applied to moving object segmentation and other techniques. References [1] P. J. Burt and E. H. Adelson. The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4):532 540, April 1983. [2] D. A. F. Florêncio and R. W. Schafer. Homotopy and critical morphological sampling. Proc. SPIE, 08:97 109, June 1994. [3] R. M. Haralick, C. Lin, J. S. J. Lee, and X. Zhunag. Multiresolution morphology. Proceedings, IEEE First International Conference on Computer Vision, pages 516 520, 1987. [4] H. J. A. M. Heijmans and A. Toet. Morphological sampling. CVGIP: Image Understanding, 54(3):384 400, November 1991. [5] B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17:185 203, 1981. [6] P. Moulin, R. Krishnamurthy, and J. W. Woods. Multiscale modeling and estimation of motion fields for video coding. IEEE Transactions on Image Processing, 6(12):1606 1620, December 1997. [7] A. Said and W. A. Pearlman. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems For Video Technology, 6(3):243 249, June 1996. [8] E. Simoncelli. Distributed Representation and Analysis of Visual Motion. PhD thesis, Massachusetts Institute of Technology, Cambridge, 1993. Available by anonymous ftp from whitechapel.mit.edu. 5