Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Tuukka Toivonen and Janne Heikkilä Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering P. O. Box 500, FIN-9001 University of Oulu, Finland {tuukkat,jth}@ee.oulu.fi Abstract. We present an efficient method for performing half-pixel accuracy block motion estimation, as required by common video coding standards such as H.263 and MPEG-. The estimation quality is superb, in some cases even slightly better than the conventional method, but with % less computation. Alternatively, computation can be decreased by 9% with only small penalty on quality. The method interpolates directly the sum of squared or absolute differences (SSD or SAD) matching criterion at integer pixel positions and subtracts a term based on horizontal, vertical, and diagonal differentials obtained from the search area. 1 Introduction Most video coding standards, such as MPEG- and H.263, use block motion estimation (ME) and compensation (MC) for removing temporal redundancy. Each frame in a video sequence is divided into blocks, typically picture elements (pixels). Each current block B is compared with overlapping candidate blocks C in the search area at the previous frame, and the displacement between the current and the most similar candidate block is used as the motion vector for the current block. Typical criteria, which are used for measuring the similarity, are sum of absolute differences (SAD) and sum of squared differences (SSD): E y,x = H 1 W 1 h=0 w=0 B h,w C h,w (y, x) p (1) where E y,x denotes the criterion value for a candidate motion vector (y, x) corresponding to the candidate block C (y, x). The block size is H W pixels and p = 1 for SAD and 2 for SSD. Block elements are denoted as X h,w for an element or pixel at (h, w). The SSD criterion gives typically slightly better image quality than SAD, but the latter is more widely used due to smaller computational complexity. In practice, the displacement of an object between two subsequent frames in a video is not an integer number of pixels. Therefore, modern coding standards employ also fractional pixel motion estimation, in which motion vectors may Copyright c 2003 Springer-Verlag. Published in the 2003 International Workshop VLBV, scheduled for September 18 19, 2003 in Madrid, Spain. Included in Lecture Notes in Computer Science 289: Visual Content Processing and Representation, available for purchase from http: //www.springeronline.com/sgw/cda/frontpage/0,10735,3--22-9322801-0,00.html.
6.6 x 10 6.5 SSD value 6. 6.3 6.2 6.1 6 17 18 19 20 21 22 23 2 25 Motion vector X coordinate Fig. 1. Behavior of typical fractional pixel motion compensation. point to candidate blocks placed at half-pixel (or sometimes quarter pixel) locations. As defined in most standards, the pixel values in these fractional candidate blocks are obtained by interpolating linearly or bilinearly the nearest pixels at integer locations. If the motion vector (y, x) points to an integer location, then the horizontally half-pixel candidate block C (y, x + 1/2) pixels are obtained by ( C h,w y, x + 1 ) = 1 2 2 C h,w (y, x) + 1 2 C h,w (y, x + 1) (2) for h = 0... 15 and w = 0... 15 (from now on, the pixel indices h and w are dropped for conveniency). Vertically half-pixel candidate blocks are obtained similarly, and when both motion vector indices are fractional, ( C y + 1 2, x + 1 ) = 1 2 C (y, x) + 1 C (y, x + 1) + 1 C (y + 1, x) + 1 C (y + 1, x + 1). (3) That is, a motion vector pointing to a fractional candidate block can be thought to point into several candidate blocks at integer locations, whose average is used for motion compensation. The averaging does not only compensate for noninteger displacement, but also filters out fast image variations and noise. Therefore, a candidate block at a fractional location usually gives better match than at an integer location, as shown in Fig. 1.
2 Previous Search Methods Conventional encoders perform motion estimation in two steps to save computation: first they find the criterion minimum at an integer location (IL). Then the search area around the best integer candidate block is interpolated into higher resolution and the motion vector is refined into sub-pixel accuracy by computing the criterion between the current block and usually the eight nearest half-pixel candidate blocks to the best integer motion vector. However, this requires much computation and may be difficult to perform in real-time encoders. Thus, faster methods have been investigated. We assume that the criterion values at the eight nearest integer locations surrounding the best integer motion vector have been evaluated and stored into memory, so that they are available during the half-pixel motion estimation without extra computation. This can be easily achieved with many fast full search algorithms [,5], and it is a reasonable assumption even with other fast search methods, because it guarantees that the nearest integer locations do not have a smaller criterion value than the best match which was found. Lee et al. [1] propose that only the four most promising half-pixel locations of the eight are tested. This halves the criterion computations. The surrounding eight criterion values at integer positions are used for deciding which half-pixel locations are selected. However, the total computation, as compared to the conventional method (CM), is decreased only by 38%, because the candidate blocks still need to be interpolated to obtain half-pixel blocks. This increases memory accesses and the amount of memory required for motion estimation. Also the quality will be slightly lower than with the conventional method. A straightforward way is to interpolate directly the criterion values from the integer locations into fractional motion vector locations and select the motion vector corresponding to the smallest value. Some examples presented in literature are linear interpolation method (LIM) and quadratic fit method (QFM) [3]. Both the candidate block interpolation and the direct criterion computation is avoided. Unfortunately, the result will be poor, because low-order polynomials can not approximate well the behavior of the fractional matching criterion, which is obvious from Fig. 1. More interesting interpolation technique called MAE approximation method (MAM) was presented by Senda et al. [2]: the half-pixel criterion is interpolated linearly from two or four nearest integer locations and weighted with a constant factor, E y,x+1/2 = ψ hv (E y,x + E y,x+1 ) /2 and E y+1/2,x+1/2 = ψ d (E y,x + E y,x+1 + E y+1,x + E y+1,x+1 ) /. The factor is ψ hv horizontally and vertically and ψ d diagonally. However, it is not clear how to choose the factors: the best values must be obtained experimentally, and they depend on video content and encoder bitrate. Even when the optimal values for a certain sequence are used, the encoded quality will be clearly worse than with the conventional half-pixel search, as shown in Table 1. In a later paper [3], Senda derives horizontally and vertically half-pixel SSD values from integer locations and block differentials, and applies the results for approximating the factor ψ hv for SAD. However, he does not consider diagonal
cases. We will expand the Senda s derivation into diagonally half-pixel locations in the next section and show that it is not necessary to compute the factor at all: an expensive division is avoided and less approximations are required. We also investigate fast algorithms for computing the differentials in Section. 3 Half-Pixel Criterion Let us compute directly the SSD at horizontally half-pixel location by substituting (2) into (1): [ E n = B 1 2 C (y, x) 1 ] 2 2 C (y, x + 1) () where E n is a single sum term in (1). By expanding the square and rearranging the terms, we get E n = 1 2 B2 + 1 C (y, x)2 BC (y, x) + 1 C (y, x) C (y, x + 1) 2 + 1 2 B2 + 1 C (y, x + 1)2 BC (y, x + 1). (5) This can be factored into squares, yielding E n = 1 2 [B C (y, x)]2 + 1 2 [B C (y, x + 1)]2 1 [C (y, x) C (y, x + 1)]2. (6) By summing over h = 0... H 1 and w = 0... W 1, we get E y,x+ 1 2 = 1 2 E y,x + 1 2 E y,x+1 1 H y,x (7) where H is the horizontal differential of a candidate block H y,x = H 1 W 1 h=0 w=0 C h,w (y, x) C h,w (y, x + 1) p (8) for SSD with p = 2. Similarly the SSD criterion can be also derived for vertically half-pixel criterion, in which case the candidate block vertical differential V y,x = H 1 W 1 h=0 w=0 C h,w (y, x) C h,w (y + 1, x) p (9) is required instead of the horizontal differential. For diagonally half-pixel locations, we substitute (3) into (1): [ E n = B 1 C (y, x) 1 C (y, x + 1) 1 C (y + 1, x) 1 C (y + 1, x + 1) ] 2. (10)
We proceed in the same manner than in the horizontal case, expanding the square. By rearranging the terms and factoring into squares, we get E n = 1 [B C (y, x)]2 + 1 [B C (y, x + 1)]2 + 1 [B C (y + 1, x)]2 + 1 [B C (y + 1, x + 1)]2 1 [C (y, x) C (y, x + 1)]2 1 [C (y, x) C (y + 1, x)]2 1 [C (y, x) C (y + 1, x + 1)]2 1 [C (y + 1, x) C (y, x + 1)]2 1 [C (y + 1, x) C (y + 1, x + 1)]2 1 [C (y, x + 1) C (y + 1, x + 1)]2. (11) Finally, by summing the terms, the SSD criterion is E y+ 1 2,x+ 1 2 = 1 E y,x + 1 E y,x+1 + 1 E y+1,x + 1 E y+1,x+1 1 H y,x 1 H y+1,x 1 V y,x 1 V y,x+1 1 N y,x 1 S y,x (12) where H and V are horizontal and vertical block differentials, defined above in (8) and (9), and N and S are diagonal differentials in northwest-southeast and southwest-northeast directions, respectively. The value of the half-pixel SSD criterion between four integer locations is the average of the SSD values at the integer locations, minus the weighted differentials. The differentials are shown in Fig. 2 as arrows, where the integer pixel locations are denoted as filled circles and the half-pixel location as an open circle. The SAD criterion can not be derived similarly for the half-pixel locations. However, as pointed out by Senda [3], there is a close relation between SAD and SSD. We can approximate the SAD criterion value by using p = 1 in the differential (8), in which case the approximated horizontally half-pixel SAD is 1 E y,x+ 1 2 2 E y,x 2 + 1 2 E y,x+1 2 1 H2 y,x (13) where E is the SAD criterion (1) with p = 1. The square root can be removed by squaring both sides of the equation, and the obtained algorithm can apply integer pixel SAD values. In the computation of the differentials the multiplication is replaced with an absolute value, although the actual biased interpolation in (13) still requires a few multiplications. The vertical and diagonal cases can be handled in the same manner.
H V N S Fig. 2. Differentials for half-pixel motion estimation. Subtract Add Fig. 3. Computing the differentials using the sliding window method. The shaded area denotes a single candidate block. Computing the Differentials For computing the SSD values of the nearest eight or sixteen half-pixel locations to the best integer pixel match, we need six vertical and horizontal and eight diagonal or in total twenty differentials. However, the candidate blocks, whose differentials are computed, mostly overlap. We can first compute differentials of columns of the topmost candidate blocks, saving the results. Summing the first W of these yields the differential of the first candidate block, as shown at left in Fig. 3. The differential of the next topmost candidate block is obtained from the previous by subtracting the differential of the first leftmost column of the previous candidate block and adding the differential of the new rightmost column (at middle in the figure). This is repeated for all two or three blocks in the row. After each row, the stored column differentials are updated by subtracting the differentials at the topmost pixel row and adding the new differentials at the bottommost pixel row (at right in the figure). Then the process repeats, obtaining again the differential of the first candidate block at the second row by summing the first W stored differentials. This part is very similar to the computation of the reference block norm, which is described in [5] with a greater detail. Using this sliding window (SW) technique for computing the horizontal, vertical, and the two diagonal differentials in distinct orientations, the twenty differentials are obtained, and the exact
Table 1. PSNR in decibels of the predicted images and operation counts for half-pixel motion estimation Method IL CM MAM SW Max Sub IL CM MAM SW Max Sub Criterion SSD SSD SSD SSD SSD SSD SAD SAD SAD SAD SAD SAD Foreman 29.57 31.9 30.85 31.9 31.3 31.09 29.0 31.25 30.62 31.10 31.11 30.51 Munchener 21.50 23.28 22.9 23.35 23.33 23.27 21.6 23.17 22.81 23.15 23.12 23.08 Stefan 22.51 2.70 2.32 2.71 2.67 2.59 22.9 2.61 2.2 2.60 2.56 2.50 Tempete 23.76 25.92 25.5 25.92 25.88 25.86 23.77 25.83 25.6 25.82 25.80 25.77 Tourists 21.1 21.78 21.60 21.90 21.88 21.87 21.06 21.70 21.51 21.69 21.68 21.67 Average 23.69 25.3 25.05 25.7 25. 25.3 23.6 25.31 2.93 25.27 25.25 25.10 Additions 5068 8 2700 90 296 5068 8 268 92 280 Multiplicat. 208 8 1276 50 128 8 29 11 11 Abs. values 208 1276 50 128 Total 71 3976 1390 2 71 3989 1385 19 SSD criterion values can be computed for either the eight or sixteen nearest half-pixel locations essentially with the same number of operations. In practice, one can compute only four differentials, horizontal, vertical, and two diagonal differentials, of a single candidate block located at the best integer pixel motion vector. Since the surrounding candidate blocks almost completely overlap with each one, we can assume that each differential in a particular orientation will be constant over all of the blocks. This avoids the somewhat cumbersome calculation using the SW method and reduces the number of arithmetic operations, but gives very good approximation for the half-pixel SAD or SSD criteria. Another approximation, which will still maintain good quality, is to refrain from computing the diagonal differentials. These can be estimated well using the maximum of the horizontal and vertical diagonals, N S max {H, V}. Finally, the differentials can be computed from every other pixel i.e. computing them from subsampled candidate blocks. This will still supply adequate accuracy for some purposes. 5 Experimental Results The half-pixel motion estimation methods were implemented into Project Mayo s OpenDivX Core MPEG- encoder [6]. Five CIF-sized video sequences, each 200 frames long with 30 frames per second, were encoded at 38 kilobits per second. The coding results are shown in Table 1. The peak signal-to-noise power ratio (PSNR) between the predicted and the original frames is shown. With MAM and the SSD criterion ψ hv = 13/ and ψ d = 12/ and with the SAD criterion ψ hv = 15/ and ψ d = 1/, which produced the best results. For the differential SSD-based methods, sixteen half-pixel positions are tested; for the SAD-based methods, only eight are tested, because this yielded the best outcome. The Max method computes only the vertical and horizontal differentials of a single block; the Sub method is similar, except that the
block is also subsampled by two. The obtained SSD criterion is slightly different when computed from interpolated image, as in CM, than if computed directly using the differential SW method, because rounding is not accounted in the latter. However, this is more than recompensed because the SW method examines twice more half-pixel positions. Therefore the SW method with SSD produces the best results, with % less computation. 6 Conclusions We presented a new method for estimating motion vectors at half-pixel accuracy for video encoding. The method is based on computing the SSD or SAD criterion corresponding to half-pixel locations using block differentials and precomputed criterion values at integer locations. The method is very efficient, saving % of computation and image interpolation as compared to the conventional method, and still yielding better image quality, because sixteen half-pixel positions are tested instead of eight. By sacrificing only slightly quality and approximating the differentials, we can diminish computation up to 9% with only 0.2 db loss of predicted image quality. References 1. K.-H. Lee, J.-H. Choi, B.-K. Lee, and D.-G. Kim: Fast Two-Step Half-Pixel Accuracy Motion Vector Prediction. Electronics Letters 36, no. 7 (2000) 625 627 2. Y. Senda, H. Harasaki, and M. Yano: A Simplified Motion Estimation Using An Approximation for the MPEG-2 Real-Time Encoder. International Conference on Acoustics, Speech, and Signal Processing (1995) 2273 2276 3. Y. Senda, H. Harasaki, and M. Yano: Theoretical Background and Improvement of a Simplified Half-Pel Motion Estimation. Proceedings of International Conference on Image Processing 3 (1996) 263 266. T. Toivonen, J. Heikkilä, and O. Silvén: A New Algorithm for Fast Full Search Block Motion Estimation Based on Number Theoretic Transforms. Proceedings of the 9th International Workshop on Systems, Signals, and Image Processing (2002) 90 9 5. Y. Naito, T. Miyazaki, and I. Kuroda: A Fast Full-Search Motion Estimation Method for Programmable Processors with a Multiply-Accumulator. IEEE International Conference on Acoustics, Speech, and Signal Processing 6 (1996) 3221 322 6. Project Mayo: OpenDivX Core.0 alpha 50 (2001-02-2) URL: http://download2. projectmayo.com/dnload/divxcore/encore50src.zip