702 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 4, AUGUST 1997 [8] W. Ding and B. Liu, Rate control of MPEG video coding and recording by rate-quantization modeling, IEEE Trans. Image Processing, vol. 5, pp. 12 20, Feb. 1996. [9] S.-W. Wu and A. Gersho, Rate-constrained optimal block-adaptive coding for digital tape recording of HDTV, IEEE Trans. Circuits Syst. Video Technol., vol. 1, pp. 100 112, Mar. 1991. [10] J. Shapiro, Embedded image coding using zerotrees of wavelet coefficients, IEEE Trans. Signal Processing, vol. 41, pp. 3445 3462, Dec. 1993. [11] D. Taubman and A. Zakhor, Multirate 3-D subband coding of video, IEEE Trans. Image Processing, vol. 3, pp. 572 588, Sept. 1994. [12] A. Said and W. A. Pearlman, A new, fast and efficient image codec based on set partitioning in hierarchical trees, IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 243 250, June 1996. [13] K. Ramchandran, Z. Xiong, K. Asai, and M. Vetterli, Adaptive transforms for image coding using spatially-varying wavelet packets, IEEE Trans. Image Processing, 1995, submitted. [14] J. Li, P.-Y. Cheng, and C.-C. J. Kuo, On the improvements of embedded zerotree wavelet (EZW) coding, in SPIE: Visual Communication and Image Processing 95, Taipei, Taiwan, May 1995, vol. 2501, pp. 1490 1501. [15] J. Li, J. Li, and C.-C. J. Kuo, An embedded DCT approach to progressive image compression, in IEEE Int. Conf. Image Processing, Lausanne, Switzerland, Sept. 1996, pp. I:201 I:205. [16] Z. Xiong, O. Guleryuz, and M. T. Orchard, A DCT-based embedded image coder, IEEE Signal Processing Lett., vol. 3, pp. 289 290, Nov. 1996. Low-Complexity Block-Based Motion Estimation via One-Bit Transforms Balas Natarajan, Vasudev Bhaskaran, and Konstantinos Konstantinides Abstract We present an algorithm and a hardware architecture for block-based motion estimation that involves transforming video sequences from a multibit to a one-bit/pixel representation and then applying conventional motion estimation search strategies. This results in substantial reductions in arithmetic and hardware complexity and reduced power consumption, while maintaining good compression performance. Experimental results and a custom hardware design using a linear array of processing elements are also presented. Index Terms Architectures, CPU performance, instructions, motion estimation, multimedia, video compression standards. I. INTRODUCTION Digital video is typically stored and transmitted in compressed form conforming to the MPEG standards for motion sequences [1]. These standards utilize block-based motion estimation as a technique for exploiting the temporal redundancy in a sequence of images, thereby achieving increased compression. The simplest abstraction of the motion estimation problem is as follows. Given two blocks of pixels, a source block of size b 2 b and a search window larger than the source block, find the b 2 b subblock in the search window that is closest to the source block. Manuscript received September 30, 1996; revised January 31, 1997. This paper was recommended by Guest Editors B. Sheu, C.-Y. Wu, H.-D. Lin, and M. Ghanbari. The authors are with Hewlett-Packard Laboratories, Palo Alto, CA 94304 USA. Publisher Item Identifier S 1051-8215(97)05878-3. The distance between two blocks can be measured by a number of different metrics [2], and typically the l 1 metric (mean absolute deviation) is used. Using this metric and a search strategy, we can evaluate candidate subblocks of the search window to find the subblock that is closest to the source block. The search strategy may be exhaustive search, evaluating each one of the candidate blocks from the search window and selecting the one that is closest in appearance to the source block. Or we may employ faster but approximate strategies, such as logarithmic search [1], to find a subblock that is close in appearance to the source block but is not necessarily the closest. Whatever the search strategy, evaluating the l 1 metric on pixels of full intensity resolution is computationally expensive. To overcome this obstacle, we propose to transform the current and reference frame to frames of binary-valued pixels. We then apply one of the conventional search strategies to these frames. The l 1 metric then amounts to computing the exclusive-or of a sequence of bits and adding up the number of ones in the result. This can result in substantial savings in software implementations as well as reduced complexity and power consumption in hardware implementations. Our experiments show that a careful choice of the one-bit transform can realize these gains with a small sacrifice in compression efficiency. Previously, a one-bit modification of the l 1 metric was proposed in [3], and we will compare our approach to theirs later in this paper. Recently and independently, Feng et al. [4] proposed a onebit transform similar to ours, but exploited it as a preprocessing step to exhaustive search with the l 1 metric. Their approach differs from ours on three counts. 1) They use the block mean as the threshold. However, we have found that the block mean does not offer the best results in our experiments. 2) The complexity of their strategy is roughly six times that of ours. 3) Their strategy is adaptive and not suited for simple hardware implementation at low power consumption. In [5], Mizuki et al. describe a binary block matching architecture where block matching is performed on the binary edge maps of the current and the reference frames. They also present a custom hardware implementation that includes circuitry for edge detection and a two-dimensional (2-D) array of elementary processors, where the number of elementary processors is equal to the number of candidate blocks for full-search motion estimation. Compared to conventional block matching schemes, they estimate that binary block matching for motion estimation reduces the silicon area required by a factor of five. In Section II, we establish the preliminaries and define the problem; In Section III, we give details of the proposed one-bit transform; in Section IV, we present a custom architecture for the one-bit motion estimation strategy, and in Section V, we present experimental results from applying our technique to sample video sequences. II. PRELIMINARIES Let s denote the source block of b 2 b pixels, with s i; j being the pixel at row i and column j. Similarly, let w denote the search window with w i; j being the pixel at row i and column j. The subblock of w at position x; y is denoted by w x; y, and is the block of b 2 b pixels w x+i; y+j, for i =1;2;111;b,j=1;2;111;b. The distance between two blocks u and v can be measured in many metrics, but typically the mean absolute deviation is used. The mean absolute deviation or l 1 metric is given by ku; vk 1 = 1 b 2 i; j ju i; j 0 v i; j j: (1) 1051 8215/97$10.00 1997 IEEE
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 4, AUGUST 1997 703 Fig. 1. Operations for thresholding 8-b frames to 1-b frames. For two one-bit images this metric reduces to ku; vk 1 = 1 b 2 i; j u i; j v i; j (2) where denotes the exclusive-or operation. The problem of motion estimation is to find the position x; y so that the subblock w x; y is closest to the source block s, in that ks; w x; y k is minimum over all subblocks of w. III. THE ONE-BIT STRATEGY We now construct a transform Q that maps a frame of multivalued pixels to a frame of binary-valued pixels. Q is defined with respect to a convolution kernel K and is denoted by Q K. Let F denote a frame and let ^F denote the filtered version of F obtained by applying the convolution kernel K to F. Let G = Q K (F ) be the frame obtained by applying Q K to F. The pixels of G are given by Fig. 2. Flow of operations for 1-b block-based motion estimation. G i; j = 1; if Fi; j ^F i; j 0; otherwise. (3) In this paper, we use the 17 2 17 convolution kernel K given below Fig. 3. range. Pixel coordinates for a 16 2 16 source block and a [08, 7] search K i; j = 1 ; 25 if i; j 2 [1; 4; 8; 12; 16] 0; otherwise. The motivation behind our method rests on the observation that the edges in an image are key to accurate motion estimation. A simple way to extract the edges is to carry out a high-pass thresholding, that is, compare the frame pixel by pixel to a high-pass filtered version of the frame, and threshold the pixels to zero or one, depending on the outcome of the comparison. Unfortunately, this would also cause the thresholded frame to track the high-frequency noise in the original frame. To overcome this, we use band-pass thresholding, wherein the smoothed version is a band-pass filtered version of the original frame, so that the thresholded frame represents the mid-frequency content of the original frame. The convolution kernel that we propose is motivated by this consideration, as well as the need to minimize the number of arithmetic operations. For comparison, [3] uses a block averaging kernel, which corresponds to using low-pass thresholding. The operations for the one-bit transform are shown in Fig. 1. Note that there is no global threshold for all pixels in a frame. For video coding, our one-bit motion estimation strategy consists of the following steps (Fig. 2): 1) apply the one-bit transform Q to both the current frame and the reference frame; 2) use any motion-vector search strategy in combination with the metric defined in (2). (4) IV. THE ARCHITECTURE The proposed one-bit motion estimation strategy is amenable to both single-processor-based and multiprocessor-based implementations. Experimental results using full-search and logarithmic search strategies on a single-processor, 32-b architecture are presented in the next section. In this section, we consider the design and performance of a custom architecture based on a linear array of custom, but simple, processing elements. Without loss of generality, let us consider the architecture for a block-matching full-search motion estimator for blocks of 16 2 16 pixels and a search range of [08, 7] pixels. If we shift the coordinate system so that there are only positive pixel indexes, Fig. 3 shows the pixel coordinates for the 16 2 16 source block and the 31 2 31 search window. We assume that the one-bit transform has been completed and all pixels have binary values. For example, if r i; j denotes a pixel in the source block, then r i; j is either zero or one. Let R i =[r i; 0; r i; 1; 111;r i; 15] denote the ith row of the search block, and V i; j =[v i; j ;v i; j+1 ; 111;v i; j+15 ] denote a 16-b vector from the search window, starting from location (i; j), where i and j 2 [0; 15]. The problem of motion estimation can be expressed as
704 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 4, AUGUST 1997 TABLE I PIXEL FLOW FOR COMPUTING THE FIRST 16 DISTORTIONS USING A 16-PROCESSOR ARRAY Fig. 4. Linear array for motion estimation. TABLE II NUMBER OF OPERATIONS PER PIXEL FOR FULL SEARCH AND LOGARITHMIC SEARCH WITH AND WITHOUT THE ONE-BIT TRANSFORM Fig. 5. Processor architecture for motion estimation. TABLE III AVERAGE PSNR VALUES FOR THE SEARCH STRATEGIES (BLOCKSIZE: 16 216; MOTION-VECTOR SEARCH RANGE: 615 PIXELS; VIDEO FRAMESIZE: 352 2 240) finding k and l 2 [0; 15] for which D(k; l) = 15 i=0 is minimized, where f ()is defined as f (R i;v i+k; l) = 15 j=0 f(r i ;V i+k; l) (5) r i; j v i+k; j+l: (6) That is, the f ( )function computes the number of bits for which there is a match between the R and V binary vectors. From (5), each V vector is used in the computation of multiple distortion values. For example, V 15; 0 (Fig. 3) is used in the computation of 16 distortions, namely D(0; 0), D(1; 0); 111; D(15; 0). Hence, if the V vectors are distributed to multiple processors, then one can compute multiple distortions in parallel. Fig. 4 shows such an implementation using an array of 16 processors. This is similar to the implementation in [6], except that each processor operates on 16-b vectors instead of on 8-b pixels. The architecture of each processor is shown in detail in Fig. 5. The f () function defined in (6) is computed using two 8-b exclusive-or arrays, a dual-port look-up table (LUT) with 256 entries, and a 4-b adder. The look-up table yields the total number of ones (or matches) at the output of each exclusive-or array. One xor-array operates on the eight most significant bits of the R and V vectors and the other one on the eight least significant bits. Table I shows in more detail the data flow of operations on processors PE-0, PE-1, and PE-15 for the computation of the first 16 distortion values. At t =0, only PE-0 is active with binary vectors R 0 and V 0; 0 as inputs. At t =1, PE-0 processes R 1 and V 1; 0, and PE-1 processes R 0 and V 1; 0. Following this approach, D(0; 0), in PE-0, will be ready after t =15, and all the first 16 distortion values will be computed in 16 + 15 cycles. However, as shown in Table I, by using two ports for the search memory, processing of the next set of distortions can begin at t =16. As shown in Fig. 5, a multiplexor in each processor selects the appropriate input from the search memory. The complete set of 256 distortion values can then be computed in 16216+15 = 271 cycles. In contrast, the traditional architecture [6] requires 4111 cycles. Thus, the one-bit transform allows for a roughly 15 : 1 speed improvement. This is consistent with the fact that at each cycle we now process 16 binary pixels instead of one 8-b pixel. For higher throughput, multiple arrays could be used. For example, in pipelined mode, two such arrays (which is equivalent to using a 16-processor array of 32-b processors) could compute all distortions in 128 cycles.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 4, AUGUST 1997 705 Fig. 6. Motion-compensated prediction residual for the Miss America sequence using various search schemes. Consider now the case of motion estimation using a search range of [016, 15] pixels. Then, the 1-b search window is 47 2 47 pixels and we need to compute 1024 distortion values. Since our architecture can compute 16 distortion values in 16 cycles, we can estimate that the 16-processor linear array will require now 16 2 64 + 15 = 1039 cycles to compute all 1024 distortion values. V. EXPERIMENTAL RESULTS Custom architectures may provide the highest level of performance for motion estimation, however, binary block matching schemes are also ideally suited for software-only implementations on a general purpose processor. We studied the performance of the following search strategies on several sequences from the MPEG video test suite. 1) Full8: Full search on 8-b data, l1 metric. 2) Log 8 : Logarithmic search on 8-b data, l1 metric. 3) Full1: Full search after 1-b transform, distance metric of (2). 4) Log 1 : Logarithmic search after 1-b transform, distance metric of (2). Table II shows the computational complexity of the four strategies for two different 32-b architectures. The first one, referred to as 32-b ops, has a native instruction for counting the population of ones in a register. This allows for 32 binary comparisons per instruction. The second one, referred to as 1-b ops, is a traditional one where only one binary comparison per instruction is performed. We also include estimates for the pixel distance criterion (PDC) metric, which is the scheme proposed in [3]. Note that the calculations for the PDC metric in this table is for the full-search scheme. For Full1 and Log 1, the additional expense of the filtering and compare operations of Fig. 1 is also included in the number of operations. From Table II, we note at least a 200-fold reduction in complexity for Log 1 compared to Full8. The effectiveness of each of the search strategies can be measured in
706 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 4, AUGUST 1997 TABLE IV ENTROPY OF MOTION-COMPENSATED DIFFERENCES FOR THE SEARCH STRATEGIES (BLOCKSIZE: 16216; MOTION-VECTOR SEARCH RANGE: 615 PIXELS; VIDEO FRAMESIZE: 352 2 240) searching stage, we select three. For the next stage, each of these three locations is then used as a starting seed. From Fig. 6, Log 1 (3) yields a motion-compensated residual that is comparable to that obtained using the one-bit exhaustive search scheme (Full1). However, in terms of one-bit operations, it has nearly six times lower complexity. In a fixed-rate coder, quantization noise can mask the prediction error resulting from a suboptimum search strategy. This is illustrated in Fig. 7, which shows the output quality of an H.263 decoder, measured in terms of the PSNR, versus bit rate for various search strategies used in macroblock motion estimation. The results were obtained using the Telenor TMN software coder (version 1.5) [7] with a fixed quantizer, fixed frame skipping (one), in arithmetic coding mode, with no advanced prediction modes, and a motion vector search range of 615 pixels. At 28 kb/s, PSNR values for Full8, Full1, Log 8, and Log 1 (3) are 28.48, 27.98, 27.65, and 26.83 db, respectively. Thus, Full1 is only 0.5 db worse than Full8 and the one-bit, three-candidate logarithmic search [Log 1 (3)] yields only 1.65 db lower performance than the exhaustive full-search (Full8) method, while its complexity is 65 times lower. VI. CONCLUSIONS We presented a motion estimation strategy for digital video based on a one-bit transform and gave an architecture for its hardware implementation. The strategy can effectively integrate low complexity search schemes, such as logarithmic search, to obtain complexity reductions as large as 200 fold relative to classical exhaustive search. The complexity reduction can translate into proportionate reduction in power consumption of custom hardware. Experimental results indicate that the reduced arithmetic complexity is accompanied by acceptable levels of performance degradation. REFERENCES Fig. 7. PSNR versus bit rate for an H.263 coder and various search strategies. Test sequence: Foreman, QCIF, 12 frames/s, fixed quantizer. terms of the peak signal-to-noise ratio (PSNR) and entropy values of the motion-compensated difference frames. We compute PSNR as 10 log 10 i; j 255 F i; j 2 db (7) where F is the motion-compensated prediction residual image. Table III shows the PSNR of several video sequences, averaged over 100 motion-compensated difference frames for each sequence. It is clear that the performance of full search after the 1-b transform (Full1) is comparable to or better than that of logarithmic search on 8-b data (Log 8 ). Also, for typical video sequences, the performance of Log 1 compares quite favorably with Full8, considering the 200- fold reduction in complexity. In Table IV, we show the entropy of the motion-compensated difference image. From this table, we note that the one-bit transform scheme with a suboptimum search strategy such as logarithmic search results, on the average, in an 8% increase in the entropy of the motion-compensated prediction residual relative to Full8 at substantially lower complexity. In Fig. 6 we show motion-compensated prediction residuals for several search strategies. To improve the performance of the Log 1 scheme, we also examined simple extensions to this approach, namely a multicandidate logarithmic search scheme. For example, in a threecandidate logarithmic search [Log 1 (3)], instead of selecting a single (usually the one that yields the minimum error) candidate after each [1] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards, Algorithms and Architectures. Boston: Kluwer, 1995. [2] K. R. Rao and P. Yip, Discrete Cosine Transform Algorithms, Advantages, Applications. New York: Academic, 1990. [3] H. Gharavi and M. Mills, Blockmatching motion estimation algorithms New results, IEEE Trans. Circuits Syst., vol. 37, pp. 649 651, May 1990. [4] J. Feng, K.-T. Lo, H. Mehrpour, and A. E. Karbowiak, Adaptive block matching motion estimation algorithm using bit-plane matching, in IEEE Int. Conf. Image Processing, Washington, DC, 1995, pp. 496 499. [5] M. M. Mizuki, U. Y. Desai, I. Masaki, and A. Chandrakasan, A binary block matching architecture with reduced power consumption and silicon area requirements, in IEEE ICASSP-96, Atlanta, 1996, vol. 6, pp. 3248 3251. [6] K.-M. Yang, M.-T. Sun, and L. Wu, A family of VLSI designs for the motion compensation block-matching algorithm, IEEE Trans. Circuits Syst., vol. 36, pp. 1317 1325, Oct. 1989. [7] Telenor Research, H.263 coder, http://www.nta.no/brukere/dvc/ h263\_software/.