FPGA Implementation of a Novel, Fast Motion Estimation Algorithm for Real-Time Video Compression

FPGA Implementation of a Novel, Fast Motion Estimation Algorithm for Real-Time Video Compression S. Ramachandran S. Srinivasan Department of Electrical Engineering Indian Institute of Technology, Madras Chennai - 600 036, India Tel.: 91 (044) 44 8374 Email: ram@ee.iitm.ernet.in srini@ee.iitm.ernet.in ABSTRACT A novel block matching algorithm for motion estimation in a video frame sequence, well suited for a high performance FPGA implementation is presented in this paper. The algorithm is up to 40% faster when compared to one of the fastest existing algorithms, viz., one-at-a-time step search algorithm without compromising either in the image quality or in the compression effected. The speed advantage is preserved even in the event of a sudden scene change in a video sequence. The proposed algorithm is also capable of dynamically detecting the direction of motion of image blocks. The FPGA implementation of the algorithm is capable of processing color pictures of sizes up to 1024x768 pixels at the real time video rate of 2 frames/second and conforms to MPEG-2 standards. Keywords Block matching algorithm, motion estimation, macroblock, sum of absolute pixel intensity difference, discrete cosine transform, quantization and variable length code. 1. INTRODUCTION In multimedia applications, the key requirements are speed of processing and compression without sacrificing the quality of the image. In order to process motion pictures with high resolution, one needs a highly efficient motion estimation algorithm in terms of processing speed. Several block matching algorithms are available in the literature. Full search algorithm [1] is a straightforward scheme which requires a large number of searches for finding a correct match for the image block being processed. This scheme requires (2w+1) 2 number of search points, where w is the maximum pixel displacement, which is usually taken as 8. One-dimensional full search [2] is another method that requires (4w+3) number of search points. Hierarchical method [3], which requires (1+8log 2 w) number of search points, is faster than the two methods mentioned earlier. Even faster is the one-at-a-time step search (OSS) algorithm [4] requiring only (2w+3) number of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA 2001, February 11-13, 2001, Monterey, California, USA. Copyright 2001 ACM 1-8113-341-3/01/0002 $.00. search points. In this paper, a novel, fast algorithm is proposed in which the maximum number of search points is (2w+1). Although this number is close to that of the OSS method, the present work is faster by up to 40% than the OSS method when the motion of image blocks is small of the order of 1 or 2 pixels, which is usually the case in most of the commonly encountered image frame sequences. The speed-up distribution for various motions is covered in detail in section 2. A number of software and hardware implementations [], [6] have been reported for some of the motion estimation algorithms described earlier. Although software implementations are easy to implement on general-purpose microprocessors, multi-processors or digital signal processors, their instruction sets are not well suited for fast processing of high resolution moving pictures. In addition, the instructions are executed sequentially, thus slowing down the processing further. In contrast to this, the hardware implementations based on FPGAs and ASICs can exploit pipelined and massively parallel processing, resulting in faster and cost effective motion estimation. ASIC designs are suitable if high volume production is envisaged. However, in the research and development phase, for rapid prototyping of a new design and for dynamic reconfigurability, FPGA is the right choice. Further, FPGA implementation is cost effective for low volume applications. For these reasons, the proposed, new motion estimation scheme designed using innovative, novel circuits is implemented on FPGAs. The following sections present the new algorithm, architecture and its FPGA implementation of the proposed method. The results and discussions follow this. Conclusions arrived are presented in the penultimate section. 2. THE FAST ONE STEP SEARCH ALGORITHM The basic principle involved in motion estimation is depicted in Fig. 1a. The current frame of the image is processed macroblock (MB) by macroblock. The macroblock (j,k) currently being processed is identified in the previous frame and a search window surrounding it is defined for conducting the search. A minimum sum of absolute pixel intensity differences (A d ) is required to be met in order to locate the shifted macroblock. A d is defined as 1 1 A d (x,y) = i j k - i (j + x)(k + y) ; j= 0 k = 0-8 x, y 8, (1) 213

where i j k is the pixel intensity in the macroblock processed in the current frame, i (j + x)(k + y), its corresponding intensity in the search window of the previous frame and (x,y), referred to as the motion vector, is the shift undergone by the macroblock. For the sake of convenience, the computation of one A d is referred to as a search point. Macro block shifted by (x, y) Previous frame (j+x, k+y) (j,k) Search window (32X32 pixels) Pixels 8 8 Macro Block for which motion shift is to be found Figure 1. Motion estimation: a Basic principle; b Processing order and direction in the FOSS method. a b Current frame (j,k) Up/Right (U/R) direction C B A D E F G H Search window in previous frame The proposed Fast One Step Search (FOSS) method is indicated in Fig. 1b. For example, pixels A to H are within the search window in the previous frame, where A is the origin of the macroblock being currently processed. A d computed for one of these A to H pixels will be minimum if the image block has shifted to that particular pixel. For instance, let G be the final shifted pixel of the image block. In order to locate this pixel G, we need to move first in the vertical (Up) direction and compute A d for pixels A and B. If B yields lower A d of the two, then A d is computed for C. If A d for B is still the lowest, the same procedure is repeated in the horizontal (Right) direction from D until we arrive at the absolute minimum value for A d at the pixel G. In the example described above, the first moves in the vertical and horizontal directions were Up and Right respectively. This combination is designated as the U/R direction. There are seven other possible combinations of directions, namely, Right/Up, Up/Left, Left/Up, Down/Left, Left/Down, Down/Right, and Right/Down that yield different numbers of search points depending on the actual motion encountered in the image frame being processed. Based on the procedure outlined, detailed steps are given below for the proposed FOSS algorithm for motion estimation in the U/R direction. 1. Min = A d (x,y), Up=Down=1. If A d (x, y +Up) < Min, then Min = A d (x, y +Up), go to 2. Otherwise, if A d (x, y -Down) < Min, then Min = A d (x, y -Down), go to 3. Else, go to 4. 2. Up = Up+1. If A d (x, y +Up) < Min, then Min = A d (x, y +Up), repeat 2. Otherwise, y = y +Up-1, go to 4. 3. Down = Down+1. If A d (x, y -Down) < Min, then Min = A d (x, y -Down), repeat 3. Otherwise, y = y - Down + 1. 4. Right=Left=1. If A d (x + Right, y) < Min, then Min = A d (x + Right, y), go to. Otherwise, if A d (x -Left, y) < Min, then Min = A d (x - Left, y), go to 6. Else, Go to 7.. Right = Right + 1. If A d (x + Right, y) < Min, then Min = A d (x + Right, y), repeat. Else, x = x + Right - 1, go to 7. 6. Left = Left + 1. If A d (x, y-left) < Min, then Min = A d (x, y- Left), repeat 6. Otherwise, y = y-left+1. 7. Completion of motion estimation - Min contains the minimum of A d (x,y) and (x,y) is its motion vector. FOSS algorithms for other directions are similar to the one described for U/R direction. There is no scope for the algorithm to get caught in local minima or missing motion since only the actual error is coded and ultimately reconstructed. The number of search points per macroblock for the proposed FOSS method as against the OSS method for various shifted image blocks can be readily computed from the respective algorithms and are presented in Table 1. It may be noted that the maximum speed advantage of 40% results if the motion of image block is 1 pixel diagonally in any of the four directions and a minimum of 9% for the maximum shift of 8 pixels either horizontally or vertically. The number of search points per macroblock for the FOSS method is always lower than that for the OSS method, the difference between the two methods being one search point for horizontal or vertical directions of motion and two search points for all other directions. Of course, there is no speed advantage if no motion is involved. As a matter of fact, the speed up factor will change from macroblock to macroblock, depending upon the actual motion encountered, and when averaged over a number of frames, it may be anywhere from 9% to 40% for an image sequence. However, in order to derive the maximum speed advantage, one needs to assess the direction of motion of objects on-the-fly. In the next section, a scheme for detecting the direction of motion is presented. 214

Table 1. The number of search points and the speed up factors in a macroblock for the FOSS method over the OSS method Shift in image block position 1 pixel diagonally 2 pixels diagonally 8 pixels diagonally 1 pixel horizontally or vertically 2 pixels horizontally or vertically 8 pixels horizontally or vertically No motion FOSS method 7 17 6 11 Number of search points in a MB by OSS method 7 9 19 6 7 12 Speed up factor for FOSS method 1.40 1.29 1.12 1.20 1.17 1.09 1.00 3. ASSESSMENT OF DIRECTION OF MOTION OF IMAGE BLOCKS The method of finding the direction of motion of image blocks in a picture as implemented in the present work is as follows. This assessment is made only for the first P frame after the I frame in every group of pictures (GOP), which can be user defined. In the present scheme, a GOP consists of only an I frame followed by P frames and does not contain B frames. Since computation of motion estimation is a time consuming process, motion estimation for all the eight directions as explained in the FOSS algorithm is carried out only for representative samples of five macro blocks. The total number of search points for all the five macroblocks for each of the eight directions are computed. The direction for which the total number of search points is a minimum is reckoned as the optimum direction of motion of image blocks in the proposed method. The optimum direction, thus found, is applied to all the P frames in the current GOP. These macro blocks are located at (M/4, N/4), (3M/4, N/4), (M/2, N/2), (M/4, 3N/4) and (3M/4, 3N/4) co-ordinates, where MxN is the picture size in pixels. The processing time overhead involved in motion estimation for these macroblocks is under 0.4% of the overall processing time for the entire GOP. The placement of macroblocks has been arrived at after conducting elaborate experiments on a number of images, trying various locations and number of samples in a frame. This placement yielded the minimum number of search points of all the combinations tried out for various images of sizes up to 720x480 pixels. Beyond this picture size, if required, one can use nine macro blocks instead of five yielding marginally better performances. 4. DETECTION OF SCENE CHANGE The FOSS algorithm is robust and can adapt seamlessly even in the event of a radical scene change. The algorithm detects scene changes by keeping track of the total number of search points for every frame and comparing it with that for the previous frame. If the total number of search points for the current frame exceeds that for the previous frame by more than 2%, a scene change is deemed to have occurred. This figure of 2% has been arrived at after testing with a number of images. In such an event, the frame following the scene change frame is taken as the reference frame for a fresh group of pictures. Quality of the picture doesn t degrade since only the actual error is processed and appropriate correction effected. Further, no motion of image blocks is lost track of owing to this reason. However, compression falls only for the scene change frame, and normality is restored immediately with succeeding frame.. ARCHITECTURE FOR THE FOSS MOTION ESTIMATION PROCESSOR The architecture for the FOSS motion estimation processor is shown in Fig. 2. It consists of a motion estimation controller which contains the circuit for executing the FOSS algorithm, a dual redundant current MB RAM, a module for evaluating A d and an external dynamic RAM to hold the processed I and P frames. To start with, the host processor communicates the picture size, the luminance or the color information and the macroblock number to be processed to the ME controller. After ensuring the EDATA signal is set, the host writes the image macroblock information into one of the two current macroblock RAMs. When the ME controller is ready to begin the motion estimation processing, the READY signal is set, receiving which the host asserts START signal, thus initiating the processing. EDATA signal is immediately asserted so that the host can enter the next image input concurrently with the processing of motion estimation. Before processing the P frame, the I frame is processed by the DCTQ processor [7] and the inverse quantization and inverse DCT processor. The processed I frame is stored in the external RAM designated as the previous frame RAM and serves as the reference frame for carrying out the motion estimation. The ME controller contains the FOSS algorithm in the form of a sequential circuit, executing each step of the algorithm as explained earlier. The minimum value of A d is stored in an internal bit register. The motion vector variables x and y are reset at the start of motion estimation for every macroblock. A d is computed using the A d (x,y) module. This is cleared before starting every A d computation. The controller converts the x, y variables into appropriate addresses for the current macroblock and the previous frame RAMs. At one time, one row of a macroblock containing pixels of data is fetched each from the current and the previous frame RAMs and the sum of absolute differences for all these pixel pairs is computed and accumulated. The controller takes clock cycles to accumulate the sum for sixteen rows of the macro block. Since these computations are time consuming, they are pipelined with an inherent latency of 6 clock cycles. As a result, one A d computation takes 22 clock cycles for execution. 21

INPUT IMAGE HOST BUS DUAL BYTES CURRENT MB RAM A d (x,y) BYTES PREVIOUS (I/P) FRAME EXTERNAL RAM FROM/TO HOST PROCESSOR ME SYSTEM BUS A d BITS ME SYSTEM BUS RESET CLOCK EDATA READY ME CONTROLLER EOME SKIP DVALID START HOST BUS MOTION VECTOR/ ERROR BUS Figure 2. The architecture of the FOSS motion estimation processor The motion estimation is completed in about 30 clock cycles per A d computation, considering the internal steps involved in the algorithm. For a macroblock, a maximum of eight numbers of A d computations are required as can be seen from the results presented in the next section. When the motion estimation for one macroblock is completed, the motion vector followed by the rowwise intensity errors are output at MOTION VECTOR/ERROR pins for use as inputs for the subsequent DCTQ, Inverse Quantization and inverse DCT processing, thus reconstructing the error. The MOTION VECTOR/ERROR codes are generated in a sequential order for Y, and color components Cb and Cr as per the MPEG-2 format. The corresponding motion-compensated sum of row-wise intensity of previous macroblock and the reconstructed error are written into the previous frame RAM to form the P frame. This P frame serves as the previous frame for processing the next frame of the GOP. These operations require 2-clock cycles each for execution of one luminance and two color components. As a result, the total execution time for motion estimation and compensation per macroblock is around 31 clock cycles for a color picture. Motion estimation is applied only on the luminance part, Y. DVALID is a synchronous pulse for writing the MOTION VECTOR/ERROR. If no motion is detected (motion vectors, x = y = 0), SKIP signal is issued. End of motion estimation signal, EOME, is generated after completing the motion estimation for the current macroblock. This process is repeated for all the macroblocks in the frame. The actual color processing takes place only in the variable length coder (VLC) [8]. 6. IMPLEMENTATION OF THE FOSS MOTION ESTIMATION ON AN FPGA The proposed FOSS motion estimation and compensation processor has been implemented using a single Altera s FLEX 10K100A EPLD, each module being realised using either schematic entry or VHDL. The utilisation of logic gates for the present design is about 80,000, using about 80% of the total capacity of the chip. Synthesis, functional testing and timing analysis were carried out for all the individual modules. The proposed architecture was tested with 40ns clock and it was found to work satisfactorily. As mentioned in the previous section, the motion estimation and compensation requires 31 clock cycles per macroblock for execution. This works out to a processing speed of 2 frames per second for a color picture of size 1024x768 pixels. The frame rate can be easily changed to 30 per second with a corresponding tradeoff for picture size. The external memory requirement for this design, which contains I as well as a P frame, is 4. MB. 7. RESULTS AND DISCUSSIONS The FOSS algorithm for all the eight directions together with automatic assessment of the direction of motion of image blocks 2

and detection of scene changes has been developed, coded in C language and successfully tested using a number of images of different sizes. The savings effected in computations in the proposed method compared to the OSS method are given in Table 2. It is clear from the table that, irrespective of the direction chosen initially and applied to all the frames of the GOP, the FOSS method tracks the shifted image block faster than the OSS method. The direction for which the number of search points is minimum is selected as the optimum direction of motion for each image sequence. Plots for the number of search points versus the frame number are presented for one of the image sequences, viz., the Car in Fig. 3 for both the FOSS and OSS methods. The first Table 2. Savings effected in the number of search points for various directions Image sequence Percentage savings effected in number of search points Proposed FOSS Method over OSS method Direction L/U R/U R/D L/D U/L U/R D/R D/L Rugby 10.8 11.1 11.3 10.6 14.8 1.3 1.0 14.7 Table Tennis 12.6 12.4 12.1 12.2 13. 13.6 13.3 13.4 Figure 3. The number of search points versus the frame number for the Car sequence for the FOSS and the OSS methods Table 3. The total number of search points for various directions for five sample macro blocks and optimum directions of motion found by the FOSS algorithm Image Number of Search Points Direction Optimum direction found L/U R/U R/D L/D U/L U/R D/R D/L Rugby 1 42 39 39 41 37 37 3 3 D/R TT 1 29 30 29 28 2 2 29 29 U/L bmw 1 4 4 47 47 38 41 43 40 U/L Table 4. The speed up ratio in a GOP for the FOSS method over the OSS method Image sequence Frame numbers Avg. search points per MB Speed up ratio Rugby image size: 480x688 pixels TT image size: 480x720 pixels bmw image size: 32x288 pixels 0-9 8.1 1.146 0-1 7.0 1.134 90-110 8.1 1.114 217

a b Figure 4. Simulation image: a Original bmw image (Frame number: 91, Picture size: 32x288 pixels); b Reconstructed bmw image by the FOSS method (PSNR: 3.6 db) frame is an I frame and all others are P frames. As can be seen from the plots, the number of search points in the FOSS method is lower compared to that in OSS method for all the frames. Similar results have been obtained for all the image sequences tested, although not presented here. The quality measure, peak signal-to-noise ratio (PSNR) and compression effected in bits/pixel for the Car sequence averaged over all the frames are 33.9 db and 0.36 respectively for the FOSS method, whereas for the OSS method they are 33. db and 0.4 respectively. This result indicates that the FOSS method does not compromise on the quality of the reconstructed image and in effecting compression, while improving the speed of execution. Similar results, though not presented here, have been obtained for other sequences as well. The total number of search points for various directions for all the five sample macroblocks as explained earlier are given in Table 3. The optimum directions of motion found by the FOSS algorithm for various image sequences are also presented. In case there are more than one minima, the first occurrence from the left is taken as the optimum direction. For instance, the Rugby sequence yields a minimum of 3 search points for two directions D/R and D/L and, therefore, D/R is recognised as the optimum direction in this case. Table 4 presents the speed up ratios for the FOSS method over the OSS method for various image sequences. The overhead time required for detecting the direction of motion is included in the total number of search points found by the FOSS method. The proposed algorithm preserves the visual quality of the picture as can be seen from Fig. 4 which presents the original and the reconstructed images using the FOSS method, for one of the image frames. Two image sequences, namely, the bmw_tram and the Car_susie were tested for scene change. Table presents the speed up ratios for the FOSS method over the OSS method for the two image sequences. In spite of a sudden scene change, there is not only a speed advantage in the FOSS method, but reconstruction of a good quality image is also possible as is evident from Fig.. The FOSS algorithm is, therefore, flexible enough to accommodate scene changes, while preserving the speed advantage as well as the quality of the processed image over the OSS method. Table. The speed up ratio for the FOSS method over the OSS method for image sequences with scene changes Image sequence Frame Scene change Avg. search points Speed up ratio numbers at per MB bmw_tram image size: 32x288 pixels dir.: R/U 90-110 170-180 170 Tram 7.9 1.096 Car_susie image size: 26x26 pixels dir.: U/R 1-70 0-19 0 Susie.3 1.090 218

a b Figure. Simulation image with scene change from Car to Susie: a Original Susie image (Frame number: 0, Picture size: 26x26 pixels); b Reconstructed Susie image by the FOSS method (PSNR: 3.9 db) 8. CONCLUSIONS A new, fast, one step search method for motion estimation in video frame sequences along with automatic assessment of direction of motion of image blocks and its implementation on FPGA have been presented. The simulation results show that this new method is faster than the OSS method without compromising either on the image quality of picture or the compression effected. Although the present implementation is for processing I and P frames only, the algorithm can be easily extended to cover B frames as well. Presently, we are working on a FPGA based, reconfigurable video encoder system using the FOSS motion estimation and catering to various applications conforming to JPEG, MPEG, H.263 standards. REFERENCES [1] S.C. Cheng and H.M. Hang, A comparison of block matching algorithms mapped to systolic array implementation, IEEE Trans. on circuits and systems for video technology, Vol. 7, No., 1997, pp. 741-77. [2] M.J. Chen, L.G. Chen, T.D Chieuh, One-dimensional full search motion estimation algorithm for video coding, IEEE Trans. on circuits and systems for video technology, Vol. 4, No., Oct. 1994, pp. 04-09. [3] H.M. Jong, L.G. Chen and T.D. Chiueh, Parallel architectures for three step hierarchical search block matching algorithm, IEEE Trans. on circuits and systems for video technology, Vol. 4, No. 4, 1994, pp. 407-417. [4] R. Srinivasan, K.R. Rao, Predictive coding based on efficient motion estimation, IEEE Trans. on communications, Vol. COM-33, No. 8, Aug. 198, pp. 888-89. [] T. G. Venkatesh and S. Srinivasan, A Pruning based fast Rate Control Algorithm for MPEG Coding, ICCIMA, India, pp. 403-407, 1999. [6] Rajesh T.N. Rajaram, Optimization of fast search block matching motion estimation algorithms and their VLSI implementation, Thesis work for the degree of Master of Science (by research), Department of Electrical Engineering, Indian Institute of Technology, Madras, June 1999. [7] S. Ramachandran, S. Srinivasan and R. Chen, EPLD-based Architecture of Real Time 2D-Discrete Cosine Transform and Quantization for Image Compression, IEEE International symposium on circuits and systems, Orlando, Florida, pp. iii37-378, May-June 1999. [8] S. Ramachandran and S. Srinivasan, Design and Implementation of an EPLD-based Variable Length Coder for Real Time Image Compression Applications, The IEEE International symposium on circuits and systems (ISCAS), Geneva, Switzerland, pp. I607-610, May 28-31, 2000. 219