Fast Motion Estimation for Shape Coding in MPEG-4

358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 Fast Motion Estimation for Shape Coding in MPEG-4 Donghoon Yu, Sung Kyu Jang, and Jong Beom Ra Abstract Effective shape coding in MPEG-4 requires a motion estimation (ME) procedure, which is a great burden for real-time encoders. This paper deals with fast ME for MPEG-4 shape coding. The proposed algorithm utilizes inherent properties of shape coding, i.e., coding based on motion vector predictor for shape (MVPS) and motion vector (MV) correlation between neighboring binary alpha blocks (BABs). By properly using these properties, the proposed algorithm can reduce a processing speed. In addition, the proposed algorithm can be combined with any conventional fast block matching algorithm (BMA) being used in texture ME and a bit-packing technique based on binary feature of shape information. Simulation results show that the proposed algorithm can reduce the computational complexity of ME for shape coding to 68.3% even in the worst case. Also, by simply combining the proposed algorithm with a fast BMA and a bit-packing technique, we can achieve the complexity suitable for real-time software implementation of MPEG-4 shape encoding. Index Terms Binary alpha blocks, motion estimation (ME), motion vector predictor, shape coding in MPEG-4. I. INTRODUCTION MPEG-4 provides a set of tools for object-based coding of natural and synthetic videos/audios [1]. It attempts to support a wide range of bit rates from 64 kbits/s to 38.4 Mbits/s according to profiles. Thereby, its application areas include internet communication, video conferencing, mobile and wireless multimedia, etc. Video-coding standards have focused on representing moving pictures as a single entity and efficiently compressing them as such. In MPEG-4, however, moving pictures are treated as an organized collection of independently coded visual objects. MPEG-4 supports several types of visual objects, among which is the video object (VO). The VO may be thought of as a sequence of two-dimensional (2-D) images where each image may have an arbitrary shape. A special case of the VO occurs when the shape is rectangular and time invariant with respect to size and position. This corresponds to the familiar definition of the video sequence as dealt with by video coding standards other than MPEG-4. The general MPEG-4 VO can be of any shape, and its shape, size, and position may vary from one frame to the other. And this VO can be described with three color components (YUV) and an alpha plane. The alpha plane defines the object s shape frame by frame. Each 2-D image of the VO is called a video object plane (VOP), and each alpha plane requires shape coding [2]. Manuscript received April 6, 2001; revised October 1, 2002. This paper was recommended by Associate Editor K.-H. Tzou. The authors are with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, Yuseonggu, Daejeon 305-701, Republic of Korea (e-mail: dhyu@issserver.kaist.ac.kr; skjang@issserver.kaist.ac.kr; jbra@ee.kaist.ac.kr). Digital Object Identifier 10.1109/TCSVT.2003.811430 Applications of real-time object based MPEG-4 encoding are currently limited due to the lack of mature automatic segmentation algorithm. But in a case where the automatic segmentation is limited to low-level features (e.g., blue screen technology) or specific scenarios with simple and predefined semantics (e.g., head and shoulder scenes), the remaining obstacle is the complexity of encoder. The MPEG-4 standard has selected a context-based arithmetic encoding (CAE) algorithm for MPEG-4 shape coding among various methods. Motion estimation (ME) is performed to increase the coding efficiency of the CAE algorithm. ME is also essential to improve the efficiency in texture coding. However, ME for shape coding, as well as texture coding, dominate the computational complexity [3] and are a heavy burden for real-time MPEG-4 encoders. For several years, many researchers have focused on fast ME algorithms for texture coding. However, fast ME for shape is also imperative for real-time VOP-based encoding in MPEG-4. Recently, several papers have proposed hardware implementation methods for shape coders [4], [5]. In those papers, a shapecoding algorithm is chosen by considering hardware implementation, and its corresponding efficient hardware structure is proposed. But a hardware solution is usually not applicable for a software-only codec. In this paper, we address a software solution of fast ME for shape coding. The proposed scheme is based on inherent properties of shape coding. By using these properties, we can achieve high processing speed in ME for shape coding. In the following section, we will review an ME algorithm for shape encoding in the MPEG-4 verification model (VM) [1]. The proposed algorithm will be described in Section III. Then, the simulation results will be shown in Section IV, and finally, conclusions will be drawn in Section V. II. ME METHOD FOR SHAPE IN VM The MPEG-4 VM [1] provides an encoding scheme and its corresponding program. In this section, the provided encoding scheme of shape coding is introduced briefly. Shape coding is performed on the alpha plane based on binary alpha blocks (BABs). A BAB has a size of 16 16 pixels, and its position is coincident with that of a macroblock in texture coding. Shape coding supports both lossless and lossy coding. To allow an acceptable error between the source BAB and encoded BAB, a threshold AlphaTH is used in shape encoding. It represents the number of pixels that can have different values between the source and encoded BABs, and is utilized for comparing the two BABs subblock by subblock. Here, the subblock denotes a 4 4 elementary block of a BAB. Hence, if lossless shape coding is desired, AlphaTH is to be set to zero. 1051-8215/03$17.00 2003 IEEE

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 359 Fig. 1. Candidates for MVPs. Generally, shape information is as critical as motion information in texture coding [6]. So, shapes are to be encoded lossless due to their importance in most cases. Therefore, in this paper, we focus only on lossless shape coding. For effective coding, a motion vector of shape (MVS) is obtained through shape ME. MVS are represented by the addition of a motion vector predictor for shape (MVPS) and motion vector difference for shape (MVDS). Fig. 2. Flowchart of the original ME scheme. A. MVPS A MVPS is determined by referring certain candidate motion vectors around the current shape block (or BAB) and MVs of texture around the macroblock (MB) corresponding to the current BAB. They are located and denoted as shown in Fig. 1, and MV1, MV2, and MV3 are rounded to integer numbers. By looking into these MVs and MVs in the order of MVS1, MVS2, MVS3, MV1, MV2, and MV3, MVPS is determined by taking the first encountered motion vector that is valid. In this procedure, an MVS of transparent BAB and intra BAB, and MV of transparent MB, is considered invalid, and an MVS of opaque BAB can be either valid or invalid depending on its mode (see [1] for detail). MV1, MV2, and MV3 are not valid if the current VOP is a B-VOP or if the current VO has only a binary alpha plane but no texture. If no candidate of MVPS is valid among them, MVPS is regarded as (0, 0). B. Determination of MVS Based on MVPS determined above, MVS is computed by the following procedure. The motion compensation (MC) error is computed by comparing the BAB indicated by the MVPS and current BAB. If the computed MC error is less than or equal to 16 AlphaTH for all 4 4 subblock, the MVPS is directly employed as MVS and the procedure is terminated. If the condition above is not satisfied, MVS is searched around the predicted vector MVPS by computing 16 16 MC error [sum of absolute difference for shape (SADS)] between the BAB indicated by the MVS and current BAB. The search range is 16 pixels around MVPS along both horizontal and vertical directions. The MVS that minimizes SADS are taken as the final MVS, and MVDS is defined as MVDS MVS MVPS. If more than one MVS minimizes SADS with an identical value, the MVDS that minimizes the code length of MVDS is selected. If more than one MVS minimizes SADS with an identical value and generates an identical code length of MVDS, the MVDS having a smaller vertical element is selected. If vertical elements are also the same, the MVDS with a smaller horizontal element Fig. 3. Flowchart of the proposed scheme. is selected. Fig. 2 shows a brief flowchart of this procedure. When MVS is searched around the MVPS, the full-search algorithm is adopted in VM. However, a fast-search algorithm can be used instead of the full-search algorithm for high-speed implementation. It may be interesting to notice the differences between ME for texture coding and for shape coding. In texture coding, the search range is determined around the zero MV. If a macroblock (or 8 8 block) has zero MV, it is classified into a special mode and its MV information is not sent. Meanwhile, in shape coding, the search range is determined around the MVPS instead of the zero MV, and if a BAB has zero MVDS, it is classified into a special mode and its MV information is not sent. To reduce the computational complexity, the proposed algorithm utilizes this property. III. THE PROPOSED SCHEME Fig. 3 illustrates the overall proposed scheme for fast ME in shape coding. In the proposed scheme, to enhance the speed of ME, we insert two skipping opportunities by comparing the MC error with a given threshold (THR). Here, it should be noted that

360 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 Fig. 4. Average number of shape bits additionally produced in coding each BAB by using MVS providing minimum SADS rather than MVPS. The graph is plotted as a function of initial MC error. THR is different from AlphaTH, which is used in the scheme in VM. The proposed algorithm utilizes the properties to improve the searching speed, namely, MVPS-based coding and MV correlation between neighboring BABs. In this section, we first consider the two well-known techniques in Sections III-A and III-B, the bit-packing technique for binary shape data and a fast ME algorithm being used in texture coding. Then, the proposed algorithm is described in detail in Sections III-C and III-D. A. Bit-Packing Technique Using Binary Feature of Shape Information Shape information described in the alpha plane consists of two levels, i.e., 0 and 255. Here, 0 denotes a transparent region and 255 an opaque one. We may use this binary feature to enhance the speed of MC error calculation. Namely, the shape information of a pixel may be represented with a single bit. In other words, the shape information of eight pixels can be represented with a single variable (of character type in C language). Therefore, we can compute the MC error corresponding to eight pixels, by only performing an XOR operation to two 8-bit variables and just referencing a pre-defined look-up-table. Thereby, we can reduce eight comparing operations to one XOR operation, and seven ADD operations to one memory access operation. Theoretically the computational cost for MC error calculation is reduced to about 2/15. But it is shown that the actual computational cost reduction becomes about 1/6, due to the procedure of packing eight pixels into one variable. Even though we need to increase the local frame memory for efficient MC error calculation, the burden is negligible. B. Fast-Search Algorithm for Final Search Procedure The proposed algorithm, as well as VM, needs the regular searching step around MVPS ( 8 search is used as in the gray box in Figs. 2 and 3) to find MVS providing the minimum MC error. Therefore, the algorithms can be combined with any existing fast search algorithm. BABs usually have highly MVPSbiased motion vectors, and tend to have a uni-modal error surface. And this tendency is much higher than in the texture error case. So, a block-based gradient descent search (BBGDS) algorithm [7] seems to be more fitting than in the texture ME for high-speed and performance search. However, various existing search algorithms, such as the three-step search, new three-step search, simple and efficient search (SES), unrestricted center-biased diamond search algorithms [8] [13], etc., can also be applied, although they are less suitable for shape information. According to the adopted fast search algorithm, bit rates and the number of skipped BABs may vary. C. ME Skipping As mentioned before, in shape coding, a BAB having zero MVDS is decided into a special mode and its motion vector information does not need to be sent. Therefore, a slight increase of CAE bits by selecting the zero MVDS, rather than nonzero optimal MVDS, can be compensated. Hence, to improve the searching speed in the proposed algorithm, we set MVDS to zero and stop the search, if the MC error for the zero MVDS is smaller than a certain threshold THR (see Fig. 3). To see the efficiency of this method, we perform simple experiments using the Children sequence in MPEG-4 of 300 frames with a 30-Hz QCIF format. Fig. 4 shows the average number of shape bits (MVDS bits CAE bits BAB type bits) additionally produced in coding each BAB by using MVS providing minimum SADS rather than MVPS. It is noted in the figure that if the initial MC error is not small enough, the total number of shape bits is reduced because the saved CAE bits due to the ME procedure is more than the increased bits due to MVDS coding. Otherwise, the number of shape bits is not reduced because CAE bit saving is similar to MVDS bit increase. Therefore, the ME procedure is less effective if the initial MC error is small. To understand this phenomenon better, we may examine the distribution of final MC errors obtained by the ME procedure, for small initial MC errors (see Fig. 5). As shown in the plots, the amount of MC error reduction through ME becomes smaller and more limited for small initial errors. Of course, according to Fig. 4, this reduction is not enough to compensate the additional bits of MVDS. Hence, we may conclude that if the initial MC error corresponding to an MVPS is less than a given threshold (THR), ME skipping in the current BAB does not noticeably affect the performance of shape coding, statistically. Meanwhile, we can expect this ME skipping to provide a significant reduction of computation time

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 361 Fig. 6. Histogram of the initial MC error of BABs. Fig. 5. Distribution of final MC errors obtained by the ME procedure as a function of the initial MC error. because a large portion of BABs has small initial MC errors, as shown in Fig. 6. Therefore, in this algorithm, we first obtain the MVPS of the current BAB, by adopting the existing algorithm described in Section II-A. Then, if the MC error corresponding to the MVPS is less than a pre-defined THR, we set the current MVS to the MVPS without any further ME procedure. By using an experimentally determined THR of 10, ME for about 30% of BABs can be skipped via this process even in the worst case, without increasing a noticeable total amount of bits. D. MV Correlation Between Adjacent BABs Even if the MC error corresponding to the MVPS is larger than a pre-defined THR, we try to estimate MVS to further reduce computational cost (see Fig. 3). For this purpose, we take advantage of MV correlation between neighboring BABs by assuming that each object is a rigid body. First, as shown in Fig. 7, we categorize shapes into four classes according to the object boundary connection status. Let us assume that the gray region denotes an object. Then, we can see that MVS correlation between adjacent BABs depends on boundary connectivity. In Fig. 7(a), the number of object pixels on the outer left column of the current BAB is neither 0 nor 16, and the number of object pixels on the outer upper row of the current BAB is 16, so we know that the object boundary in the current BAB is connected to that of its left BAB. And this implies that the current BAB may have a similar MVS to its left BAB. Similarly, the current BAB may have a MVS similar to that of its upper BAB in Fig. 7(b). On the other hand, Fig. 7(c) describes the situation that MVSs of two adjacent BABs have a strong correlation with that of the current BAB. Finally, Fig. 7(d) shows that no adjacent BAB has a boundary connected to that of the current BAB. For the first two cases of Fig. 7(a) and (b), an initial MVS for the current BAB is set to the MVS of the connected BAB. In the third case of Fig. 7(c), we choose an initial MVS as the average of two MVSs of the connected BABs. To enhance CAE performance, we may need an additional refinement to the initial MVS, which is predicted by using MV correlation. To minimize searching complexity, the refinement Fig. 7. Four kinds of object boundary shape correlation between the current BAB and neighboring BABs. The bottom-right square denotes the current BAB. is performed with a 1 search around the initial MVS. According to our experiment, the 1 local search is enough to get reliable performance. After this step, about 10% of BABs are additionally skipped by comparing the MC error with THR. IV. SIMULATION RESULTS In our experiment, we use six MPEG-4 test video sequences of a QCIF (176 144) format; the Children, Bream, Hall monitor, Akiyo, Container, and News sequences. Each sequence consists of 300 VOPs of arbitrary shape. Lossless shape coding is performed at 10 and 30 Hz for all the test sequences. We use a search range of 8 for the VM searching step, which is found better than 16 by experiments. The total bit rate is set to around 64 kbits/s for QCIF. It should be noticed that test results for other bitrates are quite similar to the result for 64 kbits/s because shape coding is lossless. We compare the performance of the VM algorithm with that of the proposed algorithm. In Tables I and II, the number of search points represents computational complexity, the overall complexity shows the percentage of computational cost compared with that of the VM algorithm, and the number of s_bits/vop denotes the average number of bits to represent the shape per VOP. Also, in the tables, VM bit-packing BBGDS and Proposed bit-packing BBGDS denote the algorithms adopting the bit-packing technique and the BBGDS algorithm rather than a full search algorithm as in VM and Proposed. Note that the number of unskipped BABs varies even in the VM, since skipping occurs if the MC error between the BAB indicated by MVPS and the current BAB is less than or equal to 16 AlphaTH for all 4 4 subblocks; and the overall complexity of VM bit-packing BBGDS and Proposed bit-packing BBGDS is obtained by

362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 TABLE I PERFORMANCE COMPARISON. VIDEO SEQUENCES OF 30 Hz WITH A BIT RATE OF AROUND 64 KBITS/S ARE USED dividing the percentage of the number of searching points by six as mentioned in Section III-A. Also, it should be noted that the number of s_bits/vop, which represents the ME performance, is kept nearly the same for all experiments. In the Container, Akiyo, and News sequences, it is noted in the tables that the proposed algorithm has a very small complexity even without bit-packing and BBGDS, since only a few BABs require MVS searches. In the Children, Bream, and Hall monitor sequences, however, the proposed algorithm has a relatively high complexity (62.2% in the worst case of Hall monitor). But this complexity can be reduced to 0.6% by additionally using the bit-packing technique and BBGDS algorithm (see Table II). This is considered low enough for real-time software implementation of ME for shape coding. It is also interesting to note that by applying bit-packing and BBGDS to VM, the complexity can be reduced to 0.87% even in the worst case, and this complexity can be further reduced to 0.6% by applying the proposed algorithm. Since ME for shape shares the complexity by about 25% in the original VM, we can expect that the proposed algorithm reduces the computational complexity of the total encoder by about 1% 1.4% on condition that ME for shape coding utilizes the bit-packing technique and BBGDS algorithm and the other part of encoder is optimized by about 10 15 times. The proposed algorithm is mainly focused on lossless shape coding, but it can also support lossy shape coding to some extent. In lossy coding, the algorithm still reduces the complexity, but increases the number of shape bits compared to the original coding. The increased amount is about 3.8% for the Hall monitor sequence (10 Hz) with AlphaTH of 48, while the overall computational complexity of ME for shape coding reduces to about 13.1% (refer to the lossless case of 10.4%). As the loss increases, the algorithm becomes less effective. V. CONCLUSIONS We propose a fast ME algorithm for MPEG-4 shape coding, by using inherent properties in shape coding. Based on these properties, the algorithm can achieve a high processing speed, which is suitable for a real-time software approach instead of a hardware approach. Simulation results show that the proposed algorithm can reduce the overall computational complexity of ME for shape coding to about 68.3%, even in the worst case. By adopting a bit-packing technique and a fast-search algorithm

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 363 TABLE II PERFORMANCE COMPARISON. VIDEO SEQUENCES OF 10 Hz WITH A BIT RATE OF ABOUT 64 KBITS/PS ARE USED BBGDS to the proposed algorithm, we can further reduce it to only 0.6% compared with that of the full-search-based ME scheme described in the VM. REFERENCES [1] MPEG-4 Video Verification Model version 14.2, ISO/IEC JTC1/SC29/WG11, MPEG99/5477, Dec. 1999. [2] N. Brady, MPEG-4 standardized method for the compression of arbitrary shaped video objects, IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 1170 1189, Dec. 1999. [3] P. M. Kuhn and W. Stechele, Complexity analysis of the emerging MPEG-4 standard as a basis for VLSI implementation, in Proc. SPIE Visual Communication and Image Processing 1998, vol. 3309, 1998, pp. 798 809. [4] D. Gong and Y. He, Computation complexity analysis and VLSI architectures of shape coding for MPEG-4, in Proc. SPIE Visual Communication and Image Processing 2000, vol. 4067, 2000, pp. 1459 1470. [5] Y.-C. Wang, H.-C. Chang, W.-M. Chao, and L.-G. Chen, An efficient architecture of binary motion estimation for MPEG-4 shape coding, in Proc. SPIE Visual Communications and Image Processing 2001, vol. 4310, 2001, pp. 959 967. [6] H. Shao, W. Zhu, and Y. Zhang, User and content-aware object-based video streaming over the internet, in Proc. SPIE Visual Communication and Image Processing 2000, vol. 4067, 2000, pp. 653 661. [7] L.-K. Liu and E. Feig, A block-based gradient descent search algorithm for block motion estimation in video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 419 422, Aug. 1996. [8] T. Koga, K. Iinuma, A. Hirano, and T. Ishiguro, Motion-compensated interframe coding for video conferencing, in National Telecommunications Conf., 1981, pp. G5.3.1 G.5.3.5. [9] R. Li, B. Zeng, and M. L. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 438 442, Aug. 1994. [10] R. Srinivasan and K. Rao, Predictive coding based on efficient motion estimation, IEEE Trans. Commun., vol. COM-33, pp. 888 896, Aug. 1985. [11] L. M. Po and W. C. Ma, A novel four-step search algorithm for fast block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 313 317, June 1996. [12] J. Lu and M. L. Liou, A simple and efficient search algorithm for blockmatching motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 429 433, Apr. 1997. [13] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, A novel unrestricted center-biased diamond search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 369 377, Aug. 1998.