Video Compression by Three-dimensional Motion-Compensated Subband Coding Patrick Waldemar, Michael Rauth and Tor A. Ramstad Department of telecommunications, The Norwegian Institute of Technology, N-7034 Trondheim, Norway Tel: +47 73 59 69 77, Fax: +47 73 26 40, E-mail: waldemar@tele.unit.no ABSTRACT Three-dimensional subband coding is an alternative approach to hybrid coding schemes which are used in todays video compression standards. In this work, two possible models for threedimensional subband coding have been tested. In model I the temporal ltering is performed directly on the input sequence. In model II the temporal ltering is performed along the trajectories of the motion. In both models the resulting lowpass and highpass sequences are coded using interframe and intraframe coding. Motion compensation is performed using blockmatching. The coding results are compared to the H263 standard. 1. INTRODUCTION Todays video coding standards make use of hybrid coding, employing prediction with motion compensation (MC) along the temporal axis and 2-D discrete cosine transform (DCT) coding in the spatial domain [1, 2]. Other work has been reported using subband coding () in the spatial domain [3, 4]. has emerged as an outstanding technique for encoding 2-D image signals, and has been shown to give increased coding gain compared to DCT and removal of blocking artifacts [5, 6]. In 3-D schemes subband decomposition is also applied along the temporal axis of a video sequence. If the temporal axis decomposition is performed as the rst step, the high temporal subbands will represent temporal changes which are related to the motion in the original sequence. If the motion is slow and each frame has lowpass spatial characteristics, the energy in the higher temporal frequency components will be low. Hence, high energy compaction will be achieved even without motion adaptation. Schemes without motion adaptation [7, 8] have mostly been applied to videophone sequences. If motion occurs and the spatial domain spectrum is broadband, the correlation along the temporal lter path of the analysis may be drastically lowered. Motionadaptive 3-D schemes [9, 10] have been proposed to overcome this problem. In order to obtain high energy compaction in the case of motion, it is convenient to use motioncompensated 3-D frequency coding. Schemes with global MC [11, 12] are straightforward, but lack ef- ciency in the case of inhomogeneous motion vector elds (MVFs). Pioneering work on 3-D and 3-D DCT with spatial-variant MC are found in [13, 14]. This scheme needs an additional encoding of a residual error signal at some of the frame positions. This problem is overcome in the method proposed by Ohm in [15, 16, 17]. In [18] the scheme is generalized from being restricted to using block motion and 2-tap quadrature mirror lters (QMFs) to use of MC with arbitrary MVFs and any linear-phase QMFs. To obtain multiband temporal frequency decomposition, a cascade of two-band analysis and synthesis stages are used. Very good results are obtained at the price of high coding delay. In this paper we use only one stage with 2-tap QMFs corresponding to a 2 2 Hadamard transform in order to obtain the temporal decomposition. Thus, we avoid high coding delay. The resulting lowpass (LP) and highpass (HP) frames are coded using a 2-D subband coder. The LP frames are coded using a hybrid coder conguration, and the HP frames are intraframe coded. This paper describes two methods for temporal ltering. One with ordinary temporal ltering and one with temporal ltering along the motion trajectories. The purpose of this work is to study and compare these two systems to the standard hybrid coder (H263), and show the potential of 3-D in video compression. 2. SYSTEM DESCRIPTION This section describes two models for 3-D subband coding. Model I is in short described as temporal
ltering without motion compensation. Model II uses temporal letring along the motion trajectories. Model I will be shown to be a special case of model II. This section will also describe the 2- dimensional subband coder used in both models. 2.1. Model I Model I is a very simple 3-D subband coding system. It consists of a 2-tap QMF temporal analysis (TA) lter, splitting the sequence into one lowpass (LP) or average sequence, and one highpass (HP) or scaled dierence sequence. Two consecutive frames A(m; n) and B(m:n) are input to the system simultaneously. After ltering each lowpass frame L(m; n) in the lowpass sequence is coded using a 2-D subband coder in a hybrid coder con- guration. Each highpass frame H(m; n) in the highpass sequence is intraframe coded using the same 2-D subband coder. The reconstructed lowpass frame L(m; n) and highpass frame H(m; n) are output of the subband coder. The subband coder is described in Section 2.3. The temporal reconstruction is performed using the inverse 2-tap QMF temporal synthesis (TS) lter giving as output the reconstructed frames A(m; n) and B(m; n). The system is shown in Figure 1. TA INTRA INTER TS Figure 1. Block diagram of model I. Temporal ltering without MC 2.2. Model II Temporal ltering along the motion trajectories, or motion compensated temporal ltering is used to increase the energy compaction in the temporal analysis lters when there is motion in the sequence. By ltering along the motion trajectories less energy is placed in the HP sequence, and the spatial correlation in the LP frames are preserved. Hence the HP and LP sequences may be coded more eciently. Using hybrid coding on the LP sequence instead of cascades of motion-compensated temporal ltering is the main dierence between the scheme in [18] and the method studied in this paper. By not using cascades of MC temporal ltering we avoid a large coding delay. However, since we are using a hybrid coder conguration we do not limit possible channel errors to stay within the size of the cascade of lters. This is the case in [18]. TA MC ME INTRA INTER TS MC Figure 2. Block diagram of model II. Temporal ltering with MC If the motion vector eld is homogeneous and zero, there is a one to one connection between all pixels in the two input frames. In this case the motion compensation on the analysis side and reconstruction on the synthesis side is trivial. However, since this is not the case in a practical sequence the algorithm used must be able to handle the situation where there is not a one to one relation between the pixels in the two input frames to the temporal lters. This has been solved by applying the scheme given in [18]. For the convenience of the reader we iterate parts of the explanation in [17, 18] in order to explain our model more exactly. The motion vectors (k; l) are found between frame A(m; n) and B(m; n) by using ordinary block motion estimation with A(m; n) as the \previous" frame. Figure 3 illustrates the problem of having dierent connections given by the motion vectors (k; l) between pixels in frame A(m; n) and B(m; n). By using the geometric analysis algorithm from [18] we classify all pixels as either connected, not connected or double connected. The meaning of the dierent connections is illustrated in Figure 3. Assume that all motion vectors are found to be zero except for the two blocks shown in frame B(m; n). In the upper frame A(m; n) it is shown how the corresponding blocks should be moved according to the motion vectors. Genereally there exists a connection from every block in the B(m; n) frame to a block somewhere in the A(m; n) frame. However, as seen from the lower left part of Figure 3 the A(m; n) frame might have areas which are not connected to the B(m; n) frame because these areas are never chosen by the block matching algorithm. Areas being chosen twice or several times might also occur. These areas are named double connected. The rest of frame A(m; n) is classied as connected because
every pixel has a one to one connection with a pixel in the B(m; n) frame. frame frame 2. Not connected Analysis ltering: h H(m; n) = 0:5 B(m? ^k; n? ^l) i? A(m; n) A(m; n) = B(m? ^k; n? ^l)? 2 H(m; n) frame Area with connected pixels Area with pixels not connected Area with double connected pixels Figure 3. Problem of not connected and double connected pixels In the equations describing the dierent cases the pixel positions in the L(m; n) frame are related to the positions in the B(m; n) frame, and the pixel positions in the H(m; n) frame are related to the positions in the A(m; n) frame. It is important to note that the not connected situation is related only to the A(m; n) and H(m; n) frame, and the double connected situation is related only to the B(m; n) and L(m; n) frame. In other words, if the H(m; n) frame is found from the equations of not connected pixels, the L(m; n) frame may be found either from the connected or form the double connected equations. The following temporal lter operations are used: 1. Connected Analysis ltering: L(m; n) = 0:5 [B(m; n) + A(m + k; n + l)] H(m; n)= 0:5 [B(m? k; n? l)? A(m; n)] B(m; n) = L(m; n) + H(m + k; n + l) A(m; n) = L(m? k; n? l)? H(m; n) The motion vectors used to access frame B(m; n) from frame A(m; n) are the negatives of the motion vectors used to access frame A(m; n) from frame B(m; n). The B(m? ^k; n? ^l) is given by the closest connected pixel in A, and (^k; ^l) is the motion vector at this position. Hence the order of reconstruction during decoding is important. The B(m; n) frame has to be reconstructed before the A(m; n). Note that only the block motion parameters (k; l) need to be transmitted. The motion vectors used in the dierent cases of pixel connections, including (^k; ^l), are calculated both on the temporal analysis and synthesis side using the geometric analysis algorithm from [18]. 3. Double connected Analysis ltering: L(m; n) = B(m; n) B(m; n) = L(m; n) The lters shown above were given in [17]. In [18] the performance was improved by allowing the use of the closest in time previously reconstructed frame, E(m; n), in the case of not connected pixels. Assuming a constant motion in neighbouring frames in the sequence, we use the motion vector between frame A(m; n) and B(m; n) to nd the motion vector between frame A(m; n) and E(m; n). Using frame E(m; n) instead of frame B(m; n), the energy in the highpass frame will be lowered because all pixels in the E(m; n) frame might be chosen whereas some pixels in B(m; n) might already be used. Using the E(m; n) frame we the order of reconstruction of frame A(m; n) and B(m; n) is arbitrary. In our model we have used the following equations in the not connected case:
Analysis ltering: H(m; n) = 0:5 he(m? ^k; n? ^l)? A(m; n) i Input Q Coded prediction error Q -1 A(m; n) = E(m? ^k; n? ^l)? 2 H(m; n) As seen from the equations special care must taken because we are using \open-loop MC" instead of \closed-loop" MC relative to the previous reconstructed frame. Looking at each case of the possible pixel connections separately it is easy to realize that by using one pixel resolution in the motion vectors, this scheme gives perfect reconstruction when no quantization is applied. Halfpel resolution gives almost perfect reconstruction, but increases the energy compaction. The system is shown in Figure 2. Note that by setting the motion vectors to zero all pixels are connected and model II is equal to model I in Figure 1. 2.3. 2-D subband coder A 2-D subband coder is used for coding both the HP sequence using intraframe coding and for the LP sequence using hybrid coding. The encoder consists of an 8 channel separable analysis lter bank where the lters are applied both in horizontal and vertical direction. The resulting subband image is quantized using uniform quantization and variable length coding (VLC) [19]. The image is reconstructed using a synthesis lter bank. The lter coecients are optimized for coding gain [5]. The 2-D subband coder is shown in Figure 4. Input Analysis FB Q VLC Synthesis FB Output Figure 4. Block diagram of the 2-D subband coder 2.4. Hybrid coding A general hybrid coder is shown in Figure 5. By exchanging the Q and Q?1 with the encoder and decoder part of the 2-D subband coder in Figure 4, respectively, we get the system used to encode the lowpass sequence. Predictor Motion estimation Motion vectors Figure 5. Block diagram of a hybrid coder for image sequences 3. SIMULATIONS The purpose of our simulations was to evaluate the coders described in model I and model II. Four dierent sequences were used: 1. Claire: Sequence with low motion. Frames 70-99. 2. Foreman: Sequence with low motion. Frames 1-30. Frames 1-30. 3. Mobile & Calendar: Sequence with low global motion, but high amount of local motion. Frames 10-39. 4. Car: Sequence with large global motion, especially panning. Frames 10-39. All sequences have a framerate of 25 Hz. Claire, Mobile & Calendar and Car have a framesize of 256 256 and Foreman is a QCIF sequence with framesize 144 176. Only the luminance components have been considered. The standard hybrid coder H263 [19], is used as a reference coder. The H263 coder used is a base version using transform coding, uniform quantization and variable length coding. The coder has the possibility to use intra mode for each block in the image spending only one bit for the quantization of this block. This is particularly useful when there is little or no motion in the sequence being encoded. MC is performed with 16 16 blocks, a search area of 15 and half-pel resolution in all coder types.
40 PSNR vs for Claire, frames 70 99 35 PSNR vs for Car, frames 10 39 39 34 38 33 32 37 36 35 31 30 29 34 33 28 27 26 32 0 0.01 0.02 0.03 0.04 0.05 0.06 33 32 31 30 29 28 27 Figure 6. Results with sequence Claire PSNR vs for Foreman, frames 1 30 26 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 28 27 26 25 24 23 22 Figure 7. Results with sequence Foreman PSNR vs for Mobile & Calendar, frames 10 39 21 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 8. Results with sequence Mobile & Calendar 25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Figure 9. Results with sequence Car In Figures 6, 7, 8 and 9 results are shown for the dierent sequences. Model II is better than model I when motion is present, but is somewhat inferior for the low motion sequence, Claire. The reason for this is probably that the motion in this sequence is very low and hence the gain of using motion compensation in the temporal lters is not large enough compared to the increase in bit rate because of the extra motion vectors. Furthermore, we see that model II and model I is better than the standard hybrid coder for the Mobile & Calendar sequence. Model II is equal or better for the Foreman sequence. For the other sequences model I and model II perform worse than the H263 coder. The reason might be that the 2-D subband coder used in this scheme is not yet fully optimized for the 3-D coder systems. Further improvement can be expected by extending the temporal part of the 3D coder by including several stages as done in [18]. From the results we clearly see that model II performs dierently for dierent sequences. The model assumes that it is possible to lower the energy in the highpass band by using motion compensated temporal ltering. In the car sequence the motion is very high, and in many regions the true motion is larger than the search area of the motion estimation algorithm. For the Claire sequence the H263 coder performs extremely well probably because it has a more ecient VLC. Included in the H263 coder lies the opportunity of copying the last reconstructed block in the case of very low motion. In this case only one bit is spent on this particular block. 4. CONCLUSION From the result we can say that the 3-D subband coders seem promising for video compression, but the present model has some drawbacks
for very low motion sequences, and for sequences with very high motion or fast panning. However, for some sequences motion compensated temporal ltering gives good results which are comparable to the standard hybrid coder H263. Further work with model II described in this paper will include improvements in the quantization strategy in the temporal HP bands and a study of using a cascade of motion compensated temporal lters. This scheme also seems suitable for multiframe motion estimation, since the motion estimation is performed on the original sequence. Creating a smooth motion vector eld avoiding blocking effects in the lowpass frames will also be investigated in future work. REFERENCES [1] ISO/IEC IS 11172, Information Technology- Coding of Moving Pictures and Associated Audio for Digital Storage Up to about 1.5 Mbit/s. (MPEG-1). [2] ISO-IEC IS 13818, Information Technology { Generic Coding of Moving Pictures and Associated Audio Information. (MPEG-2). [3] J. Biemond, P. Westerink, and F. Muller, \Subband coding of image sequences at low bit rates," in Proc. SPIE's Visual Communications and Image Processing, pp. 741{751, 1989. [4] Y. Zhang and S. Zafar, \Motion-compensated wavelet transform coding for color video compression," in Proc. SPIE's Visual Communications and Image Processing, pp. 301{317, Nov. 1991. [5] S. O. Aase, Image Subband Coding Artifacts: Analysis and Remedies. PhD thesis, The Norwegian Institute of Technology, Norway, Mar. 1993. [6] T. A. Ramstad, S. O. Aase, and J. H. Husy, Subband Compression of Images { Principles and Examples. North Holland: ELSE- VIER Science Publishers BV, To be published, 1995. [7] G. Karlsson and M. Vetterli, \Subband coding of video signals for packed switched networks," Proc. SPIE's Visual Communications and Image Processing, vol. 845, pp. 446{ 456, 1987. [8] F. Bosveld, R. Lagendijk, and J. Biemond, \Hierarchical video coding using a spatiotemporal subband decomposition," in Proc. Int. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pp. III{221 { III{224, Mar. 1992. [9] C. I. Podilchuk, N. S. Jayant, and P. Noll, \Sparse codebooks for the quantization of non-dominant sub-bands in image coding," in Proc. Int. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pp. 2101{2104, 1990. [10] C. I. Podilchuk and N. Farvardin, \Perceptually based low bit rate video coding," in Proc. Int. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pp. 2837{2840, 1991. [11] T. A. T. Takahashi and K. Takahashi, \Adaptive thress-dimmensional transfrom coding for moving pictures," in Proc. Picture Coding Symposium, pp. 8.2{1{8.2{2, mar 1990. [12] W. Li and M. Kunt, \Video coding using 3d subband decomposition," in Proc. Picture Coding Symposium, pp. 11.1{1{11.1{2, mar 1993. [13] T. Kronander, Some aspects of perception based image coding. PhD thesis, Linkoping University, Jan. 1989. [14] T. Kronander, \New results on 3-dimensional motion compensated subband coding," in Proc. Picture Coding Symposium, pp. 8.5{1, mar 1990. [15] J.-R. Ohm, \Temporal domain subband video coding with motion compensation," in Proc. Int. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pp. III{229 { III{232, Mar. 1992. [16] J.-R. Ohm, \Advanced packet video coding based on layered vq and sbc techniques," IEEE Trans. Circuits, Syst. for Video Tech., vol. 3, pp. 208{221, jun 1993. [17] J.-R. Ohm, \Three-dimensioanl motioncompensated subband coding," in SPIE Proc. Internat. Symp. Video Commun. Fiber Optic Services, vol. 1977, pp. 188{197, apr 1993. [18] J.-R. Ohm, \Three-dimensional subband coding with motion compensation," IEEE Trans. Image Processing, vol. 3, pp. 559{571, Sept. 1994. [19] Telecommunication Standardization Sector, Study Group 15, Working Party 15/1, Expert's Group on Very Low Bitrate Videophone, Video Codec Test Model, TMN5, Jan. 1995. Source: NTR.