Three-Dimensional Subband Coding with Motion Compensation

Size: px

Start display at page:

Download "Three-Dimensional Subband Coding with Motion Compensation"

Barnaby Sharp
5 years ago
Views:

1 Three-Dimensional Subband Coding with Motion Compensation Jens-Rainer Ohm, MEMBER, IEEE 1 IP EDICS category : 1.1 Abstract Three-dimensional (3-D) frequency coding is an alternative approach to hybrid coding concepts used in today's standards. The first part of this paper presents a study on concepts for temporal-axis frequency decomposition along the motion trajectory in video sequences. It is shown that, if a 2- band split is used, it is possible to overcome the problem of spatial inhomogeneity in the motion vector field (MVF), which occurs at the positions of uncovered and covered areas. In these cases, original pixel values from one frame are placed into the lowpass-band signal, while displacedframe-difference values are embedded into the highpass band. This technique is applicable with arbitrary MVFs; examples with block-matching and interpolative motion compensation are given. Derivations are first performed for the example of 2-tap quadrature mirror filters (QMFs), and then generalized to any linear-phase QMFs. With 2-band analysis and synthesis stages arranged as cascade structures, higher-resolution frequency decompositions are realizable. In the second part of the paper, encoding of the temporal-axis subband signals is discussed. A parallel filterbank scheme was used for spatial subband decomposition, and adaptive lattice vector quantization was employed to approach the entropy rate of the 3-D subband samples. Coding results suggest that high-motion video sequences can be encoded at significantly lower rates, than those achievable with conventional hybrid coders. Main advantages are the high energy compaction capability and the non-recursive decoder structure. In the conclusion, the scheme is interpreted more generally, viewed as a motion-compensated short-time spectral analysis of video sequences, which can adapt to the quickness of changes. Although a 3-D multiresolution representation of the picture information is produced, a true multiresolution representation of motion information, based on spatio-temporal decimation and interpolation of the MVF, is regarded as the still-missing part. 1 Correspondence address : Dr.-Ing. Jens-Rainer Ohm Technische Universität Berlin, Institut für Fernmeldetechnik, Sekretariat FT 5 Einsteinufer 25, D Berlin, Germany Phone : Fax : ohm@ftsu00.ee.tu-berlin.de This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under grant No 75/

2 Introduction Hybrid coding, employing prediction with motion compensation (MC) along the temporal axis and 2-D DCT coding in the spatial domain, is the path that is taken in the present digital video standardization activities [1]. Other work has been reported on "motion-compensated SBC", e.g. [2] [3] [4]. These are all together hybrid coders, but the DCT frequency decomposition is replaced by 2- D subband filterbanks. Indeed, SBC has emerged as a superior technique for encoding of 2-D image signals, which can overcome the blocking artefacts inherent in DCT schemes. Transform coding may be regarded as a special case of SBC, with the transform's basis functions interpreted as the impulse responses of a filterbank [5]. In 3-D SBC schemes, subband decomposition must likewise be applied along the temporal axis of a video sequence. One argument in favor of such a scheme is the nonrecursive decoder structure (provided that FIR filterbanks are used), which avoids infinite propagation of transmission errors. If the temporal axis decomposition is performed as the first step, the original sequence is transformed into several subsampled sequences, each of which contains the information about a specific frequency band, representing the "velocity of temporal change". If the amount of motion is low, the amount of energy in the higher frequency component sequences will be low, and the energy compaction will be high even without motion adaptation. Hence, 3-D SBC schemes without motion adaptation [6] [7] [8] have mostly been applied to videophone sequences. If motion occurs, the correlation along the temporal filter path of the SBC analysis may be drastically lowered. To overcome this problem, motion-adaptive 3-D SBC schemes [9] [10] [11] [12] [13] were proposed, which apply the temporal-axis frequency decomposition only in the areas of low motion. These schemes are not even applicable to scenes with global motion, because intraframe encoding would inherently be performed over all frames. To attain high energy compaction in the case of motion, it is convenient to employ motion-compensated 3-D frequency coding. Schemes with global MC [14] [15] are straightforward, but lack efficiency in the cases of inhomogeneous MVFs, covered and uncovered areas. Pioneering work on 3- D SBC and 3-D DCT with spatial-variant MC is due to Kronander [16] [17]. His schemes need an additional encoding of a residual error signal at those frame positions, that are not hit by the motion trajectory. A scheme denoted as MC-SBC, which can overcome this burden has been proposed in [18] [19] [20]. The scheme was formerly restricted to the use of block-matching MC and 2-tap QMFs; this paper gives a generalization to perform MC with arbitrary MVFs and any linear-phase QMFs. Another approach proposed in [21] performs temporal axis subband decomposition on a signal, in which a component of displaced frame difference (DFD) is superimposed upon original image frames. This seems to be inefficient, because high energy remains present in the higherfrequency temporal bands. -2-

3 I. Motion-compensated SBC analysis and synthesis along the temporal axis I.1. Block transforms with global and spatial-variable MC To simplify the explanations about motion-compensated subband analysis and synthesis, the special case of block transforms is regarded first. Groups of W subsequent frames of the video signal are transformed into W frequency components c w. The impulse response (basis function) length of all analysis and synthesis filters is W. The last frame in the group of W frames may serve as the reference frame; the motion trajectory is derived with respect to the position in this frame (see fig. 1a). A motion-compensated W-band block transform of the signal x with the analysis basis functions h w results in the temporal-axis frequency component c w with column, row and frame indices m,n,o : c ( m, n, o) = x( m', n', o') h ( r) ; 0 w < W. (1) w W 1 r= 0 w The global translational motion parameters for the pth frame in the analysis block are [k(r),l(r)]. To prevent the use of pixels from outside the images x of size M N (numbers of columns/rows), it is convenient to introduce a spatial-circular extension of the images (expressed by modulo-functions) : m' = mod m+ k( r), M ; n' = mod n+ l( r), N ; o' = o W + r. (2) The inverse transform with synthesis basis functions g w re-compensates the motion shift : W 1 ym ( ', n', o') = c ( mno,, ) g ( r) 0 r< W, (3) w= 0 w w such that perfect reconstruction can be obtained, if integer-accurate motion parameters are used. Block transforms as in (1),(3) have similarity to the polyphase realization of subband filterbanks [5]. Spatial-variable MVFs can be caused by object motion, which may also be non-translational (e.g. rotation, dilation). Fig. 1b shows the case of a local object, which moves in front of the background, fig. 1c an object with change of scale. Some motion trajectories overlap, while some positions are not hit by any motion trajectory at all. In the latter case, it is impossible to reconstruct the signal from (3). If W is greater than 2, the only solution to this problem, as proposed in [16], seems to be the transmission of a residual error signal at those positions, which can be characterized as the parts that are "covered" in the reference frame. Of course, it is possible to exploit the spatio-temporal redundancy in the residual signal, e.g. by application of motion-compensated hybrid coding or a temporal block transform without MC. I.2. 2-band subband decomposition using 2-tap QMFs with MC -3-

4 A solution to the problem of inhomogeneous MVFs can be given for the case of a block transform with W=2, which performs a decomposition into a lowpass signal c 0 and a highpass signal c 1. To simplify the notation in the following equations, some abbreviations were used for the original frames x, the reconstructed frames y and the subband signals c : Amn (, ) xmn (,, 2 o) ; Bmn (, ) xmn (,, 2 o+ 1) Lmn (, ) c0( mno,, ) ; Hmn (, ) c1 ( mno,, ) (4) Cmn (, ) ymn (,, 2 o) ; Dmn (, ) ymn (,, 2 o+ 1) ; Emn (, ) ymn (,, 2 o 1). In the case of W=2, usual orthonormal block transforms (e.g. DCT, Haar and Hadamard) have the basis functions [ 2/2; 2 /2] for their lowpass, and [ 2 /2;- 2 /2] for their highpass components. These can also be interpreted as the impulse responses of a perfect-reconstruction, length-2 QMF pair. The problem of inhomogeneous MVFs can be solved by the following provisions (see fig. 2) : Subband decomposition is performed, whenever a unique motion trajectory exists between A and B (this is called the "connected" case). Each sample in H is positioned at the coordinate of the A sample on the "backward" motion trajectory [k, l], while the L sample is positioned at the coordinate of the B sample on the "forward" trajectory [k,l] (see fig. 2b/c). When the MVF indicates, that new areas were "uncovered" in B, the original B value is substituted into the L frame. The definition of "uncovered" positions depends on the motion estimation (ME) scheme. Examples for block matching and interpolative ME are given in the following section. When the MVF indicates, that areas of A are "covered" in B, a motion-compensated DFD value towards the previous reconstructed frame E is substituted into the H frame. To avoid brightness variations between "connected" and "uncovered" positions in the L frame, it is necessary to use a non-orthonormal subband analysis filter pair H 0 (z)= z -1 for the lowpass, H 1 (z)= z -1 for the highpass branch. It is consistent, to multiply the DFD values substituted into the H frame by a factor of 0.5. With polyphase filters, the analysis equations are : "connected" : L( m, n) = 05. B( m, n) A ( m + k ( mn, ), n + l ( mn, )) (5) "uncovered" : L( m, n) = B( m, n) (6) "connected" : H( m, n) = 05. B ( m + k ( mn, ), n + l ( mn, )) 05. A ( m, n ) (7) "covered" : H( m, n) = 05. E ( m+ k ( mn, ), n+ l ( mn, )) A( m, n). (8) A, B, E indicate, that these values may be estimates at subpel positions (if k,l, k, l, kl, are noninteger values; fig. 2b/c illustrates the definition of these parameters). The reversed motion parameters k, l are defined at the "connected" positions, where the "nint"-function points to the nearest-integer value : k, l = k, l ( mn, ) ( mn, ) ( m*, n*) ( m*, n*) m = nint m * + k( m*, n*) ; n = nint n * + l( m*, n*). (9) -4-

5 A symbolic program for the derivation of "connected"/"unconnected" positions and of the parameters k, l is given as appendix A. At the "covered" positions, it is reasonable to assume homogeneous motion and define kl, as the displacement at the adjacent "connected" position (see fig. 2b). The synthesis equations are : "connected" : Cmn (, ) = Lm ( + k ( mn, ), n + l ( mn, )) Hmn (, ) (10) "covered" : C( m, n) = E ( m + k ( m, n), n + l ( m, n) ) 2 H( m, n) (11) "connected" : D( m, n) = L( m, n) + H ( m + k ( mn, ), n + l ( mn, )) (12) "uncovered" : Dmn (, ) = Lmn (, ). (13) Remark that now estimates L, H are used in the case of subpel-accurate MC. With integer-accuracy of the motion parameters, L = L, H = H, and hence, C(m,n)=A(m,n), D(m,n)=B(m,n). Perfect reconstruction is guaranteed. I.3. Estimation of motion parameters In earlier publications [19] [20], block matching (BM) was the basis of ME within the MC-SBC scheme. This is shown to be a special case of the analysis/synthesis equations given above. The (, ij) translational motion vector kl, for the block of size I J with the start coordinates (i I,j J) can be found by the BM algorithm BM BM ( i+ 1) I 1( j+ 1) J 1 α φ kl, Π m= i I n= j J (, ij) kl, = arg min dbmn (, ), Am ( + kn, + l), (14) where d(, ) is the frame difference criterion (e.g. minimum absolute or mean squared error), and Π the search range. Since the MVF is constant over the whole block with BM, we get ( k(m,n)=k m / I, n / J ) ( m/ I, n/ J) and l(m,n)=l. The parameters k, l are calculated according to (9). BM BM "Uncovered" positions in frame B are present, if the shifted blocks in the A frame overlap (this case was called "doubly connected" in refs. [19] [20]), while "covered" positions in frame A are indicated by no reference between A and B (the "unconnected" case of [19] [20]). Fig. 3 illustrates the scheme, with frames A and B partitioned into 4 blocks. If multiple overlaps occur, the positions in B belonging to the lefthand/uppermost block are defined as "connected". Improvement is possible by application of hierarchical BM, a scheme originally developed for MC interpolation [22]. This prevents adjacent blocks from producing largely different motion vectors and raises the number of "connected" pixels. Remark that the total number of "uncovered" and "covered" positions is always identical with BM. Two main problems result with the BM procedure outlined above : Inhomogeneities in the MVF are produced, whenever the motion of a closed object is non-translational. With the scheme described, parts of rotated or dilated objects would be classified as "uncovered" and "covered", which indeed is not the case. -5-

6 The placement of "covered" and "uncovered" positions is quite accidental. The positions selected by the described procedure may be totally different from the real occurence of occlusion effects. The operation of an interpolative MC (IMC) algorithm, which is regarded as a first step to solve these problems, is shown in fig. 4. The MVF is defined by the translational shift of support points, the motion in between these points is derived by bilinear interpolation. Hence, if the support points in frame B form a rectangular grid, the movement of each point influences a region, which is bordered by its eight neighbors (fig. 4a). The estimation within this region is performed similar to BM; the search range Π marks the maximum-allowed shift of the support points. Movements of adjacent points influence each other, which makes it necessary to perform ME iteratively to approach an optimum. Two iterations were found to be sufficient. The first iteration was performed on a subsampled pixel grid, with a large search range and a step size (search accuracy) of two pixels. In the second iteration, the search range was decreased to two pixels and the search accuracy increased to half-pel. The complexity increase, as compared to BM with the same search range, is four additions per pel and search step to interpolate the motion parameters; the number of search steps in the first iteration is the same as in full-search BM with the same search range, the second iteration has 81 search steps, independent of the search range. Computation time, as compared to BM, was approximately increased by a factor of three. Fig. 4b is an example, how rotational motion is captured by the procedure. With IMC, no "covered"/"uncovered" positions can usually be present, but the area referenced in frame A may become remarkably smaller or larger than the search region in B (see fig. 4c). This occurs in the cases of fast occlusions, or scale changes between the frames. The interpolation is switched off, if the area F of the triangles, bordered by the support points, is altered drastically from frame A to B. For the example of fig. 4c, it is necessary to switch off interpolation in the areas shown in fig. 4d. This action is performed, whenever F B <0.9 F A or F B >1.15 F A (F A and F B are the areas of corresponding triangles in frames A and B, respectively). The motion vectors of the nearest support points are extrapolated in those exception areas. The total number of "covered" and "uncovered" positions, that are introduced, may not be equal with IMC, e.g. in the case of a slight change of scale. Two improvements are presently under investigation : More exact positions of occlusions could be derived by object-oriented analysis [23], using information from the previous frames. Irregular-spaced support points give a more exact definition of the true MVF. The efficient encoding of such a structure can be regarded under the aspect of a multiresolution representation of the MVF, as further discussed in the conclusion. I.4. Higher-order QMFs -6-

7 The concept developed for 2-band split systems with 2-tap QMFs is now extended to arbitrary symmetric (linear phase) QMFs. Block diagrams of the complete analysis and synthesis MC-SBC filter stages are given in figs. 5 and 6. All switches are shown in "connected" positions. Again, we regard the polyphase realization, which performs decimation prior to analysis filtering, and interpolation after the synthesis filter operation. Motion estimation and the "covered/uncovered" analysis must now be applied at each frame position, while it was necessary only at each second frame in the special case of 2-tap filters. The former A/B pairs of frames are those filtered with the center coefficients h(r/2) and h(r/2-1) of an even-length-r symmetric filter. At these positions, the substitution of original and DFD values remains as in (6) and (8) for the "covered/uncovered" cases. Let h 0 (r)=h(r) be the even-length impulse response of the lowpass analysis filter, and the highpass filter be defined as h 1 (r)=(-1) r h(r). For model filters with odd-length impulse responses, add a coefficient h 0 (R-1)=0, and set h 1 (0)=0, h 1 (r)=(-1) r-1 h(r-1) for the remaining coefficients to obtain evenlength filters. The model filter h(r) must have unity gain, i.e. the sum of the coefficients must be 1. This decomposition is non-orthonormal, as in the case of the 2-tap filters. The delay, introduced during analysis, is R/2 frames. In the "connected" case, the analysis equations are : c0( m, n, o) = h0( R 1) x( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), 2o R/ 2 + 1) + h0 ( R 2) h0 ( R / 2) x( m+ k ( mn, ), n+ l ( mn, ), 2o) (15) + h0( R / 2 1) x( m, n, 2o + 1) + h0( R / 2 2) h ( 0) x( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), 2o + R / 2) 0 c1( m, n, o) = h1( R 1) x( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), 2o R/ 2 + 1) + h1 ( R 2)... + h1 ( R / 2) x( m, n, 2o) (16) h1 ( R / 2 1) x( m+ k ( mn, ), n+ l ( mn, ), 2o+ 1) + h1 ( R / 2 2) h ( 0) x( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), 2o + R / 2). 1 Again, the motion trajectory [k,l] is defined with reference to the B frame, while [k, l] refers to A. The motion trajectories are composed of the values [k,l] 0 at the center position (which are identical with the 2-tap case), and each R/2 values [k,l] - and [k,l] + pointing to past and future frames, respectively (see fig. 7a). A symbolic program to derive [k,l] - and [k,l] + by "motion tracking" from the frame-to-frame motion parameters is given as appendix B. With the synthesis filters defined as g 0 (r)=h 0 (R-r-1) and g 1 (r)=h 1 (R-r-1), the synthesis equations are : ymn (,, 2o) = g( R ) c 0 1 0( m+ k ( mnr,, / 2 2), n+ l ( mnr,, / 2 2), o R/ 4) g ( R ) c 0 / 2 0( m+ k ( mn, ), n+ l ( mn, ), o) g () c 0 1 0( m+ k ( mnr,, / 2 3), n+ l ( mnr,, / 2 3), o+ R/ 4 1) (17) + g ( R ) c 1 1 1( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), o R/ 4) g1( R / 2) c1( m, n, o) g () 1 c ( m+ k ( mnr,, / 2 4), n+ l ( mnr,, / 2 4), o+ R/ 4 1)

8 ymn (,, 2o+ 1) = g( R ) c 0 2 0( m+ k ( mnr,, / 2 4), n+ l ( mnr,, / 2 4), o R/ 4 + 1) g ( R / 2 1) c ( m, n, o) g ( 0) c ( m+ k ( mnr,, / 2 1), n+ l ( mnr,, / 2 1), o+ R/ 4) g ( R 2) c ( m+ k mnr 2 3, n+ l 1 1 (,, / ) ( mnr,, / 2 3), o R / g1 R / 2 1 c1 m+ k ( mn, ) n+ l ( mn, ) o + ( ) (,, )... + g c m+ k n+ l + + ( 0) ( ( mnr,, / 2 2), ( mnr,, / 2 2), o+ R/ 4). 1 1 (18) ) +... Modify indices of filters and motion parameters in steps of 2, until reaching the center coefficients, to get the full formulation for (17) and (18). These equations are valid for filter lengths R=4,8,12,... For R=6,10,14,.. : Interchange all indices of h between (17) and (18) in lines 1,3,4 and 6; let the o- axis indices of c 0 and c 1 run from o-r/4+1/2 to o+r/4-1/2; replace indices of k and l by R/2-3 in lines 1,6 and by R/2-2 in lines 3,4. Outer lines must be omitted, if the filters are as short, that indices of coefficients or frames would coincide with those in lines 2 and 5. The use of substituted original or DFD values must be avoided at the outer coefficient positions in (17) and (18). This would be the case, whenever motion trajectories hit each other, or are not continued due to a detected occlusion, as it is shown in fig. 7b (in the [k,l] + and [k,l] - parts of the motion trajectories this indicates the presence of "covered" and "uncovered" pixels, respectively). A disrupted motion trajectory can be handled by a constant-value-extension method, which is a usual choice for subband analysis/synthesis of finite length signals [24]. All coefficients remaining at the tail of the filter are multiplied with the value of the pixel situated at the last valid position within the motion trajectory. The total delay after synthesis is R frames. 5. Spatial interpolation for subpel-accurate MC When subpel-accurate MC is applied, spatial interpolation operations are necessary to estimate signal values between known samples, according to (5)-(13), (15)-(18). The L image is generated after spatial interpolation in frame A, while spatial interpolation in frame B must be performed to generate the H image. During synthesis, the H image must be interpolated to reconstruct frame B, while L image interpolation is necessary to reconstruct frame A. With higher-order QMFs, more interpolations are necessary at the positions of all outer coefficients. MC-SBC with subpel accuracy allows no perfect-reconstruction synthesis. Bilinear interpolation is a widely used scheme for subpel value estimation. Unfortunately, the equivalent 1-D filter for the bilinear interpolator, e.g. applied to half-pel positions, is a strong lowpass with transition frequency (3 db attenuation) at Ω=π/2. If the interpolation filter has such a smooth frequency roll-off, the result after reconstruction appears heavily blurred. One approach to obtain subpel values with higher accuracy is the fast algorithm for cubic spline interpolation [25], which has a complexity of 4 multiplications/pixel-to-be-interpolated per spatial dimension. Applied over 4 cascaded analysis/synthesis stages, a slight blurring effect remains visible, -8-

9 but the quality is sufficient at low data rates. Better interpolation results were obtained by parallel (blockwise) interpolation in the DCT frequency domain, as shown in fig. 8. After blockwise transform, zero values are appended to the DCT spectrum (fig. 8a), then a quadruple-sized IDCT is applied. The positions of the interpolated pixels, resulting after IDCT, are apart from the former original values (fig. 8b). To obtain estimates at any subpel position, bilinear interpolation is still necessary. This interpolation is performed in an upsampled image, and effects no heavy degradation of the higher frequencies. The block size of the DCT should be large, because values at the block borders are inaccurate - the interpolation blocks must overlap (fig. 8c). It was found that a DCT block size of 32x32, and an overlap of 3 pixels, are sufficient for satisfactory reconstruction results. Differences between the original sequence and the reconstruction over 4 analysis/synthesis stages are hardly visible (some ringing effects may appear, when fields instead of frames are interpolated, but these are not visible in the motion video presentation). For the highly-detailed sequence MOBILE&CALENDAR, the reconstruction PSNR is more than 37 db; other sequences showed PSNR values of db. I.6. Cascade structures To obtain multi-band frequency decompositions, the 2-band analysis and synthesis stages must be arranged as cascade structures. An example is the octave-band structure shown in fig. 9. For optimum energy packing in the subband signals, it is necessary to optimize the motion parameters at each stage of the cascade. The results of motion analysis from one lower cascade stage are used as a starting-point for estimation at the next-higher stage. Even with the 2-tap filters, where ME is performed only at each second frame position, a simple addition of the local motion parameters from two adjacent A-B frame pairs was sufficient to obtain the initial estimate for the next-higher stage. This reduces the overall complexity, because the search range Π can be kept small at all stages. The result of motion-compensated temporal subband decomposition undergoes a 2-D spatial decomposition. To compare the efficiency of different motion-compensated and uncompensated temporal axis decompositions, it is necessary to have regard to the spectral flatness of the resulting 3-D signals. This is taken into account by the coding gain, which is defined as the ratio of arithmetic mean to geometric mean values from the quadratic expectation values of the resulting frequency components [26]. The octave-band decomposition example in fig. 9 results in a 16:1 decimation of the lowest frequency band. Two more decomposition schemes were compared to that, which result in the same bandwidth of the LLLL band. These are a full-band decomposition with constant-width subbands and an 8-band modified octave-band structure, where the first H band was split again in an octavelike fashion. The resulting frequency band partitions for all three schemes are shown in fig. 10. Coding gains of 3-D coding over 2-D intraframe coding are given in tab.1; for spatial decomposition, the TDAC scheme described in section II.2 was employed. The values were -9-

10 calculated from the 25 Hz video sequences MOBILE&CALENDAR, FLOWER GARDEN and TABLE TENNIS; two sampling formats were compared, each for the cases without and with MC (the latter with full-pel and half-pel accurate BM) : Interlaced (CCIR 601, 720x576 pixels); in this case, the odd fields are the A-, the even fields the B-frames fed into the first stage of the cascade. Progressive (SIF, 352x288 pixels), which were generated by rowwise subsampling of the odd fields from the CCIR 601 sequences. The coding gain clearly increases with a higher number of subbands for the full-band decomposition. The gain achievable by half-pel accuracy is higher by around 2 db for progressive and 2.5 db for interlaced sequences, as compared to full-pel accuracy. With MC, the coding gain in the cases of octave-band (for progressive sequences) and modified octave-band (for interlaced sequences) decompositions, is almost as good as for the full 16-band decomposition. The efficiency of the octave-type structures is important, because less motion parameters have to be calculated and transmitted as side information, than with full-band decomposition. With the 2-tap filters, the octave-band structure equals the Haar wavelet transform, while the full-band structure is equivalent to the Hadamard transform (both except for a scale factor, and only in the "connected" areas). The modified-octave structure may be viewed under the theory of wavelet packets [28]. In the "interlaced" case, the H band contains high energy, if the octave-band structure is employed, which is due to the spatio-temporal shift between adjacent fields. With the modified-octave structure, the information about the brightness of both even and odd fields is concentrated in LLLL, the information about their differences in HLLL. Fig. 11 shows examples of image fields, resulting after temporal modified-octave decomposition of the interlaced MOBILE&CALENDAR sequence. In the case without MC (fig. 11a), the LLLL image appears heavily blurred, while the HH image still contains a high amount of information. This is no longer the case when MC is applied. The spatial information in the lowest-frequency temporal band LLLL is sharp, and can be regarded as a mean value extract from a number of adjacent frames. Furthermore, it is interesting to note the differences between BM (fig. 11b), and interpolative MC (fig. 11c). In the BM case, blocking effects appear in the LLLL image, which can be expected to cause degradations at higher compression ratios. Experiments with higher-order QMFs were performed, using Johnston's filters 8A and 16C [29]. The 8-tap filter was modified to unity gain. The longer filters were applied up to the second stage of the cascade, in order to keep a reasonable encoding delay. The coding gains over the 2-tap filters were 0.04/0.11 db with BM motion compensation and 0.07/0.13 db with IMC, for the 8-tap/16-tap filters, respectively. These relatively low coding gains indicate the high correlation along the motion trajectory. It can be concluded, that the application of longer filters is not reasonable at high rates, where the coding gain is a measure to determine the rate-distortion efficiency [26]. At low rates, the longer filters were found to eliminate jerky, artificial movements, which are temporal-axis blocking effects, appearing with the 2-tap filters. Unfortunately, the number of motion parameters to be -10-

11 calculated and transmitted is doubled, when longer filters are used. It is suggested that a new strategy of motion representation, including spatio-temporal interpolation of the MVF, is needed instead of the lossless frame-to-frame parameter encoding concept, to gain full advantage of longer QMFs. II. Encoding of the temporal-axis subband signals II.1. Comparison with MC prediction coding The basic decomposition structure of the MC-SBC scheme is shown in fig. 12 for the 2-tap filter case. With the non-orthonormal filters, the resulting L image is the motion-compensated average, while the H image is half of the DFD between frames A and B. If the quantizer step size chosen at the original image level is Q, the optimum step sizes to encode the L and H images must be Q / 2, to achieve the same MSE (this is just the factor distinguishing the filters as used from orthonormal ones). MC prediction coders would perform intraframe coding of A, and DFD encoding of B, both with step size Q. It follows that, with MC-SBC, the DFD signal (H frame) must be encoded by a factor of 2 / 2 less accurate than in MC prediction. As a counterpart, the L frame carries mean value information about both frames A and B, and must be encoded by the same factor more accurate than the original (intra-coded) frame. With the R=1/2 log 2 (σ 2 /D) formula from ratedistortion theory [26], we would come to the conclusion that no coding gain over MC prediction is possible by the application of MC-SBC with 2-tap filters. This effect remains constant with the number of cascaded stages; for example, the four-stage configuration of fig. 9 would result in no coding gain, as compared to MC prediction with a frame refresh at each 16th frame. Indeed, two important differences must be stated : The requirement for a more exact quantization of L indicates, that energy compaction (concentration of information to the lowest-frequency band) is higher in MC-SBC. This effect increases with the number of cascaded stages. It is well-known, that schemes with higher energy compaction are superior for encoding at low data rates; e.g. transform coding of still images clearly outperforms DPCM at rates below 1 bit/pixel. In MC-SBC, the DFD signal is calculated between original frames A and B, in MC prediction between a reconstructed A and an original B. This means that coding error feedback (which deteriorates the efficiency of MC prediction at low rates) does not occur. Both effects have their counterparts in a more efficient transmission over lossy channels. The higher energy compaction allows an efficient protection of information, while the non-recursive structure inhibits propagation of transmission errors [19]. Of course, these statements are only true for the "connected" parts of the decomposition. For "uncovered" pixels (which are original values from B), the optimum step size is Q, while for the -11-

12 DFD values at "covered" positions (which carry the whole information about A), the optimum step size is Q/2. Hence, the performance at these positions would be the same as with a MC prediction coder, which would apply intraframe-encoding at the uncovered parts of an image. It follows that the optimum quantizer step sizes differ between the "connected", "covered" and "uncovered" positions. The step sizes at position (m,n) can be calculated for the L and H frames at any cascade stage (where q A, q B are the outputs from the next lower stage, set q A =q B =Q for the first stage) : "connected" : q L ( m, n) = q ( m+ k( m, n), n+ l( m, n) ) q ( m, n) A q 2 ( m+ k( m, n), n+ l( m, n) ) + q 2 ( m, n) A "uncovered" : ql( m, n) = qb( m, n) qa( m, n) qb( m+ k( m, n), n+ l ( m, n) ) "connected" : qh ( m, n) = 2 2 q ( m, n) + q ( m+ k( m, n), n+ l ( m, n) ) A B B B (19) (20) (21) "covered" : q ( m, n) = 0. 5 q ( m, n). (22) H B An algorithmically simpler form for the "connected" cases is to proceed with 1/q L(H) 2=1/q A 2+1/q 2 B from stage to stage. II.2. Spatial decomposition of the temporal-axis subbands The 2-D images (L.. and H..), resulting after motion-compensated temporal-axis subband decomposition, exhibit spatial correlation. Generally, any 2-D image compression scheme like DCT, SBC, VQ or fractal coding might be employed. E.g., earlier experiments were performed, combining MC- SBC with a 2-D DCT [18]. Indeed, better coding results than with DCT were obtained by the application of a time-domain aliasing cancellation (TDAC) subband decomposition scheme [30], a parallel filterbank approach resulting in U V subbands. A fast algorithm for 2-D TDAC is based on a 2-D DCT of size 2U 2V [31]; U=V=8 was chosen in the experiments, resulting in 64 spatial subbands of constant bandwidth. In fact, TDAC is very similar to the lapped orthogonal transform (LOT) approach, proposed more recently for image coding applications [32]; both belong to the class of cosine-modulated filterbanks. It is now described, how the requirement for spatial-variable quantizer functions q(m,n) from (19)- (22) can be fulfilled. The subband transform "weighs" the local quantizer functions by the absolute values of the impulse responses h u,v (p,q) (size P Q subband analysis filters), which are used to calculate the spatial subband coefficients c u,v (i,j) ; i=m/u, j=n/v are the coordinate positions in the subband domain. The optimum quantizer step sizes for these coefficients in the case of orthonormal decompositions then are P 1 Q 1 uv, 2 uv, p= 0 q= 0 2 q (, i j) = q ( i U p+ P / 2, j V q + Q / 2 ) h ( p, q). (23) (23) can be realized via a fast transform algorithm in the case of TDAC decomposition. -12-

13 II.3. Encoding of the spatio-temporal subband signals To approach the entropy rate of the spatio-temporal subband decomposition, the adaptive lattice VQ (ALVQ) scheme shown in fig. 13 was employed. This scheme was described in more detail in [19]; in a MC prediction coder, a slightly lower rate was achieved than with the VLC of MPEG. The scheme adapts well to the varying statistics of the spatio-temporal subbands For the lowest-frequency temporal subband, spatially-weighted quantization was applied; for this purpose, MPEG's intra_quantizer_matrix was used [1]. The remaining temporal-subband quantizers were designed with a deadzone, which is 3/2 of the usual quantizer stepsize. In ALVQ, samples only from the same spatio-temporal subband are arranged to a vector. The adaptive components are run-length coding (RLC) and codebook-size adaptation. Two stages of RLC are used : Block-RLC indicates the positions (i,j), where any subband coefficients c u,v (i,j) have to be quantized; sample-rlc points to the positions of these coefficients inside the block. Block-RLC significantly lowers the rate for the high-frequency temporal subbands, where often only few samples have to be transmitted. The lattice E 8 was employed for rates above 2 bits/sample, Λ 16 for the lower rates, as requested by the codebook size adaptation. All adaptation parameters, and the codebook index vectors, resulting from the procedure described in [33], are encoded by simple Huffman VLCs. II.4. Encoding of the motion parameters The octave-band cascade structure of MC-SBC results in a sort of pyramid representation of motion parameters; the higher stages exhibit motion, which is present over several frames, while the lowest stage represents the frame-to-frame motion. This fact was used to reduce the search range, as described in section I.6. The redundancy in the spatio-temporal MVF can as well be exploited for encoding of the motion parameters. Motion parameters were encoded differentially, proceeding from the bottom of the decomposition cascade to the top : The initial estimate of ME is subtracted from the actual value. Additionally, a spatial prediction from the next lefthand and topmost parameter positions is performed (parameters are the block shifts in BM and support point shifts in IMC). To encode the parameter differences, MPEG's VLC table was applied. Rate saving, as compared to pure spatial prediction at each cascade stage, is 5-10 %. II.5. Results The following coding examples were performed on color (YUV) sequences; ME was performed with the luminance component Y, and the motion parameters were divided by half, according to the -13-

14 subsampling factors of color components U and V. Besides, spatio-temporal decomposition and quantization strategies were the same for Y, U and V. To evaluate the performance of the 3-D MC-SBC coder, it was compared to a MC prediction coder and to 3-D SBC without MC. Fig. 14 shows the PSNR results obtained with the CCIR 601 interlaced MOBILE&CALENDAR sequence (the given PSNR is averaged over luminance and chrominance components, and over all frames). All coders used the same scheme for spatial encoding (TDAC with AVLC). MC-SBC was performed with BM and IMC. MC-SBC/BM and MC prediction were with half-pel accuracy. MC prediction was performed with a field/frame adaptative BM, and without frame refresh. In BM, the size of search blocks was 16x16, the support points in IMC were also on a grid with 16-pixel spaces. The hybrid coder lags behind by approximately 4 db at 2 Mbit/s and comes closer at higher data rates; this behaviour is as expected from the statements in section II.1. The gain of MC-SBC over SBC without MC remains almost constant at around 4 db, over a wide range of data rates. MC- SBC/IMC outperforms MC-SBC/BM, especially at low rates. To further enhance the coding efficiency, it was found convenient to perform MC prediction of the LLLL images. In the case of a scene change, the cascade decomposition must be interrupted. Remaining lowpass images at any cascade stages are then also encoded with MC prediction from their predecessors. Low bit rate coding results for different sequences are illustrated in fig. 15, the rates for the different components are given in tab.2. In the MOBILE&CALENDAR example, MC prediction of the LLLL fields was applied, but with a frame refresh at each 16th frame; the reader may compare this to the results of a MPEG coder, GOF length of 16. The examples with SIF sequences (FLOWER GARDEN and TABLE TENNIS) are without frame refresh over the whole sequence, except for the scene changes in TABLE TENNIS. For TABLE TENNIS, the rates for the first part with a zoom, which consumes most bit rate, are given in brackets. The rates for the higherfrequency temporal subbands and motion parameters increase drastically, due to the faster changes. All these examples exhibit compression ratios between 150:1 and 200:1 for full-motion sequences! III. Conclusions This paper has described new strategies to apply motion-compensated subband analysis along the temporal axis of video sequences. The technique can easily be extended to a variety of schemes based on 2-band splits, including wavelet approaches [27]. The result is a motion-compensated, spatio-temporal multiresolution representation of the video signal, which depends on a component of motion information. It is a widespread opinion in the image coding community, that frame skipping is sufficient to obtain a multiresolution representation along the temporal axis of video signals; it is argued, that the 3-D signal is composed from a pure 2-D image signal and a displacement field [34]. -14-

15 The author does not agree with this point of view. The occurence of occlusions produces new parts of image information. This effect must not be neglected, if we regard the levels of temporal hierarchy. The motion-compensated 3-D spectrum concentrates as much information as possible at the lowest temporal frequency, if MC is perfect. The MC-SBC scheme can be viewn as a realization of a short-time spectral analysis, which adapts to the occurence and quickness of occlusions. The motion-compensated subband analysis is performed with a finer temporal resolution (subband analysis is switched off, performing the mentioned substitutions), whenever image information vanishes or new areas appear. An effect of this property is visible only in a moving video presentation : The foreground tree of FLOWER GARDEN in fig. 15b moves very fast. Here, the covered/uncovered areas are updated at each frame, even at this low data rate. The tree right in front of the house also covers and uncovers small parts of the house with each frame, but this leads to a relatively small energy in the highpass bands. Updating occurs less frequently, which results in a slight "gummy" movement of tree and background. Viewers note, that this is a very cute and thoroughly acceptable effect. It is surely less serious than the jerk of whole images, which occured with frame skipping, and is unacceptable for full-motion video. MC-SBC can perform spatio-temporally scalable encoding of video sequences, which may allow a unique hierarchical representation, from very low resolution at low bit rates up to a high-quality level. In this context, the non-recursive decoder structure is advantageous, one heavy obstruction of hybrid coders is put aside. Up to now, the spatio-temporal multiresolution property has only been realized for the part of the 3-D image information. The MC-SBC scheme still needs a spatio-temporally hierarchical, or scalable, representation of the motion information. At the present state, lossless encoding of the motion information, as used during analysis, is always recommended for subband synthesis. To solve this problem, the interactions between image information and motion information have to be further investigated. Spatio-temporal interpolation of motion parameters is regarded as a convenient way, which would open the path to use higher-order subband filters, with better aliasing cancellation properties. An approach in this direction will be presented in a forthcoming paper [35]. For further improvements, many ways can be thought of. The interactions of the subband filters, used for temporal and spatial decomposition, must be carefully examined, especially from the viewpoint of wavelet theory. With proper choice of spatial filters, the spatial interpolation for subpel-accurate MC might also be integrated into the 3-D subband decomposition. This would replace the DCT interpolation, which seems to be unnatural for the scheme. MC can be enhanced by use of object-oriented techniques, which may not only regard the information from the previous image frame (as predictive object-oriented coders usually do), but from the higher levels of the temporal-axis subband decomposition as well. Weighted quantization with regard to the spatiotemporal response of the human visual system could be applied. Combinations with nonlinear encoding techniques, like fractal coding for the image information in the temporal lowpass band, can also be suggested. The MC-SBC scheme can not only be combined with most techniques -15-

16 investigated today to enhance hybrid coders, but may also give rise to further development of new approaches like multiframe motion compensation, which could more efficiently exploit the temporal-axis correlation in video sequences. Appendix A : Example program in a C-like notation for derivation of motion parameters [k, l ] from [k,l]. The array arr_/k,l/ must be calculated in advance and may define UNCOVered positions, if allowed so by the ME procedure; additonal conditions for UNCOVerings are stated in the program. The arrays contain the horizontal and vertical displacement components k and l, which also may be used separately as arr_/k/, arr_/l/. OUT_FR defines a displacement reference outside the frame size; ni [] denotes the nearest-integer function. arr_/k,l/ [ number_of_rows, number_of_columns ] /* MVF B A */ arr_/ k, l / [ number_of_rows, number_of_columns ]=COVER /* define COVERed in advance */ for n=0, n<number_of_rows, n++ ; for m=0, m<number_of_columns, m++ { if [ arr_/k,l/ [n,m]!= UNCOV ] { if [ [ n+ni[arr_/l/ [n,m]], m+ni[arr_/k/ [n,m]] ] == OUT_FR ] arr_/k,l/ [n,m] = UNCOV elif [ arr_/ k, l /[ n+ni[arr_/l/[n,m]], m+ni [arr_/k/[n,m]] ]!= COVER ] arr_/k,l/[n,m] = UNCOV else arr_/ k, l / [ n+arr_/l/ [n,m], m+arr_/k/ [n,m] ] = - arr_/k,l/ [n,m] } } All positions remaining COVERed in arr_/ k, l / have no references in frame B. -16-

17 B : Example program in a C-like notation for derivation of motion trajectory parameters [k,l] +, [k,l] -, [k, l] + and [k, l] -. The array arr_/k,l/ must be calculated in advance for R/2 past frames, arr_/k, l/ for R/2 future frames (for analysis, or synthesis with filter length R=6,10,14,.. only for R/2-1 frames). The first may contain UNCOVered, the latter COVERed positions, as defined in appendix A. The motion trajectories are derived for position (m,n) of the A and B frames; [k,l] 0 and [k, l] 0 denote the motion vectors at this position between A and B. arr_/k,l/ [ R/2, number_of_rows, number_of_columns ] /* MVFs of R/2 past frames */ arr_/ k, l / [ R/2, number_of_rows, number_of_columns ] /* MVFs of R/2 future frames */ arr_/k,l/ + [R/2] ; arr_/ k, l / + [R/2] ; arr_/k,l/ - [R/2] ; arr_/ k, l /- [R/2] val_/k,l/ + =[0,0] ; val_/k,l/ - =[k,l] 0 ; val_/ k, l / +=[k, l] 0 ; val_/ k, l /-=[0,0] for r=0,r<r/2, r++ { if [ arr_/k,l/ [ r, n+ni[val_/l/ - ], m+ni[val_/k/ - ] ]!= UNCOV && val_/k,l/ -!= UNCOV ] { val_/k/ - = val_/k/ - + arr_/k/ [ r, n+ni[val_/l/ - ], m+ni[val_/k/ - ] ] ; arr_/k/ - [r] = val_/k/ - val_/l/ - = val_/l/ - + arr_/l/ [ r, n+ni[val_/l/ - ], m+ni[val_/k/ - ] ] ; arr_/l/ - [r] = val_/l/ - } else { arr_/k,l/ - [r] = UNCOV ; val_/k,l/ - = UNCOV } if [ arr_/k,l/ [ r, n+ni[val_/l / - ], m+ni[val_/k / - ] ]!= UNCOV && val_/ k, l /-!= UNCOV ] { val_/k / - = val_/k / - + arr_/k/ [ r, n+ni[val_/l / - ], m+ni[val_/k / - ] ] ; arr_/k / - [r] = val_/k / - val_/l / - = val_/l / - + arr_/l/ [ r, n+ni[val_/l / - ], m+ni[val_/k / - ] ] ; arr_/l / - [r] = val_/l / - } else { arr_/ k, l /- [r] = UNCOV ; val_/ k, l /- = UNCOV } if [ arr_/ k, l / [ r, n+ni[val_/l/ + ], m+ni[val_/k/ + ] ]!= COVER && val_/k,l/ +!= COVER ] { val_/k/ + = val_/k/ + + arr_/k / [ r, n+ni[val_/l/ + ], m+ni[val_/k/ + ] ] ; arr_/k/ + [r] = val_/k/ + val_/l/ + = val_/l/ + + arr_/l / [ r, n+ni[val_/l/ + ], m+ni[val_/k/ + ] ] ; arr_/l/ + [r] = val_/l/ + } else { arr_/k,l/ + [r] = COVER ; val_/k,l/ + = COVER } if [ arr_/ k, l / [ r, n+ni[val_/l / + ], m+ni[val_/k / + ] ]!= COVER && val_/ k, l / +!= COVER ] { val_/k / + = val_/k / + + arr_/k / [ r, n+ni[val_/l / + ], m+ni[val_/k / + ] ] ; arr_/k / + [r] = val_/k / + val_/l / + = val_/l / + + arr_/l / [ r, n+ni[val_/l / + ], m+ni[val_/k / + ] ] ; arr_/l / + [r] = val_/l / + } else { arr_/ k, l / + [r] = COVER ; val_/ k, l / + = COVER } } The filter paths are broken (constant value extension as described in section I.4) at the COVERed and UNCOVered positions in arr_/k,l/ +, arr_/ k, l / +, arr_/k,l/ -, arr_/ k, l /- -17-

18 References [1] ISO-IEC/JTC1 SC 29/WG 11 (MPEG) : "Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s," ISO Rec , Part 2: Video. [2] P. H. Westerink, J. Biemond and F. Muller : "Subband coding of image sequences at low bit rates," Signal Processing : Image Commun. 2 (1990), pp [3] M. Pecot, P. Tourtier and Y. Thomas : "Compatible Motion Compensated Subband Coding," in Proc. PCS-91, pp , Sept [4] Y.-Q. Zhang and S. Zafar : "Motion-compensated wavelet transform coding for color video compression," IEEE Trans. Circ. Syst. Video Techn., vol. CSVT-2, pp , Sept [5] J. W. Woods (ed.) : "Subband image coding," Boston, MA : Kluwer, [6] G. Karlsson and M. Vetterli : "Sub-band coding of video signals for packet switched networks," SPIE Visual Commun. Image Processing, vol. 845 pp , [7] F. Bosveld, R. L. Lagendijk and J. Biemond : "Hierarchical video coding using a spatio-temporal subband decomposition," in Proc. ICASSP-92, vol.3, pp. III/221-III/224, Mar [8] A. Jacquin and C. Podilchuk : "Very low bit rate 3D subband-based video coding with a dynamic bit allocation," in SPIE Proc. Internat. Symp. Video Commun. and Fiber Optic Services, vol. 1977, pp , Apr [9] C. I. Podilchuk, N. S. Jayant and P. Noll : "Sparse codebooks for the quantization of nondominant sub-bands in image coding," in Proc. ICASSP-90, vol. 4, pp , Apr [10] C. I. Podilchuk and N. Farvardin : "Perceptually based low bit rate video coding," in Proc. ICASSP-91, vol. 4, pp , May [11] C. Podilchuk and A. Jacquin : "Subband video coding with a dynamic bit allocation and geometric vector quantization," in SPIE Proc. IS&T Symp. Electr. Imaging and Tech., vol. 1668, pp , Feb [12] G. Schamel : "Motion adaptive four channel HDTV subband/dct coding," in Proc. PCS-90, pp , Mar [13] M. P. Queluz : "A 3-dimensional subband coding scheme with motion-adaptive subband selection," in Proc. EUSIPCO-92, pp , Sept [14] T. Akiyama, T. Takahashi and K. Takahashi : "Adaptive three-dimensional transform coding for moving pictures," in Proc. PCS-90, pp , Mar [15] W. Li and M. Kunt : "Video coding using 3D subband decompositions," presented at PCS-93 (Proc. PCS-93, pp , do not fully reflect the oral presentation), Mar [16] T. Kronander : "Some aspects of perception based image coding," PhD Dissertation, Linköping Univ., [17] : "New results on 3-dimensional motion compensated subband coding," in Proc. PCS-90, p.8.5-1, Mar

Motion-Compensated Subband Coding. Patrick Waldemar, Michael Rauth and Tor A. Ramstad

Video Compression by Three-dimensional Motion-Compensated Subband Coding Patrick Waldemar, Michael Rauth and Tor A. Ramstad Department of telecommunications, The Norwegian Institute of Technology, N-7034