Technische Universität Berlin, Institut für Fernmeldetechnik Three-Dimensional Subband Coding with Motion Compensation

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO INFORMATION ISO/IEC/JTC1/SC29/WG11 M0333 MPEG95/ Nov 1995 Source : Title : Status : Jens-Rainer Ohm Technische Universität Berlin, Institut für Fernmeldetechnik Three-Dimensional Subband Coding with Motion Compensation Proposal 1 Functionalities of the Coder Improved compression. The motion-compensated 3D subband coder has good compression capability over a wide range of data rates. The main advantage is the lack of feedback structures (no prediction from frame to frame, no error feedback from badly decoded frames as in hybrid coders). Overlapping motion compensation and use of a lapped subband transform further improve the efficiency. Scalability. Spatial and temporal scalability are natural to the 3D subband approach. The coder is also quality-scalable without loss in efficiency due to the use of universal variable length entropy coding (UVLC). It is important to note that the motion-compensated temporal-axis subband system works independent of the subsequent parts of the encoding process. This means that scaling can freely be applied, almost as in intraframe coders. Scalability also includes temporal scalability of motion parameters, while spatial scalability of motion parameters must be further investigated. Object scalability may be applicable as well, if separate temporal-axis subband transforms are run over background and moving objects. (The scalability functionalities were not provided for the subjective tests, see note on last page) Robustness in error-prone environments. Due to its highly hierarchical and scalable data structure, the encoded data stream can undergo an efficient error protection. Experiments have been performed with a 2-layer version ( 1/3 of the information in the base layer, 2/3 in the enhancement layer). This can be realized with only marginal increase in data rate, caused by the need to provide resynchronization information for both layers independently. Graceful degradation was observed even with severely affected enhancement information. (The error-robustness functionality was not provided for the subjective tests, see note on last page) This proposal contains the description of the coding algorithm used for the subjective test submissions. Two elements of the coder are herein proposed as TOOLS : 1. Motion-compensated interframe subband coding (see description in section 2.1) 2. Motion grid interpolation with contour adaptation (see description in section 2.2) - 1 -

2 Technical Description The core technologies of the coder, which are also provided as separate tools in this proposal, are Motion-compensated interframe subband coding (along the temporal axis) ; Motion compensation based on grid interpolation, with adaptation to object borders. 2.1 Motion-compensated Interframe Subband Coding This element of the coder is proposed as a TOOL Fig.1. Octave-band decomposition of a FDG ( length 16 frames ) into the temporal-axis frequency components (top) ; resulting frequency bands (bottom). We use a motion-compensated temporal subband filter approach. In this proposal, 2-tap Haar filter bases were used, but longer temporal filters are applicable as well [1]. The basic difference as - 2 -

compared to conventional coders (hybrid MC prediction types) comes from what is encoded. Instead of single "intra-coded" frames, we have a temporal lowpass band, which is a transformed representation containing as much information as possible from all frames within a frame decomposition group (called FDG here to mark the difference to MPEG1/2's GOF). Instead of predictive-coded frames, we have different temporal higher-frequency bands, which represent the fastness of changes within the FDG. Refer to figure 1 to see what is happening. This is an example with progressive video formats, e.g. SIF resolution. The use of the Haar filter base induces that the temporal lowpass (L) information is extracted from two frames A and B by motion-compensated averaging (++), while the temporal highpass (H) information is obtained from a difference operation (-+). Both averages and differences are normalized by factors of 0.5, such that the value range of the lowpass frames is always equal to that of the original frames. When motion-compensated averages and differences are again extracted from the H and L frames (which have the same sizes as A and B), the FDG length is enlarged and we obtain a finer resolution for the temporal-axis frequency decomposition. Figure 1 shows an example with an octave-band subband tree, as it is the best choice for progressively-sampled sequences. The FDG length is 16 frames. Only the lowpass bands are further decomposed in this case; if motion compensation is exact enough, they look very much like original image frames. The resulting temporal-axis frequency resolution is shown in the lower part of figure 1. During synthesis, the motion compensation is reversed, and the FDG is reconstructed starting from the root (LLLL/LLLH) of the subband tree. It is straightforward to perform frame rate scalings by power-of-2 factors. Figure 1 indicates which information would be omitted if a 15 Hz reconstruction is required from an original 30 Hz sequence. Remark that neither the frames "A" nor the frames "B" are actually reconstructed in this case. Instead, a sequence of motion-compensated averages, the first lowpass band L, is replayed. Very fast changing parts of the scene may become gradually smoothed, which is an alias suppression and reduces jerky movements when compared to the usual technique of simply skipping the "A" or "B" type frames. No frame recursion loops are present neither in the coder nor in the decoder, which allows temporal subsampling (replay of lowpass sequences) at any level of the subband decomposition tree. Figure 2 shows an appropriate decomposition structure for interlaced video, as it was actually used in the MPEG4 tests for the class C sequences. In this case, the "A" and "B" are the even and odd fields, respectively. Due to the aliasing effects inherent in interlaced sampling, a relatively high amount of information is present in the first highpass band "H". This can be counteracted by again decomposing the "H" frames in an octave-band tree throughout the FDG. Due to the frequency reversions occuring during the subsampling of high-frequency components in subband systems, we obtain the decomposition as shown in the lower part of figure 2, which we want to designate as "mirrored octave-band". The frequency resolution is narrow at low frequencies and near the frequency of the field rate, where most aliasing occurs. The information about the differences between adjacent fields throughout the FDG is now concentrated in the HLLL band. The subband decomposition frames in this case have the sizes of the fields. If a progressive SIF reconstruction is required, only the L frames of the first subband decomposition level at the decoder are encoded, after they were subsampled in the horizontal direction by a factor of 2 (see section 2.3). All temporal frequencies above 25/30 Hz are discarded. Again, neither the even nor the odd fields, but motion-compensated averages from both are replayed as SIF reconstruction. This is exactly what has been done in the video examples provided for the tests : While analysis was performed on ITU-R-601 sequences, only the SIF resolution information was written to the bit streams. To produce the 601 output format again, the decoder replaces both even and odd fields by the sequence of L-frames. Fig.3 illustrates the procedure of motion-compensated subband analysis at the first analysis level. The problem is to guarantee the inversibility of motion compensation even in the case of inhomogeneous motion. This can be solved as follows (a more detailed description including symbolic program code is given in [1]) : 1. Perform decomposition into subbands L and H at positions with a unique motion path between A and B. The values in L are placed at the position referring to B, while the values in the H get their positions from A. - 3 -

2. When no unique motion path exists : Insert original values from B at their proper positions into subband frame L, which happens in the case of uncovered areas. Insert a motion-compensated prediction from previous B into subband frame H, which happens in areas covered from A to B. Fig.2. Mirrored octave-band decomposition of a FDG ( length 8 frames / 16 fields ) into the temporalaxis frequency components (top) ; resulting frequency bands (bottom). Fig.3. a) motion paths, covered and uncovered areas in frames A/B b) positions of subband decomposition results and of inserted values in H and L - 4 -

This way, each position in both frames A and B can uniquely be recovered, either by inverse motioncompensated subband synthesis, or by simple replacement of the inserted original/predicted values. The same technique is applied over all levels of the subband tree, e.g. at the second level two subsequent L frames take over the role of A and B to produce LL and LH. The temporal axis subband decomposition as described can be combined with any motion compensation technique. Nevertheless, block matching was found less appropriate due to its inexact description of object borders. If the subband frames at the higher levels of the subband decomposition tree do not provide enough similarity to the original scene contents, artifacts appear in the reconstructed sequence in the case of low-rate encoding. The scheme exhibits enough universality to apply object-oriented techniques as well. The best solution in this case would be a separate subband decomposition of particular objects and background (see section 5). One crucial point in the system are the spatial interpolations necessary when sub-pel accuracy of motion compensation is used. If bilinear interpolation is applied, the signal gets blurred during analysis over several levels of the subband tree, and it gets even more blurred during subsequent synthesis, when inverse motion compensation is performed. In the tests, the frames were upsampled during processing by a factor of 2 horizontally and vertically with a blockoverlapping DCT/IDCT process, and pel values were then taken from these upsampled images by bilinear interpolation. Any other less-blurring interpolation technique, e.g. spline or higher-order linear, might be used as well. It is not mandatory to use the same type of spatial interpolation during analysis and synthesis (see section 4.2). To exploit the redundancy between adjacent FDGs, motion-compensated predictive encoding of temporal subband frames LLLL is an appropriate solution, though this redundancy can be low in the case of fast motion. Predictive encoding of the LLLL information has also to be set into relation with the demand for an error-robust transmission (see section 3). The motion-compensated temporal subband decomposition leaves the frame or field sizes unchanged. The video sequence is decomposed into several temporal subband sequences with different frame rates, e.g. the sequence of LLLL-frames in fig.1 has 1/16th of the original sequence's frame rate. If any intraframe (2D) subband coder is applied to those frames resulting from interframe subband analysis, the coder is a 3D subband device. To emphasize the good properties of the coder for spatio-temporal scalable applications, we performed this full 3D subband decomposition; the properties of spatial (2D) subband coding are further described in section 2.3. It is important to note that the motion-compensated interframe subband coder may freely be combined with other spatial encoding techniques. An eminent advantage, as compared to conventional hybrid coders, is the complete independence of the temporal-axis subband decomposition, including motion compensation, from the subsequent encoding process applied to the subband frames, which can then almost be regarded as single-frame coding operations. This allows for a great freedom to use spatially-scalable and qualityscalable techniques. What should be provided by MPEG4 syntax to define the system : 1. Type of temporal filter ( e.g. Haar or other ) 2. Decomposition structure ( e.g. octave-band, mirrored octave-band, full-band ) 3. FDG length 4. Predictive encoding / frequency of frame refresh in LLLL subband 5. Motion vector field ( best interface to arbitrary motion compensation would be pelwise description of the motion vector field, including description of occlusions, and possibly description of objects ) 6. Spatial interpolation technique with low blur effect - 5 -

2.2 Motion Grid Interpolation with Contour Adaptation This element of the coder is proposed as a TOOL The basic scheme of motion compensation used to encode the test sequences is the control grid interpolation approach from [2]. Fig.4 illustrates, how this scheme was applied in the 3D subband coder. The motion vectors are defined at control grid points in the "B" frames and point to references in the corresponding "A" frames. The control grid points in B are spaced in a regular fashion (in the experiments, a distance of GX=GY=16 pel in both horizontal and vertical directions was used). If motion is heterogeneous, the reference points in A will be irregularly distributed. The motion vector field in between the control grid points is interpolated from each 4 adjacent values. During motion estimation, the horizontal and vertical shifts are optimized separately for each grid point. Fig.4 shows the reference region, which is influenced by the central grid point; the displaced frame differences within this region must be taken into account for optimization. From the intention (ability to capture non-translational motion with the same number of motion parameters as with conventional block matching), the scheme is indeed very similar to the overlapped block MC (OBMC) of the H.263 standard. However, CGI is more universal in that the type of motion parameter interpolation is not a priori defined. In our experiments, we used a bilinear interpolation mapping, but e.g. a perspective warp would be applicable as well and might give still more natural motion vector fields. Fig.4. Control grid interpolation motion compensation between A and B frames Fig.5. Switching off interpolation. Control grid & reference points (top) and related motion vector fields (bottom) in the presence of a) covered area b) uncovered area - 6 -

One eminent disadvantage of CGI and OBMC is the disability to cope with fast motion and occlusions. Interpolation between adjacent grid values or adjacent blocks is always performed in the original schemes. This problem can be solved, if the interpolation in the region between adjacent points is switched off, whenever a discontinuity in the motion vector field is found. A good indicator for the presence of a discontinuity is the distance between the reference points in the first image (A). We switched off interpolation, whenever this distance was inferior than 0.75 G or exceeded 1.25 G (where G denotes the distance GX or GY between the corresponding control grid points). The effect of this procedure is outlined in fig. 5. Fig.5a is an example where an area is covered (present in frame A, not present in frame B), while fig.5b shows the case of an uncovered area (not present in frame A, present in frame B). Remark that interpolation is still performed where the distances between reference points (o) are within the prescribed limits. Switching off the interpolation, though already rendering a better description of the motion vector field and allowing faster motion, does not yet describe the real position of the covered or uncovered areas. In fig.5, the discontinuity in the motion vector field was assumed to be centered between the particular grid points. To give a more exact description, the scheme in fig.6 was employed, which was able to reduce artifacts in the decoded sequences at low rates. The necessary information is the shape position of the motion discontinuity (in the case of an area covered in frame B) or the shape of an uncovered area itself. For a raw approximation, it is sufficient for the decoder to know the intersection of this shape with the straight line between two diverging control grid points. In fig.5, the top left point's motion indicates a separation from the others. Hence, it is necessary to encode the intersection positions between the top left and bottom left point, and between the top left and top right point. If GX and GY are the grid spacings in the horizontal and vertical directions, the intersection can be at one out of GX-1 or GY-1 positions, respectively. Between those intersections, a straight line (polygon) approximation of the contour was used in our experiments, but other approximations, e.g. spline, might be used as well. For the regions uncovered in B, the contour approximation marks the center of the area. fig.6. Representation of the discontinuities within the motion vector field What should be provided by MPEG4 syntax to define the system : 1. Grid spacing ( e.g. uniform : GX, GY ; nonuniform ) 2. Type of interpolation ( e.g. binlinear, affine, perspective, quadratic warps ) 3. Upper and lower limits of motion discontinuity to switch off interpolation 4. Interface to contour descriptor ( number & position of contour points, type of contour interpolation : e.g. polygon,spline ) - 7 -

2.3 Spatial (2D) Subband Coding of the Temporal Subband Frames In the case of interlaced video encoding, the first step of spatial coding is a preprocessing of the frames resulting from temporal subband decomposition. These have the format of ITU-R-601 fields (720x288 or 720x240), and are decomposed into SIF-compatible sizes. This requires a downsampling by a factor of 2 in the horizontal direction for the luminance Y, and a downsampling by factors of 2 in both horizontal and vertical directions for the chrominance components U and V. Accordingly, a quadrature mirror filter (QMF) based decomposition into 2 horizontal frequency bands is applied to Y, while 4 horizontal/vertical frequency bands are generated from U and V (see fig. 7). An odd-length QMF (9-tap from[3]) was employed for this purpose. Fig.7. Preprocessing/decomposition of ITU-R-601 into SIF While the vertical high bands of the chrominance are always discarded in order to encode a 4:1:1 representation, the upper horizontal bands of both luminance and chrominance may be discarded in addition, if only SIF resolution is required after decoding. This is exactly what was done in the sequences provided for the tests. The same scheme is also applicable for full compatibility, maybe from HDTV down to QCIF formats. In this case, it would be useful to apply octave-band (wavelet) decomposition. The resulting SIF frames (352x288 or 352x240) with 4:1:1 components are then processed by a type of lapped transform, a 2D version of the subband approach called time domain aliasing cancellation (TDAC) [4]. The number of frequency bands is 64 in both horizontal and vertical directions, having the same frequency resolution as with a DCT of size 8x8. Any other spatial transform or subband decomposition might be applied as well, but TDAC provides better results than e.g. DCT. The TDAC coefficients remain organized along with those from the same analysis windows (simlilar to coefficient block ordering in DCT) for subsequent processing. What should be provided by MPEG4 syntax to define the system : 1. Type of transform ( e.g. DCT, LOT, TDAC, wavelet, including filter type / basis functions ) 2. Number and arrangement of frequency bands ( e.g. full-band, octave-band, wavelet-packet, 1D/2D, separable/non-separable ) 2.4 Quantization and Entropy Coding The pure quantization of the 3D subband information is more or less conventional. The spatial subbands of the temporal LLLL band were encoded using the intra_quantizer_matrix from MPEG1/2, while the higher frequency bands were processed with a 3/2 deadzone quantizer. Of higher importance is the global quantizer step size which was applied to particular temporal bands. The - 8 -

temporal subband decomposition employs nonorthonormal Haar filters (normalization factors 0.5 instead of 2 / 2). This is necessary to embed the uncovered areas into the lowpass frames without visible brightness changes. The consequence is, that the global quantizer step size has to be lowered by a factor of 2 / 2 with each level of the subband decomposition tree. Additionally, within one lowpass frame, the inserted uncovered areas can be quantized by a factor of 2 / 2 coarser than the surrounding "true" lowpass information. The other way round, within one highpass frame, the inserted covered areas, which are predicted from the previous frame, must be quantized by the same factor finer than their neighbors (for a detailed explanation, see [1]). This problem is solved by adjusting the quantizer step sizes locally, according to the number of covered/uncovered pixels under the analysis window of the TDAC transform. Remark that no additional information needs to be transmitted for this purpose, the necessary quantizer step sizes can be derived exactly only from the motion information. Fig.8. Superblock arrangement (64x64) used for 30 Hz sequences (240 lines) The entropy coder used to encode the quantizer output is the universal variable length coder (UVLC) described in [5]. This coder is a runlength coder working on the bitplanes of quantized frequency coefficients. A desirable advantage of this coder is the capability of quantizer scaling without any data overhead. A layered representation of the encoded information is obtained, if the code for the higher bit planes is transmitted as base information, the lower bit planes as enhancement information. For efficient UVLC, it is necessary to reorganize the coefficients before processing. Instead of "one block at a time" (as in MPEG1/2) they are encoded "one frequency at a time". In the original proposal of [5], this is performed in slices containing 90 DCT blocks of size 8x8. At first 90 DC coefficients, then 90 AC-first coefficients etc. are processed. This scheme was modified by using superblocks of size 64x64 pel instead of slices. Each superblock contains the coefficients from 64 TDAC analysis windows for the Y component, and from 2x16 analysis windows for the U&V components. Since the SIF image sizes are not dividable by 64, the superblock arrangements as in figs.8&9 were used for the 30 Hz and 25 Hz formats, respectively. - 9 -

Fig.9. Superblock arrangement (64x64) used for 25 Hz sequences (288 lines) The overhead information related to the superblocks is as described in [5]. Coefficients are ordered in eight classes, the quantizer range (necessary bit number) for each class is transmitted, and VLC codes are adapted according to the number of "1"s in each bit plane. In addition, it is possible in our realization to set complete superblocks to zero. This is frequently performed in the higher-frequency temporal subbands at low rates. The encoded bitstreams provided for the MPEG4 tests originated from an experimental encoder and do not yet provide resynchronization information. Presently, a scheme is investigated which allows a resync of the decoder once at the beginning of the frame for the first level of temporal subband decomposition, at each seventh superblock for the second level, at each third superblock for the third, and at each superblock for the fourth level (e.g. lowest-frequency subband LLLL) of the subband tree with a negligible amount of information overhead. What should be provided by MPEG4 syntax to define the system : 1. Type of quantizer ( e.g. deadzone, quantizer matrix ) 2. Arrangement of quantized information ( e.g. block structure, superblock structure, slice structure ) 3. Type of entropy coding ( e.g. UVLC, Huffman VLC, arithmetic ) 4. Specific adaptation of entropy coding ( e.g. coefficient classes, VLC codes ) 2.5 Encoding of Motion Parameters The number of motion parameters to be transmitted depends on the decomposition structure (octaveband or mirrored octave-band), but is generally lower than e.g. in MPEG1/2 with extensive use of B- frame structures. There are three cases : Motion information is directly related to image information. This is the case whenever a highpass frame of the subband decomposition forms the end of the subband tree and is encoded. In fig.1, all marked frames except LLLL, in fig.2 all marked frames except LLLL and HLLL are of this type. There is image information to transmit, but no motion information. This is the case for LLLL (unless MC prediction from the last LLLL is performed) and for HLLL. There is motion information to transmit, but no image information. For example, this is the case at the first level of subband decomposition in fig.2, when ITU-R-601 decoding is required. - 10 -

If the information about the particular highpass frame at any level of the temporal subband system is not available at the decoder, the motion information is not necessary as well. Return to fig.2. If SIF reconstruction is required, all motion parameters related to frames beginning with "H.." must not be known at the decoder, and for the whole FDG, only 7 sets of frame motion parameters are needed instead of 22 for the full 601 reconstruction. The same is true for the example in fig.1 : Here, only 7 sets of frame motion parameters need to be transmitted for the 15 Hz reconstruction instead of 15 for the 30 Hz case. The temporal subsampling procedure inherently includes the subsampling of motion parameters. Lossless encoding of the motion parameters (horizontal and vertical shifts for the control grid points) is performed within the same 64x64 superblock structure described in the previous section. When the source sequence has ITU-R-601 format, and the motion grid spacing is GX=GY=16, as applied in the MPEG4 tests, each superblock in the SIF-downsampled representation contains the vectors of 32 motion grid points (during horizontal downsampling, the virtual horizontal grid space is downsampled as well, such that 4 rows of each 8 grid points are contained within each superblock). We have not yet implemented spatial subsampling of the motion parameters. Encoding is performed separately for the horizontal and vertical shift components, with a spatial prediction, and the VLC table from MPEG1. Spatial prediction newly starts at the left grid point within the topmost row of each superblock; the points of the topmost row are predicted from their left neighbors, the points of the lefthand column from their top neighbors. All other values are predicted with factors 0.5/0.5 from both left and top neighbors. The accuracy of motion parameters is half pel. Hence, with the VLC used, it is necessary to transmit additional bits if the value range of the shifts exceeds 7.5, as indicated in the following table. value range <15 <30 <60 <120 <240 additional bits 1 2 3 4 5 The maximum value range is tested within each superblock, and a 3-bit code is used independently for both horizontal and vertical shifts to inform the decoder about the required additional bit number in this superblock. The presence of shape parameters is determined by a divergence test on the motion parameters, as described in section 2.2 (fig.6). Because a discontinuity in the motion vector field may be situated in between two grid points belonging to different superblocks, the shape parameters are encoded at once for the whole frame. In our case with GX=GY=16, 15 different shape positions are possible, for which a binary 4-bit code is used. The rate necessary to encode the additional shape information is negligible (typically 1/4 of the rate for the motion vectors). For those areas within the highpass frames, which are predicted from the previous "B" frame (see fig.3), an additional 1-bit code indicates which edge of the motion discontinuity exhibits the best displacement vector. What should be provided by MPEG4 syntax to define the system : 1. Arrangement of motion information in the bit stream ( e.g. blockwise, super-blockwise, at once for each frame ) 2. Differential encoding ( type of prediction, 1D/2D ), resync points ( e.g. at each super block ) 3. Type of entropy coding ( e.g. Huffman VLC, arithmetic ) 4. Specific adaptation of entropy coding ( VLC codes ) 5. Relation between motion and shape information 2.6 Type of Subsampling Filter The subsampling filter (ITU-R-601 to SIF) is different from the one suggested in the MPEG4 PPD. A motion-compensated averaging filter is used for vertical subsampling instead of field skipping, because this operation is a natural part of the temporal-axis subband analysis (section 2.2). For horizontal subsampling, the 9-tap QMF from [3] was employed, which is also part of the spatial (2D) subband coder (section 2.3). - 11 -

3 Flexibility of the Coder The 3D subband coder exhibits a wide range of flexibility. Most parts are freely exchangeable with other tools, e.g. one might replace the motion compensation by block matching or even objectoriented techniques, instead of the TDAC a LOT, any subband transform or even DCT might be used as well as any other spatial (intraframe) coder, the UVLC can be replaced by ordinary Huffman VLC or arithmetic coding. The coder is attractive for combination with layered coding techniques. From that, what's really new is the first proposed tool, the motion-compensated temporal axis subband processing. Even this may be combined in a flexible way with conventional hybrid coders working on the lowest temporal frequency frames (LLLL in our example) or also with object-oriented techniques. Remark that, with a FDG length of 1, and MC prediction applied, the scheme reduces to hybrid coding. As it is possible in MPEG1/2 to define the number of B frames and the GOF length, the FDG length for the 3D subband coder should be freely definable, depending on the specific requirements for encoding delay, and on the scene contents. It is even mandatory to allow for switching the length individually from one FDG to the next : A shorter FDG length is necessary before a scene change (this was indeed applied in the simulations for TABLE TENNIS); In the case of very fast changes, it can be suboptimum to use a large FDG length. For the last part of STEFAN (frames after #230), better results would be possible with shorter length. A utility to perform this action is available, but was not yet built in the simulation program. For the second tool proposed, the CGI motion compensation with border adaptation, I'm sure that similar ideas will come up in other proposals. The following points appear mandatory to me for definition in the MPEG4 syntax, to allow flexible implementation of motion parameter interpolation : Allow the definition of several warping techniques, e.g. bilinear, affine, perspective, maybe even inclusion of quadratic terms for nonplanar surfaces; Allow interdependence between motion and shape representation. Shape parameters are needed to know where the discontinuities in the motion vector field are. On the other hand, the motion information can bear information about the presence of any shapes, which may be utilized to reduce the data rate. 4 Complexity 4.1 Storage requirements A full storage of the whole FDG is usually necessary either at the coder or at the decoder. If in-place memory usage is applied (output of subband analysis or synthesis written to the same memory), L+1 (where L is the number of subband levels) frame stores are needed at that end (coder or decoder) that does not provide full FDG storage. Return to fig.1 to illustrate this point (L=4 in this case) : If the coder does not provide full FDG storage, the subband frames are transmitted in the following order : H0, H1, LH0, H2, H3, LH1, LLH0, H4, H5, LH2, H6, H7, LH3, LLH1, LLLH, LLLL. Two frame memories are needed for processing. The following frames must be stored intermediately, where "../.." indicates that these frames use one memory subsequently : L0/2/4/6 (until L1/3/5/7 are processed), LL0/2 (until LL1/3 are processed), LLL0 (until LLL1 is processed). The decoder must start synthesis at LLLL/LLLH, which are transmitted last, and hence would have to store all frames that have been transmitted up to this point. An alternative would be to store the bit stream, reorder it as needed (given under the next point), and wait with decoding until LLLL/LLLH are present; in this case, the decoder would also need only L+1 frame memories, but an additional decoding delay would be introduced. If the coder does provide full FDG storage, the whole FDG is first processed and transmitted in the sequence the frames are needed at the decoder : LLLL, LLLH, LLH0, LH0, H0, H1, LH1, H2, H3, LLH1, LH2, H4, H5, LH3, H6, H7. Intermediate storage would be necessary for LLL1 (until B3 is reconstructed), LL1/LL3 (until B1/B5 are reconstructed), L1/3/5/7 (until B0/2/4/6 are reconstructed). - 12 -

Additional frame memories are recommended at coder and decoder to store the decoded motion shift parameters (horizontal/vertical) and along with these, the covered/uncovered information for each pel. This simplifies programming, and may be omitted only with block-based motion compensation. One memory of double frame size is needed for spatial interpolation, at least with the technique that was applied in the simulations. 4.2 Processing complexity At the coder, the most demanding task is motion estimation. For the MPEG4 tests, a procedure consisting of three steps was employed : 1. Hierarchical block matching (to obtain a smooth vector field) for initial estimation of CGI parameters; 2. Refinement of CGI parameters over the reference region in fig.4; 3. Determination of optimum shape intersections according to fig.6. The second point was only performed if adjacent grid points showed different motions, the third point only if adjacent grid points violated the continuity conditions given in section 2.2. Though a modified telescopic search was used in steps 1&2, the processing time is relatively long due to the large search ranges necessary at the higher levels of the temporal subband tree. Former investigations based on block matching have shown that the search range can be greatly reduced if the motion information from the next-lower level is utilized [1], but this fact was not yet exploited by the CGI-based coder used in the simulations. Further complexity considerations are unique at the coder and decoder sides. Our implementation of the TDAC algorithm takes about three times the processing power needed for a DCT with the same number of coefficients. A more crucial point are the high-quality spatial interpolations necessary to avoid a blurred reconstruction. For this purpose, we use a blockwise DCT of size 32x32 with 8 pel overlap at each side, blow it up to 64x64 by attaching zeros, and perform an IDCT of this size. The whole procedure, including bilinear interpolation which is still necessary unless the motion vector points exactly to a half pel site, costs approximately 40 multiplications per pel. Other interpolation techniques, like spline or higher-order linear, should be investigated at this point. Nevertheless, it is not mandatory to use the same type of spatial interpolation at the coder and decoder. With bilinear interpolation, decoding becomes faster approximately by a factor of 5, and storage requirements are reduced. Indeed, the reconstruction gets more blurred in this case, but no annoying artifacts are introduced. The remaining tasks, e.g. interpolation of the motion vector field from the CGI parameters, determination of covered and uncovered areas etc. are of marginal influence on processing time. 5 Possible Improvements 5.1 Known Bugs in the Simulation The decoder provided for the MPEG4 tests produces the full ITU-R-601 format (e.g. 720x480 for 30 Hz frame rate). Indeed, the first step of encoding, the temporal subband analysis, was performed on the full-size (720-pel rows) images. Encoding, as shown in fig. 7, cuts off 8 columns at each side. After decoding and temporal subband synthesis, we found that this may cause serious effects at the left and right borders of the image, because sometimes information from the omitted columns is expected by inverse motion compensation. This effect was found to be within an acceptable limit for the MOBILE and TABLE TENNIS sequences (it should still be possible to judge about the performance of the coder). With STEFAN, artifacts were detected in some cases even far from the border; here, we changed the spatial coder to rows of width 720 in an ad-hoc action (not documented), providing compatibility by using a spare bit in the bitstream headers. The solution for further experiments is to cut off the columns before temporal subband processing. - 13 -

5.2 Possible Improvements to the Present Coder The quantization criteria outlined in section 2.4 are derived from rate-distortion theory, i.e. the temporal subband signal is quantized more exactly whenever its value affects more reconstructed frames. Visual examination suggests, that sometimes fast-moving areas, as well as areas being covered or uncovered, are not handled in an optimum way. This becomes especially visible arond the ball in MOBILE, or near STEFAN himself. The reason is the invalidity of rate-distortion based quantizer assignments at low data rates. If high frequency components have a very low energy, they are set to a zero value and are in fact quantized with less distortion than other ones. Just this is the case here : In areas which cannot be exactly motion-compensated, the temporal high frequency component usually plays an important role and is not set to zero during quantization. Hence, the quantization error is indeed larger than in areas, which can be perfectly reconstructed only from lowpass information. Unlike intraframe coding, where this effect is desirable due to noise masking in detailed areas, artifacts become visible, because the viewer looks at those moving objects. A modified quantization according either to constant SNR or to psychovisual criteria appears necessary at this point. To enhance quality, especially at the lowest rates, it seems to be important to exploit the temporal redundancy of the motion information as well. In the coder used for the MPEG4 tests, motion parameters were encoded independently for all levels of the subband system. We have already performed experiments with a temporal-axis "differential pyramid" encoding of motion parameters, starting at the highest level. At the same time, such a technique is suitable to subdivide the motion parameters into several relevance levels for transmission in error-prone environments [6]. This, however, requires modifications to the bit stream, including possibly the transmission of all motion parameters for the whole FDG as one package. In order to provide a wider range of spatial scalability, maybe from HDTV down to QCIF formats, it would be appropriate to replace the subband decomposition described in section 2.3 (fig.7) by a wavelet pyramid. Further investigations at this point are necessary to realize a spatial scaling of motion parameters along with the image information. 5.3 Organization as an Object-Oriented Coder It is straightforward to combine the temporal-axis subband system with object-oriented encoding techniques. Each lowpass frame has exactly the same coordinate positions as a particular frame from the original sequence (refer to figs.1&3 ; the following frames show basically the same images : L0 B0, L1 B1, L2 B2,..., LL0 B1, LL1 B3,..., LLL0 B3, LLL1 B7, LLLL B7). Hence, if a technique is available that tracks objects from frame to frame, separate temporal subband analysis can be performed without any problem on objects and background. Positions of covered and uncovered areas are exactly known in this case, and the LLLL image of the background would contain information about areas that are visible at any frame within the FDG (remark that the background can be moving as well, or may even consist of several parts with different movements with our technique). For an object with arbitrary shape, it is recommended to use some region-oriented spatial encoder in combination with the temporal-axis subband system. Full object scalability is guaranteed : Even approaches may be realizable, where the background is encoded with the 3D subband system, and particular objects with any other technique. References [1] J.-R. Ohm : "Three-dimensional subband coding with motion compensation," IEEE Trans. Image Proc. 3 (1994), pp. 559-571 [2] G. J. Sullivan and R. L. Baker : "Motion compensation for video compression using control grid interpolation," Proc. IEEE ICASSP (1991), pp. 2713-2716 [3] E. P. Simoncelli and E. H. Adelson : "Subband transforms," in Subband Image Coding, J. W. Woods (ed.), Boston : Kluwer 1991, p. 186-14 -

[4] J. P Princen and A. B. Bradley : "Analysis/synthesis filter bank design based on time domain aliasing cancellation," IEEE Trans. Acoust., Speech, Signal Proc. 34 (1986), pp. 1153-1161 [5] P. Delogne and B. Macq : "Universal variable length coding for an integrated approach to image coding," Annales Télécommunications 46 (1991), pp.452-459 [6] J.-R. Ohm : "Motion-compensated 3-D subband coding with multiresolution representation of motion parameters," Proc. IEEE ICIP (1994), vol. III, pp. 250-254 [7] J.-R. Ohm : "Advanced packet-video coding based on layered VQ and SBC techniques," IEEE Trans. Circ. Syst. Video Tech. 3 (1993), pp. 208-221 - 15 -