Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

Size: px

Start display at page:

Download "Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist"

Augustine Webster
5 years ago
Views:

1 ON THE USE OF LAYERS FOR VIDEO CODING AND OBJECT MANIPULATION Luis Torres, David Garca and Anna Mates Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Gran Capita s/n, D Barcelona, Spain fluis, albroto, Abstract Layered representations, originally introduced in [7] are important in several applications. Specically, they are very appropriate for object tracking, object manipulation and content-based scalability which are among the main functionalities of the future standard MPEG-4 [2, 3]. A robust representation of moving images based on layers has been presented in [5]. One of the most crucial stages in the formation of layers is the motion segmentation. In particular, the initial segmentation is very critical for the success of the layered system. In this context, the objective of this paper 1 is to provide some insight on the performance of layers by studying dierent initial segmentations for the construction of layers. In order to have a selfcontained paper, the rst part of the paper reviews the main concepts and results presented in [5] which includes image synthesis and object manipulation. The second part provides dierent type of initial segmentations that help to analyze the performance of the overall layered representation of moving images. Some comments on the use of layers for video coding are also introduced. 1 INTRODUCTION The basic idea behind the new standard MPEG-4 [2] is to represent the world understood as a composition of audio-visual objects, following a script that describes their spatial and temporal relationship. This type of representation should provide the possibility that the user interacts with the various audio-visual objects in the scene, in a way similar to the actions taken in everyday life. Although this content-based approach to the scene representation may be considered evident for a human being, it represents in fact a revolution in terms of video representation architecture used in the available standards, since it allows a jump in the 1 This work was supported in part by the CICYT grant of the Spanish government: TIC C05 type of functionalities that may be provided to the user. A scene represented as a composition of (more or less independent) audio - visual objects oers to the user the possibility to play with the scene content, by changing some of the objects characteristics (e.g. position, motion, texture or shape), by accessing only selected parts of scene or even by cut and pasting objects from one scene to another. Content and interaction are thus central concepts in MPEG-4 [3]. Basic concepts and several approaches to achieve different functionalities in the context of MPEG-4 may be found in [6]. An approach to represent the visual world in terms of objects (called Video Object Planes, VOP's, in MPEG-4) is the representation in layers proposed by Wang and Adelson [7]. Each layer contains three different maps: 1) the intensity map, 2) the alpha map, which denes the opacity or transparency of the layer at each point and 3) the velocity map which describes how the map should be warped over time. To clearly show the concepts behind the layered representation, Figure 1 advances our own results on the layered representation of the Mobile Calendar sequence. The representation has been able to extract four dierent layers corresponding to the ball, the train, the calendar and the background. The last image represents the reconstruction obtained from the dierent layers. In each layer all the visible information of the corresponding object is accumulated in one single extended frame. Notice that this representation allows manipulation of the content of the scene and besides each frame is dened only through the motion parameters what gives a very compact representation very useful in video coding applications Our construction of the layers follows conceptually the approach presented in [7], but using more ecient and robust techniques. The process implies two basic operations: 1) a local motion estimation and 2) a motion segmentation by ane model tting. We have developed a simple and yet robust method to nd the motion eld. The technique is presented in Section 2.

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square distance between the motion vectors of the pixels

2 Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square distance between the motion vectors of the pixels belonging to each of the models being compared. This gives an ecient motion segmentation strategy which is explained in Section 3. Section 4 provides some results of reconstructed sequences using this layered approach along with an example of object manipulation. Section 5 gives some insight on the performance of layers by giving results using manual and semi-manual initial segmentations. Section 6 provides a few ideas on the use of layers for video coding applications. Finally, Section 7 draws some conclusions. 2 ROBUST MOTION ESTI- MATION In order to estimate the motion eld of the scene, we estimate rst an initial approximation of the motion of some pixels located in a rectangular grid using block matching techniques. Given the position ~r of a pixel and a vector motion ~v = (v x ; v y ) jv x j; jv y j maximum displacement, the following cost function is dened: G(~r; ~v) = X ~r02w (~r) I t (~r0)? I t?1 (~r0? ~v) 2 (1) where W (~r) is a neighboorhod of ~r and I t, I t?1 are the actual and previous images. The motion vectors of the remaining pixels are interpolated from these rst ones. Due to local minima of the cost function, some motion vectors may be erroneous. To improve these motion vectors, a motion gradient is found for each pixel: MG(~r) = 1 4 X ( (1, 0) (0, 1) ~d= (-1, 0) (0, -1) ~v(~r)? ~v(~r? ~ d) 2 (2) Then for those pixels whose motion vectors are above the mean of all the gradients, the motion vectors are recalculated using a bigger W (~r). The interpolated motion vectors are rened with integer precision. The nal motion vector for each pixel of the whole image is obtained through a renement process based on one third-pixel accuracy. Please notice that if occluded areas of I t?1 are uncovered in I t, the cost function G(~r; ~v) cannot give the correct motion estimation vector. To minimize this problem, the above explained technique is applied to estimate the motion eld between I t and I t+1. This information is used to compensate for the errors of the rst motion eld (I t?1 and I t ). Figure 2 shows the motion eld obtained for the rst two images of the Flower Garden sequence. This gure shows a motion vector per pixel, an acceptable motion eld and a quite good performance in the contours. 3 ROBUST MOTION SEG- MENTATION The next step is to segment the motion vector eld by tting an ane motion model. The objective is that those regions of the image with the same ane model will be grouped. As we do not have a reliable initial

models obtained from an original decomposition of the image in rectangular regions of 24 32 pixels.

Figure 4 and Figure 5 show the original sequence Flower Garden and its corresponding segmentation.

For each of these rectangular regions, the best ane model in the minimum square sense is found through an iterative k-means clustering algorithm.

3 models obtained from an original decomposition of the image in rectangular regions of pixels. We believe that this technique represents also an important improvement with respect to the original technique presented in [7]. Figure 4 and Figure 5 show the original sequence Flower Garden and its corresponding segmentation. Figure 2: Motion eld of the Flower Garden sequence image partition, the image is initially partitioned in rectangular regions (same approach as taken in the original paper in [7]). For each of these rectangular regions, the best ane model in the minimum square sense is found through an iterative k-means clustering algorithm. Once a set of ane models have been found, similar models are grouped based in a meansquare distance between the motion vectors of the pixels belonging to each of the models being compared. Two models are grouped each time. The ordering in which all the model pairs are grouped is of paramount importance as it will x the stability and convergence of the technique. The distance between two given af- ne models a i and a j belonging respectively to the regions R i and R j, is dened as dist(a i ; a j ) = 1 n X ~r2r i;r j V ai (~r)? V aj (~r) 2 (3) where n indicates the number of pixels of R i [ R j, V ai (~r) and V aj (~r) the motion vectors represented by a i and a j in the position ~r. Once the distance between two ane models is dened, the grouping of models is done according to the following steps: 1. Find the pair (i; j) such that the distance between the corresponding models is minimum: dist(a i ; a j ) dist(a k ; a l ) 8 (k; l) 6= (i; j) 2. Group a i and a j in a single model, which gives the best estimation of the motion eld of the regions R i and R j. 3. Iterate the process while dist(a i ; a j ) is less than a prespecied threshold. A threshold of 1 pixel is a very reasonable option for most of the sequences. Using this grouping technique, the segmentation algorithm proves to be very ecient as it gives very reliable nal models at a fast convergence rate. As an example, Figure 3 has only needed four iterations to segment the motion eld, starting from a list of ane a) Initial, n = 0. b) n = 1. c) n = 3. d) Post-processed after n = 4. Figure 3: Results of the motion segmentation between frames #1 and #2 of Flower Garden after dierent iterations. White zones represent unassigned pixels 4 RESULTS The robust segmentation technique explained above provides a series of non-overlapped regions that will be used to form the nal layers. Our layer synthesis approach follows very closely to the original one presented in [7]. To prove the feasibility of our approach Figure 6 and Figure 7-b present the reconstruction of three frames of the Flower Garden sequence and the 15th frame of the Mobile Calendar sequence using the approach described above. Although it is not possible to fully appreciate in these still images the improvements of our method over the time domain, the reconstruction of the image it is believed to be of a very good quality. As a very basic example of object manipulation, Figure 7-c shows the same image with the ball removed. Please also notice the quality of the reconstructed image, specially in the location of the ball. All these sequences have been obtained using a rectangular initial partition.

4 a) #5 b) #15 c) #25 Figure 4: Original Flower Garden Sequence a) #5 b) #15 c) #25 Figure 5: Segmentation of Flower Garden a) #5 b) #15 c) #25 Figure 6: Reconstruction of Flower Garden using the proposed robust approach a) Original b) Reconstruction c) Object manipulation Figure 7: Reconstruction of Mobile Calendar and object manipulation using layers

the image in rectangular regions. The rectangular shape of this rst segmentation has been chosen due to the lack of a better segmentation.

5 a) Wall. b) Calendar. c) Ball. d) Train. Figure 8: Layers extracted from the Mobile Calendar sequence using the manual approach. Static layer not shown. 5 RESULTS USING MANUAL INITIAL SEGMENTATIONS Section 4 has shown some results of the layered representation using as an initial segmentation of the motion segmentation algorithm a decomposition of the image in rectangular regions. The rectangular shape of this rst segmentation has been chosen due to the lack of a better segmentation. This procedure has led to a quite acceptable segmentation of the video sequence in layers and to a good reconstruction of the synthesized image sequence. At this stage, one may wonder which can be the performance of the system if a better approach than the initial rectangular segmentation is selected. In order to nd a perfect representation of a moving image in layers, the rst (and easy) temptation is to manually segment all the images of the whole sequence. Although this may sound as a highly impracticable procedure for the majority of applications, it may provide a bound on the performance of the approach with respect to which other sub optimal approaches might be compared. To that end we have performed the following experiment. Using a graphic editor the Mobile Calendar sequence has been manually segmented and then the same process explained in Sections 2 and 3 has been applied to nd the motion and the corresponding segmentation. Figure 8 shows the extracted layers using this manual approach. It can be observed the almost perfect decomposition of the moving image in layers. Figure 9 shows the corresponding result using the rectangular decomposition of the rst image. It can be appreciated that the rectangular approach provides a much worse representation. Notice for instance that in Figure 9 the calendar and the wall are extracted as a single layer. Figure 12 shows the reconstruction of frames 1, 15 and 30 of the Mobile Calendar sequence using the manual approach. Although when presented as a static result it is dicult to appreciate the dierence, the reconstruction using the automatic segementation (rectangular regions) is worse than the reconstruction using the manual approach. This can be fully ap-

a) Wall and Calendar. b) Train. c) Ball. d) Static information. Figure 9: Layers extracted from the Mobile Calendar sequence with the rst image segmented rectangularly.

6 a) Wall and Calendar. b) Train. c) Ball. d) Static information. Figure 9: Layers extracted from the Mobile Calendar sequence with the rst image segmented rectangularly. preciated in Figure 10 where the PSNR between the original Mobile Calendar sequence and the reconstructed sequences using the manual segmentation (solid line) and the automatic segmentation (dashed line) are presented. It can be appreciated that the best reconstruction is provided for frame number 15, since it has been selected as reference for the construction of the layers. Similar results are found with other sequences. This experiment has shown that if an adequate segmentation is found, the decomposition of a moving image using layers is a good approach for image representation, object manipulation and video coding. At this stage the following question arises naturally: is it possible to obtain the same results by using a complete automatic segmentation? Our own results of Section 4 show that it is not possible or at least we do not know how to do it. However we can improve the results of Section 4 by performing another experiment. Instead of segmenting manually the whole sequence, what will the results be of the layered representation if we only segment manually the rst image? Figure 11 shows the layers obtained using this semiautomatic approach. Notice that it is still possible to distinguish between the wall and calendar layers, which was not the case with the totally automatic approach. Figure 13 shows the corresponding reconstructed sequence and Figure 14 presents the PSNR of the reconstructed images. For convenience the three reconstructions are shown together. The solid line represents the semiautomatic approach, the top dashed line represents the manual segmentation and the down dashed line the automatic approach. As one might be expecting, the semiautomatic approach lies half way between the corresponding fully manual and automatic approaches. These results show that some improvements may expected in the layered representation of moving images. Instead of segmenting manually the rst image and in order to have a completely automatic approach we are currently working in using an automatic spatial segmentation of the rst image. First results using this approach look promising and we hope to present them very soon.

7 a) Y component. b) U component. c) V component. Figure 10: PSNR between the original Mobile Calendar sequence and the reconstructed sequence using the manual segmentation (solid line) and the automatic segmentation (dashed line) 6 THE LAYERED REPRE- SENTATION OF MOVING IMAGES FOR VIDEO CO- DING The results of the layered approach shown above gives an idea of the potentiality of the representation for video coding applications. To encode an image sequence represented by layers the information that needs to be transmitted is: the intensity maps, the alpha maps, the motion parameters for each layer and possibly an error signal (delta map). Notice that the intensity map can be coded as a still image, although the contours of the layer need also to be transmitted what requires extra bits. Still image compression techniques such as subband, vector quantization or wavelets may be used to encode the arbitrarily shaped intensity map [1]. The shape of the layer may be encoded using a lossless modied chain coded approach or a lossy scheme such as the multigrid technique [4]. For binary alpha maps, the opacity information is already included in the layer shape. The six parameters of the ane motion information for each layer and each frame need also to be encoded. We are currently developing a new scheme to combine in a unique motion parameter set the motion parameters associated to each layer. This approach will further reduce the number of bits associated to each set of parameters of each frame. Finally, for perfect reconstruction, the error signal needs also to be sent. Our rst results show that if the motion information associated to each frame of each layer is good enough, there is no need to send the error signal, what increases the compression ratio. Although we are not ready yet to provide reliable compression results, our rst experiments look very promising what may proof that the layered approach is very suitable for video coding applications. 7 CONCLUSIONS A robust representation of images using layers has been presented in this paper. To that end, robust motion estimation and segmentation techniques have been presented. Examples of image reconstruction and object manipulation have been provided which proves the feasibility of the approach. Dierent type of segmentations have been provided which have given further insight into the performance bounds of the layered approach. It has been shown that the representation of moving images using layers is very promising for object manipulation as well as for video coding. References [1] R. J. Clarke. Digital compression of still images and video. Academic Press, [2] ISO/IEC JTC1/SC29/WG11. MPEG-4 Proposal Package Description (PPD). July [3] F. Pereira. MPEG-4: a new challenge for the representation of audio-visual information. In Picture Coding Symposium, Melbourne, Australia, March [4] P. Salembier, F. Marques, and A. Gasull. Coding of partition sequences. In L. Torres and M. Kunt, editors, Video coding: the second generation approach, pages 125{169. Kluwer Academic Publishers, [5] L. Torres, D. Garca, and A. Mates. A robust motion estimation and segmentation approach to represent moving images with layers. In International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, April [6] L. Torres and M. Kunt. Video coding: the second generation approach. Kluwer Academic Publishers, Englewood Clis, [7] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. IEEE Transactions on Image Processing, 3(5):625 { 638, September 1994.

8 a) Wall. b) Calendar. c) Ball. d) Train. e) Static textures. Figure 11: Layers extracted from the Mobile Calendar sequence with the rst image segmented manually.

a) Y component. b) U component. c) V component. Figure 14: PSNR of the reconstructed images.

9 a) #1 b) #15 c) #30 Figure 12: Reconstruction of the Mobile Calendar sequence with the whole sequence segmented manually. a) #1 b) #15 c) #30 Figure 13: Reconstruction of the Mobile Calendar sequence with the rst image segmented manually. a) Y component. b) U component. c) V component. Figure 14: PSNR of the reconstructed images. The solid line represents the semiautomatic approach, the top dashed line represents the manual segmentation and the down dashed line the automatic approach

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k TEMPORAL STABILITY IN SEQUENCE SEGMENTATION USING THE WATERSHED ALGORITHM FERRAN MARQU ES Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Campus Nord - Modulo D5 C/ Gran