NeTra-V: Towards an Object-based Video Representation

Proc. of SPIE, Storage and Retrieval for Image and Video Databases VI, vol. 3312, pp 202-213, 1998 NeTra-V: Towards an Object-based Video Representation Yining Deng, Debargha Mukherjee and B. S. Manjunath Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106 deng, debu@iplab.ece.ucsb.edu, manj@ece.ucsb.edu Abstract There is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects. We present here a prototype system, called NeTra-V, that is currently being developed to address some of these content related issues. The system has a twostage video processing structure: a global feature extraction and clustering stage, and a local feature extraction and object-based representation stage. Key aspects of the system include a new spatio-temporal segmentation and objecttracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the color/texture image segmentation and affine motion estimation techniques. Experimental results show that the proposed approach can handle large motion. The output of the segmentation, the alpha plane as it is referred to in the MPEG-4 terminology, can be used to compute local image properties. This local information forms the low-level content description module in our video representation. Experimental results illustrating spatiotemporal segmentation and tracking are provided. Keywords: content-based retrieval, spatio-temporal segmentation, object-based video representation. 1. Introduction With the rapid developments in multimedia and internet applications, there is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects, semantic description of the scene, detection of unusual events, and possible recognition of the objects. Current compression standards, such as MPEG-2 and H.263, are developed to achieve good data compression, but do not provide any content-related functionalities. There has been much work done on the emerging MPEG-4 standard 17, which is targeted towards access and manipulation of objects as well as more efficient data compression. However, functionalities that can be provided at present. are limited to cut and paste of a few objects in simple scenes. There is no visual information extracted from the object itself that can be used for similarity search and high-level understanding. On the other hand, current research in content-based video retrieval 1, 5, 8, 9, 11, 21 has provided ways of simple content descriptions by temporally partitioning the video clip into smaller shots, each of which contains a continuous scene, and extracting visual features such as color, texture, and motion from these shots. These low-level visual features are quite effective in searching for similar video scenes given a query video shot. However, with the exception of a few 20, much of the prior work in this area is restricted to global image features. This paper describes a video analysis and retrieval system, called NeTra-V [1], which is being developed with the objective of providing content related functionalities. The system is fully automatic and has a two-stage video processing structure. The first stage is global feature extraction and clustering. The second stage is local feature extraction and object-based representation. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the spatial color/texture image segmentation and affine motion estimation techniques. Experi- [1] Netra means eye in Sanskrit, an ancient Indian language. NeTra is also the name of the image retrieval system described in 14.

mental results show that the proposed approach can handle large motion and complex scenes containing several independently moving objects. The output of the segmentation, the alpha plane, can be used for both MPEG-4 coding which produces compact stored data, and local feature extraction which provides object information. Both global features and local features are used to form the low-level content description of the video representation model. The current implementation of NeTra-V allows the user to track regions in a video sequence and search for regions with similar color, texture, shape, motion pattern, location, or size in the database. Identifying more meaningful objects from these low-level region features is a future goal. Some examples from a football game database are shown in this paper. Demonstrations of NeTra-V system are available on the web at: http://copland.ece.ucsb.edu/demo/video/ The rest of the paper is organized as follows. Section 2 gives an overview of the system. Section 3 details the spatio-temporal segmentation. Section 4 illustrates the low-level content description of the video data. Section 5 concludes with discussions. 2. System Overview Figure 1 shows a schematic diagram of the NeTra-V system. Our research so far include all the shaded blocks in the figure. Video data, either raw or compressed using current standards, is segmented in the temporal domain into small video shots of consistent visual information. Often each video shot represents a single natural scene delimited camera breaks or editing cuts. The temporal partitioning algorithm 5 works directly on the MPEG-2 video sequence. It can detect both abrupt scene cuts and gradual transitions by using color and pixel intensities in the I-frames and motion prediction information in the P-frames and B-frames. The partitioned video shots are processed in two stages: 1. Global features are first extracted 5. These features help in preliminary scene classification, are quite robust to local perturbations, and are easy to compute. Feature clustering is then performed based on the global feature distances to better organize the data and facilitate indexing and search. Video shots are clustered into different categories according to the content and object search can be restricted within certain categories so that search space is greatly reduced. A traditional agglomerative method is chosen for feature clustering instead of the more popular k-means method because many video shots do not belong to any well defined categories and should not affect the clustering process. Figure 2 shows some example frames of the categories generated from a football game database. It can be seen that the feature clustering, to a certain degree, captures some semantic level information in each category, such as zoom-out shots of the football field and zoom-in shots of individual players. 2. The local processing step consists of three blocks. Spatio-temporal segmentation generates a labeled region map, or the alpha plane in the MPEG-4 terminology, for each video frame. The alpha plane is essential to MPEG-4 object-based video coding, which generates compactly stored video data while allowing object access and manipulation. With the region map, local features can also be extracted. These features include color, texture, motion, shape, size, and spatial relations among the regions. Note that dimensionality reduction 2 is needed to provide compression of the high dimensional feature vectors and an efficient indexing structure. A hierarchical object-based video representation model, which provides both compact storage of the video data and content information of the scene, is also illustrated in Figure 1. This model, shown within the dashed box, is composed of four data representation levels. The bottom level stores object encoded video data and allows object access and manipulation. The next level provides low-level content description of the video scenes by storing all global and local visual features. These features can be used for content-based search and retrieval, and semantic abstraction at the next level. High-level content-based description requires a certain degree of human assistance and the system should have self learning ability as well. The top level contains textual annotations of the video data. These could include non-image related information such as recording date, recording place, source, category, description of the content and so on. The following sections give more details on two key aspects of NeTra-V, spatio-temporal segmentation and lowlevel video content description.

Raw or Compressed Video Data: movies, news, sports, surveillance,... Hierarchical Object-based Video Representation Model Textual Annotation Temporal Partitioning Global Processing Global Feature Extraction High-level Content Description: Semantic Abstraction Supervised and Unsupervised Learning Spatiotemporal Segmentation Feature Clustering Local Processing Local Feature Extraction Dimensionality Reduction Object-based Video Coding Low-level Content Description: color, texture, motion, shape, spatial relations,... Compactly Stored Video Data Two-Step Video Processing Structure User Interface Network System Interface: Search and Retrieval New Database Indexing Figure 1. Schematic Diagram of NeTra-V System.

Figure 2. Example frames of different categories generated from the football game database. 3. Spatio-temporal Segmentation Spatio-temporal segmentation continues to be a challenging problem in computer vision research 6, 7, 10, 18, 19. Many motion segmentation schemes use optical flow methods to estimate motion vectors at the pixel level, and then cluster pixels into regions of coherent motion. There are several drawbacks to this approach. First, the optical flow method does not cope well with large motions. Second, regions of coherent motion may contain multiple objects, for example, the entire background. While these regions are good for coding purposes, they are not useful for local feature extraction and object identification. In general, techniques designed with coding objectives cannot yield good segmentation results. We will elaborate on this point later in Section 5. Another approach to spatio-temporal segmentation combines the results of both spatial and motion segmentation. Intuitively, this approach exploits as much information as possible from the data and should yield better results. The general strategy here is to spatially segment the first frame and estimate local affine motion parameters for each region to predict subsequent frames. Numerical methods 3, 4, 16 have been proposed to estimate affine motion parameters. The success of this approach depends largely on an initial good spatial segmentation. Results using simple region growing methods are usually not very satisfactory. Recently, a general framework for color and texture image segmentation has been presented 13, which appears to give good segmentation results on a diverse collection of images. This algorithm is used in our spatio-temporal segmentation scheme. 3.1 General Scheme We borrow the idea of intra- and inter-frame coding from MPEG. Video data is processed in consecutive groups of frames. These groups are non-overlapping and independent of each other. The number of frames in each group is set to 7 in the following experiments. The middle frame of each group is called the I-frame. Spatial segmentation is performed on the I-frame only for each group. Remaining frames in the group are called P-frames. P-frames are segmented by local affine motion predictions from their previous frames. There can be either forward prediction or backward prediction. The insertion of I-frames in the video sequence recovers failures from affine motion estimation in case of large object movements in 3D space and ensures robustness of the algorithm. Some heuristics are applied to handle overlapped and uncovered regions by using the information in motion prediction. Figure 3 illustrates the general segmentation scheme. The use of simultaneous forward and backward predictions, such as the way B-frames in MPEG-2 are generated, will help in affine motion estimation. However, unlike in the case of block-based prediction in MPEG-2, this method creates the problem of region correspondence between frames since the spatial segmentation on two consecutive I- frames can be significantly different. For this reason B-frames are not used in our scheme. The current method restricts the maximum number of regions to be same as the one generated from spatial segmentation. Regions can disappear because of occlusion or moving out of the image boundary, but no new regions are labeled during the motion prediction phase. New regions entering the scene are handled by the next I-frame. 3.2 Spatial Segmentation A brief description of the spatial segmentation algorithm 13 is given here. This algorithm integrates color and texture features together to compute the segmentation. First, direction of changes in color and texture is identified and integrated at each pixel location. Then a vector is constructed at each pixel pointing in the direction that a region

P3 P2 P1 I P1 P2 P3 Figure 3. General segmentation scheme. One group of frames is shown. I is the spatially segmented frame. P1, P2, and P3 are the first, second and third predicted frames, respectively. boundary is most likely to occur. The vector field propagates to neighboring points with similar directions and stops if two neighboring points have opposite flow directions, which indicates the presence of a boundary between the two pixels. After boundary detection, disjoint boundaries are connected to form closed contours. This is followed by region merging based on color and texture similarities as well as the boundary length between the two regions. The algorithm is designed for general images and requires very little parameter tuning from the user. The only parameter to be specified is the scale factor for localizing the boundaries. 3.3 Motion Segmentation The results of spatial segmentation can be used for affine motion estimation. A 6-parameter 2D affine transformation is assumed for each region in the frame and is estimated by finding the best match in the next frame. Consequently, segmentation results for the next frame is obtained. A Gaussian smoothing is performed on each image before affine estimation. Affine motion estimation is performed on the luminance component of the video data only. Mathematically, the following functional f is to be minimized for each region R, f( a) = gi ( 1 ( x'y', ) I 2 ( x, y) ) ( x, y) R (1) where a is the six-parameter affine motion vector which can be separated in x and y directions, a a x T T = a,, g is a robust error norm to reject outliers, defined as 10 x1 a x2 a x3 a y = a y1 a y2 a y3, = a x a y T and ge ( ) = e 2 ( σ + e 2 ) (2) where σ is a scale parameter, I 1 and I 2 are the current frame and the next frame respectively, x and y are pixel locations, x' = x + dx and y' = y+ dy, dx and dy are displacement vectors, dx = b T a x and dy = b T a y, where b = 1 xy T. Ignoring high order terms, a Taylor expansion of (1) gives 1 f( a) = f( a 0 ) + f( a 0 )( a a 0 ) + -- ( a a 2 0 ) T 2 f( a 0 )( a a 0 ) (3) Using a modified Newton s method 12 that ensures both descent and convergence, a can be iteratively solved by updating the following equation at the kth iteration, ak [ + 1] = ak [ ] ck [ ] 2 f( a[ k] ) { }1 f( a[ k] ) (4) where c[k] is a search parameter selected to minimize f. For (1), f and 2 f are calculated as

f = R g ----- I T 1 ------- b T I ------- 1 b T e x' y' (5) 2 f = R 2 g ------- I 1 ------- 2 g ----- 2 I 1 + --------- e 2 x' e x' 2 2 g ------- I 1 e 2 ------- I 1 ------- x' y' 2 g ------- I 1 e 2 ------- I 1 ------- x' y' 2 g ------- I 1 ------- 2 g ----- 2 I 1 + --------- e 2 y' e y' 2 1 x y xx 2 yx yxyy 2 (6) Note that the gradient components I 1 x' and I 1 y' can be precomputed before the iterations start. The method derived here requires the cost function to be convex in affine space, which is not true in general. However, in the vicinity of the actual affine parameter values, we can assume it to be true. Thus a good initialization is needed before the iterations. In the case that affine parameters of the previous frame are known, we can make a first-order assumption that the region is going to keep the same motion and use the affine parameters of the previous frame as the initial values for the current frame. In the case that affine parameters of the previous frame are unknown (I-frames, for example), a hierarchical search is performed to obtain the best initial affine values. This is only needed once for every group of frames. The search is done using all three color components to ensure best results. To reduce the complexity, a 4-parameter affine model which accounts for x and y translations, scale, and rotation is used. Image is downsampled first and results of the search at a lower-resolution are projected back to the original-sized image for fine tuning. 3.4 Results Figure 4 shows two spatio-temporal segmentation examples, one from an MPEG-2 standard test sequence flowergarden, the other from a football game sequence. It can be seen that the results are quite good where regions of sky, clouds, tree, flowers in (a) and helmet, face, jersey in (b) are all segmented out. (Original color images can be found on the web at http://copland.ece.ucsb.edu/demo/video/.) 4. Video Representation 4.1 Low-level Video Content Description Low-level video content description is an important module in NeTra-V. The representation scheme used for this purpose is show in Figure 5. It is organized from bottom to top as follows: 1. Region features extracted from the I-frame are used to represent the entire group of frames since features in the P-frames of the same group should be similar. Also features in the I-frame are more reliable than the ones in the P-frames because there are no propagation errors due to motion segmentation. We refer to these regions in the I- frame as I-regions. 2. Temporal correspondence between regions in consecutive I-frames is established by pairing up I-regions with the most similar features. Motion compensation is used to predict the approximate location of each region in the next I-frame to limit the search area. The correspondences are one-to-one in the forward temporal direction. That is each I-region can only be connected to one I-region in the next I-frame. Some I-regions are left out without any correspondences, indicating disappearances of the objects. Starting from the first frame, corresponding I-regions are tracked through the entire video shot. This is illustrated in Figure 6. A subobject is defined as a group of corresponding I-regions through tracking. We call them subobjects because object definitions are often quite subjective and a segmented region is usually a part of an object. In our experiments, the duration of a subobject is required to be at least 3 I-frames long. 3. Each video shot is composed of a set of subobjects. A video shot now can be characterized by its subobject information, and the spatial and temporal relations between these subobjects.

I P1 P2 P3 Figure 4. (a) Example of spatio-temporal segmentation on one group of frames in the flower garden sequence. Arrows indicate the actual flow of video frames in time.

I P1 P2 P3 Figure 4. (b) Example of spatio-temporal segmentation on one group of frames in a football game video database. Arrows indicate the actual flow of video frames in time.

video shot (global features)... subobject... subobject... (subobject features) (subobject features) fundamental elements... I-region I-region (I-region features)... (I-region features)... Figure 5. Structure of low-level content description. Subobjects are the fundamental elements of this representation level. Figure 7 shows two examples of identified subobjects. A set of 6 consecutive I-frames are shown in each example. Figure 7(a) is a half zoom-out view of a football field. A small subobject is identified, which is the upper body of a football player. Figure 7(b) shows the tracking of a person s face. Notice that in the first frame the face is partially occluded while in the last frame there is a segmentation failure which merges the face with the helmet. In both cases, the tracking algorithm is robust enough to pick up the face. Table 1 shows the detail information extracted to characterize the I-region, the subobject, and the video shot. Integrating information from different region features is an important issue. Since color is the most dominant feature, it is used to rank the distance measure while other features are only used as constraints to eliminate false matches. Subobjects are the fundamental elements of the low-level content description. Similarity search and retrieval are mainly performed using the subobjects. I-region information can also be used if necessary. For example, in order to answer a query such as find the subobjects that move from left to right, motion information of each I-region of the subobject is needed. The subobjects could serve as building blocks for user-subjective object identifications in the high-level content description. 5. Discussions 5.1 Video Coding vs. Analysis It is natural to consider developing a scheme that can be simultaneously optimized for both video coding and analysis. However, this is difficult because coding is to compress the data as much as possible while analysis is to extract information from the data. These are two separate objectives. Current techniques can do a good job on either of them but not both simultaneously. Commonly used image features are not suitable for data reconstruction and A A A A A A I k I k+1 I k+2 I k+3 I k+4 I k+5 Figure 6. Tracking of I-regions to form a subobject. Regions labeled A in each I-frame are matched and tracked. A subobject is formed by grouping of these regions. The subobject is identified despite the occlusion effects.

(a) (b) Figure 7. Two examples of identified subobjects, each showing a set of 6 consecutive I-regions.

Table 1: Detail Content Descriptions I region Subobject Video Shot index region label subobject label start and end frames region indices subobject indices temporal relations among subobjects color region color histogram average of I region features global color histogram texture region Gabor texture feature average of I region features global Gabor texture feature motion affine motion parameters average and variance of I region features global motion histogram shape Fourier-based descriptor using curvature, centroid distance, and complex coordinate functions average of I region features size number of pixels average of I region features location centroid and bounding box average of I region features spatial relations among subobjects encoded bit streams do not contain much useful visual information either. Further, because these two goals are different, the approaches are different as well. For example, in order to achieve a good spatial-temporal segmentation, information in the next frame is needed. This is not possible for any predictive coding scheme. Coding schemes that seek to minimize mean squared errors, such as the block-based motion prediction method used in the MPEG-2 and H.263, do not really care whether the segmentation makes sense. Note that the segmentation method presented here is general for both purposes. With some small modifications it can be easily adapted for predictive coding 15. The model proposed in this paper separates coding and analysis into two modules within one general framework, sharing the common results of the spatio-temporal segmentation preprocess. Given the alpha planes, how to code the data more efficiently becomes an independent issue as long as the decoder can provide object access and manipulation functionalities. Motion predictions can be recalculated if necessary to achieve results optimized for compression. 5.2 Conclusions and Future Research In this paper, we have described an implementation of the NeTra-V system, whose main objective is to provide content-functionalities for the video data. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. One of the main focuses of this work has been on the low-level content description of the video representation model. Future research will be on the high-level semantic abstraction of the video data based on this low-level content description.

Acknowledgments This work is supported by a grant from NSF under award #IRI-94-1130. We would like to thank Dr. Wei-Ying Ma for providing the software for the spatial segmentation, Gabor texture and shape feature extraction. References [1] E. Ardizzone, M.L. Cascia, Multifeature image and video content-based storage and retrieval, Proc. of SPIE, vol. 2916, pp 265-276, 1996. [2] M. Beatty and B.S. Manjunath, Dimensionality Reduction Using Multi-Dimensional Scaling for Content- Based Retrieval, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 835-838, 1997. [3] J. Bergen, P. Burt, R. Hingorani, and S. Peleg, A three-frame algorithm for estimating two-component image motion, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp 886-896, 1992. [4] M. Bober and J. Kittler, Robust Motion Analysis, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 947-952, 1994. [5] Y. Deng and B.S. Manjunath, Content-based Search of Video Using Color, Texture and Motion, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 534-537, 1997. [6] B. Duc, P. Schroeter, and J. Bigun, Spatio-temporal robust motion estimation and segmentation, Proc. of 6th Intl. Conf. on Computer Analysis of Images and Patterns, pp 238-245, 1995. [7] F. Dufaux, F. Moscheni, and A. Lippman, Spatio-temporal segmentation based on motion and static segmentation, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 306-309, 1995. [8] A. Hampapur, et. al., Virage video engine, Proc. of SPIE, vol. 3022, pp 188-200, 1997. [9] G. Iyengar and A.B. Lippman, Videobook: an experiment in characterization of Video, Proc. of IEEE Intl. Conf. on Image Processing, vol. 3, pp 855-858, 1996. [10] S. Ju, M. Black, and A. Jepson, Skin and bones: Multi-layer, locally affine, optical flow and regularization with transparency, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 307-314, 1996. [11] V. Kobla, D. Doermann, and K. Lin, Archiving, indexing, and retrieval of video in the compressed domain, Proc. of SPIE, vol. 2916, pp 78-89, 1996. [12] D. Luenberger, Linear and Nonlinear Programming, 2nd ed., Addison-Wesley, 1984. [13] W.Y. Ma and B.S. Manjunath, Edge flow: a framework of boundary detection and image segmentation, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 744-749, 1997. [14] W.Y. Ma and B.S. Manjunath, NeTra: A toolbox for navigating large image databases, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 568-571, 1997, and also in ACM Multimedia Systems Journal. [15] D. Mukherjee, Y. Deng and S.K. Mitra, A region-based video coder using edge flow segmentation and hierarchical affine region matching, Proc. of SPIE, vol. 3309, 1998. [16] H. Sanson, Toward a robust parametric identification of motion on regions of arbitrary shape by non-linear optimization, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 203-206, 1995. [17] Special Issue on MPEG-4, IEEE Trans. on Circuit and Systems for Video Technology, vol.7, no.1, 1997. [18] J. Wang and E. Adelson, Spatio-temporal segmentation of video data, Proc. of SPIE, vol. 2182, pp 120-131, 1994. [19] L. Wu, J. Benois-Pineau, and D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 406-409, 1995. [20] D. Zhong and S.F. Chang, Video Object Model and Segmentation for Content-based Video Indexing, Proc. of IEEE Intl. Symposium on Circuit and Systems, 1997. [21] H.J. Zhang, J. Wu, D. Zhong and S.W. Smolliar, An integrated system for content-based video retrieval and browsing, Pattern Recognition, vol. 30, no. 4, pp 643-658, 1997.