NeTra-V: Towards an Object-based Video Representation

Similar documents
WITH the rapid developments in multimedia and Internet

Color Image Segmentation

Tools for texture/color based search of images

A Robust Wipe Detection Algorithm

Optical Flow-Based Motion Estimation. Thanks to Steve Seitz, Simon Baker, Takeo Kanade, and anyone else who helped develop these slides.

Representing Moving Images with Layers. J. Y. Wang and E. H. Adelson MIT Media Lab

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

A Probabilistic Framework for Spatio-Temporal Video Representation & Indexing

Motion in 2D image sequences

DATA and signal modeling for images and video sequences. Region-Based Representations of Image and Video: Segmentation Tools for Multimedia Services

Mixture Models and EM

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

Module 7 VIDEO CODING AND MOTION ESTIMATION

Motion Estimation for Video Coding Standards

Motion Tracking and Event Understanding in Video Sequences

Particle Tracking. For Bulk Material Handling Systems Using DEM Models. By: Jordan Pease

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES

Searching Video Collections:Part I

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Edge tracking for motion segmentation and depth ordering

Introduction to Medical Imaging (5XSA0) Module 5

Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features

BI-DIRECTIONAL AFFINE MOTION COMPENSATION USING A CONTENT-BASED, NON-CONNECTED, TRIANGULAR MESH

A Content Based Image Retrieval System Based on Color Features

ELEC Dr Reji Mathew Electrical Engineering UNSW

Multiple Motion and Occlusion Segmentation with a Multiphase Level Set Method

Motion Estimation. There are three main types (or applications) of motion estimation:

Video shot segmentation using late fusion technique

Region-based Segmentation

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

EE795: Computer Vision and Intelligent Systems

Tracking of video objects using a backward projection technique

Data Hiding in Video

AUTOMATIC OBJECT DETECTION IN VIDEO SEQUENCES WITH CAMERA IN MOTION. Ninad Thakoor, Jean Gao and Huamei Chen

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Motion and Optical Flow. Slides from Ce Liu, Steve Seitz, Larry Zitnick, Ali Farhadi

FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Adaptive Learning of an Accurate Skin-Color Model

Local Image Registration: An Adaptive Filtering Framework

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

International Journal of Advance Engineering and Research Development

Performance study of Gabor filters and Rotation Invariant Gabor filters

Clustering Methods for Video Browsing and Annotation

Segmentation and Tracking of Partial Planar Templates

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Experiments with Edge Detection using One-dimensional Surface Fitting

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

AUTOMATIC VIDEO INDEXING

VIDEO OBJECT SEGMENTATION BY EXTENDED RECURSIVE-SHORTEST-SPANNING-TREE METHOD. Ertem Tuncel and Levent Onural

Notes 9: Optical Flow

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Image retrieval based on region shape similarity

Camera Motion Identification in the Rough Indexing Paradigm

A Miniature-Based Image Retrieval System

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems

Segmentation by Clustering. Segmentation by Clustering Reading: Chapter 14 (skip 14.5) General ideas

Segmentation by Clustering Reading: Chapter 14 (skip 14.5)

Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

Unsupervised learning in Vision

Optical Flow Estimation

Digital Image Stabilization and Its Integration with Video Encoder

A Semi-Automatic 2D-to-3D Video Conversion with Adaptive Key-Frame Selection

Chapter 3 Image Registration. Chapter 3 Image Registration

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

Textural Features for Image Database Retrieval

Key Frame Extraction and Indexing for Multimedia Databases

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

Shot Detection using Pixel wise Difference with Adaptive Threshold and Color Histogram Method in Compressed and Uncompressed Video

Image Coding with Active Appearance Models

TEXT DETECTION AND RECOGNITION IN CAMERA BASED IMAGES

Artifacts and Textured Region Detection

Global Flow Estimation. Lecture 9

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Hybrid Video Compression Using Selective Keyframe Identification and Patch-Based Super-Resolution

Marcel Worring Intelligent Sensory Information Systems

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

CAP 5415 Computer Vision Fall 2012

An Edge-Based Approach to Motion Detection*

Texture Image Segmentation using FCM

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Texture Segmentation by Windowed Projection

Automatic Texture Segmentation for Texture-based Image Retrieval

Feature Tracking and Optical Flow

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

NOVEL APPROACH TO CONTENT-BASED VIDEO INDEXING AND RETRIEVAL BY USING A MEASURE OF STRUCTURAL SIMILARITY OF FRAMES. David Asatryan, Manuk Zakaryan

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

9.913 Pattern Recognition for Vision. Class 8-2 An Application of Clustering. Bernd Heisele

A Rapid Scheme for Slow-Motion Replay Segment Detection

CS 664 Segmentation. Daniel Huttenlocher

Video Syntax Analysis

Clustering CS 550: Machine Learning

PixSO: A System for Video Shot Detection

Transcription:

Proc. of SPIE, Storage and Retrieval for Image and Video Databases VI, vol. 3312, pp 202-213, 1998 NeTra-V: Towards an Object-based Video Representation Yining Deng, Debargha Mukherjee and B. S. Manjunath Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106 deng, debu@iplab.ece.ucsb.edu, manj@ece.ucsb.edu Abstract There is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects. We present here a prototype system, called NeTra-V, that is currently being developed to address some of these content related issues. The system has a twostage video processing structure: a global feature extraction and clustering stage, and a local feature extraction and object-based representation stage. Key aspects of the system include a new spatio-temporal segmentation and objecttracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the color/texture image segmentation and affine motion estimation techniques. Experimental results show that the proposed approach can handle large motion. The output of the segmentation, the alpha plane as it is referred to in the MPEG-4 terminology, can be used to compute local image properties. This local information forms the low-level content description module in our video representation. Experimental results illustrating spatiotemporal segmentation and tracking are provided. Keywords: content-based retrieval, spatio-temporal segmentation, object-based video representation. 1. Introduction With the rapid developments in multimedia and internet applications, there is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects, semantic description of the scene, detection of unusual events, and possible recognition of the objects. Current compression standards, such as MPEG-2 and H.263, are developed to achieve good data compression, but do not provide any content-related functionalities. There has been much work done on the emerging MPEG-4 standard 17, which is targeted towards access and manipulation of objects as well as more efficient data compression. However, functionalities that can be provided at present. are limited to cut and paste of a few objects in simple scenes. There is no visual information extracted from the object itself that can be used for similarity search and high-level understanding. On the other hand, current research in content-based video retrieval 1, 5, 8, 9, 11, 21 has provided ways of simple content descriptions by temporally partitioning the video clip into smaller shots, each of which contains a continuous scene, and extracting visual features such as color, texture, and motion from these shots. These low-level visual features are quite effective in searching for similar video scenes given a query video shot. However, with the exception of a few 20, much of the prior work in this area is restricted to global image features. This paper describes a video analysis and retrieval system, called NeTra-V [1], which is being developed with the objective of providing content related functionalities. The system is fully automatic and has a two-stage video processing structure. The first stage is global feature extraction and clustering. The second stage is local feature extraction and object-based representation. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the spatial color/texture image segmentation and affine motion estimation techniques. Experi- [1] Netra means eye in Sanskrit, an ancient Indian language. NeTra is also the name of the image retrieval system described in 14.

mental results show that the proposed approach can handle large motion and complex scenes containing several independently moving objects. The output of the segmentation, the alpha plane, can be used for both MPEG-4 coding which produces compact stored data, and local feature extraction which provides object information. Both global features and local features are used to form the low-level content description of the video representation model. The current implementation of NeTra-V allows the user to track regions in a video sequence and search for regions with similar color, texture, shape, motion pattern, location, or size in the database. Identifying more meaningful objects from these low-level region features is a future goal. Some examples from a football game database are shown in this paper. Demonstrations of NeTra-V system are available on the web at: http://copland.ece.ucsb.edu/demo/video/ The rest of the paper is organized as follows. Section 2 gives an overview of the system. Section 3 details the spatio-temporal segmentation. Section 4 illustrates the low-level content description of the video data. Section 5 concludes with discussions. 2. System Overview Figure 1 shows a schematic diagram of the NeTra-V system. Our research so far include all the shaded blocks in the figure. Video data, either raw or compressed using current standards, is segmented in the temporal domain into small video shots of consistent visual information. Often each video shot represents a single natural scene delimited camera breaks or editing cuts. The temporal partitioning algorithm 5 works directly on the MPEG-2 video sequence. It can detect both abrupt scene cuts and gradual transitions by using color and pixel intensities in the I-frames and motion prediction information in the P-frames and B-frames. The partitioned video shots are processed in two stages: 1. Global features are first extracted 5. These features help in preliminary scene classification, are quite robust to local perturbations, and are easy to compute. Feature clustering is then performed based on the global feature distances to better organize the data and facilitate indexing and search. Video shots are clustered into different categories according to the content and object search can be restricted within certain categories so that search space is greatly reduced. A traditional agglomerative method is chosen for feature clustering instead of the more popular k-means method because many video shots do not belong to any well defined categories and should not affect the clustering process. Figure 2 shows some example frames of the categories generated from a football game database. It can be seen that the feature clustering, to a certain degree, captures some semantic level information in each category, such as zoom-out shots of the football field and zoom-in shots of individual players. 2. The local processing step consists of three blocks. Spatio-temporal segmentation generates a labeled region map, or the alpha plane in the MPEG-4 terminology, for each video frame. The alpha plane is essential to MPEG-4 object-based video coding, which generates compactly stored video data while allowing object access and manipulation. With the region map, local features can also be extracted. These features include color, texture, motion, shape, size, and spatial relations among the regions. Note that dimensionality reduction 2 is needed to provide compression of the high dimensional feature vectors and an efficient indexing structure. A hierarchical object-based video representation model, which provides both compact storage of the video data and content information of the scene, is also illustrated in Figure 1. This model, shown within the dashed box, is composed of four data representation levels. The bottom level stores object encoded video data and allows object access and manipulation. The next level provides low-level content description of the video scenes by storing all global and local visual features. These features can be used for content-based search and retrieval, and semantic abstraction at the next level. High-level content-based description requires a certain degree of human assistance and the system should have self learning ability as well. The top level contains textual annotations of the video data. These could include non-image related information such as recording date, recording place, source, category, description of the content and so on. The following sections give more details on two key aspects of NeTra-V, spatio-temporal segmentation and lowlevel video content description.

Raw or Compressed Video Data: movies, news, sports, surveillance,... Hierarchical Object-based Video Representation Model Textual Annotation Temporal Partitioning Global Processing Global Feature Extraction High-level Content Description: Semantic Abstraction Supervised and Unsupervised Learning Spatiotemporal Segmentation Feature Clustering Local Processing Local Feature Extraction Dimensionality Reduction Object-based Video Coding Low-level Content Description: color, texture, motion, shape, spatial relations,... Compactly Stored Video Data Two-Step Video Processing Structure User Interface Network System Interface: Search and Retrieval New Database Indexing Figure 1. Schematic Diagram of NeTra-V System.

Figure 2. Example frames of different categories generated from the football game database. 3. Spatio-temporal Segmentation Spatio-temporal segmentation continues to be a challenging problem in computer vision research 6, 7, 10, 18, 19. Many motion segmentation schemes use optical flow methods to estimate motion vectors at the pixel level, and then cluster pixels into regions of coherent motion. There are several drawbacks to this approach. First, the optical flow method does not cope well with large motions. Second, regions of coherent motion may contain multiple objects, for example, the entire background. While these regions are good for coding purposes, they are not useful for local feature extraction and object identification. In general, techniques designed with coding objectives cannot yield good segmentation results. We will elaborate on this point later in Section 5. Another approach to spatio-temporal segmentation combines the results of both spatial and motion segmentation. Intuitively, this approach exploits as much information as possible from the data and should yield better results. The general strategy here is to spatially segment the first frame and estimate local affine motion parameters for each region to predict subsequent frames. Numerical methods 3, 4, 16 have been proposed to estimate affine motion parameters. The success of this approach depends largely on an initial good spatial segmentation. Results using simple region growing methods are usually not very satisfactory. Recently, a general framework for color and texture image segmentation has been presented 13, which appears to give good segmentation results on a diverse collection of images. This algorithm is used in our spatio-temporal segmentation scheme. 3.1 General Scheme We borrow the idea of intra- and inter-frame coding from MPEG. Video data is processed in consecutive groups of frames. These groups are non-overlapping and independent of each other. The number of frames in each group is set to 7 in the following experiments. The middle frame of each group is called the I-frame. Spatial segmentation is performed on the I-frame only for each group. Remaining frames in the group are called P-frames. P-frames are segmented by local affine motion predictions from their previous frames. There can be either forward prediction or backward prediction. The insertion of I-frames in the video sequence recovers failures from affine motion estimation in case of large object movements in 3D space and ensures robustness of the algorithm. Some heuristics are applied to handle overlapped and uncovered regions by using the information in motion prediction. Figure 3 illustrates the general segmentation scheme. The use of simultaneous forward and backward predictions, such as the way B-frames in MPEG-2 are generated, will help in affine motion estimation. However, unlike in the case of block-based prediction in MPEG-2, this method creates the problem of region correspondence between frames since the spatial segmentation on two consecutive I- frames can be significantly different. For this reason B-frames are not used in our scheme. The current method restricts the maximum number of regions to be same as the one generated from spatial segmentation. Regions can disappear because of occlusion or moving out of the image boundary, but no new regions are labeled during the motion prediction phase. New regions entering the scene are handled by the next I-frame. 3.2 Spatial Segmentation A brief description of the spatial segmentation algorithm 13 is given here. This algorithm integrates color and texture features together to compute the segmentation. First, direction of changes in color and texture is identified and integrated at each pixel location. Then a vector is constructed at each pixel pointing in the direction that a region

P3 P2 P1 I P1 P2 P3 Figure 3. General segmentation scheme. One group of frames is shown. I is the spatially segmented frame. P1, P2, and P3 are the first, second and third predicted frames, respectively. boundary is most likely to occur. The vector field propagates to neighboring points with similar directions and stops if two neighboring points have opposite flow directions, which indicates the presence of a boundary between the two pixels. After boundary detection, disjoint boundaries are connected to form closed contours. This is followed by region merging based on color and texture similarities as well as the boundary length between the two regions. The algorithm is designed for general images and requires very little parameter tuning from the user. The only parameter to be specified is the scale factor for localizing the boundaries. 3.3 Motion Segmentation The results of spatial segmentation can be used for affine motion estimation. A 6-parameter 2D affine transformation is assumed for each region in the frame and is estimated by finding the best match in the next frame. Consequently, segmentation results for the next frame is obtained. A Gaussian smoothing is performed on each image before affine estimation. Affine motion estimation is performed on the luminance component of the video data only. Mathematically, the following functional f is to be minimized for each region R, f( a) = gi ( 1 ( x'y', ) I 2 ( x, y) ) ( x, y) R (1) where a is the six-parameter affine motion vector which can be separated in x and y directions, a a x T T = a,, g is a robust error norm to reject outliers, defined as 10 x1 a x2 a x3 a y = a y1 a y2 a y3, = a x a y T and ge ( ) = e 2 ( σ + e 2 ) (2) where σ is a scale parameter, I 1 and I 2 are the current frame and the next frame respectively, x and y are pixel locations, x' = x + dx and y' = y+ dy, dx and dy are displacement vectors, dx = b T a x and dy = b T a y, where b = 1 xy T. Ignoring high order terms, a Taylor expansion of (1) gives 1 f( a) = f( a 0 ) + f( a 0 )( a a 0 ) + -- ( a a 2 0 ) T 2 f( a 0 )( a a 0 ) (3) Using a modified Newton s method 12 that ensures both descent and convergence, a can be iteratively solved by updating the following equation at the kth iteration, ak [ + 1] = ak [ ] ck [ ] 2 f( a[ k] ) { }1 f( a[ k] ) (4) where c[k] is a search parameter selected to minimize f. For (1), f and 2 f are calculated as

f = R g ----- I T 1 ------- b T I ------- 1 b T e x' y' (5) 2 f = R 2 g ------- I 1 ------- 2 g ----- 2 I 1 + --------- e 2 x' e x' 2 2 g ------- I 1 e 2 ------- I 1 ------- x' y' 2 g ------- I 1 e 2 ------- I 1 ------- x' y' 2 g ------- I 1 ------- 2 g ----- 2 I 1 + --------- e 2 y' e y' 2 1 x y xx 2 yx yxyy 2 (6) Note that the gradient components I 1 x' and I 1 y' can be precomputed before the iterations start. The method derived here requires the cost function to be convex in affine space, which is not true in general. However, in the vicinity of the actual affine parameter values, we can assume it to be true. Thus a good initialization is needed before the iterations. In the case that affine parameters of the previous frame are known, we can make a first-order assumption that the region is going to keep the same motion and use the affine parameters of the previous frame as the initial values for the current frame. In the case that affine parameters of the previous frame are unknown (I-frames, for example), a hierarchical search is performed to obtain the best initial affine values. This is only needed once for every group of frames. The search is done using all three color components to ensure best results. To reduce the complexity, a 4-parameter affine model which accounts for x and y translations, scale, and rotation is used. Image is downsampled first and results of the search at a lower-resolution are projected back to the original-sized image for fine tuning. 3.4 Results Figure 4 shows two spatio-temporal segmentation examples, one from an MPEG-2 standard test sequence flowergarden, the other from a football game sequence. It can be seen that the results are quite good where regions of sky, clouds, tree, flowers in (a) and helmet, face, jersey in (b) are all segmented out. (Original color images can be found on the web at http://copland.ece.ucsb.edu/demo/video/.) 4. Video Representation 4.1 Low-level Video Content Description Low-level video content description is an important module in NeTra-V. The representation scheme used for this purpose is show in Figure 5. It is organized from bottom to top as follows: 1. Region features extracted from the I-frame are used to represent the entire group of frames since features in the P-frames of the same group should be similar. Also features in the I-frame are more reliable than the ones in the P-frames because there are no propagation errors due to motion segmentation. We refer to these regions in the I- frame as I-regions. 2. Temporal correspondence between regions in consecutive I-frames is established by pairing up I-regions with the most similar features. Motion compensation is used to predict the approximate location of each region in the next I-frame to limit the search area. The correspondences are one-to-one in the forward temporal direction. That is each I-region can only be connected to one I-region in the next I-frame. Some I-regions are left out without any correspondences, indicating disappearances of the objects. Starting from the first frame, corresponding I-regions are tracked through the entire video shot. This is illustrated in Figure 6. A subobject is defined as a group of corresponding I-regions through tracking. We call them subobjects because object definitions are often quite subjective and a segmented region is usually a part of an object. In our experiments, the duration of a subobject is required to be at least 3 I-frames long. 3. Each video shot is composed of a set of subobjects. A video shot now can be characterized by its subobject information, and the spatial and temporal relations between these subobjects.

I P1 P2 P3 Figure 4. (a) Example of spatio-temporal segmentation on one group of frames in the flower garden sequence. Arrows indicate the actual flow of video frames in time.

I P1 P2 P3 Figure 4. (b) Example of spatio-temporal segmentation on one group of frames in a football game video database. Arrows indicate the actual flow of video frames in time.

video shot (global features)... subobject... subobject... (subobject features) (subobject features) fundamental elements... I-region I-region (I-region features)... (I-region features)... Figure 5. Structure of low-level content description. Subobjects are the fundamental elements of this representation level. Figure 7 shows two examples of identified subobjects. A set of 6 consecutive I-frames are shown in each example. Figure 7(a) is a half zoom-out view of a football field. A small subobject is identified, which is the upper body of a football player. Figure 7(b) shows the tracking of a person s face. Notice that in the first frame the face is partially occluded while in the last frame there is a segmentation failure which merges the face with the helmet. In both cases, the tracking algorithm is robust enough to pick up the face. Table 1 shows the detail information extracted to characterize the I-region, the subobject, and the video shot. Integrating information from different region features is an important issue. Since color is the most dominant feature, it is used to rank the distance measure while other features are only used as constraints to eliminate false matches. Subobjects are the fundamental elements of the low-level content description. Similarity search and retrieval are mainly performed using the subobjects. I-region information can also be used if necessary. For example, in order to answer a query such as find the subobjects that move from left to right, motion information of each I-region of the subobject is needed. The subobjects could serve as building blocks for user-subjective object identifications in the high-level content description. 5. Discussions 5.1 Video Coding vs. Analysis It is natural to consider developing a scheme that can be simultaneously optimized for both video coding and analysis. However, this is difficult because coding is to compress the data as much as possible while analysis is to extract information from the data. These are two separate objectives. Current techniques can do a good job on either of them but not both simultaneously. Commonly used image features are not suitable for data reconstruction and A A A A A A I k I k+1 I k+2 I k+3 I k+4 I k+5 Figure 6. Tracking of I-regions to form a subobject. Regions labeled A in each I-frame are matched and tracked. A subobject is formed by grouping of these regions. The subobject is identified despite the occlusion effects.

(a) (b) Figure 7. Two examples of identified subobjects, each showing a set of 6 consecutive I-regions.

Table 1: Detail Content Descriptions I region Subobject Video Shot index region label subobject label start and end frames region indices subobject indices temporal relations among subobjects color region color histogram average of I region features global color histogram texture region Gabor texture feature average of I region features global Gabor texture feature motion affine motion parameters average and variance of I region features global motion histogram shape Fourier-based descriptor using curvature, centroid distance, and complex coordinate functions average of I region features size number of pixels average of I region features location centroid and bounding box average of I region features spatial relations among subobjects encoded bit streams do not contain much useful visual information either. Further, because these two goals are different, the approaches are different as well. For example, in order to achieve a good spatial-temporal segmentation, information in the next frame is needed. This is not possible for any predictive coding scheme. Coding schemes that seek to minimize mean squared errors, such as the block-based motion prediction method used in the MPEG-2 and H.263, do not really care whether the segmentation makes sense. Note that the segmentation method presented here is general for both purposes. With some small modifications it can be easily adapted for predictive coding 15. The model proposed in this paper separates coding and analysis into two modules within one general framework, sharing the common results of the spatio-temporal segmentation preprocess. Given the alpha planes, how to code the data more efficiently becomes an independent issue as long as the decoder can provide object access and manipulation functionalities. Motion predictions can be recalculated if necessary to achieve results optimized for compression. 5.2 Conclusions and Future Research In this paper, we have described an implementation of the NeTra-V system, whose main objective is to provide content-functionalities for the video data. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. One of the main focuses of this work has been on the low-level content description of the video representation model. Future research will be on the high-level semantic abstraction of the video data based on this low-level content description.

Acknowledgments This work is supported by a grant from NSF under award #IRI-94-1130. We would like to thank Dr. Wei-Ying Ma for providing the software for the spatial segmentation, Gabor texture and shape feature extraction. References [1] E. Ardizzone, M.L. Cascia, Multifeature image and video content-based storage and retrieval, Proc. of SPIE, vol. 2916, pp 265-276, 1996. [2] M. Beatty and B.S. Manjunath, Dimensionality Reduction Using Multi-Dimensional Scaling for Content- Based Retrieval, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 835-838, 1997. [3] J. Bergen, P. Burt, R. Hingorani, and S. Peleg, A three-frame algorithm for estimating two-component image motion, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp 886-896, 1992. [4] M. Bober and J. Kittler, Robust Motion Analysis, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 947-952, 1994. [5] Y. Deng and B.S. Manjunath, Content-based Search of Video Using Color, Texture and Motion, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 534-537, 1997. [6] B. Duc, P. Schroeter, and J. Bigun, Spatio-temporal robust motion estimation and segmentation, Proc. of 6th Intl. Conf. on Computer Analysis of Images and Patterns, pp 238-245, 1995. [7] F. Dufaux, F. Moscheni, and A. Lippman, Spatio-temporal segmentation based on motion and static segmentation, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 306-309, 1995. [8] A. Hampapur, et. al., Virage video engine, Proc. of SPIE, vol. 3022, pp 188-200, 1997. [9] G. Iyengar and A.B. Lippman, Videobook: an experiment in characterization of Video, Proc. of IEEE Intl. Conf. on Image Processing, vol. 3, pp 855-858, 1996. [10] S. Ju, M. Black, and A. Jepson, Skin and bones: Multi-layer, locally affine, optical flow and regularization with transparency, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 307-314, 1996. [11] V. Kobla, D. Doermann, and K. Lin, Archiving, indexing, and retrieval of video in the compressed domain, Proc. of SPIE, vol. 2916, pp 78-89, 1996. [12] D. Luenberger, Linear and Nonlinear Programming, 2nd ed., Addison-Wesley, 1984. [13] W.Y. Ma and B.S. Manjunath, Edge flow: a framework of boundary detection and image segmentation, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp 744-749, 1997. [14] W.Y. Ma and B.S. Manjunath, NeTra: A toolbox for navigating large image databases, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 568-571, 1997, and also in ACM Multimedia Systems Journal. [15] D. Mukherjee, Y. Deng and S.K. Mitra, A region-based video coder using edge flow segmentation and hierarchical affine region matching, Proc. of SPIE, vol. 3309, 1998. [16] H. Sanson, Toward a robust parametric identification of motion on regions of arbitrary shape by non-linear optimization, Proc. of IEEE Intl. Conf. on Image Processing, vol. 1, pp 203-206, 1995. [17] Special Issue on MPEG-4, IEEE Trans. on Circuit and Systems for Video Technology, vol.7, no.1, 1997. [18] J. Wang and E. Adelson, Spatio-temporal segmentation of video data, Proc. of SPIE, vol. 2182, pp 120-131, 1994. [19] L. Wu, J. Benois-Pineau, and D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Proc. of IEEE Intl. Conf. on Image Processing, vol. 2, pp 406-409, 1995. [20] D. Zhong and S.F. Chang, Video Object Model and Segmentation for Content-based Video Indexing, Proc. of IEEE Intl. Symposium on Circuit and Systems, 1997. [21] H.J. Zhang, J. Wu, D. Zhong and S.W. Smolliar, An integrated system for content-based video retrieval and browsing, Pattern Recognition, vol. 30, no. 4, pp 643-658, 1997.