Video Summarization and Browsing Using Growing Cell Structures

Size: px

Start display at page:

Download "Video Summarization and Browsing Using Growing Cell Structures"

Beverly Wilkinson
5 years ago
Views:

1 Video Summarization and Browsing Using Growing Cell Structures Irena Koprinska and James Clark School of Information Technologies, The University of Sydney Madsen Building F09, Sydney NSW 2006, Australia {irena, Abstract We present a new approach for video summarization and browsing of MPEG-2 compressed video based on the Growing Cell Structures (GCS) neural algorithms. It first applies GCS to select keyframes for each shot and then clusters them using TreeGCS to form a hierarchical view of the video for efficient browsing. The keyframe selection is based on histogram features of the dc-images for I frames. It captures well the video content and outperforms two other approaches. The main advantage of the TreeGCS module is the ability to form dynamically a flexible hierarchy depending on the video content. I. INTRODUCTION Recent developments in computing performance, multimedia compression and communication technologies have made possible the creation of digital video archives. Applications such as video-on-demand, digital TV, digital libraries generate and use large collections of video data. It is also expected that the storage of digital video at home will soon overtake the current analogue systems. However, unlike the document databases that use keywords to quickly access data, video databases still lack techniques for efficient organization, searching and retrieval. Text-based video organization based on manual annotation is highly inefficient, subjective, and time consuming. Recently, some content-based prototype systems [1,3,11,12] have been developed to automatically organize video in order to provide fast and meaningful nonlinear access to the relevant material in video. The generally accepted approach first breaks up the video stream into temporally homogeneous segments called shots [9]. Each shot is then represented by one or more keyframes. The shots are indexed typically using spatial features extracted from the keyframes (e.g. color, texture, shape) and also temporal features extracted from the shot (e.g. motion, camera operations) and are organised into a sequential or hierarchical structure of keyframes. This representation allows the user to browse the content of the video and search quickly for sub-sequences of interest without the need to watch the entire video. The user can also query the video database. The retrieval is based on the similarity between the feature vector of the query and feature vectors representing the shots. Clustering has been successfully applied for both keyframe selection and video organization. Ferman et al. [4] cluster the frames within each shot using an iterative 3-means algorithm and select as a keyframe the frame closest to the cenntroid of the larger cluster. As the content of the shot may change significantly due to camera operations and object motion, in their subsequent work [3] the clustering algorithm is modified to extract more than one keyframe. Two keyframes are extracted for clusters with high intercluster variance (the closest and the farthest to the centroid). Frames with large deviations from the average luminance of the shot are selected as keyframes as well. Girgenson and Boreczky [7] applied an agglomerative hierarchical clustering to extract a predefined number of keyframes and used them to represent videotaped meetings and presentations. Drew and Au [2] proposed a new feature based on color histograms and then applied an agglomerative hierarchical clustering that merges clusters based on cluster variance and temporal distance. Yeung et al. [12] select one keyframe for each shot and then cluster them based on visual content and temporal distance to create a scene-transition graph. In ViBE [1] each shot is represented with a tree induced by hierarchical clustering of the frames. The video is then organised into a three level similarity pyramid where each level contains groups of similar shots organized into a two dimensional grid. The pyramid is created by clustering of the shots based on temporal, motion, pseudosemantic and shot-tree distance. In this paper we present a new approach for keyframe selection and video browsing based on the Growing Cell Structures (GCS) neural algorithm. We use GCS to select keyframes representing the content of each shot. GCS finds the number of keyframes in unsupervised fashion, and also maps similar frames to neighboring nodes offering a better indication of the shot s structure. TreeGCS is then used to cluster the shots and provide a high-level hierarchical video representation. The main advantage over existing systems is that it creates a flexible hierarchical representation of the

2 video content, i.e. the number of layers in the hierarchy and the number of clusters in each layer will depend on the video content and do not have to be pre-specified in advance. In addition, similar clusters (of frames or shots) are mapped onto neighboring nodes that makes browsing more convenient. Our system operates directly on MPEG-2 compressed video that allows faster operations and smaller storage requirements. II. GROWING CELL STRUCTURE ALGORITHMS A. GCS GCS [5] is an incremental self-organizing neural algorithm, an extension of Kohonen s self-organising maps (SOM) [10]. It generates a mapping from a high dimensional input data to a lower (typically two-dimensional) space. The main advantage of such a mapping is that it allows to gain insight into the structure of the data due to two important properties: topology preservation (similar inputs are mapped onto neighboring neurons) and density preservation (regions of high input density are mapped on neural structures with more neurons). An important advantage over SOM and most of the classical clustering algorithms (e.g. k-means) is that GCS is able automatically to find a suitable network size and structure, i.e. does not require the number of clusters to be specified in advance. This is achieved through a process of controlled growing and removal of nodes. Unlike SOM, in GCS the number of neighboring neurons connected to a given neuron is not fixed. Finally, GCS is able to form discrete clusters, while in SOM the clusters remain connected and to find their boundaries is not always easy. The GCS algorithm we implemented starts with a randomly initialised triangle of neurons. At each iteration the best matching unit and its topological neighbours are adapted toward the input vector. There is no cooling schedule'' as in SOM, where neighbourhood size and learning rate decrease with time. New neurons are inserted at positions with high errors when the current structure under represents the input data distribution. Superfluous neurons are deleted from regions with low probability distribution. It is important that the deletion step maintains the consistency of the triangular structure. To ensure this we have implemented a simpler heuristic than Fritzke s tetrahedron based. The algorithm iterates until the stopping criteria is satisfied (maximum number of epochs or network size is reached). Fritzke has also demonstrated superior performance of GCS over SOM in terms of topology preservation and distribution-modelling error [6]. The algorithm requires 7 user specified parameters: maximum number of neurons or training epochs, insertion period, deletion period, learning rates for the winner ebmu and its neighbourhood e i, and error decay factors and. Fig. 1. GCS simulation results on four square shaped data B. TreeGCS TreeGCS [8] is a hierarchical clustering algorithm that is based on GCS. It maps high dimensional input vectors onto a multi-depth two dimensional hierarchy that preserves the topological ordering of the input space. The tree is generated dynamically and adapts to the underlying GCS structure. Initially the root of the tree points to one cluster that contains the initial GCS network. A split in the cluster results in adding a new node in the tree (Fig. 2). When clusters are deleted, the associated tree nodes are deleted and the resulting redundancies (if any) are removed. Our implementation follows strictly the original algorithm apart from the introduction of a hierarchy generation threshold (in [8] the tree is generated at the end of each GCS epoch). This threshold is the only user specified parameter. Fig. 2. Creating new nodes in TreeGCS when a cluster subdivides III. VIDEOGCS A. Data Pre-processing and Feature Extraction Since MPEG was established as an international standard for compression of digital video, video is increasingly stored and moved in compressed format. This motivates the

development of methods that process directly compressed video due to the computational and storage savings (no need to decode/re-encode the video) and faster operations (lower data rate of compressed

3 development of methods that process directly compressed video due to the computational and storage savings (no need to decode/re-encode the video) and faster operations (lower data rate of compressed video). Our system operates directly on MPEG-2 encoded video. MPEG-2 uses mackroblock based motion compensation to reduce temporal redundancy and block-based Discrete Cosine Transform (DCT) to reduce spatial redundancy. The only information that is available in the compressed stream is the DCT coefficients of intra coded blocks or residual errors, and also the motion vectors. Our system uses the DC terms (i.e. the 0 frequency term of the DCT coefficients) of intra-coded (I) frames. As each DC term is a scaled version of the block's average value, spatially reduced versions of the original images, called dc-images [11], can be constructed. The (i,j) pixel of the dc-image is the average value of the (i,j) block of the image (Fig.3 ). For each dc-image we compute the 16-bin grayscale histogram. Histograms have been successfully used as image representation as they are less sensitive to object movement, image rotation or variations in viewing angle and scale. decompressed them. After that we concatenated them using cuts (in the order shown in Table I) to form one long sequence. This sequence was then MPEG-2 compressed and the DC terms were extracted. A total of 30 shot boundaries were detected, 11 of them gradual and the other abrupt. Thus the total number of shots was 31. The size of the original video frames was 352x240 pixels, hence the size of their dc-images was 44x30 pixels. TABLE I VIDEO STATISTICS sequence # frames # shots Canada day 768 4: s0-s3 Capilano 778 2: s4-s5 Dragon boat 705 6: s6-s11 Jazz 577 3: s12-s14 Professor 130 1: s15 Steam clock 648 1: s16 Walk with dragon 795 3: s17-s19 Aqua : s20-s26 Beach 740 4: s27-s30 Fig. 3. A full image (352x288 pixels) and its dc image (44x36 pixels) B. Keyframe Selection and Video Representation After shot boundary detection, we use GCS to cluster the frames in each shot based on their 16-dimensional feature vectors. Depending on the content of the shot, GCS forms different number of discrete clusters. For each of them, the keyframe closest to the centroid is selected as a keyframe. Because GCS is preserving topology, similar frames are mapped to neighboring neurons. The selected keyframes are further clustered using TreeGCS to create a hierarchical view of the video sequence allowing the user to browse at different level of content. The depth of the hierarchy and number of nodes in each level depend on the video content. Each node corresponds to a cluster of similar shots, and can be represented by one single keyframe chosen as described above. The bottom level nodes are associated with clusters of similar shots that are mapped on a 2-dimensional GCS grid allowing efficient visualization and browsing. IV. EXPERIMENTS A. Video Sequences We used 9 video sequences available from [14] and previously used for keyframe selection evaluation (Table I). As the videos were originally MPEG-1 compressed, we B. GCS and TreeGCS Parameters The following GCS parameters were used: number of iterations=20000, insertion period=200, deletion period=2000, error decay factors =1, =0.0004, learning rates: e bmu = 0.06, e i = The hierarchy period of TreeGCS was set to 500. Our preliminary experiments showed that the ratio between the insertion and deletion periods is important. Before a deletion is performed, the GCS network has to grow sufficiently. This ensures that the clusters are not formed prematurely. C. Keyframe Selection Results The keyframe selection results of GCS are summarized in Table II. The column Correct indicates correct humanproduced results. It should be noted that our correct keyframes are slightly different than those reported in [2] for the following sequences: Capilano (1 less keyframe in the first shot), Dragon boat (2 less in the last shot) and Aquarium (3 more: 1 more in shot 1, 1 for shot 2 that was a missed shot in [2] and 1 more in shot 6). A comparison of GCS with two other approaches HistInt [3] and Signatures [2] is presented in Table IV based on the results reported in [2]. Both approaches used color histograms and work on uncompressed video. Some examples are shown in Fig.4-6 (for GCS the grayscale dcimages are shown).

TABLE II NUMBER OF KEYFRAMES GENERATED BY GCS sequence correct generated redundant missed Canada day 4 8 4 0 Capilano 4 3 0 1 Dragon boat 6 9 3 0 Jazz 6 4 0 2 Professor 1 1 0 0 Steam clock 1 1 0 0

4. Keyframe selection for the sequence Aqua Overall GCS performs well and typically selects 1 keyframe for the low activity shots and several keyframes for the high activity shots.

For example, GCS typically generates two keyframes instead of one for shot 3 of Canada day (there is a small zoom and object tracking) and shot 1 (pan), and also for shot 3 of Beach (zoom and object

In the other cases the redundant keyframes are selected for small and low activity shots. For example, two similar keyframes are generated for shot 4 of Aqua (Fig.

4 TABLE II NUMBER OF KEYFRAMES GENERATED BY GCS sequence correct generated redundant missed Canada day Capilano Dragon boat Jazz Professor Steam clock Walk dragon Aqua Beach Total a) Correct (4 keyframes) b) GCS (6 keyframes) c) HistInt (18 keyframes) a) Correct (9 keyframes) d) Signatures (5 keyframes) Fig. 6. Keyframe selection for the sequence Walk with the dragon b) GCS (10 keyframes) c) HistInt (36 keyframes) d) Signatures (4 keyframes) Fig. 4. Keyframe selection for the sequence Aqua Overall GCS performs well and typically selects 1 keyframe for the low activity shots and several keyframes for the high activity shots. It misses just 3 keyframes but generates 13 redundant. In half of the cases these redundancies occur in sequences involving panning and zooming. For example, GCS typically generates two keyframes instead of one for shot 3 of Canada day (there is a small zoom and object tracking) and shot 1 (pan), and also for shot 3 of Beach (zoom and object movement). As the image (and its corresponding histogram) changes, GCS generates a new keyframe. But as the semantics does not change, the human does not select a new keyframe. In the other cases the redundant keyframes are selected for small and low activity shots. For example, two similar keyframes are generated for shot 4 of Aqua (Fig.4), and three for shot 1 of Walk with the dragon. This happens because GCS always splits the cluster after a pre-specified number of iterations regardless of its quality. This drawback can be eliminated by modifying the GCS deletion step. a) Correct b) GCS c) Signatures (1 keyframe) d) HistInt (6 keyframes) Fig. 5. Keyframe selection for the sequence Steam TABLE III DEFINITION OF RECALL, PRECISION AND F1 MEASURE keyframes # assigned as correct # not assigned as correct # correct tp fn (missed) # not correct fp (redundant) tn tp tp PR P, R, F1 2 tp fp tp fn P R

5 Nevertheless, GCS compares well with the other two approaches. HistInt tends to generate too many keyframes, while Signatures is able to generate a compact representation but there are many misses and redundancies. We have also calculated Recall (R), Precision (P) and F1 measure that are standard performance measures in information retrieval (Table III). As it can be seen from Table IV, overall GCS is the best approach. features characterizing each shot (based on the motion vectors that are directly available in the MPEG-2 steam), temporal features that prevent too distant keyframes to be grouped together and also more semantically rich components such as text captions and teletex. The open framework also allows using different distance metrics, e.g. the histograms can be compared with the widely used chi squared test. TABLE IV KEYFRAME SELECTION COMPARISON corr ect gener ated redun dant miss ed R [%] P [%] F1 [%] GCS Hist Int Signa tures D. Hierarchical Video Representation Results The generated hierarchical representation by clustering of the keyframes using TreeGCS is shown in Fig. 7. It has organized the keyframes (and the shots they represent) into a 3-level structure. Each node corresponds to a cluster of similar shots, and can be represented by one single keyframe. As it can be seen, the keyframes are grouped into two main clusters based on their gray-level histogram: lighter and darker. These two clusters are further split into 3 and 2 subclusters of similar shots, respectively. Similar sub-clusters appear close to each other in the tree. The biggest sub-cluster (sub-cluster 5) is less homogeneous than the others; if TreeGCS had been trained longer, it would have split it into further sub-clusters. The number of neurons in the five GCS grids was 8, 8, 11, 18 and 43, respectively. Within each of these bottom level clusters, similar keyframes were mapped to neighboring neurons in the GCS grid. The keyframe closest to the cluster centroid was selected as a keyframe representing the cluster of similar shots (the framed pictures: s17, s6, s19, s16 and s27). Similarly, keyframes can also be chosen for the two nodes at level 1. Thus, the resulting structure will allow the user to browse the video at different levels of detail. The quality of the video summarization crucially depends on the quality of the features extracted to represent each shot. As the example shows, while the keyframe histogram is a useful feature it may not be enough to capture well the semantics of the video and allow efficient retrieval. Highlevel semantic features would provide more useful description but their automated extraction is an open research problem. One of the advantages of clustering-based keyframe selection and video organization is that new features can be easily incorporated. We plan to investigate the use of motion Fig. 7. Hierarchical video representation The main advantage of the hierarchical representation used in VideoGCS is the ability dynamically to form a hierarchy where the number of layers and clusters in them depend on the video content. In the existing systems for video summarization the structure is fixed. For example, in [1] a three level hierarchy is used with a fixed number of clusters in each level (e.g. 4, 16, 54). In [13] the number of levels and clusters in them was also pre-determined. The agglomerative hierarchical clustering approaches used in [7,12] generate dendrograms that cannot be visualized for large data sets and require a selection of pre-defined number of nodes. TreeGCS also provides good visualization due to the underlaying GCS algorithm that maps high dimensional inputs to a twodimensional grid that is topology and density preserving. In contrast to SOM, it is able to automatically find the cluster boundaries. V. CONCLUSION In this paper we have presented a new approach for video summarization and browsing based on the GCS neural algorithms. The system VideoGCS process directly MPEG-2 compressed video. It applies GCS to select keyframes for each shot and then clusters them using TreeGCS to form a hierarchical view of the video content for efficient browsing. The results show that the keyframe selection module captures well the salient video content and outperforms two other approaches. The generated hierarchy based on the grayscale histogram of keyframes is useful but it does not capture the

6 video semantics. However, an advantage of the TreeGCS module over the existing systems is its ability to dynamically form a flexible hierarchy that depends on the video content. Future work will include modification of the GCS algorithm to reduce the number of redundant keyframes for small and low activity shots, and also integration of complementary low-level and semantic features to improve summarization. Another interesting direction for future research is to apply VideoGCS for creating video summaries on-line as both GCS and TreeGCS can be used in an on-line mode. ACKNOWLEDGMENT This work was supported by SESQUI grant Video Segmentation and Summarization from the University of Sydney. We are very grateful to Damien McMonigal for the extraction of the dc-images. REFERENCES [1] J.-Y. Chen, G. Taskiran, A. Albiol, E.J. Delp and C. Bouman, ViBE: A Compressed Video Database Structured for Active Browsing and Search,, IEEE Trans. Multimedia, [2] M. S. Drew and J. Au, Video Keyframe Production by Efficient Clustering of Compressed Chromaticity Signatures, ACM Multimedia, [3] A.M. Ferman and A.M. Tekalp, Efficient Filtering and Clustering Methods for Temporal Video Representtaion and Visual Summarization,, J. Visual Commun. & Image Rep., vol. 9, pp , [4] A.M. Ferman and A.M. Tekalp, Multiscale Context Extraction and Representtaion for Video Indexing, SPIE 3229, pp.23-31, [5] B. Fritzke, Growing Cell Structures a Self-Organizing Network for Unsupervised and Supervised Learning,, Neural Networks, vol.7(9), pp , [6] B. Fritzke, Kohonen feature maps and Growing Cell Structures A Performance Comparison, Adv. Neural Info. Processing, [7] A. Girgensohn and J. Boreczky, Time-constrained Keyframe Selection Technique, Multim. Tools & Appl, v.11, pp , [8] V.J. Hodge and J. Austin, "Hierarchical Growing Cell Structures: TreeGCS, IEEE Trans Know& Data Eng, v.13(2), pp , [9] I. Koprinska and S. Carrato, Temporal Video segmentation: A Survey,, Signal Processing: Image Commun, v.16, pp , [10] T. Kohonen, Self-Organizing Maps, 2d ed., Springer-Verlag, [11] B. Yeo and B.-L. Liu, Rapid scene Analysis on Compressed Video,, IEE Trans Circuits Sys Video tech, v.5(6), pp , [12] M. Yeung and B.-L. Yeo, Segmentation of Video by Clustering and Graph Analysis, Comp.Vis.& Image Und., v.71(1), pp , [13] D. Zhong, H. Zhang, and S.-F. Chang, Clustering Methods for Video Browsing and Annotation,, SPIR-2670, pp , [14]

Navidgator. Similarity Based Browsing for Image & Video Databases. Damian Borth, Christian Schulze, Adrian Ulges, Thomas M. Breuel

Navidgator. Similarity Based Browsing for Image & Video Databases. Damian Borth, Christian Schulze, Adrian Ulges, Thomas M. Breuel Navidgator Similarity Based Browsing for Image & Video Databases Damian Borth, Christian Schulze, Adrian Ulges, Thomas M. Breuel Image Understanding and Pattern Recognition DFKI & TU Kaiserslautern 25.Sep.2008