VIDEO SUMMARIZATION USING FUZZY DESCRIPTORS AND A TEMPORAL SEGMENTATION. INPG, 45 Avenue Felix Viallet Grenoble Cedex, FRANCE

Size: px

Start display at page:

Download "VIDEO SUMMARIZATION USING FUZZY DESCRIPTORS AND A TEMPORAL SEGMENTATION. INPG, 45 Avenue Felix Viallet Grenoble Cedex, FRANCE"

Charity Simmons
5 years ago
Views:

1 VIDEO SUMMARIZATION USING FUZZY DESCRIPTORS AND A TEMPORAL SEGMENTATION Mickael Guironnet, 2 Denis Pellerin and 3 Patricia Ladret mickael.guironnet@lis.inpg.fr,2,3 Laboratoire des Images et des Signaux INPG, 45 Avenue Felix Viallet 383 Grenoble Cedex, FRANCE ABSTRACT In this paper, three new compact and fuzzy descriptors (motion, color and orientation descriptors have respectively 3, and 5 components) are introduced for video summary. A similarity measure is defined and allows frames to be compared according to several descriptors. The method of video summary is based on two stages: video segmentation and segment clustering. First, each video is partitioned within homogeneous segments from one or several descriptors. This segmentation is compared to the partition in shots and our approach retrieves the transitions with good precision (> 9%). The segmentation combining three descriptors provides better results than the segmentation obtained with only one descriptor. Then, segment clustering with temporal constraint allows to reduce the size of summary with less key frames and to preserve a temporal coherence. Finally, research by example, tested on a corpus of 3 hours of video data, shows that the segments are correctly found and the index combination (motion, color and orientation) improves the results.. INTRODUCTION The quantity of audiovisual information has increased in a spectacular way with the arrival of high-speed Internet and digital television. Video retrieval among this large quantity of information has become a difficult task. Video indexing aims at facilitating the automatic and fast access of information in large video bases. Recent works [, 2, 3] generally employ low level criteria to describe a sequence of images, but very few works tackle the problem of their combination [4]. This article presents a method of video summary construction. To represent the video content, we extract three new fuzzy and compact descriptors (motion, color and orientation). First, video summary consists in partitioning video into homogeneous segments and then the close segments temporally are gathered together to reduce their number. Moreover, video summary allows us to carry out a query by example from one or several descriptors. This paper is organized as follows: Section 2, 3 and 4 respectively describes the motion descriptor, the color descriptor and the orientation descriptor. In section 5, the method of video summary is presented. Section 6 shows a query by example application. Finally, section 7 concludes the paper. 2. MOTION ACTIVITY DESCRIPTOR To describe the video dynamic content, it is essential to study motion in a video and more particularly its degree of. In general, action movies have many segments with high, whereas the news tends to be characterized by low. Hence, we define a very compact descriptor which captures intuitive notions of motion intensity. This section describes an optical flow estimation method which leads to a new descriptor. The determination of the optical flow is an important stage and will condition the performances of the descriptor. The estimation method that we chose [5] allows compact and multi-scale motion representation. Let V j (p i, t) is the optical flow estimation to the pixel p i = (x i, y i ) between the I(p i, t) and I(p i, t + ) images on the level of resolution j. The motion is approached by a linear combination of scale functions: V j (p i, t) = 2 j k,k2= θ j,k,k2 Φ j,k,k2 (p i ) () Φ represents the scale function, in our case, a B-Spline function of degree with 3 levels of resolution (Fig. ). The indices j, k, k2 respectively represent the scale level, the horizontal and vertical shift. By supposing the conservation of brightness, the algorithm estimates in an iterative and robust way the θ coefficients by minimizing an objective function: E = p i ρ(i(p i + V j (p i, t + ) I(p i, t), σ) (2) The function ρ(, σ) is a function which weights the data according to the error (M-estimator of Geman-McClure). The minimization of the objective function (Eq. 2) is described in [5]. Moreover, knowing the decomposition of

.4.3.2. 2 4 6 8 From this description, we characterize the global with only 3 components by computing the cardinality of each set (Fig. 3).

2 4 6 8.4.3.2. 2 4 6 8 I(t) I(t+) θx θy Magnitude Low Medium Global Figure : B-Spline function of degree with three resolution levels: in top, j=2, in medium j= and in bottom j=.

Fine knowledge of motion is not necessary for video indexing and the estimation is carried out on under-sampled images (72x88 pixels) to accelerate the processing.

As the scale coefficients characterize motion on each area of the image, they offer a local and compact representation of the motion.

In our approach, the magnitude is given from the scale coefficients according to the equation.

2 From this description, we characterize the global with only 3 components by computing the cardinality of each set (Fig. 3). The set having the greatest cardinality informs about the level I(t) I(t+) θx θy Magnitude Low Medium Global Figure : B-Spline function of degree with three resolution levels: in top, j=2, in medium j= and in bottom j=. High Figure 2: Principle of the descriptor construction. the velocity field on a level of resolution j, we can achieve the scale coefficients on a lower level of resolution. Fine knowledge of motion is not necessary for video indexing and the estimation is carried out on under-sampled images (72x88 pixels) to accelerate the processing. The method offers a signature of the motion in the form of 62 coefficients (8 coefficients for each component). As the scale coefficients characterize motion on each area of the image, they offer a local and compact representation of the motion. Motion can be directly characterized by the magnitude of the scale coefficients. Indeed, the degree of depends on the quantity of high and low magnitude. In our approach, the magnitude is given from the scale coefficients according to the equation. M = θ 2 2 x + θ y (3) From the grids θ x and θ y of each pair of images, a grid of magnitude M is obtained. Knowing the decomposition of the optical flow on the level of j=2 resolution, we determine the scale coefficients on the level of j= resolution. In our implementation, the grid of magnitude is a 5 by 5 grid (Fig. 2). Each magnitude is then fuzzified according to 3 fuzzy sets: high, medium and low. The membership functions are represented on (Fig. 3). We then obtain 3 grids from a grid of magnitude (Fig. 2), each coefficient is transformed into a degree of membership according to the 3 fuzzy sets. These 3 grids describe local (75 components). For example, it is possible to visualize the local using the method of the center of gravity (Fig. 4). It consists in computing the mean of the modal values of the sets, weighted by their membership degree. A modal value of a fuzzy set is by definition an x value such as µ(x) =. In our implementation, the modal values of the 3 fuzzy sets (low, medium and high activities) are respectively, 2 and Low Medium High Magnitude values Figure 3: of. Figure 4 gives examples of local and global descriptors. The white area corresponds to high and the black area to low. We can observe in the example that the local descriptor describes object displacement. Finally, the global descriptor has only 3 components. Local Global Figure 4: Example of descriptor extraction. Top: Video frame. Middle: Local descriptor. The white area corresponds to high and the black area to low. Bottom: Global descriptor. 3. COLOR DESCRIPTOR Image indexing using color histogram is effective to retrieve images or videos in a database. Nevertheless, vari-

ous color spaces (RGB, HSV, YCbCr,...) do not present the same properties. The perception of colors in RGB space is not uniform and depends on the lighting conditions.

Hue H represents the shade of color and saturation S describes the purity of the color. Finally value V corresponds to the luminosity of the color.

However, the disadvantage of this method is that it gives the same weight to the pixels near the centre of a bin as to those that are located at the edges.

2 (v,s ) (v,s) y up (v 2,s 2 ) (v,s)=(.3,.44) (v 2,s 2 )=(.39,.46) (v,s )=(.9,.42) (w,w 2 )=(.44,.56) Color area Intermediate area. y down Gray levels area..2.3.4.5.6.7.8.

When S is worth and V increases, the color goes from black to white through all the shades of gray.

3 ous color spaces (RGB, HSV, YCbCr,...) do not present the same properties. The perception of colors in RGB space is not uniform and depends on the lighting conditions. HSV space is often preferred to RGB space because it directly translates the intuitive notions of color. The transformation of space RGB with HSV is nonlinear but invertible. Hue H represents the shade of color and saturation S describes the purity of the color. Finally value V corresponds to the luminosity of the color. HSV cylinder is often uniformly quantified according to each component and allows the reduction of the number of colors in the image. Each component is divided into regular bins. However, the disadvantage of this method is that it gives the same weight to the pixels near the centre of a bin as to those that are located at the edges. The use of the fuzzy sets makes it possible to solve this kind of problem by associating to the pixel a membership degree according to each bin. Saturation (v,s ) (v,s) y up (v 2,s 2 ) (v,s)=(.3,.44) (v 2,s 2 )=(.39,.46) (v,s )=(.9,.42) (w,w 2 )=(.44,.56) Color area Intermediate area. y down Gray levels area Value Figure 5: Color visualization in the saturation value plan. Sural et al. [6] observed that saturation determines the transition between colors and gray levels. When S is worth and V increases, the color goes from black to white through all the shades of gray. On the other hand if saturation increases for a given value V and hue H, the perceived color changes shade of gray to the pure color indicated by H. Finally, for low values of S, the color can be represented by a gray level (V) whereas for high values of S, the color can be linked to hue (H)..5 red yellow green cyan blue magenta red Hue H Figure 6: of color. The method consists in providing each pixel with a representative color or gray level. The pixel is projected in the saturation-value plan. Three areas are defined in this plan: color area, gray level area and an intermediate area. Thus the projected pixel is associated with one of these three areas. In fig. 5, the projected pixel belongs to the intermediate zone. By computing the distance from projected pixel with the blue curves, we then determine a membership degree of the color area (w2=.56) and the gray level area (w=.44) Value Figure 7: of gray levels. If the representative area of the pixel is color, it is associated with one of 6 colors: red, yellow, green, cyan, blue, magenta, following the membership functions of the figure 6. If the representative area of the pixel is the gray level, it is associated with one of the 5 levels: black, darkgray, gray, light gray and white, according to the membership functions of the figure Once the projected pixels, we have a color quantization on bins (6 components for color and 5 components for gray level). A histogram is then evaluated for each frame. Figure 8 gives examples of color descriptor. Each pixel is associated with the set which has highest membership degree. Figure 8: Example of color descriptors. Left: Video frame. Right: Color descriptor where each pixel is associated with the set which has highest membership degree.

4. EDGE DIRECTION HISTOGRAM Many works study orientations in images [7]. This index appears to be a powerful feature for discriminating the scenes between images (for example indoor/outdoor scenes).

As in [], we use image gradient orientation for distinguishing characteristics of scenes. A fuzzy histogram is obtained from gradient orientation.

Thus edge detection carries out by a no maxima suppression. In our experiments, we use 5 bins to represent orientations. Figure 9 shows membership functions of bins for, 9, 8 and 27.

Figure shows an example of 2 images associated to edge maps and gradient orientation histograms..4.2 2 3 4.4.2 2 3 4 Figure : Example of orientation descriptor. Top: Video frame. Middle: Edge map.

4 4. EDGE DIRECTION HISTOGRAM Many works study orientations in images [7]. This index appears to be a powerful feature for discriminating the scenes between images (for example indoor/outdoor scenes). In [8], Gabor filters are used to classify the natural scenes. In [9], the orientations based on Canny detector are used to display the image in its correct orientation. As in [], we use image gradient orientation for distinguishing characteristics of scenes. A fuzzy histogram is obtained from gradient orientation. Thus, we start to calculate the image derivatives ( I x, I y ) and the gradient orientation histograms are computed only for the pixels with the magnitude superior with a given threshold. Thus edge detection carries out by a no maxima suppression. In our experiments, we use 5 bins to represent orientations. Figure 9 shows membership functions of bins for, 9, 8 and 27. The last bin represents the count of pixels that do not contribute to edge. Finally, we normalize the histograms for image size Orientation Figure 9: of orientations. Figure shows an example of 2 images associated to edge maps and gradient orientation histograms Figure : Example of orientation descriptor. Top: Video frame. Middle: Edge map. Bottom: Gradient orientation histogram. 5. METHOD OF VIDEO SUMMARIZATION In order to browse in a video database or to search for an extract in a sequence, we propose a method of video summarization that has several levels of resolution. As our approach does not depend on the type of descriptors, we give a general explanation. Each video must first be structured to facilitate its visualization. Thus we carry out a method of video partitioning into homogeneous segments according to one or more descriptors. Each segment is then represented by a key frame and constitutes the finest resolution level of video summary. To reduce the number of key frames, we create a hierarchy thanks to a similarity clustering method with temporal constraint. It provides a fast idea of the video content to the user. 5.. Segmentation using heterogeneous features We suppose a set of features, noted F = {f,..., f l }, where l is the number of descriptors (l = 3) and f i is a feature vector that extracts in each image of sequences. A similarity measure has also to be associated with each feature. As we have fuzzy features, the sum of components equals to. If we use the L distance, the maximum distance between 2 feature vectors within the same descriptor is less or equal to 2. As each descriptor is normalized, the similarity is then defined by the equation: k s(f i, g i ) = f i,k g i,k (4) 2 where f i,k is k th component of feature vector i and s(f i, g i ) is the similarity between two images f and g according to descriptor i. The square root allows less weight to be attributed to the far distance. We can carry out a homogeneous partition in accordance with several descriptors. The index combination is achieved by their weighting according to the bins number of each descriptor: s(f, g) = w i s(f i, g i ) (5) w t where w i is the bin number of feature i and w t is the total bin number of features. From the similarity measure, we can build a hierarchical video summary. The first level of resolution (fine resolution), inspired of the work describes in [], consists in gathering images according to their similarity. The first image is compared with the following one. If their descriptors are close then the two images are gathered using the mean of each descriptor. If they are not close, a new cluster is created. This process is repeated until the last frame of video. This approach allows a homogeneous video segmentation to be created according to the descriptors considered. The principle of the segmentation is as follows: i

5 Step : p = number of frame, number of cluster noted n =, center of gravity noted C n equals F p = {f p,..., f p l } and k = number of frame contained in the current cluster. Step 2: p = p +, if no frames, then stop. Step 3 : computing of the similarity of the frame p+ with the current center of gravity. If the similarity is less than a threshold then go to step 4 else step 5 Step 4 : the frame p+ creates a new cluster and its descriptor is the center of gravity, n = n +, k = and go to step 2 Step 5 : the frame p + is added to the current cluster C n = k k+.c n + k+.f p+, k = k + and go to step 2 In order to determine the optimal threshold, we compare the segmentation of a video (74 shots and 7598 images) whose partition in shots is known with our method. A shot represents a portion of video filmed continuously without special effects or cuts. Nevertheless, a shot is not necessarily homogeneous according to a considered index. For example, in a shot, the set can change by a camera motion. Table shows that the number of segments found depends Table : Comparison of our video segmentation with the ground truth Descriptor Threshold Cut found color color, orientation, motion Number of segments dissolve found on the threshold selected. We consider a cut found if a transition from the homogeneous segments corresponds to a cut and we consider a dissolve found if there is a transition being located in a dissolve. We prefer to have video oversegmentation and that segments obtained be homogeneous according to the considered criterion. The selected threshold (.7) is a good compromise between cut found (> 9%) and oversegmentation. Moreover, if we compare the method with the color descriptor and the combination of the three, we can note that the percentage of cut found by combining the 3 features is worth 96% against 9% for the color descriptor with a number of equal segments (38 to 4). The segments that are smaller than 5 images are also gathered with the following segments. This segmentation corresponds to the finest summary of the video Segment clustering with temporal constraint The number of segments obtained on the first level is still high, and a fast visualization by the user is still not realizable. In [2, 3], Fuzzy c-means clustering is used to create a hierarchy without temporal constraint. Then, our 2 nd stage consist in gathering temporally close and similar segments (but not necessarily adjacent) according to one or several descriptors. We impose a temporal constraint on clustering in order to preserve a temporal coherence of video. It aims to prevent clustering of a segment being located at the beginning of a video with one at the end of the video. Each homogeneous segment possesses a feature vector C i and the frame of segment closest to the vector is defined as a key frame. A temporal distance d t and a temporal similarity s t between segments are then defined: s t (i, j) = d t (i, j) = i j { )2 ( d t(i,j) w if d t w if d t < w where i, j are the segment positions of video and w is the width of the temporal window. The weighted similarity s w between two segments is then obtained by multiplying the temporal similarity s t by the similarity s (Eq 5). Thus the segments whose similarity is greater than a threshold are gathered. In order to have several levels of hierarchy, the width of the window increases and the threshold decreases. This method create clusters locally. Then further we go up in the resolution more globals the clusters will become. Thus, on the last level of the hierarchy, the number of keyframes is reduced. That is why we proposed a fine to coarse video summary. The principle of the hierarchy is as follows: (6) Step : Let N is the number of segments. Given a threshold θ and width of the window w Step 2: Compute similarity s(i, j) between segments (Eq 5), temporal similarity s t (i, j) and weighted similarity s w (i, j) = s t (i, j) s(i, j), i, j N Step 3: Find segments to cluster F or i = N F or j = i i + w if s w (i, j) > θ fusion{i} = fusion{i} {j}

6 Step 4: Clustering of segments (computing the mean vector weighting for the number of frame contained in each segment to cluster, new segments are located where the segment contained the most frames), update N, θ = θ.5, w = 2 w and go to step 2 Figure presents an example of clustering carried out thanks to motion, color and orientation information. The initial parameters are θ =.7 and w = 5 in this example. request with the various segments created by the method. The segment which corresponds to the nearest key frame is regarded as nearest to the request. This method was tested on 4 videos of the series: The Avengers, whose total duration is over 3 hours. We draw out a group of requests ( images) in each video and we checked that the segments to which they belong are well matched. We use a popular measure defined in [] which computes the number of segments found within the α first retrieval. We observe in table 2 that the color query offers good results with 72% of segments found for α =. However, the combination of motion, color and orientation is more effective with 79% for α = (and 9% for α = 3) Table 2: Results of query by example video Mean results with α = α = 2 α = Color query 65% 82% 85% Motion, color and 76% 9% 94% Color query 75% 83% 85% Motion, color and 79% 85% 89% Color query 63% 7% 8% Motion, color and 73% 82% 85% Color query 85% 89% 9% Motion, color and 88% 9% 95% Color query 72% 8% 85% Motion, color and 79% 87% 9% 7. CONCLUSION Figure : Three example of clustering with temporal constraint and 3 levels of resolution. The numbers under the frames corresponds to the segment number. 6. APPLICATION TO RESEARCH BY EXAMPLE An application such as research by example allows the effectiveness of the suggested method to create video summaries to be judged. It consists in comparing an image We have presented a method of hierarchical video summary construction according to various indices, with several levels of resolution. Three new fuzzy descriptors are been introduced and are compact ( components for the color, 3 for the motion and 5 for the orientation ). Video summary is based on two stages: video segmentation and segment clustering. First, the method determines a homogeneous segmentation from one or several descriptors. This segmentation constitutes the finest level of hierarchy. Then, segment clustering with temporal constraint allows summary to be reduced and provides a fast sight to the user. Moreover, research by example, tested on the video summaries, shows that the combination of the indices (color, motion and orientation) improves the results.

7 8. REFERENCES [] B. S. Manjunath, J. Ohm, V. V. Vasudevan, and A. Yamada, Color and texture descriptors, in IEEE Trans. on Circuits and Systems for Video Technology, June 2. [2] E. Veneau, R. Ronfard, and P. Bouthemy, From video shot clustering to sequence segmentation, in Fifteenth International Conference on Pattern Recognition, ICPR, Barcelone, Spain, September 2. [3] Y. Gong and X. Liu, Video summarization using singular value decomposition, Multimedia Syst. 9(2), pp , 23. [] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, Adaptive key frame extraction using unsupervised clustering, in Proceedings, IEEE ICIP, Chicago,USA, 998, pp [2] A.M.Ferman and A.M.Tekalp, Two-stage hierarchical video summary extraction to match low-level user browsing preferences, June 23, pp. vol. 5, no. 2. [3] X. D. Yu, L. Wang, Q. Tian, and P. Xue, Multi-level video representation with application to keyframe extraction, in th International Multimedia Modelling Conference, Brisbane, Australia, Jan 24, p. 7. [4] G. Sheikholeslami, W. Chang, and A. Zhang, Semquery: Semantic clustering and querying on heterogeneous features for visual data, in IEEE Trans. on Knowledge and Data Engineering, Sept/Oct 22, pp [5] E. Bruno and D. Pellerin, Global motion model based on b-spline wavelets: application to motion estimation and video indexing, in In Proc. of the 2nd Int. Symposium. on Image and Signal Processing and Analysis, ISPA, Pula, Croatia, June 2. [6] S. Sural, G. Qian, and S.Pramanik, A histogram with perceptually smooth color transition for image retrieval, in Fourth International Conference on Computer Vision,Pattern Recognition and Image Processing, CNP 22, Durham, North Carolina, March 22, pp [7] Z. Aghbari and A. Makinouchi, Semantic approach to image database classification and retrieval, in National Institue of Informatic (NII) Journal Number 7, 23. [8] N. Guyader, H. Le Borgne, J. Herault, and A. Guérin-Dugué, Towards the introduction of human perception in a natural scene classification system, in Proc. of the 22 IEEE NNSP, Switzerland, Sept. 22, pp [9] Y. M. Wang and H. Zhang, Detection image orientation based on low-level visual content, Computer Vision and Image Understanding (CVIU), pp , vol. 93, no. 3, 24. [] J. Kosecka, L. Zhou, P. Barber, and Z. Duric, Qualitative image based localization in indoors environments, in Computer Vision and Pattern Recognition CVPR, Madison, Wisconsin, 23.

Global motion model based on B-spline wavelets: application to motion estimation and video indexing

Global motion model based on B-spline wavelets: application to motion estimation and video indexing E. Bruno 1, D. Pellerin 1,2 Laboratoire des Images et des Signaux (LIS) 1 INPG, 46 Av. Félix Viallet,