CHAPTER 3 SHOT DETECTION AND KEY FRAME EXTRACTION

Size: px

Start display at page:

Download "CHAPTER 3 SHOT DETECTION AND KEY FRAME EXTRACTION"

Alexandra Crawford
5 years ago
Views:

1 33 CHAPTER 3 SHOT DETECTION AND KEY FRAME EXTRACTION 3.1 INTRODUCTION The twenty-first century is an age of information explosion. We are witnessing a huge growth in digital data. The trend of increasing information access boosts the requirement for progress in multimedia technology. Video compression becomes a necessity to transfer huge volume of data over the limited bandwidth channel. A video stream consists of a number of shots, each of which has the boundary property, such as cut, fade, dissolve, wipe, etc. A shot is defined as the consecutive frames from the start to the end of recording in a camera. It shows a continuous action in an image sequence. Generally, there are three kinds of shot boundaries: cut, dissolve and wipe. A cut is an abrupt transition between shots which is naturally formed by the video capturing process. A dissolve is a gradual transition between shots, which is an effect added by video editors where two adjacent shots are partly overlapped, while the frame intensities of the first shot are decreased to zero and the frame intensities of the second shot are increased from zero. In fade-in and fade-out, the two shots are not overlapped but the variations of frame intensities in the two adjacent shots are similar to those in a dissolve. A wipe is a digital video effect also generated by video editors that can have many different forms. In a wipe, one new shot pushes away an old shot. The detection of the dissolve and wipe are more difficult than the detection of cut.

2 34 A matching process between two consecutive frames is required to identify a scene change. Many researchers have used the luminance pixelwise difference or luminance or color histogram difference to match two frames (Zhang et al 1993). However, luminance or color is sensitive to small change, so these features produce false alarms. Traditionally, shot detection techniques were based on the comparison of features between two adjacent frames to locate areas of sudden dissimilarity as shot boundaries (Yeo and Liu 1995; Patel and Sethi 1996). Limitations existed in such systems because they could not handle long transitions. Detection of shot change is useful for many applications including video browsing and retrieval, video compression, statistical characterization of video in terms of different attributes of a shot and global clustering of video documents (Sethi and Patel 1995). A commonly used scheme in literature (Hanjalic et al 1997) detects shot changes in video by using a locally computed threshold on the Frame to Frame Histogram Difference (FFD) values. The problem with this approach is that using a high threshold increases the number of misses and using a lower threshold increases the number of false alarms. Each shot may be represented by a set of key frames. Key frame extraction along with shot segmentation technology plays a fundamental role for video compression, video retrieval and video summary (Ferman et al 00). In this chapter, a computationally efficient method using SSIM index for detecting video shots and then for extracting the key frames from each shot are presented. 3. STRUCTURAL SIMILARITY SSIM Recently, a new philosophy for image quality measurement was proposed, based on the assumption that the human visual system is highly

3 35 adapted to extract structural information from the viewing field. It says that a measure of structural information change can provide a good approximation to perceived image distortion (Wang and Bovik 00). It assesses the visual impact of changes in luminance, contrast and structure in an image. So SSIM includes three comparisons between two images x and y, namely contrast c(x,y), luminance l(x,y) and structure s(x,y). SSIM is defined as (Wang et al 00) SSIM(x,y) = l(x,y) c(x,y) s(x,y). (3.1) l( x, y) x x y y c 1 c 1 (3.) c( x, y) x x y y c c (3.3) s( x, y) x xy y c 3 c 3 (3.4) SSIM index method is easy to be implemented and can better correspond with human perceived measurement than PSNR (or MSE) (Wang et al 004). It has been used in video quality monitoring (Wang et al 003) (Chih-Che Lin and Chau 006), photographic restoration (Channappayya et al 008), biomedical imaging (Sampat et al 006), image coding (Wang et al 007), video compression (Sung et al 005; Mai et al 006), and picture enhancement (Cockshott et al 007). In this dissertation, SSIM is applied to detect the scene change, then to identify the key frames within a scene and also to determine the spatial redundancy within the frame apart from using it as a fidelity measure.

4 36 The SSIM metric is calculated as follows SSIM ( x, y) ( ( x x y y c )(cov 1 c )( 1 xy x y c c ) ) (3.5) where x and y are two frames in a video sequence. x the average of x ; y the average of y ; the variance of x ; the variance of y ; cov xy the covariance of x and y c 1 = (k 1 L), c = (k L) L the dynamic range of the pixel-values; k 1 = 0.01 and k = 0.03 by default The next section explains the shot detection procedure using SSIM and the results of the proposed concept is compared with pixel based and singular value decomposition (SVD) based method. 3.3 SSIM BASED SHOT DETECTION Procedure Take video sequences with multiple shots (mixed video sequence, shoiab etc) Calculate SSIM between the frames Determine the Dis-Similarity Index Measure DSSIM = 1/(1- SSIM)

5 37 Plot DSSIM vs frame number Calculate mean m 1 ( N 1) N i 1 DSSIM ( i, i 1) Calculate standard deviation 1 s ( N 1) N i 1 DSSIM ( i, i 1) m DSSIM ( i, i 1) m where N is the number of frames in the video sequence. Declare the frame as boundary detection if D(i,i+1) > m + ks where k is a dissimilarity threshold. Find out the number of correct shots (Nc) detected, false alarm (Nf), missed shots (Nm) for each value of k. Calculate precision, recall and retrieval success index RSI which are defined as (Kolekar and Sengupta 004) Precision = Nc/(Nc+Nf) Recall = Nc/(Nc+Nm) and RSI = Nc/(Nc+Nf+Nm) 3.3. Performance Evaluation The performance of the proposed SSIM based video shot detection is evaluated in terms of three parameters namely precision, recall and retrieval success index (Kolekar and Sengupta 004) and the results are compared with existing video shot detection schemes, such as pixel based and singular value decomposition schemes. Three different video clippings viz mixed video sequence generated by combining 15 frames from the standard video clippings Carphone, Claire, Miss America and Mobile, Air craft take off video and cricket match sequence ( are used for the experimental study. Figures 3.1 to 3.3 show the results comprising the

6 38 different shots of the mixed video sequence based on SSIM, SVD and pixel based schemes respectively. Dissimilarity measure Frame Number Figure 3.1 SSIM Based shot detection It can be seen from Figure.3.1 that the highest peak indicates a scene change occurring at that particular frame number. Further, it clearly shows the occurrence of scene change at frame numbers 16, 31, 46, 61, 76, 91, 106 respectively.

7 39 Dissimilarity measure Frame Number Figure 3. SVD based shot detection Figure 3. shows eight different shots in the mixed video sequence with total frames 10 based on SVD method of shot detection. High peak indicates that a scene change occurs at that particular frame number and it is observed from Figure 3. that a scene change occurs at frame numbers 16, 31, 46, 61, 76, 91, 106 respectively. 3 x 106 Pixel Based Shot Detection Dissimilarity measure Frame Number Figure 3.3 Pixel based shot detection

40 The Figure 3.3 is a plot of frame number versus dissimilarity measure for the mixed video sequence which is a combination of Carphone, Claire, Miss America and Mobile.

8 40 The Figure 3.3 is a plot of frame number versus dissimilarity measure for the mixed video sequence which is a combination of Carphone, Claire, Miss America and Mobile. In a few cases, the peaks indicating a scene change are lower in amplitude. For example, when the scene change is from Claire to Miss America which occurs at frame number 31 as shown in Figure 3.3 and Mobile to Carphone at frame number 61, the amplitude is lower. When the threshold is selected too high, these shot changes may not be noticed. So, proper selection of threshold is essential. Figure 3.4 Last frame of each shot in the mixed video sequence Figure 3.4 shows the last frame of each shot in mixed video sequence. Totally, there are eight shots and each shot consists of 15 frames and the total number of frames in the mixed video sequence is 10.

41 Figure 3.5 Last frame of each shot of the aircraft landing video sequence Figure 3.5 shows the last frame of each shot in aircraft landing video sequence.

9 41 Figure 3.5 Last frame of each shot of the aircraft landing video sequence Figure 3.5 shows the last frame of each shot in aircraft landing video sequence. Totally, there are nine shots and each shot consists of variable frame length and total number of frames in the sequence is 50. The performance parameters are evaluated for all the three schemes by varying the dissimilarity threshold k and the results are shown in Tables 3.1 to 3.3 (only mixed video sequence result is shown).

10 4 Table 3.1 Performance evaluation for mixed video sequence pixel based approach Results for Mixed Video Sequence (Pixel Based) k N c N m N f Precision Recall RSI Table 3. Performance evaluation for mixed video sequence SVD based approach Results for Mixed Sequence (SVD Based) k N c N m N f Precision Recall RSI

11 43 Table 3.3 Performance evaluation for mixed video sequence proposed concept Results for Mixed Sequence (SSIM Based) k N c N m N f Precision Recall RSI Performance Evaluation-Precision Precision k Threshold K ( Threshold) Pixel SVD SSIM Figure 3.6 Precision versus Threshold Figure 3.6 shows the variation of precision for different thresholds for the pixel based, SVD based and SSIM based approaches. Precision is comparably better in the case of SSIM. This is due to the fact that in SSIM based approach, precision takes the value one even for smaller values of k.

12 Recall Pixel SVD SSIM k Threshold Figure 3.7 Recall versus Threshold Figure 3.7 shows the variation of recall measure for different thresholds for the pixel based, SVD based and SSIM based approaches. Perform ance Evaluation-RSI 1.10 RSI Pixel SVD SSIM k Threshold Figure 3.8 RSI versus Threshold Figure 3.8 shows the variation of RSI for different thresholds for the pixel based, SVD based and SSIM based approaches. RSI is

13 45 comparatively better in the case of SSIM. The following section explains the extraction of key frames from each shot. 3.4 IMPORTANCE OF KEY FRAMES IN VIDEO COMPRESSION Key frame extraction is of great interest to the multimedia research community as it provides valuable information for video compression, summarization and organization (Luo et al 009). In the process of video analysis, indexing, and summarization one should eliminate the redundant information and highlight the salient frame that possesses the significant content details. Several methods have been reported in the literature for extracting key frames from consumers, sports video etc. (Li et al 008). In general, for video analysis, several ways are adopted to treat the frame as the key frames. Some of the existing key frame extraction algorithms simply take the middle frame of each shot or the first and the last frame of each shot as the key frames (Pentland et al 1994). The first frame in the shot is treated as key frame and this is a simple technique but the first frame may not be a good abstraction of the entire shot (Tonomura et al 1993). Other approaches, include time sampling the shots at predefined intervals (Gong et al 000) where the key frames are taken from a set location within the shot. Irrespective of the content of the frame, such procedures fail to exploit the correlation properties which are an essential gradient for video compression problem. To overcome the difficulties that are predominant, a simple approach called structural similarity index is proposed to detect the key frame from the given video sequence. Key frame in general provides compendious representation of the given video sequence reflecting the slow and fast motion of the selected shot. For compression problem, key frames are considered to be a frame encoded

14 46 without reference to any image in another frame and these key frames are referred to as intra coded or I frames. Due to complexity of video contents, many factors such as motion of camera, interaction between moving objects, and scene content have to be considered into account in order to decide the optimal number and choice of key frames. Further, the type of video content, say consumer video, sports video and news video requires that the key frame selection process should be in an appropriate manner for video summarization, indexing and compression problem. In extracting key frames from each shot, an important issue is to determine the number of key frames needed to represent the shot content. Existing approaches for key frame selection tend to be either cluster-based or sequential-based methods using some visual similarity measure (Li et al 008). 3.5 KEY FRAME EXTRACTION In this section, key frame extraction using SSIM index is discussed. Further, to compare its performance, two additional techniques, such as, Euclidean distance and entropy difference are also presented SSIM Approach In general, the still video frames exhibit strong spatial correlation and such dependencies possess the structural details of the object in the visual scene. It is well known that eventhough error sensitivity performs linear transformation, the strong dependencies between the pixels cannot be removed efficiently. Inorder to quantify error degradation or deviation, a better approach referred to as structural similarity index has been proposed as the key quality assessment tool for reconstructed image quality evaluation (Wang et al 007). The extent of this approach is adopted here to extract the key frame.

15 47 Figure 3.9 shows the flow chart for the illustration of computing SSIM between the video frames in the given sequence Start Take video clipping Divide into frames Calculate SSIM between frames Choose a threshold value T SSIM index > T Yes No Discard the frame Store the frame as key frame No All the frames compared Yes Display the key frames End Figure 3.9 Key frame extraction using SSIM

16 48 In this work inorder to extract key frames, first SSIM value is calculated between all possible consecutive frames in the video sequence. Higher the SSIM value, more similar are the frames. So, a threshold value is selected and frames with SSIM greater than this threshold are treated as similar or identical frames and they are discarded. The frames with SSIM value below threshold are stored as key frames. A suitable threshold is introduced to vary the effect of SSIM on different video data set Entropy Difference Method Key frames are the specific frames from the video stream that represents its content. In entropy based key frame extraction algorithm, the entropy is not considered as global feature for the entire frame but as a local operator. Entropy is a specific way of representing the impurity or unpredictability of a set of data since it is dependent on the context in which the measurement is taken. Video clippings are divided into frames and initially, the first frame is assumed as key frame. There is a possibility of change in brightness during the key frame comparison, so it is necessary to apply colour quantization and then median filtering for region smoothing. Then the gray level entropy of the first frame is calculated. The same procedure is repeated for the next frame and the absolute difference with the relevant gray level entropy from the first frame to the next processed frame is calculated. If the sum of the normalized difference is more than a specific threshold, then there is a change in the content of the frame sequence and therefore, a new key frame is needed. The frame entropy is estimated based on equation (3.6) K max E e ( k) (3.6) total K 1 f where e f (k) is the gray level entropy which is defined as

17 49 e f ( k) Pf ( k) Q f ( k) (3.7) where P f (k) = probability of appearance of the k th gray level in the frame and Q f (k) = information quantity transmitted by an element. h f ( k) Pf ( k) (3.8) M N where h f (k) is the histogram of frame k the gray level and M,N represents the size of the frame. The information quantity Q f (k) transmitted by an element is equal to Q f ( k) log 1 P ( k) f log P f ( k) (3.9) The information quantity Q f (k) multiplied by its probability of appearance gives the entropy E generated by the source for this quantization level. The complete process of implementing the entropy difference algorithm is presented in Figure 3.10.

18 50 Start Select video clip Select 1 st video frame as the initial key frame Apply colour quantization (56 colour bits) and Median filtering Compute Grey Level Entropies Sort descending the gray level entropies and calculate global frame entropy Select the next consecutive frame Calculate the frame entropy Compute frame entropy difference If frame entropy difference > T Yes No End of video clip No 1. Store the previous selected keyframe in buffer as a real key-frame. Put the current selected keyframe in buffer Yes Display key frames End Figure 3.10 Flowchart of the implementation of entropy difference algorithm

19 Euclidean Distance Method Edges characterize key boundaries and are therefore considered as a problem of fundamental importance in image processing. Edge detection is often the first step in image processing since quite useful information can be extracted from the edges. It is well known that Canny edge detection algorithm is considered as the optimal edge detector and it is used for the initial detection process. In Euclidean distance method, edges of the image in each frame is found initially and the edge sum is calculated for the frames. If the difference edge sum is greater than the threshold T, then the key frame count is increased by considering the current frame also as a key frame. Figure 3.11 shows the flow chart for computing the key frames using Euclidean distance.

20 5 Start Select video clip Select 1 st Video frame as the initial key frame Apply Gray Quantization & Canny edge detection to find edges of image in first frame Find the sum of the edges of first (previous) frame Specific function of edge sum is fixed as threshold Select the next consecutive Frame Apply edge detection algorithm, find the edges and calculate the edge sum Find edge sum difference between 1 st & nd frame If percentage T < diff. sum No Yes 1. Store the previous selected key-frame in buffer as a real key-frame. Put the current selected keyframe in buffer as a key-frame No If end of video clip Yes Display the key frames End Figure 3.11 Flowchart of Euclidean Distance Method

53 3.6 PERFORMANCE EVALUATION The performance of the proposed key frame extraction algorithm is evaluated using standard video sequences and the number of key frames obtained are tabulated in Table 3.

It can be noticed that the SSIM technique provides better results compared to the other techniques.

21 PERFORMANCE EVALUATION The performance of the proposed key frame extraction algorithm is evaluated using standard video sequences and the number of key frames obtained are tabulated in Table 3.4. A total of 100 frames with a resolution of are considered. It can be observed from Table 3.4 that the number of key frames varies with respect to threshold values. It can be noticed that the SSIM technique provides better results compared to the other techniques. This is due to the fact that the structural information of video contents, both fast and slow movements are well exploited by this approach. Figure 3.1 shows the key frames obtained using SSIM approach. In order to further evaluate its applicability for real time applications, the performance is evaluated in terms of three parameters, namely Compression Ratio (CR), Processing Time (PT) and Computational Efficiency (CE) Figure 3.1 Key frames extracted from Claire video sequence using SSIM concept for Threshold T=0.85

22 54 Table 3.4 Key frames obtained using different algorithms Video Akiyo Claire Carphone Mother daughter Mobile No. of key frames Threshold Entropy Euclidean SSIM T 1 15 T T T T T 16 7 T T T1 6 3 T T T T T T3 3 4 T T T T T

23 55 The efficiency are evaluated in terms of CR which is defined as CR 1 TKF / TNF (3.10) where TKF Total number of key frames TNF - Total number of frames in a sequence PT - Time required for key frame extraction CE = CR/PT (3.11) The performance measures like CR, PT and CE are calculated at different threshold values and the results are tabulated in Table 3.5. Table 3.5 Performance analysis of key frame extraction techniques T1 T T3 T4 Entropy Euclidean SSIM PT CR CE PT CR CE PT CR CE PT CR CE Figure 3.13 shows the performance of different key frame techniques in terms of the above specified parameters.

24 56 Performance Analysis for T PT CR CE Entropy Euclidean SSIM Key Frame Techniques Performance Analysis for T PT CR CE Entropy Euclidean SSIM Key frame techniques Figure 3.13 Performance analysis curve for different threshold

25 57 Performance Analysis for T PT CR CE 0 Entropy Euclidean SSIM Keyframe techniques Performance Analysis at T Entropy Euclidean SSIM Keyframe Techniques PT CR CE Figure 3.13 (Continued) From Figure 3.13 it is found that among the three key frame techniques, SSIM is found to be superior in terms of compression efficiency as well as computational complexity.

26 CONCLUSION This chapter discusses the detection of shots from video sequence and the extraction of key frames from the detected shots. SSIM based technique has been proposed to detect the shots as well as to extract the key frames. SSIM based shot detection gives better results in terms of precision, recall and RSI compared to pixel difference and singular value decomposition approaches. Key frames are identified based on a threshold value of SSIM and the performance is evaluated in terms of compression ratio and computational efficiency. The results are compared with the existing key frame techniques such as Euclidean distance and entropy difference. It is concluded from the results that the SSIM techniques recognize the appropriate key frames compared to other schemes while preserving less computational complexity. Interframe compression concept and illustration are presented in the following chapter.

Searching Video Collections:Part I

Searching Video Collections:Part I Introduction to Multimedia Information Retrieval Multimedia Representation Visual Features (Still Images and Image Sequences) Color Texture Shape Edges Objects, Motion