Semantic Movie Scene Segmentation Using Bag-of-Words Representation THESIS

Size: px

Start display at page:

Download "Semantic Movie Scene Segmentation Using Bag-of-Words Representation THESIS"

Todd Walters
5 years ago
Views:

1 Semantic Movie Scene Segmentation Using Bag-of-Words Representation THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Sai Luo Graduate Program in Civil Engineering The Ohio State University 2017 Master's Examination Committee: Advisor, Alper Yilmaz Lei Wang Rongjun Qin

2 Copyrighted by Sai Luo 2017

3 Abstract Video segmentation is an important procedure in video indexing and archiving. Shot and scene are two levels of video segment. In this thesis, a shot detection method and a scene segmentation method are introduced. The shot detection method is based on both camera motion and histogram difference in the frames. Motions in the frames within one shot have certain continuity, so the frames which breaks the pattern are the shot transition frames. Motion based method is sensitive to object motion in frames, which would affect the result, thus histogram comparison is introduced. Frames in one shot are very similar in terms of visual content, the visual content can form the histogram. The drawback of this method is that in scenes with massive camera movement, the visual content may vary much. By combining these two methods the shots can be extracted with higher accuracy. Scene segmentation is a higher level in video summarization. In this process, BOW model and color histogram method are used. BOW model is used to construct feature descriptors to demonstrate background information. The model is constructed by generating clusters from the feature vectors of the training images. Since scene is a complex level of video segment, we need to use multiple criteria to determine it, thus color histogram is introduced. Color information of the frames in one scene shares ii

4 similarity, the frames which break the pattern are the frames with scene transition, thus forming the second level scenes. Keywords: Shot detection, Scene segmentation, Bag-of-Words iii

5 Vita NO.1 Middle School Affiliated to Central China Normal University B.E. Information Engineering, China University of Geoscience 2015 to present...m.s. Civil Engineering, The Ohio State University Fields of Study Major Field: Civil Engineering iv

6 Table of Contents Abstract... ii Vita... iv Table of Contents... v List of Tables... vii List of Figures... viii Chapter 1: Introduction... 1 Chapter 2: Related Work Shot Detection Dissolve and Fade Detection Wipe Detection Bag-of-Words Feature Detecting Descriptors Vocabulary Building and Codebook generation Limitations and recent researches on BOW v

7 2.3 Scene Transition Detection K-nearest Neighbors Algorithm Chapter 3: Shot Transition Detection Motion based shot detection Histogram based shot detection Threshold Setting Chapter 4: Scene Segmentation Bag-of-Words Support Vector Machine Color Histogram Comparison Chapter 5: Experiments and Results Shot detection Scene Segmentation Chapter 6: Conclusions and Future Work References vi

8 List of Tables Table 1. Pros and cons of SIFT Table 2. Example of BOW model accuracy check Table 3. Description of dataset and shot detection results Table 4. Comparison between ground truth and shots detected (movie Matrix, partial)...65 Table 5. Vocabulary size and its precision Table 6. Part of model accuracy check for vocabulary size of Table 7. Part of model accuracy check for vocabulary size of Table 8. Movie scene segmentation results Table 9. A partial example for ground truth of scenes and second level scene segments for the movie Matrix vii

9 List of Figures Figure 1. Main steps of the method... 5 Figure 2. Shot transition illustration (from [2])... 9 Figure 3. (From left to right) A Harris affine region; the normalized region; and the 8 maps of gradient magnitude constituting the SIFT descriptor (from [14]) Figure 4. The steps for BOW model (from MathWorks for image classification with Bag of Visual Words) Figure 5. Video structure Figure 6. Example of k-nn classification, the green object is the target, k is chosen to be 3, the target should be classified as red triangle since it is the dominant contributor Figure 7. Patterns in a spatial-temporal slice (from [25]) Figure 8. Illustration of block matching based motion estimation searching for matching blocks Figure 9. Region-wise shot motion summary of (a) static, (b) tilt, (c) pan and (d) zoom shots. Each region represents the rough direction of a camera motion Figure 10. Camera projection model Figure 11. Image regions viii

10 Figure 12. Two transition models: (a) Abrupt change and (b) Gradual transition Figure 13. Color histogram differences for (a) cut boundaries; (b) gradual boundaries.. 39 Figure 14. Workflow of shot transition detection Figure 15. Shot and scene transition, (a) shot transition, (b) scene transition Figure 17. SVM Classifier Figure 18. Expression of Euclidean distance (d) in 2-D space Figure 19. Example of histograms for three channels of RGB color space Figure 20. HSV color space illustration Figure 21. (a) Two consecutive frames with brightness change, (b) color histogram difference in RGB space, (c) color histogram difference in HSV space Figure 22. Abrupt shot transition, (a) entry and exit frame of the previously detected shot, (b) the entry and exit frame of the latter detected shot Figure 23. Camera motion for abrupt shot transition Figure 24. Example of gradual (fade out) shot transition Figure 25. Camera motion in fade out shot transition Figure 26. Color histogram difference in gradual shot transition Figure 27. Example of misdetection of shot boundary, (a) entry and exit frame of the previously detected shot, (b) the entry and exit frame of the latter detected shot Figure 28. Examples of training images, set (a) is forest, set (b) is living room Figure 29. Error rate of the classification against the vocabulary size Figure 30. (a) two shots in one scene; (b) a scene transition Figure 31. RGB to HSV ix

11 Chapter 1: Introduction Nowadays, there are thousands of movies produced every year, hence the indexing of the movie data becomes a crucial task. We usually build the movie database by setting labels for the movies, like movie names, movie types (action, science fiction, fantasy, etc.), and movie cast. With the number of movies growing rapidly in these years, it becomes hard to index movies with these labels since there would be movies with the same labels, so we need to find new criteria for movie indexing. In recent researches, it is proved that breaking up movies into smaller segments like shots or scenes could make it easier to access a large amount of video data. By breaking up movies into segments we can index small pieces of a movie and get a series of labels, these labels can help indexing the movies and make the access to the movie database easier. Shots and scenes are two different level of movie segments, a shot is a combination of frames, and a scene is a combination of shots, the difficulty in this process is to find shot and scene transitions. There are multiple techniques developed to find shot and scene transitions, this will be further introduced in chapter 2. A shot can be described as a sequence of frames that are taken from the same camera [1], and for every shot, there should be a frame that can represent the whole shot (key frame). The representation of a movie by shots would allow us to browse a movie by viewing just 1

12 a few frames, thus making the video data easier to access. The shot boundary detection algorithms need to challenge the difficulty of finding shot boundaries in the presence of camera and object motion and illumination variations. Furthermore, different shot boundaries may have different attributes like abrupt shot changes or smooth shot transitions. There are four types of shot transitions: Cut (abrupt change, discontinuous shot transition): a sudden change of a scene. Fade: a scene gradually disappearing/ appearing. Dissolve: one scene gradually disappearing while another gradually appears. Wipe: one scene gradually entering across the view while another gradually leaves. There are many automatic techniques haven been developed to detect shot transitions in videos, one of the simplest way of measuring the visual content discontinuity is to compare the corresponding pixels between two consecutive frames. This method is rather time-consuming, an alternative to pixel-based methods is applying color histogram comparison methods. Edge changes can also be used as a feature to detect shot boundaries. Abrupt and gradual transitions are both used in narrative film and video to convey story structure: an abrupt transition is also called a cut, it is roughly equivalent to a period ending a sentence, and gradual transitions shows differences in larger structures such as paragraphs or chapters. Abrupt shot transitions are easy to identify in videos, but the gradual transitions have different types, such as fade, dissolve and wipe, these gradual transitions are hard to spot and are of great importance in parsing narrative film and video. There are methodologies developed for extracting summaries of narrative video 2

13 programs based on proper identification of shot transitions. Most of these methodologies try to find all kinds of shot transitions simultaneously but often failed to identify dissolve and wipe transitions. Although shots can be a good indexing method for movies, the efficiency of this method can be improved by combining shots into scenes. A scene can be defined as a sequence of shots that are taken in the same background (e.g. a dialogue between characters in a room) or describing the same event (e.g. multiple shots taken in a war scene) [2]. We know that every movie tells a story or multiple stories, and mechanism of the storytelling in movies is breaking the story into segments, each segment tells part of a story and has different characters in it. By combining shots into scenes, we could reduce the number of segments for one movie and gain more information in one segment, so that the indexing of video data becomes less complex. The movie segmentation into scenes is facing two major problems. The first one is shot segmentation of movies. There are different types of contents in movies, like car chase scenes, fight scenes, dialogue scenes, etc. In dialogue scenes, the camera tends to be steady, so the camera motion between shots can be used to define the shot boundary. But in war scenes, the camera moves rapidly that makes it hard to apply normal camera motion method to detect shot boundary, so there is a need to apply adjusted camera motion method in this case. Motion histogram descriptor is introduced to solve this problem [3]. Using motion histogram, we need to characterize camera motions by a template based background motion estimation, to estimate motion vectors [4]. And by 3

14 using a block matching technique, the camera motion vectors can be used with more confidence [5]. The second challenge is the shots combining criteria. Scene is a complex level of movie segmentation, it is a combination of a sequence of shots, but different shot has different contents, so we need to figure out a boundary detecting method. After viewing several movies and set up the ground truth for scenes and shots, we checked the pattern of the shots in each scene, and fount two similarities that most shots follow. One pattern is that most shots in a scene have the same background information. For instance, in indoor dialogue scenes, the background of the shots all have the indoor information. On the other hand, the scene change would usually come with background change, so it could be a sign for scene boundary. Using background information to categorize the shots we need to introduce the visual bag of words (BOW). The mechanism for BOW is to extract features from the training images and get the descriptors of the images, then use certain criteria to group the descriptors into different clusters thus form the visual vocabulary [6]. By applying visual vocabulary, we can get feature descriptors of the target images, then we can set up a threshold of distance between to combine a set of consecutive shots to form a scene. The other pattern is that the color histograms for the shots in a scene, even in a dynamic scene (with a lot of character actions and camera rotation and movement), shows a relatively high resemblance, but with the scene changing the color histograms would mostly change, which would break the resemblance and thus show the scene boundary. 4

15 The process of movie summarization in this thesis can be described as follows: Apply shot segmentation to get movie shots, take BOW on the shots to get the primary set of scenes, then use color histogram to combine the primary scenes and get final scenes. Figure 1. Main steps of the method 5

16 Chapter 2: Related Work In this section, we will have a discussion on the previous works on shot transition detection and scene detection and related areas in our work, like feature extraction methods and Bag-of-Words classifier. Shot transition detection is the primary task of movie scene segmentation. In the process of shot transition detection, the abrupt transitions and gradual transitions are the two most important steps, especially gradual transitions, since they are hard to detect and have great significance in video summarization. Scene segmentation is based on shot transition results, in our experiments, the Bag-of-Words method is applied to classify the key frames of each shot and get feature vectors to find scene transitions. In the Bag-of-Words classification process, feature extracting is a key step, a good feature extractor would give better results and save a lot of time. 2.1 Shot Detection Video shot detection has been researched for a long period time and there are plenty of methods developed to settle this topic. The mostly adapted methods are frame difference method, histogram method, and camera motion method. The basic idea of shot change is significant difference between two consecutive frames, and a simple way of comparing two frames is to find accumulated pixel-level differences 6

17 [7]. By comparing the number of pixels that have significant difference, or the number of regions that are significantly different, the shot boundary can be found. The concept of this method is simple, and it is easy to apply, but camera motion has a strong effect on the result of this method. Another widely used video summarization method is histogram comparison, and there are two types of application of this method: local histogram comparison and global histogram comparison [8]. Local histogram comparison is based on histograms of consecutive frames, if the histogram changes abruptly there is a shot transition between the two frames. Global histogram comparison is based on histograms of all the frames of the video, and by setting up a threshold the shot transition can be found. The error rate of this method is acceptable, but when it comes to continuous shot transition, it works poorly. Motion based comparison methods are based on camera motion to get shot boundaries. In general, the camera motion within a single shot has certain continuity. There are two ways of applying motion based comparison, one is extracting motion vectors, the other is frame by frame block matching. Motion vectors are easy to obtain from frames in sequence, and the ones which break the continuity should be shot transition frames. Block matching is a method with a similar idea, if two consecutive frames have massive matching blocks they can be categorized as same shot, otherwise, there is a shot transition between the two frames. The drawback of this method is the computational cost. There is much work that isfocused on compressed domain video segmentation algorithms, since most of the digital videos, especially on the Internet storage, are in compressed format. Analyzing video directly on the compressed domain would greatly 7

18 reduce the computation complexity, thus achieving real-time performance. The discrete cosine transform (DCT) is used in the shot transition detection in MPEG compressed video. One way to achieve this is to get the discrete cosine component of each frame to conduct a color histogram comparison or pixel-level frame differencing on the consecutive low-resolution frames. This method is relatively time-consuming and the accuracy is hard to maintain. An alternative to it is using the normalized inner product of the motion vectors of DCT coefficients to characterize the changing statistics [9]. Motion vector information embedded in the MPEG data is widely used to detect shot transitions. Doing video segmentation on MPEG video is quite computationally efficient, but the gradual shot transitions are still hard to detect, which is the case for all the other techniques mentioned above. Most of the methodologies mentioned above are easy to implement, and all of them can give reasonable results for abrupt shot transition detection. However, algorithms for gradual shot transition detection are of low successful rate, especially for dissolve and wipe detection [9]. It is proved that the histogram differencing algorithm is the most consistent and gives relatively high precision results for gradual shot transition detection. Motion analysis and histogram comparison methods can both distinguish gradual shot transitions from abrupt shot transitions, but when deciding shot boundaries, they did a poor job. There are not many types of research on gradual shot transitions, we will give a brief discussion on previous works on fade, dissolve, and wipe transition algorithms aside from the methodologies introduced above. 8

19 Figure 2. Shot transition illustration (from [2]) 9

20 2.1.1 Dissolve and Fade Detection Zabih, Miller, and Mai [22] at Cornell University proposed a new approach to detect various shot transitions, including cut, dissolve, fade, and wipe. To identify dissolves and fades, they decided to look at relative values of the entering and exiting edge percentages. The edge tracking algorithm is proved to perform the best on gradual shot transition detection, but the accuracy is still not enough for applicational work. Moreover, to apply edge tracking algorithm, we need to conduct motion estimation beforehand to align the consecutive frames and the performance depends on the robustness of the motion estimation. Hampapur, Jain, and Weymouth [23] introduced a model-driven approach to digital video segmentation called chromatic scaling. From observation, they found the fact that a particular set of operations on the image sequence will yield an image which has a constant value in all its pixels. This algorithm assumes that during the fade in and fade out process, both the object motion and camera motion are at a very low level. Also, they assume that the edit effects all parts of the image. Yu, Bozdagi, and Harrington [24] proposed a simpler feature-based algorithm for fade and dissolve detection. This algorithm takes advantage of the production aspects of video and can do a good job on identifying fade and dissolve transitions from camera zooming and panning, and distinguish them from wipe transition Wipe Detection Hampapur, Jain, and Weymouth introduced a model-driven algorithm for spatial translation detection in The assumption they made to conduct the experiment is that 10

21 the brightness of a pixel does not change much during a pure spatial edit. In 1998, Wu, Liu, and Wolf proposed a good wipe detection method using projection operation. They project the 2D differences of the frames into 1 D to obtain the changing statistics of wipe transitions. The drawback of the two algorithms is that the detection is limited to the detection of left-to-right or right-to-left spatial translation only. In modern video editing, there are various spatial translations generated during the process, so the two algorithms mentioned above are not capable of detecting wipe transitions within these videos. Wipe transition is not just the side-to-side translation, and the boundary is not easy to set. Hence the algorithms that only focus on left-to-right or right-to-left translations of shots cannot meet the requirement of modern video segmentation. There are many types of wipe transitions, among all these types of wipe transitions, the most frequently appearing ones are side-to-side (left-to-right, right-to-left, top-down, and bottom-up), corner-tocorner, boundary-in, and center-out [10]. So, the wipe transition detection method should be able to detect and distinguish all these four types of wipe transitions. This algorithm will be further discussed in Chapter Bag-of-Words The Bag-of-Words model is applied to categorize images by extracting features from images and build visual words. The categorization can be achieved in three steps: feature extraction, feature descriptors, and vocabulary generation [11]. Feature extraction is the fundamental step for building the Bag-of-Words model, it is achieved by applying feature detecting descriptors. After feature extraction for the training images, the features that are detected from the images are used to build the classification model. The feature 11

22 descriptors are found by taking the video frames into the model. The vocabulary generation is based on the choice of codebook size; a good choice of codebook size would significantly increase the efficiency and accuracy of Bag-of-Words. The following are the related works on feature detecting descriptors and BOW model building Feature Detecting Descriptors There are multiple ways of detecting features. SIFT is an efficient feature detecting method developed by University of British Columbia, it is the abbreviation of Scaleinvariant feature transform, it is an algorithm to detect and describe features in images. For an image, there should be an object or multiple objects in it, to describe the objects the points of interest should be extracted, these points of interest can provide feature descriptors for the object [12]. There is a pattern for these features that the relative positions of them for one object do not change. SIFT descriptor is a robust feature detecting method that can identify object under different conditions, even among clutter and under partial occlusion. It is because the SIFT feature descriptor is invariant to uniform scaling and orientation, and for affine distortion and illumination changes, it also works relatively fine [13]. The mechanism of SIFT is to extract points of interest as key points and establish a database, the database is used as a reference, by comparing features extracted from new images to the database and finding matching key points based on Euclidean distance between the feature vectors of them. The next step is to find good matches base on locations, scales, and orientations of the matching points. After filtering out the good matches, the features of these points are used to form clusters, the features 12

23 that agree on an object would be put in the same cluster, and with the clusters, the model can be built. Problem Technique Advantage Key localization/ scale/ rotation Geometric distortion Indexing and matching DoG/ scale-space pyramid/ orientation assignment Blurring/ resampling of local image orientation planes Nearest neighbor/ Best Bin First search Accuracy, stability, scale & rotational invariance Affine invariance Efficiency/ speed Cluster identification Hough transform voting Reliable pose models Model verification/ outlier detection Linear least squares Better error tolerance with fewer matches Hypothesis acceptance Bayesian Probability Reliability Table 1. Pros and cons of SIFT There are three main steps for SIFT, the first is feature detection. To track the features an image is transformed into feature vectors and there are similarities among them. Gaussian derivatives are computed at 8 orientation planes over a 4x4 grid of spatial locations, thus forming a 128-dimension vector. The key points are taken as maxima/minima if the 13

24 Difference of Gaussians (DoG) that occur at multiple scales, and a DoG image D(x, y, σ) is given by D(x, y, σ) = L(x, y, k i σ) L(x, y, k j σ) (1) where L(x, y, kσ) is the convolution of the original image I(x, y) with the Gaussian blur G (x, y, kσ) at scale kσ. L(x, y, kσ) = G (x, y, kσ) I(x, y) (2) when DoG images are obtained, the key points can be identified as local maxima/minima of the DoG images across scales. Figure 3 shows an example of the maps of gradient magnitude corresponding to the 8 orientations [14]. Figure 3. (From left to right) A Harris affine region; the normalized region; and the 8 maps of gradient magnitude constituting the SIFT descriptor (from [14]) The second step is feature matching and indexing. Feature matching is based on relative positions of feature points, and for feature indexing, it consists of key points storing and matching points identification. The algorithm to identify nearest neighbors with high probability is called Best-bin-first (BBF) search, which is based on k-d tree algorithm. The goal of BBF is to search the bins in the feature space in the order of their closest distance from the query location [12]. The best match for each key point is its nearest 14

25 neighbor in the database formed based on training images. The nearest neighbors are found based on minimum Euclidean distance between feature vectors. The last step is to cluster identification. Hough transform is used in this step to form clusters that all the features in a cluster agree upon a particular pose. Hough transform achieves this goal by creating a voting mechanism for the features, if a cluster of features vote for the same pose of an object, this cluster has a higher probability of the interpolation than any single feature. Any cluster with at least 3 features in it forms a bin, and the bins are sorted into decreasing order of size. Speeded up robust features (SURF) is an enhanced algorithm for feature detecting, it is partially inspired SIFT, it is claimed to be much faster than SIFT and more robust against different image transformations than SIFT. The basic principles of SURF are the same as SIFT, but in each step, there are minor differences. There are three main steps of SURF: the detection of points of interests, local neighborhood description, and feature matching. In feature detecting stage, SIFT uses Gaussian smoothing to get DoG images, in SURF, it uses square-shaped filter as an approximation of it. In the filtering process the integral image is used thus making it much faster: x y i=0 j=0 (3) s(x, y) = I(i, j) For finding points of interest, SURF uses Hessian matrix. The determinant of Hessian matrix is an important element in this case, the local change around the points can be measured using the determinant of Hessian matrix, if the determinant is maximal the points are chosen. The determinant can also be used to select scale. The Hessian matrix at a given point p = (x, y) with the scale σ is: 15

26 H(p, σ) = ( L xx ( p, σ) L xy (p, σ) L yx (p, σ) L yy (p, σ) ) (4) where L(p, σ) is the convolution of second-order derivative of Gaussian with the image I(x, y) at the point x. The next stage concerns about the location of interest points and scale-space representation. At different scales the interest points can be found, the scale space is usually represented as a pyramid in other algorithms for feature detecting. By repeatedly smoothing the images with Gaussian blur and subsample them the pyramid is built. At every level of the pyramid a mask can be calculated: σ = filter size Base filter scale Base filter size (5) The scale space is divided into several levels, and in SURF, the lowest level is obtained from the output of the 9 9 filters. And in SURF the scale space is implemented in a different way than other algorithms, it is done by up-scaling the filter size rather than reducing image size. In the process of up-scaling filter size, the non-maximum suppression in a neighborhood is applied to localize interest points in the image and over scales. Interpolating scale space is of great significance in SURF, since scale difference between layers is rather large. The final stage is feature matching. The matching stage for SURF is the same as SIFT, it is also based on Euclidean distance between feature vectors. Orientation is an important element in feature extraction, for an object, if the orientation of it in different images differ, the extracted features would not be a good match. To 16

27 eliminate rotational effect on feature matching, the orientations of the interest points should be obtained. Haar wavelet is introduced to solve this problem Vocabulary Building and Codebook Generation The definition of BOW model is histogram representation based on independent features [16]. Using the BOW model to illustrate an image, the image is treated as a document, thus the terminology words is used. After feature detection, each image is represented by a series of local patches. By transforming these local patches into numerical vectors the feature descriptors are formed. The descriptors should be invariant to intensity, rotation, scale, and affine transform to some extent. SIFT is usually used in descriptor building, but in our case, we use SURF since it is faster practically and has similar result. The terminology Codebook is represented a collection of words. Several patches that share similar features can form a word. K-means clustering is used in this process. One simple way to generate codebook V is Naïve Bayes classifier [14], it can be described as the equation below: e = arg max c p(c w) = arg max c p(c)p(w c) = arg max c 17 p(c) N n=1 p(ω n c) (6) where ω is a V-dimensional vector has only one component that equals to one and other components equal to zero, the single component that equals to one shows which cluster this vector belongs to. And w is a representation of the images, it is a combination of all the patches in an image. The c represents the category of the image. The assumption of this model is that for each category there is a distribution over the codebooks, and the distributions are rather different from each other. For instance, a

28 category for buildings and a category for humans have different distributions. The building category may have a codebook consists of windows, doors, antennas and roof, and the human category should have a codebook which has head, body and legs in it. Given a set of training images the classifier can learn from the samples and get the distributions for the categories, the realization is done by the equation for Naïve Bayes classifier. Since this method is simple and is effective to some extent, it is usually used as a baseline method for comparison. And since images are treated as documents in BOW model, some discriminative models that are suitable for text document categorization can be applied, the most popular models are support vector machine (SVM) [14], AdaBoost [17] and Kernel trick. Among these models, SVM is the most popular method to apply in BOW. A support vector machine constructs a hyperplane or a set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class. The classification function of SVM model for a given observation X can be described as below: f(x) = sign(w T x + b) (7) where w and b are hyperplane parameters Limitations and recent researches on BOW One of the biggest drawbacks for BOW is that it does not count the spatial relationships among the patches. The spatial relationships are of great importance in the representation 18

29 of an image. Recent researches have proposed several methods to deal with spatial information. For feature level information, correlogram feature is introduced to capture spatial co-occurrences of features [18]. Relative positions [19] of the words are introduced to find spatial relationships among patches. Once the spatial relationships among the patches are considered, the results of BOW would be greatly improved. For now, the BOW model is not well understood for object segmentation and localization, and for viewpoint invariance and scale invariance, the performance is unclear. Figure 4. The steps for BOW model (from MathWorks for image classification with Bag of Visual Words) 2.3 Scene Transition Detection A video usually consists of scenes and each scene in a combination of shots. A shot is an uninterrupted segment of video frame sequence with static or continuous camera motion, while a scene is a series of consecutive shots that are coherent from the narrative point of 19

30 view [25]. The shots within one scene are either taken in the same place or share similar thematic content. Figure 5 shows the structural content of a video. scenes shots frames Figure 5. Video structure Scene change detection can be accomplished in two ways: comparing the similarity of background information in shots, and analyzing the content of audio features. There are several research problems along this thought: (1) background and foreground segmentation; (2) background and foreground identification; (3) similarity measure; (4) word spotting from audio signal. The first problem can be settled with good accuracy if the background and foreground objects have different motion patterns. The second problem is a bit complex, it needs human feedback on the difference between foreground and background information. The third problem has been studied seriously since the beginning of video retrieval research. The last problem is rather hard since the soundtrack of the video is complex and often mixed with many sound sources. 20

31 Scene change detection is a complex task in video summary field, because it needs to solve all four problems mentioned above. A fully automated system for video scene transition detection is hard to build, since a reliable segmentation of video cannot be done prior to the detection of a scene. There are two major algorithms for scene transition detection: one uses a time-constraint clustering algorithm to group shots which share similar visual contents and temporally closed as a scene; the other applies audiovisual characteristics to detect scene boundaries. In general, these two approaches are based on the video representation scheme and shot similarity measure. The first approach is to represent a video in a semantically meaningful way, if the consecutive shots are visually similar, they tend to be semantically related; the second approach is to simulate human perception, from both visual and acoustic perspective. In most of the algorithms for scene transition detection, shot detection is the primary step. The shots are represented by a set of keyframes, and the similarities among the shots are dependent on the color similarity of those keyframes. A motion-based video representation method for scene change detection was proposed in 2000 [25]. It is done by integrating previous works on video partitioning, motion characterization, and foreground-background segmentation. The problem is solved from four different aspects: (1) represent shots adaptively and compactly through motion characterization; (2) reconstruction of background information in multiple motion case; (3) reduce the distraction of foreground objects by histogram intersection; (4) impose time-constraint to group shots that are temporally closed [25]. In previous work for video scene transition detection, the compact video representation for shot similarity measure 21

32 has not been well settled. The most popular method for scene transition in previous researches is based on color similarity of the key frames of two consecutive shots, which may lead to the missing detection of scene transitions. In [25], they proposed a method that not only select key frames from shots, but also set keyframes based on annotated motion of shots. By combining color similarity of shots and shot motions, the accuracy of scene transition detection can be significantly raised. Since this video representation scheme proposed is compact, the histogram intersection which measures similarity between features based on the intersection of feature points can be more effectively performed for scene transition detection. 2.4 K-nearest Neighbors Algorithm K-nearest neighbors algorithm (k-nn) is a non-parametric method in pattern recognition, it is used for classification and regression. The input of k-nn consists of the k closest training examples in the feature space in both two cases, the output depends on whether k-nn is used for classification or regression. 1) In k-nn classification, the output is a class membership. An object is classified by majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, usually small). 2) In k-nn regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. The k-nn algorithm is a type of lazy learning, where the function is only approximated locally and all computation is deferred until classification. It is one of the simplest algorithms in machine learning. 22

33 Both for regression and classification, assigning weight to the contributions of the neighbors can be useful. The nearer neighbors contribute more to the average than the ones of greater distance. A common weight setting method is to give each neighbor a weight of 1/d, where d is the distance to the neighbor. The training samples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm is simply storing the feature vectors and class labels of the training samples. In classification phase, k is a constant defined by user, and an unlabeled target vector is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A frequently used distance metric for variables is Euclidean distance. The drawback of the basic majority voting classification occurs when the class distribution is skewed. The samples of a more frequent class tend to dominate the prediction of a new example since they are common among the k nearest neighbors due to their large number. One way to overcome this problem is to assign weights in the classification phase, while the weights are related to the distance from the test point to each of its k nearest neighbors. The class of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. 23

34 ? Figure 6. Example of k-nn classification, the green object is the target, k is chosen to be 3, the target should be classified as red triangle since it is the dominant contributor The choice of k depends on the data. Generally, a larger k reduces the effect of noise on the classification, but make the boundaries between classes less distinct. A good k can be selected by various heuristic techniques. The special case where the class is predicted to be the class of the closest training sample (k=1) is called the nearest neighbor algorithm. The accuracy of the k-nn algorithm can be severely degraded by the presence of the noisy or irrelevant features, or if the feature scales are not consistent with their importance. There are researches aiming at the selecting or scaling of the features to improve the accuracy of classification. One particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. 24

35 The k-nn algorithm in our case is used to set up threshold for shot and scene boundary detection. The threshold setting is a crucial step in our algorithm, a good threshold will significantly raise the accuracy of the shot and scene boundary detection. 25

36 Chapter 3: Shot Transition Detection Shot transition detection is a crucial task in video summarization. The first level of video segmentation is frame level, and a shot is a combination of frames that are taken from one camera. This topic has been studied for years, and the techniques are quite mature for now. As discussed in Chapter 1, the very first step to do movie scene segmentation is to get shot segments for the movie, since a scene is a collection of consecutive shots which share similar visual content or taken at the same place. A good shot transition detection result would make scene detection easier and raise the accuracy. Previous shot detection methods are discussed in Chapter 2, and in this thesis, we apply both motion based comparison and histogram based comparison to detect shot transition, and the gradual transitions are modeled. By setting up all these criteria for shot transition, a good shot transition detection method is built. 3.1 Motion based shot detection There are four steps to obtain the features in this method. First, block matching based motion estimation method is applied to estimate local camera motion of each block. Secondly, the local camera motion for each block is analyzed, and motion vectors of interests can be found. Then the motion vectors of interests can be used to construct a sequence of compact representation of temporal motion [20]. At last a normalized 26

histogram can be obtained by using the compact representations, this histogram is a representative as the local motion feature of one shot. Figure 7.

37 histogram can be obtained by using the compact representations, this histogram is a representative as the local motion feature of one shot. Figure 7. Patterns in a spatial-temporal slice (from [25]) The block matching based motion estimation is a widely used local motion estimating method, it is usually used for video compression. The motion estimation methods are used as a function to obtain optimal optical displacement in the blocks. The optical displacement can be described as a motion vector which can illustrate the coordinate displacements of the matching blocks in the reference frame. For a block in i th frame, the motion vector is searched in the (i + 1) th frame, these two consecutive frames can be represented as f i and f i+1. By extracting all the motion vectors in one frame, a pack of motion vectors can be constructed. To achieve this goal, f i should be divided into nonoverlapping blocks of size n n, and in f i+1 the most similar block is found by searching an area of size 3n 3n in that frame. The formula can be described as below: where i i (u (x,y), v (x,y) ) = arg min e(x, y, u, v, i) (8) u ( n+1,,n),v ( n+1,,n) n 1 n 1 e(x, y, u, v, i) = f i (x + p, y + q) f i+1 (x + u + p, y + v + q) p=0 q=0 27

38 i and u (x,y) i and v (x,y) are the horizontal displacement and vertical displacement. Figure 8. Illustration of block matching based motion estimation searching for matching blocks The next step is detecting motion vectors. The goal is to identify regions with motion information that is related to camera movement. The camera movement can be categorized as static, zoom, pan, tilt, and the combination of them. In modern movie shooting process, the cameras tend to be fixed to machines, the motion feature continuity between the consecutive frames is rather strong without shot transition, so the movie shots are detected with smooth and consistent camera motion no matter the objects in the frames are static or dynamic. The motion vectors for regions of static objects in frames have a direct relationship with camera movement. However, for regions of objects with massive movements or non-rigid bodies (e.g. dripping water), the motion vectors captured are with randomness. By searching areas in the frames that preserves 28

consistency of camera motion (regions of static objects), this obstacle can be overcome. The motion features obtained from these regions can be integrated to form the motion vectors.

39 consistency of camera motion (regions of static objects), this obstacle can be overcome. The motion features obtained from these regions can be integrated to form the motion vectors. (a) (b) (c) (d) Figure 9. Region-wise shot motion summary of (a) static, (b) tilt, (c) pan and (d) zoom shots. Each region represents the rough direction of a camera motion The computation for camera motion is expressed in detail as follows: A camera projects a 3D world point A = (X, Y, Z) t onto the 2D image plane, the corresponding point is expressed as a = (x, y) t. The focal length of the camera is f, the projection can be shown as: x = f X Y and y = f Y Z (9) The camera motion vector is calculated with 29

40 V = T + Ω P (10) where V = (V x V y V z ) t is the 3D velocity vector point whose position vector is P = (X Y Z) t, and T = (T x T y T z ) t is the 3D vector for camera translation, and Ω = (ω x ω y ω z ) t is the 3D vector for angular velocity of the camera rotation. Let O(x, y) = (u(x, y)v(x, y)) t be the observed optical flow at location p = (x, y) t. We can get from the definition u(x, y) = d x d t and v(x, y) = d y d t (11) From equation (11) we can get u(x, y) = dx dt = f Z ( T x + Ω Y Z Ω Z Y) x Z ( T Z + Ω X Y Ω Y X) = xy f Ω X + (f + x2 f ) Ω Y yω Z + f Z T X x Z T Z (12) v(x, y) = (f + y2 f ) Ω X + xy f Ω Y + xω Z f Z T Y y Z T Z (13) Taking camera zoom factor into account, we add an element to equation (12) and (13), u(x, y) = xy f Ω X + (f + x2 f ) Ω Y yω Z + f Z T X x Z T Z f [arctan ( x x2 )] (1 + f f 2) r zoom (14) v(x, y) = (f + y2 f ) Ω X + xy f Ω Y + xω Z f Z T Y y Z T Z f [arctan ( y y2 )] (1 + f f 2) r zoom (15) where r zoom is the camera zoom factor (the angular magnification factor). 30

41 Among the nine unknowns in the equations above, only Z varies with the pixel location, the other eight variables: Ω X, Ω Y, Ω Z, T X, T Y, T Z, f and r zoom are independent of the pixels in an image. The goal is to find the dominant motion from Ω X, Ω Y, Ω Z, T X, T Y, T Z and r zoom, or to find there is no motion at all. By applying optical flow and examining the overall mean value of it, the existence of camera motion in an image sequence can be found. If camera motion is found in an image sequence, the task is to detect Ω X (tilt), Ω Y (pan), Ω Z (Z-axis rotation), T X (horizontal translation), T Y (vertical translation), T Z (Zaxis translation), or r zoom (zoom). The projection and camera motion illustration is shown below Y T y P y x O T x X Z T z Image plane Figure 10. Camera projection model 31

42 The assumption in our case is that there is only one type of camera motion in each video segment. If there are multiple types of camera motion, and one of the motions dominate the others, the assumption is still applicable. Based on this assumption, we can analyze different types of camera motion separately, which means we will have only one nonzero parameter during each calculation. The calculation of camera motion is done separately in the x (horizontal) and y (vertical) directions. We divide an image into 7 nonoverlapping regions, O, X 0, Y 0, I, II, III, and IVas is shown in Figure 11. II X 0 I Y 0 Y 0 II X 0 I Figure 11. Image regions Optical flow is calculated in different regions in an image. Applying this method, we can get relationship between each kind of camera motion and u(x, y) and v(x, y). Tilt: u(x, y) = 0 in both X 0, Y 0, v(x, y) is constant at Y 0 ; Pan: u(x, y) is constant at X 0, v(x, y) = 0 in both X 0, Y 0 ; Z-axis rotation: u(x, y) = 0 at Y 0, v(x, y) = 0 at X 0 ; Horizontal translation: u(x, y) is constant, v(x, y) = 0 everywhere; 32

43 Vertical translation: u(x, y) = 0 everywhere, v(x, y) is constant; Z-translation: u(x, y) = 0 at X 0, v(x, y) = 0 at Y 0 ; Zoom: u(x, y) = 0 at X 0, v(x, y) = 0 at Y 0 ; We can see from above that, apart from Z-axis translation and zoom, the other camera motions have unique features. We can distinguish each of the camera motions by detecting u(x, y) and v(x, y) and check the value and pattern of them in the regions of an image. Considering the nondominant camera motion effects and the noises, we take the mean value in each region of an image to calculate the velocity, and the standard deviation to decide whether the velocity is constant. In experiments, a camera pan is usually confused with horizontal translation since they share similar pattern in optical flow result. The y component of the optical flow xy Ω f Y is normally very small in camera pan, thus is easily mistaken for zero. To solve this problem, we separate camera pan and horizontal translation from other camera motions first and then distinguish camera pan and horizontal translation using the fact that u(x, y) of camera pan has a larger variation than u(x, y) of camera horizontal translation. The recognition of camera tilt and camera vertical translation shares similar idea. When the dominant camera motion is zoom, Applying Taylor expansion, Ignoring the third order elements, u(x, y) = f[arctan ( x x2 )](1 + f f 2)r zoom (16) u(x, y) = f r zoom x + r zoom x 3 + R f 3 (x) (17) u(x, y) f r zoom x (18) 33

44 When the dominant camera motion is Z-axis translation, u(x, y) = x T Z Z (19) The equation (18) and (19) are both linear functions of x. To distinguish T Z from r zoom, we expect that T Z is related to Z while r zoom is independent of Z, which means T Z would have larger variance than r zoom in normal situations. 3.2 Histogram based shot detection The idea of this method is to detect gradual transition from the feature histogram of two consecutive frames. The feature histogram in our method is constructed by visual descriptors. If there is an abrupt change between the visual descriptors of the two consecutive frames, there must be a shot transition. But when there is a shot transition, the difference between the two visual descriptors may be small, since there is gradual transition in some cases. For these frames, the shot transition is hard to be identified. To solve this problem, a sliding window method is introduced. The window size is N, and a frame is compared with its N th predecessor [21]. If there is a gradual transition, the difference between the temporal separation of the frame and the predecessor frame is more significant than the ones without gradual transition. Although this method can work fine, the choice of window size and the threshold for histogram difference is rather crucial. A basic model for transition detection is based on the idea that there is only one shot transition within one video, this makes the model easier to understand, and it will be proved later that this model also works for videos with multiple shots. 34

45 The features for the frames in one shot have certain continuity in nature, all the features of the frames in the first shot in the window would be highly similar to the reference frame, which is usually set as the first frame of the shot. But because of the object motion in the frames, the features should have certain difference. To eliminate the effect of this kind of difference, we need to design the descriptor properly. The reference descriptor should be robust enough to avoid effect of noise. To achieve this goal, we take average of the first I descriptors in the window as the descriptor for the reference frame, in the experiments we take, I is chosen as 5. Here we introduce a concept named as similarity value. The similarity value is calculated by the difference between the descriptors of current frames and the reference descriptor, the smaller the difference is the greater the similarity value is. By observing the result for similarity values for the frames, two patterns are found, one is for abrupt transition, the other is gradual transition, the two patterns can be shown in Figure 12. Although the two patterns seem clear, it is hard to set up a proper threshold value to determine shot transition, since there are some small gradual transitions within one shot. In this thesis, we separate the reference frame temporally by the whole shot from the frames in transition phase. Thereby the similarity values would be relatively different comparing to the frames in the first shot in the window. This makes the selection of proper threshold value easier. The feature extractor to be used in this model should meet two requirements. The first is that the features extracted from the frames within one shot should be very close to each other. The other is that the feature must have discriminative power to distinguish the 35

46 within shot frames and the transition frames in case of gradual transition. For these two requirements, the edge-based features are applied. The basic idea of this method is to convert the image to grayscale images and apply Sobel operator to compute gradient at each pixel, then the edge image is obtained by the pre-setup threshold, and the features can be extracted from the edge image. Similarity Value Shot 1 Abrupt Transition Shot 2 Time (a) Similarity Value Shot 1 Gradual Transition Shot 2 (b) Figure 12. Two transition models: (a) Abrupt change and (b) Gradual transition Time 36

47 The model is proved in experiments that it works fine in short videos with only 2 shots, but in this thesis, we are trying to do movie summarization, a movie is consist of a few shots. The model should be adjusted a bit to fit shot detection in larger videos like movies. The biggest obstacle for the previous model in larger videos is the choice of window size, a bad choice of window size would result in failure of finding shot transitions. The idea is simple, by setting up a range of the window sizes. The window size is iteratively updated, and the video shots are updated in this process. Color histogram difference is a better way of shot boundary detection method. In modern movie filming, the videos taken are usually represented in RGB color space. In this thesis, we define the color histogram as an array of M elements, where M is the number of different possible colors within the color space. We cannot compute all the possible colors since the computational cost is rather large, and the human visual system cannot distinguish all levels of colors. A solution to this problem is considering only the most significant bits of each component RGB. By doing so, we can reduce the number of colors from index level. A better solution for it is to evaluate the color histogram in the HSV space due to its similarity and perceptivity characteristics. This color space is defined based on human perception of colors. Visually similar colors are grouped within close quantization levels. Otherwise visually different colors would be grouped in far quantization levels. Moreover, the similarity between two colors is evaluated by the distance in this color space. The transformation from RGB to HSV is easy to achieve. After transformation, a series of bins are defined for the color histogram. In RGB space, we have a series of quantization levels based on the number of bits of each component. 37

48 For HSV color space, we separate hue component from saturation and value component, since human visual system is more sensitive to hue variations than saturation and value variations. The color histogram comparison based shot boundary detection in this case is based on the computation of color histograms between frames as a measure of discontinuity. The difference is computed as the sum of the absolute difference between the bin values, D RGB (X, Y) = M i=1 h x (i) h y (i) (20) where h x is the color histogram of image X which contains M different bins. In HSV color space, the algorithm is the same, and the result we obtained is similar as it is in RGB color space, which means there is no significant improvement by applying RGB to HSV transformation. The shot boundary detection method is based on the difference between color histograms of consecutive frames in a video sequence. The difference can be calculated by HistDif[i] = M j=1 h i (j) h i 1 (j) (21) where h i is the color histogram with M bins of frame i corresponding to the video sequence. As shown above, different types of shot transitions would have different shapes of color histogram difference. For instance, a single peak implies the appearance of an abrupt shot transition. The cuts are easy to detect since they always present a big amplitude. The equation for cut detection can be shown as HistDif cut [i] = a i δ(i i cut ) (21) 38

Assuming the histogram differences being constant within the transition between two shots, we can have the function as follows: HistDif fade and dissolve [i] = β i rect ( i i fade and

49 where a i represent the amplitude of delta function and i cut is the frame number where the cut occurs. Gradual transitions like fades and dissolves, appear with lower amplitude and the transition is rather smooth. Assuming the histogram differences being constant within the transition between two shots, we can have the function as follows: HistDif fade and dissolve [i] = β i rect ( i i fade and dissolve T fade and dissolve ) (22) rect(x) = { 1, x 1/2 0, x > 1/2 (23) By combining motion based shot transition detection and histogram comparison based shot detection, the abrupt and gradual shot transitions are well modeled and the detection accuracy is reasonable. (a) (b) Figure 13. Color histogram differences for (a) cut boundaries; (b) gradual boundaries 39

50 3.3 Threshold Setting The threshold setting is a key step in shot transition detection and scene segmentation. The pattern in shot transition is illustrated in previous sections, the motion based shot transition detection method does not need a threshold to distinguish the camera motions, but the histogram based method need a threshold to distinguish gradual shot transitions from noises, so the next step is to set a reasonable threshold based on the observation of the histograms. The method we applied is based on k-nn algorithm. The idea is to divide the whole movie into small parts with the same number of frames, each frame has its own histogram. By calculating the variance of each group of frames, we can get a set of variances. The variance represents the differences of the histograms of frames in each group, the bigger the variance is the larger difference there is. We put the histograms of the frames in one group into the k-nn model to calculate the distance between the histograms, then calculate the variance of this group of histograms. We sort the variances of all the groups in descending order and form a vector, then we divide the vector into five parts of the same size. We take the third part of the vector and find the corresponding groups of histograms, the distances between the consecutive histograms are calculated in the k-nn model, we take the mean of the distances of the histograms in these groups as the threshold. Taking the middle part of the vector is to distinguish gradual shot transitions, the third part of the vector contains the median variances of the groups of histograms, which represents that the histogram differences in these groups are large enough to have a gradual shot transition but not so large to have an abrupt shot transition. By taking the mean of all the distances of the histograms in these groups we can have 40

relatively reasonable threshold to detect gradual shot transitions. To separate gradual shot transitions from abrupt shot transitions, we need an upper bound for the histogram distances.

51 relatively reasonable threshold to detect gradual shot transitions. To separate gradual shot transitions from abrupt shot transitions, we need an upper bound for the histogram distances. We take the first part of the vector segments and calculate the mean of the distances in the corresponding groups of histograms. The mean is taken as the threshold to distinguish gradual shot transition from abrupt shot transition. This method can give a relatively accurate result for gradual shot transition detection, and the threshold is automatically set, which saves a lot of time and human labor. Figure 14. Workflow of shot transition detection 41

52 Chapter 4: Scene Segmentation After video shots are found, the scene segmentation is considered. The definition for scene in this thesis is given in chapter 1, and it is a bit different from the scene definition in filmmaking. In filmmaking field, a scene is generally thought of as the action in a single location and continuous time. And in our case, the action in a scene does not have to be taken in the same location, if the background information and color histogram do not change much within consecutive shots, these shots can be grouped as one scene. By the definition, we can set up the ground truth for scenes in advance and use as a reference to test our method. There are two criteria use to segment the movie into scenes, one is background information, the other is character existence. When generating the ground truth, the background information is the first level criterion, if the background changes abruptly (e.g. from forest scene to urban scene), there is a scene transition. The character existence is the secondary criterion for scene transition, once the first level scene segments are obtained, we examine the existence of the main characters in the consecutive first level scenes, if the character existence shows certain continuity, these consecutive scenes can be combined into second level scenes, thus form the ground truth for scenes. 42

53 In this thesis, the method we proposed for automated movie scene segmentation is based on background information and color histogram comparison. The background information here represents environmental and spatial information, for instance, urban environment, indoor environment, and rural environment. (a) (b) Figure 15. Shot and scene transition, (a) shot transition, (b) scene transition 4.1 Bag-of-Words To get the first level scenes, the first step is to build a BOW model. We choose to set up 15 categories of background that often appear in movies: Bedroom, Coast, Forest, Highway, Industrial, Inside City, Kitchen, Livingroom, Mountain, Office, Open Country, Store, Street, Suburb and Tall Building. For each type of background, we use 100 images as training data, and the feature extraction method applied is SURF since it is practically faster than SIFT and can get similar results. After this, we apply BOW model introduced in chapter 2 to get the codebook as the reference for background information 43

54 categorization. In this process, the most important part is to choose a reasonable vocabulary size, a small vocabulary size could result in higher misclassification rate, and a large vocabulary size would significantly slow down the process and have negative effect on the categorization precision. The detailed method for BOW model building is given below. The method has four main steps: (1) Detection and description of image patches (2) Assigning patch descriptors to a set of predetermined clusters (vocabulary) with a vector quantization algorithm (3) Constructing a bag of visual words, which counts the number of patches assigned to each cluster (4) Applying a multi-class classifier, treating the bag of visual words as the feature vector, and thus determine which category or categories to assign to the image. These steps are designed to increase the accuracy while limiting the computational cost. Thus, the descriptors extracted in the first step should be invariant to variations that are irrelevant to the categorization task (image transformations. Lighting variations and occlusions) but rich enough to carry enough information to be discriminative at the category level. The vocabulary size should be large enough to detect relevant changes in image parts, but not too large to detect the irrelevant variations such as noise. 44

55 Figure 16. Visual Bag-of-Words Steps We refer to the quantized feature vectors (cluster centers) as visual words, but in our case, the visual words do not have a repeatable meaning, nor is there a best choice for the vocabulary. The goal is to find a vocabulary that can give a reasonable performance on a given training dataset. So, in the training process, the algorithm applied should consider multiple possible vocabularies: (1) Detection and description of image patches for a set of labeled training images 45

56 (2) Constructing a set of vocabularies: each is a set of cluster centers, with respect to which descriptors are vector quantized (3) Extracting bags of visual words for these vocabulary (4) Training multi-class classifiers using the bag of visual words as feature vectors (5) Selecting the vocabulary and classifier giving the best overall classification accuracy. In the first step, the feature extraction algorithm is applied. By examining the previous researches, we found SIFT and SURF feature extractor are two most efficient and effective algorithms in feature extraction. These two algorithms have advantage over other descriptors such as Gaussian derivatives or differential invariants of the local jet for the following reasons: 1. They are simple linear Gaussian derivatives, so that we expect them to be more stable to typical image perturbations such as noise than higher Gaussian derivatives of differential invariants. 2. The use of a simple Euclidean metric in the feature space seems justified. In the case of differential invariants obtained by combination of the components of the local jet, the use of a Mahalanobis distance is more appropriate. For instance, one would expect a second derivative feature to have a higher variance than a first derivative, the choice of an appropriate Mahalanobis distance is rather challenging. It is not appropriate to use the covariance matrix of SIFT descriptors over the entire dataset, since this is predominantly influenced by inter-class variations (or more precisely, by variations between visual words that we do not 46

57 wish to ignore). Measuring a Mahalanobis distance would probably requiring manual specification of multiple homologous matching points between different images of objects of the same category, seriously working against our objective of producing a simple and automated categorization system. 3. There are far more components to these feature vectors (128 in SIFT and SURF). Hence we can have a richer and potentially more discriminative representation. In our method, the vocabulary is used as a sample, by comparing descriptors obtained from target images with the vocabulary, the target images can be classified into clusters. The initial idea is to compare each descriptor of the target images to all the training descriptors, but this seems impractical given the huge amount of target data. The number and size of clusters are also important in visual vocabulary construction. A smaller size of clusters would increase the accuracy while the computational efficiency is decreased. We found the best tradeoffs of cluster number and cluster size. Most clustering or vector quantization algorithms are based on iterative square-error partitioning or on hierarchical techniques. Square-error partitioning algorithms try to obtain the partition which minimizes the within-cluster scatter or maximizes the betweencluster scatter. We apply one simple square-error partitioning method: k-means, in our method. K-means is an algorithm which proceeds by iterated assignments of points to their closest cluster centers and computation of the cluster centers. There are two difficulties when using k-means algorithm, one is that k-means converges only to local optima of the squared distortion, the other is that the parameter k (cluster number) is not determined by itself. In our case, we do not actually want the best choice of cluster 47

58 number, we just need to find reasonable tradeoffs between accuracy and computational efficiency, since we prefer to build a system for real-time applications. Once descriptors have been assigned to clusters to form the feature vectors, we reduce the problem of generic visual categorization to that of multi-class supervised learning, with as many classes as defined visual categories. The classifier performs two separate steps to predict the classes of unlabeled images: training stage and testing stage. In training stage, labeled data is sent to the classifier and used to adapt a statistical decision procedure for distinguishing categories. The choice of classifier is important, among all available classifiers, we choose to use Support Vector Machine (SVM) since it is known to produce state-of-the-art results in high-dimensional problems. 4.2 Support Vector Machine Figure 17. SVM Classifier 48

59 SVM constructs a hyperplane or a set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks. To perform a good separation, we rely on the hyperplane that has the largest distance to the nearest training data point of any class (called functional margin), because in general, the larger the margin is the lower the generalization error of the classifier is. When a problem is stated in a low-dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. To solve this problem, the original lower-dimensional space is mapped into a much higher-dimensional space, thus making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function k(x, y) selected to suit the problem [26]. The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations of parameters a i of images of feature vectors x i that occurs in the database. With this choice of a hyperplane, the points x in the feature space that is mapped into the hyperplane are defined by the relation: a i k(x i, x) = constant i If k(x, y) becomes small as y grows further away from x, each term in the sum measures the degree of looseness of the test point x to the corresponding database point x i. By doing so, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. 49

60 The hyperplane constructed by SVM classifier separates two-class data with maximal margin. The margin is the distance of the closest training data point to the separating hyperplane. For a given observation X, and the corresponding label Y which takes values ±1, the classification function can be described as: where w, b represents the parameters of the hyperplane. f(x) = sign(w t x + b) (24) The SVM takes two approaches to make datasets separable. First, it introduces a constant C as an error weighting constant, which penalizes misclassification of samples in proportion to their distance from the classification boundary. Second, a mapping Φ is made from the original data space of X to another feature space. The second feature space has a higher dimension than the first one. One of the advantages of the SVM is that it can be formulated entirely in terms of scalar products in the second feature space. The kernel is described as below: K(u, v) = Φ(u) Φ(v) (25) Both the kernel K and penalty C are problem dependent and need to be determined ahead. In the kernel formulation, the decision function can be expressed as f(x) = sign( i y i a i K(x, x i ) + b) (26) where x i is the training features from data space X and y i is the label of x i. The parameters a i are typically zero for most i. Equivalently, the sum can be taken only over a select few of the x i. These feature vectors are known as support vectors. It has been proven that the support vectors are the feature vectors lying nearest to the separating hyperplane. 50

61 After applying SVM, the BOW model is built and the feature descriptors for the training images are found. Then the accuracy of the classification should be measured by sending testing data into the model and comparing the classification result with the ground truth of testing data. The goal is achieved by categorizing feature descriptors of the testing images into their nearest cluster center. The accuracy is not the most important issue in our case, since we only need the feature descriptors to detect scene transitions. The first level scenes are obtained by combining consecutive shots with Euclidean distance of the feature descriptors less than the threshold. Ground truth bedroom coast forest highway kitchen bedroom coast forest highway kitchen Table 2. Example of BOW model accuracy check 4.3 Color Histogram Comparison After the model is built, we take the key frames for the shots (we take the first frame as the key frame) into the BOW model and get the feature descriptors for each shot key frame. A descriptor is vector, the dimension is the same as the vocabulary size, each element is the possibility of being in the corresponding cluster. The scene boundary 51

62 is detected by examining the Euclidean distance between the consecutive shot key frames. The Euclidean distance between two vectors q and p can be calculated as: d(p, q) = d(q, p) = (q 1 p 1 ) 2 + (q 2 p 2 ) (q n p n ) 2 (27) A threshold needs to be set up for scene transition, the threshold setting criterion is quite similar with to threshold setting method discussed in Chapter 3. By analyzing the first level scenes, we found that it contains most of the true boundaries of the scenes, but most of the scenes are separated into sub-scenes, which differs from the ground truth. Therefore, we apply color histogram after this process. Figure 18. Expression of Euclidean distance (d) in 2-D space 52

Figure 19. Example of histograms for three channels of RGB color space Color histogram is a simple method for image comparing, the idea is to get the color information from the images.

63 Figure 19. Example of histograms for three channels of RGB color space Color histogram is a simple method for image comparing, the idea is to get the color information from the images. Usually, RGB information is used, but RGB histogram may differ much even between 2 frames within one shot, we need a more uniformed color feature, thus HSV is chosen. HSV is a color space applied to be an alternative to the RGB color space. It is a more intuitive color mixing model, and it consists of three components: Hue: The hue of a color refers to which pure color it resembles. All tints, tones and shades of red have the same hue. Hues are described by a number that specifies the position of the corresponding pure color on the color wheel, as a fraction between 0 and 1. The value 0 refers to red, 1/6 is yellow, 1/3 is green, and so forth around the color wheel. 53

64 Saturation: The saturation of a color describes how white the color is. A pure red is fully saturated, with a saturation of 1. Tints of red have saturations less than 1, and white has a saturation of 0. Value: The value of a color describes how dark the color is. A value of 0 is black, with increasing lightness moving away from black. The HSV model can be derived via geometric strategies, or can be thought of as specific instances of a generalized LHS model. The idea of RGB to HSV transformation is taking an RGB cube, with constituent amounts of red, green, and blue light in a color denoted R, G, B, and tilted it on its corner, so that black rested at the origin with white directly above it along the vertical axis, then measured the hue of the colors in the cube by their angle around that axis, starting with red at 0 o. Then they came up with a characterization of brightness and defined saturation to range from 0 along the axis to 1 at the most colorful point or each pair of other parameters. The transformation can be described in equations: M = max(r, G, B) m = min(r, G, B) c = M m G B mod 6, if M = R C B R H = + 2, ifm = G C R G { + 4, if M = B C H = 60 o H 54

65 V = M 0, if V = 0 S = { C V, otherwise The HSV color wheel is used to pick the desired color. Hue is represented by the circle in the wheel. A separate triangle is used to represent saturation and value. The horizontal axis of the triangle indicates value and the vertical axis represents saturation. The HSV color space is quite like the way in which humans perceive color comparing to RGB, which makes it easier to define colors clearly. After getting the HSV features and form the vectors, we compute the difference between every two key frames of two consecutive shots, the difference is also represented by Euclidean distance. By comparing the distance to the pre-setup threshold, the scene transition can be found. The color histogram is applied on the last shot of the previous scene and the first shot of the latter scene, if the histogram difference is greater than the threshold, there is a scene transition between these two shots. To set up the threshold, we need to get the feature descriptors first and get a vector consists of distances between every two consecutive key frames, the element (i, 1) in this matrix is the distance between the i th frame and the (i + 1) th frame. We set a window size of 8 and get blocks along this vector. Then we calculate the variance of every block. If the variance is high enough, it is reasonable to say that there is a scene transition among these 8 shots. We sort the blocks in descending order for variance and take the first 20 percent of the blocks. By finding the largest elements in these blocks and 55

66 calculating the mean of them we can set a reasonable threshold for scene transition. This method is stated in Chapter 3 in detail. Figure 20. HSV color space illustration The color histogram difference in RGB space is based on the difference of the value in each channel and combine them to form the histogram difference, while in HSV space we calculate only the hue value rather than calculating differences in all three channels, since the hue value represents the color information. The reason to do this is that when using RGB space, there are some cases that the color difference in human perspective is very small, but the histogram difference between the two frames is relatively large. For example, when there is only a brightness change in a static scene, the color histogram of 56

two consecutive frames should not change much, but in histogram comparison in RGB space, the histogram difference tends to be larger than expected: (a) (b)

67 two consecutive frames should not change much, but in histogram comparison in RGB space, the histogram difference tends to be larger than expected: (a) (b) (c) Figure 21. (a) Two consecutive frames with brightness change, (b) color histogram difference in RGB space, (c) color histogram difference in HSV space 57

68 Chapter 5: Experiments and Results In our experiments, we first check the algorithm for shot transition detection. The abrupt shot transitions are easy to detect, the main task in our case is to model the gradual shot transition appropriately. After shot detection is done, we apply Bag-of-Words model for background information extraction. In this process, we found the optimal choice of the tradeoff of the accuracy and computational efficiency. The color histogram comparison is applied at last to minimize the error rate for scene transition detection and get the second level scenes. We use 9 movies as our database, detailed description for this database is given in Table Shot detection Shot detection is a baseline for our scene segmentation method, the result of shot detection is rather important since it will have impact on the scene segmentation. For camera motion based shot detection, function (8) is applied. We choose n to be 9, which means the search area is the size of This allows us to track enough features in one block without tolerating massive camera movement. In histogram comparison based shot detection, the range for of window size is set to be 10~200 since this is a proper range for number of frames within one shot. After determining n, the histograms can be produced. 58

69 Movie name Movie type True # shots # shots detected Avatar Science fiction Braveheart History and war Harry Potter 7 Part 1 Fantasy Harry Potter 7 Part 2 Fantasy Harry Potter 6 Fantasy Lord of The Ring 3 Fantasy Matrix Science fiction Troy History and war Year One Comedy Table 3. Description of dataset and shot detection results The result for shots is the intersection of the result of motion based shot transition detection and histogram comparison based shot transition detection. For abrupt shot transitions, the motion based shot transition detection method has certain advantage over histogram comparison based shot transition detection method, since the camera motion for abrupt shot transition is much more significant than the camera motion for gradual shot transitions, while the histogram difference between abrupt and gradual is not easy to detect and the computational cost is higher. By setting up a threshold for camera motion difference and comparing the detected camera motion between consecutive shots to the threshold, the abrupt shot transition is detected. 59

70 (a) (b) Figure 22. Abrupt shot transition, (a) entry and exit frame of the previously detected shot, (b) the entry and exit frame of the latter detected shot The camera motion for these two consecutive shots can be described as follow: Figure 23. Camera motion for abrupt shot transition 60

71 We can see from the camera motion pattern that when there is an abrupt shot transition, the camera motion within these frames tends to have great amplitude, which makes it easy to detect. We can see from Figure 13 that there are three peaks in the histogram, but there is only one shot transition in this video sequence. The first peak is because of the object motion in the first shot. And in the second shot video sequence, there is a camera zoom, which leads to the third peak of camera motion. The amplitude of object motion and camera zoom are lower than the abrupt shot transition, so we decide to set up a threshold to eliminate the effect of object motion and camera zoom (and other gradual camera movements like tilt and pan) on abrupt shot transition detection. The threshold setting method is introduced in Chapter 4, it is an automated threshold setting algorithm, which can reduce the human labor. For gradual shot transitions, we prefer histogram comparison based shot transition detection method. The reason is that when there is a fade or dissolve shot transition, the camera motion amplitude tends to be around 0, which leads to the failure of shot transition detection. Figure 24. Example of gradual (fade out) shot transition 61

72 The camera motion in this case is shown below: Figure 25. Camera motion in fade out shot transition We can see from the result that the camera motion is low in the duration, which would be classified as no shot transition in our motion based shot transition detection method, but there is a gradual shot transition. To detect this kind shot transition, we apply the histogram based shot transition detection method instead. The inter-shot and intra-shot difference have a strong pattern, the pattern is shown in Chapter 4. We can use this 62

73 pattern in detection gradual shot transitions. The color histogram difference for the example is given below: Figure 26. Color histogram difference in gradual shot transition By combining camera motion method and color histogram comparison method, the abrupt shot transitions and gradual shot transitions are found, then we have the video segmentation for shots. From the experiments and result, we can see that the difference between the true number of shots and the number of shots detected is rather small, and usually, the number of detected shots is more than the true number of shots. When 63

74 examining the key frames of the detected shots and comparing them to the ground truth, we found that there are still some misdetections of shots. Some shots should not be segmented, we found it is mostly because the color histogram changes abruptly within one shot, which leads to the misclassification of shot transition boundary. An example is shown below: (a) (b) Figure 27. Example of misdetection of shot boundary, (a) entry and exit frame of the previously detected shot, (b) the entry and exit frame of the latter detected shot We tried to settle this problem by using camera motion but did not work out well. It is because that the camera and object in these shots tend to be static. To rate our method for shot detection, we need to compare our shot detection results with the ground truth we generated. The ground truth is generated by observing the shot transitions in the movies when watching. The ground truth is shown as the time the shot transition happens in the movie. And the shots we detected are shown as the frame number where shot transition is detected. For the comparison between the ground truth of shots and the detected shots, an example is given in Table 4: 64

75 Ground Truth (time in movie) Detected (frame number) 00:00:00-00:00: :00:20-00:00: :00:57-00:01: :01:31-00:01: :01:56-00:01: :01:59-00:02: Table 4. Comparison between ground truth and shots detected (movie Matrix, partial) The total frame number of this movie is , the number of frames in one second in the movie is 22, by transforming the ground truth into frame numbers, we found that the differences between the ground truth and the detected shots are mostly within 6 frames, which means the shot detecting method we use is reliable. Overall, the shot detection method works find, and the accuracy is reasonable for scene transition detection. 5.2 Scene Segmentation Scene segmentation is the second level in video summarization. A scene is a combination of consecutive shots which share similar background information. For example, in dialogue scenes, there are always shot transitions among the characters who have the conversation, these shots are taken in the same place, the background information should be similar, these shots then should be categorized as one scene. 65

76 Vocabulary size Accuracy % % % % % % % % % Table 5. Vocabulary size and its precision First, the BOW model is applied, formula (6) and (7) are used in this model. In the experiments, we tested several vocabulary sizes, and the vocabulary size is the vector dimension in formula (6). In the experiment stage, we chose 15 categories of images as training images, the categories are introduced in chapter 4. For testing the precision, we prepared a set of testing images, for each category in the training set we have 100 images in the same category to be the testing images. The accuracy here is the overall precision among all the classification results of BOW. It is calculated by taking the mean of the accuracy of classification for each category. Table 4 illustrate the vocabulary size and its corresponding precision. 66

in the vocabulary size k and the accuracy of the classification of the model.

77 (a) (b) Figure 28. Examples of training images, set (a) is forest, set (b) is living room From the experiments, we found a pattern in the vocabulary size k and the accuracy of the classification of the model. The accuracy of the classification increases with the increase of vocabulary size, but when the vocabulary size is large enough, the increase of vocabulary size would have little effect on the accuracy, the experiment result is shown as below: 67

78 Figure 29. Error rate of the classification against the vocabulary size Ground truth bedroom coast forest highway kitchen bedroom coast forest highway kitchen Table 6. Part of model accuracy check for vocabulary size of

79 Ground truth bedroom coast forest highway kitchen bedroom coast forest highway kitchen Table 7. Part of model accuracy check for vocabulary size of 1000 In our case, we do not count on the best vocabulary size for classification, the tradeoffs between accuracy and computational efficiency are our greatest concern. Since the greater the vocabulary size is the less the computational efficiency is, we choose to evaluate the efficiency of increasing the vocabulary size. From Figure 18, we can see that when the vocabulary size reaches 1000, the error rate decreases slowly with the increase of vocabulary size, so we choose 1000 as the vocabulary size. The next step is to use the BOW model we got to process the key frames of the shots and get the feature descriptors. The feature descriptor is the size of the vocabulary size, and each element of a descriptor is the correlation with the corresponding cluster. Then the threshold can be set up using the method mentioned in chapter 4. The first level scenes can be found by comparing the differences between features of consecutive shots. This difference is the Euclidean distance between the feature vectors. The larger the distance is, the more difference in background content there is. 69

Figure 30. (a) two shots in one scene; (b) a scene transition This method can find all the scene transitions in a movie, but there are also misdetections.

80 Figure 30. (a) two shots in one scene; (b) a scene transition This method can find all the scene transitions in a movie, but there are also misdetections. The problem is that some shots that are in one scene are not categorized as a group, in other words, there are some scenes got separated. This is caused by many reasons, the one with the most importance is the significant camera movement. If the camera movement is too large between two shots (e.g. the great aspect difference), the background information may change much, hence influence the feature descriptors extracted from the key frames of the shots. Then we apply histogram comparison method to get second level scenes. The histogram comparison method is similar to the color histogram method we used in shot transition detection, the only difference is that we use HSV color space rather than RGB color space. The scene segmentation results are shown in Table 7. 70

81 Figure 31. RGB to HSV We choose to use HSV rather than RGB because the histogram difference in RGB space would be easily influenced by many factors and could be significant even within one shot, so we need a more uniformed color space to conduct the experiments. Movie name # of true # of first level # of second level scenes scenes scenes Avatar Braveheart Harry Potter 7 Part Harry Potter 7 Part Harry Potter Lord of The Ring Matrix Troy Year One Table 8. Movie scene segmentation results 71

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image