Video Syntax Analysis

Size: px

Start display at page:

Download "Video Syntax Analysis"

Hilary White
5 years ago
Views:

1 1 Video Syntax Analysis Wei-Ta Chu 2008/10/9

2 Outline 2 Scene boundary detection Key frame selection

3 3 Announcement of HW #1 Shot Change Detection Goal: automatic shot change detection Requirements 1. Write a program to automatically perform shot change detection 2. Write a report that describes 2.1. How to run your program 2.2. What kinds of features or algorithms you used or compared Detection performance in precision and recall, or even ROC/PR curves

4 4 Announcement of HW #1 Shot Change Detection Evaluation data Homework submission Pack your programs and report into one zip file, and mail it to Deadline: 12:00, Oct. 26, 2008 Grade will be given based on detection performance and descriptions in your report.

5 Implementation 5 Do this homework with any tool you are familiar with. Related resources libmpeg2: FFMPEG: Opencv library: Matlab toolbox:

6 6 Scene Transition Graph Yeung, et al. Segmentation of video by clustering and graph analysis Computer Vision and Image Understanding, vol. 71, no. 1, pp , 1998.

7 Observations 7 Shots in a scene are often repetitive. We are able to classify shots by grouping shots of similar visual contents. Often, a scene is made up of temporally adjacent shots indicating their interrelationships.

8 Similarity of Video Shots 8 D(.,.) measures the dissimilarity between two image frames.

9 Similarity of Video Shots 9 Dissimilarity based on color histogram intersection Dissimilarity based on luminance projection Yeungand Liu, Eficient matching and clustering of video shots Proc. of IEEE International Conference on Image Processing, vol. 1, pp , 1995.

10 10 Representative Image Set for a Video Shot Selection of representative set is achieved by nonlinear temporal sampling

11 11 Representative Image Set for a Video Shot Only 2 to 5% of frames are needed in comparison to achieve good matching results. In addition to temporal subsampling, spatial subsampling can also be used to improve matching efficiency.

12 Clustering of Video Shots 12 Shots in the same cluster are similar Any other shot outside of the cluster must have a dissimilarity greater than the dissimilarity between any shot in the cluster. C i : the ith cluster

13 Clustering of Video Shots 13 Dissimilarity between two clusters: Using the shot pair, in which two shots are in two different clusters, that has the largest dissimilarity value. Dissimilarity between two clusters should be updated at each iteration.

14 Clustering of Video Shots 14

15 Time-Constrained Clustering 15 Any two shots that are far apart in time, even if they share similar visual contents, they potentially represent different contents or occur in different scenes. Temporal distance between two shots The distance in number of frames The distance in number of frames from the end of the earlier shot to the beginning of the latter one.

16 Scene Transition Graph 16 A scene transition graph is a directed graph with the property G=(V,E,F) V: each node represents a cluster of shots E: a directed edge is drawn from node U to W if there is a shot represented by node U that immediately precedes any shots represented by node W. F: a mapping that partitions the set of shots into clusters STG is able to represent compactly the structures of shots and the temporal flow of the story for many video programs.

17 Example of STG 17

18 Cut Edges 18 An edge is a cut edge, if when is removed, results in two disconnected graphs. Each partitioned STG G i represents the interactions of shots in a story unit.

19 19 STG After Time Confining and Cut Edges Finding

20 Framework 20 Shot segmentation Time-constrained clustering Building of scene transition graph Scene segmentation

21 Influences of Parameters 21 Without the knowledge of how long each individual scene lasts, T cannot be approximated well. If T is too large, shots from different scenes are clustered together. If T is too small, shots in the same scene may be separated into different scenes. It s les detrimental to have several story units represent a scene than to have one story unit represent several scenes.

22 Influences of Time Constraints 22 T = 20s. d t (B 1,B 3 ) > T STG B 1 B 2 A 1 A 2 A 3 Clustering results are {B 1,B 2 },{A 1,A 2,A 3 },{B 3,B 4 },{C 1 },{D 1 } Story unit results are {B 1,A 1,B 2,A 2,B 3,A 3,B 4 },{C 1 },{D 1 } B 3 B 4 C 1 D 1

23 Influences of Time Constraints 23 T = 20s. STG B 1 A 1 A 2 A 3 B 4 Clustering results are {B 1 },{B 2,B 3 },{A 1,A 2,A 3 },{B 4 },{C 1 },{D 1 } Story unit results are {B 1 },{A 1,B 2,A 2,B 3,A 3 },{B 4 },{C 1 },{D 1 } B 2 B 3 C1 D 1

24 Refined Analysis 24 Make the time-window more elastic Compute the duration of each story unit and adjust Given a story unit, examination of the next story unit by relaxing the temporal windows and reclustering the shots in these two units. If there exists at least one new cluster that contains shots from the two units, two story units are merged into one.

25 Refined Analysis 25

26 Example 26

27 Results 27 STG constructed from the sitcom Friends. There are frames, each at a spatial resolution of 320x240. There are 313 shots.

28 Results 28 Time-constrained clustering of video shots is able to identify individual story units. The resulting STG permits rapid nonlinear browsing of long video programs.

29 Variations of Clustering Parameters 29 Smaller delta values result in more clusters and thus more story units. Users often prefer over-segmentation rather than under-segmentation.

30 Refining the Segmentation Results 30 The first two story units in Scene 1 are merged into one. The number of story units in Scene is reduced from 4 to 2.

31 Conclusion 31 Analysis based on time-constrained clustering and scene transition graph analysis has contributed to the extraction of story units. The building of story structure provides nonlinear access to video contents. Identification, integration, and application of domain-dependent and semantic features tend to improve segmentation accuracy.

32 32 Scene Detection in Movies and TV shows Rasheed, et al. Scene detection in Holywood movies and tvshows Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp , 2003.

33 Introduction 33 Every year around 4500 motion pictures are released around the world spanning over approximately 9000 hours of video. Inexpensive and popular digital technology is available through cable and internet, such as video on demand. Accessing video content Detect shots and set of key frames Combine similar shots and form scenes or story units

34 Problems in Previous Works 34 A false color match between shots of two different scenes may wrongly combine scenes. Action scenes may be broken into many scenes for not satisfying the color matching criterion.

35 System Flowchart 35 BSC: backward shot coherence PSB: potential scene boundaries

36 Shot Detection 36 Based on histogram intersection 16 bin HSV normalized color histogram 8 bins for hue, 4 bins each for saturation and value

37 Key frame Selection 37 (1) Initially, the middle frame of the shot is selected and added to the null set K i. (2) Each frame within a shot is compared to every frame in K i. (3) If the frame differs from all previously chosen key frames by a fixed threshold, it is added in K i.

38 Shot-based Features 38 Shot length Shot motion content Estimate the parameters of a global affine motion model Calculate the difference between the actual and the reprojected motion of blocks

39 Shot Motion Content 39

40 Scene Boundary Detection Algorithm 40 Pass 1: Detecting potential scene boundaries based on color properties Pass 2: Removal of weak scene boundaries by analyzing the shot length and motion content

41 Potential Scene Boundaries 41 Backward shot coherence

42 Potential Scene Boundaries 42 Backward shot coherence Compute the shot coherence of the shot I in a window of previous shots Taking the maximum shot coherence in a window of length N Post-processing: if a pair of key frames of two adjacent potential scenes are similar, merger them into one scene.

43 Potential Scene Boundaries 43

44 Selection of Window Size 44 The computation of BSC is controlled by the selection of window size N. A memory parameter which mimics a human s ability to recall a shot seen in the past. If N is too large, it may span over several scenes. If N is too small, over-segmentation of video may be obtained. N=10 in this paper

45 Scene Dynamics Analysis 45 Scenes with weak structure are often broken in several scenes. E.g. action scenes non-repetitiveness of shots Scene dynamics Action scenes: larger SMC and smaller L The PSB between two consecutive scenes will be removed if SD of both scenes exceed a fixed threshold.

46 Scene Dynamics Analysis 46

47 Scene Representation 47 A shot is a good representative when The shot is shown several times (higher SC) The shot spans over longer period of time (larger shot length) The shot has minimal motion content (smaller SMC) Multiple faces are preferred.

48 Shot Goodness 48 A correlation matrix of dimension N X N is constructed where element (i,j) is the coherence of shot i with shot j. Three shots with the highest W are selected as candidate shots and face detection is performed.

49 Detection of Faces 49 A method based on skin detection is adopted. The middle frame of candidate shots are tested. Each isolated segment of skin is considered as face and the frame with highest votes is taken as the scene key frame. In the case of a tie or no face, the key frame of the shot with the highest goodness value is selected.

50 Scene Key Frame 50

51 More Examples 51

52 Experimental Results 52 Five movies, one sitcom, and one TV show Miss (false negative) False alarm (false positive)

53 Experimental Results 53 Slightly over segmentation is preferable over undersegmentation. While browsing a video, it s beter to have two segments of one scene rather than on segment consisting of two scenes.

54 Experimental Results 54

55 References 55 Rasheed, et al. Scene detection in Holywood movies and tvshows Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp , Yeung, et al. Segmentation of video by clustering and graph analysis Computer Vision and Image Understanding, vol. 71, no. 1, pp , Vendrig, et al. Systematic evaluation of logical story unit segmentation IEEE Transactions on Multimedia, vol. 4, no. 4, pp , 2002.

56 56 Brief Introduction of Montage

57 Montage 57 Montage refers to the editing of the film, the cutting and piecing together of exposed film in a manner that best conveys the intent of the work

58 Methods of Montage 58 Metric The editing follows a specific number of frames, cutting to the next shot no matter what is happening within the image. This montage is used to elicit the most basal and emotional of reactions in the audience. Example

59 Methods of Montage 59 Rhythmic Cutting based on time -- along with a change in the speed of the metric cuts -- to induce more complex meanings than what is possible with metric montage. Once sound was introduced, rhythmic montage also included aural elements (music, dialogue, sounds). Example

60 Methods of Montage 60 Tonal A tonal montage uses the emotional meaning of the shots -- not just manipulating the temporal length of the cuts or its rhythmical characteristics -- to elicit a reaction from the audience even more complex than from the metric or rhythmic montage. For example, a sleeping baby would emote calmness and relaxation. Example: This is the clip following the death of the revolutionary sailor Vakulinchuk, a martyr for sailors and workers.

61 Methods of Montage 61 Overtonal/Associational The overtonal montage is the accumulation of metric, rhythmic, and tonal montage to synthesize its affect on the audience for an even more abstract and complicated effect. Example: In this clip, the men are workers walking towards a confrontation at their factory, and later in the movie, the protagonist uses ice as a means of escape.

62 Methods of Montage 62 Intellectual Uses shots which, combined, elicit an intellectual meaning Example: from Eisenstein's October and Strike. In Strike, a shot of striking workers being attacked cut with a shot of a bull being slaughtered creates a film metaphor suggesting that the workers are being treated like cattle. This meaning does not exist in the individual shots; it only arises when they are juxtaposed.

63 Next Week 63 Introduction of content-based image retrieval

A Robust Wipe Detection Algorithm

A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,