Scene Detection Media Mining I

Size: px

Start display at page:

Download "Scene Detection Media Mining I"

Milo Gallagher
6 years ago
Views:

1 Scene Detection Media Mining I Multimedia Computing, Universität Augsburg Rainer.Lienhart@informatik.uni-augsburg.de

2 Overview Hierarchical structure of video sequence Fames Shots = Sequence taken by a single camera during one ON/OFF cycle Scenes (Story units) = Single dramatic event taken by a small number of related cameras A Musical Analogy Notes Shots Chords Scenes Musical Phrases Themes/Acts Common Approach Shot detection prerequisite Scene detection Abstracted as a question of symbolic pattern segmentation such as ABABCDCD 2

3 Scene Transition Graph

4 Time Constrained Clustering (1) Step 1: Assign to each shot s i a label L i which provides a description of the content of the shot Difficult: Automatic semantic labels such as news anchor, news room, man behind table are difficult Instead use low-level feature such as color histograms Time-constrained clustering: Let C i be the ith cluster, and w, x, y the elements (video shots) in a cluster Clustering criteria: for all x in C i Visual dissimilarity between shots Temporal distance between shots Max. visual separation between 2 shots in cluster C i Max. Temporal distance between start/stop of earlier/later shot in C i 4

5 Time Constrained Clustering (2) Frame Dissimilarity Normalized color histogram difference (8r*8g*4b=256 bins): D(ƒ i, ƒ j ) = b h ib h jb /N h ib = a given bin b in color histogram of frame i N = total # of pixels in each frame L1-Norm normalized to [0,1] Shot Dissimilarity Minimum dissimilarity between any two frames of each shot D(S i,s j ) = min k,l D(ƒ ki, ƒ lj ) ƒ ki = frame k of shot i ƒ lj = frame l of shot j Not a metric: violates triangle inequality and positive definite req. 5

6 Time Constrained Clustering (3) Perform complete-link(-age) hierarchical clustering = merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance) NOT Single-link(-age) hierarchical clustering = merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance). Worst case time complexity: O(n^2 log n) One O(n^2 log n) algorithm: 1. Compute the n^2 distance metric 2. Sort the distances for each data point (overall time: O(n^2 log n)). 3. After each merge iteration, the distance metric can be updated in O(n). We pick the next pair to merge by finding the smallest distance that is still eligible for merging. If we do this by traversing the n sorted lists of distances, then, by the end of clustering, we will have done n^2 traversal steps. 4. Adding all this up gives you O(n^2 log n). 6

7 Complete-link Clustering 1. Start with n clusters C 1,...,C n, where C i = x i 2. Minimize cost function c(c i,c j ) in order to find the two best clusters to merge.. 3. Replace C i and C j by C i C j 4. Repeat 2. and 3. until all clusters are merged into one Complete-link c( C i, C j ) max x C, y i C j d( x, y) 7

8 Time Constrained Clustering (4) Result of time-constrained clustering: Video partition into C 1, C 2,, C N. Each C j consists of one or more shots Each shot s i in C j is assigned label L j Label sequence {L j } Example: A video sequence: ABABCDCDABAB 8

9 Scene Transition Graph (1) Step 1: Derive story units / scenes last(a, b) = max i b:l i ==A i ; i in set of shot indices = the shot index of the last occurrence of label A starting at shot s b [Algorithm Detect Story Unit] 1) l := m, e := last (L l,m) 2) While l <= e do if ( last (L l,m) > e) e := last (L l,m) l++; 3) Shots s m, s m+1,, s e constitute a story unit 9

10 Scene Transition Graph (2) Example 1: Short video sequence ABABCDCDABAB The graph is: A B D C Cyclic with no cut edge. Example 2: Long video sequence ABABCDCDABAB Becomes: ABABCDCDEFEF The graph is: A B C D E F Two cut edges. Example 3: Video sequence ABABCDCD 8 shots, 4 camera positions The graph is: A B C D The cut edge is from (the last) B to (the first) C. Less than T apart More than T apart 10

11 Scene Transition Graph (3) 11

12 Scene Transition Graph (4) 12

13 13

14 Experimental Results 14

15 Scene Determination based on Video and Audio Features

16 System Overview 16

17 Audio Feature Audio sequence := sequence of shots in which very similar audio signal re-occur. Pre-Filter: Background elimination Get rid of unimportant sound pieces Threshold (a) signal power or (b) perceptual loudness measure Audio Cuts = time instances which delimit periods of similar sound => audio shots Power spectrum Distance Measure Each video shots has at least one associated audio shot Disaggregated set representation of audio shots Minimal Euclidean distance found between any two feature vectors of two audio shots 17

18 Video Features Color Coherence Vectors (CCV) Developed by Pass, Zabih, and Miller Orientation Correlograms Shot Dissimilarity Measure Dissimilarity between 2 shots with respect to their color/orientation content is measured based on the disaggregated set representation of the shots, using the minimum distance between the most common feature values Each shot S i is described by a set of feature values {f 1i,, f n i } derived from each of its frames and compared with shot S j = {f 1j,, f n j } by dist(s i,s j ) = min {d(f i, f j ) f i in S i, f j in S j } 18

19 Color Coherence Vector Shots with very similar color content usually belong to a common setting because they share a common background. The color content changes more dramatically from setting to setting than within a single setting. Color content is usually measured by some sort of refined color histogram technique such as the color coherence vector (CCV) CCV makes use of spatial coherence and discriminates much better than the basic color histogram. Instead of counting only the number of pixels of a certain color, the CCV additionally distinguishes between coherent and incoherent pixels within each color class j depending on the size of the color region to which they belong. If the region (i.e., the connected 8-neighbor component of that color) is larger than a threshold t ccv, a pixel is regarded as coherent, otherwise as incoherent. Thus, there are two values associated with each color j: α j : the number of coherent pixels of color j and β j : the number of incoherent pixels of color j Two CCVs CCV 1 and CCV 2 are compared by 19

20 Orientation Correlogram (1) Local orientation := image structure in which the gray or color values change only in exactly one direction, but remain static in the orthogonal direction. Does not distinguish between direction x and (x+180) (unlike direction) Inertia Tensor: Gradient Gaussian Filter I(x,y) = grayscale image D x and D y = first derivative in x and y direction 20

21 Orientation Correlogram (2) measures the angle of orientation of the image structure around the location (x, y). The relation of the eigenvalues to each other can be used as a certainty measure of the estimated orientation. 3 cases can be distinguished: 1. λ 1» λ 2 : There is an orientation in direction Φ 2. λ1 λ2 : The gray-scale values change similarly in all directions, and thus, the structure is isotropic. 3. λ1,λ2 0 : The local environment has a constant gray-scale value. 21

22 Orientation Correlogram (3) Capture local orientation of an image independently of translations and small or middle scalings Orientation correlogram := a table indexed by an orientation pair <i,j>. The kth entry γ i,jk of an orientation pair <i,j> specifies the probability that within a distance of k of an orientation i the orientation j can be found in the image. Describes how the spatial correlation of orientation pairs changes with distance Orientation Φ is discretized in n classes K i : 22

23 Dialog Detection (1) Frontal Face Detection Neural network detector Proposed by Rowley, Baluja and Kanade 23

24 Dialog Detection (2) Dialog Detection A sequence of contiguous shots of a minimum length of three shots where at least one face-based class is present in each shot and the Eigenface-based shot-overlapping relations between face-based classes interlink shots by crossings. 24

25 Scene Determination (1) Observation: Same features need not to be present in each shot Human similarity judgement is very adaptive Our Approach: shot cluster := all shots between two shots which are no further apart than the lookahead (e.g., 3 shots) and their distance is below a predefined (absolute) threshold. Overlapping shot clusters are grouped into scenes Merging overlapping shot clusters from different domains 25

26 Scene Determination (2) 26

Sometimes intensive discussion at critical points necessary Tool to support creation of ground truth shot boundary being

27 Ground Truth Determined manually for each kind of scene: Dialog, Locale, Audio Scene, Semantic Scene by classifying each shot boundary whether the two involved shots belong together 1 if two associated shots belong to the same scene and 0 otherwise. Sometimes intensive discussion at critical points necessary Tool to support creation of ground truth shot boundary being labeled shot n Future shots help to decide shot n+1 shot n+2 n++ Play 2 seconds preceding and following the shot boundary from n to n+1 27

28 Experiments Test Set Groundhog Day Forrest Gump Methodology Hit rate h = (# of correctly eliminated shot boundaries + correctly detected scene boundaries) / (total # of shot boundaries) Miss rate m = (# of missed scene boundaries) / (# of shot boundaries) False hit rate f = (# of falsely detected scene boundaries) / (# of shot boundaries) = 1 h - m 28

29 Quality of Audio Sequence Determination 85 % 10 % threshold 29

30 Quality of Video Setting Determination 80 % 10 % threshold 30

31 Quality of Dialog Determination Groundhog Day Forrest Gump # shots # dialogs Hit rate 35% 60% False hits 5% 118% 31

32 Examples of Detected Scenes 32

33 Conclusion Four type of features Clustered into audio sequences, settings and dialogs Hit rate rates 66% - 90% (except dialogs) False hit rates 5% - 28% (except dialogs) 33

34 Unified Shot & Scene Detection

35 temporal Video Structure non-temporal 35

36 Core Idea Use short term visual memory buffer Common models in psychology: sensory buffers are leaky integrators Encode extend to which the present frame reminds the viewer of a prior one Shot Detection Encode extend to which the present shot reminds the viewer of a prior one Scene Detection A B C B C A B A B C X Y B Z Good coherence bad coherence 36

37 Basic Notations (Shot Detection) D(F i,f j ) = symmetric frame dissimilarity measure between frame F i at time i and frame F j at time j, i<j, e.g., L 1 norm on color histograms (see Time Constrained Clustering (2)) e -(j-i)/b = probability a frame F j recalls a (prior) frame F i (of dissimilarity D(F i,f j ) expected frame-to-frame dissimilarity: D(F i,f j ) e -(j-i)/b If F j is similar/dissimilar to frame F i, it is also very likely similar/dissimilar to those frames immediately preceding F i. Recall(i,j)= i B=size of short term memory - D(F t,f j ) e -(j-t)/b dt // D made suitably continuous aggregated expected dissimilarity experienced at frame F j due to all frames occurring before frame F i Total amount of dissimilarity induced at the interframe gap just after frame F i : Dis(i) = i Recall(i,j)dj= i i- D(F t,f j ) e -(j-t)/b dt dj Simply a double convolution of pair-wise Features: frame dissimilarities with an exponential Due to integration this method is more robust than simple adjacent kernel frame comparison modeling the memory buffer loss a peak at shot boundaries a local, but more rounded maximum at dissolve boundaries a local minimum at very representative key frames 37

38 Dissolve Model (1/2) Without loss of generality we make the following assumptions: Unit of time = buffer size B normalized equations Edits are center at t=0 ranging from w to w Edit duration = 2Bw d = dissimilarity between frame just before and after dissolve Cross Dissolve dissimilarity between frames at time i and j within dissolve (i.e., -w i j w) is given by d(j-i)/(2w) Before/after edit the shot is almost constant in its dissimilarity measure, i.e., D(F i, F j ) 0 Recall(i,j) and Dis(i) can be computed 38

39 Dissolve Model (2/2) 3 cases to be considered Case 2: -w i w j Recall: Recent past of F j consists of frames before, during, and after the dissolve For i>0 Before: D(i,j) d During: D(i,j) d(j-i)/(2w) After: D(i,j) 0 Recall( i, j) w de ( j t) dt Dis(i) = i Recall(i,j)dj = i w w t d e 2w // linear dissolve ( j t) dt 39

40 Examples (1/3) Dis(0)=d/w(1-exp(-w)) Theoretic response of Dis(i) to ideal dissolves of size w = 0 to w = 5 with frame distances measured in units of buffer size. An ideal dissolve with w = 0 is a cut, and shows the impulse response 40

41 Examples (2/3) Response of model to a section of a video Cut at frame # 4869 Dissolve at frame # 4936 with width 26 Small dissimilarities in the middle of the graph due to a pan Do not affect segmentation results 41

42 Examples (3/3) Fade out centered at frame # 2255 with width 19 Fade in centered at frame # 2293 with width 20 42

43 Parameter Settings The large buffer size B, the more frames are recalled, and the video tends to be perceived as being smoother The smaller buffer size B, the more unlikely that incoming frames have similar frames to recall, and local frame differences predominate in the measure Shot Detection B = 4, K = 120 (+-60; extend of exp. loss filter) Maxima detection in dis(i): dis(i) >dis(j), j [i-n, i+n] & monotonic decrease; N=6 43

44 Scene Detection Scene Boundary = Shot boundary with a small total coherence recall Same approach as for shot detection, but Buffer size B significantly larger e.g., 8 time the average shot length (~1000 frames) Local maximum search = maximum dissimilarity at shot boundaries, i.e, search only at shot boundaries e.g., N=3 44

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!