Content-based image and video analysis. Event Recognition

Size: px

Start display at page:

Download "Content-based image and video analysis. Event Recognition"

Philip Johnston
5 years ago
Views:

1 Content-based image and video analysis Event Recognition

2 What is an event? a thing that happens or takes place, Oxford Dictionary Examples: Human gestures Human actions (running, drinking, etc.) Human interaction (gunfight, car-crash, etc.) Sports event (tennis, soccer, etc.) Nature event (fire, storm, etc.) 1

3 Why event recognition? Huge amounts of video available Event recognition useful for: Content-based browsing Fast forward to the next goal scoring scene Video search Show me all videos with Bush and Putin shaking hands 2

4 Human actions Human actions are major events in movie content Meaning hidden within visual representation 4

5 What are human actions? Definition 1: Physical body motion: KTH action Database ( 5

6 What are human actions? Definition 2: Interaction with environment on specific purpose Same physical motion different action depending on the context 6

7 Challenges Variations: Lighting, appearance, view, background Individual motion, camera motion Difference in shape Difference in motion Drinking Smoking Both actions are similar in overall shape (human posture) and motion (hand motion) 7

8 Challenges Problems: Existing datasets contain few action classes, captured in controlled and simplified settings Lots of (realistic) training data needed Idea: Realistic human actions are frequent events within movies Perform automatic labeling of video sequences based on movie scripts 8

9 Action event dataset Coffee and Cigarettes dataset 159 annotated Drinking samples 149 annotated Smoking samples Spatial annotation head rectangle First frame Temporal annotation Keyframe Last frame torso rectangle 9

10 Problems: Challenges Existing datasets contain few action classes, captured in controlled and simplified settings Lots of (realistic) training data needed Idea: Realistic human actions are frequent events within movies Perform automatic labeling of video sequences based on movie scripts Hollywood Human actions (HOHA) dataset 10

11 Problems: Script-based annotation No time information available Use subtitles to align scripts with the video Described actions do not always correspond to movie scene Assign an alignment score to scene description: a = (#matched words)/(#all words) Apply a Threshold (eg. a>0.5) Variability of action expressions in text Use regularized Perceptron to classify action description 11

12 Script alignment Scripts publicly available to many movies Subtitles available for the most of movies Transfer time to scripts by text alignment :20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'd you keep your marriage a secret? RICK Why weren't you honest with me? Why did you keep your marriage a secret? :20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way :20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. subtitles 01:20:17 01:20:23 Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. movie script 12

Script alignment: Problems Perfect alignment does not guarantee perfect action annotation in video From 147 action with correct alignment (a=1), only 70% did match the video Errors Temporal

13 Script alignment: Problems Perfect alignment does not guarantee perfect action annotation in video From 147 action with correct alignment (a=1), only 70% did match the video Errors Temporal misalignment (10%) Ouside FOV (10%) Completely missing in video (10%) (a: quality of subtitle-script matching) Example of false positive for get out of car, Action is not visible in video! A black car pulls up, two army officers get out. 14

14 Script-based Annotation Pros: Realistic variation of actions Many examples per class, many classes No extra overhead for new classes Character names may be used to resolve who is doing what? Problems: No spatial localization Temporal localization may be poor Script does not always follow the movie Automatic annotation useful for training, but not precise enough for evaluation 15

15 Retrieving actions in movies Ivan Laptev and Patric Perez ICCV

16 Actions == space-time objects? stableview objects atomic actions car exit phoning smoking hand shaking drinking Take advantage of spacetime shape Temporal slice time time time time 17

17 Actions == space-time objects? Can actions be considered as space-time objects? Transfer object detectors to action recognition Here: only atomic actions considered (i.e. simple actions) Temporal slice of same action under different circumstances similar So, (atomic) actions = space-time objects 18

Action features Action volume = space-time cuboid region around the head (duration of action) Encoded with block-histogram features f ( ), = (x,y,t,dx,dy,dt,, ), defined by

18 Action features Action volume = space-time cuboid region around the head (duration of action) Encoded with block-histogram features f ( ), = (x,y,t,dx,dy,dt,, ), defined by Location (x,y,t) Space-time extent (dx,dy,dt) Type of block ( ) el Plane, Temp-2, Spat-4 Type of histogram ( ) Histogram of optical flow (HOF) Histogram of oriented gradient (HOG) 19

19 Action features HOG features HOF features 20

20 Histogram features (simplified) Histogram of oriented gradient: Apply gradient operator to each frame within sequence (eg. Sobel) Bin gradients discretized in 4 orientations to block-histogram Histogram of optical flow: Calculate optical flow (OF) between frames Bin OF vectors discretized in 4 direction bins (+1 bin for no motion) to block-histogram Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels More than 1 Mio. possible features f ( ) 21

21 Histogram features HOG: histograms of oriented gradient HOF: histograms of optic flow >10^6 possible features 4 grad. orientation bins 4 OF direction bins + 1 bin for no motion 22

22 Action learning Use boosting method (eg. AdaBoost) to classify features within an action volume Features: Block-histogram features selected features boosting AdaBoost: weak classifier Efficient discriminative classifier [Freund&Schapire 97] Good performance for face detection [Viola&Jones 01] 23

23 Action learning: Boosting A weak classifier h is a classifier with accuracy only slightly better than chance selected features weak classifier Boosting: combine a number of weak classifiers so that the ensemble is arbitrarily accurate Allows the use of simple (weak) classifiers without the loss if accuracy Selects features and trains the classifier 24

24 pre-aligned samples Action learning: Boosting Haar features optimal threshold Histogram features Use FLD, select opt. threshold Weak classifier h t : In case of one dimensional features select an optimal decision threshold E.g. for Haar-filter responses (Viola&Jones face detector) Here: m-dimensional features Project data on one dimension using Fisher s Linear Discriminant (FLD), then select optimal threshold in 1-D 25

25 Action classification test Comparison of: Static keyframe classifier with spatial HOG features (BH-Grad4) Use Boosting to classify action using features from keyframes (= frame when hand reaches the mouth) only Space-time action classifier with HOF features (STBH-OF5) Space-time action classifier with HOF and HOG features (STBH-OFGrad9) 28

26 Action classification test Random motion patterns Additional shape information does not seem to improve the space-time classifier Space-time classifier and static key-frame classifier might have complementary properties 29

keyframe classifier(hog) Space-time classifier and

27 Classifier properties Training output: Accumulated feature maps Space-time classifier (HOF) Static keyframe classifier(hog) Space-time classifier and static keyframe classifier might have complementary features 30

28 Classifier properties Region of selected features show most active parts of classifier High activity at beginning and end of sequence, but low activity at keyframe Less accuracy expected when only classifying keyframes Space-time classifier selects most features around hand region Keyframe classifier selects most features in upper part of the key-frame Idea: Use complementary properties to combine classifiers (-> keyframe priming) 31

29 Keyframe priming Combination of static key-frame classifier with space-time classifier Motivated by complementary properties of both classifiers Bootstrap space-time classifier and apply it to keyframes detected by the keyframe detector (boosted space-time window classifier) Speeds up detection Combines complementary models 32

30 Keyframe priming Apply keyframe detector (HOG classification on single frame) to all positions, scales and frames while being set to a high false positive rate Generate space-time blocks aligned with detected keyframes and with different temporal extent Run space-time classifier on each hypothesis 33

31 Keyframe priming Training Positive training sample Negative training samples Keyframe-primed event detection Keyframe detections Test 34

32 Action detection Test on 25min from Coffee and Cigarettes with 38 drinking actions No overlap with the training set in subjects or scenes Keyframe priming is faster and leads to significant better results Keyframe priming No Keyframe priming 35

33 Action detection 36

34 Learning realistic human actions from movies Ivan Laptev, Marcin Marszalek, Cordelia Schmid and Benjamin Rozenfeld CVPR

35 Action recognition in real-world videos Humans can do more than just drinking and smoking Robust detection and classification of all kinds of human actions needed 38

36 Space-Time Features: Detector [Laptev, IJCV 2005] 39

37 Space-Time Features: Detector Space-Time Interest Points (STIP): Space-Time Extension of Harris Operator Add dimensionality of time to the second moment matrix Look for maxima in extended Harris corner function H Detection depends on spatio-temporal scale Extract features at multiple levels of spatio-temporal scales (dense scale sampling) 40

Space-Time Features: Descriptor Multi-scale space-time patches from corner detector Public code available at www.irisa.

38 Space-Time Features: Descriptor Multi-scale space-time patches from corner detector Public code available at Histogram of oriented spatial grad. (HOG) Histogram of optical flow (HOF) 3x3x2x4bins HOG descriptor 3x3x2x5bins HOF descriptor 41

39 Space-Time Features: Descriptor Compute histogram descriptors of space-time volumes in neighborhood of detected points: Compute a 4-bin HOG for each cube in 3x3x2 space-time grid Compute a 5-bin HOF for each cube in 3x3x2 space-time grid Size of each volume related to detection scales 42

40 Spatio-temporal bag of features (BoF) Cluster features (eg. k-means) Visual vocabulary (here, k=4000) Assign each feature to nearest vocabulary word Compute histogram of visual word occurrences over space time volume different spatio-temporal grids explored: 43

41 Action classification Use SVMs with multi-channel chi-square kernel: Channel c is a combination of a spatiotemporal grid and a descriptor (HoG or HoF) D is 2 distance between the BoF-histograms A is mean value of distances between all training samples One-against-all approach in case of multi-class classification 44

42 Results on KTH actions dataset Examples of all six classes and all four scenarios 45

43 Results on KTH actions dataset Average class accuracy Confusion matrix 46

44 Results on HOHA dataset Average Precision for each action class Comparison of results for annotated (clean) and automatic training data Chance denotes results of a random classifier 47

45 Results on HOHA dataset 48

46 Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment Dong Xu and Shih-Fu Chang CVPR

47 Event recognition in news broadcast Challenging task: Complex motion, cluttered backgrounds, occlusions, geometric variations of objects HOF/HOG approaches are sensitive to high-motion regions only News events may have relatively low motion (eg. fire) Broadcast news less constrained domain than human actions Event usually consists of different sub-clips E.g. riot may consist of scenes of fire and smoke at different locations 50

48 Temporally aligned pyramid matching Algorithm overview: each frame is represented by one feature vector decompose video into sub-clips using hierarchical clustering on each hierarchy level, calculate Earth Mover s Distance (EMD) between frame features and use it within SVM framework Constrain EMD that frames from one sub-clip can only be matched to fames from one other sub-clip (alignment) fuse SVM outputs from all levels 51

49 Temporally aligned pyramid matching 52

50 Decomposition of video into sub-clips Temporal-constrained Hierarchical Agglomerative Clustering (T-HAC): First each feature-vector forms a cluster Construct clusters iteratively by combining existing clusters based on their distances (EMD) Only merge neighboring clusters in temporal dimension Clusters form a pyramid like structure (dendrogram) 53

51 Features Low-level global features: Grid Color Moment (GCM) First three moments for each grid region (eg. 5X5 grid) Gabor Texture feature (GT) Edge Direction Histogram (EDH) (same as HoG) Apply Sobel operator and histogram edge directions quantized at 5 degrees Mid-level feature: Concept Score (CS) N-dimensional vector (eg. N=108) each component represents confidence score from a semantic concept classifier (SVMs) 55

52 Temporal alignment Matching in single-level EMD (a) and temporally aligned pyramid matching (TAPM) (b) 58

53 Classification SVM Classification with Earth Movers Distance (EMD) Kernel: D(P,Q): EMD between features from sequence P and Q A: hyper-parameter set empirically through cross-validation 59

54 Fusion Fuse Information from different levels directly: L: number of sub-clip levels g: decision values from the SVMs at different levels l h: level weights (eg. all weights equal to 1) Similar to logistic regression/perceptron 60

55 Experiments Test with LSCOM concepts on TRECVID events/activities annotated (449 total concepts) 10 events chosen Relatively high occurrence frequency May be recognized from visual cues intuitively Number of positive samples for each class:

56 Results Comparison of single-level EMD (SLEMD) with key-frame classification on different features 62

57 Results Average Precision on different levels of TAPM L<x>: classification at level x d: fusion with non-uniform weights (h0=h1=1; h2=2) 63

58 Summary Paper1: Retrieving actions in movies Atomic human actions: smoking, drinking Histogram of oriented gradients (HoG), histograms of oriented flow (HoF), at different spatio-temporal positions & scales + grid-types,.. Boosting to select good features and to build classifier Paper2: Learning realistic human actions from movies Eight atomic human actions: Kiss, AnswerPhone, GetOutCar,.. HoG / HoF at space-time interest points, bag of features SVM for classification Paper 3: Visual event recognition in news videos Decomposes video into sub-clips Temporal alignment improves classification Low- and mid-level features, SVM for classification 64

59 References I. Laptev and P. Perez. Retrieving actions in movies. ICCV '07 I. Laptev, M. Marszalek, C. Schmid and B. Rozenfeld. Learning realistic human actions from movies. CVPR '08 D. Xu and SF. Chang. Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment. CVPR '07 65

60 References ptev.html P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR '01 R. Schapire, Y. Freund, P. Bartlett and WS. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. ICML '97 66

61 Hiwi Interested in action recognition? We are currently looking for a motivated student to teach Armar-III how to recognize situations within rooms (eg. cooking, cleaning) Required skills: Interest in Computer Vision C++ programming experience under Linux lukas.rybok@kit.edu for more details 67

Learning Realistic Human Actions from Movies

Learning Realistic Human Actions from Movies Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** INRIA Rennes, France ** INRIA Grenoble, France *** Bar-Ilan University, Israel Presented