Class 9 Action Recognition

Size: px

Start display at page:

Download "Class 9 Action Recognition"

Alexis McDowell
5 years ago
Views:

1 Class 9 Action Recognition Liangliang Cao, April 4, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University Visual Recognition And Search 1

2 A Historical Overview Few internet videos Few surveillance cameras Visual Recognition And Search 2

3 A Historical Overview TRECVID videos - 11 hours - mainly TV news Few internet videos Few surveillance cameras Visual Recognition And Search 3

A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

4 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) TRECVID videos - 11 hours - mainly TV news Few internet videos Few surveillance cameras Visual Recognition And Search 4

5 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) TRECVID videos - 11 hours - mainly TV news YouTube launched! Few internet videos Few surveillance cameras Visual Recognition And Search 5

6 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion Few internet videos Few surveillance cameras Visual Recognition And Search 6

7 A Historical Overview Hollywood2 dataset STIP new version CVPR 08 (700+ cite) KTH Dataset, ICPR 04 (1100+ cite) Topic model for actions IJCV 08 (500+ cite) TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion Few internet videos Few surveillance cameras Visual Recognition And Search 7

A Historical Overview Hollywood2 dataset STIP new version CVPR 08 (700+ cite) More datasets UCF50, 2008 KTH Dataset, ICPR 04 (1100+ cite) Topic model for actions IJCV 08 (500+ cite) 2001 2002 2003

8 A Historical Overview Hollywood2 dataset STIP new version CVPR 08 (700+ cite) More datasets UCF50, 2008 KTH Dataset, ICPR 04 (1100+ cite) Topic model for actions IJCV 08 (500+ cite) TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion YouTube ad revenue raise Few internet videos Few surveillance cameras Visual Recognition And Search 8

A Historical Overview Hollywood2 dataset STIP new version CVPR 08 (700+ cite) More datasets UCF50, 2008 MSR 2009 KTH Dataset, ICPR 04 (1100+ cite) Topic model for actions IJCV 08 (500+ cite) 2001

9 A Historical Overview Hollywood2 dataset STIP new version CVPR 08 (700+ cite) More datasets UCF50, 2008 MSR 2009 KTH Dataset, ICPR 04 (1100+ cite) Topic model for actions IJCV 08 (500+ cite) TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion YouTube ad revenue raise Few internet videos Few surveillance cameras Visual Recognition And Search 9

10 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) Hollywood2 dataset STIP new version CVPR 08 (700+ cite) Topic model for actions IJCV 08 (500+ cite) More datasets UCF50, 2008 MSR 2009 HMDB TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion YouTube ad revenue raise Few internet videos Few surveillance cameras Visual Recognition And Search 10

11 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) Hollywood2 dataset STIP new version CVPR 08 (700+ cite) Topic model for actions IJCV 08 (500+ cite) More datasets UCF50, 2008 MSR 2009 HMDB TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion Few internet videos Few surveillance cameras YouTube ad revenue raise Example: 48 video-hours uploaded per min. 4M security cameras at UK. VIRAT 2011 (29 hours) TRECVID SED (100 hours) TRECVID MED (100K clips) Visual Recognition And Search 11

12 A Historical Overview KTH Dataset, ICPR 04 (1100+ cite) Hollywood2 dataset STIP new version CVPR 08 (700+ cite) Topic model for actions IJCV 08 (500+ cite) More datasets UCF50, 2008 MSR 2009 HMDB TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion Few internet videos Few surveillance cameras YouTube ad revenue raise Example: 48 video-hours uploaded per min. 4M security cameras at UK. VIRAT 2011 (29 hours) TRECVID SED (100 hours) TRECVID MED (100K clips) Flooding internet videos. Surveillance cam everywhere. Visual Recognition And Search 12

13 What Am I Going To Talk Patch-feature based video recognition KTH Dataset, ICPR 04 (1100+ cite) Hollywood2 dataset STIP new version CVPR 08 (700+ cite) Topic model for actions IJCV 08 (500+ cite) More datasets UCF50, 2008 MSR 2009 HMDB 2011 Big gap TRECVID videos - 11 hours - mainly TV news YouTube launched! bought by Google $1.65 Billion Few internet videos Few surveillance cameras Non-patch based surveillance tech YouTube ad revenue raise Example: 48 video-hours uploaded per min. 4M security cameras at UK. VIRAT 2011 (29 hours) TRECVID SED (100 hours) TRECVID MED (100K clips) Flooding internet videos. Surveillance cam everywhere. Visual Recognition And Search 13

14 Classical Surveillance Techniques Many Techniques Background subtraction Object detection Tracking People counting Trajectory analysis Visual Recognition And Search 14

15 Classical Surveillance Techniques Background Mixture Model Chris Stauer& Eric L. Grimson Adaptive Background Mixture Models for Real-Time Tracking CVPR citations Visual Recognition And Search 15

16 GMM For Background Subtraction Idea Case = + Assumption: Background is fixed There is not much noise Example courtesy to Michael Knowles Visual Recognition And Search 16

frame I, classify individual pixels as foreground if B-I > T (threshold)

17 GMM For Background Subtraction Background Image Background Subtraction: Construct a background image B as average of few images For each actual frame I, classify individual pixels as foreground if B-I > T (threshold) Real Case Current Image Example courtesy to Latecki et al Visual Recognition And Search 17

GMM For Background Subtraction Why Difficult Illumination Changes Gradual Sudden Repetitive background changes Long term scene changes Low resolution

18 GMM For Background Subtraction Why Difficult Illumination Changes Gradual Sudden Repetitive background changes Long term scene changes Low resolution Figure from Stauffer and Grimson 98 Subject stayed and then left Scatter plots of red and green values of a single pixel overtime Visual Recognition And Search 18

19 GMM For Background Subtraction Gaussian Mixture Model Mixture model to capture multiple components in each location Visual Recognition And Search 19

20 GMM For Background Subtraction Adaptive GMM Recall that GMM adaptation is used in coding and pooling (lecture 3) Now we use adaptiation to capture the lighting changes is used to limit the influence of old data Visual Recognition And Search 20

21 GMM For Background Subtraction Background and Foreground The Gaussians are ordered via (high support & less variance) Then simply the first distributions are chosen as the background model. Visual Recognition And Search 21

22 GMM For Background Subtraction Background and Foreground -- foreground -- background Visual Recognition And Search 22

23 GMM For Background Subtraction Background Updating Pixels that do not match with the background Gaussians are classified as foreground. If the new pixel do not match to any of the K existing Gaussians, the least probably distribution is replaced with a new one. New distribution has a high variance and a low prior weight. Visual Recognition And Search 23

24 GMM For Background Subtraction Pixel-wise threshold Judging New Pixels You may use eigenanalysis or neighboring blocks to strengthen the analysis (see Seki et al s work) Visual Recognition And Search 24

25 Local Features for Images Semantics, attributes Detection BOW model Local detector/descriptor Local Features for Video Analysis Visual Recognition And Search 25

Local Features Based Video Analysis Space-Time Interest Point Following the local detector + descriptor paradigm Similar to (or even worse than) image domain, the detectors are of good mathematic

26 Local Features Based Video Analysis Space-Time Interest Point Following the local detector + descriptor paradigm Similar to (or even worse than) image domain, the detectors are of good mathematic motivation but unsatisfying performance. Dense sampling is still a good option Laptev swidely used descriptors: HOG (histof oriented gradients) and HOF (histof optical flow) Visual Recognition And Search 26

27 Local Features Based Video Analysis Space-Time Interest Point Following the local detector + descriptor paradigm Similar to (or even worse than) image domain, the detectors are of good mathematic motivation but unsatisfying performance. Dense sampling is still a good option Laptev swidely used descriptors: HOG(histof oriented gradients) and HOF(histof optical flow) Ivan Laptev et al, CVPR 08 Visual Recognition And Search 27

Implementation faster than STIP (60+ frame/ second) Tian et al, Hierarchical

28 Local Features Based Video Analysis Hierarchically Filtered Motion Motion history is informative but often very noisy Using Hierarchical filter + HOG descriptor Implementation faster than STIP (60+ frame/ second) Tian et al, Hierarchical Filtered Motion for Action in Crowded Videos, TSMC 2011 Visual Recognition And Search 28

29 Local Features Based Video Analysis Dense Trajectory Do not use feature detector but dense sampling. Tracking densely-sampled points for Lframes by median filtering in a dense optical flow field. Wang et al, CVPR 2011, IJCV 2012 Visual Recognition And Search 29

30 Local Features Based Video Analysis Bag of Words Model Niebles, Wang, Fei-Fei, IJCV 2008 (559 citations) Visual Recognition And Search 30

sliding windows: Object localization by efficient subwindow search, CVPR 08 Yuan, Liu and Wu,

31 Local Features Based Video Analysis Action Detection Object detection in image (Branch and bound in 2D) Action detection in videos Branch and bound in 3D (xyt) Lampert, Blaschko, and Hofmann, Beyond sliding windows: Object localization by efficient subwindow search, CVPR 08 Yuan, Liu and Wu, Discriminative Subvolume Search for Efficient Action Detection, CVPR 09 Visual Recognition And Search 31

32 Semantics Based Video Analysis From Object Bank To Action Bank Li et al, Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification, NIPS 2010 Visual Recognition And Search 32

33 Semantics Based Video Analysis From Object Bank To Action Bank Sadanand and Corso, Action Bank: A High-Level Representation of Activity in Video, CVPR 2010 Visual Recognition And Search 33

34 Semantics Based Video Analysis Action Attributes Liu, Kuipers, and Savarese, Recognizing Human Actioins by Attributes, CVPR 2011 Visual Recognition And Search 34

35 Local Features Based Video Analysis Pros Easily borrow techniques from image analysis No need for tracking or detecting of human body (which sometimes can be frustrating) Cons Pros and Cons of Local Video Feature Most of them are slow (for real time processing) Expensive (the number of local features can be as large as several millions) Not good enough for low-resolution, crowded scenes Visual Recognition And Search 35

36 state-of-the-art action recognition Gap large scale action/event recognition Visual Recognition And Search 36

nonevent, only a small amount of event sequences Figures are courtesy to Oh and Perera Oh et al, A

37 VIRAT Dataset Real-world surveillance: - Low resolution of subjects - Both spatial and temporal detection - Multiple objects, different movement, occlusions - Majority of the videos are of nonevent, only a small amount of event sequences Figures are courtesy to Oh and Perera Oh et al, A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video, CVPR 11 Visual Recognition And Search 37

TRECVID SED Challenges of SED: - Majority of the videos are of non-event, only a small amount of event sequences - Heavy occlusion, significantly different viewing

38 TRECVID SED Challenges of SED: - Majority of the videos are of non-event, only a small amount of event sequences - Heavy occlusion, significantly different viewing directions Results from CMU-IBM team (best performer at TRECVID SED 12) Note: you will get a score of 1.0 with an emptysubmission. Visual Recognition And Search 38

39 Why Event/Action Recognition Are Difficult Features are not good enough How to design or learn efficient, accurate features? How to make feature reliable accross different views/scenes? Training labels are not enough How much improvement can we expect from more data? How to learn from heavily imbalanced data? Visual Recognition And Search 39

Adaptive Action Detection

Adaptive Action Detection Illinois Vision Workshop Dec. 1, 2009 Liangliang Cao Dept. ECE, UIUC Zicheng Liu Microsoft Research Thomas Huang Dept. ECE, UIUC Motivation Action recognition is important in