Learning to Detect Activity in Untrimmed Video. Prof. Bernard Ghanem

Size: px

Start display at page:

Download "Learning to Detect Activity in Untrimmed Video. Prof. Bernard Ghanem"

Arlene Daniels
5 years ago
Views:

1 Learning to Detect Activity in Untrimmed Video Prof.

2 An image is worth a thousand words A video is worth a million words Source: YouTube Image: a tiger attacking a person on a grass field Video: the tiger is being playful

Fun facts about video By 2017, online video

55% of people watch videos online every day 1

Facebook or YouTube videos a week 2 Almost 50%

product or service before visiting a store 4

5 Source:Source:1) MWP Statistics, 2015; 2)

3 Fun facts about video By 2017, online video will account for 74% of all online traffic 3 55% of people watch videos online every day 1 45% of people watch more than an hour of Facebook or YouTube videos a week 2 Almost 50% of internet users look for videos related to a product or service before visiting a store 4 85% of Facebook video is watched without sound 5 Source:Source:1) MWP Statistics, 2015; 2) HubSpot, ) KPCB, ) Google, 2016; 5) DIGIDAY, 2016

4 Problem: Detecting Human Activities in Video Input Output Class: Pole Vault Bounds: (23.1s, 25.2s)

5 Problem: Detecting Human Activities in Video Input Output Class: Pole Vault Bounds: (23.1s, 25.2s)

6 Why Activity Detection?

8 Challenges of Detecting Human Activities Input Output Class: Pole Vault 1. Not enough large-scale training data Bounds: (23.1s, 25.2s) 2. Large number of activities 3. Real-time processing is not enough

9 1. Not enough large-scale training data 1 st Version (R1.1): ~200 classes ~850 hours class hierarchy ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015]

1. Not enough large-scale training data At CVPR 2017 (July 26 afternoon) http://activity-net.

10 1. Not enough large-scale training data At CVPR 2017 (July 26 afternoon) Sponsored by: ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015]

11 Classical Activity Detection Pipeline Basketball Dunk Classifier... Volleyball Spiking Classifier

12 Classical Activity Detection Pipeline Basketball Dunk Classifier... Volleyball Spiking Classifier

13 Using proposals is important Action Proposal Basketball Dunk Classifier Volleyball Spiking Classifier Basketball Dunk Classifier Volleyball Spiking Classifier

14 What have we done? Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos [CVPR 2016] proposals are represented as sparse combinations of STIPs (10FPS on single CPU core) DAPs: Deep Action Proposals for Action Understanding [ECCV 2016] multi-scale (sparse) proposals are output by an LSTM in one pass (130FPS on single GPU) SST: Single-Stream Temporal Action Proposals [CVPR 2017] multi-scale (dense) proposals are scored by a GRU in one pass + streaming (300FPS on single GPU)

15 SST: Single Stream Temporal Action Proposals k - proprosals Untrimmed Input Video Temporal Action Proposals Localized Action Detections SST classifier Output Proposals Seq. Encoder output (time step t) c t Visual Encoder ϕ ϕ ϕ ϕ ϕ ϕ Input video δ k δ maximum proposal size (per output) Time

16 SS-TAD: Single Stream Temporal Action Detection (a) (b) (c) Action Detections Merging/Smoothing Classifiers SS-TA D Frame-level Classifiers Proposals Untrimmed Video Input End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos [BMVC 2017] multi-scale (dense) detection are scored in one pass + streaming (700FPS on TitanX GPU)

17 SS-TAD: Single Stream Temporal Action Detection Key Detection Ground-truth Time (Actions are played at 1x speed, Background video is sped up)

18 2. Large number of activities Applying activity detectors for large number of activity classes is expensive. Can we do better than linear computational growth with # of activity classes?

19 Activity-Object and Activity-Scene Relations SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016]

20 Typical Activity Detection Pipeline Video Sequence Action Proposals (Stage 1) Action Proposals Action Classifiers (Stage 2) Reject SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016]

21 SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

22 SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

23 SCC: Semantic Context Cascade SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

24 3. Real-time processing is not enough In the past, real-time processing was a good-to-have, i.e. 1min video 1min processing But, not anymore! We need to stay ahead of the increasing video upload rate. How? hardware acceleration (GPUs) more efficient implementation do we need to visit every frame?

25 Do we have to visit every frame? Log how human annotator moves the time slider instead of throwing it away t Can we learn from how humans move the slider to localize activities? Search History Action Search: Learning to Search for Human Activities in Untrimmed Videos [arxiv 2017][To be submitted to CVPR2018]

26 t t Action Search: Learning to Search for Human Activities in Untrimmed Videos [arxiv 2017][To be submitted to CVPR2018]

.. h i 3 h i 2 v i 2 v i 1 h i 1 v i h i

27 t f(x i 3 )... f(x i 2 ) f(x i 1 ) f(x i ) f(x i+1 )... h i 3 h i 2 v i 2 v i 1 h i 1 v i h i v i+1 h i+1 LSTM D ConvNet Target Activity X: Visual Observation X i 2 X i 1 X i X i+1 v: h: Feature Vector LSTM State f X : Temporal Location f(x i 3 ) f(x i 2 ) f(x i+1 ) f(x i ) f(x i 1 ) t Action Search: Learning to Search for Human Activities in Untrimmed Videos [arxiv 2017][To be submitted to CVPR2018]

Search: Learning to Search for Human Activities in

28 Action Search or Action Spotting Activity: shot put Activity: Activity: basketball shot put dunk Action Search: Learning to Search for Human Activities in Untrimmed Videos [arxiv 2017][To be submitted to CVPR2018]

29 SPONSORS

30 Prof. ivul.kaust.edu.sa baseball throw dunk shoveling washing dishes pole vault dancing

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Scene Text Recognition for Augmented Reality Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science Outline Research area and motivation Finding text in natural scenes Prior art Improving