Long-term Temporal Convolutions for Action Recognition INRIA

Size: px

Start display at page:

Download "Long-term Temporal Convolutions for Action Recognition INRIA"

Georgiana James
5 years ago
Views:

1 Longterm Temporal Convolutions for Action Recognition Gul Varol Ivan Laptev INRIA Cordelia Schmid

2 Motivation Current CNN methods for action recognition learn representations for short intervals (116 frames). Typical actions last several seconds.

2 2 Motivation Current CNN methods for action recognition learn representations for short intervals (116 frames). Typical actions last several seconds. Actions contain characteristic patterns with specific longterm temporal structure. Spacetime convolutions Long temporal extent [Tran 2015] Optical flow [Simonyan 2014]

3 3 Contributions The advantages of longterm temporal convolutions The importance of highquality optical flow estimation for learning accurate video representations.

4 4 Approach

5 5 Network Architecture 3D convolutions with 3x3x3 filters ReLU 3D maxpooling of 2x2x2 Experiments with T = {16, 20, 40, 60, 80, 100}

6 6 Network Architecture Optical flow : 2channel input (original [x, y] values) RGB : 3channel input Increased temporal extent by the cost of decreased spatial resolution.

7 7 Experiments

8 8 Datasets HMDB51 (Kuehne et al. 2011) UCF101 (Soomro et al. 2012)

9 Optical Flow 60frame training from scratch

Even lowquality MPEG flow outperforms RGB.

5 63.8 Farneback 66.3 71.3 Brox 74.8 79.

9 9 Optical Flow 60frame training from scratch With different input modalities Conclusions 1 Even lowquality MPEG flow outperforms RGB. Input Clip Video RGB MPEG flow Farneback Brox Quality of flow impacts the results significantly. RGB MPEG flow Farneback flow Brox flow 60frame networks from scratch on UCF101 (split 1)

10 10 16 vs 60frame networks spatial res. 112x112 58x58 16f network has the same architecture as Tran Input RGB Flow 16f 60f gain Clip Video Clip Video UCF101 (split 1) Pretraining Flow from scratch Flow from UCF101 16f 60f gain [Simonyan 2014] Clip Video Clip Video HMDB51 (split 1)

11 11 RGB Network Finetuning RGB from scratch is difficult to learn We need pretraining Clip Video 16f f f UCF101 (split 1) RGB from scratch C3D 16f 3D convnet trained on Sports1M (Tran 2015) We extend C3D to longer temporal convolutions as follows: Conv5 layer output has T/16 temporal resolution. Maxpool conv5 output over time to recycle pretrained fc layers. Finetune whole network.

12 12 Varying Temporal and Spatial Resolutions clip (dotted..) (plain ) video High resolution (71x71) Long temporal extent High spatial resolution RGB+Flow complementary RGB > Flow (clips) RGB < Flow (videos) Curves less steep for video Low resolution (58x58) (pink) Flow from scratch (blue) RGB from C3D performance Conclusion T UCF101 (split 1)

13 Multiple Networks Combined Input UCF101 HMDB51 LTCFlow (100f) 82.6 56.7 LTCFlow (60f+100f) 83.8 60.

13 13 Multiple Networks Combined Input UCF101 HMDB51 LTCFlow (100f) LTCFlow (60f+100f) LTCRGB (100f) 81.8 LTCRGB (60f+100f) 81.5 LTCFlow+RGB LTCFlow+RGB + IDT split 1 UCF101 (split 1) flow

14 14 HMDB51 [Wang 2013] IDT+FV [Peng 2014] IDT+HSV IDT+MIFS IDT+SFV 66.8 Slow fusion (scratch) 41.3 C3D (scratch) 44 Slow fusion 65.4 Spatial stream C3D (1 net) LTCRGB C3D (3 nets) 85.2 [Lan 2015] [Peng 2014] [Karpathy 2014] [Tran 2015] CNN (RGB) UCF101 [Karpathy 2014] [Simonyan 2014] [Tran 2015] [Tran 2015] Fusion Handcrafted Method CNN (Flow) Comparison to the Stateoftheart 1 3 splits average Temporal stream LTCFlow [Simonyan 2014] Twostream(avg) [Simonyan 2014] Twostream(SVM) LSTM (flow+rgb) 88.6 TDD [Tran 2015] C3D+IDT 90.4 [Wang 2015] TDD+IDT LTCFlow+RGB LTCFlow+RGB + IDT [Simonyan 2014] [Ng 2014] [Wang 2015] Our implementation is 80.2% 2 No finetuning 3 Uses multitask learning

15 15 Qualitative Analysis

16 Classes with Largest Improvement JavelinThrow 16f 60f 54.8 96.

16 16 Classes with Largest Improvement JavelinThrow 16f 60f *JavelinThrow is mostly confused with FloorGymnastics in 16f. FloorGymnastics = running + gymnastics JavelinThrow = running + throwing javelin

17 17 First Layer Filters Complex motion patterns in local neighborhoods x and y intensities 2D vectors t=1 blue t=2 red t=3 green 60f Flow on UCF101 (split 1)

18 18 Higher Layer Filters Video Top activations of filters at conv layers. Colors: classes, Rows: maximum responding test videos, Columns: filters. L1 L2 L3 L4 L5 100f 16f

19 19 thanks! Questions? project page : contact : gul.varol@inria.fr

20 20 Credits Special thanks to all the people who made and released these awesome resources for free: Presentation template by SlidesCarnival Photographs by Unsplash

arxiv: v1 [cs.cv] 15 Apr 2016

arxiv: v1 [cs.cv] 15 Apr 2016 arxiv:0.09v [cs.cv] Apr 0 Long-term Temporal Convolutions for Action Recognition Gül Varol, Ivan Laptev Cordelia Schmid Inria Abstract. Typical human actions last several seconds and exhibit characteristic