Large-scale Video Classification with Convolutional Neural Networks

Similar documents
Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Activity Recognition in Temporally Untrimmed Videos

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

CS231N Section. Video Understanding 6/1/2018

Recurrent Neural Networks and Transfer Learning for Action Recognition

A Deep Learning Framework for Authorship Classification of Paintings

A Review on Large-scale Video Classification with Recurrent Neural Network (RNN)

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Long-term Temporal Convolutions for Action Recognition INRIA

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Action recognition in videos

Deep Learning with Tensorflow AlexNet

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

People Detection and Video Understanding

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Lecture 37: ConvNets (Cont d) and Training

Evaluation of Triple-Stream Convolutional Networks for Action Recognition

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Person Action Recognition/Detection

arxiv: v1 [cs.cv] 15 Apr 2016

An Exploration of Computer Vision Techniques for Bird Species Classification

YouTube-8M Video Classification

Rich feature hierarchies for accurate object detection and semantic segmentation

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

Yiqi Yan. May 10, 2017

Classification of objects from Video Data (Group 30)

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Jersey Number Recognition using Convolutional Neural Networks

arxiv: v2 [cs.cv] 2 Apr 2018

Machine Learning for Big Fishery Visual Data

Class 9 Action Recognition

Kaggle Data Science Bowl 2017 Technical Report

CS 523: Multimedia Systems

A Unified Method for First and Third Person Action Recognition

Using Machine Learning for Classification of Cancer Cells

COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: THERE S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX

Object Detection. Part1. Presenter: Dae-Yong

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Long-term Temporal Convolutions for Action Recognition

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

arxiv: v2 [cs.cv] 13 Apr 2015

A spatiotemporal model with visual attention for video classification

Object Detection on Self-Driving Cars in China. Lingyun Li

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Regionlet Object Detector with Hand-crafted and CNN Feature

Spatial Localization and Detection. Lecture 8-1

Deep Learning of Player Trajectory Representations for Team Activity Analysis

Large-scale gesture recognition based on Multimodal data with C3D and TSN

Investigation of Machine Learning Algorithm Compared to Fuzzy Logic in Wild Fire Smoke Detection Applications

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Machine Learning 13. week

Classifying Depositional Environments in Satellite Images

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

A Real-Time Street Actions Detection

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Convolutional-Recursive Deep Learning for 3D Object Classification

Project 3 Q&A. Jonathan Krause

Multi-Glance Attention Models For Image Classification

Image and Video Understanding

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

EasyChair Preprint. Real-Time Action Recognition based on Enhanced Motion Vector Temporal Segment Network

UCF-CRCV at TRECVID 2014: Semantic Indexing

Multi-view Facial Expression Recognition Analysis with Generic Sparse Coding Feature

Know your data - many types of networks

Vehicle Classification on Low-resolution and Occluded images: A low-cost labeled dataset for augmentation

arxiv: v1 [cs.cv] 26 Jul 2018

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

CS229: Action Recognition in Tennis

arxiv: v3 [cs.cv] 8 May 2015

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Deep learning for music, galaxies and plankton

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Lecture 7: Semantic Segmentation

Recurrent Convolutional Neural Networks for Scene Labeling

DMM-Pyramid Based Deep Architectures for Action Recognition with Depth Cameras

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

Smart Content Recognition from Images Using a Mixture of Convolutional Neural Networks *

Storyline Reconstruction for Unordered Images

11. Neural Network Regularization

CS5670: Computer Vision

ImageNet Classification with Deep Convolutional Neural Networks

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Visual Inspection of Storm-Water Pipe Systems using Deep Convolutional Neural Networks

CLASSIFICATION Experiments

Computer Vision Lecture 16

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Transcription:

Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei Note: Slide content mostly from : Bay Area Multimedia Forum - 20 June 2014 - Andrej Karpathy - Large-scale Video Classification with Convolutional Neural Networks 16-824 Spring 2015 Presenter : Esha Uboweja

Problem Classification of videos in sports datasets

Standard approach to video classification Bag of Words (BoW) approach: 1. Extraction of local visual features (dense/sparse) 2. Visual word encoding of features 3. Training a classifier (e.g. SVM) Convolutional Neural Networks (CNNs) emulate all these stages in a single neural network

Motivations for using CNNs for video classification 1. CNNs outperform other approaches in image classification tasks (e.g. ImageNet challenge) 2. Features learned in CNNs transfer well to other datasets (e.g. fine-tuning top layers of a network trained using ImageNet for food recognition)

Dataset Current video datasets lack variety and number of videos to train a CNN: UCF 101 dataset : 13,320 videos, 101 classes KTH (human action) : 2391 videos, 6 classes Sports-1M dataset : 1.1 million videos, 487 classes (new!)

Models

Baseline CNN Krizhevsky et al. 12

Baseline CNN Fully-Connected Fully-Connected Max-Pooling Convolution Convolution Convolution Normalization Max-Pooling Convolution Goal: Extend 2D convolutional layers to 3D to learn spatio-temporal filters Call this Single-frame baseline Normalization Max-Pooling Convolution Image

Temporal Fusion in CNNs Modify 1st convolutional layer to be of size 11 x 11 x 3 x T pixels T = # frames (authors use 10)

Temporal Fusion in CNNs Modify 1st convolutional layer to be of size 11 x 11 x 3 x T pixels T = # frames (authors use 10) 2 single-frame networks 15 frames apart merge in 1st fully connected layer The fully connected layer can compute global motion characteristics

Temporal Fusion in CNNs Modify 1st convolutional layer to be of size 11 x 11 x 3 x T pixels T = # frames (authors use 10) 2 single-frame networks 15 frames apart merge in 1st fully connected layer The fully connected layer can compute global motion characteristics Spatial + temporal convolutions, and higher layers get more global information

Multiresolution CNNs To improve runtime performance: Input = 178 x 178 frame video clip Low-Res Context stream gets down sampled 89 x 89 (entire frame) High-Res Fovea stream gets cropped center 89 x 89 patch Both streams merge in 1st fully connected layer

Multiresolution CNNs To improve runtime performance: Input = 178 x 178 frame video clip Low-Res Context stream gets down sampled 89 x 89 (entire frame) High-Res Fovea stream gets cropped center 89 x 89 patch Both streams merge in 1st fully connected layer

Train Procedure 1. Randomly sample a video 2. Sample a 15 frame (~0.5 secs) clip from (1) 3. Randomly crop, flip frames in clip, subtract mean of all pixels in images (data augmentation + preprocessing) Test Procedure is similar

Experiments

Feature Histogram Baseline 1. Extraction of local visual features : HoG, Texton, Cuboids, Hue-Saturation, Color moments, #Faces detected 2. Visual word encoding of features: Spatial pyramid encoding in histograms after k- means : Finally obtain a 25,000 D feature vector for the entire video 3. Training a classifier: Use a 2-hidden layer neural net (worked better than any linear classifier)

Testing Procedure 1. Randomly sample 20 clips for a given test video 2. Present each clip individually to the network (with different crops and flips) 3. Individual clip class predictions are averaged to get a class result for the entire video

Results on Sports-1M dataset

Video Results https://www.youtube.com/watch?v=qrzq_ab1dzk Cycling Basketball

Quantitative Results Baseline 55.3 Single Frame 59.3 Single Frame + Multires 60 Time-fusion 60.9 CNN Average 63.9 0 25 50 75 100

Qualitative Results 1. The confusion matrix shows that the network doesn t do well on fine-grained classification 2. Slow-fusion networks are sensitive to small motions, hence motion-aware, but don t work well with presence of camera translation and zoom

Transfer Learning

UCF-101 dataset 5 main categories of data 1. Human Object Interaction 2. Body-Motion only 3. Human-Human interaction 4. Playing Musical Instruments 5. Sports Soomro et al. 12

Transfer learning to Sports data in UCF 101 Results Soomro et al 43.9 Baseline Fine-tune top layer Fine-tune top 3 layers Fine-tune all layers Train CNN from scratch 59 64.1 65.4 62.2 64.1 0 25 50 75 100

Discussion