Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Similar documents
arxiv: v2 [cs.cv] 23 Dec 2017

End-to-End Localization and Ranking for Relative Attributes

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Spatial Localization and Detection. Lecture 8-1

WITH the recent advance [1] of deep convolutional neural

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Human Pose Estimation with Deep Learning. Wei Yang

Fully Convolutional Networks for Semantic Segmentation

Inception Network Overview. David White CS793

Deep Learning for Object detection & localization

Gradient of the lower bound

Deep Learning with Tensorflow AlexNet

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Exploiting noisy web data for largescale visual recognition

Project 3 Q&A. Jonathan Krause

Deep neural networks II

Structured Prediction using Convolutional Neural Networks

Part Localization by Exploiting Deep Convolutional Networks

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Fuzzy Set Theory in Computer Vision: Example 3

Flow-Based Video Recognition

Cascade Region Regression for Robust Object Detection

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

An Exploration of Computer Vision Techniques for Bird Species Classification

Using Machine Learning for Classification of Cancer Cells

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Mind the regularized GAP, for human action classification and semi-supervised localization based on visual saliency.

Photo-realistic Renderings for Machines Seong-heum Kim

Object Detection Based on Deep Learning

Perceptron: This is convolution!

DeCAF: a Deep Convolutional Activation Feature for Generic Visual Recognition

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

CAP 6412 Advanced Computer Vision

Visual Computing TUM

Deep Learning for Computer Vision II

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Learning to Segment Object Candidates

FINE-GRAINED image classification aims to recognize. Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Machine Learning. MGS Lecture 3: Deep Learning

arxiv: v1 [cs.cv] 20 Dec 2016

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Video Object Segmentation using Deep Learning

Dynamic Routing Between Capsules

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Recurrent Convolutional Neural Networks for Scene Labeling

Investigation of Machine Learning Algorithm Compared to Fuzzy Logic in Wild Fire Smoke Detection Applications

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Deep Model Compression

ImageNet Classification with Deep Convolutional Neural Networks

Finding Tiny Faces Supplementary Materials

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

DEEP BLIND IMAGE QUALITY ASSESSMENT

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Unsupervised Learning of Spatiotemporally Coherent Metrics

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

TorontoCity: Seeing the World with a Million Eyes

Lecture 7: Semantic Segmentation

SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Unsupervised Object Discovery and Co-Localization by Deep Descriptor Transformation

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Regionlet Object Detector with Hand-crafted and CNN Feature

Object Detection on Self-Driving Cars in China. Lingyun Li

YOLO9000: Better, Faster, Stronger

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Bilinear Models for Fine-Grained Visual Recognition

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

Multi-Task Self-Supervised Visual Learning

Deformable Part Models

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Adaptive Learning of an Accurate Skin-Color Model

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Face Recognition A Deep Learning Approach

Weakly Supervised Object Recognition with Convolutional Neural Networks

Person Re-identification for Improved Multi-person Multi-camera Tracking by Continuous Entity Association

Ship Classification Using an Image Dataset

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Transcription:

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization Krishna Kumar Singh and Yong Jae Lee University of California, Davis ---- Paper Presentation Yixian Wang, Ethan Hou

OutLine Background & Current Problems Paper s Solution Intro. Approach & Implementation Details Experiment

Background Weakly-supervised Def. Limited amount of labeled data. Advantages: less detailed annotations Compatibility with vast weakly-annotated visual data Categories: Incomplete Supervision Inexact Supervision Inaccurate Supervision Mixly Used

Background Current Problem Localize only the most discriminative parts

Main Idea Hide patches in a training image randomly Force the network to seek other relevant parts

Main Idea Advantage Modify only the input image Ability to work with any network designed for object localization. Ability to be easily generalized to different neural networks and tasks

Approach Object localizer to predict both the category label & bounding box Hide Random Image Patches Divide the training images into grid of fixed patches. Hide patches randomly (0.5 Probability), Force the network to focus on other relevant parts of the object. * hide different patches for the same training images

Approach Hide Random Image Patches Fig. Hide Different patches for the same image

Approach ---- Deal with Hidden Pixel Value Setting the hidden pixel values Reason: Activation distributions are different between training sets and test sets, due to the hidden patches in training images. Solution Setting the value of hidden patches to mean value of the image over the entire dataset.

Approach Setting the hidden pixel values Completely visible Filter Completely hidden Filter Partially hidden Fiter Fig. Three types of activation

Approach Setting the hidden pixel values Set the hidden value to the mean of all pixels Completely visible Filter Completely hidden Filter Partially hidden Fiter

Approach ---- Generating CAMç What is Class Activation Map (CAM) Indicate and visualize the discriminative image regions used by the CNN to identify that category (heat map) Enables classification-trained CNNs to perform object localization without using bounding box annotations. Fig. Class activation map

Class Activation Map (CAM) Example Type of environment: outdoor Scene categories: rock_arch (0.931) Scene attributes: natural light, open area, boating, natural, ocean, swimming, rugged scene, no horizon, rock Informative region for predicting the category *rock_arch* is:

How to Generate CAM Network Architecture CNN trained of classification GAP layer instead of fully-connected layer before the final output, then softmax layer. B.Zhou,A.Khosla,L.A., A.Oliva, and A.Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

How to Generate CAM Project back the weights F = {F1, F2...Fm} ---- M feature maps W ---- N M weight matrix B.Zhou,A.Khosla,L.A., A.Oliva, and A.Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

Approach ---- Bounding Box How to generate bounding box from CAM Thresholding -> segment the heatmap Regions of which the value is above 20% of the max value Bounding box covers the largest connected component. B.Zhou,A.Khosla,L.A., A.Oliva, and A.Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

Approach ---- Action Localization Goal Predict the label of an action & its start and end time Approach Hiding frames in video to improve action localization, Implementation Sample video frames uniformly from video Divide samples into segments Hide segments randomly and feed to deep action localizer network Apply thresholding on generated CAM

Experiment ---- Object Localization Architecture: Modified AlexNet and GoogLeNet remove the fully-connected layers before the final output and replace them with GAP followed by a fully-connected softmax layer[*] Dataset: ILSVRC 2016 Metrics: Top-1 Loc GT-known Loc Top-1 Class -> classification & localization -> localization -> classification B.Zhou,A.Khosla,L.A., A.Oliva, and A.Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

Experiment ---- Action Localization Architecture: C3D + CNN C3D generate feature maps that fed into a CNN with 2 conv layer followed by GAP and softmax Dataset: THUMOS 2014 Metrics: map

C3D 3D convolutional operation C3D network architecture Tran, Du, et al. "Learning spatiotemporal features with 3d convolutional networks." Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015.

Object localization quantitative results Patch Size N = {16, 32, 44, 56} Difference between sizes is trivial All better than baseline in localization Lower classification accuracy

Object localization quantitative results Patch Size N = {16, 32, 44, 56} Lower classification accuracy Mix Size: randomly choose from 16, 32, 44, 56 and no hidden Classification accuracy improved (learn complementary info)

Object localization quantitative results Ensemble Model Average CAM from different patch sizes

Object localization qualitative results Ground-truth: Localization Result:

Further Analysis Comparison with Dropout Difference: Dropout: HaS: prevent overfitting Drop pixels randomly improve localization Drop patches contiguously Results: dropout produces inferior performance still show most discriminative part Random drop High probability see relevant parts

Further Analysis GAP VS GMP GMP is inferior to GAP in baseline HaS GMP increase a lot and slightly outperform than Has GAP Reason: GMP is forced to learn the less discriminative parts GMP is more robust to noise.

Further Analysis HaS in convolutional layers. Divide conv1 feature maps into a grid and hide patches of size 5 and 11. Proves that HaS can be generalized to convolutional layers

Further Analysis Does Hiding Probability Make Difference? 50% in previous experiment Showing more pixels make classification performance better But decrease localization ability.

Further Analysis Action Localization Result Methods IOU-thresh=0.1 0.2 0.3 0.4 0.5 Video-full 34.23 25.68 17.72 11.00 6.11 Video-HaS 36.44 27.84 19.49 12.66 6.84

Strength & Weakness Strength: Simple method, good result Weakness: Brief introduction to action localization part

Question?