Multi-Task Self-Supervised Visual Learning

Similar documents
Inception Network Overview. David White CS793

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Deep Learning with Tensorflow AlexNet

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

CNN Basics. Chongruo Wu

Deep Learning on Graphs

POINT CLOUD DEEP LEARNING

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Dynamic Routing Between Capsules

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION

Structured Prediction using Convolutional Neural Networks

Classification of objects from Video Data (Group 30)

Study of Residual Networks for Image Recognition

ECE 5470 Classification, Machine Learning, and Neural Network Review

Facial Expression Classification with Random Filters Feature Extraction

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Using Machine Learning for Classification of Cancer Cells

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks

Perceptron: This is convolution!

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Fuzzy Set Theory in Computer Vision: Example 3

MULTI-LEVEL 3D CONVOLUTIONAL NEURAL NETWORK FOR OBJECT RECOGNITION SAMBIT GHADAI XIAN LEE ADITYA BALU SOUMIK SARKAR ADARSH KRISHNAMURTHY

arxiv: v1 [cs.cv] 29 Sep 2016

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

arxiv: v3 [cs.lg] 30 Dec 2016

Kaggle Data Science Bowl 2017 Technical Report

Convolutional Neural Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Channel Locality Block: A Variant of Squeeze-and-Excitation

Convolutional Neural Networks (CNN)

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins

Deep Learning for Computer Vision II

SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Convolutional-Recursive Deep Learning for 3D Object Classification

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Machine Learning 13. week

Neural Networks with Input Specified Thresholds

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Convolutional Neural Networks and Supervised Learning

A Deep Learning Approach to Vehicle Speed Estimation

arxiv: v1 [cs.cv] 20 Dec 2016

A performance comparison of Deep Learning frameworks on KNL

A Generic Approach Towards Image Manipulation Parameter Estimation Using Convolutional Neural Networks

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

Accelerating Convolutional Neural Nets. Yunming Zhang

Hello Edge: Keyword Spotting on Microcontrollers

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Midterm Review. CS230 Fall 2018

Plankton Classification Using ConvNets

Learning visual odometry with a convolutional network

GPU Based Image Classification using Convolutional Neural Network Chicken Dishes Classification

INTRODUCTION TO DEEP LEARNING

3D model classification using convolutional neural network

Content-Based Image Recovery

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Deep Learning for Embedded Security Evaluation

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet

Advanced Video Analysis & Imaging

CSE 559A: Computer Vision

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

What was Monet seeing while painting? Translating artworks to photo-realistic images M. Tomei, L. Baraldi, M. Cornia, R. Cucchiara

Deep Learning on Graphs

Deconvolutions in Convolutional Neural Networks

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

Bounding Out-of-Sample Objects A weakly-supervised approach

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Keras: Handwritten Digit Recognition using MNIST Dataset

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Capsule Networks. Eric Mintun

Convolutional Neural Networks for Facial Expression Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Multi-Task Learning of Facial Landmarks and Expression

Convolutional Neural Network for Facial Expression Recognition

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Convolution Neural Networks for Chinese Handwriting Recognition

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Como funciona o Deep Learning

Transfer Learning. Style Transfer in Deep Learning

Deep Learning for Computer Vision

Machine Learning Workshop

Transcription:

Multi-Task Self-Supervised Visual Learning Sikai Zhong March 4, 2018 COMPUTER SCIENCE

Table of contents 1. Introduction 2. Self-supervised Tasks 3. Architectures 4. Experiments 1

Introduction

Self-supervised Learning It does not has manual labeling; The objective is measured by the performance of the task; 2

Multi-task Self-supervision Learning It combines the following four tasks to boost performance. Relative Position; Colorization; Exemplar task motion segmentation; 3

Challenge Tasks learn at different rates; A naive combination of self-supervision tasks will conflict; 4

Self-supervised Tasks

Relative Position Sampling two patches from a single image randomly Pairs of patches are taken from adjacent grid points( eight-way softmax classification) 5

Colorization Given a grayscale image and predict the color of every pixel;[2] The color are vectors quantized into 313 different categories; 313-way softmax classification for every region of the image; 6

Exemplar 1. Create pseudo-classes, where each class was generated by taking a patch from a single image and augmenting it via translation, rotation, scaling, and color shifts;[1] 2. Randomly sample two patches x1 and x2 from the same pseudo-class, and a third patch x3 from a different pseudo-class 3. Train the network with a loss of the form max(d(f (x1); f (x2))d(f (x1); f (x3)) + M; 0), 7

Motion Segmentation Given a single frame of video, it asks the network to classify which pixels will move in subsequent frames. 8

Architectures

Convolutional Neural Network Architecture Figure 1: The structure of our multi-task network. It is based on ResNet-101, with block 3 having 23 residual units. a) Naive shared-trunk approach, where each head is attached to the output of block 3. b) the lasso architecture, where each head receives a linear combination of unit outputs within block3, weighted by the matrix, which is trained to be sparse 9

Separating features via Lasso Differnt tasks require different features; Some information is useful to imagenet classification but useless to object detection; Some tasks need only image patches but some tasks need the entire image; 10

Separating features via Lasso Each task has a set of coefficients, one for each of the 23 candidate layers in block 3 (Figure.5); Lasso(L1) is used to encourage the matrix to be sparse; Sparse matrix will encourage the network to concentrate all of the information required by a single task into a small number of layers; 11

Harmonizing network inputs Each self-supervised task pre-process its data differently,so the low-level image statistics are often very different across tasks; We replace relative position s preprocessing with the same precessing used for colorization; Images are converted to Lab, and the a and b channels are discarded. L channels replicate the L channels 3 times so that the network can be evaluated on color images; 12

Head for Relative Position Input: a batch of patches; Running ResNet-v2-101 at a stride of 8 with a dilated convolution(most block 3 convolutions produce outputs at stride 16) Header has two more residual units.the first has an output with 1024 channels, a bottleneck with 128 channels, and a stride of 2; the second has an output size of 512 channels, bottleneck with 128 channels, and stride 2. 3 fully-connected residual units is used to process the flatten feature map; 13

Head for Colorization the input images are 256*256; Running ResNet-v2-101 at a stride of 8 with a dilated convolution; Header has two more standard convolution layers with a ReLU nonlinearity (one has 2*2 kernel with stride 1, the other has 1*1 kernel with stride 1, they both have 4096 output channels); The last layer is a 1*1 convolution with stride 1 and 313 outputs channels; 14

Header for Exemplar Input:Images are resized to 256*256 and sample patches that are 96*96. Running ResNet-v2-101 at a stride of 8 with a dilated convolution; Header has two residual units, the first with an output with 1024 channels, a bottleneck with 128 channels, and a stride of 2; the second has an output size of 512 channels, bottleneck with 128 channels, and stride 2. the feature map is used directly to compute the distances needed for the loss. 15

Header for Motion Segmentation Input: Image are resized to 240*320; Running ResNet-v2-101 at a stride of 8 with a dilated convolution; header have two 1*1 conv layers each with dimension 4096, followed by another 1*1 conv layer which produces a single value, which is treated as a logit and used a per-pixel classification. 16

Training Figure 2: Distributed training setup. Several GPU machines are allocated for each task, and gradients from each task are synchronized and aggregated with separate RMSProp optimizers 17

Experiments

Datasets ImageNet: used in relative position, colorization, exemplar; SoundNet: used in motion segmentation; 18

Comparison of performance for different self-supervised methods over time Figure 3: Comparison of performance for different self-supervised methods over time. X-axis is compute time on the self-supervised task (2.4K GPU hours per tick). Random Init shows performance with no pre-training. 19

Comparison of performance for different multi-task selfsupervised methods over time Figure 4: Comparison of performance for different multi-task self-supervised methods over time. X-axis is compute time on the self-supervised task (2.4K GPU hours per tick). Random Init shows performance with no pre-training. 20

Mediated combination of self-supervision tasks Figure 5: Comparison of performance with and without the lasso technique for factorizing representations, for a network trained on all four self-supervised tasks for 16.8K GPU-hours. 21

References i C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In The IEEE International Conference on Computer Vision (ICCV), 2017. R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649 666. Springer, 2016.