Structured Prediction using Convolutional Neural Networks

Similar documents
Deconvolutions in Convolutional Neural Networks

Lecture 7: Semantic Segmentation

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep Learning with Tensorflow AlexNet

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning for Computer Vision II

Computer Vision Lecture 16

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Channel Locality Block: A Variant of Squeeze-and-Excitation

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Spatial Localization and Detection. Lecture 8-1

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Computer Vision Lecture 16

Fully Convolutional Networks for Semantic Segmentation

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Computer Vision Lecture 16

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

3D model classification using convolutional neural network

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Know your data - many types of networks

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

ImageNet Classification with Deep Convolutional Neural Networks

Study of Residual Networks for Image Recognition

CSE 559A: Computer Vision

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Introduction to Neural Networks

Dense Image Labeling Using Deep Convolutional Neural Networks

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Object detection with CNNs

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins

Object Detection Based on Deep Learning

Multi-view 3D Models from Single Images with a Convolutional Network

Yiqi Yan. May 10, 2017

Semantic Segmentation

Convolutional Neural Networks

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Microscopy Cell Counting with Fully Convolutional Regression Networks

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Fuzzy Set Theory in Computer Vision: Example 3, Part II

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

Face Recognition A Deep Learning Approach

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Advanced Video Analysis & Imaging

Machine Learning. MGS Lecture 3: Deep Learning

Convolutional Networks in Scene Labelling

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Real-time Object Detection CS 229 Course Project

arxiv: v1 [cs.cv] 20 Dec 2016

Inception Network Overview. David White CS793

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

6. Convolutional Neural Networks

Elastic Neural Networks for Classification

A Deep Learning Approach to Vehicle Speed Estimation

Neural Networks with Input Specified Thresholds

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Learning Deep Features for Visual Recognition

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Feature-Fused SSD: Fast Detection for Small Objects

Project 3 Q&A. Jonathan Krause

arxiv: v1 [cs.cv] 24 May 2016

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Convolutional Neural Networks

Dynamic Routing Between Capsules

arxiv: v1 [cs.cv] 4 Dec 2014

Deep Neural Networks:

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Image and Video Understanding

A Novel Weight-Shared Multi-Stage Network Architecture of CNNs for Scale Invariance

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Fei-Fei Li & Justin Johnson & Serena Yeung

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Lecture 10: Other Applications of CNNs

POINT CLOUD DEEP LEARNING

Unified, real-time object detection

Fuzzy Set Theory in Computer Vision: Example 3

Seminars in Artifiial Intelligenie and Robotiis

Residual Networks for Tiny ImageNet

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

Facial Key Points Detection using Deep Convolutional Neural Network - NaimishNet

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

CAP 6412 Advanced Computer Vision

arxiv: v2 [cs.cv] 23 May 2016

Traffic Multiple Target Detection on YOLOv2

Classification of objects from Video Data (Group 30)

Tiny ImageNet Visual Recognition Challenge

Transcription:

Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer vision Image denoising Super resolution Deconvolutions for structured predictions Object generation Semantic segmentation Summary 2 Convolutional Neural Network (CNN) Convolutional Neural Networks Feed forward network Convolutions Non linearity: Sigmoid or Rectified Linear Unit (ReLU) Pooling: (typically) local maximum End to end supervised learning Representation learning LeNet [LeCun89] [Lecun89] Y. LeCun et al.: Handwritten Digit Recognition with a Back Propagation Network. NIPS 1989 3 4

Convolutional Neural Network (CNN) AlexNet [Krizhevsky12] CNN had not shown impressive performance. Reasons for failure Insufficient training data Slow convergence Bad activation function: Sigmoid function Too many parameters Limited computing resources Lack of theory: needed to rely on trial and error CNN recently draws a lot of attention due to its great success. Reasons for recent success Availability of larger training datasets, e.g., ImageNet Powerful GPUs Better model regularization strategy such as dropout Simple activation function: ReLU 5 Winner of ILSVRC 2012 challenge Same architecture with [Lecun89] but trained with larger data Bigger model: 7 hidden layers, 650K neurons, 60 million parameters Trained on 2 GPUs for a week Training with error back propagation using stochastic gradient method [Krizhevsky12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 6 Other CNNs for Classification Other CNNs for Classification Very Deep ConvNet by VGG [Simonyan15] GoogLeNet [Szegedy15] 256 480 480 512 512 512 832 832 1024 Smaller filters: 3x3 More non linearity Less parameters to learn: ~140 millions A significant performance improvement with 16 19 layers Generalization to other datasets The first place for localization and the second place for classification in ILSVRC 2014 [Simonyan15] K. Simonyan, A. Zisserman: Very Deep Convolutional Networks for Large Scale Image Recognition, ICLR 2015 Convolution Pooling Softmax Others Network in network: inception modules Auxiliary classifiers to facilitate training 22 layer network: 27 layers if pooling layers are counted The winner of ILSVRC 2014 classification task [Szegedy15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich: Going deeper with convolutions. CVPR 2015 7 8

Denoising Network A simple CNN to reconstruct noise free images Low Level Structured Prediction using Convolutional Neural Networks Input and output are both images in a same dimension. Activation function: sigmoid function Loss function: 2 norm reconstruction error 24 filters in each layers but 8 inter layer connections 5x5 filters and 624 convolutions (15,697 parameters) V. Jain and S. Seung, Natural Image Denoising with Convolutional Networks. NIPS 2008 9 10 Network architecture Super Resolution CNN Training Super Resolution CNN 2 norm loss function to favor a high PSNR between input and output Preprocessing: bicubic interpolation to the desired size Stochastic gradient descent Learned filters 3 convolution layers with ReLu for the first two layers Conceptually similar to sparse coding Good performance with a small number of training images C. Dong, C. C. Loy, K. He, and X. Tang: Learning a Deep Convolutional Network for Image Super Resolution. ECCV 2014 11 12

Super Resolution CNN Structured Prediction in Low Level Vision Test convergence CNN as a collection of non linear filters Low level processing Learning data driven filters No domain shift between input and output Results 13 14 Deconvolution Networks Deconvolution Networks for Structured Prediction Generative convolutional neural network Domain shift between input and output Advantages Capable of structural prediction Segmentation Matching Object generation More general than classification: extending applicability of CNNs Challenges More parameters Difficult to train Requires more training data, which may need heavy human efforts Task specific network: typically not transferrable 15 16

Deconvolution for Structured Prediction Object generation A. Dosovitskiy, J. T. Springenberg and T. Brox. Learning to Generate Chairs with Convolutional Neural Networks. CVPR 2015 Semantic segmentation J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 2015 H. Noh, S. Hong, and B Han, Learning Deconvolution Network for Semantic Segmentation, arxiv:1505.04366, 2015 S. Hong, H. Noh, and B. Han, Decoupled Deep Neural Network for Semi supervised Semantic Segmentation, arxiv:1506.04924, 2015 Object Generation 17 18 Discriminative vs. Generative CNN Goal Discriminative CNN Generate an obejct based on high level inputs such as CNN Object class Viewpoint Style Class Orientation with respect to camera Additional parameters Rotation, translation, zoom Stretching horizontally or vertically Hue, saturation, brightness Generative CNN Object class Viewpoint Style CNN [Dosovitskiy15] A. Dosovitskiy, J. T. Springenberg and T. Brox. Learning to generate chairs with convolutional neural networks. CVPR 2015 19 20

Knowledge transfer Contribution Given limited number of viewpoints of an object, the network can use the knowledge learned from other similar objects to infer remaining viewpoints. Interpolation between different objects Generative CNN learns the manifold of chairs. Data Using 3D chair model dataset [Aubry14] Original dataset: 1393 chair models, 62 viewpoints, 31 azimuth angles, 2 elevation angles Sanitized version: 809 models, tight cropping, resizing to 128x128 Notation,,,,,,,,, : class label : viewpoint : additional parameters,,,,,, : target RGB output image :segmentation mask [Aubry14] M. Aubry, D. Maturana, A. Efros, and J. Sivic, Seeing 3D Chairs: Exemplar Part based 2D 3D Alignment using a Large Dataset of CAD Models. CVPR 2014 21 22 Network Architecture Operations Unpooling: 2x2 Fixed location unpooling 32M parameters altogether Deconvolution: 5x5 ReLU 23 24

Training Network Capacity Objective function Minimizing the Euclidean error in 2D of reconstructing the segmentedout chair image and the segmentation mask min,,,, Optimization Stochastic gradient descent with momentum of 0.9 Learning rate 0.0002 for the first 500 epochs Dividing by 2 after every 100 epoch Orthogonal matrix initialization [Saxe14] [Saxe14] A. M. Saxe, J. L. McClelland, and S. Ganguli, Learning a Nonlinear Embedding by Preserving Class Neighbourhood. ICLR 2014 Translation Rotation Zoom Stretch Saturation Brightness Color 25 26 Learned Filters Single Unit Activation Visualization of uconv 3 layer filters in 128x128 network RGB stream Images generated from single unit activations FC 1 (class) Segmentation stream Facts and observations The final output at each position is generated from a linear combination of these filters. They include edges and blobs. FC 2 (class) FC 3 FC 4 All neurons are set to 0 s. 27 28

Zoom neuron Hidden Layer Analysis Increasing activation of the zoom neuron found in FC 4 feature map With knowledge transfer Interpolation between Angles Without knowledge transfer Spatial mask Chairs generated from spatially masked 8x8 FC 5 feature map 2x2 4x4 6x6 8x8 Viewpoints in training set 29 30 Morphing Different Chairs Summary Supervised training of CNN can also be used to generate images. Generative network does not merely memorize, but also generalizes well. The proposed network is capable of processing very different inputs using the same standard layers. Viewpoints in training set 31 32

Semantic Segmentation Segmenting image based on its semantic notion Deconvolution Network for Semantic Segmentation 33 34 FCN for Semantic Segmentation Segmentation by Fully Convolutional Network (FCN) [Long15] End to End CNN architecture for semantic segmentation Convert fully connected layers to convolutional layers 500x500x3 Bilinear interpolation filter Deconvolution Filter Same filter for every class There is no learning! Not a real deconvolution How does this deconvolution work? Deconvolution filter is fixed. Fining tuning convolution layers of the network with segmentation ground truth. 16x16x21 Deconvolution [Long15] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 2015 64x64 bilinear interpolation 35 36

Limitations of FCN based Semantic Segmentation Coarse output score map A single bilinear filter should handle the variations in all kinds of object classes. Difficult to capture detailed structure of objects in image Fixed size receptive field Unable to handle multiple scales Difficult to delineate too small or large objects compared to the size of receptive field Noisy predictions due to skip architecture Trade off between details and noises Minor quantitative performance improvement Learning Deconvolution Network Learning a deconvolution network Conceptually more reasonable Better to identify fine structures of objects Designed to generate outputs from larger solution space Capable of predicting dense output scores Difficult to learn: memory intensive Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 37 38 Learning Deconvolution Network Instance wise training and prediction Easy data augmentation Reducing solution space Inference on object proposals, then aggregation Labeling objects in multiple scales Unpooling Operations in Deconvolution Network Place activations to pooled location Preserve structure of activations Deconvolution Densify sparse activations Bases to reconstruct shape Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 ReLU Same with convolution network 39 40

How Deconvolution Network Works? Visualization of activations Instance wise prediction Inference DeconvNet Deconv: 14x14 Unpool: 28x28 Deconv: 28x28 1. Input image 2. Object proposals 3. Prediction and aggregation 4. Results Inference on object proposals Each class corresponds to one of the channel in the output layer. Label of a pixel is given by max operation of all channels. Aggregation of object proposals Max operation with all proposals overlapping on each pixel Number of proposals: not sensitive to accuracy 50 proposals for evaluation Unpool: 56x56 Deconv: 56x56 Unpool: 112x112 Deconv: 112x112 41 42 Inference Results Handling multi scale objects naturally Number of proposals 43 44

Summary Confirmation of some conjectures Deconvolution network is conceptually reasonable. Learning a deep deconvolution network is a feasible option for semantic segmentation. Presenting a few critical training strategies Data augmentation Multi stage training Batch normalization Good performance Best in all algorithms trained on PASCAL VOC dataset The 3 rd overall Code available http://cvlab.postech.ac.kr/research/deconvnet DecoupledNet for Semi Supervised Semantic Segmentation 45 46 Motivation Challenges in existing supervised learning approaches Heavy labeling efforts in semantic segmentation Much more expensive to obtain pixel wise segmentation labels than other kinds of labels Difficult to extend to other classes and handle more classes Problem Setting Semi supervised learning with hybrid annotations Many weak annotations: image level object class labels Few strong annotations: full segmentation labels person bike person horse boat person dining table potted plant tv/monitor 47 48

Classification network Segmentation network Bridging layers Architecture Specification Classification Network Input: image Output: 20 dimensional class label vector ; Construction Fine tuning from VGG 16 layer net Transferrable from any other existing classification networks min ; ;,where 0,1 is GT. 49 50 Segmentation Network Bridging Layers Specification Input: class specific activation map of input image Output: two channel class specific segmentation map ; Construction Adopting DeconvNet Customized for binary segmentation min ; ;,where is binary GT. Specification Input: concatenation of and Output: class specific activation map Construction in the channel direction Fully connected layers : pool5 : backpropagating class specific information until pool5 51 52

Class Specific Information Class specific saliency map [Simonyan12] Given an image, pixels related to specific class can be identified by computing gradient of class score w.r.t image by Class Specific Activation Maps Search space reduction [Simonyan12] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop, 2014. 53 54 Segmentation Maps Segmentation Maps 55 56

Inference Qualitative Results Need iterations Computing segmentation map for each identified label Using the same segmentation network with different class specific information ; max ;, ;, ; 57 58 Quantitative Results Summary Comparison to other algorithms in PASCAL VOC 2012 validation set Per class accuracy in PASCAL VOC 2012 test set Novel deep neural network architecture for semi supervised learning with hybrid annotations Outstanding performance Easy training Free from iterative procedure for label inference More flexible approach Extensible to other classes by fine tuning existing classification networks Capable of handling many classes without parameter explosion Code available http://cvlab.postech.ac.kr/research/decouplednet 59 60

Conclusion Concluding Remark CNN: useful for structured predictions 2D/3D object generation Semantic segmentation Human pose estimation Visual tracking Image enhancement More parameters but trainable Unpooling is approximate but effective. 61 62 63