Martian lava field, NASA, Wikipedia

Similar documents
Fully Convolutional Networks for Semantic Segmentation

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object detection with CNNs

Spatial Localization and Detection. Lecture 8-1

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Lecture 5: Object Detection

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Lecture 7: Semantic Segmentation

Object Detection Based on Deep Learning

Semantic Segmentation

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Rich feature hierarchies for accurate object detection and semantic segmentation

Computer Vision Lecture 16

Computer Vision Lecture 16

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Learning for Object detection & localization

Computer Vision Lecture 16

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Introduction to Neural Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Machine Learning 13. week

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Detecting cars in aerial photographs with a hierarchy of deconvolution nets

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

INTRODUCTION TO DEEP LEARNING

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Object Detection on Self-Driving Cars in China. Lingyun Li

Deep Learning for Computer Vision II

Using Machine Learning for Classification of Cancer Cells

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep Neural Networks:

Efficient Segmentation-Aided Text Detection For Intelligent Robots

arxiv: v1 [cs.cv] 31 Mar 2016

Fuzzy Set Theory in Computer Vision: Example 3, Part II

CS 2770: Computer Vision. Edges and Segments. Prof. Adriana Kovashka University of Pittsburgh February 21, 2017

YOLO9000: Better, Faster, Stronger

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Deconvolutions in Convolutional Neural Networks

Yiqi Yan. May 10, 2017

Optimizing Object Detection:

Cascade Region Regression for Robust Object Detection

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Lung Tumor Segmentation via Fully Convolutional Neural Networks

Convolutional Networks in Scene Labelling

Semantic Segmentation. Zhongang Qi

Rich feature hierarchies for accurate object detection and semantic segmentation

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

VISION & LANGUAGE From Captions to Visual Concepts and Back

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Classification of objects from Video Data (Group 30)

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Structured Prediction using Convolutional Neural Networks

EE-559 Deep learning Networks for semantic segmentation

Project 3 Q&A. Jonathan Krause

Feature-Fused SSD: Fast Detection for Small Objects

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

Know your data - many types of networks

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Computer vision: teaching computers to see

Deformable Part Models

Fuzzy Set Theory in Computer Vision: Example 3

Introduction to Neural Networks

Visual features detection based on deep neural network in autonomous driving tasks

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

arxiv: v1 [cs.cv] 14 Dec 2015

Multi-Glance Attention Models For Image Classification

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Segmentation as Selective Search for Object Recognition in ILSVRC2011

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

General Principal. Huge Dataset. Images. Info from Most Similar Images. Input Image. Associated Info

Deep Learning with Tensorflow AlexNet

CS231N Section. Video Understanding 6/1/2018

Opportunities of Scale

Final Report: Smart Trash Net: Waste Localization and Classification

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

arxiv: v1 [cs.cv] 29 Sep 2016

THE goal of action detection is to detect every occurrence

Automatic Latent Fingerprint Segmentation

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Single Image Super-resolution. Slides from Libin Geoffrey Sun and James Hays

arxiv: v1 [cs.cv] 14 Nov 2014

Regionlet Object Detector with Hand-crafted and CNN Feature

CAP 6412 Advanced Computer Vision

Return of the Devil in the Details: Delving Deep into Convolutional Nets

arxiv: v2 [cs.cv] 18 Jul 2017

Transcription:

Martian lava field, NASA, Wikipedia

Old Man of the Mountain, Franconia, New Hampshire

Pareidolia http://smrt.ccel.ca/203/2/6/pareidolia/

Reddit for more : ) https://www.reddit.com/r/pareidolia/top/

Pareidolia Seeing things which aren t really there DeepDream as reinforcement pareidolia

Powerpoint Alt-text Generator Vision-based caption generator

ConvNets perform classification < millisecond tabby cat 000-dim vector end-to-end learning 8 [Slides from Long, Shelhamer, and Darrell]

R-CNN: Region-based CNN Figure: Girshick et al. 5

Stage 2: Efficient region proposals? Brute force on 000x000 = 250 billion rectangles Testing the CNN over each one is too expensive Let s use B.C. vision! Before CNNs Hierarchical clustering for segmentation Uijlings et al., 202, Selection Search Thanks to Song Cao

Remember clustering for segmentation? Oversegmentation Undersegmentation Hierarchical Segmentations

Cluster low-level features Define similarity on color, texture, size, fill Greedily group regions together by selecting the pair with highest similarity Until the whole image become a single region Draw a bounding box around each one Into a hierarchy

Vs Ground Truth Thanks to Song Cao

Thanks to Song Cao

R-CNN: Region-based CNN Figure: Girshick et al. 0,000 proposals with recall 0.99 is better. but still takes 7 seconds per image to generate them. Then I have to test each one! 22

Fast R-CNN Multi-task loss RoI = Region of Interest Figure: Girshick et al.

Fast R-CNN - Convolve whole image into feature map (many layers; abstracted) - For each candidate RoI: - Squash feature map weights into fixed-size RoI pool adaptive subsampling! - Divide RoI into H x W subwindows, e.g., 7 x 7, and max pool - Learn classification on RoI pool with own fully connected layers (FCs) - Output classification (softmax) + bounds (regressor) Figure: Girshick et al.

What if we want pixels out? monocular depth estimation Eigen & Fergus 205 semantic segmentation optical flow Fischer et al. 205 boundary prediction Xie & Tu 205 25 [Long et al.]

R-CNN does detection many seconds dog R-CNN cat [Long et al.]

~/0 second??? end-to-end learning 28 [Long et al.]

Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley 29 [CVPR 205] Slides from Long, Shelhamer, and Darrell

A classification network Number of filters, e.g., 64 Number of perceptrons in MLP layer, e.g., 024 tabby cat 30 [Long et al.]

A classification network tabby cat 3 [Long et al.]

A classification network tabby cat The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer. 32 [Long et al.]

A classification network tabby cat The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer. AlexNet: 256 filters over 6x6 response map Each 2,359,296 response is attached to one of 4096 perceptrons, leading to 37 mil params. 33 [Long et al.]

Problem We want a label at every pixel Current network gives us a label for the whole image. Approach: Make CNN for every sub-image size? Convolutionalize all layers of network, so that we can treat it as one (complex) filter and slide around our full image.

Long, Shelhamer, and Darrell 204

A classification network tabby cat The response of every kernel across all positions are attached densely to the array of perceptrons in the fully-connected layer. AlexNet: 256 filters over 6x6 response map Each 2,359,296 response is attached to one of 4096 perceptrons, leading to 37 mil params. 38 [Long et al.]

Convolutionalization Number of filters Number of filters x convolution operates across all filters in the previous layer, and is slid across all positions. 4 [Long et al.]

Back to the fully-connected perceptron Perceptron is connected to every value in the previous layer (across all channels; visible). [Long et al.]

00 x

Convolutionalization # filters, e.g. 024 # filters, e.g., 64 x convolution operates across all filters in the previous layer, and is slid across all positions. e.g., 64xx kernel, with shared weights over 3x3 output, x024 filters = mil params. 45 [Long et al.]

Becoming fully convolutional Multiple outputs Arbitrarysized image When we turn these operations into a convolution, the 3x3 just becomes another parameter and our output size adjust dynamically. Now we have a vector/matrix output, and our network acts itself like a complex filter. 46 [Long et al.]

Long, Shelhamer, and Darrell 204

Upsampling the output Some upsampling algorithm to return us to H x W 48 [Long et al.]

End-to-end, pixels-to-pixels network 49 [Long et al.]

End-to-end, pixels-to-pixels network conv, pool, nonlinearity upsampling pixelwise output + loss 50 [Long et al.]

What is the upsampling layer? This one. Hint: it s actually an upsampling network 5 [Long et al.]

Deconvolution networks learn to upsample Often called deconvolution, but misnomer. Not the deconvolution that we saw in deblurring -> that is division in the Fourier domain. Transposed convolution is better. Zeiler et al., Deconvolutional Networks, CVPR 200 Noh et al., Learning Deconvolution Network for Semantic Segmentation, ICCV 205

Upsampling with transposed convolution Convolution

Upsampling with transposed convolution Convolution Transposed convolution = padding/striding smaller image then weighted sum of input x filter: stamping kernel 2x2, stride, 3x3 kernel, upsample to 4x4 2x2, stride 2, 3x3 kernel, upsample to 5x5.

Kernel Feature map 2 3 4 Padded feature map 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 4 4 3 4 4 3 4 4 3 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 4 7 6 3 4 7 6 3 4 7 6 3 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 4 7 8 5 2 4 7 8 5 2 4 7 8 5 2 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 4 7 8 5 2 5 8 8 5 2 5 8 8 5 2 4 4 4 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map Output feature map 4 7 8 5 2 5 8 2 8 5 2 5 8 2 8 5 2 4 4 4 0 2 3 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map 2 3 4 Output feature map 4 7 8 5 2 5 8 3 34 2 8 9 32 55 60 37 4 38 66 64 43 6 7 24 4 44 27 0 3 0 7 8 4 Inspired by andriys

Kernel Input feature map 2 3 4 Padded input feature map 2 3 4 Cropped output feature map 8 3 34 2 32 55 60 37 38 66 64 43 24 4 44 27 Inspired by andriys

Is uneven overlap a problem? Uneven overlap across output Yes = causes grid artifacts Could fix it by picking stride/kernel numbers which have no overlap

Is uneven overlap a problem? Uneven overlap across output Yes = causes grid artifacts Could fix it by picking stride/kernel numbers which have no overlap Or think in frequency! Introduce explicit bilinear upsampling before transpose convolution; let kernels of transpose convolution learn to fill in only highfrequency detail. https://distill.pub/206/deconv-checkerboard/

Deconvolution networks learn to upsample Often called deconvolution, but misnomer. Not the deconvolution that we saw in deblurring -> that is division in the Fourier domain. Transposed convolution is better. Zeiler et al., Deconvolutional Networks, CVPR 200 Noh et al., Learning Deconvolution Network for Semantic Segmentation, ICCV 205

But we have downsampled so far How do we learn to create or learn to restore new high frequency detail?

Spectrum of deep features Combine where (local, shallow) with what (global, deep) Fuse features into deep jet (cf. Hariharan et al. CVPR5 hypercolumn ) 68 [Long et al.]

Learning upsampling kernels with skip layer refinement interp + sum interp + sum End-to-end, joint learning of semantics and location dense output [Long et al.]

Learning upsampling kernels with skip layer refinement interp + sum interp + sum dense output [Long et al.]

Learning upsampling kernels with skip layer refinement interp + sum interp + sum End-to-end, joint learning of semantics and location dense output [Long et al.]

Skip layer refinement input image stride 32 stride 6 stride 8 ground truth no skips skip 2 skips 72 [Long et al.]

Results FCN SDS* Truth Input Relative to prior state-of-the-art SDS: - 30% relative improvement for mean IoU - 286 faster *Simultaneous Detection and Segmentation Hariharan et al. ECCV4 74 [Long et al.]

What can we do with an FCN? Long, Shelhamer, and Darrell 204

How much can an image tell about its geographic location? 6 million geo-tagged Flickr images http://graphics.cs.cmu.edu/projects/im2gps/ im2gps (Hays & Efros, CVPR 2008)

Nearest Neighbors according to gist + bag of SIFT + color histogram + a few others

PlaNet - Photo Geolocation with Convolutional Neural Networks Tobias Weyand, Ilya Kostrikov, James Philbin ECCV 206

Discretization of Globe

Network and Training Network Architecture: Inception with 97M parameters 26,263 categories places in the world 26 Million Web photos 2.5 months of training on 200 CPU cores

PlaNet vs im2gps (2008, 2009)

Spatial support for decision

PlaNet vs Humans

PlaNet vs. Humans

PlaNet summary Very fast geolocalization method by categorization. Uses far more training data than previous work (im2gps) Better than humans!

Even more: Faster R-CNN (FCN) Region Proposal Network uses CNN feature maps. Then, FCN on top to classify. End to end object detection. Ren et al. 206 https://arxiv.org/abs/506.0497

Even more! Mask R-CNN Extending Faster R-CNN for Pixel Level Segmentation He et al. - https://arxiv.org/abs/703.06870 Add new training data: segmentation masks Second output which is segmentation mask