Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Similar documents
arxiv: v1 [cs.cv] 8 Dec 2016

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Predicting Ground-Level Scene Layout from Aerial Imagery

Fully Convolutional Networks for Semantic Segmentation

Places Challenge 2017

YOLO9000: Better, Faster, Stronger

Unsupervised Learning of Spatiotemporally Coherent Metrics

Lecture 7: Semantic Segmentation

08 An Introduction to Dense Continuous Robotic Mapping

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

TorontoCity: Seeing the World with a Million Eyes

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Multi-View 3D Object Detection Network for Autonomous Driving

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Image georeferencing is the process of developing a model to transform from pixel coordinates into GIS coordinates such as meters on the ground.

CIS 580, Machine Perception, Spring 2015 Homework 1 Due: :59AM

Figure 1: Workflow of object-based classification

Rich feature hierarchies for accurate object detection and semantic segmentation

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Deconvolution Networks

Collaborative Mapping with Streetlevel Images in the Wild. Yubin Kuang Co-founder and Computer Vision Lead

Rich feature hierarchies for accurate object detection and semant

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Class 5: Attributes and Semantic Features

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

CSE 559A: Computer Vision

Human Pose Estimation with Deep Learning. Wei Yang

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

End-to-End Localization and Ranking for Relative Attributes

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Visual Computing TUM

Learning to Segment Object Candidates

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Recurrent Transformer Networks for Semantic Correspondence

Table of contents. DMXzone Google Maps Manual DMXzone.com

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

Dynamic Routing Between Capsules. Yiting Ethan Li, Haakon Hukkelaas, and Kaushik Ram Ramasamy

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Joseph Camilo 1, Rui Wang 1, Leslie M. Collins 1, Senior Member, IEEE, Kyle Bradbury 2, Member, IEEE, and Jordan M. Malof 1,Member, IEEE

Efficient Segmentation-Aided Text Detection For Intelligent Robots

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

Joint Object Detection and Viewpoint Estimation using CNN features

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Multi-modal, multi-temporal data analysis for urban remote sensing

arxiv: v1 [cs.cv] 20 Dec 2016

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Exploiting noisy web data for largescale visual recognition

A MULTI-RESOLUTION FUSION MODEL INCORPORATING COLOR AND ELEVATION FOR SEMANTIC SEGMENTATION

Region-based Segmentation and Object Detection

assessment of its interpretable features

Supplementary Materials for Learning to Parse Wireframes in Images of Man-Made Environments

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Using Machine Learning for Classification of Cancer Cells

Aerial photography: Principles. Visual interpretation of aerial imagery

Internet of things that video

PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding

A Fast Linear Registration Framework for Multi-Camera GIS Coordination

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

Edge and corner detection

Learning and Recognizing Visual Object Categories Without First Detecting Features

arxiv: v1 [cs.cv] 9 Aug 2017

Structured Prediction using Convolutional Neural Networks

Large-scale Video Classification with Convolutional Neural Networks

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

SUMMARY. they need large amounts of computational resources to train

Part Localization by Exploiting Deep Convolutional Networks

In Live Computer Vision

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Articulated Pose Estimation with Flexible Mixtures-of-Parts

THE goal of action detection is to detect every occurrence

Fuzzy Set Theory in Computer Vision: Example 3

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Bridging the Gap Between Local and Global Approaches for 3D Object Recognition. Isma Hadji G. N. DeSouza

Computer Vision: Making machines see

Technical Publications

FINE-GRAINED image classification aims to recognize. Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Reset Cursor Tool Clicking on the Reset Cursor tool will clear all map and tool selections and allow tooltips to be displayed.

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Creating Affordable and Reliable Autonomous Vehicle Systems

CS 534: Computer Vision Texture

Object Geolocation from Crowdsourced Street Level Imagery

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

End-to-End Learning of Polygons for Remote Sensing Image Classification

EE-559 Deep learning Networks for semantic segmentation

Transcription:

Predicting ground-level scene Layout from Aerial imagery Muhammad Hasan Maqbool

Objective Given the overhead image predict its ground level semantic segmentation Predicted ground level labeling Overhead/Aerial Image Ground Level Image

Overall Architecture (Training) VGG Labeling Transform Ground Level Labeling (Predicted) Overhead/Aerial Image Cross Entropy Loss SegNet Ground Level Image Ground Level Labeling

Ground Truth of Ground-level Segmentation 1. Generate the segmentation of ground level image using (pre-trained model) 1. SegNet This gives Ground Truth for training 2. To avoid manual pixel annotation of their new dataset (overhead imagery) Ground Level Image

Algorithm Overview Aerial Image to Ground Level (Predicted) Segmentation 512 Conv 1 x 1 512 Conv 1 x 1 4 Conv 1 x 1 17 x 17 = 289 Matrix Multiplication 1 x 1 Conv 128 Conv 1 x 1 64 Conv 1 x 1 1 Conv 1 x 1 320 x 289 8 x 40

Algorithm Overview Predicted and Ground Truth Layouts Distance Minimization Minimize the pixel distance between the ground layout and the aerial-to-ground layout prediction Ground Level Layout (ground truth) Predicted Ground Level Layout Cross Entropy Loss

Network A Network A Hypercolumn 17x17x960 17x17x512 17x17x512 17x17x4 Three conv layers employ 1x1 convolution Reduce channel dimension from 960 to 4

Network S Encode the global information of the aerial image Network S 17x17x512 17x17x128 17x17x64 17x17x1 Three conv layers employ 1x1 convolution Reduce channel dimension from 512 to 1

Determine the Size of the Transformation Matrix Aerial image layout l a is multiplied by transformation matrix M to get the predicted ground-level layout l g M 17x17x1 * Vec = l a Reshape 8x40x1 l g Based on the input size and the desired output size, we can determine the size of M -----> In this case, we have M = 320x289

Construct Input to Network F How to set the coordinates for each pixel will be explained later S network (i, j, y, x) 17x17x1 Reshape 1x1x289 Width (W) = 289 Height: H =320 1x1x4 Concatenate coordinates Height: H =320 Width (W) = 289

Network F Transformation matrix

How to determine the coordinates? (i, j, y, x)

How to determine the coordinates? Recall the following procedure is used to get the predicted ground-level layout Reshape Transformation matrix

How to determine the coordinates? (i, j) = (0, 0) i j l a i: row index j: column index Vec 17x17 l a (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) 289-D vector (i, j) = (16, 16)

How to determine the coordinates? 320-D vector Reshape y x 8x40 l g y: row index x: column index

How to determine the coordinates? 320-D vector (y, x) = (0, 0) (y, x) = (0, 39) (y, x) = (1, 0) x 8x40 (y, x) = (1, 39) Reshape y l g y: row index x: column index (y, x) = (7, 39)

How to determine the coordinates? 320x289 0th row 39th row * (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (y, x) = (0, 0) 320-D (rotate to a row vector due to space limit) (y, x) = (0, 39) (y, x) = (1, 0) (y, x) = (1, 39) (I, j, y, x) = (0, 0, 0, 0) (I, j, y, x) = (0, 1, 0, 0) (I, j, y, x) = (0, 16, 0, 0) (I, j, y, x) = (1, 0, 0, 0) (y, x) = (7, 39) (I, j, y, x) = (16, 16, 0, 0) 289-D (i, j) = (16, 16) One row determines a single pixel value at location (y, x)

How to determine the coordinates? 320x289 0th row 39th row * (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (y, x) = (0, 0) (y, x) = (0, 39) (y, x) = (1, 0) (y, x) = (1, 39) (I, j, y, x) = (0, 0, 0, 39) (I, j, y, x) = (0, 1, 0, 39) (I, j, y, x) = (0, 16, 0, 39) (I, j, y, x) = (1, 0, 0, 39) 320-D (y, x) = (7, 39) (I, j, y, x) = (16, 16, 0, 39) 289-D (i, j) = (16, 16) One row determines a single pixel value at location (y, x)

Interpret the transformation matrix 320x289 0th row 39th row * 289-D (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (i, j) = (16, 16) 320-D One row determines a single pixel value at location (y, x) This is Weight the 17x17 pixels in I a Each row of the transformation matrix e.g. Determine one pixel in I g

Transformation Matrix Visualization The transformation matrix encodes the relationship between the aerial-level pixels and the ground-level pixels 17 320 17 17 x 17 289

Dataset

Cross-View Dataset CVUSA contains approximately 1.5 million geo-tagged pairs of ground and aerial images from across the United States. Google Street View panoramas of CVUSA as our ground level images Corresponding aerial images at zoom level 19 35,532 image pairs for training and 8,884 image pairs for testing.

Cross-View Dataset: An Example In the aerial images, north is the up direction. In the ground images, north is the central column. Aerial Ground

Cross-View Dataset

Cross-View Dataset

Evaluation and Applications

Weakly Supervised Learning

Weakly Supervised Learning for Aerial Image Classification and Segmentation

As Pre-trained Model

Network Pre-training for Aerial Image Segmentation

Network Pre-training for Aerial Image Segmentation Per-class Precision on the ISPRS Segmentation Task

Network Pre-training for Aerial Image Segmentation

Orientation Estimation

Orientation Estimation Given a street view image I g, estimate its orientation 1. Get the corresponding aerial image (using GPS tag), I a 2. Infer the semantic labeling of the query image, I g L g 3. Predict the ground image labels from the aerial image using the proposed network, I a -> Lg 4. Compute the cross entropy between Lg and Lg in a sliding window fashion across all possible orientations 5. The smallest entropy is the optimal estimated orientation

Orientation Estimation Their CVUSA dataset contains aligned image pairs Therefore, the learned network predicts the ground (panorama image) layout in a fixed orientation served as reference

Orientation Estimation Query street view image Get aerial image Get segmentation (e.g. SegNet) Get predicted groundlevel layout using the proposed network W H 1 column = 360 degree / W Predicted ground-level layout of the panorama image

Orientation Estimation

Geocalibration

Fine-Grained Geocalibration

Fine-Grained Geocalibration

Ground Level Image Synthesization

Ground-level Image Synthesization

Ground-level Image Synthesization

Conclusion A novel strategy to relate aerial-level imagery and the ground-level imagery A novel strategy to exploit the automatically labeled ground images as a form of weak supervision for aerial imagery understanding Show the potential of this method in geocalibration and ground appearance synthetization

Questions