Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Size: px

Start display at page:

Download "Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool"

Leslie Fox
5 years ago
Views:

1 Predicting ground-level scene Layout from Aerial imagery Muhammad Hasan Maqbool

2 Objective Given the overhead image predict its ground level semantic segmentation Predicted ground level labeling Overhead/Aerial Image Ground Level Image

3 Overall Architecture (Training) VGG Labeling Transform Ground Level Labeling (Predicted) Overhead/Aerial Image Cross Entropy Loss SegNet Ground Level Image Ground Level Labeling

4 Ground Truth of Ground-level Segmentation 1. Generate the segmentation of ground level image using (pre-trained model) 1. SegNet This gives Ground Truth for training 2. To avoid manual pixel annotation of their new dataset (overhead imagery) Ground Level Image

Algorithm Overview Aerial Image to Ground Level (Predicted) Segmentation 512 Conv 1 x 1 512 Conv 1 x 1 4 Conv

5 Algorithm Overview Aerial Image to Ground Level (Predicted) Segmentation 512 Conv 1 x Conv 1 x 1 4 Conv 1 x 1 17 x 17 = 289 Matrix Multiplication 1 x 1 Conv 128 Conv 1 x 1 64 Conv 1 x 1 1 Conv 1 x x x 40

6 Algorithm Overview Predicted and Ground Truth Layouts Distance Minimization Minimize the pixel distance between the ground layout and the aerial-to-ground layout prediction Ground Level Layout (ground truth) Predicted Ground Level Layout Cross Entropy Loss

7 Network A Network A Hypercolumn 17x17x960 17x17x512 17x17x512 17x17x4 Three conv layers employ 1x1 convolution Reduce channel dimension from 960 to 4

8 Network S Encode the global information of the aerial image Network S 17x17x512 17x17x128 17x17x64 17x17x1 Three conv layers employ 1x1 convolution Reduce channel dimension from 512 to 1

9 Determine the Size of the Transformation Matrix Aerial image layout l a is multiplied by transformation matrix M to get the predicted ground-level layout l g M 17x17x1 * Vec = l a Reshape 8x40x1 l g Based on the input size and the desired output size, we can determine the size of M -----> In this case, we have M = 320x289

10 Construct Input to Network F How to set the coordinates for each pixel will be explained later S network (i, j, y, x) 17x17x1 Reshape 1x1x289 Width (W) = 289 Height: H =320 1x1x4 Concatenate coordinates Height: H =320 Width (W) = 289

11 Network F Transformation matrix

12 How to determine the coordinates? (i, j, y, x)

13 How to determine the coordinates? Recall the following procedure is used to get the predicted ground-level layout Reshape Transformation matrix

14 How to determine the coordinates? (i, j) = (0, 0) i j l a i: row index j: column index Vec 17x17 l a (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) 289-D vector (i, j) = (16, 16)

15 How to determine the coordinates? 320-D vector Reshape y x 8x40 l g y: row index x: column index

16 How to determine the coordinates? 320-D vector (y, x) = (0, 0) (y, x) = (0, 39) (y, x) = (1, 0) x 8x40 (y, x) = (1, 39) Reshape y l g y: row index x: column index (y, x) = (7, 39)

17 How to determine the coordinates? 320x289 0th row 39th row * (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (y, x) = (0, 0) 320-D (rotate to a row vector due to space limit) (y, x) = (0, 39) (y, x) = (1, 0) (y, x) = (1, 39) (I, j, y, x) = (0, 0, 0, 0) (I, j, y, x) = (0, 1, 0, 0) (I, j, y, x) = (0, 16, 0, 0) (I, j, y, x) = (1, 0, 0, 0) (y, x) = (7, 39) (I, j, y, x) = (16, 16, 0, 0) 289-D (i, j) = (16, 16) One row determines a single pixel value at location (y, x)

18 How to determine the coordinates? 320x289 0th row 39th row * (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (y, x) = (0, 0) (y, x) = (0, 39) (y, x) = (1, 0) (y, x) = (1, 39) (I, j, y, x) = (0, 0, 0, 39) (I, j, y, x) = (0, 1, 0, 39) (I, j, y, x) = (0, 16, 0, 39) (I, j, y, x) = (1, 0, 0, 39) 320-D (y, x) = (7, 39) (I, j, y, x) = (16, 16, 0, 39) 289-D (i, j) = (16, 16) One row determines a single pixel value at location (y, x)

19 Interpret the transformation matrix 320x289 0th row 39th row * 289-D (i, j) = (0, 0) (i, j) = (0, 16) (i, j) = (1, 0) (i, j) = (1, 16) (i, j) = (16, 16) 320-D One row determines a single pixel value at location (y, x) This is Weight the 17x17 pixels in I a Each row of the transformation matrix e.g. Determine one pixel in I g

20 Transformation Matrix Visualization The transformation matrix encodes the relationship between the aerial-level pixels and the ground-level pixels x

21 Dataset

22 Cross-View Dataset CVUSA contains approximately 1.5 million geo-tagged pairs of ground and aerial images from across the United States. Google Street View panoramas of CVUSA as our ground level images Corresponding aerial images at zoom level 19 35,532 image pairs for training and 8,884 image pairs for testing.

23 Cross-View Dataset: An Example In the aerial images, north is the up direction. In the ground images, north is the central column. Aerial Ground

24 Cross-View Dataset

25 Cross-View Dataset

26 Evaluation and Applications

27 Weakly Supervised Learning

28 Weakly Supervised Learning for Aerial Image Classification and Segmentation

29 As Pre-trained Model

30 Network Pre-training for Aerial Image Segmentation

31 Network Pre-training for Aerial Image Segmentation Per-class Precision on the ISPRS Segmentation Task

32 Network Pre-training for Aerial Image Segmentation

33 Orientation Estimation

34 Orientation Estimation Given a street view image I g, estimate its orientation 1. Get the corresponding aerial image (using GPS tag), I a 2. Infer the semantic labeling of the query image, I g L g 3. Predict the ground image labels from the aerial image using the proposed network, I a -> Lg 4. Compute the cross entropy between Lg and Lg in a sliding window fashion across all possible orientations 5. The smallest entropy is the optimal estimated orientation

35 Orientation Estimation Their CVUSA dataset contains aligned image pairs Therefore, the learned network predicts the ground (panorama image) layout in a fixed orientation served as reference

36 Orientation Estimation Query street view image Get aerial image Get segmentation (e.g. SegNet) Get predicted groundlevel layout using the proposed network W H 1 column = 360 degree / W Predicted ground-level layout of the panorama image

37 Orientation Estimation

38 Geocalibration

39 Fine-Grained Geocalibration

40 Fine-Grained Geocalibration

41 Ground Level Image Synthesization

42 Ground-level Image Synthesization

43 Ground-level Image Synthesization

44 Conclusion A novel strategy to relate aerial-level imagery and the ground-level imagery A novel strategy to exploit the automatically labeled ground images as a form of weak supervision for aerial imagery understanding Show the potential of this method in geocalibration and ground appearance synthetization

45 Questions

arxiv: v1 [cs.cv] 8 Dec 2016

arxiv: v1 [cs.cv] 8 Dec 2016 Predicting Ground-Level Scene Layout from Aerial Imagery Menghua Zhai Zachary Bessinger Scott Workman Nathan Jacobs Computer Science, University of Kentucky {ted, zach, scott, jacobs}@cs.uky.edu arxiv:1612.02709v1