Supplementary Material for SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

Similar documents
SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

arxiv: v1 [cs.cv] 20 Nov 2018

Shape Context Matching For Efficient OCR

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Content-Based Image Recovery

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Identifying Layout Classes for Mathematical Symbols Using Layout Context

CS 445 / 645 Introduction to Computer Graphics. Lecture 21 Representing Rotations

arxiv: v3 [cs.cv] 7 Dec 2018

CS354 Computer Graphics Rotations and Quaternions

Parallel Transport on the Torus

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Pose estimation using a variety of techniques

Animation. Keyframe animation. CS4620/5620: Lecture 30. Rigid motion: the simplest deformation. Controlling shape for animation

Jump Stitch Metadata & Depth Maps Version 1.1

MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

April 4-7, 2016 Silicon Valley

Animation. CS 4620 Lecture 32. Cornell CS4620 Fall Kavita Bala

4.2 Description of surfaces by spherical harmonic functions

Orientation & Quaternions

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Specular 3D Object Tracking by View Generative Learning

CS 231A Computer Vision (Fall 2012) Problem Set 3

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Dynamic Routing Between Capsules

A Sub-pixel Disparity Refinement Algorithm Based on Lagrange Interpolation

arxiv: v1 [cs.cv] 20 Dec 2016

For each question, indicate whether the statement is true or false by circling T or F, respectively.

Motion Control (wheeled robots)

Mathematics of a Multiple Omni-Directional System

Fundamentals of Computer Animation

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Visual Recognition: Image Formation

Blacksburg, VA July 24 th 30 th, 2010 Georeferencing images and scanned maps Page 1. Georeference

Joint Object Detection and Viewpoint Estimation using CNN features

CS 223B Computer Vision Problem Set 3

CS612 - Algorithms in Bioinformatics

A spatiotemporal model with visual attention for video classification

Fall 2016 Semester METR 3113 Atmospheric Dynamics I: Introduction to Atmospheric Kinematics and Dynamics

CS 130 Final. Fall 2015

Face Detection and Tracking Control with Omni Car

Answers to practice questions for Midterm 1

Supplementary Material for A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Georeferencing & Spatial Adjustment

Analysis of Euler Angles in a Simple Two-Axis Gimbals Set

Classification of objects from Video Data (Group 30)

Visual object classification by sparse convolutional neural networks

Animation and Quaternions

Metric Rectification for Perspective Images of Planes

Transformation. Jane Li Assistant Professor Mechanical Engineering & Robotics Engineering

10/11/07 1. Motion Control (wheeled robots) Representing Robot Position ( ) ( ) [ ] T

arxiv:submit/ [cs.cv] 13 Jan 2018

ECE 5470 Classification, Machine Learning, and Neural Network Review

Performance Study of Quaternion and Matrix Based Orientation for Camera Calibration

A Direct Approach for Object Detection with Catadioptric Omnidirectional Cameras

CS 231A Computer Vision (Fall 2011) Problem Set 4

University of Technology Building & Construction Department / Remote Sensing & GIS lecture

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Image warping , , Computational Photography Fall 2017, Lecture 10

DTU M.SC. - COURSE EXAM Revised Edition

Georeferencing & Spatial Adjustment 2/13/2018

Improving Face Recognition by Exploring Local Features with Visual Attention

DETC APPROXIMATE MOTION SYNTHESIS OF SPHERICAL KINEMATIC CHAINS

Postprint.

Unit 2: Locomotion Kinematics of Wheeled Robots: Part 3

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Worksheet 3.5: Triple Integrals in Spherical Coordinates. Warm-Up: Spherical Coordinates (ρ, φ, θ)

Capsule Networks. Eric Mintun

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

LAB 2: Resampling. Maria Magnusson, 2012 (last update August 2016) with contributions from Katarina Flood, Qingfen Lin and Henrik Turbell

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

The Problem. Georeferencing & Spatial Adjustment. Nature Of The Problem: For Example: Georeferencing & Spatial Adjustment 9/20/2016

Instantaneously trained neural networks with complex and quaternion inputs

Chapter 18. Geometric Operations

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Part II: OUTLINE. Visualizing Quaternions. Part II: Visualizing Quaternion Geometry. The Spherical Projection Trick: Visualizing unit vectors.

Rotation parameters for model building and stable parameter inversion in orthorhombic media Cintia Lapilli* and Paul J. Fowler, WesternGeco.

The Problem. Georeferencing & Spatial Adjustment. Nature of the problem: For Example: Georeferencing & Spatial Adjustment 2/4/2014

Robotics (Kinematics) Winter 1393 Bonab University

Coordinate Transformations in Advanced Calculus

Accelerometer Gesture Recognition

Efficient Algorithms may not be those we think

Joint Unsupervised Learning of Deep Representations and Image Clusters Supplementary materials

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

CS4670: Computer Vision

Curve Subdivision in SE(2)

CMPUT 412 Motion Control Wheeled robots. Csaba Szepesvári University of Alberta

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

ROTATIONAL MOTION MODEL FOR TEMPORAL PREDICTION IN 360 VIDEO CODING. Bharath Vishwanath, Tejaswi Nanjundaswamy and Kenneth Rose

Para-catadioptric Camera Auto Calibration from Epipolar Geometry

Translation Symmetry Detection: A Repetitive Pattern Analysis Approach

Dynamic Routing Between Capsules. Yiting Ethan Li, Haakon Hukkelaas, and Kaushik Ram Ramasamy

Transcription:

Supplementary Material for SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Benjamin Coors 1,3, Alexandru Paul Condurache 2,3, and Andreas Geiger 1 1 Autonomous Vision Group, MPI for Intelligent Systems and University of Tübingen 2 Institute for Signal Processing, University of Lübeck 3 Robert Bosch GmbH Abstract. In this supplementary document, we present a visualization of the sampling error made by regular convolutional filters on omnidirectional images. Next, we provide details on the Spherical Transformer Network, which is able to transform inputs to a canonical orientation in support of a recognition model. Furthermore, we present additional experiments on the impact of varying object scale, object elevation and choice of interpolation on the Omni-MNIST classification task. Finally, we give a qualitative comparison of the CNN and SphereNet transfer learning models on the omnidirectional parked cars (OmPaCa) datasets. 1 Sampling Error When applying a regular convolutional neural network to omnidirectional images, the normalized average absolute geodesic error is relatively small close to the equator region but grows large towards the poles when compared to the optimal sampling locations used by SphereNet (see Fig. 1). latitude φ 1.57 0.00 1.57 3.142 1.571 0.000 1.571 3.142 longitude θ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Fig. 1: Average Geodesic Error of the sampling locations of a regular convolution kernel applied to the equirectangular image representation. Note how the error increases towards the poles φ { π 2, + π 2 }.

2 Benjamin Coors, Alexandru Paul Condurache, Andreas Geiger Warp Transformation Encoder Fig. 2: Spherical Transformer Network. A transformation encoder network outputs the rotation parameters r specifying the axis v and angle β of rotation. The input I in is warped to the output I out via bilinear interpolation according to r. 2 Spherical Transformer Network As a baseline for the classification and detection tasks, we investigate the use of a Spherical Transformer Network (SphereTN), which aims to undistort objects by performing a global rotation of the spherical image conditioned on the input image. Compared to SphereNet, the Spherical Transformer needs to learn how to undistort an object. Additionally, it only performs a single global transformation of the input and may therefore be unable to undistort multiple objects at once. Similar to a Spatial Transformer Network (STN) [2], the Spherical Transformer Network uses a localization network to predict a set of transformation parameters which are used to resample the input image. However, unlike a Spatial Transformer which typically outputs an affine transformation, the Spherical Transformer Network predicts and applies a 3D rotation of the spherical image representation. In order to avoid the gimbal lock problem, we do not represent the axial rotations using Euler angles but instead leverage the axis-angle representation. More formally, we rotate 3D points y R 3 located on the surface of the unit sphere S by applying a rotation y = Ry where the rotation matrix R SO(3) is given as follows: R = I + (sin β)[v] + (1 cos β)[v] 2 (1) Here, v R 3 is a unit vector defining the rotation axis and β denotes the rotation angle. [v] denotes the cross-product matrix and is an element of the Lie algebra so(3). As v is a unit vector with two degrees of freedom, we are able to encode the rotational component β as its length. More specifically, we parameterize the rotation axis v and rotation angle β with a 3-dimensional vector r = β v which is the output of the localization network conditioned on the input image. Note that after prediction, r can be easily decomposed into β = r 2 and v = r/β. The predicted transformation is applied to the omnidirectional image via differentiable image sampling with bilinear interpolation [2]. See Fig. 2 for an illustration.

Supplementary Material for SphereNet 3 Table 1: Digit Scale Evaluation on Omni-MNIST. Performance comparison on the omnidirectional MNIST dataset for varying digit sizes. Method Large Medium Small GCNN [3] 17.21 20.35 28.54 S2CNN [1] 11.86 19.90 38.80 CubeMapCNN 10.03 11.37 24.46 EquirectCNN 9.61 9.10 8.60 EquirectCNN+SphereTN 8.22 8.69 8.46 SphereNet (Uniform) 7.16 6.91 8.77 SphereNet (NN) 7.03 6.32 7.51 SphereNet (BI) 5.59 5.03 5.89 Table 2: Digit Elevation Evaluation on Omni-MNIST. Performance comparison on the omnidirectional MNIST dataset for varying digit elevation. Method φ d [0, π 8 ] φ d [ π 8, π 4 ] φ d [ π 4, 3π 8 ] φ d [ 3π 8, π 2 ] GCNN [3] 17.48 19.48 18.39 20.63 S2CNN [1] 11.63 11.45 11.41 11.49 CubeMapCNN 8.70 8.70 9.10 13.80 EquirectCNN 7.02 8.64 9.17 11.34 EquirectCNN+SphereTN 6.46 7.76 7.74 11.08 SphereNet (Uniform) 6.25 6.87 7.18 9.14 SphereNet (NN) 6.15 6.36 6.23 8.67 SphereNet (BI) 4.50 4.87 4.89 6.84 Table 3: Interpolation Evaluation for SphereNet on Omni-MNIST. Performance comparison of SphereNet when varying the location of the bilinear interpolation (BI) layers. Bi-layers none conv 1 conv 2 pool 1 pool 2 conv 1,2 pool 1,2 Test error 7.03 6.31 6.25 6.47 6.25 6.18 6.17 5.59 all 3 Omni-MNIST Analysis In order to analyze the image classification results on the Omni-MNIST dataset more closely, we conduct further studies. First, we test the effect of varying the digit scale on the performance of the different models. To do so, we generate the Omni-MNIST

4 Benjamin Coors, Alexandru Paul Condurache, Andreas Geiger dataset at three different scales (small, medium, large). We then train and evaluate each model on each of the three variants. The results are shown in Table 1 and demonstrate that the performance of the nearest neighbor (NN) and bilinear interpolation (BI) variants of SphereNet are not significantly impacted by changes in digit scale. On the other hand, the uniformly sampled variant (Uniform) of SphereNet drops in performance for smaller digits, as in this case important information is lost at a smaller object scale. In contract, the EquirectCNN baseline performs slightly better for small digit scales, as smaller digits minimize the amount of object distortion in the equirectangular image. The CubeMapCNN, S2CNN and GCNN baselines all show significantly degraded performance for smaller digit sizes, with the S2CNN model particularly struggling with the classification of digits of smaller size. Second, we evaluate the performance of all models for different ranges of digit elevation φ d (see Table 2). While CubeMapCNN and EquirectCNN models perform gradually worse with increasing elevation and object distortion, the SphereNet variants offer near constant performance for elevations φ d 3π 8. When further investigating the decrease in SphereNet s performance for elevations φ d [ 3π 8, π 2 ], we find it to be caused by a sudden drop in performance to nearly 20% test error at the poles ( φ d = π 2 ). The reason for SphereNet s loss in performance at this elevation is that its assumption of an upright object orientation no longer holds at the poles. Unlike SphereNet, S2CNN encodes full rotation equivariance and therefore is the only baseline with near-constant performance over all digits elevations. Finally, we evaluate which network layers in the SphereNet model benefit most from the use of bilinear interpolation (BI) over a nearest neighbor interpolation (NN). Table 3 indicates that all layers benefit from bilinear interpolation with the benefit being slightly larger in the second layers and increasing further when both convolutional or pooling layers utilize bilinear interpolation. 4 OmPaCa Detection Comparison In order to visualize the performance benefit of a SphereNet model compared to the EquirectCNN baseline on the omnidirectional parked cars (OmPaCa) detection task, we provide a qualitative comparison of the detection results in Fig. 3. References 1. Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical CNNs. In: International Conference on Learning Representations (2018) 3 2. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (NIPS) (2015) 2 3. Khasanova, R., Frossard, P.: Graph-based classification of omnidirectional images. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) Workshops (2017) 3

Supplementary Material for SphereNet (a) EquirectCNN 5 (b) SphereNet (NN) Fig. 3: Qualitative Performance Comparison between EquirectCNN and SphereNet (NN) model on the OmPaCa dataset. The ground truth is shown in green, detections are shown in red. Unlike SphereNet, the baseline EquirectCNN model struggles to detect objects in the polar regions of omnidirectional images (row 1 3) and, in general, outputs less tight bounding boxes (row 4 5).