Learning image representations equivariant to ego-motion (Supplementary material)

Similar documents
arxiv: v2 [cs.cv] 29 Mar 2016 Abstract

Learning image representations tied to ego-motion

Learning image representations tied to egomotion from unlabeled video

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Perceptron: This is convolution!

Real-time Object Detection CS 229 Course Project

Self-Supervised Learning & Visual Discovery

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Unsupervised Learning of Spatiotemporally Coherent Metrics

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Multi-Glance Attention Models For Image Classification

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking

Deep Learning for Computer Vision II

Keras: Handwritten Digit Recognition using MNIST Dataset

arxiv: v1 [cs.cv] 6 Jul 2016

Discriminative classifiers for image recognition

Deep Learning and Its Applications

CAP 6412 Advanced Computer Vision

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Recurrent Convolutional Neural Networks for Scene Labeling

Siamese Network Features for Image Matching

Attributes as Operators (Supplementary Material)

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Static Gesture Recognition with Restricted Boltzmann Machines

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Machine Learning 13. week

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Efficient Algorithms may not be those we think

Study of Residual Networks for Image Recognition

Scene Segmentation in Adverse Vision Conditions

Dynamic Routing Between Capsules

Content-Based Image Recovery

ECS289: Scalable Machine Learning

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Deep Learning with Tensorflow AlexNet

Convolutional Neural Network Layer Reordering for Acceleration

Stacked Denoising Autoencoders for Face Pose Normalization

Bilevel Sparse Coding

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Deep Face Recognition. Nathan Sun

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Channel Locality Block: A Variant of Squeeze-and-Excitation

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Class 6 Large-Scale Image Classification

Learning visual odometry with a convolutional network

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

The exam is closed book, closed notes except your one-page cheat sheet.

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Rank Preserving Hashing for Rapid Image Search

Driving dataset in the wild: Various driving scenes

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Machine Learning. MGS Lecture 3: Deep Learning

Tiny ImageNet Visual Recognition Challenge

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Deep Neural Networks:

Plankton Classification Using ConvNets

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

arxiv: v2 [cs.cv] 14 May 2018

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Pose estimation using a variety of techniques

arxiv: v1 [cs.cv] 15 Jun 2018

Probabilistic Siamese Network for Learning Representations. Chen Liu

Finding Tiny Faces Supplementary Materials

ROBUST ROAD DETECTION FROM A SINGLE IMAGE USING ROAD SHAPE PRIOR. Zhen He, Tao Wu, Zhipeng Xiao, Hangen He

Visual Recommender System with Adversarial Generator-Encoder Networks

Structured Prediction using Convolutional Neural Networks

arxiv: v1 [cs.lg] 20 Dec 2013

Deep neural networks II

Window based detectors

arxiv: v1 [cs.lg] 16 Jan 2013

Introduction to SLAM Part II. Paul Robertson

Deep Generative Models Variational Autoencoders

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

LEARNING A SPARSE DICTIONARY OF VIDEO STRUCTURE FOR ACTIVITY MODELING. Nandita M. Nayak, Amit K. Roy-Chowdhury. University of California, Riverside

End-to-End Localization and Ranking for Relative Attributes

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network

Keras: Handwritten Digit Recognition using MNIST Dataset

Transcription:

Learning image representations equivariant to ego-motion (Supplementary material) Dinesh Jayaraman UT Austin dineshj@cs.utexas.edu Kristen Grauman UT Austin grauman@cs.utexas.edu max-pool (3x3, stride2) 7 7 avg-pool (3x3, stride2) 3 3 avg-pool (3x3) (fully connected) Figure 2. KITTI z θ architecture producing D =-dim. features: 3 convolution layers and a fully connected feature layer (nonlinear operations specified along the bottom).. KITTI and SUN dataset samples Some sample images from KITTI and SUN are shown in Fig. As they show, these datasets have substantial domain differences. In KITTI, the camera faces the road and has a fixed field of view and camera pitch, and the content is entirely street scenes around Karlsruhe. In SUN, the images are downloaded from the internet, and belong to 397 diverse indoor and outdoor scene categories most of which have nothing to do with roads. 2. Optimization and hyperparameter selection (Main Sec 4.) (Elaborating on para titled Network architectures and Optimization 4.) As mentioned in the paper, for KITTI, we closely follow the cuda-convnet [] recommended CIFAR-0 architecture: conv(x)-max(3x3)- conv(x)--avg(3x3) conv(x)-avg(3x3) D = full feature units. A schematic representation for this architecture is shown in Fig 2. We use Nesterov-accelerated stochastic gradient descent as implemented in Caffe [], starting from weights randomly initialized according to [3]. The base learning rate and regularization λs are selected with greedy crossvalidation. Specifically, for each task, the optimal base learning rate (from 0., 0.0, 0.00, 0.000) was identified for CLSNET. Next, with this base learning rate fixed, the optimal regularizer weight (for DRLIM, TEMPORAL and EQUIV) was selected from a logarithmic grid (steps of 0 0. ). For EQUIV+DRLIM, the DRLIM loss regularizer weight fixed for DRLIM was retained, and only the EQUIV loss weight was cross-validated. The contrastive loss margin parameter δ in Eq (6) in DRLIM, TEMPORAL and EQUIV were set uniformly to.0. Since no other part of these objectives (including the softmax classification loss) depends on the scale of features, different choices of margins δ in these methods lead to objective functions with equivalent optima - the features are only scaled by a factor. For EQUIV+DRLIM, we set the DRLIM and EQUIV margins respectively to.0 and 0. to reflect the fact that the equivariance maps M g of Eq () applied to the representation z θ (gx) of the transformed image must bring it closer to the original image representation z θ (x) than it was before i.e. M g z θ (gx) z θ (x) 2 < z θ (gx) z θ (x) 2. In addition, to allow fast and thorough experimentation, we set the number of training epochs for each method on each dataset based on a number of initial runs to assess the scale of time usually taken before the classification softmax loss on validation data began to rise i.e. overfitting began. All future runs for that method on that data were run to roughly match (to the nearest 000) the number of epochs identified above. For most cases, this number was of the order of 0000. Batch sizes (for both the classification stack and the Siamese networks) were set to 6 (found to have no major difference from 4 or ) for NORB-NORB and KITTI-KITTI, and to 28 (selected from 4, 6,, 28) for KITTI-SUN, where we found it necessary to increase batch size so that meaningful classification loss gradients were computed in each SGD iteration, and training loss began to fall, despite the large number (397) of classes. On a single Tesla K-40 GPU machine, NORB-NORB training tasks took minutes, KITTI-KITTI tasks took 30 minutes, and KITTI-SUN tasks took 2 hours. Technically, the EQUIV objective in Eq () may benefit from setting different margins corresponding to the different ego-motion patterns, but we overlook this in favor of scalability and fewer hyperparameters.

Figure. (top) Figure from [2] showcasing images from the 4 KITTI location classes (shown here in color; we use grayscale images), and (bottom) Figure from [8] showcasing images from a subset of the 397 SUN classes (shown here in color; see text in main paper for image pre-processing details).

3. Equivariance measurement (Main Sec 4.2) Computing ρ g - details In Sec 4.2 in the main paper, we proposed the following measure for equivariance. For each ego-motion g, we measure equivariance separately through the normalized error ρ g : [ zθ (x) M ] gz θ (gx) 2 ρ g = E, () z θ (x) z θ (gx) 2 where E[.] denotes the empirical mean, M g is the equivariance map, and ρ g = 0 would signify perfect equivariance. We closely follow the equivariance evaluation approach of [6] to solve for the equivariance maps of features produced by each compared method on held-out validation data (cf. Sec 4. from the paper), before computing ρ g. Such maps are produced explicitly by our method, but not the baselines. Thus, as in [6], we compute their maps 2 by solving a least squares minimization problem based on the definition of equivariance in Eq (2) in the paper: M g = arg min M m(y i,y j)=g z θ (x i ) Mz θ (x j ) 2. (2) M g s computed as above are used to compute ρ g s as in Eq (). M g and ρ g are computed on disjoint subsets of the validation data. Since the output features are relatively low in dimension (D = 00), we find regularization for Eq (2) unnecessary. Equivariance results - details While results in the main paper (Table ) were reported as averages over atomic and composite motions, we present here the results for individual motions in Table. While relative trends among the methods remain the same as for the averages reported in the main paper, the new numbers help demonstrate that ρ g for composite motions is no bigger than for atomic motions, as we would expect from the argument presented in Sec 3.2 in the main paper. To see this, observe that even among the atomic motions, ρ g for all methods is lower on the small up atomic egomotion ( ) than it is for the larger right ego-motion (20 ). Further, the errors for right are close to those for the composite motions ( up+right, up+left and down+right ), establishing that while equivariance is diminished for larger motions, it is not affected by whether the motions were used in training or not. In other words, if trained for equivariance to a suitable discrete set of atomic ego-motions (cf. Sec 3.2 in the paper), the feature space generalizes well to new egomotions. 2 For uniformity, we do the same recovery of M g for our method; our results are similar either way. Tasks atomic composite Datasets up (u) right (r) u+r u+l d+r random.0000.0000.0000.0000.0000 CLSNET 276 202 222 38 074 TEMPORAL [7] 0.740 0.8033 0.8089 0.806 0.8207 DRLIM [4] 0.770 0.7038 0.728 0.782 0.7 EQUIV 0.8 0.6836 0.693 0.694 0.720 EQUIV+DRLIM 0.293 0.633 0.0 0.60 0.66 Table. The normalized error equivariance measure ρ g for individual ego-motions (Eq ()) on NORB, organized as atomic (motions in the EQUIV training set) and composite (novel) egomotions. 4. Recognition results (Main Sec 4.3) Restricted slowness is a weak prior We now present evidence supporting our claim in the paper that the principle of slowness, which penalizes feature variation within small temporal windows, provides a prior that is rather weak. In every stochastic gradient descent (SGD) training iteration for the DRLIM and TEMPORAL networks, we also computed a slowness measure that is independent of feature scaling (unlike the DRLIM and TEMPORAL losses of Eq 7 themselves), to better understand the shortcomings of these methods. Given training pairs (x i, x j ) annotated as neighbors or non-neighbors by n ij = ( t i t j T ) (cf. Eq (7) in the paper), we computed pairwise distances ij = d(z θ(s) (x i ), z θ(s) (x j )), where θ(s) is the parameter vector at SGD training iteration s, and d(.,.) is set to the l 2 distance for DRLIM and to the l distance for TEMPORAL (cf. Sec 4). We then measured how well these pairwise distances ij predict the temporal neighborhood annotation n ij, by measuring the Area Under Receiver Operating Characteristic (AUROC) when varying a threshold on ij. These slowness AUROC s are plotted as a function of training iteration number in Fig 3, for DRLIM and COHER- ENCE networks trained on the KITTI-SUN task. Compared to the standard random AUROC value of 0., these slowness AUROCs tend to be near already even before optimization begins, and reach peak AUROCs very close to.0 on both training and testing data within about 4000 iterations (batch size 28). This points to a possible weakness in these methods even with parameters (temporal neighborhood size, regularization λ) cross-validated for recognition, the slowness prior is too weak to regularize feature learning effectively, since strengthening it causes loss of discriminative information. In contrast, our method requires systematic feature space responses to ego-motions, and offers a stronger prior.. Next-best view selection (Main Sec 4.4) We now describe our method for next-best view selection for recognition on NORB. Given one view of a NORB ob-

query pair NN (ours) NN (pixel) query pair NN (ours) NN (pixel) + query pair NN (ours) NN (pixel) + + + Figure 4. (Contd. from Fig 4) More examples of nearest neighbor image pairs (cols 3 and 4 in each block) in pairwise equivariant feature difference space for various query image pairs (cols and 2 per block). For comparison, cols and 6 show pixel-wise difference-based neighbor pairs. The direction of ego-motion in query and neighbor pairs (inferred from ego-pose vector differences) is indicated above each block. DrLIM AUROC Coherence AUROC 0.8 training 0.8 0 2000 4000 6000 8000 0000 training 8 6 4 2 0 2000 4000 6000 8000 0000 0.8 testing 0.8 0 2000 4000 6000 8000 0000 testing 8 6 4 2 0 2000 4000 6000 8000 0000 Figure 3. Slowness AUROC on training (left) and testing (right) data for (top) DRLIM (bottom) COHERENCE, showing the weakness of slowness prior. ject, the task is to tell a hypothetical robot how to move next to help recognize the object i.e. which neighboring view would best reduce object prediction uncertainty. We exploit the fact that equivariant features behave predictably under ego-motions to identify the optimal next view. We limit the choice of next view g to { up, down, up+right and up+left } for simplicity in this preliminary test. We build a k-nearest neighbor (k-nn) imagepair classifier for each possible g, using only training image pairs (x, gx) related by the ego-motion g. This classifier C g takes as input a vector of length 2D, formed by appending the features of the image pair (each image s representation is of length D) and produces the output probability of each class. So, C g ([z θ (x), z θ (gx)]) returns class likelihood probabilities for all 2 NORB classes. Output class probabilities for the k-nn classifier are computed from the histogram of class votes from the k nearest neighbors. We set k = 2. At test time, we first compute features z θ (x 0 ) on the given starting image x 0. Next we predict the feature z θ (gx 0 ) corresponding to each possible surrounding view g, as M gz θ (x 0 ), per the definition of equivariance (cf. Eq 2 in the paper). 3 With these predicted transformed image features and the pair-wise nearest neighbor class probabilities C g (.), we may now pick the next-best view as: g = arg min H(C g ([z θ (x 0 ), M gz θ (x 0 )])), (3) g where H(.) is the information-theoretical entropy function. This selects the view that would produce the least predicted image pair class prediction uncertainty. 6. Qualitative analysis (Main Sec 4.) To qualitatively evaluate the impact of equivariant feature learning, we pose a pair-wise nearest neighbor task in the feature difference space to retrieve image pairs related by similar ego-motion to a query image pair (details in Supp). Given a learned feature space z(.) and a query 3 Equivariance maps M g for all methods are computed as described in Sec 3 in this document (and Sec 4.2 in the main paper)

image pair (x i, x j ), we form the pairwise feature difference d ij = z(x i ) z(x j ). In an equivariant feature space, other image pairs (x k, x l ) with similar feature difference vectors d kl d ij would be likely to be related by similar ego-motion to the query pair. 4 This can also be viewed as an analogy completion task, x i : x j = x k :?, where the right answer should apply p ij to x k to obtain x l. For the results in the paper, the closest pair to the query in the learned equivariant feature space is compared to that in the pixel space. Some more examples are shown in Fig 4. References [] Cuda-convnet. https://code.google.com/p/ cuda-convnet/. [2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics: The KITTI Dataset. IJRR, 203. 2 [3] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 200. [4] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Reduction by Learning an Invariant Mapping. CVPR, 2006. 3 [] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arxiv, 204. [6] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. CVPR, 20. 3 [7] H. Mobahi, R. Collobert, and J. Weston. Deep Learning from Temporal Coherence in Video. ICML, 2009. 3 [8] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. CVPR, 200. 2 4 Note that in our model of equivariance, this isn t strictly true, since the pair-wise difference vector M gz θ (x) z θ (x) need not actually be fixed for a given transformation g, x. For small motions (and the right kinds of equivariant maps M g), this still holds approximately, as we find in practice.