Visual Place Recognition in Changing Environments with Time-Invariant Image Patch Descriptors Boris Ivanovic Stanford University borisi@cs.stanford.edu Abstract Feature descriptors for images are a mature area of study within computer vision, and as a result, researchers now have access to many attribute-invariant features (e.g. scale, shift, rotation). However, changes to environments caused by changes in time, ie. weather and season, still pose a serious problem for current image matching systems. As the use of detailed 3D maps and visual Simultaneous Localization and Mapping (SLAM) for robotics becomes more widespread, the ability to match image points across different weather conditions, illumination, seasons, and vegetation growth becomes a more important problem to solve. In this paper, we propose a method to learn a timeinvariant image patch descriptor that can reliably match regions in images across the large-scale scenery changes caused by different weather and seasons. We use Convolutional Neural Networks (CNNs) to learn representations of image patches and in particular train a Siamese network with pairs of (non-)matching patches to enforce descriptor (dis)similarity. We enforce this by (maximizing)minimizing the Euclidean distance between descriptors of (non-)matching patches during training. To improve representation generalization, we work with the seldomused, large-scale Archive of Many Outdoor Scenes (AMOS) dataset. Figure 1. Top row: Images of a forest as the season changes from summer on the left to fall in the middle to winter on the right. Bottom row: Images of St. Louis, MO in the summer on the left and winter on the right. 1. Introduction navigation and perform pose estimation, scale-drift correction, map updating, etc. Additionally, visual place recognition over a long period of time is the more difficult problem of identifying locations previously visited by an agent where the agent s knowledge of the location is from a different season or weather condition, causing the visual appearance in memory to differ vastly from what is currently seen. Examples of these visual appearance differences can be seen in Fig. 1. With the growth of autonomous vehicles for consumer use and their use of visual Simultaneous Localization and Mapping (SLAM), the ability for vision systems to work in all conditions and seasons is paramount. As a result, visual place recognition over a long period of time has been identified as one of the core requirements of any modern robotic system to operate reliably in the real world [8]. Specifically, visual place recognition is the problem of identifying locations previously visited by an agent, enabling the agent to localize itself in an environment during We present an approach to visual place recognition over a long period of time based on local image features. Specifically, we propose a method to learn a time-invariant image patch descriptor that can reliably match regions in images across the large-scale scenery changes caused by different weather and seasons. We use CNNs to learn representations of image patches and in particular train a Siamese network with pairs of corresponding and non-corresponding patches to enforce descriptor similarity and dissimilarity for better patch matching. Fig. 2 illustrates our work s overall idea. 1
local image features using a novel data extraction scheme that generates more general time-invariant image patch descriptors as well as a novel patch dataset for this task. 3. Method 3.1. Data Collection and Feature Extraction 3.1.1 Dataset Selection Figure 2. Illustration of our method. On the left is our training Siamese CNN architecture and on the right is how we use the trained CNN to generate image patch representations which we then compare via Euclidean distance to determine which patches match. 2. Previous Work Previous approaches for visual place recognition over a long period of time include matching image sequences [9], learning to predict appearance changes to simplify the problem of matching images [12], using holistic image descriptors obtained from Convolutional Neural Networks (CNNs) [14], and combining local image features with CNN feature descriptors to match patches across images and provide resilience to viewpoint changes [10]. More recent work focuses on combining local patch features and holistic descriptors with multi-scale superpixel grids to provide the accurate matching performance of holistic descriptors while still maintaining the viewpoint invariance of local patches [11]. Additionally, work has been done to create an algorithm that identifies sets of heterogeneous features that are invariant to the types of changes that occur across seasons and weather, simplifying the problem of image matching [5]. Our work is most similar to Neubert and Protzel s work [10] as it also combines a feature detector with a CNN feature descriptor. The main differences are that Neubert and Protzel use the vectorized third convolutional layer (conv3) of the VGG-M network [2] as a feature descriptor and perform Hough-based patch matching whereas our work uses a different CNN architecture and performs direct patch matching via smallest Euclidean distance between the patches descriptors. Our work also takes cues from the work of Simo-Serra et al. [13] as we are both trying to learn discriminative feature descriptors. As a result, we use the same CNN architecture and Siamese layout for training. The main contribution of our work is an approach to visual place recognition over a long period of time based on This work uses a seldom-used dataset for the task of visual place recognition: the Archive of Many Outdoor Scenes (AMOS) [6]. It is a very large collection of more than 1 billion images from almost 30,000 static cameras located around the world. The reason for using such a dataset rather than one of the more popular ones for this task (e.g. Nordland Dataset [9]) is that we aim to create a more general image patch descriptor that can be used in a variety of environments (e.g. cities, forests, roads, etc.) rather than specific scenes such as a railroad. Another candidate dataset that we may work with in the future is the Longterm Observation of Scenes with Tracks (LOST) Dataset [1], comprised of videos taken from streaming outdoor webcams. The LOST dataset is also very large, with more than 150 million video frames captured to date. The main difference between the LOST and AMOS datasets is that all videos in the LOST dataset are taken during the same 30-minute daytime interval (noon local time), which may hinder the diversity of data collected (e.g. sunsets and sunrises, poor illumination conditions at dusk, city lights, road lights, etc.). 3.1.2 Day/Night Classification and Pruning The AMOS dataset contains both daytime and nighttime images, which is beneficial for data diversity, however, a majority of the cameras in the dataset have poor low-light performance and thus the nighttime images are mostly featureless and black. In order to exclude these images from being used during training, we have to identify which images are taken at night (undesirable) and which are taken during the day (desirable). First instincts may indicate that we should just use timestamps of the images to determine day/night boundaries. However, all timestamps are in GMT and there is no accompanying camera location information. Even if camera location and local time of capture were known, it is difficult to set day/night boundaries manually when they vary so heavily across seasons and locations. Thus, in order to avoid complex rule-based classification, we used the intuition that nighttime images have more dark pixels than daytime images to formulate this as a binary classification problem using image pixel values as the features. As a result, we decided to train a simple SVM 2
training data for this classifier from different cameras in the AMOS dataset and selected 476 daytime and 420 nighttime images. After training, the classifier was able to linearly separate the data, leading to perfect classification of day and night images. Fig. 4 shows some classification results. 3.1.3 Image Patch Pair Extraction In order to extract pairs of matching patches across different conditions, we capitalized on the fact that the AMOS dataset contains images from static cameras. For example, if the front door of a building in the summer time is contained in a 64x64 patch centered at (x, y), then that same door will be found in a 64x64 patch centered at the same (x, y) coordinates in a wintertime image. Now, patches from the image must be chosen. To do this we perform keypoint detection with SIFT on the image and select the 10 points with highest response (response specifies the strength of the found keypoint). We then define a 64x64 patch centered at each of the 10 keypoints and use that as one of our patches. To find a corresponding patch we pick a random image in a different environmental condition (e.g. season) and extract patches from the same coordinates as the original image. In order to prevent data duplication (ie. all top 10 keypoints lying within 3 pixels of each other, creating nearly identical patches), we choose the top 10 keypoints in a manner so that no two would have patches that overlap. Above are the steps to find matching patch pairs. The process to find non-corresponding patch pairs is very similar. The only difference is that when it comes time to choosing a patch in the randomly selected other image, we chose patch coordinates from one of the other non-overlapping patches in our image, guaranteeing that the patch extracted from the other image shares no intentional similarity with the original patch. Fig. 5 illustrates our overall image patch pair extraction process for corresponding patches. Fig. 6 shows a few extracted patch pairs. Figure 3. Illustration of the process and results of extracting features for the day/night classifier for two example images. Figure 4. Example classifications from the SVM classifier where the top row were classified as Day and the bottom row were classified as Night. classifier on the frequency of image pixel values. However, using the pixel values directly would create very highdimensional feature vectors, so we chose to bin the pixel values into 4 bins per channel. By binning the image pixel values, we separate the 256 pixel values of each channel into 4 bins: Very dark: [0,64), Dark: [64, 128), Light: [128, 192), and Very Light: [192, 256) and then count how many pixels in the image are in each bin, making it easier to group dark and light images together. Finally, the counts are normalized so that this feature extraction works with images of any size. Fig. 3 illustrates this process. With these features, we expect that nighttime images would have much higher values in the Very Dark and Dark bins whereas daytime images would have much higher values in the Light and Very Light bins. With the features chosen and defined, we hand-picked 3.1.4 Dataset Statistics Using our patch extraction method, our dataset contains patches from daytime images of 16 different scenes in both the summer and winter. Dates vary from 2006 to 2015. The specific composition of the dataset is 55,137 pairs of matching patches and 62,444 pairs of non-matching patches. 3.2. CNN Image Patch Descriptor 3.2.1 Model Architecture We use a three-layer CNN architecture similar to SimoSerra et al. [13] since we also aim to compute discriminative feature descriptors for image patches. Table 1 details the architecture of the network. 3
Figure 5. Illustration of the process to generate corresponding image patches. On the left is an image of certain type (in this case summer), in the middle is the image with SIFT keypoints drawn on, and on the right is a randomly chosen image of a different type (in this case winter) to match with. The red dashed boxes indicate 64x64 image patches and the horizontal red lines indicate which patches are matched. Figure 6. A sample of 64x64 corresponding (top) and non-corresponding (bottom) patches extracted from the AMOS dataset. fj is the representation of image patch xj, Wij is a label indicating if patches xi and xj should be similar (Wij = 1) or dissimilar (Wij = 0), and m is the distance that dissimilar patches should be at least apart by. Table 1. Architecture of our three-layer CNN. 3.2.2 Fig. 2 shows the Siamese CNN architecture used for training. Model Training 4. Experiments We aim to compute whether two examples are similar or not based on the similarity of their feature descriptors. This is an instance of semi-supervised embedding as defined in Weston et al. [16] and is also why a Siamese CNN architecture like the one outlined in Simo-Serra et al. [13] fits the task well (ie. identical copies of the same funciton, same weights, uses a distance measuring layer to compute similarity). As a result, we use the margin-based loss function proposed by Hadsell et al. [4] which encourages similar examples to be close, and dissimilar ones to be at least some distance away from each other: ( L(fi, fj, Wij ) = 4.1. AlexNet s Reprsesentation of Different Seasons In order to gauge how current CNNs perform on the task of grouping together similar environments across different weather conditions and seasons, we decided to visualize how AlexNet s fc7 layer [7] (trained on ImageNet [3]) represents a handful of images that differ in seasons, daytime or nighttime, weather conditions, and vegetation growth. To do this we performed t-distributed Stochastic Neighbor Embedding (t-sne) [15] and plotted the images on their resulting representation vectors. Fig. 7 shows the results. As can be seen, AlexNet does manage to group similar scenes together (ie. all the forest images were together, all the construction images were together, etc.). However, due to the heavy variations in color and illumination, scenes in different conditions are not represented alike, and this can be seen by the distances within the similar scene groups between specific images. fi fj 2 if Wij = 1 max(0, m fi fj 2 ) if Wij = 0 (1) where: fi is the representation of image patch xi, 4
Accuracy Cutoff Matching Accuracy Top 1 55.2% Top 3 77.1% Top 5 89.4% Table 2. The model s performance on our patch dataset. The accuracy cutoffs indicate what position the query s corresponding patch must be ranked within in order to count as a correct result. 4.3. Representation Analysis Figure 7. t-sne plot of AlexNet s fc7 representation of various scenes with appearance changes. In the top left are images of St. Louis, MO in the summer and winter; in the bottom left are images of a forest in the summer, fall, and winter; in the middle are images of a school building in the summer and winter; in the top right are images of a construction site in a city during the day, dusk, and night. 4.2. Patch Matching Performance In order to test the model s performance on patch matching, we devised an experiment where: 1. A query patch s 128-dimensional representation vector was obtained from a trained model. 2. A selection of patches from the same location in the opposite season are chosen (including the query patch s corresponding patch) and their representations obtained. 3. The choice patches are ranked in ascending order according to their representation s Euclidean distance to the query patch s representation. With this experiment, we are looking for the query patch s corresponding patch to have the smallest representation distance from the query patch. An visualization of this experiment is shown in Fig. 8 and the model s performance on our dataset is detailed in Table 2. When the experiment was run, usually 15 choice patches were presented (making the total number of choice patches 16, including the query patch s corresponding patch). The fact that the accuracy increases quickly with respect to the number of included top results is promising, as it shows that the corresponding query patch is still close to the query patch even if the model is wrong. Fig. 9 shows a t-sne visualization of the model s representation of patches from our dataset. Unlike typical uses of t-sne plots, we are not looking for global patterns across the latent space with this work. Instead, we are looking for local groups of patches corresponding to the same scene in different seasons, as this shows that the model has learned to represent different appearances of the same scene similarly. As can be seen, our model does produce clusters of the same scene in different seasons, lending credence to the notion that our model generates time-invariant image patch descriptors. 4.4. Weight Visualization Fig. 10 shows some weights from each layer of a trained model. Unfortunately, due to the sizes of the weight matrices and input patches, it is difficult to make out any specific patterns in the weights. Ideally, we would also have performed an analysis of which patches maximally-activate specific neurons, to test for the occurrence of semanticallyactivated neurons like a foliage neuron, shadow neuron, etc. 5. Conclusion In conclusion, we propose an approach to visual place recognition over a long period of time based on local image features using a novel data extraction scheme that generates time-invariant image patch descriptors. We use Convolutional Neural Networks (CNNs) to learn representations of image patches and in particular train a Siamese network with pairs of (non-)matching patches to enforce descriptor (dis)similarity. To improve representation generalization, we work with the seldom-used Archive of Many Outdoor Scenes (AMOS) dataset and show a method to easily extract corresponding and non-corresponding image patches from it. References [1] A. Abrams, J. Tucek, N. Jacobs, and R. Pless. LOST: Longterm Observation of Scenes (with Tracks). In IEEE Workshop on Applications of Computer Vision (WACV), pages 297 304, 2012. Acceptance rate: 44%. 5
Figure 8. A visualization of the patch matching accuracy test for two query patches. The query patches are shown on the left and the choice patches are shown on the right, in ascending order of their representation s Euclidean distance from the query patch. A green border indicates that the patch is the query s corresponding patch. We desire the patch with the green border to be as close as possible to the query patch (ie. to be the leftmost of the choice patches). The top trial would count as accurate in all the top 1, 3, and 5 accuracy cutoffs, whereas the bottom trial would only count as accurate in the top 3 and 5 accuracy cutoffs. Figure 9. A t-sne visualization of the model s representation of 300 patches from our dataset. Circled in red are local groups of patches corresponding to the same scene in different seasons. 6
Figure 10. Visualization of four 7x7 weight matrices from the first convolutional layer (top), four 6x6 weight matrices from the second convolutional layer (middle), and four 5x5 weight matrices from the third convolutional layer (bottom), with specific location in the model indicated in the plot title. 7
[2] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014. [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248 255, 2009. [4] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06), volume 2, pages 1735 1742, 2006. [5] F. Han, X. Yang, Y. Deng, M. Rentschler, D. Yang, and H. Zhang. Life-long place recognition by shared representative appearance learning. In Workshop on Robotics: Science and Systems, AnnArbor, Michigan, June 2016. [6] N. Jacobs, N. Roman, and R. Pless. Consistent Temporal Variations in Many Outdoor Scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 6, June 2007. Acceptance rate: 23.4%. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097 1105. Curran Associates, Inc., 2012. [8] S. M. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. D. Cox, P. I. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Trans. Robotics, 32(1):1 19, 2016. [9] M. Milford and G. Wyeth. Seqslam : visual route-based navigation for sunny summer days and stormy winter nights. In N. Papanikolopoulos, editor, IEEE International Conferece on Robotics and Automation (ICRA 2012), pages 1643 1649, River Centre, Saint Paul, Minnesota, 2012. IEEE. [10] P. Neubert and P. Protzel. Local region detector + cnn based landmarks for practical place recognition in changing environments. In ECMR, 2015. [11] P. Neubert and P. Protzel. Beyond holistic descriptors, keypoints, and fixed patches: Multiscale superpixel grids for place recognition in changing environments. IEEE Robotics and Automation Letters, 1(1):484 491, Jan 2016. [12] P. Neubert, N. Sünderhauf, and P. Protzel. Superpixelbased appearance change prediction for long-term navigation across seasons. Robotics and Autonomous Systems, 69:15 27, 2015. [13] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In Proceedings of the International Conference on Computer Vision (ICCV), 2015. [14] N. Sünderhauf, F. Dayoub, S. Shirazi, B. Upcroft, and M. Milford. On the performance of convnet features for place recognition. CoRR, abs/1501.04158, 2015. [15] L. van der Maaten and G. E. Hinton. Visualizing highdimensional data using t-sne. Journal of Machine Learning Research, 9:2579 2605, 2008. [16] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep Learning via Semi-supervised Embedding, pages 639 655. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. 8