Visual Place Recognition in Changing Environments with Time-Invariant Image Patch Descriptors

Similar documents
Single-View Place Recognition under Seasonal Changes

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction SUPPLEMENTAL MATERIAL

Classification of objects from Video Data (Group 30)

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Beyond Bags of Features

Local Region Detector + CNN based Landmarks for Practical Place Recognition in Changing Environments

Dynamic Environments Localization via Dimensions Reduction of Deep Learning Features

Fusion and Binarization of CNN Features for Robust Topological Localization across Seasons

CSE 559A: Computer Vision

Ensemble of Bayesian Filters for Loop Closure Detection

3D model classification using convolutional neural network

Improving Vision-based Topological Localization by Combining Local and Global Image Features

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

arxiv: v1 [cs.ro] 30 Oct 2018

Quantifying Translation-Invariance in Convolutional Neural Networks

DeepIndex for Accurate and Efficient Image Retrieval

Real-time Object Detection CS 229 Course Project

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Domain Adaptation For Mobile Robot Navigation

Combining Selective Search Segmentation and Random Forest for Image Classification

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Using Machine Learning for Classification of Cancer Cells

Part Localization by Exploiting Deep Convolutional Networks

A Multi-Domain Feature Learning Method for Visual Place Recognition

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Analysis of the CMU Localization Algorithm Under Varied Conditions

arxiv: v1 [cs.cv] 27 May 2015

Semantics-aware Visual Localization under Challenging Perceptual Conditions

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Object Detection. Part1. Presenter: Dae-Yong

Improving Recognition through Object Sub-categorization

Recurrent Convolutional Neural Networks for Scene Labeling

An Exploration of Computer Vision Techniques for Bird Species Classification

Siamese Network Features for Image Matching

Deep Learning for Computer Vision II

Made to Measure: Bespoke Landmarks for 24-Hour, All-Weather Localisation with a Camera

Specular 3D Object Tracking by View Generative Learning

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Structured Prediction using Convolutional Neural Networks

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Computer Vision Lecture 16

Learning visual odometry with a convolutional network

Image Features for Visual Teach-and-Repeat Navigation in Changing Environments

UAV Pose Estimation using Cross-view Geolocalization with Satellite Imagery

Lecture 12 Recognition

Bilinear Models for Fine-Grained Visual Recognition

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Discriminative classifiers for image recognition

A Novel Representation and Pipeline for Object Detection

Multiple-Choice Questionnaire Group C

A Sequence-Based Neuronal Model for Mobile Robot Localization

Deep Learning with Tensorflow AlexNet

Selection of Scale-Invariant Parts for Object Class Recognition

Robust Visual SLAM Across Seasons

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Dynamic Routing Between Capsules

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Self-supervised Visual Descriptor Learning for Dense Correspondence

Semantic Object Recognition in Digital Images

Toward Part-based Document Image Decoding

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

A Keypoint Descriptor Inspired by Retinal Computation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deep Face Recognition. Nathan Sun

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Visual features detection based on deep neural network in autonomous driving tasks

Weighted Convolutional Neural Network. Ensemble.

Multi-Glance Attention Models For Image Classification

arxiv: v1 [cs.cv] 1 Jan 2019

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

Object detection with CNNs

Efficient and Effective Matching of Image Sequences Under Substantial Appearance Changes Exploiting GPS Priors

Study of Residual Networks for Image Recognition

Human Pose Estimation with Deep Learning. Wei Yang

Su et al. Shape Descriptors - III

arxiv: v1 [cs.cv] 20 Dec 2016

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Depth from Stereo. Dominic Cheng February 7, 2018

A Generalized Method to Solve Text-Based CAPTCHAs

3D Reconstruction of a Hopkins Landmark

HOG-based Pedestriant Detector Training

arxiv: v1 [cs.mm] 12 Jan 2016

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Facial Expression Classification with Random Filters Feature Extraction

Real-time convolutional networks for sonar image classification in low-power embedded systems

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Lecture 12 Recognition. Davide Scaramuzza

Visual Recognition and Search April 18, 2008 Joo Hyun Kim

Where s Waldo? A Deep Learning approach to Template Matching

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

Person Re-identification for Improved Multi-person Multi-camera Tracking by Continuous Entity Association

Towards Life-Long Visual Localization using an Efficient Matching of Binary Sequences from Images

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Robust Visual Robot Localization Across Seasons using Network Flows

Transcription:

Visual Place Recognition in Changing Environments with Time-Invariant Image Patch Descriptors Boris Ivanovic Stanford University borisi@cs.stanford.edu Abstract Feature descriptors for images are a mature area of study within computer vision, and as a result, researchers now have access to many attribute-invariant features (e.g. scale, shift, rotation). However, changes to environments caused by changes in time, ie. weather and season, still pose a serious problem for current image matching systems. As the use of detailed 3D maps and visual Simultaneous Localization and Mapping (SLAM) for robotics becomes more widespread, the ability to match image points across different weather conditions, illumination, seasons, and vegetation growth becomes a more important problem to solve. In this paper, we propose a method to learn a timeinvariant image patch descriptor that can reliably match regions in images across the large-scale scenery changes caused by different weather and seasons. We use Convolutional Neural Networks (CNNs) to learn representations of image patches and in particular train a Siamese network with pairs of (non-)matching patches to enforce descriptor (dis)similarity. We enforce this by (maximizing)minimizing the Euclidean distance between descriptors of (non-)matching patches during training. To improve representation generalization, we work with the seldomused, large-scale Archive of Many Outdoor Scenes (AMOS) dataset. Figure 1. Top row: Images of a forest as the season changes from summer on the left to fall in the middle to winter on the right. Bottom row: Images of St. Louis, MO in the summer on the left and winter on the right. 1. Introduction navigation and perform pose estimation, scale-drift correction, map updating, etc. Additionally, visual place recognition over a long period of time is the more difficult problem of identifying locations previously visited by an agent where the agent s knowledge of the location is from a different season or weather condition, causing the visual appearance in memory to differ vastly from what is currently seen. Examples of these visual appearance differences can be seen in Fig. 1. With the growth of autonomous vehicles for consumer use and their use of visual Simultaneous Localization and Mapping (SLAM), the ability for vision systems to work in all conditions and seasons is paramount. As a result, visual place recognition over a long period of time has been identified as one of the core requirements of any modern robotic system to operate reliably in the real world [8]. Specifically, visual place recognition is the problem of identifying locations previously visited by an agent, enabling the agent to localize itself in an environment during We present an approach to visual place recognition over a long period of time based on local image features. Specifically, we propose a method to learn a time-invariant image patch descriptor that can reliably match regions in images across the large-scale scenery changes caused by different weather and seasons. We use CNNs to learn representations of image patches and in particular train a Siamese network with pairs of corresponding and non-corresponding patches to enforce descriptor similarity and dissimilarity for better patch matching. Fig. 2 illustrates our work s overall idea. 1

local image features using a novel data extraction scheme that generates more general time-invariant image patch descriptors as well as a novel patch dataset for this task. 3. Method 3.1. Data Collection and Feature Extraction 3.1.1 Dataset Selection Figure 2. Illustration of our method. On the left is our training Siamese CNN architecture and on the right is how we use the trained CNN to generate image patch representations which we then compare via Euclidean distance to determine which patches match. 2. Previous Work Previous approaches for visual place recognition over a long period of time include matching image sequences [9], learning to predict appearance changes to simplify the problem of matching images [12], using holistic image descriptors obtained from Convolutional Neural Networks (CNNs) [14], and combining local image features with CNN feature descriptors to match patches across images and provide resilience to viewpoint changes [10]. More recent work focuses on combining local patch features and holistic descriptors with multi-scale superpixel grids to provide the accurate matching performance of holistic descriptors while still maintaining the viewpoint invariance of local patches [11]. Additionally, work has been done to create an algorithm that identifies sets of heterogeneous features that are invariant to the types of changes that occur across seasons and weather, simplifying the problem of image matching [5]. Our work is most similar to Neubert and Protzel s work [10] as it also combines a feature detector with a CNN feature descriptor. The main differences are that Neubert and Protzel use the vectorized third convolutional layer (conv3) of the VGG-M network [2] as a feature descriptor and perform Hough-based patch matching whereas our work uses a different CNN architecture and performs direct patch matching via smallest Euclidean distance between the patches descriptors. Our work also takes cues from the work of Simo-Serra et al. [13] as we are both trying to learn discriminative feature descriptors. As a result, we use the same CNN architecture and Siamese layout for training. The main contribution of our work is an approach to visual place recognition over a long period of time based on This work uses a seldom-used dataset for the task of visual place recognition: the Archive of Many Outdoor Scenes (AMOS) [6]. It is a very large collection of more than 1 billion images from almost 30,000 static cameras located around the world. The reason for using such a dataset rather than one of the more popular ones for this task (e.g. Nordland Dataset [9]) is that we aim to create a more general image patch descriptor that can be used in a variety of environments (e.g. cities, forests, roads, etc.) rather than specific scenes such as a railroad. Another candidate dataset that we may work with in the future is the Longterm Observation of Scenes with Tracks (LOST) Dataset [1], comprised of videos taken from streaming outdoor webcams. The LOST dataset is also very large, with more than 150 million video frames captured to date. The main difference between the LOST and AMOS datasets is that all videos in the LOST dataset are taken during the same 30-minute daytime interval (noon local time), which may hinder the diversity of data collected (e.g. sunsets and sunrises, poor illumination conditions at dusk, city lights, road lights, etc.). 3.1.2 Day/Night Classification and Pruning The AMOS dataset contains both daytime and nighttime images, which is beneficial for data diversity, however, a majority of the cameras in the dataset have poor low-light performance and thus the nighttime images are mostly featureless and black. In order to exclude these images from being used during training, we have to identify which images are taken at night (undesirable) and which are taken during the day (desirable). First instincts may indicate that we should just use timestamps of the images to determine day/night boundaries. However, all timestamps are in GMT and there is no accompanying camera location information. Even if camera location and local time of capture were known, it is difficult to set day/night boundaries manually when they vary so heavily across seasons and locations. Thus, in order to avoid complex rule-based classification, we used the intuition that nighttime images have more dark pixels than daytime images to formulate this as a binary classification problem using image pixel values as the features. As a result, we decided to train a simple SVM 2

training data for this classifier from different cameras in the AMOS dataset and selected 476 daytime and 420 nighttime images. After training, the classifier was able to linearly separate the data, leading to perfect classification of day and night images. Fig. 4 shows some classification results. 3.1.3 Image Patch Pair Extraction In order to extract pairs of matching patches across different conditions, we capitalized on the fact that the AMOS dataset contains images from static cameras. For example, if the front door of a building in the summer time is contained in a 64x64 patch centered at (x, y), then that same door will be found in a 64x64 patch centered at the same (x, y) coordinates in a wintertime image. Now, patches from the image must be chosen. To do this we perform keypoint detection with SIFT on the image and select the 10 points with highest response (response specifies the strength of the found keypoint). We then define a 64x64 patch centered at each of the 10 keypoints and use that as one of our patches. To find a corresponding patch we pick a random image in a different environmental condition (e.g. season) and extract patches from the same coordinates as the original image. In order to prevent data duplication (ie. all top 10 keypoints lying within 3 pixels of each other, creating nearly identical patches), we choose the top 10 keypoints in a manner so that no two would have patches that overlap. Above are the steps to find matching patch pairs. The process to find non-corresponding patch pairs is very similar. The only difference is that when it comes time to choosing a patch in the randomly selected other image, we chose patch coordinates from one of the other non-overlapping patches in our image, guaranteeing that the patch extracted from the other image shares no intentional similarity with the original patch. Fig. 5 illustrates our overall image patch pair extraction process for corresponding patches. Fig. 6 shows a few extracted patch pairs. Figure 3. Illustration of the process and results of extracting features for the day/night classifier for two example images. Figure 4. Example classifications from the SVM classifier where the top row were classified as Day and the bottom row were classified as Night. classifier on the frequency of image pixel values. However, using the pixel values directly would create very highdimensional feature vectors, so we chose to bin the pixel values into 4 bins per channel. By binning the image pixel values, we separate the 256 pixel values of each channel into 4 bins: Very dark: [0,64), Dark: [64, 128), Light: [128, 192), and Very Light: [192, 256) and then count how many pixels in the image are in each bin, making it easier to group dark and light images together. Finally, the counts are normalized so that this feature extraction works with images of any size. Fig. 3 illustrates this process. With these features, we expect that nighttime images would have much higher values in the Very Dark and Dark bins whereas daytime images would have much higher values in the Light and Very Light bins. With the features chosen and defined, we hand-picked 3.1.4 Dataset Statistics Using our patch extraction method, our dataset contains patches from daytime images of 16 different scenes in both the summer and winter. Dates vary from 2006 to 2015. The specific composition of the dataset is 55,137 pairs of matching patches and 62,444 pairs of non-matching patches. 3.2. CNN Image Patch Descriptor 3.2.1 Model Architecture We use a three-layer CNN architecture similar to SimoSerra et al. [13] since we also aim to compute discriminative feature descriptors for image patches. Table 1 details the architecture of the network. 3

Figure 5. Illustration of the process to generate corresponding image patches. On the left is an image of certain type (in this case summer), in the middle is the image with SIFT keypoints drawn on, and on the right is a randomly chosen image of a different type (in this case winter) to match with. The red dashed boxes indicate 64x64 image patches and the horizontal red lines indicate which patches are matched. Figure 6. A sample of 64x64 corresponding (top) and non-corresponding (bottom) patches extracted from the AMOS dataset. fj is the representation of image patch xj, Wij is a label indicating if patches xi and xj should be similar (Wij = 1) or dissimilar (Wij = 0), and m is the distance that dissimilar patches should be at least apart by. Table 1. Architecture of our three-layer CNN. 3.2.2 Fig. 2 shows the Siamese CNN architecture used for training. Model Training 4. Experiments We aim to compute whether two examples are similar or not based on the similarity of their feature descriptors. This is an instance of semi-supervised embedding as defined in Weston et al. [16] and is also why a Siamese CNN architecture like the one outlined in Simo-Serra et al. [13] fits the task well (ie. identical copies of the same funciton, same weights, uses a distance measuring layer to compute similarity). As a result, we use the margin-based loss function proposed by Hadsell et al. [4] which encourages similar examples to be close, and dissimilar ones to be at least some distance away from each other: ( L(fi, fj, Wij ) = 4.1. AlexNet s Reprsesentation of Different Seasons In order to gauge how current CNNs perform on the task of grouping together similar environments across different weather conditions and seasons, we decided to visualize how AlexNet s fc7 layer [7] (trained on ImageNet [3]) represents a handful of images that differ in seasons, daytime or nighttime, weather conditions, and vegetation growth. To do this we performed t-distributed Stochastic Neighbor Embedding (t-sne) [15] and plotted the images on their resulting representation vectors. Fig. 7 shows the results. As can be seen, AlexNet does manage to group similar scenes together (ie. all the forest images were together, all the construction images were together, etc.). However, due to the heavy variations in color and illumination, scenes in different conditions are not represented alike, and this can be seen by the distances within the similar scene groups between specific images. fi fj 2 if Wij = 1 max(0, m fi fj 2 ) if Wij = 0 (1) where: fi is the representation of image patch xi, 4

Accuracy Cutoff Matching Accuracy Top 1 55.2% Top 3 77.1% Top 5 89.4% Table 2. The model s performance on our patch dataset. The accuracy cutoffs indicate what position the query s corresponding patch must be ranked within in order to count as a correct result. 4.3. Representation Analysis Figure 7. t-sne plot of AlexNet s fc7 representation of various scenes with appearance changes. In the top left are images of St. Louis, MO in the summer and winter; in the bottom left are images of a forest in the summer, fall, and winter; in the middle are images of a school building in the summer and winter; in the top right are images of a construction site in a city during the day, dusk, and night. 4.2. Patch Matching Performance In order to test the model s performance on patch matching, we devised an experiment where: 1. A query patch s 128-dimensional representation vector was obtained from a trained model. 2. A selection of patches from the same location in the opposite season are chosen (including the query patch s corresponding patch) and their representations obtained. 3. The choice patches are ranked in ascending order according to their representation s Euclidean distance to the query patch s representation. With this experiment, we are looking for the query patch s corresponding patch to have the smallest representation distance from the query patch. An visualization of this experiment is shown in Fig. 8 and the model s performance on our dataset is detailed in Table 2. When the experiment was run, usually 15 choice patches were presented (making the total number of choice patches 16, including the query patch s corresponding patch). The fact that the accuracy increases quickly with respect to the number of included top results is promising, as it shows that the corresponding query patch is still close to the query patch even if the model is wrong. Fig. 9 shows a t-sne visualization of the model s representation of patches from our dataset. Unlike typical uses of t-sne plots, we are not looking for global patterns across the latent space with this work. Instead, we are looking for local groups of patches corresponding to the same scene in different seasons, as this shows that the model has learned to represent different appearances of the same scene similarly. As can be seen, our model does produce clusters of the same scene in different seasons, lending credence to the notion that our model generates time-invariant image patch descriptors. 4.4. Weight Visualization Fig. 10 shows some weights from each layer of a trained model. Unfortunately, due to the sizes of the weight matrices and input patches, it is difficult to make out any specific patterns in the weights. Ideally, we would also have performed an analysis of which patches maximally-activate specific neurons, to test for the occurrence of semanticallyactivated neurons like a foliage neuron, shadow neuron, etc. 5. Conclusion In conclusion, we propose an approach to visual place recognition over a long period of time based on local image features using a novel data extraction scheme that generates time-invariant image patch descriptors. We use Convolutional Neural Networks (CNNs) to learn representations of image patches and in particular train a Siamese network with pairs of (non-)matching patches to enforce descriptor (dis)similarity. To improve representation generalization, we work with the seldom-used Archive of Many Outdoor Scenes (AMOS) dataset and show a method to easily extract corresponding and non-corresponding image patches from it. References [1] A. Abrams, J. Tucek, N. Jacobs, and R. Pless. LOST: Longterm Observation of Scenes (with Tracks). In IEEE Workshop on Applications of Computer Vision (WACV), pages 297 304, 2012. Acceptance rate: 44%. 5

Figure 8. A visualization of the patch matching accuracy test for two query patches. The query patches are shown on the left and the choice patches are shown on the right, in ascending order of their representation s Euclidean distance from the query patch. A green border indicates that the patch is the query s corresponding patch. We desire the patch with the green border to be as close as possible to the query patch (ie. to be the leftmost of the choice patches). The top trial would count as accurate in all the top 1, 3, and 5 accuracy cutoffs, whereas the bottom trial would only count as accurate in the top 3 and 5 accuracy cutoffs. Figure 9. A t-sne visualization of the model s representation of 300 patches from our dataset. Circled in red are local groups of patches corresponding to the same scene in different seasons. 6

Figure 10. Visualization of four 7x7 weight matrices from the first convolutional layer (top), four 6x6 weight matrices from the second convolutional layer (middle), and four 5x5 weight matrices from the third convolutional layer (bottom), with specific location in the model indicated in the plot title. 7

[2] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014. [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248 255, 2009. [4] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06), volume 2, pages 1735 1742, 2006. [5] F. Han, X. Yang, Y. Deng, M. Rentschler, D. Yang, and H. Zhang. Life-long place recognition by shared representative appearance learning. In Workshop on Robotics: Science and Systems, AnnArbor, Michigan, June 2016. [6] N. Jacobs, N. Roman, and R. Pless. Consistent Temporal Variations in Many Outdoor Scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 6, June 2007. Acceptance rate: 23.4%. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097 1105. Curran Associates, Inc., 2012. [8] S. M. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. D. Cox, P. I. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Trans. Robotics, 32(1):1 19, 2016. [9] M. Milford and G. Wyeth. Seqslam : visual route-based navigation for sunny summer days and stormy winter nights. In N. Papanikolopoulos, editor, IEEE International Conferece on Robotics and Automation (ICRA 2012), pages 1643 1649, River Centre, Saint Paul, Minnesota, 2012. IEEE. [10] P. Neubert and P. Protzel. Local region detector + cnn based landmarks for practical place recognition in changing environments. In ECMR, 2015. [11] P. Neubert and P. Protzel. Beyond holistic descriptors, keypoints, and fixed patches: Multiscale superpixel grids for place recognition in changing environments. IEEE Robotics and Automation Letters, 1(1):484 491, Jan 2016. [12] P. Neubert, N. Sünderhauf, and P. Protzel. Superpixelbased appearance change prediction for long-term navigation across seasons. Robotics and Autonomous Systems, 69:15 27, 2015. [13] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In Proceedings of the International Conference on Computer Vision (ICCV), 2015. [14] N. Sünderhauf, F. Dayoub, S. Shirazi, B. Upcroft, and M. Milford. On the performance of convnet features for place recognition. CoRR, abs/1501.04158, 2015. [15] L. van der Maaten and G. E. Hinton. Visualizing highdimensional data using t-sne. Journal of Machine Learning Research, 9:2579 2605, 2008. [16] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep Learning via Semi-supervised Embedding, pages 639 655. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. 8