Life-Long Place Recognition by Shared Representative Appearance Learning

Size: px

Start display at page:

Download "Life-Long Place Recognition by Shared Representative Appearance Learning"

Ralf Smith
5 years ago
Views:

Life-Long Place Recognition by Shared Representative Appearance Learning Fei Han, Xue Yang, Yiming Deng, Mark Rentschler, Dejun Yang and Hao Zhang Division of Computer Science, EECS Department,

1 Life-Long Place Recognition by Shared Representative Appearance Learning Fei Han, Xue Yang, Yiming Deng, Mark Rentschler, Dejun Yang and Hao Zhang Division of Computer Science, EECS Department, Colorado School of Mines, Golden, CO 8040 of Electrical Engineering, University of Colorado Denver, CO Department of Mechanical Engineering, University of Colorado Boulder, CO {fhan, {djyang, Department Abstract Place recognition plays an important role in Simultaneously Localization and Mapping (SLAM) in many robotics applications. When long-term SLAM is performed in an outdoor environment, additional challenges can be encountered, including strong perceptual aliasing and appearance variations caused by changes of illumination, vegetation, weather, etc. Especially, when considering life-long place recognition, these issues become more critical. We introduce a novel Shared Representative Appearance Learning (SRAL) approach for long-term loop closure detection. Different from previous place recognition methods using a single feature modality or multi-feature concatenation techniques, our SRAL approach can autonomously learn a set of shared features that are representative in all scene scenarios and integrate them together to represent the long-term appearance of scenes observed by a robot during vision-based navigation. By formulating SRAL as an optimization problem regularized by structured sparsityinducing norms, we are able to model the interdependence of the feature modalities. An optimization algorithm is also developed to efficiently solve the formulated convex optimization problem, which holds a theoretical guarantee to find the global optimum. Extensive experiments are performed to assess the SRAL method using large-scale benchmark datasets, including St Lucia, CMUVL, and Nordland datasets. Experimental results have validated that SRAL outperforms existing place recognition methods, and obtains superior performance on life-long place recognition. Fig.. Motivating examples of the algorithm based on learning the shared representative appearance for life-long place recognition. The same place can show great appearance variations in different weather, vegetation, and time (Nordland Dataset). The proposed SRAL algorithm automatically identifies a set of heterogenous features (e.g., color, GIST, HOG, and LBP), which are not only representative, but also shared by different scenarios, thus image matching in these scenarios can be more accurately performed. I. I NTRODUCTION Visual place recognition during a long time period is one of the core capabilities needed by a modern intelligent robot to perform long-term visual Simultaneous Localization And Mapping (SLAM) [23], which thus has attracted an increasing attention in the robotics community [0, 29, 20, 8, 34, 7]. Place recognition aims at identifying the locations previously visited by a mobile robot, which enables the robot to localize itself in an environment during navigation as well as correct incremental pose drifts during SLAM. However, visual place recognition is a challenging problem due to robot movement, occlusion, dynamic objects (e.g., cars and pedestrians), and other vision challenges such as illumination changes. Moreover, multiple locations may have similar appearance, which results in the so-called perceptual aliasing issue. Long-term place recognition also introduces additional difficulties. For example, the same place during daytime and night, or across seasons, can look quite differently, which we name the Longterm Appearance Change (LAC) problem. In the past several years, life-long place recognition has been intensively investigated [23, 29, 35, 2, 28, 24], due to its importance to long-term robot navigation. For example, FABMAP [6] is one of the first techniques to address life-long place recognition for loop-closure detection in SLAM, which detects revisited places by matching individual images according to their visual appearance, and was validated over 000 km [7]. More recently, sequence-based image matching methods have demonstrated very promising performance on long-term place recognition, such as SeqSLAM [24], ABLE-M [2], and others [29, 2, 5], which depend on sequence-based image matching for long-term place recognition. Previous sequencebased place recognition methods either utilize a single visual feature to encode place appearance, or combine two visual features simply through concatenation [34, 38, 39]. However, these methods that rely on purely hand-crafted features are not able to learn the importance of heterogenous features and effectively integrate the features together for robust place recognition, especially across a long-term time span. To construct a descriptive representation for life-long place recognition, we propose a novel algorithm, referred to as the Shared Representative Appearance Learning (SRAL), which

2 automatically learns a set of heterogenous visual features that are important to all scene scenarios, and fuses them together to represent the long-term appearance of scenes observed by a robot during vision-based navigation. Our SRAL algorithm models the key insight that visual features that are not only discriminative but also shared among all scene scenarios are the best for long-term place recognition. If visual features are only highly discriminative to encode a specific scenario (e.g., winter), they are less useful to match images from different scenarios, because they can not encode other scenarios. For the first time, our algorithm automatically learns multimodal shared representative features that are discriminative to represent the long-term visual appearance in all scene scenarios, with strong perceptual aliasing and LCA issues (e.g., during daytime and night, and/or across multiple seasons). We formulate this task as a multi-task multimodal feature learning task, and solve this problem through sparse convex optimization. Then, the fused multimodal, long-term representation is used with SeqSLAM [24], a sequence-based image matching method, to robustly perform life-long place recognition during daytime/night and across seasons. Our contributions are threefold. First, we introduce a novel formulation of constructing long-term appearance representations as a multitask, multimodal feature learning and fusion problem, with the objective of automatically identifying a set of heterogenous visual features that are discriminative for all scene scenarios (e.g., daytime vs night). Second, we propose a novel SRAL algorithm that addresses this problem based on sparse convex optimization regularized by two structured sparsity-inducing norms to analyze the interrelationships of multimodal features and learn shared representative features. Third, a new optimization method is developed to solve the formulated problem, which is theoretically guaranteed to find the optimal solution. The rest of the paper is organized as follows. We describe related research on long-term place recognition in Section II. Section III introduces the novel SRAL algorithm for long-term representation learning. Experimental results are discussed in Section IV. Finally, we conclude our paper in Section V. II. RELATED WORK Existing SLAM techniques can be broadly classified into three categories based on graph optimization, particle filter, and Extended Kalman Filter [37]. Based on the relationship of visual inputs and mapping outputs, SLAM can be categorized into 2D SLAM (using 2D images to produce 2D maps) [20, 5, 6], monocular SLAM (using 2D image sequences to generate 3D maps) [8, 27], and 3D SLAM (using 3D point clouds or RGB-D images to build 3D maps) [9, 33, 3]. Place recognition is an integrated component for loop closuring in all visual SLAM methods, which employs visual features to identify locations previously visited by robots. We provide an overview of image matching methods and visual features used in life-long place recognition. A. Image Matching for Long-Term Place Recognition Most previous place recognition techniques are based on image-to-image matching, which are broadly categorized into two groups, based on pairwise similarity scoring and nearest neighbor search. Methods based on pairwise similarity scores calculate a similarity score of the current scene image and each template based on a certain distance metric and select the template that has the maximum score [5, ]. On the other hand, techniques based on the nearest neighbor search typically build a search tree to efficiently identify the most similar template to the current scene. For example, the FAB- MAP SLAM [7, 6] employs the Chow Liu tree to locate the most similar template; RTAB-MAP [8, 9] uses the KD tree to perform fast nearest neighbor search; and other search trees are also used in different place recognition methods for efficient image-to-image matching [8, 2]. However, due to the limited information provided by a single image, previous image-based matching methods usually suffer from the LAC and perceptual aliasing issues, and are not robust to perform life-long place recognition. To integrate more information, several recent methods use a sequence of image frames to decrease the negative effect of LAC and perceptual aliasing and improve the accuracy of life-long place recognition [24, 2, 5, 4, 25]. Most sequence-based place recognition techniques are based on a similarity score computed from a pair of template and query image sequences. For example, SeqSLAM [24], one of the earliest sequence-based life-long place recognition methods, computes a sum of image-based similarity scores to match image sequences. The ABLE-M approach [2] concatenates binary features extracted from each image in a sequence as a representation to match between sequences. The similarity scores are also used by other sequence-based image matching techniques to perform life-long place recognition for robot navigation [25, 4, 7, 5], which calculate a sequence-based similarity score of the query and all template sequences to construct a similarity matrix, and then choose the template sequence with a statistically high score as a match. To model temporal relationships of individual frames in the sequence, data association methods based on Hidden Markov Models (HMMs) [2], Conditional Random Fields (CRFs) [4], and graph flow [28] were also proposed to align a pair of template and query sequences. However, the previous methods, usually based on a single type of features, are not capable to effectively fuze the rich information offered by multiple modalities of visual features. We address this problem by introducing a novel data fusion algorithm that can work with SeqSLAM and other sequencebased or image-based image matching methods for life-long place recognition. B. Visual Features for Place Representation A large number of visual features are proposed in previous long-term place recognition techniques to encode the scenes observed by robots during navigation, which can be broadly classified into two categories: local and global features.

3 Local features apply a detector to detect points of interest (e.g., corners) in an image and a descriptor to encode local information around each detected point of interest. Then, the Bag-of-Words (BoW) model is typically employed by place recognition methods to build a feature vector for each image. The Scale-Invariant Feature Transform (SIFT) feature is used along with a BoW model to detect revisited locations from 2D images []. FAB-MAP [6, 7] applies the Speeded Up Robust Features (SURF) for visual place recognition. Both features are also used by the RTAB-MAP SLAM [8, 9]. The binary BRIEF and FAST features are applied to build a BoW model to perform fast loop closure detection [9]. The ORB feature is also used for place recognition [26]. The local visual features are usually discriminative to differentiate each scene scenario (e.g., daytime vs night, or different seasons). However, they are often not shared by different scenarios and thus less representative to encode all scene scenarios for image matching in long-term place recognition. Global features extract information from the whole image, and a feature vector is often formed based on concatenation or feature statistics (e.g., histograms). These global features can encode raw image pixels, shape signatures, and other information. GIST features [2], constructed from the response of steerable filters at different orientations and scales, were employed to perform place recognition [34]. Local Difference Binary (LDB) features were applied to represent scenes by directly using intensity and gradient differences of image grid cells to calculate a binary string [2]. SeqSLAM [24] used the sum of absolute differences of low-resolution images as global features to conduct sequence-based place recognition. Convolutional neural networks (CNNs) were also utilized to extract deep features to match image sequences [29]. Other global features were also proposed to perform life-long place recognition and loop-closure detection [25, 28, 3]. These global features can encode whole image information and no dictionary-based quantization is required. They also showed promising performance for long-term place recognition [23]. However, the critical problem of identifying more discriminative global features and effectively fuzing them together were not well studied in previous methods based on global features, which is addressed in this paper. III. SHARED REPRESENTATIVE APPEARANCE LEARNING FOR LIFE-LONG PLACE RECOGNITION We propose to learn and fuse a set of heterogenous features that are representative and shared by multiple scenarios (e.g., night vs daytime, or different seasons), which was not well investigated in the previous research, to address the problem of life-long place recognition. We introduce the formulation of long-term feature learning and our novel SRAL algorithm from the perspective of sparse convex optimization. A new optimization solver is also implemented to solve the formulated problem, which has a theoretical guarantee to converge to the global optimal solution. Notation. In this paper, matrices are written as boldface, capital letters, and vectors are written as boldface lowercase letters. Given a matrix M = {m ij } R n m, we refer to its i-th row and j-th column as m i and m j, respectively. The l -norm of a vector v R n is defined as v = n v i. The l 2 -norm of v is defined as v 2 = v v. The l 2, -norm of the matrix M is defined as: M 2, = n and the Frobenius norm is defined as: A. Problem Formulation m j= m2 ij = n mi 2, () M F = m n j= m2 ij. (2) Given a set of images acquired by a robot during navigation in multiple scenarios (e.g., daytime vs night, or different seasons), the vectors of heterogenous features extracted from this image set are denoted as X = [x,..., x n ] R d n, and their corresponding scene scenarios are represented as Y = [y,..., y n ] R n c, where x i = [(x i ),..., (x m i ) ] R d is a vector of heterogenous features from the i-th image and consists of m modalities such that d = m j= d j, and y i Z c is the indicator vector of scene scenarios with y ij indicating whether the i-th image belongs to the j-th scene scenario. Then, the task of shared representative appearance learning can be considered as multitask feature learning and formulated as the following optimization problem: min A X W Y 2 F + λ W 2,. (3) where W R d c is a weight matrix with w ij denoting the importance of the i-th visual feature with respect to the j-th scene scenario, and λ is a trade-off parameter. The first term encoded by the squared Frobenius norm in Eq. (3) is a loss function that models the squared error of using the weighted features to identify a scene scenarios. The second term based on the l 2, -norm is a regularization that enforces the sparsity of the rows of W and the grouping effect of the weights in each row. The sparsity enforced by the regularization allows the method to find discriminative features with larger weights (non-discriminative features have a weight that is very close to 0); the grouping effect enforces the discriminative features have similar weights for all scene scenarios, since when the weights of a visual feature are equal for all scenarios, m i 2 in the l 2, -norm as presented in Eq. () obtains the minimum value. Therefore, this regularization term models our insight of finding shared representative features to build a descriptive representation that can represent long-term appearance variations for accurate life-long place recognition. Because different types of visual features typically capture different attributes of the scene scenarios (e.g., color, shape, etc.), when multiple modalities of features are used together, it is important to model their underlying structure introduced by multiple modalities. To realize this capability, we define a new regularization term that enforces the grouping effect of the weights of features belonging to the same modality, as well as the sparsity between modalities to identify shared

4 Our fusion method can be applied in both single-imagebased SLAM and sequence-based SLAM. A significant advantage of our SRAL method is that we are able to learn representative feature modalities shared by different environments caused by changes of weather and/or illumination conditions. This fusion method improves the robustness of the feature matching for place recognition, which significantly benefits loop closure detection in life-long SLAM. Fig. 2. Illustration of the proposed structured sparsity-inducing norms. Given the feature weight matrix W = [w ; w 2 ; ; w m] R d c, we model the interrelationship of feature modalities using the M-norm regularization term, which allows our SRAL algorithm to identify the shared modalities across different scenarios; we also introduce the l 2 -norm regularization to identify more representative shared features. This figure is best viewed in color. discriminative modalities, as follows: W M = di c (wpq) i 2 = W i F, (4) p= q= where W i R di c is the weight matrix for the i-th modality. Then, the final problem formulation of our SRAL algorithm becomes: min A X W Y 2 F + λ W 2, + λ 2 W M, (5) which simultaneously considers the underlying structure of feature modalities and identifies shared representative visual features to encode long-term appearance changes. B. Place Recognition The place recognition procedure can be partitioned into two phases according to our SRAL method, the weight learning phase and fusion phase. After solving the problem in Eq.5 during first phase (solution is detailed in Section III-C), we can obtain the optimal weight matrix W = [W ; W 2 ; ; W m ] R d c. The optimal weight ω i, i =,, m for each feature modality i can be obtained by ω i = W i F. Then, in the second fusion phase, given a new observation feature sample x = [x ; x 2 ; ; x m ] R d, we first calculate the matching score s i, i =,, m with respect to the query image and the existing image using each feature modality separately. Then the final score is obtained by the following fusion method s = ω i s i, i =,, m, (6) where ω i is the normalized weight for modality i. Finally, we can determine whether two images are matched by comparing the score with the predefined threshold. C. Optimization Algorithm and Analysis Because the objective in Eq. 5 comprises two non-smooth regularization terms: the l 2, -norm and M-norm, it is difficult to solve in general. To this end, we implement a new iterative algorithm to solve the optimization problem in Eq. 5 with the non-smooth regularization terms. Taking the derivative of the objective with respect to w i ( i c), and setting it to zero vector, we obtain XX w i Xy i + γ Dw i + γ 2 Dwi = 0, (7) where D is a diagonal matrix with the i-th diagonal element as 2 w i 2. D is a diagonal matrix with the i-th diagonal block as 2 W i F I i, I i is an identity matrix with size of d i, W i is the i-th row segment of W and includes the weights of the i-th feature modal. Thus we have w i = (XX + γ D + γ 2 D) Xy i. (8) It is noted that D and D are dependent on W and thus are also unknown variables. An iterative algorithm is proposed to solve this problem, which is described in Algorithm. In the following, we analyze the algorithm convergence and prove that Algorithm converges to the global optimum. First, we present a lemma from Nie et al. [30]: Lemma : For any vector ṽ and v, the following inequality holds: ṽ 2 ṽ v 2 v 2 v v 2. From Lemma, we can easily obtain the following corollary. Corollary : For any matrix M and M, the following inequality holds: M F M 2 F 2 M F M F M 2 F 2 M F. Theorem : Algorithm monotonically decreases the objective value of the problem in Eq. 5 in each iteration. Proof: According to Steps 3 of Algorithm, we know W(t + ) = argmin X W Y 2 F (9) W +γ T rw D(t + )W + γ 2 T rw D(t + )W. Then, we can derive that J (t + ) + γ T rw (t + )D(t + )W(t + ) +γ 2 T rw (t + ) D(t + )W(t + ) J (t) + γ T rw (t)d(t + )W(t) +γ 2 T rw (t) D(t + )W(t), where J (t) = X W(t) Y 2 F.

5 After substituting the definition of D and D, we obtain J (t + ) + γ +γ 2 m J (t) + γ W i (t + ) 2 F 2 W i (t) F w i (t + ) 2 2 (0) w i (t) γ 2 W i (t) 2 F 2 W i (t) F, From Lemma and Corollary, we can derive that and w i (t + ) 2 w i (t) 2 W i (t + ) F w i (t + ) 2 2 w i (t) 2 2, () W i (t + ) 2 F 2 W i (t) F W i W i (t) 2 F (t) F 2 W i. (2) (t) F Adding Eq. 0-2 on both sides, we have J (t + ) + γ w i (t + ) 2 (3) m +γ 2 W i (t + ) F J (t) + γ w i (t) 2 + γ 2 m W i (t) F. Therefore, Algorithm monotonically decreases the objective value in each iteration. Because the optimization problem in Eq. 5 is convex, the algorithm converges to the global optimal solution. Also, since the solution has a closed form in each iteration, Algorithm converges very fast. A. Experiment Setup IV. EXPERIMENTAL RESULTS We use three large-scale benchmark datasets to validate the performance of our SRAL method in different conditions during various time spans. The dataset statistics are summarized in Table I. Four types of visual features were employed in our experiments for all of these datasets, which are color features [22] applied on downsampled images, GIST features [2] applied on downsampled images, HOG features [28] applied on downsampled images, and LBP features [32] applied on downsampled images. Instead of concatenating all these features together into a large vector for the scene representation, we need to learn the weight of each feature in order to select features that Algorithm : An efficient algorithm to solve the optimization problem in Eq.5 Input : X = [x,, x n ] R d n and Y = [y,, y n ] R n c : Let t =. Initialize W(t) by solving min W X W Y 2 F. 2: while not converge do 3: Calculate the block diagonal matrix D(t + ), where the i-th diagonal element of D(t + ) is I i. Calculate the block diagonal matrix D(t + ), where the i-th diagonal block of D(t + ) is 2 W i (t) F I i. 4: For each w i ( i c), w i (t + ) = ( XX + γ D(t + ) + γ 2 D(t + )) Xyi. 5: t = t +. 6: end Output: W = W(t) R d c TABLE I STATISTICS AND SCENARIOS OF THE PUBLIC BENCHMARK DATASETS USED FOR VALIDATION OF OUR SRAL METHOD Dataset Sequence Image Statistics Scenario St Lucia [0] 0 2 km 0 22, 000 frames Different times at 5 FPS of the day CMU-VL [3] 5 8 km 5 3, 000 frames at 5 FPS Different months Nordland [35] km 4 900, 000 frames at 25 FPS Different seasons are robust to various environments in order for the long term place recognition objective in SLAM loop closure detection. We implement both the proposed SRAL method and the simple multi-modal feature concatenation method for longterm place recognition tasks. Our implementation was performed using a mixture of unoptimized Matlab and C++ on a Linux laptop with an i7 3.0 GHz GPU, 6GB memory and 2GB GPU. Similar to other state-of-the-art methods [29, 36], the implementation in this current stage is not able to perform large-scale long-term loop closure detection in real time. A key limiting factor is that the runtime is proportional to the number of previously visited places. Utilizing memory management techniques [9], combined with an optimized implementation, can potentially overcome this challenge to achieve realtime performance. In these experiments, we qualitatively and quantitatively evaluate our algorithms, and compare them with several state-of-the-art methods, including BRIEF-GIST [34] (using single BRIEF feature), SeqSLAM [24] (using single grayscale image feature), etc. B. Results on the St Lucia Dataset (Various Time of the Day) The St Lucia dataset [0] was collected by a single camera installed on a car in the suburban area of St Lucia in Australia at various times over several days during a two-week period. Each data instance includes a video of minutes. GPS data was also recorded, which is used in the experiment as the

(a) Matched images BRIEF-GIST Feature Concatenation SRAL SeqSLAM Color LBP 0.95 0.9 Precision 5 Color % LBP 0.3% GIST 40% 0.

Comparisons with some of the main state-of-the-art methods are also graphically presented in Fig. 3(b).

6 (a) Matched images BRIEF-GIST Feature Concatenation SRAL SeqSLAM Color LBP Precision 5 Color % LBP 0.3% GIST 40% 0.75 HOG 59.% 0.7 dynamic objects and other vision challenges including camera motions and illumination changes. Comparisons with some of the main state-of-the-art methods are also graphically presented in Fig. 3(b). It is observed that for long-term loop closure detection, our SRAL algorithm outperforms the methods based on single features, including BRIEF-GIST [34], SeqSLAM [24], color-based SLAM [22], and LBP-based localization [32], due to the significant appearance variations of the same location at different times. It can be observed from the matching images in Fig. 3(a) that scene appearance varies a lot during different times of the day, which causes some features less representative to identify the same place in different scenarios. After the feature modality learning procedure in our SRAL method, the weights of these less useful features will be decreased, resulting in better final place recognition performance Recall (b) Precision-recall curves (c) Feature weight Fig. 3. Experimental results over the St Lucia dataset. Fig. 3(a) presents an example showing the matched images recorded at 5:45 on 08/8/2009 and 0:00 on 09/0/2009, respectively. Fig. 3(b) illustrates the precision-recall curves that indicate the performance of our SRAL algorithms. Quantitative comparisons with some of the main state-of-the-art loop closure detection methods are shown in Fig. 3(b). Fig. 3(c) illustrates the weights of each feature modality learned automatically by Algorithm. The figures are best seen in color. (a) Matched sequences Color 9.79% LBP 0.98% GIST 0.98% ground truth for place recognition. The dataset contains several challenges including appearance variations due to illumination changes at different times of a day, dynamic objects including pedestrians and vehicles, and viewpoint variations due to slight route deviations. The dataset statistics is shown in Table I. Loop closure detection results over the St Lucia dataset are illustrated in Fig. 3. The quantitative performance is evaluated using a standard precision-recall curve, as shown in Fig. 3(b). The high precision and recall values indicate that our SRAL methods with `2, -norm and M -norm obtain high performance and well match morning and afternoon video sequences. We also implement the feature concatenation method, which simply concatenate all feature modalities into a big feature for the scene representation. Our SRAL method is able to learn the weight of each feature modality, which enables more powerful and robust scene representation abilities. The precision and recall curve of these two methods can be found in Fig. 3(b), showing that our SRAL method outperforms the simple feature concatenation scheme, which underscores the importance of representative feature modalities with shared ability. To qualitatively evaluate the experimental results, an intuitive example of the sequence-based matching is presented in Fig. 3(a). We show the image (upper row of Fig. 3(a)) that has the maximum matching score for a query image (bottom row of Fig. 3(a)) within the sequence. This qualitative results demonstrate that the proposed SRAL algorithm works well with the presence of Precision BRIEF-GIST Feature Concatenation SRAL SeqSLAM Color LBP HOG 88.25% Recall (b) Precision-recall curves (c) Feature weight Fig. 4. Experimental results over the CMU-VL dataset. Fig. 4(a) presents an example demonstrating the matched images and query images recorded in December and October, respectively. Fig. 4(b) illustrates the precision-recall curves that indicate the performance of our SRAL algorithms. Quantitative comparisons with some of the main state-of-the-art loop closure detection methods are also shown in Fig. 4(b). Fig. 4(c) illustrates the weights of each feature modality learned automatically by Algorithm. The figures are best seen in color. C. Results on the CMU-VL Dataset (Different Months) The CMU Visual Localization (VL) dataset [3] was gathered using two cameras installed on a car that traveled the same route five times in Pittsburgh areas in the USA during different months in varying climatological, environmental and weather conditions. GPS information is also available, which is used as the ground truth for algorithm evaluation. This dataset contains seasonal changes caused by vegetation, snow, and illumination variations, as well as urban scene changes due to constructions

7 and dynamic objects. The visual data from the left camera is used in this set of experiments. The qualitative and quantitative testing results obtained by our SRAL algorithms on the CMU-VL dataset are graphically shown in Fig. 4. The qualitative results in Fig. 4(a) show the images (upper row) with the maximum score for each query image (bottom row) in an observed sequence. It is clearly observed that our SRAL method is able to well match scene images and recognize same locations across different months that exhibit significant weather, vegetation, and illumination changes. The quantitative experimental results in Fig. 4(b) indicate that the SRAL methods with M-norm and l 2, - norm regulations obtain significant performance. In order to demonstrate the significance of our SRAL method with weighted feature fusion, we also implemented the baseline feature concatenation method and compared its performance. The comparison result in Fig. 4(b) indicates that our SRAL method is still able to address the LAC problem and obtain better scene recognition result, which is the same phenomenon observed in the experiment using the St Lucia dataset. Fig. 4(b) also illustrates comparisons of our SRAL method with several previous loop closure detection approaches, which illustrates the same conclusion as in the St Lucia experiment that our SRAL loop closure detection approaches significantly outperform methods based on single feature matching for longterm place recognition. Precision (a) Matched sequences BRIEF-GIST Feature Concatenation SRAL SeqSLAM Color LBP Recall (b) Precision-recall curves Color 2.49% GIST 0.2% LBP 4% HOG 96.68% (c) Feature weights Fig. 5. Experimental results over the Nordland dataset. Fig. 5(a) presents an example illustrating the matched images and query images recorded winter and spring, respectively. Fig. 5(b) shows the precision-recall curves that indicate the performance of our SRAL algorithms. Quantitative comparisons with some of the main state-of-the-art loop closure detection methods are also shown in Fig. 5(b). Fig. 5(c) illustrates the weights of each feature modality learned automatically by Algorithm. The figures are best seen in color. D. Results on the Nordland Dataset (Different Seasons) The Nordland dataset [35] contains visual data from a tenhour long journey of a train traveling around 3000 km, which was recorded in four seasons from the viewpoint of the train s front cart. GPS data was also collected, which is employed as the ground truth for algorithm evaluation. Because the dataset was recorded across different seasons, the visual data contains strong scene appearance variations in different climatological, environmental and illumination conditions. In addition, since multiple places of the wilderness during the trip exhibit similar appearances, this dataset contains strong perceptual aliasing. The visual information is also very limited in several locations such as tunnels. The difficulties make the Nordland dataset one of the most channelling datasets for loop closure detection. Fig. 5 presents the experimental results obtained by matching the spring data to the winter video in the Nordland dataset. Fig. 5(a) demonstrates several example images from one of the matched sequences (upper row) and query (bottom row) images within the sequence. Fig. 5(a) validates that our SRAL algorithm can accurately match images that contain dramatic appearance changes across seasons. Fig. 5(b) illustrates the quantitative result obtained by our SRAL algorithms and the comparison with the previous techniques, showing the improvement of our SRAL method in comparison to existing methods. Also, it can be observed from Fig. 5(b) that our SRAL method outperforms the feature concatenation method, showing the significance of feature weight learning. V. CONCLUSION In this paper, we propose a novel SRAL algorithm that is able to learn the shared representative features and fuse the learned heterogenous feature modalities together to address the LAC problem and accurately perform life-long place recognition. The shared representability of each feature modality is automatically learned, which is formulated as an optimization problem regularized by structured sparsityinducing norms. A new optimization algorithm is also implemented to efficiently solve the formulated problem. To evaluate our SRAL algorithm, extensive experiments are performed using three large-scale benchmark datasets. The qualitative results have validated that SRAL is able to robustly perform long-term place recognition under significant scene variations across various environments in different days, months and seasons. In addition, the quantitative evaluation has shown that the proposed algorithm outperforms previous techniques and obtains a promising life-long place recognition performance. REFERENCES [] Adrien Angeli, David Filliat, Stéphane Doncieux, and Jean-Arcady Meyer. Fast and incremental method for loop-closure detection using bags of visual words. TRO, 24(5): , [2] R. Arroyo, P.F. Alcantarilla, L.M. Bergasa, and E. Romera. Towards life-long visual localization using an efficient matching of binary sequences from images. In ICRA, 205.

8 [3] Hernán Badino, Daniel Huber, and Takeo Kanade. Realtime topometric localization. In ICRA, 202. [4] César Cadena, Dorian Gálvez-López, Juan D Tardós, and José Neira. Robust place recognition with stereo sequences. TRO, 28(4):87 885, 202. [5] Cheng Chen and Han Wang. Appearance-based topological bayesian inference for loop-closing detection in a cross-country environment. IJRR, 25(0): , [6] Mark Cummins and Paul Newman. FAB-MAP: Probabilistic localization and mapping in the space of appearance. IJRR, 27(6): , [7] Mark Cummins and Paul Newman. Highly scalable appearance-only SLAM-FAB-MAP 2.0. In RSS, [8] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD- SLAM: Large-scale direct monocular SLAM. In ECCV, 204. [9] Dorian Gálvez-López and Juan D Tardós. Bags of binary words for fast place recognition in image sequences. TRO, 28(5):88 97, 202. [0] Arren J Glover, William P Maddern, Michael J Milford, and Gordon F Wyeth. FAB-MAP + RatSLAM: appearance-based slam for multiple times of day. In ICRA, 200. [] Jens-Steffen Gutmann and Kurt Konolige. Incremental mapping of large cyclic environments. In ISCIRA, 999. [2] Paul Hansen and Brett Browning. Visual place recognition using HMM sequence matching. In IROS, 204. [3] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter Fox. RGB-D mapping: Using Kinectstyle depth cameras for dense 3D modeling of indoor environments. IJRR, 3(5): , 202. [4] Kin Leong Ho and Paul Newman. Detecting loop closure with scene sequences. IJCV, 74(3):26 286, [5] Edward Johns and Guang-Zhong Yang. Feature co-occurrence maps: Appearance-based localisation throughout the day. In ICRA, 203. [6] Michael Kaess, Hordur Johannsson, Richard Roberts, Viorela Ila, John J Leonard, and Frank Dellaert. isam2: Incremental smoothing and mapping using the Bayes tree. IJRR, 3(2):26 235, 202. [7] Manfred Klopschitz, Christopher Zach, Arnold Irschara, and Dieter Schmalstieg. Generalized detection and merging of loop closures for video sequences. In 3D Data Processing, Visualization, and Transmission, [8] Mathieu Labbe and Francois Michaud. Appearancebased loop closure detection for online large-scale and long-term operation. TRO, 29(3): , 203. [9] Mathieu Labbe and François Michaud. Online global loop closure detection for large-scale multi-session graph-based SLAM. In IROS, 204. [20] Yasir Latif, César Cadena, and José Neira. Robust loop closing over time for pose graph SLAM. IJRR, pages 6 626, 203. [2] Yasir Latif, Guoquan Huang, John Leonard, and José Neira. An online sparsity-cognizant loop-closure algorithm for visual navigation. In RSS, 204. [22] Donghwa Lee, Hyongjin Kim, and Hyun Myung. 2D image feature-based real-time RGB-D 3D SLAM. In RITA, pages [23] Stephanie Lowry, Niko Sunderhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey. TRO, 205. [24] Michael J Milford and Gordon F Wyeth. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In ICRA, 202. [25] Michael J Milford, Gordon F Wyeth, and DF Rasser. RatSLAM: a hippocampal model for simultaneous localization and mapping. In ICRA, [26] Raúl Mur-Artal and Juan D Tardós. Fast relocalisation and loop closing in keyframe-based SLAM. In ICRA, 204. [27] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular slam system. TRO, 3(5):47 63, 205. [28] Tayyab Naseer, Luciano Spinello, Wolfram Burgard, and Cyrill Stachniss. Robust visual robot localization across seasons using network flows. In AAAI, 204. [29] Tayyab Naseer, Michael Ruhnke, Cyrill Stachniss, Luciano Spinello, and Wolfram Burgard. Robust visual SLAM across seasons. In IROS, 205. [30] Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efficient and robust feature selection via joint l 2, -norms minimization. In NIPS, 200. [3] Edward Pepperell, Peter Corke, and Michael J Milford. All-environment visual place recognition with SMART. In ICRA, 204. [32] Yongliang Qiao, Cindy Cappelle, and Yassine Ruichek. Place recognition based visual localization using LBP feature and SVM. In Advances in Artificial Intelligence and Its Applications, pages [33] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IROS, 202. [34] Niko Sünderhauf and Peter Protzel. BRIEF-Gist Closing the loop by simple means. In IROS, 20. [35] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In ICRAW, 203. [36] Niko Sunderhauf, Sareh Shirazi, Adam Jacobson, Feras Dayoub, Edward Pepperell, Ben Upcroft, and Michael Milford. Place recognition with ConvNet landmarks: Viewpoint-robust, condition-robust, training-free. In RSS, 205. [37] Sebastian Thrun and John J Leonard. Simultaneous localization and mapping. In Springer Handbook of Robotics, pages [38] Xue Yang, Fei Han, Hua Wang, and Hao Zhang. Enforcing template representability and temporal consistency for adaptive sparse tracking. In IJCAI, 206. [39] Hao Zhang, Fei Han, and Hua Wang. Robust multimodal sequence-based loop closure detection via structured sparsity. In RSS, 206.

arxiv: v1 [cs.ro] 18 Apr 2017

arxiv: v1 [cs.ro] 18 Apr 2017 Multisensory Omni-directional Long-term Place Recognition: Benchmark Dataset and Analysis arxiv:1704.05215v1 [cs.ro] 18 Apr 2017 Ashwin Mathur, Fei Han, and Hao Zhang Abstract Recognizing a previously