Institutionen för systemteknik

Size: px

Start display at page:

Download "Institutionen för systemteknik"

Cassandra Cooper
6 years ago
Views:

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera Examensarbete utfört i Reglerteknik vid Tekniska högskolan vid Linköpings Univeristet av Jon Bjärkefur and Anders Karlsson LiTH-ISY-EX--1/4427--SE Linköping 21 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

3 Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera Examensarbete utfört i Reglerteknik vid Tekniska högskolan i Linköping av Jon Bjärkefur and Anders Karlsson LiTH-ISY-EX--1/4427--SE Handledare: Examinator: Fredrik Lindsten isy, Linköpings universitet Zoran Sjanic isy, Linköpings universitet Christina Grönwall FOI Joakim Rydell FOI Thomas Schön isy, Linköpings universitet Linköping, 2 June, 21

5 Avdelning, Institution Division, Department Division of Automatic Control Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--1/4427--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Simultan lokalisering och kartering av inomhusmiljöer med en stereokamera och en laserkamera Simultaneous Localisation and Mapping of Indoor Environments Using a Stereo Camera and a Laser Camera Författare Author Jon Bjärkefur and Anders Karlsson Sammanfattning Abstract This thesis describes and investigates different approaches to indoor mapping and navigation. A system capable of mapping large indoor areas with a stereo camera and/or a laser camera mounted to e.g. a robot or a human is developed. The approaches investigated in this report are either based on SLAM (Simultaneous Localisation and Mapping) techniques, e.g. Extended Kalman Filter-SLAM (EKF-SLAM) and Smoothing and Mapping (SAM), or registration techniques, e.g. Iterated Closest Point (ICP) and Normal Distributions Transform (NDT). In SLAM, it is demonstrated that the laser camera can contribute to the stereo camera by providing accurate distance estimates. By combining these sensors in EKF-SLAM, it is possible to obtain more accurate maps and trajectories compared to if the stereo camera is used alone. It is also demonstrated that by dividing the environment into smaller ones, i.e. submaps, it is possible to build large maps in close to linear time. A new approach to SLAM based on EKF-SLAM and SAM, called Submap Joining Smoothing and Mapping (SJSAM), is implemented to demonstrate this. NDT is also implemented and the objective is to register two point clouds from the laser camera to each other so that the relative motion can be obtained. The NDT implementation is compared to ICP and the results show that NDT performs better at estimating the angular difference between the point clouds. Nyckelord Keywords EKF, SLAM, SAM, SJSAM, Maps, Submaps, NDT, ICP, Laser scan registration

7 Abstract This thesis describes and investigates different approaches to indoor mapping and navigation. A system capable of mapping large indoor areas with a stereo camera and/or a laser camera mounted to e.g. a robot or a human is developed. The approaches investigated in this report are either based on SLAM (Simultaneous Localisation and Mapping) techniques, e.g. Extended Kalman Filter-SLAM (EKF-SLAM) and Smoothing and Mapping (SAM), or registration techniques, e.g. Iterated Closest Point (ICP) and Normal Distributions Transform (NDT). In SLAM, it is demonstrated that the laser camera can contribute to the stereo camera by providing accurate distance estimates. By combining these sensors in EKF-SLAM, it is possible to obtain more accurate maps and trajectories compared to if the stereo camera is used alone. It is also demonstrated that by dividing the environment into smaller ones, i.e. submaps, it is possible to build large maps in close to linear time. A new approach to SLAM based on EKF-SLAM and SAM, called Submap Joining Smoothing and Mapping (SJSAM), is implemented to demonstrate this. NDT is also implemented and the objective is to register two point clouds from the laser camera to each other so that the relative motion can be obtained. The NDT implementation is compared to ICP and the results show that NDT performs better at estimating the angular difference between the point clouds. v

9 Acknowledgments A lot of people have been helpful during this work. When we started the work, we soon realised that mapping and navigation is a very wide and at the same time deep topic. However, thanks to good guidance in the beginning, especially from our examiner Dr Thomas Schön, the project got a good and smooth start. This master thesis has been performed at the Swedish Defence Research Agency, FOI. Our supervisors at FOI, Dr Joakim Rydell and Dr Christina Grönwall, have been very helpful. We have had many interesting discussions and lately they have proofread our report in a rigorously and helpful way. We would also like to thank our supervisors at ISY, Zoran Sjanic and Fredrik Lindsten. Dr. Jonas Nygårds helped us with interesting discussions when we defined the main part of the project. Also a big thanks from Anders Karlsson to Ulrike Münzner, whose happy heart has power to make a stone a flower. Her support and inspiration made this work much easier. vii

11 Contents 1 Introduction Simultaneous Localisation and Mapping A Current Application of SLAM Motivation for Indoor SLAM Laser Scan Registration Goal and Problem Definition Limitations Method Sensors Bumblebee Stereo Camera PMD Laser Camera The Rig Camera Calibration Feature Extraction Feature Detection in Stereo Vision Scale Invariant Feature Transform (SIFT) Speeded Up Robust Features (SURF) CenSurE and SUSurE Comparison of SIFT and SURF Stereo Pair Features Laser Vision Conclusion Data Association Measurements Nearest Neighbour Association Mahalanobis Association Removal of Unstable Landmarks Conclusion ix

12 x Contents 5 SLAM Models D SLAM With Simulated Measurements Rig State Coordinate Systems Measurement Model System Dynamics D SLAM with stereo camera measurements Robot State Coordinate Systems Measurement Model System Dynamics SLAM Algorithms The SLAM Problem Extended Kalman Filter SLAM Time Update Measurement Update D Visual EKF-SLAM Pros and Cons With EKF-SLAM Partitioned Updates Submaps Conditionally Independent Divide and Conquer SLAM Conditionally Independent Graph SLAM Tectonic SAM Information Form SLAM Sparse Extended Information Filter Exactly Sparse Extended Information Filter Smoothing and Mapping The Least Squares Problem The Measurement Matrix A and the Vector b Incremental Smoothing and Mapping Complexity of SLAM Algorithms Choice of algorithms for this thesis Submap Joining Smoothing and Mapping Conditionally Independent Submaps Coordinate Systems and Rotation Matrices Measurement equation System Dynamics Algorithm Time Complexity Consistency Similar Work

13 Contents xi 8 Laser Scan Registration Iterative Closest Point Normal Distributions Transform D-NDT D-NDT Experimental Results With SLAM Experiments with EKF Stereo Camera Mounted to a Rig Stereo Camera Held in Hand Fireman Looking Around in a Room With The Stereo Camera Sweeping Movement - Validity Check of Angle Estimates Mapping of a Conference Room Consistency Problems Experiments With SJSAM Submaps Merging of Submaps Summary SJSAM on the workshop data Discussion Future work EKF-SLAM SJSAM Experimental Results With Laser Scan Registration Data Collection Iterative Closest Point Rotation Reducing the Number of Points Creating a Trajectory Normal Distributions Transform Grid Size Rotation Reducing the number of points Creating a Trajectory Discussion Future work SLAM With the Stereo Camera and the Laser Camera Laser camera only Laser and Stereo Camera Pixel Mapping Laser and Stereo Camera With Fused Depth Information Experimental Results Reference Results - Stereo Camera Only Laser Camera Only

14 xii Contents Stereo and Laser Camera With Fused Depths Discussion Conclusion 125 A NDT 129 Bibliography 131

15 Chapter 1 Introduction The available technologies of the twenty-first century offers great possibilities in the art of making maps. While the first maps where created with information from travelling merchants and explorers, modern maps are made from satellite images and aerial photographs. The advanced technologies have also made navigation much easier. It is no longer necessary to use sextants, stars and other tricks. Satellites orbit the earth and can provide anyone using a Global Positioning System (GPS) with an accurate position estimate in seconds. A major shortcoming of the GPS system is that it does not work indoors. Neither is it possible to take overview photos of building interiors. Indoor mapping and localisation requires other techniques, and remains an open and active research area. 1.1 Simultaneous Localisation and Mapping Simultaneous Localisation and Mapping (SLAM) deals with the problem of making maps of previously unknown environments while simultaneously performing localising within these maps. SLAM is a very active research area and thus there exist a great number of different solutions to the problem. A SLAM algorithm will need to keep track of both the moving platform itself and the world around it. The world is usually modelled using landmarks. A landmark is a distinct feature in the world, at a specific location, that can be used for navigation. An example of a real world landmark is a lighthouse. There are many different parts in SLAM that need to be solved. Figure 1.1 illustrates an example that will be used to describe these parts. The figure shows a robot that moves through a corridor between time t = 1 and t = 2. The robot is equipped with a camera, so that it can see the environment in front of it. For each captured image, the following steps need to be performed: Feature extraction which means extracting interesting points from the newly acquired images. 1

16 2 Introduction Figure 1.1: A black robot moves in an unknown corridor and observes corners (marked with red circles). Data association, that is, matching the new features to corresponding landmark if possible. Pose estimation which means that a new pose is calculated using the information from the new features and the data association. At t = 1, an image is captured by the robot. This image is processed with a feature extraction system that finds corners, marked with red circles in Figure 1.1. Three corners are found. Since the world is completely unknown at t = 1, i.e. the robots map is empty, these landmarks can be added to the map right away and no data association is needed. The pose estimation is also not needed now, since the robot has not moved yet. Another image is captured at t = 2. The feature extraction system finds four corners. The data association system must now find out if any of these observed corners have been seen before, i.e. in t = 1. It realises that two of the corners have been seen before and returns two matches. The pose estimation system can now use these matches to estimate the relative movement between t = 1 and t = 2. This will also give an estimate of the speed of the robot and improved estimates of the matched corners positions. 1.2 A Current Application of SLAM SLAM technology is already utilised around the world. The most common usage is in the field of autonomous robotic vacuum cleaners, where the robots are equipped with a wide-angle camera looking upwards. The SLAM systems are then extracting features from the ceiling and tracking these. It is then possible to map a whole

17 1.3 Motivation for Indoor SLAM 3 apartment/house so that the robots can avoid to clean the same spot twice during the same session. Ceiling based visual SLAM (CV-SLAM) was first introduced in [2]. The somewhat fascinating results can be seen at bq5hzzgf3vq. A vacuum cleaner is mapping an entire apartment while at the same time cleaning it more or less optimally. One of the most recent CV-SLAM vacuum cleaners that are available to customers is the Samsung Navibot. It has a wide-angle camera with a field of view of 167 degrees, capturing images at 3 frames per second. The robot makes a map and uses it to localise itself. It is possible for the robot to go home to its charging station and then go back to work at the place where it was before. 1.3 Motivation for Indoor SLAM Imagine that a large building is on fire. Several firemen run around looking for people to rescue. A task force leader is standing outside of the building and needs to have an overview of where people are found and what the structure of the building is. There has been no time to get the blue-prints of the building. Each fireman has a small stereo camera attached to his or her helmet. All cameras send live video footage to a main computer in the fire truck. The main computer processes the image flows and calculates how each fireman is moving and how the corridors and rooms that he or her see is looking. A combined map is built on the go in real time and the force leader can see it while it is created. When a fireman sees a person in need of rescue, he communicates this to the task force leader. The leader can now push a button and the main computer marks the location in the map. The combined map and the trajectories of all the firemen can be used to see where no one has looked. It can also be used to plan the ongoing mission. Furthermore, it is easy to see that the same technology can also be used by e.g. the police or the military. 1.4 Laser Scan Registration Two cameras are available for the SLAM systems in this thesis. The first one is a stereo camera. A stereo camera consists of two cameras mounted a bit away from each other, but looking in the same direction. The benefit of this is the same that humans have with their two eyes - it is possible to know approximately how far away an object is. The second camera is a laser camera. This camera gives accurate estimates of the range to each part in the observed scene. Chapter 2 contains more information about these sensors. The laser camera is a relatively new sensor. The resolution in pixels is very limited compared to a new modern digital camera. The big difference for the laser camera that each pixel measures the intensity, and the amplitude of the reflected light from objects in the world. Its biggest advantage is that it can also measure the distance to objects in the world.

18 4 Introduction These measurements are performed by illuminating the environment with light of a known wavelength. By checking the phase of the incoming light the distance to the object depicted at a certain pixel can be calculated. A measurement of this type gives the ability to perform a different form of navigation. Imagine that the camera looks at a completely flat wall 2.5 m away. In a perfect world this means that each pixel in the range image will measure a distance of 2.5 m. When moving the camera forward.5 m each pixel will measure a distance of 2. m. Now say that the movement of.5 m is unknown, this will be the case when just moving the camera around. Then there exists methods for calculating this motion in an efficient way. It can be thought of as walking in a room and looking around. For each step that is taken the change in environment is used to deduce how long the step was and how the viewing angle changed. By taking a step forward and noting the length and direction each time the complete path can be recreated easily. Doing this with a laser camera is called laser scan registration and for this there exist multiple methods. In this thesis ICP and NDT will be used for laser scan registration. 1.5 Goal and Problem Definition The goal of this thesis is to investigate different aspects of indoor mapping using the stereo camera and the laser camera. The outcome shall be a report of these aspects along with a SLAM system capable of making large indoor maps. More specifically the following questions shall be answered in this thesis: 1. Which SLAM techniques are suitable for making large high-quality maps of indoor environments? 2. How can the laser camera complement the stereo camera when making indoor maps? 3. What will the quality of the map be if the cameras are moved systematically compared to if they are moved more randomly? 4. What will the quality of the map be if the cameras are mounted to a robot compared to if they are handheld? 5. How does the two laser scan registration methods compare to each other? 1.6 Limitations In this thesis there are two notable limitations. The first one is that all data from the stereo camera and the laser camera will be collected indoors. Collecting data indoors reduces problems with illumination, especially for the laser camera. The second limitation is that there are no moving objects present in the images. Moving objects can be a problem since it affect the images in an unpredictable

19 1.7 Method 5 way. This means that there will be no people or moving machines etc. in the room when the data is collected. Another limitation is that EKF-SLAM will not be used for the largest maps. The reason for this is that the computational time becomes a limiting factor. Computational time is also a limiting factor for the laser registrations algorithms and therefore these algorithms will only be applied on smaller data sets. 1.7 Method Implementation of the SLAM algorithms is performed in steps of gradually increasing complexity. Gradually increasing the functionality of the algorithms makes it easier to spot errors that can be hard to detect later on. The implementation is done with the following steps: 1. 2D simulation with known data association. 2. 2D simulation with unknown data association. 3. Increase the dimensionality to three dimensions and repeat the two previous steps. 4. Extend the algorithm to use real world data. This is an efficient way to detect problem with data association and other minor problems that can be almost impossible to detect with noisy data from the real world. Initial testing is done with the well known EKF-SLAM method. Since SLAM algorithms becomes computationally demanding for large maps a submap based approach will be used for the largest maps. The submaps will then be merged into one single map in an efficient way. Real world data consists of features in the stereo images extracted by SURF [5]. Image processing is not the goal with this thesis and therefore an existing implementation of SURF will be used. The data is if possible collected with ground truth available. Data association is then performed with either of the following two methods: Nearest neighbor association. Mahalanobis distance association. Laser scan registration is tested by using a set of positions marked with tape on the floor. The laser camera is moved between the points and collects a set of images at each. The two algorithms for registration are then compared to each other. As a final comparison, the laser camera will be moved along a trajectory to see which algorithm that gives the most accurate result.

21 Chapter 2 Sensors In this chapter the equipment used for the experiments will be described. The equipment consists of two cameras, the first one is a Bumblebee stereo camera and the second one is a 3D camera called CamCube. This chapter will describe both cameras and provide some sample images from them. Furthermore the rig used for experiments will be described and illustrated using photos. 2.1 Bumblebee Stereo Camera The Bumblebee stereo camera is developed by Point Grey Research. The camera is depicted in Figure 2.1. It has two sensors with a maximum resolution of pixels. The resolution used in this thesis is pixels since it provides sufficient data and the image processing time is shorter. Bumblebee is able to acquire images at 48 FPS according to the specifications. This is however not the case when using the Matlab interface on the current hardware. At best the camera reaches about 15 FPS. The field of view (FOV) for Bumblebee is about 97 horizontally. This causes some rather serious fisheye effects on the images, however, the Software Development Kit (SDK) delivered with the camera can correct this distortion. Figure 2.2 provides an example of an image that is distorted together with an image that has been rectified using the SDK. 2.2 PMD Laser Camera The camera used for obtaining 3D-scans is a PMD[vision] CamCube that can be seen in Figure 2.6. This camera has a resolution of pixels which gives at total of points in each scan. The SDK provided with the camera makes it possible to directly acquire multiple image types such as range image, intensity image, amplitude image and also a calibrated 3D point cloud. Figure 2.3 shows the three images and a 3D-scan of the same scene. 7

22 8 Sensors Figure 2.1: Point Grey Research Bumblebee stereo camera. (a) Distorted image. (b) Rectified image. Figure 2.2: Distorted and rectified image from the Bumblebee stereo camera.

2.2 PMD Laser Camera 9 2 2 4 4 6 6 8 8 1 1 12 12 14

2 5 1 15 2 (b) Amplitude image. 1.5 2 4 1 6 8.

23 2.2 PMD Laser Camera (a) Range image (b) Amplitude image (c) Intensity image (d) Point cloud. Figure 2.3: Distorted and rectified image from the Bumblebee stereo camera.

1 Sensors Figure 2.4: Bumblebee mounted on a table. According to specifications the standard deviation is less than 3 mm at 2 m distance and the frame rate is specified to 25 FPS.

24 1 Sensors Figure 2.4: Bumblebee mounted on a table. According to specifications the standard deviation is less than 3 mm at 2 m distance and the frame rate is specified to 25 FPS. In the Matlab implementation used to acquire images the frame rate seldom exceeded 12 FPS with current hardware. The field of view is specified to 4 in horizontal and vertical direction. The 3D-scans are used for creating a trajectory. It is also possible to use any of the other images to detect points of interest and use the range image to acquire the distance. This approach will be tested in Chapter 11. CamCube have some advantages over an ordinary camera. One of them is being able to acquire images in total darkness. 2.3 The Rig When using only the stereo camera a table with wheels was used. The camera was mounted on an arm attached to the table. For collecting data an ordinary laptop was used. This rig is depicted in Figure 2.4. To be able to better follow ground truth the camera was mounted on a tripod as well, the tripod was then mounted on a frame with four wheels. Beneath the camera a rod that almost reached the floor was mounted. The only usage for the rod was to make it easier to follow a predefined trajectory taped on the floor. Using this rig the camera could be moved along a trajectory with quite good precision. Figure 2.5 shows the stereo camera mounted on the tripod. When both cameras were used they were mounted on a tripod. In this setup the CamCube was mounted beneath the stereo camera. The stereo camera was approximately centered on top of the laser camera. The tripod with both cameras mounted can be seen in Figure 2.6 Using this setup the cameras can see almost the same scene apart from the fact that the stereo camera has 97 FOV compared to 4 for the laser camera. Figure

25 2.3 The Rig 11 Figure 2.5: Bumblebee mounted on a tripod.

26 12 Sensors Figure 2.6: Bumblebee and CamCube mounted on a tripod.

27 2.4 Camera Calibration 13 Figure 2.7: Image from Bumblebee s left camera in grey. The red overlay in the middle is what the CamCube laser camera see. The image from the CamCube has been scaled to fit in the image from Bumblebee. 2.7 shows the view from the laser camera as an overlay on the left image from the stereo camera. As can be seen in Figure 2.7 there is quite a large difference in FOV. 2.4 Camera Calibration To be able to use the cameras together the cameras they have to be calibrated. For this the Camera Calibration Toolbox [33] for Matlab is used. This toolbox can be used to calibrate a single camera and a stereo pair. In this case the two cameras are a Bumblebee stereo camera and a CamCube laser camera. This results in a total of three cameras so because of this only the left camera in the stereo camera is utilised. Calibration is done by first calibrating the cameras individually. The calibration is performed using a chessboard. The chessboard is tilted in various directions and for each pose an image from each of the cameras is saved. The four outer corners on the chessboard are then marked manually. The remaining corners are then marked automatically after specifying the number of squares on the board. Figure 2.8 shows how the corner extraction is performed. The green circles represent the four corners that are manually selected. Red crosses are the corners extracted by the calibration toolbox. From the corners extracted from the images the toolbox then calculates the parameters for the camera. Table 2.1 shows the parameters for the Bumblebee camera together with the parameters given by specifications. Since the images

14 Sensors The red crosses should be close to the image corners 5 1 15 Y 2 O X 5 1 15 2 25 3 Figure 2.8: Corner extraction in an image from Bumblebee.

28 14 Sensors The red crosses should be close to the image corners Y 2 O X Figure 2.8: Corner extraction in an image from Bumblebee. from Bumblebee are rectified by the SDK, distortion is set to zero. The results from the calibration shows very promising results for Bumblebee. All values are close to the specified values and if the uncertainties are included the specified value lie in the interval. For CamCube there are no specifications available. The sensor have pixels so the principal point can be approximated with 12 in both directions. Table 2.2 shows the parameters from calibration of CamCube. As expected the principal point is close to 12. The other parameters are hard to verify but it seems reasonable that f has the same value in both directions. Also there is some distortion visible in the images acquired from Camcube. However, the distortion parameters are a bit uncertain and should not be relied on. Calibrating the left stereo camera and the laser camera as a stereo pair gives the extrinsic parameters in Table 2.3. Measurements on the rig confirms that the parameters are accurate. It is of course hard to verify since it is unknown exactly where in the camera the sensor is located. Since the rotation is close to zero it is for simplicity approximated by zero. Figure 2.9 shows the setup together with the different estimated poses of the chessboard. In total the chessboard have 14 different poses. Each pose is represented by a numbered plane in the figure. It is also obvious from the figure that the left camera has a wider field of view which is correct according to the specifications.

29 2.4 Camera Calibration 15 Table 2.1: Parameters for Bumblebee. Parameter Specification Estimation Uncertainty Focal Length ± ±2.474 Principal Point ± ± Skew + + ± Distortion ± ± ± ± ± Table 2.2: Parameters for CamCube. Parameter Estimation Uncertainty Focal Length ± ± Principal Point ± ± Skew + ± Distortion.382 ± ± ± ±.19 + ± Table 2.3: Position of right camera with respect to left camera. x [mm] y [mm] z [mm] ψ [rad] θ [rad] φ [rad]

30 16 Sensors Extrinsic parameters [mm] 5 Z 5 Left Camera X 1 Y Z 15 Right Camera X 2 Y [mm] [mm] Figure 2.9: Extrinsic parameters of the cameras.

31 Chapter 3 Feature Extraction Feature extraction refers to methods for detecting interesting points in an image and describing them. There is no universal definition of what constitutes a feature, since this depends on the application. A loose definition is an interesting part of the image which means a part that can be identified multiple times. First in this chapter, stereo vision, which uses traditional image processing, is treated and then feature extraction in laser data is described. The concept with features will be described for the respective cameras and then methods for extraction and matching features will be presented. As a brief summary the different methods will be tested and compared for a typical image or laser scan. 3.1 Feature Detection in Stereo Vision In stereo vision a feature is usually a point of interest surrounded by a region that is selected by a region detector. A point of interest can be a corner or any other small distinctive object. It is important that the detector has a high ability to detect the same features in similar images i.e. high repeatability, a such detector is crucial for the SLAM problem when using stereo vision. Some feature extractors will be briefly described in the next section and later on compared. In the stereo vision case edges and corners are the most valuable features since they can easily be tracked between consecutive images. A very common detector is Harris corner detector by Chris Harris et. al. [17]. Other common modern detectors are SIFT and SURF which are also invariant to scaling and rotation. More recent detectors are CenSurE [1] and SUSurE [15] which aims at real-time applications Scale Invariant Feature Transform (SIFT) SIFT might be one of the most used feature detectors today. As the name suggests this method (SIFT) [23] detects scale invariant interest points from images that can be used for reliable matching between images. Interest points extracted with 17

32 18 Feature Extraction SIFT are also invariant to rotation and quite robust against change in illumination and viewpoint angle. Interest points detected by SIFT are mostly corners. Another common type of interest points is single coloured surfaces. For a single coloured surface SIFT selects the centre as an interest point. This can be a problem for specular surfaces. Reflections on a specular surface can create an area that is lighter or in some other way distinctive. The problem with this is that this area moves when the camera moves which results in an interest point that is floating. As an example, consider a whiteboard. This is a specular surface and a reflection from e.g. a lamp in the room will move over the whiteboard as the position of the camera changes. This requires the measurement to be modelled with rather much noise. With SIFT the initial image is convolved with Gaussians to create images separated by a scale factor k. These images are gathered in octaves. Each octave represents a doubling of the standard deviation σ. The number of pictures in each octave is s+3 where s is an arbitrary number of intervals in each octave. From this the scale factor is computed as k = 2 1/s. Let the difference-of-gaussian function convolved with the image be D(x, y, σ) = (G(x, y, kσ) G(x, y, σ)) I(x, y) = L(x, y, kσ) L(x, y, σ), where D(x, y, σ) is the difference, G(x, y, σ) is Gaussian, I is the image, and L(x, y, σ) is the image I convolved with the Gaussian G. When finding a point of interest, the interest point in D(x, y, σ) is compared to its eight neighbours in the current image and to the nine neighbours in the scales above and below. It is selected to be a feature only if it has the largest or smallest value among these 27 points. Points representing local extrema are called interest points. Each interest point is then assigned a orientation and a descriptor, refer to [23] for details. The descriptor is a 128-dimensional vector for each key point. Since SIFT was introduced in 24 there have been various implementations available in addition to Lowe s original implementation. In this thesis the implementation available in the open source library VLFeat [9] is used. VLFeat is implemented in C for efficiency and comes with a Matlab interface. Figure 3.1 shows a comparison between Lowe s and VLFeat s implementation Speeded Up Robust Features (SURF) The goal with SURF [5] was to develop a novel scale- and rotation invariant interest point detector and descriptor which outperforms previously proposed schemes with respect to repeatability, distinctiveness and robustness. This while still maintaining a low or even lower computational cost than SIFT. SURF can be used with different types of descriptors. For feature extraction in the stereo images the SURF-128 descriptor is used, this is an extended descriptor that is much more distinctive and not much slower to compute compared to the standard SURF-64 descriptor. Matching with SURF-128 is slower than with SURF-64 due to the double dimensionality. Another but not so useful descriptor is U-SURF which is faster but not invariant to rotation.

3.1 Feature Detection in Stereo Vision 19 Figure 3.1: VLFeat keypoints (blue) superimposed to D. Lowe s key points (red). As can be seen the results are almost identical.

33 3.1 Feature Detection in Stereo Vision 19 Figure 3.1: VLFeat keypoints (blue) superimposed to D. Lowe s key points (red). As can be seen the results are almost identical. The image is retrieved from SURF uses precomputed Hessian box filters to approximate second order Gaussian derivatives. The filters come in different sizes, e.g. 9 9, 15 15, etc. Because of the use of integral images and box filters there is no need to iteratively apply the same filter to the output of a previously filtered layer. Instead a filter of any size can be applied directly and even in parallel. The difference between octaves are here the increase in size between filters e.g. for the first octave, the filter size increases with six, for the next octave it is increasing with twelve etc. Descriptors used by SURF are very similar to those used by SIFT, but with a little less complexity. The first step to reduce complexity is by creating a reproducible orientation based on information from a circular region around the interest point, by calculating the Haar-wavelet response in x and y directions. Then a square region is aligned to the selected orientation before the SURF descriptor is extracted from it. This square is divided into 4 4 subspaces and for every subspace, orientation is calculated using Haar-wavelets once again. This gives a four dimensional subspace descriptor in the format v = ( d x dy dx dy ) where d x and d y are the wavelet responses over a subregion. This results in a descriptor that is for each interest point. For the extended SURF-128 descriptor the sums are computed separately depending on sign and the absolute values are split up according to sign which gives a descriptor with the size There are various implementations available for SURF. One of them is a part of the OpenCV Library which is open source. Two other implementations are [6, 32] which are closed source, but available for almost any platform. One of these implementations are Graphics Processing Unit (GPU) based. The GPU SURF implementation is described in [11]. This implementation runs entirely on the GPU which leaves the CPU free for other computations. This makes it almost perfect for real-time SLAM.

34 2 Feature Extraction CenSurE and SUSurE Center Surround Extrema (CenSurE) as proposed by Agrawal et al. in [1] is yet another feature extractor. CenSurE is suited for real-time applications since it outperforms SIFT in repeatability and performance. The CenSurE detector consists of three steps. The first step is to compute the response to a bilevel Laplacian of Gaussian and filter weak responses. The second step is detection of local extrema and finally local extrema with strong corner response are detected using the Harris measure. Speeded Up Surround Extrema (SUSurE) by Ebrahimi et al. [15] is based upon CenSurE, but takes simplifications even further to reduce computational cost without noticeable loss in performance. This makes SUSurE suitable for applications where computational power and memory are very limited, e.g. an unmanned aerial vehicle. Due to the lack of an implementation of CenSurE and only one implementation of SUSurE available these extractors will not be compared with SIFT and SURF. The implementation of SUSurE is available at uk/publications/pub_master.jsp?id=2118. This is only a sample implementation, but an optimised version should be released soon according to the author. At current state the application is only able to read a portable network graphics (PNG) file and write a new file with the extracted features as overlay. It is therefore not possible to use this implementation for SLAM. 3.2 Comparison of SIFT and SURF In this thesis the SURF-128 descriptor is used. The main reason for this is that the SIFT-descriptor is 128-dimensional. Using the same length for SURF-descriptors makes it possible to use VLFeat for matching between images. There is also a matching function that comes bundled with the implementation that works for standard and extended descriptors. This method is very simple and using VLFeat is preferable. Another reason for using the SURF-128 descriptor is that the SURF- 128 descriptor have the best recognition rate with an average recognition rate of 85.7 %, as shown in [5]. Both SIFT and SURF were run on a 3. GHz Pentium D with 2 GB of RAM using Windows XP. The number of features returned in each image is about 4 6. Since it is more important with stable features than many features when navigating the thresholds were quite high. When using the default threshold for the SURF implementation it returns 3 to 4 features per image. Image resolution in this case is pixels. Figure 3.2 shows the same image processed with SIFT and SURF. Both SIFT and SURF performs well when matching images from the stereo camera. Feature extraction in one image took about 3 ms with SIFT and 18 ms with SURF. Another GPU-based implementation has also been tested and it yields about 85 frames per second (FPS) for an image with pixels. This however, shows that by running image processing on a GPU it is possible to perform feature extraction in real-time. GPU-SURF depends on Nvidia s Compute Unified Device

3.2 Comparison of SIFT and SURF 21 5 5 1 1 15 15 2 2 25 25 3 3 35 35 4 4 45 45 1 2 3 4 5 6 (a) SIFT 1 2 3 4 5 6 (b) GPU-SURF Figure 3.2: Same image processed with SIFT from VLFeat and GPU-SURF.

35 3.2 Comparison of SIFT and SURF (a) SIFT (b) GPU-SURF Figure 3.2: Same image processed with SIFT from VLFeat and GPU-SURF. The image processed with GPU-SURF has an overlay and therefore uses a grey colour map. Architecture (CUDA) [22]. The GPU-implementation was run on a 3. GHz Core 2 Quad with 8 GB of RAM and a Nvidia 96 GT graphics card using 64-bit Gentoo Linux with Nvidia CUDA Toolkit 3.. Since the GPU-implementation uses a different data structure for descriptors this implementation was only tested for performance comparison. A brief summary can be seen in Table 3.1. In one case SIFT extracted 151 features. This was because of the difficulty of determining a threshold that results in exactly 15 features. From the computational time listed in the table it is obvious that SURF performs better than SIFT for a larger image. The time to extract features with SIFT seems to be proportional to the image size. Most notable is the GPU-SURF implementation that has almost the same computational time for an image with twice the x and y resolution. This is because of the GPU s ability to perform parallel computations. The bottleneck when performing computations on a GPU is the time it takes for the data to reach the graphics card. This means that when changing size of an image the texture data is handled differently to maintain performance. Also the number of threads can vary which also affects the performance. The card used for this test has 64 cores, but newer cards from Nvidia has as many as 48 cores which should be enough for quite large images Stereo Pair Features For the stereo camera there are two images where features can be extracted. Since the images differ the features are not necessarily the same for the two images, which means features have to be matched between the stereo pair. The previous mentioned package for feature extraction, VLFeat, comes with a function that identifies features that are present in both images. There is also a function for matching bundled with the SURF implementation. This matching function uses

36 22 Feature Extraction Table 3.1: Computational time for different feature extractors. Extractor Time (ms) Features Image Resolution SIFT SURF SURF GPU-SURF SIFT SURF SURF GPU-SURF SIFT SURF SURF GPU-SURF the descriptors to perform the matching. Let δ l and δ r be the descriptors for all features in the left and right image. By comparing each descriptor in the left image to all descriptors in the right image the best corresponding feature can be found. Finding the match for feature m i can be described as m i = arg max j i { } 1 δ i δ j δ i δ j, (3.1) which means finding the descriptor δ j that maximises the normalised scalar product δ i δ j. A feature is only considered to have a match if { } 1 max δ i δ j δ i δ j α, where α [, 1]. Choosing a good value for α can be quite tricky since the descriptors are only rotation and scale invariant. This implies that there are other environmental conditions that affects the descriptors e.g. illumination and viewing angle. Setting α 1 makes it hard to recognise features in different images and using a low value for α increases the possibility of false matches. This will be discussed further in Chapter 4. Figure 3.3 shows the result after matching features in the left image to features in the right image. Each feature is represented by a yellow circle. The yellow circle represent the scaling and rotation and is centred at the feature position. 3.3 Laser Vision There are various ways of extracting features from the CamCube. Since the camera provides several images e.g. range and amplitude images, these images can be seen as ordinary images with a different scaling. It is therefore possible to extract

3.3 Laser Vision 23 1 5 5 1 15 2 25 3 35 1 2 3 4 5 6 Figure 3.3: Features matched in a stereo pair.

37 3.3 Laser Vision Figure 3.3: Features matched in a stereo pair. features with SIFT or SURF from these images one at a time or use a combination of the different image types. Another way to extract features from the PMD camera is to use the point clouds to extract features in 3D, e.g. average range, volume, sphere approximation etc. Although since there is not much difference between the amplitude image and the intensity image the features can differ greatly. Initial test showed that SURF found zero features in the intensity image but quite many in the amplitude image. As a result of this, SIFT was applied to the images. This gave zero features in the amplitude image and hundreds in the intensity image. By normalising the intensity image it was possible to extract features from the intensity image with SURF as well. Let I i denote the intensity image, normalisation is then performed using I I = I i min(i i ) max(i i min(i i )). Using this normalisation gives an intensity image with I I [, 1]. Figure 3.4 shows SURF features in the three images acquired from CamCube together with features in the normalised intensity image. It can be seen that there are no features in the intensity image. In the normalised intensity image there are some features but these seem to be placed at random. The range image gives the best features. The features are mainly placed on single coloured surfaces and edges. Features in the amplitude image may also appear random at first. By differentiating the image in the x- and y-direction it can be seen that features are placed near sharp edges. The differentiated images together with the extracted features is shown in Figure 3.5. In the intensity image the features still appear to be quite

24 Feature Extraction 2 2 4 4 6 6 8 8 1 1 12 12 14 14 16 16 18 18 2 2 4 6 8 1 12 14 16 18 2 (a) Intensity image. 2 2 4 6 8 1 12 14 16 18 2 (b) Normalized intensity image.

Features are marked with a red star. random and the reason for this is unknown. 3.

38 24 Feature Extraction (a) Intensity image (b) Normalized intensity image (c) Amplitude image (d) Range image. Figure 3.4: Features in the different images from CamCube. Features are marked with a red star. random and the reason for this is unknown. 3.4 Conclusion Both SIFT and SURF gives almost identical interest points when applied to stereo images although SIFT is somewhat slower which can clearly be seen in Table 3.1. This leads to the conclusion that SURF is more suitable for the problem at hand and will therefore be used for feature extraction. Most preferably would be to use GPU-SURF but since there is no need for real-time SLAM at this point and there is no hardware available for mobile usage the choice falls on SURF with extended descriptors, i.e. SURF-128. The extended descriptors offer a higher recognition rate at a slightly higher computational cost. For features in the images from CamCube only the distance image can be considered to give stable features. The amplitude image will also be used, but

3.4 Conclusion 25 2 2 4 4 6 6 8 8 1 1 12 12 14 14 16 16 18 18 2

2 2 4 4 6 6 8 8 1 1 12 12 14 14 16 16 18 18 2 2 4 6 8 1 12 14 16

in x-(d) Amplitude image differentiated in y- direction.

39 3.4 Conclusion (a) Intensity image differentiated in x-direction. (b) Intensity image differentiated in y-direction (c) Amplitude image differentiated in x-(d) Amplitude image differentiated in y- direction. direction. Figure 3.5: Features overlayed on the differentiated version of the amplitude and intensity image. 2

40 26 Feature Extraction only in combination with the distance image. The intensity image however, will be discarded. The intensity image is discarded due to the fact that the features seems to be unstable. It is also the case that it is hard to determine a threshold for feature extraction in the intensity image.

41 Chapter 4 Data Association This chapter describes the data association part of the SLAM problem. Data association is an important part of all SLAM algorithms, from the simple EKF- SLAM to the more advanced Tectonic SAM [26]. It is important to have a robust method for data association. This basically means being able to identify the same features in a sequence of images, see Chapter 3. By extracting the same feature in a sequence of images, rotation and translation of the camera can be calculated if the parameters of the camera are known. Section 4.1 will define a measurement and will be followed by two sections describing methods for associating measurements to landmarks. 4.1 Measurements The definition of a measurement, z, is something that depends on the type of sensors available. It can for example be z = (a x a y a z ) T (4.1) where a is the acceleration measured by an Inertial Measurement Unit (IMU). In this case the sensors are a stereo camera and a laser camera. This gives the possibility to take measurements only from the stereo camera, only the laser camera or a combination of both. Given that only the stereo camera is used, two measurements are acquired in each sample, one for each camera in the stereo pair. A sketch of the geometry of a stereo camera is shown in Figure 4.2. Let a measurements from a stereo pair be z 1 and z 2. The measurement will then be z 1 = ( u1 v 1 ), z 2 = ( u2 v 2 ), (4.2) where u and v denotes the pixel coordinate of the sensor for each extracted feature. Figure 4.1 shows how the coordinates (u, v) are represented in the image sensor. 27

42 28 Data Association u v Figure 4.1: Representation of pixels in the sensor. The measurements from each sensor are then combined into one measurement and used in EKF-SLAM. The first sensor, the left camera, is chosen as the primary sensor. Measurements from the second sensor, the right camera, is only used to calculate the depth to the features present in both images. A measurement is then defined as u z = v, (4.3) d where u and v represent the pixel coordinates in the image from the primary sensor for features present in both images. For each feature the depth d can be estimated given that the parameters of the stereo camera are known. In this case the baseline b and the focal length f are known. Hence the depth can be calculated as d = f b u 1 u 2. (4.4) For the data association the measurement is then transformed into a measurement in world coordinates by using u m = K 1 v d, (4.5) 1 which gives the feature position in Cartesian world coordinates. Section describes this more extensively. The data association will be performed in world coordinates. 4.2 Nearest Neighbour Association Using the shortest Euclidean distance for data association is the simplest way to associate features. Association works by finding the distance between a measurement and a landmark. A landmark is defined as a distinct point in the world that corresponds to a feature in an image. For nearest neighbour association the

43 4.2 Nearest Neighbour Association 29 Scene point d Image plane f u 1 u 2 Image plane Left camera b Right camera Figure 4.2: A sketch of the stereo camera Measurement Landmarks Figure 4.3: Illustration of one measurement and two landmarks.

44 3 Data Association g R g A Figure 4.4: Gates used during association. influence of measurement noise is not considered so every measurement is regarded as being perfect. This is illustrated by Figure 4.3. When performing association the distance, ik, from the measurement, m i, to all landmarks, l k, is calculated according to ik = m i l k. The distance ik is then compared to two gates. The first is g A which determines if the measurement is close enough to another landmark to be associated. The other gate, g R decide whether the distance to all landmarks is large enough for the measurement to be a new landmark. Figure 4.4 shows how the gates work. The red dot is a landmark in the database. A measurement closer than g A is associated to the landmark. If the measurement is outside g R a new landmark will be created. All measurements in the grey area are discarded. Figure 4.5 shows the map from a run with EKF-SLAM using shortest distance. Landmarks are visualised by red stars. The surrounding red ellipse illustrates the 99.5 % confidence interval. The trajectory is given by the black dots and the blue line represents the final orientation. The result is good although there is no ground truth for verification. The result is quite sensitive to the parameters g A and g R. A large value for g A increases the possibility of false associations so it is preferable to use a small value although this makes it harder to detect loop closure after drift. Drift means the accumulation of small errors in pose during a long sequence of measurements. A good estimation for g A is the movement between two consecutive frames. Choosing a good value for g R can also be quite tricky since a large difference g R g A leaves a big dead zone where features are discarded. Choosing a small value results in the introduction of landmarks close to already existing landmarks which may lead to a larger risk of incorrect association. Association according to this implementation is described in Algorithm 1. The algorithm starts by finding the distance between a measurement m i and all landmarks l k. The distances are then sorted so that the shortest distance is the first. Then for all sorted distances, find the shortest distance that correspond to a landmark that is not associated to a measurement. If no association is made and the smallest distance is larger than g R a new landmark has been found and the measurement is added to X new.

45 4.2 Nearest Neighbour Association y [m] x [m] Figure 4.5: EKF-SLAM with association using shortest distance.

46 32 Data Association Algorithm 1: Associate measurements M to landmarks L using nearest neighbor Data: Current measurements M and all landmarks L Result: Best matching landmarks X and new landmarks X new 1 begin 2 forall the measurements m i M do 3 ik = m i l k for l k L 4 d ik = sort( ik ) 5 forall the sorted distances d ik do 6 Find the first k X 7 if k X & d ik g A then 8 X = X k 9 break 1 end 11 end 12 if d i1 g R then 13 X new = X new m i 14 end 15 end 16 end 4.3 Mahalanobis Association One of the biggest problems with nearest neighbour association is that the nearest feature does not have to be the most likely. Figure 4.6 shows an example of this. The figure is the same as Figure 4.3, but now includes landmark uncertainty represented by the ellipses. As can be seen in the figure the landmark closest to the measurement is the one in (1, 1), but it can also be seen that the measurement is more likely to correspond to the landmark in (2, 2). This can be solved by using the normalised innovation squared (NIS) also known as the Mahalanobis distance. Association using Mahalanobis distance is performed using a measurement z i and the predicted observation ẑ k = h(x, l k ). The innovation can then be calculated as i ik = z i ẑ k. Normalising with the covariance of the landmark, P, and squaring the innovation gives the NIS according to M ik = i T ikp 1 i ik. (4.6) If the innovation has a Gaussian distribution, which normally is the case for EKF, the NIS forms a χ 2 distribution. The degrees of freedom for the distribution is equivalent to the dimension of the innovation vector. Association is performed by finding M ik g A where g A is an arbitrary gate. The integral of the χ 2 distribution specifies the probability that association will be performed if z i is an observation of l k. Applying the gate means computing the integral of the χ 2 distribution, M ik from to g A. This means that g A defines the probability that association will

47 4.3 Mahalanobis Association Measurement Landmarks Figure 4.6: Illustration of a measurement and two landmarks with landmark uncertainty ellipses. be performed. Figure 4.7 shows the χ 2 distribution for three different degrees of freedom, k. Common values for g A are 4, 9, and 16. Setting the gate, g A to e.g. g A = 4 for three degrees of freedom means that the gate will accept 74 % of the associations, as 74 % of the χ 2 probability mass lies between and 4. Table 4.1 shows the probability that association will be accepted for some different values of g A with three degrees of freedom. Figure 4.8 shows the probability as a continuous function of g A. Table 4.1: Probability that association will be accepted. g A Probability [%] In Figure 4.6 the ellipses represent a 97.5 % confidence interval. Trying to perform data association with g A 9 would therefore result in the measurement not being associated to any of the landmarks. Increasing g A to 16 makes the landmark in (2, 2) the most likely association. When there are multiple measurements, z = {z 1... z n } within the gate of a landmark, l, a likelihood of association Λ i can be defined for each z i z as 1 Λ i = ( (2π) n/2 det(p) exp 12 ) iti P 1 i i, (4.7)

48 34 Data Association.6 k = 1 k = 2 k = g A Figure 4.7: Probability density function (PDF) for the χ 2 distribution. The PDF is plotted for one, two, and three degrees of freedom. The integral of a χ 2 distribution from to g A is equivalent to the probability of association. 1.9 k = 1 k = 2 k = g A Figure 4.8: Cumulative distribution function (CDF) for the χ 2 distribution. The CDF is plotted for one, two, and three degrees of freedom. The value at g A is equal to the integral from to g A of the corresponding PDF in Figure 4.7.

49 4.3 Mahalanobis Association 35 where n, is the dimension of a measurement. By taking the negative log of Equation (4.7), the normalized distance N i is obtained according to N i = i T i Pi i + log(det(p)). (4.8) The goal is then to select the observation that minimises N i since this maximises the likelihood of association Λ i. The normalised distance determines which measurement that correspond to which landmark. The NIS is then used to determine whether association should be accepted or rejected by comparing it to the gate g A. Since the database with landmarks can become quite large the association can be sped up by first reducing the number of possible landmarks. Intuitively this can be thought of as walking along a trajectory. If the current position is approximately known there is no need to check if the current measurements correspond to landmarks that were seen 1 m away. Using the Euclidian distance to reduce the number of landmarks can speed up the association since it is a fast method compared to calculating the Mahalanobis distance. In the current implementation the max radius, r max, for which the search for suitable landmarks is performed can be chosen arbitrary. Using a large value just increases computational time since the motion between two consecutive frames is quite small. In Algorithm 2 that describes Mahalanobis association the max association distance is called r max. An additional extension to the matching algorithm is the use of descriptors. The descriptors are used for association in the same way as they are used for matching between stereo pairs in Section In this case the descriptor of the measurement is compared to the descriptors of the landmarks. If a measurement is close enough to be associated to a landmark the normalised scalar product between the descriptor of the measurement and the landmark is calculated. If it is larger than a predefined threshold, α, the association is considered to be correct. This is a verification that the measurement and the landmark really are the same. Algorithm 2 describes data association with Mahalanobis distance. The algorithm includes the use of descriptors and the ability to replace an existing association if a better match is found. Using this method for association in the EKF-SLAM algorithm yields the result displayed in Figure 4.9 which is very similar to the previous test in Figure 4.5. Once again red stars represent landmarks, red ellipses represent the uncertainty, black dots is the trajectory, and the blue line shows the final orientation. Due to the lack of ground truth there is no way to tell which trajectory is the most accurate. What can be said is that the association with Mahalanobis distances requires less tuning and is better for loop closure. Mahalanobis distance association is applicable for low and moderate noise levels. If the measurements are too accurate there will be problem with association due to the inverse of the covariance matrix. No noise will result in the Mahalanobis distance becoming infinite. Too much noise will create landmarks with very big uncertainty and this will result in very few landmarks in total. Having few landmarks will create problems when trying to track position and movement.

50 36 Data Association Algorithm 2: Associate measurements M to landmarks L using Mahalanobis distances Data: Current measurements M and landmarks L Result: Best matching landmarks X and new landmarks X new 1 begin 2 forall the measurements m i M do 3 Find all m i l k r max for l k L 4 forall the landmarks l k closer than r max do 5 x i 6 M min 7 N min 8 Find and normalize the descriptors δ m,i and δ l,k 9 Calculate normalized innovation M (see Eq. 4.6) 1 Calculate normalized distance N (see Eq. 4.8) 11 if M g A & N N min & δ m,i δ l,k α then 12 N min = N 13 x i = k 14 else if M M min then 15 M min = M 16 end 17 end 18 if x i then 19 if x i X then 2 Not associated 21 X = X x i 22 else 23 Check if better than previous association 24 Find index l for which x i = X l 25 Find the normalized distance N l corresponding to X l 26 if N min N l then 27 X l = x i 28 N l = N min 29 end 3 end 31 else if M min g R then 32 This is a new landmark 33 X new = X new x i 34 end 35 end 36 end

51 4.4 Removal of Unstable Landmarks y [m] x [m] Figure 4.9: EKF-SLAM with association using shortest distance. 4.4 Removal of Unstable Landmarks When extracting features from images it often happens that a feature is present in just one single image. To reduce the number of landmarks introduced in the landmark database L a score is maintained for each landmark. For each landmark there is also a value defining in which sample the landmark was seen the first time. When a new landmark is introduced in the database L it receives the score s = 1. The score function is then monitored for a total of N samples. When the landmark is seen again the score is updated by s = s + 1. If the score is lower than s min after N consecutive samples the landmark is removed from the database L. If the score is higher than s min the landmark is considered stable and cannot be removed from the database. The values of s min and N can be chosen arbitrary as long as s min N. Choosing s min = N when N is large reduces the number of landmarks efficiently since a landmark must be seen in N consecutive samples. If it is undesirable to remove landmarks this can be performed by setting s min = N = 1 which is the same as consider a landmark stable when it has been seen only once. The benefit of removing unstable landmarks is that it reduces the risk of incorrect association. It also speeds up the SLAM algorithm since it reduces the dimensionality of the problem. Some testing has shown that s min = 3 and N = 6 reduces the amount of unstable landmarks efficiently while still keeping enough landmarks for the SLAM algorithm to succeed.

52 38 Data Association 4.5 Conclusion Both methods for data association gives good results, but since the method using the Mahalanobis distance is more robust especially when it comes to detect loop closure after drift. Because of this Mahalanobis is the method that is preferred. Using only the shortest distance association is by far the fastest method of the two. The nearest neighbour association is a very simple version and it is possible that this method could be improved. Introducing descriptors to this method should yield a significant improvement, first of all by reducing the number of incorrect associations. However, since nearest neighbour is faster than Mahalanobis this method is a good solution for reducing the number of landmarks before doing data association with Mahalanobis.

53 Chapter 5 SLAM Models This chapter describes the models used in the SLAM systems of this thesis. The cameras must have models relating their measurements to conditions of the rig and the landmarks. It is also necessary to have a model of how the rig moves. Another necessity is coordinate systems to express these models in D SLAM With Simulated Measurements The authors of this thesis have developed 2D versions of SLAM algorithms, i.e. where a simulated rig can only move in one plane and only make observations of landmarks in this plane. The reason for the development of 2D versions is to get an understanding of the algorithms and to test various concepts in easier environments. These versions do not use real sensors so measurements are simulated Rig State The condition of the rig is called its state. It contains all the known information about the rig. The number of dimensions of the state depends on how much of the rig condition that needs to be modelled. The state of the rig at time t in the 2D world is x(t) y(t) x(t) = φ(t) ẋ(t), (5.1) ẏ(t) φ(t) where x(t) and y(t) are the relative positions in a coordinate system, ẋ(t) and ẏ(t) are the corresponding velocities, φ(t) is the heading/yaw in which the rig is looking and φ(t) is the angular velocity. 39

54 4 SLAM Models Y y y ( ψ X x x Figure 5.1: The black rig is attached to the local coordinate system XY, moving around in the global coordinate system xy Coordinate Systems There are two coordinate systems. The first is a local coordinate system called XY. It is fixed to the rig with its origin at the camera. The X-axis points in the direction in which the camera is looking. The Y -axis points to the left. The global coordinate system is called xy. The plots in this report will have x to the right of the paper and y pointing towards the top of the paper. A point p xy in xy can be transformed into a point p XY in XY using ( ) cos( ψ) sin( ψ) p XY = R xy (p xy t xy ) = (p sin( ψ) cos( ψ) xy ( ) x ), (5.2) y where x, y and ψ is the pose and orientation of XY relative xy. Figure 5.1 illustrates how the rig is bound to the local coordinate system and how the coordinate systems relate to each other with the same notation as in (5.2) Measurement Model A measurement z k in the 2D versions of EKF-SLAM and SAM in this thesis consists of the range and bearing from the rig to a landmark given in the local coordinate system XY. The measurement is defined by ( r z k =, (5.3) ϕ) where r and ϕ are the range and bearing to the landmark, respectively. By defining ( ) ( ) ( ) x lx x = =, (5.4) y where ( x y l y y ) T is the coordinates of the robot in the global coordinate system xy and ( ) T l x l y is the position of the landmark in xy. The measurement equation h is given by

55 5.1 2D SLAM With Simulated Measurements 41 Y y y r ϕ ( ψ X x x Figure 5.2: The black rig is attached to the local coordinate system XY, moving around in the global coordinate system xy. It observes a red landmark through a measurement z = (r ϕ) T. ) ( ) (ˆr ẑ k = = h(x, l ˆϕ k ) =. (5.5) pi2pi(atan2( x, y ) ψ) Equation (5.5) is predicting a measurement ẑ k given the state x of the rig and the global coordinates l k of the landmark. The function pi2pi is normalising an angle to the interval from π to π. The notation is shown in Figure System Dynamics The system dynamics is a mathematical description of how the system moves between time t and t + T. It is often the case that the system cannot move or accelerate more than some limits. It is also possible to estimate where the system will be at time t + T given that the position and speed of the system at time t is known. The system dynamics is therefore sometimes also called the prediction equation or the process equation. The 2D EKF-SLAM system of this report uses a constant velocity model with the state vector described in Section This model sets limits on how much the system can accelerate, approximates that the velocity is constant and predicts the next position of the rig. The discrete time version of the model is x(t + T) = Fx(t) + Gw(t) = T 1 T T T 2 = 1 T 2 T 1 x(t) w(t), 1 T T 1 T (5.6) where T is the sample time and w(t) is the process noise being acceleration of the rig.

56 42 SLAM Models 5.2 3D SLAM with stereo camera measurements The 3D SLAM system uses a stereo camera as its primary sensor. The system keeps track of interest points in the two images from the stereo camera. This enables the rig to continuously build a world of its environment and simultaneously localise itself within it Robot State The orientation of the rig is described using Euler angles defined in Table 5.1. Table 5.1: Euler angles in the 3D model. Euler angle Yaw Pitch Roll Notation φ θ ψ The yaw angle is the rotation of the rig around its Z-axis, pitch is the rotation around its Y -axis and roll is the rotation around the X-axis (see Section for the definition of the coordinate systems). A total of 12 variables are needed in the state vector used in 3D-SLAM when a constant velocity-model is used together with Euler angles. The state vector is x(t) y(t) z(t) φ(t) θ(t) x(t) = ψ(t) ẋ(t), (5.7) ẏ(t) ż(t) φ(t) θ(t) ψ(t) with notation as in Section but extended with the altitude z(t), the pitch θ(t), the roll ψ(t) and the corresponding velocities Coordinate Systems The local coordinate system is called XY Z and is fixed to the center of the left camera. The X-axis points in the forward direction, the Y -axis points in the left direction and the Z-axis points upwards. The global coordinate system is called xyz. The plots in this report will have x to the right of the paper, y pointing

57 5.2 3D SLAM with stereo camera measurements 43 Z Y z X x y Figure 5.3: The black rig is attached to the local coordinate system XY Z, moving around in the global coordinate system xyz. to the top of the paper and z pointing out from the paper, towards the reader. Figure 5.3 shows an illustration of the two coordinate systems. Using Euler angles, it is possible to relate a point in one of the coordinate systems to the same point in the other coordinate system. The variant used is the yaw, pitch and roll convention. The following matrices and (5.11) makes a rotation from the global map coordinate system to the local robot coordinate system. The yaw rotation is defined as cos(φ) sin(φ) Y = sin(φ) cos(φ). (5.8) 1 Equation (5.8) makes a left hand (clockwise) rotation of the global coordinate system around its z-axis. Equation (5.9) makes a left hand rotation of the global coordinate system around its y-axis. This is the pitch rotation. cos(θ) sin(θ) P = 1. (5.9) sin(θ) cos(θ) Finally, (5.1) makes a left hand rotation of the global coordinate system around its x-axis, corresponding to the roll rotation 1 R = cos(ψ) sin(ψ). (5.1) sin(ψ) cos(ψ) The following multiplication of these matrices gives the final rotation matrix:

58 44 SLAM Models 1 cos(θ) sin(θ) T 1 = R P Y = cos(ψ) sin(ψ) 1 sin(ψ) cos(ψ) sin(θ) cos(θ) cos(φ) sin(φ) sin(φ) cos(φ). 1 (5.11) A point p xyz in xyz can be transformed into a point p XY Z in XY Z using p XY Z = T 1 (p xyz t xyz ), (5.12) where T 1 is the rotation matrix in (5.11) and t xyz is a vector to the origin of XY Z. A rotation from the local coordinate system to the global coordinate system is made by changing the sign on the angle in (5.8), (5.9) and (5.1). The order of the rotations also have to be changed, this gives the rotation matrix T 2, cos(φ) sin(φ) cos(θ) sin(θ) T 2 = sin(φ) cos(φ) 1 1 sin(θ) cos(θ) 1 cos(ψ) sin(ψ). sin(ψ) cos(ψ) (5.13) Measurement Model The measurement model relates a pixel in an image to an xyz-position in the world coordinate system. For this to be possible, it is necessary to know the depth to the position that the pixel correspond to. This information can be retrieved because the stereo-camera observes the xyz-position from two different positions with a known distance between the lenses. Section 4.1 describes how the depth estimate is retrieved from a stereo image pair. Each camera is modelled with a pinhole model, illustrated in Figure 5.4. The large rectangle is the part of the world that is observed. That area is projected through the pinhole, i.e. the focal point, and onto the image plane in the small rectangle. The red line shows how a specific part of the world is projected onto the image plane. Figure 5.5 illustrates how the pinhole camera works with the notation used in this thesis. Note that the stereo images are mirrored in both directions of the image plane before they arrive to the computer, so that they look correct. This is illustrated in the camera model by mirroring the image plane in the focal point. The focal length of the camera is f, the depth to the observed landmark is d and l XY Z is the position of the observed landmark in local rig coordinates. This landmark is projected onto the pixel coordinates u and v along the corresponding axes of the image plane. The measurement vector is defined as

59 5.2 3D SLAM with stereo camera measurements 45 Figure 5.4: The pinhole camera model. v u d l XY Z f Y Z X l uv Figure 5.5: The pinhole camera model with the image plane mirrored in the focal point.

60 46 SLAM Models u z = v, (5.14) d where the origin is the top left pixel in the image as seen in Figure 5.5. The measurement model consists of several steps. The first is to calculate the relative distance between the landmark and the rig in world coordinates. The second step is to rotate these distances to the local coordinate system of the rig. These two steps can be performed by using l XY Z = T 1 (l xyz x xyz ), (5.15) where T 1 is the Euler rotation matrix defined in (5.11). The third step is to change the definition of the axes so that they correspond to those in the image plane. The X-axis of the local coordinate system relates to the depth d in the measurement. The left-pointing Y -axis relates to the u axis in the image (with inverted sign) while the Z-axis relates to v-axis in the image (also with inverted sign). By performing l XY Z,M = M l XY Z, (5.16) where M is the transformation matrix that relates the axes of the camera coordinate system uv and the local rig coordinate system XY Z and is given by 1 M = 1. (5.17) 1 Note that the signs on two of the axes are altered because those axes point in different directions, see Figure 5.5. After this the final step is to translate the point defined in the altered local coordinate system to a pixel coordinate and a depth, resulting in a measurement. This is done by performing u 1 z = v = 1 C l XY Z,M /l XY Z,M,Z, (5.18) d l XY Z,M,Z where l XY Z,M,Z is the third component of l XY Z,M. defined by The camera matrix C is f u C = f v, (5.19) 1 where f is the focal length of the camera. The pixel values of the image centre point is u and v in the u- and v-directions respectively. From (5.18), together with (5.15) and (5.16), the measurement equation used in 3D visual SLAM using a stereo camera is given.

61 5.2 3D SLAM with stereo camera measurements System Dynamics The dynamics of the system is described by a constant velocity model as in Section The equation for this looks very much like the one in (5.6), but is extended with the extra states used in 3D-SLAM.

63 Chapter 6 SLAM Algorithms This chapter will present and describe a number of different SLAM algorithms. The choice of algorithms to implement in this thesis is presented at the end of the chapter. A successful SLAM implementation needs to address many issues, some of them introduced in Section 1.1. There are more important issues to concern. Different approaches to SLAM handle these in different ways and with varying success. The major complications that a SLAM algorithm needs to solve apart from feature extraction and data association is: 1. Loop closing Loop closing shall occur when the rig revisits a previously seen area and the result should be an updated and improved map. The loop closure shall also be used to update the current state of the rig. 2. Large number of landmarks The SLAM algorithm must be able to handle that the map becomes large, i.e. that there is a large number of landmarks. The major problem with many landmarks is that a lot of time might be needed for calculations. It is therefore important to consider the time complexity of any SLAM algorithm. 3. Inconsistency problems The resulting map must be consistent, i.e. correct. This issue will be explained later but in short, it is an issue that may occur because of linearisation errors in SLAM. The most common approach to SLAM is based on the Extended Kalman Filter (EKF-SLAM). This approach is very successful at loop closure, but the time complexity is not very good and inconsistency errors may occur. Another approach is to use particle filters, and one of the most successful algorithms in this direction is FastSLAM. FastSLAM can have a more hard time to detect loop closure but is generally faster than EKF-SLAM and more consistent. A third approach is to use the information form instead of the dual, the covariance, used in EKF-SLAM. These approaches will be referred to as information form SLAM and generally they 49

64 5 SLAM Algorithms l 2 l 1 z 1,1 z 1,2 z 2,2 z 3,4 z 2,3 z 3,3 l 4 l 3 Figure 6.1: A black rig moves and observes red landmarks. produce maps faster than the EKF, but time issues may occur during loop closure and what s more, data association will be more complicated since the covariances needed are not readily available. Another way of solving SLAM is to use techniques from optimisation. One such approach is Smoothing and Mapping (SAM). It represents the problem using least squares and solves it by using efficient optimisation techniques. Smoothing refers to the fact that SAM is maintaining an estimate of the whole trajectory. Filtering approaches such as EKF-SLAM and information form variants are only maintaining an estimate of the current rig position. A nice benefit of using the whole trajectory is that inconsistency errors become smaller. SAM is however suffering from the fact that the matrices in the least squares problem grow very large very rapidly as the trajectory becomes longer and the map grows. It is possible to divide the environment into smaller maps, i.e. submaps. Most approaches to SLAM can be improved by this concept since no matter how efficient an algorithm is, computational time will be an issue when the map becomes large enough. 6.1 The SLAM Problem Any SLAM algorithm will need to keep track of both the rig itself and the world around it. At time t, all the information about the rig, e.g. its position, orientation and velocities, is collected in the state vector x t. This was described in Chapter 5. The world is usually modelled using landmarks. The algorithms in this thesis will assume that the true locations of these landmarks are time invariant, i.e. that the landmarks do not move. All the known information about landmark number k is collected in the vector m k. This information is typically a 2D or 3D position. Figure 6.1 shows a black rig moving in a previously unknown 2D environment between time t = 1 and t = 3. It has a sensor that can see red objects. These objects are the landmarks that have been observed up to time t = 3. By keeping track of landmarks between successive points of time, it is possible to estimate

65 6.2 Extended Kalman Filter SLAM 51 how the rig moves. The estimates of the landmark locations will also be improved when they are observed several times. z t,k is an observation, i.e. a measurement, of landmark k at time-point t. The force that moves the rig is called the control signal. This signal is often known and can be used to improve the SLAM solution. The control vector u t is applied at time t and moves the rig from x t to x t Extended Kalman Filter SLAM EKF-SLAM is based on the Extended Kalman filter and uses a state-space model with additive Gaussian noise. The state vector X t = ( ) xt l (6.1) contains both the current robot state x t and the position of all the landmarks l. The algorithm is recursively updating an estimate of this combined state vector and the corresponding covariance matrix, ( ) Pxx P P t = xl, (6.2) P lx using the measurements through the observation model and the predictions from the motion model, i.e. the system dynamics equation [14]. The motion is described by P ll x t+1 = f(x t, u t ) + g(v t ), (6.3) where v t is additive Gaussian noise with zero mean and covariance Q t. measurement model is described by The z t = h(x t, l) + e t, (6.4) where z t is all the measurements at time t and e t is additive Gaussian noise with zero mean and covariance R t Time Update The time update part of EKF-SLAM is where the knowledge of the system dynamics and the control signals affecting the system is used for prediction of the next robot pose. It is also the part where uncertainty is added to the solution. Figure 6.2 shows how a time update in a 2D world can look like. Note that landmark l 1 and l 2 have been observed earlier. The location of the rig at time-point t, using all information available up to and including time-point t, is denoted x t t. With the same notation, x t+1 t is the location of the rig at time-point t + 1 using all the information available up to time-point t. This estimate has therefore not used the sensor data available in t + 1, but only the prediction information from the system dynamics, possibly

66 52 SLAM Algorithms l 1 x t 1 t 1 x t t 1 l 2 Figure 6.2: At time t 1, the rig knows the location of two landmarks. A time update is made, leading to a new estimate of x and an increased uncertainty of x. x t t l 1 x t 1 t 1 x t t 1 l 2 Figure 6.3: The estimate x t t 1 is adjusted into x t t by performing a measurement update with landmark l 1 and l 2. including odometry data and other control signals. The time update is performed by x t t 1 = f(x t 1 t 1, u t 1 ) (6.5) P xx,t t 1 = FP xx,t 1 t 1 F T + GQ t G T, (6.6) where F = f is the Jacobian of the system dynamics equation f evaluated at the estimate x t 1 t 1 and G is a matrix that describes how the noise affects the system as in (5.6). As stated in Section 6.1, it is assumed that the true locations of the landmarks are static, so the estimates of the landmarks are not affected in the time update since there is no motion to model for them Measurement Update The measurement update is the part where the knowledge acquired from the sensors are used to adjust the estimated position of the rig so that the uncertainty of the solution is lowered. Figure 6.3 shows a measurement update in a 2D world, performed right after the time update in Figure 6.2.

67 6.2 Extended Kalman Filter SLAM 53 The rig re-observes landmark l 1 and l 2. Since the true location of these landmarks are stationary, they behave similar to e.g. lighthouses in oceans. Their locations are known to some accuracy in a global map, and the observer of a lighthouse can therefore be more certain about his or her position. Figure 6.3 shows how the estimate x t t 1 is adjusted to the new estimate x t t using the relative observation of l 1 and l 2. Also shown in the figure is that the uncertainty of the rig is lowered. The first step of the measurement update is to create the innovation vector i t = z t h(x t t 1, l t 1 ), (6.7) i.e. the error between the actual measurements and what the measurement equation predicts the measurement to be, using the current estimate of the state vector. The measurement update is then performed by ( xt t l t ) = ( xt t 1 l t 1 ) + K t i t (6.8) P t t = P t t 1 K t S t W T t, (6.9) where K t is the Kalman gain, i.e. information on how much to trust the different parts of the innovation vector and how to relate the information to a change in the state vector. The Kalman gain is calculated by S t = HP t t 1 H T + R t (6.1) K t = P t t 1 H T S 1 t. (6.11) where H = h is the Jacobian of the measurement equation h evaluated at the estimates x t t 1 and m t 1. It is normally the case that multiple landmarks are observed in each time-point. The measurements are stacked on top of each other in the measurement vector z t. Therefore, the measurement equation h must be able to relate multiple measurements to multiple landmarks. This is solved by stacking several measurement equations on top of each other in h, and placing the different Jacobians in the right places in H. This will be explained further in Section D Visual EKF-SLAM A 3D visual EKF-SLAM system has been implemented in Matlab. This section will describe how it works and how it is implemented. Algorithm 3 describes the system on a very high level. The different parts of the system will then be treated in subsections, before finally the whole system is summarised in Algorithm 3.

68 54 SLAM Algorithms Algorithm 3: High level description of 3D visual EKF-SLAM. 1 begin 2 forall the Stereo pair images do 3 Image processing 4 Data association 5 Measurement update 6 Augmentation of new landmarks 7 Time update 8 end 9 end Image Processing The data processing step takes a stereo image pair and returns measurements z = (u v d) T corresponding to interest points in the images. These interest points, i.e. features, are found using SURF as explained in Section By finding the same features in the left and the right image, as explained in Section 4.1, it is possible to calculate the depths to the world points to which the features correspond. The formula for this calculation is described in Section 4.1. Data Association The data association part of the 3D visual EKF-SLAM is a very important part. It is also the part with the highest computational cost. The input to the data association are the measurements z generated in the data processing step and the output is the matches, i.e. the relations, that are found between these measurements and the stored landmarks l. A second output is information on which measurements that do not match any stored landmarks but should be added as new ones, i.e. augmented to the state vector X t. Mahalanobis distances are calculated between each measurement and all candidate landmarks, i.e. landmarks that are close enough geometrically. As a validity check, the normalised scalar product between the descriptor of the measurement and the landmark is calculated. So, for a measurement to match to a specific landmark the following needs to hold: The position corresponding to z i is close enough geometrically to l j. The normalised cross product between the descriptors of z i and l j must be greater than a specific value. No other measurement z k that pass the above two points has a better mahalanobis distance to l j. Measurement Update The measurement update is done in a batch variant, so that all the measurements from a stereo pair is updating the EKF at the same time. Section described

69 6.2 Extended Kalman Filter SLAM 55 the basics behind the measurement update. The structure of the matrices used will be described further. The creation of these matrices can be seen as a pre step to the measurement update. Finally, this section will describe how to perform the actual measurement update in an efficient and stable way using Cholesky decomposition. Consider an example where three landmarks have been seen. Data association finds the following relations z 1 l 3, z 2 l 4, z 3 l 1 where l 1, l 3 and l 4 correspond to landmarks stored in the state vector X t. The measurement vector will be z t = z 1 z 2 z 3 with z i = u i v i d i for i = 1, 2, 3. (6.12) The combined measurement equation h is h 1 (x t, l 3 ) ẑ t = h(x t, l) = h 2 (x t, l 4 ), (6.13) h 3 (x t, l 1 ) where h i was defined for the stereo camera in Section and ẑ t is a prediction of the measurement z t. It is a bit more complicated to form the combined Jacobian H. In this example, H is given by H 1,x H 1,l3 H = H 2,x H 2,l4, (6.14) H 3,x H 3,l1 where H i,x is the Jacobian of h i with respect to the state x of the rig. H i,lj is the Jacobian of h i with respect to the states of landmark l j. Note that the correct column for Jacobian H i,lj is decided by where l j is stored in the state vector X t. The columns of H relates to the rows of the state vector X t while the number of rows in H relates to the number of measurements in z t. The last matrix that need to be created is the measurement noise R t. In this example, is is given by R z1 R t = R z2, (6.15) R z3 where R zi is the measurement noise for measurement i. The actual measurement update can be performed using Cholesky decomposition. This is more numerically stable than the standard approach described in Section The innovation i t is created just as in (6.7) in Section 6.2.2, repeated here for completeness, The update can now be performed using i t = z t h(x t, l t ). (6.16)

70 56 SLAM Algorithms ( xt t l t ) = ( xt t 1 l t 1 ) + W 2 i t (6.17) where P t t = P t t 1 W 1 W T 1, (6.18) W 1 = P t t 1 H T chol(s t ) 1 (6.19) W 2 = W 1 chol(s t ) T (6.2) S t = HP t t 1 H T + R t (6.21) are calculated with the help of the inbuilt Cholesky function in Matlab. Note that before calling the Cholesky function it is a good idea to ensure that S t really is symmetric and make it symmetric if it is not. Augmentation of New Landmarks When a new landmark l new is observed through a measurement z, this landmark needs to be added to the state vector X, resulting in an extended state vector X. Note that the time index t has been dropped for simplicity. The state augmentation is done using ( ) x X X = = l. (6.22) l new l new It is also necessary to extend the corresponding covariance matrix P, defined in Section 6.2 like ( ) Pxx P P = xl, (6.23) with new rows and columns. The new covariance matrix P is created using P lx P ll P xx P xl (H x P xx ) T P = P lx P ll (H x P xl ) T, (6.24) H x P xx H x P xl H x P xx H T x + H z R z H T z where H x is the Jacobian of the measurement equation h with respect to the states of the rig, x. H z is the Jacobian of the same measurement equation h but with respect to the variables of the measurement z.

71 6.2 Extended Kalman Filter SLAM 57 Time Update The 3D visual EKF-SLAM system uses a constant velocity model for the system dynamics. This model was described in Chapter 5. The time update is performed by x t t 1 = f(x t 1 t 1, u t 1 ) (6.25) as was described in Section P xx,t t 1 = FP xx,t 1 t 1 F T + GQ t G T, (6.26) Pros and Cons With EKF-SLAM This approach has been successfully used for over a decade in successively more complicated scenarios. EKF-SLAM maintains the full covariance matrix for the problem. While this give good solutions in small environments, several problems arise when mapping larger areas. The first problem is that standard EKF-SLAM scales poorly as the complexity for doing the measurement update is quadratic in the number of landmarks. Another important issue is that EKF-SLAM can be inconsistent when the map is large, since the estimated covariances become small compared to the true errors [4]. This is because of linearisation errors and the way uncertainty is modelled with Gaussian noise. According to [4], the problem is easily seen on the estimated heading of the robot. If the initial heading variance is set to a too large value, this variance will actually decrease as more measurements arrive. That should of course not be possible. Figure 6.4 shows the variance of some of the estimated rig states as a function of samples in our 3D visual EKF-SLAM implementation. The initial variance is set to.5 2, and as can be seen in the figure, this value is decreasing. The map is however looking all right, and the problem is avoided at least to some extent by setting the initial variance to a small value. A small initial value will result in an increasing variance. However, the variance will eventually be too optimistically estimated, ultimately leading to problems with loop closing and jagged trajectories. Section shows another example illustrating this problem. With this said, EKF-SLAM generally produces good results without approximations other than those from the linearisation. Even though the complexity is quadratic, the algorithm runs fast in our experiments with over 2 3D landmarks. Matlab can for example invert a matrix (corresponding to a covariance matrix with 192 landmarks) in approximately.15 seconds on a standard desktop computer Partitioned Updates A typical SLAM scenario involves much exploration of new environments. This means that landmarks not seen for a while will not be affected much by new measurements, unless of course the robot is closing a loop. It is possible to maintain

72 58 SLAM Algorithms.25.2 Variance x position Variance y position Variance heading Figure 6.4: Variance of estimated states in 3D visual EKF-SLAM using a large initial variance on the heading. a local map within a region close to the robot, and update the global map with a lower frequency. One such solution is the compressed EKF (CEKF) [16]. This method is however maintaining absolute coordinates for all the landmarks and will be affected by large covariances giving problems as stated in Section 6.2. Another approach that avoids large covariances is to use a local coordinate system as used in the constrained local submap filter (CLSF) [35]. CLSF seems like an efficient approach to EKF-SLAM, since most updates are performed in a small local environment, avoiding both complexity problems and some consistency problems. The global map update will still be slow but performed at a low frequency. 6.3 Submaps A popular way of solving the SLAM problem is to divide the environment into smaller areas, submaps, so that only a local map around the robot is updated most of the time. Techniques based on submaps can be used both to reduce the computational cost and to improve the consistency of the global estimated map [4]. The reason for this is that in many of these methods, no real global-level Kalman update is done and as stated before, the problems with inconsistency arise when the number of landmarks and the true uncertainties are large. There are several different variants of submaps. Some of them use a global reference system for the submap anchor-points, and some of them are only relatively located to each

73 6.3 Submaps 59 other. Common to both variants is that they use local coordinate systems in each submap Conditionally Independent Divide and Conquer SLAM Based on EKF-SLAM, Conditionally Independent Divide and Conquer SLAM (CI D&C SLAM) [29] is a submapping technique that works on conditionally independent submaps. The technique makes it possible to share important information between the submaps, the robot state and landmarks being visible from several maps. In other words, when initiating a new submap, the camera velocities and some initial features overlapping with the previous map can be used to get a good start with the new map. The submaps can be joined into a global map by back-propagating information through the chain of submaps, successively merging the common elements of the submaps. This operation takes time linear in the total number of landmarks [29]. Loop closure can be performed by matching landmarks in the current field of view with landmarks in the submaps that is likely to be somewhere close to the camera given the estimated camera position. This is preferably done using both descriptors and geometrical properties of the landmarks in a batch algorithm, see e.g. [1]. The actual loop closure is then performed by adding the common features between the current and the old submap in all submaps connecting these two submaps. For this to be efficient, the number of common features between the submaps need to be small and bounded by a constant. This methods appears to be a good good start for future better submapping techniques, but it also seems like the method is not totally defined and the loop closure part is only demonstrated to work for a single loop closure event at the end Conditionally Independent Graph SLAM Conditionally Independent Graph SLAM (CI-Graph SLAM) [28] has much in common with CI D&C SLAM but is also maintaining a spanning tree over the conditionally independent submaps. The graph has different edges between the nodes representing features that the camera has seen in neighbouring submaps. The spanning tree makes it possible to transmit information between submaps. Additionally, it gives the possibility to revisit a previous map. This algorithm seems like a more complete solution to submapping SLAM compared to CI D&C SLAM. However, it also has the problems with a lot of copying of states and landmarks to the submaps involved in a loop closure. Another important issue is the fact that the paper [28] explains the algorithm using absolute submaps. Absolute submaps do not reduce problems with inconsistency and the accuracy of the result will therefore not be better than that of normal EKF-SLAM. The paper states that the algorithm can be used for local submaps too, but this is not demonstrated and might not have been tried.

74 6 SLAM Algorithms Tectonic SAM Tectonic SAM (T-SAM) [26] is a submapping approach built on the Smoothing and Mapping concept described in Section The name comes from the tectonic plates of the earth, whom s internal structure do not change much but their relative locations do. This is also what happens in T-SAM. Experimental results in [26] look nice but it seems like computational times are very long. Another possible problem is that nothing is written about data association, and it might be hard to access the covariances needed for robust data association in a reasonable time. 6.4 Information Form SLAM The covariance matrix used in EKF-SLAM is very dense, since all landmarks correlates to each other. The inverse of the covariance matrix is the information matrix. Many of the elements in this matrix are zero or close to zero. Various sparsification techniques can be used to round small values to zero. A measurement update is done by adding information to this matrix and is therefore a quick operation. The price paid for this efficiency is a slow time update and problems of accessing the mean estimate and the covariance of the state vector. In a naive implementation, acquiring the full covariance matrix is done by inverting the information matrix. Several methods exist to get access to the mean estimate and parts of the covariance matrix in more efficient ways, but it is still a costly part of the algorithm. This problem is very present when trying to use batch gating methods for data association, since large parts of the covariance matrix are needed [3] Sparse Extended Information Filter Sparse Extended Information Filter (SEIF) is an information form SLAM algorithm that performs sparsification of the information matrix. The sparsification step removes links between the robot and some landmarks in the state vector, changing active landmarks into passive. The information in the removed links is spread out to other robot-landmark pairs instead. The benefit of this comes in the marginalization of the old robot state in the time update. This is normally resulting in new and/or strengthened links between all landmarks that the robot is linked to. However, if some landmarks are made passive before the marginalization, these will not be linked to the other active landmarks, keeping the number of new links low. This leads to an approximate solution but also a faster algorithm. The problem is however that SEIF produces overconfident estimates. In fact, the covariance ellipses of the estimates become smaller than that of the EKF-SLAM and that is inconsistent behaviour [34] Exactly Sparse Extended Information Filter Exactly Sparse Extended Information Filter (ESEIF) [34] is not breaking links like the SEIF do, instead it is controlling the initial formation of the links. It controls the number of active landmarks. When the number of active landmarks exceeds

75 6.5 Smoothing and Mapping 61 a threshold, the robot is kidnapped, i.e. links to landmarks are broken, and relocated using a set of selected measurements. The old landmarks that were active are set to passive and the landmarks observed during the relocation are made active. The result is an efficient algorithm, with some loss of information because of the threshold, that does not produce the overconfident covariances that the SEIF does. Instead, the estimates are conservative compared to the EKF- SLAM [34]. 6.5 Smoothing and Mapping All the previous algorithms perform filtering. Only the current state is kept in the state vector. Smoothing and Mapping (SAM) [12] is instead performing smoothing, solving the so called full SLAM problem using least squares. What this means is essentially that an estimate of the full trajectory is updated in each sample. All previous states (robot poses) are stored, together with all the observed landmarks. The results will therefore be better and more complete than results gained from EKF-SLAM and other filtering methods. SLAM with smoothing avoids the problems with inconsistency errors occurring in filtering approaches [21] The Least Squares Problem SAM solves the least squares problem given by { Θ N = argmin i=1 f i(x i 1, u i ) x i 2 Λ i + } K k=1 h k(x ik, l jk ) z k 2 Σ k. Θ (6.27) The goal is to find the parameter vector Θ that minimizes the expression. Θ consists of the variables to be estimated, i.e. the trajectory and the map. Minimizing the expression can be described as finding an estimate of the map and the trajectory that suits all the measurements and the system dynamics the best. The number of samples is N, K is the number of observations, Λ i is the process noise and Σ k is the measurement noise. Equation (6.28) below is a linearised version of (6.27) that can be solved with QR-factorisation. δθ = argmin δθ { N i=1 F i 1 i δx i 1 + G i iδx i a i 2 K H i k i δx ik + J j k 2 k δl jk c k k=1 Λ i + Σ k } (6.28) where F is the Jacobian of the process model, G is a negative identity matrix used to avoid special handling of δx i and a i = xi f i(xi 1, u i) is the odometry prediction error. Note that the linearisation point xi is the robot state in sample i. Superscripts determine around what estimate to evaluate the Jacobian. For

76 62 SLAM Algorithms example, F i 1 i stands for the Jacobian used at sample i but evaluated around the current estimate of the robot state in sample i 1. H is the Jacobian of the measurement equation with respect to the states of the robot, J is the Jacobian of the measurement equation with respect to the states of the landmark and c k = z k h k (xi k, lj k ) is the measurement prediction error. The state vector x SAM in the least squares problem will consist of all δx and all δl. The constants before these variables can be collected in a matrix, called the measurement matrix A. a i and c k can be collected in a vector b. By doing this (6.28) can be rewritten as δ = argmin Aδ b. (6.29) δ An important thing here is how to handle Λ i and Σ k. These matrices exist for each process sample and each measurement. By multiplying a row in the least squares problem with a value greater than one the importance of that row is increased - it is trusted more. Mathematically this means that the rows corresponding to a measurement in matrix A and b should be multiplied with Σ T/2 k if Σ k is the corresponding measurement covariance. Equation (6.29) can be solved with QR-factorisation. This results in a vector δ which contains information on how much to change all the variables in the state vector x SAM. The information is simply added using x SAM = x SAM + δ, (6.3) whereafter a new and improved estimate of the map and trajectory is available The Measurement Matrix A and the Vector b Measurement rows are added before prediction rows, so that the last robot state is found at the end of the state vector after each sample. Matrix (6.31) below shows the structure of the measurement matrix used in SAM. The first two rows correspond to two measurements done in sample one. The third row is a prediction row, but a special prediction row since it is the first prediction row. It will only contain G, since there are no older state to bind the first state to. The fourth and fifth row are measurement rows of sample two. The fourth row is a reobservation of the landmark seen at row one, while the fifth row is an observation of a new landmark. The last row is the prediction row that binds the two robot states to each other with odometry. J 1 1 H 1 1 J 2 2 H 1 2 A = G 1 1 J 1 3 H 2, b = 3 J 3 4 H 2 4 F 1 2 G 2 2 c 1 c 2 a 1 c 3 c 4 a 2 (6.31) The rows of A and b are then multiplied with the covariance information as stated in Section

77 6.5 Smoothing and Mapping y x Figure 6.5: SAM before loop closure. Experiments With 2D SAM This section presents some results to clarify the concept of SAM. Further results with EKF and SAM will be presented in Chapter 9. SAM has been implemented and tested in Matlab for a simulated 2D environment with range-bearing measurements. A map and trajectory is built iteratively as new observations of landmarks are made. The map and the trajectory is continuously smoothed. Figure 6.5 shows the map at sample 38 of 5, before loop closure. The simulated measurements are noisy. The standard deviations for the range and bearing measurements are 5 centimetres and.1 radians respectively. Figure 6.6 shows the trajectory after loop closure, i.e. when the simulated rig reobserves landmarks that it saw in the beginning of the trajectory. It is easy to see that both the map and the trajectory have been smoothed, i.e. improved Incremental Smoothing and Mapping Incremental Smoothing and Mapping (isam) [21] is an improvement of SAM that is using different techniques to speed up the solution. The algorithm performs incremental updates on a QR-factorisation of a sparse information matrix. This matrix will be very sparse when exploring new environments, but fill-in (non-zero elements far away from the diagonal) will arrive when closing loops because all the old robot states are a part of the optimisation problem. This slows down the algorithm if not taken care of but isam solves the issue with periodic variable

78 64 SLAM Algorithms y x Figure 6.6: SAM after loop closure. reordering. The only problem with this is that the variable reordering takes time and results in periodic peaks in a time consumption per sample graph. 6.6 Complexity of SLAM Algorithms The time complexity of the different algorithms are: Global Partioned Submaps Algorithm Memory T local T loop EKF O(n 2 ) O(n 2 ) O(n 2 ) SEIF O(n) O(1) amort O(n 2 ) ESEIF O(n) O(1) O(n 2 ) SAM/iSAM O(n + p) O(1) almost O(n + p) CEKF O(n 2 ) O(n) O(n 2 ) CLSF O(n 2 ) O(1) O(n 2 ) CI D&C O(n 2 ) O(n 2 ) O(n 2 ) CI-Graph O(n) O(1) O(n 2 ) ATLAS O(n) O(1) where n is the number of landmarks in the map and p is the number of rig poses in the trajectory. These values have been retrieved from the powerpoint Advanced EKF-SLAM: Building Large Maps [31], presented at SLAM Summer School 29.

79 6.6 Complexity of SLAM Algorithms 65 The memory complexity states how much memory is needed for storage of the variables estimated in the SLAM algorithms. It is clear that the information form variants (SEIF, ESEIF and SAM) have a low complexity and this is because of the sparse representation they utilise. The EKF approaches use a dense covariance matrix and their memory complexity is therefore O(n 2 ). The time complexity has been divided in two columns. The first one, named T local relates to the scenario where the rig is exploring new environments and no loop closure occurs. The information form approaches have a benefit here, once again because of the sparse representation. For partitioned and submap methods, T local relates to the complexity within a local partition or a submap. The last column, T loop, shows the time complexity for when the algorithms perform loop closure. Most algorithms have a complexity of O(n 2 ) here, but SAM (and also isam) is claimed to have a complexity of O(n + p). It seems to be a bit unknown if this is actually true, but experiments in [21] points in this direction and so does the experiments in this report, given in Section Choice of algorithms for this thesis This theory study shows that SAM and isam are promising algorithms. There exists many more algorithms than have been described in this chapter, and possibly better ones, but it was decided that more attention should be given to SAM. EKF was also chosen as an algorithm to implement. One great advantage of the EKF is its relative simplicity in a SLAM context. Covariances needed for data association are easily acquired, and it is relatively easy to see if the implementation does what it should. EKF-SLAM can make good maps over smaller environments, and should therefore be a good candidate for making submaps in a submap based algorithm. Chapter 7 will describe how EKF and SAM can be combined so that most of their good properties are preserved while some of the poor ones are removed.

81 Chapter 7 Submap Joining Smoothing and Mapping This chapter presents a new solution to the SLAM problem. It has been invented and developed in this master thesis project. Submap Joining Smoothing and Mapping (SJSAM) combines the good properties of standard EKF-SLAM and SAM. The pros and cons of these two algorithms will be discussed below. As stated in Section 6.2, EKF-SLAM has problems with mapping larger areas. The first problem is that standard EKF-SLAM scales badly as the complexity for doing the measurement update is quadratic in the number of landmarks. Another important issue is that EKF-SLAM is inconsistent when the map is large, since the estimated covariances are too small compared to the true errors [4]. This is because of linearisation errors and may result in bad maps with e.g. jumps in the trajectory and bad heading estimates. The inconsistent behaviour is also making data association harder. With this in mind, EKF-SLAM is still a very good SLAM algorithm as long as the map size is small. It is a good choice for producing consistent submaps. Section 6.5 described SAM. It stated that one problem with SAM is that the state vector grows very fast since all rig states of the trajectory are stored in it. In other words, even if no additional landmarks are seen for a while, the state vector will grow as the trajectory becomes longer. Another problem is that each landmark observation is stored in a matrix. If 2 3D landmarks are observed in each sample, where 1 samples occur in each second, = 6 rows will be added to the measurement matrix every second. While these rows are sparse, they are used for calculations in each future sample and will not be removed. They will not only affect the time for solving the least squares problem, they will also be costly to update during relinearisation. As a recap, relinearisation is when the rows are recalculated using updated linearisation points, i.e. improved estimates. The good things with SAM are the smoothing part and the time complexity of almost O(n + p) where n is the number of landmarks and p is the number of rig states. SJSAM uses an hierarchical approach with EKF-SLAM in the bottom layer 67

82 68 Submap Joining Smoothing and Mapping and SAM in the top layer. EKF-SLAM is creating submaps and SAM is then treating each submap as a combined measurement. When a new time step occurs in the SAM part, one submap is loaded in to be merged. Each submap is locally referenced, so that the robot always starts at the origin of it. A submap contains information about all the landmarks seen in it, the descriptors for these landmarks and the location of the robot at the end of the submap. All of this information is used in the SAM part of SJSAM. The landmarks of the loaded submap are seen as relative measurements from the origin of the submap. SAM uses this information as a measurement update. The location of the rig at the end of the submap is seen as odometry information, and SAM uses this to predict the next position of the rig. The algorithm will be described in more detail in the next sections, and especially in Section 7.5. This approach will result in that only a skeleton of the whole trajectory is maintained in the state vector of SAM, e.g. every 2:th sample. All the observations of a landmark between these skeleton nodes are merged into a single observation. This approach has two advantages. The first one is that the state vector of SAM becomes smaller. The second and more interesting one is that the number of measurement rows are reduced by a large factor (2 in this example). 7.1 Conditionally Independent Submaps Conditionally independent submaps are created with EKF-SLAM in a way very similar to [29]. A new submap is initiated when the robot is more than one meter away from the start of the current submap. Each submap is locally referenced, so the position of the robot in the new submap is at the origin. The covariance for this new position is zero. The speed of the robot is initiated with the speed of the robot in the old submap, rotated with the rotation matrix T 2, defined in Section The landmarks that were matched in the last step of the old submap are used as initial landmarks in the new submap. The positions of these landmarks are calculated with the rotation matrix T 2. The covariance is calculated from the old covariances by marginalization of the old robot uncertainty. This is done using (7.1), where l stands for landmarks and r for the robot state. The resulting covariance matrix is then rotated to fit the new coordinate system, once again using T 2. P l,l = P l,l P l,r P 1 r,rp r,l (7.1) 7.2 Coordinate Systems and Rotation Matrices There are two coordinate systems in SJSAM. The notation for these coordinate systems is described in Section Section also describes the rotation matrices used in SJSAM.

83 7.3 Measurement equation Measurement equation In SJSAM, the landmarks of a submap are seen as relative measurements from the origin of the submap. Each measurement consists of the relative distance between the landmark and the robot in three dimensions. The measurement equation is x l x x x z = y = h(x, l) = T 1 l y x y, (7.2) z l z x z where z is the measurement, x is the robot pose, l is the position of the observed landmark and T 1 is an Euler rotation matrix defined in Section The covariances of the landmarks in the submaps are used as measurement noise in SJSAM. 7.4 System Dynamics The last position in each submap is seen as odometry information in the prediction part of the SJSAM merging algorithm. This gives initial estimates of the robot states and the linearisation points needed in SAM. The equation for predicting a new robot pose given the old robot pose estimate and the odometry information is the following: x t x t 1 x y t y t 1 T 2 y z x t = z t φ t θ t ψ t = f(x t 1, u t 1 ) = z t 1 φ t 1 θ t 1 ψ t 1 + φ θ ψ, (7.3) where the pose of the robot in the end of a locally referenced submap is x y z φ θ ψ, i.e. the odometry input given by the final pose in the last submap. T 2 is an Euler rotation matrix defined in Section The covariance of the final robot pose in the submap is seen as process noise in SJSAM. 7.5 Algorithm Algorithm 4 gives an overview of the mechanics in SJSAM. The first step in each sample is to retrieve the linearisation points, corresponding to the estimates of the rig state in sample i and i 1. The estimate in i 1 is simply accessed by looking at the last rows of the state vector. The linearisation point for i is calculated using the process equation with the state in i 1 and the odometry information of the last submap, as stated in (7.3). The second step is data association. The measurements in a sample correspond to the landmarks of one submap. The current solution performs this data association using nearest neighbour calculation with Euclidean distances. Each landmark is stored with its descriptor from the SURF extractor and this information is used

84 7 Submap Joining Smoothing and Mapping Algorithm 4: Submap Joining Smoothing and Mapping (SJSAM) Data: Submaps s i = (x i L i ) T S created with e.g. EKF-SLAM. The last state of the rig in submap i is x i and L i is the landmarks in it Result: A state vector ˆX = (ˆx ˆm) T, where ˆx is a skeleton trajectory and ˆm is the map 1 begin 2 forall the submaps s i S do 3 1. Retrieve the linearisation points 4 if i > 1 then 5 ˆx i 1 is the last rig state estimate in ˆX 6 ˆx i = f(ˆx i 1, u i 1 ) where u i 1 = x i 1 s i 1 (see Equation (7.3)) 7 else 8 Initiate ˆx 1 with zeros 9 end 1 2. Data association 11 Find all l j L i that match a previously seen landmark 12 Find all l n L i that have not been seen before Measurement update 14 forall the reobservations l j do 15 Add 3 rows to A and b 16 Put J and H at the right places on these rows (see Section 6.5) 17 end 18 forall the new observations l n do 19 Add 3 rows to ˆX for the new landmark 2 Calculate an initial estimate for the landmark using the measurement, and add this to the new rows in ˆX 21 Add 3 rows to A and b, Add 3 columns to A 22 Put J and H at the right places on these rows 23 end Time update 25 Add 6 rows to A and b, Add 6 columns to A 26 if i > 1 then 27 Put F and G at the right places on these rows (see Section 6.5) 28 else 29 Put G at the right place on these rows 3 end 31 Add 6 rows to ˆX for the new rig state, Put ˆx i at these rows Relinearisation 33 Recompute all the elements of A and b that was created during sample 1... (i 1), using the new linearisation points Solve the least squares problem 35 Solve Aδ = b using Q-less QR-factorisation 36 ˆX = ˆX + δ 37 end 38 end

85 7.6 Time Complexity 71 as a validity check. The normalized scalar product between the descriptors of a measurement and a landmark must be larger than.6 for a nearest neighbour match to be valid. In EKF-SLAM, this value was set to.75, but it has been set to a lower value here since the distance between the observations of a landmark becomes longer. The third step is to expand the measurement matrix A and the vector b with three rows for each observation. The next thing to do is to add the prediction rows and to augment the state vector with a new state. SJSAM will incrementally find better estimates of the linearisation points used in the past, so all the matrices formed with these need to be recalculated once in a while. This is called relinearisation. SJSAM performs relinearisation in each sample. The whole A matrix and the vector b is traversed from top to bottom. A database is maintained with information of where in the matrix and the vector a landmark and a robot state can be found. SJSAM will then solve the problem using Q-less QR-factorisation with the sparse A matrix and the vector b. Solving the resulting QR-factorisation gives δ with information of how much to change each variable in the current state vector ˆX. This information is simply added to ˆX, that contains information on all the previous states and positions of the landmarks, see Section for details on how this is performed. The result is a new and improved estimate. The exact robot state covariance estimates can be retrieved efficiently from the QR-factor R using back-substitution [21]. Conservative estimates of the landmark covariances can be retrieved efficiently using the robot covariance and the measurement noise used when the landmark was initiated. Exact estimates are more costly to access. 7.6 Time Complexity The time complexity for creating the submaps using EKF-SLAM is O(1) because the size of each submap is bound by a constant. If n is the number of landmarks in the merged map and p is the number of submaps, merging these submaps using SAM is done in O(n + p) during exploration of new grounds and slightly more than O(n + p) during loop closure. The resulting complexity for SJSAM is therefore close to O(n+p), an improvement from the O(n 2.4 ) complexity of EKF-SLAM. 7.7 Consistency EKF-SLAM is producing local and consistent maps, avoiding the problems with too small covariances and wrong estimates arising when building large maps as stated in Section 6.2. The SAM part of SJSAM is continuously relinearising the problem using refined linearisation points, solving the full SLAM problem. This will improve the consistency because the problems of large scale EKF-SLAM is mostly because of linearisation errors.

86 72 Submap Joining Smoothing and Mapping 7.8 Similar Work The Sparse Local Submap Joining Filter (SLSJF) [18] is similar to SJSAM in that it uses a hierarchical approach with submaps. SLSJF is also storing each start/end pose of the submaps in the state vector. However, SLSJF uses an Extended Information Filter (EIF) representation for the submap merging instead of the SAM representation. The Iterated Sparse Local Submap Joining Filter (I-SLSJF) [19] is an improvement of SLSJF that is more costly, but also capable of producing better estimates. It works like SLSJF but employs a special step if the change in the state estimate of the robot becomes to large between two submaps. This step will recalculate the entire information matrix and information vector using new and improved linearisation points, similar to the relinearisation step in SJSAM. SLSJF and I-SLSJF retrieves the state vector x from the information matrix I and the information vector i by solving the linear equation Ix = i, e.g. a least squares problem similar to that in SJSAM. The most obvious difference between SJSAM and I-SLSJF is that I-SLSJF uses EIF instead of SAM for the submap merging. In other words, SJSAM and I-SLSJF uses very different representations for the data. Another difference is that I-SLSJF only performs smoothing (recalculation of the matrix and vector) sometimes, while SJSAM in its current version recalculates the measurement matrix and vector in each sample. A third difference is that I-SLSJF does not use a specific prediction step.

87 Chapter 8 Laser Scan Registration Laser scan registration, or scan matching, is a very useful method for building maps based on range images. Matching is performed to deduce how a robot or equivalent has moved between two consecutive scans. This is performed based on the parts that are common between multiple scans. There are various ways to solve this problem, the most common is probably the iterative closest point (ICP) method described in [7]. Another method is the normal distributions transform (NDT) that is described in [8, 24]. Both methods are suitable for two and three dimensions, although ICP is best applied in situations where the motion is small or approximately known. In this case there is no inertial measurement unit (IMU) or other sensor available so predictions will rely on a model. 8.1 Iterative Closest Point ICP [7] is an algorithm employed to minimise the difference between two scans and can be used for example to localise robots or reconstruct 3D surfaces. The complexity of one iteration is in the worst case O(N Y N X ) where N Y is the number of points in the model scan and N X is the number of points in the data scan. Model scan refers to the point cloud acquired at time t which can be seen as a reference cloud. The data scan is the point cloud from time t + 1 for which the rotation and translation is computed. If the matching is perfect the data scan will be rotated and translated to exactly match the model scan. The quadratic complexity of ICP makes it rather slow for large scans. It is however not necessary to use all available data from both scans. According to [24] ICP gives acceptable results down to 8 % of the data scan, but it is still favourable to use the entire scan as model which gives a rather high computational cost. Another common way to speed up ICP, except for sampling, is the use of kdtrees since it is a favourable way to search for nearest neighbour if N 2 D where N is the number of points in the data set and D is the dimensionality of the set. If N and D are small efficiency will be similar to an exhaustive search. Since the ICP algorithm converges to the nearest local minimum there can be some problems if the movement between two scans is large. In those cases an initial 73

88 74 Laser Scan Registration guess can be necessary to obtain the global minimum. The following refinements can make the algorithm more insensitive to local minima: Removal of outliers Use other metrics (point-to-surface vs point-to-point) Use additional information e.g. colour, curvature 8.2 Normal Distributions Transform The basic idea behind the Normal Distributions Transform (NDT) is to use normal distributions for matching point clouds. Each point cloud is divided into a grid. These will be called cells from now on. For each cell the mean and standard deviation are calculated and used to approximate a normal distribution which will represent the probability of a cell being occupied by a point. The benefit of this representation is that the the point cloud now is piecewise continuous and differentiable and matching of two clouds can therefore be solved using Newton s algorithm [13, Page 86]. NDT for two dimensions is described below followed by a description of 3D-NDT. The main difference between 2D-NDT and 3D-NDT is the dimension. Most notably is that the size of vectors and matrices increase D-NDT The first step is to divide the space occupied by the model into regularly sized cells, in this case squares. For each cell that contains more than a minimum number of points, the expectation value µ and the covariance matrix Σ are calculated as µ = 1 n x k, n k=1 (8.1) Σ = 1 n (x k µ)(x k µ) T, N 1 (8.2) k=1 where x k=1,...,n are the points contained in the cell. Each cell is then modeled as a normal distribution that is describing the distribution of the points. The probability to find a point, x, in the cell is thus given by p(x) = 1 c e (x µ)t Σ 1 (x µ) 2. (8.3) The constant c is just for normalisation and can for simplicity be set to one. There are one obvious problem when using the covariance matrix and that is a world without noise. Imagine all points on a line with zero noise, this makes Σ singular and it can therefore not be inverted. This is also a problem in practice even when there is small measurement noise present since Σ can become close to singular. To prevent this the quotient between the largest and smallest eigenvalue

89 8.2 Normal Distributions Transform 75 is checked, if the quotient is smaller than 1 3 the smallest eigenvalue is set to 1 3 times the largest value. Another problem is discretisation when dividing the model into cells. This problem can be minimised by using overlapping grids. In this implementation, four overlapping grids are used. That is, one grid with cell size l l placed first, then a second one, shifted (l/2, ), a third one shifted (, l/2) and finally a forth one, shifted (l/2, l/2). This implies that every point falls into four cells except for points near the edge. By creating a grid around the points it gets really simple to obtain which cells that contains a certain point. Let x i be a point in cell c k, this gives c k = x i, where means rounding towards negative infinity. Also known as the floor function. For the two dimensional problem there are three parameters to be optimized. These will be represented by the parameter vector p = [t x t y φ] where t x and t y are the translation and φ is the rotation angle. Using counter-clockwise rotation the spatial mapping between two scans is given by ( ) ( ) x cos(φ) sin(φ) tx = T(p, x) = x +. (8.4) sin(φ) cos(φ) For each pose given by the vector p the probability density function (PDF) is evaluated for all points and the negative sum is the score function which will be minimised. Given a set of points X = {x 1,..., x n }, a pose p and a transformation matrix T(p, x) the current score as a function of the pose is given by n n s(p) = p(x ) = p(t(p, x)). (8.5) k=1 Since s is piecewise continuous and differentiable, Newton s algorithm can be used to solve the optimisation problem of matching two point clouds. By solving H p = g where H and g are the Hessian and Jacobian, that is, the first and second derivative of s which is H ij = g i = s p i = 2 s p i p j = N k=1 ( x T k C 1 x k p j k=1 x T k C 1 x k p i exp N ( x exp k C 1 x k 2 k=1 ) + x k T C 1 2 x k t y ( x k C 1 x k 2 ) (( ) x k T C 1 x k p i + x T k C 1 x k p i p j p j p i ), (8.6) ). (8.7) Solving this gives p which represents a translation and rotation for one iteration. By adding p to p in every iteration the total translation and rotation is acquired, convergence is reached when the elements of p becomes small. With all the equations given the NDT algorithm can be described step by step as in Algorithm 5.

90 76 Laser Scan Registration Algorithm 5: Register data scan X with model Y using NDT Data: Two point clouds X and Y. Result: Translation and rotation, p, between two point clouds. 1 begin 2 Build a cell structure C 3 forall the points y i Y do 4 Locate the cell c k that contains y i 5 Store index i in c k 6 Mark c k as not empty 7 end 8 forall the non empty cells c k C do 9 y k = y i c k 1 µ k = 1 N N k=1 y k 11 Σ k covariance of the points y k 12 end 13 p 14 while p.1 do 15 g 16 H 17 s 18 forall the points x i X do 19 x i = T(p, x i) 2 Locate the cell that contains x i 21 s = s p(x i ) (see Eq. (8.3)) 22 Update g (see Eq. (8.6)) 23 Update H (see Eq. (8.7)) 24 end 25 Solve H p = g 26 p = p + p 27 end 28 end

91 8.2 Normal Distributions Transform D-NDT NDT in three dimensions can be performed in the same way as in the two dimensional case. When adding an extra dimension there are some differences, the rotation matrix becomes more advanced and is for Euler angles given by 1 cos(θ) sin(θ) R = cos(ψ) sin(ψ) 1 sin(ψ) cos(ψ) sin(θ) cos(θ) cos(φ) sin(φ) sin(φ) cos(φ). 1 (8.8) In this representation ψ is roll, θ is pitch and yaw is given by φ. Section explains Euler angles more thoroughly. A more preferable representation would be quaternions. Quaternions have some favourable properties when it comes to multiple rotations since a normalised quaternion always represents a valid rotation. For the three dimensional problem the parameter vector is, p = [t x t y t z φ θ ψ]. Thus, there are six degrees of freedom to this problem. Calculating the Jacobian and Hessian according to Eq. (8.6) and (8.7) can be quite difficult by hand and is preferably done by using Symbolic Math Toolbox in Matlab or any other symbolic computational software. The final expression for the Jacobian is given in Appendix A. An improvement for NDT in 3D is a function for automatic refinement of the grid size. This means grid size is refined when the algorithm converges. The final value of p is then taken as initial guess after the grid is refined. By doing this the grid size can be quite large in the beginning and smaller as the scans come closer together. The largest benefits of this are the increased precision and that it simplifies the choice of grid size.

93 Chapter 9 Experimental Results With SLAM The SLAM systems of this report is validated and tested through a number of tests. Each experiment is accompanied with at least two figures. The first one shows the map from above. All generated maps are in three dimensions, but the results are more clear if the map is shown from above. The estimated trajectory is a black line and the landmarks are illustrated with red stars. There are also ellipses showing the uncertainty of the landmark locations. These covariance ellipses illustrate 99.5 % confidence bounds, i.e. the true landmark location is inside the ellipse with a 99.5 % probability. The feature extraction is performed with SURF-128. To ensure a high enough number of features matched between the left and the right image, we use a dynamic threshold that is lowered if too few matches are found. This results in that SURF is called multiple times for some stereo pairs. Mahalanobis distances are used for data association. Each feature generated from SURF-128 comes with a 128 dimensional descriptor vector. Each landmark is stored together with the first descriptor vector observed and this vector is used in data association for validating our matches. We require the normalised scalar product between the stored and the measured descriptor vector to be at least.75. This has made our implementation much more robust as spurious matches are mostly rejected. 9.1 Experiments with EKF This section will describe how the 3D visual EKF-SLAM system performs under various premises. It will try to answer question 3 and 4 in Section 1.5. These questions are repeated here: What will the quality of the map be if the cameras are moved systematically compared to if they are moved more randomly? 79

94 8 Experimental Results With SLAM Figure 9.1: The trajectory which the camera is moved along. What will the quality of the map be if the cameras are mounted to a robot compared to if they are held in hand? Note that the laser camera will not be used is these tests. The combination of the stereo camera and the laser camera, i.e. question two in the list in Section 1.5, will be addressed in Chapter Stereo Camera Mounted to a Rig The stereo camera is mounted to a rig. This implies that no or little movement occurs in the z-direction. Neither do the pitch and roll angles vary much. The rig is moved along a taped trajectory in a workshop room, shown from above in Figure 9.1. Note that the rig is moved backward during the final meter of the trajectory. The frame rate is about 1 stereo pair images per second. The system produces the map in Figures 9.2, 9.3 and 9.4 with 349 landmarks in 135 seconds. The taped trajectory that the rig was moved along is also visible in the figure. The standard deviation of the estimates can be seen in Figure 9.5. As can be seen, the uncertainty increases during the first half of the trajectory and decreases at the end, when old landmarks are seen again, i.e. at loop closure. The final estimated position of the rig is only 1.5 centimetres away from the initial position. By using this estimated trajectory together with the depth and color information from the stereo camera, it is possible to create full 3D models. Figure 9.6 shows the resulting 3D spatial map Stereo Camera Held in Hand The stereo camera is now simply held in hand of a person, walking the same trajectory as the rig moved, but with some movement in the z-direction and the pitch and roll angles. There is also some movement in the x and y-directions,

95 9.1 Experiments with EKF Estimated trajectory Estimated landmark locations y [m] x [m] Figure 9.2: 3D EKF-SLAM: Stereo camera mounted to a rig, map seen from above. z [m] Estimated trajectory Estimated landmark locations x [m] Figure 9.3: 3D EKF-SLAM: Stereo camera mounted to a rig, seen from the side.

96 82 Experimental Results With SLAM 1 z [m] y [m] 4 x [m] Figure 9.4: 3D EKF-SLAM: Stereo camera mounted to a rig, seen in 3D..7.6 Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.5 [rad],[m] Sample Figure 9.5: 3D EKF-SLAM: Stereo camera mounted to a rig, standard deviation of the rig states.

97 9.1 Experiments with EKF 83 Figure 9.6: A 3D model created using the trajectory in this section and depth and color data from the stereo camera.

98 84 Experimental Results With SLAM y [m] x [m] Figure 9.7: 3D EKF-SLAM: Stereo camera held in hand. resulting from the dynamics of walking. The experiment simulates how the camera would move it was fixed to e.g. the breast or the helmet of a fireman moving in a controlled way. The map has 356 landmarks and can be seen in Figure 9.7. The final estimated position of the rig is once again only 1.5 centimetres away from the initial position. Standard deviations can be seen in Figure Fireman Looking Around in a Room With The Stereo Camera This experiment illustrates how a fireman stands in the corner of a room and looks around in it to get an overview. The camera moves a bit like it would do if it was mounted to the helmet of a fireman. However, it is likely that the fireman would move faster and less predictable. The frame rate in the capture system is only about 1 frames per second, and this is not enough for fast movement as will be shown later in this section. Figure 9.9 shows the map and Figure 9.1 shows the standard deviation of the states during this experiment. As stated before, the cameras work at 1 frames per second (FPS). The camera was moved at normal speed in this experiment. Another experiment has been done on the same data, but using only every second image pair, resulting in 5 FPS. This was done in order to find the lower limit for the FPS. The map can be seen in Figure It can be seen that both the map and the trajectory is a bit incorrect since the fireman should look to the right at the end of the trajectory and more of

99 9.1 Experiments with EKF Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.5 [rad],[m] Sample Figure 9.8: 3D EKF-SLAM: Stereo camera held in hand, standard deviation of the states. 2 1 Estimated trajectory Estimated landmark locations 1 y [m] x [m] Figure 9.9: 3D EKF-SLAM: Fireman standing in a corner and looking around.

100 86 Experimental Results With SLAM Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.1 [rad],[m] Sample Figure 9.1: 3D EKF-SLAM: Fireman, standard deviation of the states. the room should be visible Sweeping Movement - Validity Check of Angle Estimates In order to validate that the EKF-SLAM estimates the angles and the angular velocities in a correct way, an experiment has been performed where the rig is rotated 9 degrees. Note that the sample time is about.1 seconds during the data collection, but the EKF-SLAM implementation uses 1 second as the sample time. This do not affect the result other than that the size of the velocities may seem strange. Figure 9.12 shows the resulting map. Note that this small sweep alone, with only 72 image pairs, is enough to get an overview of the room. From Figure 9.13 is it easy to see that the estimated angle difference from the first and last state is around 9 degrees. The angular velocity estimates can be seen in Figure 9.14 and they are not very noisy even though the velocities are not measured directly but estimated by the filter Mapping of a Conference Room The map in Figure 9.15 was generated from 353 stereo pair images using our EKF- SLAM implementation. The camera has been fixed to a small table with wheels. This table has been moved around in a conference room. The feature extraction part takes 162 seconds in total while the EKF-part of the algorithm takes 111

101 9.1 Experiments with EKF 87 3 Estimated trajectory Estimated landmark locations 2 1 y [m] x [m] Figure 9.11: 3D EKF-SLAM: Fireman, using only every second image pair. 2 1 Estimated trajectory Estimated landmark locations 1 y [m] x [m] Figure 9.12: 3D EKF-SLAM: Stereo camera sweeping about 9 degrees.

102 88 Experimental Results With SLAM.2.2 Yaw Pitch Roll Angle [rad] Sample Figure 9.13: 3D EKF-SLAM: Stereo camera sweeping about 9 degrees or 1.6 rad. Angle estimates..1 Yaw velocity Pitch velocity Roll velocity.1 Angle velocity [rad/s] Sample Figure 9.14: 3D EKF-SLAM: Stereo camera sweeping about 9 degrees. Angle velocity estimates.

103 9.2 Experiments With SJSAM Estimated trajectory Estimated landmark locations y [m] x [m] Figure 9.15: Estimated map with 353 landmarks. seconds, resulting in 351 landmarks. Figure 9.16 shows how the standard deviation of the camera states change over time. Note that the standard deviation of the heading increases in the beginning and decreases on loop closure, just as expected Consistency Problems A longer experiment has also been done, resulting in 818 landmarks seen in Figure The computation takes 1 minutes (image processing not included). Standard deviation estimates along the run is presented in Figure It is obvious that the estimated heading standard deviation is too small, and problems seem to start around sample 5 where the deviation shrinks and the trajectory bends. However, the algorithm has had few matches of features between successive samples at this specific time, so it is very possible that the bend is mostly from lack of good measurements. The lack of matches can however not explain the small heading variance, as it should rather be increasing at such an event. 9.2 Experiments With SJSAM The data used for evaluation of SJSAM has been collected using a stereo camera mounted to a trolley. The trolley was moved in a large eight-shape through one large room and one small room. Although the processing and the resulting maps are in 3D, the plots in Figure 9.19 and Figure 9.22 are only showing them in 2D

104 9 Experimental Results With SLAM Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.1 [rad],[m] Sample Figure 9.16: Variance of estimated states in 3D visual EKF-SLAM using a small initial variance on the heading. from above. The data sequence consists of 1661 stereo pair images gathered during a few minutes Submaps There are two main principles for when to start a new submap. The one used here is to begin a new submap when the robot has moved more than one meter from the beginning of the submap. This will guarantee that states are evenly spread out along the trajectory. A problem with this principle occurs if the robot moves around at almost the same spot continuously observing new landmarks because the EKF will grow without a lot of control. It is however fine if the robot stands still, since not many new landmarks will be added. From this perspective, a better approach is to start a new submap when a fixed number of landmarks have been observed. This will put a limit on the size of the matrices in the EKF and guarantee O(1) time complexity. Figure 9.19 shows the 34 submaps resulting from using the first approach with one meter between the submaps. The time complexity is O(1) as can be seen in Figure 9.2. The time spent for creating these submaps is 14 seconds, excluding the time used by SURF for feature extraction from the images. Around 75 % of the time is spent doing Mahalanobis data association. Note that the spike at around iteration 14 is because the robot moves forward and backward in the same area, while at the same time the region contains a lot of features.

105 9.2 Experiments With SJSAM 91 5 Estimated trajectory Estimated landmark locations 5 1 y [m] x [m] Figure 9.17: Visual EKF-SLAM between two rooms through a corridor.

106 92 Experimental Results With SLAM.35.3 Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.25 [rad],[m] Sample Figure 9.18: Standard deviation of estimated states in 3D visual EKF-SLAM using a small initial variance on the heading. 4 2 y [m] x [m] Figure 9.19: 34 submaps created in 14 seconds.

107 9.2 Experiments With SJSAM Time [s] Sample Figure 9.2: Time used for each sample in submap creation. Conditionally independent submaps seem to be a quite robust approach. A much larger data set has been collected in a lab area at the University of Linköping. The mapped environment consists of five large rooms and the trajectory is about 7 meters. The resulting submaps can be seen in Figure It took 495 seconds to create these submaps, not including time spent performing image processing. Note that the true final position is the same as the first position. It may look like this has been correctly estimated by SLAM, but the drift in heigh is almost 2 meters at the end. This is because of the long loop that is made at the end, where no loop closure can occur. Because of this large drift, SJSAM has not been able to merge the submaps into something reasonable better Merging of Submaps SAM is merging the 34 submaps into the map in Figure 9.22 in 21 seconds. The trajectories from the EKF-SLAM submaps have been overlaid in the figure to fill in between the skeleton nodes of SJSAM. Relinearisation of the whole problem is performed in each sample. The resulting map has 1275 landmarks and 34 states. Loop closure has occurred which can be seen by comparing Figure 9.19 and The green dot shows the beginning of the last submap. This last submap ends only 6 centimetres away from the start of the first submap. The loop closure can also be seen in the sparse measurement matrix A in figure This matrix has the same structure as explained in Section A simplified way to imagine the matrix is to think that landmarks correspond to columns of the matrix and

108 94 Experimental Results With SLAM 1 5 y [m] x [m] Figure 9.21: 7 submaps created over a large indoor environment with five rooms.

109 9.2 Experiments With SJSAM 95 4 Estimated trajectory Estimated landmark locations 2 y [m] x [m] Figure 9.22: 34 submaps merged in 21 seconds into 1275 landmarks and 34 states. observations of these correspond to rows. There are a lot of entries in the bottom left corner of the matrix. These are landmarks first seen in the first submaps, but re-observed also in the last submap, i.e. loop closure. A third way to see the loop closure is to look at how the variance of the robot changes during the trajectory and this is shown in Figure The time complexity of SAM is supposed to be almost linear in the number of robot poses and map landmarks. This linear tendency can be seen in Figure 9.25, showing how much time is spent merging each submap into the large map. More time is used at the end because of loop closure, but exploration of new areas without loop closure is linear Summary SJSAM has created a 3D map of two rooms at 82 square meters in = 155 seconds, excluding the time used for feature extraction. Most of this time is spent performing data association. The current implementation is made in Matlab. It is likely that a C implementation would speed up both the feature extraction part and the map creation SJSAM on the workshop data SJSAM has also been tested on the same data used in Section EKF-SLAM starts by producing submaps, shown in Figure The error of the final position

110 96 Experimental Results With SLAM nz = Figure 9.23: The sparse measurement matrix A Standard deviation x position Standard deviation y position Standard deviation z position Standard deviation heading.5 [rad],[m] Sample Figure 9.24: The standard deviation of the 34 samples.

111 9.3 Discussion Time [s] Submap Figure 9.25: Time used for each sample in the merging part of SJSAM. is approximately 3 centimetres. The submaps are created in 74 seconds, i.e. half of the time that EKF-SLAM used to produce one big map. The SAM part of SJSAM merges these submaps in a total of 2 seconds. The resulting map can be seen in Figure The error is now only about 2 centimetres. Loop closure has occurred. 9.3 Discussion It has been shown that good maps and trajectories can be produced if the camera is mounted to a rig. When the camera is held in hand, the resulting trajectory is less good but the map is still looking just fine. In most cases of SLAM, it is not needed to have centimetre-accuracy everywhere. It is more important to get a consistent overview of the area to be mapped. Section shows that is it possible to get a good overview map of a room just by looking around from a corner during a few seconds. It is however important to use a good enough frame rate compared to the movement of the camera. If too few images are taken during a movement, SLAM will not be able to associate new observations to old landmarks. This is especially important if the movement is unpredictable. But if the camera is moved in a very predictable and controlled way, the system dynamics will be able to predict the next pose in a reliable matter even though the frame rate is low. The key to robust visual SLAM seems to be a high enough frame rate and

112 98 Experimental Results With SLAM Figure 9.26: Submaps of a workshop. 1 1 y [m] x [m] Figure 9.27: Merged submaps of a workshop.

113 9.4 Future work 99 a good data association implementation. The addition of the descriptor validity check in the data association part made the EKF-SLAM implementation much more robust and less sensitive to parameter changes. When mapping large areas, submaps is a way of making the problem less computationally demanding. The EKF-SLAM, and also other algorithms, will sooner or later reach a limit where processing can not be made in a reasonable time. However, how to best divide the environment into submaps and how to fuse these submaps to a large map remains an open question. SJSAM has however proven to be a very fast approach for mid-sized maps. Compared to the EKF, it is much faster and this is more and more noticeable as the maps become larger. The potential problem is how to implement a better data association in the merging SAM-part of the algorithm, but there are other SAM based approaches that have solved this so it should be possible in SJSAM as well. The SJSAM implementation uses Euclidean distances for data association and this is a big problem when closing large loops. Large acceptance distances between a measurement and a stored landmark are needed, but this affects the robustness of the algorithm so that false matches may be added. Mahalanobis is a much more dynamic method and would give SJSAM the possibility to close larger loops. However, when the loop is very large and a lot of drift has occurred, Mahalanobis will likely fail too. If the only sensor is a camera, one solution is to compare actual images to each other with a correlation method and then apply loop closure constraints on the pose of the rig. The ultimate SLAM algorithm would have a constant time complexity. It is however easy to see that such a behaviour is not possible if we want to be able to perform loop closure and not only visual odometry. A linear or close to linear time complexity is more realistic and the concept behind SJSAM and other locally referenced submap-algorithms are close to achieving this. 9.4 Future work The implementations of EKF-SLAM and SJSAM described in this report could be improved further EKF-SLAM The visual EKF-SLAM system can be improved in several ways and some of them are listed below. Intertial Measurement Unit (IMU) The system would be more robust if a high frequency IMU was added. An IMU provides acceleration measurements and they would be very valuable since the camera is only measuring direct movement. An IMU is often neither expensive or very hard to incorporate. It could be incorporated with the addition of an extra measurement update in the Extended Kalman Filter. One situation where the IMU would be especially valuable is when e.g. a fireman is looking around in a fast and unpredictable way.

114 1 Experimental Results With SLAM Laser camera The laser camera system described in Section 2.2 can be merged with the EKF-SLAM system used here. The easiest and possibly also best approach is to use the relative movement estimates from the NDT or the ICP as odometry information in the EKF. Batch data association The current data association approach uses one measurement at a time and compares it with the landmarks. This can be problematic if the uncertainty of the rigs pose is large. A better approach is to consider all the measurements at the same time and pair them to the landmarks so that the overall data association result is optimised. There are two major techniques for this. The first one is called Joint Compatibility Branch and Bound (JCBB) [25] and is performing a tree-search. By measuring the joint compatibility of several candidate pairings it can also reject spurious matches. The second approach is Combined Constraint Data Association (CCDA) [2], based on graphs. This approach can actually perform data association without any knowledge of the vehicle pose SJSAM The merging part of the SJSAM implementation, i.e. the part that uses SAM, can be improved with any of the following points. Mahalanobis data association Mahalanobis data association can be used instead of the easier Euclidean distance approach. An issue here is how to receive good estimates of the covariances needed in a reasonable time, but this has been addressed in [21]. Batch data association Image based loop closure For large loops where the estimates drift a lot, it is necessary with alternative methods for loop closure. One approach is to compare images from the stereo camera to each other so that the rig can recognise that it has been at a specific place before, similar to how humans know where they are. This approach may be very computationally demanding. There are however efficient methods. One is called Tree of Words [27] and this method is abstracting each image into a set of words. The words from different images can then be compared efficiently. If two images from different time-points are similar enough, it would be possible to apply a loop closure constraint on the state of the rig. Periodic variable reordering The factor R resulting from the QR-factorisation in SAM is sparse and diagonal but can be made even more sparse by periodic variable reordering of the state vector. This idea is described in [12] and [21]. Variable reordering will be especially beneficial at loop closure, since such an event will result in many non-zero elements being added to the factor R.

115 Chapter 1 Experimental Results With Laser Scan Registration This chapter will present results from experiments with laser scan registration. There will be experiments with known translation and rotation as well as experiments where the laser scans is used to create a trajectory. The registration will be performed using the well known ICP algorithm and the more recent NDT algorithm. 1.1 Data Collection The data is collected in a small room, m where ground truth is marked with tape on the floor. The camera is moving 25 cm between each data collection. The total translation in the y-direction (forward) is 1 m and.75 m in the x- direction. This gives a total of 2 different poses. At each pose 15 scans are collected and are then compared pairwise with scans from another pose. Since the camera is mounted on a tripod the roll and pitch angle should theoretically be zero degrees between the matches. The results will consist of plots for all the states. The first experiments will be performed with ICP and later on the same tests will be performed with NDT. 1.2 Iterative Closest Point The implementation used for ICP has one important parameter. This is the distance rejection limit which is an upper bound when searching for neighbours. The distance rejection limit can be compared to the grid size for NDT. Another parameter is the maximum number of iterations that should be performed. This on the other hand only affects the result when the limit is actually reached. For the experiments the iteration limit is set to 1 and is never reached. The distance rejection limit, d, is varied just as the grid size will be varied for NDT. The exper- 11

116 12 Experimental Results With Laser Scan Registration Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.1: Successful registration of 25 cm movement with ICP. iments are performed with 2 % of the points in the data scan together with all points in the model scan. Figure 1.1 presents the result from a registration with ICP using d =.5. Ground truth is approximately 25 cm in the y-direction, the ground truth may differ a few centimetres due to measurement errors and positioning of the camera. Figure 1.1 shows successful registration for all 15 frames in this case. Translation in x and z-direction is small enough to be considered acceptable. Increasing the movement from 25 cm to 5 cm makes no difference for ICP. All the scans are correctly registered. Figure 1.2 shows the result. Further increase of the translation gives the result that one registration fails for ICP. This is illustrated by Figure 1.3. Increasing the distance rejection limit to d = 1.5 m has a positive effect for ICP. Figure 1.4 shows that all registrations are successful when using the new value for d. Even when increasing the movement further to 1 m every registration still succeeds Rotation Rotation was tested by putting the camera on a protractor on a table. Figure 1.5 shows the placement of the camera. The camera was then rotated ten degrees at a time. This means that only the yaw angle was changed during rotation. Due to the current implementation of ICP it was not possible to directly obtain the rotation for independent axes. Returned from the implementation was the rotation matrix and also the transformed data set. The rotation angles can be derived from the rotation matrix but since the definition of the rotation angles are unknown the

117 1.2 Iterative Closest Point 13 Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.2: Successful registration of 5 cm movement with ICP. Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.3: Failed registration of 75 cm movement using ICP.

118 14 Experimental Results With Laser Scan Registration Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.4: Successful registration of 75 cm movement using ICP. It can be seen that the rotation is a bit uncertain with this big translation. value closest to the actual rotation was considered to be yaw. Figure 1.6 shows the result when the camera has been rotated 1. It turned out that ICP failed all registrations for the rotation. Tuning of the parameters did not lead to any improvement. Figure 1.6 can be compared to Figure 1.15 which represents the same registrations performed with NDT. Comparing the figures shows that NDT successfully registers the rotation and clearly has an advantage over ICP Reducing the Number of Points It was also necessary to investigate how robust ICP were to changes in the number of points in the data scan. A movement of 5 cm in y-direction was used as testing case. Registration was then performed with 2 % of the points with the condition that all registrations must be successful. The number of points in the data scan was then reduced while keeping the same settings Creating a Trajectory As a final test ICP was used to create a trajectory. The trajectory was created by registering 1 consecutive scans and adding the translation and rotation. The scans were collected in the workshop with ground truth marked with tape on the floor. The camera was mounted on the tripod shown in Figure 2.6. Any movement in height is therefore incorrect provided that the camera was mounted horizontal. Figure 1.7 shows the trajectory when combining the 1 scans.

1.2 Iterative Closest Point 15 Figure 1.5: Setup for rotation test. Translation [m].1.5.5 x y z.1.15 2 4 6 8 1 12 14 Rotation [deg] 2 4 6 8 1 ψ θ φ 12 2 4 6 8 1 12 14 Scan Figure 1.

119 1.2 Iterative Closest Point 15 Figure 1.5: Setup for rotation test. Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.6: Result after registration with ICP of a 1 yaw rotation.

120 16 Experimental Results With Laser Scan Registration y [m] x [m] Figure 1.7: Trajectory when using ICP to perform registrations. The problem with creating a trajectory was that errors were accumulated over time. This basically means that one missed registration could wreck the entire trajectory since every point depends on the orientation and position of the previous point. The trajectory shown in Figure 1.7 has a small drift in height which is incorrect. The height is not shown in the figure but lies in the interval.1 z meters. ICP had some major difficulties as soon as the camera started to turn. The problems with turning was also very obvious when testing a 1 rotation which ICP failed to register. It is obvious from the result that this problem is present for smaller angles as well since the camera moved slowly with about 1 FPS.

121 1.3 Normal Distributions Transform 17 Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.8: Successful registration of 25 cm movement. 1.3 Normal Distributions Transform The following test were made with the NDT-algorithm. NDT only has two important parameters, an initial guess of the motion and the grid size, but note that these two parameters are vectors. The initial guess at each test was set to p = ( ) T. The grid size, g, had to be varied between the tests since NDT can not make a successful registration if the grid size is smaller than the movement. The next section will display result from experiments with different grid sizes and 2 % of the points in the data scan. After that results for different number of sample point will be presented Grid Size The result from two matchings are displayed in Figure 1.8 and 1.1. In Figure 1.8 all registrations were successful. The movement was approximately 25 cm in the y-direction which was also the result from the algorithm. The grid size used for this registration was g = (.5.5.5) m. Figure 1.1 shows the result when the camera pose has changed 5 cm in the y-direction, the grid size used was g = ( ) m. The figure clearly shows that there is one missed registration which is scan number 14. By changing the grid size for scan number 14 to g = ( ) m yields the result displayed in Figure Using this grid size the registration succeeded. The figures 1.12 and 1.13 shows registration for 75 cm and 1 cm translation using g = ( ) m.

122 18 Experimental Results With Laser Scan Registration Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.9: Decreased precision due to the larger grid size..6.5 Translation [m] x y z Rotation [deg] 1 1 ψ θ φ Scan Figure 1.1: One failed registration of a 5 cm movement.

123 1.3 Normal Distributions Transform 19 Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.11: Successful registration of all scans after increasing the grid size. Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.12: Registration of 75 cm translation.

124 11 Experimental Results With Laser Scan Registration Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.13: Registration of 1 cm translation. Figure 1.12 and 1.13 shows that NDT can be used when the translation is quite large if the grid size is increased. The downside of increasing grid size is that precision is decreased. Figure 1.9 shows the result when using g = ( ) m for the same scans used in Figure 1.8. Figure 1.9 displays similar results but precision has decreased. This shows the problem of choosing a good grid size. A small grid size requires a small distance between scans and can also miss large details in a room, e.g. chairs, tables etc. Using a grid size that is too large results in the opposite. Scans close together may be interpreted as no movement at all and small objects may be missed. The grid size also affects the number of iterations needed for the algorithm to converge. Table 1.1 shows the number of iterations needed for a registration using two different grid sizes. In this case the differences are small but the grid size can make a big difference in certain cases. Movement in this case should be (25, 25) cm although it might differ some centimetres due to measurement errors and positioning of the camera. Figure 1.14 shows what happened when registering a (25, 25) translation with the grid size g = (.5.5.5) m. Almost every other registration fails. This grid size was clearly not large enough for this translation Rotation The NDT algorithm was also tested for small rotations. For this test the camera was placed on a protractor on a table as shown by Figure 1.5. The camera was turned 1 clockwise and counter-clockwise. For each pose 15 scans were collected.

125 1.3 Normal Distributions Transform 111 Table 1.1: Number of iterations before convergence. Scan Grid Size (m) Iterations Deviation (x, y) cm (6, 3) (6, 2) (7, 3) (6, 2) (7, 3) (8, 2) (6, 3) (6, 2) (6, 3) (6, 2) Translation [m] x y z Rotation [deg] ψ θ φ Scan Figure 1.14: Failed registration for (25, 25) cm movement.

126 112 Experimental Results With Laser Scan Registration Translation [m] 5 x x y z Rotation [deg] ψ θ φ Scan Figure 1.15: 1 counter-clockwise rotation. Results from the registrations are shown in Figure 1.15 and It clearly shows 15 successful registrations for both rotations. Although it appears that rotations was more difficult to register correctly compared to a translation without rotation. Some tuning had to be done to the grid size for the registrations to be acceptable. NDT apparently has an advantage here compared to ICP that failed every registration when the camera was rotated. Figure 1.17 shows the point clouds after registration. The figure show that the alignment of the point cloud is good. The red dots to the left in the figure are points that appear after rotation. In other words these are points that are not common between the two scans. For points that are common to the two scans the registration appears to be almost perfect. This can also be verified by Figure 1.15 and 1.16 where the rotations are plotted Reducing the number of points The NDT algorithm was also tested with a reduced number of points. This was done the same way as for ICP. First a grid size was chosen so that all 15 scans were correctly registered. Then the number of points was reduced gradually while keeping the same grid size. Just as for ICP, NDT registered all tests correctly. Table 1.2 shows the time for completing 15 registrations with ICP and NDT respectively.

127 1.3 Normal Distributions Transform Translation [m] 15 x 1 x y z ψ θ φ Rotation [deg] Scan Figure 1.16: 1 clockwise rotation Figure 1.17: Point cloud seen from above after registration. Blue dots represents the model. The red dots represents the data scan that is rotated and translated to match the model.

128 114 Experimental Results With Laser Scan Registration Table 1.2: Registrations time for ICP and NDT. Points (%) ICP Time (s) NDT Time (s) y [m] x [m] Figure 1.18: Trajectory when using NDT Creating a Trajectory NDT was applied to the same data set as ICP. The resulting trajectory is shown in Figure NDT results in slightly more drift in height compared to the results with ICP. The height is now in the interval.5 z.15 which is twice that of ICP. NDT on the other hand performs much better over all and gives a decent trajectory. By comparing Figure 1.18 to Figure 1.7 it can be seen that NDT results in a smoother and more correct trajectory. However, it is also clearly seen that the precision not is good enough for navigation using only NDT. It appears that each step is a little too short which results in a trajectory that differs from ground truth. Using NDT as odometry should however be possible since the registrations are quite correct.

129 1.4 Discussion Discussion Both ICP and NDT performs well when it comes to laser registration. NDT has the advantage of linear complexity. ICP can be implemented quite efficiently by using kd-trees etc. and in the current implementation it is faster than the Matlab implementation of NDT. However, Takeuchi et al. [3] have been able to achieve an average matching time of.9 s per input scan for NDT. This shows that registration with NDT can be performed in real-time. In the current Matlab implementation one registration can be performed in approximately 2 s. However, the time depends on the number of iterations needed for convergence. The results with the ICP algorithm are equivalent or better than NDT for simple translations. ICP also has the advantage of less parameters. Choosing a good grid size is more difficult than choosing a good distance rejection limit. Mainly because of the three values needed to define a grid size for NDT. On the other hand NDT gives the freedom of using different precision for each direction. Implementing the NDT algorithm with the features suggested in Section 1.5 would make it easier to choose grid size. By having a grid size that is refined after convergence would most certainly make it easier to apply NDT to any given data scan. When trying to register scans where the camera is rotated NDT clearly has an advantage. ICP fails every registration even with this small rotation. The reason that NDT performs better in this case is probably because the algorithm considers the shape of the point cloud. ICP only tries minimise the sum of the distances between points. Rotation of the camera introduces new points on one side of the cloud. These new points have nothing in common with the previous point cloud and are supposed to be on one side after a successful registration. Using ICP the data scan is centred around the model which is not correct. When it comes to the part of creating a trajectory NDT performs better than ICP. ICP have some problems when the camera is rotated and moves sideways. This is consistent with the results from the test with a 1 rotation where NDT outperformed ICP. The conclusion is that NDT is the better choice for indoor navigation. 1.5 Future work For the NDT-algorithm there are various improvements that can be made. Some of them are: The use of variable grid size should increase precision of the matching as suggested in [3]. This method uses a smaller grid close to the camera to increase precision. Introduce an IMU. An IMU would give the ability to make a qualified guess for how the camera has moved. The result would be less iterations and the possibility to rely on the IMU if the registration fails.

130 116 Experimental Results With Laser Scan Registration Changing grid size between iterations. This method is used by Magnusson et.al. in [24]. This means that the initial iterations uses a larger grid. When the algorithm converges the grid size is refined making it possible to achieve better precision compared to the initial grid size. Make sure that H in Equation (8.7) is positive definite using the method described in [13]. This should fix numerical problems. The method currently used for ensuring that H is positive definite is H = H + λ min I where λ min is the smallest eigenvalue of H. Implement NDT in another language, preferably in C/C++. Make use of quaternions instead of Euler angles since quaternions have many favourable properties.

131 Chapter 11 SLAM With the Stereo Camera and the Laser Camera In this chapter the BumbleBee camera and the CameCube are combined in an effort to use the improved depth from the CameCube where applicable. The chapter starts with a short description of how SLAM can be performed with only the laser camera. The chapter then continues with a description of how the cameras are utilised together and how pixel mapping and fusion of information are performed. The chapter is concluded with some results from SLAM with only the laser camera, but also the combination between the laser camera and the stereo camera. As a comparison results with only the stereo camera is presented as a reference result Laser camera only The PMDTec laser camera delivers intensity, amplitude and distance images with a resolution of pixels. The amplitude image can be used like a normal image and because of this it is possible to extract features from it using e.g. SIFT or SURF. Once again SURF will be used to extract features from the images. The distance image can then be used to retrieve the distance to these features. SLAM with only the laser camera can be performed in three steps: 1. Extract features from the amplitude image using SURF. 2. Calculate the depth to these features using the distance image. 3. Process this information in EKF-SLAM. 117

132 118 SLAM With the Stereo Camera and the Laser Camera 11.2 Laser and Stereo Camera The laser camera and the Point Grey BumbleBee stereo camera can be combined in SLAM by using pixel coordinates from the stereo camera and their corresponding distances from the laser camera. The laser camera has a smaller field of view than the stereo camera and this is a problem that needs to be handled. A simple way to work around this is to limit the field of view of the stereo camera to that of the laser camera, but this leads to big loss of information. A better alternative is to use the depth from the laser camera where it is possible, and the depth from the stereo camera outside of the FOV of the laser camera. The following list explains the algorithm in this scenario: 1. Extract features from the stereo images using SURF. 2. Calculate the depth to the features outside the FOV of the laser camera using the disparity formula. 3. Acquire the depth to the features in the FOV of the laser camera using the distance image from the laser camera. 4. Process this information in EKF-SLAM Pixel Mapping The mapping between the cameras is performed using the parameters from the camera calibration in Section 2.4. Pixel mapping is performed between the left stereo image and the image from the laser camera. After feature extraction in the left stereo image the coordinates of the features are transformed to world coordinates. The world coordinates are then translated according to the values in Table 2.3. Next the coordinates are transformed back to pixel coordinates using the parameters for CamCube. Pixel coordinates are then verified against the size of the sensor. For every feature that has a valid pixel coordinate on the CamCube sensor the depth is extracted from the point cloud acquired by CamCube. Transformation between pixel coordinates and world coordinates is explained in Figure 11.1 shows the result from a pixel mapping. The selected pixel is marked with a red star in both images. The error in the pixel mapping is approximately ±4 pixels at worst. The error is larger near the edges of the image. This is due to the fact that the image from CamCube is distorted. The mapping might be correct but appears wrong due to the distortion. The point clouds are rectified and since the distance are acquired from those the distortion is neglected Laser and Stereo Camera With Fused Depth Information The approach described in Section 11.2 can be improved further if the depth from both cameras are used in the FOV of the laser camera. The depth information

11.4 Experimental Results 119 2 5 4 6 1 8 1 15 12 14 2 16 18 5 1 15 2 25 3 (a) Image from Bumblebee. 2 5 1 15 2 (b) Image from CamCube. Figure 11.1: Performed pixelmapping. The pixel in 11.

133 11.4 Experimental Results (a) Image from Bumblebee (b) Image from CamCube. Figure 11.1: Performed pixelmapping. The pixel in 11.1a is mapped to the pixel in 11.1b. from the stereo camera can be weighted with the depth information of the laser camera using the sensor fusion formula. Suppose that y 1 and y 2 are two direct measurements of a parameter with independent noise: y 1 = x + e 1, e 1 N (, R 1 ) (11.1) y 2 = x + e 2, e 2 N (, R 2 ) (11.2) The weighted estimate of this parameter is then given by R x = (R R 1 2 ) 1 (11.3) and ˆx = R x (R1 1 y 1 + R2 1 y 2). (11.4) The following list explains the algorithm in this scenario: 1. Extract features from the stereo images using SURF. 2. Calculate the depth to all these features using the disparity formula. 3. Acquire the depth to the features in the FOV of the laser camera using the distance image from the laser camera. 4. Apply the sensor fusion formula in equation (11.3) and (11.4) whenever possible. 5. Process this information in EKF-SLAM Experimental Results The data used for evaluation of the different approaches have been collected in a small room. The stereo camera and the laser laser camera have been fixed to each other. They have collected 414 frames each during a walk around the room. The cameras return to the same position as they start in.

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract