FAST RECOGNITION AND POSE ESTIMATION FOR THE PURPOSE OF BIN- PICKING ROBOTICS ALEXANDER J. LONSBERRY

Size: px

Start display at page:

Download "FAST RECOGNITION AND POSE ESTIMATION FOR THE PURPOSE OF BIN- PICKING ROBOTICS ALEXANDER J. LONSBERRY"

Christina Robertson
5 years ago
Views:

1 FAST RECOGNITION AND POSE ESTIMATION FOR THE PURPOSE OF BIN- PICKING ROBOTICS by ALEXANDER J. LONSBERRY Submitted in partial fulfillment of the requirements for the degree of Master of Science Thesis Advisor: Roger D. Quinn Ph.D Department of Mechanical and Aerospace Engineering CASE WESTERN RESERVE UNIVERSITY January, 2012

2 1 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of Alexander J. Lonsberry candidate for the Master of Science degree *. (signed) Roger Quinn Ph.D (chair of the committee) Frank Merat Ph.D Jaikrishnan Kadambi Ph.D (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 2 This thesis is dedicated to my loving family for all their caring and support through all the years, both good and bad.

4 3 Contents List of Tables... 7 List of Figures... 8 Glossary Abstract Chapter 1: Introduction What is Object Recognition and Pose Estimation Standard 3D Process Offline Processing Online Processing General Use of Technology Project Goal Paper Overview Chapter 2 Background Approaches Many Tools for Matching Global Techniques Local Techniques Conclusions Chapter 3 Contributions... 26

5 4 Chapter 4 Oriented Point-Pair Features The Fundamentals Oriented Point-Pair Features Data Structure Storage Matching Features: Finding Plausible Correlations Automatic Selection of Corresponding Model Points Filtering Scene-to-Model point correspondences Finding Rotation and Translation Matrix Chapter 5 Matching Pipeline The Matching Pipeline Redundant Sub-Groups Sub-Sampling Plane Fitting Normal Calculation Method I: Creating Sub-Groups Method I: Choosing the Sub-Group SSi Running the Algorithm Removal of Scene Points Chapter 6 Off-line Model Preparation Offline Description... 56

6 5 6.2 Mesh Surface Sub-Sampling Normal Calculation Feature Calculation Chapter 7 Testing Testing Setup Test Objects Scenes Quantifying Results Testing Iterations Results: Laser Scanned Input Sampling Radius and Sampling Density Sampling Radius and Correspondence Error Limit Sampling Radius and Required Correspondence Pairs Sampling Radius and Grid Parameter Sampling Radius and Model Distance Discretization Sampling Radius and Percent Hash Overlap Results: Synthetic Data Chapter 8 Discussion Results Summary... Error! Bookmark not defined.

7 General Discussion Parameter Sensitivity Synthetic versus Real Data Chapter 9 Conclusions Chapter 10 Future Work Testing the Current Algorithm Improving the Algorithm Implementing Bin-Picking System Works Cited

8 7 List of Tables Table 1 (varying sampling radius and sampling density) Table 2 (varying sampling radius and correspondence error limit) Table 3 (varying sampling radius and required correspondence pairs) Table 4 (varying sampling radius and grid parameter) Table 5 (varying sampling radius and distance discretization) Table 6 (varying sampling radius and hash overlap percent) Table 7 (varying sampling radius and sampling density) Table 8 General effect of each of the parameters varied during the testing phase Table 9 A comparison some of 3D scanners

9 8 List of Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure

10 9 Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure

11 10 Figure

12 11 Glossary x l : Lower fourth (median of the smallest half of the data). x u : Upper fourth (median of the largest half of the data). A k : Array whose cardinality is equivalent to the cardinality of the underlying set of L k. It stores the multiplicity values of the underlying set from the multisetl k. C n : A constraint set by the user. C > C n in order to move on to the rotation and translation calculation portion of the algorithm. C o : Set of correspondence pairs pre-filtering. I k : Set of model points found from using set intersection for the k th point. L k : Multiset of model points used for voting/accumulator process. M : The set of original model points that have been rotated and translated onto an identical object in the scene. Ne ij : A neighboring vertex. The j th closest point to the i th scene point s i. S scan : Set of scene points specifically scanned by a laser scanner. d angle : Angular quantization step. d dist : Distance quantization step. e gmax : Maximum geometric error allowed between correspondence pairs. This is set by the user. e cent : The centroid error: the error between the true and estimated position. e g : Geometric error between two correspondence pairs n CP : The cardinality of the set of correspondence pairs, pre-filtering. n m, n s, n ss : The number of points in the model, scene, or sub-group.

13 12 p d : The percent overlap of one bin of the hash table into its neighbor. ss k : The k th scene point in a sub-group. θ diff : The difference measured in degrees, between a point s associated estimated and true normal vectors. ξ s : Fourth spread. σ dist : A distance parameter set by the user that controls which scene points are discarded after an object has been found in the scene. τ object : The largest dimension of an object. Sampling circle : The basis of a sub-group. All scene points within the sampling circle are deemed an individual sub-group. Φ : Automatically selected set of candidate model points corresponding to a given scene point. Any of the model points in the set may be matches to the scene point. C : Set of correspondence pairs post-filtering. CM : The centroid CP : A correspondence pair: set of a matching model and scene point. F( ) : Feature function. Input is two oriented points. Output is 4D feature vector. H( ) : Hash function, input is a 4D feature, output is an array of model points, M : The set of model points N : Normal vector associated with a scene or model point. Attached subscripts indicate the point they are attached to. P : Dissimilarity matrix initially holding the correspondence error values between correspondence pairs in C o.

14 13 S :The set of scene points. SS, SS i : Sub-group of scene points. T: A 4x4 transformation matrix that maps the original model to an object in the scene. d, d( ) : Euclidean distance expressed as scalar or function. f, f ij : A 4D feature vector. i and j are index values of the scene or model points, m, m i : A single model point. r : The sampling radius of a sampling circle. s,s i : A single scene point. η : The gird parameter that controls the spacing of the grid projected on the scene. The vertical and horizontal lines of the grid are separated by the scalar value η. θ, θ( ) : Angle between two vectors expressed as a scalar or as a function. μ, μ d, μ θ : Discretized (Normalized) value of the Euclidean distance or scalar angular value, ρ : The number of points (vertices) per unit of surface area (pnts/mm 2 ).

15 14 Fast Recognition and Pose Estimation for the Purpose of Bin-Picking Robotics Abstract By ALEXANDER J LONSBERRY This thesis presents a novel object recognition engine for the application of bin-picking. The algorithm is capable of quickly recognizing and estimating the pose of objects in a given unorganized scene. Based on the oriented point-pair feature vector, the algorithm matches points in the scene to points on the surface of an original model via an efficient voting process. Groups of features defining a point in the scene are used to find probable matching model points in a precompiled database. Sets of candidate model and scene point-pair matches are created and then filtered based on a geometric consistency constraint. Results show that the algorithm can produce centroid error values of less than.55mm and angular error values of less than 4 without a secondary iterative closest point algorithm. Run-times are in the range of. 1 to. 5 secs to locate a single object.

16 15 Chapter 1: Introduction 1.1 What is Object Recognition and Pose Estimation Object recognition is defined as locating known object(s) in a scene of interest. Pose is defined as an object's position and orientation relative to a specified coordinate system. For the research here, the specific coordinate system is defined by the camera in the 3D laser scanner. Pose estimation is an algorithmic process using partial or approximate data from a scene to estimate position and orientation. The work presented here will use object recognition and pose estimation synonymously. The proliferation of 3D scanning technology makes object recognition in the 3D spatial domain a viable topic of research. The current challenge in this field is to determine the object's pose as fast as possible. Speed of estimation without the loss of accuracy is a chief goal of this research. The main contribution of the work is 1) an approach to identifying key points and 2) a novel engine that robustly clusters key points with emphasis on speedy recognition. The difficulty in computer vision is finding a method to efficiently define the model. The model is a tangible object that has unique physical characteristics, that is being sought within some scene. The approach presented works via surface matching, matching the surface of the model to the surface of objects in a given scene. Matching is done by locating points in the scene that are unique to the model's surface. Having multiple correlated point matches will constitute identification. Following identification a

17 16 transformation can be generated that maps the model onto the scene; thereafter the objects position in scene space is known. 1.2 Standard 3D Process Many algorithms have been devised for the purpose of object recognition and pose detection. Techniques generally can be divided into two separate phases: online and offline Offline Processing The model of an object is a 3D representation of the object being sought in any scene. Models are generally stored as 3D meshes or as point clouds and can be generated from a variety of software applications or CAD programs. Point clouds represent 3D information by storing a series of vertices or points in space corresponding to the surface of objects present in the scene. Vertices in a Figure 1 Two images of the same object. The first is the mesh (a) of the bottle cap. The second (b) is a point cloud of that same bottle cap. point cloud are generally defined in a Cartesian coordinate system and for each vertex X, Y, and Z coordinates are stored. A cloud can contain oriented points. An oriented point is one that not only has a position in space but also a normalized direction vector (i.e. Normal Vector) associated with it. A mesh is a similar geometrical

18 17 convention but it also includes information describing how points in space are linked. Meshes are a collection of vertices and a list of vertex interconnections that make up the faces. Faces can be triangular, quadrilateral, or other convex polygons. For the purpose of this research only triangular meshes were utilized. In the offline phase the model is broken down into features. Each individual pose estimation technique generates different features from the model. Feature types will be later discussed in greater detail. These features are stored for online usage in some type of data structure, i.e., a hash table. This offline phase only occurs once and the saved data structure with the feature information can be accessed repeatedly during the online phase of the process Online Processing The online phase is the process in which the objects are actually sought in any given scene. 3D scene information is gathered from the scene and then software detects and locates the objects being sought. The process can be broken down into a series of generalized steps. First, the scene must be imaged. Many different types of hardware such as a LIDAR scanner can be used for this purpose. The output from the typical scanner will be a point cloud. The resolution and accuracy of the point cloud is heavily dependent on the scanner and scanning method used. The hardware can have an effect on the overall results. For this thesis the focus is on the recognition software and not the input hardware.

19 18 Second, there is usually some sort of pre-processing on the input. Filters can be applied to the data. Pre-processing is more prevalent with 2D techniques, though many 3D techniques require some sort of processing. Most commonly some sort of noise reduction or smoothing filter can be applied. In the case of 3D data, re-sampling may be necessary. Other filters may be necessary though each will be computationally expensive and with the run-time in consideration, fewer filters will tend to faster results. Third is to extract the critical features from the scene. Much work has emphasized what are the best features to describe an object in free-space (Gibbins D., 2009) (Laga, 3D Shape Analysis, ) (Campbell & Flynn, 2001) (Laika & Stechele, 2007). Features calculated during this phase will be done in the same manner that the model features were calculated during the off-line phase and will eventually be compared with those stored from the model. As there is usually a tradeoff between time of calculation and accuracy of pose estimation the features must be simple to calculate but overall must generate robust results. Figure 2 Matching points from the surface of two bottle caps

20 19 Fourth is the matching process. All algorithms have some mechanism such as correlation to match the scene features to the saved features from the model. Accuracy in matching is imperative. Matching incorrect model features to scene features will inevitably generate poor pose estimations. Many algorithms break this correlating phase into two distinct parts. First a coarse registration is applied followed by a secondary refinement like ICP (Zhang, 1994) (Park, Germann, Breitenstein, & Pfister, 2010) (Halma, Haar, Bovenkamp, Eendebak, & Eekeren, 2010). For sake of speed the goal of this project is to have a single matching scheme devoid of a secondary refinement. 1.4 General Use of Technology In order to interact with the surrounding environment, human beings utilize their vision system. Like living creatures equipped with vision, computerized machines interacting with their environment must be able to see their environment and process information pertaining to it. Some of the uses for computer vision systems are: 1) recognizing particular events, 2) interacting with people and physical structures, 3) controlling processes in an industrial setting, and 4) inspection and analysis in the medical field. Today there is a large demand for 3D vision in a multitude of fields. As computer processors grow faster and 3D hardware less expensive, the burden is to develop powerful object recognition techniques. Vision technology will become more integral to many products as real time recognition techniques become available.

21 Project Goal The intended use of this research is for randomized bin-picking applications, an industrial term referring to the removal (picking) of parts from a basket or bin by a robotic manipulator. Speedy and accurate removal of parts is dependent on many factors such as the gripping mechanism on the end-effector, 3D scanning hardware, and the pose estimation software. This research is focused on a robust object recognition and pose estimation technique that can in quick succession locate more than a single instance of an object in a given scene. Each scene is composed of a set of unorganized objects inside a bin. Objects can be stacked on top of each other in no particular pattern as there is no mechanical pre-sorting. For much of the testing, the object scenes are identical to a typical industrial setting. Unlike much research that has focused on recognizing a wide variety of objects in a given scene, the focus here is on recognizing a single type of object. Lack of organization in the scene requires that the technique must be able to handle rotation and occlusion. 3D sensors are not perfectly accurate and some noise will inevitably permeate the data. Therefore the recognition algorithm must insure accurate recognition and pose estimation in the presence of noise. The technique described in this thesis can be applied to finding a variety of different objects but focuses on finding multiple instances of a single object such that a robot can accurate pick single objects from a particular container.

22 Paper Overview Chapter 2 is a brief background to object recognition and pose estimation. It details some current methods to describe and locate objects in a 3D scene. Chapter 3 is a description of this thesis' contribution towards object recognition for bin-picking robotics. Chapter 4, describes the recognition and pose estimation algorithm. Features and a scheme to use them for matching are described, as well as the data structure used to store the surface information of the original model. Chapter 5 describes the software implementation. It entails the process: beginning from inputting scene data, and going to the end in which pose is estimated. Chapter 6 describes how the data structure used during the online phase is created. Chapter 7 describes how the algorithm is tested and presents the results. Chapter 8 discusses the results. Chapter 9 contains the conclusions and Chapter 10 is dedicated to future work. Chapter 2 Background Approaches 2.1 Many Tools for Matching To this date, many techniques have been developed for the purpose of object recognition and pose estimation using 3D data. Their focus has been to find a robust and efficient manner to identify objects in a scene. The techniques can be segregated into global and local which are enumerated further in the following sections. 2.2 Global Techniques Techniques in this class attempt to model whole objects with a single vector or single representation. Instead of breaking down models by parts, segmented surfaces, or any

23 22 other geometrical constraint, the entire model is used. This is a powerful technique given objects in the scene have already been segmented, segmentation is though computational expensive. Global approaches are typically good at finding generic shapes but fail with highly occluded data which is present in most applications. There are many global 3D techniques that have been devised in prior research efforts. Chen (Chen, Tian, Shen, & Ouhyoung, 2003) introduced Light Field descriptors. For a finite number of viewing directions, a 2D image is generated of the 3D model. The view directions are centered at the object's origin and the view angles are large enough to encompass the entire model in question. Chen analyzes both the boundary and the interior of the 2D model silhouettes created. A vector containing 10 Fourier coefficients describing the boundary of the 2D image and 35 Zernike moments describing the interior of the 2D view is created for each individual view of the model. Finding objects in the scene is done by calculating Light Field Descriptors for the scene. Similarity between scene and model is done by coefficient comparison. Laga et al. used a technique to overcome such view-based approaches by spherically parameterizing entire objects into geometry images (GI) (Laga, Takahashi, & Nakajima, Spherical paramertication and geometyr image-based 3D sahpe similarity estimation, 2006). A GI is a spherical-planer domain mapping which eliminates the need for multiple views or silhouettes of the object. During the offline phase the Cartesian vertices making up a mesh are transformed from 3D space to a planer array of geometric attributes. The GI is rotational and translational invariant and has shown some success with a large

24 23 model database. This technique may still fall prey to issues of occlusion and scenes that are not segmented. Osada et al. proposed and analyzed Shape Distributions (Osada, Funkhouser, Chazelle, & Dobkin, 2002) which samples a set of points on an object and measures some properties at each of those points (curvature, distance to center of mass, distance between pairs, etc.). Osada actually measured five different properties: angles between three random points on the models surface, the distance between a fixed point and a point on the surface, distance between points on the surface, the square root area of the triangle between three surface points, and the cube root volume of the tetrahedron between four points on the surface. The shape is then represented by a histogram for each property. These shape distributions are invariant to rotation but require normalization for scaling effects. Efficiency is dependent on the types of features measured. As aforementioned, global recognition techniques are good at finding primitives such as spheres and planes. Rabbani used a Hough transform for detection of cylinders in point cloud data (Tahir Rabbani, 2005). The researchers devised a five dimensional (5D) Hough space with Orientation Estimation followed by a secondary Position and Radius Estimation step (Tahir Rabbani, 2005). Schnable et al. designed an automatic detection routine based on random sample consensus (RANSAC) to find basic shapes with point cloud input (Schnabel, Wahl, & Klein, 2007). The input data is randomly sampled and deconstructed into segmented

25 24 shape primitives. Candidate shapes generated from the sampling are then tested against the entire set of data to determine if the shapes are well correlated with the data set. The algorithm is proven to be simple and easy to implement and works well with noise but is best suited for finding primitive shapes and not complete complex objects. 2.3 Local Techniques Local techniques are dependent on point-descriptors. Schemes using point-descriptor features generally identify correlated model and scene points and then group them together to get a final correlation. Point-descriptors must be distinctive such that succinct matching is prevalent with little ambiguity between which model points actually correspond to a certain scene point. Low-dimensionality is generally sought to reduce overhead and computational burden. For robustness, this class of features should be invariant against rotation, scaling effects, and noise. A variety of pointdescriptors have been devised which are delineated by their radius of influence on a particular surface. Several selected local feature based techniques are described below. Point descriptors are geometric representations of the surface at a specific point. Ho et al. describes a multi-scale feature using local surface curvature values (Gibbins & Ho, 2008). Surface curvature is robust in that it is invariant to rotation and translation. The authors devised a measure of curvature called curvedness which is dependent upon k 1 and k 2, the maximum and minimum values of normal curvature respectively, at a point on a surface. Gatzke et al. developed a curvature map about a surface point

26 25 (Gatzke, Grimm, Garland, & Zelinka, 2005). The map size is based on concentric N-rings around the point defined by the local mesh. These curvature maps are invariant and describe a region about a point. Johnson developed a point-region descriptor called spin-images. These spin-images map the 3D content around an oriented point to a 2D domain. Mapped to 2D bins, the points around an oriented point are stored in accumulators. Single points are matches using linear correlation. Multiple matches are filtered and clustered together by geometric correspondence between model and scene pairs. The spin-image algorithm is robust to noise, rotation, translation, and sampling. Frome et al. describes using 3D shape contexts as an extension of 2D shape contexts used in image processing (Frome, Huber, Kolluri, Bulow, & Malik, 2004). A shape-context is very similar to the spin-image. The mapping takes coordinates in Cartesian space and accumulates them into 3D bins defined in spherical coordinates. The spherical shape context is centered on a point in space and the support region is divided into equally spaced bins. Each bin is a weighted accumulator for vertices in space. 2.4 Conclusions Many recognition and pose estimation techniques have been devised, yet there still lacks a fast, robust, and accurate technique for the purpose of recognizing objects in a bin of identical parts. The greatest issue is the relative speed of recognition in a cluttered and occluded environment.

27 26 Chapter 3 Contributions Object recognition for industrial bin-picking applications needs a compact object description for robust recognition. Speed, accuracy, and robustness are important to any object recognition/pose determination algorithm. Many approaches in the past have lacked sufficient speed or accuracy. Global approaches in general are quite limited in this respect. In contrast, there exist a variety of local features that are rather efficient. Most of these features are strongly reliant on the local surface quality and depend on both the quality and resolution of the scene. The approach described in this thesis begins with a global model based on oriented point-pair features, which is subsequently used to match objects in the scene. This approach offers a means of quick pose estimation and object recognition of cluttered point clouds. Features are constructed by the combination of pairs of oriented points. Unlike other approaches, the scheme discriminatorily uses the set of features to first segment then match a group of points to corresponding model points. Model point to scene point matching is based on a voting scheme formed on the basis of set intersection. A set of oriented points on the surface of a given object can uniquely define said object by means of topological and geometrical relationships. A single oriented point can be correlated to a model point by determining distinctiveness from features generated by relating geometric attributes of the point in question to those points neighboring it. By intersecting said features, resultant model correspondences are returned. Point pair oriented points are inherently invariant to rotation and

28 27 translation. Symmetry of the model can be an issue but is handled in this work by using an geometrically based filtering approach to matching. A single matching pipe-line is presented; at the heart is the voting/accumulator point matching scheme. The approach uses redundant overlapping sub-groups. Predicated on simplicity of implementation, and designed with parallelization in mind, this method breaks the scene down into overlapping cells. Points bounded within these cells are deemed sub-groups. Each sub-group is processed. If an object can be found inside the sub-group its pose is estimated. The devised scheme is highly accurate even with high rates of occlusion for high cyclic bin-picking applications Chapter 4 Oriented Point-Pair Features 4.1 The Fundamentals The backbone of the algorithm is to find matching points between the scene and a given model. The model is the term given to the object that one is searching for in a given scene. For the presented research the model and scene are comprised of discrete points or vertices that lie on the surface of the objects (point cloud). It is imperative to find the scene and model points that are most nearly the same in terms of geometric surface location. A matching model and scene point pair is called a correspondence pair, CP = {s, m}, s ε S, where S is the set of scene points and m ε M, where M is the set of all

29 28 model points. If a set of points in the scene can accurately be identified as a corresponding group of model points, then translation and rotational between the object in the scene and the original model can be found. Locating objects in the scene is done so that they can be picked up by a robotic gripper. 4.2 Oriented Point-Pair Features Oriented point-pair features are the fundamental building block for the surface matching presented in this thesis. Features, which in general are unique representations of specific surfaces, are constructed by the combination of two oriented points (pointpair). An oriented point is a three-dimensional vertex with an associated direction vector; in this case it is the surface normal N. Figure 3 Feature created two model points m 1 and m 2 and there associated normal vectors N 1 and N 2 respectively. Drost et al. used the oriented point-pair feature and completed matching with an accumulator to find model points and a component of the rotation matrix from model to scene (Drost, Ulrich, Naval, & Ilic, 2010). Oriented point-pairs are a simple and robust way to describe the surface of an object. The feature encodes relative position and orientation of two specific points on the surface of an object into a single fourdimensional vector shown in Figure 3. The figure is composed of two-oriented points m 1

30 29 and m 2 with associated surface normal vectors N 1 and N 2, respectively. Every point m i is a 3-tuple (x i, y i, z i ) R 3 representative of Euclidean space. The points are oriented: they have an associated normal N i. If a surface is implicitly given as a set of points (x, y, z) satisfying G(x, y, z) = 0, then the i th normal is defined as N i = G(x i, y i, z i ). A feature is derived from the combination of two oriented points on the surface of an object. Given two points m 1 and m 2 and their associated normal vectors N 1 and N 2, as seen in the Figure, a feature is defined as: F(m 1, N 1, m 2, N 2 ) = d, θ 1, θ 2, θ 3 = f 12, (1) where, θ 1 = θ(n 1, d), θ 2 = θ(n 2, d), θ 3 = θ(n 1, N 2 ), (2) d(m 1, m 2 ) = m 1 m 2. (3) d is the Euclidean distance between the points and θ 1,2,3 ε[0; π] is the angle between vectors. F( ) is the function that takes input from the two oriented points and outputs the 4D vector. One cannot expect to find identical features between the model and any similar object in the scene as that would require identical point cloud representations, which is unlikely. Because of this, the features have to be generalized such that they are less discriminative. Therefore, once a feature is created it is discretized and the four dimensions are normalized/binned by some predefined amount. It is this process that generalizes each feature. Making the matching process feasible, discretization

30 introduces tolerance into the matching pipeline and is discussed in 4.2.1. As done by Drost et al.

31 30 introduces tolerance into the matching pipeline and is discussed in As done by Drost et al., distance is quantized in steps by d dist and the angle by d angle (Drost, Ulrich, Naval, & Ilic, 2010). Single point-pair features are not very powerful matching tools by themselves. For any given object, many oriented point-pairs may generate similar feature vectors. It is the combination of features from a set of oriented points, like that in Figure 4, which make a surface unique. An object can be matched to a model based on the unique combination of features made from a group of points all belonging to the surface of an object in a Figure 4 Pictured is a mesh (grey) with associated vertices (red) that are connected (orange). A single point cannot be determined accurately between a single set of points. Rather a point is described by the accumulation of features made by connections to surrounding points. given scene. In order to describe the surface of a model, the set of points from the surface of the model is used to create a hash table storing the 4D features. During the online matching process, the algorithm continuously accesses the hash table to lookup possible model point matches for any given oriented scene-point s i. Generation of the hash table is described in the following section.

32 Data Structure Storage Every single oriented point-pair on the model generates a particular feature that is stored in a hash table for reference during the online matching process. A hash table is a particular type of data structure common in computer science. During the hashing process (adding feature vectors to the data-structure), each point on the model will be matched with every other point on the model to create a number of features. In order to make matches between objects with different surface samplings, the feature vectors are binned into discrete ranges. The variables d dst and d angle quantize the ranges of Euclidean distance and angle measure, respectively. If a given distance value falls between n d dst < d < (n + 1) d dst, where n is some scalar number; d is set to the value n d dst. The discretization is a floor function. The previous relationship can be rewritten as follows, μ d = floor( d d dst ). (4) Here μ d is the mapped version of d. The equation above also applies to the angular values when d dst is replaced by d angle. The original 4D feature vector is transformed in this manner into a new limited space. The feature vectors can now be rewritten to include this mapping. F m i, N i, m j, N j = μ d, μ θ1, μ θ2, μ θ3 = f ij, (5) For the remainder of the text the function F( ) is to include the mapping shown in equation (4) such that the distance and angular values are discretized. The features

33 32 once mapped are keys to the hash table. Each key is literally an address to a particular bin or memory block of the data-structure. Inside each bin of the hash table is a list of model points. Many other points may also be located in the same bin. A single point m i will inevitably be binned into more than one block of the hash table. A single point can put in a particular bin more than once; however for this thesis only a single instance of m i is allowed in each bin. If we were to have multiple instances in the same hash the algorithm would work properly, and perhaps match more accurately, but the table would become large and incur extra computational cost during the online recognition phase. The features are directional, this means that for a given feature vector f ij = F m i, N i, m j, N j, only m i will be stored in section of the data structure corresponding to f ij. Feature vectors can be symmetric such that f ij = f ji, but are not limited to be so Figure 5 In the figure two similar surfaces can be seen, one representing the original model (a) and the other from an object in the scene (b). Between the surfaces there is a set of points, {m 1, m 2 } and {s 1, s 2 }, that correspond to one another within some tolerance. In part (c) it is shown that hard discretization would impede matching. The discretized distance between the features would not match. Therefore tolerances need be applied to overcome this situation.

34 33 such that f ij f ji is feasible. Discretization overlaps are added to this scheme. This deviates from previous work using the same feature vector (Drost, Ulrich, Naval, & Ilic, 2010). The rational for additional overlaps is best seen through example. As seen in Figure 5, there are two identical surfaces. The first surface is from the original model (a) and the second from the same object in a given scene (b). On the surfaces there are two similar sets of oriented points {m 1, m 2 } and {s 1, s 2 } (associated normal vectors are not drawn). Though they are not in exactly the same position, within some tolerance they can be considered a good matching pair. In this example, because of the hard cut off the two features F m 1, N m1, m 2, N m2 and F s 1, N s1, s 2, N s1 are not equivalent, i.e., d(m 1, m 2 ) d(s 1, s 2 ). In order to avoid this occurrence overlapping is added; there can be more than one resultant feature vector f between two points. This is done by the user setting the relative percentage of d dst and d angle. This percentage is deemed p d. That is, if the distance or angle value falls outside a discrete segment by an amount < d p d, then two features are created. For the moment, focus is on the length portion of the feature vector from Figure 5. Given the set of points from the model, the mapped length is given as, μ d = d(m 1,m 2 ) d dst. (6) Because of tolerances, there could be a second distance measure μ d2,

35 34 μ d + 1, μ d < d(m 1,m 2 ) + p d d dist μ d2 = μ d 1, μ d > d(m 1,m 2 ) p d d dist d(m, 1,m 2 ) p d d μ d d(m 1,m 2 ) + p dist d d dist (7) If μ d2 then there would be at least a second feature vector created for this given set of points as displayed below. f 121 = μ d, μ θ1, μ θ2, μ θ3 f 122 = μ d2, μ θ1, μ θ2, μ θ3 Corresponding to the example, here the set of model points would have at least 2 different features, f 121 and f 122. There could be more depending on the angle measurements as this overlapping is applicable to the angle measurements as well. In the hash table, the model point m 1 will be stored in at least two sections corresponding to the particular keys (f 121 and f 122 ). Done off-line, the entire process of hashing these points is done before the scene scrutinizing online-phase. The term off-line simply means that compilation of the data structure of the model occurs prior to running the primary algorithm: looking through the scene for similar objects as compared to the model. Hashing the model data-set can be time consuming; in fact it is much slower than the matching process. It is done only once for a given set of parameters and the hash table is saved for as long as it is required for use. The hashing process is outlined in Figure 6. From the surface of the model a feature is created from point m i to m j. The feature is a key to a specific

36 35 location in the hash. There m i is stored with any other model points that had a similar feature vector. Matlab was the programming language of choice for this research and its functionality was used to create the hash table. The container.maps Matlab data structure was used Figure 6 The surface on the left is of the original model. Contained on the model s surface are two oriented points m i and m j. These oriented points together make a feature with is a key to the database, which is a hash table. Inside the section of the hash corresponding to the key, an array of model points with a similar feature is stored. The model point m i is added to this array. as the hash table. container.maps is a general hash table representation. A particular key points to a certain part of the table where the model point indices are stored. The entire model point (x, y, z) is not stored in the hash itself, but instead a pointer to an array storing the actual point (x, y, z) coordinates. 4.3 Matching Features: Finding Plausible Correlations Assume there is a sub-group of n ss oriented scene points SS which is a subset of points from S, the set of all scene points, i.e. SS S. Discussion of automated selection of SS is in section

37 36 A given oriented point ss k in the sub-group SS has n ss 1 related features: 4D feature vectors created by combination of ss k with every other oriented point ss o where o = {1,2, n ss, o k}. In order to find a likely matching model point for ss k, the model database is searched. As aforementioned the hash table data structure stores lists of model points in memory blocks corresponding to particular feature vectors. An intersection operation between the arrays of model points will return a list of possible model points. This is represented as follows, I k = H F ss k, N ssk, ss 1, N ss1 H F ss k, N ssk, ss 2, N ss2 H F ss k, N ssk, ss nss, N ssn. (8) ss H( ) is the hash function. The input is a 4D feature vector and the output is the array of model points stored in the memory block corresponding to that particular feature. I k = {m 1,, m n } is the list of possible matching model points: the correspondence list for the ss k. In the perfect case, I k would contain only a single model point that exactly matches the scene point ss k. Testing has made it evident that pure intersection of this sort is too discriminant. Instead an accumulator is created and a voting scheme is made such that likely model-to-scene point correspondences can be found. An accumulator in this context is a construct to find the model points that most likely match a single scene point. First, a list is conceived for the k th point in the subgroup ss k called L k. L k is a multiset where repeated elements are kept. It is an archive of all model points found in the data structure corresponding to features between ss k and all ss o.

38 37 L k = H F(ss k, ss 1 ) H F(ss k, ss 2 ) H F ss k, ss nss. (9) Every unique model point in L k could be a potential match to the scene point ss k. To determine which model points are most likely to match ss k, an accumulator A k is created. A k is an array whose cardinality is equivalent to the cardinality of the underlying set of elements in the multiset L k. The accumulator array stores the multiplicity values of the underlying set. The accumulator array is filled as shown in Figure 7. Figure 7 Two points s 1 and s 2 are contained on the surface of an object in the scene SSSS ssenn. The feature created between the two points is a key pointing to an array of model points in the database. Each of these points has a corresponding bin in accumulator space. Every time a model point appears in the hash table, its value in accumulator space is incremented by 1. For every feature between point ss k and the rest in ss o, the hash table is referenced and the accumulator is updated. The number of times a particular model point is

39 38 accessed is recorded in the accumulator. The more a specific model point is incremented in the accumulator, the more likely it is the scene point correspondence. One could simply take the model point or points with the largest value in the accumulator as being the best correspondence, but in this case a better approach is sought as described in the following section. 4.4 Automatic Selection of Corresponding Model Points Selection of the best model point or points is critical to accuracy. A correspondence could be chosen only by finding the model point with the largest value in the accumulator array. This is a crude approach. and one should select likely candidate points because they meet some validity criterion. A better quantification is done via histogram analysis. Histograms measure the number of points with specific accumulator values. The points being sought are those in the upper outliers, having relatively large accumulator values which most readily match well. If outliers do not exist, than it is hard to say with any real assurance that any of the model point s correspondence to a given scene point. The most likely correspondences isolated using the fourth spread (Johnson, 1997). The fourth spread identifies observations unusually far from the bulk of the data and is the upper fourth subtracted from the lower fourth. ξ s = x u x l. (10) Where x u is the upper fourth (median of the largest half of the data), and x l is the lower fourth (median of the smallest half of the data). Extreme outliers are 3ξ s units above

40 39 the upper fourth. This is shown in Figure 8. Correspondences are only created if the upper outliers exist. Plausible Model Points = A k > 3ξ s + x u (11) If outliers do not exist according to this metric, the data is considered indiscriminant and the scene point is discarded from the sub-group. Given a discriminant data set with outliers, the plausible model points are further constrained by simple maximum heuristic. Φ = max(a k ). (12) Out of the initial set of possible matching model points, only the most likely are kept: the set of these points is Φ. There may be multiple model points in Φ and for each of Figure 8 A histogram where the horizontal axis represents the number of points in the accumulator space for a given model point, and the vertical axis is the number of model points with that particular number of instances in the accumulator. Matching model points are only selected given that their exists outliers.

41 40 these points a candidate correspondence pair is created between it and ss k. The word candidate is used because not all correspondence pairs are necessarily correct. Subsequent filtering is necessary to find the correspondence pairs that most readily accurately locate the object in the scene. This voting process is run for every single point in SS. The result is C o, a set of all the candidate correspondence pairs created for a given sub-group. 4.5 Filtering Scene-to-Model point correspondences With a set of correspondence pairs C o = CP 1, CP 2,, CP NCP, the next phase is to filter these points. Though a multitude of correspondence pairs have been created, not all are accurate. It is necessary to include a secondary filtering process which trims away pairs that do not meet geometric consistency constraints. The definition of geometric consistency is defined as the geometric error (Euclidean distance), e g, between two correspondence pairs. Given two correspondence pairs CP 1 = {s 1, m 1 } and CP 2 = {s 2, m 2 }, the correspondence error is calculated as, e g = d gc (CP 1, CP 2 ) = m 1 m 2 s 1 s 2 (13) With this definition in hand a n CP n CP symmetric dissimilarity matrix P is formed where n CP is the number of correspondence pairs, i.e. n CP = C o. The dissimilarity matrix is defined as, P ij = d gc CP i, CP j i, j = 1,, n cp (14)

42 41 In the square matrix the i th row and j th column correspond to the correspondence error between CP i and CP j. We are looking for the largest group inside the dissimilarity matrix that has acceptable geometric error between them. Herein the largest set of correspondence pairs that meet an imposed geometrical error threshold are kept, the rest are thrown away. To be specific, we are searching for the largest set of correspondence pairs that meets the imposed conditions, CP i ε C P ij = P ji < e gmax, CP j ε C (15) where C is the largest set of correspondence pairs that meet the criteria and e gmax is the geometric correspondence error limit set by the user. In Chapter 7 the e gmax parameter is examined in detail. In order to find the set of geometrically corresponding correspondence pairs, P is modified such that graph theory is applicable. The P matrix is mapped into a matrix comprised of Boolean values. If a particular value P ij is below a set threshold e gmax, meaning it has as acceptable geometric error, than it is set to 1. The opposite case: all values considered too large, is set to 0. P ij = 1 P ij < e gmax, i, j = 1,2,, n CP (16) 0 P ij > e gmax, i, j = 1,2,, n CP In this form P is similar to an undirected graph. An undirected graph is a mathematical abstraction describing the connectivity between objects, an example of which is in Figure 9. Specifically for the case in hand, the objects are the preliminary correspondence pairs. The symmetric P matrix describes the connection between pairs.

43 42 The Boolean true between a given CP i and CP j indicates that these two correspondences may represent two actual scene and model points. Figure 9 shows a graph and its corresponding associativity matrix. The objects h i=1,..10 are, or are not, linked by an edge. The scalar 1 indicates an edge or link while 0 represents the absence of an edge. The associated matrix in the figure is the connectivity matrix for the diagram. Highlighted in the diagram are the maximal cliques. Let G = (V, E) be an undirected graph, where V = {1,, n} is the list of nodes or vertices and E is the set of ordered pairs (v, w) of distinct vertices. These ordered vertices are adjacent if (v, w)εe, meaning they are edges in the undirected graph. A clique is a set κεv where an edge exists between all vertices. A maximal clique is one to Figure 9 Above is an undirected graph and its associated connectivity matrix. For the usage here in the algorithm a matrix is constructed like the one above and the largest clique is found. which no other vertices in a given graph can be subsequently added without violating the criterion. The maximum clique is the largest maximal clique within G. If there are two maximal cliques κ 1 and κ 2, which are sub-graphs of G such that κ 1, κ 2 G, and these are the only maximal cliques within G, then the maximum clique is the larger set

44 43 of the two (i.e. κ 1 is maximum given κ 1 > κ 2, and to be complete κ 2 is maximum if κ 1 < κ 2 ). It is important to note that not all connections in the matrix P are valid. A single connection standing alone between two correspondence pairs lacks enough evidence to merit a correct match. What is being sought is a group of correspondence pairs that all have connections between them (they all have edges between them in the P matrix). This is a classical mathematical problem: to find the largest group (maximum clique) of correspondence pairs all linked together having correspondence error < e gmax. The maximum clique is being sought using the premise that larger groups tend to give better results in terms of accuracy and robustness. The larger the number of correspondence pairs grouped together, the higher the probability that we have found an object in the scene. In Figure 9, the maximum clique is the set of vertices {h 1, h 2, h 3, h 4 }. To find the maximum clique a simple scheme developed by Balaji et al. is used (Balaji, Swaminathan, & Kannan, 2010). The output of this filtering process is the largest set of correspondence pairs C that meet the geometric error criteria and are likely to accurately match a surface of the model to the same object in the scene. The resulting set of pairs C will move to the next step in the process given that the set meets a size constraintc n to insure accurate matching (C > C n ). Larger groups give more assurance that the group of points adequately matches an individual object. Results of varying this parameter are examined in Chapter 7.

45 Finding Rotation and Translation Matrix Finding the translation and rotation matrix from the model to object in the scene, in the scanner s coordinate frame, occurs once the group of correspondence pairs has been filtered. The model can be mapped onto the scene as, M = R(M) + T R, (17) where R and T R are the rotational and translation components, respectively, and M is the translated and rotated model. Finding the rotation matrix is complicated. For accurate estimation the technique described by Horn et al. is used (Horn, Hilden, & Negahdaripour, 1988). The rotation matrix is found as, R = D(D T D) 1 2, (18) where D is, S xx S xy S xz D = S yx S xy S yz. (19) S zx S zy S zz For example, some of components of D are, n C S xx = i=1 x s,i x m,i (20) n C S xy = i=1 x s,i y m,i (21) n C S xz = i=1 x s,i z m,i (22)

46 45 In the summations n C = C and the subscripts attached to the coordinates indicate the scene or model point. But, to correctly calculate R, both objects must have their centroids identically placed at the origin of reference frame the rotations occur about. Thus the model points M C and the set of scene points S C in C are translated. The translation vectors is dictated by their centroids which are found as, CM SC = n C i=1 s C i n CP (23) CM MC = n C i=1 m C i n CP (24) where CM SC and CM MC are the centroids of S C and M C, respectively. The translations can be written in matrix form as, 1 0 T MC O = CM MC (x) 0 CM MC (y) 1 CM MC (z) 0 1 (25) 1 0 T SC O = CM SC (x) 0 CM SC (y) 1 CM SC (z) 0 1 (26) Now the original set of points in C can be translated to the origin as, S C = T SC OS C M C = T MC OM C

47 46 where S C and M C are the scene and model points respectively that have been translated. Now the rotation matrix R can be calculated as shown in (19) using the points in S C and M C. The next step is to translate the model points onto the original location in the scene. This is done using the following, 1 0 T O SC = CM SC (x) 0 CM SC (y) 1 CM SC (z) 0 1 (27) The final transform T that maps the model onto the object in the scene is, T = T O S RT M O. (28) Now the model M can be mapped to the scene by applying the transformation matrix found in (28) by, M = TM (29) Now the position of the object in the scene is known to the robot. Chapter 5 Matching Pipeline 5.1 The Matching Pipeline A single pipeline is used to locate model objects in the scene for the purpose of binpicking. We choose points from the scene, match them to model points, and then preclude them from the remaining points in the scene such that they can no longer be sampled. The overall method is deemed Redundant Sub-Groups. It is so named

48 47 because points in the scene can be placed into multiple sub-groups during the initialization. 5.2 Redundant Sub-Groups Predicated on simplicity and for use with multiple processors, the method described in this section works by creating overlapping geometrically related sub-groups of points from the set of scene points S. The first step is to sample the scene. Only a small fraction of the scanned output will Figure 10 An outline of the algorithm to locate objects in a given scene. The first step is to create sub-groups and run the groups through the process of point identification. This matching process concludes when enough correspondence pairs are found and a rotation and translation mapping is found. actually be used for matching. Sub-groups are chosen based on location with the scene (i.e. sub-groups contain points that are all near each other). The sub-sampling is such that scene points can simultaneously be in multiple sub-groups creating redundancy. Redundant overlapping sub-groups increases the probability that objects will be located.

49 48 The entire matching pipeline is outlined in Figure 10. After these steps the iterative matching process takes input from one of the sub-groups, and outputs a matching model if one can be found. Once all sub-groups have been examined or eliminated (description of elimination in 5.2.4), the algorithm ends Sub-Sampling The first step is to sub-sample the scene points. This is not the same as choosing subgroups which is discussed later. The laser scanner used can output a very dense point cloud denoted as S scan = {s 1, s 2,, s n }. The scanner outputs a point cloud with surface density ρ scene 4.5 points mm 2,where ρ scene is representative of the average number of points in the scene divided by the total surface area in the scene. Because the scanner is reliant on a camera with predefined pixels, surfaces that are closer are denser than surfaces that are farther from the lens. Also as a surface faces away from the camera lens the point density decreases. ρ scene is large on average such that scenes are composed of roughly 100,000 points. If all of the scene points are used for feature creation (i.e. features created between every pair of points) then the algorithm will be very slow. On the other hand, the large density of points makes the calculation of associated normal vectors more accurate. To reach speeds required by a robotic bin-picker, the scene is sub-sampled to reduce the size of the data set. Given the original S scan, the sub-sampled points S are related to S scan as S S scan. Given the original surface density ρ scene, the density after sampling is ρ sub, where ρ sub < ρ scene. The key is to sub-sample the scene such that ρ sub is large

50 49 enough to describe the object surfaces but small enough to run quickly. It should be noted that the runtime is dependent on the ρ sub in a nonlinear fashion: the length of the runtime is O(n 2 ) where n is the given number of points. Instead of applying a computationally expensive sub-sampling algorithm to insure a specific spacing between sub-sampled points a simple random selection is used. The results with random sub-sampling have proven successful though further work may be merited in applying a Poisson-disk sub-sampling filter to insure equivalent spacing between sampled points. Once S has been calculated the associated normal vectors for these points must be calculated. Since the features are based on oriented points the normal vectors are only calculated for the selected sub-sampled points, but all the points in the scene are used for this calculation such that the normal vectors are as accurate as possible. This normal calculation is discussed in the following section Plane Fitting Normal Calculation For each point s i S, an associated normal vector N i is found. To insure the estimation of these normal vectors is as accurate as possible all points in the scene are used. For a given scene point s i, it and a user specified number of its closest neighboring points (6 neighboring points used for thesis) are used to fit a plane using a least squares regression. The j th neighboring point for the i th scene point, Ne ij, does not have to be in the sub-sampled set, the only condition is that Ne ij is part of the original scan: Ne ij S scan.

51 50 Calculation of associated normal vectors is done via fitting a plane to a group of points surrounding the surface point in question. The estimated plane s normal vector will be the associated normal vector for the given scene point. The least-squares approach is used. The function to be minimized is the square of the distance from a plane to a point in space as follows, Ne g(a, b, c, d) = ax j+by j +az j +d 2 ij (30) j=1 a 2 +b 2 +c 2 where a, b, c, and d are constants in the equation of a plane and x, y, and z are the coordinates of the neighboring vertices. If the centroid of the vertex and its neighbors is represented as (x o, y o, z o ), equation (30) can be represented as follows, N g(a, b, c) = a(x i x o )+b(y i y o )+c(z i z o ) 2 i=1. (31) a 2 +b 2 +c 2 This can be put into matrix form with, v T = [a b c], (32) x 1 x o y 1 y o z 1 z o Q =.... (33) x N x o y N y o z N z o Substituted into (31) to get, g(v) = vt Q T Qv v T v. (34) Define W = Q T Q and (34) becomes Rayleigh s quotient. g(v) = vt Wv v T v (35)

52 51 According the Rayleigh s principle the quotient is minimized by the first eigenvalue of W (Strang, 1988). To find the eigenvalues and eigenvectors of Q, singular value decomposition (SVD) is used. The eigenvector corresponding to the smallest eigenvalue will be the plane s associated normal vector. The SVD of Q is as follows, Q = USV T, (36) Substituting (36) into W = Q T Q and noting that U is orthogonal the result is, W = VS T SV T. (37) W can now be substituted into equation (35) resulting in, g(v) = vt VS T SV T v. (38) v T v With the decomposition of W, the columns of Vare the eigenvectors and S T S is a diagonal matrix containing the eigenvalues. To minimize the quotient, v is set to be equivalent to the eigenvector in V corresponding to the smallest eigenvalue in S T S. V and S are found using the SVD() function call in Matlab; subsequently v is found which is the normal vector sought Method I: Creating Sub-Groups Creating overlapping sub-groups follows the process of normal vector estimation for all the points in the scene. Input into this stage is only the sampled points S. This process can best be seen in Figure 11. Instead of randomly choosing points, an evenly spaced rectangular grid is constructed and overlaid onto the scene. Grid spacing is denoted by the grid parameter η which is set by the user. Experimental testing, described in Chapter

53 52 7 will examine how η influences the algorithmic accuracy and speed. At each node a sampling radius r is specified; all radii are equivalent. Sampling done here is redundant in that the same point can be sampled multiple times and placed into multiple subgroups depending on the grid and sampling radius. Figure 11 An outline of the process in which sub-groups are created from the set of sub-sampled points. In (a) the scene in its entirety is shown. This is the all the points output from the scanner. (b) shows a sampled version of (a): not all the scanned points are used for computational efficiency. In (c) a grid is formed. At the intersection of the vertical and horizontal lines a vertex is created, vert i. At each vertex a sampling radius emanates outwards as seen in (d). The highlighted point in (e) are inside the sampling radius and as such are par to of sub-group i. In (f) all the sub-groups are shown such that one can see that they do overlap.

54 53 With reference to Figure 11 the first step of the process is to take the input data and evenly grid the space. All of the scanned data used for this research came from objects on a flat surface in the x-y plane relative to the scanner. Though the parts are in three dimensional Cartesian space (x, y, z), the sampling process occurs in two-dimensional planar space (x, y), (S O : R 3 R 2 ). The scene is comprised of two bottle caps as shown in Figure 11 (a). Part (b) shows the scanned point cloud. In part (c) a black grid is overlaid onto the scene in the x, y plane (z = 0). Each vertex represents a center point for a cylinder radiating outwards. In part (d) a single circle, referred to as the sampling circle, is placed onto the grid. Its center is located on one the grid s intersections. Highlighted in yellow in part (e) are the points in the sub-group SS i. For every vertex in the grid, a similar circle is overlaid and a new sub-group is found. This can be seen in part (f) where there is a circle emanating outwards at every vertex. Because the algorithm has no knowledge about where the objects in the scene are, we cannot predict where to make sub-groups. Therefore we have to create a large number of sub groups, which overlap in space, to ensure we find the objects. An object will most likely be located if it is contained wholly inside a sub-group. If the surface of any object, represented by a point cloud, is completely contained by the sub-group then we have the maximum probability of locating the object. That is, we have the maximum number of oriented points and associated features. So why do we oversample the scene? We oversample because we cannot just pick out sub-groups that absolutely contain a single object. With randomly placed sub-samples we may include points from many objects. In

55 54 order to find all the objects we overlap the scene space to insure that all objects will be located Method I: Choosing the Sub-Group SS i Accuracy and fast rates of recognition are antagonistic. It is conceivable to achieve near perfect pose estimation given that an object in the scene is sampled very densely and every point is used in the algorithm. As more points are added, the processing time increases in a non-linear fashion. This non-linearity is bound to the algorithm for as each point added, a new feature is created between the new point and every other point already present. The number of features is equal to n 2 SS n SS, where n SS is the number of points in a particular sub-group. Because objects are strewn at random and the centers of the sampling radii are evenly spaced, it is likely that some sub-groups will be more populous than others. The question is which sub-group does the algorithm start with? For the research presented here this is done in a very simple manner. After all the sub-groups have been created the average sub-group size is found. The sub-groups are then ordered in terms of how close their size is from the mean. This ordering dictates the succession in which sub-groups are processed through the matching scheme; subgroups with sizes closer to the mean are processed first Running the Algorithm Having chosen a particular sub-group SS i of scene points, the points are processed according to the approach graphically outlined in Figure 10. The process is to first create an accumulator array for each of the scene points in the sub-group. Outlier based automatic selection follows filling the accumulator. This process is done for each point

56 55 in the sub-group. By the end of this process there exists a set of candidate correspondence pairs. They are deemed candidate correspondence pairs because many, but not all, of the pairs correctly match the portion of the scene that is found in the space inside the sampling radius. Theses candidate are filtered to find the maximal clique given geometrical constraints. Given that there are enough filtered correspondence pairs, they are then used to find the transformation matrix: a combination of the rotational and translational components which maps the model object onto the scene. The scene points are then removed Removal of Scene Points So that the algorithm does not repeatedly find the same objects, points from the scene representing the located object are removed from the unprocessed sub-groups. This is done much like the initial step in the iterative closest point algorithm (ICP). Using the found transformation matrix T, the original model points are shifted onto the scene (M = TM). All the scene points that are within a specified geometric distance from any point m i M are removed from the scene. A scene point s i is removed from the set of scene points S if, d(s i, M ) = min i 1 nm d(s i, m i ) σ dist (39) where σ dist is a user set parameter. For the research presented σ dist = 6mm which is 1/10 of the overall size of the objects used (bottle cap).

57 56 Chapter 6 Off-line Model Preparation 6.1 Offline Description The offline portion constructs a feature-based model representation of the original object and stores its features in an easily accessible data structure. Each model sought in the scene requires this pre-processing which occurs only once. The stored feature information can be used repeatedly. 6.2 Mesh Surface Sub-Sampling Mesh and point cloud sub-sampling are important for 3D pose estimation and object recognition techniques. For the purpose of this research the models of the objects being sought are created in CAD software and exported in a variety of formats. Output models are generally made to be efficient in terms of memory and required graphics hardware. This means that surfaces will be represented by the fewest number of vertices and faces possible. Meshes produced in this manner lack evenly spaced vertices across the objects surface. On the other hand, scanning hardware will output a point cloud that is far different in construction compared to the CAD generated models. Scanned scenes will have far more evenly distributed point densities across the object s surfaces. The density will also depend on the angle of the surface relative to the position of the scanner and the distance from the scanner. Objects that are further away from the scanner will have fewer points representing their surfaces. Also surfaces that are orientated at a steep angle relative to the scanner will be represented by fewer points. In order to keep the

58 57 surface densities similar, the scenes used in testing are all limited in size and distance away from the scanner. This is discussed in more detail in 7.1. In order to create features that will closely match those produced by scanning hardware, it is imperative to sub-sample the models mesh such that the vertices on the model can match points in the scene. Sampling the model is constrained by ρ model (n M mm 2 ), which is the average number of points on the surface per unit area. It is unlikely that an object in the scene will be sampled such that the sampled points will exactly fit a certain model point. Clearly the rate of sampling of the model and the average density of the output from the scanner affects the end matching results. What is the optimal ratio between number of surface vertices and surface area? During the testing phase a variety of ρ scene have been evaluated and the results presented in 7.3. Before creating the feature vectors, the model mesh is sampled evenly according to a set value of ρ model. The Poisson disk sampling method is used. This software was initially written in Matlab using the dart throwing method, but it is more efficient to use the open-source MeshLab mesh editing software package which includes a filter for sub-sampling an input mesh (Cline, Jeschke, White, Razdan, & Wonka, 2009). All mesh model instances used and tested here are subsampled using the MeshLab software for consistency. This results in a set of model points (x, y, z) M, which satisfy M surf (x, y, z) = 0 which are used to develop the features and create a model database.

59 Normal Calculation Oriented point-pair features require a normal vector associated with each model point. All normal calculations for the models vertices for the database generation are completed using MeshLab. Given the model surface is implicitly given as a set of points (x, y, z) M satisfying M surf (x, y, z) = 0, then the i th normal is defined as N i = M surf (x i, y i, z i ). 6.4 Feature Calculation Having the set of oriented points, the next step is to calculate the features. The question is how to find the optimum discretization values d angle and d dist for hashing relative angle measures and distance measures respectively. The calculation of d angle remains consistent for the each data structure as it is based on the expected accuracy of the normal calculation. Alternatively d dist is experimentally tested and the results are discussed in 7.2. Under testing an estimate for d angle has been determined. Accuracy of the normal vectors is highly dependent on the density of the point cloud representing the surface. The method to estimate the normal vectors for a given scene is based on fitting a plane to a set of neighboring vertices and is described in Accuracy of this estimation is strongly dependent on surface point density. Sampling at points on the surface with high curvature is an issue. Low sampling rates can lead to loss of normal vector accuracy as high curvature regions may be neglected. This problem is mitigated by increasing the sampling density which increases the accuracy of the normal vectors direction. Figure 12

60 59 shows a plot of the average angular error versus the surface sampling density. Angular error is measured in degrees and is on the vertical axis. The error is the angle θ diff between the true normal vector orientation and the estimated position and is defined as, θ diff = cos 1 N m true N mest N mtrue N mest. (40) Here N mtrue and N mest are the ground truth and estimated normal vectors for each point on the surface of the model respectively, and θ diff [0, π]. True normal vectors are estimated accurately in MeshLab based on the original model mesh. Figure 12 Angular difference between the estimated normal vector and ground truth normal vector of the associated points describing the surface of a given object. The independent variable is the surface point density: the number of points / mm^2. Given that the scenes have a ρ scene 9.5 points/mm 2, the accuracy of the normal estimation process is used to determine an appropriate d angle. At this surface point density the standard deviation is The discretization is set to the average

61 60 angular error plus 3 standard deviations. Rounding this to a whole number the discretization is set to d angle = 8. Setting the angular discretization in this manner is reasonable given the accuracy of the results shown in Chapter 7. The goal is to set d dist to an appropriate value that will not over constrain while also being somewhat discriminating. If d dist is too large, it is indiscriminant and too many features will have identical values. This can lead to poor performance in terms of matching. If a single feature returns thousands of possible matching model points then the task of finding only a single matching point is difficult. On the contrary if d dist is too small, then a matching model point may never be found for two reasons: sampling differences and scanning error. Sampling differences readily occur. Even if two surfaces have the same sample density, it is unlikely that two points will lie in exactly the same location. 3D scanning technology has intrinsic error associated with it. Because of this also we must set the tolerances in the discrimination so model points can be found. Finding appropriate values for d dist is described in Chapter 7. Chapter 7 Testing 7.1 Testing Setup Test Objects The research focus is on a single small object: a detergent bottle cap whose largest dimension is denoted τ object 60mm. The model is used for multiple reasons. It has regions of both high and low curvature. Some of the areas of the cap can be described

62 61 as a simple cylinder with constant curvature having very few distinct features. Other portions of the objects surface have very distinct features. The cap is rotationally symmetric about an axis through the objects centroid that is perpendicular with the large opening at the mouth of the cap Scenes Testing is done with two types of data: synthetic and laser scanned. The majority of testing is completed using actual bottle cap scans. It is very difficult to quantize the error in these laser scans; therefore synthetic data is also tested to compare with the results from the laser scans Scanned Scenes For this process a variety of scanned scenes, 5 in total, were used to test the accuracy and speed of the algorithm against laser scanned data. Each scene is composed of between 1 and 12 bottle caps randomly distributed in front of the laser scanner. Scanning of the scenes is done via the David Laser Scanner. A camera based low cost laser scanner, the system is comprised of a hand held red line laser, a web cam, and a backplane. PC software is provided that includes a simple interface. This scanning unit has been used for its relative low expense but at the cost of repeatable accuracy and scanning speed. Orientation of the objects and distance away from the camera can affect surface point density. To maintain a roughly uniform surface point density across all objects in the

63 62 scene, objects were kept in a specified range from the camera lens (.25m -.5m). The objects themselves are randomly oriented in the scene. Ground truth position of the objects in the scenes is found using MeshLab s align tool. These ground truth positions are only estimates. Built into the MeshLab software, the process to find the estimated ground truth positions begins with a manual rough alignment where the user is instructed to pick a set of matching points from both model and scene. Sequentially MeshLab takes the rough alignment and uses an embedded ICP algorithm to find the best alignment. It is noted that the ICP settings were left unaltered in MeshLab. The best accuracy achievable, as described by the manufacturer is an average positional error of 0.4 mm. This is the result obtained with a more powerful laser, a more expensive gray scale camera, and professional software tuning, and was not expected for the testing presented in this thesis. From empirical estimation the geometric error between the original models fit to the scene using ICP in MeshLab can be as high as 1.5 mm in some isolated cases. It is difficult to give an accurate measure of average scanner error. In testing the MeshLab align function repetitiously with scanned data, it was found that the centroid of the model rotated and translated onto the object in the scene can deviate from the mean centroid position by an average amount of. 55mm (standard deviation =.1842mm). Using the same method to define angular error as specified in section 7.1.3, the average angular deviation of a ground truth estimate from the mean is

63 2.56 (standard deviation of this measurement is 1.25 ).

64 (standard deviation of this measurement is 1.25 ). Given that the align function gives nearly equivalent position information repeatedly, MeshLab is used to determine the ground truth positions of the objects. The use of the align function in MeshLab is illustrated in Figure 13. There the original model is rotated and translated onto the position of one object in the scene. The output of this mapping in MeshLab is a transformation matrix (4x4 rotation and translation matrix) that is used to define the true position of the objects in the scene relative to the original position of the model in the camera coordinate frame. Figure 13 Illustration of the MeshLab s align function. (a) the original model (blue) in the camera s coordinate frame is to be aligned with an object in the scene (red). (b) the model object has been aligned with one of the objects in the scene, the rotation and translation have been accounted for. This process is used to determine the true orientation of objects in the scene.

65 Synthetic Scenes A series of 10 different synthetic scenes with 16 bottle caps are also analyzed for two reasons: to quantify the error of the scanner, and to find better estimates of the recognition rate: the average rate at which objects are located given that only a percentage of the object is actually visible due to clutter and occlusion. It is difficult to truly estimate the error in the results with the scanned data. Synthetic scenes are needed where the objects contained in the scene have position information that is absolutely known. These synthetic scenes can be run through the algorithm and the baseline accuracy of the algorithm can be gathered. Testing synthetic scenes will result in understanding the total error associated with using David laser scanner with MeshLab. In order to have a firm understanding of the recognition rate a large number of objects with varying visible surface areas need be tested. Initial results found during the testing phase indicated that hundreds of laser scanned objects were needed to give a concise picture of the recognition rate. Testing was done on 160 bottle caps placed in 10 different scenes with varying visible surface areas. The generated point clouds in the synthetic scenes come from the original bottle cap mesh, but are sampled randomly; differently from the models used to create the hash tables. This is done to simulate real world scenarios. All the objects in the scene have been randomly occluded and there true position in the scene precisely known.

66 Quantifying Results Results are quantified by the accuracy, speed of matching, and the recognition rate. Accuracy is broken into two components: centroid error and angular error. The centroid of a given model M defined by a set of points m i M is defined as, n M m i CM = i=1 41 n M where n M = M is the number of evenly distributed points across the model s surface. The centroid is geometric in nature. Given that there exists a transformation T j that maps the original model onto the j th object in a scene, the corresponding Cent j can also be calculated. CM j = n Mj i=1 m ji n Mj 42 where, m ji M j = T j M 43 Centroid error is the deviation of the estimated position found via the algorithm compared to the ground true position of the centroid found in MeshLab. Given that there is a CM true j, which is the ground truth centroid value of the j th object in a scene, and CM est j is the centroid found via the presented algorithm, the centroid error is defined as, e cent = CM j true CM j est.

67 66 Angular error is defined by the Euler axis and associated Euler vector. Any sequence of rotations in 3D space can be accomplished via a single rotation about a specific axis. Given that the ground truth position of an object in a scene and the estimated position of the object in the scene are both known, a rotation matrix can be calculated between the two. Using the vrrotmat2vec command in Matlab, this rotation matrix is broken into an Euler axis-angle. The Euler angle is the measure of angular error between estimated and ground truth poses. Recognition rate is the metric to describe how robust the algorithm is to occlusion and clutter. For simplicity, rate of recognition is presented as a function of the percentage of the object s surface area seen by the scanner versus the percent of times the object is located by the algorithm. 7.2 Testing Iterations There are a variety of settings in the algorithm that require sensitivity testing to find a group of settings that result in fast and accurate matching. Fast is relative to the machine doing the bin-picking operation. Testing here is not intended to optimize the settings but rather to determine how adjustments in parameters change algorithmic output. The following parameters have been tested: Sampling Radius: r (mm) Sampling Density: ρ scene (number of sampled points total scene surface area) Correspondence geometric error limit: e gmax

68 67 Required Correspondence Pairs : C n Grid parameter : η Model Distance Discretization: d dist Hash Overlap Percent: p d An optimization is left for future work. The extent of this research is to give a baseline of what the algorithm is capable of producing. 7.3 Results: Laser Scanned Input Sampling Radius and Sampling Density The first results presented are those with a varying sampling radius r and sampling density ρ scene while keeping all other parameters constant at the values in Table 1. This set of tests determines the effect that these independent parameters have on the results in terms of accuracy, run times, and recognition rate. Table 1 Parameters for algorithm testing (varying sampling radius and sampling density) Testing Parameters Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2) {. 06,.07,.08,.09} Model Distance Discretization : d dist Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm 1.5 Grid Parameter : η (mm) 15

69 68 Required Correspondence Pairs : C n 8 Hash Model Size (nsmbes os points) 4000 Results indicate that varying the radius linearly entails a linear response in terms of average centroid error, e cent r, and average angular error, e angular r, shown graphically in Figure 14 and Figure 15 respectively and are not linearly dependent on the number of oriented scene points in a sub-group. As r increases linearly, the area swept by the sampling circle is non-linear (dependent on r 2 ). The number of scene points included in a particular sub-group, n SS, is proportional the area swept by the sampling circle. Therefore, there are diminishing returns for adding more points to a sub-group in terms of accuracy and even time. Results indicate that increasing the size of the subgroups has a non-linear effect on the run-time, as seen in Figure 16 which is a plot of the normalized run-time versus the size of a sub-group.

70 69 Figure 14 A plot of the matching results in terms of centroid error parameters shown in Table 1. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The different curves correspond to different surface point densities (ρ sssss ).

71 70 Figure 15 Plot of the average angular error versus sampling radius and sampling density. The sampling radius r is on the x-axis. The different curves correspond to different surface point densities (ρ sssss ). The relationship between r or ρ scene and the time to recognize an individual object is non-linear due to the way in which the algorithm works. As the algorithm runs, subgroups of oriented scene-points are run through the voting process to find matching model points. The addition of a single scene point results in the creation of a new feature between it and all other included points in the sub-group. Each feature is analyzed in linear time, but because the number of features for each sub-group is nonlinear, the result is a non-linear time dependence on adding new points to a sub-group. A visualization of this is seen in Figure 16 which shows the normalized run-time versus the number of scene points in a sub-group for the accumulator based voting portion of the algorithm.

72 71 Figure 16 Normalized run-time versus the number of scene points in a given sub-group for the accumulator based voting process portion of the algorithm. As the number of points included in a sub-group increases, the time increases non-linearly. Figure 17 shows the average run-time per object versus the varying of the independent variables. This again shows the non-linearity between the run-time and the sampling radius and sampling density.

73 72 Figure 17 Average run-time to find a single object in a given scene. The sampling radius r is on the x-axis. The different curves correspond to different surface point densities (ρ sssss ). These results show minimal dependence in error, both centroid and angular, due to the variation of the sampling density (Figure 14 and Figure 15). The centroid error is marginally more susceptible to change in surface density. As ρ scene increases these results indicate that the error marginally increases. This is not expected. As the number of sampling points increase one would expect to see a decrease in error. This result is certainly influenced by the rate of recognition. With a lower sampling density, objects in the scene with a smaller visible surface patch have fewer associated points which leads to lower recognition rates as shown in Figure 18 (shown more clearly in Figure 37 and Figure 38 (testing with larger synthetic data set)). The curves are not smooth in Figure 18. This is likely because only 5 scenes of scanned data are used. The synthetic data

74 73 testing portion shows smoother results using a larger number of scenes and objects. When ρ scene is small enough such that objects are not recognizable by the algorithm, the average accuracy will increase. The well-defined objects in the scene (those with larger visible surfaces), will be located accurately, and those objects that are harder to recognize will be ignored. When ρ scene becomes larger, the harder to find objects are located, but because there are fewer points representing them, (relative to other objects in the scene) the pose estimation is less accurate. Figure 18 Plot of the recognition rates for the various sampling density values. Lower sampling density values correspond to lower recognition rates Sampling Radius and Correspondence Error Limit This section describes the results of varying the sampling radius r and the correspondence geometric error limit e gmax. The correspondence error was discussed in

75 74 section 4.5 and is part of the filtering process that eliminates poor correspondence pairs. Table 2 Parameters for algorithm testing (varying sampling radius and correspondence error limit) Testing Parameters Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2). 06 Model Distance Discretization : d dist Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm {1, 1.5, 2, 2.5, 3} Grid Parameter : η (mm) 20 Required Correspondence Pairs : C n 9 Hash Model Size (nsmbes os points) 4000 As e gmax decreases in value, the resulting object matching will become more accurate as only very geometrically consistent correspondence pairs will be allowed. The larger e gmax becomes more correspondence pairs will be accepted through the filtering process. This action is synonymous with the loosening of a tolerance. The average error will increase as the correspondence error limit increases shown in Figure 19 and Figure 20.

76 75 Figure 19 Matching results for the parameters shown in Table 1. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The various plotted are associated with different error limits (e gmmm ). The effects of the correspondence error limit e gmax on accuracy are large relative to the total average centroid error. Variation of this parameter has a linear effect for the larger values of e gmax relative to set of values tested. The decrease in error from e gmax = 3 to e gmax = 2.5 is nearly identical to the drop e gmax = 2.5 to e gmax = 2.0, which is.2 mm average centroid error. This linearity seems decrease as the error limit approaches 0. In the limit as e gmax 0 the results indicate that the error will not tend to zero.

77 76 Figure 20 Matching results for the parameters shown in Table 2. Angular error is measured in degrees and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The various curves correspond to different error limits (e gmmm ).

78 77 Figure 21 The average run-time to find a single object in a given scene. The sampling radius r is on the x-axis. The various curves correspond to different error limits (e gmmm ). The sampling radius seems to have a larger effect on the run-time than the correspondence error limit. Figure 21 again shows the non-linearity in the relationship between sampling radius and run time. The effect of the correspondence error limit seems to be less important, especially as the sampling radius increases beyond r = 30mm. Below that threshold, e gmax plays a large role in dictating the speed of object recognition. This is because of the lower recognition rates associated with the stiffer tolerance value. As e gmax becomes smaller and the sampling radius decreases, average recognition time increases. As can be seen in Figure 22, the smaller correspondence error limit in general decreases the rate of recognition. Lower rates of recognition have a time penalty

79 78 associated with them. Increasing e gmax does not always increase the time it takes the algorithm to parse an entire scene, but lowering the error limit can reduce the number of objects found. As the number of objects found in a scene decreases, the average time to find an object will increase. Figure 21 also indicates that the sampling radius can be optimized since sampling radius values between 25mm and 30mm induce the faster recognition rates. Given that τ object 60mm, these values do make sense. When r = 30mm, a given sampling circle can encompass an entire object in the scene. Having the entire object represented within a single sub-group will insure the most precise results with the algorithm using all the points representing the object in the scene. These results indicate that setting the sampling radius to r τ object /2 optimizes the run-time for the set of parameters used in this experiment.

80 79 Figure 22 The recognition rate with the parameters set as shown in Table 2. The plotted lines are 3 rd order polynomials which were fit to original data. In general, varying the correspondence error limit has the effect of either limiting the number of objects found and increasing accuracy or of allowing more objects to be found at the cost of losing accuracy Sampling Radius and Required Correspondence Pairs Results presented are those with varying sampling radius r and the required correspondence pairs C n. To determine the effect that these two independent parameters have on the results in terms of accuracy, run times, and recognition rate, they are varied as shown in Table 3.

81 80 Table 3 Parameters for algorithm testing (varying sampling radius and required correspondence pairs) Testing Parameters Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2). 06 Model Distance Discretization : d dist Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm 1.75 Grid Parameter : η (mm) 20 Required Correspondence Pairs : CP n {6,7,8,9,10,11,12} Hash Model Size (nsmbes os points) 4000

82 81 Figure 23 Matching results for parameters shown in Table 1. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. Required numbers of correspondence pairs are on the x-axis. The various curves correspond to different sampling radii(r). Increasing C n results in more accurate results as shown in Figure 23. Requiring a larger number of consistent correspondence pairs negates the probability that a set of correspondence pairs inaccurately match the object. It also diminishes the effect of any bad matching pairs. Increasing the number of consistent correspondence pairs causes the algorithm to effectively discount the objects with lower visible surface area. Shown in Figure 25 are the results in terms of recognition rates.

82 The results of this particular set of tests indicate that the average time to find a single object is nearly linearly dependent on C n for this range of values.

83 82 The results of this particular set of tests indicate that the average time to find a single object is nearly linearly dependent on C n for this range of values. Increasing the required number of correlated correspondence pairs increases the run time. Figure 24 The average run-time to find a single object in a given scene. Required number of correspondence pairs is on the x-axis. Different curves correspond to different sampling radii(r).

84 83 Figure 25 The recognition rate with the parameters set as shown in Table 3. The plotted lines are 3 rd order polynomials fit to the original data. The original data resembles that of Figure Sampling Radius and Grid Parameter Results are presented for varying the sampling radius r and the grid spacing parameter η which dictates the number and position of the sub-groups. The parameters used in these experiments are shown in Table 4. This test is used to determine the effect that both independent parameters have on accuracy and run times. Table 4 Parameters for algorithm testing (varying sampling radius and grid parameter) Testing Parameters Sampling Radius : S (mm) {15, 17.5, 20, 22.5, 25, 27.5, 30}

85 84 Sampling Density : ρ ssene ( pnts mm2). 06 Model Distance Discretization : d dist Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm 1.75 Grid Parameter : η (mm) {5, 10, 15, 20, 25} Required Correspondence Pairs : C n 9 Hash Model Size (nsmbes os points) 4000 Varying η adjusts the way in which the scene is sampled. The smaller the value of η, the more sub-groups are created. What is interesting is that the grid parameter does have an effect on the average error. Smaller values of η induce slightly larger values of error compared to the larger values for the sampling error. This phenomenon is shown in Figure 26. As the grid parameter reaches a value of 15mm, furthur increase in the sampling parameter has little effect on centroid error.

86 85 Figure 26 The matching results for the parameters shown in Table 4. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The different curves correspond to different error limits ( η ).

87 86 Figure 27 The average run-time to find a single object in a given scene. The sampling radius r is on the x-axis. The different curves correspond to different error limits ( η ). The grid parameter has differing effects on the average speed of recognition as compared to its effects on the average error. Figure 27 shows that recognition speed is maximized for specific values of both η and r. In section the results indicated that a sampling radius of 30mm was approximately optimal (r τ object /2). The parameters used for these tests seem to elicit a different response; a sampling radius of r 20mm (r τ object /3) with η = 15mm gives the fastest results. This indicates that the parameters are coupled together to some extent.

87 Figure 28 The recognition rate given the parameters in Table 4. The results are similar to those shown in Figure 18. The curves are not smooth because only 5 scenes of scanned data were used.

88 87 Figure 28 The recognition rate given the parameters in Table 4. The results are similar to those shown in Figure 18. The curves are not smooth because only 5 scenes of scanned data were used. The synthetic data shows smoother results using a larger number of scenes and objects Sampling Radius and Model Distance Discretization Results are presented for varying the sampling radius r and the distance discretization parameter d dist. The parameters used are shown in Table 5. This test determines the effect that both these independent parameters have on the results in terms of accuracy, run-time, and recognition rate. Table 5 Parameters for algorithm testing (varying sampling radius and distance discretization) Testing Parameters Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2).05

89 88 Model Distance Discretization : d dist {. 5,.6,.7,.8,.9,1.0,1.1,1.2} Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm 1.5 Grid Parameter : η (mm) 15 Required Correspondence Pairs : C n 8 Hash Model Size (nsmbes os points) 4000 Varying the distance discretization parameter has only a small effect on the results. As shown in Figure 29, the error remains fairly constant for each model discretization step. The small perturbations are due to the small number of scenes used in these experiments.

90 89 Figure 29 Matching results in terms of centroid error parameters shown in Table 5. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. The distance discretization is on the x-axis. The different curves correspond to different sampling radii. The relative consistency seen in Figure 28 is also seen in Figure 29. The discretization value has a minimal effect on the average run-time which is determined by the radius r and not d dist.

91 90 Figure 30 The average run-time to find a single object in a given scene. The distance discretization is on the x-axis. The different curves correspond to different sampling radii. The distance discretization parameter has a very mild effect on the rate of recognition as seen in Figure 31. It can be concluded that the parameter d dist for the ranges tested here has minimal effects on the output as compared to the other parameters tested. This may change if the range of tested values increases.

92 91 Figure 31 The recognition rate for the parameters shown in Table 5. The plotted lines are 3 rd order polynomials fit to the original data Sampling Radius and Percent Hash Overlap Results are presented for varying sampling radius r and the hash table overlap percent parameter p d. The parameters used in this testing are shown in Table 6. This test is used to determine the effects that these independent parameters have on accuracy, run times, and recognition rate. Table 6 Parameters for algorithm testing (varying sampling radius and hash overlap percent) Testing Parameters Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2). 05

93 92 Model Distance Discretization : d dist Percent Hash Overlap : p d {0%, 15%, 25%, 35%} Correspondence Error Limit : e gmmm 1.5 Grid Parameter : η (mm) 15 Required Correspondence Pairs : CP n 8 Hash Model Size (nsmbes os points) 4000 These results indicate that the p d parameter is only marginally effective in terms of varying the results. Figure 32 shows that the results for an overlap of 15% vary wildly. An increase from p d = 25% to p d = 35% seems to result in slightly more accurate matching.

94 93 Figure 32 The matching results for the parameters shown in Table 1. Angular error is measured in degrees and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The various different curves correspond to different hash table overlap percentages (p d ) The average run time per object is shown in Figure 33. It is clear that the change in overlap percent does affect the overall run-time. An increase in p d increases the runtime linearly. This result makes sense; as the percentage becomes larger the model database also grows in size. The size of the hash table used for matching therefore has an effect on the run-time. With a larger database, the algorithm has to parse through more potential matching model points, increasing the run-time. Because there is no clear increase in accuracy but a definite time penalty with an increase in the overlap percent, it seems of little advantage to increase the percent overlap.

95 94 Figure 33 The average run-time to find a single object in a given scene. The sampling radius r is on the x-axis. The different curves correspond to different hash overlap percentages (p d ).

96 95 Figure 34 The recognition rate with the parameters shown in Table 5. The plotted lines are 3 rd order polynomials fit to the original data. The change in p d does not seem to improve the recognition rate very much. This is graphically shown in Figure 34. There is a slight improvement in going from p d = 0% to p d = 15%, but further enlargement has no noticeable effect. 7.4 Results: Synthetic Data The results presented are for ten synthetic scenes each with 16 bottles caps. The parameters used for testing are the same as those used in section and are summarized in Table 7. The purpose of using synthetic data is to quantify the error of the scanner, and to get better estimates of recognition rate. Table 7 Parameters for algorithm testing (varying sampling radius and sampling density) Testing Parameters

97 96 Sampling Radius : S (mm) {20,22.5,25,27.5,30} Sampling Density : ρ ssene ( pnts mm2) {. 06,.07,.08,.09} Model Distance Discretization : d dist Percent Hash Overlap : p d 15% Correspondence Error Limit : e gmmm 1.5 Grid Parameter : η (mm) 15 Required Correspondence Pairs : C n 8 Hash Model Size (nsmbes os points) 4000 As expected the results shown here suggest that the scanner is responsible for much of the error. The average centroid error presented in Figure 35 is about twice as accurate as those for the real scanned data (between.3mm and.6mm more accurate). Figure 35 The matching results in terms of centroid error parameters shown in Table 7. Centroid error is measured in mm and is the deviation of the estimated object s position from the estimated true position. The sampling radius r is on the x-axis. The different curves correspond to different surface point densities (ρ sssse ).

97 The results for the average angular error are close in magnitude to those for the scanned data set. The average angular error ranges from 3.5⁰ to 7.5⁰ which are similar to the scanned data sets.

98 97 The results for the average angular error are close in magnitude to those for the scanned data set. The average angular error ranges from 3.5⁰ to 7.5⁰ which are similar to the scanned data sets. The shape of the curves however, is not monotonically decreasing. Instead the curves look parabolic, with maximum error values between r = 20mm and r = 22mm. Figure 36 The average angular error versus sampling radius and sampling density. The sampling radius r is on the x- axis. The different curves correspond to different surface point densities (ρ scene ).

98 Figure 37 The recognition rate as a function of sampling density values. Lower sampling density values correspond to lower recognition rates.

99 98 Figure 37 The recognition rate as a function of sampling density values. Lower sampling density values correspond to lower recognition rates. One of the primary reasons for testing the synthetic data (other than to quantify the error due to the scanner) is to gain a better understanding of the algorithm s recognition rate. Far more objects are tested with varying visible surface area values as compared to the laser scanned data sets. The actual recognition rates are shown in Figure 37. Figure 38 shows the data fitted to a 5 th order polynomial.

100 99 Figure 38 Plot of data from Figure 37 fit to 5 th order polynomials. Chapter 8 Discussion 8.1 General Discussion Using real input data from the David laser scanner, the average centroid error was in the range of.8mm to 2.4mm. Given that τ object 60mm, the centroid error relative to the size of the object is in the range of.133% to 4%. Synthetic scene testing resulted in average centroid error in the range of.55mm to.85mm. The average angular errors were between 3.5and 7.5 for Average Run Time:.1-.5 seconds both real and synthetic data input. The algorithm described in this thesis is designed for bin-picking applications. With any bin-

The gripping mechanism could be Average Centroid Error for Laser Scanned Input:.8mm -2.4mm Average Centroid Error for Synthetic Input:.55mm-.

101 100 picking application a large focus is on the gripper design. With these relatively low recognition errors it should not be too difficult to design an end-effector to pick up objects with these position errors. The gripping mechanism could be Average Centroid Error for Laser Scanned Input:.8mm -2.4mm Average Centroid Error for Synthetic Input:.55mm-.85mm Average Angular Error Scanned/Synthetic Input: 4⁰-6⁰ self-orienting: as the object is picked, it conforms to the physical geometry of the endeffector. The algorithm proved to be very fast even though it is implemented in Matlab. The time to find single objects ranged between.1 and.5 seconds ( objects a minute). The exact recognition time is highly dependent on the parameters of the algorithm. The algorithm is capable of running very quickly for position errors on the order of 1.5mm - 2.5mm. A shown in section 7.3.2, setting e gmax = 3 and r = 20mm, the bottle caps were recognized at an average time of.135 seconds per object. Given that Matlab is an interpreted language, this is great result. Writing the code in another language like C++ would result in much faster recognition rates. It is very hard to estimate the actual speed improvement but it is plausible to assume it would be a factor in the range of 2 to 10. Also because of the manner in which the algorithm works (sending sub-groups of points down the algorithm pipeline individually) there is room for parallelization such that it could run on multiple CPU or GPU cores. Currently there is not a physical robot known that could keep up with such a high through put (> 500 parts/minute).

DETECTION AND ROBUST ESTIMATION OF CYLINDER FEATURES IN POINT CLOUDS INTRODUCTION

DETECTION AND ROBUST ESTIMATION OF CYLINDER FEATURES IN POINT CLOUDS Yun-Ting Su James Bethel Geomatics Engineering School of Civil Engineering Purdue University 550 Stadium Mall Drive, West Lafayette,