A Comparison and Evaluation of Three Different Pose Estimation Algorithms In Detecting Low Texture Manufactured Objects

Clemson Unversty TgerPrnts All Theses Theses 12-2011 A Comparson and Evaluaton of Three Dfferent Pose Estmaton Algorthms In Detectng Low Texture Manufactured Objects Robert Krener Clemson Unversty, rkrene@clemson.edu Follow ths and addtonal works at: https://tgerprnts.clemson.edu/all_theses Part of the Electrcal and Computer Engneerng Commons Recommended Ctaton Krener, Robert, "A Comparson and Evaluaton of Three Dfferent Pose Estmaton Algorthms In Detectng Low Texture Manufactured Objects" (2011). All Theses. 1290. https://tgerprnts.clemson.edu/all_theses/1290 Ths Thess s brought to you for free and open access by the Theses at TgerPrnts. It has been accepted for ncluson n All Theses by an authorzed admnstrator of TgerPrnts. For more nformaton, please contact kokeefe@clemson.edu.

A Comparson and Evaluaton of Three Dfferent Pose Estmaton Algorthms In Detectng Low Texture Manufactured Objects A Thess Presented to the Graduate School of Clemson Unversty In Partal Fulfllment of the Requrements for the Degree Master of Scence Electrcal Engneerng by Robert Charles Krener Dec 2011 Accepted by: Dr. Rchard Groff, Commttee Char Dr. Stanley Brchfeld Dr. Adam Hoover

Abstract Ths thess examnes the problem of pose estmaton, whch s the problem of determnng the pose of an object n some coordnate system. Pose refers to the object s poston and orentaton n the coordnate system. In partcular, ths thess examnes pose estmaton technques usng ether monocular or bnocular vson systems. Generally, when tryng to fnd the pose of an object the objectve s to generate a set of matchng features, whch may be ponts or lnes, between a model of the object and the current mage of the object. These matches can then be used to determne the pose of the object whch was maged. The algorthms presented n ths thess all generate possble matches and then use these matches to generate poses. The two monocular pose estmaton technques examned are two versons of SoftPOSIT: the tradtonal approach usng pont features, and a more recent approach usng lne features. The algorthms functon n very much the same way wth the only dfference beng the features used by the algorthms. Both algorthms are started wth a random ntal guess of the object s pose. Usng ths pose a set of possble pont matches s generated, and then usng these matches the pose s refned so that the dstances between matched ponts are reduced. Once the pose s refned, a new set of matches s generated. The process s then repeated untl convergence,.e., mnmal or no change n the pose. The matched features depend on the ntal pose, thus

the algorthm s output s dependent upon the ntally guessed pose. By startng the algorthm wth a varety of dfferent poses, the goal of the algorthm s to determne the correct correspondences and then generate the correct pose. The bnocular pose estmaton technque presented attempts to match 3-D pont data from a model of an object, to 3-D pont data generated from the current vew of the object. In both cases the pont data s generated usng a stereo camera. Ths algorthm attempts to match 3-D pont trplets n the model to 3-D pont trplets from the current vew, and then use these matched trplets to obtan the pose parameters that descrbe the object s locaton and orentaton n space. The results of attemptng to determne the pose of three dfferent low texture manufactured objects across a sample set of 95 mages are presented usng each algorthm. The results of the two monocular methods are drectly compared and examned. The results of the bnocular method are examned as well, and then all three algorthms are compared. Out of the three methods, the best performng algorthm, by a sgnfcant margn, was found to be the bnocular method. The types of objects searched for all had low feature counts, low surface texture varaton, and multple degrees of symmetry. The results ndcate that t s generally hard to robustly determne the pose of these types of objects. Fnally, suggestons are made for mprovements that could be made to the algorthms whch may lead to better pose results.

Table of Contents Ttle Page................................... Abstract.................................... Lst of Tables................................. v Lst of Fgures................................. v 1 Introducton................................ 1 1.1 Motvaton................................. 1 1.2 Related Work............................... 2 1.3 Outlne................................... 7 2 Background................................ 9 2.1 Notaton.................................. 9 2.2 What s meant by pose?......................... 12 2.3 How does magng work?......................... 14 2.4 Camera calbraton............................ 17 2.5 Pose From Correspondences....................... 18 2.6 POSIT................................... 19 2.7 3D Reconstructon From Stereo Images................. 24 3 Methods.................................. 31 3.1 SoftPOSIT................................. 31 3.2 SoftPOSIT Wth Lne Features..................... 39 3.3 Pose Clusterng From Stereo Data.................... 45 4 Experments and Results........................ 52 4.1 Experments................................ 52 4.2 Results................................... 57 5 Conclusons and Dscusson....................... 78 Bblography.................................. 80 v

Lst of Tables 1.1 Classfcaton of a few of the dfferent pose estmaton technques dscussed. Each unknown correspondence algorthm depends or bulds upon the known correspondence algorthm to the left.......... 3 1.2 Classfcatons of the dfferent types of pose estmaton algorthms dscussed along wth ther requrements.................. 3 4.1 Summary of the propertes and sgnfcance of performance classngs. 59 v

Lst of Fgures 2.1 The relatonshp of the model, camera, and world coordnate systems. 13 2.2 Mathematcally dentcal camera models................ 14 2.3 The projecton of a pont onto the mage plane............. 16 2.4 Estmatng pose wth known correspondences............. 18 2.5 Example of two cameras n space.................... 25 2.6 Example of two stereo rectfed cameras................. 28 2.7 Geometry of the dsparty to depth relatonshp............ 30 3.1 Pont relatonshps n SoftPOSIT.................... 35 3.2 Generaton of projected lnes n SoftPOSITLnes............ 40 3.3 Example form of the matrx m for SoftPOSIT wth lne features... 43 3.4 Example of two matched trplets..................... 46 4.1 The Objects of Interest.......................... 53 4.2 Example poses from each class...................... 58 4.3 Total pose error of the three algorthms for each mage n the cube mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1).......... 61 4.4 Pose error of the three algorthms on the cube mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class......... 62 4.5 Breakdown of the translatonal error for the three algorthms for each mage n the cube set. Errors are gven n cm............. 63 4.6 Total pose error of the three algorthms for each mage n the assembly mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1).............. 64 4.7 Pose error of the three algorthms on the assembly mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class...... 65 4.8 Breakdown of the translatonal error for the three algorthms for each mage n the assembly set. Errors are gven n cm........... 66 v

4.9 Total pose error of the three algorthms for each mage n the cubod mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1).............. 67 4.10 Pose error of the three algorthms on the cubod mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class......... 68 4.11 Breakdown of the translatonal error for the three algorthms for each mage n the cubod set. Errors are gven n cm............ 69 4.12 Total error for the three pose estmaton algorthms on the assembly set. The frst row shows the results of tryng to fnd the assembly usng the assembly as the model, whle the second rows shows the results of fndng the assembly usng only the cube as the model......... 70 4.13 Two example results mages from Class 2. Both of these poses llustrate nstances where poses are perceptually correct and features are matched, however the correspondences are ncorrect. The whte lnes ndcate the fnal pose estmated by the algorthm............ 72 4.14 Example mage where the SoftPOSITLnes algorthm outperforms the trplet matchng algorthm. The goal s to dentfy the pose of the green cube. The whte wre frames show the poses estmated by the two algorthms. In ths nstance the trplet matchng algorthm ncorrectly dentfed the red cubod as the green cube................ 75 4.15 Two Example mages (one per column) where the SoftPOSIT algorthms outperform the trplet matchng algorthm. The goal s to dentfy the pose of the red cubod. The wre frames show the poses estmated by the algorthms. In both nstances the trplet matchng algorthm ncorrectly dentfes the surface of the stck as a surface of the red cubod............................... 77 v

Chapter 1 Introducton Pose estmaton s the process of determnng the pose of an object n space. The pose of an object s the object s translaton and orentaton,.e., roll, ptch, and yaw n some coordnate system. Ths thess wll examne the problem of pose estmaton usng vson systems. 1.1 Motvaton Pose estmaton s an mportant problem n autonomous systems. In the case of an ndustral robot attemptng to nteract wth or avod an object, the robot must know where the object s located and how t s orented. Typcally, the problem of locatng objects for graspng s avoded by ensurng that objects are always at the same locaton through some sort of toolng system. The objects wth whch the robot wll nteract are loaded nto the toolng system by humans before the robot s able to nteract wth them. If the robot were capable of dentfyng where the objects were va ts own pose estmaton system t could, n theory, load the parts nto the system tself. One reason why ths technology s not prevalent n ndustry currently 1

s that many manufactured objects, such as sold metal/plastc components, do not have many readly detectable features. Pose estmaton s also mportant n moble robotc systems. If a robot s to retreve an object t must be able to locate t n space frst. Pose estmaton can also be used n moble robot localzaton. If the locaton of a known landmark can be determned then the robot can estmate ts own poston n space, much lke how a human would look for a famlar buldng or sgn to dentfy where they are. 1.2 Related Work Many researchers have studed the pose estmaton problem and developed algorthms to fnd the pose of objects. Table 1.1 shows the relatonshp of a few of the pose estmaton algorthms whch wll be dscussed, specfcally ncludng the algorthms whch wll be examned n ths thess. Table 1.2 shows some of the dfferent types of pose estmaton problems whch wll be dscussed and the common assumptons assocated wth them. The three categores of pose estmaton problems shown n the table are pose estmaton, pose trackng, and AR pose estmaton technques. The frst category, pose estmaton, addresses the problem of dentfyng an objects pose n space w.r.t. the camera, usng a sngle mage of the object. Pose trackng s the problem of trackng an objects pose from frame to frame n a vdeo sequence, whch s equvalent to fndng the objects precse pose when the approxmate pose s already known. The AR pose estmaton technques presented all work only wth vdeo sequences, and are related to structure from moton technques. The AR technques address the problem of fndng the cameras pose n the world. Ths thess focuses on the frst category of problems, pose estmaton. 2

Known Correspondences Monocular Vson Unknown Correspondences POSIT [10] SoftPOSIT [8, 9] PnP Methods [18, 23] RANSAC [12] Bnocular Vson Known Unknown Correspondences Correspondences Absolute Trplet Orentaton [24] Matchng [21] Table 1.1: Classfcaton of a few of the dfferent pose estmaton technques dscussed. Each unknown correspondence algorthm depends or bulds upon the known correspondence algorthm to the left. Algorthm Types Algorthms Requrements Appled To Pose Estmaton SoftPOSIT [8, 9] Trplet Matchng [21] Model known Sngle mage Pose Trackng RAPD [20] [27] and [13] Model known Approx pose known Vdeo or Sngle Image AR Pose Estmaton [36] and [29] Movng camera Vdeo Table 1.2: Classfcatons of the dfferent types of pose estmaton algorthms dscussed along wth ther requrements 3

Pose estmaton, when the approxmate pose s known, has been wdely studed. These algorthms are generally used for pose trackng. In these nstances the pose from one mage to the next can only vary slghtly, thus the approxmate pose s known, and the problem s constraned. Some example algorthms for pose trackng nclude RAPD [20], a method proposed by Lowe [27], and yet another method by Jure [13]. Another common applcaton of pose estmaton s n augmented realty (AR) systems. These systems use pose to place objects n an mage, such that the nserted object appears as f t were actually n the orgnal scene. Often n these applcatons precse pose s not necessary because there s no physcal nteracton between the system and the world, and objects only need to appear as f they were actually n a scene. Also snce AR s typcally appled to vdeo many of the algorthms take advantage of the cameras moton to help wth the pose estmaton problem. Some example AR pose estmaton algorthms nclude [36, 29]. Lepett gves a through survey of pose estmators for both AR and pose trackng applcatons n [25]. Ths thess wll focus on mathematcal and geometrcal methods of pose estmaton, whch rely on matchng a model of the object to be found to some sort of mage or sensor data. In all of these algorthms the true pose s assumed to le wthn a large search space, the approxmate pose s not known a pror, and the only mage data avalable s a sngle mage or a par of stereo mages. One of the most common methods for estmatng pose wth a model and mage data s to extract features from the mage, such as lnes, corners, or even crcles and match the extracted features to the model features. If the correspondences/matches between the features of the model and the mage are known the problem becomes nearly trval. One common algorthm for pose estmaton wth known pont feature corre- 4

spondences s POSIT (Pose from Orthogonalty and Scalng wth ITeratons) [10]. Ths algorthm assumes that feature correspondences are known n advance and wll fal when correspondences are ncorrect. Other methods of pose estmaton wth known correspondences nclude [18, 23, 32, 1]. All of these algorthms are capable of generatng pose estmates gven a set of pont, or n some cases lne, correspondences and a cameras calbraton matrx. The POSIT algorthm was later updated to become SoftPOSIT [9] whch combnes the POSIT algorthm wth a correspondence estmaton algorthm softassgn [15, 38]. Ths algorthm requres all of the pont features n both the model and current mage to be provded, along wth a guess of the possble pose of the object. The algorthm matches the model and mage features and estmates the pose to mnmze the dstance between all of the matched features. The pose output by the algorthm s dependent upon the ntal pose guessed, and the algorthm s not guaranteed to converge. Even n cases where the algorthm does converge there s no way to know that the pose s correct wthout further evaluaton. SoftPOSIT was extended to work wth lne features [8], but stll has many of the same problems as the orgnal SoftPOSIT. Another well known algorthm for estmatng poses wth features s RANSAC (RANdom SAmple Consensus) [12]. Ths algorthm matches, at random, the mnmum number of pont features from the model to features n the mage to estmate a pose. The absolute mnmum of matched features s three [18], whch wll provde up to four feasble pose estmates, whle four matched features wll yeld a sngle pose estmate. By teratng through the possble sets of matches at random the actual pose can be generated. Ths algorthm has the advantage that t s guaranteed to yeld the correct pose at some pont; however, the correct pose must be extracted from all of the poses returned by the algorthm. The algorthm also s exponental (theoretcally) n executon tme as the number of features ncreases, makng t a bad 5

choce for feature rch scenes. Some of the most robust pose estmaton algorthms currently avalable [16, 17, 6] make use of Scale Invarant Feature Transform (SIFT) [28] features. These algorthms combne SIFT features wth monocular, stereo, or Tme of Flght (TOF) cameras to gve hghly accurate poses for objects. Although these algorthms work well, they are lmted to use on hghly textured objects. Ths s due to the fact that they rely on SIFT features whch are only present on surfaces wth hgh texture. Therefore, these algorthms are not sutable for use on many manufactured objects whch have farly consstent surfaces such as cardboard boxes, metal components, or plastcs. These algorthms would also fal f the surfaces of the objects were changed even when ther form remans the same, e.g., f a company redesgned ts packagng art or decded to make ts products n dfferent colors. Both SoftPOSIT and RANSAC can be appled to any set of mage model pont feature correspondences regardless of how they are generated. Besdes SIFT, many other popular feature detectors exst ncludng the Harrs corner detector [19], SURF [3], FAST [33], and many others. See [31, 35] for a comprehensve revew and comparson of common pont feature detectors. However, as wth SIFT other pont features requre certan types of surface texture varaton to functon well. If the object to be detected has few corners or relable surface features, then there are no relable features to match. Ths s true of many manufactured objects. Another drawback to feature based methods s that n order to match features they must frst be extracted from the mage, and as the mage s content becomes ncreasngly complex the number of false matches and occluded features ncreases. All of the pose estmaton algorthms dscussed up to ths pont are feature based, n that they requre the matchng of model and mage features as a step n estmatng a pose, and thus are restrcted to beng appled to objects whch contan 6

features. Another class of pose estmators uses only range data to estmate an object s pose. All of these estmators [30],[34],[21] rely only range data, that s (x, y, z) pont locatons to estmate poses rather than feature extracton. These types of algorthms can work on objects of any shape, color, or texture provded accurate enough depth nformaton can be extracted. Many devces exst whch can generate depth nformaton, ncludng: stereo cameras, laser scanners, TOF cameras, sonar, and radar. Thus, these algorthms are not restrcted to workng only wth stereo range data. 1.3 Outlne Ths paper compares and examnes the effectveness of SoftPOSIT wth pont features, SoftPOSIT wth lne features, and a 3-D pont trplet matchng algorthm n detectng the pose of low texture manufactured objects. The frst two algorthms are drectly comparable as they both are run on 2-D mage data and rely on feature extracton. The thrd algorthm uses a stereo camera setup to reconstruct the scene s 3-D geometry as a pont cloud and then examnes ths data to extract the pose of the object wthn. The overall performance of these algorthms wll be compared over a sample set of mages, but the reader should keep n mnd the dfferences between the algorthms when comparng ther performance. Chapter 2 presents some background content ncludng: basc concepts of magng, 3-D reconstructon, and pose estmaton wth known correspondences. Chapter 3 examnes n detal the three pose estmaton algorthms presented n ths thess. Chapter 4 presents the experments conducted to examne the effectveness of the three pose estmaton algorthms studed along wth the expermental results. Fnally Chapter 5 presents a revew of the expermental fndngs along wth possble 7

future mprovements and modfcatons that can be made. 8

Chapter 2 Background 2.1 Notaton All ponts n 3-D space wll be defned by the captal letter P and a superscrpt letter C, M, or W wll desgnate the frame of reference of the pont. The letter C ndcates the pont s represented n the camera coordnate system, M ndcates the pont s represented wth respect to the model coordnate system, and W ndcates the pont s represented wth respect to the world coordnate system. Ponts wll be enumerated by subscrpt numbers, or n the general case a subscrpt. P M 2 for example would correspond to object pont 2 n the model s coordnate system. The coordnates of a pont P wll be expressed by captal letters (X, Y, Z). Fgure 2.1 gves an example of 3-D ponts expressed n dfferent frames. All mage ponts wll be desgnated by the lower case letter p. In the case of two cameras wth separate mages, superscrpt C s wll be used to ndcate the mage whch the pont belongs to. All mage ponts wll be enumerated wth subscrpt numbers. For example p C 1 3 would ndcate the thrd mage pont n the mage generated by camera 1. The coordnates of a pont p wll be expressed as lowercase letters (x, y). 9

For both mage ponts p and 3-D ponts P the homogeneous representaton of the ponts wll often need to be used. The homogeneous form s acheved by appendng a 1 to the coordnates so that x p = λ y 1 X P = λ Y Z 1 The homogeneous form allows easer expressons of rotatons and translatons of ponts. Note the lambda term s ncluded because homogeneous coordnates are scale nvarant. When the last coordnate of the ponts s 1 the coordnates are referred to as normalzed homogeneous coordnates. In any case the coordnate form (X, Y, Z) or homogeneous form [X, Y, Z, 1] T of ponts may be used throughout the thess when referrng to ponts. It has been shown that ponts have a homogeneous form whch s generated by appendng a 1 to the coordnates. However, homogeneous coordnates also allow a alternate way to express lnes. Specfcally a lne l can be descrbed n a Eucldean sense by the equaton ax + by + c = 0 or n homogeneous form by l = [a, b, c]. The prevous equaton can then be expressed n a homogeneous sense by the equaton [a, b, c][x, y, 1] T = 0. Matrces and vectors wll both be ndcated by bold face text. R R 3 3 wll be a rotaton matrx whch can be expressed as R M C = r 1 r 2 r 3, r R 1 3 (2.1) 10

where r 1 s the unt vector of the camera frame s X axs ê C x expressed n terms of the unt vectors of the model frame ê M x, ê M y, and ê M z. Smlarly r 2 and r 3 are the unt vectors ê C y and ê C z expressed n terms of the model coordnate systems unt vectors. Ths rotaton matrx completely descrbes the rotaton from the model to the camera coordnate system and satsfes R T R = RR T = I and det(r) = 1 Note that the superscrpt on R M C ndcates the source coordnate system and the subscrpt the destnaton coordnate system. So R M C s the rotaton matrx that converts coordnates n the model frame to coordnates n the camera coordnate frame, assumng the orgns of the two systems concde. In the case where the orgns of the two systems do not concde an addtonal translaton T M C R3 must be appled to the ponts to shft them to the correct locaton. Where T M C s the vector from the orgn of the camera coordnate system to the orgn of the model coordnate system n the camera s frame of reference. To convert a pont from one coordnate system to another, the rotaton and translaton transforms can be appled to the pont to generate the new coordnates. For example to convert pont P M from the model frame to the camera frame the followng equaton would be used P C = R M C P M + T M C Ths equaton frst rotates the pont then shfts t to the proper poston n the camera s frame. Usng homogeneous coordnates ths transform can be expressed as a sngle 11

homogeneous rgd body transform of the form: T M C P C = RM C P M 1 0 1 1 Ths rgd body transform allows a set of ponts belongng to an object whch are expressed n a model coordnate frame to be expressed n the camera systems coordnate frame. 2.2 What s meant by pose? As dscussed n Chapter 1, the goal of pose estmaton s to generate a pose that descrbes an object s poston and orentaton n space wth respect to some coordnate system. Pose n ths nstance wll be a translaton T M C and rotaton RM C whch fully descrbes the poston and orentaton of an object n the camera s frame of reference. If the relatonshp between the camera s coordnate system and a world coordnate system s known (R C W,TC W ), the overall pose of the object n the world can be determned see (2.2). Fgure 2.1 shows the relatonshp of three coordnate systems. In Fgure 2.1, P M are the object ponts expressed n the model coordnate frame, and P M 0 s the centrod of the model and the orgn of the model coordnate system. P C are the object ponts expressed n the camera coordnate frame, and P C 0 s the centrod of the object n the camera s coordnate frame. P W are the object ponts n the world coordnate frame. The equaton relatng the coordnates of the 12

Fgure 2.1: The relatonshp of the model, camera, and world coordnate systems. ponts n the model frame to the ponts n the world frame s gven by P W = RC W 0 1 TC W T M C RM C 0 1 P M (2.2) If t s assumed that the camera s relatonshp to the world s constant,.e., the camera does not move or the camera and world frame move synchronously, then the transform relatng the camera coordnate system and world coordnate system (R C W, TC W ) can be calculated once and wll reman constant. Assumng the camera to world transform s known the goal of pose estmaton s to fnd the rotaton matrx R M C and translaton vector TM C whch wll locate the object n the camera frame of reference. 13

f f (a) Camera model (b) Frontal camera model Fgure 2.2: Mathematcally dentcal camera models 2.3 How does magng work? 2.3.1 Modelng a camera The smplest model to examne the behavor of a camera s the pnhole model. Ths model treats the camera as a sngle pont and a plane. In an actual camera lght n the world travels through a lens whch focuses the lght through a pont and onto flm or a CCD. The pont n the model s equvalent to the center of the lens, the optcal center, and the plane s equvalent to the CCD or flm n a camera. The optcal center of the camera wll be defned as the orgn of the camera s coordnate system, O C. The Z-axs wll be defned by the locaton where the plane normal passes through O C, and the X and Y axes wll be parallel to the mage plane wth the X-axs left to rght and the Y-axs pontng up and down as n Fgure 2.2(a). Ths geometry generates an nverted mage whch dgtal cameras correct by nvertng the mage data. To acheve the same result wth the model, the magng 14

plane can be moved n front of the focal pont as n Fgure 2.2(b). Fgure 2.3 llustrates the projecton of a pont onto the mage plane for both models. Notce the frontal plane model gves a non-nverted mage. The length of the perpendcular lne between the camera and the optcal center s the focal length f. It s related to the length between the CCD/flm and the lens of a camera. The unts used for the length wll determne the correspondence between pxel lengths and real world lengths. In ths thess all lengths wll be n meters. Thus, f has unts of pxels/meters. 2.3.2 The geometry of mage formaton Usng ths frontal model the geometry of how an mage s formed can be explaned. Fgure 2.3 shows the projecton of a pont onto the mage plane for both the real and frontal camera models. Notce that 2 smlar trangles are formed wth lengths Y, Z and y, f. Usng the relatonshp of smlar trangles the y coordnate and smlarly the x coordnate of the projected pont p P = (x, y) can be calculated. Note that P ndcates the coordnates are wth respect to the projected mage coordnate system. The relatonshp between the two coordnate systems s as follows. x P = f XC Z C y P = f Y C Z C (2.3) At ths stage the transform necessary to project ponts from the camera s coordnate system onto the mage plane and nto the projected mage coordnate system has been shown. Snce mages typcally assume that the orgn of the mage coordnate system s at the top left of the mage an addtonal transform must be appled to these projected ponts coordnates to shft the orgn to the top left. Ths transform s a smple translaton n the x and y coordnates of the mage. Wth the 15

Image Planes Y y f f y Z Fgure 2.3: The projecton of a pont onto the mage plane prevous transformaton equaton (2.3) the change was from camera coordnates n meters to mage coordnates n pxels; however, ths transform s wthn the same space thus the translatons unts are n pxels. Specfcally the translaton T P I = [u o, v o ] where u o, v o are the coordnates of the center of the mage n pxels w.r.t. the mage coordnate system orgn O I. Thus, the complete transform to convert from camera coordnates to mage coordnates s gven by the equaton x I = f XC Z C + ui o y I = f Y C Z C + vi o (2.4) Ths equaton can be smplfed by usng homogeneous coordnates and some smple matrx algebra. X x f 0 u o 0 λ y = 0 f v o 0 Y Z 1 0 0 1 0 }{{}}{{} 1 p I H }{{} P C (2.5) The λ factor s n the equaton to ensure that the result of the matrx multpl- 16

caton s ndeed a normalzed homogeneous coordnate.e. ts thrd coordnate s one. Ths factor appears because any pont along a ray from the optcal center through a pxel on the mage plane wll project down to that pxel. The H matrx n equaton (2.5) s commonly referred to as the camera matrx or calbraton matrx and ths s ts smplest form. In realty the two f terms are slghtly dfferent because of varyng pxel dmensons n the X and Y drectons. Addtonally, there s a skew term whch can be added to the matrx. There are also dstorton terms whch can be used to correct lens dstorton n the projecton, but for most smple applcatons all of the dstortons can be gnored along wth the hgher complexty terms n the camera matrx. Wthout n-depth knowledge of the constructon of the camera t s not possble to know the value of f, u o, or v o. Thus methods have been developed to determne these parameters through calbraton. Wth proper calbraton all of the parameters n the calbraton matrx along wth the dstorton terms can be estmated wth a hgh level of accuracy. 2.4 Camera calbraton There exst many dfferent methods for performng camera calbraton. In ths mplementaton the bult-n method, cv::calbratecamera, n the OpenCV lbrary was used. Camera calbraton requres a seres of dfferng vews of a calbraton pattern, n ths case a checkerboard, to be fed nto the functon along wth the dmensons of the checkers on the pattern. The checkerboard pattern makes t easy to fnd the corners of the squares and f the dmensons of the squares are known a model for the checkerboard can be easly generated. Wth a known model of the calbraton pattern and wth the detected squares of the maged calbraton pattern, correspondences 17

A B D C a d c b O C Fgure 2.4: Estmatng pose wth known correspondences between the detected mage corners and the model corners can be generated. Usng these correspondences a homography matrx can be generated that represents the transform the model goes through to create the mage. Usng a number of these homographes from dfferent mages of the calbraton pattern, the parameters of the calbraton matrx can be emprcally determned. Thus the matrx H can be determned. A more detaled explanaton of the calbraton process can be found n [40]. A survey of calbraton methods and ther approaches can be found n [2]. 2.5 Pose From Correspondences It has been shown that any pont n space whch les along a ray that ntersects the mage plane can project down to the plane at that ntersecton pont. If a seres of correspondences between a geometrc model and an mage of that model can be determned, then a pose estmate whch algns the model ponts n space along the rays passng through the mage ponts can be generated. Fgure 2.4 shows a possble pose generated from four correspondences between an mage and a model. Normally four ponts s suffcent to recover the correct actual pose of the object as long as the four ponts are not co-planar. In ths example, Fgure 2.4, the 18

four ponts are co-planar. For three non co-planar ponts or four co-planar ponts, there are multple poses for the object whch wll result n the same mage. Usng four non-coplanar ponts avods ths problem. There are many methods to solve for the pose of an object gven a model and a set of mage correspondences ncludng the P3P (Perspectve 3 Pont) [18] problem, POSIT [10], and others [23][1]. In ths paper the focus wll be on the POSIT algorthm as t s an ntegral part of the SoftPOSIT algorthm. 2.6 POSIT 2.6.1 Overvew of POSIT POSIT [10] uses known mage model pont correspondences and a known camera calbraton to reconstruct the pose of an object. The goal of POSIT s to relate a model s geometry, a scaled orthographc projecton of the model, and an actual mage of the modeled object to recover all of the parameters whch defne the pose. The algorthm ntally assumes that the object s at some depth whch s relatvely far away from the camera as compared to the depth of the actual object ts self, and then fts the pose as best t can at ths depth by tryng to algn mage and model features. Ths s the POS (Pose from Orthogonalty and Scalng) algorthm. Based upon the error of the ft a better depth estmate s created and the process s repeated. The repeated applcaton of the POS algorthm s the POSIT (POS wth ITeratons) algorthm. After teratvely mprovng the pose, the algorthm wll eventually converge and return the pose of the object. 19

2.6.2 Scaled Orthographc Projecton The Scaled Orthographc Projecton (SOP) of a model s an approxmaton of the perspectve transform. In fact, the SOP s a specal case of the perspectve transform where all of the ponts n the scene of an mage le n a plane parallel to the mage plane. To generate the scaled orthographc projecton all of the ponts n a scene are orthogonally projected onto a plane parallel to the mage plane at at dstance Z o from the camera s orgn. Then these pont coordnates are scaled by Z o /f to generate the SOP. In POSIT the model undergoes the SOP to generate a smulated mage. If there are N number of model ponts P0 M...PN M R3 1 where P0 M concdes wth the orgn of the model coordnate system then a perspectve projecton of these ponts would have the form of the equaton n (2.3),.e. x P = f XC Z C y P = f Y C Z C Assumng the plane for the orthographc projecton s located at the z coordnate of P M 0 n the camera coordnate system.e. Z o = Z C 0 then the SOP mage coordnates p of a pont P M are gven by x = f XC Z C 0 y = f Y C Z0 C Combnng these forms, a more desrable form of the SOP mage coordnates p whch relates the known mage coordnates and desred model coordnates n the 20

camera s coordnate system s generated. x = x P 0 + s(x C X C 0 ) y = y P 0 + s(y C Y C 0 ) (2.6) s = f Z C 0 2.6.3 POS The pror defnton of the rotaton matrx (2.1) wll be used as the unknown rotaton matrx R M C we seek to fnd wth the POSIT algorthm. Usng ths notaton the pose of the object can be fully recovered wth the parameters r 1,r 2,r 3,and the coordnates of P0 C. The followng two equatons relate the known parameters the model and mage features to the unknown parameters r 1,r 2, and Z0 C. (P M P M 0 ) f Z C 0 r 1 = x P (1 + ɛ ) x P 0 (2.7) (P M P M 0 ) f Z C 0 r 2 = y P (1 + ɛ ) x P 0 (2.8) where ɛ s defned as ɛ = f Z C 0 (P M P M 0 ) r 3 (2.9) and r 3 s calculated from takng r 1 r 2 snce R M C s requred to have orthogonal rows. These equatons relate the mage coordnates of the SOP and the actual perspectve projecton to the model, wth the coordnates of the SOP expressed n terms 21

of the perspectve projecton. Lookng wth more detal t can be shown that x 0 = x P 0 because the plane used n generatng the SOP s located at the Z coordnate of P M 0. Thus, the SOP and perspectve projecton of P M 0 are the same pont. Examnng the term x P (1 + ɛ ), t can be shown that ths term s the mage coordnate p of the SOP of P M full proof of ths fact s show n [10]. Intutvely ths makes sense because ɛ s the rato of the dstance between the Z coordnates of the model ponts, and the dstance between the camera s orgn and the orthographc projecton plane. Thus, f an object s far away ɛ s small and x x but when an object s close ɛ s large and the dsparty between x and x ncreases thus the coordnate must be shfted a greater dstance. The left hand sde of equaton (2.7) s the projecton of a vector n the model coordnate system onto the vector r 1 whch s the mage X-axs expressed n the model coordnate system, ths projecton s then scaled by the SOP scalng factor. Thus, the result of the left hand sde of the equaton s the length between the two model ponts P M 0 and P M along the X-axs n the SOP coordnate system, whch s equal to the dstance between the ponts x = x P (1 + ɛ ) and x 0 = x P 0. Snce r 1, r 2, Z C 0 wll be chosen to optmze the ft of all of N model ponts the equatons (2.7) (2.8) wll need to be re-wrtten n a form whch lends t self to developng a lnear system. The equatons are rewrtten (P M P M 0 ) I = ξ wth (P M P M 0 ) J = η I = f Z C 0 r 1 J = f Z C 0 r 2 ξ = x P (1 + ɛ ) x P 0 η = y P (1 + ɛ ) y P 0 (2.10) 22

These equatons can be rewrtten to a lnear system of the form AI = x AJ = y (2.11) A (N 1) 3 s the matrx of model ponts P M 1...N n the model coordnate system whch does not change. I s the same as n equaton 2.10 whle x (N 1) 1 and y (N 1) 1 are vectors contanng ξ and η respectvely. The equaton (2.11) can be solved n a smple least squares sense to gve values for I and J. Lookng back at the defntons of I and J t can be seen that r 1 and r 2 can be recovered by normalzng I and J. The amount by whch are r 1 and r 2 are scaled s s = f Z0 C f Z0 C. Thus the average of the magntude of I and J gves a good estmate of. Snce f s known n the algorthm Z C 0 can be readly calculated. The last parameters to be calculated are r 3 and ɛ. r 3 can be quckly generated by takng r 1 r 2 and ɛ s now dependent on already calculated parameters. 2.6.4 POS wth ITeratons By usng the results of the frst applcaton of POS to generate new values of ɛ, and then repeatng the POS algorthm wth the new ɛ values the POSIT algorthm s developed. Up to now the POSIT algorthm has been developed. Now the problem of how to start the algorthm s addressed. After all, the lnear system (2.11) requres an ntal value for ɛ 0. Makng the assumpton that the Z dmensons of the object are small, compared to the dstance to the object from the camera, the algorthm can be started wth ɛ 0 = 0. Ths ntal seedng of the algorthm works well when the assumpton s true, but can cause the algorthm to dverge from the correct answer when the assumpton s false. Because of ths the POSIT algorthm s only useful when 23

the assumpton s ndeed true, whch for many real applcatons ths assumpton s acceptable. If the POSIT algorthm s run n a loop untl ɛ(n) ɛ (n 1) < δ then the algorthm can be consdered to have converged. Once the algorthm has converged the pose parameters can be recovered from the values returned by POSIT. R M C can be recovered from r 1, r 2, and r 3 and the translaton vector T M C = [ p P 0 /s, s/f ], whch s the mage pont p P 0 projected back nto space at a depth Z C 0. Now the objects pose has been reconstructed usng the model coordnates, correspondng mage coordnates, and the camera s focal length. 2.7 3D Reconstructon From Stereo Images One last topc to explore related to the algorthms whch wll be presented s 3D reconstructon from stereo mages. The goal of 3-D reconstructon s to re-project an mages ponts back nto space at the approprate depth so that a 2-D mage can be used to recreate a 3-D pont cloud whch approxmates the contnuous surface whch was maged. If there are two cameras n a world lookng at the same object then each camera wll project the same pont P n the object down to dfferent ponts, p C 1 p C 2, n each camera s mage coordnate systems. Usng the camera models as shown n Fgure 2.5, for each camera the lne of sght from the camera orgn through the mage plane at the pxel correspondng to the model pont P can be reconstructed. If nose s non exstent, then n theory, both of the lnes of sght rays from both cameras wll ntersect at the object pont n space. If the dstance and orentaton between the two cameras s known then the locaton of the object pont n space w.r.t. the cameras can be determned va trangulaton. Two major assumptons are made above whch must be explored further. Frst and 24

P Fgure 2.5: Example of two cameras n space t was assumed that the rotaton R C1 C2 and translaton TC1 C2 between the two cameras was known. In realty ths s almost never the case. Thus ths relatonshp must be determned va some method. Thankfully due to the geometry of two cameras lookng at a pont, the rotaton and translaton between the two cameras can be calculated relatvely easly. 2.7.1 Fndng the Essental Matrx Lookng at Fgure 2.5 the lne drawn between the two cameras orgns s called the baselne, and t ntersects each cameras mage plane at e C 1 and e C 2. These two ponts are referred to as the eppolar ponts and the lnes between e C 1, p C 1 and e C 2, p C 2 are eppolar lnes l C 1,l C 2. The baselne s the common edge of the trangle formed between the cameras orgns and any pont n space P. Any pont lyng on ths trangle n space wll project onto the mage plane of camera one somewhere along the lne between p C 1 and e C 1 and mage plane of camera two somewhere along p C 2 and e C 2. The essental matrx E captures the relatonshp of a normalzed homogeneous mage pont p C 1 and ts eppolar lne l C 1 n mage one to the correspondng eppolar 25

lne l C 2 n mage two, specfcally l C 2 = Ep C 1 and l C 1 = E T p C 2 P n Fgure 2.5, P C 1 coordnates of P C 2 are P C 2 = R C1. Lookng at pont are the coordnates of P n camera ones coordnate system. The C2 PC 1 mage coordnates ths relatonshp becomes +T C1 C2. Convertng to normalzed homogeneous λ 2 (p C 2 ) = R C1 C2λ 1 (p C 1 ) + T C1 C2 Multplyng ths equaton by ˆT gves ˆTλ 2 (p C 2 ) = ˆTR C1 C2λ 1 (p C 1 ) + 0 wth 0 T 3 T 2 ˆT = T 3 0 T 1 T 2 T 1 0 Takng the nner product of both sdes wth p C 2 (p C 2 ) T ˆTR C1 C2(p C 1 ) = 0 (2.12) Ths equaton s known as the eppolar constrant and the essental matrx E s gven by E = ˆTR C1 C2 E s a functon of R C1 C2 recovered. and TC1 C2 and f E can be calculated RC1 C2 and TC1 C2 can be If a number of pont correspondences between mages from camera one and mages from camera two can be generated then by explotng the eppolar constrant 26

and the propertes of the matrx E a precse numercal approxmaton of E can be calculated. A common algorthm whch does ths s the 8-Pont algorthm [26]. In bref the algorthm sets up a lnear system of equatons usng the pont correspondences and E that conforms to the eppolar constrant. Ths system s then solved n a least squares sense to gve a best ft E. Usng SVD the rank of E s forced to be two, whch s the form requred for an essental matrx. The result s an accurate approxmaton of E. Wth E known, R C1 C2 and TC1 C2 can be recovered usng SVD. It was shown that for any pont n mage one, the correspondng pont n mage two wll le along the lne defed by l C 2 = Ep C 1. A method to calculate E and fnd the locaton of camera two n relatonshp to camera one has also been developed. Usng all of these knowns a pont n mage one can be chosen, then the correspondng pont n mage two can be found along the lne l C 2, whch allows the trangulaton of pont P usng the known correspondences, R C1 C2, and TC1 C2. 2.7.2 Stereo rectfcaton Up to now one of the two assumptons whch was made earler has been addressed, whch s that R C1 C2 and TC1 C2 were known. The second assumpton was that there was no nose n the mage. In realty nose s unavodable n magng due to the fact that ponts n contnuous space are projected nto pxels whch have dscrete coordnates. A second level of nose s added due to mperfectons and dstortons n the lens of the camera. Wth nose added nto the mages the two rays projected out from each cameras orgn through the correspondng mage ponts wll not ntersect n space. Thus an approxmate ntersecton must be chosen whch mnmzes some sort of error metrc, such as the re-projecton error n both mages. Avodng the complextes of approxmatng the ntersecton of the two lnes 27

P B Fgure 2.6: Example of two stereo rectfed cameras and contnuously calculatng search lnes l C 2 to fnd correspondences, the mages from each camera can frst be rectfed. In a rectfed stereo par the cameras have the layout shown n Fgure 2.6. In ths camera layout the baselne between the cameras does not ntersect the mage plane because the mage planes are parallel. Snce the baselne does not ntersect the mage planes the eppolar ponts are now at nfnty. When ths happens the correspondng eppolar lnes n each mage are the same and are all parallel. Ths smplfes the search for correspondences because now a pxel at locaton p C 1 = (x, y) wll correspond to a pxel n mage two at p C 2 = (x d, y). The value d s known as the dsparty for the pxel between the two mages. Correspondences can be easly generated by comparng the sum of the color values n a wndow around a pont p C 1 n mage one to the sum of the color values n a wndow around a pont p C 2 n mage two where the two ponts are related by a dsparty d. The value of d whch mnmzes the dfference of these two sums s the optmum dsparty for the pxel. Lookng at the geometry between the two cameras the depth of a pont s drectly related to the length of the baselne, the focal length of the camera, and the 28

dsparty. Ths relatonshp s gven by Z C 1 = f B d Where B s the length of the baselne, d s the dsparty, f s the focal length, and Z s the dstance of the world pont from the cameras orgn, along the Z axs. Fgure 2.7 shows ths relatonshp. Ignorng the fact that nose causes the projecton rays from each mage to not ntersect at an exact pont, but nstead choosng to re project the pont along the ray correspondng to mage one, then the coordnates of pont P can be reconstructed. P = ( Z C 1 f ) x, ZC 1 y, Z C 1 f To convert the cameras geometry to the geometry of stereo rectfed cameras the mage plane of the two cameras can be rotated n space so that they become co-planar. If the two planes are only rotated then the baselne wll reman the same and the above calculatons wll hold. Once the rotaton s found whch algns the two mage planes a transformaton can be calculated whch converts the pxel coordnates n the orgnal mage plane to the proper coordnates n the new mage plane. The result s two stereo rectfed mages. OpenCV ncludes a functon whch can perform ths transformaton whch s based upon the method n [14]. If the object ponts are reconstructed wth respect to ths rectfed mage plane the reconstructed ponts can be transfered back to the orgnal coordnate system by usng the nverse of the rotaton used to generate the new mage plane. Thus t has been shown how the locatons of 3D ponts of an object can be recovered from two mages of the ponts. 29

P Z O I1 O I2 p 1 p 2 d l f B Fgure 2.7: Geometry of the dsparty to depth relatonshp 30

Chapter 3 Methods Ths thess wll focus on the mplementaton and comparson of three pose estmaton algorthms. The frst of the three algorthms s the SoftPOSIT [9] algorthm. The second algorthm s an extenson of the SoftPOSIT algorthm desgned to work wth lne features [8] nstead of pont features. The last of the three algorthms s one proposed by Ulrch Hllenbrand n a paper called Pose Clusterng From Stereo Data [21]. 3.1 SoftPOSIT 3.1.1 The SoftPOSIT algorthm The SoftPOSIT algorthm s an extenson of POSIT whch s desgned to work wth unknown correspondences. The algorthm develops correspondences whle updatng the estmate of the pose. The algorthm takes an ntal guess of the pose and then develops possble correspondences based upon the ntal pose guessed. Wth the set of guessed correspondences the pose can be refned and then new correspondences generated. Ths process s repeated untl a fnal set of correspondences and the pose 31

fttng the correspondences s generated. Frst the method used to update the pose s changed slghtly from the orgnal POSIT algorthm. 3.1.1.1 Updatng POSIT The prevous defnton of a rotaton matrx R M C from equaton (2.1) wll agan be used. The vector T M C = [T x, T y, T z ] T s the translaton from the orgn of the camera C O to the orgn of the model P C 0, whch need not be a vsble pont. The rgd body transform relatng the model frame to the camera frame s then gven by the combnaton of R M C and TM C. The mage coordnates of the N model ponts P M =0...N wth the model pose gven by RM C and TM C are w x P w y P w f 0 0 0 = 0 f 0 0 RM C T M C P M 0 1 1 0 0 1 0 Notce that the camera matrx H here assumes the mage coordnates are wth respect to the prncpal pont, not the shfted mage orgn as n equaton (2.5). The prevous expresson can be rewrtten to take the form w x P w y P w fr 1 = fr 2 r 3 ft x ft y P M 1 T z 32

Settng s = f/t z and rememberng that homogeneous coordnates are scale nvarant, the prevous equaton can be re-wrtten w x P w y P = sr 1 sr 2 st x P M (3.1) st y 1 w = r 3 P M /T z + 1 Notce that w s smlar to the (ɛ + 1) term n equatons (2.7) and (2.8) from the POSIT algorthm. Smlar to the term n POSIT, w s the projecton of a model lne onto the cameras Z axs plus one. That s, w s the rato of the dstance from the camera orgn to a model pont over the dstance from the camera orgn to the SOP plane, or smply the rato of the Z coordnate of a model pont over the dstance to the SOP plane. The equaton for the SOP of a model pont takes a smlar form x = sr 1 y sr 2 st x P M (3.2) st y 1 Ths s dentcal to equaton (3.1) f and only f w = 1. If w = 1 then r 3 P M = 0 whch means that the model pont les on the SOP projecton plane and the SOP s dentcal to the perspectve projecton. Rearrangng equaton (3.1) gves [ P M 1 ] T T [ sr 1 sr 2 = w x P st x st y } {{ } p ] w y P } {{ } p (3.3) 33

Assumng there are at least four correspondences between model ponts P M and mage ponts p and that w for each correspondence s known, a system of equatons can be set up to solve for the unknown parameters n equaton (3.3). The left half of equaton (3.3) wll be defned as p whch s the SOP of model pont P M for the gven pose, as n Fgure 3.1. Ths defnton s straghtforward as the left half of equaton (3.3) s smply the transpose of equaton (3.2) whch was the equaton to fnd the mage coordnates of the SOP of a model pont. The rght hand sde of equaton (3.3) wll be defned as p whch s the SOP of model pont P M constraned to le along the true lne of sght L of P C, whch s the lne passng through the camera orgn and the actual mage pont p. The pont lyng along the lne of sght wll be referred to as P C L and wll be constraned to have the same Z coordnate as P C. Fgure 3.1 llustrates the relatve layout of the ponts. Its been shown that p = w p whch can be proven by observng the geometry of the ponts. It was shown before that w s the rato of the Z coordnate of a model pont over the dstance to the SOP plane T z. Therefore, w T z s the Z coordnate of a pont P C. Usng ths fact P C L = w T z p /f whch s the re projecton of mage pont p to a depth w T z. Ths gves the camera coordnates of pont P C L. P C When the correct pose s found, the ponts p and p wll be dentcal because wll already le along L, the lne of sght of P C. Thus the goal of the algorthm s to fnd a pose such that the dfference between the actual SOP and the SOP constraned to the lnes of sght s zero. An error functon whch defnes the sum of the squared dstances between p and p as the error s gven by E = N d 2 = N p p 2 = N (Q 1 P M w x P ) 2 + (Q 2 P M w y P ) 2 (3.4) 34

Z+ ω T z L T z T SOP Plane p' p'' Image Plane Fgure 3.1: Pont relatonshps n SoftPOSIT wth ] Q 1 = s [r 1 T x ] Q 2 = s [r 2 T y Iteratvely mnmzng ths error wll eventually lead to the rght pose. To mnmze the error the dervatve of equaton (3.4) s taken, whch can be expressed as a system of equatons ( N ) 1 ( N ) Q 1 = P M P MT w x P P M (3.5) Q 2 = ( N P M P MT ) 1 ( N w y P P M ) (3.6) Lke POSIT, at the start of the loop t can be assumed that w =0...N = 1 calculate new values for Q 1 and Q 2, then update w =0...N usng the new estmated pose. What we have developed up to here s smply a varaton on the orgnal POSIT algorthm, now t can be extend to work wth unknown correspondences. 35

3.1.1.2 POSIT wth unknown correspondences For the case of unknown correspondences there are N model ponts and M mage ponts. Model ponts wll be ndexed wth the subscrpt and mage ponts wth the subscrpt j, thus there are P M =0...N model ponts and p j=0...m mage ponts. If correspondences are unknown then any mage pont can correspond to any model pont and there are a total of MN possble correspondences. Wth w defned as before n (3.1). The new SOP mage ponts are p j = w p j (3.7) and p = Q 1 P M Q 2 P M (3.8) Where equaton (3.7) s the SOP of model pont P C constraned to the lne of sght L of mage pont p j and equaton (3.8) s dentcal to the orgnal p n (3.3) but addng the new Q notaton. The dstance between ponts p j and p s gven by d 2 j = p p 2 j = (Q 1 P M w x P j ) 2 + (Q 2 P M w yj P ) 2 (3.9) whch can be used to update the prevous error equaton (3.4) gvng a new error equaton E = N M ( m j d 2 j α ) = j N M ( m j (Q1 P M w x P j ) 2 + (Q 2 P M w yj P ) 2 α ) j (3.10) where m j s a weght n the range 0 m j 1 expressng the lkelhood that model 36

pont P M corresponds to mage pont p j. The term α s here to bump the error away from settng all the weghts to zero and to account for nose s the locatons of feature ponts n the mages so that slghtly ms-algned model and mage ponts can stll be matched. In the case that all correspondences are completely correct m j = 1 or 0 and α = 0 ths equaton s dentcal to the prevous error equaton (3.4). The matrx m s a (M + 1) (N + 1) matrx where each entry expresses the probablty of correspondence between mage ponts and model ponts. The ndvdual entres are populated based upon the dstance between SOP ponts p j and p gven by d j. As d j ncreases the correspondng entry n m j wll decrease towards zero and as d j decreases m j ncreases ndcatng that ponts lkely match. At the end of the SoftPOSIT algorthm the entres of m should all be nearly zero or one, ndcatng that ponts ether correspond or don t. The matrx m s also repeatedly normalzed across ts rows and columns to ensure that the cumulatve probablty of any mage pont matchng any model pont s one and the total probablty of any model pont matchng any mage pont s one. Ths matrx form s referred to as doubly stochastc and an algorthm from Snkhorn [37] s used to acheve the form. In the case that a gven model pont s not present n the mage or a pont n the mage does not have a matchng model pont the weght n the last row or column of m wll be set to one. The last row and column of m are the slack row and slack column, respectvely, and are the reason why m has plus one rows and columns. Entres n these locatons ndcate no correspondence could be determned. Wth the error functon defned the values of Q 1 and Q 2 whch wll mnmze the error are found n the same fashon prevously and are gven by ( N ( M ) Q 1 = m j j P M P MT ) 1 ( N M m j w x P j P M j ) (3.11) 37

( N ( M ) Q 2 = m j P M P MT ) 1 ( N M m j w yj P P M ) (3.12) j j As before the algorthm s started wth a w =0...N = 1 and then m s populated by calculatng all of the values of d j. Snce m must be populated before updatng Q 1,2, an ntal pose must be gven to the algorthm whch s the pose used to generate m. Wth m populated Q 1,2 can be updated, whch s then used to generate a better guess for w =0...N. Ths process s repeated untl convergence. At the concluson of the algorthm the pose parameters can be retreved from Q 1,2. s = ( [ [ ] Q 1 1, Q 2 1, Q1] 3 Q 1 2, Q 2 2, Q 3 2 ) 1/2 R 1 = [ Q 1 1, Q 2 1, Q 3 1] T /s R2 = [ Q 1 2, Q 2 2, Q 3 2] T /s R 3 = R 1 R 2 [ Q 4 T = 1 s, Q4 2 s, f ] s 3.1.2 Lmtatons and Issues wth SoftPOSIT The need for an ntal guessed pose s the major lmtaton of the SoftPOSIT algorthm. Snce the pose update equaton s dependent upon the ntal pose wth whch the algorthm s started, the algorthm wll only converge to the local mnmum whch satsfes the error functon (3.10). To converge to the true pose the algorthm may need to be started at a varety of dfferent poses around the actual pose. Addtonally f the algorthm s started at a pose whch s too far away from the correct pose the algorthm wll not converge and wll termnate early wth no soluton. Snce the algorthm reles on matchng feature ponts and pays no attenton to the vsblty of feature ponts based upon pose, the algorthm wll often match 38

ponts n the model whch actually are occluded by the model ts self to ponts n the mage. For example, when tryng to match our cube model to the mages most of the corner ponts on the back of the cube are not vsble because the front of the cube occludes the back; however, the algorthm wll often match ponts whch correspond to the back of the model to maged corners belongng to the front of the cube. These types of matches should not be allowed due to the geometry of the model occludng ts self, but there s no mechansm n the algorthm to account for ths. The other major lmtaton of ths algorthm s accurate feature extracton. When tryng to detect corners for example a rounded corner may not be detected or the overlap of two objects may lead to spurous corner detectons whch the algorthm may converge to. When the algorthm does fnally converge to a pose there s no way of knowng f the algorthm generated the correct correspondences or even matched to true features of the object nstead of some of the spurously detected ones. Thus any pose detected by the algorthm must be evaluated to check ts ftness before beng accepted as the fnal answer. 3.2 SoftPOSIT Wth Lne Features 3.2.1 SoftPOSIT Wth Lne Features Algorthm After the ntal development of SoftPOSIT an extenson of the algorthm was created to allow the algorthm to be run on lne features [8]. The underlyng Soft- POSIT algorthm s dentcal to the one prevously descrbed. Snce the SoftPOSIT algorthm reles on pont features to actually preform the pose estmaton and correspondence determnaton the lne features and correspondences are converted to pont 39

P C Y+ O I X+ L P ' C p j lj p j p j ' p j ' Fgure 3.2: Generaton of projected lnes n SoftPOSITLnes features and pont correspondences. 3.2.1.1 Convertng lne features to pont features For the current mage all of the lnes whch are canddates for matchng to the model lnes are detected. Usng the prevous notaton the two end ponts of a model lne are gven by L = (P M, P M ) and the two end ponts of a detected mage lne are l j = (p j, p j). N wll now represent the number of model lnes, meanng there wll be 2N model ponts whch correspond to the lnes and M mage lnes whch wll have a total of 2M ponts. The plane n space whch contans the actual model lne used to generate mage lne l j can be defned usng the ponts (C O, p j, p j) as n Fgure 3.2. The normal to ths plane n j s gven by n j = [p j, 1] [p j, 1] If the current model pose s correct and model lne L corresponds to mage lne l j then the ponts S C = R M C P M + T M C S C = R M C P M + T M C 40

wll le on the plane defned by (C O, p j, p j) and wll also satsfy the constrant that n T j SC and T M C = n T j S C = 0. Usng the SoftPOSIT algorthm ts assumed that at frst R M C wll not be correct and therefore L wll not le n the plane. Recallng the SoftPOSIT algorthm the model ponts gven the current pose were constraned to le on the lnes of sght of mage ponts. In ths nstance t wll be requred that the model lnes le on the planes of sght of the mage lnes. If model lne endponts gven by S C and S C are the model lne endponts n the cameras frame for the current pose, then the nearest ponts to these lne endponts whch fulfll the planar constrant are the orthogonal projectons of ponts S C and S C onto the plane of sght. The coordnates of these projected ponts are gven by S C j = RP M + T [ (RP M + T) n j ] nj (3.13) S C j = RP M + T [ (RP M + T) n j ] nj (3.14) Notce these ponts are stll n the 3D camera frame, however the mage of these ponts can be generated as p j = (S j x, S jy ) S jz p j = (S j x, S j y ) S j z (3.15) The collecton of pont pars gven by 3.15 wll be analogous to the constraned SOP ponts p j see equaton (3.7). The collecton of these ponts for the current guess of R M C and TM C wll be referred to as P mg (R M C, T M C ) = { p j, p j, 1 N, 1 j M } (3.16) 41

The collecton of model ponts analogous to p see equaton (3.8) wll be referred to as P model = { P M, P M, 1 N } (3.17) A new m matrx for expressng the probablty that pont p j corresponds to P M and p j corresponds to P M must now be developed. The total dmensonalty of m wll be 2MN 2N but the matrx wll only be sparsely populated. Frst, half of the possble entres are 0 because p j corresponds to P M and p j corresponds to P M but the opposte s not true.e. p j does not correspond to to P M and p j does not corresponds to P M. Snce the mage ponts are generated by projectng model lnes onto planes formed by mage lnes, mage ponts should only be matched back to the model lnes whch generated them. If for example a set of mage ponts p j1 and p j1 correspond to L 1 projected onto all of the mage lne planes, p j1 should only be matched to P M 1 and p j1 to P M 1. Attemptng other correspondences would be senseless as the ponts p j1 and p j1 are derved from L 1. Thus m wll take a block dagonal form as n the example Fgure 3.2.1.1. In Fgure 3.2.1.1 l 1 corresponds to L 3 and l 2 corresponds to L 1. As before the matrx s requred to be doubly stochastc whch can stll be acheved va Snkhorn s [37] method. When the pose s correct every entry n the matrx wll be close to one or zero ndcatng that the lnes/ponts ether correspond or don t. Agan recallng the prevous algorthm the values of m pror to normalzaton are related to the dstance between model ponts SOP s and ther lne of sght corrected SOP s. Snce ths algorthm s matchng lne features dstances wll be defned n terms of lne dfferences rather than pont dstances. Usng these dstances, any ponts generated from model lne L and mage lne l j.e. ponts p j, p j have dstance 42

P 1 P 1 P 2 P 2 P 3 P 3 p 11.3 0 0 0 0 0 p 11 0.3 0 0 0 0 p 12 0 0.1 0 0 0 p 12 0 0 0.1 0 0 p 13 0 0 0 0.8 0 p 13 0 0 0 0 0.8....... p 21.7 0 0 0 0 0 p 21 0.7 0 0 0 0 p 22 0 0.2 0 0 0 p 22 0 0 0.2 0 0 p 23 0 0 0 0.2 0 p 23 0 0 0 0 0.2 Fgure 3.3: Example form of the matrx m for SoftPOSIT wth lne features measures d j = θ(l j, l ) + ρd(l j, l ) (3.18) where θ(l j, l ) = 1 cos l j l l s the lne obtaned by takng the perspectve projecton of L and d(l j, l ) s the sum of the dstances from the endponts of l j to the closest pont on l. Thus ths dstance metrc takes nto account the ms orentaton of two matched lnes and the dstance between the two lnes. The reason d(l j, l ) s chosen as the sum of the endponts of the mage lne to the closest pont on the maged model lne s that a partally occluded lne wll stll have a dstance of zero ndcatng that a match s found. Ths behavor s desrable because the algorthm should be able to match partally occluded mage lnes to whole model lnes. These dstance measures are used to populate m pror to the normalzaton by the Snkhorn algorthm. Usng the new weghtng matrx, and modfed p j and p gven by equatons 43

(3.16) and (3.17) respectvely the orgnally descrbed SoftPOSIT algorthm can be appled to the lne generated ponts. The algorthm s started and termnated n the same fashon as before. The algorthm s started wth an ntal pose guess and assumes w = 1. The algorthm then generates the ponts P mg and the correspondng weghts for the probablty of ponts matchng, and updates Q 1 and Q 2 usng the weghts and current w s. Next, the algorthm updates the values for w s usng the current pose guess and repeats the process untl convergence. 3.2.2 Lmtatons and Issues wth SoftPOSIT Usng Lne Features The major advantage of usng lne features over pont features s that lne features are generally more stable and easer to detect. For example a rounded corner probably won t be detected by a corner detector; however, the two lnes leadng nto the rounded corner wll stll appear. The problem of occlusons generatng fake features s stll present because two overlappng objects wll generally form a lne when a lne detector s used. The problem of self-occluson s also stll not addressed n ths algorthm, so lnes whch are not vsble n the current object pose can stll be matched to mage lnes. Ths s especally a problem when the object s symmetrc and thus has many lnes whch are parallel and can algn when n certan poses. Ths algorthm also returns the local pose whch mnmzes the error functon so agan the algorthm must be started usng dfferent ntal poses to fnd the global mnmum. The poses must also be evaluated for correctness as wth regular SoftPOSIT. 44

Typcally, when compared to SoftPOSIT usng pont features the fnal poses returned by the algorthm are more accurate and the probablty of convergng to the correct pose s generally hgher. 3.3 Pose Clusterng From Stereo Data 3.3.1 Pose Clusterng From Stereo Data Algorthm In Secton 2.7 t was shown how t s possble to generate a 3D pont cloud reconstructon of a scene gven two vews of the scene. It wll now be assumed that a model pont cloud M has been generated where the orgn and orentaton of the model frame s known and the ponts coordnates are expressed n reference to ths frame. The orgn of the model s located at the center of the model pont cloud. For every pont n the model cloud the lne of sght from the camera to to the orgnal pont must also be stored, the need for ths wll be shown later. If another mage s captured of the same object and the coordnates of the 3D pont cloud reconstructon are generated wth respect to the camera coordnate system, then ths pont could wll be referred to as S the scene pont cloud. The goal then s to fnd some rgd body transform whch wll relate the ponts n M to the ponts n S. Ths transform wll be the pose of the object w.r.t the cameras coordnate system. It should be noted that n the full mplementaton of ths algorthm the model s generated by takng multples vews of an object from dfferent angles and reconstructng the complete 3D geometry of the object. Ths task s smple enough to do f the object s placed at a locaton n a model based coordnate system and the camera s moved to specfc known locatons n the model frame so that all of the 45

r 1 a c b r 3 r' 3 c' b' a' r' 2 r 2 [R T] r' 1 Fgure 3.4: Example of two matched trplets reconstructons from each vew can be transformed nto the frame of reference of the object. In ths mplementaton we wll be usng a model constructed from only a sngle vewpont; however, ths has no bearng on the mplementaton so the extenson to multple vews s as smple as sttchng together dfferent vewponts to make a more complete model. Assumng that both the model and scene pont clouds are generated. If three ponts n M whch correspond to three ponts n S can be dentfed, then the transform that moves the coordnates of the the ponts n M to the correspondng ponts n S gves the pose of the object. However, full pont correspondences are mpossble to generate because the only data beng used n ths algorthm s 3-D pont data. Snce pont correspondences can not be drectly generated by matchng features, trplet correspondences are generated nstead. Where trplet correspondences refers to matchng the lengths between three ponts n the scene to the lengths between 3 ponts n the model. Fgure 3.4 shows two matched trplets. If 3 trplet lengths can be matched then three pont correspondences can be generated and the pose can be reconstructed. Snce the 3D pont cloud reconstructon s not exactly accurate due to nose n the mages and re projecton errors t s not possble to match trplet lengths exactly. Instead a matchng threshold s used such that f two lengths are wthn some tolerance they are consdered to be matched. Due to the matchng tolerance and the geometry of the objects there wll be 46

many trplets n the model whch can be matched to a sngle trplet n the scene. If all of the the rotatons and translatons were computed whch move all of the matched model trplets nto the scene, then only one of the transforms would be the correct transform and all of the others would be ncorrect. If the process of pckng a trplet from the scene and matchng t to possble matches n the model s repeated then eventually a number of correct guesses would be generated along wth many many more ncorrect guesses. However, f the poses are stored n a 6D parameter space a cluster of poses correspondng to the actual object pose wll develop along wth other randomly dstrbuted poses throughout the rest of the space. Dvdng the 6D parameter space nto a set of hypercubes allows easy detecton of when a cluster has been generated n space. Once a cluster of ponts n the parameter space s detected, the pose whch best descrbes the cluster can be generated. Ths pose wll then correspond to the pose of the object n space. Usng ths approach, Hllenbrand developed an algorthm [21] that s summarzed as follows: 1. Draw a random pont trple from S the sample pont cloud 2. Among all matchng pont trples n M pck one at random 3. Compute the rgd body transform whch moves the trple from M to the trple n S. 4. Generate the sx parameters whch descrbe the transform and place the pose estmate nto the 6D pose space. 5. If the hypercube contanng ths 6D pont has less than N samples members return to 1 otherwse contnue. 6. Estmate the best pose usng the 6D pont cluster generated n space. 47

Now that the algorthm as a whole has been presented the detals of the steps wll be presented. To fnd pars of matchng trplets an effcent method for matchng trplet lengths needs to be developed. To do ths a hash table contanng trplets from the model whch s ndexed by the lengths between the ponts s generated. To ensure that ponts are always matched n the proper order the lne lengths are always generated by gong clockwse around the ponts accordng to the the pont of vew of the camera. Falure to do ths would result n ncorrect pont correspondences even though the lnes were correctly matched. The three values used to hash three model ponts r 1, r 2, r 3 wth lnes of sght l 1, l 2, l 3 are gven by the equaton r 2 r 3 r 3 r 1 k 1 f [(r 2 r 1 ) (r 3 r 1 )] T (l 1 + l 2 + l 3 ) > 0 r 1 r 2 k 2 = k r 3 r 2 3 r 2 r 1 else r 1 r 3 (3.19) where k 1, k 2, k 3 are the lengths between the ponts. Ths hashng method guarantees that the ponts are hashed n a clockwse order accordng to the pont of vew of the camera. In addton to hashng the three ponts wth the order k 1, k 2, k 3 the ponts are also hashed wth the keys k 3, k 1, k 2 and k 2, k 3, k 1. The three ponts are hashed wth all three of these entres because when pckng three ponts from the current scene there s no way of knowng whch order they wll appear n, only that the lengths between them are generated n a clockwse manner. Usng the method presented to generate clockwse lengths a pont trple can be selected n S and the approprate 48

lengths generated. Usng those three lengths and the hash table, all of the model trples whch could possbly match the scene trple can be quckly found. The method for fndng the rgd body transform whch relates the three ponts s based on quaternons and s explaned n [24]. Ths method s used because t fnds the best ft R and T n a least squares sense that relates ponts r 1, r 2, r 3 n the model to ponts r 1, r 2, r 3 n the scene. Ths method s also specfcally desgned to work wth three pars of correspondng ponts whch s the number of correspondences whch ths algorthm generates. A method of convertng pose parameters to 6D ponts s now presented. A rotaton matrx R can be expressed as an axs of rotaton and an angle of rotaton. R = exp(ŵθ) where ŵ s the unt vector about whch the rotaton takes place and θ s the amount n radans by whch ponts are rotated. The vector ŵθ s called the canoncal form of a rotaton. R wll now be consdered the canoncal form of a rotaton matrx R. Combnng the vectors [R, T] nto one large vector gves a 6D vector whch completely descrbes the rgd body transform/pose. Ths 6D vector could be chosen to preform pose clusterng, but there s one major problem. If parameter clusterng s to be preformed, a consstent parameter space must be used so that clusters are not formed due to the topology of the parameter space alone. Hllenbrand shows n hs earler work [39] that the canoncal parameter space s not consstent and therefore s not sutable for parameter clusterng. He proposes a transform ( ) 1/3 R sn R R ρ = π R (3.20) 49

whch s a consstent space parameterzed by a vector ρ R 3, where each element ρ 1, ρ 2, ρ 3 all satsfy 1 ρ 1. Thankfully the Eucldean translaton space s already consstent so all of the pose estmates can be stored n a consstent 6D parameter space usng vector p = [ρ, T]. Now that poses can be parameterzed n a 6D space, the fnal part of the algorthm s examned. Ths part of the algorthm determnes the best pose whch represents all of the pose ponts n the cluster. The best pose s found by usng a mean shft procedure descrbed n [7]. The procedure s started wth p 1 equal to the mean of all the poses p =1...Nsamples n the bn whch was flled, and s repeated untl p k p k 1 < δ ndcatng the procedure has converged. p k = Nsamples =1 w k p Nsamples =1 w k (3.21) w k = u ( ρ k 1 ρ /rrot ) u ( T k 1 T /rtrans ) where 1 f x < 1 u(x) = 0 else The rad r rot and r trans defne a maxmum radus around the current mean whch ponts must le n n order to contrbute to the new mean. These values are dependent upon the bn sze used to generate the pont cluster and can be vared accordngly. The fnal result of the clusterng algorthm s the pose gven by p k whch represents the mean of the major cluster wthn the bn. Ths s the fnal pose output by the algorthm. 50

3.3.2 Lmtatons and Issues wth Pose Clusterng Out of the three algorthms dscussed n ths thess ths s the only one whch s feature ndependent. Ths s one of the most appealng parts of ths algorthm because t can be used on any type of object wth any texture as long a some sort of stereo depth nformaton can be recovered. The drawback to ths feature s that t makes t relatvely smple to confuse smlar objects. For example n the experments secton we wll attempt to fnd cubes and cubods where the dmensons are the same except that the cubod s wder. In ths sort of case t s easy to place the cube nsde of the cubod because the geometry of the shapes are relatvely smlar. 51

Chapter 4 Experments and Results 4.1 Experments 4.1.1 The sample set For the evaluaton of the algorthms a total of 95 dfferent mages were captured, and ther 3D reconstructons were generated. The mages nclude sngle cubes, cubods, and assembles of cubes and cubods both wth and wthout other objects n the frame and wth and wthout occlusons. The sample set was captured usng a par of Playstaton Eye cameras controlled wth OpenCV. Results wll be presented that show the effectveness of each algorthm n detectng the objects of nterest see Fgure 4.1, and comparng the effectveness of detectng assembles usng only a sngle component as the model as compared to the entre assembly as the model,.e., fndng the assembly usng only the cube as the model compared to detectng the pose of the assembly usng the entre assembly as the model. 52

(a) Cube (3cm x 3cm x 3cm) (b) Assembly (c) Cubod (3cm x 3cm x6 cm) Fgure 4.1: The Objects of Interest 4.1.1.1 Sample set pre-processng For all of the mages background subtracton was used to solate the actual objects n the scene from the backdrop. The 3D reconstructon and lne/corner detecton was then performed on the segmented objects only to remove nose sources unassocated wth the objects n the scene. No color nformaton was used to dstngush objects from one another, determne object boundares, or to verfy poses. 4.1.1.2 Sample set dvsons The mage set was dvded nto three parts and then all of the algorthms were run aganst the sets. The frst set s the collecton of all mages where a cube as n Fgure 4.1(a) s the object of nterest. Ths set ncludes pctures of ndvdual cubes, cubes as a part of an assembly, and cubes wth other objects and cubods present. The second set conssts of all mages where an assembly as n Fgure 4.1(b) s the object of nterest. The assembly s a cube drectly attached to a rectangular cubod. The assembly set conssts of mages of a sngle assembly and mages of a sngle assembly wth cubes, cubods, and other objects present. The fnal set conssts of all mages where the rectangular cubod shown n Fgure 4.1(c) s the object of nterest. Ths set contans mages of the cubod by tself and mages of the cubod wth cubes and 53

other objects present. 4.1.1.3 Generaton of baselne true poses All of the mages n the sample set show the object of nterest n a 45x30 cm space whch s roughly 30 cm from the camera. Baselne truths for the postons of the objects were generated by placng the objects at a known locaton n the world coordnate system and convertng the coordnates to a locaton n the cameras coordnate frame of reference. The orentaton of the objects was found by fttng the model of the object to the mage of the object by manually generatng correspondences and applyng a PNP (Perspectve N Pont) algorthm. The baselne truths were generated n ths splt fashon because t s dffcult to accurately determne the 3- D locaton of an object when tryng to generate correspondences by hand because small changes n locaton of the object can lead to nearly unnotceable changes n the mage data, and manually localzng the pont where a feature s located n an mage s not necessarly an easy task even for a human. Locatng the object n a world coordnate system can easly be done by placng the object on a sheet of grd paper, as was done n ths mplementaton, and then convertng the coordnates to camera coordnates wth a coordnate system transform. The orentaton of the object on the other hand s dffcult to measure by hand and can be relatvely accurately determned by generatng correspondences by hand, because PNP algorthms are less senstve to nose when determnng orentaton as apposed to poston. 4.1.2 SoftPOSIT algorthm mplementatons The SoftPOSITPonts algorthm was obtaned drectly from Danel DeMenthnon and was run n Matlab 2011a. 54

For the ponts algorthm the chosen features were the block corners. To fnd the corners a corner detector was used [5]. Ths corner detector detects corners by lookng for lne ntersectons and s desgned to robustly detect rounded corners where lnes do not meet at a sharp pont, whch s often the case wth the objects of nterest. The SoftPOSITLnes algorthm was created by modfyng DeMenthnon s orgnal code and was also mplemented n Matlab 2011a. The SoftPOSITLnes algorthm requred lne extracton and ths was acheved va the Canny edge detector [4] n combnaton wth the Douglas-Peucker lne smplfcaton algorthm [11]. 4.1.2.1 SoftPOSIT pose verfcaton For both of the varants of the SoftPOSIT algorthm the fnal step before acceptng the estmated pose s an evaluaton of the ftness of the pose. Ths was done by calculatng the mean dstance of the model lnes, projected to the mage plane gven the current pose, to the nearest mage lnes. Once all of the dstances were calculated the pose was accepted f 5 model lnes were wthn 4 pxels average dstance to any mage lnes. In addton the pose was requred to project nto the foreground segmented regon of the mage. Ths check was preformed frst to quckly elmnate poses whch place the object n an area occuped by the background. 4.1.3 Trplet matchng algorthm mplementaton The pont trplet matchng algorthm was coded n Vsual Studo 2008 usng C++ and the OpenCV lbrary. In ths mplementaton of the trplet matchng algorthm two levels of pose bnnng were performed. In the frst level of bnnng the translaton space s dvded 55

nto 4cm cubes whch corresponds to dvdng each translaton parameter nto 25 bns and the consstent rotaton sphere s dvded nto 64 bns wth 4 bns of dvson for each parameter. In the second level the 4cm cube s dvded nto 7mm cubes by re bnnng wth 6 bns per parameter. The rotaton space s re bnned usng sx bns per parameter to gve a resoluton.04 unts for each parameter. Ths bnnng method was chosen because t allows a large ntal search space to be somewhat refned then re bnned to gve an accurate fnal pose. Ths saves tme n the bnnng process and memory n the mplementaton because the more bns whch a space s dvded nto the longer t wll take for a bn to fll whch means that more pose estmates must be stored before a sngle bn flls. Wth the herarchcal bnnng method a space can be coarsely bnned untl a bn flls whch gves a smaller search space that can then be bnned for a more accurate pose estmate. Wth the herarchcal method all poses from the hgher levels of bnnng can be removed from memory once the refned search space s determned. Ths helps wth avodng memory lmtatons when searchng for a pose over a large space. For the frst level of bnnng a bn was consdered full when t had 5000 poses contaned n a bn and for the second level a full bn was acheved at 300 poses. The bgger bns requre a bgger count to be consdered full because ncorrectly matched trplets can accumulate n space easly when the bns are larger and ths can lead to bns whch fll that do not contan the actual pose. Lkewse smaller bns can have a smaller fll count because randomly matched trplets wll be more dspersed throughout the bns. 4.1.4 The search space The search space for all of the algorthms was dentcal. The search space covered -.5m to.5m n the X and Y drectons and 0m to 1m n the Z drecton, 56

where the X,Y, and Z drectons correspond to the camera coordnate axes n Fgure 2.2(b) and the full range of rotatons n the unt sphere. For the SoftPOSIT algorthms the ntal poses were selected at random wth unform probablty from wthn the total range. 4.2 Results The results of the three algorthms wll be dscussed n three sectons. Frst the results of the two monocular methods of pose estmaton SoftPOSITPonts and SoftPOSITLnes wll be dscussed and analyzed. Then the results of the Bnocular trplet matchng algorthm wll be examned. Fnally a global comparson of the three algorthms and ther overall performance wll be gven. Results from each algorthm wll be presented n a fashon that allows drect comparsons of the performance of each algorthm on specfc mages and n a classed form that enables a way to quckly evaluate the performance of each algorthm n a global sense. Table 4.1 shows the breakdowns and sgnfcance of each class and Fgure 4.2 shows example poses from each class. Class 1 corresponds to poses whch are correct wthn a small tolerance, and that can be used n applcatons where relatvely hgh accuracy s a concern such as accurately grppng an object for assembly. Class 2 provdes poses whch yeld good localzaton but no useful sense of orentaton. Ths type of accuracy could be useful for smple obstacle avodance. Class 3 provdes poses whch accurately express an objects orentaton but only approxmate ts locaton. These pose results could be used for vsual rotatonal servong or determnng whch way a rotatng object s pontng. Fnally the poses n Class 4 are essentally ncorrect and have no real practcal use. In all cases the translatonal error s gven n unts 57

(a) Example Class 1 Pose (b) Example Class 2 Pose (c) Example Class 3 Pose (d) Example Class 4 Pose Fgure 4.2: Example poses from each class 58

Class Translaton Error (cm) Rotaton Error Sgnfcance Mn Max Mn Max 1 0 1.5 0 2 Essentally correct pose n terms of both poston and orentaton. 2 0 1.5 2 Pose provdes correct localzaton but no sense of orentaton. 3 1.5 3.5 0 2 Pose provdes approxmate localzaton and correct orentaton. 4 All other cases Pose provdes no useful localzaton or orentaton. Table 4.1: Summary of the propertes and sgnfcance of performance classngs of centmeters. The rotaton error s gven by a unformly scaled dstance n the consstent rotaton space (4.1) between the true pose and the algorthms guessed pose. e r = 10 ρ actual ρ guess (4.1) where ( ) 1/3 R sn R R ρ = π R For reference, 1 unt of error n the scaled consstent space corresponds to roughly 15 of rotaton error along the correct axs of rotaton or rotatng the correct amount around an axs whch s 6 off from the correct axs of rotaton. Plots are also provded whch show the breakdown of the translaton error for each mage n terms of error along each unt axs n cm. The plots dsplayng the performance of the algorthm across the dfferent mages sets are gven n Fgures 4.3-4.12. Fgures 4.3, 4.6, and 4.9 all show the total pose estmaton error of each algorthm on each mage of the dataset. These fgures allow drect comparsons of the performance of each algorthm on ndvdual mages. 59

Fgures 4.5, 4.8, and 4.11 all show the breakdown of the translaton estmaton error of each algorthm on each mage of the dataset. These fgures allow drect comparsons of the translatonal accuracy of each algorthm on ndvdual mages. Fgures 4.4, 4.7, and 4.10 all show total pose estmaton error for each algorthm ndependently sorted by class. These fgures allow global comparsons of the performance of each algorthm on the dfferent mage sets. 4.2.1 Monocular Methods Out of the two monocular methods presented, SoftPOSITPonts and Soft- POSITLnes, the lnes method s defntely the more stable, n terms of convergng to a usable pose, and the more accurate of the two. For the case of detectng ether the cube or cubod the lnes method consstently had hgher detecton rates for classes 1,2, and 3 see Fgures 4.4,4.10. The only mage set where the ponts algorthm outperformed the lnes algorthm was n detectng the assembly, see Fgure 4.7. However, n ths nstance, nether algorthm had any Class 1 detectons and the ponts algorthm only outperformed the lnes algorthm by achevng two Class 3 detectons. A major lmtaton wth the monocular pose estmaton methods, s correctly determnng the Z coordnate of the object, or n other words the object s dstance from the camera. Fgures 4.5,4.8, and 4.11 all show the breakdown of the translaton error nto ts ndvdual X,Y, and Z components. For both the ponts and lnes algorthms a dsproportonately large part of the error s n the Z component. In fact many nstances can be found where the X and Y error components are wthn a few mllmeters whle the Z component s only wthn a few centmeters. Ths error dsparty s due to poor localzaton of mage features and pxel dscretzaton. Due to the propertes of the perspectve transform ms locatng a feature n the mage by 60

Pose Error for the Three Algorthms on Each Image 70 60 50 Image Number 40 30 20 10 0 2 4 6 8 10 Post Ponts Error 2 4 6 8 10 Post Lnes Error Trans Error Rot Error 2 4 6 8 10 Trplet Matchng Error Fgure 4.3: Total pose error of the three algorthms for each mage n the cube mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1). 61

Pose Error for the Three Algorthms Independently Sorted By Group 4 3 2 4 3 4 3 2 Trans Error Rot Error 2 1 2 4 6 8 10 Post Ponts Error 1 2 4 6 8 10 Post Lnes Error 1 2 4 6 8 10 Trplet Matchng Error Fgure 4.4: Pose error of the three algorthms on the cube mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class. 62

Translaton Error for the Three Algorthms on Each Image 70 60 50 Image Number 40 30 20 10 0 2 4 6 8 10 Post Ponts Error (cm) X Error Y Error Z Error 2 4 6 8 10 2 4 6 8 10 Post Lnes Error (cm) Trplet Matchng Error (cm) Fgure 4.5: Breakdown of the translatonal error for the three algorthms for each mage n the cube set. Errors are gven n cm 63

18 Pose Error for the Three Algorthms on Each Image 16 14 12 Image Number 10 8 6 4 2 0 2 4 6 8 10 Post Ponts Error 2 4 6 8 10 Post Lnes Error Trans Error Rot Error 2 4 6 8 10 Trplet Matchng Error Fgure 4.6: Total pose error of the three algorthms for each mage n the assembly mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1). 64

Pose Error for the Three Algorthms Independently Sorted By Group 4 2 4 3 2 4 2 Trans Error Rot Error 1 2 4 6 8 10 Post Ponts Error 2 4 6 8 10 Post Lnes Error 2 4 6 8 10 Trplet Matchng Error Fgure 4.7: Pose error of the three algorthms on the assembly mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class. 65

18 Translaton Error for the Three Algorthms on Each Image 16 14 12 Image Number 10 8 6 4 2 0 2 4 6 8 10 Post Ponts Error (cm) X Error Y Error Z Error 2 4 6 8 10 2 4 6 8 10 Post Lnes Error (cm) Trplet Matchng Error (cm) Fgure 4.8: Breakdown of the translatonal error for the three algorthms for each mage n the assembly set. Errors are gven n cm 66

Pose Error for the Three Algorthms on Each Image 30 25 20 Image Number 15 10 5 0 2 4 6 8 10 Post Ponts Error 2 4 6 8 10 Post Lnes Error Trans Error Rot Error 2 4 6 8 10 Trplet Matchng Error Fgure 4.9: Total pose error of the three algorthms for each mage n the cubod mage set. The translaton error s gven n cm whle the rotaton errors are lengths n the scaled consstent space (4.1). 67

Pose Error for the Three Algorthms Independently Sorted By Group 4 2 4 4 3 2 2 4 6 8 10 Post Ponts Error 3 1 2 4 6 8 10 Post Lnes Error Trans Error Rot Error 1 2 4 6 8 10 Trplet Matchng Error Fgure 4.10: Pose error of the three algorthms on the cubod mage set. For each algorthm, results are sorted by total error and classfed. The dotted lnes ndcate class boundares and the numbers ndcate the class labels. Table 4.1 shows the requrements of each class. 68

Translaton Error for the Three Algorthms on Each Image 30 25 20 Image Number 15 10 5 0 2 4 6 8 10 Post Ponts Error (cm) X Error Y Error Z Error 2 4 6 8 10 2 4 6 8 10 Post Lnes Error (cm) Trplet Matchng Error (cm) Fgure 4.11: Breakdown of the translatonal error for the three algorthms for each mage n the cubod set. Errors are gven n cm 69

Detecton Wth Whole Assembly Model Image Number 18 16 14 12 10 8 6 4 2 Pose Error for the Three Algorthms n Detectng Assembly Components and Whole Assembles Trans Error Rot Error 0 18 16 Detecton Wth Cube Model Image Number 14 12 10 8 6 4 2 0 2 4 6 8 10 Post Ponts Error 2 4 6 8 10 Post Lnes Error 2 4 6 8 10 Trplet Matchng Error Fgure 4.12: Total error for the three pose estmaton algorthms on the assembly set. The frst row shows the results of tryng to fnd the assembly usng the assembly as the model, whle the second rows shows the results of fndng the assembly usng only the cube as the model. 70

a pxel or two wll not have a large bearng on the objects X,Y locaton n space but t does greatly affect the objects Z locaton n space. Ths relatonshp can be seen n equaton (2.3) whch shows that an mages x,y coordnates are proportonal to an objects X,Y world coordnates by a factor gven by Z. Thus small changes n x or y correspond to larger changes n Z. If better feature localzaton were possble many of the Class 3 poses could be mproved to Class 1 poses snce ths would reduce the error n the Z drecton. The lnes algorthm does not suffer from ths localzaton ssue qute as much as the ponts algorthm because lnes do not requre localzaton to a sngle pont. For ths reason the lnes algorthm tends to have more Class 1 members than Class 3 members and also consstently has more Class 1 members than the ponts algorthm. Another lmtaton to both of these algorthms s msmatchng features. Ths s especally a problem wth hghly rotaton symmetrc objects lke the cube and cubod because these objects can assume many poses where model features wll lne up wth mage features and only one of the possble poses s actually correct. These types of poses belong to Class 2 see Fgure 4.13 for examples. In both of these nstances the poses vsually appear to be correct to the naked eye and ndeed many of both the lne and pont features appear to be correctly matched, but n actualty both poses have a rotatonal error of 4 or more unts n the scaled consstent space. The ftness test for acceptng the poses returned by the algorthms s another lmtng factor n the correctness of the fnal pose accepted. If the requrements for the ftness test were ncreased so that lnes have to be closer than 4 pxels apart on average or more than 5 model and mage lnes must match up then better poses may be obtanable. However, f the acceptance crtera are made too strct then none of the poses returned by the algorthm may ever be accepted. It s also hard to determne a good number for the matchng requrement because dependng upon the vew of the 71

(a) SoftPOSITPonts Class 2 pose example (b) SoftPOSITLnes Class 2 pose example Fgure 4.13: Two example results mages from Class 2. Both of these poses llustrate nstances where poses are perceptually correct and features are matched, however the correspondences are ncorrect. The whte lnes ndcate the fnal pose estmated by the algorthm. object dfferent numbers of lnes are vsble n the mage. 4.2.2 Bnocular Method The trplet matchng algorthm was the one bnocular algorthm run on the dataset. For each of the three subsets the trplet matchng algorthm was able to return a Class 1 pose on more than 50% of the mages. For the assembly case Fgures 4.6,4.7,4.8 the algorthm shows the best convergence results of any of the three sets. The mproved results are due to the fact that the assembly model s larger and has a more unque shape than the other models thus there are more unque trplets whch can be generated and matched by the algorthm than wth the other models. For example any three ponts chosen on a sngle face of the cube could be matched to three ponts on any other face of the cube because all of the faces are dentcal. Wth the assembly model three ponts can be chosen whch are spaced far enough apart so that they can only be matched back to only one sde of the assembly. Fgure 4.12 shows the results of all three algorthms n tryng to fnd the 72