A Fast and Accurate Feature-Matching Algorithm for Minimally-Invasive Endoscopic Images

Size: px

Start display at page:

Download "A Fast and Accurate Feature-Matching Algorithm for Minimally-Invasive Endoscopic Images"

Clarence Morgan
5 years ago
Views:

1 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY A Fast and Accurate Feature-Matching Algorithm for Minimally-Invasive Endoscopic Images Gustavo A. Puerto-Souza* and Gian-Luca Mariottini Abstract The ability to find image similarities between two distinct endoscopic views is known as feature matching, andisessential in many robotic-assisted minimally-invasive surgery (MIS) applications. Differently from feature-tracking methods, feature matching does not make any restrictive assumption about the chronological order between the two images or about the organ motion, but first obtains a set of appearance-based image matches, and subsequently removes possible outliers based on geometric constraints. As a consequence, feature-matching algorithms can be used to recover the position of any image feature after unexpected camera events, such as complete occlusions, sudden endoscopic-camera retraction, or strong illumination changes. We introduce the hierarchical multi-affine (HMA) algorithm, which improves over existing feature-matching methods because of the larger number of image correspondences, the increased speed, and the higher accuracy and robustness. We tested HMA over a large (and annotated) dataset with more than 100 MIS image pairs obtained from real interventions, and containing many of the aforementioned sudden events. In all of these cases, HMA outperforms the existing state-of-the-art methods in terms of speed, accuracy, and robustness. In addition, HMA and the image databasearemadefreely available on the internet. Index Terms Abdomen, endoscopic image analysis, endoscopy, feature matching, robust estimation. I. INTRODUCTION I N robotic-assisted minimally-invasive surgery (MIS), the ability to find image similarities between (at least two) laparoscopic views of the same scene is crucial in many applications, such as shape recovery [1] [3], camera calibration [4], structure and camera-motion estimation [5], [6], or augmented reality (AR) [7] [10]. Thus far, this similarity-search problem has been addressed either by means of recursive (feature tracking) strategies or by means of feature matching (or tracking by detection ) [11], [12]. Feature-tracking algorithms require that the features are extracted from sequential frames, and have been successfully applied to MIS in the case of small occlusions [4], [6], Manuscript received November 04, 2012; revised January 02, 2013; accepted January 02, Date of publication June 26, Asterisk indicate corresponding author. *G. A. Puerto-Souza is with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX USA ( gustavo.puerto@mavs.uta.edu). G. L. Mariottini is with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX USA ( gianluca@uta.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TMI [13] [15]. Feature-matching methods do not make any restrictive assumption about the chronological order of the frames to be processed, or about the scene geometry (e.g., known organ motion due to breathing). Because of these characteristics, feature-matching algorithms are of utmost importance to automatically recover those tracked features that were lost after large and sudden camera motions, complete occlusions, or strong organ deformations. In general, feature-matching algorithms initially find a set of potential matches by leveraging the appearance around distinctive image features (e.g., SIFT [16]), extracted from the image before and after the sudden camera event. Subsequently, possible ambiguities in these appearance-based matches are removed by enforcing a geometric constraint, i.e., by estimating an image mapping (e.g., rotation, translation, scale, and shear) between the two candidate feature sets. This mapping is of particular importance, since it can be used to predict the position of the lost tracked features in the image after the event. Because of these appealing characteristics, feature matching is of fundamental importance in many applications, such as structure from motion [5], [6], [17], registration [1] [3], and localization. We are here particularly interested in augmented-reality (AR) applications [7], [18] [21], which aim at increasing the surgeon s visual awareness of anatomical targets by precisely overlaying preoperative radiological data onto live laparoscopic videos. Providing a reliable and accurate feature matching is fundamental in AR to recover the position of the lost anchor image points (e.g., after a complete and prolonged occlusion). In this way, feature matching can be used to automatically re-initialize the lost augmented display (cf. the example in Fig. 1) and guarantee long-term augmentations. Recently, some feature-matching strategies have been presented [16], [22], [23] that try to be robust in the presence of object deformations. However, these algorithms are still computationally cumbersome, cannot retrieve a large-enough set of accurate matches and, to the best of our knowledge, have never been evaluated in MIS scenarios. MIS images are challenging due to the presence of frequent occlusions, object deformations, and image clutter (smoke, blood, and reflections). Addressing these shortcomings is then of primary need for the medical-imaging community. The original contribution of this work consists of the design of a novel feature-matching algorithm that improves over the existing methods by finding a larger number of image correspondences at an increased speed, and with both a higher accuracy and robustness to image clutter. Our method, called hierarchical multi-affine (HMA), hierarchically clusters the initial set of appearance-based matches into spatially-distributed clusters /$ IEEE

2 1202 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 Fig. 1. Application of feature matching for augmented-reality recovery: AR systems make use of anchor points (i.e., associations between the 2-D laparoscopic view and the 3-D CT model) to mantain an augmented view (frame before occlusion). However, occlusions can cause the loss of anchor points (frame occlusion), and then, the loss of the augmented view (frame after occlusion). HMA can be used to automatically recover the lost anchor points and thus, the augmented view (see frame recovered augmentation). over the organ s surface. For each of these clusters, an affine transformation is then estimated to further prune the incorrect initial matches, and to capture features spread over the entire object s surface. Each of these clusters of features is estimated according to a local geometric (affine) transformation, which maps image features from the image before the occlusion to the one after the occlusion. Due to the sensitive nature of the MIS scenario, feature-matching methods have to comply with safety requirements to be approved for use in the operating room. High accuracy is required in order to precisely recover a high number of features. Moreover, real time and robust performance are crucial for any in vivo MIS application. For these reasons, HMA has been extensively evaluated with respect to accuracy, robustness and time, over two large and varied in-lab and MIS images datasets (the latter includes images from six 3-h-long surgical interventions). In this work, we decided to focus on the comparison between three most-popular features, SIFT [16], ASIFT [24], and SURF [25]. While other sparse and dense features have been proposed in the past years [26] [29], the aforementioned ones represent a good compromise between invariance to a large number of affine parameters, and can be extracted almost in real time. In all the aforementioned evaluation scenarios, we show that HMA outperforms the existing state-of-the-art methods in terms of speed, inlier detection rate, and accuracy. A. Related Work The first step in feature matching consists of extracting salient features from the two images of the same scene. Several algorithms have been proposed to detect features (SIFT, SURF, ASIFT), each one with different invariant properties. After this feature-extraction part, a feature-matching step follows, which in general consists of two phases; first, an appearance-based matching phase [30], in which the local appearance of each feature is used to determine a set of candidate (or initial) matches. In the second step, these initial matches are pruned from ambiguities in the appearance by means of an additional geometric-validation phase, which leverages the spatial arrangement of features. In the past years, several approaches have been proposed to address this second geometric-validation step. For example, in [31], the authors use an image database to construct a 3-D model of the object of interest and to finally match the features of the query image and of the 3-D object. However this algorithm is computationally expensive, it assumes that the object is nondeformable, and it needs a large set of images to build the 3-D model. Other approaches have been proposed in [23], [32] [34] to model the feature-matching as a graph-matching problem. In general, the possible matches are considered as nodes, while their disagreement (or agreement) is measured by the weight of each edge, which is modelled by an energy function. In [32], geometric constraints are used to penalize those matches that change their relative length and orientation between the two images. However, these constraints are only suitable for rigid movements, and do not work under significant object deformations or viewpoint change. In [34], an energy function is used to penalize the occluded features, and the geometric constraints of [32] are relaxed to neighboring matches. Despite these improvements, these methods are not robust to viewpoint changes. In [33], geometric constraints are used to represent each feature position as an affine combination of its neighbors. Even if this algorithm is robust to affine viewpoint changes, it performs poorly with large occlusions and nonrigid deformations. The method proposed in [35] creates additional (randomly-distorted) training images, and estimates a single homography mapping together with the associated inliers. However, and because of the uncontrolled (random) generation of these images, only a limited set of correspondences can be detected. We focus here on some recent feature-matching algorithms that are more appropriate to deal with large camera movements, occlusions and deformations. In particular, the algorithm in [16] detects a predominant set of matches (inliers) that satisfy auniqueaffine-transformation [36]. While this method can reliably discard many wrong matches, only a limited number of those agreeing with this single transformation are kept. The above issue was solved in [22], where multiple local-affine transformations have been used to detect a larger number of matches, and uniformly distributed over the entire nonplanar object surface. However, due to its high computational time, thismethodcouldonlybeusedforoffline data processing. The work in [23] defines a dissimilarity measure between the matches based on both a geometrical and an appearance constraint. Finally, an agglomerative step generates clusters of matches by iteratively merging similar ones (according to the dissimilarity measure). A drawback of this algorithm is its high computational complexity for an increasing number of matches [37]. The HMA algorithm presented in this paper improves over the existing methods according to the following. HMA uses multiple affine transformations to accurately map features between the two images. Because of this, HMA can also detect a larger percentage of correct matches when compared with [16] and [23].

PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1203 HMAisfast, because of the adoption of a hierarchical feature-clustering phase.

3 PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1203 HMAisfast, because of the adoption of a hierarchical feature-clustering phase. In particular, when a large number of features are detected (e.g., when using high-definition images or extracting ASIFT features), HMA is almost oneorder-of-magnitude faster than [23] and [22]. HMA is robust to outliers, because it incorporates several robust techniques (e.g., RANSAC and nonparametric data analysis). HMA is an outgrowth of our recent conference paper [38] over which we ameliorated in several interesting directions. First, we improved the feature-matching performance, and reduced its computational time by adopting an initial Hough-voting phase. Second, we extended HMA s evaluation and comparison by including a matching-performance metric based on ROC analysis [39]. Finally, we evaluated HMA over a large and manually annotated dataset of in-lab and in vivo images. HMA and the image database are made available on the Internet for the entire community. 1 The paper is organized as follows. Section II introduces the feature matching problem, the basic notation, and the details of the HMA algorithm. Section III reports the results of our extensive experimental evaluation and the comparison of HMA with state-of-the-art feature matching methods. Finally, in Section IV we discuss our results. II. METHODS We introduce here the basics of feature matching, and highlight the need for more advanced methods when dealing with laparoscopic images. In Section II-D we finally introduce the hierarchical multi-affine (HMA) algorithm. A. The Feature-Matching Problem Consider a pair of images, (training, e.g., before occlusion) and (query, e.g., after occlusion), and two corresponding sets of image features, and,(e.g.,sift [16], SURF [25], ASIFT [24]), extracted from and, respectively. Each feature consists of a keypoint vector [16] that stores the geometric characteristics of the feature (such as its position), and of a descriptor vector, which captures the local appearance around the keypoint position. The main goal of a feature-matching algorithm (see the diagram in Fig. 2) is to retrieve pairs of similar features (correspondences) among the two images by leveraging both the appearance and the geometric information contained in all these features. Existing feature-matching algorithms consist of two phases: an appearance-based matching and a geometric-validation.the appearance-based matching uses the information encoded in the descriptor vectors to obtain a set of initial (candidate) matches [40], [41]. A popular method is nearest neighbor distance ratio (NNDR) [16], to match a query feature,, with a training feature,, in the case they have the closest distance between descriptor vectors, and if the ratio between the closest and the second-closest descriptor distances is less than a threshold. 1 Source code available on Fig. 2. Diagram of the HMA feature-matching algorithm: HMA is specifically used to remove incorrect initial matches. However, due to appearance similarity, this set of initial matches may contain a large number of incorrect matches. In order to prune the initial matches from these outliers, a geometric-validation phase is usually adopted, which models the geometric motion of one (or a group of) candidate matching features between the two images. As a result, those matches that agree with this geometric model are considered inliers (i.e., correct matches) otherwise they are discarded as outliers. In what follows, we assume a given set of initial matches (e.g., computed by using NNDR). We detail the geometric constraint phase (cf. Sections II-B and II-C ) and highlight potential problems of this phase when applied to laparoscopic images. This will lead to the design of the hierarchical multi-affine algorithm (cf. Section II-D). B. Imposing Geometric Constraints Geometric constraints can be used to model image transformations of one (or a group of) features from the training image to the query image. These geometric constraints are usually adopted to predict the feature mapping from to, and vice

Left: SIFT keypoints detected in an endoscopic image: The position of each feature is represented by the center of the circle, the scale is proportional to the radius, while the orientations contain

4 1204 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 Fig. 3. Left: SIFT keypoints detected in an endoscopic image: The position of each feature is represented by the center of the circle, the scale is proportional to the radius, while the orientations contain the directions of the most prominent local image gradients around the feature position. Right: SURF keypoints: Differently from SIFT, SURF keypoints tend to be less dense on textureless area. versa. Because of this, geometric constraints can be leveraged to detect incorrect matches that, differently from the majority of initial matches, do not agree with this mapping. Examples of such constraints are: similarity transformation (which models feature rotation, translation and scale), and affine transformation (which also includes shear). As detailed in the following Sections II-B1 and II-B2, these geometric transformations can be directly estimated from the keypoint vector, which usually contains the feature (pixel) position,, its scale,, and its orientation,.forexample, Fig. 3 shows SIFT and SURF keypoints detected in an endoscopic image. 1) Similarity Transformation: A similarity transformation maps a generic image point 2,to, according to the following model: where the parameters of the similarity transformation are:, which contains the change in scale, the 2-D rotation angle,, and the translation between keypoints. These similarity parameters can be estimated from each keypoint parameters, as follows:,,. 2) Affine Transformation: An affine transformation models the rotation, translation, scale and shear of image features. The affine transformation is modeled by the following [36]: The six affine-transformation parameters,, can be estimated from at least three (noncollinear) matches by first rewriting (2) into a linear form 3, an then by calculating a least-squares solution. C. Feature Matching by Imposing (Single) Affine Constraint Geometric constraints (e.g., affine transformation, )are often used to model the mapping for groups of initial matches. The quality of this mapping, and of each potential matching, is represented by the (pixel) symmetric-reprojection error 2 Note that this image point does not have to correspond to the keypoint location,. 3 Note that denote the extension to homogeneous coordinates of. (1) (2) (3) Fig. 4. Imposing a geometric (affine) constraint on feature matches. (a) Set of initial matches (better seen in color), the correct matches are shown in (green) solid lines, and the wrong matches in (yellow) dashed lines. (b) Subset of matches agreeing with the affine transformation,, with a threshold of 1.5 pixels. Note that this set only contains a few correct matches only localized in a portion of the image. Those matches that exhibit a reprojection error larger than athreshold are considered outliers (i.e., wrong matches). Usually, the estimation of both the transformation and of the inliers is performed by means of RANSAC (RANdom SAmpling and Consensus) [42]. In brief, RANSAC randomly selects a minimal number of matches to estimate an instance of the model (e.g., three matches to estimate an affine transformation) and checks how many other matches (the consensus)agreewith this minimal model. This random selection is iterated over many times until a final transformation is obtained that has a large consensus or when a maximum number of iterations is reached. An example of this single-affine robust estimation is illustrated in Fig. 4. The initial matches are shown in Fig. 4(a): for clarity of presentation, we indicated the correct matches as solid (green) lines, while the inaccurate matches with (yellow) dashed lines. Fig. 4(b) shows the correct matches obtained in our experiments as the result of an affine transformation estimated with RANSAC. Note that the refined matches in Fig. 4(b) do not contain outliers and are correctly mapped to by the estimated transformation. However, we have observed an important drawback of this method, namely its capacity to recover only a limited number of matching features lying on an (almost planar) portion of the organ surface (cf. the polygon in Fig. 4). This happens because the affine constraint models the feature motion only as a rigid motion plus shear, and it is then more appropriate when observing planar surfaces. This phenomenon was never reported before in the literature, and motivated our team to search for a better solution to the feature-matching problem in the general case of a nonplanar (e.g., organ s) surface. D. Hierarchical Multi-Affine (HMA) Algorithm HMA improves over the aforementioned limitations by estimating a set of multiple and spatially-distributed affine transformations. As illustrated in Fig. 5, each transformation

Illustrative example of the hierarchical clustering of the matches: The initial matches (the correct matches are represented by solid lines while the wrong ones with dotted lines) are iteratively

5 PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1205 Fig. 5. From a set of initial (appearance-based) matches, HMA retrieves a set of refined (or final) matches, and estimates a set of local affine transformations (betterseenincolor). Fig. 6. Illustrative example of the hierarchical clustering of the matches: The initial matches (the correct matches are represented by solid lines while the wrong ones with dotted lines) are iteratively divided into smaller clusters (represented by polygons) each one limited to a portion of the image. Note this clustering generates clusters spatially distributed over all the scene. maps any image feature from the training image to its corresponding region on the query image. In estimating these transformations, HMA simultaneously computes the set of final matches (inliers) that support these local affine transformations. As we would expect, multiple affine transformations locally adapt more precisely to the object surface than a single transformation. As an effect, they can 1) retrieve a larger number of correct matches, and 2) can estimate a set of highly-accurate image transformations. We present here a general overview of the HMA algorithm. Each stage will be further detailed in Section II-E. From a given set of candidate matches, HMA clusters the associated keypoints into contiguous areas, spatially distributed over the entire organ s surface. As illustrated in Fig. 6, this clustering is hierarchical, and each cluster of matches is represented by a tree node, while the edges represent the expansion of a cluster (node) into sub-clusters (children nodes). The root node of the tree contains all the appearance-based matches, as indicated by the (black) polygon in the root node of Fig. 6. These matches are clustered into disjoint portions, and geometric constraints are enforced in each cluster to remove outliers (see the colored polygons in the inner node). Note that the first node to the left ( inner node ) still contains some outliers, thus requiring further expansion, while the right node already reached a terminal state ( leaf node ) since it has a minimum number of supporting matches, and it does not have any wrong matches. Thanks to this clustering formulation HMA provides several advantages with respect to the other existing techniques. Fig. 7. HMA algorithm: The initial matches in the root node are passed to an initial clustering, which detects and removes potential outliers. The remaining matches are passed to an expansion phase which consists of three steps: clustering, affine estimation and correction. As a result the root node is divided into new k-nodes. These new nodes are subject to a stop criterion to determine if each node requires further expansion or not. This expansion phase is repeated until each node reaches a leaf state. Accuracy: The use of multiple affine transformations is effective to remove a high percentage of incorrect matches on different areas of the organ s surface. Speed: The expansion of each node allows to separately process each new branch of the tree. This permits the user to eventually stop the expansion of each branch when a desired accuracy (or tree depth) is achieved, thus reducing the computational time. Note that the hierarchical structure of HMA lends itself to be further accelerated with a parallel implementation on a multicore machine by processing each sub-tree on a different processor. Robustness: Each feature in the training image will be associated (i.e., matched) to only one feature in the query image. This is done by guaranteeing that each cluster (and thus, each affine transformation) is populated by contiguous and nonoverlapping sets of matches. Furthermore, each local affine transformation is estimated by means of RANSAC, which makes HMA robust to outliers. E. HMA: Block Diagram and Phases We detail here each phase of the HMA algorithm, illustrated in the block diagram of Fig. 7. The appearance-based matches are passed to a Hough-voting phase that is used to find clusters of features with similar keypoint parameters. In doing so, this phase can discard those

6 1206 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 matches that have very low votes. These outliers are sent to an outlier buffer. The resulting inliers are instead assigned to the root node of the tree. This node is passed into a node-expansion phase,which consists of three steps. 1) A clustering step, which partitions these matching keypoints into clusters (each one containing matches with similar keypoint parameters). 2) For each cluster, a robust affine-estimation step is used to estimate an affine transformation, as well as the corresponding set of inliers and outliers. At this level, the set of outstanding outliers will be referred to as hard outliers; they will be removed from that node and sent to. 3) Finally, a correction step is used to a) verify that the clusters are spatially disjoint and, b) if necessary, to reassign the matches from one cluster to another in order to ensure disjoint clusters and, finally, c) to update the sets of inliers and outliers. After these stages, each affine transformation, together with their corresponding inliers and outliers define a new child node. A stop criterion is adopted at this point to check when to stop the node expansion, and deem that a leaf node. The set of final matches,, and the set of transformations,, are extracted from the leaf nodes that satisfy a threshold over the minimal number of inliers, i.e.,. However, since some correct matches could have been erroneously labeled as hard outlier in a previous phase, a final top-down phase tries to recover them by checking their pixel reprojection error with any of the final affine transformations in. It is evident at this point that, even if our HMA shares the same tree-like structure of hierarchical k-means [43], it also has significant differences. First, HMA makes use of two additional stages (affine estimation, and correction) in order to generate a clustering of the matches that would result into separated image regions. Second, HMA can detect outliers by integrating a geometric-constraint phase. Finally, HMA expands a node if necessary and not only a fixed amount of times. 1) Initial Hough Voting: Since the set of initial (appearancebased) matches may contain a large number of incorrect matches [e.g., those in Fig. 4(a)], HMA adopts a Hough voting to remove potential outliers. In doing so, the probability of success of the next phases increases, which also leads to a lower expected computational time [16]. This voting is done in the keypoint parameters space, which is discretized by using broad bin sizes of 0.25 times the image dimension of the larger image for both the translation of the similarity transformation and the image-position, 30 for the orientation, and a factor of 2 for the scale. The Hough voting first computes the similarity parameters of the matches (cf. Section II-B1). Then, each match votes for the bin closest to its similarity parameters. Note that bins with many votes represent matches with consistent parameters, while bins with fewer votes represent potential outliers. For this reason those matches voting for bins with less than three votes are removed, since three is the minimal number of matches required to fit anaffine transformation (cf. Section II-B2). These removed matches are sent to the hard-outliers buffer,,whiletheremaining ones are passed to the node-expansion phase. 2) Node Expansion: Clustering Step: This represents the first step in the node expansion phase, which is executed at every level of the tree. In this step, the matches in the input node (e.g., Fig. 8. Clustering step: Example of the resulting sets of the clustering step for an early stage of the tree. Note that clusters 1 and 4 successfully isolates the majority of correct matches. root node, if the current tree level is 0) are partitioned into clusters (see Fig. 8) by applying k-means [43] over a six-dimensional vector consisting of both the query-keypoint position,, as well as the four similarity-transformation parameters (cf. Section II-B). We observed that clustering in this six-dimensional space is the key, because it simultaneously leverages the spatial position of each keypoint, together with the geometric information given by the similarity parameters. As a result, the obtained clusters will contain features that are both close spatially (with respect to ), and have closer similarity parameters (which represent indeed a first approximation to each local affine transformation). As shown in the example of Fig. 8, we observed that the translation parameters play an important role in discriminating between correct and incorrect matches in early levels of the tree, such as the root node. Giving a higher importance to these translation parameters in the vectors passed to k-means is key to successfully isolate most of the outliers, as we can see in Fig. 8(b) and (c). Meanwhile, we also noticed that the keypoint-position components aremoreimportantindiscriminating the correct matches at deeper levels of the tree. In fact, in these cases, the position parameters are ideal to ensure the spatial contiguity of the clusters, while the translation parameters are less informative due to their larger variation within smaller clusters. Finally, we observed that and are sensitive to viewpoint changes at every level of the tree, especially when the observed object is nonplanar. Due to the above facts, HMA weights the components of each vector of matches before applying k-means, depending on the average reprojection error at that node. The weights are computed by interpolating between a given initial weights,,and final ones,, as follows: where is the average of the symmetric reprojection errors 4, and is an increasing function with values in the interval, defined as 4 Consider for the initial iteration, i.e.,. if otherwise

PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1207 Fig. 9. Example of the alpha function with and.

7 PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1207 Fig. 9. Example of the alpha function with and. and is the maximum mapping error threshold, and is a smoothing parameter. In Fig. 9, we can see an example of the plot of the -function for and. After extensive empirical evaluation, we determined the following values for the weight vectors: and, which provide a good trade-off between outlier detection capabilities and spatial contiguity of clusters. Note that, when the number of outliers is large (such as in the initial levels of the tree), will give more importance to the translation parameters, thus separating matches with large translation (possible outliers). When the number of outliers is reduced, will reduce its importance on the translation parameters, and it will increase its values on the position (thus enforcing spatial contiguity of the clusters). From our experience, a good value for the number of clusters is, since it offers a good balance between simplifying the problem (i.e., generating four easier subproblems), but without compromising the overall accuracy. Instead, a larger will tend to generate clusters with few matches making sometimes impossible to fit anyaffine transformation. 3) Node Expansion: Affine-estimation Step: This is the core component of HMA, in which the geometric constraints are imposed at each node, in order to simultaneously estimate the local affine transformation, and to remove possible outliers. First, the affine-estimation step verifies that the input cluster contains enough matches to fit an affine model (i.e., more than three matches), otherwise the cluster is discarded and the matches are sent to the buffer.anaffine-transformation is then estimated by means of RANSAC (cf. Sections II-B2 and II-C), and if the consensus is greater than a minimum value, the inliers and outliers are computed, and a final affine model is estimated from the obtained inliers. If the minimal consensus is not reached, the cluster is discarded and the matches are sent to the buffer. As anticipated in the previous section, and after the entire tree is created, a final opportunity will be given in the top-down phase to those outliers in to be associated with one of the final transformations. Fig. 10(a) shows an example of the AT and the set of inliers (colored dots) obtained from the clusters and in the exampleoffig.8.asobserved,these regions nicely capture keypoints according to the slope of the organ s surface. Also, note that the clusters and were discarded because it was not possible to fit anaffine transformation with a minimum consensus. Finally, in order to remove isolated matches with Fig. 10. (a) Example of affine transformations (the planar regions) and inliers for each cluster (better seen in color), observe that the local region associated to each cluster (colored square) may overlap in both images. (b) and (c) Example of the correction phase. (b) Two clusters of inliers (blue and red points), and their associated an image regions (red and blue squares). Note that these regions overlap (yellow circle) which negatively affects the performance of HMA. (c) LDA computes a separation (solid yellow line) between both clusters, generating two nonoverlapping image regions (colored regions). large residuals 5, HMA adopts a nonparametric outlier-detection technique [44]. In particular, from the statistics of the symmetric reprojection errors in each node,,athreshold is built to discard matches with a larger error. This threshold is computed as,where is the th quartile of over all the matches, and is a factor 6.Thosematcheswitha reprojection error outside this statistical measure are removed from the cluster and sent to. For each cluster, the resulting affine transformation, and the set of inliers and outliers are passed to the correction step. Note the following key observations regarding the affine-estimation step. As well known in RANSAC, choosing the right consensus value is crucial. In our experience, represents a good trade-off between robustness and cluster s rejection. Clusters with low percentage of correct matches are rejected, since the RANSAC estimation assumes that a majority of the matches are inliers. The values of and are closely related. For example, a large value of (which reduces HMA s computational complexity) will generate clusters with fewer matches, thus requiring a smaller (which is less robust to outliers) to avoid immediate cluster rejections. 4) Node Expansion: Correction Step: Even after the clustering and the affine-estimation steps, some feature matches belonging to distinct clusters (i.e., distinct affine transformations) may still overlap, i.e., they may share a common image area. For example, Fig. 10(b) shows an overlap (blue and red points) of two clusters after the affine-estimation step. This overlap can have negative effects on the performance of our algorithm. For example, when computing the affine mapping of a generic imagefeaturelyingintheoverlapping region, it is not clear which affine transformation should be used. The goal of the Correction Phase is to solve this ambiguity by creating nonoverlapping clusters of matches. We addressed 5 These matches could negatively affect subsequent clustering phases. 6 has been fixed to.

8 1208 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 Fig. 11. In-lab experiment. (a) Some of the images used in the rotation set. The left image was fixed as training image, while the object in the query images was rotated from to 30 over its vertical axis. (b) Some of the images in the deformation set, where a soft object (left image) was deformed with different levels of strength. this problem by passing the query-keypoint positions and their corresponding cluster s class indexes as training data of a linear discriminant analysis (LDA) algorithm [43]. LDA uses this training data to learn the linear parameters used to separate feature matches within each cluster. Finally, the inliers and outliers are used as testing set for LDA. In this way, each image region is now reassigned to only one cluster, as observed in the example of Fig. 10(c). Once all the matches are reassigned, the sets of inliers and outliers are updated with a final estimation of affine transformation. 5) Stop Criterion: In order to stop the expansion of the th node, and to deem it as a leaf node, HMA examines when there is no benefit in further expanding that node. This is achieved by measuring the ratio of inliers,.in particular, a node is deemed as a leaf when. 6) Top-Down Phase: After each node reaches a leaf state, a final opportunity is given to retrieve correct matches that were erroneously classified as hard outliers; the matches in are recovered and associated to the (spatially) closer node, by examining the symmetric reprojection error. III. EXPERIMENTS AND RESULTS A. Overview of the Experimental Evaluation We thoroughly compared the performance of HMA in two scenarios: 1) a highly controlled in-lab dataset with nonplanar objects, and 2) a large laparoscopic-surgery dataset with more than 100 images acquired from six real videos of partial nephrectomy interventions. The in-lab dataset consists of over 18 image pairs simulating highly-controlled cases of occlusion, viewpoint changes, and deformations. Fig. 11 shows a representative example for each of these scenarios. Note that, while the popular graffiti image was chosen as texture for the in-lab objects, our in-lab dataset substantially differs from other computer vision databases because in our case this texture is applied to nonplanar object surfaces. The laparoscopic-image dataset contains many cases in which there is a real need to accurately retrieve a precise and large number of matches. These cases range from prolonged and complete camera occlusion (e.g., due to the surgical tool motion in front of the camera), camera retraction and reinsertion, sudden camera motion, and specular reflections. Fig. 14 shows image pairs that are representative for each of these scenarios. In order to provide an extensive evaluation of HMA over all of the above scenarios and over several features (SIFT, ASIFT, and SURF), as well as a thorough comparison against other state-of-the-art feature-matching algorithms (cf. Section III-D), we manually-annotated the data (both the features and the matches) extracted from the aforementioned endoscopic images. This ground-truth data was captured by four expert users, where three users provided independent annotations, and the fourth user solved for any conflict among the three annotations. Initial SIFT matches were manually labeled by each user as correct or incorrect by carefully observing their position on the two images. Note that only the most certain matches were labeled as correct. A set of manually-selected corresponding corners were selected by each user in each image pair. Also in this case, only the most certain image correspondences were selected. Note that, SURF and ASIFT were not labeled due to large number of their initial matches. 7 Also, note that for SURF and ASIFT features, we did not limit our datasets only to the strongest features in order not to alter the performance (e.g., the percentage of correct matches). Our comparison (for each feature type) is based on measuring the algorithms accuracy in detecting the correct matches (matching performance), as well as their accuracy in mapping ground-truth corresponding points between images (mapping performance). These measures are described in detail in Section III-C. Additionally, we compared the computational time of each algorithm. The computational time is measured in seconds of timer CPU 8 required by each run of each algorithm. In our experimental evaluation (cf. Sections III-E III-F) we compare, analyze and discuss the performance of all the aforementioned algorithms, and show that HMA represents a considerable improvement over the existing methods. B. Comparison With Existing Datasets Our benchmark improves over existing databases because it includes many carefully-annotated surgical images for the accurate testing in a minimally-invasive surgical scenario. In our dataset webpage we provide a detailed description of both the DB image-pairs, as well as of the (ground-truth) matching and mapping data, to be used by the community to evaluate future algorithms. To the best of our knowledge, only two other large data sets have been made publicly available. The dataset in [45] contains many stereo videos from real endoscopic surgeries. However, and differently from our benchmark, [45] does not currently contain any MIS image pairs from challenging cases that can be used for evaluating feature-matching algorithms. Furthermore, ground-truth corresponding features and matching labels are not currently provided. Finally, [45] does not include any in-lab experiments (with ground truth features) under controlled object rotations and deformations. The dataset in [46] contains thousands of images to be used in many computer-vision problems, such as image retrieval, 7 For example, ASIFT can produce more than 4000 initial matches per image, thus rendering the ground-truth (manual) labelling almost unfeasible. 8 Intel Core i7 2670QM 2.20 GHz, Intel Corp., Santa Clara, CA, USA.

9 PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1209 classification, recognition and, finally, matching. However, this dataset does not contain any endoscopic-image cases, and the provided images are only obtained by applying several (known) homography distortions to single images of (almost-planar) scenes (e.g., building facade). As a result, this dataset cannot be used to compare the efficiency and robustness of feature-matching algorithms towards scene distortions, as well as towards possible reflections, clutter, and illumination changes (which are very common in real surgical scenarios when moving the endoscopic camera to a different viewpoint). C. Validation Metrics The matching-performance metric compares the final matches towards a ground-truth of manually-labelled matches. We based our analysis on receiver operating characteristic (ROC) curves [39], which are commonly used to visualize, organize and select classifiers based on their classification performance. ROC curves depict the relative trade-off between the sensitivity or recall, andthe1-specificity. For instance, given a set of initial and final matches with known labels (correct or wrong), we compute the sets of true positives and false positives that represent, respectively, those matches included in the set of final matches that were labeled as correct, as well as those matches included in the set of final matches, but labeled as wrong. These sets are used to compute the sensitivity and 1-specificity as follows: The ROC curves are parameterized based on a score,whichindicates the quality of the algorithm to accept/reject each match, e.g., the negative of the reprojection error (of each match when mapped by their corresponding AT). The matches are sorted in descending order according to these score values; this order is used to iteratively compute and plot the cumulative sensitivity, and 1-specificity (cf. [39] for illustrative examples). The mapping-performance metric measures how precisely each algorithm s transformation maps a manually-selected set of corresponding points between each image pairs (cf. Section III-A). This is done by computing the symmetric reprojection error [36] of these points. Note that this measure is independent from the number of initial matches (which varies with different thresholds, and different features); in fact, the mapping performance only requires a set of known corresponding points between images. D. Description of the Comparing Methods In what follows, we briefly describe three recent feature-matching algorithms that we compared against HMA. These algorithms differ for their geometric constraints used to detect and remove outliers from the initial matches (cf. Section II-A). We also provide a short analysis of the strengths and weaknesses of each algorithm. Lowe s algorithm [16]: This method works similarly to the strategy described in Section II-C, by estimating a single affine transformation, which maps features from the training to the query image. The refined matches are only those that obey to such a geometric affine transformation within a specific reprojection-error pixel threshold. This transformation is estimated by first using a voting scheme (Hough) to cluster the initial matches into sets with closer similarity parameters. For each cluster, a single affine transformation is estimated (cf. Section II-C), and a probabilistic model is used to select the best model. Adaptive multi-affine (AMA) algorithm [22]: AMA relaxes the assumption of a single affine model by estimating asetofmultiple affine transformations, where each one is associated to a cluster of matches. As a result, the inliers (matches) are now distributed along the entire organ s surface. In AMA a set of clusters is estimated as in [16], and are then sent to a cascade of RANSAC-based affine estimators. For each cluster, a supporting transformation is computed. An adaptive procedure is then used to select the best for k-means, and to estimate all the transformations and the corresponding inliers. AMA extracts more inliers than Lowe s approach, however the adaptive process is computationally expensive [38]. Agglomerative correspondence clustering (ACC) algorithm [23]: ACC determines the set of refined matches by employing a hierarchical clustering algorithm based on an agglomerative (bottom-up) strategy [43]. This strategy iteratively merges pairs of matches (or clusters of matches) into a single cluster based on a dissimilarity measure between matches (or clusters). This dissimilarity measure consists of both geometric and appearance constraints. Finally, ACC iteratively merges clusters according to both their dissimilarity measure and a linkage criteria, to generate the final clusters. ACC requires the user to specify a larger number of (nonintuitive) parameters than in Lowe, AMA, and HMA. For a fair comparison, all of the aforementioned algorithms were implemented in MATLAB, and they were tested using the same dataset. Our implementation of HMA used a RANSAC threshold of 5 pixels, a minimal consensus threshold of 6, a threshold over the ratio of inliers of 0.9,, weight vectors and. Similarly, we used the same threshold of 5 pixels for AMA and Lowe. ACC uses 10 as cutoff value of the linkage function (threshold over the dissimilarity measure [23]) which we observed to give a maximized performance. The remaining parameters for Lowe, AMA, and ACC were fixed as originally reported in [16], [22], and [23], respectively. E. In-Lab Experiments The goal of the in-lab experiments is to evaluate the performance of the algorithms under highly-controlled cases of rotation and deformation. This dataset consists of 18 image pairs divided into two testing cases: the rotation case consisting of seven image pairs of a nonplanar object in which the query images where taken under controlled object rotations (ranging from to ), as shown in Fig. 11(a). Other 11 image

1210 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO.

In-lab experiment Matching performance (better seen in color). (a) Plots of the AUC of the ROC curves for the rotation case. (b) AUC curves for the deformation case.

Each image has a resolution of 640 480 pixels, and the average number of initial matches for SIFT, SURF, and ASIFT are: 196, 239, and 4000, for the rotation set, and 540, 545, and 8761 for the

10 1210 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 TABLE I IN-LAB EXPERIMENT: ROTATION AND DEFORMATION SETS AVERAGE REPROJECTION ERRORS (PIXELS) PER IMAGE PAIR TABLE II IN-LAB DATASET:ALGORITHMS AVERAGE PERFORMANCE Fig. 12. In-lab experiment Matching performance (better seen in color). (a) Plots of the AUC of the ROC curves for the rotation case. (b) AUC curves for the deformation case. pairs correspond to the deformation case, in which a soft object was subject to different levels of deformations (low, medium, strong), as Fig. 11(b) illustrates. Each image has a resolution of pixels, and the average number of initial matches for SIFT, SURF, and ASIFT are: 196, 239, and 4000, for the rotation set, and 540, 545, and 8761 for the deformation one. Our mapping ground-truth consists of an average of 55 corresponding corners for the rotation, and 30 for the deformation, which were carefully selected from different parts on the entire surface of the objects. We computed the area-under-the-curve (AUC) of each ROC curve in order to better illustrate a comparison between the performance of each method in these in-lab experiments. Fig. 12 shows the comparison for the four algorithms in both testing cases. Note that larger AUC values (i.e., closer to 1) indicate better matching performance. Table I provides detailed statistical results of each algorithms mapping performance for each testing case. The columns 3 9 represent a different image-pair s rotation, while the columns represent different levels of deformation (low, medium, and strong). Fig. 13. Qualitative example of the matching performance of the in-lab experiment for SIFT features. Note that the single affine transformation (yellow arrow) estimated by Lowe tends to accumulate large reprojection errors (yellow lines) when mapping those ground truth correspondences (green stars) that are not supporting the affine transformation (polygon). (Better seen in color). Table II summarizes the average results for both cases. In particular, the first four rows contain the results for the rotation case and the last four the results for the deformation case. The second, fifth and eighth columns show the mean and standard deviation of the symmetric pixel reprojection error. The third, sixth, and ninth columns present the required computational time (in seconds), while the fourth column indicate the sensitivity and 1-specificity values, and the seventh and tenth contain the percentage of inliers which we use as a heuristic to indicate the detection power of each algorithm. Fig. 13 shows an example of qualitative results when SURF features are used. Note that HMA, AMA, and Lowe retrieve both a set refined matches, as well as multiple (or a single) affine transformations. Note that Lowe s single affine transformation only captures a reduced number of matches, localized on a strip of the object s surface, thus indicating the difficulty to reliably

PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1211 TABLE III LAPAROSCOPIC DATASET:ALGORITHMS PERFORMANCE Fig. 14.

13(b), the mapping of those ground-truth correspondences lying on the sides of the object surfaces give rise to high reprojection errors.

11 PUERTO-SOUZA AND MARIOTTINI: A FAST AND ACCURATE FEATURE-MATCHING ALGORITHM FOR MINIMALLY-INVASIVE ENDOSCOPIC IMAGES 1211 TABLE III LAPAROSCOPIC DATASET:ALGORITHMS PERFORMANCE Fig. 14. Examples of images pairs contained in the database. map features far away from the ones supporting the (single) affine transformation. As a result, and as illustrated in Fig. 13(b), the mapping of those ground-truth correspondences lying on the sides of the object surfaces give rise to high reprojection errors. Conversely, HMA and AMA capture more matches distributed around the whole object s surface. Also observe that ACC sometimes tends to detect more ambiguous matches than the other methods, e.g., those at the top-right corner of the object. F. Surgical-Images Dataset Our data set includes more than 100 image pairs with resolution of These images were selected from cases of instrument occlusion, fast camera or organ motion, change of illumination, or camera retraction. In particular, the image pairs were manually selected by considering cases without blur and those with large viewpoint changes but still with high visibility of the same scene. Fig. 14 shows some representative examples of image pairs relative to cases of camera retraction, complete occlusion, or zoom. The sets of initial matches were in average approximately 265, for SIFT, 233 for SURF, and 2872 for ASIFT.The set of manually selected correspondences between each image pair was in average of 20 points. Manually obtaining a higher (average) number of ground-truth corresponding corners was indeed very difficult because of the strong illumination changes, and image clutter. Fig. 15(a) depicts the ROC curve for each algorithm when SIFT features are used. Note that the results of all the image pairs are integrated into the average ROC curves for each algorithm. We also include the confidence intervals 9 (vertical lines) for some selected scores values. 10 In addition Fig. 15(b) and (c) shows the direct comparisons of sensitivity and 1-specificity between HMA and the other methods. Table III summarizes the average results for the algorithms mapping performance of the different types of features for the same parameters and thresholds than in the in-lab experiment. Fig. 16 show qualitative examples of the matching performance of the four algorithms for SIFT, SURF, and ASIFT. 9 We use a significance level of 95% in a two tail test. 10 For clarity, we only present the score values: that represent small, medium and large reprojection errors. Fig. 15. MIS dataset matching performance (better seen in color). (a) Average ROC curves of the four algorithms. The vertical lines show the 95% confidence intervals of the mean. (b) and (c) Direct comparison of the HMA s sensitivity and 1-specificity against Lowe, AMA, and ACC. IV. DISCUSSION AND CONCLUSION In this work, we have presented our novel Hierarchical Multi- Affine (HMA) feature-matching algorithm to find image similarities between two laparoscopic views. HMA removes incorrect matches from a given set of initial appearance-based matches by iteratively partitioning the features into clusters over the organ s surface, and by estimating an affine transformation for each cluster. This affine mapping allows to recover the pixel position of those tracked features that were lost after a complete and prolonged occlusion, or sudden camera motions. HMA is an important tool in MIS applications, and has the potential to become a core component in many surgical-vision applications. We evaluated our HMA algorithm in two controlled in-lab tests, as well as in challenging MIS scenarios. In addition, we compared HMA s performances with respect to other state-ofthe-art algorithms: Lowe s [16], adaptive multi-affine (AMA) [22], and Agglomerative correspondence clustering (ACC) algorithms [23] (cf. Section III-D). The in-lab experiment (cf. Section III-E) demonstrated higher performance in matching, mapping and time of HMA with respect to the other algorithms under cases of highly-controlled rotation and deformation. In particular, from Fig. 12 is clear that HMA outperforms AMA, Lowe and ACC in the matching task. Table I shows that HMA has also a higher mapping accuracy than Lowe and ACC, in terms of both average error and standard deviation. Note that

1212 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 Fig. 16. Qualitative example of the matching performance of the four algorithms for three different image pairs.

12 1212 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013 Fig. 16. Qualitative example of the matching performance of the four algorithms for three different image pairs. The first column represents solutions for the database image pair 13 considering SIFT features; the second column contains solutions for the image pair 6 when using SURF features. The third column shows the solutions for image pair 38 when ASIFT features are used. (Better seen in color). AMA is very competitive, indicating that the multi-affine transformation approach adapts better to (nonplanar) object rotations and deformations. However, even if HMA and AMA share the same multi-affine approach, HMA was specifically designed to be faster that AMA. HMA s uses a hierarchical structure that iteratively divides the feature-matching problem into smaller subproblems, thus reducing the computational throughput. Instead, AMA uses a brute-force strategy to determine the right number of local affine transformations, thus resulting in a very large overhead. HMA improvement over the other approaches is shown in Table II. In particular, note the large computational time difference when the number of matches is large. For instance, in the deformation case for ASIFT, HMA is three times faster than Lowe, 39 times faster than AMA, and 914 times faster than ACC. In the second experiment, we evaluated these feature-matching algorithms under the highly-cluttered MIS environment. Our database is particularly challenging due to the high percentage of incorrect matches after the appearance-based matching phase (this is larger than in the in-lab test). This happens for several reasons, such as: few and sparse features due to large texture-less image regions; image similarities due to the ambiguous nature of surgical images; high image distortion due to the endoscope lenses, and endoscopic illumination (for example, this last phenomenon causes many good matches to be localized around the image center). Despite these problems, our results illustrated in the average ROC curves in Fig. 15, in the qualitative comparison in Fig. 16, and in Table III show that HMA achieves a great balance between speed and accuracy. In fact, HMA has a high matching and mapping performance, but with a significantly reduced computational time. Furthermore, Fig. 15(a) shows that HMA only requires a score (reprojection error) of 15 (pixels) to detect most of the correct matches, while the other algorithms require larger thresholds, and can only detect an increased number of false positives (incorrect matches). Fig. 15(b) shows that HMA achieves better sensitivity values than the other methods with the same score values, thus indicating an increased capability to detect the correct matches even at lower score thresholds. Simultaneously, in Fig. 15(c) HMA has similar performance than Lowe and AMA, as well as more stability to outliers than ACC. In particular, observe the gap between the dashed (pink) curve and the other curves in Fig. 15(c). This gap indicates a higher detection of false positives (incorrect matches) when ACC s dissimilarity threshold is increased. Also, as we observed in Fig. 16, HMA is comparable with AMA, is more efficient than Lowe s single affine formulation, and is more robust than ACC [several ambiguities can be readily found by visual inspection, e.g., in Fig. 16(c)]. Furthermore, from the results in Table III, and similarly to the in-lab experiment, we observed that HMA achieves a mapping error comparable to AMA, but at speed rate comparable with Lowe s. Note that these results demonstrate that HMA hierarchical formulation successfully increased the convergence speed without sacrificing mapping accuracy. Also, observe that in the case of a large number of matches (e.g., when using ASIFT) HMA again achieves reduced computational times when compared with Lowe, AMA and ACC (approximately 2.57, 20, and 245 times faster, respectively). In addition, we collected HMA s tree maximum depth for both experiments. We observed that HMA reaches a tree depth of,,and for SIFT, SURF, and ASIFT, respectively. Across all the MIS image-pairs, the minimum depth reported was 2 (SURF), and the maximum 62 (ASIFT). In the case of the in-lab dataset, we observed that the percentage of correct matches in each image pair is usually

Real-time Feature Matching for the Accurate Recovery of Augmented-Reality Display in Laparoscopic Videos

Real-time Feature Matching for the Accurate Recovery of Augmented-Reality Display in Laparoscopic Videos Gustavo A. Puerto, Alberto Castaño-Bardawil, and Gian-Luca Mariottini Department of Computer Science