AUGMENTED reality (AR) displays promise to improve

Size: px

Start display at page:

Download "AUGMENTED reality (AR) displays promise to improve"

Prudence Kerry Craig
6 years ago
Views:

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 2609 Toward Long-Term and Accurate Augmented-Reality for Monocular Endoscopic Videos Gustavo A.

1 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER Toward Long-Term and Accurate Augmented-Reality for Monocular Endoscopic Videos Gustavo A. Puerto-Souza, Student Member, IEEE, Jeffrey A. Cadeddu, and Gian-Luca Mariottini, Member, IEEE Abstract By overlaying preoperative radiological 3-D models onto the intraoperative laparoscopic video, augmented-reality (AR) displays promise to increase surgeons visual awareness of high-risk surgical targets (e.g., the location of a tumor). Existing AR surgical systems lack in robustness and accuracy because of the many challenges in endoscopic imagery, such as frequent changes in illumination, rapid camera motions, prolonged organ occlusions, and tissue deformations. The frequent occurrence of these events can cause the loss of image (anchor) points, and thus, the loss of the AR display after a few frames. In this paper, we present the design of a new AR system that represents a first step toward long term and accurate augmented surgical display for monocular (calibrated and uncalibrated) endoscopic videos. Our system uses correspondencesearch methods, and a new weighted sliding-window registration approach, to automatically and accurately recover the overlay by predicting the image locations of a high number of anchor points that were lost after a sudden image change. The effectiveness of the proposed system in maintaining a long term (over 2 min) and accurate (less than 1 mm) augmentation has been documented over a set of real partial-nephrectomy laparascopic videos. Index Terms Augmented reality (AR), endoscopic vision, feature tracking. I. INTRODUCTION AUGMENTED reality (AR) displays promise to improve the outcome of minimally-invasive surgical interventions, because of the possibility to enhance the surgeon s awareness of high-risk anatomical targets [1], [2]. As illustrated in Fig. 1, the accurate overlay of a patient s preoperative radiological 3-D organ s model (e.g., from CT scans) onto the live surgical video, can reveal the exact location, orientation, and depth of a tumor in it, or of other important anatomical structures. The detection and tracking of anchor points a set of organ s 3-D model points and video-features associations is of utmost importance to ensure a prolonged and accurate augmented display even after strong illumination changes, camera occlusions, and tissue deformations. Manuscript received October 4, 2013; revised February 16, 2014; accepted May 2, Date of publication May 14, 2014; date of current version September 16, Asterisk indicates corresponding author. G. A. Puerto-Souza is with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX USA ( gustavo.puerto@mavs.uta.edu). G. L. Mariottini is with the department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX USA ( gianluca.mariottini@uta.edu). J. A. Cadeddu is with the Department of Urology, University of Texas Southwestern Medical Center, Dallas, TX USA ( jeffrey.cadeddu@ utsouthwestern.edu). This paper contains multimedia material available online at ieee.org (File size: 239 MB). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TBME Fig. 1. Example of an AR display: A set of anchor-points (i.e., matches between the 3-D CT model and the endoscopic image) are used to maintain the AR display. While artificial fiducial markers inserted into the surgical scene [3], [4] have been adopted in the past, recent AR systems [5] [7] have used natural features within the surgical scene. Even if the latter approaches are less invasive for the patient, they are very sensitive to frequent large camera motions, illumination changes, and occlusions since they exclusively rely on imagefeature trackers [8] [12]. As a result, the anchor points (and thus, the AR display) are easily lost after a few frames and a time-consuming manual reinitialization by an expert user is required to reinitialize the augmentation. In an effort to promote long-term and accurate AR surgical displays, we present the design and prototype development of a new AR system that can automatically and accurately recover the augmentation over long endoscopic monocular surgical videos and after unexpected events (such as rapid camera motions, occlusions, or organ deformations). Accuracy is indeed important in many surgical interventions: e.g., adequate resectioning margins in partial nephrectomy are about 5 7 mm. The key ingredients of our system consist of the use of feature matching [13], [14] to automatically recover the precise position of the anchor points, as well as in the use of a sliding-window (SW) weighted least-squares criterion to ensure accurate and stable AR display. As such, the proposed system provides accurate augmentation over long time periods using weighted point correspondences, temporal smoothing, removal of augmentation when tracking fails, and subsequent recovery of alignment when possible. Our system works also in the case of unknown camera-calibration parameters, which is of interest in retrospectively augmenting uncalibrated videos for skill assessment and evaluation. Our system was tested on real monocular (calibrated and uncalibrated) surgical videos from two partial-nephrectomy laparoscopic interventions and an overall accuracy of less than IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See standards/publications/rights/index.html for more information.

2 2610 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER mm was achieved. Our sequences are representative of challenging scenarios, such as camera retraction/reinsertion, prolonged partial and total occlusion of the organ by surgical instruments passing in front of the endoscope, as well as in the case of fast camera motions and strong organ deformations. To the best of our knowledge, this is the first CT-to-video AR surgical system that achieves both long-term and highlyaccurate augmentation without the need for adopting fiducial markers. Our designs and the results presented here represent a first important step toward the wide adoption and acceptance of augmented-reality systems in minimally-invasive surgery. A. Related Work and Original Contribution Existing AR systems can be broadly divided into fiducialbased and feature-based, depending on whether they use artificial markers (fiducials) or natural features (e.g., tissue textures), respectively. Fiducial-based systems [15] work by registering the virtual object with respect to a visible fiducial with known geometry. In [4], color-coded metallic fiducial shafts are used to co-register in real time the transrectal ultrasonography 3-D data with the live laparascopic video. Similarly, the system in [16] uses fiducial markers to align monocular video with 3-D CT data. Despite their popularity when adopted for man-made environments [17], [18], fiducial-based surgical systems are invasive for the patient, they have a low accuracy because of the paucity and size of fiducials that can be inserted, and are invasive for the patient since they need to be placed in the patient s body preoperatively, e.g., at the moment of the radiological exams. Feature-based systems do not require artificial fiducials, but make use of natural structures in the scene, such as corners and textures. The authors of [19] have proposed an AR system with a semiautomatic initialization based on the contours of the 3-D model. This system incorporates an illumination-invariant feature tracker robust to partial occlusions. A major drawback of this method is in its high computational complexity, which makes it inadequate for surgical applications. A major improvement was presented in parallel tracking and mapping (PTAM) [20], where tracking and mapping were treated in a parallel way, thus improving over both real-time performance, as well as accuracy and robustness to fast camera motions. However, PTAM is sensitive to the initialization phase and it has been recently documented not robust when applied to endoscopic scenarios [21]. The work in [22] detects new anchor points in the scene based on both optical flow and affine constraints. However, this approach requires known camera calibration parameters and has been designed to work in man-made environments that have less clutter than those during surgical interventions. Feature-based surgical AR systems are mostly designed for stereo endoscopes. The authors of [3], [5], [6], [23] propose markerless real-time systems that use at their core the iterative closest point (ICP) algorithm [24] to co-register the 3-D reconstruction of the scene (from stereo endoscopes) with the preoperative 3-D CT model. These approaches require accurate camera calibration and AR initialization in order for ICP to converge to a correct solution. Furthermore, they are very sensitive to noise and to occlusions since they do not include any anchorpoint recovery strategy. Finally, they have been tested only over very short video sequences. Addressing the problem of long-term tracking of image features in endoscopic videos is an open challenge [25], [26] and it has been recently studied in [12], where two new feature detectors/descriptors were introduced and tracked over long sequences. However, this work adopts spatial and temporal filters for tracking and will not then work in the case of strong organ deformations and prolonged image occlusions or blurs. More recently, some efforts have been done to achieve wide-baseline registration like in [21], where the authors present a two-phased approach to register the uterus in monocular laparoscopy, that effectively decouples 3-D mapping from registration. However, the proposed method is tailored to registering a 3-D model obtained from an initial video sequence, and not from preoperative radiological scans (e.g., CT). The novelty of our study consists of the introduction of a feature-based AR system for long-term augmentations. The major contributions of our system are both the automatic detection of anchor-points and the accurate estimation of the camera projection model. Our system can recover the overlay of radiological (CT) data after unexpected camera events, such as a total and prolonged occlusion, illumination changes, and organ deformations. Our system makes no assumptions about the endoscope s position and orientation, and it works also in the case of unknown endoscope calibration parameters. This work is an extension of our conference submissions [27] and [28] over which we strongly improved in several directions: first, the algorithm s pipeline has been ameliorated to achieve better accuracy in augmenting longer video sequences. Second, a modified version of the weights for the registration phase is here presented that also accounts for the reprojection error from the previous frames. Finally, the proposed system has been tested over a dataset lasting several minutes long and including many challenging scenarios (camera retraction/reinsertion, bleeding, prolonged total occlusions, smoke, and organ deformations). Note that such an extensive validation has never been performed before and that existing state-of-the-art methods are usually validated only on sequences lasting few seconds. II. METHODS The AR system described in the following has been designed to ensure both accuracy and long-term overlay. These two features are particularly important because of the occurrence of challenging (but very common) events in the operating site, such as fast camera motion, smoke, blood, changes in illumination, organ deformation, or total occlusions (e.g., due to surgical instruments moving in front of the camera). A. Monocular AR Architecture This system aims to provide accurate augmentation over long time periods by using weighted point correspondences, temporal smoothing and removal of augmentation when tracking fails with subsequent recovery of alignment when possible. This

3 PUERTO-SOUZA et al.: TOWARD LONG-TERM AND ACCURATE AUGMENTED-REALITY FOR MONOCULAR ENDOSCOPIC VIDEOS 2611 Fig. 2. Block diagram of the AR pipeline: Several stages are used to accurately estimate the projection matrix P t while being robust to occlusion and fast camera motions. subsection provides an overview of the proposed AR pipeline, while each phase is described in detail in the following subsections. The proposed AR architecture is illustrated in the block diagram of Fig. 2. We assume a given initial alignment between the preoperative 3-D model and the initial monocular video frame, I t0,attimet 0. This alignment corresponds to finding a set of n 0 3D-to-2D corresponding point pairs, at time t 0,usually referred to as anchor-point pairs. We indicate this set as α 0 {(u i 0, X i )} n 0 i=1. The set of anchor points could be obtained either by carefully placing fiducial markers before the surgery on the scene, or by means of manual alignment with the assistance of an expert user. Since our datasets are retrospective, and in order not to alter or interfere with the operating site, we designed a graphical user interface (GUI) to help an experienced urologist in the manual alignment task (cf., Section II-B). Note that, while the choice of a GUI for the initial alignment is preferred in the case of monocular endoscopic videos (as in this work), 3-D to 3-D initial-registration techniques (e.g., ICP) can be seamlessly adopted in the case of stereo endoscopic videos. The output of this stage consists of a set α 0 of initial anchor-points associations. Once the initial anchor-point pairs are obtained, a projection matrix, P t is estimated, to project the CT model on top of the current frame I t. However, the estimation of P t is made challenging by several factors such as noisy and erroneous anchor points, camera occlusions, and organ deformations. Our system is designed to provide robustness to these potential problems by iterating over the following four stages. At time t, the set of anchor-point pairs, α t {(u i t, X i )} n t i=1, are passed to a projection-estimation stage that estimates a projection matrix, P t [29], in the cases of both known or unknown camera calibration parameters. One of the major contributions of this step is the use of a weighting scheme to reduce the impact that image features clustered over a specific organ s portion might have in biasing the augmentation over that region. The second contribution consists of incorporating a temporal smoothing approach to reduce the sensitivity to noise on the anchor points, as well as to reduce the model s jitter on the augmented video. Robustness to incorrect associations is ensured by the inclusion of RANSAC in the estimation of P t. Moreover, our scheme is designed for either calibrated or uncalibrated scenarios. The estimated matrix P t is then used in the augmentation stage to project the entire 3-D CT organ s model onto the current endoscopic video. Next, a feature-tracking stage is used to update the location of the image points, u t u t+1, by tracking them into the next image frame, I t+1. As a result, each 3-D CT point, X i, is now associated to a feature u i t+1. These three stages will be iterated until unexpected events happen. In particular, these events are detected by means of a recoverycondition criteria. The second major contribution of this work is indeed in the ability to automatically recover after system failures caused by camera occlusions, endoscope retractions and organ deformation. In these cases, an anchor-point recovery stage is adopted at time t to recover the positions of features u t corresponding to a set of previously-detected anchor points (e.g., u 0 ). If the recovery is successful, the new set of anchor points are used to compute P t. Otherwise, the system does not render the virtual model on the current frame and tries to recover the anchor-point pairs at the next frame. These stages are iterated for each new frame of the video. B. Initial Alignment Similarly to other works [3] [7] we assume a given set of initial anchor points α 0, i.e., a known initial alignment between the first endoscopic-video frame and the 3-D CT model. Our approach does not require the use of special patterns or fiducials

Note that some corners are clustered in small regions (green circle), while others are very isolated (white arrow). (b) Resulting augmentation using a DLT with no weighting scheme.

4 2612 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 Fig. 4. Example of the bias in the projection matrix estimation. (a) The (yellow) asterisks represent the tracked anchor points whose position was perturbed with additive white Gaussian noise with standard deviation of 3 pixels. Note that some corners are clustered in small regions (green circle), while others are very isolated (white arrow). (b) Resulting augmentation using a DLT with no weighting scheme. Note that this augmentation is not accurate (see red arrow) and tend to discard many features (red crosses/yellow squares). (c) Resulting augmentation using a DLT incorporating the weighting scheme. Note that this augmentation is more accurate (see yellow arrow) and also preserves more features than the regular DLT (i.e., with no weights). Fig. 3. Initial alignment : Our GUI for the manual alignment between the 3-D model and the endoscopic video. (a) and (b) Note that the control panel allows the user to manipulate the position of the model (green/red dots) until fully matching with the organ profile. (c) Extracted z 0 Shi-Tomasi corners on the initial frame. (d) Resulting initial anchor-point associations. (a) GUI before manual alignment. (b) After manual alignment. (c) Corner extraction. (d) Anchor-point association. on the scene, nor of any additional knowledge about the camera location. We developed a GUI that allows the user to rotate and translate the organ s model to best match the profiles of the observed scene, as shown in Figs. 3(a) (b). Once this alignment is given, a set of Shi-Tomasi corners [30], z 0 is extracted from the endoscopic image with a minimum inter-corner distance of 10 pixels, [c.f., Fig. 3(c)]. Then, each corner, z i 0, is associated to the closest projected model s point, P 0 (X i ), and only those associations within a distance of τ = 3.5 pixels are kept. This process results in a set of initial anchor-point associations, α 0 = {(u i 0, X i )} n 0 i=1, where ui 0 = z i 0. We finally note that, as for every 3-D-to-2-D registration, the quality of the initial alignment will certainly dominate the performance along the entire video. In the experimental results presented in Section III, we tried to have a fair comparison by initializing all of the evaluated methods with the same projection matrix, thus effectively focusing the comparison on their relative algorithmic differences. C. Projection-Estimation Stage The proposed approach estimates the 3 4 projection matrix P t from a set of anchor points, α t, and seamlessly works even if the camera calibration parameters are unknown. Furthermore, our method is robust to outliers, and is accurate despite image noise or cluttered features. In the uncalibrated case, we adopt an improved version of the direct linear transformation (DLT) approach [29]. In DLT, P t is obtained from the homogeneous system of linear equations, A DLT t p t = 0, where A DLT t is a 2n t 12 matrix created from the anchor-point associations, α t, and p t is a 12 1 vector constructed by stacking the columns of the projection matrix P t. The solution of this homogeneous system of linear equa- tions is found as the eigenvector corresponding to the smallest eigenvalue of matrix A DLT t [29]. In the calibrated case, images are first undistorted and the perspective-three-point (P3P) approach [31] is used to compute, { C X i t} n t i=1, i.e., the 3-D reconstruction of the observed features in the current camera frame. Then, the matrix, A P3P t = n t ( C )( i X i t C X t X i X ) T is constructed, where ( ) denotes the mean operator. Then, matrix A P3P t is used to extract the camera pose as in [32], i.e., W R = UV T, C W t = C C X C W RT X, where U and V are factors obtained from the Single Value Decomposition (SVD) as, [U, Σ, V] = SVD(A P3P t ). The projection matrix is finally computed as, P t = K [ C W R C W t]. In both cases, a RANSAC phase [33] is implemented to simultaneously estimate P t, while discarding outliers. In our experiments, we observed that RANSAC alone would not be able to cope for two other sources of errors: 1) the presence of anchor points clustered in a single organ s region, and 2) the frequent jitter, due to noise in the measurements, among consecutive frames. Our novel contribution stems from trying to address these two problems. In order to deal with case 1), we devised a weighted-ransac strategy that weighs less both those anchor points clustered in portions of the image [see e.g., Figs. 4(a) (c)], as well as those that exhibited large reprojection errors in the previous frames. In particular, the weights, w i, have been chosen as follows: w i 0.5e f d (u i t,u t,δ) +0.5e f e (u i t 1,P t 1,X i ) [0, 1], where f d (u i t, u t,δ) represents the density 1 of features around u i t, and f e (u i t 1, P t 1, X i ) is the pinhole reprojection error 2 of the ith anchor-point at time t 1. Once the new constraint is obtained, in the uncalibrated case, the weights w i are incorporated into the system of linear equations, W t A DLT t p t = 0, where W t is a diagonal matrix of the form diag(w t )=[w 1,w 1,w 2,w 2,...,w n t,w n t ] T. In the calibrated case, the weights are incorporated directly in the computation of A P3P t, i.e., A P3P t = n t ( i w C i X C X )( X X ) T. 1 The function f d (u i t, u t,δ) computes the number of u t elements within a circular region centered on u i t and with radius δ. 2 The reprojection error is defined as, f e (u i t, P t X i )= u i t P t, Xi, where X i denotes the extension to homogeneous coordinates of X i.

5 PUERTO-SOUZA et al.: TOWARD LONG-TERM AND ACCURATE AUGMENTED-REALITY FOR MONOCULAR ENDOSCOPIC VIDEOS 2613 The other source of inaccuracy was observed to cause jittering in the projected model across consecutive frames. We address this issue by formulating a SW approach [34]. The overall W-RANSAC SW estimation stage uses the constraints obtained from the inliers of W-RANSAC in the previous k iterations, A t, A t 1,...A t k, and associate each constraint to a forgetting-factor coefficient β =[β 0,β 1,...,β k ] T. This factor indicates the relative importance for each linear system. As a result, a new constraint is formulated as, k i=0 β ia T t i A t i.in the uncalibrated case, P t is obtained as the eigenvector corresponding to the smallest eigenvalue of this new constraint. For the calibrated case, the SVD decomposition of this new constraint is used to extract C W R and C W t, and P t = K [ C W R C W t] is readily computed. In Section III, we will present a comparison between the two (calibrated and uncalibrated) projection-estimation algorithms. In order to illustrate the benefits of the weighted-sw (W-SW) approach, a comparison against a simple RANSAC-based DLT will be also included. D. Augmentation Stage In this stage, the projection matrix estimated from the previous stage, is used to overlay the entire 3-D organ s model onto the endoscopic view. Note that, when projecting the entire 3-D model onto the endoscopic image, we only preserve those anchor points whose reprojection error is below γ = 10 pixels. E. Feature-Tracking Stage The feature-tracking stage updates the anchor point every time a new frame is acquired. In order to track image features (corners) among consecutive endoscopic video frames, we adopted the feature-tracking algorithm in [35], which is robust to changes in illumination. However, no matter how strong the tracker is, the tracked points will still be lost after unexpected camera and organ motions, or camera occlusions by the surgical instruments. The use of just a feature tracker is surprisingly still the standard in many AR systems. F. Tracking-Recovery Stage As mentioned previously, the tracking algorithm can lose some tracked features due to unexpected or sudden image events. As a consequence, the augmented overlay may deteriorate its quality or even disappear. For the sake of clarity, but without losing in generality, we will focus our discussion on the challenging case of a total loss of tracked features, e.g., due to the motion of a surgical tool in front of the camera while the organ and the camera are also moving. In order to accurately recover a high percentage of anchor points after a complete occlusion, we adopted a trackingrecovery stage (see Fig. 5). Our method compares an image before the occlusion with the current one after the occlusion. The current frame will be indicated as I t, while the image before the occlusion is selected from a buffer containing the initial Fig. 5. Diagram of the tracking recovery: Our method uses the images before and after the occlusion, and initial associations (anchor points (u 0, X)) between the image before the occlusion and the 3-D model. As an output, we recover the alignment between the image after the occlusion and the 3-D model. frame, I 0, and the last successfully recovered image, (e.g., I m with m<t). First, local image features are extracted from both images (feature-extraction block) and are matched by means of an appearance-based criteria to find a set of candidate matches (initial-matching block) 3. We chose SIFT features instead of Shi Tomasi corners, because of the invariance of SIFT to rotation and scale that Shi Tomasi corners do not offer. These initial matches are used by our hierarchical multiaffine (HMA) feature-matching strategy [14] to compute an image transformation that predicts in I t the position of those anchor points from I m. This procedure is repeated for each image in the buffer, and the solution that leads to the largest number of recovered features and lowest reprojection error is selected. However, due to uncertainties in the estimated transformation, these predicted points u t might not correspond exactly to the original tracked features. For this reason, a corner-association stage has been introduced to ensure that the recovered features are of the same kind of the original tracked ones, e.g., Shi- Tomasi corners. This is done by first extracting Shi Tomasi corners, {z}, from the image after occlusion (corner-extraction block), and by associating them to the closest recovered anchor points {u t }. An anchor point is updated to its closest corner only if their distance is less than a threshold γ (e.g., 3.5 pixels); in this way, only the most certain associations are preserved. As a result of the whole feature-recovery phase, we obtain a new set of anchor points {(u i t, X i )}. If the recovery is successful, the anchor points {(u i t, X i )} are passed to the projection estimation phase to re-compute P t. Conversely, if the recovery fails (e.g., an instrument is still occluding the organ) the frame is skipped (i.e., the frame is displayed without augmentation) and a recovery attempt is then done with the next frame. III. EXPERIMENTAL EVALUATION We evaluated the performance of our system on real surgical videos from partial nephrectomy interventions. Fig. 6 shows some representative frames extracted from these sequences that include many cases of rapid camera motions, organ deformations, and prolonged occlusions. 3 A distance-ratio threshold of 0.8 is a common choice [36].

6 2614 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 Fig. 6. Examples of evaluated frames from both video sequences (a) First sequence: 570 frames containing cases of fast organ motion and camera occlusion. (b) Second sequence: 3596 frames containing cases of camera retraction, fast camera motion, organ deformation, and prolonged occlusions. (a) Prolonged occlusion (first sequence). (b) Occlusion, retraction, and deformation (second sequence). In order to illustrate the strengths of our approach, we compared the performance of our proposed solution with respect to a RANSAC DLT approach that estimates the projection matrix by means of DLT with RANSAC (i.e., without the weighting scheme and the SW approach), and also adopts the feature recovery-strategy of Fig. 5. The second approach (denoted as W-DLT SW) estimates the projection matrix with the weighted DLT on a SW, and the feature-recovery strategy. The third approach (denoted as W-P3P SW), assumes known camera intrinsic calibration parameters and rectified images, and incorporated the projection matrix estimation with the weighted P3P, together with a SW, as well as the feature-recovery strategy. For each sequence, a 3-D model of the diseased kidney was obtained by processing the preoperative CT-scans of the respective patient. ITK-Snap [37] was used to segment the kidney and the tumor and to generate a 3-D model for each sequence. Each model was then processed (e.g., simplified and smoothed) by means of MeshLab [38] in order to obtain the final 3-D models used in our experiments. The first testing sequence duration is about 19 s, mostly containing strong changes in illumination and partial occlusions and, thus, representing a good benchmark for a detailed performance analysis of our approach. The second testing sequence has larger duration (more than 2 min) and contains a wider range of challenging cases such as total occlusions, strong organ deformations, camera retraction and reinsertion, as well as zoom-ins and zoom-outs. This sequence is useful to support the observations from the first sequence, and to demonstrate the effectiveness of the proposed method over a long sequence with a wider range of challenging events. Note that our long and comprehensive sequences differ from other monocular videos used in state-of-the-art papers, which are only a few seconds long and with very limited or controlled events. To the best of our knowledge, this is the first work that presents such a comparison over challenging and realistic laparoscopic sequences. Because of the richer set of challenging scenarios contained in these videos, our data is more complete than when using in-lab phantom organ models. n t Σ n t We assessed the performance of our system based on the following parameters during the entire length of each sequence: precision of overlay, robustness to noise, robustness to occlusion and precision of 3-D registration.theprecision of overlay is measured as the average pixel reprojection error between the 2- D component of the anchor points pairs and their corresponding 1 reprojected 3-D points, i.e., i=1 f e(u i t, X i, P t ). A high reprojection error indicates a disagreement between the estimated projection matrix and the measured tracked points, and thus a potential wrong augmentation. The number of anchor points in each frame was used as an indicator of the robustness to noise since, as expected, the display is more sensitive to noise when it has only a few anchor points. The robustness to occlusion is measured by maintaining a statistic over the number of skipped frames, as well as over the number of successfully recovered anchor points. The precision of 3-D registration is measured as the average distance between 3-D points of the model and the corresponding 3-D reconstructed from the endoscopic images, i.e., 1 n t Σ n t i=1 ˆX i t X i. The reconstructed 3-D points are obtained from the pin-hole camera model as, ˆX i t = λ i tk 1 ũ t, where K is the camera intrinsic calibration parameters, and λ i t can be obtained from the third coordinate of P t X i. In the uncalibrated case, an estimation ˆK, calculated from P t [29], is used instead of K. This error (in mm) provides a more intuitive interpretation of the accuracy of the estimated projection matrix. The parameters used by our algorithms are δ =25, τ = γ = 3.5, k =4, β =[1, 0.75, 0.5, 0.25] T, an accuracy of 5 pixels for the RANSAC-based DLT, and the parameters of HMA were set as in [14] with minimal changes to improve efficiency. Based on the aforementioned performance indices, the following tests were used in the anchor-point recovery stage: 1) The number of anchor-point pairs, n t, is less than 25% the number of initial anchor-point pairs, n 0. 2) The average reprojection error f e (u i t, X i, P t ) is larger than 4 pixels. 3) The number of anchor points, n t is less than 80% of the number of previously successfully-recovered anchorpoint pairs, and f e (u i t, X i, P t ) > 3.5 pixels. The first two conditions represent limit situations when we deem it is imperative to recover the anchor points due to the high risk of performing a wrong augmentation; the third condition detects potential cases when the overlay is incorrect due to accumulation of errors in the tracked anchor points. All these systems were implemented in MATLAB and processed offline on a computer with an i7 processor at 2.20 Ghz and 8 GB of RAM. In our MATLAB implementation, the time required to process a frame is in average 1.2 s (0.4 s for the projection estimation stage, 0.65 s for the display, and 0.25 s for the tracking), and 6.5 s for the recovery strategy. However, note that these systems are in their prototype stage and a final C/C++ implementation will be comprehensively optimized and parallelized to achieve significantly faster execution time. A. Experimental Results: First Sequence This sequence consists of 570 frames (approx. 19 s) with an image resolution of pixels and fps. The

7 PUERTO-SOUZA et al.: TOWARD LONG-TERM AND ACCURATE AUGMENTED-REALITY FOR MONOCULAR ENDOSCOPIC VIDEOS 2615 Fig. 7. First sequence-initial alignment: Example of the manual alignment between the initial frame and the 3-D model foot he calibrated case. The resulting 161 anchor points (yellow asterisks) are spread over the visible surface of the kidney and tumor. (a) Initial frame. (b) 3-D model. (c) Initial alignment. Fig. 9. AR results for the W-DLT SW strategy: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. Fig. 8. AR results for the RANSAC DLT strategy: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. sequence shows an exposed kidney containing a large tumor located on the top, as illustrated in the examples of Fig. 6(a). The beginning of the sequence (frames 1 220) mostly contains cases of strong changes in illumination and fast organ motion due to the patient s breathing. The kidney is mostly occluded by an ultrasound probe during frames and Figs. 7(a) and (b) show the initial frame and the reconstructed 3-D model from the CT images. Note that in the 3-D model, we colored in green the portions of the kidney corresponding to healthy tissue, while the tumor is colored in blue. Fig. 7(c) shows the initial alignment obtained from our GUI, containing 188 initial anchor points (yellow asterisk) for the first two uncalibrated systems, and 161 initial anchor points for the calibrated case. Our qualitative results for the three methods, RANSAC DLT, W-DLT SW, and W-P3P SW, are summarized on Figs. 8 10, respectively. In these figures, the first row shows the resulting overlay for some of the most crucial frames of the video sequence. In particular, we chose a frame before the first occlusion, during the first occlusion, after the first occlusion, and finally, during the final occlusion (frames 165, 336, 400, and

2616 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 Fig. 10. AR results for our W-P3P SW system: (First row) Selected frames from the AR endoscopic augmented sequence.

(Fourth row) Average 3-D registration error of the tracked points during the sequence. Fig. 11.

The resulting 213 anchor points (yellow asterisks) are spread over the visible surface of the kidney and tumor. (a) Initial frame. (b) 3-D model. (c) Initial alignment. 566, respectively).

8 2616 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 Fig. 10. AR results for our W-P3P SW system: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. Fig. 11. Second sequence-initial alignment: Example of the manual alignment between the initial frame and the 3-D model, for the calibrated case. The resulting 213 anchor points (yellow asterisks) are spread over the visible surface of the kidney and tumor. (a) Initial frame. (b) 3-D model. (c) Initial alignment. 566, respectively). We use red arrows in these images to indicate the evident disparity between the projected 3-D model boundary and the organ boundary in the endoscopic video. The second row in Figs shows the number of tracked features during the sequence. Note that each peak in these plots repre- Fig. 12. AR results for the RANSAC DLT strategy: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. sents a successful anchor-point recovery, which results in an increment on the number of anchor point pairs. Similarly, the third row in Figs shows the plots of the average reprojection error of the tracked features, respectively. Note that the average reprojection error increases in time if no anchor-point recovery stage is used. Each dropdown in the plots represent a successful recovery of anchor points, thus resulting in a reduction of the reprojection error. The fourth row in Figs shows the plots of the average 3-D registration error of the tracked features, respectively. Note that the average 3-D registration error also decreases when an anchor-point recovery phase is launched. B. Experimental Results: Second Sequence This sequence consists of 3596 frames (approx. 2 min) with an image resolution of pixels and fps. The sequence shows a partially exposed kidney with a clearly visible large tumor located on the top-left of the kidney, as illustrated

9 PUERTO-SOUZA et al.: TOWARD LONG-TERM AND ACCURATE AUGMENTED-REALITY FOR MONOCULAR ENDOSCOPIC VIDEOS 2617 Fig. 13. AR results for the W-DLT SW strategy: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. Fig. 14. AR results for our W-P3P SW system: (First row) Selected frames from the AR endoscopic augmented sequence. (Second row) Number of tracked features during the sequence. (Third row) Average reprojection error of the tracked points during the sequence. (Fourth row) Average 3-D registration error of the tracked points during the sequence. in the examples of Fig. 6(b). This sequence is larger and more challenging than the first one, since it contains several cases of prolonged partial and total occlusions (frames , , and ) due to the surgical instruments passing in front of the camera, retraction reinsertion of the endoscope (frames ), strong changes of illumination (frames ), as well as, zooms and strong organ deformations (frames: and ) due to the ultrasound probe pressing the organ. We subsampled this sequence in order to reduce the processing time by selecting every five frames when the scene was static, and every two otherwise, resulting in 1536 frames. Fig. 11(a) and (b) show the initial frame and the reconstructed 3-D model from the CT images where green portion of the kidney corresponds to healthy tissue, and the blue regions corresponds to the tumor. Fig. 11(c) shows the initial alignment obtained from our GUI, containing 243 anchor points for the uncalibrated cases, and 213 anchor points (yellow asterisks) for the calibrated one. Similarly to the first sequence, Figs. 13 and 14 summarize our qualitative results for all systems. The first row shows the resulting overlay for a frame after a first occlusion by the ultrasound probe, after a zoom-in, an occlusion with a surgical instrument, and finally, during the an organ deformation by the ultrasound probe pressing the organ (frames 925, 2100, 2668, and 3186, respectively). The second, third and fourth rows in Figs show the number of tracked features, the plot of the average reprojection error of the tracked features, and the 3-D registration error during the sequence, respectively. Note that all systems can successfully prevail along the multiple challenges of this sequence, eventually recovering even after total occlusion or strong organ deformations (frames and ). Table I shows the statistics collected throughout both sequences for all the evaluated systems. The second-third and tenth-eleventh columns contain the mean and standard deviation of the number of anchor points; the fourth-fifth and

10 2618 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 TABLE I STATISTICS OF THEPERFORMANCE FOR THE THREEMETHODS IN BOTH EVALUATEDSEQUENCES Method First Sequence Second Sequence Avg. Num. Feat. Avg. Rep. Err. (pix) Avg. 3D. Error (mm) Rec. Drops. Avg. Num. Feat. Avg. Rep. Err. (pix) Avg. 3D. Error (mm) Rec. Drops avg std avg std avg std avg std avg std avg std RANSAC-DLT W-DLT-SW W-P3P-SW twelfth-thirteenth columns show the mean and standard deviation of the average reprojection error; and the sixth-seventh and fourteenth-fifteenth columns show the mean and standard deviation of the average 3-D registration error, respectively. Finally, the eight-ninth and the sixteenth-seventeenth columns contain the number of successful anchor-point recoveries (Num. Rec.) and unsuccessful ones (Drops) for each sequence. The resulting videos for all methods are available from: IV. DISCUSSION AND CONCLUSIONS In this paper, we have presented our novel design and validation of a new AR system, which represents the first step toward long term and accurate augmented surgical display. Our study represents an improvement over the current state-of-theart AR systems since it integrates both a robust estimation of the projection matrix (cf., Section II-C), for both calibrated and uncalibrated camera cases, as well as a tracking recovery strategy (cf., Section II-F) which accurately retrieves those anchor points that were lost due to unexpected events (e.g., occlusions or deformations). A weighted SW projection-estimation strategy is adopted and is shown to improve accuracy. The effectiveness of our tracking-recovery strategy has been demonstrated on two long and challenging endoscopic sequences of two real partial nephrectomy endoscopic interventions. In both cases, the average registration error was always less than 1.5 mm. The results presented in Figs and Table I demonstrate the advantages of our approach, for both calibrated (W-P3P SW) and uncalibrated (W-DLT SW) cases, compared to the RANSAC DLT strategy on both video sequences. From the results of the first sequence, we noticed the benefits of RANSAC and the anchor-point recovery stage, since the RANSAC DLT strategy can maintain a good augmentation during the entire first sequence, even in cases of occlusion, as shown in Fig. 8(a). This happens because RANSAC removes potentially wrong anchor points, and the anchor-point recovery stage recovers those anchor points that were lost in previous frames. This is evident from the plots in Fig. 8(b) and (c) which show that the number of anchor points is always above the minimum threshold (50) and the average reprojection error is always below the maximum limit (4 pixels). However, because of the faster decay in the number of anchor points DLT requires the frequent adoption (30 times) of the anchor-point recovery, mostly during the second occlusion (frames ), as noticeable from the many peaks in Fig. 8(b). These peaks represent those time instants when the recovery takes place and the augmentation is successfully reinitialized. Moreover, the overlay is sometimes inaccurate, as shown in frames 336, 400, and 566, where the contour of the augmented model does not match with the current contour of the organ in the endoscopic image (red arrow/ellipse). This happens because the RANSAC-based DLT applied at each frame is not able to fully overcome the noise in the anchor points introduced by the tracker. Even if RANSAC rejects many outliers during each iteration, it only preserves few anchor points, thus making the estimation of the projection matrix more sensitive to noise. As a result, the augmentation is slightly unstable, which is evident on the augmented video. The performance of our proposed W-DLT SW (uncalibrated case) is illustrated in the results of Fig. 9(a). Note that our algorithm maintains the overlay of the 3-D model very close to the organ real boundaries in the endoscopic video (yellow arrows). Furthermore, the decay in the number of tracked features during the sequence is smoother and the overlay is more stable during the whole sequence, indicating that the SW approach and the feature weight are indeed helpful. The SW helps in providing a more stable projection matrix estimation than DLT, even in the case of occlusions (frame 350 and 541). This improved stability of our method is demonstrated from the plots of the number of tracked features, the average reprojection error, and the 3-D registration error in Fig. 9(b) (d), respectively. In particular, observe that the decay in the number of features is slower than in the RANSAC DLT approach. As a result, our method only requires a tracking recovery few times (8 compared with the 30 of the DLT strategy), mostly when the reprojection error of the anchor points passes the threshold of 4 pixels, e.g., during the second occlusion (frames ). Also, note that the number of features never pass the minimum limit (50), the average reprojection error never goes above the maximum threshold (4 pixels), and the 3-D error is below 1 mm. Moreover, all plots are more smooth than in the other two approaches. Differently from RANSAC DLT, the reprojection error for W-DLT SW tends to increase over time, indicating a stronger temporal relationship between frames. The qualitative results for the calibrated case (W-P3P SW) are presented in Fig. 10(a). Note that this version of our algorithm achieves the most accurate overlay in the augmented endoscopic video (yellow arrows). Similarly to the uncalibrated

11 PUERTO-SOUZA et al.: TOWARD LONG-TERM AND ACCURATE AUGMENTED-REALITY FOR MONOCULAR ENDOSCOPIC VIDEOS 2619 case, the decay in the number of tracked features during the sequence is slow and the augmentation is more stable than in RANSAC DLT. However, the number of recoveries is large (32), the majority happening during the occlusions (frames and ). This suggests that the P3P-based projection matrix estimation is more sensitive to noise, due to occlusions, than its DLT-based counterpart. However, to make a fair comparison, note that W-P3P SW had two subtle differences with respect to the other two approaches. First, the input images were rectified, thus requiring an initial alignment with different number of anchor points. Second, the 3-D registration error measure computed by W-P3P SW is the most precise, since it uses the ground-truth camera intrinsic calibration parameters, K, instead of an estimated one. Having these points in mind, it is evident the impressive performance of W-P3P SW, because of its high number of anchor points (always above the minimum of 50), as well as comparable reprojection and 3-D registration errors which were maintained below 4 pixels and 1 mm., respectively [c.f., Fig. 10(b) (d)]. The results from the second sequence support the observations made from the first sequence. Fig. 12 shows that RANSAC DLT can maintain the augmentation during the majority of the second sequence because of the recovery-strategy. However, the anchor points quickly deteriorate after few frames, resulting in incorrect overlays, as shown in the examples in Fig. 12(a). In particular, note the high instability of the overlay after losing anchor points in key regions, indicated by (red) ellipses. This observation is also evident from the plots in Fig. 12(b) that shows the steep peaks in the number of features, which usually decay rapidly. In the case of W-DLT SW, the results illustrated in Fig. 13(a) show a more precise overlay of the 3-D model to the organ real boundary in the endoscopic video (yellow arrows). Note in Fig. 13(b) that the number of tracked features during the sequence is larger and decay slower than in RANSAC DLT. Note that the peaks in the reprojection and 3-D registration errors in Fig. 13(c) (d) represent cases when the augmentation went wrong due to a unexpected event, e.g., occlusion of an instrument or fast zoom-in. Also, observe that despite of the increase number of recoveries (60) and same number of dropped frames (326 of 1536, approx 21.2%), the final augmented video is significantly more stable and with less jitter than the RANSAC DLT strategy. The plots in Fig. 14(a) show the high accuracy of W-P3P SW. Furthermore, the plot in Fig. 14(b) shows the high number of tracked features with a slow decay. The plots in Fig. 14(c) and (d) show larger reprojection error and 3-D registration errors, always below 4 pixels and 1.5 mm, respectively. Observe from Table I that the number of recoveries (57) is comparable with the uncalibrated case, but the dropped frames is slightly larger (378 out of 1536, approx. 23.2%). However, it is important to consider that in several of these dropped frames the organ was drastically deformed (e.g., by an ultrasound probe) explaining the incapability of P3P to estimate the camera rotation and translation components. The final augmented video is very accurate, and similarly to the uncalibrated case, is significantly more stable and with less jitter than the RANSAC DLT strategy. All of these observations, for both experiments, are summarized on the statistics presented in Table I. The endoscopic vision community is in need of an accurate comparison of the state-of-the-art augmented reality systems. As such, our future work will focus on creating a publicly available and annotated dataset of endoscopic videos and CT scans. Also, we will focus on evaluating the robustness of the system after changes in the organ s topology (e.g., after the excision of the tumor). Another topic of future investigation is the adoption of deformable registration techniques jointly with biomechanical models of the organ, in order to more effectively combine preoperative and intraoperative images. Prior work was recently done by [39] to show the increased accuracy when compared to extended Kalman filter (EKF) SLAM registration, but further work is needed to expand this preliminary work to other domains other than cardiac surgery. REFERENCES [1] O. Ukimura and I. S. Gill, Image-fusion, augmented reality, and predictive surgical navigation, Urologic Clin. North Amer., vol. 36, no. 2, pp , [2] P. Lamata, W. Ali, A. Cano, J. Cornella, J. and Declerck, O. J. Elle, A. Freudenthal, H. Furtado, D. Kalkofen, E. Naerum et al., Augmented reality for minimally invasive surgery: overview and some recent advances, Augmented Reality, ISBN, pp , [3] D. Cohen, E. Mayer, D. Chen, A. Anstee, J. Vale, G. Z. Yang, A. Darzi, and P. Edwards, Augmented reality image guidance in minimally invasive prostatectomy, Prostate Cancer Imag. Comput.-Aided Diagnosis, Prognosis, Intervent., 2010, pp [4] T. Simpfendorfer, M. Baumhauer, M. Muller, C. N. Gutt, H. P. Meinzer, J. J. Rassweiler, S. Guven, and D. Teber, Augmented reality visualization during laparoscopic radical prostatectomy, J. Endourol., vol. 25, no. 12, pp , [5] L. M. Su, B. P. Vagvolgyi, R. Agarwal, C. E. Reiley, R. H. Taylor, and G. D. Hager, Augmented reality during robot-assisted laparoscopic partial nephrectomy: Toward real-time 3d-ct to stereoscopic video registration, Urology, vol. 73, no. 4, pp , [6] B. Vagvolgyi, L.-M. Su, R. Taylor, and G. D. Hager, Video to ct registration for image overlay on solid organs, in Proc. 4th Workshop Augment. Envir. Med. Imag. Comput.-Aided Surg., 2008, pp [7] E. H. William, P. H. James, L. Kongkuo, A. M. Scott, R. Lav, and Y. Kun- Chang, 3d ct-video fusion for image-guided bronchoscopy, Computer. Med. Imag. Graph., vol. 32, no. 3, pp , [8] R. Richa, A. P. L. Bo, and P. Poignet, Towards robust 3d visual tracking for motion compensation in beating heart surgery, Med. Image Anal., vol. 15, no. 3, pp , [9] D. Stoyanov, G. Mylonas, F. Deligianni, A. Darzi, and G.-Z. Yang, Softtissue motion tracking and structure estimation for robotic assisted mis procedures, in Proc. Med. Image Comput. Comput.-Assist. Intervent., 2005, pp [10] P. Mountney and G. Z. Yang, Motion compensated slam for image guided surgery, in Proc. Med. Image Comput. Comput.-Assist. Intervent., 2010, pp [11] S. Giannarou, M. Visentini-Scarzanella, and G. Yang, Probabilistic tracking of affine-invariant anisotropic regions, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp , Jan [12] M. Yip, D. Lowe, S. Salcudean, R. Rohling, and C. Nguan, Real-time methods for long-term tissue feature tracking in endoscopic scenes, in Proc. Third Int. Conf. Inf. Process. Comput.-Assist. Intervent., 2012, pp [13] G. A. Puerto-Souza and G. L. Mariottini, Hierarchical multi-affine (HMA) algorithm for fast and accurate feature matching in minimallyinvasive surgical images, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012, pp [14] G. A. Puerto-Souza and G. L. Mariottini, A fast and accurate featurematching algorithm for minimally-invasive endoscopic images, IEEE Trans.Med.Imag., vol. 32, no. 7, pp , Jul

2620 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 [15] P. Pratt, A. Marco, C. Payne, A. Darzi, and G.-Z.

Simpfendörfer, M. Baumhauer, E. O. Güven, F. Yencilek, A. S. Gözen, and J. Rassweiler, Augmented reality: A new tool to improve surgical accuracy during laparoscopic partial nephrectomy?

Commun. Eng., vol. 101, no. 652, pp. 79 86, Feb. 2002. [18] R. Munoz-Salinas, ArUco: A minimal library for vision based augmented reality applications based on opencv, (Jun. 6, 2014). [Online].

12 2620 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 61, NO. 10, OCTOBER 2014 [15] P. Pratt, A. Marco, C. Payne, A. Darzi, and G.-Z. Yang, Intraoperative ultrasound guidance for transanal endoscopic microsurgery, in Proc. Med. Image Comput. Comput.-Assist. Intervent., 2012, vol. 7510, pp [16] D. Teber, S. Guven, T. Simpfendörfer, M. Baumhauer, E. O. Güven, F. Yencilek, A. S. Gözen, and J. Rassweiler, Augmented reality: A new tool to improve surgical accuracy during laparoscopic partial nephrectomy? Preliminary in vitro and in vivo results, Eur. Urol., vol. 56, no. 2, pp , [17] H. Kato, ARToolKit: Library for vision-based augmented reality, Tech. Rep., PRMU, Inst. Electron. Inf. Commun. Eng., vol. 101, no. 652, pp , Feb [18] R. Munoz-Salinas, ArUco: A minimal library for vision based augmented reality applications based on opencv, (Jun. 6, 2014). [Online]. Available: [19] G. Bleser, H. Wuest, and D. Strieker, Online camera pose estimation in partially known and dynamic scenes, in Proc. IEEE/ACM Int. Symp. Mixed Augment. Reality, 2006, pp [20] G. Klein and D. Murray, Parallel tracking and mapping for small ar workspaces, in Proc. 6th IEEE ACM Int. Symp. Mixed Augment. Reality, 2007, pp [21] T. Collins, D. Pizarro, A. Bartoli, M. Canis, and N. Bourdel, Realtime wide-baseline registration of the uterus in laparoscopic videos using multiple texture maps, in Proc. Augment. Reality Environments Med. Imag. Comput.-Assist. Intervent., 2013, pp [22] J. Park, S. You, and U. Neumann, Natural feature tracking for extendible robust augmented realities, in Proc. Int. Workshop Augment. Reality, 1998, pp [23] H. Meinzer, M. Fangerau, M. Schmidt, T. R. dos Santos, A. M. Franz, L. Maier-Hein, and J. M. Fitzpatrick, Convergent iterative closest-point algorithm to accommodate anisotropic and inhomogenous localization error, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp , Aug [24] P. J. Besl and N. D. McKay, A method for registration of 3-D shapes, IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, pp , [25] D. Mirota, M. Ishii, and G. Hager, Vision-based navigation in imageguided interventions, Annu. Rev. Biomed. Eng., vol. 13, pp , [26] D. Stoyanov, Surgical vision, Ann. Biomed. Eng., vol. 40, no. 2, pp , [27] G. A. Puerto-Souza, A. Castano-Bardawil, and G. L. Mariottini. (2012). Real-time feature matching for the accurate recovery of augmentedreality display in laparoscopic videos, in Augmented Environments for Computer-Assisted Interventions, (ser. Lecture Notes in Computer Science), vol. 7815, Berlin: Springer-Verlag, pp [Online]. Available: [28] G. A. Puerto-Souza and G. L. Mariottini, An augmented-reality system for laparoscopic surgery robust to complete occlusions and fast camera motions, in Proc. IEEE Int. Conf. Robot. Automat, 2013, pp [29] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision Cambridge, U.K.: Cambridge Univ. Press, [30] J. Shi and C. Tomasi, Good features to track, in Proc. IEEE Int. Conf. Comp. Vis. Patt. Rec., Jun. 1994, pp [31] V. Lepetit, F. Moreno-Noguer, and P. Fua, Epnp: An accurate o (n) solution to the pnp problem, Int. J. Comput. Vis., vol. 81, no. 2, pp , [32] K. Arun, T. Huang, and S. Blostein, Least-squares fitting of two 3-d point sets, Pattern. Anal. Mach. Intell., vol. 9, no. 5, pp , [33] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol. 24, pp , Jun [34] D. MacKay, Information Theory, Inference and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, [35] J. Hailin, P. Favaro, and S. Soatto, Real-time feature tracking and outlier rejection with changes in illumination, in Proc. IEEE Int. Conf. Comput. Vis., 2001, vol. 1, pp [36] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comp. Vis., vol. 60, no. 2, pp , [37] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability, Neuroimage, vol. 31, no. 3, pp , [38] P. Cignoni, M. Corsini, and G. Ranzuglia, Meshlab: An open-source 3d mesh processing system, ERCIM News, vol. 1, no. 73, pp , Apr [39] P. Pratt, D. Stoyanov, M. Visentini-Scarzanella, and G. Z. Yang, Dynamic guidance for robotic surgery using image- constrained biomechanical models, in Proc. 13th Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., 2010, pp Gustavo A. Puerto-Souza (S 11) received the B.S. degree in mathematics from the University of Yucatán, Mérida, México, in 2006, and the M.S. degree in computer science from the Center for Research in Mathematics, Guanajuato, México, in 2010.Since 2010, he has been working toward the Ph.D. degree in computer science and engineering at University of Texas at Arlington, Arlington, TX, USA. He is a Member of the Active Sensing Technologies for Robotics and Automation (ASTRA) Laboratory, and research interests include surgical vision, localization, and augmented reality for endoscopic scenarios. Mr. Puerto-Souza received the Best Paper Award from the 6th Pacific-Rim Symposium on Image and Video Technology. Jeffrey A. Cadeddu received the B.S. degree in biomedical engineering from Johns Hopkins University, Baltimore, MD, USA, in 19889, and the Ph.D. degree in medicine from Johns Hopkins University School of Medicine, in 1993, where he also completed his urology and surgery residencies. He then joined The University of Texas Southwestern Medical Center, Dallas, TX, USA, in 1999, where he currently holds the dual appointments of Professor of Urology and Professor of Radiology. In addition, he holds the Ralph C. Smith, M.D. Distinguished Chair in Minimally Invasive Urologic Surgery and serves as the Director of the Clinical Center for Minimally Invasive Urologic Cancer Treatment. His affiliations include membership in the American Urological Association, the Endourological Society, the Society for Urologic Oncology, Texas Urological Society, and the Engineering and Urology Society. He currently serves as Associate or Assistant Editor on behalf of the Journal of Endourology, World Journal of Urology, and the International Brazilian Journal of Urology, and serves as a Survey Section Editor for the Journal of Urology. His publications include over 200 peer-reviewed articles; over 75 invited articles and book chapters; numerous editorial comments, book reviews, and original videos. Dr. Cadeddu is the recipient of the American Urological Associations 2007 Gold Cystoscope Award. In April 2013, he was elected to active membership in the American Association of Genitourinary Surgeons. Gian-Luca Mariottini (S 04 M 06) received the M.S. degree in computer engineering and the Ph.D. degree in robotics and automation from the University of Siena, Siena, Italy, in 2002 and 2006, respectively. In 2005 and 2007, he was a Visiting Scholar at the GRASP Lab (CIS Department, UPENN, USA) and he held postdoctoral positions at the University of Siena from 2006 to 2007, the Georgia Institute of Technology from 2007 to 2008, Atlanta, GA, USA, and the University of Minnesota from 2008 to 2010, Minneapolis, MN, USA. Since September 2010, he has been an Assistant Professor at the Department of Computer Science and Engineering at The University of Texas at Arlington, Arlington, TX, USA, where he founded the ASTRA Robotics Lab. His research interests include robotics and computer vision, with a particular focus on endoscopic vision, augmentedreality systems for minimally-invasive surgery, as well as single- and multirobot localization and navigation. Dr. Mariottini received the Best Paper Award from the 6th Pacific-Rim Symposium on Image and Video Technology.

Real-time Feature Matching for the Accurate Recovery of Augmented-Reality Display in Laparoscopic Videos

Real-time Feature Matching for the Accurate Recovery of Augmented-Reality Display in Laparoscopic Videos Gustavo A. Puerto, Alberto Castaño-Bardawil, and Gian-Luca Mariottini Department of Computer Science