Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Augmented Reality VU Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Feature Point-Based 3D Tracking

Feature Points for 3D Tracking Much less ambiguous than edges; Point-to-point reprojection error, not point-to-line. More robust to occlusions than template matching. BUT not for all objects.

Feature Point Extraction How to extract the same physical points?

Good Features to Track x J. Shi and C. Tomasi. Good features to track. CVPR'94. Defines a "cornerness" measure. Idea: Look for the points x easy to match under a 2D translation

Local Feature Detection: The Math x y W (u,v) Consider shifting the window W by (u,v) how do the pixels in W change? compare each pixel before and after by summing up the squared differences

Local Feature Detection: The Math x y (u,v) W summing up the squared differences of pixel intensities:

Small Motion Assumption Taylor Series expansion of I(x + u, y + v): If the motion (u,v) is small, then the first order approximation is good Plugging this into the formula from the previous slide

Local Feature Detection: The Math

Local Feature Detection: The Math E(u, v) X (x,y)2w X (x,y)2w X (x,y)2w [uv] 0 @ apple [I x I y ] apple Ix apple u v apple u [uv] [I I x I y ] y v apple apple I 2 [uv] x I x I y u I x I y Iy 2 v 1 X apple apple I 2 x I x I y A u I x I y v (x,y)2w 2 I 2 y

Local Feature Detection: The Math 0 E(u, v) [uv]@ X (x,y)2w apple I 2 x I x I y I x I y I 2 y 1 A apple u v we are looking for (x, y) images locations such that E(u, v) is large for all directions [u, v] How is it related to H?

Local Feature Detection: The Math H = X (x,y)2w apple I 2 x I x I y I x I y is a 2 2 symmetric matrix. It can be decomposed into: H =[x + x ] apple + 0 0 where λ + and λ - are the eigenvalues of H; I 2 y x + and x - are the eigenvectors of H. 2 4 x> + x > 3 5

Local Feature Detection: The Math apple H =[x + x ] + 0 0 where λ + and λ - are the eigenvalues of H; x + and x - are the eigenvectors of H. E(u, v) [uv] H apple u v x + = direction for (u, v) of largest increase in E λ + = amount of increase in direction x + x - = direction for (u, v) of smallest increase in E λ - = amount of increase in direction x - 2 4 x> + x > x + x - 3 5

Harris Cornerness Computation g x g y (g y ) 2 (g x ) 2 Gauss(.) g x g y Gauss(.) (g x ) 2 Gauss(.) (g y ) 2

min(λ 1,λ 2 )

Feature Point Tracking Two general approaches: KLT Kanade-Lucas-Tomasi tracker: Detection in the first frame then tracking; Detection in every frame and matching.

KLT Kanade-Lucas-Tomasi Tracker [Shi & Tomasi CVPR94] Detection (Good Features to Track CVPR 94): Tracking: Make use of the Lucas-Kanade algorithm j ( I t (f(m j ;p i + Δ i )) T(m j )) 2 à Correlation measure: Sum of Square Differences à f: translation model or affine model. Monitoring the templates: Stop tracking when low similarity (Correlation measure > Threshold)

Disadvantages of the KLT Tracker Looses all the features after a while: Potential solution: regularly redetect feature points, but can be confused by still be tracked features. There is a better solution.

Detection + Matching in Every Frame Detecting in every frame; Matching consecutive frames.

Feature Point Matching 4 3 2 Possible correlation measures: +di +dj ( ) 2 C = I 1 (x + di,y + dj) I 2 ( x # + di, y # + dj) i= di j= dj 1 Left image Right image 5 Point 1 Point 2 Point 3 Point 4 Point 5 C = +di +dj i= di j= dj ( I 1 (x + di, y + dj) I 1 ) I 2 ( x # + di, y # + dj) I 2 σ 1 σ 2 ( ) x di x' di dj dj y y' I 1 I 2

Cross-correlation measure: 1 4 5 Feature Point Matching 3 2 C = +di i= di +dj j= dj ( I 1 (x + di,y + dj) I 1 ) I 2 ( x # + di, y # + dj) I 2 σ 1 σ 2 - invariant to affine changes of the lighting; - between -1 (completly different patches) and +1 (equal patches); In practice: accept patches when C > 0.8 ( )

Feature Point Matching 2 1. For each point, search for the correspondent that maximizes the correlation. Search limited to a Region of interest centered on the point.

Feature Point Matching 2 1. For each point, search for the correspondent that maximizes the correlation. Search limited to a Region of interest. Retain the best correspondent according to the correlation.

Feature Point Matching 1. For each point, search for the correspondent that maximizes the correlation. Search limited to a Region of interest. 2. Reverse the role of the images.

Feature Point Matching Keep the points that choose each other.

Using Interest Points for 3D Tracking: Tracking planes Simon et al., "Pose Estimation for Planar Structures", CGA02. For a plane: H w,t =H t x H t-1 x..h 1 x H w,0 H w,0 H 1 H 2 H t Estimation of the homographies H t from matches.

Interest Point-Based Tracking Advantage: Robust to occlusions

Reference frame-based tracking [Vacchetti et al PAMI04] Reference frames are images of the object, captured and registered offline.

Reference Frame-Based Tracking Method Frame at time t

Reference Frame-Based Tracking Method How to match points with points in a (registered) reference frame of the object? Reference frame Frame at time t

Wide baseline matching During the tracking we roughly know where the camera is. Frame at time t.

Wide baseline matching During the tracking we roughly know where the camera is. We re-render the reference frame from the viewpoint estimated at time t-1. Reference frame «Rerendered reference frame» rendered from the viewpoint estimated at time t-1. Frame at time t.

Reference frame Wide baseline matching During the tracking we roughly know where the camera is. We re-render the reference frame from the viewpoint estimated at time t-1. The «re-rendered reference frame» is an intermediate image that can easily be matched with the current frame. «Rerendered reference frame» rendered from the viewpoint estimated at time t-1. Frame at time t.

Reference Frame-Based Tracking Method Works but not accurate à the virtual objects jitter:

In the reference frames-based tracking method, the successive frames were tracked independently. reference frames time t-1 time t

Stable Tracking Method Idea: track interest points on the object over the successive frames and use them to improve the accuracy of the camera registration. reference frames time t-1 time t

Stable Tracking Method The tracked points are the projections of 3D points lying on the object surface: à we should also optimize on these points 3D positions. reference frame time t-1 time t

The tracked points are the projections of 3D points lying on the object surface: à we should also optimize on these points 3D positions. We also optimize not only on the current camera position but also on the previous ones. The problem becomes: min camera positions upto time t 3D positions of the tracked points Stable Tracking Method reprojection errors of the tracked points + reprojection errors of points matched with reference frames with the constraint that the tracked points lie on the object surface. reference frame time t-1 time t

Stable Tracking Method 1. We consider only the current and the previous frames to keep reasonable computation times. 2. The optimization of the tracked points 3D positions under the constraint they lie on the object surface can easily be performed using a transfer function Ψ. (Ying Shan et al. Model-Based Bundle Adjustment with Application to Face Modeling ICCV01): à the 3D positions are not explicitly computed. Object n i Ψ( n i m i ) Error to minimize camera at time t-1 camera at time t

Full Method

Results

Augmented Reality

Face Tracking Face assumed to be rigid; Generic 3D model of the face; 1 reference frame built manually on a frontal view of the face; Automatic reinitialisation using a 2D detection.

Vision-Based 3D Tracking

Recursive Tracking t = 0 t = 1 t = 2...

3D Object Detection Keypoint detection (Harris, extrema of Laplacian, affine regions,...); Keypoint recognition (descriptor matching or classification); Robust pose estimation (RANSAC+P3P,...). Registered image(s) of the object to detect Input image

Keypoint-Based Object Detection

Step 1: Detection invariant to scale and rotation, or perspective transformation

Step 2: Patch rectification

Step 3: Build description vector

Step 4: Match description vectors

Feature Detector in SIFT: Invariant to Rotation and Scale

Scale-Space Theory Original image Successive convolutions with a Gaussian filter or Gaussian derivative filter while increasing σ [Lindeberg 9*]

Laplacian of Gaussian (G xx + G yy ) for feature point detection Laplacian operator

Fast Approximation of the Laplacian of Gaussian Convolution with Laplacian of Gaussian is not separable, and therefore slow. However, the Laplacian of Gaussian can be approximated by the difference of two Gaussians: G(σ) G(σ') G(σ) - G(σ')

Efficient Scale-Space Detection

Resample Blur Subtract Accurate Keypoint localization Keypoint locations: Extrema of Difference-of-Gaussian in scale space: Sub-pixel and sub-scale interpolation: The Taylor expansion around point is: Offset of extremum (use finite differences for derivatives):

Results

Affine Region Detectors: Invariant to Affine Transformations

Harris-Affine & Hessian-Affine Region Detector Harris-Affine: Uses the Auto-correlation matrix as in the classic Harris detector: " 2 % Gauss(.) I x Gauss(.) (I x I y ) M = $ ' $ 2 Gauss(.) (I x I y ) Gauss(.) I ' # y & Local maxima of the smallest eigenvalue indicate the presence of a corner. Hessian-Affine: Considers the Hessian matrix: " H = I xx I xy % $ ' # I xy I yy & Local maxima of determinant or of the smallest eigenvalue indicate the presence of a blob structure.

Scale Selection Both the Harris-Affine and the Hessian-Affine use the Laplacian to select the "characteristic" scale: σ 2 Lap(x,σ)

Affine Transformation Estimation Warp by Affine Transformation M 1/2, where M is the auto-correlation matrix.

Harris-Affine & Hessian-Affine Region Detector Algorithm: 1. Detect initial region with Harris or Hessian detector and select the scale; 2. Estimate the shape with the second moment matrix (=auto-correlation matrix); 3. Normalize the affine region to the circular one; 4. Go to step 2 if the eigenvalues of the second moment matrix for new point are not equal.

Maximally Stable Extremal region detector Binary thresholding with thresholds from 0 to 255; Regions that remain unchanged over a large ranges of thresholds are kept.

Affine Normalization Warp by M 1 1/2 Warp by M 2 1/2 We still have to correct for the orientation!

Select Canonical Orientation Create histogram of local gradient directions computed over the image patch; Each gradient contributes for its norm, weighted by its distance to patch center; Assign canonical orientation at peak of smoothed histogram. 0 2π

SIFT Description Vector Made of local histograms of gradients: In practice: 8 orientations x 4 x 4 histograms = 128 dimensions vector.

Handling Lighting Changes Gains do not affect gradients; Normalization to unit length removes contrast; Saturation affects magnitudes much more than orientation: magnitudes are thresholded.

Standard Approach Step 4: Match description vectors

Matching: Approximate Nearest Neighbour Best-Bin-First: Approximate nearest-neighbour search in k-d tree q q