Coarse-to-fine image registration

Similar documents
Scale Invariant Feature Transform

Scale Invariant Feature Transform

Outline 7/2/201011/6/

Obtaining Feature Correspondences

Local Features: Detection, Description & Matching

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

CAP 5415 Computer Vision Fall 2012

SIFT - scale-invariant feature transform Konrad Schindler

The SIFT (Scale Invariant Feature

Computer Vision for HCI. Topics of This Lecture

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Filtering Images. Contents

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Feature descriptors and matching

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

Local Features Tutorial: Nov. 8, 04

Feature Detectors and Descriptors: Corners, Lines, etc.

CS 231A Computer Vision (Fall 2012) Problem Set 3

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

EE795: Computer Vision and Intelligent Systems

School of Computing University of Utah

Local features: detection and description. Local invariant features

CS 223B Computer Vision Problem Set 3

Comparison of Feature Detection and Matching Approaches: SIFT and SURF

Local invariant features

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Panoramic Image Stitching

2D Image Processing Feature Descriptors

Visual Tracking (1) Pixel-intensity-based methods

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Local features: detection and description May 12 th, 2015

Lecture 10 Detectors and descriptors

Peripheral drift illusion

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Implementation and Comparison of Feature Detection Methods in Image Mosaicing

Feature Tracking and Optical Flow

HISTOGRAMS OF ORIENTATIO N GRADIENTS

Final Exam Study Guide

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Feature Tracking and Optical Flow

Segmentation and Grouping

CS231A Section 6: Problem Set 3

A NEW FEATURE BASED IMAGE REGISTRATION ALGORITHM INTRODUCTION

Feature Matching and Robust Fitting

Local Feature Detectors

Computer Vision. Recap: Smoothing with a Gaussian. Recap: Effect of σ on derivatives. Computer Science Tripos Part II. Dr Christopher Town

Motion Estimation and Optical Flow Tracking

TA Section 7 Problem Set 3. SIFT (Lowe 2004) Shape Context (Belongie et al. 2002) Voxel Coloring (Seitz and Dyer 1999)

CS4670: Computer Vision

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

Eppur si muove ( And yet it moves )

CS334: Digital Imaging and Multimedia Edges and Contours. Ahmed Elgammal Dept. of Computer Science Rutgers University

Image features. Image Features

CS664 Lecture #21: SIFT, object recognition, dynamic programming

Scale Invariant Feature Transform by David Lowe

Image processing. Reading. What is an image? Brian Curless CSE 457 Spring 2017

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

3D Reconstruction From Multiple Views Based on Scale-Invariant Feature Transform. Wenqi Zhu

Ulas Bagci

convolution shift invariant linear system Fourier Transform Aliasing and sampling scale representation edge detection corner detection

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Motion illusion, rotating snakes

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Lecture 7: Most Common Edge Detectors

Anno accademico 2006/2007. Davide Migliore

Corner Detection. GV12/3072 Image Processing.

Lecture 6: Edge Detection

Lecture 4: Image Processing

Edge Detection (with a sidelight introduction to linear, associative operators). Images

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition. Motion Tracking. CS4243 Motion Tracking 1

Scott Smith Advanced Image Processing March 15, Speeded-Up Robust Features SURF

Feature Based Registration - Image Alignment

A Comparison of SIFT and SURF

CS 4495 Computer Vision Motion and Optic Flow

Key properties of local features

Pictures at an Exhibition

3D from Photographs: Automatic Matching of Images. Dr Francesco Banterle

A Survey on Face-Sketch Matching Techniques

Prof. Feng Liu. Spring /26/2017

Local Descriptors. CS 510 Lecture #21 April 6 rd 2015

Implementing the Scale Invariant Feature Transform(SIFT) Method

Local features and image matching. Prof. Xin Yang HUST

Motion and Optical Flow. Slides from Ce Liu, Steve Seitz, Larry Zitnick, Ali Farhadi

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Visual Tracking (1) Feature Point Tracking and Block Matching

Introduction to Homogeneous coordinates

Computer Vision I. Announcements. Fourier Tansform. Efficient Implementation. Edge and Corner Detection. CSE252A Lecture 13.

Ninio, J. and Stevens, K. A. (2000) Variations on the Hermann grid: an extinction illusion. Perception, 29,

Wikipedia - Mysid

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Chapter 3 Image Registration. Chapter 3 Image Registration

3D Vision. Viktor Larsson. Spring 2019

Transcription:

Today we will look at a few important topics in scale space in computer vision, in particular, coarseto-fine approaches, and the SIFT feature descriptor. I will present only the main ideas here to give you a sense of the problem and possible solutions. You should consider these ideas to be the tip of the iceberg. I m hoping that some of you will be curious and will choose these as topics for your term paper or further readings on your own. Coarse-to-fine image registration Lucas-Kanade image registration made use the second moment matrix. M = (x,y) Ngd(x 0,y 0 ) [ x )2 x )( I) y x )( I y ) ( I y )2 Let s begin by building a scale space out of this matrix. Suppose that each of the partial derivatives is computed with a normalized Gaussian derivative where the Gaussian has parameter σ. This defines a scale space of 2nd moment matrices, namely a family of second moment matrices that are defined over the domain (x, y, σ). Notice that the summations over the neighborhoods themselves define a convolution. For example, if we were to define { 1, (x, y) Ngd(0, 0) w(x, y) 0, otherwise then we could write [ M = w(x, y) x )2 x )( I ) y x )( I y ) ( I y )2 where it is understood that each term in the 2 2 matrix is a function of (x, y) and the convolution is applied to each of the terms. Often in practice, one wishes to give more weight to closer neighbors of (x 0, y 0 ) and one uses a Gaussian for w(x, y). In this common case, we have one Gaussian G σd (x, y) to blur the noisy image and a second Gaussian G σi (x, y) to integrate the second moment matrix. The D and I refer to the derivative scale and the integration scale, respectively. They are also sometimes called the inner scale and outer scale. Typically the ratio σ I : σ D is constant, for example, 3. Recall the version of the problem in which we were trying to find a translation (h x, h y ) such that I(x+h x, y +h y ) J(x, y). In deriving the Lucas-Kanade method, we used a first order model for the intensities of I(x, y) in local neighborhoods. This required that the translation distance h = (h x, h y ) is much less than the radius of the neighborhood. 1 This implies that we need to choose a neighborhood scale σ I that is much greater than h. But what do we do if the actual translation distance h is large? In that case, it would seem that our estimate of (h x, h y ) should give a poor match (since it is computed under the assumption that h is small). Assume our scale space for the second moment matrix is defined by keeping the ratio σ I : σ D constant, so by σ here let s just say we mean σ D. Blurring the image by σ now removes not just the image noise, but also smooths out variations in the image signal at distances similar to σ. The 1 Recall: The first order model says that the intensity is linear. But if the intensity is linear over the whole neighborhood, then we have the aperture problem and cannot uniquely determine (h x, h y ). ] ] 1

blurred image is now smooth (roughly linear) at these distances and so the linear model applies at these distances, and we can estimate (h x, h y ) up to around the distance σ. Note, however, that this estimate of (h x, h y ) often will be give us only a rough approximation since the LK method also assumes that (h x, h y ) is constant over the integration neighborhood Ngd(x 0, y 0 ). But using a larger σ D means that we need to use a larger σ I too 2 since we need the gradient I to be constant over the σ D neigbhorhood, but we need it to vary over the σ I neighborhood. But the larger we make σ I, the less likely it is that (h x, h y ) will be constant over this neighborhood. What can we do? The solution is to use a coarse-to-fine approach. Suppose we wish to estimate (h x, h y ) near pixel (x 0, y 0 ). (We will repeat the following for all pixels, so let s just work with a single pixel.) We compute the scale spaces I(x, y, σ) and J(x, y, σ). We first estimate (h x, h y ) at the largest scale, and then proceed iteratively to the smaller and smaller scales. Suppose we have estimated (h x, h y ) k+1 at scale σ k+1. To estimate (h x, h y ) k at scale σ k, we shift the pixels (x, y) Ngd(x 0, y 0 ) and within scale σ k I (x, y) I(x + h k+1 x, y + h k+1 y, σ k ). Note that we copy the values into a temporary variable I (x, y) that is, we don t actually modify the scale space! We then apply LK registration to match I (x, y) to J(x, y) in Ngd(x 0, y 0 ). This match gives an estimate (h x, h y ), and so our net estimate at scale σ k is: (h x, h y ) k (h x, h y ) k+1 + (h x, h y ) Notice that at each scale σ k, the blur used to compute the gradient is less and less and the size of the neighborhood over which we integrate the gradients is less and less. This means that we ultimately are using all the information in the image, and we are (at the end) assuming that the translation (h x, h y ) is constant in a small neighbhorhood only. One final note: you should realize that there are two iterations going on here. There is the coarse-to-fine iteration mentioned above. And there is also the iteration within each scale σ k which we discussed in lecture 10. I did not mention the latter iteration above to avoid confusion. But now I mention it to remind you it is still there. Local descriptors Difference of Gaussians (DOG) Let s now turn to our second topic for today s lecture. Recall the problem of blob detection from last lecture. I presented a simple solution to it which was based on the normalized Laplacian filter. One observation is that is historically very important in vision research is that the Laplacian of a Gaussian function has a similar shape to a difference of Gaussians (called a DOG), namely a difference of two Gaussians with different standard deviations. 3 You can see this by sketching the two functions by by hand (as I did in class - do it for yourself now). A more formal proof goes as follows: 2 as mentioned earlier, often we are hold σ I : σ D constant 3 e.g. D. Marr, E. Hildreth, Theory of Edge Detection, Proc. Royal Soc. Lond. (Series B), 1980. 2

First, one can show just by taking derivatives that σ 2 G σ (x) 2 x = G σ(x). (This happens to be a version of the heat equation in physics.) Then 4 we can approximate G σ (x) sσ σ where s is slightly bigger than 1, and so σ G σ(x) and substituting from the heat equation above gives what we want σ 2 2 G σ (x) 2 x. Similarly in 2D, G sσ (x, y) G σ (x, y) σ 2 2 G σ (x, y). Thus, the DOG and Laplacian of a Gaussian have similar shapes, as claimed. (Note that is a constant, and so doesn t affect any claims about scale invariance.) Often one approximates the normalized Laplacian scale space by computing a Gaussian scale space (G σ I)(x, y) over a discrete set of scales s i σ 0 and then computing the difference of Gaussians at neighboring scales. This is done by the SIFT method, which I will discuss below. (You will also use this scheme in Assignment 2.) Keypoints Suppose that we have computed a scale space using the normalized Laplacian (or a difference of Gaussians). I argued last lecture that such a scale space will give local maxima and local minima over (x, y, σ) when there is a blob present. In 2D, a blob is a square that is brighter than its background (this gives a local minimum) or a square that is darker than its local background (this gives a local maximum). Recall that the maxima and minima occur at specific scales which are related to the size of the squares. Also, the height of the local maxima and minima will depend on the difference of the intensity between the blob and its background. Last class I assumed the background intensity was zero, but since the DOG gives no response to a constant image all that matters is the difference in intensity (recall Question 7 on the midterm exam). As you will see in Q2 of Assignment 2, for general images you will lots of local maxima and minima and these typically are not due to isolated square regions on a uniform background. Rather these local maxes and mins arise from other image intensity patterns. We can refer to a local maxima and minima of the filtered image as a keypoint. These are similar to Harris corner points 4 This argument was taken from D. Lowe Distinctive Image Features from Scale Invariant Keypoints, International Journal of Computer Vision 2004, but the basic idea is much older than that. 3

mentioned in lecture 9, but now we are specifically defining them to be local maxes and mins of the DOG filtered image in a 3D scale space, rather than in terms of a maximum of the Harris corner measure at a particular scale. For each keypoint, we have a position (x, y) and a scale σ. If we had a second image of the same scene but the image was scaled relative to the first, the scale space of that image will be stretched relative to the first, and it should have corresponding maxima and minima. (Moreover, these maxima and minima should still be present if the second image was 2D translated and 2D rotated relative to the first image.) If we have a set of several hundred keypoints in one image, and we want to match these to keypoints in another image, it is sometimes convenient to have a descriptor of the image in the neighborhood of each keypoint. Otherwise any keypoint in the first image could potentially be matched to any keypoint in the second image. Each keypoint has a scale σ so it makes sense to use the intensity structure in the scale space at that scale σ and in a local neighborhood of the keypoint position, where by local we mean a neighborhood that depends on σ. For example, as mentioned earlier, if σ is scale at which we compute the derivative (σ D ), then the neighborhood could be a Gaussian window defined by σ I which is a fixed multiple of σ D. How can we describe the local intensity structure? Let s briefly sketch out one solution, called SIFT, which has been extremely popular in the last decade. 5 SIFT: Scale invariant feature transform Assume we have found a keypoint at (x 0, y 0, σ) which is a local maximum (or minimum) in a difference of Gaussian scale space. We ensure that this local peak is sufficiently different from its neighbors and sufficiently different from zero, i.e. it is it is well localized and it is a large peak. We now want to describe the intensities in scale space in the neighborhood of the keypoint. SIFT constructs a local descriptor of the normalized gradients vectors σ G σ I(x, y) at the scale σ and in the (x, y) neighborhood of the feature point. Suppose we sample a square of width say 4σ with 64 64 grid of samples. Note that the larger is σ, the wider the spacing of the samples relative to the original image. The first step is to find any dominant orientations. For example, a square has four dominant orientations corresponding to the four edges. A triangle with long side and two short sides would have one dominant orientation, whereas a triangle with two long sides and one short side would have two dominant orientations. To find the dominant orientations, take the sampled gradient vectors in the neighborhood of (x 0, y 0 ) and treat them as a set, that is, apart from the weighting we ignore their spatial locations. Then make direction bins and drop these vectors into their appropriate bin, as follows. 5 http://people.cs.ubc.ca/~lowe/keypoints 4

// Let theta(x,y) and mag(x,y) be the direction and length // of the image gradient at (x,y,sigma). // anglestep = 360/numAngleBins for cttheta = 0:numAngleBins-1 hist(cttheta) = 0 for (x,y) in Ngd(x0,y0, sig) hist(theta(x,y) mod anglestep) += mag(x,y) Lowe uses 24 angle bins of 15 degree width. This gives a histogram 6 of orientations. The bin with the maximum value is taken to be the dominant orientation of the keypoint. If there is more than one dominant orientation, i.e. there may be more than one peak in the histogram that has roughly the same (max) height, then make multiple local descriptors. There are several ways one could imaging building a descriptor, once one has a dominant orientation. (For example, one could use the raw image intensities I(x, y, σ) in some σ-neighborhood. This turns out to work very poorly, though, for matching between images, since the intensities tend to change enormously when either the lighting changes and or camera parameters change. ) What SIFT does (in a nutshell, ee his 1999 or 2004 paper for more details if you want to probe a bit deeper) is the following. Define a rotated square that is oriented parallel to the dominant direction found above. The width of the square is some multiple of the scale σ and the square is partitioned into a 16 16 grid. The gradient vector is chosen/computed for each of these 256 grid elements. This 16 16 grid is then partitioned in a 4 4 array of 4 4 subgrids. For each of the subarrays, an orientation histogram with 8 orientations of 45 degrees each is constructed (rather than 24 orientations of 15 degrees, which is what was used to find the dominant orientation for the whole neigbhorhood). These 16 orientation histograms, with 8 bins each, define at 128 dimensional feature descriptor. (In Lowe s 2004 paper, he carries out several experiments comparing performance for various choices of the number of subgrids and number of direction bins.) Of course, there are several technical details that I have left out here, but that is the main idea. 6 http://en.wikipedia.org/wiki/histogram 5