DARTs: Efficient scale-space extraction of DAISY keypoints

Similar documents
Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

The SIFT (Scale Invariant Feature

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Evaluation and comparison of interest points/regions

Computer Vision for HCI. Topics of This Lecture

A NEW FEATURE BASED IMAGE REGISTRATION ALGORITHM INTRODUCTION

Local features and image matching. Prof. Xin Yang HUST

A Comparison of SIFT, PCA-SIFT and SURF

FESID: Finite Element Scale Invariant Detector

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

A Comparison of SIFT and SURF

Specular 3D Object Tracking by View Generative Learning

Motion Estimation and Optical Flow Tracking

Object Recognition with Invariant Features

Object Recognition Algorithms for Computer Vision System: A Survey

Key properties of local features

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Implementation and Comparison of Feature Detection Methods in Image Mosaicing

Feature Detection and Matching

Obtaining Feature Correspondences

Performance Evaluation of Scale-Interpolated Hessian-Laplace and Haar Descriptors for Feature Matching

AK Computer Vision Feature Point Detectors and Descriptors

Local Features: Detection, Description & Matching

Feature Based Registration - Image Alignment

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

Stereoscopic Images Generation By Monocular Camera

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Lecture 10 Detectors and descriptors

Local Feature Detectors

Local Patch Descriptors

SURF: Speeded Up Robust Features

SIFT: Scale Invariant Feature Transform

Semantic Kernels Binarized A Feature Descriptor for Fast and Robust Matching

Scale Invariant Feature Transform

Scale Invariant Feature Transform

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

CAP 5415 Computer Vision Fall 2012

Motion illusion, rotating snakes

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Local Image Features

Fast Image Matching Using Multi-level Texture Descriptor

SURF: Speeded Up Robust Features

Outline 7/2/201011/6/

III. VERVIEW OF THE METHODS

SIFT - scale-invariant feature transform Konrad Schindler

Patch Descriptors. CSE 455 Linda Shapiro

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Comparison of Feature Detection and Matching Approaches: SIFT and SURF

Invariant Features from Interest Point Groups

IMAGE RETRIEVAL USING VLAD WITH MULTIPLE FEATURES

Implementing the Scale Invariant Feature Transform(SIFT) Method

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Aalborg Universitet. A new approach for detecting local features Nguyen, Phuong Giang; Andersen, Hans Jørgen

Multi-modal Registration of Visual Data. Massimiliano Corsini Visual Computing Lab, ISTI - CNR - Italy

Click to edit title style

State-of-the-Art: Transformation Invariant Descriptors. Asha S, Sreeraj M

A Novel Extreme Point Selection Algorithm in SIFT

An Evaluation of Volumetric Interest Points

Object Detection by Point Feature Matching using Matlab

SUSurE: Speeded Up Surround Extrema Feature Detector and Descriptor for Realtime Applications

A Novel Algorithm for Color Image matching using Wavelet-SIFT

Local invariant features

Local Pixel Class Pattern Based on Fuzzy Reasoning for Feature Description

A Hybrid Feature Extractor using Fast Hessian Detector and SIFT

2D Image Processing Feature Descriptors

Advanced Video Content Analysis and Video Compression (5LSH0), Module 4

Using Geometric Blur for Point Correspondence

A Comparison and Matching Point Extraction of SIFT and ISIFT

Local Features Tutorial: Nov. 8, 04

Appearance-Based Place Recognition Using Whole-Image BRISK for Collaborative MultiRobot Localization

CS 556: Computer Vision. Lecture 3

Computer Vision. Recap: Smoothing with a Gaussian. Recap: Effect of σ on derivatives. Computer Science Tripos Part II. Dr Christopher Town

School of Computing University of Utah

Patch Descriptors. EE/CSE 576 Linda Shapiro

Local Image Features

A Novel Feature Descriptor Invariant to Complex Brightness Changes

LOCAL AND GLOBAL DESCRIPTORS FOR PLACE RECOGNITION IN ROBOTICS

Local Image Features

Distinctive Image Features from Scale-Invariant Keypoints

CS4670: Computer Vision

Video Google: A Text Retrieval Approach to Object Matching in Videos

Prof. Feng Liu. Spring /26/2017

Image Features: Detection, Description, and Matching and their Applications

HISTOGRAMS OF ORIENTATIO N GRADIENTS

Distinctive Image Features from Scale-Invariant Keypoints

Image matching. Announcements. Harder case. Even harder case. Project 1 Out today Help session at the end of class. by Diva Sian.

Coarse-to-fine image registration

Robust Binary Feature using the Intensity Order

A SIFT Descriptor with Global Context

Speeding Up SURF. Peter Abeles. Robotic Inception

Finding the Best Feature Detector-Descriptor Combination

Local Descriptor based on Texture of Projections

A Novel Real-Time Feature Matching Scheme

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Transcription:

DARTs: Efficient scale-space extraction of DAISY keypoints David Marimon, Arturo Bonnin, Tomasz Adamek, and Roger Gimeno Telefonica Research and Development Barcelona, Spain {marimon,tomasz}@tid.es Abstract Winder et al. [15, 14] have recently shown the superiority of the DAISY descriptor [12] in comparison to other widely extended descriptors such as SIFT [8] and SURF [1]. Motivated by those results, we present a novel algorithm that extracts viewpoint and illumination invariant keypoints and describes them with a particular implementation of a DAISY-like layout. We demonstrate how to efficiently compute the scale-space and re-use this information for the descriptor. Comparison to similar approaches such as SIFT and SURF show higher precision vs recall performance of the proposed method. Moreover, we dramatically reduce the computational cost by a factor of 6x and 3x, respectively. We also prove the use of the proposed method for computer vision applications. 1. Introduction Keypoints or salient points are those samples in an image that are highly repeatable across different view conditions such as viewpoint and illumination. Identifying those keypoints in different images of the same object or a scene permit to fulfill tasks like reconstructing 3D spaces or recognising objects, among others. In the last decade, several keypoint extraction techniques have been developed that have had wide acceptance across several application domains due to their robustness [8, 1]. Two recent studies [15, 14] have thoroughly evaluated the performance of different descriptor layouts and features, among other parameters to describe image patches. In their studies, one can observe clear improvements on specific combinations of parameters (e.g. the DAISY descriptor [12]) over the widely accepted descriptors cited above. However, the full implementation of those combinations leads to computationally demanding methods. In this paper, we aim at extracting viewpoint and illumination invariant keypoints in scale-space with a variation of the DAISY descriptor. The algorithm is designed towards low overall computational complexity. On one hand, we perform an efficient computation of the scale space which is intensively re-used during the keypoint description phase. On the other hand, we identify an algorithmic optimisation of the description that dramatically speeds up the process. To sum up, we propose a novel framework for fast scalespace keypoint extraction and description with the following contributions: An approximation of the determinant of the Hessian by piece-wise triangle filters, that is faster and gives similar or better results than the integral image-based approximation of SURF and the Difference-of-Gaussian of SIFT; A keypoint orientation assignment faster that the one in SURF; A DAISY-like descriptor extracted efficiently by reusing computations done for keypoint extraction and optimizing the sampling space; All these improvements lead to a speed-up factor of 6 and 3 when compared to SIFT and SURF, respectively, with better precision-vs-recall performances than the ones obtained by those methods. Figure 1 shows a block diagram of the method. First, the image is filtered with triangle kernel at different scales. This is followed by the computation of the determinant of Hessian at each scale and then the detection of extrema on this space. For each extremum, the dominant orientations are computed from gradient information extracted using the triangle-filtered images. For each dominant orientation, its descriptor is calculated. This computation uses the oriented gradients also extracted from the triangle-filtered images. The structure of the paper is as follows. Next section describes similar keypoint extraction techniques. Section 3 presents the proposed scale-space extraction method. The adaptation of the descriptors proposed by Winder et al. is dealt with in Section 4. The results of validating the approximations in comparison to related methods is given in Section 5. Section 6 shows the successful application of the proposed framework to computer vision domains such as 978-1-4244-6983-3/10/$26.00 2010 IEEE

Figure 1. Block diagram of the proposed scale-space extraction of DAISY-like keypoints. The triangle filtered images are efficiently computed and re-used for the computation of the dominant orientation and the descriptor. that of object tracking and 3D reconstruction. Last section discusses the achievements of the proposed technique and indicates future paths of research. 2. Related work The problem of scale-space keypoint detection has been studied before in computer vision [7]. This section is devoted to those techniques that share similarities with the proposed method. There exist two main widely adopted methods for scalespace keypoint extraction and description, namely SIFT [8] and SURF [1]. SIFT method searches for extrema on the Difference-of- Gaussian at consecutively sampled scales. First, a pyramid of smoothed versions of the input image is computed. Extrema are found inside each octave (doubling of sigma). At each one of those extremum, a descriptor is built based on the orientation of the gradient. First, a grid is defined according to the main orientation of the gradient around the keypoint. Inside each grid, a histogram of the orientation of the gradient weighted by its magnitude is computed. The descriptor is built typically with 4x4 such regions and histograms of 8 bins, leading to a vector of 128 components. This method has demonstrated good performance in a large variety of applications, such as 3D reconstruction, object recognition or robot localisation. However, this method has one main drawback, more precisely, the computational cost of building the pyramid. SURF, on the other hand, is designed for much faster scale-space extraction. Extrema are located on the determinant of Hessian approximated by Haar-wavelets. The descriptor is based on the polarity of the intensity changes. Sums of the gradient (oriented with the main orientation of the keypoint) and absolute of gradient in horizontal and in vertical direction are computed. The descriptor is usually formed of 4 such values computed on 4x4 regions around the keypoint, leading to a descriptor of 64 values. The benefit of their method comes mainly in the extraction phase, where the Haar-wavelets are computed by accessing the integral image. This dramatically reduces the number of memory accesses and computations, especially in a multiscale environment. For a comparative study with relevant affine invariant scale-space keypoint extraction techniques, the reader is referred to [10]. Although not covered in this section, the reader should also note that robust single-scale extraction and/or description methods have been proposed in literature [3, 11, 6]. The performance of several descriptors, including SIFT, has been evaluated in [9]. More extensive evaluations have been recently pursued by Winder et al. [15, 14] with different features, layouts and steps to describe keypoints. In their analysis, the authors propose a chain of elements to describe a keypoint. The best results are achieved by a sampling layout called DAISY [12] also known as S4 [15]. This sampling clearly outperforms the grid layout used in SIFT and SURF, and also improves the results achieved with the polar one used in GLOH [9]. Motivated by those results, we propose an extraction method and a particular configuration of a DAISY-like descriptor that shows better precision vs recall results than SIFT and SURF. 3. Keypoint extraction The proposed method for scale-space extrema detection is composed of three steps: (i) an efficient computation of the approximated determinant of Hessian at each scale; (ii) extrema search in scale-space; and (iii) finding the keypoint with sub-pixel and sub-scale accuracy as in Brown et al. [2]. Let us start by explaining the extraction algorithm generically and then the exact process that makes it efficient. The determinant of Hessian consists in computing xx (i, j) yy (i, j) xy (i, j) 2, where xx is the second horizontal derivative of Gaussian over an image, yy is vertical, and xy is cross derivative. We propose to approximate the shape of those filters with piece-wise triangles. Let us assume that an image is filtered with a 2D triangleshaped filter obtaining L(i, j). Figure 2(a) plots the shape of this filter. For each scale, all the derivatives are computed by accessing the filtered response at different points. Conceptually, the shape of the second horizontal derivative of Gaussian (see Figure 2(b)) is approximated by translated and weighted triangle-shaped responses (see Figure 2(c)). This process is performed at different scales k leading to

(a) 2D triangle-shaped kernel. (b) Second derivative of Gaussian. (c) Second derivative of Gaussian approximated by accessing weighted triangle responses. Figure 2. Plots of the shapes used to compute the determinant of Hessian. the following approximations: k xx = L(k, i d 1, j) 2 L(k, i, j) + L(k, i + d 1, j) (1) k yy = L(k, i, j d 1 ) 2 L(k, i, j) + L(k, i, j + d 1 ) (2) k xy = L(k, i d 2, j d 2 ) L(k, i + d 2, j d 2 ) L(k, i d 2, j + d 2 ) + L(k, i + d 2, j + d 2 ) (3) where d 1 = (2 3σ + 1)/3 and d 2 = d 1 /2 are chosen experimentally to best approximate the kernel of the second derivative of Gaussian generated with the corresponding σ. Note that those approximations (unless d 1 is equal to 1) are not equivalent to filtering with a triangle filter and then convolving with a 2nd derivative filter [1, 2, 1] as this would generate undesired artifacts. As it can be deduced, computing the derivatives requires only 9 different accesses to L. This should be compared to the box-shaped approximation of SURF [1], where the computation of the determinant of Hessian requires 8 + 8 + 16 = 32 accesses to the integral image. The scale-space is formed as a stack of filtered versions L(k, i, j), all with the size of the input image. This is different than in SIFT or SURF where a pyramid of filtered versions is created. In this process, at each octave (doubling of the sigma of the Gaussian filter) sub-sampling is performed. In our experiments, sub-sampling creates a relevant loss of performance and therefore we propose to filter the input image without sub-sampling. Although this might look as computationally demanding, one of the contributions of our work is the efficient computation of this stack of triangle-filtered images as described below. So far we have explained how to compute the determinant of Hessian by accessing a limited number of samples of the filtered versions of the input image. Nevertheless, filtering with a 2D triangle kernel requires many memory accesses per pixel. Moreover, multi-scale filtering is a time-consuming process especially for large-sized kernels. Therefore, we propose an efficient computation of the triangle-filtered versions. As studied by Heckbert [4], Gaussian filters can be approximated by iteratively convolving box-type filters. The main observation that Heckbert did was that a convolution of a function f and g is equivalent to a convolution of f n (n times repeated integration of f) with g n (n times derivation of g) ( n ) ( n ) g(x) f g = f(x)dx x n. (4) Heckbert proposes to obtain a kernel function by the n times repeated convolution of a box filter (box n ). For an increasing n the kernel approximates a Gaussian filter [13]. From [4], the n times derivation of the n times convolution is (box n ) n (x) = (box ) n (x) = n ( ) n = N n ( 1) i δ(x + (n/2 i)n), (5) i i=0 where N is the length of the kernel and δ(x) is a discrete impulse function. The filtering can then be computed by only accessing n + 1 (in the one-dimensional case) samples of f n. For instance, a one-dimensional Bartlett filter (n = 2) needs only 3 samples of the second integral of the input signal f, regardless of the size N of the kernel. The process of integrating and accessing n samples to obtain a filtered signal can be performed on-the-fly. In our case, for a single scale k, L(k, i, j) can be obtained on the fly with only two passes (one horizontal and one vertical) over the image with 3 memory accesses per pixel at each pass instead of N xn accesses performed in standard filtering (or N at each pass, assuming that kernel separability is applied). The search for extrema is performed on all scales except the first k = 1 and the last one k = K. Each extremum is searched within a search window of σxσ on the current

k, upper k + 1 and lower k 1 scales. In order to speed up the process, a first test is performed on a 3x3 window on the current scale k to quickly detect non-maximum and avoid further processing. It should be noted that both SIFT and SURF search for extrema inside octaves of the pyramid and generate extra scales to allow for correct extrema detection. The proposed extrema search is continuous in the scale-space stack. 4. Keypoint description Winder et al. have recently evaluated different features, layouts and steps to describe keypoints [15, 14]. Motivated by those results, we have investigated several configurations of the DAISY descriptor that fit our purpose of re-using as much information as possible from the extraction step. 4.1. Orientation assignment In order to generate viewpoint invariant descriptors, the first step is to detect the dominant orientation of the keypoint. The approach followed in SIFT is to compute the histogram of gradient orientations weighted by their magnitude within a window of the corresponding Gaussiansmoothed image. Gradients are computed by pixel differentiation. The dominant orientation is found in the peak of the histogram. If more than one dominant peak is found, several keypoints are generated. In the case of SURF, derivatives are computed with Haar-wavelets (exploiting the integral image) at sampled points in a circular neighbourhood of the keypoint. The magnitude of each sample populates a space of horizontal and vertical derivatives. This space is scanned and the dominant orientation is found with the largest sum of values within the window. Our method exploits some of the benefits of both approaches and proposes an alternative method. Speedup is gained by sampling a circular neighbourhood as in SURF. Our advantage is that gradients at each sample (i, j) are computed by simply accessing L(k) at two points (compared to 6 samples in SURF): x = L(k, i d 3, j) L(k, i+ d 3, j) for the horizontal derivative and equivalently for the vertical one. d 3 = (2 3σ +1)/6 where σ comes from the sub-scale accuracy step mentioned at the beginning of Section 3. Each derivative is accumulated into a histogram with a weight proportional to its magnitude and with a Gaussian kernel centered at the keypoint. Finally, multiple dominant orientations are found similarly to SIFT. 4.2. Picking a DAISY-like descriptor The computation of our particular DAISY-like descriptor is inspired by the results of Winder et al. [14]. In our implementation, we compute first derivatives on samples around the keypoint. Those samples use a particular DAISY layout. The accumulation of derivatives forms a vector that is normalised in a final step. We explain the details of this process hereafter. In Winder et al., the authors propose a so called T2- block which consists in computing first order derivatives (gradient) from every pixel in a neighbourhood of the keypoint and obtain a vector of four values: { x x ; x + x ; y y ; y + y }. In our case, only selected samples are evaluated and the derivatives are oriented according to the orientation of the keypoint. Note that SURF computes also oriented gradients but cannot exploit the integral image without introducing artifacts as Haar-wavelets are oriented with image pixel indexing. On the other hand, in our case, computing oriented gradients is straightforward and takes only two accesses to L(k): x θ = L(k, i d 3 cos θ, j d 3 sin θ) L(k, i + d 3 cos θ, j + d 3 sin θ). Note that quantisation effects at small scales can occur due to the nature of image pixel accesses. We tradeoff a possible interpolated access for computation speed as the loss of performance in our tests is not significant. The extraction of those four values is performed on samples with a spatial distribution that follows an oriented DAISY-like layout. This layout consist in defining concentric rings, and segments (regions of the image) centered along each ring that divide the space with a regular angle. Figure 3 shows this layout with a circle representing each segment. We have tested different layouts. More specifically, changing the number of segments and rings. The layout that produced the best results for us while keeping the length of the descriptor relatively short has 8 segments and 2 rings. This produces a vector of (1 + 2x8)x4 = 68 values. Moreover, the number of samples selected for each segment varies depending on the ring. In particular, the kernels have size 3x3, 5x5, and 7x7, for the central segment, first and second rings, respectively. A Gaussian weighting is also performed on the samples of each segment. Prior to accessing the samples of each segment, the whole layout (segment centers and segment samples) is rotated w.r.t. the dominant orientation of the keypoint. Finally, L2-normalisation is applied on the descriptor vector. The result is quantised into only 8 bits. This vector is what we call a DART keypoint. 4.3. Optimisation of the descriptor In the DAISY descriptor there are two parameters that further determine the layout: the distance between samples inside each segment and the distance between the center of the keypoint and the centers of each segment. Our experiments have leaded us to samples separated by 2σ, and a distance to segments of 4σ for the first ring and 8σ for the second. The segments in this layout largely overlap as can be seen in Figure 3. This property is actually desirable in our design since we can then optimise the descriptor com-

The extraction process can be validated by measuring the repeatability of keypoints across different image transformations. For this validation, we use the framework proposed by Mikolajczyk et al. [10]. Figure 4 shows a comparison of repeatability score in several sequences 1 for SIFT, SURF, DART and the Determinant of Hessian (identified as Hessian) computed with Gaussian filters with the same range of σ s as that approximated by the scale-stack of DART. The overall performance of Hessian is better than any of the compared methods. For viewpoint, blur and jpeg compression, we can observe that DART has similar or better performance than SIFT and SURF. Although comparable to the other techniques, scale changes seem to affect DART s extraction phase. We attribute this behaviour to the approximation of Gaussian second derivatives with triangles. 5.2. Descriptor and its optimisation Figure 3. Our DAISY layout with 2 rings and 8 segments (circles) per ring oriented at 0 degrees. Due to the distance of segments to the center many samples (dots) are closely located. putation. By looking at the coordinates of the samples that contribute to each segment, one can observe how very close samples (if not exactly the same) are accessed. We proceed by re-grouping near samples into a single sample. The X and Y oriented derivatives computed at that sample contribute to several segments with the corresponding weight. The process of determining the samples that are to be accessed is performed only once independently of the scale. The sub-scale (σ ) is a multiplicative factor applied at the moment of computing the descriptor of a given keypoint. The result is a grid of samples and the corresponding links to the segments they contribute to. This optimisation drastically reduces the number of accesses. In particular, we re-group samples located within a radius of 0.5σ. In this case, from the original 3x3+5x5x8+7x7x8=601 samples (2404 accesses to L(k)), the number is dropped to 197 (788). This reduction has no significant loss of performance as shown in the next section. Further re-grouping with a larger radius starts to impoverish the overall results. As it can be deduced, this optimisation has less impact on keypoints at small scales where samples are even closer together. 5. Experiments This section describes experiments fulfilled to assess the performance of DART. 5.1. Extraction process The proposed variation of the DAISY descriptor at keypoints detected with our extractor can be validated by measuring the 1-Precision vs. Recall. For this validation, we use the framework proposed by Mikolajczyk and Schmid [9] using their Matlab implementation and sorting the Nearest Neighbour correspondence assignment between images 1 and 4 of their dataset. Figure 5 depicts the performance achieved by our method with and without descriptor optimisation, together with the performance of SIFT and SURF. Note that in each case we use its corresponding keypoint extraction method. As it can be seen, DART produces better results in all the evaluated sequences. 5.3. Computational complexity The proposed algorithm for extracting and describing keypoints is designed towards efficiency. In order to validate this hypothesis, keypoints on the first Graffitti image (800x640 pixels) have been extracted and described on an Intel Core 2 Duo CPU @ 2.33 GHz with 2GB RAM. The time spent for our method is compared to the time spent with the binaries of SIFT and SURF (from the official authors websites). Table 1 shows the time spent 2 for each method. We use different thresholds in DART for a fairer comparison in terms of number of keypoints. Results show a speed up by a factor of 6x when compared to SIFT and 3x when compared to SURF. Moreover, the following computational differences must be noted. First, DART generates more filtered images (18) as compared to SIFT (15) or SURF (12) 3. Second, both SIFT and SURF perform sub-sampling at each new octave, whereas there is no subsampling in DART. Although not implemented in our case, notice the highly parallelisable nature of the scale stack ex- 1 We provide a sub-set of the dataset results due to space limitations. 2 Elapsed time includes image loading and also writing the keypoints in an ASCII file. 3 Both SIFT and DART generate a number of filtered images that depends on the size of the input image.

Figure 4. Repeatability score (%) for image sequences, from left to right, Graffitti, Boat, Bikes and UBC. Determinant of Hessian from Gaussian derivatives (Hessian) provided for reference. DART shows similar or better performance than SIFT and SURF for viewpoint changes, blur and jpeg compression. Scale changes seem to affect DART s extraction phase more than the other techniques. Figure 5. 1-Precision vs Recall for image sequences, from left to right and top to bottom, Graffitti, Boat, Bikes and Leuven. DART produces better results in all the evaluated sequences. traction step. This indicates that further reduction of computational cost is possible. 6. Applications In order to further validate the applicability of the proposed method, the following sections explore two computer vision problems that successfully employ DART. 6.1. Object tracking Three-dimensional object tracking consists in trailing the 3D pose of an object w.r.t. a static or moving camera. This is often used in applications such as Augmented Reality. In the particular case of planar objects, the problem can be solved by matching keypoints extracted from a reference image of the object against those extracted from each frame of a video stream. Once the correspondences are established, the pose of the object can be estimated. This approach is generally known as tracking-by-detection [5]. We

Figure 6. Object tracking-by-detection using DART keypoint matching. Top: synthetic reference image of the object. Bottom: different video frames with correct tracking under scale and viewpoint changes. Orange lines show keypoint correspondences. A blue rectangle contours the tracked object. Descriptor length Octaves Levels Method Keypoints Time [s] SIFT 128 5 3 3106 3.356 DART 68 6 3 3044 0.536 SURF 64 4 3 1557 1.207 DART 68 6 3 1540 0.394 Table 1. Time spent for different keypoint extraction methods on the first image of the Graffitti sequence (800x640 pixels). The number of levels is the amount of intermediate filtering steps per octave. have implemented a nearest neighbour matching of DART descriptors. We eliminate those correspondences where the Euclidean distance is beyond a threshold and if the ratio between the distance to the first and second best match is not greater than 0.7. Figure 6 shows snapshots of tracking-bydetection using DART. 6.2. 3D Reconstruction There is a vast variety of scene reconstruction techniques. In our case, DART keypoints are triangulated from two or more consistent views to generate a 3D point cloud. To perform this task, Structure-from-Motion and Epipolar geometry is used to build the geometric representation of a real scene using a video sequence as input 4. The video sequence has been captured with a hand-held camera that performs panning and travelling motion. 4 This video reconstruction framework is called Videosurfing and was developed in our lab. A running prototype of the system is available at: http://surfing.tidprojects.com. Figure 7 shows snapshots of the point cloud generated by triangulating 167411 DART keypoints from 884 frames. 7. Conclusions We have proposed a novel technique to efficiently extract DAISY-like keypoints. We have contributed with an extraction method that approximates the determinant of Hessian scale-space by piece-wise triangle filters efficiently computed. We have also introduced a variation of the DAISY descriptor with an optimisation over the sampling space. The method has been compared to similar techniques in terms of repeatability, precision vs recall and also computational cost. In terms of repeatability our extractor has comparable or better performance than SIFT and SURF. In the case of precision-recall, the variation of the DAISY descriptor is a clear benefit w.r.t. the other methods. We have also shown a reduction of the computational complexity with speedup factor of 6x when compared to SIFT, and 3x when compared to SURF. Together with this evaluation, we have provided prove of applicability to object recognition and 3D reconstruction applications. We are currently experimenting with DARTs in the context of object recognition with a bag-of-visual words approach obtaining promising results. 8. Acknowledgements Telefónica I+D participates in Torres Quevedo subprogram (MICINN), co-financed by the European Social Fund, for Researchers recruitment. This work was developed within the MobiAR project financed by the MITyC (Spanish Government) inside the

Figure 7. 3D reconstruction using DART keypoints on a video sequence captured with a hand-held camera. Top: first frame. Bottom: coloured cloud of 3D points and location of the camera for each frame (represented in blue). Avanza program. We would also like to thank Dr. Vincent Lepetit for his fruitful comments at the early stage of this paper. References [1] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. In Proc. European Conference on Computer Vision (ECCV), May 2006. 1, 2, 3 [2] M. Brown and D. Lowe. Invariant features from interest point groups. In Proc. British Machine Vision Conference (BMVC), pages 656 665, September 2002. 2 [3] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision Conf., pages 147 151, 1988. 2 [4] P. Heckbert. Filtering by repeated integration. In Proc. Computer Graphics (SIGGRAPH), volume 20, pages 315 321, 1986. 3 [5] V. Lepetit and P. Fua. Monocular model-based 3d tracking of rigid objects. Foundations and Trends in Computer Graphics and Vision, 1(1):1 89, September 2005. ISBN: 1-933019- 03-4. 6 [6] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(9):1465 1479, 2006. 2 [7] T. Lindeberg. Scale-Space Theory in Computer Vision. Kluwer, 1994. 2 [8] D. Lowe. Distinctive image features from scale-invariant keypoints. Intl. Journal of Computer Vision, 60(2):91 110, 2004. 1, 2 [9] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(10):1615 1630, 2005. 2, 5 [10] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. Intl. Journal of Computer Vision, 65(1/2):43 72, 2005. 2, 5 [11] E. Rosten and T. Drummond. Machine learning for highspeed corner detection. In Proc. European Conference on Computer Vision (ECCV), volume 1, pages 430 443, 2006. 2 [12] E. Tola, V. Lepetit, and P. Fua. A fast local descriptor for dense matching. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 0, pages 1 8, Los Alamitos, CA, USA, 2008. IEEE Computer Society. 1, 2 [13] W. Wells. Efficient synthesis of gaussian filters by cascaded uniform filters. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(2), 1986. 3 [14] S. Winder, G. Hua, and M. Brown. Picking the best daisy. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 0, pages 178 185, Los Alamitos, CA, USA, 2009. IEEE Computer Society. 1, 2, 4 [15] S. A. J. Winder and M. Brown. Learning local image descriptors. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 0, pages 1 8, Los Alamitos, CA, USA, 2007. IEEE Computer Society. 1, 2, 4