Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

Similar documents
Requirements for region detection

Local invariant features

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Scale Invariant Feature Transform

CAP 5415 Computer Vision Fall 2012

School of Computing University of Utah

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Scale Invariant Feature Transform

AK Computer Vision Feature Point Detectors and Descriptors

Patch Descriptors. CSE 455 Linda Shapiro

The SIFT (Scale Invariant Feature

Local Features: Detection, Description & Matching

Local Feature Detectors

CS4670: Computer Vision

Comparison of Local Feature Descriptors

Object of interest discovery in video sequences

Image matching. Announcements. Harder case. Even harder case. Project 1 Out today Help session at the end of class. by Diva Sian.

SCALE INVARIANT FEATURE TRANSFORM (SIFT)

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Yudistira Pictures; Universitas Brawijaya

Patch Descriptors. EE/CSE 576 Linda Shapiro

Patch-based Object Recognition. Basic Idea

Local Patch Descriptors

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Local Image Features

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

CS 558: Computer Vision 4 th Set of Notes

Wikipedia - Mysid

Local features: detection and description. Local invariant features

Computer Vision for HCI. Topics of This Lecture

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Local Image Features

Image Features: Detection, Description, and Matching and their Applications

Motion illusion, rotating snakes

Local features: detection and description May 12 th, 2015

Specular 3D Object Tracking by View Generative Learning

EE795: Computer Vision and Intelligent Systems

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Local features and image matching. Prof. Xin Yang HUST

Outline 7/2/201011/6/

Advanced Video Content Analysis and Video Compression (5LSH0), Module 4

Improving Recognition through Object Sub-categorization

Motion Estimation and Optical Flow Tracking

2D Image Processing Feature Descriptors

SIFT - scale-invariant feature transform Konrad Schindler

Evaluation and comparison of interest points/regions

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Prof. Feng Liu. Spring /26/2017

VEHICLE MAKE AND MODEL RECOGNITION BY KEYPOINT MATCHING OF PSEUDO FRONTAL VIEW

Local Image Features

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Designing Applications that See Lecture 7: Object Recognition

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

A System of Image Matching and 3D Reconstruction

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Multi-modal Registration of Visual Data. Massimiliano Corsini Visual Computing Lab, ISTI - CNR - Italy

Feature descriptors and matching

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Coarse-to-fine image registration

Scene Text Detection Using Machine Learning Classifiers

Object Recognition with Invariant Features

Final Review CMSC 733 Fall 2014

Lecture 10 Detectors and descriptors

Feature Detectors and Descriptors: Corners, Lines, etc.

Ulas Bagci

3D Photography. Marc Pollefeys, Torsten Sattler. Spring 2015

Determinant of homography-matrix-based multiple-object recognition

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Automatic Image Alignment (feature-based)

Feature Based Registration - Image Alignment

Shape Descriptors for Maximally Stable Extremal Regions

A NEW FEATURE BASED IMAGE REGISTRATION ALGORITHM INTRODUCTION

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Sampling Strategies for Object Classifica6on. Gautam Muralidhar

Image Matching. AKA: Image registration, the correspondence problem, Tracking,

Fitting: The Hough transform

CS664 Lecture #21: SIFT, object recognition, dynamic programming

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

3D Vision. Viktor Larsson. Spring 2019

Shape Descriptor using Polar Plot for Shape Recognition.

Edge and corner detection

Combining Appearance and Topology for Wide

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Selection of Scale-Invariant Parts for Object Class Recognition

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

Image matching on a mobile device

Robotics Programming Laboratory

Faster Image Feature Extraction Hardware

Verslag Project beeldverwerking A study of the 2D SIFT algorithm

Corner Detection. GV12/3072 Image Processing.

SuRVoS Workbench. Super-Region Volume Segmentation. Imanol Luengo

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1

Human Detection and Action Recognition. in Video Sequences

Transcription:

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi hrazvi@stanford.edu 1 Introduction: We present a method for discovering visual hierarchy in a set of images. Automatically grouping and organizing images has applications for personal use where personal photographs can be arranged as overexposed or underexposed, urban or rural landscape, closeups or family gatherings (lots of people etc) besides various other useful hierarchies. This also has applications in the educational/research domain where it would be interesting and maybe useful to find out if a large dataset as the ImageNet [1] can be arranged in other interesting hierarchies. Effective hierarchies allow for efficient searching of images and better image recommendations (i.e. other images similar to a certain image of interest). This problem is broken down into a problem of selecting appropriate features of interest from an image and then organizing images based on these features. We must be able to handle variations in scale, viewpoint, lighting, occlusion etc. 2 Extracting Points of Interest: 2.1 MSER: The MSER [2] method is used to detect blobs in images, blobs are patches in the image that are either brighter or darker than their surroundings. Blob detectors work differently from edge detectors and provide complementary information about patches in the image, such as their shape etc. These can be used to obtain regions of interest for further processing. The MSER algorithm basically sorts pixels by intensity and considers pixels below a threshold to be black and those above it to be white. It does this for a range of thresholds and determines regions based on the stability of a group of pixels w.r.t to the change in thresholds. It is invariant to affine transformation of image intensities and is quite robust to viewpoint and scale changes. Once the MSER method returns points of interest (also called region seeds), lists of pixels belonging to each region are extracted. The VLFeat library is used to do this [7]. Binary images are generated for each feature by setting pixels not belonging to that region as zero. Finally SIFT based descriptors for these features are generated. This is similar to the approach outlined in David Lowe s paper [4]. The descriptors are 128 dimensional vectors with values from -255. The MSER method can sometimes give incorrect results where it considers multiple adjacent objects in the image as a single large feature/region. Two methods are used to guard against this, 1. All features which comprise more than 8% of the entire image s pixels are rejected. 2. All features which comprise between 5-8% of the image s pixels are attempted to be broken down into sub-features. For this purpose a bounding circle of the feature is calculated and SIFT is used to detect features within this bounding circle. Note: We

don t re-run MSER on the area within the bounding circle as MSER would again consider any pixels falling within this bounding circle to be one feature. This almost always doesn t give the correct shape so running MSER again isn t very useful. Figure-1 shows MSER features extracted for a sample image alongwith a binary image for one of the features. Figure-1E shows an example where MSER goes wrong and considers multiple objects as a single large feature. The calculated bounding circle is also shown. Figure-1F shows an example where MSER performs reasonably well. To calculate the bounding circle for a detected feature, the binary image represented as a 2-D matrix is linearly indexed (using the same conventions as MATLAB) with the top left corner of the image taken as index1. A white pixel with the maximum linear index w r gives the bottom pixel in the right-most column of the feature/region while a white pixel with the minimum linear index w l gives the top pixel in the left-most column of the region. The pseudocode is then, ( x r, y r ) x,y coordinates of w r ( x l, y l ) x,y coordinates of w l C x ( x r + x l ) / 2 C y ( y r + y l ) / 2 C r α * sqrt( (C x - x r ) 2 + (C y - y r ) 2 ) Where α is a scaling factor and is empirically selected to be.667 2.2 SIFT: SIFT [3] is a method for detecting and also encoding features in an image. It is invariant to scale, orientation, affine distortion and is partially invariant to illumination changes. It applies a difference of Gaussians function to a series of smoothed and resampled images defining the minima and maxima of the result as points of interest. [FIGURE-1] [A] Original Image [B] SIFT detected features [C] MSER detected features [D] MSER region contours [E] MSER detects all this [F] MSER correctly detects as a single large feature the truck in the background

The SIFT method tends to detect a lot of very small features in the image. Its two parameters the peak threshold and the (non) edge threshold are set to limit detection of lots of very small points as features. Figure-1B shows SIFT based features detected for an image, these are shown as circular patches where the line from the center of the circle shows the feature s orientation. 4 DataSet & Data Biasing: My dataset consists of images across 5 categories {s, s, s, s, s}. There are images per category. This is a relatively easy dataset as the categories are quite distinct from each other. s and cars do share some similarity in their features (shape of the wheels, in some cases headlights etc). This is specifically done so as to see whether the algorithm will be able to distinguish between objects that have some percentage of similar features but are distinct overall. To improve clustering performance I initially take images from each category and for the categories s and s I manually segment out the object (car/scooter). For s, s and s I don t do any segmentation but pick out images which have less cluttered backgrounds so that the majority of the features detected in the image will actually belong to s, s and s. I call this data biasing and run K-means clustering on only these 15 images to cluster the SIFT features into 5 and MSER features into clusters as discussed previously. These cluster models are saved. Figure- 2 shows an example of a segmented out image (car) and a non-segmented out image (building). Note the building image also has a few cars and people in it. Segmented Image Non-Segmented Image [FIGURE-2] 5 Representing Images as Feature Vectors: For each of the original images (note, these are not the segmented images but rather their original counterparts) SIFT and MSER features are detected. Taking each image individually, its SIFT features are clustered into 5 clusters based on the saved SIFT cluster model above. A 5 dimensional feature vector S is generated for this image. S i then contains the number of SIFT features of the image that mapped to cluster i. Similarly a dimensional feature vector is also generated for an image based on its detected MSER features and the saved MSER cluster model above. Thus there is one 5 dimension SIFT feature vector and one dimensional MSER feature vector produced for each image.

6 Clustering Images: The EM algorithm is used to cluster images based on only their 5 dimensional SIFT feature vectors. A cluster is given an identity based on the category that has the maximum number of images falling into that cluster. Figure-3A shows the clustering produced. The crosses [x] represent images that were correctly clustered while the squares represent incorrectly clustered images. Similarly the EM algorithm is used to cluster images based on only their dimensional MSER feature vectors. The resulting clustering is shown in Figure-3B. [FIGURE-] 3A 3B Predicted Cluster Decomposition [SIFT] Actual Cluster Assignments [SIFT] 7 45 6 5 35 25 15 1 1 5 Predicted Cluster Decomposition [MSER] Actual Cluster Assignments [MSER] 8 45 7 6 35 5 25 15 1 1 5 [FIGURE-4] Figure-4 is another important figure. Under the Predicted Cluster Decomposition heading it shows the number of images that were assigned by EM to each cluster and the

color coding shows which categories these assigned images actually belonged to. For example in Predicted Cluster Decomposition [SIFT] 22 images were assigned by EM to the category. 19 of these images were actually s while 2 were s and 1 was a. The complementary figures labeled Actual Cluster Assignments show for the images belonging to each category how many ended up assigned to each category. For example for Actual Cluster Assignments [SIFT] 19 of the images were assigned to the category but 21 of them were assigned to the category. Similar results are also shown for MSER. 7 Combining SIFT & MSER: In the figures above note that SIFT is conservative about assigning Images to the category. It has high precision for that cluster but low recall as most images end up in the category. MSER behaves differently, the face cluster is the largest and most face images do land into this cluster. However while MSER has higher recall for this cluster it has lower precision with images from s, s etc also falling into this cluster. I aim to combine the MSER and SIFT features in an attempt to achieve better results. The simplest approach I tried first was to concatenate the SIFT and MSER feature vectors for an image so that an image is now represented by one 7 dimensional feature vector and then cluster images using EM. This worked really well for s and Images which had perfect results. All cars ended up in one cluster with no images from any other category and the same results were also true for s. s improved too, however accuracy for buildings especially suffered. The overall accuracy (percentage of images correctly clustered) was 73% compared with 63.5% for MSER only EM clustering and a nicer 84% accuracy for SIFT only EM clustering. Probably simply combining SIFT & MSER feature vectors isn t a good approach and I m looking into other schemes and consider this future work. Figure-5 gives clustering as produced by EM run on SIFT only feature vectors. For shortage of space I did not include all images. The mosaic contains most of the negative examples (incorrectly clustered images) but fewer than actual correctly clustered images. 8 References: [1] ImageNet, http://www.image-net.org/ [2] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In BMVC, 2. [3] Lowe, David G. (1999). Object recognition from local scale-invariant features. Proceedings of the International Conference on Computer Vision. 2. pp. 115 1157 [4] Per-Erik Forss en, David G. Lowe: Shape Descriptors for Maximally Stable Extremal Regions. ICCV 7. [5] G. Csurka, C. Bray, C. Dance, L. Fan: Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, 4. [6] D. Martin, C. Fowlkes, J. Malik. Learning to detect natural image boundaries using brightness and texture. In NIPS 3. [7] VLFeat, http://www.vlfeat.org/

[FIGURE-5]