Image Retrieval (Matching at Large Scale)

Size: px

Start display at page:

Download "Image Retrieval (Matching at Large Scale)"

Victoria Small
5 years ago
Views:

1 Image Retrieval (Matching at Large Scale)

2 Image Retrieval (matching at large scale) At a large scale the problem of matching between similar images translates into the problem of retrieving similar images given a query image. EffecBve solubons to this problem require the capability of designing some indexing structure that records where to find all images in which a feature occurs (the same matched local features can be present in many images)

3 Indexing images with local features In text documents, where the problem is to find all pages on which a word occurs inverted indexes are commonly used as a solubon Following the analogy, visual vocabularies offer a simple but effecbve way to index images efficiently with an inverted file w 23 w 7 w 7 Image #1 Word # Image # 1 3 Word # Image # 1 3 Data abase im mages 2 w 62 w 7 7 1, 2 Image #2 8 3 w 91 9 w 7 New query image 2 7 1, w 76 w 8 Image #3 w (a) All database images are loaded into the index mapping words to image numbers. (b) A new query image is mapped to indices of database images that share a word. Text inverted index K. Grauman, B. Leibe

A seminal system: Video Google Video Google is a seminal system [Sivic and Zisserman ICCV 2003] that performs mapping of visual features into visual words using k- means clustering, and supports

4 A seminal system: Video Google Video Google is a seminal system [Sivic and Zisserman ICCV 2003] that performs mapping of visual features into visual words using k- means clustering, and supports effecbve retrieval by content of visual data. Fundamental idea of paper: treat each frame as a document, then try to find visual words. Having image features represented as visual words, inverted file index is used to efficiently index visual words for content- based retrieval. Video Google retrieves key frames and shots of a video containing a parbcular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing parbcular words. text Word retrieval vector Doc. ID 1 2 images Word number List of frame numbers feature 3 book N

Video Google visual analogy Text Retrieval Word Stemming Text document (page) Document corpus (book) The Video Google algorithm Visual descriptor Using centroids of visual features clusters Video

5 Video Google visual analogy Text Retrieval Word Stemming Text document (page) Document corpus (book) The Video Google algorithm Visual descriptor Using centroids of visual features clusters Video frame Video 1. Assume a vocabulary 2. Parse documents into words 3. Perform stemming: walk" = { walk, walking, walks, } 4. Stop list to reject very common words 5. For each document define a vector with components given by the frequency of occurrence of the words the document contains 6. Store vector in an inverted file 1. Detect affine covariant regions in each key frame of video 2. Reject unstable regions 3. Build visual vocabulary 4. Remove stop listed words 5. For each image compute weighted document frequency based on the occurrence of the visual words 6. Build the inverted file index

6 The Video Google algorithm Pre- processing (off- line): Step 1. Calculate viewpoint invariant regions and region descriptors: - Shape Adapted (SA) region: ellipbcal shape adaptabon about interest point centered on corner- like features using Harris- affine operator - Maximally Stable (MS) region: MSER segmenta7on to extract blobs of high contrast with respect to their surroundings 720 x 576 pixel video frame 1200 regions Each region is represented by a 128- dimenbonal vector using SIFT descriptor Step 2. Reject unstable regions: Any region that does not survive for more than 3 frames is rejected. This stability check significantly reduces the number of regions to about 600 per frame. 6

7 Video Google Maximally Stable (MS) regions are in yellow Shape Adapted (SA) regions are in blue/cyan 7

8 MS yellow SA - cyan Zoomed view

Step 3. Build Visual Vocabulary: Use k- Means clustering to vector quanbze descriptors into clusters Mahalanobis distance is used as the distance funcbon for k- Means clustering: Step 4.

9 Step 3. Build Visual Vocabulary: Use k- Means clustering to vector quanbze descriptors into clusters Mahalanobis distance is used as the distance funcbon for k- Means clustering: Step 4. Remove stop- listed visual words: The most frequent visual words that occur in almost all images, such as highlights which occur in many frames, are rejected. n id = n. Bmes term i appears in doc d n d = n. terms in doc d N = n. docs in the collecbon N i = n. docs where term i appears Step 5. Compute p- idf weighted document frequency vector:

10 Vocabulary building Subset of 48 shots selected 10k frames = 10% of movie Regions construcbon (SA + MS) 10k frames * 1200 = 1.2E6 regions Frame tracking, RejecBng unstable regions ~200k regions Clustering descriptors using k- means SIFT descriptors representabon

11 Step 6. Build inverted- file indexing structure. An inverted file is structured like an ideal book index: it has an entry for each word in the corpus followed by a list of all the documents (and posibon in that document) in which the word occurs. w 23 w 7 w 7 Image #1 Word # Image # 1 3 Word # Image # 1 3 Data abase im mages 2 w 62 w 7 7 1, 2 Image #2 8 3 w 91 9 w 7 New query image 2 7 1, w 76 w 8 Image #3 w (a) All database images are loaded into the index mapping words to image numbers. (b) A new query image is mapped to indices of database images that share a word.

12 The Video Google algorithm for content- based retrieval Run- Bme (on- line): Take a query image region 1. Determine the set of visual words within the query region 2. Retrieve keyframes based on visual word frequencies 3. Re- rank the top keyframes using spabal consistency Generate query descriptors Use nearest neighbor to build query vector Use inverse index to find relevant frames Rank results Calculate distance to relevant frames

13 SpaBal consistency: - Matched covariant regions in the retrieved frames should have a similar spabal arrangement to those of the outlined region in the query image - To verify a pair of matching regions (A, B), a circular search area is defined by the k (5 in figure) spabal nearest neighbors in both frames Each match that lies within the search areas in both frames casts a vote in support of the match (in the example, three supporbng matches are found) - Matches with no support are rejected

14 Query region and its close- up. How Video Google works

15 Original matches based on visual words

16 Original matches based on visual words

17 Matches awer using the stop- list

18 Final set of matches awer filtering on spabal consistency

19 19 Video Google

20 Video Google 20

21 Video Google Performance Analysis Q Number of queried descriptors (~10 2 ) M Number of descriptors per frame (~10 3 ) N Number of key frames per movie (~10 4 ) D Descriptor dimension (128~10 2 ) K Number of words in the vocabulary (16X10 3 ~10 3 ) α - rabo of documents that does not contain any of the Q words (~.1) ComputaBonal cost: Nearest Neighbor = QMND = ~ Video Google: Query Vector quanbzabon + Distance = QKD + Q(αN) = ~ Improvement factor = ~ :- 10 6

Matching. Brandon Jennings January 20, 2015

Matching. Brandon Jennings January 20, 2015 Matching Brandon Jennings January 20, 2015 Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman Video Google The problem: Desire to match objects in a scene