Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Similar documents
Video Google: A Text Retrieval Approach to Object Matching in Videos

Video Google: A Text Retrieval Approach to Object Matching in Videos

Evaluation and comparison of interest points/regions

Object Recognition with Invariant Features

Video Data Mining Using Configurations of Viewpoint Invariant Regions

Matching. Brandon Jennings January 20, 2015

Patch Descriptors. EE/CSE 576 Linda Shapiro

Large Scale Image Retrieval

Selection of Scale-Invariant Parts for Object Class Recognition

Instance-level recognition part 2

Component-based Face Recognition with 3D Morphable Models

Motion Estimation and Optical Flow Tracking

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Lecture 16: Object recognition: Part-based generative models

Face detection and recognition. Detection Recognition Sally

Patch Descriptors. CSE 455 Linda Shapiro

Hello! My name is... Buffy Automatic Naming of Characters in TV Video

CS229: Action Recognition in Tennis

Shape recognition with edge-based features

Recap Image Classification with Bags of Local Features

Human Detection Based on a Probabilistic Assembly of Robust Part Detectors

Evaluation of GIST descriptors for web scale image search

Instance-level recognition II.

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Lecture 14: Indexing with local features. Thursday, Nov 1 Prof. Kristen Grauman. Outline

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

Face Detection and Alignment. Prof. Xin Yang HUST

Window based detectors

Building a Panorama. Matching features. Matching with Features. How do we build a panorama? Computational Photography, 6.882

Efficient visual search of videos cast as text retrieval

Motion illusion, rotating snakes

Lecture 10 Detectors and descriptors

Local Image Features

A NEW FEATURE BASED IMAGE REGISTRATION ALGORITHM INTRODUCTION

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Visual Object Recognition

CS6670: Computer Vision

Today. Main questions 10/30/2008. Bag of words models. Last time: Local invariant features. Harris corner detector: rotation invariant detection

Finding people in repeated shots of the same scene

Metric learning approaches! for image annotation! and face recognition!

Scalable Recognition with a Vocabulary Tree

Scalable Recognition with a Vocabulary Tree

Simultaneous Recognition and Homography Extraction of Local Patches with a Simple Linear Classifier

Previously. Window-based models for generic object detection 4/11/2011

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Local features and image matching. Prof. Xin Yang HUST

Viewpoint Invariant Features from Single Images Using 3D Geometry

Specular 3D Object Tracking by View Generative Learning

Image Processing. Image Features

Category-level localization

Local Image Features

Min-Hashing and Geometric min-hashing

Face detection. Bill Freeman, MIT April 5, 2005

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang

Feature Based Registration - Image Alignment

Local features: detection and description. Local invariant features

Automated Scene Matching in Movies

Object recognition (part 1)

Object Category Detection: Sliding Windows

Categorization by Learning and Combining Object Parts

Histograms of Oriented Gradients for Human Detection p. 1/1

Object and Class Recognition I:

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Object detection as supervised classification

SURF. Lecture6: SURF and HOG. Integral Image. Feature Evaluation with Integral Image

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Parameter Sensitive Detectors

Lecture 12 Recognition

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Component-based Face Recognition with 3D Morphable Models

A performance evaluation of local descriptors

Visual Object Recognition

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

Available from Deakin Research Online:

Image Retrieval (Matching at Large Scale)

Texture Features in Facial Image Analysis

An Evaluation of Volumetric Interest Points

Binary SIFT: Towards Efficient Feature Matching Verification for Image Search

Lecture 12 Recognition. Davide Scaramuzza

Prof. Feng Liu. Spring /26/2017

Structure Guided Salient Region Detector

A Novel Extreme Point Selection Algorithm in SIFT

Oriented Filters for Object Recognition: an empirical study

Indexing local features and instance recognition May 14 th, 2015

Adaptive Learning of an Accurate Skin-Color Model

3D model search and pose estimation from single images using VIP features

Feature descriptors and matching

Mixtures of Gaussians and Advanced Feature Encoding

Performance Evaluation of Scale-Interpolated Hessian-Laplace and Haar Descriptors for Feature Matching

SHOT-BASED OBJECT RETRIEVAL FROM VIDEO WITH COMPRESSED FISHER VECTORS. Luca Bertinetto, Attilio Fiandrotti, Enrico Magli

THE aim of this work is to retrieve those key frames and

Selection of Scale-Invariant Parts for Object Class Recognition

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Robust Human Detection Under Occlusion by Integrating Face and Person Detectors

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Using the Forest to See the Trees: Context-based Object Recognition

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Object Category Detection: Sliding Windows

Transcription:

Video Google faces Josef Sivic, Mark Everingham, Andrew Zisserman Visual Geometry Group University of Oxford

The objective Retrieve all shots in a video, e.g. a feature length film, containing a particular person Visually defined search on faces Pretty Woman [Marshall, 1990] Applications: intelligent fast forward on characters pull out all videos of x from 1000s of digital camera mpegs

Uncontrolled viewing conditions Image variations due to: pose/scale lighting partial occlusion expression c.f. Standard face databases

The ideal situation face space Despite all these image variations, want different identities to map to distinct unique points

and reality face space

Approach minimize variations due to pose, and lighting by choice of feature vector multiple face exemplars to represent expressions each identity represented by a distribution over exemplar feature vectors

The benefits of video Automatically associate expression exemplars

Outline 1. Obtaining sets of faces using tracking within shots Identity free 2. Matching face sets within shots Requires identity matching 3. Indexing for efficient retrieval Live demo: Pretty Woman Groundhog Day Casablanca

1. Obtaining sets of faces by tracking within a shot

Face detection Need to associate detections with the same identity frames

Face Detector Mikolajczyk et al ECCV 2004 In tradition of Rowley et al 96, Schneiderman & Kanade 00, Viola & Jones 01 and also inspired by SIFT descriptor of Lowe `99 Local features: gradient quantized orientations Laplacian Weak classifiers: from feature occurrence and co-occurrence Strong classifier: using Adaboost Operate at high precision (90%) point few false positives

Face detection performance on CMU-MIT test data 125 images with 481 frontal faces ROC curve Operate at high precision (90%) point few false positives

Track local affine covariant regions on faces detect regions independently in each frame a region s size and shape are not fixed, but automatically adapts to the image intensity to cover the same physical surface i.e. pre-image is the same surface region tracking - connect the detected regions temporally Track through pose changes, partial occlusions, face deformations

Track local affine covariant regions on faces detect regions independently in each frame a region s size and shape are not fixed, but automatically adapts to the image intensity to cover the same physical surface i.e. pre-image is the same surface region tracking - connect the detected regions temporally Track through pose changes, partial occlusions, face deformations

Viewpoint covariant segmentation Characteristic scales (size of region) Lindeberg and Garding ECCV 1994 Lowe ICCV 1999 Mikolajczyk and Schmid ICCV 2001 Affine covariance (shape of region) Baumberg CVPR 2000 Matas et al BMVC 2002 Mikolajczyk and Schmid ECCV 2002 Schaffalitzky and Zisserman ECCV 2002 Tuytelaars and Van Gool BMVC 2000 Maximally stable regions Shape adapted regions

Tracking covariant regions two stages Goal: develop very long and good quality tracks Stage I match regions detected in neighbouring frames Problems: e.g. missing detections Stage II repair tracks by region propagation

Example I original sequence

Example I tracked regions

Example II tracked regions

Region tubes

Connecting face detections temporally Goal: associate face detections of each character within a shot Approach: Agglomeratively merge face detections based on connecting tubes frames require a minimum number of region tubes to overlap face detections

Example: Buffy the Vampire Slayer Breakfast Scene

raw face detections

face tubes

2. Matching face sets within shots

Face feature vector a possible approach is to determine 3D pose/illumination in the manner of Blanz, Romdhani & Vetter 3D morphable model instead concentrate on near frontal pose, and compensate for pose/illumination variation using descriptors designed with built in invariance multiple overlapping SIFTs

Face feature vector - summary Multiple, overlapping, affinely transformed local SIFT descriptors face detector eyes/nose/mouth multiple overlapping SIFTs inspired by von der Malsburg et al Elastic Bunch Graph Matching representation, and Heisele et al Component Approach

Detect face features for rectification Video with detected features close-up rectified face

Eyes/nose/mouth detectors Training data: ~5,000 images with hand-marked facial features Scale determined by face detector Fixed-size patches extracted around feature points

Constellation like Appearance/Shape Model Model shape X (2-D points) and appearance A (patches at points in X) Appearance and shape are assumed independent Appearance of a feature is modelled as a mixture of Gaussians (GMM) EM (mixture of probabilistic PCA) algorithm is used to estimate parameters Joint position of all features is modelled as a (mixture of) Gaussians Full covariance (positions of all features interact) position x i GMM clusters appearance a j

SIFT descriptor [Lowe 1999] rectified face Create array of orientation histograms 8 orientations x 3x3 spatial bins = 72 dim.

Face feature vector - summary Benefits of local SIFT descriptors: SIFT unaffected by small localization errors in eyes/nose/mouth detector Centre weighting de-emphasizes background (no foreground segmentation) Illumination normalization per SIFT allows lighting to vary across face multiple overlapping SIFTs SIFT for each facial feature, i.e. 5 x 72 = 360 vector for entire face

Parameters/representation Support region size, number and overlap Representation of distribution Distance measures between distribution over face exemplars

Face tube Representation of face set - I represent tube by set of 360-vectors no representation of ordering or dynamics

Matching face sets within a shot min-min distance: d!a, B" # $%& a!a,b!b d!a, b" A, B... sets of face descriptors (360-vectors)

Matching face sets within shot Goal: Match face tubes of a particular person within a shot (to overcome occlusions, self-occlusions) Approach: Agglomeratively merge facesets using min-min distance with exclusion constraints. Exclusion principle: The same character cannot appear twice in the same frame

face tubes (tracking only)

intra-shot matching

3. Indexing for efficient retrieval

Preliminaries Film statistics for Pretty Woman 170,000 frames 1151 shots Pre-processing Track local regions through every shot Detect faces in every frame using a `frontal face detector (38,0457 face detections) Obtain face tubes by tracking (659 face tubes) After intra-shot matching (plumbing) (611 face tubes)

Face tube representation as single vector Descriptor for each face Representation of face set II Compact representation of the entire face set obtain compact representation for the entire face tube treat face descriptors as samples from an underlying unknown pdf represent face tube as a histogram over face exemplars (non-parametric model of pdf) cf. Gaussian approximation [Shakhnarovich et al., ECCV 2002]

Represent face tube as histogram over face exemplars Face tube Having a set of precomputed face exemplars Assign each face to the nearest exemplar Counts Separate histogram for each facial feature Concatenate histograms for each face feature into a n-vector Exemplars Histogram over exemplars Facial feature exemplars are obtained by k- means clustering on a subset of the movie

Examples of face feature vocabulary Facial vocabulary: K-means initialized by progressive constructive clustering (determines K) K left eye 537 middle eye 523 right eye 675 mouth 834 nose 675 Total 3,244

Examples of face feature visual words

Represent marginals of each facial feature, not joint

Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k Counts p k Counts q k Exemplars Exemplars

Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k an alternative would be to measure KL divergence between the sets KL!pkq" # X p k *+,!p k /q k " though these are related as ( '!'!p, q" <# KL µpk ( '!p ) q" )KL µqk ( '!p ) q" <# *&'! '!p, q"

Making the search efficient (Google like retrieval) Represent video by histogram over facial feature exemplars for each face tube 42 facial feature exemplars 5 face tubes Counts p k Each column is a normalized histogram Exemplars cf words vs documents (e.g. web pages) in text retrieval Employ text-retrieval techniques e.g. Inverted file indexing Ranking (here on chi-squared)

Video Google Faces Demo

Inter shot retrieval ground truth evaluation Ground truth for 7 characters: 373 face tracks (minimum number of 10 detections) 194 97 15 23 29 9 7

Inter shot retrieval example I Query sequence Retrieved sequences (shown by first detection) Example sequence

Inter shot retrieval example II Query sequence Examples of recognized faces Retrieved sequences (shown by first detection)

Inter shot retrieval (other characters)

Example: Matching across movies Bill Murray Lost in translation [Coppola 2003] Groundhog Day [Ramis 1993 ]

Lost in translation - query Query shot Example face detection 192 associated face detections

Find Bill Murray in Groundhog Day Face detections from the first 36 retrieved face tracks: First false positive ranked 42nd, 15 false positives in the first 100 retrieved face tracks (out of total of 596 face tracks )

Summary Face shot retrieval using a specialized vocabulary and strong spatial model Extensions - Include hair/clothes in visual query for more specific search (integrate vocabularies) - Add profile face detector to harvest further face tubes - Use of exclusion principle to provide negative exemplar sets in inter-shot matching - Apply to other object classes Previous work: Object retrieval in entire movies Sivic and Zisserman, ICCV 2003 Demo: http://www.robots.ox.ac.uk/~vgg/research/vgoogle/