Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Video Google faces Josef Sivic, Mark Everingham, Andrew Zisserman Visual Geometry Group University of Oxford

The objective Retrieve all shots in a video, e.g. a feature length film, containing a particular person Visually defined search on faces Pretty Woman [Marshall, 1990] Applications: intelligent fast forward on characters pull out all videos of x from 1000s of digital camera mpegs

Uncontrolled viewing conditions Image variations due to: pose/scale lighting partial occlusion expression c.f. Standard face databases

The ideal situation face space Despite all these image variations, want different identities to map to distinct unique points

and reality face space

Approach minimize variations due to pose, and lighting by choice of feature vector multiple face exemplars to represent expressions each identity represented by a distribution over exemplar feature vectors

The benefits of video Automatically associate expression exemplars

Outline 1. Obtaining sets of faces using tracking within shots Identity free 2. Matching face sets within shots Requires identity matching 3. Indexing for efficient retrieval Live demo: Pretty Woman Groundhog Day Casablanca

1. Obtaining sets of faces by tracking within a shot

Face detection Need to associate detections with the same identity frames

Face Detector Mikolajczyk et al ECCV 2004 In tradition of Rowley et al 96, Schneiderman & Kanade 00, Viola & Jones 01 and also inspired by SIFT descriptor of Lowe `99 Local features: gradient quantized orientations Laplacian Weak classifiers: from feature occurrence and co-occurrence Strong classifier: using Adaboost Operate at high precision (90%) point few false positives

Face detection performance on CMU-MIT test data 125 images with 481 frontal faces ROC curve Operate at high precision (90%) point few false positives

Track local affine covariant regions on faces detect regions independently in each frame a region s size and shape are not fixed, but automatically adapts to the image intensity to cover the same physical surface i.e. pre-image is the same surface region tracking - connect the detected regions temporally Track through pose changes, partial occlusions, face deformations

Viewpoint covariant segmentation Characteristic scales (size of region) Lindeberg and Garding ECCV 1994 Lowe ICCV 1999 Mikolajczyk and Schmid ICCV 2001 Affine covariance (shape of region) Baumberg CVPR 2000 Matas et al BMVC 2002 Mikolajczyk and Schmid ECCV 2002 Schaffalitzky and Zisserman ECCV 2002 Tuytelaars and Van Gool BMVC 2000 Maximally stable regions Shape adapted regions

Tracking covariant regions two stages Goal: develop very long and good quality tracks Stage I match regions detected in neighbouring frames Problems: e.g. missing detections Stage II repair tracks by region propagation

Example I original sequence

Example I tracked regions

Example II tracked regions

Region tubes

Connecting face detections temporally Goal: associate face detections of each character within a shot Approach: Agglomeratively merge face detections based on connecting tubes frames require a minimum number of region tubes to overlap face detections

Example: Buffy the Vampire Slayer Breakfast Scene

raw face detections

face tubes

2. Matching face sets within shots

Face feature vector a possible approach is to determine 3D pose/illumination in the manner of Blanz, Romdhani & Vetter 3D morphable model instead concentrate on near frontal pose, and compensate for pose/illumination variation using descriptors designed with built in invariance multiple overlapping SIFTs

Face feature vector - summary Multiple, overlapping, affinely transformed local SIFT descriptors face detector eyes/nose/mouth multiple overlapping SIFTs inspired by von der Malsburg et al Elastic Bunch Graph Matching representation, and Heisele et al Component Approach

Detect face features for rectification Video with detected features close-up rectified face

Eyes/nose/mouth detectors Training data: ~5,000 images with hand-marked facial features Scale determined by face detector Fixed-size patches extracted around feature points

Constellation like Appearance/Shape Model Model shape X (2-D points) and appearance A (patches at points in X) Appearance and shape are assumed independent Appearance of a feature is modelled as a mixture of Gaussians (GMM) EM (mixture of probabilistic PCA) algorithm is used to estimate parameters Joint position of all features is modelled as a (mixture of) Gaussians Full covariance (positions of all features interact) position x i GMM clusters appearance a j

SIFT descriptor [Lowe 1999] rectified face Create array of orientation histograms 8 orientations x 3x3 spatial bins = 72 dim.

Face feature vector - summary Benefits of local SIFT descriptors: SIFT unaffected by small localization errors in eyes/nose/mouth detector Centre weighting de-emphasizes background (no foreground segmentation) Illumination normalization per SIFT allows lighting to vary across face multiple overlapping SIFTs SIFT for each facial feature, i.e. 5 x 72 = 360 vector for entire face

Parameters/representation Support region size, number and overlap Representation of distribution Distance measures between distribution over face exemplars

Face tube Representation of face set - I represent tube by set of 360-vectors no representation of ordering or dynamics

Matching face sets within a shot min-min distance: d!a, B" # $%& a!a,b!b d!a, b" A, B... sets of face descriptors (360-vectors)

Matching face sets within shot Goal: Match face tubes of a particular person within a shot (to overcome occlusions, self-occlusions) Approach: Agglomeratively merge facesets using min-min distance with exclusion constraints. Exclusion principle: The same character cannot appear twice in the same frame

face tubes (tracking only)

intra-shot matching

3. Indexing for efficient retrieval

Preliminaries Film statistics for Pretty Woman 170,000 frames 1151 shots Pre-processing Track local regions through every shot Detect faces in every frame using a `frontal face detector (38,0457 face detections) Obtain face tubes by tracking (659 face tubes) After intra-shot matching (plumbing) (611 face tubes)

Face tube representation as single vector Descriptor for each face Representation of face set II Compact representation of the entire face set obtain compact representation for the entire face tube treat face descriptors as samples from an underlying unknown pdf represent face tube as a histogram over face exemplars (non-parametric model of pdf) cf. Gaussian approximation [Shakhnarovich et al., ECCV 2002]

Represent face tube as histogram over face exemplars Face tube Having a set of precomputed face exemplars Assign each face to the nearest exemplar Counts Separate histogram for each facial feature Concatenate histograms for each face feature into a n-vector Exemplars Histogram over exemplars Facial feature exemplars are obtained by k- means clustering on a subset of the movie

Examples of face feature vocabulary Facial vocabulary: K-means initialized by progressive constructive clustering (determines K) K left eye 537 middle eye 523 right eye 675 mouth 834 nose 675 Total 3,244

Examples of face feature visual words

Represent marginals of each facial feature, not joint

Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k Counts p k Counts q k Exemplars Exemplars

Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k an alternative would be to measure KL divergence between the sets KL!pkq" # X p k *+,!p k /q k " though these are related as ( '!'!p, q" <# KL µpk ( '!p ) q" )KL µqk ( '!p ) q" <# *&'! '!p, q"

Making the search efficient (Google like retrieval) Represent video by histogram over facial feature exemplars for each face tube 42 facial feature exemplars 5 face tubes Counts p k Each column is a normalized histogram Exemplars cf words vs documents (e.g. web pages) in text retrieval Employ text-retrieval techniques e.g. Inverted file indexing Ranking (here on chi-squared)

Video Google Faces Demo

Inter shot retrieval ground truth evaluation Ground truth for 7 characters: 373 face tracks (minimum number of 10 detections) 194 97 15 23 29 9 7

Inter shot retrieval example I Query sequence Retrieved sequences (shown by first detection) Example sequence

Inter shot retrieval example II Query sequence Examples of recognized faces Retrieved sequences (shown by first detection)

Inter shot retrieval (other characters)

Example: Matching across movies Bill Murray Lost in translation [Coppola 2003] Groundhog Day [Ramis 1993 ]

Lost in translation - query Query shot Example face detection 192 associated face detections

Find Bill Murray in Groundhog Day Face detections from the first 36 retrieved face tracks: First false positive ranked 42nd, 15 false positives in the first 100 retrieved face tracks (out of total of 596 face tracks )

Summary Face shot retrieval using a specialized vocabulary and strong spatial model Extensions - Include hair/clothes in visual query for more specific search (integrate vocabularies) - Add profile face detector to harvest further face tubes - Use of exclusion principle to provide negative exemplar sets in inter-shot matching - Apply to other object classes Previous work: Object retrieval in entire movies Sivic and Zisserman, ICCV 2003 Demo: http://www.robots.ox.ac.uk/~vgg/research/vgoogle/