Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Size: px

Start display at page:

Download "Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford"

Meredith Fleming
5 years ago
Views:

1 Video Google faces Josef Sivic, Mark Everingham, Andrew Zisserman Visual Geometry Group University of Oxford

2 The objective Retrieve all shots in a video, e.g. a feature length film, containing a particular person Visually defined search on faces Pretty Woman [Marshall, 1990] Applications: intelligent fast forward on characters pull out all videos of x from 1000s of digital camera mpegs

3 Uncontrolled viewing conditions Image variations due to: pose/scale lighting partial occlusion expression c.f. Standard face databases

4 The ideal situation face space Despite all these image variations, want different identities to map to distinct unique points

5 and reality face space

6 Approach minimize variations due to pose, and lighting by choice of feature vector multiple face exemplars to represent expressions each identity represented by a distribution over exemplar feature vectors

7 The benefits of video Automatically associate expression exemplars

8 Outline 1. Obtaining sets of faces using tracking within shots Identity free 2. Matching face sets within shots Requires identity matching 3. Indexing for efficient retrieval Live demo: Pretty Woman Groundhog Day Casablanca

9 1. Obtaining sets of faces by tracking within a shot

10 Face detection Need to associate detections with the same identity frames

Face Detector Mikolajczyk et al ECCV 2004 In tradition of

and also inspired by SIFT descriptor of Lowe `99 Local

classifiers: from feature occurrence and co-occurrence Strong

11 Face Detector Mikolajczyk et al ECCV 2004 In tradition of Rowley et al 96, Schneiderman & Kanade 00, Viola & Jones 01 and also inspired by SIFT descriptor of Lowe `99 Local features: gradient quantized orientations Laplacian Weak classifiers: from feature occurrence and co-occurrence Strong classifier: using Adaboost Operate at high precision (90%) point few false positives

12 Face detection performance on CMU-MIT test data 125 images with 481 frontal faces ROC curve Operate at high precision (90%) point few false positives

13 Track local affine covariant regions on faces detect regions independently in each frame a region s size and shape are not fixed, but automatically adapts to the image intensity to cover the same physical surface i.e. pre-image is the same surface region tracking - connect the detected regions temporally Track through pose changes, partial occlusions, face deformations

cover the same physical surface i.e. pre-image is the same surface region tracking - connect

14 Track local affine covariant regions on faces detect regions independently in each frame a region s size and shape are not fixed, but automatically adapts to the image intensity to cover the same physical surface i.e. pre-image is the same surface region tracking - connect the detected regions temporally Track through pose changes, partial occlusions, face deformations

15 Viewpoint covariant segmentation Characteristic scales (size of region) Lindeberg and Garding ECCV 1994 Lowe ICCV 1999 Mikolajczyk and Schmid ICCV 2001 Affine covariance (shape of region) Baumberg CVPR 2000 Matas et al BMVC 2002 Mikolajczyk and Schmid ECCV 2002 Schaffalitzky and Zisserman ECCV 2002 Tuytelaars and Van Gool BMVC 2000 Maximally stable regions Shape adapted regions

16 Tracking covariant regions two stages Goal: develop very long and good quality tracks Stage I match regions detected in neighbouring frames Problems: e.g. missing detections Stage II repair tracks by region propagation

17 Example I original sequence

18 Example I tracked regions

19 Example II tracked regions

20 Region tubes

21 Connecting face detections temporally Goal: associate face detections of each character within a shot Approach: Agglomeratively merge face detections based on connecting tubes frames require a minimum number of region tubes to overlap face detections

22 Example: Buffy the Vampire Slayer Breakfast Scene

23 raw face detections

24 face tubes

25 2. Matching face sets within shots

Face feature vector a possible approach is to determine 3D pose/illumination in the manner of Blanz, Romdhani & Vetter 3D morphable model instead

26 Face feature vector a possible approach is to determine 3D pose/illumination in the manner of Blanz, Romdhani & Vetter 3D morphable model instead concentrate on near frontal pose, and compensate for pose/illumination variation using descriptors designed with built in invariance multiple overlapping SIFTs

multiple overlapping SIFTs inspired by von der Malsburg et al

27 Face feature vector - summary Multiple, overlapping, affinely transformed local SIFT descriptors face detector eyes/nose/mouth multiple overlapping SIFTs inspired by von der Malsburg et al Elastic Bunch Graph Matching representation, and Heisele et al Component Approach

28 Detect face features for rectification Video with detected features close-up rectified face

29 Eyes/nose/mouth detectors Training data: ~5,000 images with hand-marked facial features Scale determined by face detector Fixed-size patches extracted around feature points

30 Constellation like Appearance/Shape Model Model shape X (2-D points) and appearance A (patches at points in X) Appearance and shape are assumed independent Appearance of a feature is modelled as a mixture of Gaussians (GMM) EM (mixture of probabilistic PCA) algorithm is used to estimate parameters Joint position of all features is modelled as a (mixture of) Gaussians Full covariance (positions of all features interact) position x i GMM clusters appearance a j

31 SIFT descriptor [Lowe 1999] rectified face Create array of orientation histograms 8 orientations x 3x3 spatial bins = 72 dim.

32 Face feature vector - summary Benefits of local SIFT descriptors: SIFT unaffected by small localization errors in eyes/nose/mouth detector Centre weighting de-emphasizes background (no foreground segmentation) Illumination normalization per SIFT allows lighting to vary across face multiple overlapping SIFTs SIFT for each facial feature, i.e. 5 x 72 = 360 vector for entire face

33 Parameters/representation Support region size, number and overlap Representation of distribution Distance measures between distribution over face exemplars

34 Face tube Representation of face set - I represent tube by set of 360-vectors no representation of ordering or dynamics

35 Matching face sets within a shot min-min distance: d!a, B" # $%& a!a,b!b d!a, b" A, B... sets of face descriptors (360-vectors)

36 Matching face sets within shot Goal: Match face tubes of a particular person within a shot (to overcome occlusions, self-occlusions) Approach: Agglomeratively merge facesets using min-min distance with exclusion constraints. Exclusion principle: The same character cannot appear twice in the same frame

37 face tubes (tracking only)

38 intra-shot matching

39 3. Indexing for efficient retrieval

40 Preliminaries Film statistics for Pretty Woman 170,000 frames 1151 shots Pre-processing Track local regions through every shot Detect faces in every frame using a `frontal face detector (38,0457 face detections) Obtain face tubes by tracking (659 face tubes) After intra-shot matching (plumbing) (611 face tubes)

41 Face tube representation as single vector Descriptor for each face Representation of face set II Compact representation of the entire face set obtain compact representation for the entire face tube treat face descriptors as samples from an underlying unknown pdf represent face tube as a histogram over face exemplars (non-parametric model of pdf) cf. Gaussian approximation [Shakhnarovich et al., ECCV 2002]

42 Represent face tube as histogram over face exemplars Face tube Having a set of precomputed face exemplars Assign each face to the nearest exemplar Counts Separate histogram for each facial feature Concatenate histograms for each face feature into a n-vector Exemplars Histogram over exemplars Facial feature exemplars are obtained by k- means clustering on a subset of the movie

43 Examples of face feature vocabulary Facial vocabulary: K-means initialized by progressive constructive clustering (determines K) K left eye 537 middle eye 523 right eye 675 mouth 834 nose 675 Total 3,244

44 Examples of face feature visual words

45 Represent marginals of each facial feature, not joint

46 Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k Counts p k Counts q k Exemplars Exemplars

47 Matching face tubes use chi-squared as a distance measure between face tube histograms! '!p, q" # SX k#(!p k " q k " ' p k ) q k an alternative would be to measure KL divergence between the sets KL!pkq" # X p k *+,!p k /q k " though these are related as ( '!'!p, q" <# KL µpk ( '!p ) q" )KL µqk ( '!p ) q" <# *&'! '!p, q"

48 Making the search efficient (Google like retrieval) Represent video by histogram over facial feature exemplars for each face tube 42 facial feature exemplars 5 face tubes Counts p k Each column is a normalized histogram Exemplars cf words vs documents (e.g. web pages) in text retrieval Employ text-retrieval techniques e.g. Inverted file indexing Ranking (here on chi-squared)

49 Video Google Faces Demo

50 Inter shot retrieval ground truth evaluation Ground truth for 7 characters: 373 face tracks (minimum number of 10 detections)

51 Inter shot retrieval example I Query sequence Retrieved sequences (shown by first detection) Example sequence

52 Inter shot retrieval example II Query sequence Examples of recognized faces Retrieved sequences (shown by first detection)

53 Inter shot retrieval (other characters)

54 Example: Matching across movies Bill Murray Lost in translation [Coppola 2003] Groundhog Day [Ramis 1993 ]

55 Lost in translation - query Query shot Example face detection 192 associated face detections

56 Find Bill Murray in Groundhog Day Face detections from the first 36 retrieved face tracks: First false positive ranked 42nd, 15 false positives in the first 100 retrieved face tracks (out of total of 596 face tracks )

57 Summary Face shot retrieval using a specialized vocabulary and strong spatial model Extensions - Include hair/clothes in visual query for more specific search (integrate vocabularies) - Add profile face detector to harvest further face tubes - Use of exclusion principle to provide negative exemplar sets in inter-shot matching - Apply to other object classes Previous work: Object retrieval in entire movies Sivic and Zisserman, ICCV 2003 Demo:

Video Google: A Text Retrieval Approach to Object Matching in Videos

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic, Frederik Schaffalitzky, Andrew Zisserman Visual Geometry Group University of Oxford The vision Enable video, e.g. a feature