Describable Visual Attributes for Face Verification and Image Search

Size: px

Start display at page:

Download "Describable Visual Attributes for Face Verification and Image Search"

Polly Blair
6 years ago
Views:

1 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 1 Describable Visual Attributes for Face Verification and Image Search Kumar, Berg, Belhumeur, Nayar. PAMI, Ryan Lei 2011/05/05 Course ammai

2 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 2 Outline Title explanation Introduction Related work Creating labeled image datasets Learning visual attributes Application to face verification Application to face search Possible improvements

3 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 3 Title Explanation Describable visual attributes: Labels that can be given to an image to describe its appearance. Examples: gender, age, skin color, hair style, smiling. Goal 1 - Face verification (authentication): a. Determine whether two faces are of the same individual. b. Knowing the identity, judge if the query face image is the same person. Goal 2 - Image Search: Attribute-based image search system. Sample query: smiling Asian man with glasses.

4 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 4 Introduction (1) Traditional face recognition research: Use of low-level image features to directly train classifier for the classification task. Examples: color (histogram, GCM), texture (Gabor), intensity (Haar, LBP), gradient (SIFT, HOG). The feature representations are often high dimensional, and are in an abstract space. Proposal - attribute-based representation: a. Describable visual attributes ( attributes in this slide) b. Similes: similarities to reference faces.

5 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 5 Recap: semantic concept detection (course MMAI).

6 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 6 Concept space for faces -- (a): attributes

7 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 7 Concept space for faces -- (b): similes

8 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 8 Introduction (2) Advantages of attribute-based representation: A mid-level representation bridging the semantic gap. Dimensionality reduction & manifold discovery. Flexible, generalizable, efficient. Flexible: various levels of specificity. white male a set of people. + brown-hair green-eyes scar-on-forehead a specific person. + smiling lit-from-above seen-from-left a particular image of that person.

9 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 9 Generalizable: Introduction (3) Learn a set of attributes from a large image collection, and then apply them in arbitrary combinations to the recognition of unseen images. Efficient: k binary attributes can identify up to 2 k categories. Requires a much smaller labeled dataset to achieve comparable performance on recognition tasks.

10 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 10 Introduction (4) Contributions of this paper: 1. Introduce attribute & simile classifiers, and face representation using these results. 2. Application to face verification and face search. 3. Releases two large public datasets: FaceTracer and PubFig. Previous work: N. Kumar, et al., Attribute and simile classifiers for face verification, CVPR N. Kumar, et al., FaceTracer: A Search Engine for Large Collections of Images with Faces, ECCV 2008.

11 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 11 Creating Labeled Datasets (1) Steps 1 & 2: Download from a variety of online resources, such as Yahoo! Images and Flickr. Using a commercial face detector (Omron OKAO), detect faces, pose angles, and fiducial points. Using the yaw angle, flip the face so it always faces left. Align faces using linear least square regression.

online resources, such as Yahoo! Images and Flickr.

12 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 12 Creating Labeled Datasets (1) Steps 1 & 2: pose angles fiducial points Download from a variety of online resources, such as Yahoo! Images and Flickr. Using a commercial face detector (Omron OKAO), detect faces, pose angles, and fiducial points. Using the yaw angle, flip the face so it always faces left. Align faces using linear least square regression.

13 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 13 Review of linear least square regression Basics (1-dimensional-output case): Given a vector x, when want to predict its output y. Prepare a training set of N data: {x 1, x 2,..., x N }. Introduce M basis functions Φ j (.), e.g., Gaussian, such that y is a linear combination of M basis components: y = M " w!! (x) = w T!(x) j j j=1 Goal: learn the weight vector w. Prediction: y new = w T!(x new )

14 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 14 Review of linear least square regression Formulation: N Overdetermined systems:! # # X = # data # " basis M " j=1! 1 (x 1 )! 2 (x 1 )!! M (x 1 )! 1 (x 2 )! 2 (x 2 )!! M (x 2 )!! "!! 1 (x N )! 2 (x N )!! M (x N ) Solve the normal equations for w :! j (x i )! w j = y i, i =1, 2,..., N $! & # & # &, y = # & # % " y 1 y 2! y N X T Xw = X T y! w = (X T X) "1 X T y $ & & & & % ' Xw = y

15 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 15 Review of linear least square regression Extension to D-dimensional-output case:! # # W = # # " output dim. w 11 w 12! w 1D w 21 w 22! w 2D!! "! w M1 w M 2! w MD $! & # & # &, Y = # & # % " output dim. y 11 y 12! y 1D y 21 y 22! y 2D!! "! y N1 y N 2! y ND X T XW = X T Y! W = (X T X) "1 X T Y Given 6 fiducial points, predict 6 new coordinates. In our case, W matrix is an affine transformation. $ & & & & %

16 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 16 Creating Labeled Datasets (2) Step 3: Crowd-sourcing part. They chose Amazon Mechanical Turk, a service that matches workers to online jobs created by requesters. Quality control by requiring confirmation of results by several workers, or minimum worker experience, etc. After verification, they collected 145,000 attribute labels, at the cost of USD$6,000.

17 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 17 Amazon Mechanical Turk Worker Requester

18 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 18 Creating Labeled Datasets (3) FaceTracer attribute labels: 15,000 faces with 5,000 labels. (0.33 labels per image) Researchers still need to train their own attribute classifiers to transform the faces into attribute space. PubFig identity labels: 58,797 images of 200 public figures. Development set: 60 people. (for training simile classifiers) Evaluation set: 140 people. (same purpose as LFW) What if jointly labeling identities and attributes?

19 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 19 # of faces of this person.

20 Multimedia Analysis and Indexing, Fall 2010, NTU 20 Attributes labeled in the FaceTracer dataset However, all attribute classifiers in this work are binary.

21 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 21 Learning Visual Attributes (1) Definitions: Consider the attribute gender and the qualitative labels male and female. An Attribute can be thought of as a function that maps an image I to a real value r i. female male r i Large positive (negative) values of r i indicate the present (absent) of the i-th attribute. Core part in this work: Feature selection, though still an open problem in machine learning.

22 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 22 Learning Visual Attributes (2) Low-level features options: a) Region of face. [10] b) Type of pixel value. [5] c) Normalization to apply. [3] d) Level of aggregation to use. [3] A total of 10*5*3*3 = 450 combinations.

23 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 23 Learning Visual Attributes (3) Complete feature pool: However, not all possible combinations are valid, e.g., normalization of hues. a) ˆx = x µ ˆx = x! µ! b) c) d) (concat)

Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 24 Learning Visual Attributes (4) Forward feature selection in each iteration: 1.

24 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 24 Learning Visual Attributes (4) Forward feature selection in each iteration: 1. Train all possible classifiers by concatenating one feature option with the current feature set. 2. Evaluate the performances by cross-validation. 3. Feature(s) with highest CV accuracy added. 4. Drop the lowest-scoring 70% of features. 5. Keep adding until the accuracy stops improving.

25 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 25 Learning Visual Attributes (5) Classifiers used: SVMs with RBF kernel. (tool: libsvm) Seems to be only one final classifier. ECCV 2008: Boosting approach. Thought: Features A and B work well individually. Will they work well together?

26 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 26 Typically range from 80% to 90%. Analysis shows that all regions and feature types are useful: the power of feature selection.

27 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 27 Learning Visual Attributes (6) Simile classifiers:

28 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 28 Learning Visual Attributes (6) Simile classifiers: 60 reference people. 8 manually selected face regions. (Similes are component-based by definition.) 6 manually selected combinations of [pixel value type, normalization, aggregation], without automatic selection. (Could have done so.) Yield a total of 60*8*6 = 2,880 simile classifiers, each being an SVM with RBF kernel.

29 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 29 Now we have these attributes and similes as midlevel representation. How do we apply them in high-level classification / retrieval problems?

30 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 30 Application to Face Verification (1) Problem: Are these two faces of the same person? Existing methods: Very early work & early work: Compare L2 distance in PCA-reduced space. Improved by the supervised LDA. Early work used well-known low-level features directly. Often make avoidable mistakes: men being confused for women, young people for old, Asians for Caucasians. From this observation, they claim that the attribute and simile classifiers can avoid such mistakes.

31 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 31 Application to Face Verification (2) Many steps have been explained in previous sections. Goal here: the verification classifier.

32 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 32 Application to Face Verification (3) Verification classifier: Train a verification classifier V that compares attribute vectors C(I 1 ) and C(I 2 ) of two face images I 1 and I 2, and returns the decision v(i 1, I 2 ). Define a i = C i (I 1 ) and b i = C i (I 2 ), subscript i meaning the i-th attribute (simile) value. i = 1, 2,, n. Use both the absolute difference and product. 1) a i b i : Observation that a i and b i should be close if the are of the same individual. 2) a i * b i : Observation that a i and b i should have the same sign if the are of the same individual.

33 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 33 Application to Face Verification (4) Verification classifier (cont.): Concatenate them in tuple p i, then concatenate for all i: <p 1,,p n > is the training input to the classifier. Again, they use SVM with RBF kernel. Experiment: Performed on datasets LFW and PubFig. Even the individuals are disjoint in training / testing sets. In other words, machine never sees the same pair of people in its model. Rely solely on attributes / similes.

34 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 34

35 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 35 How well can attributes potentially do? 91.86% 81.57% Attributes obtained by machine learning: 81.57%. Attributes obtained by human labeling: 91.86%. Face verification entirely by human: 99.20%.

36 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 36 Application to Face Search FaceTracer search engine: For details, refer to the work in ECCV Demo video on YouTube: Text-based query. Remember that we have bridged the semantic gap by those mid-level classifiers. For each attribute, search results are ranked by confidences. Convert confidences into probabilities in [0,1] by Gaussian. For multiple query terms, just combine them by taking the product (AND) of those probabilities. Indexing: basic inverted index? Not mentioned.

37 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 37 What about the user gap?

38 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 38 To accurately capture user s search intention: sketch-based concept-based people-attribute -based

39 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 39 Possible Improvements Go beyond binary classifiers for attributes. Regressors are good for quantitative attributes like age. Use feature selection in training the Simile classifiers. Automatic (Dynamic) selection of attributes. Image search: Product of probabilities can t prevent outlier scores. TF-IDF approach for ranking the word frequencies? Combine identity information? Combine location and size, even sketch?

40 Advanced Topics in Multimedia Analysis and Indexing, Spring 2011, NTU. 40 Thank You! Q & A

CS 1674: Intro to Computer Vision. Attributes. Prof. Adriana Kovashka University of Pittsburgh November 2, 2016

CS 1674: Intro to Computer Vision Attributes Prof. Adriana Kovashka University of Pittsburgh November 2, 2016 Plan for today What are attributes and why are they useful? (paper 1) Attributes for zero-shot