Deep Face Recognition Nathan Sun
Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an ID instead of carrying around a driver s license Collect demographical data or personal data geared towards an individual (e.g. an addictive gambler)
Current Issues Difficulty in generating image dataset without too much personpower You need a large dataset to train CNNs Large public dataset has been lacking Large corporations (Facebook, Google, etc.) have huge datasets Investigate various CNN architectures (face ID and verification)
Dataset Comparisons DataSet Identities Images LFW 5,749 13,233 WDRef(another paper) 2,995 99,773 CelebFaces 10,177 202,599 Ours (this paper) 2,622 2,600,000 Facebook 4,030 4,400,000 Google 8,000,000 200,000,000
Current State of Affairs DeepFace Siamese network (same CNN applied to pairs of faces to obtain descriptors and compared using Euclidean distance) Train CNN to minimize distance between congruous pairs of faces and maximize distance between incongruous pairs of faces DeepId-> multiple CNNS with multiple layers Very complicated with over 200 CNNs Google CNN Triplet-based loss -> pair of congruous faces (a,b) and an incongruous face (c) Goal is to make aclose to bthan c cis a pivot face Achieves best performance in LFW (LabledFaces in the Wild) and YTF (YouTube Faces in the Wild)
Dataset Collection 5 stage process Desire for more automation vs manual acquisition Reduce manhours, increase computer hours
Stage 1: Bootstrapping + Filtering Bootstrapping is nonparametric approach to statistical inference Use variability within a sample to estimate sampling distribution empirically instead of making assumptions about sampling distribution of a statistic Can be with or without replacement (here is without) Obtain candidate list from IMDB celebrity list 2.5k actors and 2.5k actresses (sufficient number of photos for each) Download 200 images for each of the 5k names Human annotators check for sufficient image purity (no similar images) Narrows down to 3,250 identities Check with LFW and YTF for names already there 2,622 final number of identities
Stage 2: Collect More Images Query each name on Google and Bing Image Search Query once more after appending actor to names 500 images for each query for a total of 2000 images per identity
Stage 3: Improve Purity with Automatic Filter Remove erroneous faces in each set using a classifier Top 50 images (Google search rank) for each identity used as positive training samples Top 50 images of all other identities used as negative training samples Fisher Vector Faces descriptor (one-vs-rest linear SVM) for each identity SVM (support vector machine) for classification Vectors to make hyperplanes Hard-margin since linearly separable Top 1000 images are retained
Stage 4: Near Duplicate Removal Duplicate images can be found by different search engines or search terms Compute VLAD descriptor for each image Cluster the descriptors within the 1000 images for each identity Use a very tight threshold and retain single element per cluster VLAD (Vector of Locally Aggregated Descriptors Very low dimensional (e.g. 16 bytes per image) All descriptors for very large image datasets can fit into main memory Starts by vector quantizing locally invariant descriptor such as SIFT (Scale Invariant Feature Transform) Cluster of similar elements, different clusters = different elements
Stage 5: Final Manual Filtering Increase purity of data with human annotations With automatic ranking to avoid time consumption and cost Multi-way CNN trained to distinguish between 2,622 faces using AlexNet architecture AlexNet(CNN written with CUDA to run with GPU support) 7 hidden weight layers (5 convolutional 3 fully connected layers) RELU (rectified linear unit) used in rectifier activation functions f(x) = tanh(x) (standard) f(x) = max(0, x) (quick to train) Final number of images obtained: 982,803
Data Collection Overview Small human annotation cost (14 days of manual effort) Considerable part of acquisition process is automatically carried out A for automatic, M for manual EER (equal error rate) values are performance on LFW
Convolutional Neural Networks (CNN) Based on Neuro Image Reception Neural receptors have fields of reception These fields overlap to present a more accurate representation Every image a matric of pixel values (each pixel potentially a different color) Convolution is to take a filter and overlay it over picture and move it by its stride size
Convolutional Neural Networks (CNN) Running a filter over an image will generate a feature map Different filters will acquire different feature mages Edge detection, sharpen, blue, etc. Size of feature maps depend on: Depth (number of filters) Stride (# pixels filter is slid) Zero-padding (control size of feature maps) Using zero-padding is wide convolution)
Network Architecture and Training CNNs used in this paper are very deep with multiple layers Can achieve state of the art performance in some tasks of ImageNet ILSVRC (Large Scale Visual Recognition Challenge) 2014
Learning a Face Classifier N-ways classification (2,622 way classification) CNN associates to each training image l t, t = 1,...,T a score vector x t = Wφ(l t ) + b R N Uses fully connected layer containing N linear predictors W R N x D, b R N, one per identity Scores compared to ground-truth class identity c t {1,, N} by computing empirical softmax log-loss function Euclidean distance to compare and score vectors to ID face Bootstrapping the network as a classifier makes training significantly easier and faster (compared to triplet loss)
Learn a Face Embedding Using a Triplet Loss Learn score vectors that perform well in final application Like metric learning (similarity learning classification function) Triplet loss depends on having an anchor image, another similar image that s not equal to the anchor and a negative
Architecture 8 Convolutional blocks and 3 fully connected blocks (fully connected = classification) Fully connected: size of filters match size of input date (each filter senses data from entire image) Each convolution layer followed by either ReLU or max pooling Max pool to reduce variance and computation complexity (2x2 pooling) ReLUspeeds up training (gradient computation is simple, computation is easy [no div, multi, all negatives set to 0.0]) Why so many layers? More convolution layers = the more complicated objects that can be recognized E.g. layer 1 for edge detection, layer 2 for shapes, layers 3 and so on for higher level features
ReLU Replaces all negative pixel values in feature map with zero Introduce non-linearity in the CNN (like real-world data) Convolution is a linear operation
Pooling Subsampling or down-sampling Reduces dimensionality of each feature map but retains most important information (max, average, sum) Makes input representation more manageable Makes network invariant to small transformations
Convolution Neural Network Combination of Convolution, ReLU and Pooling When the number of pixels is reduced, then we can use fully connected layer Classifies input based on high-level features generated by convolution + ReLU + pooling
Training 3 CNN configurations (A, B, D) Optimized with stochastic gradient regularized with dropout and weight decay Weight of filters in CNN initialized with random sampling from Gaussian distribution Triplet Loss trained with frozen network except for last fully connected layer with 10 epochs where each epoch contains all possible positive pairs (a, p) where a is anchor and p is similar image
Datasets and Evaluation Protocols Use existing benchmarks: LFW dataset (13,233 images and 5,749 identities) YTF dataset (3,425 videos and 1,595 identities) Standard evaluation protocol and report EER EER (equal error rate) is error rate at the ROC (receiver operating characteristic) operating point where false positive and false negative rates are equal Give threshold values for false acceptance rate and false rejection rate EER is when the rates are equal (plot both and EER is point of intersection) Lower value = higher accuracy
Experimental Results / Analysis Implementation based on MATLAB toolbox MatConvNet Linked against NVIDIA CuDNN libraries to accelerate training Used 4 NVIDIA Titan Black GPUs Face images are inputted as 224 x 224 pixels Training on full dataset (F) better than curated dataset (C) 2D alignment slightly improves performance Learning embedding for verification significantly boosts performance (DCN + triplet loss)
Component Analysis More data is better (hard positives might get removed) 2D alignment is better Face alignment is landmark detection (localizing facial key points) Slight improvement from config A to B, no difference between B and D Triplet loss embedding improves performance by 1.8% This means reducing error rate by 68%
Comparison with State of the Art: LFW Achieved comparable results to state of art while requiring less data for learning and using a simpler network architecture (compared to DeepID ROC curves to the right
Comparison with State of the Art: YTF K = # of faces used to represent each video Triplet loss embedding achieves state of the art
Conclusion Use weaker classifiers to rank data presented to annotators to save time (less human work) Deep CNN without embellishments but with proper training can achieve results comparable to state of the art
How Good is Facial Recognition Today? Apple (pioneers of advancement) uses DCN (deep convolutional networks) iphone X with Face ID An Asian woman s iphone X could be unlocked by her teenage son Brothers can unlock same phones People under age of 14 do not have distinguishing features and leads to error
References https://www.robots.ox.ac.uk/~vgg/publications/2015/parkhi15/park hi15.pdf http://www.robots.ox.ac.uk/~vgg/publications/2013/simonyan13/si monyan13.pdf https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13 /arandjelovic13.pdf http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexne t_tugce_kyunghee.pdf https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Thank You!