Deep Face Recognition. Nathan Sun

Similar documents
DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Face Recognition A Deep Learning Approach

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Deep Convolutional Neural Network using Triplet of Faces, Deep Ensemble, and Scorelevel Fusion for Face Recognition

Deep Neural Networks:

Machine Learning 13. week

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Deep Learning for Vision

Convolutional Neural Networks

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Classification of objects from Video Data (Group 30)

Deep Learning for Computer Vision II

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Dynamic Routing Between Capsules

An Exploration of Computer Vision Techniques for Bird Species Classification

Deep Learning with Tensorflow AlexNet

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

CAP 6412 Advanced Computer Vision

ConvolutionalNN's... ConvNet's... deep learnig

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Using Machine Learning for Classification of Cancer Cells

Computer Vision Lecture 16

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Face recognition algorithms: performance evaluation

arxiv: v1 [cs.cv] 20 Dec 2016

SHIV SHAKTI International Journal in Multidisciplinary and Academic Research (SSIJMAR) Vol. 7, No. 2, April 2018 (ISSN )

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Implementing Deep Learning for Video Analytics on Tegra X1.

Perceptron: This is convolution!

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Supporting Information

Hybrid Deep Learning for Face Verification. Yi Sun, Xiaogang Wang, Member, IEEE, and Xiaoou Tang, Fellow, IEEE

Two-Stream Convolutional Networks for Action Recognition in Videos

Object Recognition II

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Keras: Handwritten Digit Recognition using MNIST Dataset

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Accelerating Convolutional Neural Nets. Yunming Zhang

Como funciona o Deep Learning

Know your data - many types of networks

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Deep Learning and Its Applications

Smart Content Recognition from Images Using a Mixture of Convolutional Neural Networks *

A performance comparison of Deep Learning frameworks on KNL

Spatial Localization and Detection. Lecture 8-1

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Computer Vision Lecture 16

Robust Face Recognition Based on Convolutional Neural Network

Deep Learning for Face Recognition. Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong

arxiv: v1 [cs.cv] 2 Sep 2018

Classifying Depositional Environments in Satellite Images

Vulnerability of machine learning models to adversarial examples

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

POINT CLOUD DEEP LEARNING

Adaptive Learning of an Accurate Skin-Color Model

Rotation Invariance Neural Network

Su et al. Shape Descriptors - III

Todo before next class

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Deep Learning. Volker Tresp Summer 2014

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Face Recognition by Deep Learning - The Imbalance Problem

Improving Face Recognition by Exploring Local Features with Visual Attention

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Computer Vision Lecture 16

Facial Expression Classification with Random Filters Feature Extraction

Transfer Learning. Style Transfer in Deep Learning

DD2427 Final Project Report. Human face attributes prediction with Deep Learning

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012

Facial Expression Recognition Using a Hybrid CNN SIFT Aggregator

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Training Convolutional Neural Networks for Translational Invariance on SAR ATR

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Tutorial on Machine Learning Tools

Project 3 Q&A. Jonathan Krause

Online Open World Face Recognition From Video Streams

Deconvolutions in Convolutional Neural Networks

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Convolution Neural Networks for Chinese Handwriting Recognition

6. Convolutional Neural Networks

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 523: Multimedia Systems

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Face detection and recognition. Detection Recognition Sally

Advanced Introduction to Machine Learning, CMU-10715

Keras: Handwritten Digit Recognition using MNIST Dataset

Learning to Recognize Faces in Realistic Conditions

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Fuzzy Set Theory in Computer Vision: Example 3

Transcription:

Deep Face Recognition Nathan Sun

Why Facial Recognition? Picture ID or video tracking Higher Security for Facial Recognition Software Immensely useful to police in tracking suspects Your face will be an ID instead of carrying around a driver s license Collect demographical data or personal data geared towards an individual (e.g. an addictive gambler)

Current Issues Difficulty in generating image dataset without too much personpower You need a large dataset to train CNNs Large public dataset has been lacking Large corporations (Facebook, Google, etc.) have huge datasets Investigate various CNN architectures (face ID and verification)

Dataset Comparisons DataSet Identities Images LFW 5,749 13,233 WDRef(another paper) 2,995 99,773 CelebFaces 10,177 202,599 Ours (this paper) 2,622 2,600,000 Facebook 4,030 4,400,000 Google 8,000,000 200,000,000

Current State of Affairs DeepFace Siamese network (same CNN applied to pairs of faces to obtain descriptors and compared using Euclidean distance) Train CNN to minimize distance between congruous pairs of faces and maximize distance between incongruous pairs of faces DeepId-> multiple CNNS with multiple layers Very complicated with over 200 CNNs Google CNN Triplet-based loss -> pair of congruous faces (a,b) and an incongruous face (c) Goal is to make aclose to bthan c cis a pivot face Achieves best performance in LFW (LabledFaces in the Wild) and YTF (YouTube Faces in the Wild)

Dataset Collection 5 stage process Desire for more automation vs manual acquisition Reduce manhours, increase computer hours

Stage 1: Bootstrapping + Filtering Bootstrapping is nonparametric approach to statistical inference Use variability within a sample to estimate sampling distribution empirically instead of making assumptions about sampling distribution of a statistic Can be with or without replacement (here is without) Obtain candidate list from IMDB celebrity list 2.5k actors and 2.5k actresses (sufficient number of photos for each) Download 200 images for each of the 5k names Human annotators check for sufficient image purity (no similar images) Narrows down to 3,250 identities Check with LFW and YTF for names already there 2,622 final number of identities

Stage 2: Collect More Images Query each name on Google and Bing Image Search Query once more after appending actor to names 500 images for each query for a total of 2000 images per identity

Stage 3: Improve Purity with Automatic Filter Remove erroneous faces in each set using a classifier Top 50 images (Google search rank) for each identity used as positive training samples Top 50 images of all other identities used as negative training samples Fisher Vector Faces descriptor (one-vs-rest linear SVM) for each identity SVM (support vector machine) for classification Vectors to make hyperplanes Hard-margin since linearly separable Top 1000 images are retained

Stage 4: Near Duplicate Removal Duplicate images can be found by different search engines or search terms Compute VLAD descriptor for each image Cluster the descriptors within the 1000 images for each identity Use a very tight threshold and retain single element per cluster VLAD (Vector of Locally Aggregated Descriptors Very low dimensional (e.g. 16 bytes per image) All descriptors for very large image datasets can fit into main memory Starts by vector quantizing locally invariant descriptor such as SIFT (Scale Invariant Feature Transform) Cluster of similar elements, different clusters = different elements

Stage 5: Final Manual Filtering Increase purity of data with human annotations With automatic ranking to avoid time consumption and cost Multi-way CNN trained to distinguish between 2,622 faces using AlexNet architecture AlexNet(CNN written with CUDA to run with GPU support) 7 hidden weight layers (5 convolutional 3 fully connected layers) RELU (rectified linear unit) used in rectifier activation functions f(x) = tanh(x) (standard) f(x) = max(0, x) (quick to train) Final number of images obtained: 982,803

Data Collection Overview Small human annotation cost (14 days of manual effort) Considerable part of acquisition process is automatically carried out A for automatic, M for manual EER (equal error rate) values are performance on LFW

Convolutional Neural Networks (CNN) Based on Neuro Image Reception Neural receptors have fields of reception These fields overlap to present a more accurate representation Every image a matric of pixel values (each pixel potentially a different color) Convolution is to take a filter and overlay it over picture and move it by its stride size

Convolutional Neural Networks (CNN) Running a filter over an image will generate a feature map Different filters will acquire different feature mages Edge detection, sharpen, blue, etc. Size of feature maps depend on: Depth (number of filters) Stride (# pixels filter is slid) Zero-padding (control size of feature maps) Using zero-padding is wide convolution)

Network Architecture and Training CNNs used in this paper are very deep with multiple layers Can achieve state of the art performance in some tasks of ImageNet ILSVRC (Large Scale Visual Recognition Challenge) 2014

Learning a Face Classifier N-ways classification (2,622 way classification) CNN associates to each training image l t, t = 1,...,T a score vector x t = Wφ(l t ) + b R N Uses fully connected layer containing N linear predictors W R N x D, b R N, one per identity Scores compared to ground-truth class identity c t {1,, N} by computing empirical softmax log-loss function Euclidean distance to compare and score vectors to ID face Bootstrapping the network as a classifier makes training significantly easier and faster (compared to triplet loss)

Learn a Face Embedding Using a Triplet Loss Learn score vectors that perform well in final application Like metric learning (similarity learning classification function) Triplet loss depends on having an anchor image, another similar image that s not equal to the anchor and a negative

Architecture 8 Convolutional blocks and 3 fully connected blocks (fully connected = classification) Fully connected: size of filters match size of input date (each filter senses data from entire image) Each convolution layer followed by either ReLU or max pooling Max pool to reduce variance and computation complexity (2x2 pooling) ReLUspeeds up training (gradient computation is simple, computation is easy [no div, multi, all negatives set to 0.0]) Why so many layers? More convolution layers = the more complicated objects that can be recognized E.g. layer 1 for edge detection, layer 2 for shapes, layers 3 and so on for higher level features

ReLU Replaces all negative pixel values in feature map with zero Introduce non-linearity in the CNN (like real-world data) Convolution is a linear operation

Pooling Subsampling or down-sampling Reduces dimensionality of each feature map but retains most important information (max, average, sum) Makes input representation more manageable Makes network invariant to small transformations

Convolution Neural Network Combination of Convolution, ReLU and Pooling When the number of pixels is reduced, then we can use fully connected layer Classifies input based on high-level features generated by convolution + ReLU + pooling

Training 3 CNN configurations (A, B, D) Optimized with stochastic gradient regularized with dropout and weight decay Weight of filters in CNN initialized with random sampling from Gaussian distribution Triplet Loss trained with frozen network except for last fully connected layer with 10 epochs where each epoch contains all possible positive pairs (a, p) where a is anchor and p is similar image

Datasets and Evaluation Protocols Use existing benchmarks: LFW dataset (13,233 images and 5,749 identities) YTF dataset (3,425 videos and 1,595 identities) Standard evaluation protocol and report EER EER (equal error rate) is error rate at the ROC (receiver operating characteristic) operating point where false positive and false negative rates are equal Give threshold values for false acceptance rate and false rejection rate EER is when the rates are equal (plot both and EER is point of intersection) Lower value = higher accuracy

Experimental Results / Analysis Implementation based on MATLAB toolbox MatConvNet Linked against NVIDIA CuDNN libraries to accelerate training Used 4 NVIDIA Titan Black GPUs Face images are inputted as 224 x 224 pixels Training on full dataset (F) better than curated dataset (C) 2D alignment slightly improves performance Learning embedding for verification significantly boosts performance (DCN + triplet loss)

Component Analysis More data is better (hard positives might get removed) 2D alignment is better Face alignment is landmark detection (localizing facial key points) Slight improvement from config A to B, no difference between B and D Triplet loss embedding improves performance by 1.8% This means reducing error rate by 68%

Comparison with State of the Art: LFW Achieved comparable results to state of art while requiring less data for learning and using a simpler network architecture (compared to DeepID ROC curves to the right

Comparison with State of the Art: YTF K = # of faces used to represent each video Triplet loss embedding achieves state of the art

Conclusion Use weaker classifiers to rank data presented to annotators to save time (less human work) Deep CNN without embellishments but with proper training can achieve results comparable to state of the art

How Good is Facial Recognition Today? Apple (pioneers of advancement) uses DCN (deep convolutional networks) iphone X with Face ID An Asian woman s iphone X could be unlocked by her teenage son Brothers can unlock same phones People under age of 14 do not have distinguishing features and leads to error

References https://www.robots.ox.ac.uk/~vgg/publications/2015/parkhi15/park hi15.pdf http://www.robots.ox.ac.uk/~vgg/publications/2013/simonyan13/si monyan13.pdf https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13 /arandjelovic13.pdf http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexne t_tugce_kyunghee.pdf https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

Thank You!