Rapid Face and Object Detection in ios

Rapid Face and Object Detection in ios Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

Today Background on Rapid Face Detection Facial Feature Detection in ios. Dlib for Object Detection.

Face Detection

What can OpenCV do? Functionality overview Image Processing Filters Transformations Video, Stereo, 3D Calibration Pose estimation Taken from OpenCV 3.0 latest news and the roadmap. Edges, contours Robust features Segmentation Optical Flow Detection and recognition Depth

OpenCV 3.0 In terms of detectors all the standards one are still there:- Viola & Jones style face detector. Dalal & Triggs style pedestrian detector. (d) Support in 3.0 now for deformable parts based models. Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005. Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

Viola & Jones Computationally Expensive Cheap High Low Capacity Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

Viola & Jones Instead of searching all regions of an image with the same complexity classifier, we can use a cascade. All Sub!windows T T 1 2 3 F F F T Further Processing Reject Sub!window

Viola & Jones Vast majority of regions in an natural image can be rejected using small capacity classifiers. Figure 4: From left to right: input image, followed by portions of the image which contain un-reject patches after the sequential evaluation of 1 (13.3% patches remaining), 10 (2.6%), 20 (0.01%) and 30 (0.002%) support vectors. Note that in these images, a pixel is displayed if it is part of any remaining un-rejected patch at any scale, orientation or position This explains the apparent discrepancy between the above percentages and the visual impression.

Improving Speed Further Viola & Jones (2001) suggested using box filters. (Sub)image I(x, y) Two adjacent regions R 1, R 2 Feature value: f(i) = (x,y) R 2 I(x, y) (x,y) R 1 I(x, y) 10 (Black)

Box Filters (Sub)image I(x, y) Two adjacent regions R 1, R 2 R 1 Feature value: f(i) = (x,y) R 2 I(x, y) (x,y) R 1 I(x, y) 11 (Black)

Box Filters (Sub)image I(x, y) R 2 Two adjacent regions R 1, R 2 R 1 Feature value: f(i) = (x,y) R 2 I(x, y) (x,y) R 1 I(x, y) 12 (Black)

Box Filters Three types of box filters Vary size, aspect ration, location, orientation Defined over 24 24 window 160,000 distinct features Classifier using a single feature: 13 (Black)

Why do this? We need to compute the box filter values many, many times and we must do it very fast! I(x,y ) II(x, y) = x x, y y I(x,y ) (x, y) 14 (Black)

Computing Integral Images Computing sum of pixels in a rectangular area: f(a) = 15 (Black)

Computing Integral Images Computing sum of pixels in a rectangular area: f(a) = II(A) A 16 (Black)

Computing Integral Images Computing sum of pixels in a rectangular area: f(a) = II(A) II(B) B A 17 (Black)

Computing Integral Images Computing sum of pixels in a rectangular area: f(a) = II(A) II(B) II(C) B C A 18 (Black)

Computing Integral Images Computing sum of pixels in a rectangular area: f(a) = II(A) II(B) II(C) + II(D) D B C A A 3 box filter array takes only 8 lookups. 19 (Black)

Computational Comparison To evaluate a 24x24 region of an image using pixels with a linear classifier takes 576 lookups and flops. Conversely, to evaluate a 24x24 region with a 3 box filter classifier takes 8 lookups and flops. Detector False detections 10 31 50 65 78 95 110 167 422 Viola-Jones 78.3% 85.2% 88.8% 89.8% 90.1% 90.8% 91.1% 91.8% 93.7% Rowley-Baluja-Kanade 83.2% 86.0% - - - 89.2% - 90.1% 89.9% Evaluation reported on the MIT-CMU Face Database Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

Viola & Jones Boosting is ideally suited to be used with box filters. Techniques like AdaBoost, LogitBoost or GentleBoost can naturally learn a complex classifier from a cascade of weak classifiers. H(x) = sign M m=1 mh m (x) Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

Schneiderman & Kanade Schneiderman & Kanade reported improved results using a Semi-Naive Bayesian classifier. Technique used a cascade of box filters, but used a mutual information (MI) to group features into independent sets.!"#$%&'()*+),-%'!"#$%&'()*+),-%'!"#$%&'()*+),-%'

Schneiderman & Kanade

The Future - Deep Detection? Farfade, Sachin Sudhakar, Mohammad J. Saberian, and Li-Jia Li. "Multi-view face detection using deep convolutional neural networks." Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015. Li, Haoxiang, et al. "A convolutional neural network cascade for face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

The Future - Deep Detection 24-net 12-net Input Image Convolutional layer Max-pooling Fully-connected layer layer 3 channels 16 3x3 filters 12x12 stride 1 3x3 kernel stride 2 16 outputs Labels 2 classes face / non-face Input Image Convolutional layer 3 channels 24x24 64 5x5 filters stride 1 resize Max-pooling Fully-connected layer layer Labels 2 classes face / non-face 3x3 kernel stride 2 128 outputs 12-net Fully-connected layer 48-net Input Image 3 channels 48x48 Convolutional layer 64 5x5 filters stride 1 resize Normalization Max-pooling Fully-connected Max-pooling Normalization Convolutional layer layer layer layer layer layer 3x3 kernel stride 2 9x9 region 64 5x5 filters stride 1 9x9 region 24-net 3x3 kernel stride 2 Labels 2 classes 256 outputs face / non-face Fully-connected layer Still uses a cascade strategy for speed. 12x12 network disposes of 90% of image, where then the subsequent denser networks are used. Li, Haoxiang, et al. "A convolutional neural network cascade for face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

The Future - Deep Detection Li, Haoxiang, et al. "A convolutional neural network cascade for face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Today Background on Rapid Face Detection Facial Feature Detection in ios. Dlib for Object Detection.

Face Detection in ios In ios face detection comes built in, and can be performed much more efficiently than standard OpenCV. Utilizes the QuartzCore and CoreImage frameworks within the project.

Facial Feature Detection

CIFaceFeature

CIFaceFeature Example We are now going to demonstrate a simple example of face detection in ios. On your browser please go to the address, https://github.com/slucey-cs-cmu-edu/cifacefeature_lena Or better yet, if you have git installed you can type from the command line. $ git clone https://github.com/slucey-cs-cmu-edu/cifacefeature_lena.git

CIFaceFeature Example

Smerk and GPUImage Recently, an extension to GPUImage was proposed to allow for the utilization of ios face detection within ios. Called Smerk - GitHub project page can be found at:- https://github.com/mattfoley/smerk

Smerk Example We are now going to demonstrate how we can perform realtime face tracking through GPUImage. On your browser please go to the address, https://github.com/slucey-cs-cmu-edu/smerk_example Or better yet, if you have git installed you can type from the command line. $ git clone https://github.com/slucey-cs-cmu-edu/smerk_example.git

Smerk Example

Today Background on Rapid Face Detection Facial Feature Detection in ios. Dlib for Object Detection.

Dlib C++ for Computer Vision Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License. Code is platform independent (Windows, Linux, MAC OS X). Check out more details at the link - http://dlib.net/ Very useful set of vision and learning tools.

Why Dlib is useful? What makes Dlib very cool, is its ability to train your own object detectors quickly and easily. This is hard to do in OpenCV as it relies on something called Hard Negative Mining. Requires setting tricky parameters, and can often takes hours/days to train a model.

Dlib - Make your own detector! Dlib - uses the well known HOG - SVM pipeline for object detection - Dalal & Triggs 2005. Does not rely on HNM, instead employs Structural Support Vector Machine (SVM). No need for negative training set, no messy parameters. Using this tutorial authors were able to learn a face detector in just a few minutes using Dlib. Visualization of HOG Detector Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.

Dlib versus OpenCV Red Box - Dlib Blue Circles - OpenCV Taken from http://blog.dlib.net/2014/02/dlib-186-released-make-your-own-object.html.

Dlib versus OpenCV Another example - 8 images of stop signs downloaded and labeled. Dlib was then used to create a HOG detector. Visualization of HOG Detector Taken from http://blog.dlib.net/2014/02/dlib-186-released-make-your-own-object.html.

More to read A Convolutional Neural Network Cascade for Face Detection Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, Gang Hua Stevens Institute of Technology Hoboken, NJ 07030 {hli18, ghua}@stevens.edu Abstract In real-world face detection, large visual variations, such as those due to pose, expression, and lighting, demand an advanced discriminative model to accurately differentiate faces from the backgrounds. Consequently, effective models for the problem tend to be computationally prohibitive. To address these two conflicting challenges, we propose a cascade architecture built on convolutional neural networks (CNNs) with very powerful discriminative capability, while maintaining high performance. The proposed CNN cascade operates at multiple resolutions, quickly rejects the background regions in the fast low resolution stages, and carefully evaluates a small number of challenging candidates in the last high resolution stage. To improve localization effectiveness, and reduce the number of candidates at later stages, we introduce a CNN-based calibration stage after each of the detection stages in the cascade. The output of each calibration stage is used to adjust the detection window position for input to the subsequent stage. The proposed method runs at 14 FPS on a single CPU core for VGA-resolution images and 100 FPS using a GPU, and achieves state-of-the-art detection performance on two public face detection benchmarks. 1. Introduction Face detection is a well studied problem in computer vision. Modern face detectors can easily detect near frontal faces. Recent research in this area focuses more on the uncontrolled face detection problem, where a number of factors such as pose changes, exaggerated expressions and extreme illuminations can lead to large visual variations in face appearance, and can severely degrade the robustness of the face detector. The difficulties in face detection mainly come from two aspects: 1) the large visual variations of human faces in the cluttered backgrounds; 2) the large search space of possible face positions and face sizes. The former one requires 1 Adobe Research San Jose, CA 95110 {zlin, xshen, jbrandt}@adobe.com the face detector to accurately address a binary classification problem while the latter one further imposes a time efficiency requirement. Ever since the seminal work of Viola et al. [27], the boosted cascade with simple features becomes the most popular and effective design for practical face detection. The simple nature of the features enable fast evaluation and quick early rejection of false positive detections. Meanwhile, the boosted cascade constructs an ensemble of the simple features to achieve accurate face vs. non-face classification. The original Viola-Jones face detector uses the Haar feature which is fast to evaluate yet discriminative enough for frontal faces. However, due to the simple nature of the Haar feature, it is relatively weak in the uncontrolled environment where faces are in varied poses, expressions under unexpected lighting. A number of improvements to the Viola-Jones face detector have been proposed in the past decade [30]. Most of them follow the boosted cascade framework with more advanced features. The advanced feature helps construct a more accurate binary classifier at the expense of extra computation. However, the number of cascade stages required to achieve the similar detection accuracy can be reduced. Hence the overall computation may remain the same or even reduced because of fewer cascade stages. This observation suggests that it is possible to apply more advanced features in a practical face detection solution as long as the false positive detections can be rejected quickly in the early stages. In this work, we propose to apply the Convolutional Neural Network (CNN) [13] to face detection. Compared with the previous hand-crafted features, CNN can automatically learn features to capture complex visual variations by leveraging a large amount of training data and its testing phase can be easily parallelized on GPU cores for acceleration. Considering the relatively high computational expense of the CNNs, exhaustively scanning the full image in multiple scales with a deep CNN is not a practical solution. To achieve fast face detection, we present a CNN cascade, MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Rapid Object Detection Using a Boosted Cascade of Simple Features Viola, P.; Jones, M. TR2004-043 May 2004 H. Schneiderman & T. Kanade. Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition", CVPR 1998. P. Viola and M. Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features, CVPR 2001. N. Dalal & B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005. H. Li et al. A Convolutional Neural Network Cascade for Face Detection, CVPR 2015 Abstract This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers[6]. The third contribution is a method for combining increasingly more complex classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2004 201 Broadway, Cambridge, Massachusetts 02139 Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhône-Alps, 655 avenue de l Europe, Montbonnot 38334, France {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detec- 2, give an overview of our method 3, describe our data sets in 4 and give a detailed description and experimental tion as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids evaluation of each stage of the process in 5 6. The main conclusions are summarized in 7. of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detec- 2 Previous Work tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detection [18,17,22,16,20]. See [6] for a survey. Papageorgiou et orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with al [18] describe a pedestrian detector based on a polynomial approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedestrian detection system [7]. Viola et al [22] build an efficient 1 Introduction Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating cluttered backgrounds under difficult illumination. We study SVM based limb classifiers over 1 st and 2 nd order Gaussian the issue of feature sets for human detection, showing that locally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth filters in a dynamic programming framework similar to those scriptors provide excellent performance relative to other existing feature sets including wavelets [17,22]. The proposed position histograms with binary-thresholded gradient magni- [9]. Mikolajczyk et al [16] use combinations of orientation- descriptors are reminiscent of edge orientation histograms tudes to build a parts based method containing detectors for [4,5], SIFT descriptors [12] and shape contexts [1], but they faces, heads, and front and side profiles of upper and lower are computed on a dense grid of uniformly spaced cells and body parts. In contrast, our detector uses a simpler architecture with a single detection window, but appears to give they use overlapping local contrast normalizations for improved performance. We make a detailed study of the effects significantly higher performance on pedestrian images. of various implementation choices on detector performance, 3 Overview of the Method taking pedestrian detection (the detection of mostly visible people in more or less upright poses) as a test case. For simplicity and speed, we use linear SVM as a baseline classifier chain,which is summarized in fig. 1. Implementation details This section gives an overview of our feature extraction throughout the study. The new detectors give essentially perfect results on the MIT pedestrian test set [18,17], so we have well-normalized local histograms of image gradient orienta- are postponed until 6. The method is based on evaluating created a more challenging set containing over 1800 pedestrian images with a large range of poses and backgrounds. use over the past decade [4,5,12,15]. The basic idea is that tions in a dense grid. Similar features have seen increasing Ongoing work suggests that our feature set performs equally local object appearance and shape can often be characterized well for other shape-based object classes. rather well by the distribution of local intensity gradients or 1 Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition Abstract In this paper, we describe an algorithm for object recognition that explicitly models and estimates the posterior probability function, P( object image). We have chosen a functional form of the posterior probability function that captures the joint statistics of local appearance and position on the object as well as the statistics of local appearance in the visual world at large. We use a discrete representation of local appearance consisting of approximately 10 6 patterns. We compute an estimate of P( object image) in closed form by counting the frequency of occurrence of these patterns over various sets of training images. We have used this method for detecting human faces from frontal and profile views. The algorithm for frontal views has shown a detection rate of 93.0% with 88 false alarms on a set of 125 images containing 483 faces combining the MIT test set of Sung and Poggio with the CMU test sets of Rowley, Baluja, and Kanade. The algorithm for detection of profile views has also demonstrated promising results. 1. Introduction In this paper we derive a probabilistic model for object recognition based primarily on local appearance. Local appearance is a strong constraint for object recognition when the object contains areas of distinctive detailing. For example, the human face consists of distinctive local regions such as the eyes, nose, and mouth. However, local appearance alone is usually not sufficient to recognize an object. For example, a human face becomes unintelligible to a human observer when the various features are not in the proper spatial arrangement. Therefore the joint probability of local appearance and position on the object must be modeled. Nevertheless, representation of only the appearance of the object is still not sufficient for object recognition. Some local patterns on the object may be more unique than others. For example, the intensity patterns around the eyes of a human face are much more unique than the intensity patterns found on the cheeks. In order to represent the uniqueness of local appearance, the statistics of local appearance in the world at large must also be modeled. The underlying representation we have chosen for local Henry Schneiderman and Takeo Kanade Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 appearance is discrete. We have partitioned the space of local appearance into a finite number of patterns. The discrete nature of this representation allows us to estimate the overall statistical model, P( object image), in closed form by counting the frequency of occurrence of these patterns over various sets of training images. In this paper we derive a functional form for the posterior probability function P( object image) that combines these representational elements. We then describe how we have applied this model to the detection of faces in frontal view and profile. We begin in section 2 with a review of Bayes decision rule. We then describe our strategy for deriving the functional form of the posterior probability function in section 3 and perform the actual derivation in section 4. In section 5, we describe how use training images to estimate a specific probability function within the framework of this functional form. In section 6 and 7 we give our results for frontal face detection and profile detection, respectively. In section 8 we compare our representation with other appearance-based recognition methods. 2. Review of Bayes decision rule The posterior probability function gives the probability that the object is present given an input image. Knowledge of this function is all that is necessary to perform object recognition. For a given input image region, x = image, we decide whether the object is present or absent based on which probability is P( object x) P( object x) = 1 P object x ( ) larger, or, respectively. This choice is known as the maximum a posteriori (MAP) rule or the Bayes decision rule. Using this decision rule, we achieve optimal performance, in the sense of minimum rate of classification errors, if the posterior probability function is accurate. 3. Model derivation strategy Presented at CVPR98 Unfortunately, it is not practically feasible to fully represent P( object image) and achieve optimal performance; it is too large and complex a function to represent. The best we can do is choose a simplified form of P( object image) that can be reliably estimated using the available training data. Although a fully general form of P( object image) is intractable, it provides a useful starting point for derivation of a sim-