Clustering Lightened Deep Representation for Large Scale Face Identification Shilun Lin linshilun@bupt.edu.cn Zhicheng Zhao zhaozc@bupt.edu.cn Fei Su sufei@bupt.edu.cn ABSTRACT On specific face dataset, such as the LFW benchmark, recent face recognition methods have achieved near perfect accuracy. However, the face identification is still a challenging task for a super large scale dataset, where a real application is urgently needed, thus Microsoft challenge of recognizing one million celebrities (MS-Celeb-1M) has attracted an increasing attention. In this paper, we propose a threestep strategy to address this problem. Firstly, based on a corss-domain face dataset, i.e., the CASIA-Web dataset, an efficient and deliberate face representation model with a Max-Feature-Map (MFM) activation function is trained to map raw images into the feature space quickly. Secondly, face representations with the same MID in MS-Celeb-1M are clustered into three subsets: a pure set, a hard set and a mess set. The cluster centers are used as gallery representations of the corresponding MID and this scheme reduces the impact of noisy images and the number of comparisons during the face matching. Finally, locality sensitive hashing (LSH) algorithm is applied to speed up the search of the nearest centroid. Experimental results show that our face CNN model can extract stable and discriminative face representations, and the proposed three-step strategy achieves a promising performance without any manual selection for the MS-Celeb-1M dataset. Furthermore, we find that via clustering a relatively pure set is kept by many MIDs in MS- Celeb-1M, which indicats this scheme is effective for cleaning a huge but mess dataset. 16], due to the publication of LFW [4], an extensively reported dataset for evaluation of face recognition algorithms. However, surpassing human recognition accuracy on LFW does not mean that the face recognition has been solved for the number of images in LFW is relatively small. Face i- dentification is still a challenging task for a super large scale dataset. In many real world applications, accurate identification at planetary scale is needed. i.e., in suspect searching, the face identification algorithm should find the suspect in the large scale gallery image set. Similarly, large scale robustness is also necessary in the field of mobile payment to ensure that other people can not use your account by their faces [5]. Recently, Microsoft is releasing MS-Celeb-1M [3], a large scale real world face image dataset to public, encouraging researchers to develop the best face recognition techniques to recognize one million people entities identified from Freebase. Its V1.0 version contains 10M celebrity face images for the top 100K celebrities, which can be used to train and evaluate both face identification and verification algorithms in a relatively large scale. In this paper, we propose a three-step strategy to address MS-Celeb-1M (V1.0 version) challenge of recognizing 100K celebrities in the real world, the flowchart is shown in Fig. 1. Keywords Large Scale Face Identification, Lightened Deep Representation, Clustering, Convolutional Neural Network 1. INTRODUCTION In the past few years, face recognition under uncontrolled environments becomes one of the most extensively research fields in computer vision [19, 20, 15, 17, 10, 8, 13, 2, 6, 14, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ICC 17, March 22 2017, Cambridge, United Kingdom c 2017 ACM. ISBN 978-1-4503-4774-7/17/03... $15.00 DOI: http://dx.doi.org/10.1145/3018896.3025149 Figure 1: The basic idea of our proposed three-step strategy to MS-Celeb-1M challenge, including face representation extractor training, clustering in the feature space and closest centroid finding. To achieve ultimate accuracy, existing CNN based models tend to be deeper or use multiple local facial patch ensemble. The very deep model leads to a long computation time for representation extraction on CPU or GPU. And the appli-
cation of multiple facial patches requires much time and introduces uncertainty caused by automatic facial landmarks detection. While for large scale face recognition task, in addition to the time consuming and demand for a large number of computing resources in offline training, the time consuming of representation extraction and dimension of representation should be carefully considered. To obtain a small face representation extractor with fast speed of feature extraction and low-dimensional representation, widely used ReLU is replaced with Max-Feature-Map (MFM) which is proved to be effective [19]. Only a kind of aligned face patch extracted from CASIA-Webface [20] is applied as training data. There are two reasons for us not using MS-Celeb-1M dataset for training. On the one hand, this dataset is still relatively dirty at this stage and takes time to be cleaned. On the other hand, in practical application, it is difficult to cover all query identities in the training set. Our aim is to obtain a generalized model with good performance. The MS-Celeb-1M (V1.0 version) contains 8,456,240 of 99,892 MIDs. In such a large scale, it is not economical to compare query image with the entire dataset. The visualization of face representations obtained by the extractor provides a direct visual insight that through layer by layer abstraction, face representations of the same person tend to be similar, while there is a clear difference between representations of different person. According to this observation, K-means [7] is applied to face representations with the same MID in MS-Celeb-1M and three cluster centers (or less) are obtained as the gallery representations of the corresponding MID. The result of clustering shows face images corresponding to face representations of many MIDs were divided into three parts (pure set, hard set and mess set, the latter two do not necessarily exist). The pure set contains normal images of the person who appears most frequently in a MID folder. Faces in the hard set have variations in pose, illumination, or expression and a small amount of them do not belong to the person appears in pure set. Mess set is consists of those noisy images. Clustering in the feature space not only reduces the amount of computation required for the query phase, but also provides an effective preprocessing method for further data cleaning. Although the number of comparisons at the query phase is reduced by clustering, and face features are compact due to the MFM activation function. Searching for the nearest one from hundreds of thousands of gallery representations still limit the speed of our system. Locality-Sensitive Hashing (LSH) [12] is an effective method to deal with this issue. Through hashing similar gallery representations are mapped to the same bucket with high probability. During evaluation, LSH algorithm will map the query item to the bucket that contains similar gallery representations and consumes approximate O(1) time complexity to find the nearest one. 2. RELATED WORK 2.1 Data Set Face recognition history can be said to be the history of evolution of face database. Early face datasets, such as PIE [11], FERET [9], were almost collected under controlled environments. Very high performance is achieved on these ideal datasets through the efforts of many researchers. However, models learned from these datasets are difficult to generalize to practical applications, especially under the uncontrolled environments. So the interests of community gradually transferred from controlled environments to uncontrolled environments, and the publication of a milestone data-set, LFW [4], including 13,233 images of 5,749 identities, promotes the studies of unconstrained face recognition. Compared to previous datasets, the biggest difference of LFW is that the images are crawled from Internet rather than acquired under several pre-defined environments. Therefore, LFW has more variations in pose, illumination, expression, resolution, imaging device and these factors are combined together in random way. YTF [18] is based on the name list of LFW (1,595 identities) but it is created for video based face recognition. All the 3,425 videos in YTF were downloaded from YouTube. Because the videos on YouTube are highly compressed, the quality of the face snapshots are lower than LFW. Along with the development of deep learning, the scale of face database is also increasing. In the large scale public datasets, CASIA-WebFace [20] that includes about 500K photos of 10K celebrities crawled from the web is a great resource for training. Although the scale of CelebFaces [15] is relative smaller than CASIA-WebFace, it contains rich face attribute labels for face parsing. For large scale private datasets: Facebook s SFC [17] contains more then 4000 subjects and each subject has an average of 1000 images. Using SFC, [17] successfully learns an effective face representation robust to face variations in the wild. Google has access to massive photo collections, for example, they trained the FaceNet [10] on 200 Million photos of 8 Million people (and more recently on 500M of 10M) and won the outstanding performance in many face recognition tasks. The details of recent representative large scale face datasets are shown in Table. 1. Table 1: Current large scale face dataset. Dataset Identities Images Access WDRef [1] 2,995 99,773 Private CelebFace [15] 10,177 202,599 Public WebFace [20] 10,575 494,414 Public VGG Face [8] 2,622 2.6M Public SFC [17] 4,030 4.4M Private Google [10] 10M 500M Private 2.2 Deep Face Representation Learning Data and algorithm are two essential components for face recognition, especially for deep face representation learning. The publication of large scale web crawled dataset, promotes the studies of unconstrained face recognition. Several excellent face features were learned by different deep networks, and achieved high performance on LFW face verification task. Taigman et al. [17] proposed the DeepFace, which was a multi-class network trained to perform the face identification task on 4.4 million faces of over 4,000 identities. Through an ensemble of three networks using different alignments and color channels they obtained high accuracy approaching the human-level. Sun et al. [13] proposed to combine the identification and verification losses for reducing intra-personal variations while enlarging inter-personal differences during the training of DeepID2. They concatenated the features from 25 of such networks, each operating on a different face patch, and then PCA was applied to get
the compact face representation. Schroff et al. [10] presented a system, called FaceNet, which directly learned a mapping from face images to a compact Euclidean space with 100-200 million training face thumbnails consisting of about 8 million different identities. All groups mentioned above surpassed human s performance and achieved near perfect results on the LFW benchmark. However, face recognition is far from being solved. Many applications require accurate identification at planetary scale like finding the best matching face in a database including billions of people while LFW includes only 13K photos of 5K people. The MS-Celeb-1M (V1.0 version) contains 8,456,240 of 99,892 MIDs. Using the entire MS-Celeb-1M dataset as gallery set may not be appropriate. On the one hand, the computation and storage overhead is relatively large, on the other hand, there are many noisy images in a MID folder and calculating the distance between them and query image is not necessary. In fact, through layer by layer abstraction (Fig. 3), face representation has a high distinguishability (intuitive visualization results will be given in section 4) which provides a good basis for clustering. So K-means algorithm is applied to face representations with the same MID in MS-Celeb-1M and three cluster centers (or less) are gained as the gallery representations of corresponding MID. During querying, only distances between all cluster centers and query image need to be calculated. Interestingly, in the absence of any supervisory information, pictures in a MID folder are automatically divided into different categories (pure, hard and mess, the latter two do not necessarily exist) which provides an effective preprocessing method for the follow-up database cleaning. 3. PROPOSED METHOD Since face recognition involves small inter-personal variations and large intra-personal variations, how to learn discriminative face representation narrowing the intra-personal distance and enlarging the inter-personal gap is always a key topic. Deep face representation has made remarkable breakthroughs in this field and been widely used. In the large scale face recognition, time consumption of feature extraction and overhead of feature storage should not be overlooked. Max- Feature-Map activation function proposed in [19] is advantageous to learn a CNN model with small size, fast speed of feature extraction and compact representation. As shown in Fig. 2, the output of MFM activation is the maximum between two convolution feature map candidate nodes. This operation selects more notable and discriminative nodes in both convolution and fully connected layers and makes the model lightened. Details of network architecture is given in Fig. 3. MFM instead of ReLU is utilized after convolutional layers and fully connected layer. Face representation is extracted from FC1 layer (after MFM activation). Figure 3: The flowchart of Transfer FaceNet. Figure 2: Principle of the Max-Feature-Map activation function. To further speed up our system, Locality-Sensitive Hashing (LSH) is utilized in the final query phase. Through hash similar cluster centers are mapped to the same bucket with high probability. During evaluation, LSH algorithm will map the query item to the bucket that contains similar gallery representations and consumes approximate O(1) time complexity to find the nearest one. 4. EXPERIMENTAL RESULTS 4.1 Face Representation Extractor In our experiments, model training is based on the open source deep learning framework Caffe [10]. Training samples for the face representation extractor are 144 144 gray-scale aligned (using five facial landmarks) face images. And training samples are cropped into 128 128 randomly and mirrored for data augmentation. 60% dropout is used to avoid overfitting on fully connected layers. The learning rate is set to 1e-3 initially and reduced to 1e-5 during learning. Fig. 4 shows feature maps from the Conv4b layer of the trained
which implies this scheme is effective for cleaning huge but mess data. Our results of clustering in the feature space provides guidance for further data cleaning which is helpful to obtain a relatively clean and large scale dataset to be used to train a more powerful face representation extractor. Figure 5: Clustering results of MID m.0b3q8k (top) and m.0cz96px (bottom) in MS-Celeb-1M (V1). Figure 4: Visualization of the first 36 feature maps produced by the Conv4b layer of face representation extractor on images of Zhiling Lin (a,b,left) and X- iang Chen (c,d,right). Best viewed in color. face representation extractor. We can see that the response of face concepts like mouth and eyes is visible which can provide discriminative information for face recognition. For example, in our daily life, we intuitively use concepts like big eyes and high nose to distinguish different people. Comparing feature maps of same persons (a vs. b, feature maps in the red boxes for instance) in Fig. 4, we can find that the intra-personal variations of feature maps produced by higher layer of the face representation extractor is small. And the feature maps of two different persons have a significant difference (a vs. c, feature maps in the red and green boxes). This characteristic of the face representation provides a good foundation for the further clustering. 4.2 Clustering in the Feature Space With the purpose of reducing the number of comparisons during evaluation and excluding the impact of noise images on identification performance, K-means algorithm is applied to face representations with the same MID in MS-Celeb-1M and three cluster centers (or less) are gained as the gallery representations of corresponding MID. Some results of clustering are shown in Fig. 5. Images of MID m.0b3q8k (top) are divided into three sets including pure set, hard set and mess set as we described before. Hard set or mess set doesn t always exist, because images of some MID have been relatively pure. MID m.0cz96px (bottom) includes face images of many identities, and it is difficult to confirm which one is the main identity. We believe that images corresponding to such MID need to be re-collected. We find that via clustering a relatively pure set is kept by many MIDs in MS-Celeb-1M, 4.3 Performance In querying, corresponding MID of the nearest centroid is assigned to the query image. Introducing Locality-Sensitive Hashing (LSH) is effective to speed up the process of finding the closest centroid (from O(n) to approximate O(1) time complexity, n is the number of centroids). Because we emphasize on the generalization ability of our system on the large scale web-crawled face dataset, in the process of building our system and testing, neither MS-Celeb-1M (V1) is used for training, nor any artificial cleaning is applied to this dataset. Finally, 58.20% and 38.4% top1 identification rate is achieved on the MS-Celeb-1M (V1) random and hard dev set respectively. The 100K-list in the MS- Celeb-1M (V1) only covers about 75% of celebrities in the measurement set. So in the absence of any expansion of the database, the highest identification rate is 75%. 5. CONCLUSION AND FUTURE WORK This paper proposes a three-step strategy to construct a promising system for large scale automatic face identification by clustering lightened deep representation. We offer an intuitive visual interpretation to the discriminability of lightened deep representation. Face representations are clustered in the feature space based on this observation, which can reduce the impact of noise images mixed in each MID and the number of comparisons during matching. Benefit from the discriminable face representation, images of corresponding MID are divided into three set (or less) and most of MIDs contain a pure set of the main identity. Such automated processing will reduce the labor cost of further data cleaning. Moreover, a relatively clean dataset is essential for training a deep model with strong generalization ability. Exploring the appropriate number of cluster centers of each MID is one of the key points in our future work. How to achieve a good tradeoff between processing speed and accuracy is one of the issues to be solved in large scale face recognition system.
6. ACKNOWLEDGMENT This work is supported by Chinese National Natural Science Foundation (61532018, 61372169, 61471049), Special Funds of Beijing Municipal Co-construction Project, and Beijing Key Lab of Network System and Network Culture. 7. REFERENCES [1] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Computer Vision ECCV 2012, pages 566 579. Springer, 2012. [2] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3025 3032. IEEE, 2013. [3] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: Challenge of recognizing one million celebri-ties in the real world. [4] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007. [5] I. Kemelmacher-Shlizerman, S. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. arxiv preprint arxiv:1512.00596, 2015. [6] C. Lu and X. Tang. Surpassing human-level face verification performance on lfw with gaussianface. arxiv preprint arxiv:1404.3840, 2014. [7] J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281 297. Oakland, CA, USA., 1967. [8] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, volume 1, page 6, 2015. [9] P. J. Phillips, H. Moon, S. Rizvi, P. J. Rauss, et al. The feret evaluation methodology for face-recognition algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(10):1090 1104, 2000. [10] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815 823, 2015. [11] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 46 51. IEEE, 2002. [12] M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors [lecture notes]. IEEE Signal Processing Magazine, 25(2):128 131, 2008. [13] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pages 1988 1996, 2014. [14] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face verification. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1489 1496. IEEE, 2013. [15] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1891 1898. IEEE, 2014. [16] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2892 2900, 2015. [17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1701 1708. IEEE, 2014. [18] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529 534. IEEE, 2011. [19] X. Wu, R. He, and Z. Sun. A lightened cnn for deep face representation. arxiv preprint arxiv:1511.02683, 2015. [20] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arxiv preprint arxiv:1411.7923, 2014.