WHO MISSED THE CLASS? - UNIFYING MULTI-FACE DETECTION, TRACKING AND RECOGNITION IN VIDEOS. Yunxiang Mao, Haohan Li, Zhaozheng Yin

WHO MISSED THE CLASS? - UNIFYING MULTI-FACE DETECTION, TRACKING AND RECOGNITION IN VIDEOS Yunxiang Mao, Haohan Li, Zhaozheng Yin Department of Computer Science Missouri University of Science and Technology, USA ym8r8@mst.edu, hl87c@mst.edu, yinz@mst.edu ABSTRACT We investigate the problem of checking class attendance by detecting, tracking and recognizing multiple student faces in classroom videos taken by instructors. Instead of recognizing each individual face independently, first, we perform multiobject tracking to associate detected faces (including false positives) into face tracklets (each tracklet contains multiple instances of the same individual with variations in pose, illumination etc.) and then we cluster the face instances in each tracklet into a small number of clusters, achieving sparse face representation with less redundancy. Then, we formulate a unified optimization problem to (a) identify false positive face tracklets; (b) link broken face tracklets belonging to the same person due to long occlusion; and (c) recognize the group of faces simultaneously with spatial and temporal context constraints in the video. We test the proposed method on Honda/UCSD database and real classroom scenarios. The high recognition performance achieved by recognizing a group of multi-instance tracklets simultaneously demonstrates that multi-face recognition is more accurate than recognizing each individual face independently. Index Terms Face detection, face tracking, face recognition, multiple object tracking, multiple object recognition. 1. INTRODUCTION We are interested here in the class attendance checking by recognizing student faces in classroom images/videos taken by the instructor. Face recognition in images has been intensively researched in the past three decades [17, 21]. Usually, the workflow of image-based face recognition systems performs in the following steps: detect the human face in an image, compare the detected face with a reference dataset of N faces and classify who she/he is. This typical workflow builds face models/features for each individual person independently and recognizes each individual face. However, achieving high performance face recognition by using this one face image versus N references strategy is still a This work was supported by the NSF grant CNS-1205695 and Intelligent Systems Center of Missouri S&T. challenging problem due to the variation of facial expression, pose, illumination condition, occlusion etc. In group photographs such as taking a short clip of video on students faces in a classroom, we can explore video information to improve the face recognition performance. In video-based face recognition, facial motion features are extracted for face recognition in addition to the appearancebased image features [6, 18]; 3D face models are fit into video frames to achieve pos-invariant face recognition [6, 10]; video dictionaries are designed to encode temporal and pose information for face recognition [2, 20]. Video-based face recognition is also converted into pairwise image set matching problem in [1, 4, 5, 16, 19]. Both the query and reference gallery contain a set of images for each individual corresponding to different poses, illuminations, expressions, etc. Faces from two videos are compared based on the similarity between two corresponding image sets. Considering the temporal recurrence of faces in a video volume, face tracking in videos has been adopted to improve face detection and face recognition. In [11], face tracking is leveraged to improve the hit rate of face detection in videos while decreasing the false alarm rate. A face tracker combining generative and discriminative face models [7] is developed to track faces in videos and build face models reflecting appearance variations, enabling face recognition by persondiscriminant features. By tracking faces gradually changing from the front view to profile view, the front and profile face appearances are linked and each identity contains more faces for recognition or age estimation [12, 13]. In [3], interdependencies between face tracking and face recognition is explored to improve the performance of multiple face recognition in videos. For the class attendance checking problem, we propose to improve the face recognition performance by using an Mversus-N strategy where M denotes the number of face sets detected in videos (note that the M sets may contain false positive face sets) and N denotes the number of reference faces (i.e., the number of registered students in the class). In the M-versus-N strategy, we recognize the M face sets simultaneously, considering the spatial and temporal context con-

Fig. 1. Overview of the proposed method. To protect the privacy of students, we hide their eyes in the paper. straints among them. The underlying motivation is that recognizing M face sets once is more accurate than recognizing individual face set M times. Fig.1 illustrates the workflow of the proposed system. Given a short clip of video taken on student faces in a classroom, face detector detects many faces candidates in individual images including non-face false positives. Then, we apply a multiple object tracking approach to link detected faces into tracklets based on the spatial-temporal information. Each tracklet contains multiple face instances with variations in pose, expression, illumination etc. The tracklets are summarized by sparse representations and encoded with entropyrelated features. Finally, we formulate a global optimization problem solving three problems at the same time: identify tracklets into reliable and unreliable tracklets such as false positives, connect broken tracklets which belong to the same person, and recognize the group of reliable tracklets. Our contributions in this paper include: (1) we perform multi-object tracking to associate detected faces (including false positives) into face tracklets (each tracklet contains multiple instances of the same individual with variations in pose, illumination etc.) and then we cluster the face instances in each face tracklet into a small number of clusters, achieving sparse face representation with less redundancy; (2) we formulate a unified optimization problem to: (a) identify false positive face tracklets; (b) link face tracklets belonging to the same person due to long occlusion; and (c) recognize the group of faces simultaneously with spatial and temporal context constraints in the video. 2. FACE DETECTION AND MULTI-FACE TRACKING 2.1. Face Detection In a crowded scene such as the classroom, some students may be occluded by other students who sit in front of them, thus it is hard to detect their faces from a single viewpoint. However, when the instructor moves from one end to the other end of the blackboard while recording videos, occluded faces will eventually appear. For each frame in the video, we apply the cascaded face detector based on Haar-like features [15] to detect faces, which works fine in the classroom scene (Fig.2). We largely decrease the miss detection rate to make sure that every person has her/his face detected in some images for the final recognition. The cost of low miss detection rate is to have quite some false positives. After this face detection step, we have thousands of cropped face images consisting of true positives and false positives. Fig. 2. Sample face detection results (detected faces are marked by red rectangles). The right column shows a student face before, during and after occlusion. 2.2. Face Tracking Recognizing each detected instance by comparing it with the reference gallery is not reliable due to the large variations in face pose, expression etc. In this paper, we group detected instances which belong to the same object into tracklets based on their temporal and spatial information. The face recognizer will recognize objects by tracklets instead of single image.

Three cases are possible in the face tracking (or tracklet association): linking, appear and disappear. For a single detected face image, we consider it as a special tracklet whose length is only one. Linking. Two tracklets T i and T j which belong to the same object may be separated due to miss detection. We define the cost to link the two tracklets as c(t i T j ) = l e i l s j + t e i t s j + s e i s s j l t s (1) where i, j [1, N], N is the number of tracklets; l e i is the tail location of tracklet T i and l s j is the head location of tracklet T j ; t e i is the time of the last instance of tracklet T i and t s j is the time of the first instance of tracklet T j; s e i s s j is the scale change between the last instance of tracklet T i and the first instance of tracklet T j. The linking cost between the two tracklets can only be considered when: the spatial distance between the tail location of T i and the head location of T j does not exceed the maximal translation distance ( l ); tracklet T j appears after tracklet T i, and the time difference between the tail of tracklet T i and the head of tracklet T j cannot exceed the maximal allowed time interval ( t ); the scale change ratio between tracklet T i and tracklet T j cannot exceed the maximal allowed ratio ( s ). Appear. A tracklet in the current frame reappears when it cannot be connected to any previous tracklet. The cost for a tracklet appearance is c( T j ) = N N N c(t i T j ) c(t i T j ) (2) i=1 j=1 i=1 arg min b c T b, s.t., R T b = 1 (4) where b is a binary vector in which b i = 1 means that the ith association hypothesis is selected in the optimal solution; R is a (NLinking T racklet + 2N T racklet ) 2N T racklet binary matrix, in which the nonzero elements of each row indicating which tracklets are involved in that association. By solving the binary integer programing in Eq.4, detected faces are grouped into tracklets. 3. RECOGNIZE A GROUP OF MULTI-INSTANCE TRACKLETS SIMULTANEOUSLY Recognizing a face tracklet will be more reliable than recognizing each individual face image since a tracklet contains multiple instances of an object with face variations. Instead of recognizing each tracklet independently, we can further improve the recognition performance by recognizing a group of tracklets simultaneously. 3.1. Recognizer Training To train a face recognizer, we perform our face detection and tracking on training videos and manually annotate the generated tracklets to construct a face image pool for each person. Within each face image pool, there are hundreds of faces with different illumination conditions, poses, and expressions. But since they are cropped in consecutive frames, many of them also share face similarities. We develop an effective way to reduce the redundancy in each image pool while still maintaining the face variations (Fig.3). First, we encode each image with the Histogram of Local Binary Pattern [9]. Then, we use K-mean to cluster face images in each pool (K=50 in our paper). Therefore, for each person, we have a small number of images for training. Finally, we train an eigen face recognizer [14] based on the collection of small sets of training images from all objects. Disappear. A tracklet in the previous frame disappears when it cannot be connected to any later tracklet either because it is occluded by some objects or it moves out of the field of view. The cost for a tracklet disappearance is c(t i ) = N N N c(t i T j ) c(t i T j ) (3) i=1 j=1 j=1 The number of linking hypothesis, denoted by NLinking T racklet, can be determined from the spatial and temporal constraints in Equation (1). Each tracklet can have disappearing and appearing hypotheses. Therefore, the dimension of a cost vector for all the possible associations is equal to NLinking T racklet + 2N T racklet. The following minimization problem can be formulated for tracklet association: Fig. 3. Generate sparse and representative training dataset.

3.2. Entropy-based Tracklet Feature arg max x ij M N+1 i=1 j=1 x ij e Dij (7) N+1 s.t., x ij 1, i [1, M] (8) j=1 (x pj + x qj )δ(c p Cq ) 1, j [1, N], p [1, M], q [1, M], p q (9) Fig. 4. Entropy encoding. Suppose we have N reference people in our training dataset, and each reference person has K reference images by K-mean clustering. Suppose after face tracking, we have M tracklets, and for the ith tracklet, there are m(i) detected face images, denoted by {q i1,..., q im(i) }. We project detected face image q ik (k [1, m(i)]) and reference image p jl (l [1, K]) into the eigen subspace and obtain their eigen coefficients. Then, we compute the norm-2 distance between their eigen coefficients, denoted by d ik,j l. We compute the distance between the kth instance of the ith tracklet and the jth reference person as d ik,j = min l d ik,j l, then we will have a 1 N vector, [d ik,1,, d ik,n ], indicating the distance between one tracklet instance and N reference people. To judge if one instance is informative or not, we compute its entropy as: f ik = N e d i k,j log(e d i k,j ) (5) j=1 where d i k,j is the normalized d i k,j by the maximum of d ik,j, j [1, K]. The entropy of a tracklet is computed as f i = m(i) k=1 f i k m(i) Fig.4 illustrates some entropy computation examples. If a tracklet is similar to one reference but dissimilar to other references, its entropy will be low (i.e., low uncertainty). Since false positives are not similar to every reference person, entropies of the instances in a false-positive tracklet should be high, resulting in high entropy in the tracklet. For some tracklets, if they are not similar to any reference people or they are similar to a set of reference people, these tracketlets will have high entropy indicating unreliable tracklets for recognition. 3.3. Global Tracklet Recognition We formulate the global tracklet recognition in a single optimization problem: (6) Objective function In Eq.7, x ij is a binary variable, D ij = min k,l d ik.j l. The (N+1)th reference represents unreliable trackets. The distance between the ith tracklet and the unreliable tracklets is defined as where E i = D i,n+1 = L i + E i + F i (10) { 0, m(i) < ML L i = 1, m(i) M L (11) { 1, minj (D ij ) < M E 0, min j (D ij ) M E (12) F i = { 1, fi < M F 0, f i M F (13) L i is a binary variable by thresholding the length of the ith tracklet. E i is a binary variable by thresholding the minimum distance between the ith tracklet and all reference people. F i is a binary variable by thresholding the entropy of the ith tracklet. Constraints The constraints in Eq.8 ensure that for each tracklet, it can only be assigned one label, either one of the reference people or unreliable tracklet. The constraints in Eq.9 consider the temporal exclusion among tracklets. A reference person s label cannot be assigned to two tracklets which have appeared together in at least one frame. Suppose the video has T frames, we construct an M T binary constraint matrix C, in which C ij = 1 means the ith tracklet appears in the jth frame. We scan from the first row to the last row of C. Each time we compare two rows (C p, C q ) to see if their intersection is empty. If C p Cq (i.e., δ(c p Cq ) = 1), labels from 1 to N cannot be assigned to tracklet p and tracklet q at the same time. By checking all pairwise rows, at most N(N 1) 2 constraints are added to the optimization problem. By solving the optimization function, tracklets are classified into two categorizes: reliable or unreliable tracklets; broken tracklets belonging to the same person due to long occlusion or large spatial distance/scale changes in videos can be

connected; the tracklets are recognized simultaneously compared to the reference images meanwhile the students who missed the class are also recognized since there is no tracklet associating to them. 4. EXPERIMENTS 4.1. Datasets The first dataset, Honda/UCSD, was collected by Lee et al. [8]. We take their first subset for our experiment, in which there are 59 videos on 20 different people including 20 videos for training and 39 videos for testing. The individuals were asked to demonstrate different expressions and head movements. The second dataset considers a small group recognition. It consists of three testing videos taken in a graduate class. 16 students are supposed to attend this class. 9, 9 and 13 students appear in the three videos, respectively. The third dataset considers a bigger undergraduate class which is supposed to have 39 students. There are 35, 37, 37, 34, 32 and 38 students in 6 videos, respectively. 4.2. Compare Face Recognition by Single-Image, Face Tracklet and Tracklet Group on the Honda/UCSD dataset Each video in the Honda/UCSD dataset contains a single person (the left of Fig.5). First, we run the face detector on the 20 training videos and then we apply the LBP+K-mean method on those detected faces to construct the sparse training dataset for each person. Then, we train our eigen recognizer for 20 people. After detecting and tracking faces in the 39 testing videos, the faces are grouped into 19 test videos corresponding to 19 different testing people. The first experiment on Honda dataset is to recognize each individual image in the test videos using the eigen recognizer. Table 1 summarizes the performance of face recognizer by single image on all test videos. The accuracy (ACC) is defined as the percentage of how many individual images out of the video are correctly recognized. On average, the accuracy is about 75% for all images in the testing videos, which indicates that recognition based on one single image is not reliable. Since people in the test videos are trying to act with different expression and head poses, it causes a big challenge for single-image-based recognition. The second experiment on Honda dataset is to recognize person by face tracklets. If we recognize one tracklet instead of one single image, this overcomes the difficulty caused by Fig. 5. Test datasets. different variations. Out of the 19 face tracklets corresponding to 19 different people, only one is wrongly recognized and we achieve recognition accuracy of 95%, as shown in the second column in Table 2 with k = 1). In the third experiment, we exhaustively evaluate all the possible combinations of selecting k (k [1, 19]) tracklets out of 19 tracklets to form a tracklet group and recognize the group simultaneously using our global recognition method. Table 2 shows that the average accuracy of using global tracklets recognition on different number of tracklets from test videos. The average accuracy increases while more tracklets are recognized together. 100% accuracy is achieved when the 19 tracklets are recognized simultaneously. By comparing the results from the three experiments, we can observe that recognizing tracklets is more accurate than recognizing single images, and recognizing a group of tracklets can achieve even higher accuracy than recognition based on individual tracklets. 4.3. Face Recognition in Classrooms We test our proposed method on two different video datasets captured by two instructors. In our implementation, the M L,M E and M F are set to be 14, 3000, and 19 respectively. For a dataset of V videos, when choosing 1 out of V as the testing video, we use the rest as training videos. Three evaluation metrics are used to evaluate the recognition performance: (1) precision on tracklets - the percentage of detected and tracked tracklets that are correctly recognized; (2) recall on students attending the class - the percentage of students attending the class whose associated tracklets are all correct; (3) recall on students missing the class - the percentage of students missing the class who do not have any tracklet associated. The results on the two classes are summarized in Table 3 and 4, respectively. On average, precision and recall above 90% are achieved in both classes. 5. CONCLUSION We investigated the problem of checking class attendance by detecting, tracking and recognizing multiple student faces in videos taken by instructors. We perform multi-object tracking to associate detected faces into face tracklets and then we cluster the face instances in each face tracklet into a small number of clusters, achieving sparse face representation with less redundance. Then, we formulate a unified optimiza-

Table 1. Recognition based on single image. Video No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 avg Acc (%) 75 90 64 75 78 75 56 92 63 80 51 55 80 62 93 88 83 77 84 75 Table 2. Recognition based on tracklet group consisting of k tracklets. k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Acc(%) 95 95 95 95 95 95 95 95 96 96 96 97 97 98 98 98 99 99 100 tion problem to (a) identify false positive face tracklets; (b) link broken face tracklets belong to the same person due to long occlusion; and (c) recognize the group of faces simultaneously with spatial and temporal context constraints in the video. Experiments on Honda/UCSD database proved that recognizing a group of multi-instance tracklets is more accurate than recognizing a single tracklet or a single image. The high recognition performance achieved in classroom scenarios shows its promising feasibility and effectiveness for video-based class attendance checking. Table 3. Recognition performance in an undergraduate class with 39 students. Prec: precision on tracklets; Rec1: recall on attendance; Rec2: recall on missing class. Seq. Attn# Prec(%) Rec1 (%) Rec2(%) 1 35 97 97 100 2 37 93 90 100 3 37 92 83 100 4 34 95 94 80 5 32 89 97 88 6 38 94 81 100 Avg 35.5 93 90 95 Table 4. Recognition performance in a graduate class with 16 students. Prec: precision on tracklets; Rec1: recall on attendance; Rec2: recall on missing class. Seq. Attn# Prec(%) Rec1(%) Rec2(%) 1 9 100 100 100 2 9 90 90 100 3 13 100 100 100 Avg 10.3 97 97 100 6. REFERENCES [1] H. Cevikalp and B. Triggs, Face Recognition Based on Image Sets, in Proc. CVPR, Jun. 2010. [2] Y. Chen, et al., Dictionary-Based Face Recognition from Video, in Proc. ECCV, Oct. 2012. [3] A. Cohen and V. Paclovic, An Efficient IP Approach to Constrained Multiple Face Tracking and Recognition, in Proc. ICCV, Nov. 2011. [4] Z. Cui, et al., Image Sets Alignment for Video-based Face Recognition, in Proc. CVPR, Jun. 2012. [5] J. Harguessn, et al., Fusing Face Recognition from Multiple Cameras, in Proc. WACV, Dec. 2009. [6] J. Harguess, et al., Full-Motion Recovery from Multiple Video Cameras Applied to Face Tracking and Recognition, in Proc. ICCV, Nov. 2011. [7] M. Kim, et al., Face Tracking and Recognition with Visual Constraints in Real-World Videos, in Proc. CVPR, Jun. 2008. [8] K. C. Lee, et al., Video-based face recognition using prbabilistic appearance manifolds, in Proc. CVPR, Jun. 2003. [9] T. Ojala, et al., A Comparative Study of Texture Measures with Classification Based on Feature Distributions, Pattern Recognition, 29:51-59, 1996. [10] B. Raducanu and F. Dornaika, Pose-Invariant Face Recognition in Videos for Human-Machine Interaction, in Proc. ECCV, Oct. 2012. [11] E. Sariyanidi, et al., LZM in Action: Realtime Face Recognition System, in Proc. ECCV, Oct. 2012. [12] J. Sivic, et al., Who are you? - Learning person specific classifiers from video, in Proc. CVPR, Jun. 2009. [13] Z. Song, et al., Learning universal multi-view age estimator by video contexts, in Proc. ICCV, NOV. 2011. [14] M. A. Turk and A. P. Pentland, Face Recognition Using Eigenfaces, in Proc. CVPR, Jun. 1991. [15] P. Viola and M. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, in Proc. CVPR, 2001. [16] R. Wang, et al., Manifold-Manifold Distance with application to face recognition based on image set, in Proc. CVPR, Jun. 2008. [17] M. Yang, et al., Detecting Faces in Images: A Survey, in Proc. PAMI, 2002. [18] N. Ye and T. Sim, Towards general motion-based face recognition, in Proc. CVPR, 2010. [19] Q. Yin, et al., An Associate-Predict Model for Face Recognition, in Proc. CVPR, Jun. 2011. [20] Q. Zhang and B. X. Li, Discriminative K-SVD for Dictionary Learning in Face Recognition, in Proc. CVPR, Jun. 2010 [21] W. Zhao, et al., Face recognition: A literature survey, in ACM Computing Surveys, 2003.