A Video-Based Face Detection Method Using a Graph Cut Algorithm in Classrooms

Size: px

Start display at page:

Download "A Video-Based Face Detection Method Using a Graph Cut Algorithm in Classrooms"

Jesse Rose
5 years ago
Views:

1 A Video-Based Face Detection Method Using a Graph Cut Algorithm in Classrooms Jiun-Lin Guo, Chiung-Yao Fang*, Yi-Chun Li, and ei-wang Chen Department of Computer cience and Information Engineering National aiwan Normal University aipei, aiwan *Corresponding Author: violet@csie.ntnu.edu.tw ABRAC his study presents a face detection method for student identification applied in classrooms with varying illumination and complex environments. he face detection system should be able to process many students simultaneously, sometimes more than thirty. Moreover, the faces may not be directly in front of the camera. hese changes of head pose increase the difficulty of the face detection. In this paper, an improved dynamic graph cut algorithm is applied to extract foregrounds and to detect a subject s skin color regions. he main advantage of using the graph cut algorithm to extract regions is that the extracted results are usually smooth and complete, because they take into account the relationship with neighboring pixels. Moreover, these methods are more suited to tolerate small shifts in the camera than pixel-based models. his paper proposes an improved dynamic graph cut algorithm that can deal with a sequence of frames, reduce the running time of the graph cut algorithm, and automatically provide the hard constraints for the input frames. Finally, the subject s facial region is selected from the detected skin color regions. he experimental results show that the proposed method can robustly detect the subject s face at various illuminations and complex classroom environments. Keyword: dynamic graph cut algorithm, foreground extraction, skin color detection, video-based face detection

2 1. Introduction Classroom observation is the process of recording the instructor's teaching practices and student actions, including facial expressions. he results can be used to refine the materials or improve the teaching methods. However, to obtain these types of data manually is time consuming and may disturb the students learning. Many researchers have focused on developing automatic systems to collect this data quickly (Fang, Kuo, Lee, & Chen, 2011). In this work, a face detection method is proposed to robustly detect the students faces at various illuminations and in complex classroom environments. It is hoped that the high quality of the detection results will be helpful for recognizing a student s facial expressions, including concentration, distraction, and drowsiness. o observe and record the facial expressions of students during classes allows an evaluation of the teaching methods, which is a very important factor in classroom observation in educational research. his study develops a student face detection system that can solve the following issues. Firstly, a student face detection system should be able to process successive frames, as the system is applied to classroom observation of multiple subjects. econdly, the system should be able to process many students simultaneously, sometimes in excess of thirty. Finally, students faces may not always be directly in front of the camera. hey may turn their heads to talk to each other, to look at the teacher, or to read from the blackboard during class. hese changes of head pose will increase the difficulty of face detection. Face detection features that are invariant with respect to different head poses should also be considered when developing the face detection system. ince color features are invariant with respect to different head poses, this paper proposes a color-based method to extract skin color and detect faces. he main problem impacting a color-based method is the effect of illumination. If this problem can be solved, the color features will be robust under various head poses. he range of skin color in various color spaces has been studied extensively; researchers have collected large numbers of images with skin color regions and investigated the suitable range of skin color within given color spaces. HV (igal, claroff, & Athitsos, 2004; Zhang, Jiang, Liang, & Liu, 2010) and YCrCb (Chai & Ngan, 1999; Mahmoud, 2008; Phung, 2002) are two color spaces commonly used in skin color detection since the distributions of the skin colors are more concentrated in these two color spaces than in others. Color compensation techniques can be applied to the input images (Zhang et al., 2010) to deal with the illumination effect. Moreover, igal et al. (2004) used a Markov chain and Bayes rule to model the color change parameters in an attempt to dynamically learn the suitable range of skin color. In this study, a fixed skin color interval of the HV color space, similar to the one proposed by Zhang et al.

3 (2010), is used as the initial skin color range of the system. his skin color range is dynamically updated through an improved graph cut algorithm proposed in this paper. In addition, a reliable foreground extraction technique is required. his paper also proposes a foreground extraction technique, an improved graph cut algorithm, that is suitable to the classroom environment. Foreground extraction is usually achieved through pixel-based modeling. For instance, a mixture of Gaussians is a commonly used model (Elgammal, Duraiswami, Harwood, & Davis, 2002; Kae & Bow, 2010). It provides a model for estimating background colors; however, the background colors must initially occupy a higher proportion of time. econdly, the pixel-based model does not take the color information of neighbors into account; thus, noise may affect the foreground extraction results. Finally, the pixel-based model is sensitive to small shifts in the camera by only a few pixels. herefore, many researchers (Heikkilä & Pietikäinen, 2006; Cheng, Gong, chuurmans, & Caelli, 2011) have applied complex pixel-based methods to improve the results, attempting to use the neighboring information to obtain smoother foreground extraction results. he traditional graph cut algorithm (Wu & Leahy, 1993; Juan & Boykov, 2006) provides another foreground extraction concept. his method was originally developed for single image segmentation and requires some initial information inputted manually by the user (Wu & Leahy, 1993). Users should provide some hard constraints (i.e., several labeled pixels) using brush tools to construct the distributions of the colors of background and foreground pixels. Application of the graph cut algorithm, based on the hard constraints, allows the system to extract the foreground pixels. he main advantage of using the graph cut algorithm to extract the foreground is that the extracted results are usually smooth and complete, since the method naturally takes the relation of neighboring pixels into account. Moreover, this method is more resistant to small shifts of the camera than pixel-based models. In this paper, an improved dynamic graph cut algorithm is proposed to deal with sequences of frames, reduce the running time of the traditional graph cut algorithm, and automatically provide the hard constraints of the input frames. he proposed method is used to extract the foreground and detect a student s face at various illuminations and in complex classroom environments. 2. ystem Overview Figure 1 shows a flowchart of the face detection system. he proposed system first uses an improved dynamic graph cut algorithm to extract the foregrounds of the input frame. A dynamic skin color range, based on the extracted foregrounds, is applied to detect skin color regions. he preliminary result of skin color detection will then be refined by applying another graph cut algorithm. Finally, the face regions are

4 automatically selected from the refined skin regions and the process is complete. It should be noted that the kernel technique of the proposed system is the graph cut algorithm. o apply the graph cut algorithm to video foreground extraction, two issues should be considered: (1) the information about hard constraints should be provided automatically, and (2) the time complexity of the graph cut algorithm should be reduced for a real-time system. his study solves these issues to improve the graph cut algorithm. he graph cut algorithm is introduced in the following section. Input Frames Foreground Extraction kin Color Region Detection kin Color Region Refinement Face Region election Color Range Updating Fig. 1. Flowchart of the face detection system. 3. Graph Cut Algorithm 3.1 Energy function Image labeling methods can be used to extract the foreground pixels of the input frame. he fitness of the image labeling methods can be measured using energy functions. Given a frame I with N pixels, the frame can be labeled and recorded by a binary vector A ( a1, a2,..., a N ), where ai {0,1}, i 1,2,..., N. For each pixel i, a 1 indicates a foreground pixel; otherwise, it is a background pixel ( a 0 ). i Boykov and Jolly (2001) defined an energy function of a labeled vector A: E(A) = R(A)+ lb(a), (1) where l is a constant. In Eq. (1), R(A) is the region part and B(A) is the smooth part. hey are defined as: and: N R( A) r( a ), (2) i 1 i i

5 log P( Ii O) if ai 1 ra ( i ), (3) log P( Ii B) otherwise where P( Ii O ) and P( I B ) are the conditional probabilities that indicate the i probabilities of the color of pixel i, I i, occurring in the foreground (O) and background (B) color models, respectively. hese two color models can be obtained from histograms computed from the hard constraint. he smooth part BA ( ) is defined by each pair of neighboring pixels (4-connected or 8-connected). Let p and q indicate two neighboring pixels, p q and 1 p, q N. hus: B( A) b( p, q) ( ap, aq), (4) pq, where 1 if ap aq ( ap, aq) 0 otherwise and b p q I I, and I p and I q 2 (, ) exp{ ( p q) } are the color values of the pixels p and q, respectively. In Eq. (2), R(A) indicates the penalty of the unfitness between the labeled results and the background or foreground color model. In comparison, in Eq. (4), B(A) indicates the penalty of the color similarity of each pair of neighboring pixels that have different labels. he energy function is well defined. he next step is to determine the optimal labeling that can minimize the energy function. 3.2 Weighted graph construction he traditional graph cut algorithm proposed by Boykov and Jolly (2001) is a technique that can be used to minimize the energy function to find the optimal labeling. o apply the well-known minimum-cut maximum-flow approach, a weighted graph should first be constructed. Boykov and Jolly (2001) introduced two additional virtual vertices, the foreground and background terminals, in the weighted graph construction. In Figure 2, the weighted graph construction is shown. Given a frame I, a graph G ( V, E) can be constructed, where the set of vertices V contains all pixels in the frame and the foreground and background terminals, and. In the graph, each pixel has two types of links, neighboring links and terminal links. hese links form the set

6 of edges E. A neighboring link, the n-link, connects two neighboring pixels on the frame, while a terminal link, the t-link, connects each pixel and or. In the 4-connected case, one pixel has four n-links, as shown in Figure 2. Let p and q be two neighboring pixels on the frame, and and be the foreground and background terminals, respectively. he weight of the n-link between p and q, W n link ( p, q), is defined as: 2 Wn link ( p, q) exp{ ( I p Iq) }, (5) where p q and 1 p, q N. Background terminal t-link Image pixel p n-link t-link Foreground terminal Fig. 2. Construction of the weighted graph. It should be noted that the weights of the n-links are defined by the smooth part of the energy function, equal to the function b in Eq. (4). he weight of the t-link between p and the foreground terminal is defined as: 4 if p O Wt link ( p, ) 0 if p B, (6) log P( I p B) otherwise where and are constants. If pixel p is labeled as a background pixel, then the weight of the t-link between p and the foreground terminal is set to zero. If pixel p is labeled as a foreground pixel, then the weight of the t-link between p and is set to 4l +e to ensure it is larger than the sum of all its n-links. Moreover, according to the function r shown in Eq. (3), the weights of the t-links of the unlabeled pixels are set to the region part of the energy function. imilarly, the weight of the t-link between p and the background terminal is similarly defined as:

7 4 if p B Wt link ( p, ) 0 if p O. (7) log P( I p O) otherwise his completes the construction of the weighted graph. Boykov and Jolly (2001) have proved that the minimum cut of a weighted graph is equivalent to the optimal labeling of its corresponding frame, and the sum of all the weights of the links crossing the minimum cut can obtain the minimum energy of the energy function. he most significant property is that after performing the minimum cut algorithm, each pixel of the frame will connect to one and only one terminal vertex. he labeling result can be obtained by looking at which terminal each pixel is connected to. 4. Foreground Extraction using a Graph Cut 4.1 Hard constraints of foreground areas As mentioned previously, hard constraints indicate the initial labeling of the foreground and background pixels. In this section, we propose an automatic technique to label the foreground and background pixels as the hard constraints. wo probability models for the intensity distribution of the background and foreground pixels can be constructed based on the hard constraints. Given two color frames, a difference image can be obtained by pixel-to-pixel intensity value subtraction in each color channel. If one of the frames represents the background image, then the difference image may display the foreground objects. An example is shown in Figure 3. Figure 3(a) indicates a background image captured in a classroom, and Figure 3(b) shows a frame with some students in the same classroom. he difference image of these is shown in Figure 3(c). In Figure 3(c), the darkest pixels correspond to the background pixels, since their difference values are close to zero. Moreover, even the colors are changed in the subtraction process, as the students can be observed roughly in the difference image. If these two frames are both represented by an RGB color model, then their pixel-to-pixel difference histogram in each channel can be easily obtained.

(a) (b) (c) Fig. 3. An example showing how the difference image is obtained. Fig. 4.

Figure 4 shows the histograms of the R, G, and B channels, from left to right, of the difference image shown in Fig. 3(c).

Binarization of the difference image using a threshold value can reveal the foreground objects. Chiu et al.

8 (a) (b) (c) Fig. 3. An example showing how the difference image is obtained. Fig. 4. he histograms of the R, G, and B channels (from left to right) of the difference image shown in Fig. 3(c). Figure 4 shows the histograms of the R, G, and B channels, from left to right, of the difference image shown in Fig. 3(c). In Figure 4, since the number of background pixels is always large, most of the difference values are close to zero. Binarization of the difference image using a threshold value can reveal the foreground objects. Chiu et al. (2010) proposed a fast algorithm to determine a suitable threshold. ince the difference histogram is not symmetric around the zero point, using only one threshold is not sufficient to obtain the foreground object. hus, the method is improved and the threshold is defined more precisely. wo thresholds, study. D and H, are used in this D H 40% 10% 20% 30% D 0 H Fig. 5. An illustration of threshold determination.

* Let H D (x) be the difference histogram. Chiu et al.

hus, x should satisfy H ' ( * D x ) 0 and '' ( x * ) 0.

segments of the foreground area in the previous frame.

earching the thresholds from the starting points, D and H D H, prevents the small noises near the middle of the difference histogram from having an

Figure 6 shows an example of the image binarization procedure, where Figure 6(a) shows the input frames, Figure 6(b) shows the foreground extraction

9 * Let H D (x) be the difference histogram. Chiu et al. (2010) believe that if x is a suitable threshold, then * H ( ) D x should be the local minimum of the difference * histogram. hus, x should satisfy H ' ( * D x ) 0 and '' ( x * ) 0. Instead of searching the local minima from the middle zero point, the system starts from points D and H, shown in Figure 5, which are determined by segments of the foreground area in the previous frame. he first local minima outside these two starting points are determined as the thresholds D and H, respectively. earching the thresholds from the starting points, D and H D H, prevents the small noises near the middle of the difference histogram from having an impact. Figure 6 shows an example of the image binarization procedure, where Figure 6(a) shows the input frames, Figure 6(b) shows the foreground extraction results using only one threshold proposed by Chiu et al. (2010), and Figure 6(c) shows the foreground extraction results using two thresholds, D and H. A comparison of the corresponding frames shown in Figures 6(b) and (c) shows the foreground areas extracted using two thresholds are more complete and correct. (a) (b) (c) Fig. 6. An example of image binarization. (a) he input frames, (b) the foreground extraction results using only one threshold, and (c) the foreground extraction results using two thresholds. 4.2 Foreground extraction he next step is to use the foreground extraction results as the hard constraint pixels

to construct the distribution of the background and foreground.

from the background image stored in the system.

the pixels of the input frame are classified as background pixels: I ( i, j) I ( i, j) (1 ) I( i, j), (8) B where B I B is the background image and I is the input frame.

10 to construct the distribution of the background and foreground. he color distribution of the foreground area, P ( x B), is constructed from the foreground extraction result, while the color distribution of the background area, P ( x O), can be directly computed from the background image stored in the system. o prevent gradual changes in the background image from impacting the classification, after the refinement of the graph cut, the intensity values of the pixels in the background image are updated if the pixels of the input frame are classified as background pixels: I ( i, j) I ( i, j) (1 ) I( i, j), (8) B where B I B is the background image and I is the input frame. ymbol is an input parameter to adjust the speed of updating the background pixels. In Figure 6(c), an observation of the face areas of the student in front shows that the foreground areas are still defective. hus, using the foreground areas as the hard constraints, applying the graph cut algorithm can obtain more accurate foreground results. Figure 7 shows the final foreground extraction result refined by the graph cut algorithm. Fig. 7. he final foreground extraction result refined by the graph cut algorithm. 5. kin Color Region Detection using a Graph Cut 5.1 Hard constraints of skin color regions he graph cut algorithm is also applied to detect skin color regions, since: (1) skin color pixels are usually collected into several compact regions in input frames and (2) the distribution of skin color is highly concentrated in hue components. Figure 8 shows an example in which the skin color regions are compact in hue components. One can observe that on the skin regions, such as faces and hands, the hue values of the pixels are very similar. hus, a fixed skin color interval in the HV color space, 0 H 50 and , is used to detect the initial skin color pixels. hese pixels are regarded as the hard constraints of the skin color regions.

11 (a) (b) Fig. 8. An example showing the skin color regions: (a) the input frames and (b) their corresponding hue values. Note that a fixed skin color interval usually selects most skin pixels, but the detection result is always broken due to a few outliers. Moreover, the lighting conditions in a classroom vary over time, affecting the color distribution of the skin pixels. his means the suitable skin color interval of the input frame should not be fixed. herefore, once the graph cut algorithm is applied to detect the complete skin region, a dynamic learning scheme should be used to update the interval every frame. hus, the hard constraints of the skin color regions can match the changes in lighting conditions. he proposed algorithm to initialize and update the skin color interval is as follows: 1. Construct the hue histogram H of the skin color pixels obtained by the graph cut algorithm. Let the value of each bin in the histogram H be h, where 0 i If the system initializes, find the top three values in the interval [0, 60]; otherwise, find three peak values nearest the center of the previous skin color interval. Calculate the average s to be the center of the new interval. 3. um the histogram values located left of s to obtain the left portion P l : s 255 P h h, (9) l j i j 0 i 0 and calculate the right bound value r, which should satisfy: r hj Pl. j s 4. Compute the variance s of the histogram values that are located in [0, r]. 5. et the interval 2, 2 ] as the new interval. [ s s s s It should be noted that skin color pixels are distributed on the left side of the hue histogram, since the skin colors are close to red in the hue component. In addition, to i

12 prevent the new interval from diverging too much, the maximum width of the interval is bounded by 60. If the width of the interval is larger than the maximum, then the system only shifts the interval but does not change its width. 5.2 kin color region extraction he distribution of the skin color values P ( x skin) is calculated from the detection results by the skin color intervals, while the distribution of non-skin color values P ( x ~ skin) is obtained from all the foreground pixels, with the exception of the skin color pixels. In Figure 9, an example of skin-color region extraction is shown. Figure 9(a) shows the input frames and Figure 9(b) shows the hard constraints detected by the initial skin color interval. Figure 9(c) shows the skin-color region extraction results of the graph cut algorithm. One can observe that the skin color regions shown in Figure 9(c) are more complete and compact than those shown in Figure 9(b). (a) (b) (c) Fig. 9. An example of the skin-color segmentation result: (a) the input frames, (b) the hard constraints detected by the initial skin color interval, and (c) the skin-color region extraction results of the graph cut algorithm.

13 6. Face regions selection he final stage of the proposed system is the selection of face regions. he face regions can be selected from the skin color regions obtained in the previous section. he face regions usually contain more complicated textures than non-face regions, for example, the arm and leg regions. Moreover, the shape of a face region is similar to a circle or a rectangle than the shapes of non-face regions. hus, given a skin color region and its corresponding bounding box, several criteria are proposed to distinguish face regions and non-face regions. 1. Regularity: Regularity is used to evaluate the skin region s similarity with its bounding box, which is computed by: A R, (10) B where A is the area of a skin color region, and B is the area of its bounding box. he R value of the face regions should be larger than Aspect ratio: he aspect ratio is defined as the ratio of the height and width of the bounding box. In this study, the aspect ratio of a face region is set in the range [0.5, 1]. 3. Convexity: he shape of a face region should be convex, which means the center of mass must be inside the region. 4. Number of corners: ince face regions are more complex than non-face regions, the number of corners in face regions should be larger than that in non-face regions. In this study, the smallest eigenvalue of the Hessian matrix is used as the corner feature. Given an image I: di 2 di di ( ) ( )( ) ( p) ( p) dx dx dy H, (11) 2 di di di ( )( ) ( ) ( p) ( p) dx dy dy where ( p) is a mask with a 3 3 patch size. In Figure 10(a), the green boxes represent the bounding boxes of the skin color regions and, in Figure 10(b), the green points are the corners detected in the skin color regions. Note that the number of corners in a face region is larger than that in a non-face region.

(a) (b) Fig. 10. An example of the corner detection results on skin color regions (a) bounding boxes of skin color regions and (b) the corners detected in the skin color regions.

14 (a) (b) Fig. 10. An example of the corner detection results on skin color regions (a) bounding boxes of skin color regions and (b) the corners detected in the skin color regions. Using these criteria, most of the face regions can be selected quickly and accurately. However, there are still some challenging cases. Firstly, some skin color regions in the background that can be regarded as noise will affect the selection results. econdly, several skin color regions may be occluded by others and merge into a single region. Finally, the face regions under some head poses may not contain enough corners. hus, a temporal face-region finding scheme and a probability updating function are applied to avoid inaccurate detection of some face regions. Let R i, j be a face region, which indicates the j th face region in frame i, i j B, be its corresponding bounding box, and N i, j be the number of skin color pixels in B i, j. he temporal face-region finding scheme is as follows. If R i 1, j in frame i 1 is not found, then no face region in frame i 1 is detected near i j R,, and the system computes the number of skin color pixels, Ni 1, j, inside B i, j. In this case, if 1 Ni 1, j Ni, j, then a shift vector is computed to obtain the center of Bi 1, j. his shift 2 vector begins from the center of B, and ends at the center of the skin color region i j inside B, in frame i 1. An example of the shift vector computation is shown in i j Figure 11. Figure 11(a) shows a bounding box (the green box) detected in frame i and Figure 11(b) shows the noisy skin-color detection result and the shift vector (the blue arc) detected in frame i+1. he shift vector can help the system track the face regions in frame i 1.

15 (a) (b) Fig. 11. An example of computing the shift vector: (a) a bounding box and (b) the noisy skin color detection result and its shift vector. he probability update function is defined below. For a skin color region R i, j, the probability of R, being a face region is denoted by: i j 1 if Ri, jsatifies the criteria Pf ( Ri, j ) Pf ( Ri, j ) c1 if Ri, j doesn't satify the criteria, (12) Pf ( Ri, j ) c2 if Ri, j is not found where c 1 and c 2 are two predefined constants to adjust the length of time of R i, j under the second and third conditions. In this study, if the probability of a face region decreases to zero, the system will remove the face region. 7. Dynamic Graph Cut Algorithm he traditional graph cut algorithm is memory and time consuming. A large graph should be constructed in each frame, containing (the image size) vertices and approximately edges. Application of a max-flow min-cut algorithm on such a large graph requires substantial computation time. Kohli and orr (2007) proposed a dynamic graph cut algorithm to reduce the computation time. he dynamic graph cut algorithm can utilize the previous graph cut result in frame i as the initial graph in frame i 1 only changes some weights of edges. However, changing the weights of edges changes the flow capacity. If the resulting capacity is less than the flow, a flow inconsistency will occur and the structure of the graph is destroyed. Kohli and orr s strategy is to re-parameterize the graph, which can update the flow capacity corresponding to the new graph without affecting the image labeling result. Let 1 and 2 be two different assignments of the weights on the graph, then 2 is called the re-parameterization of 1 if and only if

16 arg min E( A ) arg min E( A ). he method to re-parameterize a graph involves A 1 2 A modifying the t-links and n-links in a specific mechanism. Figure 12 shows an example of how a graph is re-parameterized by modifying t-links. Figure 12(a) shows a residual graph without the path, which is equivalent to a flow network with a max-flow passing through. After frame i 1 is inputted, an inconsistent flow on edge {, p}, which is assigned by a negative-weight t-link, is obtained. he solution is to add a positive value a on both {, p} and {, p} to update the weights of the edges to avoid a negative-weight t-link occurring, as shown in Figure 12(b). he updated graph is a re-parameterized graph updating the weights of the t-links and is free from flow inconsistency (Figure 12(c)) a p p p a 0 5 (a) (b) (c) Fig. 12. An example showing the re-parameterization of a graph by modifying t-links: (a) the residual graph with a negative-weight t-link, (b) adding a positive value on the t-links, and (c) the re-parameterized graph. 0 0 Figure 13 shows an example of the re-parameterization of a graph by modifying its n-links. In Figure 13(a), the flow inconsistency occurs on edge { q, p} after frame i 1 is inputted. o avoid the occurrence of negative-weight edges, a positive value a is added to the weights of { q, p}, { p, }, and {, p}. his updating can be regarded as reversing the overflow to avoid the inconsistency, as shown in Figure 13(b). It should be noted that a should be the minimum positive value to avoid raising other flows while also avoiding the negative-weight n-links. he updated graph is a re-parameterized graph updating the weights of n-links and is free from flow inconsistency (Figure 13(c)).

17 q a a a 0 p q p q p a a 0 1 (a) (b) (c) Fig. 13. An example of graph re-parameterization by modifying n-links: (a) the residual graph with a negative-weight n-link, (b) reversing the overflow by adding a positive value, and (c) the re-parameterized graph. (a) (b) Fig. 14. wo examples of speed up achieved by the dynamic graph cut: (a) the input videos and (b) the runtimes required to find the max-flow. he dynamic graph cut algorithm provides a scheme for modifying the weights of a residual graph in order to reduce the overhead of constricting and deconstructing large graphs for every frame. Furthermore, since the graph modified by the dynamic graph cut is an almost full flow network, it should greatly reduce the runtime required to

18 apply the max-flow algorithm on the graph. Figure 14 shows two examples of achievable speed up. Figure 14(a) shows the first frames of two input videos, and Figure 14(b) shows their corresponding runtime to find the max-flow. he blue lines show the runtime of the traditional graph cut algorithm in each frame, and the red lines show the runtime of the dynamic graph cut algorithm. It can clearly be seen that the runtime of the dynamic graph cut algorithm is less than half of those of the traditional algorithm. ince the dynamic graph cut algorithm is more effective than the traditional graph cut algorithm, it may be improved further. In video input, successive frames are very similar, while the objects in them do not move a lot. hus, the system can simply modify the weights of some special edges of the graph in the following frames and leave the others unchanged. A temporal difference with thresholds is used to decide if the connected edges of a vertex should be modified or not. hese thresholds are equal to the global thresholds, D and H, as shown in ection 4.1. Figure 15 shows an example of the runtime of the improved dynamic graph cut algorithm. Figure 15(a) shows the first frame of the input video and Figure 15(b) shows the runtimes required to find the max-flow. In Figure 15(b), the blue line, the red line, and the green line show the runtimes of the traditional graph cut algorithm, the dynamic graph cut algorithm, and the proposed graph cut algorithm, respectively. In comparison to the dynamic graph cut algorithm, the average runtime of the proposed graph cut algorithm is reduced by approximately 16%, though its variance is higher. (a) (b) Fig. 15. An example of the runtime of the improved dynamic graph cut algorithm: (a) the input video and (b) the runtime required to find the max-flow, where the green line is the runtime of the proposed graph cut algorithm.

8. Experimental Results he experiments consisted of three parts: the foreground extraction, the skin color detection, and the face detection.

Five classrooms, B101, B103, C209, 101, and the B1 lyceum, in the National aiwan Normal University were used to obtain different classroom environments.

hese classrooms have different background color distributions and lighting conditions; B103 contains has a background similar to skin color. and C209 is the only one that is not a lecture theater.

even videos (totaling 24,952 frames) were taken from these classrooms, and named in series as 265, 266, 288, 292, 301, 303, and 307. Figure 17(a) shows one frame of each video.

19 8. Experimental Results he experiments consisted of three parts: the foreground extraction, the skin color detection, and the face detection. he image size is in the foreground extraction and skin-color detection stages, and in the face detection stage. he processing time is approximately 12 frames/second. Five classrooms, B101, B103, C209, 101, and the B1 lyceum, in the National aiwan Normal University were used to obtain different classroom environments. ince the classrooms B101 and B103 are similar, Figure 16 shows four of the five classrooms. hese classrooms have different background color distributions and lighting conditions; B103 contains has a background similar to skin color. and C209 is the only one that is not a lecture theater. (a) (b) (c) (d) Fig. 16. Four classrooms used to obtain the experimental videos: (a) B103, (b) C209, (c) 101, and (d) the B1 lyceum. even videos (totaling 24,952 frames) were taken from these classrooms, and named in series as 265, 266, 288, 292, 301, 303, and 307. Figure 17(a) shows one frame of each video. Figures 17(b) and (d) show their corresponding experimental results of foreground extraction and skin color detection, respectively. Moreover, to compute the error rate of foreground extraction and skin color detection, ground truth images are manually produced (shown in Figures 17(c) and (e)). able 1 shows the experimental results of foreground extraction. he precision rates and recall rates of foreground extraction are approximately 85 92% and 84 97%, respectively. he f-measure values of these seven sequences are all more than One can see from able 1 that the precision rates are all lower than the recall rates, except for video 265. his means that when the objects are occluded by others, the small background regions near the object boundary are classified as foreground areas. However, in most cases this misclassification does not affect the face detection results unless the color of the background region is similar to the skin color. If input videos contain skin-color-like backgrounds, then the results of skin color detection will be

20 easily affected and the heights and widths of the bounding boxes of the detected skin regions will be incorrect. hese situations can be partially solved by the probability updating function (a) (b) (c) (d) Fig. 17. ome experimental examples of (a) the experimental videos, (b) the experimental results of foreground extraction, (c) the ground truth of foreground extraction, (d) the experimental results of skin color detection, and (e) the skin color ground truth. able 1. he experimental results of foreground extraction. Video No. No. of frames Precision Recall F-measure % 84.43% % 95.33% % 97.18% % 91.13% % 91.85% % 97.00% % 95.36% (e) able 2 shows the experimental results of skin color detection. he f-measure values of these seven sequences are in the range to In dark lighting conditions, as in videos 265 and 307, it can be seen that the system obtains the skin color regions

21 sufficiently well and approaches high precision rates (88.47% and 91.57%, respectively). However, some precision rates of skin color detection are much lower than the foreground extraction stage. he reasons for the low precision rate can be: (1) if there are few skin color pixels in some frames, then the portion of misclassified pixels is increased; (2) the brown color pixels of a student s hair are usually misclassified as skin color pixels and lower the precision rates, e.g., video 288; and (3) many small noisy pixels around skin regions in skin-color-like background frames, e.g., video 266, are misclassified as foreground skin color pixels. In comparison, low recall rates occur in video 288 (66.43%) and video 307 (66.99%), which are both captured in the B1 lyceum. he blue color of the chairs and the lighting conditions cause the color variation of the students faces in the frames. he system finds it difficult to determine a suitable skin color range. able 2. he experimental results of skin color detection. Video No. No. of frames Precision Recall F-measure % 77.91% % 88.49% % 66.43% % 81.73% % 87.31% % 71.91% % 66.99% able 3 shows the final results of the proposed face detection method. wo critical factors are relative to the precision rates. he first is the distribution of the skin region positions; the more dispersive the skin regions are, the better the results will be. he second is the resolution of skin regions caused by the lighting conditions, the position of the camera, and the size of the frame. In videos 266 and 303, the face regions are clear and non-overlapping, and the lighting conditions are good. hus, their precision and recall rates are rather high (91 99%). Although the background color of video 266 is similar to skin color, application of the probability update function allows the system to find the face regions as long as the corner features are clear. Videos 288 and 307 are both challenging cases since the skin regions are small areas of the scene. Moreover, the faces of students seated in the back rows are mostly occluded, which lowers the recall rates. Video 265 is the darkest case that the system tests. he system obtains a good skin-color detection result, but the poor lighting conditions cause the misdetection of corner features. As a result, the system does not

22 detect most of the faces of students seated in the back rows. he recall rate is reduced to 43%. In addition, the processing rate of all the videos is approximately 12 frames/second, meaning the technique has the potential to be a real-time system with further optimization or GPU parallelization. able 3. he experimental results of the proposed face detection method. Video No. No. of frames Precision Recall F-measure % 43% % 99% % 83% % 93% % 96% % 98% % 91% Conclusions Face detection using skin color is usually considered ineffective due to its sensitivity to lighting conditions, the race of the subjects, and other skin color imperfections. In this paper, an improved graph cut algorithm is proposed to perform foreground and skin color extraction. Moreover, a dynamic learning strategy to update the skin color range in the frame is also proposed. his strategy improves the correctness of the initial skin color range and reduces the computational time of skin-color region detection. he experimental results show that, even in a classroom with a background similar to skin color, the system can still detect faces successfully. he proposed method is robust against the various head poses of the subjects, and will help to further analyze the behavior of students in a classroom. Acknowledgment he authors would like to thank the National cience Council of the Republic of China, aiwan for financially supporting this research under Contract No. NC E Reference Boykov, Y. Y. & Jolly, M. P Interactive Graph Cuts for Optimal Boundary & Region egmentation of Objects in N-D Images, Proceedings of International Conference on Computer Vision, Vancouver, Canada, 1, Chai, D. & Ngan, K. N Face egmentation Using kin-color Map in

23 Videophone Applications, IEEE ransactions on Circuits and ystems for Video echnology, 9 (4), Cheng, L., Gong, M., chuurmans, D., & Caelli, Real-ime Discriminative Background ubtraction, IEEE ransactions on Image Processing, 20 (5), Chiu, C. C., Ku, M. Y., & Liang, L. W A Robust Object egmentation ystem Using a Probability-Based Background Extraction Algorithm, IEEE ransactions on Circuits and ystems for Video echnology, 20 (4), Elgammal, A. M., Duraiswami, R., Harwood, D., & Davis, L Background and Foreground Modeling using Nonparametric Kernel Density Estimation for Visual urveillance, Proceedings of the IEEE, 90 (7), Fang, C. Y., Kuo, M. H., Lee, G. C, & Chen,. W tudent Gesture Recognition ystem Classroom 2.0, Proceedings of the IAED International Conference Computers and Advanced echnology in Education (CAE 2011), Cambridge, United Kingdom, Heikkilä, M. & Pietikäinen, M A exture-based Method for Modeling the Background and Detecting Moving Objects, IEEE ransactions on Pattern Analysis and Machine Intelligence, 28 (4), Juan, O. & Boykov, Y Interactive Graph Cuts, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, Kae, P. & Bow, R An Improved Adaptive Background Mixture Model for Real-ime racking with hadow Detection, Proceedings of the 2nd European Workshop on Advanced Video-Based urveillance ystems, London. Kohli, P. & orr, P. H Dynamic Graph Cuts for Efficient Inference in Markov Random Fields, IEEE ransactions on Pattern Analysis and Machine Intelligence, 29 (12), Mahmoud,. M A New Fast kin Color Detection echnique, World Academy of cience, Engineering and echnology, Phung,. A A Novel kin Color Model in YCbCr Color pace And Its Application to Human Face Detection, Proceedings of IEEE International Conference on Image Processing(ICIP '02), 1, I I-292, New York.

24 igal, L., claroff,., & Athitsos, V kin Color-based Video egmentation Under ime-varying Illumination, IEEE ransactions on Pattern Analysis and Machine Intelligence, 26 (7), Wu, Z. & Leahy, R An Optimal Graph heoretic Approach to Data Clustering: heory and its Application to Image egmentation, IEEE ransactions on Pattern Analysis and Machine Intelligence, 15 (11), Zhang, X. N., Jiang, J. Z., Liang, H., & Liu, C. L kin Color Enhancement Based on Favorite kin Color in HV Color pace, IEEE ransactions on Consumer Electronics, 56 (3),

STUDENT GESTURE RECOGNITION SYSTEM IN CLASSROOM 2.0

STUDENT GESTURE RECOGNITION SYSTEM IN CLASSROOM 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee and Sei-Wang Chen Department of Computer Science and Information Engineering, National Taiwan Normal University