Abstract. 1 Introduction. 2 Motivation. Information and Communication Engineering October 29th 2010

Size: px

Start display at page:

Download "Abstract. 1 Introduction. 2 Motivation. Information and Communication Engineering October 29th 2010"

Peter Nelson
5 years ago
Views:

Information and Communication Engineering October 29th 2010 A Survey on Head Pose Estimation from Low Resolution Image Sato Laboratory M1, 48-106416, Isarun CHAMVEHA Abstract Recognizing the head

1 Information and Communication Engineering October 29th 2010 A Survey on Head Pose Estimation from Low Resolution Image Sato Laboratory M1, , Isarun CHAMVEHA Abstract Recognizing the head pose of another human is a human ability which poses a great challenge to the computer vision society. Head pose estimation has become more popular recently, with the majority of the research works with high-resolution images. While high resolution images are common nowadays, with the restrictions of many surveillance systems, there are still high demands for the estimation techniques for low resolution images. In this paper, we discuss about the development of this method throughout the years. Also we categorize the methods into categories and we also discuss about their advantages and disadvantages as well. 1 Introduction Recently, with the rapid development of the technologies, the head pose estimation, which is one of the human s basic abilities, has become more popular in the computer vision society. The head pose information of human conveys a lot of information, for example, human head pose could convey where his attention point is. Moreover, human will turn their head to the person they are talking to as the nonverbal sign to capture that person s attention and to inform that person that he is about to talk. These information is useful in various type of applications, as will be stated in the next section. Most of the recent systems based the estimation on the high-resolution images, while those restrictions might yield good results due to the ability to components of the face, many of them could not be applied with the head poses in low resolution images, which the components of the face are merely visible. In this paper, we discuss why head pose estimation from low resolution image is an interesting research, then we discuss about the existing methods, then we compare each of the approaches and finally we conclude and summarize the content of this paper. Figure 1: Ba et al.[2] uses head pose estimation used for estimating Visual focus of attention in natural meeting scenario 2 Motivation Head pose estimation conveys a lot of information, for example, the direction of people s head could estimate his focus of attention. The head pose could also be used to convey nonverbal clues to the other person. The mutual gaze could also convey the information about the point of attention for every person in the room. Head pose estimation is an interesting area of research due to the broad area of research and application it could be applied for. Ba et al.[2],[1] uses head pose estimation and contextual clue to estimate visual focus of attention in the meeting scenario (Figure 1). Benfold et al.[4] uses head pose estimation and pedestrian tracking to automatically estimate the attention of an automated surveillance system (Figure 2). Smith et al.[16] uses head pose estimation to estimate people s visual field of view (VFOA) to estimate the attention of multiple people looking at the outdoor advertisement 1

Table 1: Works on Head Pose Estimation from Low Resolution Image by Categories Approach Work Advantages Disadvantages Appearance Template Robertson et al.[14][10] Benfold et al.[3] Orozco et al.

2 Table 1: Works on Head Pose Estimation from Low Resolution Image by Categories Approach Work Advantages Disadvantages Appearance Template Robertson et al.[14][10] Benfold et al.[3] Orozco et al.[11] - Training the model requires only positive samples - Add new samples to the template model is easy - Need good head localization Niyogi et al.[9] Detector Arrays Zhang et al.[21] Rowley et al.[15] - Need no head localization - Training model requires both negative and positive samples - Difficult decision when multiple detectors detect a face frame Nonlinear Regression Stiefelhagen et al.[18][17] Gourier et al.[5] - Training the model requires only positive samples - Fast and accurate - Prone to errors from head localization Tracking Pappu et al.[13] Morency et al.[7] Wu et al.[20] - Results are accurate - Need initialization - Could not handle large drifts between image frames 3 Head Pose Estimation from Low Resolution Images Methods Figure 2: Benfold et al.[4] uses head pose estimation and pedestrian tracking to automatically estimate the attention of an automated surveillance system In this section, we discuss about the current existing methods and combine similar methods into groups and discuss the advantages and disadvantages of each group s methods. In the computer vision context, head pose estimation techniques try to estimate head poses relative to the camera. Some of the head pose estimation techniques try to estimate the head poses up to three degree of freedoms, usually referred to as Yaw, Pitch and Roll (Figure 3) Although this type of result is desirable, in low resolution images there are usually not enough information to estimate the result up to 3 degree of freedom. Therefore, in many works on low resolution images, head poses are discretized into multiple head pose classes. For example, in Benfold et al.[3],[14],[11] the head poses are discretized into 8 classes (Figure 4). Refers to the categories of techniques for head pose estimation by Murphy et al.[8], recent works on techniques for head pose estimation methods from low resolution images could be divided into 3 2

Figure 3: Visual representation of rotation angle, Roll Pitch and Roll Figure 5: Method[8] Illustration for Appearance Template Figure 4: Head pose classes used by Orozco et al.

3 Figure 3: Visual representation of rotation angle, Roll Pitch and Roll Figure 5: Method[8] Illustration for Appearance Template Figure 4: Head pose classes used by Orozco et al.[11] main categories as following (Table 1). Appearance Template Method matches the input to the template of each head pose classes to find the most similar view. Detector Arrays Method uses separate detector for each pose class and assign the pose according to the detector which matches the best to the current input. Nonlinear Regression Method maps the head pose from the input image space to the pose directions. Tracking Method uses the result from the movement between video frames to estimate the pose of the head. 3.1 Appearance Template Method Appearance Template method use image-based comparison method to match the input with the examples from each pose class and select the pose which match the input the best as the estimation (Figure 5). Usually, the descriptor is extracted from the image for comparison. Finding the good descriptor for low resolution image is still a challenging task for appearance template method. Then the descriptor is then divided into multiple classes using many techniques such as binary search tree[14], Randomized fern[4], multi-class Support Vector Machine (SVM)[11] Robertson et al.[14][10] uses descriptor based on skin color to classify head poses and use body direction to further limit the candidate pose classes for the input. Benfold et al.[3] improves the method used by Robertson et al.[14] by using the descriptor which learns a model of the skin color for each new video without requiring any other cues. This method also use randomized Fern techniques[12] for training and classification of head pose data into 8 head pose classes then use Hidden Markov Model to filter out the errors. Orozco et al.[11] states that the skin descriptor in[14] is not applicable in some situations, so they use descriptor based on Similarity Distance instead. The methods in this category have the advantage in the ability to train the model using only positive samples without the need of negative samples. Furthermore, expanding the samples to the template model could be done anytime. These methods also generally work well with very low resolution images. However, these methods usually require good localization of the head image and localization of low resolution head images is not a trivial task, therefore good localization techniques are needed for these methods. 3.2 Detector Arrays Method Detector arrays methods use techniques similar to face detection, which has been developed and was very successful in the past. The method use separate face detectors for each pose class, then assign the pose with the greatest score from the detector (Figure 6). Zhang et al.[21] uses FloatBoost classifiers, which is a variant of AdaBoost[19], to classify head poses 3

4 Figure 6: Illustration for Detector Arrays Method[8] into 5 classes. These methods have the advantage that they do not require head localization techniques because the detectors are trained to detect the head for each pose. However, training the detectors is burdensome, because they require a lot of training data, including both positive and negative ones. The problem also arises when multiple detectors simultaneously detect the same window; it would be hard to determine what pose the input should be assigned to. Rowley et al.[15] suggests the solution using router method, which first assume that the window contains the face and determine its orientation, then rotate the face and use the face detector to confirm the existence of the face. So the detector can be trained with only small range of rotation, in this work, the detector is trained within the range -10 to Nonlinear Regression Method Nonlinear regression methods map the head pose from the input image space to the pose directions. Usually the training data is labeled with discrete or continuous angles. The nonlinear regression methods mostly use neural network approaches (Figure 7). Stiefelhagen et al.[18][17] uses separate neural networks to train and classifies the pan and tilt angles of the head pose. Gourier et al.[5] uses linear autoassociative neural networks techniques to estimate the head pose. These approaches have the advantage of being fast due to it only requires positive samples, but this method also requires good localization of the head and is prone to errors from bad localization. Figure 7: Method[8] Illustration for Nonlinear Regression 3.4 Tracking methods Tracking methods utilize the movement between frames of the head images to estimate the head pose. These methods usually require proper initialization of the head image.(figure 8) Pappu et al.[13] generates the synthetic views offline using head tracking methods, then find candidate head poses from the generated views and find the pose which minimizes the difference in appearance between the input and the generated head poses. Morency et al.[7] uses stereo-motion based head tracking which utilizes depth and brightness gradient tracking combined with initialization and stabilization modules to produce system robust to strong illumination changes. Wu et al.[20] builds ellipsoidal model of points, where each points maintain probability density functions of local image features and use maximum a posteriori (MAP) estimation to estimate the pose. This technics is robust to illumination change due to the use of local edge density feature which is independent of person and illumination. These methods have the advantages that the results are quite accurate due to the ability to track small drifts between images. However, these methods require first initialization of the head to track, which means these approaches requires the face image to be in some pose in order to initialize the system and the system could not handle large drifts between image frames, making the application unusable in some systems. 4

5 Table 2: Mean Angular Errors and Correct Classification Rates of Each Work. Approach Publication Mean Error Classification Rate Dimension Pose Classes Yaw Pitch Appearance Template Detector Arrays Nonlinear Regression Robertson et al.[14][10] x20 8 Benfold et al.[3] x10 8 Orozco et al.[11] % 20x20 8 Niyogi et al.[9] % 40x30 15 Zhang et al.[21] % - 5 Rowley et al.[15] % 20x20 36 Stiefelhagen et al.[17] % / 66.3% - Continuous Gourier et al.[5] % / 45% 23x30 22 Tracking Wu et al.[20] x32 - Figure 8: Illustration for Tracking Method[8] 4 Head Pose Estimation Comparisons Figure 9: Sample head image from i-lids[6] used by Orozco et al.[11]. The images are of low resolution and subject to directional lighting changes. 4.2 Comparison of Published Results 4.1 Ground Truth Data Sets There is currently no publicly available datasets for head pose estimation on low resolution images. The ground truth dataset for each work is mostly taken from low resolution video and hand-labeled the direction. Orozco et al.[11] use manually cropped 800 head images, 100 for each pose class from i-lids[6] dataset (Figure 9). Gourier et al.[5] use downsampled images from Pointing 04 dataset into low resolution image of dimension 23x30. In Robertson et al.[14][10] The ground truth has been produced by a human user drawing the lineof-sight on the images. Although there are no public dataset for head estimation on low resolution image, we can compare the results by comparing the published results of each paper. (Table 2) From the results, we can see that the results from appearance template methods and detector arrays methods are quite high. The recovery of head pose with Roll degree change seems to be already robust with the Rowley et al.[15] method, even with the number of pose classes as high as 36 pose classes. For Yaw angle, Orozco et al.[11] and Zhang et al.[21] could recover the Yaw angle with decent accuracy with 8 and 5 pose classes respectively. The nonlinear regression methods could recover both Yaw and Pitch angles with low mean errors with continuous or high number of pose classes. The next section will discuss about the applicability of these methods. 5

4.3 Comparison of Real-World Applicability From the view point of the real world applicability, although some works may yield better results than the others but have some constraints or assumptions

6 4.3 Comparison of Real-World Applicability From the view point of the real world applicability, although some works may yield better results than the others but have some constraints or assumptions to the system that it could not be used in some situations. This section discusses about the limitation on the approaches proposed by various works. (Table 3) The assumptions usually hold in head detection from low resolution image could be categorized as follows. A Continuous video assumption assumes continuous video stream with only small pose changes between frames and the movement between frames should be minimal, no abrupt movement or it will cause instability to the system. B Initialization assumption assumes that the subject s head is initialized to a known pose before the system can operate. C Single DOF assumption assumes that the head is only rotate around one axis. The viewpoints of real-world applicability for recent works on head pose from low resolution videos are as follows. 1. Appearance Template Methods are used with very low resolution images without any assumptions about the head rotation, which means the approach could estimate head rotation with rotation of at any angle. These methods should work with various applications with low resolution images in unconstrained environments well. However, those methods will normalize the head pose to the pose relative to the camera to a number of head pose classes. 2. Detector Arrays Methods are usually used with head images with rotation of one degree of freedom (DOF). This method, however, is robust against head localization errors so it could apply easily to the applications without having to develop robust head localization techniques. 4. Tracking Methods perform well on various applications. However, they require first initialization of the data and also the smooth video without drifts in order to perform well. Figure 10: Sample output by Rowley et al.[15], the method utilizes face detection techniques as one of the core of the system 5 Research opportunities There are still a lot of research opportunities in head pose estimation from low resolution image. The research opportunities for the following methods are as follows 5.1 Appearance Template Method There are a lot of rooms for improvement of appearance template methods. The current descriptors for the appearance template still could be developed further. Also as previously stated, these methods require precise head localization for the algorithm to work well, so more precise head localization algorithm or development of the approach robust against localization errors is also an interesting topic to research. 5.2 Detector Arrays Method For the detector arrays methods, face detection techniques are used as the core of the approach. As the face detection methods are quite stable already, there is no much room for much further research (Figure 10). Still, these methods could be combined with other methods in order to increase the efficiency of the estimation. 5.3 Nonlinear Regression Method 3. Nonlinear Regression Methods, similarly to appearance template methods, have no constraints on head rotation angles. Although Nonlinear regression methods perform quite well compared to other methods and are very interesting they only worked for 2 degrees of freedom, they research direction. These techniques also require could recover the angles for each degree of freedom well. precise head localization. 5.4 Tracking Method Tracking methods could be developed in order to make initialization automatic or have fewer constraints. Also some tracking methods assume the 6

7 Table 3: Real-World Applicability for Works on Head Pose Estimation from Low Resolution Image Published Date Work Approach Assumptions DOF Pose Classes 1996 Niyogi et al.[9] Appearance Template Rowley et al.[15] Detector Arrays C Pappu et al.[13] Tracking B Wu et al.[20] Tracking A,B Stiefelhagen et al.[18],[17] Nonlinear Regression - 2 Continuous 2002 Morency et al.[7] Tracking B 3 Continuous 2005 Robertson et al.[10],[14] Appearance Template Gourier et al.[5] Nonlinear Regression Zhang et al.[21] Detector Arrays C Benfold et al.[3] Appearance Template Orozco et al.[11] Appearance Template continuous images to be robust and no drift or significant abnormalities between image frames. These assumptions cannot hold true in many applications so there are research opportunities to solve these problems. 6 Summary and Conclusion Head pose estimation in low resolution image is a research field that is being popular recently and there is still a lot of room for improvement in this field. This paper provide overview of the methods for head pose estimation from low resolution image and also point out the possible opportunities for improvement of the upcoming techniques. References [1] S. Ba and J. Odobez. Multi-person visual focus of attention from head pose and meeting contextual cues. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PP(99):1 1, [2] S. Ba and J.-M. Odobez. Recognizing visual focus of attention from head pose in natural meetings. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(1):16 33, February [3] B. Benfold and I. Reid. Colour invariant head pose classification in low resolution video [4] B. Benfold and I. Reid. Guiding visual surveillance by tracking human attention [5] N. Gourier, J. Maisonnasse, D. Hall, and J. L. Crowley. Head pose estimation on low resolution images. In In CLEAR 2006, South, [6] i LIDS Team. Imagery library for intelligent detection systems (i-lids); a standard for testing video based detection systems. pages 75 80, October [7] L.-P. Morency, A. Rahimi, N. Checka, and T. Darrell. Fast stereo-based head tracking for interactive environments. pages , May [8] E. Murphy-Chutorian and M. Trivedi. Head pose estimation in computer vision: A survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4): , April [9] S. Niyogi and W. Freeman. Example-based head tracking. pages , October [10] I. R. N.M. Robertson and J. Brady. What are you looking at? gaze estimation in mediumscale images. In Proc. HAREM05, 16th British Machine Vision Conference, Oxford, September 2005, [11] J. Orozco, S. Gong, and T. Xiang. Head pose classification in crowded scenes [12] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recognition in ten lines of code. pages 1 8, June [13] R. Pappu and P. A. Beardsley. A qualitative approach to classifying gaze direction. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, pages , [14] N. Robertson and I. Reid. Estimating gaze direction from low-resolution faces in video. In Proceeding of the 9th European Conference on Computer Vision, volume 3952/2006, pages ,

8 [15] H. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based face detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, June [16] K. Smith, S. Ba, J.-M. Odobez, and D. Gatica- Perez. Tracking the visual focus of attention for a varying number of wandering people. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(7): , July [17] R. Stiefelhagen. Estimating head pose with neural networks-results on the pointing04 icpr workshop evaluation data. In Proceedings of the ICPR Workshop on Visual Observation of Deictic Gestures, [18] R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus of attention for meeting indexing based on multiple cues. IEEE TRANSAC- TIONS ON NEURAL NETWORKS, 13: , [19] P. Viola and M. Jones. Robust real-time object detection. In International Journal of Computer Vision, [20] Y. Wu and K. Toyama. Wide-range, personand illumination-insensitive head orientation estimation. Automatic Face and Gesture Recognition, IEEE International Conference on, 0:183, [21] Z. Zhang, Y. Hu, M. Liu, and T. Huang. Head pose estimation in seminar room using multi view face detectors. In CLEAR 06: Proceedings of the 1st international evaluation conference on Classification of events, activities and relationships, pages , Berlin, Heidelberg, Springer-Verlag. 8

Information and Communication Engineering July 9th Survey of Techniques for Head Pose Estimation from Low Resolution Images

Information and Communication Engineering July 9th 2010 Survey of Techniques for Head Pose Estimation from Low Resolution Images Sato Laboratory D1, 48-107409, Teera SIRITEERAKUL Abstract Head pose estimation