On the Design and Evaluation of Robust Head Pose for Visual User Interfaces: Algorithms, Databases, and Comparisons

Size: px

Start display at page:

Download "On the Design and Evaluation of Robust Head Pose for Visual User Interfaces: Algorithms, Databases, and Comparisons"

Roger Warren
6 years ago
Views:

1 On the Design and Evaluation of Robust Head Pose for Visual User Interfaces: Algorithms, Databases, and Comparisons Sujitha Martin Shinko Y. Cheng Ashish Tawari Mohan Trivedi Erik Murphy-Chutorian ABSTRACT An important goal in automotive user interface research is to predict a user s reactions and behaviors in a driving environment. The behavior of both drivers and passengers can be studied by analyzing eye gaze, head, hand, and foot movement, upper body posture, etc. In this paper, we focus on estimating head pose, which has been shown to be a good predictor of driver intent and a good proxy for gaze estimation, and provide a valuable head pose database for future comparative studies. Most existing head pose estimation algorithms are still struggling under large spatial head turns. Our method, however, relies on using facial features that are visible even during large spatial head turns to estimate head pose. The method is evaluated on the LISA-P Head Pose database, which has head pose data from onroad daytime and nighttime drivers of varying age, race, and gender; ground truth for head pose is provided using a motion capture system. In special regards to eye gaze estimation for automotive user interface study, the automatic head pose estimation technique presented in this paper can replace previous eye gaze estimation methods that rely on manual data annotation or be used in conjunction with them when necessary. Categories and Subject Descriptors I.4.6 [Image Processing and Computer Vision]: Seg- Dr. Murphy-Chutorian is currently with Google Inc. Dr. Cheng is currently with HRL Laboratories LLC Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AutomotiveUI 12, October 17-19, Portsmouth, NH, USA. Copyright (c) 2012 ACM /12/10... $15.00 mentation Edge and Feature Detection; I.5.1 [Pattern Recognition]: Models Structural General Terms Algorithms, Experimentation, Performance Keywords Database; Facial features; Head pose 1. INTRODUCTION Safety of the driver and those in the vicinity is highly dependent on the driver s awareness of the changing driving environment. The driver has to be attentive to elements outside of the vehicle such as other vehicles in the surround, traffic lights, and pedestrians, as well as elements inside of the vehicle such as navigation tools, communication modules, and driver assistance systems. In such a visually demanding environment, an efficient Visual User Interface (VUI) for Driver Assistance is crucial. An intelligent VUI can monitor the visual attention of the driver operating the vehicle and can provide assistance when the driver s attention is distracted from the primary task of safe driving. Monitoring a driver s focus of attention can be achieved by analyzing the driver s physical pose and motion; for example, the driver s upper body posture, head pose, foot movement, and eye gaze. In this paper we focus on head pose which plays a vital role in driver intent prediction [8] [4] and serves as a proxy for gaze estimation. While gaze can be estimated accurately with eye-localization, it is more practical to do so using head pose in a driving environment. A major reason for this is that eye-localization requires careful integration and calibration of multiple high-resolution cameras around the driver in order to accommodate for large spatial head turns. Additionally, eye-occluding objects can negatively affect gaze estimation when using eye-localization. As stated in the conclusion of the study of visual distraction on driving performance [29]:...the more severe distractions, that last for a 149

2 Figure 1: A block diagram of our proposed approach. It s a two step approach: extracting features from an image and using corresponding features points on a 3D generic-face model to estimate head pose longer duration or are directed further away from forward, are more likely to involve a head pose component. Given some of the limitations associated with eye-based measures (e.g., eyes sometimes occluded by eyewear), optimizing a headbased measure should become a goal. To this end, we introduce a database containing video sequences of various head poses from on-road driving environments. To the best of our knowledge, this is a unique database with continuous head pose measurements along with manually annotated facial landmarks in a video sequence. This database will help researchers in the automotive field to develop and evaluate methods related to head dynamics and analysis. We additionally present a hybrid approach (Fig. 1) of prominent facial feature detection and using corresponding points on a 3D generic face model to estimate head pose. Evaluations of this method is done using manual feature annotation and automatic feature detection. The remainder of the paper is organized as follows. Section 2 provides studies related to VUI. Section 3 describes the LISA-P Head Pose database. Section 4 provides details on the proposed hybrid head pose approach, followed by its evaluation on the LISA-P Head Pose database in Section 5. Finally, Section 6 concludes the discussion with future direction. 2. RELATED STUDIES In this section, we present studies related to head pose estimation applications and algorithms, and available databases for evaluation of head pose estimation algorithms in order to show the contributions we make with our work. In the automotive field, head pose estimation has been shown to be a good predictor of driver intent and a good proxy for gaze estimation. For the former case, some contributions include driver lane change intent prediction [19] and turn intent prediction [4]. Interestingly, the lane change intent work by McCall et al. was used by Doshi et al. [8] to determine relative usefulness of eye gaze and head dynamics for determining intent. While the lane change intent work uses block matching for head pose estimation, which is further detailed in [15], the turn intent predictive system uses a commercial product. This commercial motion-capture system uses retroreflective markers, which are placed on the subject s head, and a kinematics software to compute head pose. Even though we gain precision in head pose estimation by using markers, many researchers are inclined towards non- 150 intrusive measures in vehicular applications. One such application is a forward collision warning system where a driver s head orientation is determined using shape features and ellipsoidal face model. In another application, robust monocular head pose estimators were developed to infer a driver s focus of attention in a real car environment [22]. Although the method used in [22] works well in the range from 90 to +90 in yaw where 0 refers to the driver looking straight at the camera, it needs an initial estimate of the head position and orientation, which is used to generate a texture-mapped 3-D model of the head. HyHOPE [20] on the other hand has automatic initialization but tracking becomes less reliable as the head approaches ±90 in yaw. These head pose estimation algorithms and others help head gesture analyzers like OHMeGA [17], [18]. The OHMeGA analyzes head gestures as viewed from a frontal facing monocular camera using mostly optical flow and sometimes head pose to remove uncertainty in spatial head fixation locations. Just as head pose estimation plays an important role in driver intent prediction system, it also plays an important role as a proxy for gaze estimation. Many like Valenti et al. [26] emphasize the importance of head pose for gaze estimation in the presence of non-frontal faces and introduce hybrid schemes of combining head pose and eye location for gaze estimation. Other literatures, although not mentioned explicitly, compute eye-gaze estimation with respect to head pose; either manually or automatically. In one application, a framework using eye-gaze estimation along with road scene understanding (i.e. sign and pedestrian detection) has been used to give feedback to the driver when a sign was missing [9]. This system, however, uses an off-the shelf stereo-camera based eye gaze tracker, called facelab, that requires manual calibration to individual drivers. facelab is also used in [10] to develop a road scene monotony detector, which gives context awareness to fatigue detection tools. While some studies use automatic eye-gaze trackers, some studies rely on manual eye gaze annotation with respect to head pose. In [6], Curin et al. studied effects of dictation and short text editing while driving. Road attention evaluation was done using direction and fixation of eye gaze, as annotated manually. Similarly, Christiansen et. al [5] studied driver s reaction to warning systems by monitoring the driver s eyes. In regards to the head pose algorithm discussed in this work, other head pose algorithms using features points are of significant interest. Researches that use feature points to estimate head pose fall into two categories: geometric methods and flexible models. For those papers that use geometric methods, most require 5-6 features points in addition to: known camera parameters [13], face symmetry using lip corners and both eyes [11], parallel lines between eye corners and between lip corners [27], anatomical structure of the eye and person-specific parameters [3], etc. Unlike our method that requires any set of distinct features on the face, these methods require a particular set of features to be visible or not-occluded. The same is true for, many flexible face model oriented head pose [28]. In [14], Ishikawa et al. describes driver gaze tracking by way of using global head model, specifically an Active Appearance Model(AAM), to track the head. These and other head pose estimation algorithms are further explored in recent surveys [21], [12]. Many databases are available for the evaluation of head pose estimation algorithms from images to videos sequences,

Figure 2: Video sequence is taken from on-road driving in the LISA-P Head Pose database. This sequence shows the driver performing spatially large head turns.

3 Figure 2: Video sequence is taken from on-road driving in the LISA-P Head Pose database. This sequence shows the driver performing spatially large head turns. from far-field to near-field camera views. CMU PIE is one of many well known databases for containing facial images of different people under different illuminations, across multiple poses and with varying expressions [24]. Another interesting database is the IDIAP Head Pose Database [1] of video sequences showing one or more heads of participants engaged in natural activities (i.e. meetings, debates), with head pose measured using magnetic trackers called the Flock of Birds system. This magnetic tracker system is also used in the BU Face Tracking [16]database for ground truth head pose measurements. The BU Face Tracking database contains two sets of video sequences, one collected under uniform illumination conditions and one under varying illumination, with free head motions of several subjects. Unlike our database, none of the above mentioned databases have neither on-road drivers as subjects nor manually annotated facial features additionally provided. More information on other databases can be found in [21]. 3. shown in Fig. 2. The video sequences were collected at a frame rate of 30 frames per second, with a 640x480 pixel resolution. Since the on-road driving environment is uncontrolled, the distribution of head orientation is not spatially equal. Fig. 3 shows a definite bias towards the driver looking straight most of the time which is to be expected in on-road driving. Nonetheless the subjects do perform spatially large head turns as shown in Fig.2. LISA-P HEAD POSE DATABASE To the best of our knowledge, there exists no public database that provides a collection of video sequences of head movements from on-road driving, with continuous ground truth measurements and manually annotated facial features. Providing such a database will allow researchers to have a common ground to compare future head pose estimation algorithms and head behavior studies. The LISA-P Head Pose Database will soon be released to the public at cvrr.ucsd. edu/lisa/datasets.html. Number of frames per subject Yaw Rot. angle Pitch Rot. angle Average STD 2843 Min 9708 Max Figure 3: A normalized histogram of all head pose orientations collected from on-road driving of the LISA-P Head Pose database separated by the three independent axis of rotation. Table 1 gives a summary on the accumulated head pose data over all 14 video sequences. There are close to frames, and the range of yaw/pitch rotation angles achieved differ from driver to driver. Even though the large portion consists of looking straight, the drivers do perform spatially large head turns as indicated by the min/max statistics. 3.2 Table 1: A summary on the accumulated head pose data over all 14 video sequences of on-road driving 3.1 On-Road Driving All head pose data collected in the on-road driving environment is natural and spontaneous. There are 14 video sequences of drivers with varying age, ethnicity and gender, and the drives took place both in the daytime and the night time. The subjects have markers and cameras relevant to a motion capture system on the back of their head and a few feet behind them, respectively, to capture ground truth head pose. Additionally, a frontal facing monocular camera records their head movements. The camera is located on the dashboard, where the view of the driver is as Facial Feature Annotation In addition to providing a database with a continuous view of the driver s head and ground truth for head pose, the database contains annotations of 7 prominent facial features. As shown in Fig. 4, the features are eyes corners, nose tip and nose corners. The illustrated features are chosen for the purpose that they conform to rigid-body motion constraints and are least sensitive to facial expression changes. Current annotations of these feature points exist for two video sequences and are performed once every 5 or 10 frames; annotation frequency increases when there are interesting head movements. These chosen features have two more useful traits: one is that not all features are occluded due to large spatial head turns and the second is that these features have a higher probability of being uniquely detected in blurry video se- 151

Figure 4: Illustration of the 7 prominent features as annotated in the database: eye corners, nose tip, and nose corners.

5 shows a histogram of the number of features annotated in the dataset. This comes to show that most of the time, the chosen prominent features are visible or not-occluded in the image plane. 4.

First, at least four prominent facial features are extracted from the detected face region in a monocular 2D image.

4 Figure 4: Illustration of the 7 prominent features as annotated in the database: eye corners, nose tip, and nose corners. Annotations are shown using a real image (Left) from the dataset and it s corresponding points on a 3D generic-face model [2](Right) quence caused by fast head movements. Fig.5 shows a histogram of the number of features annotated in the dataset. This comes to show that most of the time, the chosen prominent features are visible or not-occluded in the image plane. 4. OUR APPROACH As shown in the block diagram of our proposed approach (Fig. 1), the framework consists of two steps. First, at least four prominent facial features are extracted from the detected face region in a monocular 2D image. In the next step, the detected facial features and their correspondence on a 3D generic-face model [2] are used to estimate the head orientation. Specifically, we use Pose from Orthography and Scaling (POS) algorithm [7] to find the rotation matrix and the translation vector, such that when applied to the 3D face model and projected onto a 2D plane, it should yield a representation of the 2D face image of interest. The details of each step are provided in the following sections. 4.1 ROI Detection Feature detection has been an ongoing effort for many decades. In literature, many algorithms exist to extract the features robustly and accurately. We use CLM [23], a model based technique for deformable object fitting. It utilizes a parametrized shape model and an ensemble of local feature detectors to jointly detect and track facial features. The number of landmarks used for typical CLM model fitting is 66 landmarks. The model fitting is such that some reference model is known a priori and affine transformation is used to align the input image with the reference model. Overall the process is such that it takes an image and an initial estimate of landmark positions on the image if available, and outputs the optimal landmark positions for the current image. 4.2 POS Algorithm to Head Pose The POS algorithm [7] approximates perspective projection with a scaled orthographic projection to describe the relationship between a 3D point face model and the projection of the model onto an image plane. Although the POSIT algorithm [7] (POS with ITerations) is designed to compute 152 Figure 5: A normalized histogram of annotated facial features points from on-road driving in the LISA-P Head Pose database. Feature acronyms: L=Left, R=Right, E=Eye, N=Nose, C=Corner, T=Tip. better scaled orthographic projection of the feature points on the 3D face model onto the image plane at every iteration, it requires knowledge of an intrinsic camera parameter - the focal length. Therefore, in our current implementation, we use the POS algorithm. The POS algorithm assumes the distribution of the features on the objects as well as the image of these points under perspective projection to be known. Approximating the perspective projection with a scaled orthographic projection, the algorithm efficiently estimates the rotation matrix by solving a linear system of equations. Details and geometric proofs regarding the POS algorithm can be found in [7]. From the obtained rotation matrix, pitch (ψ), yaw ( θ) and roll (φ ) rotation angles in the Euler coordinate system are computed as follows [25]. We assume that R11 R12 R13 R = Rz (φ)ry (θ)rx (ψ) = R21 R22 R23 R31 R32 R33 and compute θ = sin 1 (R31 ) ψ = tan 1 (R32 /R33 ) φ = tan 1 (R21 /R11 ) 5. RESULTS AND DISCUSSION Our approach is evaluated using the on-road driving video sequences of the LISA-P Head Pose database. The evaluation is conducted in two parts.the first involves using the manually annotated features points of two video sequence in the database for head pose estimation. The second part involves using CLM for fully automatic feature extraction. The manually annotated features from the LISA-P Head Pose database as available for two video sequences and automatic feature extraction using CLM on four subjects from the database are used to evaluate our approach. In Fig. 6, the red curve and the green curve show the result of using manual feature annotation and CLM feature extraction, respectively, to compute head pose estimation for one of the

5 Figure 6: Shown are ground truth head pose measurements (blue), head pose estimation using manually annotated facial features and the POS algorithm (red), and head pose estimation using CLM for feature extraction and the POS algorithm (green). The database for this evaluation is taken from on-road driving. subjects in the database. It shows that our approach follows the ground truth head pose well. The standard deviation (STD) errors of head pose estimation using manual feature annotation and automatic feature extraction are given in Table 2. The presented STD values were computed when the deviation of the head pose estimation was within 30 of the ground truth head pose. This resulted in discarding 5% and 22% of the frames when using manually annotated features and CLM features, respectively. The reported error can be attributed to the fact that the mean 3D face model differs from the actual 3D configuration of the tracked facial features. Furthermore, manual annotations are often not consistent. As for CLM, the deviations occurred mainly because CLM allows for any facial orientation in the image plane. If further restraints on the location and possible orientations of the face are used, we can expect improved performance. Approach Manual feature annotation + POS CLM feature extraction + POS Pitch STD 6.0 Yaw STD 5.0 Roll STD Table 2: In evaluating our approach, feature detection was done using manual annotation and CLM. Standard deviation error in the three independent axis of rotation is given. The database for this evaluation is taken from real-car driving. Taking head pose estimation using manual feature annotation as a baseline for the best possible case using POS algorithm, using CLM for feature extraction for head pose estimation shows promising results for automating our approach. 6. CONCLUDING REMARKS In this paper we have introduced a valuable database containing video sequence of various head poses in an on-road driving environment. Additionally, ground truth for head pose is provided with the help of motion capture system and manual facial feature annotation is provided for up to 7 prominent features. We have also introduced an approach to estimate head pose during large spatial head turns where some, but not all, facial features are self-occluded and during varying speed in head movements. Evaluation of our approach on the introduced database shows an average of 7 in standard deviation error. Going forward, the 3D generic-face model used for the POS algorithm should be learned on-line to better fit the user s face and thus produce better head pose estimation. Another future work is to find an optimal way to track minimal number of good features such that head pose can be estimated with high precision over large spatial deviations. 7. ACKNOWLEDGMENTS This research is supported in part by projects sponsored by the University of California Discovery program, the National Science Foundation and the Electronics Research Laboratory of Volkswagen of America. The authors would like to thank Amruta Trivedi and Aaron Briones for helping with annotation of the facial features in our database, and the members of the for their assistance. 153

6 8. REFERENCES [1] S. Ba and J.-M. Odobez. Evaluation of multiple cue head pose estimation algorithms in natural environements. In Multimedia and Expo, ICME IEEE International Conference on, pages , july [2] J. Busby. 3d head scan, August [3] J. Chen and Q. Ji. 3d gaze estimation with a single camera without ir illumination. In Pattern Recognition, ICPR th International Conference on, pages 1 4, dec [4] S. Cheng and M. Trivedi. Turn-intent analysis using body pose for intelligent driver assistance. Pervasive Computing, IEEE, 5(4):28 37, oct.-dec [5] L. H. Christiansen, N. Y. Frederiksen, A. Ranch, and M. B. Skov. Investigating the effects of an advance warning in-vehicle system on behavior and attention in controlled driving. In Proceedings of the 3rd Internation Conference on Automotive User Interface and Interactive Vehicular Applications, pages , [6] J. Curin, M. Labsky, T. Macek, and J. Kleindienst. Dictating and editing short texts while driving: Distraction and task completion. In Proceedings of the 3rd Internation Conference on Automotive User Interface and Interactive Vehicular Applications, AutomotiveUI 11, pages 13 20, [7] D. F. Dementhon and L. S. Davis. Model-based object pose in 25 lines of code. International Journal of Computer Vision, 15: , [8] A. Doshi and M. Trivedi. On the roles of eye gaze and head dynamics in predicting driver s intent to change lanes. Intelligent Transportation Systems, IEEE Transactions on, 10(3): , sept [9] L. Fletcher, L. Petersson, N. Barnes, D. Austin, and A. Zelinsky. A sign reading driver assistance system using eye gaze. In Robotics and Automation, ICRA Proceedings of the 2005 IEEE International Conference on, pages , april [10] L. Fletcher, L. Petersson, and A. Zelinsky. Road scene monotony detection in a fatigue management driver assistance system. In Intelligent Vehicles Symposium, Proceedings. IEEE, pages , june [11] A. Gee and R. Cipolla. Determining the gaze of faces in images. Image and Vision Computing, 12(10): , [12] D. Hansen and Q. Ji. In the eye of the beholder: A survey of models for eyes and gaze. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(3): , march [13] T. Horprasert, Y. Yacoob, and L. Davis. Computing 3-d head orientation from a monocular image sequence. In Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on, pages , oct [14] T. Ishikawa, S. Baker, I. Matthews, and T. Kanade. Passive driver gaze tracking with active appearance models. In Proceedings of the 11th World Congress on Intelligent Transportation Systems, October [15] J. Jain and A. Jain. Displacement measurement and its application in interframe image coding. Communications, IEEE Transactions on, 29(12): , dec [16] M. La Cascia and S. Sclaroff. Fast, reliable head tracking under varying illumination. In Computer Vision and Pattern Recognition, IEEE Computer Society Conference on., volume 1, pages 2 vol. (xxiii ), [17] S. Martin, C. Tran, A. Tawari, J. Kwan, and M. M. Trivedi. Optical flow based head movement and gesture analyzer (ohmega). In Pattern Recognition (ICPR), 21st International Conference on, Nov [18] S. Martin, C. Tran, and M. M. Trivedi. Optical flow based head movement and gesture analysis in automotive environment. In IEEE International Conference on Intelligent Transportation Systems-ITSC, Sept [19] J. McCall, D. Wipf, M. Trivedi, and B. Rao. Lane change intent analysis using robust operators and sparse bayesian learning. Intelligent Transportation Systems, IEEE Transactions on, 8(3): , sept [20] E. Murphy-Chutorian and M. Trivedi. Hyhope: Hybrid head orientation and position estimation for vision-based driver head tracking. In Intelligent Vehicles Symposium, 2008 IEEE, pages , june [21] E. Murphy-Chutorian and M. Trivedi. Head pose estimation in computer vision: A survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4): , april [22] E. Murphy-Chutorian and M. Trivedi. Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness. Intelligent Transportation Systems, IEEE Transactions on, 11(2): , june [23] J. Saragih, S. Lucey, and J. Cohn. Face alignment through subspace constrained mean-shifts. In Computer Vision, 2009 IEEE 12th International Conference on, pages , oct [24] T. Sim, S. Baker, and M. Bsat. The cmu pose, illumination, and expression (pie) database. In Automatic Face and Gesture Recognition, Proceedings. Fifth IEEE International Conference on, pages 46 51, may [25] G. Slabaugh. Computing euler angles from a rotation matrix. [26] R. Valenti, N. Sebe, and T. Gevers. Combining head pose and eye location information for gaze estimation. Image Processing, IEEE Transactions on, 21(2): , feb [27] J.-G. Wang and E. Sung. Em enhancement of 3d head pose estimated by point at infinity. Image and Vision Computing, 25(12): , The age of human computer interaction. [28] J. Wu and M. M. Trivedi. A two-stage head pose estimation framework and evaluation. Pattern Recognition, 41(3): , Part Special issue: Feature Generation and Machine Learning for Robust Multimodal Biometrics. [29] H. Zhang, M. Smith, and R. Dufour. A final report of safety vehicles using adaptive interface technology: Visual distraction. 154

Monitoring Head Dynamics for Driver Assistance Systems: A Multi-Perspective Approach

Proceedings of the 16th International IEEE Annual Conference on Intelligent Transportation Systems (ITSC 2013), The Hague, The Netherlands, October 6-9, 2013 WeC4.5 Monitoring Head Dynamics for Driver