Monitoring Head Dynamics for Driver Assistance Systems: A Multi-Perspective Approach

Size: px

Start display at page:

Download "Monitoring Head Dynamics for Driver Assistance Systems: A Multi-Perspective Approach"

Colin Peters
5 years ago
Views:

1 Proceedings of the 16th International IEEE Annual Conference on Intelligent Transportation Systems (ITSC 2013), The Hague, The Netherlands, October 6-9, 2013 WeC4.5 Monitoring Head Dynamics for Driver Assistance Systems: A Multi-Perspective Approach Sujitha Martin, Ashish Tawari and Mohan M. Trivedi Abstract A visually demanding driving environment, where elements surrounding a driver are constantly and rapidly changing, requires a driver to make spatially large head turns. Many existing state of the art vision based head pose algorithms, however, still have difficulties in continuously monitoring the head dynamics of a driver. This occurs because, from the perspective of a single camera, spatially large head turns induce self-occlusions of facial features, which are key elements in determining head pose. In this paper, we introduce a shape feature based multi-perspective framework for continuously monitoring the driver s head dynamics. The proposed approach utilizes a distributed camera setup to observe the driver over a wide range of head movements. Using head dynamics and a confidence measure based on symmetry of facial features, a particular perspective is chosen to provide the final head pose estimate. Our analysis on real world driving data shows promising results. I. INTRODUCTION In a 2011 American Community Survey (ACS) of 3- year estimates [1], it is reported that of the 138 million surveyed people, 86.2 % used a car, truck or van as means of transportation to work and on an average spent at least 25 minutes traveling to work daily. Meanwhile, the World Health Organization (WHO) predicted that by 2020, traffic related injuries would become the 3rd leading cause of global burden of diseases [2]. Even though faced with these alarming predictions, drivers will continue to drive to work and other recreational places, and be exposed to safety critical situations. However, injury-related collisions can be avoided or mitigated if the driver is notified when such situations arise. A comprehensive survey on automotive collisions demonstrated a driver was 31% less likely to cause an injuryrelated collision when he had one or more passengers who could alert him to unseen hazards [3]. Consequently, there is a great potential for intelligent driver assistance systems (IDAS) that can alert the driver about potential dangers or even guide them briefly through the critical situation. Accessing the source and level of danger in a driving situation, with a driver in the loop, is still a very challenging task and requires context-specific systems that monitor not only the outside environment but also the inside of the vehicle [4]. For example, if a vehicle is drifting from it s lane, looking inside the vehicle at the driver will show whether the driver is distracted, in which case an IDAS can alert the driver, or if he is intending to change lanes, in which case an IDAS can assist in changing lanes. Another application can be in the design of active heads-up display [5] where safety Laboratory of Intelligent and Safe Automobiles, UCSD, CA 92092, USA {scmartin, atawari, mtrivedi} at ucsd.edu critical information is efficiently presented to the driver, knowing his or her filed of view. In our present discussion, we focus on one of the integral components for monitoring driver awareness: analysis of head dynamics. It involves continuous estimation of head position and orientation/pose. These continuous estimations of head pose are necessary for applications like lane change intent prediction [6][7], which looks for particular patterns in head dynamics over time, and driver vigilance [8][9], which uses the driver s nodding frequency to monitor fatigue. In a recent publication describing the LISA-P head pose database from naturalistic on-road driving [10], Martin et al. showed using head pose, as taken from a motion capture system, that 95% of the time the driver s head pose was within 36 of forward facing in the yaw rotation angle. While it is true that a driver spends most of the time looking at the road directly in front, situations critical to driver safety usually occur when the driver makes spatially large head turns away from forward facing. Therefore, there is a need for evaluating over databases more concentrated on large head movements during naturalistic on-road driving, in order to better evaluate any algorithm for performance, robustness and reliability of continuous head tracking. To the best of our knowledge, there exists only one other study [11] that presents evaluations especially over the instances where driver s make spatially large head turns during naturalistic driving. In this paper, we introduce a vision based multiperspective framework for head pose estimation, which targets the issues presented by spatially large head turns. In the proposed framework, we show how to robustly combine the different perspectives while treating them separately. For evaluations, we utilize naturalistic on-road driving data. It is important to note that we choose segments of the naturalistic driving data around specific maneuvers which demand large head movement thereby presenting realistic requirement for continuous driver monitoring system. We present comparisons between single- and multi-perspective using mean-squared errors in yaw, pitch and roll rotation angles, and failure rate, percentage of time when system s output is unreliable. II. RELATED STUDIES In this section, we present some representative literature on vision based head pose estimation algorithms and show that these literatures explicitly or implicitly state their need for visibility of facial features to estimate head pose. This will emphasize that during spatially large head movements /$ IEEE 2286

TABLE I SUMMARY OF ISSUES WITH RESPECT TO MONITORING THE HEAD DYNAMICS DURING NATURALISTIC ON ROAD DRIVING FACED AND ADDRESSED (IF AT ALL) BY SINGLE AND MULTIPLE PERSPECTIVE SYSTEMS.

2 TABLE I SUMMARY OF ISSUES WITH RESPECT TO MONITORING THE HEAD DYNAMICS DURING NATURALISTIC ON ROAD DRIVING FACED AND ADDRESSED (IF AT ALL) BY SINGLE AND MULTIPLE PERSPECTIVE SYSTEMS. Issues in monitoring the head dynamics of a driver Single-perspective Multi-perspective Wide changes in head pose in the yaw rotation angle Even with a wide field of view camera, the limitation in perspective from a single camera makes it susceptible Varying lighting conditions Occlusion of the driver s face (i.e. due to hand movements on the wheel, adjusting the sun shade) Any fixed placement of a single camera, depending on time of day and direction of car motion, will be affected by lighting conditions (e.g. washed out face) Highly susceptible Less susceptible with careful placement of multiple cameras Lower probability for all perspectives being severely affected Lower probability of all perspectives being severely affected. over a period of time, the lack of continuous visibility of facial features from the perspective of a single camera can be overcome with multiple camera perspectives. A recent survey by Muphy-Chutoria and Trivedi [12] gives a more thorough comparison on a variety of head pose estimation algorithms. Facial feature and geometric based algorithms is an intuitive way of estimating head pose which explicitly state the visibility requirement of facial features. Many of them require at least 5 to 6 specific features to be visible in the image plane of the camera and use geometric constraints such as face symmetry using lip corners and both eyes [13], anatomical structure of the eye and person-specific parameters [14], parallel lines between eye corners and between lip corners [15]. Yet another method, while not requiring specific facial features, requires at least 4 features in general positions on the face and uses POS [16] on corresponding points between image and mean face model to estimate head pose [10]. On the other hand, algorithms like HyHOPE [17], which uses a texture based 3D-model to represent the driver s head, implicitly require the visibility of facial features. The approach is to obtain an initial 3D model of the face, and then use rotation, translation and projection of the model to match views from video frames to estimate head pose. As claimed in [17], such algorithms are also unreliable during spatially large head movements due to self-occlusion of a majority of the face. The consequence of non-visible facial features from a single perspective camera during spatially large head movements has been suppressed because many contributions in literature have evaluated over datasets where a majority of the time, the driver is close to forward facing. In this paper, we explicitly address this concern and show results of improved performance in multi-perspective approach when compared against single-perspective. III. CHALLENGES IN SINGLE PERSPECTIVES Monitoring the head dynamics of a driver requires continuous estimation of head position and orientation. From a computer vision perspective, however, intrinsic properties of head dynamics present a challenge to the robustness of many existing algorithms. As discussed in Section II, many existing Fig. 1. Illustration of disadvantages present in single perspective systems. Each row of images shows images are from a particular camera location and each column of images are time-synchronized. Single perspective systems are susceptible to spatially large head movements (column 1), hand movements (column 2) and lighting conditions (column 3). state of the art head pose estimation algorithms rely on a portion of the face to be visible in the image plane to estimate head pose. This means that even during spatially large head movements, algorithms require the visibility of facial features to continuously track the state of the head. With a single perspective of the driver s head, however, large spatial head movements induce self-occlusions of facial features. The first column of Fig. 1 illustrates this issue. In Fig. 1, each row of images is taken from a different camera perspective and each column of images is time-synchronized. Clearly, the availability of multiple perspectives decreases the severity of self-occlusions at any instant in time which translates to an increase in the robustness of continuous head tracking. Occlusions of facial features can also occur due to external objects (e.g. hand movements near the face region, sunglasses). Depending on camera perspective, hand movements on steering wheel during vehicle maneuvers, to adjust sunshade, to point, etc. can cause occlusion. The second column in Fig. 1 shows an example of occlusion by hand due to the driver adjusting the sunshade. Another source of challenge for vision based driver head monitoring systems in naturalistic driving conditions is illumination changes. Effects of lighting conditions are also highly dependent on the camera location. In Fig. 1, the /$ IEEE 2287

Fig. 2. Time-synchronized images over a period of time from distributed camera setup around the driver. In the image series, the driver is moving his head from almost frontal facing to the left.

3 Fig. 2. Time-synchronized images over a period of time from distributed camera setup around the driver. In the image series, the driver is moving his head from almost frontal facing to the left. last column illustrates the effects of lighting conditions. Therefore, multiple perspective approach with suitable camera placements, can mitigate the adverse effect of any one camera perspective being unreliable to track the head. Outlines of these issues and how, if at all, they are addressed under single-perspective systems and multiperspective systems are given in Table I. In our current implementation, our main focus is on improving head tracking accuracies for continuous driver monitoring system during spatially large head movements. IV. MULTI-PERSPECTIVE FRAMEWORK In the multi-perspective framework, each perspective is kept independent of one another and can be processed in parallel. To estimate head dynamics, we use a shape based head pose method to find local features, such as eye corners, nose corners and nose tip, and determine pose from the relative configuration of these features. A camera selection procedure then selects the appropriate perspective based on head dynamics and a confidence measure based on symmetry of facial features. A. Head pose estimation The head pose estimation algorithm used in the multiperspective framework is based on facial features and their respective geometry. At least 4 facial features points in general positions in the image plane and their corresponding points in a 3D generic mean face model, are required in this geometric approach. With these corresponding points, Pose from Orthography and Scaling (POS) [16] is used to estimate head pose. In our current implementation, manual annotation plus POS is used as a reference to evaluate Constrained Local Models (CLM) [18] plus POS in the multi-view selective framework. 1) CLM: In recent times, model-based techniques have been extensively used in nonrigid deformable object fitting. One such technique, CLM, is used for facial landmarks detection and tracking in a video data. CLM represents objects, in our case faces, using local appearance descriptions centered around landmarks of interest, and a parameterized shape model to capture plausible deformation of the the landmark locations. An ensemble of local detectors predicts the probable locations of the landmarks, called a response map. The true response map of the these detectors are approximated by various models such as Gaussian density with diagonal covariance [19], full covariance [20], Gaussian mixture model [21] etc. to make optimization efficient and numerically stable. In [18], the response map is represented non-parametrically using kernel density estimate (KDE) approach and the landmarks locations are optimized via subspace constrained meanshifts while enforcing their joint motion via shape model. CLM is generally used to fit faces of various orientations, positions and sizes in the image plane. However, when streaming images from a fixed camera looking at the driver s face while driving, there are constraints in orientation, position and size of the face that can be used to further improve facial feature tracking. For each perspective, we estimate the probable face location and face size. When the tracked features are not within the expected region or if the size of the face is beyond what is expected from that perspective, the tracked facial features are discarded and the system is re-initialized. Similarly, if the head pose computed from the detected facial features is not realizable in a driving scenario, it is also rejected. The rejected frames are accounted in one of the metrics (failure rate) to assess the performance of the system. 2) POS: Given a 3D model of an object, POS finds the position and orientation of the camera coordinate with respect to the object reference frame, such that when the 3D object is orthographically projected onto a plane parallel to the camera s image plane and scaled, it matches the image in the camera with minimal error. Given M i and p i = (x i, y i ) are known points in the 3D model and image plane, respectively, the system of equation to solve is as follows: M 0 M i αi = x 0 x i M 0 M i αj = y 0 y i /$ IEEE 2288

The notation M 0 M i represents the vector from the reference point M 0 to M i and alpha is the earlier mentioned scaling factor that occurs due perspective projection.

Note, however, that while k is perpendicular to i and j, vectors i and j are not necessarily perpendicular because of noisy 2D-3D correspondence.

4 The notation M 0 M i represents the vector from the reference point M 0 to M i and alpha is the earlier mentioned scaling factor that occurs due perspective projection. The vectors i and j form the first two rows of the rotation matrix and the vector k = i j, a cross product. Note, however, that while k is perpendicular to i and j, vectors i and j are not necessarily perpendicular because of noisy 2D-3D correspondence. Therefore, the rotation matrix is projected into the SO3 space by normalizing the magnitude of the eigenvalues. A. Testbed and Dataset Data is collected from naturalistic, on-road driving using the LISA-A testbed [11]. There are three cameras observing the driver. One camera was placed on left most edge of the front windshield (near A-pillar) to get a better view of the driver s face when the driver turns his head leftward. Another camera was placed on the front windshield, close to the dashboard, to get the best view when driver is mostly near forward facing. The last camera was placed near the rearview mirror, for when the driver turns his head rightward. Most importantly, data from all three cameras were captured in a time-synchronized manner. Using this testbed, four drivers were asked to drive naturally on local streets and freeways near UCSD. Approximately 30 minutes of data was collected with each driver in sunny weather conditions, which allowed for varying lighting conditions. In addition, driving in an urban setting, the drivers passed through many stops signs and made multiple turns, and driving on the freeway allowed for multiple lane change occurrences, resulting in a dataset with wide spatial changes in head pose. TABLE II A LIST OF EVENTS CONSIDERED FOR EVALUATION, AND ITS RESPECTIVE COUNT AND NUMBER OF FRAMES. Fig. 3. A block diagram of the multi-view selective framework for head dynamics analysis. Inputs are images from multiple cameras. Head dynamics and a confidence based measure are used to select which perspective to use for head dynamics analysis. B. Perspective Selection algorithm In our current implementation, we use three cameras: one is placed to look straight at the driver, one looks at the driver from left and the last looks at the driver from the right, as shown in Fig. 2. When the estimated head pose, respective of camera s reference frame, in the current frame is outside of a defined range in the yaw rotation angle, the direction of the head movement is used to select the camera perspective for the next frame. If tracking becomes unreliable, the system is re-initialized with a camera based on the symmetry of the detected facial features. Higher symmetry score means detected face is closer to frontal pose in the given perspective. The block diagram in Fig. 3 illustrates this process and Fig. 4 illustrates this process as applied to data from naturalistic on-road driving test data. V. RESULTS AND DISCUSSION The proposed multi-perspective framework is evaluated on naturalistic on-road driving test data. We evaluate on events that induce spatially wide changes in the head pose of the driver. By evaluating on these select events, we show the usefulness of a multi-perspective in continuously tracking the head. Events No. of events Total no. of frames Right turns Left turns Stop sign Right Lane change Left Lane Change Merge All From the collected data, we select events during the drive that induced large head movements. These events are when the driver is making right/left turns, right/left lane changes, stops at stop signs, and freeway merges. Table II shows the events considered, the number of respective events analyzed during the total 120 minute drive containing all drivers, and the total number of frames accumulated for each event. The evaluations reported in the following section will be on these selected events. It is important to note that each event can induce more than one set of spatially wide head movements. Ground truth for evaluations is available in the form of manual facial feature annotation. The features annotated are eye corners, nose corners and nose tip. These features are chosen particularly because they are distinct features on the face and are less susceptible to deformation. The annotations are available once every 5 frames for all three camera perspectives for the events considered. These manual features annotations and their corresponding location on a 3D generic face model are used with POS to estimate head pose. In the following evaluation, we use this estimated head pose as a reference to compute the error in estimating head pose from automatic facial feature tracking. During annotations of the facial features, there were instances when less than /$ IEEE 2289

Fig. 4. Illustration of the multi-perspective framework. The graph at the bottom shows a segment taken from a subject s naturalist on-road driving experiment.

The blue asterisks represent the left camera, the red circles represent the center camera and the magenta crosses represent the right camera.

5 Fig. 4. Illustration of the multi-perspective framework. The graph at the bottom shows a segment taken from a subject s naturalist on-road driving experiment. The horizontal axis represents time and the vertical axis represents head rotations in the yaw rotation angle relative to a particular camera s reference frame. The blue asterisks represent the left camera, the red circles represent the center camera and the magenta crosses represent the right camera. The plot shows the relative head pose from the selected camera. Images from each perspective for three instances are shown while red boxes are drawn to show chosen camera perspective at respective instances. TABLE III E VALUATION OF THE MULTI - VIEW SELECTIVE FRAMEWORK. features were visible. Since we need at least 4 features in general positions to estimate head pose, we did not annotate frames where less than 4 features were visible and accounted for it in one of the metrics (failure rate). Multi-view Single-view B. On-road Performance Evaluation Evaluations are done in two parts. First, we test how well the head is tracked with the multi-perspective framework and compare with that which comes from a single perspective. In our evaluation, the single perspective refers to the output from the front facing camera and the multi-perspective includes all three camera perspectives described in the previous section. We determine that head tracking has failed when for the chosen camera perspective the estimated head pose is unavailable or if the estimated head pose is more than 30 from ground truth in the yaw rotation angle. Normalizing this count with the total number of frames over all events gives us the failure rate. As shown in Table III, the failure rate with multi-perspective is only 9.4% whereas it is 29% with the single-perspective. These percentages are important for many applications, such as lane change intent prediction, where large percentage of loss in head tracking can critically handicap a system. Therefore, smaller percentage of loss in head tracking is highly prefreable in safety critical situations like the driving environment /$ IEEE Mean Squared Erorr Pitch Yaw Roll Failure rate 9.4% 29.0% In the second part of the evaluation, we compute the mean square error (MSE) of the estimated head pose for multiperspective and for single-perspective. We see from Table III that the MSE is comparable between multi-perspective and single-perspective. VI. CONCLUDING REMARKS In a driving environment, the driver is prone to make large spatial head movements during maneuvers such as lane changes, right/left turns. At these crucial moments, it is important to continuously and reliably track the head of the driver. Since a single perspective of the driver is insufficient during spatially large head movements, we proposed a multi-perspective framework. In this framework, we use the estimation of head pose and head dynamics from the previous frame to select which perspective should be used in the current frame for head pose estimation; symmetry of facial 2290

6 features is used for initialization purposes. We evaluated this system on events during naturalistic on-road driving that induces large spatial head turns and showed that in the multiperspective framework the head tracking was reliable over 90% of the time whereas it was reliable less than 72% of the time with the single-perspective approach. In future studies, the multi-perspective framework can be furthered improved by considering other cues for fusing camera perspective. One possibility is in using the confidence of the detected facial features. Furthermore, it will be interesting to experiment whether having more/less than the current number of cameras will improve the tracking ability as well as decrease the MSE significantly. As for the vision based head pose algorithm, we used generic face model which artificially introduces error. In the future, we will pursue automatic adoption to an individual driver s face. ACKNOWLEDGMENT The authors would like to thank the support from UC Discovery and industry partners especially Audi and VW Electronic Research Laboratory. We also thank our UCSD LISA colleagues who helped in a variety of important ways in our research studies. [14] J. Chen and Q. Ji, 3d gaze estimation with a single camera without ir illumination, in IEEE International Conference on Pattern Recognition (ICPR), dec. 2008, pp [15] J.-G. Wang and E. Sung, Em enhancement of 3d head pose estimated by point at infinity, Image and Vision Computing, vol. 25, no. 12, pp , 2007, the age of human computer interaction. [16] D. F. Dementhon and L. S. Davis, Model-based object pose in 25 lines of code, International Journal of Computer Vision, vol. 15, pp , [17] E. Murphy-Chutorian and M. Trivedi, Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness, IEEE Transactions on Intelligent Transportation Systems, vol. 11, no. 2, pp , june [18] J. Saragih, S. Lucey, and J. Cohn, Face alignment through subspace constrained mean-shifts, in IEEE 12th International Conference on Computer Vision, oct , pp [19] T. Cootes and C.J.Taylor, Active shape models - smart snakes, in In British Machine Vision Conference, 1992, pp [20] Y. Wang, S. Lucey, and J. Cohn, Enforcing convexity for improved alignment with constrained local models, in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2008, pp [21] L. Gu and T. Kanade, A generative shape regularization model for robust face alignment, in Proceedings of the 10th European Conference on Computer Vision: Part I, ser. ECCV 08, 2008, pp REFERENCES [1] (2012, October) Means of transportation to work by selected characteristics, american community survey 3-year estimates. [Online]. Available: [2] M. Peden, R. Scurfield, D. Sleet, D. Mohan, A. Hyder, E. Jarawan, and C. Mathers, World report on road traffic injury prevention, World Health Organization, Tech. Rep., [Online]. Available: [3] E. Murphy-Chutorian and M. Trivedi, Hyhope: Hybrid head orientation and position estimation for vision-based driver head tracking, in IEEE Intelligent Vehicles Symposium, june 2008, pp [4] C. Tran and M. Trivedi, Driver assistance for keeping hands on the wheel and eyes on the road, in IEEE International Conference on Vehicular Electronics and Safety (ICVES), nov. 2009, pp [5] A. Doshi, S. Y. Cheng, and M. M. Trivedi, A novel active heads-up display for driver assistance, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 1, pp , [6] A. Doshi, B. Morris, and M. Trivedi, On-road prediction of driver s intent with multimodal sensory cues, IEEE Pervasive Computing, vol. 10, no. 3, pp , july-september [7] A. Doshi and M. Trivedi, On the roles of eye gaze and head dynamics in predicting driver s intent to change lanes, IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 3, pp , sept [8] L. Bergasa, J. Nuevo, M. Sotelo, R. Barea, and M. Lopez, Realtime system for monitoring driver vigilance, IEEE Transactions on Intelligent Transportation Systems, vol. 7, no. 1, pp , march [9] Z. Zhu and Q. Ji, Real time and non-intrusive driver fatigue monitoring, in IEEE International Conference on Intelligent Transportation Systems, oct. 2004, pp [10] S. Martin, A. Tawari, E.-M. Chutorian, S. Y. Cheng, and M. M. Trivedi, On the design and evaluation of robust head pose for visual user interfaces: Algorithms, databases, and comparisons, in 4th ACM SIGCHI International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AUTO-UI), [11] A. Tawari and M. M. Trivedi, Head dynamic analysis: A multi-view framework, in Workshop on Social Behaviour Analysis, International Conference on Image Analysis and Processing, [12] E. Murphy-Chutorian and M. Trivedi, Head pose estimation in computer vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp , april [13] A. Gee and R. Cipolla, Determining the gaze of faces in images, Image and Vision Computing, vol. 12, no. 10, pp , /$ IEEE 2291

On the Design and Evaluation of Robust Head Pose for Visual User Interfaces: Algorithms, Databases, and Comparisons

On the Design and Evaluation of Robust Head Pose for Visual User Interfaces: Algorithms, Databases, and Comparisons Sujitha Martin scmartin@ucsd.edu Shinko Y. Cheng sycheng@hrl.com Ashish Tawari atawari@ucsd.edu