Gaze Tracking by Using Factorized Likelihoods Particle Filtering and Stereo Vision

Size: px

Start display at page:

Download "Gaze Tracking by Using Factorized Likelihoods Particle Filtering and Stereo Vision"

Ethel Morris
5 years ago
Views:

1 Gaze Tracking by Using Factorized Likelihoods Particle Filtering and Stereo Vision Erik Pogalin Information and Communication Theory Group Delft University of Technology P.O. Box 5031, 2600 GA Delft, The Netherlands Abstract In the area of visual perception research, information about a person s attention on visual stimuli that are shown on a screen can be used for various purposes, such as studying the phenomenon of human vision itself or investigating eye movements while that person is looking at images and video sequences. This paper describes a non-intrusive method to estimate the gaze direction of a person by using stereo cameras. First, facial features are tracked with particle filtering algorithm to estimate the 3D head pose. The 3D gaze vector can be calculated by finding the eyeball center and the cornea center of both eyes. For the purpose mentioned above, we also proposed a screen registration scheme to accurately locate a planar screen in world coordinates within 2 mm error. With this information, the gaze projection on the screen can be calculated. The experimental results indicate that an average error of the gaze direction of about 7 could be achieved. Keywords: Gaze tracking, facial features tracking, particle filtering, stereo vision. 1 INTRODUCTION An eye gaze tracker is a device that estimates the direction of the gaze of human eyes. Gaze tracking can be used for numerous applications, ranging from diagnostic applications such as psychological and marketing research to interactive systems in the Human-Computer Interaction (HCI) domain ([4], [17]). For example, studying eye movements during reading can be used to diagnose reading disorders. Investigating the user s attention on advertisements can help to improve their effectiveness. In the HCI domain, gaze tracking can be used as a way to interact with machines, e.g. as a pointing device for disabled people when operating a computer or as a support system in cars to alert users when they fall asleep. Several commercial gaze tracking products exist that are highly accurate and reliable. They are mostly based on a socalled infrared technique. Tobii [18] and ERT [6] developed a system that uses a motorized camera and infrared lighting to track the eye gaze. Their products are mainly used for visual perception research. Other companies such as Fourward [8] and ASL [1] use head-mounted cameras to track the user s eyes from a close distance. These kinds of products are suitable for user interaction as well as visual perception research. There are two disadvantages which make those infraredbased gaze tracking products less attractive for wide use. Most of these products require special hardware such as motorized cameras, helmets or goggles, making the product really expensive (between US$15,000 and US$150,000 as reported in [19]). Furthermore, this special hardware can cause discomfort and will restrict the user s movements. In this paper, we designed a gaze tracking scheme in the framework of visual perception research. In a typical experiment, users are told to watch visual stimuli that are displayed on a screen [4]. Their gaze projection on the text, image or video sequence shown on the screen can be used for various purposes, such as diagnosing reading disorders, analyzing the effectiveness of advertisements and investigating differences of their attentions while evaluating the image quality of a video sequence. Considering these applications and the two disadvantages mentioned above, we summarized the following requirements as guidelines during the system design: The system should detect and track the user s gaze on a 2D screen by estimating the intersection point between the gaze ray and the screen. The system must use a non-intrusive technique. The system should track a single user at a time. The system does not have to work in real-time. The system should be made as cheap as possible and it should be possible for the system to be used for userinteraction purposes. The average angular gaze error should not exceed 5. 1

2 Inspired by the works of Matsumoto et al. [15] and Ishikawa et al. [12], which used a completely non-intrusive method to estimate the gaze directions in 3D, we make another contribution to this type of solutions by introducing some modifications to their method. Our tracking scheme combines the auxiliary particle filtering algorithm ([16]) and stereo information to detect and track facial features, such as eye and mouth corners. The 3D locations of these features determine the pose of the head. Furthermore, we use a 3D eye model which assumes that the eyeball is a sphere. Unlike Ishikawa et al., we choose to use the corners of the eye socket instead of the corners located on the eyeball surface. This would make the tracking more robust to occlusions and eye blinks. Finally, we devised a screen registration scheme to locate a 2D surface that is not visible in the camera view (such as a monitor positioned behind the camera) by using a special mirror. In this way, the screen location in the world coordinate system is known accurately so that we can directly calculate the intersection of the gaze ray with the screen. Beside the screen we could also register other objects in the world coordinate system as well. With minor modifications the system could be easily applied for user-interaction purposes. This paper is organized as follows. In section 2 we present a short summary of the work that has been done previously in eye gaze tracking. The outline of our gaze tracking system is presented in section 3. In section 4 we will discuss the calibration of the cameras and the registration of the 2D screen. Next, the two most important modules of the system, the head pose tracking and the gaze direction estimation, will be described in section 5 and 6, respectively. The system performance is evaluated and the results are given in section 7 and finally, section 8 will conclude this paper with a discussion and recommendations for future work. 2 PREVIOUS WORK In the last few years, gaze tracking research is concentrated on intrusive as well as non-intrusive video-based techniques. Using image processing and computer vision techniques, it is possible to compute the gaze direction without the need for any kind of physical contact with the user. The most popular technique is the use of infrared lightings to capture several reflections from parts of the eye (pupil, cornea, and lens reflections) [4]. The relative position of these reflections changes with pure eye rotation, but remains relatively constant with minor head movements. With appropriate calibration procedures, this method estimates the user s point of regard on a planar surface (e.g. PC monitor) on which calibration points are displayed. Several variations to interpolate the gaze from known calibration points have been reported in the literature, including the use of artificial neural networks ([2], [5], [13]). This infrared technique is widely applied in current commercial gaze trackers. However, it needs a high resolution image of the eye, which explains the use of expensive hardware, such as a zoom-capable camera mounted below the screen or attached to a helmet. Another approach that has been developed recently detects the head pose separately and uses this information to estimate the gaze direction in 3D. This method has several advantages compared to the infrared technique. Aside from the cheap hardware requirements (a pair of normal cameras and a PC), tracking is not only restricted to the point of regard on a planar object. Since the gaze is tracked in a 3D world, we can also intersect the gaze with other objects of interest as well, provided that those objects are properly registered in the 3D world (i.e. the locations are accurately known). Because of this, the system can be easily modified for interaction purposes. Matsumoto et al. [15] used stereo cameras to detect and track the head pose in 3D. A 3D model for each user is built by selecting several facial features in the initialization phase. This 3D pose will be rigidly tracked over time. To measure the gaze direction, the location of the eyeball center is calculated from the head pose and the cornea center is extracted from the stereo images. The vector that connects the eyeball center and the cornea center is the estimate of the gaze direction. The use of Active Appearance Models (AAM) has been proposed by Ishikawa et al. [12]. A 3D AAM is fitted to the user s face and tracked over time by using only a single camera. Similar steps as in [15] are done to measure the 3D gaze vector. Another camera is used to view the scene and by asking the user to look at several points in the world, the relative gaze orientation with respect to the projection of these points in the view-camera image can be interpolated. This paper makes another contribution to the 3D gaze tracking method. The pose of the head will be tracked by using the particle filtering algorithm proposed in [16]. Combined with stereo vision, 3D head pose can be recovered. We use a slightly different eyeball model than the model used in [12] and [15]. Since visual perception research is our main concern, we also devise a screen registration scheme to locate a planar screen with respect to the cameras. With this information, the gaze projection on the screen can be calculated. 3 SYSTEM OUTLINE Our gaze tracking system consists of three main modules: head pose tracking, gaze direction estimation and intersection calculation (figure 1). We use a 3D facial feature model to determine the 3D pose of the head. Together with a 3D eye model the 3D gaze vector can be determined. Figure 2 shows the hardware setup of the system. A pair of USB cameras placed below the monitor is used to capture the user in the 2

5 J = H J 1 E J E = E = J E 5 J A H A + = A H = + = E > H = J E 5? H A A 4 A C E I J H = J E!,. =? E =. A = J K H A @ A!, - O A @ A 0 A = @ 2 I A 6 H =? E C / = A, E H A?

3 5 J = H J 1 E J E = E = J E 5 J A H A + = A H = + = E > H = J E 5? H A A 4 A C E I J H = J E!,. =? E =. A = J K H A!, - O A 0 A 2 I A 6 H =? E C / = A, E H A? J E - I J E = J E, 5 J = H J 2 I E J E I. =? E =. A = J K H A 6 A F = J A I!, 0 A 2 I A 7 I A H 6 H = E E C 9 5? H A A. H = A 6 H = I B H = J E 1 J A H I A? J E + =? K = J E!, / = A 8 A? J H, 5? H A A + E = J A I Figure 1. Block diagram of the gaze tracking system. The left part shows the off-line steps that have to be done before the actual tracking is performed. Figure 2. Hardware setup of the gaze tracking system. A pair of USB cameras placed below the monitor are used to capture the user in the scene. scene. Several pre-processing steps must be done before performing the actual tracking. First of all, the stereo cameras must be calibrated. In the calibration process, the left camera reference frame is used as the world reference frame. Secondly, we need to register the screen position in world coordinates. In this way, after calibrating the cameras and the screen, we can directly compute the intersection of the gaze ray with the screen plane. The calibration procedure will be discussed in detail in section 4. The third and last step is to estimate the user-dependent parameters for the 3D facial feature model and 3D eye model. The facial feature model is built by taking several shots of the head under different poses. The eye model is created by acquiring a training sequence, where the user looks at several calibration points on the screen. The estimated parameters will be used for the actual tracking. We refer to section 6 for more details on the eyeball model used. The head pose tracking (section 5) is initialized manually in the first frame received by the cameras. In this initialization phase, we choose the facial features that we want to track and use the image coordinates of these features (in the left and right frame) as start positions for the head pose tracking. In our system, the corners of the eyes and mouth are selected. A rectangular color window defined around each chosen feature will be used as reference template. These facial features will be tracked throughout the whole video stream by using the particle filtering algorithm proposed in [16]. The system then performs stereo triangulation on each facial feature. The output of this module is the 3D locations of all features, which determines the pose of the head in the current frame. Once we know the 3D location of the eye corners, the location of the eyeball center can be determined (see section 6). A small search window is defined around the eye corners to search for the cornea center in the left and right frame. The 3D location of the cornea centers are found by triangulation. The gaze is then defined by a 3D vector connecting the eyeball center and the cornea center. Two gaze vectors are acquired from the gaze direction estimation module, one from the left and one from the right eyes. The last step is to intersect the gaze vectors from the left and right eye with the object of interest (e.g. the monitor screen). The intersection is done by extending this vector from the eyeball center until it reaches the screen. To compensate for the effect of noise, we take the average of the left and right projected gaze points and feed the single 2D screen coordinate to the output. In the following sections, each module of the system will be discussed in more detail. 4 CAMERA CALIBRATION This section discusses the calibration of the cameras and the registration of the 2D screen. The results are the intrinsic parameters of the cameras and the extrinsic parameters of the cameras and the screen (i.e. the relative position of the cameras and the screen with respect to the world reference frame). In section 4.1 we deal with the calibration of the stereo cameras, followed by the screen registration in section Calibrating Stereo Cameras Camera calibration is done by using the method proposed by Zhang [20]. This method only requires the cameras to observe a planar checkerboard grid shown at different orientations (figure 3). In the following we describe the calibration notations that will be used in the remaining sections. A 2D point is denoted by x = [u v] T and a 3D point by X = [x y z] T. We use x, X to denote the homogeneous 3

A pinhole camera model is used with the following notation: λ x im = KX c, X c = RX w + T (1) which relates a 3D point X w = [x w y w z w ] T in the world reference frame with its image projection x

4 N O N O N H H O M Figure 3. The setup used for stereo camera calibration. The origin of the camera frame is located on the pinhole of the camera. The left camera frame is also used as the world frame (w: world, l: left camera, r: right camera and g: calibration grid). coordinates of a 2D and a 3D vector, respectively. A pinhole camera model is used with the following notation: λ x im = KX c, X c = RX w + T (1) which relates a 3D point X w = [x w y w z w ] T in the world reference frame with its image projection x im = [u v 1] T in pixels, up to a scale factor λ. The matrix K, called the camera or calibration matrix, is given by K = f x αf x u 0 0 f y v and contains the intrinsic parameters: the focal lengths f x and f y, the coordinates of principal points (u 0, v 0 ) and the skewness of the image axes α. The same 3D point X w can be represented in the camera reference frame by X c = [x c y c z c ] T, which is related by a 3x3 rotation matrix R and 3x1 translation vector T. This frame transformation can also be written as a single matrix: [ ] R T M = 0 1 In this paper we use the left camera frame as the world frame, so for the left and right camera we would have: C and r = x 2 + y 2. The undistorted coordinate x ud is defined as follows [3]: where x ud = D r x d + D t (3) D r = [ 1 + k 1 r 2 + k 2 r 4 + k 5 r 6] [ 2k3 xy + k 4 (r 2 + 2x 2 ] ) D t = k 3 (r 2 + 2y 2 ) + 2k 4 xy are the radial and tangential distortion coefficients, respectively. These coefficients can be represented by a single vector k = [k 1 k 2 k 3 k 4 k 5 ] T. Finally, equation (1) can be modified to include the distortion model: x im = K x ud (4) To estimate the intrinsic and extrinsic camera parameters, the following steps are taken: Acquiring stereo images Position the two cameras so that an overlapped view of the user s head is achieved. Take a series of images of the calibration grid (figure 4) under different plane orientations. Extracting the grid reference frame For each plane orientation, the four intersection corners of the pattern are chosen manually (the white diamonds in figure 4). The inner intersections will be detected automatically by estimating planar homography between the grid plane and its image projection [20]. All detected intersection points are then refined by the Harris corner detector [3] to achieve sub-pixel accuracy. From each image, we will get the image coordinate of each intersection point x im and its coordinate in the grid reference frame X g = [x g y g 0] T. 50 O g Extracted corners X l = M wl X w, M wl = I [ 4 4 X r = M wr X w Rwr T, M wr = wr 0 1 ] (2) = M lr v 100 where I N N is an identity matrix of size N N and M lr denotes the extrinsic parameters of the stereo cameras, the transformation between the left and right camera frame. We used a lens distortion model that incorporates the radial and tangential distortion coefficients. Let x d be the normalized and distorted image projection in camera reference frame: [ x c /z c ] [ ] x x d = y c /z c = y u Figure 4. The extracted intersection points from a calibration grid. The four intersection corner points are chosen manually (white diamonds), while the inner points are automatically extracted by using plane homography. 4

N N O O O N O N Estimating individual camera parameters The intrinsic parameters and the distortion coefficients of each camera are estimated by minimizing the pixel reprojection error of all

5 N N O O O N O N Estimating individual camera parameters The intrinsic parameters and the distortion coefficients of each camera are estimated by minimizing the pixel reprojection error of all intersection points on all images, in the least square sense. The initial guess for the parameters is made by setting the distortion coefficients to zero and choosing the centers of the images as the principal points. The initial focal lengths are calculated from the orthogonal vanishing points constraint [3]. Estimating the parameters of both cameras The individually optimized parameters for each camera from the previous step are now used as initial guess for the total optimization (considering both cameras). At the end we get the optimized distortion coefficients of both cameras, the calibration matrix for each camera and the external parameters relating the two cameras. 4.2 Registering the Screen to the World Frame In order to intersect the gaze vector with the screen, the screen location with respect to the world frame must be determined. In other words, we need to determine the transformation M ws from the world frame to the screen frame. We use the following method to estimate this transformation. A mirror is placed in front of the camera to capture the reflection of the screen. The camera will perceive this reflection as if another screen is located at the same distance from the mirror but in the opposite direction (see figure 5). We attached a reference frame to each of the objects: O w, O m, O v and O s for the world, mirror, virtual screen and the real screen frame, respectively. If we know the location of the mirror and this virtual screen, then we can also calculate the location of the real screen. By taking three co-planar points on the screen in world coordinates (e.g. the points that lie on the XY-plane of the screen), we get the first two orthogonal vectors that define the screen reference frame. The third one can be computed by taking the cross product of these two vectors. I C I I H E C I I D H J M M I M M L M L C L H E C L L I D H J Figure 5. The hardware setup used for the screen registration. The stereo cameras are represented by two ellipses in front of the screen. Each object is shown with its own reference frame (w: world, m: mirror, v: virtual screen and s: screen) Z O m X Image points (+) and reprojected grid points (o) Y Z X O v Figure 6. The mirror used for the registration of the screen. A part of the reflection layer is removed, so that the camera can see the calibration pattern put behind the mirror. Compare the extracted reference frame with figure 5. By displaying a calibration pattern on the screen, the virtual screen-to-world frame transformation M vw can be computed from the reflection of that pattern. With this information, we can choose three co-planar points and calculate their 3D world coordinates vorig w, vw long and vw short (figure 5). Then, applying the following transformations to each of these points will result in the corresponding 3D screen points s w orig, sw long and sw short in world coordinates: s w i = M mw Y M wmṽ w i (5) In equation (5) the virtual points are first transformed to the mirror coordinates via M wm. The second matrix will mirror the points to the opposite direction of the mirror s XY-plane. After that, by multiplying it again with the inverse transformation M mw we get the screen points in world coordinates s w i Ṫo determine the location of the mirror, a part of the mirror s reflection layer is removed, making that part transparent. A calibration pattern is placed behind the glass. For the calculation of the world-to-mirror frame transformation M wm, the grid frame extraction from section 4.1 must be slightly modified. Instead of extracting intersection points from the whole grid only the points on the grid border need to be detected (figure 6). The last step is to determine M ws from the calculated screen points. The rotation and translation component of the transformation can be determined as follows: s xaxis = s w long s w orig s yaxis = s w short s w orig s zaxis = s w xaxis s w yaxis 5

6 N R ws = [ŝ xaxis ŝ yaxis ŝ zaxis ] T T ws = R ws s w orig (6) F N F with ŝ i as the normalized version of s i. Since the camera calibration is only accurate for the space where the calibration grid is positioned, we need to acquire two sets of images. In the first set, we take into account the space where the user s head and also the mirror are supposed to be located (about cm in front of the camera). For the second set, we place the calibration grid on the estimated location of the screen reflection (the virtual screen), about cm away from the camera. The calibration is then performed over the joint set of images. After that, the screen registration described above can be carried out. 5 HEAD POSE TRACKING In this section the head pose tracking module will be discussed in detail. First, a short summary of the particle filtering algorithm is provided in section 5.1, followed by the description of the factorized likelihoods particle filtering scheme proposed in [16] (section 5.2). The 3D facial feature model that is used in our scheme is described in section 5.3. Finally, in section 5.4 we discuss the role of particle filtering in the head tracking module and propose the use of stereo information as prior knowledge for the tracking. The choice of particle filtering parameters will also be discussed here. 5.1 Particle Filtering Recently, particle filtering has become a popular algorithm for visual object tracking. In this algorithm, a probabilistic model of the state of an object (e.g. location, shape or appearance) and its motion is applied to analyze a video sequence. A posterior density p(x Z) can be defined over the object s state, parameterized by a vector x, given the measurements Z from the images up to time t. This density is approximated by a discrete set of weighted samples, called the particles (figure 7). At time t, this set is represented by {s k, π k } which contains K particles s 1, s 2,..., s K and their weights π 1, π 2,..., π K (for easier notation, we remove the time index). The main idea of particle filtering is to update this particlebased representation of the posterior density p(x Z) recursively from previous time frames: p(x Z) p(z x)p(x Z ) p(x Z ) = x p(x x )p(x Z ) where the superscript denotes the previous time instant. See [10] for the complete derivation of this equation. Beginning from the posterior of the previous time instant p(x Z ), a number of new particles are randomly sampled from the set {s k, π k }, which is approximately equal to sampling from p(x Z ). Particles with higher weights will (7) F I Figure 7. An illustration of the particle-based representation of a 1- dimensional posterior distribution. The continuous density is approximated by a finite number of samples or particles s k (depicted by the circles). Each particle is assigned a weight π k (represented by circle radius) in proportion with the value of the observation density p(z x = s k ), which is an estimation of the posterior density at s k. have higher probability to be picked for the new set, while particles with lower weights can be discarded. Next, each of the chosen particles are propagated via the transition probability p(x x ), resulting in a new set of particles. This is approximately equivalent to sampling from the density p(x Z ) (equation (7), second line). In the last step, new weights are assigned to the new particles, measured from the observation density, that is let π k = p(z x = s k ). The new set of pairs {s k, π k } represents the posterior probability p(x Z) of the current time t. Once the new set is constructed, the moments of the state at current time t can be estimated. We can take for instance the weighted average of the particles, obtaining the mean position: E[x] = K π k s k (8) k=1 In our case, we consider a facial feature such as eye or mouth corner as a single object, with the image location as the state. In every time frame, the facial feature location is tracked by evaluating the appearance of the feature. Several problems occur when this algorithm is used to track multiple objects [16]. One of the problems is that propagating each object independently would deteriorate the tracking robustness when there are interdependencies between the objects. By incorporating this information in the tracking scheme, the propagation would be more efficient, i.e. less particles are wasted on areas with low likelihood. For example, if we want to track multiple facial features individually without any information about the relative distance between the features, the rigidness of the face is lost. By introducing some constraints in the propagation of each facial feature, the rigidness of the face is preserved. 5.2 Auxiliary Particle Filtering with Factorized Likelihoods The method summarized below, proposed in [16], is one of the improvements to particle filtering in the case of tracking multiple objects. The state is partitioned x = [x 1 x 2... x M ] T such that x i (i = 1, 2,..., M) represents the state of each 6

: ; object and M is the number of objects. Each partition is propagated and evaluated independently: p(x i Z) p(z x i ) x p(x i x )p(x Z ) (9) Similar to the notation in section 5.

7 : ; object and M is the number of objects. Each partition is propagated and evaluated independently: p(x i Z) p(z x i ) x p(x i x )p(x Z ) (9) Similar to the notation in section 5.1, each posterior p(x i Z) is represented by a set of sub-particles and their weights {s ik, π ik }, with k = 1, 2,..., K and K the number of sub-particles. After separately propagating those sets, a proposal distribution is constructed from individual posteriors: g(x) = i p(x i Z). By ignoring the interdependencies between different x i, we can construct the sample s k = [s 1k s 2k... s Mk ] T (concatenation of the sub-particles) by independently sampling from p(x i Z). The individual propagation steps are summarized below. The density p(x Z) now represents the posterior of all objects, instead of only one object. Starting from the set {s k, π k } from the previous time frame the following steps are repeated for every partition i: 1) Propagate all K particles s k via the transition probability p(x i x ) in order to arrive at a collection of K sub-particles µ ik. Note, that while s k has the dimensionality of the state space x, µ ik has the dimensionality of the partitioned state x i. 2) Evaluate the observation likelihood associated with each sub-particle µ ik, that is let λ ik = p(z x i = µ ik ). 3) Sample K particles from the collection {s k, λ ikπ k }. In this way it favors particles with high λ ik, i.e. particles which end up at areas with high likelihood when propagated with the transition probability. 4) Propagate each chosen particle s k via the transition probability p(x i x ) in order to arrive at a collection of K particles s ik. Note that s ik has the dimensionality of the partition i. 5) Assign a weight π ik to each sub-particle as follows, w ik = p(z s ik) λ ik, π ik = w ik j w. ij After this procedure, we have M posteriors p(x i Z) each represented by {s ik, π ik }. Then, sampling K particles from the proposal function g(x) is approximately equivalent with constructing the particle s k = [s 1k s 2k... s Mk ] T by sampling independently each s ik from p(x i Z). Finally, in order for these particles to represent the total posterior p(x Z) we need to assign a weight to each particle equal to [11]: π k = p(s k Z ) i p(s ik Z (10) ) In other words, the re-weighting process favors particles for which the joint probability is higher than the product of the marginals. In the general case that the above equation cannot be evaluated by an appropriate model, the weights need " " " " Figure 8. The 3D facial feature model. On the left the facial feature templates are shown. On the right we see their locations in 3D, calculated from stereo images. The triangle represents the 2D face plane, formed by connecting the average locations of all three feature pairs. to be estimated. Here the use of prior information such as the interdependencies between the objects is utilized. After normalizing the sum to 1 again, we end up with a collection {s k, π k } as the particle-based representation of p(x Z) D Facial Feature Model The facial feature model in our scheme consists of two components: templates of the facial features appearance relative 3D coordinates of the facial features (reference face model) The facial features shown in figure 8 are defined as the corners of the eyes and mouth. This facial feature model is user-dependent and must be built before tracking can be performed. First, a stereo snapshot of the head is taken. From this shot the relative 3D positions of the facial features are extracted by manually locating the features in the left and right images and triangulating those features. Together they form a reference shape model for the user s face. Next, in the beginning of each tracking process (initialization phase), the start positions of the facial features in the left and right frames are selected manually. Simultaneously, a rectangular image template around each feature is acquired. These templates will be used in the tracking process. 5.4 Multiple Facial Features Tracking In this section we will use the auxiliary particle filtering scheme described in the previous section for the problem of multiple facial feature tracking. Figure 9 shows the overview of the head tracking module where the facial features will be tracked. The facial feature templates from the initialization phase are used to track the features in 2D. The output of each of the particle-filtering blocks is a set of particles that represents the distribution of the 2D facial feature locations, for the left and right image respectively: {s k, π k } L and {s k, π k } R. In order to do the re-weighting process in equation (10) we use the reference face model (figure 8) as prior information on the relative 3D positions of the facial features. We combine " " 7

8 5 J = H J F I E J E I B =? E = B A = J K H A J A F = J A I A B J B H = A, 2 = H J E? A B E J A H E C 5 J A H A J H E = C K = J E!, 4 A M A E C D J E A B E J J E C 5 J = H J F I E J E I B =? E = B A = J K H A J A F = J A I 4 E C D J B H = A, 2 = H J E? A B E J A H E C, F = H J E? A I, F = H J E? A I!, F = H J E? A I 9 A E C D J = L A H = C A B!, F = H J E? A I!, 0 A 2 I A!,. =? E =. A = J K H A Figure 9. Block diagram of the head pose tracking module. Particle filtering is used to track the 2D locations of the facial features in the left and right frame. the two particle sets from the left and right image to a set of 3D particles by triangulating each left and right particle (one-to-one correspondence), and compare each 3D particle with the reference face model to calculate the weights π k,3d. These weights are then assigned to the left and right set (π k,l and π k,r ) and the individual propagation for the next frame can start again. From each frame we can roughly estimate the 3D locations of the facial features by calculating the weighted average of the 3D particles (equation (8)). The reference face model is then fitted to these 3D points to refine the estimation of the head pose in current frame. In the following subsections we will describe the choice of the state, the observation model and the transition model used for the 2D tracking. After that, we discuss how the priors are used to take into account the interdependencies between the facial features State and Transition Model We consider each facial feature as an object. For every facial feature i, the object state is represented by x i = [u i v i u i v i ]T, with [u i v i ] and [u i v i ] as the current and the previous 2D image coordinates of a particular feature, respectively. We choose to include the previous image coordinates in order to take into account the object s motion velocity and trajectory. To simplify the evaluation of the transition density, we assume that p(x i x ) = p(x i x i ), which means that each feature can be propagated individually. A second-order process with Gaussian noise is used for individual propagation of each feature: p(x i x i ) 1+α 0 α β 0 β x i + N(0, σ n ) (11) with α, β [0, 1] as the weight factor that determines the strength of the contribution of the horizontal and vertical motion velocity of a particle in the transition model Observation Model After the 2D particles are propagated in step 1 and 4 from section 5.2, the weight of each sub-particle needs to be determined. This is done by evaluating the observation likelihood p(z x i ). We use the same observation model as proposed in [16]. A template-based method is used as measurement z from the images. The color difference between a reference template and an equally sized window centered on each sub-particle is used as a measure of the weight of the particles, that is the probability of a sub-particle being the location of a facial feature. Let the reference template be r i, and let the window centered on a sub-particle be o i. The color-based difference is then defined as [16]: c(o i, r i ) = ( oi E{o i,y } ) r i E{r i,y } (12) where the subscript Y denotes the luminance component of the template and E{A} is the mean of all elements in A. The matrix c(o i, r i ) contains the RGB color difference between o i and r i. Finally, the scalar color distance between those two matrices is defined by: d(o i, r i ) = E ( ) ρ c( ) = { ( ) } ρ c(o i, r i ) (13) c( ) R + c( )G + c( )B where ρ( ) is a robust function defined as the L 1 -norm of the color channels per pixel. The observation likelihood is defined as: p(z x i ) ε o + exp ( d(o i, r i ) 2 ) (14) 2σ 2 o where σ o and ε o are the model parameters (see figure 10). The parameter σ o determines the steepness of the curve, that is, how fast the curve will drop in the case of bad particles (i.e. particles that have low similarity with the reference template). The parameter ε o is used to prevent that particles get stuck on local maxima when the object is lost. To improve the ability to recover from a lost object, ε o should be small but non-zero [14] Priors To approximate the re-weighting process as defined in equation (10), we use similar approach as in the calculation of the observation likelihood. The prior information on the 3D relative positions of the facial features is now used. 8

9 - + + H p(z x) 0.6 σ L C ε d(.) - Figure 10. The observation model. - After we get the new particle sets from the left and right images, {s k, π k } L and {s k, π k } R, we combine these sets (oneto-one correspondence) by triangulating each particle pair, resulting in K 3D particles. The weight of each 3D particle π k is then approximated by: ( ) π k = ε p + exp d2 k 2σp 2 (15) where σ p and ε p are the model parameters similar to those for the observation likelihood (equation (14)) and d k is the difference between the reference face shape and the shape derived from each 3D particle. To calculate this difference, the reference face shape is first rotated such that its face plane (figure 8) coincides with the face plane of the measured shape. The scalar distance d k is then defined as: d k = 1 M d 2 ik (16) M i=1 where d ik is the 3D spatial distance between feature i of the reference and feature i of the k th 3D particle. 6 GAZE DIRECTION ESTIMATION After acquiring the estimation of the head pose in the previous section, we will now discuss the gaze direction estimation module and the intersection calculation module in detail. We begin with presenting the geometrical eye model used in our system in section 6.1. In section 6.2 the calculation of the 3D gaze vector will be explained. Finally, the intersection between the gaze ray and the screen will be dealt with in section Geometrical Eye Model We use a 3D eyeball model similar to the model used by Matsumoto et al. [15] and Ishikawa et al. [12]. The eyeball is regarded as a sphere with radius r and center O (figure 11). We assume that the eyeball is fixed inside the eye socket, except for rotation movements around its center. Therefore the relative position of the center O and the eye corners is constant regardless of the head movements. Unlike Ishikawa Figure 11. The eyeball model used in our system. The capital letters denote 3D points in world coordinates. The gaze direction ˆv g is defined as a 3D vector from the eyeball center O pointing to the cornea center C. The points E 1 and E 2 are the inner and outer corners of the eye socket. et al., we also assume that the inner and outer corners of the eye socket (E 1 and E 2 ) are not located on the eyeball surface. It is easier to locate and track the eye corners than the points on the eyeball surface, because these corners are more distinctive (figure 11). This would also make the tracking more robust to eye blinks. Furthermore, we assume that the anatomical axis of the eye coincides with the visual axis 1. The gaze direction is defined by a 3D vector going from the eyeball center O through the cornea center C. Our 3D eyeball model consists of two parameters: the radius of the eyeball r. the relative position of the eyeball center with respect to the eye corners. The relative position of the eyeball center is defined as a 3D vector from the mid-point of the eye corners M to the eyeball center O, and termed as an offset vector d. These parameters are determined for each person by taking a training sequence where the gaze points of that person is known. The training sequence is acquired by recording the user s head pose and cornea center locations while he is looking at several calibration points on the screen. Since we know the locations of the calibration points, we can calculate the gaze vector to these points. If we consider only one calibration point P, the gaze vector is determined by v g = P C, ˆv g = v g v g 1 Anatomical axis is defined as the vector from eyeball center to the center of the lens, while visual axis is defined as the vector connecting the fovea and the center of the lens. The visual axis represents the true gaze direction. On the retina, the image that we see will be projected at the fovea, which is slightly above the projection of the optical axis. 9

10 with ˆv g as the normalized gaze vector when the eye gaze is fixed to point P (see figure 11). The relation between the gaze vector and the unknown parameters r and d is reflected by the equation: d + rˆv g = C M (17) This equation cannot be solved because we have 4 unknowns (the radius r and the offset vector d = [d x, d y, d z ]) and 3 equations (one for each x, y and z component). If we combine the left and right eye together, assuming the same eyeball radius, we would still have 7 unknowns and 6 equations. Therefore, we need at least 2 calibration points to estimate the eyeball parameters for each user. The generalized matrix equation for N calibration points can be derived from equation (17), written in the form Ax = b: ˆv gl,1 I 0 ˆv gl,2 I 0. ˆv gl,n I 0 ˆv gr,1 0 I ˆv gr,2 0 I. ˆv gr,n 0 I r d L d R = C L,1 M L,1 C L,2 M L,2. C L,N M L,N C R,1 M R,1 C R,2 M R,2. C R,N M R,N (18) Solving this matrix equation in the least square sense leads to the desired eyeball parameters. Note that the calculation is done in the face coordinate system (see figure 8), otherwise equation (18) would not be valid. 6.2 Estimating the Gaze Vector Once the eyeball parameters are estimated, we can estimate the gaze direction. The overview of the gaze direction estimation module is given in figure 12., F H A? J E, A B J A O A? H A H I A B J H E C D J B H = A + H A A J A? J E!, A B J? H A =? A J A H Figure 12. / = A? =? K = J E!, A B J A O A? H A H I - O A > =? A J A H? =? K = J E!, A B J A O A > =? A J A H!, A B J C = A L A? J H!, D A F I A!, 4 E C D J A O A? H A H I - O A > =? A J A H? =? K = J E!, 4 E C D J A O A > =? A J A H / = A? =? K = J E!, 4 E C D J C = A L A? J H, F H A? J E, 4 E C D J A O A? H A H I A B J H E C D J B H = A + H A A J A? J E!, 4 E C D J? H A =? A J A H!, A O A > A Detailed block diagram of the gaze direction estimation module. Figure 13. The ROI defined between the inner and outer eye corners. The small dot in the middle of the circle represents the 2D cornea center. From the head pose tracking module we get the 3D locations of all facial features. However, for gaze direction estimation we only need the 2D and 3D positions of the inner and outer eye corners (for the left and right eye). This information is used to estimate the cornea center and the eyeball center locations Finding the Eyeball Center We calculate the location of the left and right eyeball center separately by using the following equation: O = 1 2 (E 1 + E 2 ) + d = M + d (19) where d is the offset vector obtained from the training sequence Finding the Cornea Center To find the cornea center we first project the 3D eye corners back to the left and right 2D image plane. A small ROI in the image is then defined between the inner and outer corner locations (figure 13). Then, template matching with a disk-shaped template on the intensity image is used to approximately locate the cornea. After that, we define an even smaller ROI around the initial cornea center location and apply the circular Hough transform on the edge image of the smaller ROI. The second ROI is used to filter out irrelevant edges. The pixel position with the best confidence (most votes) is the estimation of the cornea center. The steps described above are done for the left and right image separately. The left and right 2D cornea center locations are then triangulated to find the 3D location The 3D Gaze Vector After finding the 3D cornea center location C and the 3D eyeball center O for the left and right eye, the gaze vector for the current frame is then calculated by v g = C O, ˆv g = v g v g (20) The normalized left and right gaze vectors are finally forwarded to the intersection calculation module (see figure 12). 6.3 Intersecting Gaze Vector with the Screen The overview of the intersection calculation module is shown in figure 14. To intersect the gaze ray with the screen we need the information about the screen location. In figure 15, the gaze direction is projected on the screen in point P. The 10

11 2!, A B J C = A L A? J H!, 4 E C D J C = A L A? J H 8 A? J H F = A E J A H I A? J E A B J C = A F H A? J E ) L A H = C A 8 A? J H F = A E J A H I A? J E, 5? H A A? E = J A 4 E C D J C = A F H A? J E 9 J I? H A A B H = A J H = I B H = J E 7 EXPERIMENTAL RESULTS In this section we evaluated the performance of each module of the gaze tracking system. The calibration and screen registration results are presented in section 7.1. Section 7.2 discusses the tracking performance of the auxiliary particle filtering. The gaze training and estimation results are shown in section 7.3 and finally, we tested the whole system by applying our system on some sequences in section 7.4. Figure 14. Detailed block diagram of the intersection calculation module. Figure 15. I Illustration of the ray-plane intersection. resulting gaze ray can be written in parametric representation as: g(t) = O + ˆv g t (21) where O is the eyeball center and ˆv g is the unit gaze vector. For a certain scalar t, the gaze ray will intersect the screen at point P. By using the knowledge that the product of every point in a plane with the plane s normal is a constant [9]: N P = N O s = c and the parametric representation of the gaze ray on equation (21), we can obtain the value t P when the gaze ray intersects the screen plane: N (O + ˆv g t P ) = N O s t P = N O s + N O N ˆv g (22) Equation (22) can be further simplified if we do the calculation in the screen coordinate system. We will then have O s = 0 and N = [0 0 1] T, reducing the calculation to a division of two scalars: t P = o z ˆv gz (23) where o z is the z component of the eyeball center (in screen coordinate system). For the output of the whole system, the average of the projected gaze ray from the left and right eye are taken to compensate for the effect of noise. L C 7.1 Stereo Calibration and Screen Registration To calibrate the web camera pair, we took 16 image pairs (320x240 pixels) of the checkerboard calibration grid under various positions. The first eight shots were taken when the grid was held about 50 cm away from the camera. The remaining shots were made while holding the grid about 120 cm away from the camera. Table I shows the estimated camera parameters. (Note that the rotation matrix is only represented by three rotation angles, one for the x, y, and z axis respectively). The results shown in this table indicate that the average horizontal and vertical reprojection error are very small (below 0.1 pixel). The reprojection error remains relatively constant if fewer images than 16 were taken, but this would result in a larger error of the estimated parameters. For the screen registration we took another 5 shots containing the mirror in various positions (figure 16). By using the method described in section 4.2, we could compute the position of the screen with respect to the world frame for each stereo image pair. The estimated world-to-screen transformation M ws for each mirror position is listed in table II. We can see from the standard deviation of the rotation angles and the translation vectors that the screen registration TABLE I STEREO CAMERA CALIBRATION RESULTS Left camera Right camera Intrinsic parameters optimized std. optimized std. Focal lengths f x f y Principal points u v Radial dist. k k Tangential dist. k k Avg. Reproj. error x (pixel) y Extrinsic parameters optimized std. Rotation R α (degree) R β R γ Translation T x (mm) T y T z

Figure 17. Example of the head pose tracking with particle filtering.

An example of the shots for the screen registration. The images shown here were taken from the left camera.

74) (65.42, 205.06, 142.88) #2 (24.96, 8.59, 1.71) (67.72, 210.55, 145.49) #3 (25.40, 8.68, 1.80) (64.73, 207.03, 145.15) #4 (24.44, 8.59, 1.64) (69.45, 209.21, 144.52) #5 (24.90, 10.43, 1.45) (70.

17 method is accurate with up to about 2 mm translation error and less than 1 rotation error.

12 Figure 17. Example of the head pose tracking with particle filtering. The results presented here were taken from the left camera for frames 1 (user initialization), 51, 86, 122 (from left to right and top to bottom). Figure 16. An example of the shots for the screen registration. The images shown here were taken from the left camera. TABLE II SCREEN REGISTRATION RESULTS: THE WORLD TO SCREEN TRANSFORMATION M ws Rotation angle (deg.) Translation vector (mm) (R α, R β, R γ ) (T x, T y, T z ) Mirror position #1 (25.46, 9.58, 1.74) (65.42, , ) #2 (24.96, 8.59, 1.71) (67.72, , ) #3 (25.40, 8.68, 1.80) (64.73, , ) #4 (24.44, 8.59, 1.64) (69.45, , ) #5 (24.90, 10.43, 1.45) (70.04, , ) Mean Standard deviation Rotation angle (degree) Translation vector (mm) method is accurate with up to about 2 mm translation error and less than 1 rotation error. The mean value of the transformation will be used to determine the screen location in the intersection calculation module. 7.2 Head Pose Tracking Figure 17 shows an example of the head pose tracking by using K = 100 particles. The tracking was performed by choosing α = 0.7 and β = 0.5 for horizontal and vertical speed components respectively, with noise standard deviation of σ n = 1.8 pixel (see equation (11)). The choice of these parameters depends strongly on the expected speed of the head movements. If only slow movements are present, then Figure 18. The reference face shape model (above) and the estimated face shape in the face coordinate system for all frames (below). we can choose a smaller value for α, β and σ n, thereby improving the tracking precision (smaller jitter). Using larger values will decrease the precision, but this makes the tracking more robust to faster movements. When we compare the estimated face shape of all frames in the same sequence, apparently the shape varies slightly over time (figure 18). This is caused by the stochastic nature of particle filtering. The statistics are shown in table III. This variation would render the user s eyeball model useless, because it assumes that the eye corners are fixed with respect to the whole face shape model. This was the reason that we have fitted the reference shape to the estimated shape (section 5.4). In this way the rigidness of the face shape in each frame was preserved. 12

13 TABLE III STATISTICS OF THE ESTIMATED FACE SHAPE Standard deviations (mm) σ x σ y σ z Facial features mouth corner mouth corner left eye corner left eye corner right eye corner right eye corner TABLE IV ESTIMATED EYEBALL PARAMETERS Eyeball param. (mm) Left eye Right eye Radius r Offset vector d x d y d z Error (mm) Ground truth Measurement Difference x y x y x y x y Average error: Standard deviation: Gaze Direction Estimation Before we could estimate the gaze direction, we trained the system in order to estimate the user-dependent model parameters (see section 6.1). The training sequence was acquired by recording the user s eye corners and cornea positions while he was looking at 4 calibration points on the corners of the screen. After that we estimated the eyeball parameters by solving equation (18). The results are summarized in table IV. We analyzed the effects of errors on two quantities to the overall gaze error: the cornea center and the eyeball center. Together they will determine the gaze vector (section 6.2). A new sequence was acquired while the user was looking at one point while his head was fixed. Since the head and cornea were fixed, the variations on the tracked eye corners (or the eyeball center, indirectly) and the cornea center locations were only caused by the algorithm. The standard deviations of the eyeball center and cornea center are shown in table V. We can see here that the cornea fitting produced almost twice as large deviation as the eyeball center. For the following calculation, the mean of the cornea center for all frames, where the head and eyes are really steady, was considered the true location. The same consideration was also made for the mean of the eyeball centers for all frames. TABLE V TRACKING AND CORNEA FITTING ERROR Standard deviations (mm) Left eye Right eye Eyeball center σ x σ y σ z Cornea center σ x σ y σ z TABLE VI EFFECTS OF INDIVIDUAL PARAMETER ERROR TO THE GAZE PROJECTION ERROR Gaze projection Exp. #1 Exp. #2 Exp. #3 std. dev. (mm) x y x y x y None fixed Eyeball center fixed Cornea center fixed The gaze projections on the screen were calculated in three passes. First the gaze vector in each frame was calculated as usual by equation (20). In the second pass, we held the cornea center constant for all frames by taking the mean. In the last pass, the eyeball center location was held constant for all frames, again by taking the mean. This experiment was repeated 3 times on the same sequence to ensure that the results are more reliable since particle filtering is stochastic in nature (table VI). The results indicate that errors in the cornea center fitting have the largest influence on the vertical gaze projection errors. If the noise in the cornea center is removed (by taking the average), the spread of the gaze projection error becomes smaller and rounder (see also figure 19). Y Figure 19. The plot of the gaze projections of the first experiment of table VI. The symbols represent the mean gaze projection (+), the gaze points when no parameters are fixed ( ), fixed eyeball center ( ), and fixed cornea center ( ). X 13

14 7.4 Overall Performance In this section we presented the overall results of the gaze tracking system when applied to the train sequence and a test sequence. Both sequences were recorded when a person was looking at the same 4 calibration points (figure 20). As we can see, the average error of the gaze projection on the screen is about 6 cm, which corresponds with an angular error of about 7 at a distance of 50 cm. Figure 21 shows the 3D gaze vectors from the left and right eyes when they were projected back to the image plane. When we compare the gaze direction estimation to all 4 calibration points, we see that the projected gaze points on the lower part of the screen have a much smaller and rounder spread. This is caused by the cornea fitting error. Since the cameras were located below the screen, we had an almost frontal view of the face when the user was looking at the lower part of the screen. Hence, the fitting produced smaller error because the cornea has a circular form. The further the user s gaze was away from the camera, the more elliptic the cornea image projection would be, making it more difficult to be fitted. As a result similar spread as that from section 7.3 was observed for the gaze to the upper part of the screen. 8 CONCLUSIONS AND RECOMMENDA- TIONS In this paper a gaze tracking system is presented based on particle filtering and stereo vision. We propose to do facial feature tracking in 2D by particle filtering, and use the stereo information to estimate the head pose of a user in 3D. Together with a 3D eyeball model the 3D gaze direction can be estimated. For gaze tracking application on visual perception research, we need to know the projection of the user s gaze on the screen, where the visual stimuli are presented. We devised a screen registration scheme in order to accurately locate the screen location with respect to the cameras. With this information, the gaze projection on the screen can be calculated. The results achieved by our gaze tracking scheme are promising. The average gaze projection error is about 7, only a few degrees off the specified requirements. At a user-monitor distance of 50 cm and a 17 screen (about 30x24 cm), this would mean that we can distinguish the gaze projections on the screen in about 5 3 distinct blocks. Y (mm) Y (mm) Average gaze projection error: mm (std: mm) X (mm) Average gaze projection error: mm (std: mm) There is still room for improvements in our gaze tracking system. Based on the results in previous section the cornea fitting should be the main concern, since errors in this part have the greatest influence on the overall gaze error. Higher resolution of the cornea fitting is needed, for example by taking larger image resolution, or by fitting the cornea with sub-pixel accuracy. Together with a more sophisticated fitting algorithm such as ellipse fitting [7], better results should be achieved. Another possible source of error is the exclusion of the anatomical and visual axis difference of the eye in the 3D eye model. Compensating this difference will also reduce the overall gaze tracking error. The head pose tracking module still has some difficulty to track persons with glasses and fast head movements. The use of lightings and rotation invariant templates might help to reduce tracking loss. Furthermore, some smoothing in the temporal domain could help to reduce the jitter in the estimated facial feature locations. Finally, to eliminate the manual selection of the facial features in the initialization phase, the possibility to automatically locate these features should be explored X (mm) Figure 20. The plots of projected gaze points from the train sequence (above) and the test sequence (below). The gaze to each calibration points are represented by different symbols. REFERENCES [1] Applied Science Laboratories (ASL), USA. (Last visited: 22 October 2004) [2] Baluja, S. and Pomerleau, D., Non-intrusive Gaze Tracking using Artificial Neural Networks, Report no. CMU-CS , Carnegie Mellon University,

Figure 21. Results from detection of gaze direction. The vectors are drawn starting from the cornea center of the left and right eye, respectively. [3] Bouguet, J.Y.

15 Figure 21. Results from detection of gaze direction. The vectors are drawn starting from the cornea center of the left and right eye, respectively. [3] Bouguet, J.Y., Camera Calibration Toolbox for MATLAB. doc/ (Last visited: 28 September 2004) [4] Duchowski, A.T., Eye Tracking Methodology: Theory and Practice, London: Springer, [5] Ebisawa, Y., Improved Video-based Eye-gaze Detection Method, IEEE Transactions on Instrumentation and Measurement, (47)4: , [6] Eye Response Technologies (ERT), USA. (Last visited: 22 October 2004) [7] Fitzgibbon, A.W., Pilu, M. and Fischer, R.B., Direct Least-squares Fitting of Ellipses, IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5): , Conference on Automatic Face and Gesture Recognition, pp , [17] Reingold, E.M., McConkie, G.W., and Stampe, D.M., Gaze-contingent Multiresolutional Displays: An Integrative Review, Human Factors, (45)2: , [18] Tobii Technology AB, Sweden. (Last visited: 22 October 2004) [19] Wooding, D., Eye Movement Equipment Database (EMED), UK, (Last visited: 22 October 2004) [20] Zhang, Z., A Flexible New Technique for Camera Calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11): , [8] Fourward Technologies, Inc., USA. (Last visited: 22 October 2004) [9] Glassner, A.S., Graphics Gems, pp , Cambridge: Academic Press, [10] Isard, M. and Blake, A., Condensation Conditional Density Propagation for Visual Tracking, International Journal of Computer Vision, (29)1:5 28, [11] Isard, M. and Blake, A., ICondensation: Unifying Low-level and High-level Tracking in a Stochastic Framework, Proceedings of the 5th European Conference on Computer Vision, vol. 1, pp , [12] Ishikawa, T., Baker, S., Matthews, I. and Kanade, T., Passive Driver Gaze Tracking with Active Appearance Models, Proceedings of the 11th World Congress on Intelligent Transportation Systems, [13] Ji, Q. and Zhu, Z., Eye and Gaze Tracking for Interactive Graphic Display, International Symposium on Smart Graphics, [14] Lichtenauer, J., Reinders, M. and Hendriks, E., Influence of the Observation Likelihood Function on Particle Filtering Performance in Tracking Applications, Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, pp , [15] Matsumoto, Y. and Zelinsky, A., An Algorithm for Real-time Stereo Vision Implementation, Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pp , [16] Patras, I., and Pantic, M., Particle Filtering with Factorized Likelihoods for Tracking Facial Features, Proceedings of the 6th IEEE International 15

Flexible Calibration of a Portable Structured Light System through Surface Plane

Vol. 34, No. 11 ACTA AUTOMATICA SINICA November, 2008 Flexible Calibration of a Portable Structured Light System through Surface Plane GAO Wei 1 WANG Liang 1 HU Zhan-Yi 1 Abstract For a portable structured