Takahiro ISHIKAWA Takahiro Ishikawa Takahiro Ishikawa Takeo KANADE

Takahro ISHIKAWA Takahro Ishkawa Takahro Ishkawa Takeo KANADE Monocular gaze estmaton s usually performed by locatng the pupls, and the nner and outer eye corners n the mage of the drver s head. Of these feature ponts, the eye corners are just as mportant, and perhaps harder to detect, than the pupls. The eye corners are usually found usng local feature detectors and trackers. In ths paper, we descrbe a monocular drver gaze trackng system whch uses a global head model, specfcally an Actve Appearance Model (AAM), to track the whole head. From the AAM, the eye corners, eye regon, and head pose are robustly etracted and then used to estmate the gaze. Key words : Monocular drver gaze trackng system, Actve Appearance Model corners s feature pont detecton and trackng. 4) 5) The head An ntellgent car that montors the behavor of the drver pose s normally computed n a smlar manner;. e. frst can be made far safer. Many of the most mportant detect and track a collecton of anatomcal feature ponts components of the drver s behavor are related to ther eye (eye corners, nose, etc) and then use a smple geometrc gaze. Whether the drver s drowsy or not s related to both model to compute the head pose. The problem wth all of ther blnk rate and ther temporal gaze varaton. Whether these feature-based methods s that they are very local; they they are dstracted or not can often be determned from only use nformaton n the mmedate vcnty of the detectng whether they are lookng outsde nto the road feature pont. If the face was tracked as a sngle object, a lot scene, or nstead at other passengers, the car rado, etc. By more vsual nformaton could be used to detect and track combnng eye gaze wth an understandng of the objects n the eye corners and estmate the head pose, both more the road-scene, t s even possble for an ntellgent car to robustly and more accurately. determne whether the drver has notced potental dangers In recent years, a number of face models have been n the scene. proposed to model the face as a sngle object, most notably Most passve approaches to gaze estmaton are n Actve Appearance Models (AAMs) ) and 3D Morphable essence very smlar. See, for eample, 3)-5), 7), 9), 10). Models (3DMMs). 1) Unfortunately, AAMs are only D The locaton of the pupl (or equvalently the rs), together models and so estmatng the 3D head pose s dffcult. On wth the nner and outer corners of the eye, are detected n the other hand, fttng or trackng wth 3D models s the nput mage(s). The eye gaze can then be computed relatvely slow. In partcular, the fastest algorthm 8) to track usng a smple geometrc head model. If an estmate of the wth a 3DMM operates at around 30 seconds per frame (. e. head pose s avalable, a more refned geometrc model can almost 1000 tmes slower than real-tme, by whch we be used and a more accurate gaze estmate made. mean 30 frames per second). Of these four quanttes (rs/pupl locaton, nner eye Recently, we have developed real-tme algorthms for corner locaton, outer eye corner locaton, and head pose), fttng both D AAMs 6) and a 3D varant of them. 11) Both of the most dffcult to estmate relably are the eye corners these algorthms operate at well over 00 frames per (and to a lesser etent the head pose.) Once the eye corners second, leavng plenty of tme for the other computatonal have been located, locatng the rs/pupl, both robustly and tasks, such as rs/pupl detecton, and the estmaton of the accurately, s relatvely straghtforward. Perhaps somewhat gaze drecton tself. In ths paper, we descrbe how we roncally, the man dffculty n gaze estmaton s not have used these algorthms to buld a gaze estmaton fndng the rs/pupl. system that derves ts robustness and hgh accuracy from The usual approach to locatng the nner and outer eye the fact that the eye corners and head pose are estmated

usng the entre appearance of the face, rather than by just trackng a few solated feature ponts. We begn by descrbng the geometrc head model we use to estmate the gaze drecton. There s nothng partcularly novel about ths model. Smlar models have been used by other authors. 4)5) The essence of our model s contaned n Fg. 1. We assume that the eyeball s sphercal and that the nner and outer eye corners have been estmated, n our case usng an AAM as descrbed n the followng secton. Our algorthm can be splt nto two steps: 1. Estmate (1) the center and () the radus of the eyeball n the mage from the eye corners and the head pose.. Estmate the gaze drecton from the pupl locaton, the center and radus of the eyeball. The frst of these steps requres the followng anatomcal constants, also shown n Fg. 1 (b): R 0: The radus of the eyeball n the mage when the scale of the face s 1 (see below for a defnton of scale). (T, T y): The offset n the mage (when the face s frontal and the scale s 1) between the md-pont of the two eye corners and the center of the eyeball. L: The depth of the center of the eyeball relatve to the plane contanng the eye corners. We now descrbe these two steps n turn and then how to estmate the anatomcal constants. The center and radus of the eyeball are computed usng the followng three steps: 1. The md-pont (m, m y) between the nner corner (e1, e1 y) and outer corner (e, e y) s computed: e1 m + e = ( my ) ( e1 + e ). The scale of the face S s computed. The most obvous way to estmate the scale s to use the foreshortencorrected dstance between the eye corners: S = (e1 - e) + (e1y - ey) cos φ (1) () Head pose Gaze drecton e o p e1 Center of rs/pupl Inner corner of eye Sphercal eyeball Center of eyeball Outer corner of eye R 0 o e m L n e1 Plane of eye corners (T, T y ) Radus of eyeball Md-pont of two corners Head pose (a) Denton of terms (b) Anatomcal constants (R,L,T,Ty) o e ST SL m n Imageplane φ m o Headpose n e1 Image plane o (SR0) - (o - p) o p θ Gaze drecton p (c) Top down vew for computng the offsets to the eye center usng the head pose φ (d) Top down vew of the eyeball used for computng the gaze drecton θ Fg. 1 Gaze Estmaton Geometrc Model (a) In the mage we detect the pupl and the eye corners (usng the AAM.) From these quanttes we frst estmate the eyeball center and radus, and then the gaze. (b) The anatomcal constants (R 0, L, T, T y ) when the scale of face s 1. (c) Top down vew used to compute the offsets to the eye center from the head pose φ when the scale of face s S. (d) Top down vew used to compute the gaze drecton θ when the scale of face s S.

The dsadvantage of ths approach s that t s very nose senstve because t s the dfference between two ponts that are very close together n the mage. Instead, we used the scale that s estmated by the AAM. Ths estmate s more relable because the scale s computed by effectvely averagng over the entre face regon. 3. The center of the eyeball (o, o y) s then computed as the md-pont (m, m y) plus two correctons:. (3) The frst correcton s a foreshortened offset that compensates for the fact that the md-pont of the eye corners s not necessarly the eye center even for a frontal mage. The second correcton compensates for the fact that the eyeball center does not, n general, le n the plane of the eye corners. In Equaton (3), (φ, φ y) s the head pose. 4. The radus of the eyeball n the mage s computed R = SR 0 The gaze drecton (θ, θ y) can then be computed as follows (see Fg. 1(d)): o ( oy ) ( ) m T cos φ sn φ = +S my ( Ty cos y ) + SL φ ( sn φ y ) ( ) ( ) θ θ sn sn y = p - o R - ( py - o ) p - o R - ( py - o ) The anatomcal constants R 0, (T, T y) and L are precomputed n an offlne tranng phase as follows. Substtutng Equaton (4) gves: sn ( sn θ ) = y p - m - ST cos φ - SL sn φ (SR0) - (py - oy) py - my - STy cos φy - SL sn φy (SR0) - (p - o) θ ( ) We collect a set of tranng samples where the gaze drecton and head pose of a person takes one of the two followng specal forms: ( θ, θy, φ, φy) = ( α, 0,, 0) β (4) (5) Suppose we have N mages of the frst form and N y of the second, we combne the equatons for these tranng samples and create the followng matr equaton: p 1 - m S1 (6) The usual approach to locatng the nner and outer eye corners s feature pont detecton and trackng. 4 5) The problem wth these feature-based methods s that they are very local; they only use nformaton n the mmedate vcnty of the feature. Hence, feature pont trackng s nether as robust nor as accurate as t could be. We now descrbe an approach that tracks the head as a sngle object and show how t can be used to: (1) estmate the head pose, () estmate the eye corner locatons, and (4) etract the regon for pupl localzaton. 1 Ny Ny y y S Ny ( θ, θy, φ, φy) = ( 0, α, 0, ) Actve Appearance Models (AAMs) ) are generatve face models. An AAM conssts of two components, the shape and the appearance. The D shape of an AAM s defned by a D trangulated mesh and n partcular the verte locatons of the mesh: y β y p - m 1 sn sn 1 1 α β cos β 0 S sn sn α β cos β 0 N N p - m N S N sn β N N sn α cos β 0 p 1 y - m1 = 1 y sn β 1 1 sn α y y 0 cos β y S1 sn β sn α 0 p y - m y y cos β y y S p - m sn Ny α y s sn β 0 cos Ny Ny y u1 u un = ( u1 u un ) β y ( ) R0 L T Ty. (7) and:

AAMs allow lnear shape varaton. Ths means that the shape matr s can be epressed as a base shape s 0 plus a lnear combnaton of m shape matrces s : where the coeffcents p are the shape parameters. AAMs are normally computed from tranng data consstng of a set of mages wth the shape mesh (usually hand) marked on them. ) The Iteratve Procrustes Algorthm and Prncpal Component Analyss are then appled to compute the the base shape s 0 and the shape varatons s. An eample of the base shape s 0 and the frst two shape models (s 1 and s ) of an AAM are shown n Fg. (a)-(c). The appearance of the AAM s defned wthn the base mesh s 0. Let s 0 also denote the set of pels u = (u, v) T that le nsde the base mesh s 0, a convenent abuse of termnology. The appearance of the AAM s then an mage A(u) defned the pels u s 0. AAMs allow lnear appearance varaton. Ths means that the appearance A(u) can be epressed as a base appearance A 0 (u) plus a lnear combnaton of l appearance mages A (u): s = s 0 + Σ m =1 p s (8) A (u) = A 0 (u) + Σ l =1 λ A (u) (9) where λ are the appearance parameters. As wth the shape, the appearance mages A are usually computed by applyng PCA to the (shape normalzed) tranng mages. ) An eample of the base λ 0 and frst two appearance modes (λ 1 and λ ) are shown n Fg. (d)-(f). Although Equatons (8) and (9) descrbe the AAM shape and appearance varaton, they do not descrbe how to generate a model nstance. The AAM model nstance wth shape parameters p and appearance parameters λ s created by warpng the appearance A from the base mesh s 0 to the model shape mesh s. In partcular, the par of meshes s 0 and s defne a pecewse affne warp from s 0 to s whch we denote W(u; p). Three eample model nstances are ncluded n Fg. (g)-(). Ths fgure demonstrate the generatve power of an AAM. The AAM can generate face mages wth dfferent poses (Fg. (g) and (h)), dfferent denttes (Fg. (g) and ()), and dfferent epresson, (Fg. (h) and ()). (a) Base shape s 0 (b) D shape mode s 1 (c) D shape mode s (d) Base appearance λ 0 (e) Appearance mode λ 1 (f) Appearance mode λ (g) Eample model nstance (h) Eample model nstance () Eample nstance (j) 3D base shape s 0 (k) 3D shape mode s 1 (l) 3D shape mode s Fg. An eample Actve Appearance Model. ) (a-c) The AAM base shape s 0 and the frst two shape modes s 1 and s. (d-f) The AAM base appearance λ 0 and the frst two shape modes λ 1 and λ. (g-) Three model nstances. (j-l) The base 3D shape s 0 and the frst two 3D shape modes s 1 and s.

Fg. 3 Eample drver head trackng results wth an AAM. Vew n color for the best clarty. Drver head trackng s performed by fttng the AAM sequentally to each frame n the nput vdeo. Three frames from an eample move of a drver s head beng tracked wth an AAM are ncluded n Fg. 3. Gven an nput mage I, the goal of AAM fttng s to mnmze: (10) smultaneously wth respect to the D AAM shape p and appearance λ parameters. In 6) we proposed an algorthm to mnmze the epresson n Equaton (10) that operates at around 30 frames per second. For lack of space, the reader s referred to 6) for the detals. Σ l u s0 = A 0 (u)+ A 0 (u)+ Σ l =1 Σ l =1 λ A (u) - I (W( u; p)) λ A (u) - I (W( u; p)) The shape component of an AAM s D whch makes drver head pose estmaton dffcult. In order to etract the head pose, we also buld a 3D lnear shape model: s = s 0 + p s (11) where the coeffcents p are the 3D shape parameters and s, etc, are the 3D shape coordnates: m Σ =1 1 s = ( n y 1 y ) y n z 1 z z n (1) frst to 3D shape modes ( s 1 and s ) of the AAM n Fg. are shown n Fg. (j)-(l). In order to combne ths 3D model wth the D model we need an mage formaton model. We use the weak perspectve magng model defned by: a ( j j y j z ) ( b ) y z u = P = + (13) where (a, b) s an offset to the orgn and the projecton aes = (, y, z) and j = ( j, j y, j z) are equal length and orthogonal: = j j; j = 0. To etract the drver head pose and AAM scale we perform the AAM fttng by mnmzng: A 0 (u) + + K Σ l =1 s 0 + Σ m =1 λ A (u) - I (W(u; p)) p s - P( s 0 + Σ m p s ) (14) smultaneously wth respect to p, λ, P, and p, rather than usng Equaton (10). In Equaton (14), K s a large constant weght. In 11) we etended our D AAM fttng algorthm 6) to mnmze the epresson n Equaton (14). The algorthm operates at around 86Hz. 11) The second term enforces the (heavly weghted soft) constrants that D shape s equals the projecton of the 3D shape s wth projecton matr P. Once the epresson n Equaton (14) has been mnmzed, the drver head pose and AAM scale can be etracted from the projecton matr P. Two eamples of pose estmaton are shown n Fg. 4. =1 In 11) we showed how the equvalent 3D shape varaton s can be computed from the correspondng D shape varaton s usng our non-rgd structure-from-moton algorthm. 1) An eample of the 3D base shape s 0 and the

Fg. 4 Pose estmaton results. The computed roll, ptch, and yaw are dsplayed n the top left. Once the AAM has been ft to an nput vdeo frame, t s easy to locate the eye corners and etract the eye regons. The AAM has mesh vertces that correspond to each of the nner and outer eye corners. The eye corner locaton n the mages can therefore be just read out of W (s; p), the locaton of the AAM mesh n the mages. See Fg. 5(a) for an eample. In the AAM, each eye s modeled by s mesh vertces. It s therefore also easy to etract the eye regon as a bo slghtly larger than the boundng bo of the s eye mesh vertces. If (, y ) denotes the s mesh vertces for = 1,..., 6, the eye regon s the rectangle wth bottom-left coordnate (BL, BL y) and top-rght coordnate (TR, TR y), where: ( ) BL TR BL y TR y = ( ) mn ma ( mn y + ma y ) - c (15) and (c, d y) s an offset to epand the rectangle. Agan, see Fg. 5(a) for an eample. c - d y d y Once the eye regon has been etracted from the AAM, we detect the rs to locate the center of the rs. Our rs detector s farly conventonal and conssts of two parts. Intally template matchng wth a dsk shaped template s used to appromately locate the rs. The rs locaton s then refned usng an ellpse fttng algorthm smlar to the ones n 7), 10). We apply template matchng twce to each of the eye regons usng two dfferent templates. The frst template s a black dsk template whch s matched aganst the ntensty mage. The second template s a rng (annulus) template that s matched aganst the vertcal edge mage. The radus of both templates are determned from the scale of the AAM ft. The two sets of matchng scores are summed to gve a combned template matchng confdence. The poston of the pel wth the best confdence becomes the ntal estmate of the center of the rs. In Fg. 5(b) we overlay the eye mage wth ths ntal estmate of the rs locaton and radus. (a) Eye regon etracton (b) Intal rs estmate (c) Refned rs estmate Fg. 5 Eample rs detecton result: (a) Eample eye regon etracton results computed from an AAM ft. (b) The ntal rs locaton and radus computed by template matchng. (c) The refned rs locaton and radus after edge-based ellpse fttng.

The ntal rs estmate s then refned as follows. Frst, edges are detected by scannng radally from the ntal center of the pupl outward. Net, an ellpse s ft to the detected edges to refne the estmate of the rs center. Edges a long way away from the ntal estmate of the radus are flterng out for robustness. The ellpse s parameterzed: a 1 + a y + a 3 y + a 4 + a 5 y = 1 (16) and the parameters a 1,..., a 5 are ft usng lease squares. Ths refnement procedure s repeated teratvely untl the estmate of the center of the rs converges (typcally only -3 teratons are requred). Eample results are shown n Fg. 5(c). In Fg. 6 we nclude three frames of a head beng tracked usng an AAM. Notce how the facal features are tracked accurately across wde varatons n the head pose. In Fg. 7 we nclude three eample pose estmates usng the 3D AAM. Note that the yaw s estmated partcularly accurately. Besdes the eye corner locatons and the head pose, the other quantty we etract from the AAM ft s the head scale S. We ecaluate the scale estmate usng the fact that the scale s nversely proportonal to the dstance to the head (the depth.) In Fg. 8 we compute the dstance to the head usng the scale as the drver sldes the seat back. We collected a ground-truthed dataset by askng each subject to look n turn at a collecton of markers on the wall. The 3D poston of these markers was then measured relatve to the head poston and the ground-truth gaze angles computed. We took multple sequences wth dfferent head poses. All varaton was n the yaw drecton and from appromately -0 degrees to +0 degrees relatve to frontal. In Fg. 9 (a-c) we nclude an eample frontal mages for each of 3 subjects. We overlay the mage wth the AAM ft and a lne denotng the estmated gaze drecton. We also nclude close ups of the etracted eye regons and the detected rs. In Fg. 9 (d-e) we plot the estmated azmuth gaze angle aganst the ground truth. The average error s 3. degrees. The green lne n the fgure denotes the correct answer. Fg. 6 An eample of head trackng wth an AAM. (a) Ground truth (0, 0, 0) (b) Ground truth (10, 0, 0) (c) Ground truth (0, 0, 0) Result value (0, -1.8, -3.6) Result value (10,.0, -4.3) Result value (0.8, -1.8, -0.8) Fg. 7 Eample pose (yew, ptch, roll) estmaton wth the AAM.

(a) Depth:64.7cm (b) Depth:67.4cm (c) Depth:71.3cm (d) Depth:74.0cm (e) Depth:75.7cm (f) Depth:77.5cm (g) Depth:78.7cm (h) Depth:80.1cm Fg. 8 Verfcaton of the scale estmated by the AAM. Snce the scale s nversely proportonal to depth, we can use the scale to estmate the dstance to the drver's head. (a-h) The dstance estmated from the AAM scale ncreases smoothly as the seat s moved backward. (a) Gaze of subject 1 (b) Gaze of subject (c) Gaze of subject 3 (d) Azmuth of subject 1 (e) Azmuth of subject (f) Azmuth of subject 3 Fg. 9 Gaze estmaton. (a-c) Gaze estmates overlad on the nput mage. We also nclude the AAM, the etracted eye regon, and the detected rs. (d-f) A comparson between the ground-truth azmuth gaze angle and the angle estmated by our algorthm. The average error s 3. degrees. If there are two cameras n the car, one magng the drver, the other magng the outsde world, t s possble to calbrate the relatve orentatons of the camera by askng a person to look at a collecton of ponts n the world and then markng the correspondng ponts n the outsde-vew mage. The relatve orentaton can then be solved usng least-squares. We performed ths eperment and then asked the subject to track a person walkng outsde the car wth ther gaze. Three frames from a vdeo of the results are shown n Fg. 10. In Fg. 10 (a-c) we dsplay the eteror vew. We overlay the estmated gaze drecton wth a yellow crcle than corresponds to a 5.0 degree gaze radus. In Fg. 10 (d-e) we nclude the correspondng nteror vew of the drver overlad wth the AAM, the etracted eye regons, the detected rs, and the estmated gaze plotted as a lne. As can be seen, the person always les well nsde the crcle, demonstratng the hgh accuracy of our algorthm.

(a) Eteror vew frame 78 (b) Eteror vew frame 634 (c) Eteror vew frame 687 (d) Interor vew frame 78 (e) Interor vew frame 634 (f) Interor vew frame 687 Fg. 10 Mappng the drver s gaze nto the eternal scene. The drver was told to follow the person walkng outsde n the parkng lot. We overlay the eternal vew wth a yellow crcle wth radus correspondng to a 5.0 error nthe gaze estmated. As can be seen, the person always les well wthn the crcle demonstratng the accuracy of our algorthm. We have presented a drver gaze estmaton algorthm that uses an Actve Appearance Model ) to: (1) track the eye corners, () etract the eye regon, (3) estmate the scale of the face, and (4) estmate the head pose. The rses are detected n the eye regon usng farly standard technques and the gaze estmated from the above nformaton usng a farly standard geometrc model. The robustness and accuracy of our passve, monocular system are derved from the AAM trackng of the whole head, rather than usng a local feature based technque. Once the eye corners have been located, fndng the rses and computng the gaze are straghtforward. The research descrbed n ths paper was supported by DENSO CORPORATION, Japan. 1) V. Blanz and T. Vetter. A morphable model for the synthess of 3D faces. In Proceedngs of Computer Graphcs, Annual Conference Seres (SIGGRAPH) (1999), pp. 187-194. ) T. Cootes, G. Edwards, and C. Taylor. Actve appearance models. IEEE Transactons on Pattern Analyss and Machne Intellgence, 3(6) (June 001):681-685. 3) A. Gee and R. Cpolla. Determnng the gaze of faces n mages. Image and Vson Computng, 30 (1994):63-647. 4) J. Henzmann and A. Zelnsky. 3-D facal pose and gaze pont estmaton a robust real-tme trackng paradgm. In Proceedngs of the IEEE Internatonal Conference on Automatc Face and Gesture Recognton (1998), pp. 14-147. 5) Y. Matsumoto and A. Zelnsky. An algorthm for realtme stereo vson mplementaton of head pose and gaze drecton measurement. In Proceedngs of the IEEE Internatonal Conference on Automatc Face and Gesture Recognton (000), pp. 499-505. 6) I. Matthews and S. Baker. Actve Appearance Models revsted. Internatonal Journal of Computer Vson, 60() (004):135-164. 7) T. Ohno, N. Mukawa, and A. Yoshkawa. FreeGaze: A gaze trackng system for everyday gaze nteracton. In Proceedngs of the Symposum on ETRA (00), pp. 15-13. 8) S. Romdhan and T. Vetter. Effcent, robust and accurate fttng of a 3D morphable model. In Proceedngs of the Internatonal Conference on Computer Vson (003) 9) P. Smth, M. Shah, and N. da Vtora Lobo. Montorng head/eye moton for drver alertness wth one camera. In Proceedngs of the IEEE Internatonal Conference on Pattern Recognton (000), pp. 636-64.

10) K. Talm and J. Lu. Eye and gaze trackng for vsually controlled nteractve stereoscopc dsplays. In Sgnal Processng: Image Communcaton 14 (1999), pp. 799-810. 11) J. Xao, S. Baker, I. Matthews, and T. Kanade. Realtme combned D+3D actve appearance models. In IEEE Conference on Computer Vson and Pattern Recognton (004) 1) J. Xao, J. Cha, and T. Kanade. A closed-form soluton to non-rgd shape and moton recovery. In Proceedngs of the European Conference on Computer Vson (004) 666666666666666666666666666666666666