Eye Typing off the Shelf

Size: px

Start display at page:

Download "Eye Typing off the Shelf"

Leonard Cox
5 years ago
Views:

1 Eye Typing off the Shelf Dan Witzner Hansen Dept. of Innovation IT University Copenhagen Copenhagen, Denmark Arthur Pece Heimdall Vision & Dept. of Computer Science University of Copenhagen Copenhagen, Denmark Abstract The goal of this work is using off-the-shelf components for gaze-based interaction, with focus on eye typing. Avoiding the use of dedicated hardware such as IR light emitters makes eye tracking significantly more difficult and requires robust methods capable of handling large changes in image quality. We employ an active-contour method to obtain robust iris tracking. The main strength of the method is that the contour model avoids explicit feature detection: contours are simply assumed to remove statistical dependencies on opposite sides of the contour. The contour model is utilized in an approach combining particle filtering with the EM algorithm. The method is robust against light changes and camera defocusing. For the purpose of determining where the user is looking calibrations is usually needed. The number of calibration points used in different methods varies from from a few to several thousands, depending on the prior knowledge used on the setup and equipment. We examine basic properties of gaze determination when the geometry of the the camera, screen and user is unknown. In particular we present a lower bound on the number of calibration points needed for gaze determination on planar objects, and we examine degenerate configurations. Based on this lower bound we apply a simple calibration procedure, to facilitate button selections for fast on-screen typing. Keywords Eye tracking, Expectation Maximisation, Particle filter, gaze calibration, lower bound, Components-off-the-shelf. 1. Introduction Humans acquire a vast amount of information through the eyes, and the eyes in turn reveal information about our attention and intention. Detection of the eye gaze enables collection of valuable information for use in psychophysics and human computer interaction (HCI). The use of commercial off-the-shelf (COTS) products as elements in larger systems is becoming increasingly commonplace. Reduced budgets, accelerating enhancement rates of COTS, and increased accessibility for such systems catalyze this process. Using COTS for camera-based eye tracking tasks has many advantages, but it certainly introduces several new problems as less assumptions on the system can be made. Eye tracking based on COTS holds potential for a large number of possible applications such as in the entertainment industry and for eye typing [2]. For severely disabled people, the need for a means of communication is crucial. Producing text using eye positioning ( eye typing ) is an appropriate modality for this purpose, as conscious control of eye movements is retained in most types of handicaps. In general it is in this framework not possible to exploit IR light sources and other novel engineered devices as they cannot be bought in a common hardware store. On the same token pan-andtilt cameras cannot be used, thus forcing such systems to be passive. Very little control over the cameras and the geometry of the setup can be expected. The methods employed for eye tracking should therefore be able to handle changes in light conditions and image defocusing, and through view and scale changes. The purpose of this paper is to show that explicit feature detection for iris tracking is not needed and thus makes the proposed method robust towards changes in illumination and image defocusing. The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 presents an overview of the method and defines the formalism. Section 4 derives the marginalized contour model. Section 5 describes gaze determination and section 6 derives a lower bound on the number of calibration points needed for gaze estimation when the setup is unknown. The results of iris tracking, gaze estimation and eye typing is given section Related Work In many high end systems for eye tracking special light and synchronization schemes are used. Often infrared (IR) light emitters are utilized for the purpose of stable and controlled light conditions and for robust gaze determination. Kalman filtering [5] and mean shift filtering [9] are recent

2 approaches for eye tracking. Methods for detection and extraction of eye features, such as eye corners and iris contours, can roughly be divided into two classes: (a) methods based on global information such as deformable templates and active appearance models [2] and (b) methods based on combining local information. Most existing approaches of the latter type assume the iris contours to be circular and through intensity edges the iris outer boundary (limbus) is detected [1]. 3. Method Overview This section describes the components of the proposed method and is based on a recursive estimation of the state variables (i.e. the iris pose and velocity) of the object being tracked. Most active contour methods can be placed into two main classes, depending on the principle used for evaluating the image evidence. One class relies on assuming object edges generate image features and thus depends on the extraction of features from the image in the neighborhood of the contours and the assignment of one (and only one) correspondence between points on the contours and image features [4]. The assumption behind feature-based methods (edges generate image features), is questionable from a physical standpoint, but it does have the advantage that shape and pose refinement is reduced to a least-squares problem. Apart from the problem of finding correct correspondences, the thresholds necessary for feature detection inevitably make these methods sensitive to noise. Other active-contour methods avoid feature detection by maximizing feature values (without thresholding) underlying the contour, rather than minimizing the distance between locally-strongest feature and contour [8]. The underlying idea is that a large image gradient is likely to arise from a boundary between object and background. Energy-based methods have the disadvantage of not explicitly taking into account unmodelled variations of the contour shape; in addition, these methods are not suitable for gradient-based optimization of the object pose and shape. The method introduced here is of the latter class, but smoothing is replaced by marginalization over possible deformations of the object shape Marginalized Iris Tracker To track a target over an image sequence, we propose to use particle filtering, as it is robust in clutter and is capable of recovering in the case of occlusions due to its multiple hypothesis representation. Particle filtering is also suitable for iris tracking, because changes in iris position are fast and do not follow a smooth and fully predictable pattern. The object location is represented by the sample mean. While increasing the sample set, higher accuracy is expected, but alas also increases the computational demand. Figure 1: Overall tracking is performed by particle filtering and through the mean of the particles (sample mean) maximum likelihood estimation of the object state is performed through the EM contour algorithm. The mean calculated in the previous time step is employed to compensate for time dependent scale changes. For both lowering the computational demand while maintaining the accuracy, the EM contour method [6] is utilized to optimize the sample mean from the particle filter. Figure 1 illustrates a flow diagram of the method State Model and Dynamics Depending on the viewing angle, the iris appears elliptical. Modelling the iris as an ellipse the state is defined by x =(c x,c y,λ 1,λ 2,θ), where (c x,c y ) is the center, λ 1,λ 2 the major and minor axes and θ the angle of the major axis with respect to the vertical. Pupil movements can be very rapid from one image frame to another. The dynamics is therefore modelled as a first order auto regressive process using a Gaussian noise model with a time dependent covariance matrix, Σ t. The time dependency is due to scale changes: when the apparent size of the eye increases, the corresponding eye movements can also be expected to increase. For this reason, the first 2 diagonal elements of Σ (corresponding to the state variables c x and c y ) are assumed to be linearly dependent on the previous sample mean. 4. Observation model This section is about the observation model that defines the pdf (probability density function) p(y t x t ). The model can be divided into two components: (a) a geometric component defining a pdf over image locations of contours and (b) a texture component defining a pdf over pixel gray level differences given a contour locations. We refer to the boundary of the object being tracked as the modelled boundary. For simplicity we assume (1) the density of the observation depends only on the gray level differences (GLD s). (2) gray level differences between pixels along a line are statistically independent. (3) intensity values of nearby pixels are correlated if both belong to the object being tracked or both belong to the background. Thus, a priori statistical dependencies between nearby pixels is assumed. (4) there is no correlation of nearby points if they are on opposite sides of

the object boundary. Thus statistical independence across object boundaries is assumed. (5) The shape of the contour is subject to random local variability.

3 the object boundary. Thus statistical independence across object boundaries is assumed. (5) The shape of the contour is subject to random local variability. Marginalization over local deformations of contours leads to a Bayesian estimate of the contour parameters. Taking the assumptions together means that no features need to be detected and matched to the model (leading to greater robustness against noise), while at the same time local shape variations are taken explicitly into account. This model leads to a simple closedform expression for the likelihood of the image given the contour parameters [6] Definitions Denote a normal to a given point on the contour, as the measurement line. Define the coordinate, ν, on the measurement line. η(ν) is a binary indicator variable which is 1 if the boundary of the target is line in the interval [ν ν/2,ν + ν/2] (with regular inter-point spacing ν) on the measurement and 0 otherwise. Given the position µ of the contour on the measurement line, the distance from µ to ν is ε = ν µ. Denote the gray level difference between two points on the measurement line by I(ν) I(ν + ν/2) I(ν ν/2), and the observation on a given measurement line by I = { I(i ν) i Z}. These definitions are illustrated in Figure 2. Denote f a (I) the likelihood of the image given no contour and f R (I µ) the ratio f 1 (I µ)/ f a (I) log f 1 (I µ)=log f a (I)+log f R (I µ) = log f a (I)+h(I µ) (1) where h(i µ) log f R (I µ). The first term involve complex statistical dependencies between pixels and is expensive to calculate as all image pixels must be inspected. Most importantly, the estimation of this term is needless as it is an additive term which is independent on the presence and location of the contour. Consequently, in terms of fitting contours to the image it is sensible only to consider the log-likelihood ratio Statistics of gray-level differences Research on the statistics of natural images show that the pdf of gray-level differences between neighboring pixels is well approximated by a generalized Laplacian [3]: ( f L ( I)= 1 exp I ) Z L λ β (2) where I is the gray level difference, λ depends on the distance between the two sampled image locations, β is a parameter approximately equal to 0.5 and Z L is a normalization constant. For β = 0.5 it can be shown that Z L = 4λ Distributions on measurement lines If there is no known edge (object boundary) between two image locations [u u/2,u + u/2] on the measurement line, the pdf of the gray levels follows the generalized Laplacian defined in equation 2: f [ I(ν) η(ν)=0]= f L [ I(ν)] (3) Assuming independence between gray level differences, the pdf of the observation in the absence of an edge is given by Figure 2: Marginalized contour definitions 4.2. Likelihood of the image The observations along the observation line depend on the contour locations in the image. This means that the likelihoods computed for different locations are not comparable, as they are likelihoods of different observations. The image does not depend on the contour location (but the likelihood does), a better evaluation function is given by the likelihood of the entire image I as a function f 1 (I µ) of the contour location µ. f a (I) f L [ I(i ν)] (4) i It is important to note that the absence of the boundary between the object being tracked and the background does not imply the absence of an edge. Due to unmodelled object boundaries, surface features of objects there may occur edges within the background as well as within the object. Two points observed on opposite sides of an edge are statistically independent. The conditional pdf of gray level differences, separated by an edge, can for simplicity be assumed to be uniform : f [ I(ν) η(ν)=1] 1 m (5)

4 where m is the number of gray levels. If there is a known object boundary at location j ν, then only one point will correspond to gray level differences across the boundary, the rest will be gray level differences of either object or background. In this case, the pdf of the observation is given by: f c (I j ν)= 1 m f a (I) f L ( I( j ν)) 4.5. Marginalizing over deformations The geometric object model cannot be assumed to be perfect. In other words the position of the idealized contour does not exactly correspond to the position of the object boundary, even if the position of the object is known. For simplicity, we assume a Gaussian distribution of geometric deformations of the object at each sample point. In the following, ν will denote the location of the object boundary on the measurement line. As mentioned above, µ is the intersection of the measurement line and the (idealized) contour, and the distance from µ to ν is ε = ν µ. The prior pdf of deformations f D (ε) is defined by: f D (ε)= 1 ( ) ε 2 exp Z D 2σ 2 where Z D = 2πσ is a normalization factor. Marginalizing over possible deformations, the likelihood is given by: f M (I µ)= 1 m f a(i) f D (ε) f L ( I(ν)) dε According to section 4.2 we use the likelihood ratio given by: f R (I µ)= f M(I µ) = 1 f D (ε) dε (7) f a (I) m f L ( I(ν)) This is the ratio between the likelihood for the hypothesis that the target is present (from equation 8); and the null hypothesis that the contour is not present (equation 4). Hence the likelihood ratio can be used for testing the hypothesis of the presence of a contour. For the EM contour algorithm, it is convenient to take the logarithm to obtain the log-likelihood ratio: h(i µ)= log(m)+log (6) (7) f D (ε) dε (7) f L ( I(ν)) It follows that for a given observation I, the point evaluation increases when the contour is placed at a location that maximizes the function of the absolute values I under a Gaussian window centered at µ. 5 Gaze Estimation For the purpose of gaze estimation, we need to infer the point where the ffsubject is looking given the image data. More specifically, we aim at finding the distribution p(x D), where x is the gaze position and D is the data obtained from the image. The maximum a posteriori (MAP), maximum likelihood (ML), or least-squares (LS) estimates are most often used and hence a deterministic mapping, Φ : R m R 3 from a m-dimensional feature space to world coordinates is inferred. When using the gaze information for screenbased applications the image of Φ is a subset of R 2. Thus we will only consider the mapping to Φ : R m R 2 as the depth is implicitly given. The process of gathering data for finding the transformation Φ is called calibration. Calibration is usually performed by assuming the user look at N predefined points (target values) t i on the screen, while relating these to the image of the eye x i. A pair of feature coordinates x i and target values t i and are called conjugate. There are several approaches to determine the mapping from image to screen coordinates. These methods can be divided into (a) feature-based and (b) appearance or view-based methods. Feature-based methods use estimated features such as contours and eye corners for gaze determination. Due to a low number of features used, the size of the input space is generally quite low. IR-based eye trackers generally use feature-based methods as the center of the eye and the glint (reflection) are easily obtained [5]. The appearance-based methods do not explicitly extract features, but use all the image information as input. Therefore, the dimensionality of the input space is much higher than feature-based methods [7]. 6 A Lower Bound on Calibration Points In this section, we obtain a lower bound on the number of calibration points needed for gaze-based interaction using uncalibrated cameras in the case where the setup geometry is unknown, but fixed. This lower bound is valid under a small-angle approximation for the range of gaze directions of practical interest. Modelling the eye as a sphere, the position of the iris is defined by two rotation angles α,β of the eye for the horizontal and vertical directions. We further define the origin α = 0,β = 0 as the position of the eye fixating the center of the screen. The exact parametrization is irrelevant for our purposes, since we are in the following only interested in the absolute value θ = α 2 + β 2 of the angle between the origin and the current direction of gaze. Consider the distances a between a corner of the screen and the center of the screen, and b between the eye and the screen. Typical values would be a 23 cm, and b 60

5 cm, with a ratio a/b That means that the maximum value for θ will be θ M = arctan (measuring the angle in radians). This is assuming that, when fixating the center of the screen, the optical axis of the eye is perpendicular to the screen: if the screen is tilted, then θ M becomes even smaller. Consider the plane E tangent to the eyeball at the point α = 0,β = 0 (Fig. 3). Again, we define a coordinate system in this plane with the origin at the point tangent to the eyeball. Each direction of gaze (α, β) corresponds to one and only one point e on the E plane. It is clear from Fig. 3 that the point e and the point on the screen that is being fixated are related through a homography. We define this homography as TE S. Defining r as the eye radius, it is also clear that e = r tanθ. S e e E Figure 3: The geometry used to derive a lower bound on the number of calibration points. The eye is represented by the hemicircle on the right-hand side, looking at the screen S. The tangent plane E is represented by a green line. Perspective projections of the center of the iris onto the E plane and onto the screen are represented by black dashed lines; orthographic projections of the iris center onto the E plane are represented by red line segments. The distance between iris and E plane is equal to r(1 cosθ) and therefore it is never larger than r(1 cosθ M ) 0.065r. This distance can be neglected when the camera image plane is almost parallel to the E plane. Therefore, we can consider that the camera is imaging the orthographic projection e of the iris onto the E plane (see Fig. 3). Clearly, e = r sinθ and therefore the relative error 2 e e / e + e is at most equal to the relative difference 2(tanθ M sinθ M )/(tanθ M + sinθ M ). Inserting θ M = into the above expression, the relative error can be seen to be at most equal to This systematic error is comparable to the random error in gaze estimation (see next section). Neglecting this error, we can assume that the camera is imaging e, instead of e. Therefore, there is an approximate homography from the E plane θ to the camera image plane. We define this homography as TC E. The concatenation of two homographies is also a homography and therefore the transformation from image to screen coordinates via the eye Φ = TC S = T C E T E S is a homography. A homography is defined by 4 points and hence the transformation from image to screen coordinates is defined by 4 points. If head movements are allowed, additional conjugate points are needed for estimation of Φ and thus 4 points can only be considered a lower bound. To summarize, we have proven that four calibration points are sufficient if the following approximations are valid: (1) the eye is spherical; (2) the maximum distance between iris and E plane is negligible; (3) the maximum distance between points e and e is negligible; (4) the head does not move. 7. Results The setup of the camera, user and monitor has been fixed for one session, but varies between sessions. For calibration the user is asked to gaze on four predefined areas in the screen. The center of these areas serve as calibration pattern for gaze estimation. The contour model is initialized on a fixed position and size using 100 samples. Σ 0 is set manually as to obtain a sufficient accuracy while still being able to allow some freedom of head movements. For locating the eye, the extend of the noise model, Σ, is initially set high and is then decreased in the first frames. The method is tested using a 1.2 GHz PC with 128 Mb RAM on both Europeans and Asians in live test situations and prerecorded sequences using web and video cameras. In images of sizes (digital video cameras) a frame rate of 25 frames per second is obtained. In figure 4 images from testing the method on iris tracking using a standard video camera are shown. These images indicate that the method is capable of tracking the iris under scale changes, squinting the eye in various light conditions and under image defocusing without any explicit feature detection. Despite these drastic observation changes, tracking is maintained without changing the model or any of its parameters. Due to the changes in image quality, there is a vast difference in difficulty of tracking eyes in images using web cameras to using video cameras and IR based images. The method is, however, capable of tracking the iris without changing the model or the parameters in the model for all three types of images. Clearly tracking accuracy improves with the changed image quality. Thus using high quality IR-based images allows for significantly larger head movements than using web cameras. Using four calibration points an accuracy of 4 degrees on gaze estimation is obtained. Figure 5 shows the results of fixating the gaze on 12 predefined points on the screen. The mean absolute errors are 0.5 in. and 0.3 in. inx and y directions respectively. The standard deviations are 2.4 and

Figure 4: Tracking the iris under various light conditions (IR and non-ir), head poses, image blurring and scales. 0.5cminx and y directions respectively.

Rather than using continuous cursor positions a nearest neighbor classifier has been used to avoid dancing mouse effects due to errors in gaze estimation.

screen) Fixation points Estimated directions of gaze 0 0 200 400 600 800 1000 1200 1400 1600 x coordinates (pixels) Figure 5: Gaze Estimation on a 17 in. screen with a resolution of 1600 1200 pixels.

5 inx and y directions respectively. 8. Conclusion We have developed a tracking method based on particle filtering and the EM algorithm and used it for iris tracking.

features, compatible with a hypothesis pose. In practice such marginalization is often difficult, but avoiding feature detection makes marginalization much easier to implement.

The method is fairly robust in the face of occlusions and changes in illumination.

6 Figure 4: Tracking the iris under various light conditions (IR and non-ir), head poses, image blurring and scales. 0.5cminx and y directions respectively. For the purpose of typing a set of 12 on-screen buttons is used for entering the text. Rather than using continuous cursor positions a nearest neighbor classifier has been used to avoid dancing mouse effects due to errors in gaze estimation. The average typing speed for novice users is 3 words per minute (WPM) on common expressions. y coordinates (pixels) On screen cursor positions (17 in. screen) Fixation points Estimated directions of gaze x coordinates (pixels) Figure 5: Gaze Estimation on a 17 in. screen with a resolution of pixels. The black dots represents a fixation point and the colored crosses shows the estimated gaze, thus obtaining a 4 degree accuracy. The standard deviations are 2.4 and 0.5 inx and y directions respectively. 8. Conclusion We have developed a tracking method based on particle filtering and the EM algorithm and used it for iris tracking. The contour model leads to a simple marginalization technique: methods that involve feature detection at any stage should marginalize over all possible correspondences of image features to model features, compatible with a hypothesis pose. In practice such marginalization is often difficult, but avoiding feature detection makes marginalization much easier to implement. The method has proven robust for tracking eyes under moderate variations in position, scale and image defocusing without performing explicit feature detection. The method is fairly robust in the face of occlusions and changes in illumination. It is thus capable of handling changes imposed by off-the-shelf cameras making it well suited for both high quality and low cost eye tracking. We have given a general lower bound for the problem of determining gaze position on planar object in the case where the geometry of the setup of camera, user and monitor is unknown, but fixed. The lower bound is used directly in a simple calibration procedure and for gaze estimation. References [1] J. Daugman. The importance of being random: statistical principles of iris recognition. Pattern Recognition, 36(2): , February [2] Dan Witzner Hansen, John Paulin Hansen, Mads Nielsen, Anders Sewerin Johansen, and Mikkel B. Stegman. Eye typing using markov and active appearance models. In IEEE Workshop on Applications on Computer Vision, pages , [3] J. Huang and D. Mumford. Statistics of natural images and models. In IEEE Computer Vision and Pattern Recognition (CVPR), pages I: , [4] Michael Isard and Andrew Blake. Contour tracking by stochastic propagation of conditional density. In European Conference on Computer Vision, pages , [5] Q. Ji and X. Yang. Real time visual cues extraction for monitoring driver vigilance. Lecture Notes in Computer Science, 2095:107, [6] A.E.C. Pece and A.D. Worrall. Tracking with the EM contour algorithm. In European Conference on Computer Vision, pages I: 3 17, [7] L.Q. Xu, D. Machin, and P. Sheppard. A novel approach to real-time non-intrusive gaze finding. In British Machine Vision Conference, [8] A.L. Yuille and J.M. Coughlan. Fundamental limits of Bayesian inference: Order parameters and phase transitions for road tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2): , February [9] Z. Zhu, Q. Ji, and K. Fujimura. Combining kalman filtering and mean shift for real time eye tracking. In International Conference on Pattern Recognition, pages IV: , 2002.

Gaze interaction (2): models and technologies

Gaze interaction (2): models and technologies Corso di Interazione uomo-macchina II Prof. Giuseppe Boccignone Dipartimento di Scienze dell Informazione Università di Milano boccignone@dsi.unimi.it http://homes.dsi.unimi.it/~boccignone/l