A Bayesian Framework for Real-Time 3D Hand Tracking in High Clutter Background

Size: px

Start display at page:

Download "A Bayesian Framework for Real-Time 3D Hand Tracking in High Clutter Background"

James Gabriel Patterson
5 years ago
Views:

1 A Bayesian Framework for Real-Time 3D Hand Tracking in High Clutter Background Hanning Zhou, Thomas S. Huang University of Illinois at Urbana-Champaign 45 N. Mathews, Urbana, IL 6181, U.S.A Abstract Robust tracking of global hand motion in cluttered background is an important task in humancomputer interaction and automatic interpretation of American Sign Language. It is still an open problem due to variant lighting, cluttered background and occlusion. In this paper, a Bayesian framework is proposed to incorporate the hand shape model, the skin color model and image observations to recover the position and orientation of hand in 3D space from monocular images. The robustness of our approach has been verified with extensive experiments. 1 Introduction Hand gestures can be a more natural way for humans to interact with computers. For instance, one can use his or her hand to manipulate virtual objects directly in a virtual environment. However, capturing human hand motion is inherently difficult due to its articulation and variability. One way to solve the problem is divide and conquer [Wu and Huang, 1999], i.e., decoupling hand motion into the global motion of a rigid model and finger movements and iteratively solve them until a convergence is reached. This approach demands (1) a robust algorithm to recover the 3D position and orientation of the hand, and (2) an efficient algorithm to recover the finger configuration. The first problem has been extensively explored. Black and Jepson [Black and Jepson, 1996] used an appearance-based model in the eigen space to recover the 2D position of a bounding box. The more general case of 3D curve matching has been addressed by Zhang [Zhang, 1994]. Blake and Islard [Blake and Isard, 1998] used the active contour to track global hand motion and recover global hand position. Their model is a deformable 2D planar curve, which cannot handle outplane rotation. [O Hagan et al., 21] developed a realtime HCI system by tracking fingertips from stereo views, which requires clean background and the accuracy was only evaluated using a grid pattern. Among many tracking techniques, particular interest is put on combining multiple cues [Wu and Huang, 21], [Tao et al., 2], [Birchfield, 1998], including color segmentation, edge and motion. In this paper, we propose a Bayesian framework to combine the a priori knowledge in the color and the shape of hand with the observation from image sequences. Within this framework, an algorithm based on ICP (iterative closed point)[zhang, 1994] is used to find the maximum likelihood solution

2 of the position and orientation a rigid planar model of the hand in 3D space, given images captured with a single camera. Section 2 establishes a Bayesian network to describe the generative model. Section 3 introduces a novel feature likelihood edge and the observation. In Section 4, the tracking problem is formulated as inference in the Bayesian network and solved with an ICP-based algorithm. Section 5 provides experimental results in both quantitative and visual forms. Section 6 concludes applicable situation and the limitation of this approach and future directions for extension and improvement. 2 Bayesian Hand Tracking A single-view sequence of color images are given by scaled orthographic projection of a rigid planar model of hand going through homogenous transformation. A priori knowledge λ includes: a rigid planer model for contour of the hand given as {s i = (m i, dm i ), i = 1, 2,... n}, where m i = [u i, v i ] is the 2D coordinates, dm i = [du i, dv i ] is the normal direction (pointing from inner region to outer region) of the contour at s i, and the initial pose (3D position and orientation) of the hand. The tracking problem is formulated as finding the corresponding sequence of rotation-translation matrices M = [R T ] with respect to the initial pose. Figure 2 shows the Bayesian network describing the dependencies in the generative model, where HE and GE denote histgram edge and grayscale edge respectively, as defined in Section 3. $ " # "!!!! Figure 1: The Bayesian network for hand tracking. The observed features LE and GE (edge points in likelihood ratio image and grayscale image) are generated from a distribution p(edge M, λ, LC). Assuming the a priori p(m λ, LC) is uniformly distributed in the subspace of feasible transformations, by Bayesian rule p(m edge, λ, LC) = p(edge M,λ,LC) p(m λ,lc) p(edge λ,lc), we can maximize the a posteriori p(m edge, λ, LC) = p(edge M, λ, LC) p(m λ, LC) by maximizing the likelihood p(edge M, λ, LC), since p(edge λ, LC) is independent of M. As HE and GE are independent given M and LC (i.e. d-separated [Jensen, 1996]), the likelihood function p(edge M, λ, LC) can be decomposed as p(edge M, λ, LC) = p(he M, λ, LC) p(ge M, λ, LC) (1)

3 The observation and inference in the Bayesian network can be implemented with the flow chart shown in Figure 2. Each component in the flow chart will be explained in detail in the following " $% " # " # $% $% # $% ) &" ' " (! Figure 2: The flow chart for Bayesian hand tracking. sections. 3 Observation: Extract Matching Candidates Each frame captured by the camera is an RGB image denoted by I k, which is converted to a grayscale image G k and an HSI (hue, saturation and intensity) image H k. The HSI image is further

4 mapped to a likelihood ratio image L k by the function defined in Equation (2) Since Equation (5) can be evaluated as: L k (u, v) = p(h k(u, v) skin) p(h k (u, v) nonskin) p(h k (u, v) skin) = p(skin H k(u, v))p(h k (u, v)) p(skin) p(h k (u, v) nonskin) = p(nonskin H k(u, v))p(h k (u, v)) p(nonskin) L k (u, v) = p(skin H k(u, v))p(nonskin) p(nonskin H k (u, v))p(skin) (2) (3) (4) (5) Jones and Rehg [Jones and Rehg, 1999] used a standard likelihood ratio approach [K.Fukunaga, 199], but the quantitative information is lost during thresholding the likelihood ratio to decide pixel belongs to skin or nonskin region. To preserves all the sufficient statistics, we choose to use the likelihood ratio without thresholding. The candidate correspondences of the sample points m i are those edge points in the likelihood ratio image, called likelihood edge: LE = {sh j = (le j, dle j ), j = 1, 2,..., n HE }, where le j = [u j, v j ] denotes 2D coordinates and dle j = [du j, dv j ] denotes the gradient. Similarly, in the gray scale image G k, grayscale edge: GE = {sg j = (ge j, dge j ), j = 1, 2,..., n GE } are extracted. 4 Inference: Recover 3D Motion Under moderate outplane rotation and finger articulation, a human hand can be approximated as a planar object. Assuming the centroid of the hand model to be the origin of the world coordinate and the z-axis to be pointing out of the frontal side of the palm, the 3D coordinates for each sample point is p i = [x i [1] x[1], y i [1] ȳ[1], ] T, where ( x[1],ȳ[1]) is the centroid of the model. The transformation from time instant 1 to k can be expressed as p i [k] = Rp i [1] + T, i = 1... N (6) Given image points of a planar object from two perspective camera views, it is a classical stereo vision problem to search for correspondences and solve R T between the two camera coordinates [Tsai and Huang, 1981] [Tsai et al., 1982] [Weng et al., 1992] [Weng et al., 1993] [Hartley and Zisserman, 2] [Faugeras et al., 21]. However, a general solution is usually sensitive to observation noise. In our special case, a canonical planar model is given, based on which an ICP-based algorithm [Zhang, 1994] can be used to we iteratively search correspondences and solve the Homography. 4.1 Matching With The Model We use the homography matrix H in the previous frame (H is identity matrix in the very first frame) to warp the model with Equation (9) and (1). Matching warped model point m i [k] with observed edge g j [k], l j [k] can be formulated as the optimization problem in Equation (7):

5 F GE (i) = arg min j {d(s i[k], g j [k])} F LE (i) = arg min j {d(s i[k], l j [k])} (7) where denotes the search region, and the distance measure is defined as follows: d(s i, g j ) = w 1 (m i g j ) T Σ 1 (m i g j ) + w 2 (dm i dg j ) d(s i, l j ) = w 1 (m i l j ) T Σ 1 (m i l j ) + w 2 (dm i dl j ) (8) In d(s i, g j ) The first term is the Mahalanobis distance [Duda and Hart, 1973] between m i and g j. Σ, the covariance matrix of g, is approximated by a diagonal matrix with the element σ i equal to the inverse of the strength of edge point g i. In order to discriminate between edges belonging to two adjacent fingers, in the second term, we use the inner product of the normal direction of the model point and that of the corresponding edge point. w 1 and w 2 are the weights for trade-off between position information and orientation information. The notations for d(s i, l j ) are similar. This optimization problem is solved by a nearest-neighbor search. The search region is adapted according to the distribution of the distances between correspondences in previous frame. 4.2 Estimating 3D Homography Transformation Assuming the intrinsic camera parameter matrix Π is approximately invariant and can be estimated beforehand, and the camera optic center at time instant 1 is at a distance d in Z direction, we can express the 3D coordinate and 2D projection of the i th sample point as p i [1] = [x i[1] x[1],y i [1] ȳ[1], z i [1]] T (z i [1] = d) and m i [1] = 1 d Π p i [1], where ( x[1], ȳ[1]) is the centroid of the planar model. To reduce error accumulation, we estimate motion from the very first frame to Frame k. Since index i of each model point is consistent in all frames, the correspondence in Equation (7) is equivalent to correspondences between the edge points in Frame k and the model points in the very first frame. Denote the edge points in Frame k as x i [k] = F GE (i)[k], (i = 1 N) and x j [k] = F LE (j)[k] (j = N + 1 2N) and the model points in Frame 1 as x i [1] = m i [1], (i = 1 N) and x j [1] = m j N [1] (j = N + 1 2N), and we have x i [k] Hx i [1] (i = 1 2N). Given the correspondences, H can be solved up to a scale with the four-point algorithm [Faugeras et al., 21]. We warp the model p i [1] with homography H and project with intrinsic parameter matrix Π and go back to the matching step to search for correspondences. p i[k] = Hp i[1] (9) m i [k] = 1 z i [k] Π p i[k] (1) 4.3 Final Pose: From 3D Homography to R and T When ICP algorithm converges, the 3D homography H is solved from m i [k] = Hm i [1], i = 1, 2,... N where m i [1] is all the sample points in the very first frame and m i [k] is those in the

6 current frame, with the SVD-based algorithm [Faugeras et al., 21]. From H, we can find unique solution for R and T (see Appendix I), which are between p i [1] (the sample points in camera coordinate at time 1 ) and p i [k]. To find [R T] between p i[1] and p i [k] in the world coordinate as defined in Equation(6), we notice that thus R = R and T = (R I)[,, d, ] T + T. 5 EXPERIMENTAL RESULTS 5.1 Quantitative Results for Global Motion p i [k] = Rp i [1] + (R I)[,, d, ] T + T (11) The trajectory of translation and rotation parameters are shown in Figure 3. The mean square error Rot(X) Rot(Y) Rot(Z) T(X) T(Y) T(Z) Frame Frame Figure 3: Trajectory of translation and rotation parameters. The green lines show the original transformation used to synthesis image sequence, the red dots show the output of the global tracking algorithm (MSE) in each dimension is shown in Table 1, where R(X), R(Y ), R(Z) denotes rotation along X, Y, Z axis respectively and T (X), T (Y ), T (Z) denotes translation along X, Y, Z axis. Table 1: MSE, the range of motion (ROM) and the percentage value of their ratio in each dimension. R(X) R(Y ) R(Z) T (X) T (Y ) T (Z) MSE ROM Ratio(%)

5.2 Demonstration using Real-world Data We have implemented the

Simultaneous In-plane rotation, translation and slightly

pose and shape for significant period of time with lighting

site http://www.ifp.uiuc.edu/ hzhou/histedge.

Isard, 1998], our tracker has the following advantages: (1) it

rotation; (2) it is robust against clutter background; and (3) it

6 Conclusions In this paper, we propose a Bayesian Framework to

within the observation from image sequences.

edge to combine color and edge information and use an ICPbased

7 5.2 Demonstration using Real-world Data We have implemented the system and it executes at 29 frames per second on a Pentium3 1.GHz processor. Figure 4 shows some snapshots from the various video clips. Simultaneous In-plane rotation, translation and slightly out-plane rotation Occlusion by another hand with very similar pose and shape for significant period of time with lighting variance Strong out-plane rotation with lighting variance Figure 4: Snapshots of the tracking result. These and other AVI sequences are available at the author s web site hzhou/histedge. Comparing with the hand tracker by Blake and Isard [Blake and Isard, 1998], our tracker has the following advantages: (1) it can recover not only in-plane rotation but also out-plane rotation; (2) it is robust against clutter background; and (3) it is robust against lighting variance. 6 Conclusions In this paper, we propose a Bayesian Framework to accommodate the a prior knowledge of the color and hand-shape within the observation from image sequences. Based on this framework, we introduce a new feature likelihood edge to combine color and edge information and use an ICPbased algorithm to find the maximum likelihood solution of the position and orientation of hand in 3D space. The ICP-based iteration alleviates the influence of noise, and the robustness has been verified with extensive experiments. Since we are using 2D model to approximate the hand, the matching of edge points will introduce considerable noise when extreme out-plane rotation of the hand occurs.

8 In order to drive a control using the recovered motion, it would be helpful to smooth the motion analysis result. We shall also extend our approach to articulated hand tracking by incorporating with the local finger tracker. Acknowledgments This work is supported by National Science Foundation Grant IIS and by National Science Foundation Alliance Program. The authors appreciate Ying Wu and John Lin for inspiring discussion and suggestions. References [Birchfield, 1998] Birchfield, S. (1998). Elliptical head tracking using intensity gradients and color histograms. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages [Black and Jepson, 1996] Black, M. and Jepson, A. (1996). Eigentracking: Robust matching and tracking of articulated object using a view-based representation. In Proc. European Conference on Computer Vision, volume 1, pages [Blake and Isard, 1998] Blake, A. and Isard, M. (1998). Active Contours. Springer-Verlag, London. [Duda and Hart, 1973] Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis. Wiley, New York. [Faugeras et al., 21] Faugeras, O., Luong, Q.-T., and Papadopoulo, T. (21). Geometry of multiple images. MIT press, Cambridge. [Hartley and Zisserman, 2] Hartley, R. and Zisserman, A. (2). Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge. [Jensen, 1996] Jensen, F. V. (1996). An Introduction to Bayesian Networks. Springer-Verlag, New York. [Jones and Rehg, 1999] Jones, M. and Rehg, J. (1999). Statistical color models with application to skin detection. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume I, pages , Fort Collins. [K.Fukunaga, 199] K.Fukunaga (199). Introduction to Statistical Pattern Recognition. Academic Press, New York, second edition. [O Hagan et al., 21] O Hagan, R., Zelinsky, A., and Rougeaux, S. (21). Visual gesture interfaces to virtual environments. Interacting with Computers. [Tao et al., 2] Tao, H., Sawhney, H., and Kumar, R. (2). Dynamic layer representation with applications to tracking. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages [Tsai and Huang, 1981] Tsai, R. and Huang, T. (1981). Estimating 3-d motion parameters of a rigid planar patch i. ASSP, 29(12):

9 [Tsai et al., 1982] Tsai, R., Huang, T., and Zhu, W. (1982). Estimating 3-d motion parameters of a rigid planar patch ii: Singular value decomposition. ASSP, 3(8): [Weng et al., 1993] Weng, J., Ahuja, N., and Huang, T. (1993). Motion and Structure From Image Sequences. Springer-Verlag, New York. [Weng et al., 1992] Weng, J., Huang, T., and Ahuja, N. (1992). Motion and structure from line correspondence: Closed form solution, uniqueness and optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14: [Wu and Huang, 1999] Wu, Y. and Huang, T. S. (1999). Capturing articulated human hand motion: A divide-and-conquer approach. In Proc. IEEE International Conference on Computer Vision, pages , Corfu, Greece. [Wu and Huang, 21] Wu, Y. and Huang, T. S. (21). Robust visual tracking by co-inference learning. In Proc. of International Conference on Computer Vision, pages 26 33, Vancouver. [Zhang, 1994] Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13:

calibrated coordinates Linear transformation pixel coordinates

1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial