Top joint. Middle joint. Bottom joint. (joint angle (2)) Finger axis. (joint angle (1)) F (finger frame) P (palm frame) R, t (palm pose) Camera frame

Size: px

Start display at page:

Download "Top joint. Middle joint. Bottom joint. (joint angle (2)) Finger axis. (joint angle (1)) F (finger frame) P (palm frame) R, t (palm pose) Camera frame"

Cornelia Hamilton
5 years ago
Views:

1 Hand Pose Recovery with a Single Video Camera Kihwan Kwon, Hong Zhang, and Fadi Dornaika Dept. of Computing Science, University of Alberta Edmonton, Alberta, Canada fkihwan, zhang, dornaikag@cs.ualberta.ca Keywords: Hand pose, object pose, real-time tracking, glove, hand animation Abstract In this paper, we present a system to determine human hand pose from a single monochrome image in real-time. We formulate the hand pose problem as determining the position and orientation of the palm as well as the joint angles of the fingers. 3D pose of the palm and each joint angle are computed simultaneously using feature points on the human hand. We exploit an existing fast object pose algorithm developed by De- Menthon and Davis [1] together with a heuristic minimization of a 1D function. We make use of a special glove to remove noise caused by the deformation of the hand and to help find correspondence between 3D object points and their images. We demonstrate our system by using a mock-up Cardboard hand with one finger. Experimental results with a real human hand wearing the special glove are presented. 1 Introduction Hand pose recovery has applications in teleoperation of robot hand, human computer interface, hand animation in computer graphics, and virtual reality. One of the popular methods for those applications is to use gloves capable of sensing its joint positions. A problem for the glove is its invasiveness since a user has to wear aglove wired to amachine. In contrast, a computer vision based approach has been attempted but with limited success. One of the problems with computer vision is that it is slow due to complex algorithms and costly image processing. For a computer vision based hand tracking system to be successful and to be widely available, four essential elements should be satisfied: inexpensive, fast, robust and accurate. To be inexpensive, the system should be simple. Rather than using several cameras, one is preferred. The system should be fast since for most of hand tracking applications (e.g., gesture recognition) real-time performance is more desirable. It should be robust in dealing with occlusion and image noise. Of course the computed hand pose should be accurate. Dorner [3] presented a hand tracking system to serve as an interface for American Sign Language signers. The system used non-linear optimization techniques to recover 26 parameters of hand pose by minimizing the difference between the projected model and features in the image. The system did not work in real-time. Rehg [4] implemented a system called DigitEyes which tracked an unadorned hand using line and point features. He used residual functions for fitting projected model and features in the image. He had to use special hardware for real-time performance. Stereo was employed to track full hand articulation. Segen and Kumar [5] implemented a system to drive an articulated hand model with stereo images. It worked in real-time although their system controlled only wrist, index finger and thumb with some of joints disabled. Our system relates hand pose problem to a linear Perspective-n-Point pose algorithm. Joint angles and the palm pose are determined by aperspective-n-point pose algorithm together with a heuristic minimization of a 1D function. Our system has several advantages. First, the system uses an object pose algorithm which exploits linear techniques, allowing the system to be fast. Second, we find joint angles in order by applying the simple technique repeatedly, rather than recovering all defined hand pose parameters with one nonlinear minimization process. One same technique is applied to a different subset of the feature points depending on which joint angle we compute. The subset is composed of five points for computation of a bottom joint angle and six for computation of a middle joint angle. Third, we use a single camera and the estimation of the joint angles and pose of human palm in 3D space are accurate as long as feature points can be seen. The remainder of this paper is organized as follows.

2 In section 2, assumptions for our system are given. In section 3, we elaborate on our hand pose algorithm. In section 4, the object pose algorithm we used is briefly discussed. In section 5, we describe some of our experiments and we conclude with our future directions of research in section 6. Top joint Middle joint Bottom joint 2_ θ3= 3 θ 2 θ 2 θ 1 2 Assumptions Several assumptions are made for the simplicity ofour algorithm and to help real-time execution. We assume the palm is rigid without considering its movement. Note that the palm deformation has not been studied in the literature of hand tracking although it was allowed as noise. In our case, we force the palm to be rigid by using a wood board so that our feature points can not be deformed or can not move independently by the palm deformation. Each finger (the thumb is not considered in this paper) has three links. The bottom joint joined with the palm has one degree of freedom (DOF) instead of two. Abduction and adduction are not considered in our system. The top joint angle of each fingeris dependent on the middle joint angle and their coupling is assumed linear. Rijpkema and Girard [2] suggest that the relationship between middle joint angle and top joint angle is almost linear. This will help increase the speed of system by obtaining the top joint angles from middle joint angles. We regard the top joint angle as the two thirds of the middle joint angle. Note that each finger has two DOF's since the bottom and middle joints have one DOF each, and the top joint is dependent on the middle one. We assume that all the feature points are visible all the time and that they are correctly labeled. 3 Hand pose using only 12 markers We decided to use simple points for the hand model (Figure 1) for at least two reasons. First, experiments on human motion perception [6] showed that humans could recognize biological motions in 3D space by observing just the light markers attached to the subject. This implies that simple points carry enough information for the motion of an articulated object. Second, 3D pose of a rigid object can be determined using at least three model points [8]. Since the hand is composed of rigid segments, it should be possible to determine hand pose using points. Figure 1: Our hand model: 12 points are used to describe all defined hand parameters. C Camera frame θ 2 (joint angle (2)) Finger axis R, t (palm pose) θ1 P (palm frame) (joint angle (1)) F (finger frame) Figure 2: Problem definition of hand pose (one finger is shown for the sake of simplicity). We attach four coplanar points on the palm. Four points are the minimum number of points required by the object pose algorithm we used [1]. We tried to use only one feature point for each segment of a finger. Moreover, since we assumed the top joints were linearly related to the middle joints, computation of the top joint angles does not require any marker. So 14 DOF's (6 DOF's for palm and 8 DOF's for four fingers) of the human hand are determined by only 12 points. Figure 1 shows the configuration of feature points on a right hand. 3.1 Hand pose Our problem can be stated as follows. Given one captured image featuring all markers and the knowledge of the palm model and the fingers' kinematics, we estimate the palm pose as well as the joint angles of each finger (see Figure 2). We can, of course, simultaneously recover the palm

3 pose and the joint angles using the perspective equations associated with each imagepoint. However, this leads to non-linear systems, which is not our goal. Thus, we havedeveloped the following heuristic which combines a real-time object pose algorithm with the minimization of a 1D function. First consider the problem of finding the 3D object pose of the model shown in Figure 3 using five points labeled 1 through 5. To find the object pose, we need to know the 3D coordinates of the five points with respect to the palm frame, and actual image coordinates corresponding to those five points. The problem here is that we do not know the coordinates of feature 5 with respect to the palm frame since we donot know the joint angle of the finger (note that the finger moves in a plane perpendicular to the palm plane according to our assumption). To resolve this, the angle is set to an initial value that will be upgraded using a specific criterion. Using this initial estimation (although it may be far from the actual angle), we can compute the coordinates of the feature 5. Now, we can determine the 3D pose of the hand model using the five points and their corresponding image points. Once the 3D pose transformation of the hand model is found, we can back-project the five points onto the image plane to verify the transformation. It is easy to see whether the transformation is accurate by comparing the projected image points and the actual image points. We have to repeat the process with another angle value until an accurate transformation is found. Once an accurate transformation is found, that means the angle guessed is the actual angle of the finger. The transformation itself represents the palm pose. We use the following notations: (u i ;v i ) is the image coordinates of the object points (x i ;y i ;z i ), i 2 f1 :::5g. F T P is the transformation from the finger frame to the palm frame. M is the camera projection matrix. L is the distance of feature 5 from the origin of the finger frame F. Note that M and F T P are known. Our algorithm can be summarized as follows: 1. Set the joint angle to a lower bound. 2. Compute the object coordinate of the feature 5 with respect to the palm frame: (x5;y5;z5; 1) T = F T P L cos( ) L sin( ) CA 3. Using an object pose algorithm, compute P T C, which is the transformation from the palm frame to the camera frame. 4. Back-project the object points using the transformation found and the camera projection matrix: (u 0 i;v 0 i)=m P T C (x i ;y i ;z i ; 1) T 5. Compute the following residual error: P N 1 P 40 e = i=0 E i E i = N 5 where E i is the Euclidean distance between the projected image points (u 0 i ;v0 i ) and the actual image points (u i ;v i ). 6. Save the angle and residual error e pair. 7. If the angle is greater than an upper bound, stop, otherwise increment by and go to step (2). 8. Choose the angle which corresponds to the least residual error. That is the estimated joint angle. However, there are several concerns with the algorithm. First, the algorithm will take time since we have to run the object pose algorithm many times for computation of each joint angle, making the system slow. This can be overcome by using the fast object pose algorithm like theonedeveloped by DeMenthon and Davis [1]. Section 4 describes howwe used it. Second, it might be the case that even if the projected image points and actual image points coincide, it does not always mean a joint angle is correct. Figure 4 shows a degenerate case where two solutions are possible. This kind of problem can be overcome using a small search range. In our system, except for the initial run, the search range can be reduced significantly. If we know the initial angle, assuming the finger joint has not rotated much with a fast frame rate, the next angle is close to the previous one and a situation of multiple solutions can be avoided. Also the finger physiology can be considered to exclude an unreasonable angle. By repeating the above algorithm, all other joint angles can be computed in the same way although computation of a middle joint angle requires six feature points. Note that the palm pose can be either derived from the above algorithm or derived from the four coplanar points attached to the palm. 3.2 The glove To support our assumptions (the finger moves in a plane and the palm is rigid) on a real hand, we developed a special glove (Figure 5). For fingers, thin

x θ x z 5 L y F Finger frame 3 4 y 2 P Palm frame z 1 Figure 3: A simple example describing our algorithm. All five points should be expressed in the palm frame.

4 x θ x z 5 L y F Finger frame 3 4 y 2 P Palm frame z 1 Figure 3: A simple example describing our algorithm. All five points should be expressed in the palm frame. Center of projection Image plane a,b Loci of the finger tip (circle) Line of sight Projection of the circle A B Finger segment Figure 4: A degenerate case where the plane of the finger passes through the center of projection. In this case, the trajectory of the finger tip-a circle- will project onto a straight line. Thus, there are two possible angles associated with each lineofsight. wood board segments are joined together using small hinges which force each finger to move in a plane and disable adduction and abduction. Those joined wood board segments are attached to a wood board palm, virtually making a very rigid palm, removing all the noises caused by non-rigidness of the human palm. This hand-like structure is attached to a black glove. Feature points are printed and glued on the palm and on the finger segments. Note that feature points are positioned between finger joints, not onto joints, to prevent deformation of feature points when the fingers bent. We tried to reduce the invasiveness of glove by usingawool glove and thin wood board segments. The advantage of using such a supporting structure is two fold. First, it helps to acquire more accurate feature positions leading to accurate computation of joint angles. Second, the glove will fit almost all the adult subjects, requiring no calibration of the system, which is necessary if naked hand were used since each person's hand has a different size. Figure 5: the special glove to be worn by the subject. 4 Object pose Determining object pose using images of model points is a well-known problem in photogrammetry and computer vision. This so-called Perspective-n-Point problem [8] can be solved with closed-form or numerical solutions. Closed-form solutions have been applied only to a limited numberofpoints ([8], [9]). Numerical solutions ([10], [11]) were required for an arbitrary number of points. However, good initial guesses were necessary for the numerical algorithms to converge. They were also computationally expensive. Recently, alin- ear n-point algorithm was suggested by [12], making a closed form solution available with at least four or more points. We used the iterative method developed by DeMenthon and Davis [1]. Their algorithm exploits the scaled orthographic projection (SOP). In SOP, object points are projected onto a plane parallel to the image plane and passing through the origin of the object frame, then they are projected with the true perspective onto the image plane (Figure 6). The algorithm has several advantages compared to non-linear methods. First, it does not require initial guesses. Second, it is fast so that it suits real-time applications. As non-linear methods, it can be applied to an arbitrary number of points. The algorithm deals with two distinct cases: the coplanar case (all the object points lie on a plane) and the non-coplanar case. However, in our system the configuration of points can be coplanar or noncoplanar depending on the joint angles of the moving fingers. In Figure 7, if the finger joint angle is zero, all five points lie on a plane. Experiments showed that when the joint angle was small (approximately less than 20 degrees), we found out that both the coplanar algorithm and the non-coplanar algorithm do not work well since the configuration is not coplanar and not

5 C wi p p i 0 I Image plane A Plane approximating the object P 0 W Z X Y Pi Optical axis Figure 6: Perspective projection (p i ) and scaled orthographic projection (w i ) for an object point P i. The reference point P o is projected onto p o in both projections. imaginary points Palm Finger Figure 7: Four virtual feature points are added to the palm model in order to make the 3D model of the whole hand non-coplanar regardless of the joint angle values. quite non-coplanar. To make the configuration always non-coplanar so that we do not have to worry about the case when joint angle is zero or small, we use four imaginary points (see Figure 7). Since we can compute the transformation from the camera to the palm using the four actual coplanar points on the palm using the coplanar algorithm, we can back-project imaginary object points using the transformation found from the four coplanar points to get the corresponding imaginary image coordinates required by the non-coplanar algorithm. With this one more extra step, we can always use the non-coplanar algorithm with nine points. We also impose orthogonality constraint, which is not forced by this object pose algorithm, by using the method described in [7]. This way we can get more accurate object pose. 5 Experiments To recognize markers easily, we use black background and black glove with white markers. We use different sizes of circular dots for the palm pose. Once the palm pose is known, we can use the associated palm planeto-image plane mapping in order to project each finger marker onto the image. The associated angle value can besettoafixedvalue (e.g., the average value) or to the computed value from the previous image. This provides predicted 2D location for each finger marker. Then, distinguishing between fingers can be carried out by comparing the predicted 2D locations and the actual 2D locations. Note that the use of circular dots as markers considerably reduces the image processing cost and allows a very accurate localization of image points. Gray scale NEC T1-23A CCD camera was set up about one meter away from the hand. The hand size was about 15 cm. Since we are using a weak perspective object pose algorithm, the hand should be at some distance from the camera for better performance. For the weak perspective pose algorithm to converge, an object should lie in the neighborhood of the optical axis and beyond some distance from the camera (for details, see [7] and [1]). For easy verification of our algorithm, we developed an animated hand with OpenGL so that the animated hand can follow the motion of the real hand wearing the special glove. The machine we are using is a Pentium 500MHz PC running Linux. We did two types of experiments. First, we made ahandwith athick Cardboard which has one finger with only one joint (Figure 8). This allows us to verify our algorithm since the Cardboard finger angle can be measured accurately. Second, we tested with the real hand wearing the developed glove. 5.1 Experiment with a Cardboard hand To test the accuracy of the proposed algorithm, we used the Cardboard hand described above. Table 1 shows the measured and computed angles for several configurations. These configurations are shown in Figure 9. Each configuration has a different joint angle except for B and C. As can be seen in Table 1, errors are within 5 degrees in arbitrary configurations in 3D space.

6 (A) (B) (C) Figure 8: The model hand made of a Cardboard with only one finger segment. configuration A B C D E F measured angle computed angle Table 1: Test results associated with six typical configurations. 5.2 Experiments with a real hand wearing the glove Wearing the glove described previously, the animated hand model was following the real hand successfully in real-time. The hand pose algorithm runs at 8Hz. It is difficult to get the ground-truth for the joint angles of moving fingers. Although the ground-truth for the fingers' motion is not available, one can evaluate the performance by comparing the real motion and the one performed by the synthetic model. We also plan to measure it accurately using a robot hand attached with feature points. Figure 10 shows images of the real hand moving in 3D space wearing the glove, the animated hand model following the motion of real hand, and the tracked real hand. We found out that the search range of the joint angle can be ±5 degrees from the previous angle and the increment step is 1 degree. 6 Summary and future research We have described how to track a human hand using a single camera. The tracked features consist of 12 simple markers. The algorithm is simple and accurate. Moreover, it works in real-time. We used a special glove made of wood board segments. One of (D) (E) (F) Figure 9: Typical configurations used for testing. its advantages is that it fits almost all subjects and no system calibration is necessary for different subjects. Our future research will focus on the occlusion problem. We assumed that all feature circles can be seen at any given time. However, in reality this is not the case and feature points are easily occluded. We are considering feature tracking algorithms to alleviate occlusion problem. Also, different features like rings as in Dorner [3] are also considered since a ring can be seen mostly from any direction. References [1] D.F. DeMenthon and L.S. Davis. Model-based object pose in 25 lines of code. International Journal of Computer Vision, 15(1/2): , [2] H. Rijpkema and M. Girard. Computer animation of knowledge-based human grasping. Computer Graphics, vol. 25, Number 4, July 1991 [3] B.Dorner. Hand shape identification and tracking for sign language interpretation. Looking at People Workshop, Chambery, France, [4] J.M. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: An application to human hand tracking. In Proc. 3rd ECCV, Stockholm, Sweden,volume II, pages 35-45, [5] J. Segen and S. Kumar. Driving a 3D articulated hand model in real-time, IMDSP'98, Alpbach, Austria July, 1998.

7 [10] J. S.-C. Yuan. A general photogrammetric method for determining object position and orientation. IEEE Transactions on Robotics and Automation, 5(2): , April [11] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robotics and Automation, vol.3, pp , [12] L. Quan, Z. Lan. Linear n-point camera pose determination. Transactions on PAMI, vol. 21, No. 7, July Figure 10: Left: Real hand in action(images were taken using another camera from different directions), Middle: The animated hand model following the 3D motion of the real hand (palm and fingers), Right: Tracked real hand. [6] G. Johansson. Visual perception of biological motion and a model for its analysis. Perceptive Psychophysics, 14(2): , [7] R. Horaud, F. Dornaika, B. Lamiroy, and S. Christy. Object pose: the link between weak perspective, paraperspective, and full perspective. International Journal of Computer Vision, 22(2): , March 1997 [8] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, vol.24, pp , [9] R. Horaud, B. Conio, O. Leboulleux, and B. Lacolle. An analytic solution for the perspective 4- point problem. Computer vision, Graphics, and Image Processing, 47(1):33-44, July 1989.

3d Pose Estimation. Algorithms for Augmented Reality. 3D Pose Estimation. Sebastian Grembowietz Sebastian Grembowietz

3d Pose Estimation. Algorithms for Augmented Reality. 3D Pose Estimation. Sebastian Grembowietz Sebastian Grembowietz Algorithms for Augmented Reality 3D Pose Estimation by Sebastian Grembowietz - 1 - Sebastian Grembowietz index introduction 3d pose estimation techniques in general how many points? different models of