A Real-time Application of Hand Gesture Recognition in Video Games

Size: px

Start display at page:

Download "A Real-time Application of Hand Gesture Recognition in Video Games"

Moris Hunt
5 years ago
Views:

1 A Real-time Application of Hand Gesture Recognition in Video Games Ying Li March 10, 2011 Abstract Hand gesture recognition has many useful applications, and one of those is video game. In this project, a virtual Rubik s cube game is implemented, in which the cube is manipulated by recognizing the hand gesture of the user. The application uses skin color to detect the hands and extracts features which contain spatiotemporal information to be classified by support vector machine (SVM). In the experiment, the application achieves a recognition rate higher than 90%, and each frame is processed and rendered in 30ms. 1. Introduction With the rapid development of video games and 3D engines, the role of human computer interaction (HCI), is becoming more important. Tremendous efforts have been made to provide a more natural and realistic experience interacting with the computer. In contrast to a video renderer, which projects a 3D world into 2D images, conventional input devices, such as mice and joysticks, convert 2D motion to manipulate the 3D objects in the games, though human hand gesture and body language, on the other hand, offer a more natural way of direct manipulation. Out of this consideration, this project explores the possibility of real-time recognition of hand gestures in a simple cube game. Approaches of hand gesture recognition can mainly be classified into two categories: those trying to infer gestures directly from the visual images observed, usually denoted as appearance-based modeling, and those inferring gestures from the motion and posture model parameters computed from the image sequences, known as 3D-model-based[1, 2]. Though 3D-model-based approaches offer a general solution to most of the hand gesture recognition problems, it is computationally intensive and the intermediate computation results might be more than necessary for a specific task. Moreover the transformation from 3D models to 2D images might not be invertible, which makes the 3D-model-based approaches unstable in some cases. In contrast, the appearance-based approach can be realized in real-time and achieve high accuracy for specific tasks. The recognition process in appearance-based approaches consists of the following steps: the spotting of the hand blob, extraction of features, and classification, depicted in fig.1. Drawing upon the similarity to the speech recognition problem, Starner et al proposed using HMM to recognize sentence-level American Sign Language (ASL) selected from a 40-word lexicon, achieving an accuracy higher than 90%[3], whereas Yang et al reported 93.42% recognition rate by classifying the motion trajectories using time-delay neural network (TDNN)[4]. Darrell et al took a different approach by accumulating representative n-tuple image sequences as gesture templates[5]. Another approach exploiting the image sequences is to model different gestural actions by motion history images (MHI), which are 2D images formed by accumulating the motion of every single pixel in the image over some temporal window[7]. In this project however, an SVM is used for the recognition task, given their success in image classification problems. Figure 1: major steps in hand gesture recognition 2. Approach As depicted in fig.1, the feature vector is extracted from the hand blobs separated from the visual image background, which is then fed to the SVM classifier. Since an SVM does not use temporal information directly, we extract features that encode spatiotemperal data. 1

2 2.1. Detection of hand blob Provided that the user can interact with the application, the algorithm simply uses a skin color model determined a priori to segment the hands. Anyhow such a skin color model might not be appropriate when there are abrupt changes in lighting conditions or the difference between the hands and the background is not obvious. Nevertheless the segmentation problem in this case is moderately alleviated because the field of view of a webcam is limited and the interest regions are restricted around the virtual cube. After the mask of potential hand pixels is computed, a morphological open and close operation is applied to the mask to remove noises and join disparate elements. Then a breadth-first-search (BFS) is implemented upon the modified mask to find connected hand components. The centroid and bounding box of the component is calculated as a by-product of the process Feature extraction After the hands are segmented from the background, a second analysis is applied to each hand blob detected to form the feature vector. Along with the centroid x, y coordinates mentioned in hand blob detection section, changes in x, y between the frames, optical flow and shape information are included also. Each hand component is labeled with right, top, left or bottom according to its relative position to the center of the virtual cube Optical flow computation To compute the optical flow between the frames, Lucas- Kanade optical flow method is used upon every two consecutive grayscale images. A brief explanation of the algorithm is presented below. When calculating optical flow from one frame to the next, it is assumed that the brightness of the same point remains constant: I(x, y, t) =I(x + δx, y + δy, t + δt) x, y are the coordinates of a pixel, and t is the time when the first frame is taken. Applying Taylor series expansion to the RHS of the equation, we get: I(x+δx, y+δy, t+δt) =I(x, y, t)+ δx+ δy+ δx δy δt δt + H.O.T. From this equation it follows that: δx δx + δy δy + δt =0 δt Dividing by δt we get: which results in δx δx δt + δy δy δt + δt δt δt =0 δx V x + δy V y + δt =0 where V x,v y are the x and y components of the velocity and δx, δy and δt are the derivatives of the image at (x, y, t) in the corresponding directions. Lucas and Kanade assume a translational model and solve for a velocity that approximately satisfies the equation above for all the pixels in a small neighborhood of size N N. Inthisway,weobtainanoverconstrained linear system of equations, of which the solution can be solved by least square method: in which I(x, y, t) = and the quantity A = b = I(p 1 ) T I(p 2 ) T.. I(p N 2) T I t (p 1 ) I t (p 2 ). I t (p N 2) [ Ix (x, y, t) I y (x, y, t) ] = [ δx δy I t (x, y, t) = (x, y, t) δt ] (x, y, t) (x, y, t) that the solution can be written as: (A T A) 1 A T b. To expedite the speed, the computation of the optical flow is refined to the region deduced from hand detection section as the bounding box. The x, y velocities calculated for each pixel are then assigned to an 8x8 grid to form a 128-element vector. To put more emphasis upon the contours, another 128-element vector is constructed from the optical flow of contours. Though computing optical flow of the interest points will conduce to the confidence of the accuracy, it is adversely proposed in this case, since the hand blob is blurring in the action phase, making the corner detection inaccurate. 2

3 Hand shape related features Geometric moments are succinct description of the shape of a component; therefore the second moments (m 20, m 11, m 02 ) are computed from the contour of each hand: [ m20 m 11 m 11 m 02 ] By curve fitting, the major and minor axes of the ellipse are deduced as features. To maintain more detailed shape information of the hand, each pixel in the hand component is assigned to an 8x8 grid and the sum of x, y coordinates of each grid is retrieved to form a 128-element vector. With reference to the duality property of SVM, the outer product is not computed for curve fitting, neither are the eigenvalues or eigenvectors, to mitigate computation cost. As it is in optical flow computation, another 128-element vector is constructed from the contour of the hand Feature vector normalization Because the features extracted in the previous section are of different orders of magnitude, which make the average and standard deviation between different features incomparable, the features are therefore normalized. For the tf-idf matrix computed from the training set, the average and standard deviation is calculated for every feature, and then each element is divided by standard deviation after subtracting the average. If the standard deviation of a feature equals to zero, then that column of feature is removed from the tf-idf matrix. This process imposes a Gaussian distribution upon every feature; every feature has average equaling zero and standard deviation equaling one. To transform all the elements to the range [-1,1], each element is divided by the square root of the number of training samples, noticing that: (x i x) 2 N (x i x) 2 x i x N (x i x) 2 = σ N Though SVM might be more immune to unnormalized dataset compared with other method by learning weight for different features in the training process, a normalization step would possibly conduce to the accuracy of the computation because the storage of a number is limited to its type size Brief explanation of SVM The normalized feature vector structured in the form described above is then passed to the SVM classifier. Unlike conventional classifiers, SVM does not train to fit the sample density of distribution; instead it trains to minimize structural risk[6]. A brief explanation of SVM is described as below. Figure 2: illustration of linear SVM Given dataset x i,y i, with y i { 1, +1}, x i R n, if there exists a hyperplane which can separate the dataset, it is written as: g(x) =wx+b =0. H 1 is a hyperplane parallel to H consisting of the closest points labeled with 1, whereas H 2 is a hyperplane parallel to H consisting of the closest points labeled with +1, so we get: H : wx+ b =0; H 1 : wx+ b = k 1 ; H 2 : wx+ b = k 2 Assuming k = k1 k2 2,then H 1 : wx + b k 1 + k = k; which can be also be written as: H 1 : wx + b = k; H 2 : wx + b k 2 k = k H 2 : wx + b = k Deviding both equations with k, we have: H 1 : wx + b =1; H 2 : wx + b = 1 From the equations above, it can be derived that the 2 margin is w. So to maximize the margin, we compute min 1 2 w 2 s.t. y i (wx i + b) 1(i =1, 2,...,n) Because all the functions are convex, there exists a unique global optimum. Solving the problem using Lagrange multiplier α i,weget: w = y i α i x i,b= y j y i α i (x i x j ), j {j α j > 0} 3

4 To allow for mislabeled samples, a slack variable ξ is introduced to trade off between the large margin and a small error penalty: y i (wx i + b) 1 ξ i Now to maximize the margin, we compute min( 1 2 w 2 + C ξ i ) y i (wx i + b) 1 ξ i (i =1, 2,...,n) ξ i 0 The classifying function we thus derived is: f(x) =sgn{wx + b} =sgn{ α i y i (x i x)+b} Referring to the equation above, it follows that to compute the label of a test sample, we just need to compute the inner products of the test sample and support vectors. If the dataset is not linearly separable in the input space, the samples can first be projected to a higher space and then be classified with the linear separation method outlined above, and that is where the kernel trick comes in. f(x) =sgn{ l α i y i K(x i,x)+b} K(x i,y i )=(φ(x i ) φ(y i )) If K satisfies Mercer s Conditions, then the kernel represents a legitimate inner product in feature space. This strategy is of resemblance to the mechanism of neural network, with the hidden layer being the inner products of the test sample and support vectors (fig.3). Therefore, the choice of kernel is another parameter to SVM method, though it is reported that the performance remains complaisant whatever the choice of kernel is in some applications[6]. In this project, Gaussian or Radial Basis Function (RBF) is used because of its effectiveness in image recognitions: K(x i,x j )=exp( γ x i x j 2 ), for γ>0. From the optimizing method outlined above, that SVM is minimizing the weight w 2 naturally reduces the chance of overfitting. Moreover its structural risk minimizing principle makes it of advantage when applied to a dataset of very high dimension. Though the prediction of an SVM classifier is only determined by the support vectors, in this project however, it is difficult to figure out which are the samples on the margin and therefore construct the training set. Figure 3: illustration of the kernel trick 2.5. Game status update rule The probabilities of p(actiona actionb), p(action inaction) and p(inaction action) can be derived by applying the classifier to the test set. If a window of specific temporal length (less a typical action) is set, the action update rule can be theoretically deduced by maximizing the likelihood of the multinomial distribution. Anyhow with consideration that the sample set is rather limited compared with the untested user actions, it is likely that such delicate strategies might not work as well as simple strategies. So in the application, the update rule is tentatively set that if the predicted labels of two consecutive frames are the same, the status of the cube will be updated if the current status is not the same as the predicted one. 3. Experiment Results and Analysis The dataset is built by recording frames at 25 fps from a webcam. True labels of the dataset are determined afterwards. Anyhow because of the subtlety when a previous action is retracted whereas the next one is in preparation, there is no guarantee that the current labeling is correct. The state of the frame is classified into 49 groups, in which the rotations for preparation purposes are isolated from the null state. If two hands are detected in the frame at appropriate positions, with one hand rotating and the other keeping still, the corresponding level of the cube is twisted. Therefore, the middle level of the cube cannot be manipulated in any case. Anyhow this is not a problem because the viewpoint of the cube can be changed. If only a single hand is detected 4

However, it is unlikely that the reduction of states would contribute to the decrease of error rate, which depends upon the construction of support vectors.

5 and the rotation is correctly perceived by the application, the corresponding rotation of the cube will take place. Reducing the number of states is likely to reduce the computation cost, especially if the SVM classifier is not constructed as decision tree. However, it is unlikely that the reduction of states would contribute to the decrease of error rate, which depends upon the construction of support vectors. The reason why the nonrotation actions are separated from the null state is that those labels might be of use to construct a finite state machine to update the status of the cube if the prediction of each frame is adequately accurate. a b Figure 4: a. the gesture of rotating the cube upward; b. the gesture of rotating the cube downward From fig.4, the optical flow labeled upon the visual images is in accordance with the movement of the hand. Those frames were sampled at 25 fps. Since Lucas and Kanade method imposes a translational model upon the transformation, increasing the sampling rate might be of help to the accuracy of optical flow calculation, provided that most of the reactions involved in this application are not pure translations, and might possibly be accompanied with shape deformations. The accuracy of the SVM classifier is outlined in table.1 as in a grid search for the appropriate parameters by 10-fold cross validation. The best recognition rate achieved is 91.87% when C equals to 32 and gamma equals to 2. It is shown in the figure that as gamma increases, the accuracy first increases accordingly and then drops down. This behavior is correlated to the influence of gamma in the Gaussian function. As gamma becomes bigger, the Gaussian function is steeper. When accompanied with a larger penalty (C value), the SVM classifier is able to reproduce highly irregular decision boundaries, at the risk of overfitting to the training set. ThesamepatternshowninthetablewhenCincreases also demonstrate this. From the table, most of the errors come from mistaking hand actions for inactions, which means the application might be inert to the action of users. On the other hand, the errors coming from mistaking inaction for actions would result in unintended movement of the cube when it is supposed to be still. Anyhow the low rate of taking one action to another implies the potential of using an SVM classifier in hand gesture recognition. Given the probability of p(actiona actionb), p(action inaction) and p(inaction action), it can be derived that using the current cube update rule, the probability of correct status change is if the previous status is inactive and the current status is active, and for the reversed status change. Anyhow the statistics calculated above is based upon the result from cross validation, and the performance of the application might be mitigated in practice. Examining the error labels, most of the mistakes occur at the boundary of two actions. Because of the ambiguity between the preparation and the retraction phases of continuous gestures, mistakes of this type are difficult to ameliorate. I expect that around 10% of the mistakes might be corrected if the trajectory of the gesture, or at least the state of the previous frame is known. When tested on a 2.53GHz computer, the application extracts features in 15ms, predicting the label of a frame by SVM classifier in 12ms, and does the rendering in 3ms; all image operation functions and the implementation of SVM are from OpenCV lib. 4. Conclusion In the project, a real-time application of hand gesture recognition is implemented. The application uses skin color to detect hand blobs and extracts features from motion and shape of the hands as input to an SVM classifier, which outputs a label. Though the error rate 5

[2] T. S Huang, and V. Pavlovic, Hand gesture modeling, analysis, and synthesis, Proc. 1995 IEEE International Workshop on Automatic Face and Gesture Recognition, September 1995, pp. 73-79. [3] T.

Yang, N. Ahuja, Recognizing hand gesture using motion trajectories. Proc. of IEEE CS Conference on Computer Vision and Pattern Recognition. 1999, pp. 468-472. [5] T. Darrell, I. Essa, and A.

6 [2] T. S Huang, and V. Pavlovic, Hand gesture modeling, analysis, and synthesis, Proc IEEE International Workshop on Automatic Face and Gesture Recognition, September 1995, pp [3] T. Starner, J. Weaver, and A. Pentland, Real-time American Sign Language recognition using desk and wearable computer based video, IEEE Trans. Pattern Anal. Mach. Intell. 20, 1998, [4] M-H. Yang, N. Ahuja, Recognizing hand gesture using motion trajectories. Proc. of IEEE CS Conference on Computer Vision and Pattern Recognition. 1999, pp [5] T. Darrell, I. Essa, and A. Pentland, Task-Specific Gesture Analysis in Real-Time Using nterpolated Views, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 12, pp.,236-1,242, Dec [6] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, [7] A.F. Bobick and J.W. Davis, Real-Time Recognition of ActivityUsing Temporal Templates, Proc. Int l Conf. Automatic Face and Gesture Recognition, Killington, Vt., Oct Table 1: Error rate computed from 10-fold cross validation in a grid search for appropriate gamma and C is , it can be further reduced by taking the trajectory into consideration. The error rate is computed from a dataset in which the samples are collected at a rather consistent position relative to the camera, so it is not predictable how the system will behave when the hand position changes even if the training set is enlarged. Nor is it certain how the application will react to different people. References [1] V. Pavlovic, R. Sharma, and T. S. Huang, Visual interpretation of hand gestures for human-computer interaction: A review, IEEE Trans. Pattern Anal. Mach. Intell. 19, 1997,

A Two-stage Scheme for Dynamic Hand Gesture Recognition

A Two-stage Scheme for Dynamic Hand Gesture Recognition James P. Mammen, Subhasis Chaudhuri and Tushar Agrawal (james,sc,tush)@ee.iitb.ac.in Department of Electrical Engg. Indian Institute of Technology,