A Two-stage Scheme for Dynamic Hand Gesture Recognition James P. Mammen, Subhasis Chaudhuri and Tushar Agrawal (james,sc,tush)@ee.iitb.ac.in Department of Electrical Engg. Indian Institute of Technology, Bombay, India-76 Abstract In this paper a scheme is presented for recognizing hand gestures using the output of a hand tracker which tracks a rectangular window bounding the hand region. A hierarchical scheme for dynamic hand gesture recognition is proposed based on state representation of the dominant feature trajectories using an a priori knowledge of the way in which each gesture is performed.. Introduction Hand gesture recognition is essential in a host of applications like haptic interfaces for large-screen multimedia and virtual reality environments [], robot programming by demonstration, sign language recognition, human computer interaction, telerobotic applications, etc. Previous attempts at recognizing similar gestures have used a variety of methods. Davis and Shah [] used markers for tracking finger tips and used the fingertip trajectories for recognizing seven gestures. In [], the hand s position in the image, velocity, values obtained by eigen analysis, etc. are used as features and words from the ASL are recognized using an HMM based scheme. In [], the concept of motion energy is used to estimate the dominant motion of the hand and the gestures are recognized by fitting finite state models of gestures. In [], a gesture is classified as a sequence of postures using Principal Component Analysis and recognized using Finite State Machines. Multiple cameras are used in [6] to extract the D pose of the human body. Instead of using all the parameters describing the pose as features, the trajectory obtained by a projection into a D eigenspace is used for gesture recognition. In [7], each feature trajectory is split into sub-trajectories and recognition is achieved by maximizing the probability of it being a particular gesture in the eigenspaces of each these sub-trajectories. An incremental recognition strategy that is an extension of the condensation algorithm is proposed in [8] to recognize gestures based on the D hand trajectory. Gestures are modeled as velocity trajectories and the condensation algorithm is used to incrementally match the gesture models to the input data. A robust hand tracker proposed in [9] is used for tracking the hand in order to extract features which can be used for recognizing the gestures. At the first level of classification, the gesture to be recognized is assigned to one of the five classes based on its dominant feature trajectories. In the next level, the gesture is recognized using a sequence of states which are obtained from the dominant feature trajectories. The representation of gestures as a sequence of states overcomes the problem due to variation in the speed at which a gesture is performed, thus avoiding time warping of the data sequence.. Selection of Features The proposed gesture recognition system is intended to be used as an interface for a telerobotic system. Dynamic manipulative hand gestures being most suited to such an application, we select the ten gestures listed in Table to form our gesture vocabulary. The hand tracker mentioned in the previous section tracks the change in position and shape of a rectangular window bounding the palm region of the hand performing the gesture. Hence the coordinates of the centroid of the hand region should serve as good features. The change in area of the hand region can capture the variation in the shape of the palm region, and moreover, it is also indicative of the variation in hand position along the axis of the camera. Thus the feature vector for the n th frame in the video sequence is selected to be f (n) =[X(n) Y (n) A(n)] T In order to have values which can be meaningful from one video sequence to another, we scale the x and y- coordinates of the centroid by the width of the hand region during the start position and the area by the initial area, i.e. A() =. This provides invariance to magnification due to the distance from the camera at which the gesture is performed and also with respect to changes in hand size from person to person. te that the features are not made invariant to the starting position. This is not required in our scheme as the representation of trajectories as a sequence of states will not depend on this. Thus we have X(n) =m(n)=m() Y (n) =m(n)=m() () A(n) =m(n)=m()
(a) Figure : Smoothing of the feature trajectories for the Move Right gesture. (a)features as obtained from the tracker, and (b) Feature trajectories after smoothing. (a) Move Up (c) Go Away (b) (b) Move Down (d) Come Closer Figure : (a-d) Smoothed feature trajectories of various gestures. where m pq (n) is the pq th moment of the hand region obtained by tracking in the n th frame. The raw data obtained by tracking may be noisy and hence we smoothen the data using an averaging filter. Fig. (a) shows the raw features as obtained from the tracker for the Move Right (RIGHT) gesture clearly depicting their noisy nature. The result of smoothing this using an averaging filter of length 7 is given in Fig. (b). The feature trajectories of each unknown gesture are smoothed in this manner before any further stages of processing in order to eliminate noise. Smoothed feature trajectories of gestures performed by a single user have been shown in Fig... Initial Classification of Gestures based on the Dominant Features We observe that although features have been selected, the information conveyed by the gestures is not simultaneously captured by all the features. For example, in the case of Move Right (RIGHT) gesture, the information is contained in the horizontal motion which is captured by the Y (n) feature in our case. The other two features do not show meaningful variations in this case. We call these features which capture the information conveyed by the gesture as the dominant features corresponding to the gesture. By defining S X ;S Y and S A to be sets of gestures whose dominant features are X(n);Y(n) and A(n), respectively, we obtain the Venn diagram representation of Fig.. It shows the relationship between the gestures and their dominant features. For example, the Clockwise (CW) and Counterclockwise (CCW) gestures where both X(n)andY (n) convey information are shown to belong to both sets S X and S Y. The seven non-overlapping regions in Fig. show that for features, we may have at most 7 classes, based on which of the features are dominant. Our set of gestures, does not cover all the seven classes. The subset (S X S A ) (S X S Y S A ) is empty in our case. For the Move Left (LEFT), Move Right (RIGHT), Move Up (UP), Move Down (DOWN), Move Counterclockwise (CCW) and Move Clockwise (CW) gestures, either X(n) or Y (n) or both are the dominant features. In the Push (PUSH) and Pull (PULL) gestures, A(n) is the dominant feature as the change in the area of the hand conveys the information. The remaining two gestures, viz., Go Away (AWAY) and Come Closer (CLOSER), show very little motion and small variation in area. Hence at the first level, both are included in the same class where the range of variation in X(n) and Y (n) is small and the range of variation of A(n) is comparable to that of X(n) and/or Y (n). Thus in the first stage of classification we have classes, each containing gestures as shown in Fig.. As the dominant feature or features can be determined based on the range of variation of each feature trajectory, we obtain the decision tree based scheme as shown in Fig. for the first level of classification. In terms of the features this assumes the following form. First we smooth each feature trajectory as mentioned earlier in order to suppress tracking noise. Thereafter we obtain the range of variation of feature i, i =max(i(n)) min(i(n)) where i = X; Y; A. A (n) =CA(n), where C is a factor used to make the range of change in area comparable to that of change in position. In this study we select C =.w we assign the gesture to be recognized to one of the classes
UP DOWN SX CW CCW AWAY PUSH PULL SY CLOSER LEFT RIGHT Gesture to be recognized Is variation in X large and variation in Y small? Is variation in Y large and variation in X small? Class I Class II S A Figure : Gesture sets based on dominant features. Are variations in X and Y both large? Class III Gesture Set Is A the feature with max variation and is the ratio of max to min large? Class IV Class: I II III IV V Class V UP DOWN LEFT RIGHT CCW CW PUSH PULL CLOSER AWAY Figure : Initial classification of the gesture set based on dominant features. in the first stage as follows. We define, f i = arg max i i and the classification proceeds as follows. ffl Class I: If ( f i = X )&( X )&( Y < X= ) ffl Class II: If ( f i = Y )&( Y ) &( X < Y =) ffl Class III: If (( f i = X)&( X > )&( Y > X=)) OR (( f i = Y )&( Y > )&( X > Y =)) ffl Class IV: If ( f i = A )&(max(a(n))= min(a(n)) > :) ffl Class V: If none of the above classes where & is the logical AND operator and and are appropriate thresholds for determining whether the motion is significant ( =:; = in this study). A threshold of means that we consider a motion greater than times the initial hand width ( m() ) to be large motion. The threshold for class III is lesser than because of the fact that when there is significant motion in both the horizontal and vertical directions, the range of variation in each direction is naturally reduced compared to the case in which there is large motion in one direction alone. Figure : Decision tree for the first stage of classification.. Trajectory Based Classification We need to analyse the dominant feature trajectories within each class in order to recognize the gestures. We observe that each gesture consists of portions of motion in which the feature dynamics are similar and distinct. As the information characterizing the gesture is contained in the sequence of these portions of similar dynamics, we propose a definition of a gesture which incorporates this information. We define a gesture to be a sequence of states fs; :::; s n g, where the states correspond to portions of the dominant feature trajectory having similar dynamics. For each class a sequence of states can be defined such that it characterizes and distinguishes between the gestures in that class. The resulting sequence of states obtained from the trajectory can be used for gesture recognition as explained below. Class I This class consists of UP and DOWN gestures. Due to the presence of only one dominant feature, viz. X(n), the states are scalars in this case. For the UP gesture, initially the hand shows a small downward movement followed by a long upward motion as shown in Fig. 6. Hence ideally X(n) would initially increase from the time index (frame) up to some index k and then decrease all the way until the end of the sequence, say m. Hence it should be split into two states representing this. We select the states to be the change in the feature over portions of uniform dynamics i.e. portions which consist of motion in the same direction. Thus, ideally this gesture would result
X(n) X k m n s Figure 6: Obtaining states from the dominant trajectory s Y S S S S [ -] T [- -] T [- ] T [ ] T (-,-) (-,) Y (,-) (,) X in the states s = X(k) X()and s = X(m) X(k). The frame k would correspond to a maxima where the direction of motion changes and hence changing the sign of the derivative of X(n). In order to determine whether it is the UP gesture, we use its most dominant characteristic, i.e. the upward motion, resulting in the recognition criterion ±s i <. The DOWN gesture is similar except that the direction is reversed. Thus, we may summarize the criterion for classification as follows. If ±s i < :UP; Else : DOWN. Class II The gestures in this class are LEFT and RIGHT. This class and its gestures are analogous to that of class I. The only difference is that instead of X(n) being the dominant feature as in class I, Y (n) is the dominant feature here. As a result the same strategy as in class I can be used. If ±s i < :RIGHT ; Else : LEFT. Class III The gestures in this class are CCW and CW. As this class has two dominant features, X(n) and Y (n), the states are constructed using both of them. The method of selecting the states is depicted in Fig. 7(a). Both the trajectories are split individually at points of maxima or minima and later both of these are merged together to obtain states for the combined D trajectory. Since our purpose is to differentiate between the directions of rotation of the hand, we select the directions of change of X(n) and Y (n) as the elements of the D state vectors. An increase is denoted by and a decrease by. Thus for the trajectories shown in Fig. 7, the states would be s = [ ] T, s = [ ] T, s = [ ] T and s = [ ] T. In the case of CCW and CW gestures, ideally we would expect a sequence of states due to the hand moving in a circle and returning back to the same position. However, the hand may rotate a bit more or less re- (a) Merging for D states (b) Mapping from states to directions Figure 7: Recognition of gestures in Class III sulting in the number of states N s being not equal to exactly in some cases. Hence for recognizing the gesture, we use a scheme based on determining the direction of rotation. We define a mapping f : S! D, i.e. from the D states to directions as shown in Fig. 7(b) and defined as. f ([ ] T ) = ; f ([ ] T ) = ; f([ ] T ) = ; f ([ ] T )=. For a clockwise rotation of the hand the resultant sequence, f (s); f(s); :::;f(s n ) would be a subsequence of the periodic sequence ; ; ; ; ; ; ;::: The direction of rotation would be given by the sign of sum of the cyclic differences P n ff N = i= f (s i)ψf (s i ) where Ψ denotes a K-cyclic difference mapping defined as follows: 8 a; b ffl f;:::;k g; 8 < a Ψ b = : a b if K<a b<k a b = K a b = K In our case K = : Thus we recognize the gesture as follows. If ff N > : CW ; Else : CCW. Class IV This class has two gestures, PUSH and PULL, which have only one dominant feature A (n). Both would ideally result in a single state in which the area either increases or decreases monotonically. Thus, similar to class I, we define the state to be s = A (m) A (). Hence to recognize the gestures we check whether the number of states N s is and differentiate between them based on the direction of change, i.e. increase or decrease of area. The resultant scheme is summarized below.
If N s =& (If s > :PUSH ; Class V Else : PULL). Both the gestures of this class, AWAY and CLOSER, are similar in the sense that both have A (n) as a dominant feature showing similar variation. Hence this feature cannot be used to discriminate between them. However, X(n) shows very different characteristics for both. For the CLOSER gesture since there is no horizontal motion of the hand, X(n) remains almost constant, whereas for the AWAY gesture, there is a distinct horizontal movement first towards the left and then towards the right. Y (n) also shows similar behavior for both gesture. Hence we use only X(n) to form the states for the AWAY gesture, using the same method as used for classes I and II which should ideally result in states. The recognition criteria are listed below. If X < Y = :CLOSER ; Else If N s =& s > :AWAY A gesture which does not fall into any of these categories results in an unrecognized gesture.. Experimental Results The hierarchical gesture recognition scheme based on dominant features was tested on a data set of 8 gestures performed by different users. Each gesture sequence is of different length varying form to 6 frames depending on the time taken to perform the gestures. The gestures were captured in a natural office environment with a cluttered background. Out of the 8 gestures, were correctly recognized, while there were two false recognition. This was due to the manner in which the gesture was performed. There was significant motion in both x and y-directions which led to false recognition. There were 6 unrecognized gestures and it was found that this was due to false skin color detection resulting in faulty tracking results and this has nothing to do with the accuracy of the proposed recognition scheme. Table summarizes the results. 6. Conclusions In this paper, a technique for gesture recognition using the data available from a hand tracker developed earlier has been proposed. We use a hierarchical scheme where in the first stage we classify based on the dominant features and in the second stage recognize the gesture based on states describing the dominant feature trajectory. Our recognition technique is unaffected by the rate at which gestures are performed and the initial position of the hand in the image due to the state based approach. The representation of gestures as a sequence of states obtained by splitting the dominant feature trajectories into perceptually important segments results in a fast and simple recognition scheme.. Gestures Instances True False Unrecog. LEFT 6 6 RIGHT 9 8 UP DOWN 9 CCW 6 CW 7 7 7 PUSH 8 PULL 9 AWAY CLOSER Total 8 6 References Table : Summary of experimental results [] R. Sharma V. Pavlovic and T. S. Huang, Visual interpretation of hand gestures for human-computer interaction: A review, IEEE Trans. on PAMI, vol. 9, no. 7, pp. 677 69, 997. [] J. Davis and M. Shah, Visual gesture recognition, in IEE Proc. - Vision, Image, Signal Processing, April 99, vol., pp.. [] J. Weaver T. Starner and A. Pentland, Real-time american sign language recognition using desk and wearable computer based video, in IEEE Trans. on PAMI, Dec 998, pp. 7 7. [] M. Yeasin and S. Chaudhuri, Visual understanding of dynamic hand gestures, Pattern Recognition, vol., pp. 8 87,. [] J. Martin and J.L. Crowley, An appearance-based approach to gesture-recognition, in Proc. of ICIAP, Sep 997. [6] H. Ohno and M. Yamamoto, Gesture recognition using character recognition technique on d eigenspace, in Proc. of IEEE ICCV, Sep 999, vol., pp.. [7] D. Hall Martin and J. L. Crowley, Statistical recognition of parameter trajectories for hand gestures and face expressions, in Proc. of ECCV, 998. [8] M. J. Black and A. D. Jepson, Recognizing temporal trajectories using the condensation algorithm, in Proc. of IEEE Int. Conf on Automatic Face and Gesture Recognition, 998. [9] James P. Mammen, S. Chaudhuri and Tushar Agrawal, Simultaneous tracking of both hands by estimation of erroneous observations, in Proc. of BMVC, Manchester, Sep.