Depth Based Dual Component Dynamic Gesture Recognition

Size: px

Start display at page:

Download "Depth Based Dual Component Dynamic Gesture Recognition"

Byron Warren
5 years ago
Views:

1 Depth Based Dual Component Dynamic Gesture Recognition Helman Stern, Kiril Smilansky, Sigal Berman Department of Industrial Engineering and Management, Deutsche Telekom Labs at BGU Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel The 2013 International Conference, IPCV'13 Abstract In this paper we describe several approaches for recognition of gestures that include simultaneous arm motion and hand configuration variations. Based on compound (dual component) gestures selected from the ASL we developed methods for recognizing such gestures from Kinect sensor videos. The method consists of hand segmentation from depth images followed by feature extraction based on block partitioning of the hand image. When combined with trajectory features a single-stage classifier is obtained. A second method which classifies arm movement and hand configuration in two-stages is also developed. These two methods are compared to a moment based classifier from the literature. Using a database of 11 subjects for training and testing the average classification accuracy of the one-stage classifier was the highest (95.5%) and that of the moment based classifier was the lowest (20.9%). The two-stage classifier obtained an average classification accuracy of 61.1%. Keywords: Gesture recognition, human-machine interaction, dynamic motion gestures, sign language 1. Introduction Gesture recognition systems (GRS) offer a natural and intuitive way for interaction with machines and electronic devices such as computers and robots. GRS can both enhance user experience and offer improved operational capabilities more suitable for the requirements of modern technological devices. The vocabulary of many GRSs include either dynamic gestures, where meaning is conveyed in the arm/hand movement, or static gestures where meaning is based on arm/ hand pose. Compound gestures, in which meaning is conveyed in both motion and pose, can enrich the gesture vocabulary and are frequently found in sign languages (SL). The difficulty of designing a classifier for SLs is not only because of their multimodal nature (face expression, head pose and nodding, body part proximity, torso movements) but also the myriad variations of motion signing itself (two handed signs, complex hand configuration changes combined with arm motion, arm pose, cyclic movements, etc. One such sign structure is that of gestures composed of a number of different components such as; hand configuration, hand orientation, change of hand location (due to arm movements), etc. Thus, in this paper we study a class of dual component signings which are part of the ASL. Also, because of the recent surge in the availability of depth enabled video cameras which reduce the amount of classification and image processing requirements, we have selected this as our source of gesture signal input. We consider dual component gestures comprised of an arm component and a hand component. As each component with regard to motion can be either static or dynamic, of the four possible cases we are only interested in those with at least one dynamic component. Following Ong and Ranganath [1] classification schemes can be divided into those that use a single classification stage and those that classify components of a gesture and then integrate them for final gesture classification. We provide a new taxonomy for gestures, highlighting the different types of meaning conveying components and possibilities for compound gestures. Gestures from American Sign Language (ASL) are then used to demonstrate the taxonomy and ground the research within a realistic recognition problem. A recognition algorithm suitable for compound gestures was developed. Recognition is based on both block partition values of hand configuration and hand trajectory angles. Two classification methods were developed; (i) a one-stage classifier based on using a single vector of all features (configuration and motion) and (ii) a two-stage classifier based on a separate classification of the hand configuration component and the arm motion component, followed by a concatenative stage where the recognized components are recombined into a final gesture. In addition, a moment based method from the literature, Agrawal and Chaudhuri [2], was tested for comparative purposes. In the following section we provide a short background on gesture recognition systems and signing gestures. In section 3 we propose our taxonomy and gesture set. Following the introduction of the architecture for classifying dual component gestures in section 4, the method of segmentation of each component from a depth map and feature extraction is discussed. Our proposed algorithms for

2 dual component gesture classification and the momentbased method are covered in section 5. Experimental testing of the algorithms and a discussion of the results are the subjects of section 6 and 7, respectively. Final conclusions appear in section Background and related literature 2.1 Gesture recognition system architecture The architecture of the hand gesture recognition system is depicted in Fig. 1. A continuous video is captured by a sensor (camera) mounted in front of the viewer who is gesturing with his/her hand. The frames of the video are separated, and sent to a tracker which locates the hand in each image. The tracking algorithm calculates a centroidal position of the hand. A sequence of centroids constitutes a gesture trajectory which is then extracted from the video stream. Figure 1. Architecture for a gesture recognition system This trajectory is processed and its descriptive features are extracted. These features are then fed into the classification module, which recognizes the particular gesture. When a gesture is recognized its representative command is sent to a state machine, which can take the form of a controllable physical or virtual object. 2.2 Motion gesture classification For motion gesture classification dynamic time warping (DTW) and hidden Markov models (HMM) are most often used. HMM is a generative classification method, quantifying the probability that an input vector could have been generated from a model constructed for each gesture based on the training set. In contrast DTW is a discriminative classification method directly comparing the input vector to template vectors of each gesture. The longest common subsequence (LCS) algorithm follows the same idea as DTW, using a distance function as a similarity measure between a temporal test sequence and a pattern Frolova, et al [3]. 2.3 SL gesture recognition Survey papers on SL recognition are scarce. However, we did find two. The first, by Havasi and Szabo [4] describes a semi-automatic construction of a sign database, and gives a summary of technical and linguistic issues of the signing effort. The second by Ong and Ranganath [1] reviews SL data capture, feature extraction and classification and the integration of non-manual (hand) signs with hand sign gestures. Mitra and Acharya [5] divide SLs into three major types; (i) finger-spelling, (ii) word level sign vocabulary, and (iii) non-manual features. Finger spelling is used for characters, word level signs are used for most of the communication, and non-manual features are facial expressions, positions of tongue, mouth, and body. An integration of motion gesture recognition and posture recognition is presented in Rashid, et al. [6]. Depth information and orientation features are classified by HMM and support vector machines. Rokade, et al. [7]. proposed using gesture trajectory features and key frames, followed by DTW for recognizing ASL gestures. Agrawal and Chaudhuri [2] used spatial moments up to the second order as features and principle component analysis (PCA) for classifying between gestures that were characterized both by hand motion and hand configuration. 3. Gesture vocabulary and taxonomy Ten gestures are selected from the ASL vocabulary (Table 1) to ground the research within a realistic recognition problem. The gestures were chosen such that they will emphasize the importance of recognizing both arm/hand trajectory and hand configuration. A two-layered gesture taxonomy for hand-pose trajectory combinations is shown in Fig. 2. The selected ASL gestures were placed in the taxonomy tree of Fig. 2 according to their static/dynamic component properties. A hand pose-trajectory is considered as a two-component gesture: (a) hand configuration, and (b) hand location(s). Each may be either static or dynamic. The hand location component may be either (i) a static hand location (no arm movement during a gesture period) or (ii) a dynamic hand trajectory (arm movement). The hand component may be either (i) a static hand configuration (no hand configuration change during a gesture period) or (ii) a dynamic hand configuration (hand configuration change during the gesture period). It is clear that a dynamic gesture is defined as having at least one dynamic component. Table 1. Ten ASL gestures used for testing (X Horizontial, Y - Vertical, Z toward the camera)

4, is to segment the hand from the rest of the depth map, for each frame in the gesture video.

3 then classifies the gesture using the results of the first stage. This chapter is organized according to the architecture stages. Depth map Hand centroid (x, y, z) Segmentation Feature extraction Figure 3. Gesture classification system diagram 4.1 Hand segmentation Classification The purpose of this stage, as shown in Fig. 4, is to segment the hand from the rest of the depth map, for each frame in the gesture video. The target of this process is to segment the hand as tightly as possible without losing regions of the hand itself. The output of the segmentation is a scaled intensity image in which pixels that belong to the hand have a scaled value above one and all the other pixels have a value of zero (black). Y Figure 2. Gesture taxonomy based on static/dynamic combinations 4. Compound gesture classification The architecture of the developed recognition method is depicted in Fig. 3. The first stage, hand segmentation, is based on a depth map and the hand centroid location in each frame. For each frame, the hand region is segmented and features of the hand configuration are extracted. Additional features of hand trajectory during the gesture are also extracted. The features are then used for classifying the gesture into one of the gesture classes in vocabulary. Two different classifiers were developed: a one-stage classifier, which classifies the gestures using a single vector of all features, and a two stage classifier, which first separately classifies configuration features and motion features and X Figure 4. Hand segmentation. Left: Depth map with overlaid hand centroid marked by a red X sign; Middle: segmented hand region; Right: A tight bound about the segmented hand. The segmentation starts with cropping the hand region of interest (ROI) using the pinhole camera model which determines the hand ROI dimensions X, Y (in pixels) based on the distance Z between the hand and the camera ( obtained by the depth map). Hand ROI dimensions are computed using the basic equations of perspective projections: H h H w Y = Fy ; X = Fx (1) Z Z Where F x and F y are the camera s focal length in the horizontal and vertical axis, respectively. The hand width H w and height H h are based on average dimensions as reported in Pheasant, and Haslegrave [8]. The hand centroid

[x (pixels), y (pixels), z (mm)] is acquired from the hand tracker and the hand ROI is centered about it as shown in Fig. 4 (middle). Finally, a bounding box is created around the hand. Fig. 4 (right) shows the original depth map of the cropped hand ROI.

Each value in the block partition matrix represents the mean value of pixels in the sub-block. The matrix size was optimally found by a direct parameter as 8x5.

$Then, for each sub-block the fraction of white pixels in the sub-block are calculated as features of the hand configuration. Fig.$

4 [x (pixels), y (pixels), z (mm)] is acquired from the hand tracker and the hand ROI is centered about it as shown in Fig. 4 (middle). Finally, a bounding box is created around the hand. Fig. 4 (right) shows the original depth map of the cropped hand ROI. 4.2 Hand configuration feature extraction A block partition method was used for dividing the segmented hand image into NxM sub-blocks. Each value in the block partition matrix represents the mean value of pixels in the sub-block. The matrix size was optimally found by a direct parameter as 8x5. A black and white (BW) mask is produced where pixels that belong to the hand have the value one (white) and all the other pixels have the value zero (black). Then, for each sub-block the fraction of white pixels in the sub-block are calculated as features of the hand configuration. Fig. 5 shows a block partition in which each cell of the 8x5 matrix on the right side represents the mean value of the matching sub block of the BW mask on the left side. Height Width Figure 5. Block partition of hand configuration The NxM block white ratio features representing the hand configuration are extracted from each segmented frame and the aspect ratio [height / width] of the bounding box is additionally computed and added to the feature vector. Therefore for a 8x5 cell matrix the length of the configuration feature vector is Hand trajectory feature extraction For a gesture video of n frames a sequence of n hand centroids [x (pixels), y (pixels), z (mm)] are acquired from the hand tracker. Because all the gestures in our database are in fact 2D (XY plane or YZ plane), the plane with the maximum motion in it is chosen and the sequence of points is projected on that plane. This is a common phenomenon where dynamic gestures are planar even when the gesture vocabulary is embedded in 3D space. Each sample trajectory is of different length varying from 12 to 78 frames, depending on the time it took the subject to perform the gesture, therefore we normalize the length of the trajectory. The trajectory is resampled into n=39 points and transformed into a vector of n-1=38 absolute angles of the hand motion direction. The computation of each absolute angle is done using eq. 2 where x and y are the coordinates of the hand centroids in the selected plane of maximum motion. 1 y i yi 1 θ = tan (2) i xi xi 1 5. Gesture classification Two different methods were developed for classifying the gestures, a one-stage classifier and a two-stage classifier. The one-stage classifier combines the features of hand configurations of some key frames with the features of hand trajectory and classifies the combined feature vector. The two-stage classifier decouples the gesture at the component level and classifies hand configurations of some key frames and hand trajectory separately. The results of these classifications are combined into one output classification using a fuzzy rule set. In addition a moment based method from the literature is described, and used for comparative testing. 5.1 One-stage classifier The one-stage classifier is based on one feature vector (Fig. 6) that represents the whole gesture. For each frame a feature vector represents the hand configuration. In addition, the hand trajectory feature vector represents the change in hand location between frames. Instead of classifying each of those channels of information separately, a classification of a full feature vector is employed which takes into account all information in the gesture. Under the assumption that the configuration of the hand is only important at the beginning of gesture and at the end of gesture, only n1 frames at the beginning and n2 at the end of gesture are considered for the configuration features. The trajectory of the hand is represented by a vector of N=38 absolute angles of the hand motion direction. Frame N-n2+1 Frame 1 Frame N-n Frame 2... Trajectory during Frames 1 N 38 features Frame N Frame n1 Figure 6. Classification features. Top: End configuration features; Middle: Beginning configuration features;

5 Bottom: Hand trajectory features The feature vectors for the hand configuration and motion trajectory were combined into a single feature vector after using PCA. The combined feature vectors (one for each test gesture) were used to train and test the K-nearest-neighbor (KNN) classifier using a Euclidean distance measure. 5.2 Two-stage classifier In the first stage, the hand configuration and motion trajectory are decoupled and fuzzy membership of each are derived. This is initiated with the hand configurations in n1 first and n2 last frames of the gesture, which are subsequently, combined into two membership vectors using a fuzzy rule base (one for the hand configuration at the beginning of the gesture and one for the end). Another membership vector for hand trajectory is computed based on HMM classifier. In the second stage, a fuzzy rule base is used for combining the three different membership vectors. The result is a single membership vector, with different degrees of belonging to each of the gestures in the gesture vocabulary. Finally a defuzzification phase is executed, where the gesture is classified into the gesture class with the highest degree of membership. 5.3 Moment based classifier The moment based classifier suggested by Agrawal and Chaudhuri [2] was used to provide a comparison to our one and two-stage algorithms. One of the main differences between Agrawal and Chaudhuri s method and ours is they do not use block partition features, but instead use a set of moments calculated from a BW hand image. These moments are concatenated over all frames of the gesture video and classified by a PCA classifier in a one stage procedure. 6. Experiments The two classifiers which were developed, and the moment based classifier suggested by Agrawal and Chaudhuri [2], were implemented using Matlab TM (Mathworks, USA). In all implementations, the PrimeSense sensor and OpenNI 1 were used to acquire depth image videos and track the hand centroid. This section starts with the description of the database used for testing the developed algorithms. The database consists of samples of the ten gesture ASL vocabulary which were recorded using the PrimeSense sensor. We then describe the classification evaluation process. Three experiments were performed: testing the one-stage, two-stage and the moment based classifiers, with a comparison made between them. 6.1 Gesture training and testing data base 1 The evaluation of the classification methods was done using the ten ASL gesture vocabulary shown in Table 1. Gestures of 11 subjects aged 25 to 46 (10 male, 9 female) were recorded. Each subject provided 5 samples of each of the 10 gesture types. A leave one subject out cross validation experiment was performed. Accordingly, a total of eleven experiments were conducted with 50 samples per gesture used for training in each experiment and 5 samples per gesture for testing. 6.2 One stage classifier evaluation Before the evaluation the values of several important feature and classifier parameters were determined by a direct search method. Recall that, for our composite gestures the hand component was extracted only at the start and end of the gesture trajectory, where different hand configurations were expected to have occurred. Thus, the number of frames at the start and end of the gesture from which the hand component is extracted are two parameters that need to be determined. Two other feature parameters are the number of rows and number of columns of the optimal block partition of the hand image. For the KNN classifier we also determined the optimal number nearest neighbors and the number of clusters. The final parameters and the range over which they were optimized are shown in Table 2. It should be noted that the number of clusters N for the KNN represent the number of forms used to represent each of the gestures. In effect, we used a constraint to insure an equal number of forms for each gesture (to allow for the natural variability of a gesture due different user s motion behavior. Since this resulted in 4 forms for each gesture, the final number of clusters used for the 10 gesture KNN was 40. The parameter evaluation was based on the accuracies of the classifier using a gesture test set for each candidate parameter set. The trajectory of the hand was represented by a vector of 38 absolute directional hand motion angles (Fig.6 and eq 2), after reforming all gestures to a common length. The hand configure component feature vector is of length (n1+n2) x no blocks (8X5) + 1 (aspect ratio). So for the n1=6 start frames the start hand feature length is 246, and for the n2=1 end frame the end hand feature length is 41. For the start and end hand component vectors we used PCA to reduce the dimensionality to 30 and 11, respectively (based on first eigenvalues that explain 85% of the variance). The hand trajectory feature vector was added to the reduced hand configuration vectors resulting in a full 79 feature vector of the gesture. These vectors are used for training and testing the classifier. Table 2. Parameters for optimization Param Description Min Max Optimum M Number of rows for the block partition

6 N k n1 n2 N Number of columns for the block partition Number of nearest neighbors for classification Number of frames at the beginning of gesture Number of frames at the end of gesture Number of clusters for K-means 6.3 Two stage classifier evaluation Nine possible hand configuration classes were specified, according to the hand shapes at the beginning and end of the ASL gestures. A training set of feature vectors was constructed for each configuration class, based on the n1 first and n2 last frames of the gesture training set. For each frame, membership values of hand configuration were computed using KNN. A cosine amplitude, r ij, was used as a similarity measure between the tested feature vector i and a training feature vector j (eq. 2). r " = " " ( " )( " ) Where m is the feature vector length and F " is the k-th feature of vector i. The cosine amplitude r " can get a value between zero (for perpendicular vectors) and one (for parallel vectors). Then, the membership value M (C) of frame i for configuration class C is the mean value of k highest measures between the tested feature vector and the training feature vectors belong to that configuration class (eq. 3). M C = mean," r " (3) Where C is the configuration class and E(C) is the set of k highest measures of the class. Membership vectors were computed for the hand configurations, one for the beginning of the gesture and one for the end. A Fuzzy rule base was used for combining the first n1=6 frames into one membership vector and the last n2=1 frames into second membership vector, respectively. The membership value for each configuration class for the start and end of the gesture was the maximum membership value of the first six frames and the membership value of the last frame, respectively (eq. 4). (2) M "#$" (C) = max.. {M C } (4) M "# C = M (C) Six possible hand trajectory classes were specified. Because paths in the YZ plane differed considerable from the paths in the XY plane the plane of motion is not considered during the classification. The HMM algorithm used for trajectory classification, Frolova, et al. [3]. contained eight states, which were the eight possible motion directions. The output was a loglikelihood vector of the dataset which was used as the membership vector of the trajectory, M "#$ (T), where T is a possible trajectory. Ten fuzzy rules combined the results of the trajectory and hand configuration classifications. For example: If the configuration at the start is closed fist and at the end is hand stretched forward and the trajectory is forward then the ASL gesture is Magic. Where the and directive is translated into a minimum function and the membership value for each gesture is given in eq. 5. M G = min {M "#$" (Cstart G ) (5) M "# Cend G, M "#$ (T(G))} Where Cstart G, Cend G and T(G) are the configurations of the start and end of gesture class G and the trajectory of gesture class G, respectively. Finally, a test gesture was classified to the gesture class with the highest membership value. 6.4 Moment based classifier evaluation Agrawal and Chaudhuri s method, unlike ours, uses a set of spatial moments up to the second order calculated from a BW hand image to represent the hand configuration. were. For each nth frame, the feature vector is: f(n)=[a(n) Cx(n) Cy(n) Cxx(n) Cxy(n) Cyy(n)] (6) where, Cxx(n), Cxy(n) and Cyy(n) are the second order spatial moments of the hand region, Cx(n) and Cy(n) are the first order moments and A(n) is the zero order moment. For the whole gesture, each moment creates a trajectory, thus six feature trajectories represent the gesture, characterized both by the motion of the hand in space and the changing configuration of the hand throughout the gesture. After Using PCA for dimensionality reduction the test data are presented to a PCA classifier. 7. Results and discussion According to the 95% confidence intervals (Table 3), the one-stage classifier gave significantly better results than the other two classifiers. The average classification accuracy of the one-stage classifier was the highest (95.45%) and that of the moment based classifier was the lowest (20.91%). The two-stage classifier obtained an average classification accuracy of 61.09%. Table 3 Comparative classification accuracies One-stage Classifier Two-stage Classifier Moment Classifier Ave. accuracy 95.45% 61.09% 20.91% Confidence interval (%) (95.0%) [92.2,98.7] [51.4,70.8] [17.6,24.2] With the two-stage classifier, many (53%) of the "Magic" (1) gestures were classified as "Audience" (2). The

trajectories were correctly classified, however the configurations are not what causes the misclassification. Fig. 7 shows examples of the misclassified configurations.

The center and right images show the "C" and "P" configurations twisted towards the camera causing a deformed view and therefore cannot be recognized. Figure 7.

7 trajectories were correctly classified, however the configurations are not what causes the misclassification. Fig. 7 shows examples of the misclassified configurations. The left image, for the Magic gesture, shows an intermediate phase between the closed fist configuration and the open hand stretched forward and therefore cannot be recognized. The center and right images show the "C" and "P" configurations twisted towards the camera causing a deformed view and therefore cannot be recognized. Figure 7. Misclassified configurations With the moment-based classifier, many of the gestures were classified as "Magic" (1) or "Audience" (2). Because those gestures are performed towards the camera the area of the hand region is changing through the gesture therefore scaling by the area in the first frame becomes insufficient. For the one-stage classifier the most confused gestures were Chicago (5) and Philadelphia (6), who share the same hand trajectory, but have different hand configurations which occasionally look alike after the block partition is performed. Fig. 8 shows an example of the hand configurations. Figure 8. ASL "C" compared to "P" As an additional observation, there are interactions between the different motion modalities within the compound gesture performed by the users. Even when the text-book hand-configurations are identical in two gestures with different trajectories, the configurations actually performed may differ. Especially when taking into account the dynamic evolvement of the configuration, as the person performing the gesture is moving his/her hand while constructing the configuration. Therefore the component classifier of a particular hand configuration class over all types of gestures containing it is trained on all variations of the hand class, even those effected or distorted by particular trajectory paths. 8. Conclusions Two different methods were developed for classifying sign language gestures; a one-stage classifier and a twostage classifier. The one-stage classifier combines the features of hand configurations of some key frames with the features of hand trajectory and classifies the combined feature vector. The two-stage classifier decouples the gesture at the component level and classifies hand configurations, of key frames and hand trajectory separately. The results of these classifications are combined into one output classification using a fuzzy rule set. In addition a moment based method from the literature is described, and used for comparative testing The classification methods were tested using ten ASL gestures. The vocabularies were arranged according to a new dynamic taxonomy of gesture constructs. Gestures were captured using a PrimeSense 3D sensor and were executed by 11 subjects. The average percent classification accuracy of the one-stage classifier was the highest,95.45, that of the moment based classifier was the lowest The two-stage classifier obtained an average classification accuracy of In the future, we intend to add face and body part detection which will allow the integration of nonmanual components into the gesture definitions. Acknowledgements This research was partially supported by Deutsche Telekom AG. We acknowledge the help of Dr. Darya Frolova, Noam Geffen, Tom Godo, Eti Almog, Omri Mendels, Merav Shmueli, and Shani Talmor. References [1] S. C. W. Ong, S. Ranganath, Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning ; IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol.27, No. 6, pp , [2] T. Agrawal, S. Chaudhuri,, Gesture Recognition Using Position and Appearance Features, Paper presented at the Image Processing, ICIP 2003, Conf. Proceedings, Vol.3, No. 2, II, pp , [3] D. Frolova, H. Stern, S. Berman, Most probable longest common subsequence for recognition of gesture character input., IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, Vol. 42, Issue 3, pp , 2013 [4] L. Havasi, H.M. Szabo, A Motion Capture System for Sign Language Synthesis: Overview and Related Issues, in Proceedings of Computer as a Tool, EUROCON 2005 Vol.1, pp , [5] S. Mitra, T Acharya, Gesture Recognition: A survey, IEEE Trans, on Systems Man, and Cybernetics-Part C: Applic. & Reviews, Vol.37, No. 3, pp , 2007 [6] O. Rashid, A. Al-Hamadi, B. Michaelis, A framework for the integration of gesture and posture recognition using HMM and SVM, Paper presented at ICIS, Vol. 4, pp , 2009 [7] U. S. Rokade, D. Doye, M. Kokare, Hand gesture recognition using object based key frame selection, Paper at Digital Image Processing, pp , 2009 [8] S. Pheasant, C. M. Haslegrave, Bodyspace: Anthropometry, ergonomics, and the design of work, Boca Raton: CRC Press, 2000

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of