Framework for a Portable Gesture Interface

Size: px

Start display at page:

Download "Framework for a Portable Gesture Interface"

Mary Bryant
6 years ago
Views:

1 Framework for a Portable Gesture Interface Sébastien Wagner *, Bram Alefs and Cristina Picus Advanced Computer Vision GmbH - ACV, Wien, Austria * sebastien.wagner@acv.ac.at Abstract Gesture recognition is a valuable extension for interaction with portable devices. This paper presents a framework for interaction by hand gestures using a head mounted camera system. The framework includes automatic activation using AdaBoost hand detection, tracking of chromatic and luminance color modes based on adaptive mean shift and pose recognition using template matching of the polar histogram. The system achieves 95% detection rate and 96% classification accuracy at real time processing, for a non-static camera setup and cluttered background. 1. Introduction This paper describes a framework for visual communication based on recognition of static hand poses on portable devices like PDA s or tablet-pcs. Motivation for this work is the development of a multimodal interface in the context of the SNOW (Services for Nomadic Worker) project. The interface is meant to help aircraft workers in the acquisition of maintenance procedures. The nomadic worker is able to switch among pen, voice and gesture input modalities. This paper presents the framework for gesture recognition, which consists of three interdependent modules: hand detection, color-space segmentation and pose recognition. Hand detection is based on a cascaded AdaBoost detector trained on a specific initialization gesture. The segmentation is performed using an adaptive 3D modeling of the hand color in the YCbCr color space. The classification is done by matching polar histograms of the hand silhouette. The framework deals with the several challenges posed by the use of a portable device. The camera system is not fixed and consists of a single head-mounted camera. Therefore, illumination varies, background can be cluttered and the background motion is added to the ego-motion of the camera. The visual input may be sent from the portable system to a server, which performs the recognition tasks. Since bandwidth of data transmission is limited, processing should be possible at a low frame rate for which spatial-tracking is not feasible. Furthermore the system has to be robust to variations of hand shapes between different users, hand poses and in case the user wears gloves. The paper consists of four sections. Section discusses the state of the art in hand gesture recognition. Section 3 presents the framework, which consists of three modules: one for hand detection, one for tracking based on color cues and one for pose recognition. Section 4 presents results for each of the modules.. State of the art Recent literature is found dedicated to visual interfaces on portable systems. These works specifically address requirements of mobile interfaces including real time performance. In [7] the ego-motion of the camera is taken into account for the state-space prediction in the tracking algorithm. A foreground/background color model adaptation algorithm is used by [11]. Kölsch et al. [9] describe a combination of methods achieving real time performance and robustness against conditions of mobile interfaces. Detection is achieved by using an AdaBoost classificator trained on intensity features for a specific gesture. A combination of color and spatial cues is used for hand segmentation and tracking. Other applications in literature deal with static camera position for non-mobile platforms that often provides a well controlled environment. Hand segmentation is based on gray level or color thresholding [15] or based on an a priori skin color model [3][16]. These methods rely on the presence of a homogeneous background and uniform illumination. Alternative approaches to hand segmentation and tracking include the use of frame to frame motion cues [5][8], local grouping of optical flow patterns [6] and accumulation of motion history gradients [1]. These methods are applied to cases for which the background is steady or its motion can be modeled [7]. Model-based approaches are more robust for background motion than motion-based approaches. On the other hand, they can be computational expensive, since the hand is a non-rigid object which has at least 5 degrees of freedom [4]. One approach in literature uses detection of blob and ridge features to model hand palm and fingers [4]. The method requires a sufficient image contrast, limited background clutter and sufficient high frame rate. Appearance based approaches classify hand poses using features such as edge and color [17], shape

context [14][19] and eigenimages [13]. Several classifiers are combined in a decision tree in order to improve classification performances [14][17]. 3. Method The framework consists of three modules.

In case of wireless connection for distributed processing, the available frame rate can be about 3 fps.

2 context [14][19] and eigenimages [13]. Several classifiers are combined in a decision tree in order to improve classification performances [14][17]. 3. Method The framework consists of three modules. The image acquisition is done by a head mounted camera. For this a regular webcam is used, with an image size of 30x40 pixels and a limited dynamic range. In case of wireless connection for distributed processing, the available frame rate can be about 3 fps. Figure 1 presents the head mounted camera geometry, and an example of image data recorded in the worker environment. Figure 1. Camera geometry and camera view in worker environment The first module detects the hand if it is within the field of view. It uses AdaBoost cascaded classifier based on edge features, for one specific gesture (open hand). Detection of the gesture activates the gesture recognition interface. The detection is color independent and is used to generate a model for the hand color. The second module tracks the hand based on color cues, independent on the type of gesture. It uses adaptive mean shift to determine the color modes in YCbCr-space. The third module classifies the hand pose, and verifies the detection results using template matching for the polar histogram of the hand silhouette. Image acquisition OFF Hand detection Recognition mode? ON Color Tracking Color initialization Pose recognition Recognition mode ON Figure. Framework workflow The framework workflow is outlined in figure : first the recognition is off and the hand detection module is active. If the hand is detected, the hand color model is initialized and the recognition mode is activated. For following frames the color model is updated every time a hand pose is recognized. 3.1 Hand detection The aim of the first module is to detect a specific hand gesture without constraints for background and illumination. In order to avoid unintentional activation, a detector is designed for the open hand gesture with vertical orientation only. We use an AdaBoost classifier cascade, originally proposed for face detection by Viola and Jones [18] and more recently also for gesture recognition [10][14]. For initialization the open hand is a suitable gesture, since it shows many edge features [10]. The detector is trained using a set of Haar-like features for which each forms one weak classifier. For each partial region an 8 bin edge-orientation histogram is determined and is compared between the positive and the negative partial [1]. Edge orientations are more discriminative in detecting finger patterns then usual Haar-like features based on the image intensity [18]. Furthermore, comparing normalized histograms is more robust for large variations of intensity. The features used are shown in figure 3. Apart from usual Haar-features, two additional features are designed in order to detect vertical finger patterns. The partials are indicated with black and white regions. During the learning step, features are generated with random position, scale and aspect ratio. Each feature is a possible weak classifier discriminating the hand from the background. A combination of weak classifiers provides a strong classifier of which several layers are combined in a cascade. Figure 3. Features type for hand detection 3. Color tracking The segmentation task is accomplished by use of color cues, therefore invariant to shape changes or motion. A model in 3D color space is initialized if the hand is detected and updated over time. The model is generated online, so that the system is not restricted to a specific hand color. The system can also adapt to non-skin colors, e.g. in case the worker wears gloves. The model is updated online to account for changes in illumination. Tracking in the color domain is possible even for low frame rates, for which spatial tracking is not feasible. In fact, color variations are slower then the hand position and tracking is more robust. The color model consists of a mask in the full 3D color space consisting of YCbCr channels. In order to use mean-shift efficiently, the distribution is projected in multiple D planes that include the whole chrominance and luminance information.

Results from the D projections are combined for the 3D segmentation.

Figure 4 shows an example of input data and location of the region of interest; the corresponding domains in the two projection planes, Cb-Cr and Y-Cr, for the hand color mode.

This makes luminance a discriminative factor to distinguish between objects close in chrominance.

The method is still useful if the hand is over/under saturated in which case chrominance is not discriminative. Figure 4.

3 Results from the D projections are combined for the 3D segmentation. We notice that the chrominance distribution on the hand surface is quite localized, while the luminance values cover an extended range. Figure 4 shows an example of input data and location of the region of interest; the corresponding domains in the two projection planes, Cb-Cr and Y-Cr, for the hand color mode. We include luminance because the position of the hand color in the CbCr projection changes for a different luminance value. In fact, the color is luminancedependent. This makes luminance a discriminative factor to distinguish between objects close in chrominance. Using the entire 3D chrominance-luminance color-space, colors can be modeled even for non-uniform illuminated objects, such as the hand surface. The method is still useful if the hand is over/under saturated in which case chrominance is not discriminative. Figure 4. Example of hand color histogram: Region of interest; Typical domains in Cb-Cr and Y-Cr for the hand color in the The color model is initialized if the hand is detected. The AdaBoost detector provides a region of interest () which defines a close region surrounding the hand. The region outside the is regarded as background region (BGR). Because of the spread in luminance values, usually the hand does not correspond to a sharp peak in the color histogram, but rather to a diffused mode. This occurs even if the hand is spatially the dominant object in the. Furthermore, sharp-peaked color modes in the usually occur as well in the surrounding BGR and are eliminated after background subtraction. In most cases only one mode is left. Figure 5 shows the histograms of, BGR and the D-gaussian mode of the distribution obtained after BGR elimination, for a hand in cluttered background. Figure 5. Normalized histograms: a) background; b) ; c) the distribution after background elimination. The ellipse indicates the D-Gaussian mode The hand color distribution is provided by the dominant mode in the resulting 3D color histogram. Let h be the normalized histogram in the and in hbgr the BGR. We define the color distribution H of the hand in the as given by: H max 0, k h h (1) where, k is a parameter defining the relative occurrence of and BGR pixels. In the limit k 1, pixels occurring exclusively in the foreground are selected. The 3D distribution H is projected onto the YCr and CbCr planes. The search for the dominant modes is carried out on the two planes, using an adaptive variant of the mean-shift algorithm []. The mode search is performed in D since it is computationally less expensive then 3D. The algorithm is able to determine the number of modes and their location. Starting from a set of initial local maxima candidates, the local maxima search is performed by evaluating first and second order moments of the distribution in the search window centered on each seed point. At each step until convergence, the window center is moved towards the direction of ascendant gradient and the window size is adapted to the length of the principal axis, given by the largest eigenvalue of the covariance matrix. Performance is increased by use of integral images for computation of the statistical moments. The dominant mode in the CbCr plane is selected together with the corresponding YCrmode. For both modes, i 1,, a D Gaussian color model is defined: 1 d i x Pi x exp () C Ci i C i where, is the covariance matrix, given by the statistical moments following from the mean-shift search window, x c i is the position of the center of mass and d 1/ i x xc C i i x x is the Mahalanobis c i distance. Corresponding models in the two planes are selected and combined to a 3D mask. A threshold is set for P i x The 3D mask is given by the intersection of the D masks. Once the color model has been initialized, we track the components in the two different planes independently. The color model is updated from the that results from pose recognition. The model update is done by averaging over the last few seconds, in order to minimize effects due to detection of outliers colors. BG

3.3. Pose recognition This section presents a method for recognition of 8 different hand poses that can be used for human machine interface.

Sample of these classes are collected for different users and used for training. Figure 6.

A binary image is computed from the 3D color mask produced by the color tracking module. Regions providing potential hand silhouettes are selected using scale and spatial correlation.

For the hand silhouette extraction, we search potential hand candidates in the binary image.

Using prior knowledge about the possible hand size in the image, candidate local maxima are selected. Further spatial relations are used in the selection, e.g. position of the center of mass, standard deviation etc.

The candidate silhouette is then scaled to a template size of 60x60 pixels, which is evaluated by the classifier.

4 3.3. Pose recognition This section presents a method for recognition of 8 different hand poses that can be used for human machine interface. They are shown in figure 6 (from left to right): full hand, fist, thumb left, thumb + forefinger, Y-shape, forefinger, thumb + two fingers, four. Sample of these classes are collected for different users and used for training. Figure 6. The 8 classes The recognition module consists of three parts: hand silhouette extraction, computation of the polar histogram and template matching for multiple classes. A binary image is computed from the 3D color mask produced by the color tracking module. Regions providing potential hand silhouettes are selected using scale and spatial correlation. For each silhouette, the polar histogram is computed with respect to the center of mass. The best match between the polar histogram and the class template provides the recognized pose. For the hand silhouette extraction, we search potential hand candidates in the binary image. A distance function is defined that provides for each pixel the distance to the closest non segmented point. The hand palm corresponds to a local maximum of the distance function. Using prior knowledge about the possible hand size in the image, candidate local maxima are selected. Further spatial relations are used in the selection, e.g. position of the center of mass, standard deviation etc. Around each selected local maxima a region is defined, the size depending on the amplitude of the distance function maximum. The candidate silhouette is then scaled to a template size of 60x60 pixels, which is evaluated by the classifier. polar histograms of the 3 class templates above: full hand, fist and thumb + forefinger. The polar histograms are computed for the center of mass of the silhouettes. Each pixel of the silhouette falls into one sector of this spider net formed by the bins of the histogram resulting in a unfolded representation of the hand shape as shown in the lower row of figure 7. The binning is a compromise between coarse binning, for which classes are not separable, and fine binning, for which the classifier is sensitive to slight shape variations. We take angle coordinates that consist of 16 bins, covering /8 radians each. The radial coordinates are covering 5 pixels of the 60x60 template. The class templates are defined using the distance between two polar histograms a k and b k [19]: K k k k k 1 a b a k, b k (3) a b k where k stands for the bins of the histograms. Initial class templates are defined from a training set as the samples within one class that have the minimum average distance to the others. Additional class templates are considered for outliers samples that correspond to possible orientations of the hand. A candidate hand silhouette is classified or discarded using the nearest neighbor classifier. The confidence value is given by the distance to the closest class template. Candidate silhouettes with low confidence value are discarded. r Figure 8. Training set examples ; average on entire dataset Figure 7. Hand silhouette and polar histograms for three examples Log-polar histogram features have been proposed in [19] for object recognition based on shape context. In the present paper, a fast hand pose recognition method is proposed, that uses a single polar histogram to describe the hand silhouette. Figure 7 shows in the lower row Figure 9. AdaBoost first layer features superposed on average hand image

4. Results This section discusses results for each of the three modules.

To validate this hypothesis, two datasets are evaluated. One consists of 000 samples of the hand pose without alignment.

The marker is used to rotate the samples so that finger patterns are vertically aligned.

For the unaligned dataset, the hand is barely recognizable, whereas, for the aligned dataset, the finger pattern is still visible.

allowed. After six layers a training result is achieved of 0.99^6 correct detections and 0.4^6 false positive.

It is observed that generated features are mostly positioned on finger patterns. Following layers have a similar number of features.

Result A shows the classifier trained on aligned data, evaluated on an independent aligned dataset.

Markers indicate results after classification for each layer. The highest detection rate is achieved by the classifier trained on aligned data.

The hand detector module defines a and a BGR region that are used for the initialization of the color model.

Note that the hand color has a strong Cr component. In the Cr channel shows large values for hand and arm pixels.

. The corresponding image mask is evaluated from the binary 3D histogram mask.

In some cases hand pixels are discarded and wrongly classified as background pixels.

5 4. Results This section discusses results for each of the three modules. For the hand detection module, we expect that the classifier is more robust if the dataset has low variability. To validate this hypothesis, two datasets are evaluated. One consists of 000 samples of the hand pose without alignment. The second consists of 5000 samples recorded with a marker. The marker is used to rotate the samples so that finger patterns are vertically aligned. Figure 8 shows in the different rows examples for each of the datasets. The average intensity value for the entire set of samples is shown in. For the unaligned dataset, the hand is barely recognizable, whereas, for the aligned dataset, the finger pattern is still visible. The AdaBoost classifier was trained for samples of each dataset, so that for each layer 99% are correctly detected and 40% false positives are allowed. After six layers a training result is achieved of 0.99^6 correct detections and 0.4^6 false positive. Figure 9 shows the resulting 8 features for the first layer of the training on the aligned dataset superposed on the average hand image. It is observed that generated features are mostly positioned on finger patterns. Following layers have a similar number of features. Figure 10 shows the ROC curves for two different classifiers. Result A shows the classifier trained on aligned data, evaluated on an independent aligned dataset. Result B shows the classifier trained on unaligned data, evaluated on the same aligned dataset. Markers indicate results after classification for each layer. The highest detection rate is achieved by the classifier trained on aligned data. Figure 11. Detection results The color tracker is initialized by the hand detection module. The hand detector module defines a and a BGR region that are used for the initialization of the color model. The is shown in the first row of figure 1, which shows an example of detection in cluttered background. Note that the hand color has a strong Cr component. In the Cr channel shows large values for hand and arm pixels. The color probability map is obtained setting each image pixel to the corresponding value in the color distribution H of section 3.. The corresponding image mask is evaluated from the binary 3D histogram mask. Note that in the first example, although the desk and hand color are very similar, the hand is correctly segmented. In some cases hand pixels are discarded and wrongly classified as background pixels. The second row of figure 1 shows an example of the environment inside the airplane, which includes saturated image regions in the background. Once the color model is initialized, it is updated each time the pose is correctly recognized. BGR nd 1 st layer 1 st Detection Rate rd 4 th 5 th 6 th 3 rd layer 4 th layer 5 th layer 6 th layer nd layer A: classif. on aligned data B: classif. on unaligned data (c) Figure 1. Examples of color segmentation: input image; Cr-channel; (c) image mask from 3D histogram mask False Positive Rate Figure 10. ROC curves, classifiers results Figure 11 shows some detection results for cluttered background in an office environment. False positives are eliminated by spatial integration of dense regions in the response map. a b c d e f g h Figure 13. Correct recognitions The classifier for pose recognition is trained and then evaluated on two independent datasets. 95.9% of the 1600 validation samples are discriminated successfully.

6 The pose recognition module is tested using an on-line demonstrator. Figure 13 shows correctly classified samples for challenging cases. The rows show the and the hand silhouette for different samples, respectively. The inclusion of multiple templates per class enables pose recognition for rotated hands, e.g. a, b, f and h. The hand orientation range is naturally limited to approx. ±40deg. for the head mounted camera set up. Despite inaccurate color based segmentation (c, e, g) or important shape variation due to different hand pose (d), the classifier is able to classify the hand silhouette correctly. 5. Conclusion This paper presents a framework for gesture recognition in a work field environment, consisting of three modules for hand detection, color tracking and pose recognition. The accuracy of the AdaBoost detector is improved by training on aligned hands and by including dedicated features for finger patterns. The use of a 3D adaptive color model provides hand segmentation without prior knowledge on the hand color. Inclusion of luminance enables distinction between objects that have similar chrominance values. Eight hand poses are fully discriminated using multiple templates and nearest neighbor classification based on the polar histogram centered on the silhouette. The framework was implemented in Matlab and evaluated on line. Images are processed in real time: at 7fps for the hand detection and 4fps for color tracking and pose recognition on a Pentium 4-PC. 6. References [1] Bradski, G. R. and Davis, J. W., Motion Segmentation and Pose Recognition with Motion History Gradients, Machine Vision and Applications, 13, p. 174, 00. [] Bradski, G. R., Computer Vision Face Tracking for Use in a Perceptual User Interface, Intel Technology Journal, Q, p. 15, [3] Brèthes, L., Menezes, P., Lerasle and F., Hayet, J., Face Tracking and Hand Gesture Recognition for Human-Robot Interaction, in Proc. IEEE Intl. Conf. on Robotics and Automation, vol., p. 1901, 004. [4] Bretzner, L., Laptev and I., Lindeberg, T., Hand Gesture Recognition using Multi-scale Color Features, Hierarchical Models and Particle Filtering, in Proc. Conf. on Automatic Face and Gesture Recognition, 00. [5] Cui, Y. and Weng, J.J., Hand Segmentation using Learningbased Prediction and Verification for Hand Sign Recognition, in Proc. IEEE Intl. Conf. on Automatic Face and Gesture Recognition, Killington, Vt., pp , [6] Cutler, R. and Turk, M., View-based Interpretation of Real Time Optical Flow for Gesture Recognition, in Proc.IEEE Intl. Conf. on Automatic Face an Gesture Recognition, [7] Dominguez, S. M., Keaton, T. and Sayed, A. H., Robust Finger Tracking for Wearable Computer Interfacing, in ACM PUI 001 Orlando, FL, 001. [8] Freeman, W. T., Anderson, D. B., Beardsley, P. A., Dodge, C. N., Roth, M., Weissman, C. D., Yerazunis, W. S., Kage, H., Kyuma, K., Miyake, Y. and Tanaka, K., Computer Vision for Interactive Computer Graphics, in Proc. IEEE Computer Graphics and Applications, vol. 18, no. 3, p 4, [9] Kölsch, M. and Turk, M., Fast D Hand Tracking with Flocks of Features and Multi-Cue Integration, in Proc. IEEE Workshop on Real-Time Vision for Human-Computer Interaction (at CVPR), 004. [10] Kölsch, M. and Turk, M., Robust Hand Detection, in Proc. IEEE Intl. Conf. on Automatic Face and Gesture Recognition, 004. [11] Kurata, T., Okuma, T., Kourogi, M. and Sakaue, K., The Hand Mouse: GMM Hand-color Classification and Mean-Shift Tracking In Second Intl. Workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time Systems, 001. [1] Levi, K. and Weiss, Y., Learning Object Detection from a Small Number of Examples: the Importance of Good Features, in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, 004. [13] Moghaddam, B. and Pentland, A., Probabilistic Visual Learning for Object Representation, in IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 19, pp , [14] Ong, E. and Bowden, R., A Boosted Classifier Tree for Hand Shape Detection, in Proc. IEEE Intl. Conf. on Automatic Face and Gesture Recognition, 004. [15] Stark, M., Kohler, M. and Zyklop, P.G., Video-based Gesture Recognition for Human Computer Interaction, Modeling - Virtual Worlds - Distributed Graphics, W.D. Fellner ed., [16] Starner, T. and Pentland, A.P., A Wearable Computer Based American Sign Language Recognizer, in Proc. Intl. Symposium on Wearable Computing, vol. 1, [17] Stenger, B., Thayananthan, A., Torr, P. H. S. and Cipolla R., Hand Pose Estimation Using Hierarchical Detection, in Proc. Intl. Workshop on Human-Computer Interaction, p 105, 004. [18] Viola, P. and M. J. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, in IEEE Conf. Computer Vision and Pattern Recognition, 001. [19] Zhang, H., Malik, J., Learning a Discriminative Classifier using Shape Context Distances, in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 003.

Detection of a Single Hand Shape in the Foreground of Still Images

CS229 Project Final Report Detection of a Single Hand Shape in the Foreground of Still Images Toan Tran (dtoan@stanford.edu) 1. Introduction This paper is about an image detection system that can detect