Hand Gesture Recognition for Human-Robot Interaction J E S Ú S I G N A C I O B U E N O M O N G E

Size: px

Start display at page:

Download "Hand Gesture Recognition for Human-Robot Interaction J E S Ú S I G N A C I O B U E N O M O N G E"

Evelyn Lamb
6 years ago
Views:

1 Hand Gesture Recognition for Human-Robot Interaction J E S Ú S I G N A C I O B U E N O M O N G E Master of Science Thesis Stockholm, Sweden 2006

2 Hand Gesture Recognition for Human-Robot Interaction J E S Ú S I G N A C I O B U E N O M O N G E Master s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2006 Supervisor at CSC was Danica Kragic Examiner was Henrik Christensen TRITA-CSC-E 2006:070 ISRN-KTH/CSC/E--06/070--SE ISSN Royal Institute of Technology School of Computer Science and Communication KTH CSC SE Stockholm, Sweden URL:

3 Abstract This Master Thesis deal with the problem of visual gesture recognition. A new form of interaction to control systems has been an intensive eld of research since the actual methods do not provide much naturalism. This computer vision system recognizes not only a gesture described by the user's movement, but also the posture the hands keep during interaction. It is a general purpose system to command dierent applications. It was successfully tested while controlling a robot navigation system, which recognized commands such as start, stop, turn right/left, speed up/down, etc... The process robustness relies on its skin segmentator. It uses a two class Bayes framework for skin and background classication. The parametric probability distributions are modeled by a Mixture of Gaussians. The model adapts itself following illumination changes in the scene all along the execution time. Hand postures are identied using Principle Component Analysis, tracking is developed using Kalman lters, and - nally, gestures are recognized using Hidden Markov Models. Experimental evaluation shows how the integration of tracking and an adaptive color model supports the robustness and exibility of the system when illumination changes occur, providing a good base for later gesture and posture recognition.

4 Contents 1 Introduction The goal of the thesis Outline System background Segmentation Tracking Posture recognition Gesture recognition Color adaptation System architecture Skin color space Color transformation Skin pixels distribution Segmentation Color modeling Classication Blob searching Morphology Training Tracking Predictions Tracking issues Color adaptation On-line adaptive learning New skin pixels Posture recognition Principle Component Analysis Training Gesture recognition Gesture partitioning Hidden Markov Models Topology of models Recognition Training Experimental evaluation Color model Color adaptation Updating policy Posture recognition Gesture recognition Speed and optimization

5 10 Conclusion Improvements and future work Other applications for the system 42 References 43 A User guide 47 A.1 Running the system A.2 Training

One of the potential solutions that has been strongly advocated is the use of robotic appliances to provide services such as cleaning, getting dressed or mobility assistance.

6 1 Introduction Over the next few decades, the society will be experiencing a signicant aging, [1]. This increase requires new services for managed care and new facilities for providing assistance to elderly and disabled. One of the potential solutions that has been strongly advocated is the use of robotic appliances to provide services such as cleaning, getting dressed or mobility assistance. For such systems to be truly useful to ordinary users, the facilities for interaction and instruction have to be made natural and easy. A service robot that moves autonomously in a regular, dynamic environment and performs tasks such as cleaning is considered one of the main research goals in the eld of robotics. Considering our aging society and upcoming need for dierent types of care systems, a service robot development is easily motivated. Mobile robots are already capable of moving around autonomously in known or partially unknown environments, fullling tasks such as door opening or simple object manipulation. In general, rather than executing tasks in a preprogrammed manner, the aim is to develop a system that allows the robot to acquire knowledge about the environment through interaction with the user and use that knowledge to plan task execution. Naturally, there is a need for a system that enables the user to demonstrate the task to he robot which then learns from observation. Such a teaching scenario requires a system that is able to, not only recognize and track the user, but also understand and interpret his/her instructions in a Programming by Demonstration (PbD) manner, [2], [3]. Figure 1: Example scenarios for gesture recognition: left) instructing robot, and right) Programming by Demonstration Interaction between human users and robots (HRI) has become an important research problem, [4], [5]. Since its beginnings, dierent devices has been used for that purpose (keyboards, joysticks, tactile screens, game pads...). But none of them resembles the natural way of human communication, [6], which considers not only voice, but also gestures made during a conversation. Gestures make our communication richer and provide people with some additional information which would not be easy to guess from the words (assent, negative, doubt, 1

7 pointing...) [7]. Therefore, hand gestures are one of the most appealing instruction methods. Many dierent kinds of systems have been proposed for gesture recognition, [7], [6], [8]. Some of them use special devices, for example gloves to capture nger and hand movements. This makes the interaction less natural, but makes it easier to design the system. One option that avoids the use of any non-natural device are computer vision systems. The main disadvantage of vision systems is their complexity, not only to produce reliable results, but also because of the execution time the process can take. It should be as close to real time as possible. HRI and PbD settings consider the use of both posture and gesture based interfaces. A posture is a static conguration of ngers and hands and it is usually extracted from a single image. Regarding gestures, a sequence of images is used to track human hand motion and compare these to a temporal gesture model. The current system uses both postures and gestures. For some tasks, such as for example instructing the robot to stop, a dened conguration of the hand is sucient. In a PbD setting, a pointing posture is also very frequent. 1.1 The goal of the thesis I was given the task of testing and improving a previous work developed at the Centre of Autonomous System (CAS), Royal Institute of Technology. That previous work is a Master Thesis which has studied the problem of vision based gesture recognition and implemented a complete gesture recognition system [9]. That work was intended to be incorporated into the Intelligent Service Robot (ISR) project at the same department. The ISR project was aimed at developing a basic robot architecture and prototype tasks for a domestic or oce robotassistant, but this gesture recognition system would be used in a new service robot scenario for natural task instruction. The requirements for the new system are more wide-ranging than for the previous work. The new system was required to recognize gestures generated by two hands. Apart from the gesture recognition which requires a temporal model, the system was also required to implement a hand posture recognition system to provide dierent types of user-robot interaction. In relation to gesture recognition, posture recognition commonly uses a single image to recognize one of the hand postures from a discrete set of these. Since it is based on a single image, posture recognition requires a reliable hand segmentation because a hand posture is determined by nger conguration. If one of the ngers is lost during a hand segmentation process, this may cause a recognition failure. Since the system was supposed to use images from a camera, color segmentation has to be implemented so to provide a base for extracting the hands from the image. The technique used for that purpose has to work under dierent illumination conditions and with dierent skin colors. The system has to work in a domestic environment where illumination conditions can change signicantly during user-robot interaction, and the system has to be robust enough to continue the reliable segmentation of the hands while the user, for example, walks near a window or below a lamp. The previous work had some weaknesses and it did not meet the requirements which were posed to this work. First of all, hand segmentation results were not 2

8 suitable for posture recognition. The technique used in the mentioned thesis provided only big, blob like skin regions which have been highly altering the original shape of the hand due to the successive morphological processing. This has caused that the segmented region representing the hand were not detailed enough so to represent the clear dierence between nger congurations. The second most important weakness of the previous work was that it was not robust to illumination changes in the scene. This is due to the adopted modeling which uses a simple threshold based color segmentation technique and a HSV space. The thresholds were updated after a gesture was recognized, but after some initial evaluation of the system it was clear that a simple updating policy by using thresholding in the HSV space was not sucient to meet the requirements of the new system. On the other hand, after some evaluation it was concluded that tracking and gesture recognition parts were performing well also in terms of the new system, with the exception that in the previous work only gestures described with the right hand are recognized. In conclusion, the objective of this thesis was to design, implement, and test a complete gesture recognition system which improves the previous work in the points discussed above. Hand segmentation had to be re-designed from scratch in order to cover all requirements that the new version needs. Hand posture recognition had to be included as a new feature. Tracking process and gesture recognition were adopted from the previous system and was not re-designed apart from being adapted for integration in the new system architecture. 1.2 Outline Chapter 2 describes the dierent steps of a recognition process. The following chapters explain in detail the color model chosen (chapter 3), the segmentation technique (chapter 4), tracking (chapter 5), color model updating (chapter 6), posture (chapter 7) and gesture (chapter 8) recognition. Chapter 9 presents some experiments and results. Conclusions are summed up on chapter 10. On chapters 11 and 12 it is discussed some improvements that could be done in future work, and other applications where the system can be useful. And nally, appendix A is a user guide to work with the programs developed (trainers and recognizer). 3

9 2 System background The new system will have to support all the requirements described in the previous section (1.1). Thus, the new system architecture will be designed to meet the requirements. In addition, tracking and gesture recognition parts from the previous work will need to be integrated in the new design. The new architecture will be designed from the classical global vision based gesture interpretation system proposed in [6] (Fig. 2). Figure 2: Flow diagram of the classical global vision based gesture interpretation system. Those two initial steps from the classical model will be divided into four steps: data analysis will be separated into image segmentation and tracking while recognition will consist of static postures and dynamic gestures. 2.1 Segmentation Segmentation step will be then the rst one in the process. It will be responsible for providing the measurements which are robust to illumination changes and thus providing well dened skin shapes from every frame to the posture recognizer. Color segmentation step takes an image from the input device and produces as output a new binary image in which each pixel is classied as skin or background. There are many image segmentation techniques that can be considered when designing a color segmentation module [10], [11]. Some of these are mentioned below. Histogram thresholding is based on constructing color and hue histograms. Then images are thresholded at its most clearly separated peaks, [12], [13]. But this thresholding technique is not possible when histograms are unimodal, which usually happens when the image consists of a large background area and small regions to segment. Region growing starts by generating initial pixels called seeds, and grows the regions around them using some homogeneity criteria, [14]. There have been examples of work that use this approach since it is fast and robust when the thresholds for the initial seeds are well dened, [15], [16]. That is not the case of this thesis, since this system requires to be as independent as possible to the skin color of the user. 4

10 Non-parametric statistical approaches, where the model is not specied but is determined by data. Their main advantage is that they provide a probability density function that can be easily estimated, [17]. But as a drawback, histogram approaches require a considerable amount of training data and adapting an histogram is a slow process when there is a large amount of training data to process. Parametric statistical approaches. Compare to non-parametric ones, it is harder to calculate the probability densities. But as a main advantage they can adapt the density functions very fast to the new colors, since they does not take into account the old training data which does not need to be so abundant. A parametric statistical method presents the more suitable advantage for the purpose of our system. It will adapt fast and continuously to the new skin colors in the scene. It will be slow to segment a frame, but reasonable in time to perform the recognition in real time. As a result from the rst segmentation step, a new binary image will be given. That binary image will mark each pixel as skin or background and then it will be necessary to nd regions by connecting skin pixels. The run length algorithm explained in [18] is a fast approach to do that. It presents the advantage of giving the possibility of calculating during its own processes some geometrical moments that will be useful in later steps. Those moments are: position, center of mass, weight, maximum and minimum coordinates and orientation. The segmentation process will produce noise and some pixel misclassication, what will require some morphological operations to x the connected regions that are found. Figure 3: Flow diagram of the segmentation process. At the moment, segmentation diagram looks as shown in Fig. 3. In this diagram, an important fact is not shown: it is not dened in which space color model the skin model will be represented. If that space color is dierent from the input image which comes from the camera a new step will be necessary to perform that transformation. To make this color transformation faster in case it is needed to be performed, a look up table should be used. As an example, most of the cameras deliver either a raw image format (RGB0) or and RGB image. If we want to use some other color representation, such as HSV the values of pixels in the original image have to be transformed to this new space. 5

11 Input from the camera will be images 320 pixels width and 240 height. This is a size that almost all cameras support and represents a good trade o between the time requirements to process the image and the resolution required to detect user's hands and head. 2.2 Tracking Kalman lters were used in the previous work to track the hands and and head of the user. These lters has often been used in interactive computer graphics, [19], [20]. Their main disadvantage are their weak robustness to occlusion problems, but since it is not one of the goals of the thesis that previous tracking subsystem will be adapted to the new system. 2.3 Posture recognition Continuous posture recognition is a new step that has to be implemented in the new system. It needs to be performed frame by frame for both hands, therefore the chosen technique has to be fast and reliable at the same time. Other current approaches uses some dierent pattern classication methods to perform it: Geometric moments such as area, perimeter, major axis length, minor axis length, eccentricity and some invariant moment functions are commonly used. This technique presents the disadvantage of being dependant on the order in which that moments are used in discrimination. This technique works ne when the objects to classify are very dierent, but in the examples considered in our work the major dierences comes from the nger conguration which cannot be easily discriminated only with the moment based representation. Dimensionality reduction techniques. The most used are Linear Discrimination Analysis (LDA) and Principal Component Analysis (PCA). The rst one (LDA) is a technique used to nd the best linear discriminant functions of a set of variables which best classies the samples. The second one (PCA) is used to simplify a dataset reducing samples into a new space. It is commonly thought that LDA outperforms PCA because it is directly related to class discrimination, but as stated in [21], when the training dataset is small PCA is more reliable. Therefore, PCA will be the recognition technique that the system will use to perform the new continuous posture recognition process. 2.4 Gesture recognition Hidden Markov Models is a technique widely used for speech and gesture recognition, [22], [23]. The previous work implements a gesture module which uses position and their dierences for recognition. The results that this module achieves are good enough to support the new system requirements, therefore, as well as the tracking module, it will not be re-designed but adapted to be integrated into the new system. That adaptation includes the use of both hands to describe gestures instead of only the right one. 6

12 2.5 Color adaptation Color adaptation is the most important issue for the new system. This module will be in charge of updating the color model in such a way that the adaptation really facilitates the system to perform better in a dynamic environment. To manage that, some interactions between dierent modules need to be designed. The adaptation has to be done after the frame has been segmented because it needs to use those pixels that have been marked as skin to update the model. In order to make the model move towards a new color region, it is not only enough if the system uses skin that is already been segmented right. Some new pixels should be provided to make the model learn new colors. Those pixels can only be taken from real blobs, and they will be the pixels that are repaired from noise mistakes (holes in the blobs). On the other hand, an updating policy should be designed. It is not useful to update every frame because the system can fail easily when no skin is being recognized. Some condence will be needed to decide when to update the model, and that can be provided from the tracking module. In the previous work, the model was updated only after a gesture was completely recognized. In this new system it should be done more often because there is no use in waiting so long. 2.6 System architecture In summary, the new system architecture will evolve as shown in Fig. 4. The dierent components which form the system are presented in the remainder of the report and evaluated in section 9. The center of interaction in the system will be the model updating module, which, as said before, will use information from tracking results, the original image, and the previous model to adapt it towards the new colors in future frames. Gesture and Posture subsystems are isolated in the architecture since they do not present any relationship between them. Posture recognition will be continuous and performed frame by frame, but gesture recognition will be performed only when the user moves his/her hands. 7

13 3 Skin color space Figure 4: Flow diagram of the system. Skin color detection and modelling have been frequently used for face detection. A recent survey [24], presents a detailed overview of commonly used color spaces. Some of these include RGB [25], normalized RGB [26, 27], HSV [28], YCrCb [29] and YUV [30]. In [31], skin chrominance models and their performance have been evaluated. For real-world applications and dynamic scenes, color spaces that separate the chrominance and luminance components of color are typically considered preferable. Main reason for this is only chrominance-dependent components of color are considered, increased robustness to illumination changes can be achieved. 3.1 Color transformation For the purpose of this thesis, using RGB color is a better choice than, for example, HSV to model skin color, [32]. HSV family presents lower reliability when the scenes are complex and they contain similar colors such as wood textures. Moreover, in order to transform a frame it would be necessary to change each pixel to the new color space which can be avoided if the camera provides RGB images directly, as most of them do. In case a camera could not do that, color transformation would be required. Each image is 320 pixels width and 240 height, and that means that = transformations per image would be needed. In order to avoid such a huge amount of heavy calculations, the system would use a look up table where all possible transformations are precalculated. Then each transformation is reduced to a memory access. The system uses 24 bit images, which means 2 24 dierent colors using three bytes per color, requiring 48M B of memory. 8

14 As it is shown in section 9.6, the system will requires 10% of total execution time when it needs to transform images into a new color space. 3.2 Skin pixels distribution The system needs an initial training with skin colors. This step is done using an o-line training process with a set of example images that are manually segmented, from which some histograms are calculated. This histograms give the system an initial part of the whole color space where pixels will be considered to be skin. The following three-dimensional plots (Fig. 5) show how skin pixels are distributed. Those plots are generated from skin pixels from one single user, but in four dierent illumination conditions and taken with dierent cameras. That cloud of points which represents the skin adapts its shape and moves when light changes, what motivates the idea of using an adaptive classication model. It can also be seen in that plot (Fig. 5) the dierence between choosing head pixels to be included into the model or not. Lips, eyes, and some hair colors are out of the cloud of skin points, but they do not disturb because all the others t well inside the cloud helping to dene it better. Figure 5: Same user in four dierent scenes and dierent cameras. Green points represent skin pixels from both hands and head, while the red points represent points which belongs only to hands. At the beginning, it is recommended to have a complex initial model (Fig. 6) to involve as many users and illumination conditions as possible. Then, when a new user uses the system it will segment some initial patches of skin and will adapt to it. 9

4 Segmentation Figure 6: Skin pixels in RGB color space. This section describes the main step in the recognition process. Later tracking and recognition highly depend on segmentation results.

15 4 Segmentation Figure 6: Skin pixels in RGB color space. This section describes the main step in the recognition process. Later tracking and recognition highly depend on segmentation results. Details about how Mixture of Gaussians are used to model skin color are given in section 4.1. The Bayes framework which classies pixels using that probabilistic model is shown in 4.2. In 4.3 and 4.4 blob searching and morphological operations used to x the chosen blobs are commented. Finally 4.5 gives some ideas about how training should be performed. 4.1 Color modeling It is not easy to chose the pixel segmentation technique to use. Some approaches use thresholds to determine the skin region inside the whole color space [9]. In fact, this is an easy and fast way to do it, but it is usually not a good method considering the skin pixels space shown in Fig. 6. This system also requires a good model updating technique, and modifying thresholds is not a very reliable way for achieving a robust system. Other option is the use of non-parametric probabilistic models [17]. They are much more complex than the previous one as they use Bayes rule to decide which pixels are skin and which are background. The results are commonly much better, but they have a problem that makes them impracticable for this system: it takes too long to update the model with new pixels and a lot of information from the past has to be saved. As discussed in section 2.1, the method adopted in this work is a probabilistic 10

16 model, but it is a parametric one. From the histograms made during the training step, a set of normal Gaussian distributions is t to the data (Fig. 7) creating a Mixture of Gaussians, [33]. A Mixture of Gaussians is more appropriate than a single Gaussian model to segment human skin, as it was tested in [34] and [35]. In section 9.1 dierent number of components are tested for the mixture. and P (x) = nx j=1 w(j) g(x; (j); 2 (j)) nx j=1 w(j) = 1 where g(x; (j); 2 (j)) represents a single normal distribution. Figure 7: Example of a Mixture of Gaussians (4 components). These three color variables (R, G, and B) are not independent. In fact, plots in Fig. 8 show how RGB skin colors are highly correlated. The estimation is calculated using the full covariance matrices. If correlation were not like this, covariances could be modeled by simple independent variances, but we model covariance matrix as: cov = 2 r r g r b g r 2 g g b b r b g 2 b This estimation is calculated using the Expectation Maximization algorithm (EM), [36], [37]. 1 A 11

17 Figure 8: Skin pixels correlation (R-G, R-B, and G-B). This algorithm is an iterative optimization method, where initial values for mean and variances are calculated distributing randomly the points into the mixture components (k points per component). Each component will have the same weight. (j) = kx i=1 x i k P k 2 i=1 (j) = (x in (j) n )(x im (j) m ) k 1 w(j) = 1 N The algorithm consist of two steps. During the Expectation step (E step), the posterior probability for each pixel to belong to each component is calculated. They have to be normalized. E(j; i) = p 1 2 j 2 jw(j) e (x i (j)) ( 2 ) 1 (xi (j)) E(j; i) = P N j E(j; i) E(j; i) The next step (M step) re-calculates the new weights, means and covariance matrices which will be used in the following iteration. P k i=1 (j) = E(j; i)x i P k i=1 E(j; i) 2 (j) = P k E(j; i)(x i=1 in (j) n )(x im (j) m ) P k i=1 E(j; i) w(j) = P k i=1 E(j; i) P N P k j=1 i=1 E(j; i) 12

18 Convergence is reached when the distance between two consecutive means is below an error threshold (0.05). Choosing this threshold is not a determinant decision since the aim of the model it to be able to nd some skin patches at the beginning of the execution. Then, the model will be adapted to the new light and user skin color. Further explanation about Expectation Maximization algorithm for a Mixture of Gaussians can be found in [38]. The next important decision is: how many components (n) should each model use? This was decided empirically. The skin model is not exact enough when using just two components, while using four or ve is similar as using three but requires more calculations per pixel. More details and evaluation are provided in section Classication Once the model is available, a Bayes classier is used to estimate the probability of a pixel being skin or background. Therefore, it needs to keep two color models, one for skin colors and another model for background color. For classication, the system uses ideas proposed in [17] but dierently from their work, it uses Mixtures of Gaussians for color modeling instead of nonparametric histograms. In the following, the conditional density is denoted P (rgb j F G) for the skin regions and P (rgb j BG) for the background and rgb 2 R 3. Posteriors P (BG j rgb) and P (F G j rgb) are estimated using Bayes' formula and the classication boundary is drawn where the ratio of P (BG j rgb) and P (F G j rgb) exceeds some threshold T. T is a relative risk factor associated with misclassication: T < P (F G j rgb) P (rgb j F G) P (F G) = P (BG j rgb) P (rgb j BG) P (BG) In other words, when the above is satised, the pixel under consideration is labeled as skin. That value T was chosen empirically from a set of images. A ROC curve was calculated from three dierent sets of image samples (Fig. 9). The ROC curve shows the number of hits (fraction of true skin pixels which are correctly classied) against number of false alarms (fraction of background pixels which are incorrectly classied as skin). Depending on the complexity of the set samples (background can contain a door, table or shelve) the ROC line shows more or less false alarms. A high threshold value (nearly 1.0) would give the best hit rate, but that is not what the system requires. The most important issue is to decrease false alarms. If false alarms are too high they can confuse the updating process at the beginning of the execution and move the model towards an incorrect color (for example a piece of wood). In conclusion, 0:4 is the threshold chosen because it is much better for the system to get a false alarm rate below 2%, although it means to decrease hit rate between 55 and 90% (depending on the sample set). That is not a trouble for the system because it will adapt soon to the user skin color and then the hit rate will raise fast in the next frames. 13

19 Figure 9: ROC curve calculated from three dierent sets of images. 4.3 Blob searching The result of the classication step is a binary image. That image shows which pixels where classied as skin and which as background. The next step is to create regions from adjacent pixels. Supposing the user is always the only person in the image and that there is not any interference with the background (for example, a big wood surface), it is correct to suppose the three biggest blobs as being the head and the two hands. But it is also possible for the user not to show the two hands in every frame, so some checking should be always done (size, position, etc). A run length algorithm is used to search blobs while it also calculates some moments about the blobs (position, centroid, size and direction...). The size is used to chose blobs with a expected maximum and minimum size, position and centroid are used for tracking, and direction is only used when drawing tracking results on the original image. This algorithm works better searching for neighbors in the eight possible directions (8-connectivity) rather than using 4-connectivity, because sometimes a not very well segmented nger can be connected to the palm by a pixel on the diagonal pixel neighbors. 4.4 Morphology The original blobs received after segmentation are not good enough to use directly for PCA recognition, that is rather sensitive to small shape changes. They are full of noise and holes and it is necessary to make them more suitable for the recognition step. The rst step is to ll the holes of each of the segmented 14

20 region. As shown in Fig. 10, a small change was introduced to the classical morphological closing. After the initial dilation, the perimeter of the region is estimated which is then used to ll the region inside it. After this, all the pixels along the perimeter are removed to retrieve the original outer shape of the region. This last step is done because the dilation can add some non-skin pixels to the region, and if they are added frequently the model can get used to them performing a worse segmentation in future frames. Figure 10: Morphological operations. From left to right: Original blob, expanded blob, perimeter, lled blob, and nal blob 4.5 Training To train the skin color model it requires a set of sample segmented images, which have to be segmented manually. All the other pixels are used to model the background. The model will be better if the set of samples is large and they are not equal (dierent kind of light, dierent users...). A big amount of samples can make Estimation Maximization algorithm very slow, but it is calculated o-line. The result is a parametric model, so they will not make the recognizer program much slower. Choosing a good range of samples is not only the important decision to take while training the color model. The number of Gaussian components for each model (skin and background) is also chosen. From experience, a background with just two Gaussians is enough. As it will be discussed later in section 9.3, the background model is not updated in run time, so it is not really important to have a reliable knowledge about the background. The skin model is more complicated, the more components the mixture has, the better estimation the 15

21 model ts. On the other hand, recognizing and updating time also grow. One or two Gaussians are not good enough, and the dierence in quality between using three or four Gaussians is not so big. Detailed explanation is provided in section 9.1. The color training program will save a le with all information about both models needed by the recognizer: number of components, means, weights and covariance matrixes. The histogram generated from skin pixels is not needed any more. 16

22 5 Tracking In the previous work, the Kalman lter [39] is used to track the segmented blobs. Kalman lter has been the subject of extensive research and application, and is often used in interactive computer graphics. It provides noise reduction and predictions from its input. Predictions are useful to decide which blob should be consider head or right/left hand. Each region uses one individual lter per coordinate (x; y), what makes the use of six lter necessary to track the two hands and the head. 5.1 Predictions The regions are tracked directly in image space where the state is represented both by the position, p = (x y) T and velocity for each of the tracked regions: p x = _p Under the assumption of a rst order Taylor expansion, the system is dened as: x k+1 = Ax k + Cw k 1 T A = 0 1 and C = T 2 =2 T here, w k represents Gaussian noise with variance 2 w. The observation model is dened as: z k = H x k + n k H = 1 0 where z is the measurement vector and n k is the measurement noise which is assumed to be Gaussian with variance 2 n. Using the standard Kalman lter updating model, [39] it is straightforward to track the regions of interest over time. The matching is performed using a nearest neighbor algorithm. For a further description of the Kalman lter algorithm, see [40]. 5.2 Tracking issues The main disadvantage of Kalman lters while tracking are occlusion problems [9]. Short periods will not interrupt tracking because the lters hold the velocity of blobs and have restricted acceleration. They arise when the blobs are so close that the become into one single object. If the user moves the two hands across each other, the tracker will probably get confused and will switch right and left hands in future tracking. 17

23 To cope with another problem inherit to the Kalman lter namely, multiple targets converging to a single one, the previous work implemented a reinitialization step. During tracking, a measure of condence is maintained. When one of the following three rules is broken, the measure of condence is decreased. If these are violated for a long time, the level of condence becomes low and re-initialization is triggered: - The region that belongs to left hand has to be to the left of the right hand. - The face is above both hands. - Non of the three tracked regions are at the same location. Figure 11 shows how the lter smooths the hand movements during one of the test runs. This is crucial for the other part of the system, namely the gesture recognition step, [41] Hidden Markov Models are based on the whole hand trajectory information. Figure 11: Left) The raw image position of the hand regions, and Right) The position of hand regions estimated by the Kalman lter. 18

24 6 Color adaptation In order to support the system requirements exposed in section 1.1, it can not rely on a static segmenting technique. Users are assumed to be in movement, going near and far from light sources. Sources can be natural (sun light from windows) or articial ones, such as uorescent lights or bulbs. Even a painted wall can alter skin colors by reection. A Mixture of Gaussians was chosen to model the skin color because they present the best advantage against non-parametric models: on-line adaptation is fast in two dierent meanings. It takes few frames to learn the colors of a hand palm, and it takes few computing time to transform the model. 6.1 On-line adaptive learning Model updating is done by calculating new means, weights and covariance matrixes for all the components in the Mixture of Gaussian, [42]. It does not need to keep in memory all points used in the training step, neither the histograms. But it is necessary to estimate empirically a value for adapting speed (nally was chosen). This value can not be small because the model would change drastically if false pixels get in the updating step. It can not be equal to 1.0 or all new pixels would be unuseful to update the original model. More details and testing results are provided in section 9.2. Model updating starts calculating new expected posterior probabilities of each component G j for the new skin pixel. q n = P (G j j x n ) = w n(j) g(x n ; n 1 (j); 2 n 1(j)) P (x n ) New weights for mixture components are calculated using that probabilities and the adapting rate w n (j) = w n 1 (j) + (1 ) q n (j) Each component gets a new learning rate from the dened adaptation speed and previous values n (j) = q n(j) (1 ) w n (j) Finally, new means and covariance matrixes are calculated using the above learning rate n (j) = (1 n (j)) n 1 (j) + n (j) x n 2 n(j) = (1 n (j)) 2 n 1(j) + n (j) (x n n (j)) (x n n (j)) New skin pixels After performing morphological operations, the model will get skin patches with some new pixels that the classier didn't guess. It will use also all correctly classied pixels in the hand, which will help to stabilize the model. In order 19

25 to prevent the model from moving towards strange areas only points which are near to the actual Gaussian distribution are accepted. The more frequently the updating is done, the more reliable will be the segmentation in future frames. On the other hand, if a false blob is recognized (some materials such as wood, cardboard...) then the model can evolve to a wrong situation. As it is shown is section 9.6, it is able to have a frame by frame updating rating (the updating process takes less than 10% of total execution time), but one has to be certain sure that the blob is a real hand (see section 9.3 for evaluation). To cope with the problem of including non-skin pixels to the update model, some checking is done before they are used to update the model. Each pixel must be close to any Gaussian component of the model. If a pixel has a very low probability of being skin (less than ) in the actual model, it will be rejected. 20

26 7 Posture recognition Principle Component Analysis (PCA) is a classical technique which is used often in pattern recognition. Other approaches use techniques [6] such as geometrical moments [43], contour [44], particle ltering [45], [46], 3D models [47], etc. but PCA is simple, reliable, and fast. PCA training is the most expensive step in the system, but it is calculated o-line. If segmentation results are good enough (without losing any nger, or not expanding out of the skin) and with few samples per posture, there is a high probability of having a right recognition. 7.1 Principle Component Analysis The basic idea is to reduce samples into a smaller space with less dimensions. In addition, the most relevant information is located in the rst components in the new coordinate system [48]. Training procedure will be explained later in section 7.2, it is more complex and slower than recognition but it is calculated o-line. As training results, some eigenvectors and eigenvalues are estimated. Those vectors ([e 1 ; e 2 ; ; e k ] 0 ) are used to calculate the new representation of each sample in the new space. g i = [e 1 ; e 2 ; ; e k ] 0 (x i c) Where c is the mean vector of the training samples (see section 7.2). Before performing the recognition the blob have to be prepared. To t in a dened frame (30 pixels width and 30 pixels height), they have to be scaled to the correct size (width or height), and the other dimension should be centered in the frame. More details and evaluation about this step are given in section 9.4. The representation of the new scaled sample is then calculated, and the shortest distance of this point to each of the training samples will be the most likely posture. That should be calculated as the Euclidean distance, but the results show that using a weighed pseudo-distance by eigenvalues is a little bit better. This way the rst components are considered more important that the following ones. An example about distance calculation is given in the evaluation section 9.4. Euclidean distance: d 2 = kx j=1 (s j g ij ) 2 where g i is the representation of the trained sample, s is the sample which is being recognized, and k is the number of eigenvectors used. Pseudo-distance: d 2 = kx j=1 (s j g ij ) 2 ev k where ev is the eigenvalue used to weigh the distance. 21

27 7.2 Training This is a very time requiring process, but like all the training for the system, it is calculated o-line. The objective is to get the eigenvalues and eigenvectors of a matrix containing n posture samples. A sample is a small binary image. It can not be too large because it would make recognizer much slower since its speed depends on the the number of dimensions of the sample. Each sample is scaled into a frame of 30 x 30 pixels, which is enough to perform a good recognition using a reasonable run time. It is recommended to have more than one sample per posture, and for some of them it is necessary because they can be described with any hand. A large set of samples is not useful because almost all of them will be similar so they do not oer more reliability. The training set and evaluation is described in detail in section 9.4. Image samples are transformed into vectors with values 1.0 for skin pixels and 0.0 for background pixels. Using all those vectors (samples) in columns, a matrix is built. x = [x 1 ; x 2 ; ; x N ] 0 To every pixel in that matrix the average of the row has to be subtracted. c i = 1 N NX j=1 x j The resultant matrix is called P c = [c 1 ; c 2 ; ; c m ] 0 P = [x 1 c; x 2 c; ; x N c] 0 Covariance matrix (Q matrix) is calculated as P P 0. Finally the eigenvectors and eigenvalues from the Q matrix are calculated. Just the rst vectors are useful. The number of dimensions for the new space have to be small but enough to be able to perform a reliable recognition. Eigenvectors are accepted until the addition of eigenvalues reach almost 1, this happens when the other dimensions do not contain useful information. The system is working at date with 10 dimensions. Evaluation about the threshold chosen is given in section 9.4. The last step is to nd the new representation for all the samples used for the training. In order to do that, a new matrix is built using the chosen eigenvectors in rows. That matrix is then multiplied by all samples. The PCA training program saves a le with eigenvectors and eigenvalues chosen, the averages of each pixel (they are necessary for recognition), and the new representation for training samples. 22

28 8 Gesture recognition Hidden Markov Models (HMM) have found a lot of use in recognition problems since they were described in 1960s. This statistical method started to be used in 1970s for speech recognition and since 1980s it has been used for biological sequences such as DNA. It is a technique which ts properly this kind of gesture recognition problems. A HMM have some parameters whose values are trained to describe a pattern. Then the samples are classied by models and the one which provides the highest posterior probability is chosen to describe the sample. The previous work concentrates only in isolated gestures. Continuous recognition [49], [50] would be more suitable for the system, since it should be able to recognize movements without a silence stop between them. But it is not in the scope of this thesis. 8.1 Gesture partitioning The gestures have to be separated by a silence stop between them. A gesture is a continuous sequence where the movement of the blob exceeds a threshold (Fig. 12). Only positions are taken into account while recognizing the gesture, in fact the dierence in position between two measurements of the blob is the motion used. That motion measurement are then scaled by convolution by a Gaussian kernel with a variance corresponding to the expected length of a gesture (about 20 frames). When the gesture is isolated in time, then the positions which describe it have to be normalized to avoid dierences in user position or size of the gesture. All the blobs are then centered which makes the system unable to distinguish when a gesture is being made, for example, over or below the head. The system is only sensitive to the dynamics within the gesture. 8.2 Hidden Markov Models Each known gesture in the database is modeled by a trained HMM. The most probable HMM for a sequence will be selected to classify the gesture. A HMM is described by the following parameters: - States, S = fs 1 ; s 2 ; : : : ; s N g. The state at time t is denoted q t. - Symbols, V = fv 1 ; v 2 ; : : : ; v M g. - Transition probability distribution, A = fa ij g. where a ij = P (q t+1 = s j j q t = s i ); 1 i; j N. - Observation probability distribution, B = fb j (k)g. where b j (k) = P (v k att j q t = s j ); 1 j N; 1 k M. - Initial state distribution, = f i g. where i = P (q 1 = s i ); 1 i N. 23

29 Figure 12: Partition starts when the motion exceeds a dened threshold (0.0006), and stops when it falls down again. Measurements are organized into observation sequences, O = o 1 o 2 : : : o T. Observations are not discrete for gestures, therefore the observation probability distribution for each state B is replaced in this case by a continuous Gaussian density: 1 b j (o t ) = q e 1 (2) n jj 2 j 2 (o t j ) 0 ( 2 ) 1 (o t j ) where n is the number of dimensions of observations, j is the mean and 2 j the covariance for state j. For a further explanation about HMM read [51]. Given a HMM there are three problems of interest to solve: - Evaluation. Calculate P (O j ), the probability of the observation sequence for the model. - Decoding. Choose a corresponding state sequence of internal states which best explains the observation sequence. - Training. Adjust the model parameters using some observation sequences. Only two of the three of them need to be solved for the purpose of this thesis: evaluation (section 8.4) and training (section 8.5). The decoding problem is more important in continuous recognition where the most probable sequence of gestures need to be recognized. 24

30 8.3 Topology of models The following diagram shows the topology of the HMM used in the previous work. Transitions are allowed from one state to the next one or two states forward, but never backwards. Transitions to the state itself are also allowed. The observations are the normalized positions of the blobs. Figure 13: The topology of the HMMs used. 8.4 Recognition The evaluation problem is solved for each of the HMMs, and then the pattern with the highest value of P (O j ) is chosen as the corresponding gesture. Some algorithms have been proposed to calculate that probability. One possibility is the forward variable i (t) = P (o 1 ; o 2 ; : : : ; o t ; q t = s i j ) [52]. i (1) = i b i (o1); 1 i N j (t + 1) = " NX i=1 i (t)a ij # b j (o t+1 ); 1 t T 1; 1 j N P (O j ) = NX i=1 i (T ) That total likelihood P (O j ) is not the appropriate measurement for recognition. The decoding problem is usually solved by the Viterbi algorithm [52] which memorizes the minimizing state in each step of the recursion. But as I said before, this solution works well for the purpose of this thesis, where most of the eort has been done in skin segmentation steps. When the probability of the sequence has been calculated for all the models, then the highest is chosen to be the recognized gesture as long as that probability passes a condence threshold. 8.5 Training Gesture training is the most complicated of the three training sets the system requires. First a template for the HMM needs to be provided describing the 25

31 number of states and transitions. The measurements used is a combination of positions and velocities of the blobs [9]. The steps and programs that needs to be run to train the system with new gestures is described in Appendix A. Training is performed in two steps. First observation sequences are divided among the states and initial means and covariances are calculated. j = 1 T j X X 8t;q t=s j o t j 2 = 1 (o t j )(o t j ) 0 T j 8t;q t=s j where T j is the number of observations in state j. Using the Viterbi algorithm [52] the most probable state sequence is calculated, and then the the transitions probabilities are approximated by a ij = A ij PN k=1 A ik where Aij is the number of transitions made from i to j in the state sequence. Then the approximation of means and covariances is repeated with the new state sequence until they don't change. But it is possible to get a better trained HMM using Baum-Welch reestimation [52]. Instead of assigning each observation to one state, the observation is assigned to each state in proportion to the probability of occupying that particular state. L j (t) denotes the probability of being in state j at time t, then mean and covariances are j = P T t=1 L j(t)o t PT t=1 L j(t) 2 j = P T t=1 L j(t)(o t j )(o t j ) 0 P T t=1 L j(t) L j (t) is calculated using the Forward-backward algorithm. Forward probability is dened as i (t) = P (o 1 ; o2; : : : ; o t ; qt = s i j ) and was already described in section 8.4. Backward probability is dened as i (t) = P (o t+1 ; : : : ; o T ; j x(t) = i; ) which is calculated using the following recursion i (t) = NX j=1 i (T ) = a in ; a ij b j (o t+1 ) j (t + 1); 1 i N 1 i N; 1 t T 26

32 multiplying both backward and forward probabilities i (t) i (t) = P (O; q t = j j ) which can be used to calculate the state occupancy probability: L j (t) = P (qt = j j O; ) = P (O; q t = j j ) P (O j ) the transition probabilities are then approximated by P T 1 t=1 i(t)a ij b j (o t+1 ) j (t + 1) aij = 1 P (O j ) = P 1 T P (O j ) t=1 i(t) i (t) 1 P (O j ) i(t) i (t) A further explanation about training can be read in [52, 51]. 27

9 Experimental evaluation In this section some results are given from experimental evaluation. The system has been tested using some dierent cameras, including low resolution web cams.

33 9 Experimental evaluation In this section some results are given from experimental evaluation. The system has been tested using some dierent cameras, including low resolution web cams. Obviously, results reliability decreases in that case but the system is still usable for applications such as,for example, controlling a music player on a desktop. Evaluation has been done about the quality of the color mixture depending on the number of components in the Mixture of Gaussians, updating polity for model adaptation, posture recognition, gesture recognition, and running speed of the recognizer program. 9.1 Color model Choosing the correct number of components in the Mixture of Gaussians is very important for the behavior of the system. The more it uses, the best segmentation it gets, but the slower the program runs. It is supposed to be a real time application, so it must be an agreement between quality and speed. The next pictures (Fig. 14) show skin segmentation using 1, 2, 3, 4 and 5 components in the mixture. Figure 14: Skin segmentation using 1, 2, 3, 4 and 5 components Three is the number of components chosen. It provides better results than using 1 or 2, and the dierence between a model of 3 components and 4 or 5 is not so big. Moreover, an additional Gaussian for the color mixture raises execution time as shown in Table 1. 28

34 N of Gaussians Time Increase [ms] % % Table 1: Increase of time requirements with additional Gaussian in GMM (default is 3). 9.2 Color adaptation As it was said in section 6, color adaptation speed is controlled by a constant called. The closer that value is to 1.0, the slower the adaptation it will be. It is a good idea to have a slow speed to avoid situations in which the skin pixels come from a failure in segmentation and tracking. A constant of can make the model robust to changes in case the tracking fails, if it does not happen very often. On the other hand it makes updating slow, because each new pixel only weighs 0.05% against 0.95% for the actual model. But if we take into account that the updating process is performed frame by frame, with around 500 pixels or even more when the level of tracking condence is high (hands and head), then it manages to adapt fast. In the following samples (Fig. 15 and 16) the model adapts from the same initial state to the new skin color in the scene. The initial model contains a large set of users and illumination conditions, while the nal one has been restricted to the actual scene by the updating process. 9.3 Updating policy The main aim of this thesis is to develop a gestural control unit robust enough to the frequent illumination changes in the scene while interacting with the robot. Having a good training set is not enough. Skin pixels change its color very often, so an updating policy is needed. These are the results of segmentation using the color model without color updating: The results are not good. When the hand moves, its color change, so it is necessary to have a frame by frame updating. In the following tests only skin model will be updated, and background model will keep constant. That decision simplies adaptation process making it faster, because most of the points in an image are supposed to be background and that will make the program run much slower. The following sequences show what happens when updating is done just using the three biggest blobs. When a false blob is accepted, the model moves towards that color. On an simple background it works well (Fig. 18), but on a complex background (with a lot of wood tables, doors...) the results show the problem discussed above (Fig. 19). The way to solve this problem is to dene an updating rule. During the following sequence (Fig. 20) the system can update every frame but only with the pixels from the blobs which have been tracked at least for a minimum period of time. Only supposed hands (both left and right) are used. The reason by pixels from the head are avoided to update the model at the beginning of a tracking sequence is because they can contain colors that can disturb the model 29

4 Posture recognition The system recognizes ve dierent postures. They are shown in Fig. 22. At date the system works with a set of 12 samples (Fig.

35 Figure 15: The initial model moves to a dierent color region where the skin is illuminated by natural light. (such as lips, glasses, hair...). They will be used later, after tracking has been performed during a long time and gets stable. The system can also perform well while using a low resolution web camera, as it is shown in Fig Posture recognition The system recognizes ve dierent postures. They are shown in Fig. 22. At date the system works with a set of 12 samples (Fig. 23): four samples for the REST posture, and two samples for the STOP, RIGHT, LEFT and UP postures. The system has been trained with that samples and a threshold of , which reduces the eigenspace from 900 to 10 dimensions. Recognition speed will depend on that number of dimension as is shown in Tables 2, x x x Table 2: Number of eigenvectors chosen using dierent frame sizes and thresholds. 30

Figure 16: The initial model moves to a dierent color region where the skin is illuminated by articial light.

color. 1 10 10 1 10 15 1 10 20 20x20 4.04 4.41 29.53 30x30 7.37 9.66 83.12 40x40 13.11 21.03 99.

Some conclusions can be determined from those tables, considering that we use 1 10 10 as threshold.

36 Figure 16: The initial model moves to a dierent color region where the skin is illuminated by articial light. Figure 17: The hand is subdivided into two blobs because the model can not update towards this user skin color x x x Table 3: % Of total execution time needed for PCA recognition using dierent frame sizes and thresholds. Some conclusions can be determined from those tables, considering that we use as threshold. When using a frame of 20x20 frame recognition time is below 5% of total execution time. Using a 30x30 frame this time raises until 31

37 Figure 18: At the beginning the left hand is not segmented, but the model updates with new pixels from the right one and then its blob starts growing until it ts the whole hand during the rest of the sequence. 7.5% but recognition results are better, specially confusion rate between REST and STOP postures is increased. Using a frame of 40x40 pixels would almost double this recognition time without providing better reliability. Recognition is performed by choosing the closest neighbor. The following Table (4) shows the dierence between using the Euclidean distance or the new pseudo-distance presented in Chapter 7. Postures are easier to recognize since the distances between the correct and the false postures are bigger in proportion. The recognition depends highly in the quality of the skin segmentation. The shape of the blob should be as good as possible in order to get reliable recognition. Hands usually lose a nger, and that can make the recognizer confuse the posture, as shown in Table 5. Those results were taken from two long sequences in with the REST posture is the most used (as usual while interaction with the system). 9.5 Gesture recognition The system recognizes ve dierent postures. They are shown in Fig. 24. Some of them are described with one hand and some with both of them. The head does not participate on the gesture. Each gesture is modelled by a HMM, which have been trained with a large number of samples for each gesture. Only examples without any occlusion or tracking failure were included in the sets. All the gestures were developed by the author himself wearing long sleeves. 32

38 Figure 19: When the left hand goes over the shelves, the model starts learning those new colors. Then, a piece of the table is segmented, which makes the model learn some pixels from the computers near it. Finally the system fails, segmenting also the ceiling of the laboratory. The system is able to distinguish the ve trained gestures when they are described carefully by the author, but some mistakes are made due to tracking failures. Results can be seen on Table 6, where SPEED UP is the less reliable command. The reason which explains that fact is that moving your hands up, they sometimes make an occlusion with the head confusing the tracking lters. 9.6 Speed and optimization Skin segmentation is a very long in time process. It has to calculate heavy operations for each pixel in the image. This parametric model uses a threedimensional Mixture of Gaussian, what means that a lot of matrix multiplica- 33

Both hands are segmented during this sequence, but sometimes they are not chosen as one of the 3 biggest blobs

39 Figure 20: Forgetting about using head pixels to update the model, and using only hand blobs which are being tracked, the model updates much better than before. Both hands are segmented during this sequence, but sometimes they are not chosen as one of the 3 biggest blobs because of the pieces of wood in the background. Head pixels are used when the tracking gets stable. Figure 21: Example test run obtained with a low resolution web camera. 34

40 Figure 22: Known postures are REST (2), STOP(2), LEFT, RIGHT and UP(2) Figure 23: Training set: LEFT (2), RIGHT (2), STOP(2), UP(2) and REST (4) tions are needed. The system should be able to run in real-time, and without performing any optimization it was completely unreachable. A lot of eort was employed to get the program being fast. The model and its updating operations became more complex, with a lot of precalculation steps to avoid doing it while segmenting an image. Some matrix multiplications were also optimized depending on their known quantity of zero elements, or proting from their symmetry. The program runs on a Linux/GNU system on a 1.6GHz Pentium processor. At that point, it segmented around 12 frames (320 x 240 pixels each) per second. That rate include getting and displaying of the images that stands for approximately 13% of the processing time. Tables (7, 8) show the execution prole for this slow version. Color transformation and skin segmentation takes almost 70% of total execution time. When using RGB color space directly from the camera LUT transformation can be avoided, which saves 8% of time. It is clear that next optimization step needs to be done in the skin segmentation step. About 40% of skin segmentation time is spent in background 35

41 Euclidean Pseudo LEFT LEFT RIGHT RIGHT STOP STOP UP UP REST REST REST REST Table 4: Distances between a random RIGHT posture and all training samples. % REST STOP LEFT RIGHT UP REST STOP LEFT RIGHT UP Table 5: Confusion matrix using as minimum distance to accept the recognition. System recognitions are rows while columns are real postures. % S MR ML SU SD S MR ML SU SD Table 6: Gesture confusion matrix. System recognitions are rows while columns are real gestures. S - START, MR - MOVE RIGHT, ML - MOVE LEFT, SU - SPEED UP, SD - SPEED DOW. probability calculation. Since background is thought to be static (section 9.3), it can be pre-calculated as a LUT table. It takes some time at the beginning of the execution, but it is worth as can be seen in Table 9. The program speeds up to 15 frames per second after this optimization. Skin segmentation is still the most heavy step taking over 50% of time. A nal optimization idea will be the following: sometimes it will be not necessary to segment the whole image. When tracking is done with a high condence it is possible to have an idea of where the blobs will be located in the next frame. 36

42 Figure 24: Gestures that the system recognizes at date: MOVE LEFT, MOVE RIGHT, SPEED UP, SPEED DOWN, START. Therefore, a partial segmentation of that frame can be done. Experiments show that in the 85% of the frames that partial segmentation was developed. This is another example of how tracking integration can help the system to perform better, raising its speed to over 20 frames per second. Table 10 shows the execution prole for the nal optimized version. 37

43 OPERATION % Skin segmentation Show image 6.99 PCA analysis 6.63 Update model 6.02 Get image 5.86 Morphology 3.88 Search blobs 2.21 HMM calculations 0.02 kalman lter 0.01 Other operations 5.14 Table 7: Execution prole during a long sequence using RGB directly from the camera. OPERATION % Skin segmentation LUT transformation 7.97 Show image 6.56 PCA analysis 6.07 Update model 5.66 Get image 5.65 Morphology 3.59 Search blobs 2.08 HMM calculations 0.02 kalman lter 0.01 Other operations 2.11 Table 8: Execution prole during a long sequence with color transformation. OPERATION % Skin segmentation Show image 8.09 PCA analysis 7.47 Update model 6.84 Get image 6.72 Backg. precalculation 5.91 Morphology 4.47 Search blobs 2.53 HMM calculations 0.02 kalman lter 0.01 Other operations 6.18 Table 9: Execution prole using background precalculation. 10 Conclusion This system is a complete posture and gesture recognizer. Its exible design makes it suitable for being used as a commander for any application, although 38

44 OPERATION % Skin segmentation Show image PCA analysis 9.78 Update model 9.39 Get image 8.64 Backg. precalculation 7.19 Morphology 6.11 Search blobs 3.26 HMM calculations 0.02 kalman lter 0.01 Other operations 9.36 Table 10: Execution prole using partial segmentation. its motivation is to be integrated in a Programming by Demonstration (PbD) framework such as a service robot. The system runs in real time processing around 20 frames per second when their size is 320 x 240 pixels. Skin segmentation is the most important step in the process. All posterior results depends on its robustness to illumination changes during interaction. As expected, it is also the most time consuming step, but is justied that more than 35% of execution time is spent during skin pixels classication when they provide the desired robustness. The color space which better results gives is RGB. Other spaces (such as HSV which is usually used in skin segmentation) are more easy to confuse by non-skin pixels when the model is being updated. Moreover, using RGB directly from the camera speeds up the program saving 8% of execution time. Its adaptation algorithm and policy works well to make the classication provide good hand shapes. A set of morphological operations help on that propose adding some extra pixels to the hand blob. Those blob pixels will adapt the color model making future classication better and better. Only hand pixels are allowed to update the model when tracking begins, because head pixels can contain some other dierent colors (hair, lips...). Head pixels are used later, when tracking has been performed during some more time. Updating is executed almost every frame (every frame a hand is being tracked) and it takes less than 10% of execution time. Filtering and tracking are improvable. Kalman lter works well as long as the user interacts in simple scenes. No more users should appear on the scene or it will not be able to track properly. Occlusions of hands and head are very common while describing a gesture. They do not necessary interrupt tracking, but it disturbs the ltered trajectory. User movements should be done slowly or the tracker will get confused and will require re-initialization. Kalman lters require a careful user, otherwise another technique should be use to track the hands. PCA works ne although sometimes it takes some frames to recognize properly the posture of the hand. When a hand moves or it changes its posture, some shadows appear on the hand. Then the model needs some frames to learn the new colors and adapt to them covering the whole hand. Even if a nger can 39

not be segmented properly, posture recognition is reliable after a few frames. PCA is also peformed every frame for both hands taking 10% of total time.

In conclusion, the system is user-independent although new users may need some time to get used to it.

45 not be segmented properly, posture recognition is reliable after a few frames. PCA is also peformed every frame for both hands taking 10% of total time. Gesture recognition provides also a high recognition rate, even when the system is tested by dierent users and they get used to it. In conclusion, the system is user-independent although new users may need some time to get used to it. Skin adaptation does not take long at the beginning, posture recognition works properly after that, and it is easy to describe the hand gestures at the correct speed (Fig. 25). Figure 25: Sequence of interaction with the system 40

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Evaluation