Gesture Recognition to control Mindstorms robot

Size: px

Start display at page:

Download "Gesture Recognition to control Mindstorms robot"

Johnathan Lane
6 years ago
Views:

1 University of Manchester School of Computer Science Gesture Recognition to control Mindstorms robot Third Year Project Report Meida Pociunaite BSc Artificial Intelligence Supervisor: Dr. Ke Chen April

2 Abstract Author: Meida Pociunaite Gesture Recognition to control Mindstorms robot Over the past few years, gesture recognition has made its debut in entertainment and gaming markets. Now, gesture recognition is becoming a commonplace technology, enabling humans and machines to interface more easily in the home, the automobile and at work. An example of a person sitting on a couch, controlling the lights and TV with a wave of his hand is being realized as gesture recognition technology. It enables natural interactions with the electronics that surround us. This report describes a third year project, Gesture Recognition to control Mindstorms robot. My motivation for this project came after being a part of 'Robogals Manchester' student society, where I teach children to build and program LEGO Mindstorms robots. The goal was to design a basic gesture recognition system to understand a small number of predefined gestures used to control the LEGO robot. In this project, the main issues to be investigated include salient gesture feature extraction/representations and appropriate recognition techniques. The report details the research, design, development and testing of the system. Many project objectives have been met, however the system has some limitations. Therefore, further improvements to the system are suggested. Supervisor: Dr. Ke Chen 2

3 Contents 1 Introduction Motivation Aims and Objectives of the Project Report Outline Background Gesture Recognition Approaches Input devices and methods Pre-processing Feature selection Machine Learning algorithms Design Overall System Devices and APIs Dataset Supervised learning Training Testing Offline Online Graphical User Interface requirements Implementation Capturing gestures Pre-processing Detecting skin Blob correction Features extraction Edge detection Centroid Calculation Ratios Calculation Training HMMs Recognizing gestures Controlling robot Results 37 3

4 6 Testing and Evaluation Offline Online Conclusion Problems & Limitations Improvements Summary References 46 4

5 List of Figures 1.1: Gesture Recognition System : Project aim : Input methods : Input devices [23][24] : Vision-based methods to capture gesture [25] : Xbox 360 Kinect [29] : Pre-processing methods [35][36][37] : Median filter [39] : Dynamic Time Warping [49] : Neural Network [52] : Hidden Markov Model [56] : Dataset [61] : Supervised learning : Training flowchart : K-fold cross-validation : Online testing flowchart : Image acquisition : Skin detection : Median filter result : Blob algorithm : Blob results : Edge detection : Hand shape centroid : Ratio calculation regions in gray: horizontally and vertically : Hidden Markov Models : LEGO NXT Mindstorms robot [69] : Main window screenshot : Recognized gesture status : Real-time results : Gesture classes used for classification : Testing results using two gesture classes : More detailed testing results using two gesture classes : Gesture classes used for classification : Testing results using three gesture classes : Gesture classes used for classification : Testing results using four gesture classes

6 6.8: Online testing results : Lightning conditions problem : Skin color objects problem

7 Chapter 1 Introduction A gesture is an expressive, meaningful physical action of the human body, joints, fingers or face that serves as a symbol to convey the meaning and information or is used to interact with the environment [1][2]. Gesture recognition is a subject in computer science and language technology with the goal of interpreting human body gestures via mathematical algorithms [3]. One of the the main purpose of gesture recognition research is to identify a specific human gesture and convey information to the user pertaining to individual gesture [4]. Hand gesture recognition can be used to enhance human computer interaction without depending on traditional input devices such as keyboard, mouse and give rise to many applications such as hardware-free remote controls. In this chapter, a motivation, objectives of the project and a short overview of the report are stated. 1.1 Motivation This project is related to the computer vision. This field has had a huge progress over the recent years since the techniques used can be seen in devices, which are widely used in everyday life such as facial recognition in smart phones or the gaming console Xbox 360 Kinect. Moreover, gesture recognition is widely used for medical equipment to find and diagnose diseases, reconstructing the crime scenes from videos and photographs, and also in telerobotic control applications [5]. Robotic systems can be controlled naturally and intuitively with such telerobotic communication [6][7]. A significant benefit of such a system is that it presents a natural way to send geometrical information or commands to the robot such as: left, right, etc. Robotic hand can be controlled remotely by hand gestures. Research is being carried out in this area for a long time. Several approaches have been developed for sensing hand movements and controlling robotic hand [8][9][10]. The first data glove which is called the Sayre Glove [11] was created in Electronic Visualization Laboratory in 1977 and it was the start of the gesture recognition techniques studies. Researchers gradually adopt the camera to implement the Human Computer Interaction (HCI) in the past thirty-five years. Comparing with data glove, gesture recognition through cameras makes it more natural and direct to realize HCI. Pre-processing, feature extraction and classification are the three tasks for the gesture recognition system 7

8 (shown in figure 1.1). For the pre-processing task, the camera is usually used for capturing RGB images. To do so, the system utilizes the characteristics of the human complexion to separate the gestures from the background. Lin detects the skin candidate regions on the color image with Gaussian Mixture Model (GMM) skin model [12]. And in Kramberger s research, to improve detection accuracy of pixel-based skin color segmentation, a parametric skin color model is used [13]. For the feature extraction and recognition tasks, recent researches are normally in two categories: (1) probability graph model based methods, which involve Hidden Markov Model (HMM) [14], Dynamic Time Warping and Neural Network; (2) template based methods, which are related to the template matching algorithm [15][16]. By employing these algorithms, some researchers implement the recognition task based on the skeleton. Their systems can capture the main points of the human skeleton and recognize the motions. Meanwhile, others achieve the interaction based on the static hand gesture, such as the peace or ok gestures. However, these gestures are recognized by the extended fingers, which mean the bent fingers information is lost in recognition process. In general, this project is mostly concerned with the recognition techniques used for finding and recognizing the sequences of dynamic gestures, based on probability model, such as HMM, which will be discussed in more detail in the following chapter. The real-time recognition system can identify four different classes of gesture commands: move forward, turn right, turn left, move backward. The Mindstorms robot receives the gesture information wirelessly and moves accordingly, so the robot can interact with people promptly through human hand gesture recognition. Figure 1.1: Gesture Recognition System 1.2 Aims and Objectives of the Project The goal of this project is to design and develop an efficient gesture recognition system used to control Mindstorms robot wirelessly according to the recognized gesture, while re- 8

9 searching and experimenting with machine learning algorithms and gesture representational data, as illustrated in figure 1.2. The main objectives of this project can be summarised as follows: Implement skin detection Find the most suitable data to represent gestures Recognize dynamic gestures Recognize gestures in real time Connect and control Mindstorms robot using gestures wirelessly Learn and understand existing gesture recognition algorithms Figure 1.2: Project aim 1.3 Report Outline This report is divided into seven chapters. Firstly, the background of the project and the overview of the approaches used are discussed in the chapter 2. Chapter 3 provides information about the system design followed by the implementation details, such as selected features, gesture representation and machine learning algorithms, in chapter 4. The results of the system are shown in chapter 5, while chapter 6 describes how the system was tested and evaluated. And finally, the conclusion of the project, limitations, improvements and suggestions for further work are given in chapter 7. 9

10 Chapter 2 Background This chapter describes essential background knowledge and previous works on this project area. Therefore, the comparison of various input methods and machine learning algorithms is given. 2.1 Gesture Recognition Approaches There are many various ways to create a gesture recognition system, but it depends what input device, what type of input data to represent gesture and machine learning algorithms are used Input devices and methods To start with, there are two main types of gesture input methods used for gesture recognition systems (see figure 2.1). A particular pose [17] or still position [18] made by a human is called static gesture. The pose is held throughout the exertion or in other cases, the user needs to define the start and end poses (see figure 2.1(A)). A dynamic gesture is a sequence of static gestures captured over a period of time (see figure 2.1(B)) [19]. Figure 2.1: Input methods The approach for representing a gesture can be done in different ways, depending on the type of input device and data [20]. First of all, gestures can be captured using two types of devices: sensor or vision based. Approach, where optical or mechanical sensors are used or 10

11 sometimes even attached to the user to directly output data such as angles, position and orientations is called sensor-based. This technique is used for hand held devices such as wands, which use position tracking, rotation tracking, and buttons to interact with 3D objects [21]. More complex gestures are possible with gloves that also measure the movement of the fingers [22]. Unlike optical sensors, such sensors are usually more reliable and are not affected by lighting conditions or obstructed backgrounds. However, as it requires the user to wear a data glove and sometimes requires calibration, it is inconvenient to use. Also, such data gloves are usually more expensive than optical sensors, e.g., cameras. As a result, it is not a very popular way for hand gesture recognition. Figure 2.2 shows an example of wand and a data glove. Figure 2.2: Input devices [23][24] The second gesture capturing approach type is a vision-based, where input images are captured by using 2D or 3D camera (figure 2.3). The method which is going to be used depends on the way the image is going to be analyzed. The appearance based method extracts features like colour, shape or size from the image s visual appearance (see figure 2.3(A)). The method, where image and the depth of field are used to produce a volumetric representation, i.e. mesh of the hand, is a 3D camera based (see figure 2.3(C)). But this method is quite computationally expensive, therefore there exists a simplified version, Skeletal based method, where only key joints are used instead as illustrated in figure 2.3(B)). Unfortunately, many vision based techniques can give poor results depending on the tracked objects, the background and lighting conditions. 11

between implementations. Therefore, there are some hybrid systems, which includes both, vision and sensor approaches.

12 Figure 2.3: Vision-based methods to capture gesture [25] Although there is a large amount of research done in visual based gesture recognition, there is some variation within the tools and environments used between implementations. Therefore, there are some hybrid systems, which includes both, vision and sensor approaches. The well-known and popular example device would be an Xbox 360 Kinect, which was mentioned earlier in the introduction section (figure 2.4). It is a depth measurement system, which consists of two parts, the infrared(ir) laser emitter and the IR camera [26]. The emitter creates a known noisy pattern of structure IR light [27]. The output of the emitter has a pattern of nine bright dots, caused by the imperfect filtering of light. The dots are recorded by the infrared camera and then compared to a known pattern, where any disturbances are to be the variations in the surface, detected as closer or further away. The depth sensing works with the principle of structured light, which is the process of projecting a known pattern of pixels on to a scene, looking to the way they deform when meeting surfaces. This allows the vision systems to calculate the depth and surface information of the objects [28]. Figure 2.4: Xbox 360 Kinect [29] 12

13 2.1.2 Pre-processing After the image has been obtained, various methods of processing can be applied to the image to perform the many different vision tasks required. The main purpose of the pre-processing stage is to extract the hand shape or in other words remove the unwanted background from the input image and pass it on to the feature selection stage, which will be discussed in the next section. The pre-processing stage is of great importance to the success of the classification. If the used approach fails to extract a proper hand shape from the input image, the system will not be able to achieve the desired results. Furthermore, it is vital that the pre-processing part will be as robust as possible so environment changes such as noise, background and lighting conditions will not affect the classification results. The first approach that need to be done is a segmentation. It is a way of spatially separating a background from object, which is a hand in gesture recognition case. One of the methods to do that is a simple threshold-based background segmentation as presented in this paper [30](see figure 2.5(A)). The method uses image colors intensities, which are classified by a threshold value to convert image into binary. Even though this method is easy to implement, the problem is that it works only for images in which objects are distinct from background in intensity. Moreover, the threshold is a parameter which is difficult to adjust automatically in general. The next method would be a color-based segmentation. Specifically, the skin color-based segmentation is widely used for gesture recognition systems. The specific skin color rules are applied to the image to threshold background from hand and make it binary as shown in figure 2.5(B). Which rules are applied depends on the image color space. Different rules apply to YCbCr, RGB, HSV and etc. Disadvantage of this approach is that if the background has any object having the same color as the hand, noise will be very high [31]. Although, P.Kovac work proves that this approach can be used with different illumination conditions, as he classified skin colour by heuristic rules that take into account two different conditions: uniform daylight and flash illumination [32]. Background Subtraction is the other segmentation method used for gesture recognition systems. Firstly, the background image is taken and stored. When the next image frame is taken, it is subtracted from the previously stored background. Therefore, this approach gives only the moving or dynamic parts which in gesture recognition case would be the hands or other body parts (figure 2.5(C) illustrates it). This method is computationally efficient and can efficiently segment hand s region against complex backgrounds. Disadvantage of this system is that if the lighting conditions change suddenly, then there 13

Even though the segmentation process can successfully extract the hand, it may also create the holes inside the hand shape.

14 is a change in pixel value where the light intensity changed and additive noise contributes to the output [33][34]. Figure 2.5: Pre-processing methods [35][36][37] The further approach to improve the image obtained after segmentation is to apply a filtering as suggested in this paper [38]. Even though the segmentation process can successfully extract the hand, it may also create the holes inside the hand shape. Therefore, the filtering method is a useful step in order to be able to properly extract hand features in the next stage. The most widely used filter for removing holes is a Median filter, which assigns each pixel with the median value of its neighbors. Accordingly, in a black and white image, the pixel is given the same color as the majority of that pixel s neighbours. An example is shown in figure 2.6. Figure 2.6: Median filter [39] 14

15 2.1.2 Feature selection In this stage, hand features are extracted and are passed later on to the gesture recognition stage to classify. There is no doubt that selecting good features to recognize the hand gesture path plays significant role in the system performance. Good features are characterised as features that highly correlate for each gesture class but differ between gesture classes [40]. Sometimes, having more features may improve the gesture recognition quality but having too many features may induce over-fitting and reduced robustness. Therefore, it takes careful thought and design of what data to collect and what features to compute in order to get accurate predictions. There are some examples of features used given below [41][42][43], which are classified in two groups: high and low level [44]. The low-level features are easier to obtain, but high-level can produce better results even though it is harder to extract: High-level features: fingertips, fingers, joint locations, etc., Low-level features: colors, contours, edges, silhouette, shapes, etc. As there are many features to choose from, quite often feature vectors instead of a single feature are used to represent a gesture. Therefore, an each gesture frame is represented as a sequence of features. And it is especially useful for dynamic gesture recognition Machine Learning algorithms Machine Learning algorithms have the ability to learn from experience or in other words, to modify their execution based on newly acquired information. A typical machine learning algorithm uses given examples to build up classification capabilities or decision making abilities to proceed on new unfamiliar inputs. There are three types of learning algorithms [45][46]: Unsupervised Learning - the machine is only given a training set, but no classification for these examples is given. This method type is used for data clustering problems. Supervised Learning - the machine is given a set of examples and the correct classification or labels for them. This method is used for classification and regression problems such as Decision trees, Neural Network, Support Vector Machine(SVM) and etc. 15

16 Reinforcement Learning - no examples are given but positive or negative reinforcement is given after each decision is made. This method type is used mostly for control problems where there is a feedback such as winning or losing a game. For general purpose gesture recognition systems the supervised machine learning method is used. Further, the various supervised learning models can be trained and subsequently used for classification depending on the data collected from the input device. Below follows a brief explanation of the most common models. Dynamic Time Warping(DTW) - is especially suitable to data that vary in length of time [47]. The sequences are warped to match each other in length by repeating one or more elements in either sequence (see figure 2.7). The distance between each sequence, a measure of how comparable they are, is calculated by summing the distance between corresponding elements. For classification the incoming gesture is compared to all other known instances of gestures. A machine learning algorithm such as Nearest Neighbour is needed, where the incoming gesture is classified as the gesture of the sequence it is closest to. One of the main issues with using a distance measure to measure the similarity between two time-series is that the results can sometimes be very unintuitive. If for example, two time-series are identical, but slightly out of phase with each other, then a distance measure will give a very poor similarity measure [48]. Figure 2.7: Dynamic Time Warping [49] 16

Neural Network - is a mathematical model inspired by biological neural networks. It consists of a number of parallel processing units with multiple inputs and one output.

17 Neural Network - is a mathematical model inspired by biological neural networks. It consists of a number of parallel processing units with multiple inputs and one output. There are connections between the nodes (neurons) with weights (importance) given to them as illustrated in figure 2.8. In the starting layer, each piece of input is fed to a node [50]. When the inputs are passed along to the next nodes, the input is multiplied by the weights of the connections they follow. If the current value is high enough, the weighted inputs are summed and passed along through the rest of the network as they reach a node in the next layer. When the output layer is reached, the node with the highest value is chosen. The one of Neural Network disadvantages is that it is unable to explain the model or network that it has built in a useful way. This explanation is important especially for analysts, who want to know how the model behaves [51]. Moreover, if the input data does not represents the problem well, Neural Network will not produce good results. Therefore, it takes careful thought to understand the problem or the outcome that expected. Figure 2.8: Neural Network [52] Hidden Markov Model(HMM) - is the other tool for representing probability distributions over sequences of observations. Hidden Markov Models can accept time-sequential data and therefore are suitable for dynamic gesture recognition. HMMs are generative models, in which a finite number of states are connected by transitions, which can generate an observation sequence depending on its transition, bias, and initial probabilities [53]. Consequently, the joint distribution of observations and hidden states (not directly observable), or equivalently both the prior distribution of hidden states (the transition probabilities) and conditional distribution of observations given states (the emission probabilities) is modeled (see figure 2.9). The system can be in any state at a single 17

18 moment in time, where each state depends only on the state preceding it (Markov property) [54]. The Markov property is the assumption that the probability of any variable in the chain depends not on the sequence of states that precede it, but only on the previous variable. This is formula for Markov property [55]: For example: It is derived from the chain rule, where the joint distribution of any variable from a set of random variables is calculated using conditional probabilities: For example: x - states y - possible observations(emission probabilities) a - state transition probabilities b - output probabilities Figure 2.9: Hidden Markov Model [56] 18

19 A basic Hidden Markov Models are usually defined by λ= (N, M, A, B, P) [57] where N is a set of number of states. M is the number of observation symbols per state. A is the NxN state transition matrix, the probability distribution of transitions between states. B is the NxM emission matrix, the probability distribution of the observations from each state. P is the prior probability matrix, the initial state probability distribution at time 0. There are well known problems to be solved when using HMMs [57]. The first one is called evaluation problem: given a model and an observation sequence, compute the probability that the observation sequence was produced by this model. The second one is decoding problem: given a model and an observation sequence, determine the most likely sequence of states that produced the given observation sequence. And finally, the last one, which is called learning problem: given an observation sequence, compute the optimal model parameters to maximise the probability of each observation sequence originating from the model [58]. The efficient solution to solve the first problem is the Forward algorithm. Given an observation sequence, it indicates the most likely state at each time step given the previous sequence of states. Also, to solve the evaluation and decoding problems, the Viterbi algorithm is used, which is similar to the Forward algorithm, with the difference that the summation over the states at a certain time step becomes a maximization [59]. It calculates the probability of the most likely path that ends at each possible state at the final time given an observation sequence. Among these, the highest probability path is the desired solution. Further, the Baum-Welch algorithm is used to solve the learning problem. It finds the most likely parameters, such as transition, emission and prior probabilities, for a model that produced the observations sequence given an observation sequence [60]. As a part of Baum-Welch, the Forward-Backward algorithm is also used. The algorithm calculates the posterior probabilities of individual states at a certain time step given the entire observation sequence [57]. To conclude, the Hidden Markov Model is widely used for gesture and speech recognition systems, as it works well for time varying classification, therefore this is the generative algorithm I have used for my project classification stage. 19

20 Chapter 3 Design The design of the system is based on the approaches described previously. And this chapter introduces the overall system architecture, used devices, libraries and then a more detailed description of the main system components design. 3.1 Overall System For the pre-processing, the skin detection has been chosen as the hand segmentation method. And primarily, the hand centroid feature was used for the classification, but as it did not provide the good results, the ratios between black and white pixels (which will be discussed more in Chapter 4) were added to improve the system. Moreover, the machine learning algorithm must be applied to the known gestures in order to predict the unseen gestures in the future. From the Chapter 2, it can be clear that particular machine learning algorithms are more suited to the different types of gestures than others. As this project is based on the dynamic gesture I used the Hidden Markov Model for classification, because it is a generative model and therefore more suitable for time-varying data than the other algorithms mentioned before. 3.2 Devices and APIs The vision based approach seems to be the most popular nowadays as it does not require to wear anything specific like in sensor-based approach and therefore are more appropriate for humans to interface easily at home, in the automobile and at work. And since I used the standard built-in FaceTime HD camera, my gesture recognition system is based on the appearance. Also, I used the LEGO Mindstorms NXT robot, which is usually programmed using the graphical NXT software provided by LEGO, but as I needed to use the robot for controlling it wirelessly, I had to use an external tools to deal with it. The MonoBrick is a communication library, which allowed me to do that on Mac OS. And because this library is written in C#, the C# has been chosen as the main language for the whole project. Therefore I used the Xamarin Studio development environment and Mono framework to allow me to easily develop the application on Mac OS in C#. Plus, the Xcode was used to create the software Graphical User Interface (GUI). An AForge.NET C# framework has 20

21 been used to do the image processing tasks such as filtering and image analysis. Additionally, for classification stage, the choice was made based on the machine learning algorithms advantages and disadvantages discussed in Chapter 2. Therefore, the Accord.NET framework for building machine learning applications was used to implement the HMMs. 3.3 Dataset In order to make a system recognize a new gesture image, it simply needs to learn from a given data set and find the closest gesture class match. For this project, I have chosen the Cambridge hand gesture data set, which consists of 900 image sequences of nine gesture classes, which are defined by three primitive hand shapes and three primitive motions. Each class contains hundred image sequences. But because I use gestures to control robot, I needed just four gestures to turn right and left, move forward and backward (see figure 3.1 for the gestures chosen). Figure 3.1: Dataset [61] 3.4 Supervised learning This section describes how the supervised learning was used to design the training and testing phases. As mentioned before, the supervised learning means that the system is given a set of examples and the correct labels for them [62]. An example is illustrated in figure

known skin regions, extracting features, applying learning algorithm and creating a

22 Figure 3.2: Supervised learning Training The training is the process of looking over many images of the data set with known skin regions, extracting features, applying learning algorithm and creating a trained model, which will be used to classify the new unknown images. Figure 3.3 shows a flowchart of training. Figure 3.3: Training flowchart 22

3.4.1 Testing Once the model has been trained it can then determine the identity of new previously unseen input image, which could be consisted of a testing set in offline testing or online data in

23 3.4.1 Testing Once the model has been trained it can then determine the identity of new previously unseen input image, which could be consisted of a testing set in offline testing or online data in real-time Offline To do an offline testing I used the cross-validation model, which involves partitioning the data set into complementary subsets. The training model is created of one subset (the training set) and then validated on the second subset (the test set) [63]. Specifically, I used the common type of cross-validation: K-fold cross-validation, where K is the number of subsamples partitions. As I had the twenty image sequences for each gesture class, I divided the classes into five folds. A single subsample was retained as the test data and the remaining four subsamples were used as the training data, as illustrated in Figure 3.4. And this process was repeated for multiple times using different partitions, with each of the folds used exactly once as the validation set. Figure 3.4: K-fold cross-validation Online The online testing is the process of using the trained model created in the training phase to classify unseen images inputted from the webcam (see testing flowchart in figure 3.5). 23

Figure 3.5: Online testing flowchart 3.5 Graphical User Interface requirements In order to be able to test the system in real-time a simple software was needed.

24 Figure 3.5: Online testing flowchart 3.5 Graphical User Interface requirements In order to be able to test the system in real-time a simple software was needed. The requirements of the software need an ability to: Add gesture frame Start and stop grabbing gestures Restart the gesture grabbing Recognise gestures in real time Show which gesture is recognized Show the system status 24

25 Chapter 4 Implementation This chapter describes the used techniques for implementing the system in detail. Also, a brief explanation of the development methodology used is presented, which is supplemented by the code snippets. 4.1 Capturing gestures The webcam is used to capture real-time video stream of hand gestures to generate commands for the robot. Images are captured by controlling webcam s processing frame rate (see figure 4.1). The robot moves in four possible directions using four possible types of commands which are right, left, forward and backward. The user needs to take five image frames by pressing the button and then image processing and classification is done to extract the gesture command. Figure 4.1: Image acquisition The code below demonstrates how was the video streaming achieved: // Find video device capturesession = new QTCaptureSession (); var device = QTCaptureDevice.GetDefaultInputDevice (QTMediaType.Video); // Add device input captureinput = new QTCaptureDeviceInput (device); // Create decompressor for video output, to get raw frames decompressedvideo = new QTCaptureDecompressedVideoOutput (); 25

26 decompressedvideo.didoutputvideoframe += delegate(object sender, QTCaptureVideoFrameEventArgs e) currentimage = e.videoframe;. 4.2 Pre-processing The main purpose of this step is to extract the hand shape in the input image and pass it on to the feature selection stage Detecting skin The method used to find a human hand in the image is the skin detection. The main aim of this method is to look for the color of the skin. The way this method generally works is to filter out all regions of an image which are not considered to be skin and the remaining regions indicate the presence of a hand palm. The specific skin color rules are applied to do that. Which rules are applied depends on the image color space. Different rules apply to YCbCr, RGB, HSV1 and etc. Disadvantage of this approach is that if there is any object having the same color as the hand, noise will be very high [64]. My data set images were in RGB color space, therefore I thresholded each pixel s RGB values by applying P.Kovac rules to classify skin [65]. If a pixel satisfied these rules, it was indicated as skin, else as a background: Uniform daylight illumination: R > 95, G > 40, B > 20, R > G, R > B, max{r, G, B} - min{r, G, B} > 15, R - G > 2; Flashlight: R > 210, G > 210, B > 170. The results I got after applying the rules above to the data set images can be seen in figure

values using the rules, published in this paper [66]. double way = 0.

114 * B; double cb = R way; double cr = B way; Then, as in previous method,

>=100, Y <= 255, Cb >= 75, Cb <= 125, Cr >= 132, Cr <= 172.

hence the RGB has been chosen as the main method for the system. 4.2.

fill the holes created inside the hand shape in order to be able to extract

27 Figure 4.2: Skin detection I also tried to apply the YCbCr rules to do the skin detection. Therefore, I converted the each pixel s RGB values to the Y, Cb and Cr values using the rules, published in this paper [66]. double way = * R * G * B; double cb = R way; double cr = B way; Then, as in previous method, I thresholded the skin from the background using YCbCr rules below [65]: Y >=100, Y <= 255, Cb >= 75, Cb <= 125, Cr >= 132, Cr <= 172. But the results received using the RGB filtering worked better than YCbCr, hence the RGB has been chosen as the main method for the system Blob correction After extracting a hand in the image, it was needed to fill the holes created inside the hand shape in order to be able to extract features well. And I used a Median filter to do that. A Median filter is a filter that assigns each pixel with the median value of its neighbors. In a binary image it gives the pixel the same color as the majority of that pixel s neighbors. The results of this filtering are shown in figure

28 Figure 4.3: Median filter result And because the Aforge.NET comes with this filter, I did not have to implement it by myself. Therefore, the lines of code below show how did I use it: AForge.Imaging.Filters.Median filter = new AForge.Imaging.Filters.Median(); filter.applyinplace (Image); But as the filter did not give the excellent results and still left some holes, I applied the Closing filter and also implemented blob analysis function to remove the existing holes. Closing filter is one of two important operators from mathematical morphology. It is derived from the fundamental operations of erosion and dilation. The closing filter performs a dilation followed by an erosion. In images with bright objects on a dark background, the closing filter fills narrow gaps between objects [67]. Dilatation is a morphological filter that works by considering a neighborhood around each pixel. From the list of neighbor pixels, the minimum or maximum value is found and stored as the corresponding resulting value. Finally, each pixel in the image is replaced by the resulting value generated for its associated neighborhood [68]. Erosion is also a morphological filter. The basic effect of the operator on a binary image is to erode away the boundaries of regions of foreground pixels. Therefore areas of foreground pixels shrink in size, and holes within those areas become larger [67]. Therefore, these lines of code show how did I use the Closing filter, which came with the AForge.NET library : AForge.Imaging.Filters.Closing filter = new AForge.Imaging.Filters.Closing(); filter.applyinplace (Image); 28

29 Moreover, I implemented the blob analysis function to make sure that all black holes inside the hand shape are removed. It was done by checking each black pixel s 10th neighbours in the eight following directions: up, down, right, left and diagonally as can be seen in figure 4.4. Figure 4.4: Blob algorithm Then I checked those eight neighbours to see if any of them were white. If the number of white neighbours pixels were greater or equal to seven, I made that processing black pixel to be white. The results after applying Closing filter and the algorithm mentioned above are illustrated in figure 4.5. Figure 4.5: Blob results 29

30 4.3 Feature extraction In this stage, hand features are extracted and passed later on to the classification stage. I did not want to use the features, which will be hard to extract and therefore slows down the running time, thus, I decided to use a simple hand shape centroid, which was not difficult task to extract. But because the classification results were not good enough, I had added the ratios feature to improve the results Edge detection After getting a proper hand shape in the image, the next step to do was the edge detection. And I used the AForge.NET library filter for it (see the code below). AForge.Imaging.Filters.Edges filter = new AForge.Imaging.Filters.Edges (); filter.applyinplace (Image); The filter finds hand edges by calculating maximum difference between pixels in four directions around the processing pixel. The results can be seen in figure 4.6. Figure 4.6: Edge detection Centroid Calculation The next step to undertake was to extract the first feature: the centroid of the hand shape. It was done by calculating the average of x and y coordinates of all white pixels (which are the hand contour pixels). See figure 4.7 for the obtained results. 30

splitted into two parts according the centroid coordinates horizontally and vertically (see figure 4.8).

region. Therefore, I got four numbers, where each is the ratio, but of a different region. Figure 4.

4 Training HMMs For gesture recognition system, a state represents a pose.

31 Figure 4.7: Hand shape centroid Ratios Calculation The next feature extracted was the ratio of the white and black pixels where image was splitted into two parts according the centroid coordinates horizontally and vertically (see figure 4.8). Ratios was calculated by simply dividing the number of white pixels out of a number of black pixels in each region. Therefore, I got four numbers, where each is the ratio, but of a different region. Figure 4.8: Ratio calculation regions in gray: horizontally and vertically 4.4 Training HMMs For gesture recognition system, a state represents a pose. The distribution for each state are symbols represented by feature vectors. My feature vector f consists of six features, which are: 1. X coordinate of centroid 2. Y coordinate of centroid 3. Ratio of upper half of horizontal region 4. Ratio of lower half of horizontal region 31

32 5. Ratio of left half vertical region 6. Ratio of right half of vertical region A Hidden Markov Model can represent a single gesture. Therefore, I created four HMMs for each gesture (illustrated in figure 4.9). Figure 4.9: Hidden Markov Models The feature vector is assumed to have a multivariate normal distribution in each state with the mean vector and covariance matrix being state dependent. The Forward algorithm was used to calculate the probability of observation sequence occurring given the HMM. Also, the Baum-Welch algorithm maximized the probability of an observation sequence for a given Hidden Markov model. The Accord.NET library was used to implement the HMMs as coding it would have been an enormous task. Therefore, all I had to do, was just to learn how to use the provided methods, insert the correct data and make sure that every function contained the correct number of states, features, labels and etc. And I had the 20 gesture sequences for each gesture class, where the number of images in each sequence was varying from 42 to 99. Moreover, it would not have been very reasonable to use the each image in a sequence as a state, because it would require a lot of time to train and test the HMMs. Therefore, I decided to use 5 states for each sequence based on the graphs I got plotting the sequence features data and it was explicit to choose 5 states, which clearly modeled the difference between data points. Accordingly, I calculated the average 5 points for each sequence, which were used for training HMMs. And as mentioned in Chapter 3, I used the K-fold cross validation, so the set, which was tested, was not used for training. 32

33 Some of the code can be seen below: // hand movement from centre to right 0001 //SET1( ) double[] fva1 = new double[] { , , 0.383, 0.635, 0.458, 0.542}; double[] fva2 = new double[] { , , 0.353, 0.619, 0.395, 0.580}; double[] fva3 = new double[] { , , 0.313, 0.604, 0.313, 0.670}; double[] fva4 = new double[] { , , 0.280, 0.624, 0.253, 0.805}; double[] fva5 = new double[] { , , 0.262, 0.634, 0.226, 0.868}; double[][] set1_0000 = { fva1, fva2, fva3, fva4, fva5 }; + 19 remaining sequences for this gesture // hand movement from centre to left 0000 //SET1( ) double[] fvb1 = new double[] { , , 0.298, 0.743, 0.425, 0.452}; double[] fvb5 = new double[] { , , 0.298, 0.871, 0.726, 0.281}; double[][] set12_0000 = { fvb1, fvb2, fvb3, fvb4, fvb5 }; + 19 remaining sequences for this gesture // hand movement from centre to down 0002 //SET1( ) double[] fvc1 = new double[] { , , 0.349, 0.727, 0.631, 0.404}; double[] fvc5 = new double[] { , , 0.110, 0.751, 0.213, 0.220}; double[][] set13_0000 = { fvc1, fvc2, fvc3, fvc4, fvc5 }; + 19 remaining sequences for this gesture // hand movement with two fingers 0008 //SET1( ) double[] fvd1 = new double[] { , , 0.148, 0.512, 0.378, 0.188}; double[] fvd5 = new double[] { , , 0.144, 0.497, 0.265, 0.222}; double[][] set14_0000 = { fvd1, fvd2, fvd3, fvd4, fvd5 }; + 19 remaining sequences for this gesture //Defining inputs, excluding the set, which is going to be tested double[][][] inputs = new double[32][][] {set2_0004, set2_0005, set2_0006, set2_0007, set3_0008, set3_0009,..., set52_0019}; //Labels int[] outputs = new int[]{ 0, 33

34 , 1,, 2,, 3, }; string[] classes = new string[]{"a","b","c","d"}; int states = 5; //Creates a sequence classifier containing 4 hidden Markov Models with 5 states and an underlying multivariate mixture of Normal distributions as density. var classifier = new HiddenMarkovClassifier<MultivariateNormalDistribution> (4, new Forward(states), new MultivariateNormalDistribution(6), classes); // Configure the learning algorithms to train the sequence classifier var teacher = new HiddenMarkovClassifierLearning <MultivariateNormalDistribution>(classifier,modelIndex => new BaumWelchLearning<MultivariateNormalDistribution>( classifier.models[modelindex])); // Train the sequence classifier using the algorithm teacher.run(inputs, outputs);. Moreover, the library provides a method to be able to save the trained model, so there is no need to train the data every time the gesture is inputted in online testing using webcam: classifier.save(path);. And that trained model can be simply loaded (see the code below). var classifier = HiddenMarkovClassifier<MultivariateNormalDistribution>.Load(path); 4.5 Recognizing gestures The unseen gesture or the set, which was not used for training, was run through all of the HMMs. The testing code example can be seen below: 34

//set1 //0 int hmmresult = classifier.compute(set1_0000); int hmmresult2 = classifier.compute(set1_0001); int hmmresult3 = classifier.compute(set1_0002); int hmmresult4 = classifier.

35 //set1 //0 int hmmresult = classifier.compute(set1_0000); int hmmresult2 = classifier.compute(set1_0001); int hmmresult3 = classifier.compute(set1_0002); int hmmresult4 = classifier.compute(set1_0003);. The gestures were classified by the model that has the highest probability. The label of the recognized class was printed out. Additionally, the testing results will be discussed in the next chapter. 4.6 Controlling robot As mentioned earlier, the robot was controlled via bluetooth using the MonoBrick library. The bluetooth connection to the robot was done by firstly pairing it with my laptop and secondly implementing a communication via this code: var brick = new Brick<Sensor,Sensor,Sensor,Sensor>("/dev/tty.MARIE DevB");. The robot was build by my own, following the instructions provided by the LEGO (see figure 4.10). Figure 4.10: LEGO NXT Mindstorms robot [69] It has two constructed motors, which makes the robot move. They are connected to the A and B ports of the robot. Therefore, the robot moves to the direction according which motors are turned on. To make it move forward or backward both of the motors need to be 35

36 turned on, but to make it move left or right - just one of them. To make the motors move for required one second, I used the 360 o degrees as a movement parameter, which mean that the motor will makes exactly one whole turn and stops. The code below displays how was it implemented: //Move forward speed = 30; brick.motorb.on(speed, degrees); brick.motorc.on(speed, degrees); //Move backward speed = 30; brick.motorb.on(speed, degrees); brick.motorc.on(speed, degrees); //Turn left speed = 30; brick.motorb.on(speed, degrees); //Turn right speed = 30; brick.motorc.on(speed, degrees);. 36

Chapter 5 Results In the previous chapters, the structure of the system was described, the preprocessing, feature extraction and classification stages were elaborated upon and the methods to achieve

37 Chapter 5 Results In the previous chapters, the structure of the system was described, the preprocessing, feature extraction and classification stages were elaborated upon and the methods to achieve their purpose were discussed. And this chapter shows the results that seal these methods as useful methods for hand gesture classification. Moreover, some screenshots are provided to present the system appearance and give an idea how it works. As noted previously, the system interface was create with Xcode. It consists only of the main window screen, where the user is able to see the video stream (see figure 5.1.). User needs to define 5 frames of the chosen gesture and the frames can be added by pressing the button. The process can be restarted if a user requires. A number of frames inputted can be seen on the main window. Moreover, the status of the present process can be seen. When the 5 frames are inputted and the gesture is recognized, it is shown on the status label (see figure 5.2.). Figure 5.1 : Main window screenshot 37

3 below, the system worked quite well in real-time and

centroid correctly if the lighting conditions were

similar-to-skin color in the background. 1.

38 Figure 5.2 : Recognized gesture status As can be seen in figure 5.3 below, the system worked quite well in real-time and was able to detect skin, extract the edge and calculate centroid correctly if the lighting conditions were appropriate and there were not any objects of the similar-to-skin color in the background. 1. Frame inputted from webcam 2. Skin segmentation + blob correction 3. Edge detection 4. Centroid Figure 5.3 : Real-time results 38

Chapter 6 Testing and Evaluation The testing of the system is detailed in this chapter. The discussion of the offline and online testing efficiency is also given. 6.1 Offline Once the model has been trained, the offline testing can determine the identity of the new previously unseen gesture image.

39 Chapter 6 Testing and Evaluation The testing of the system is detailed in this chapter. The discussion of the offline and online testing efficiency is also given. 6.1 Offline Once the model has been trained, the offline testing can determine the identity of the new previously unseen gesture image. As mentioned earlier, the K-fold cross validation has been used, where the each gesture class data was splitted into 5 random sets. Each set contained the 4 sequences of images from each gesture class. Moreover, the system was trained using 4 sets and tested using 1 set. Also, the sets were swapped over many times to get the more significant testing results. Firstly, the testing has been done using only two gestures classes: turn left and right (see figure 6.1). The results were highly accurate, as gestures were recognized with 80 and 100 percentage (shown in figure 6.2). Each line specifies which sets of each gesture were used for training and which for testing. Yellow color indicates sets used for testing, and green color indicates the training sets (this applies to the following figures also). + Figure 6.1: Gesture classes used for classification Figure 6.2: Testing results using two gesture classes 39

Secondly, I tried the more detailed testing using the same two gesture classes. All possible combinations of using different training and testing sets of both gesture classes were used.

40 Secondly, I tried the more detailed testing using the same two gesture classes. All possible combinations of using different training and testing sets of both gesture classes were used. But the overall results differ just in one percentage (see figure 6.3). Therefore, the following testing has been done using the previous approach. Figure 6.3: More detailed testing results using two gesture classes The next classification has been done using three gesture classes as can be seen in figure 6.4. The accuracy of the overall testing results fall off a little, but it is quite normal as the system needed to distinguish between more gesture classes (see figure 6.5). + + Figure 6.4: Gesture classes used for classification 40

41 Figure 6.5: Testing results using three gesture classes And finally, all four gesture classes has been used for testing (shown in figure 6.6). The overall results were quite accurate, with more than 60 percentage accuracy for each gesture class and 78,8% accuracy for the whole system, which proves that the system works reasonably well (see figure 6.7) Figure 6.6: Gesture classes used for classification Figure 6.7: Testing results using four gesture classes 6.2 Online The system was also tested under real-time conditions by using webcam. The system classified the inputted gesture frames and the performance was measured yet again. I did not have to train the model each time the system was used, because I loaded the saved model and the system was able to test the gesture very quickly online. Therefore, the robot moved accordingly the recognized gesture fast and robustly. As can be seen in figure 6.8, the overall system performance was only 63,5%, when testing it real-time. But this can be improved and it will be discussed in the next chapter. The lower accuracy rate of the online classification, compared with the offline classification, is due to the different 41

42 lightning conditions and a lack of efficient features, which would allow to easily distinguish between gesture classes. Gesture Accuracy rate 54% 67% 75% 58% Total 63,5% Figure 6.8: Online testing results 42

43 Chapter 7 Conclusion Gesture recognition has appeared to be one of the most significant research areas in the field of computer vision. In the recent years, the potential of vision-based communication systems between human and computers has extremely increased. Even though data gloves and sensors to track the body parts facilitate the gesture recognition problems, but however, to identify gestures using only image information is still a challenging task to achieve. As I did not use any sensors or data gloves for gesture recognition, my system can acquire and classify the hand gestures from images only. The general gesture system consists of three stages: pre-processing, feature extraction and classification. Various approaches have been discussed in the second chapter, therefore that information justifies the choice of hand segmentation, gesture representation and machine learning algorithm. Firstly, image processing has been done to segment the hand from the background using skin detection technique. Then system extracts relevant features for classification such as hand centroid and ratios of the white and black pixels in the image areas and finally classifies the gesture features using a Hidden Markov Model classifier to make robot move accordingly the recognized gesture. The approaches used worked well using dataset when testing offline, as an average accuracy achieved was 78,8%, but however did not perfectly worked using webcam in real-time as the overall accuracy rate was 63,5%. Therefore, the problems and limitations of the system are discussed further. 7.1 Problems & Limitations One of problems with my system is the lack of robustness under different lighting conditions (see figure 7.1). 43

Figure 7.1: Lightning conditions problem Secondly, as can be seen in figure 7.

2: Skin color objects problem In summary, if the system cannot correctly segment the hand shape in the image, the features

Moreover, even though the extracted hand features correlated for each gesture class, they were not very different between

44 Figure 7.1: Lightning conditions problem Secondly, as can be seen in figure 7.2, there is a problem segmenting the hand shape if there is any object similar to skin color in the background. Figure 7.2: Skin color objects problem In summary, if the system cannot correctly segment the hand shape in the image, the features will not be extracted well and therefore classified fairly. Moreover, even though the extracted hand features correlated for each gesture class, they were not very different between gesture classes. 7.2 Improvements My system proves that the combination of methods chosen can achieve a simple gesture recognition, on the other hand, some improvements could be made to achieve the higher accuracy and robustness. 44

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning Justin Chen Stanford University justinkchen@stanford.edu Abstract This paper focuses on experimenting with