Hand Gesture Recognition with Microsoft Kinect A Computer Player for the Rock-paper-scissors Game Vladan Jovičić, Marko Palangetić University of Primorska Faculty of Mathematics, Natural Sciences and Information Technologies vladan.jovicic@student.upr.si marko.palangetic@student.upr.si Abstract This paper is about using Microsoft Kinect for hand detection and hand gesture recognition. Compared with other devices that capture images by using mainly the standard RGB camera, Kinect uses an additional depth sensor that provides depths for every pixel in its view area. On the begining this technology was used only for playing Xbox games but later inventors saw that Kinect can be used for making interesting softwares on PC. It s API allows to us recognize most important 26 human s body points and to manipulate with them. Also despite Microsoft Kinect SDK there are some open source softwares for manipulating with Microsoft Kinect. Hand gesture detection is one of the most researched problems and here is one way how to do it using modern Microsoft inventions. This paper provides efficient algorithms for hand detection in space and making hand gesture using Kinect depth sensor (Central hand point detection using Microsoft Kinect API and detecetion rest of a hand using distance of rest hand points and cetral point in space), algorithms for clockwise sorting points of gesture (this was necessary for finger detection) and finger detection on hand (using angles od hand when we was looking it as a polygon). Once a hand is detected, gesture recognition is done by detecting the number of extended fingers. Also there is explanation about main difference about body-parts detection using Kinect and using ordinary WEB camera. (Difference between using color of hand and distance of hand for hand gesture recognition). As extra this paper provides implementation of the well-known Rock-paper-scissors game using previous results for hand and finger detection. 1 Motivation Hand gesture recognition is very popular theme in computer science since it can be used to manage with different machines. There were many attempts to find algorithm which can recognize hand in simple image obtained with RGB camera. In almost every such algorithm, color comparison methods are used. But there were problems how to make difference between face and hand. As the sensors, which can give depth information of certain pixel appeared, this problem became more accessi-
ble. Various of new algorithms are appeared. But still it is open problem to use Kienct for hand gesture recognition, because it is good to track large objects such as human body, but objects that ocuppies small part in image (e.g. hand) can not be recognized accurately. In this paper we represent several algorithms for hand tracking and gesture recognition with their advantages and disadvantages. 2 Hardware The Microsoft Kinect Platform 2.1 About Microsoft Kinect Microsoft Kinect is montion sensing input device made for Microsoft s video game consoles from Xbox. Based around a webcam-style add-on peripheral, it enables users to control and interact with their console/computer without the need for a game controller or even keyboard and mouse, through a natural user interface using gestures and spoken commands.[?] The first-generation Kinect was first introduced in November 2010 in an attempt to broaden Xbox 360 s audience beyond its typical gamer base. A version for Windows was released on February 1, 2012. Microsoft released Kinect software development kit for Windows 7. This SDK was meant to allow developers to write Kinecting apps in C++/CLI, C#, or Visual Basic.[7] Kinect sensor is a horizontal bar connected to a small base with a motorized pivot and is designed to be positioned lengthwise above or below the video display. The device features an RGB camera, depth sensor and multi-array microphone running proprietary software [8], which provide full-body 3D motion capture, facial recognition and voice recognition capabilities. Kinect sensor s microphone array enables Xbox 360 to conduct acoustic source localization and ambient noise suppression. The depth sensor consists of an infrared laser projector combined with a monochrome CMOS sensor, which captures video data in 3D under any ambient light conditions.[9] The software technology enables advanced gesture recognition, facial recognition and voice recognition. Kinect is capable of simultaneously tracking up to six people, including two active players for motion analysis with a feature extraction of 20 joints per player. Kinect s various sensors output video at a frame rate of 9 Hz to 30 Hz depending on resolution. The default RGB video stream uses 8-bit VGA resolution (640 480 pixels) with a Bayer color filter, but the hardware is capable of resolutions up to 1280 x 1024 (at a lower frame rate) and other color formats. The monochrome depth sensing video stream is in VGA resolution (640x480 pixels) with 11-bit depth, which provides 2,048 levels of sensitivity. The Kinect can also stream the view from its IR camera directly (i.e.: before it has been converting into a depth map) as 640x480 video on normal, or 1280x1024 at a lower frame rate.[10] 3 Software The Micriosoft Development Kit 3.1 Microsoft Kinect SDK In June 2011 Microsoft released Kinect Software Development Kit (SDK) which includes drivers that are compatible with Windows 7 (not with earlier versions of Windows[1]). Kinect SDK contain also API s and tools for developing Kinect-enabled applications for Microsoft Windows. Developing with API s is almost same as developing other applications for windows except that SDK provides support for the features of the Kinect, including color images, depth images, audio input, and skeletal data. Some possibilities with Kinect SDK are: Recognizing and tracking people movement using skeletal tracking. Determination of distance between an object
and the sensor camera using depth data. Capturing audio using noise and echo cancellation or finding the location of the audio source[2]. The main characteristic of SDK is that it has already implemented ways how to determine what is human and what is not in space. Going on, kinect can recognize some special points on human body which are very important for tracking human motion. These points are showed on Figure 1. Kinect in every moment knows positions of these points in 3D space. These points are in one way maximum of informations that Kinect can give about player s position i 3D space. Because of that Microsoft SDK doesn t have any API for fingers and hand tracnikg dispate position of hand center which is special Kinect point. OpenNI Open Natural Interaction is an industryled, non-profit organization focused on certifying and improving interoperability of natural user interface and organic user interface for natural interaction devices, applications that use those devices and middleware that facilitates access and use of such devices. The OpenNI framework provides a set of open source APIs. These APIs are intended to become a standard for applications to access natural interaction devices. The API framework itself is also sometimes referred to by the name OpenNI SDK. The APIs provide support for: Voice command recognition Hand gestures Body motion tracking [4] Open Kinect OpenKinect is an open community of people interested in making use of the Xbox Kinect hardware with PCs and other devices.they are working on free, open source libraries that will enable the Kinect to be used with Windows, Linux, and Mac OS. [5] 4 Implementation Putting it All Together Figure 1: Kinect recognized points 3.2 Alternative SDKs Desipite Microsoft SDK there exist a lot of open source frameworks which contain SDKs which provide working with Kinect.They are very popular among programmers, and most famouos of them are: 4.1 The Rock paper scissors Game Idea Rock-paper-scissors is hand game which usually play two persons. They show with hand one of three shapes. There are three possible shapes: rock, paper nad scissors. Rock beats scissors, scissors beats paper and paper beats rock. If players show same shape, game is tied[3].with Microsoft Kinect this game can be made such that one player is human and another is computer. Since Kinect
SDK has API s to recognize skeleton and movement of body, it is used to recognize hand and moreover shape that player showed. 4.2 Hand Gesture Recognition Algorithm for hand gesture recognition is based on processing obtained depth image. Main idea is to find one point of hand and group all other with same depth. Kinect SKD provides API s for skeletal tracking and also for obtaining 3D position of some points of human s body. The most important for our algorithm are points of hands. The first step is to find coordinates (x, y, z) of right hand (since it is used for playing) in 3D space. The center of coordinate system is point (0, 0, 0) which is actually Kinect s depth sensor. Thus, obtained z coordinate represents distance in milimetres of hand from depth sensor. Obtained point has three coordinates, but depth data that is provided by sensor is actually standard 2D and there is problem how to transform point from 3D space to 2D. Using function N uit ransf ormskeletont odepthimage it can be done on easily way. Next step of algorithm is to find all adjacent points (actually pixel in obtained depth data) which have same depth (distance) as the first obtained point. We make two lists, open and closed. In first list, put points that are candidates for points of hand and other list is used to avoid processing same point twice. Firstly, put the first point in open list.while this list is not empty, choose one point from it and find all adjacent points. All of them classify to either open or closed list. If some point does not have approximately same depth as our first point, then it must be boundary point. If such point is found, put it in particular list which contains all boundary points.this algorithm prouces result which is shown in Figure 2a. There is one more way to find all hand points. First step is same as above, that is, find one pixel and determine its depth. Then start from pixel (0, 0) and search through whole (a) Processing depth data (b) Contour image and pick pixels that have same depth and that are in neighbourhood of the first pixel.pixels, that don t belong to hand and that are adjacent to one or more hand pixels, are labeled as boundary pixels. The first way has smaller time complexity but it is more imprecise than second one. In this project second algorithm is used with optimization of search region. When hand is recognized and boundary pixels are obtained, there is problem how to find contour of hand because of finger detection. Actually, boundary pixels represent contour. But they are randomly staggered. To detect fingers, it was necessary to sort those pixels either clockwise or
Figure 3: Finger detection Algorithm for gesture recognition is based on number of fingers. This technique is used because it was only necessary to recognize three gestures (rock, paper and scissors). Also shape matching could be used since it produces good results, but it is litle bit slower and produces. Thus, with previous algorithm number of fingers is found, and if there are five points, then gesture for paper is obtained. Two points represent gesture for scissors and if there is no recognized fingers, then player showed gesture for rock. counterclockwise. The folowing algorithm is used. Pick one of the pixels and put it on stack. Search for boundary pixels in neighbourhood of top pixel in stack. If new pixel is found put it on stack. But there are situations when next pixel can not be found. In this case pop last pixel in stack, and pick again one that is on top. Then repeat procedure and stop when first pixel is reached again. There is question why this algorithm works? For every boundary pixel, our hand recognition algorithm will add also its adjacent boundary pixel. That is, every pixel which is marked as boundary will have adjacent boundary pixel. Also, another algorithm was tested for sorting pixels. The main idea of it is to choose randomly one pixel and find the closest one to it. If this procedure is repeated, always using last found pixel it is possible to obtain hand contour. In experimental result we found that first algorithm is more efficient. Obtained result is shown in Figure 2b The clockwise sorted list of pixels is very suitable for detecting finger. Basic idea is to pick a = i th, b = (i + k) th and c = (i k) th point of this list and calculate angle between vectors ab and bc. If absolute difference between obtained angle and α is smaller than 5, mark i th pixel as top of finger. By experimental results, it is found that the best result is obtained for α = 40 and k = 22. Algorithm is illustrated in Figure 3 4.3 Implementation problems One problem that doesn t depend on algorithm is actually based on Kinect SDK. If two of special points are close each other, then there are precision errors during obtaining center point of right hand. Thus, for best results, it is recomended that hand is far enough from any other special point. Another thing that can cause problems is position of fingers. If human player shows two fingers which are not separeted, algorithm described above will not recognize any finger. 4.4 Computer Player Implementation For computer player there are two methods that can be used. First one is actually picking random shape out of three posiible and this method doesn t depend on human player shape. Another method is extending of the first one in an attempt to beat human player. If human player several times show shape b after shape a,then there is high probability that he will always show shape b after a. Then algorithm is as follows: store pairs of consecutive shown shapes and count how many times each of pairs is repeated. If there exists pair which is repeated more than α times, then when human player shows first shape in that pair, computer player will show another one. Otherwise, use first algorithm.
5 Closing thoughts & Further Work Working on this project was very heplpful for students to meet Kinect as very good hardver which can be used like replacemant for classical input devices. Also it helped for authors to meet some basic stuffs about recognition of some parts of human body and how to manipulate with it.for futher work is planed to research how to improve precision of detecting and how to find better way for recognition of sccisor, rock and paper. [9] Totilo, Stephen (June 5, 2009). Microsoft: Project Natal Can Support Multiple Players, See Fingers. Kotaku. Gawker Media. Retrieved June 6, 2009. [10] Play.com (UK) : Kinect : Xbox 360 - Free Delivery. Play.com. Retrieved July 2, 2010. This information is based on specifications supplied by manufacturers and should be used for guidance only. References [1] Boulos, Maged N. Kamel, et al. Web GIS in practice X: a Microsoft Kinect natural user interface for Google Earth navigation. International journal of health geographics 10.1 (2011): 45. [2] Kinect for Windows Programming, www.microsoft.com [3] Game Basics. Retrieved 2009-12-05. [4] OpenNI Standard framewor for 3D sensing, www.openni.org [5] www.openkinect.org [6] Project Natal 101. Microsoft. June 1, 2009. Archived from the original on June 1, 2009. Retrieved June 2, 2009. [7] Kinect for Windows SDK beta launches, wants PC users to get a move on. Engadget. June 16, 2011. Retrieved October 19, 2011. [8] Totilo, Stephen (January 7, 2010). Natal Recognizes 31 Body Parts, Uses Tenth of Xbox 360 Computing Resources. Kotaku, Gawker Media. Retrieved November 25, 2010.