Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen Abstract: An XBOX 360 Kinect is used to develop two applications to control the desktop cursor of a Windows computer. Application A uses the skeletal tracking feature to track body joints in both X and Y direction to control both cursor positioning and clicking. Application B uses the skeletal tracking feature to track body joints in X, Y, and Z directions to control both cursor positioning and clicking. In this paper, along with a detailed layout of the algorithm behind the Kinect Cursor Control application, the process which makes Kinect s skeletal tracking feature possible will also be discussed in detail. I. INTRODUCTION The Xbox 360 Kinect has been around since 2010 and after its successful control-less game play technology, people have been using the Kinect s technology for all kinds of applications. In this project, the XBOX 360 Kinect will be used to control the desktop cursor of a Windows computer. The idea came from the need to control a pc-connected tv screen without the need to purchase an expensive wireless mouse, assuming the person already has a Kinect. This project utilizes the Kinect s skeletal tracking as well as Microsoft s Visual Studio 2013 to create two simple application. These applications will allow us to control the cursor s position with our right hand and clicking features with our left, but first it is necessary to understand the features of the Kinect and the process behind skeletal tracking from the beginning step of creating a depth map to the final step of joint position proposal. II. MICROSOFT XBOX 360 KINECT Figure 2.1. Kinect Main Components Kinect combines a few detection mechanisms to build up an accurate and comprehensive amount of 3D data on what s going on inside a room. The color camera 1
captures RGB images along with per-pixel depth imaging information. The sensor also relies on the pair of depth sensors to measure the distance of objects in the room in three dimensions by emitting infrared structured light beams which project a specific grid pattern that is distorted based on a person s distance from the emitter. Kinect combines structured light with two classic computer vision techniques: Depth from focus, using a special astigmatic lens with different focal length in x- and y- directions (i.e. the stuff that is more blurry is further away), and Depth from stereo which will be discussed in detail later in this paper. These images are measured by an 11-bit 640x480 pixel monochrome CMOS sensor providing 2048 levels of grey which builds a map showing the distance from the sensor of every point in the image as seen in figure 2.2. Figure 2.2. Infrared structured light beams to gray level depth map The depth calculations are performed in the scene using a method called Stereo Triangulation. The depth measurement requires that corresponding points in one image need to be found in the second image, known as the correspondence problem, and is executed by means of a method called stereo matching. Once those corresponding points are found, we can find the disparity between the two images. Disparity measures the displacement of a point between the two images (i.e. the number of pixels between a point in the right image and the corresponding point in the left image). Figure 2.3. Stereo Matching - Correspondence Problem As the Kinect makes use of the simplest case constructing the device such that the image planes of the cameras are parallel to each other and to the baseline, the camera centers at same height, and the focal lengths the same, the epipolar lines fall along the horizontal scan 2
lines of the images, and the images are rectified (along the same parallel axis). Therefore, the process is simplified and the computation time reduced. Then, once we have the disparity, we then use triangulation to calculate the depth of that point in the scene. A disparity is computed for every pixel of the image with Stereo Matching, then Triangulation is used to compute the 3D position for every disparity. Figure 2.4. Depth from Stereo Images - Rectified Image Using Triangulation, the relationship between Disparity and Depth is determined as follows in figure 2.5: Figure 2.5. Disparity: Relationship to Depth Using this relationship, we can draw important conclusions such as disparity values are inversely proportional to the depth of a point Z. This is to say that far points have low disparity (i.e. the horizon has disparity of zero), and close points have a high disparity. The disparity is also proportional to the baseline b (i.e. the larger the baseline, the higher the disparity). For the Kinect, inferring body position is a two-stage process: compute a depth map using structured light, the infer body position using machine learning. Body Part Inference and Joint Proposals A key component of this work is the intermediate body part representation where several localized body part labels, that densely cover the body, are defined and color-coded as in figure 2.6. 3
Figure 2.6. Synthetic and Real Data. Pairs of depth image and ground truth body parts. Some of these parts are defined to directly localize particular skeletal joints of interest, while others fill the gaps or could be used in combination to predict other joints. This intermediate representation causes for the problem to then be changed to one that is solved by efficient classification algorithms. Simple depth comparison features, as seen in figure 2.7, are utilized to compute the following equation at a given pixel x: (1) where is the depth at pixel x in image, and parameters describe offsets and. Figure 2.7. Depth Image Features Figure 2.7. illustrates two features at different pixel locations x. Feature looks upwards. Equation 1 returns a large positive response for pixels x near the top of the body, but a value close to zero for pixels x lower down the body. Feature 2may instead find thin vertical structures such as the arm. 4
Figure 2.8. Randomized Decision Forest These features in combination in a decision forest make for a sufficient and accurate classification tool for all trained parts. A forest is an ensemble of T decision trees, each consisting of split and leaf nodes as depicted in figure 2.8. Each split node consists of a feature and a threshold. To classify pixel x in image I, one starts at the root and repeatedly evaluates Eq. 1, branching left or right according to the comparison to threshold. At the leaf node reached in tree t, a learned distribution over body part labels c is stored. The distributions are averaged together for all trees in the forest to give the final classification: (2) Body part recognition as described above infers per-pixel information. This information is then pooled across pixels to generate reliable proposals for the 3D skeletal joints. These proposals are the final output of the algorithm. As outlying pixels severely degrade the quality of global estimates (global 3D centers of probability mass for each part accumulated using the known calibrated depth), a local modefinding approach, based on mean shift with a weighted Gaussian kernel, is utilized. Figure 2.9 Mean shift descriptors Mean shift is used to find modes in this density efficiently. A final confidence estimate is given as a sum of the pixel weights reaching each mode. The detected modes lie on the surface of the body, and by means of the learning algorithm, a final joint position proposal is produced and skeletal tracking made possible. III. VISUAL STUDIOS 2013 Due to the Kinect s popularity as a stereo vision sensor, Microsoft s website provides an SDK for Visual Studios for the sole purpose for programming applications for the Kinect. This includes games as well as windows forms application. It is because of the programmability of Visual Studios along with the Kinect s SDK that makes the Kinect so robust in terms of 5
application development. For our application we are using the KinectSDK v.1.8 which can be found in Microsoft s Kinect for Windows page. This package is essentially the bridge between the features of the Kinect, which includes all the software and libraries necessary for the sensors to work, and C#, C++, or Visual Basics coding. By utilizing the KinectSDK package, we can develop our own application that directly interacts with the Kinect. IV. COORDINATES CONVERSION One of the things we need to consider while programming an application is the coordinate system of the Kinect. As shown in Figure 4.1, we see that it is in cartesian coordinates with the origin in the center of the image. In this case, the maximum X and Y values are -1 and 1; however, depending on the object s distance from the sensor, this value may change. For a typical application, one would be standing roughly 6 feet away from the sensor and -1 and 1 is a good approximation for the maximum values of the X and Y. Figure 4. Kinect s Coordinate System For certain applications such as our own, we will need to find the relationship between the coordinate system of the Kinect and the display resolution of the computer as shown in Figure 4.2. We also need to consider the fact that the origin of the display coordinates of the pc begins in the top left as opposed to the conventional bottom left of an X-Y coordinates. Figure 4.2 PC Display Coordinate System The equations needed to convert from the Kinect s coordinates to the PC coordinates are as follows: 6
where: Figure 4.3 Conversion Equations Xpc/ Ypc is the computer s maximum resolution. Xk/Yk is the Kinect s X/Y coordinates. Xr/Yr is Kinect s reference x/y frame. the minus sign in front of the yk converts the negative Y direction which compensates for the fact that the origin begins in the top left corner in the pc coordinate system. These two equations converts the Kinect s coordinates (input) to the pc display coordinates (output) which depends on the resolution of the screen. In addition we can limit the detection area of the Kinect to a localized window. Considering our application to control the cursor we can limit the user s movement to a small region as shown in Figure 4.4. This can alleviate the issues where users will have to move further away in order to move the mouse to the corners of the screen. Figure 4.4 Kinect s Reference Frame The red box represents the localized window which can be resized by simply changing the values of Xr/Yr above. Example Calculation: Suppose we have a point (0.3, -0.3) in the Kinect s coordinate system as shown in Figure 4.5. Where the reference window frame, Xr and Yr are 0.7 and 0.5 respectively. The maximum display resolution of the computer is 1366x768. Plugging the values into the equations, we get the converted points in pixels. 7
Figure 4.5 Kinect s Reference values 0.7, 0.5 Figure 4.6 Solving for X and Y Pixel coordinate Figure 4.7 Converted Pixel Coordinates Now that we have a way to convert between coordinate system, we can simply track the joints, convert their positions from Kinect coordinate system to pc coordinates, and set them as the new mouse coordinates. V. SKELETAL TRACKING TO CONTROL CLICKING The first application developed tracks four points of the body as shown in Figure 5.1: 8
Figure 5.1 Skeletal Tracking Joints The Right Wrist is tracked to control the movement or position of the cursor. The Left Wrist is tracked via its Y-coordinates to determine the type of clicking. The Center shoulder is tracked via its Y-coordinates and acts as a threshold for double clicking. The Left Hip is tracked via its Y-coordinates and acts as a threshold for single click and hold. Conditions for clicking: Left Wrist Y-Position is less than Center Shoulder will trigger double click. Left Wrist Y-Position is greater than Left Hip will trigger single click and hold. Left Wrist Y-Position is greater than Center Shoulder and less than Left Hip will do nothing. VI. Z-DIRECTION VELOCITY TO CONTROL DOUBLE CLICK The second application tracks only two points of the body: the Left Wrist and the Right Wrist. Similar to the first application, the Right Wrist controls the positioning of the cursor and the Left Wrist will control the clicking. In this case, we will track an additional axis, the Z- Direction of the Left Wrist. We will then measure the positive direction velocity of the Left Wrist and use this to determine the conditions to perform a double click. The velocity is measured by the change in position. By tracking the distance of the Left Wrist via the Z-direction and finding the difference between past and current positions, we can calculate the instantaneous velocity. For example, if the current distance from the Kinect to the tracked joint is 2.5 feet and the previous is 2.3 feet, the measured instantaneous velocity is -0.2 feet per sample. The negative sign signifies that the joint is moving away from the Kinect and the clicking will not trigger for negative values of velocity. We then had to determine a threshold to trigger the clicking event. After several trials from a slight movement to aggressive push, the velocity used is about 0.05 feet per sample, which is a simple push movement, not weak enough to detect a forward movement and not strong enough to injure someone. VII. CONCLUSION 9
In this paper, we have discussed the processes behind skeletal tracking a two-stage process: computing a depth map using structured light, then inferring the body position using machine learning (i.e. transform depth image to body part image, and body part image into a skeleton). We then utilized the Kinect s Skeletal tracking to create two versions of a cursor control application. The first version sets clicking thresholds on the Y position of the left wrist whereas the second version sets clicking thresholds via the left wrist s Z velocity. Both applications clicking mechanism works perfectly. Small problems were presented where the positioning of the cursor sometimes generate unwanted noise which results in the cursor teleporting in the screen. This may be the result of different aspect ratios from the Kinect s localized window frame and the display of the pc. Solutions to this may include adjusting the aspect ratio of the localized window to match those of the computer or to adjust the distance between the user and the Kinect. 10
APPENDIX A. CODE LISTING: Skeletal tracking //Check for Kinect Sensor //Check for Skeleton //as long as Skeleton is found run the following: bool leftclick; if (WristLeft < ShoulderCenter) if your wrist is higher than your shoulder double click! leftclick = true; Console.WriteLine("DOUBLE CLICK!"); for (int cnt = 0; cnt < 4; cnt++) This calls the clicking method four times, click, release, click, release NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); leftclick =!leftclick; Thread.Sleep(1000); a small delay is put here to prevent endless clicking else if (WristLeft > ShoulderCenter && WristLeft < HipLeft)if the hand is between the shoulder and hip, do nothing. leftclick = false; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); else the hand is under the left hip and will click and hold, and will hold until the hand is raised above the hip. Console.WriteLine("SINGLE CLICK AND HOLD!"); leftclick = true; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); APPENDIX B. CODE LISTING: Velocity Tracking //Check for Kinect Sensor //Check for Skeleton //as long as Skeleton is found run the following: bool leftclick; if ((Zvel - jwl.position.z) > 0.05) if the previous minus current position is greater than 0.05, click! leftclick = true; Console.Write("DOUBLE CLICK!"); for (int cnt = 0; cnt < 4; cnt++) This calls the clicking method four times, click, release, click, release NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); leftclick =!leftclick; Thread.Sleep(500); a small delay is put here to prevent endless clicking else leftclick = false; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); Zvel = jwl.position.z; this sets the new previous Z position 11
REFERENCE [1] http://pages.cs.wisc.edu/~dyer/cs540/notes/17_kinect.pdf [2] http://blender.vsb.cz/download/kinect_in_car/kinect.pdf [3] http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf [4] http://www.dmi.unict.it/~battiato/cvision1112/kinect.pdf [5] http://www.cs.ucf.edu/courses/cap6121/spr11/readings/rivera.pdf [6] http://campar.in.tum.de/twiki/pub/chair/teachingws10cv2/3d_cv2_ws_2010_rectification_disparity.pdf [7] http://www.ece.ucsb.edu/~manj/ece181bs04/l14(morestereo).pdf [8] http://www.jitrc.com/downloads/ppt_reports/kinect_ppt.pdf [9] http://courses.engr.illinois.edu/cs498dh/fa2011/lectures/lecture%2025%20-%20how%20the%20kinect%20 Works%20-%20CP%20Fall%202011.pdf [10] http://mrl.cs.vsv.cz [11] http://munin.uit.no/bitstream/handle/10037/4610/thesis.pdf?sequence=2 [12] http://www.ifixit.com/teardown/microsoft+kinect+teardown/4066 [13] http://www.wired.com/2010/11/tonights-release-xbox-kinect-how-does-it-work/all/ [14] http://www.engadget.com/2010/06/17/kinect-guide-a-preview-and-explanation-of-microsofts-new-full/ [15] http://en.wikipedia.org/wiki/kinect#kinect_for_xbox_360 12