Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

Similar documents
Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Complex Sensors: Cameras, Visual Sensing. The Robotics Primer (Ch. 9) ECE 497: Introduction to Mobile Robotics -Visual Sensors

CS5670: Computer Vision

Rectification and Disparity

Stereo Vision A simple system. Dr. Gerhard Roth Winter 2012

Binocular stereo. Given a calibrated binocular stereo pair, fuse it to produce a depth image. Where does the depth information come from?

The Kinect Sensor. Luís Carriço FCUL 2014/15

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Kinect Device. How the Kinect Works. Kinect Device. What the Kinect does 4/27/16. Subhransu Maji Slides credit: Derek Hoiem, University of Illinois

EE795: Computer Vision and Intelligent Systems

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

Autonomous Vehicle Navigation Using Stereoscopic Imaging

Final Exam Study Guide

Stereo vision. Many slides adapted from Steve Seitz

Minimizing Noise and Bias in 3D DIC. Correlated Solutions, Inc.

LUMS Mine Detector Project

A method for depth-based hand tracing

There are many cues in monocular vision which suggests that vision in stereo starts very early from two similar 2D images. Lets see a few...

Stereo. Many slides adapted from Steve Seitz

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Epipolar Geometry and Stereo Vision

VIRTUAL TRAIL ROOM. South Asian Journal of Engineering and Technology Vol.3, No.5 (2017) 87 96

Correspondence and Stereopsis. Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri]

Computer Vision I. Dense Stereo Correspondences. Anita Sellent 1/15/16

Autonomous Vehicle Navigation Using Stereoscopic Imaging

A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation

Real-Time Human Pose Recognition in Parts from Single Depth Images

Fog Simulation and Refocusing from Stereo Images

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Stereo Vision Computer Vision (Kris Kitani) Carnegie Mellon University

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman

Epipolar Geometry and Stereo Vision

Practice Exam Sample Solutions

StereoScan: Dense 3D Reconstruction in Real-time

Combining PGMs and Discriminative Models for Upper Body Pose Detection

DD2423 Image Analysis and Computer Vision IMAGE FORMATION. Computational Vision and Active Perception School of Computer Science and Communication

Project 2 due today Project 3 out today. Readings Szeliski, Chapter 10 (through 10.5)

Virtual Production for the Real World Using Autodesk MotionBuilder 2013

Stereo and structured light

Depth Estimation with a Plenoptic Camera

A Comparison between Active and Passive 3D Vision Sensors: BumblebeeXB3 and Microsoft Kinect

Midterm Examination CS 534: Computational Photography

High-Fidelity Augmented Reality Interactions Hrvoje Benko Researcher, MSR Redmond

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

CS Decision Trees / Random Forests

Image Based Reconstruction II

Exam in DD2426 Robotics and Autonomous Systems

Outline. ETN-FPI Training School on Plenoptic Sensing

Basic 3D Geometry for One and Two Cameras

Physics 101, Lab 1: LINEAR KINEMATICS PREDICTION SHEET

Real-Time Human Detection using Relational Depth Similarity Features

CS 787: Assignment 4, Stereo Vision: Block Matching and Dynamic Programming Due: 12:00noon, Fri. Mar. 30, 2007.

arxiv: v1 [cs.cv] 28 Sep 2018

CS201 Computer Vision Camera Geometry

Laser sensors. Transmitter. Receiver. Basilio Bona ROBOTICA 03CFIOR

Stereo Image Rectification for Simple Panoramic Image Generation

Depth Sensors Kinect V2 A. Fornaser

Range Sensors (time of flight) (1)

Advanced Vision Guided Robotics. David Bruce Engineering Manager FANUC America Corporation

Multimedia Technology CHAPTER 4. Video and Animation

Processing 3D Surface Data

2 OVERVIEW OF RELATED WORK

Chapter 7: Geometrical Optics. The branch of physics which studies the properties of light using the ray model of light.

Lecture 9 & 10: Stereo Vision

Realtime Omnidirectional Stereo for Obstacle Detection and Tracking in Dynamic Environments

CSE152 Introduction to Computer Vision Assignment 3 (SP15) Instructor: Ben Ochoa Maximum Points : 85 Deadline : 11:59 p.m., Friday, 29-May-2015

3D Vision Real Objects, Real Cameras. Chapter 11 (parts of), 12 (parts of) Computerized Image Analysis MN2 Anders Brun,

ECE 470: Homework 5. Due Tuesday, October 27 in Seth Hutchinson. Luke A. Wendt

The XH-map algorithm: A method to process stereo video to produce a real-time obstacle map.

Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Computer Vision Projective Geometry and Calibration. Pinhole cameras

INFRARED AUTONOMOUS ACQUISITION AND TRACKING

International Society for Photogrammetry and Remote Sensing

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

INFO - H Pattern recognition and image analysis. Vision

Subpixel Corner Detection for Tracking Applications using CMOS Camera Technology

Theory of Stereo vision system

Dense 3D Reconstruction. Christiano Gava

Lecture'9'&'10:'' Stereo'Vision'

Using temporal seeding to constrain the disparity search range in stereo matching

HIGH SPEED 3-D MEASUREMENT SYSTEM USING INCOHERENT LIGHT SOURCE FOR HUMAN PERFORMANCE ANALYSIS

3D Modeling of Objects Using Laser Scanning

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Processing 3D Surface Data

Final Review CMSC 733 Fall 2014

10/5/09 1. d = 2. Range Sensors (time of flight) (2) Ultrasonic Sensor (time of flight, sound) (1) Ultrasonic Sensor (time of flight, sound) (2) 4.1.

Structured Light. Tobias Nöll Thanks to Marc Pollefeys, David Nister and David Lowe

convolution shift invariant linear system Fourier Transform Aliasing and sampling scale representation edge detection corner detection

Dense 3D Reconstruction. Christiano Gava

DEVELOPMENT OF REAL TIME 3-D MEASUREMENT SYSTEM USING INTENSITY RATIO METHOD

Introduction to 3D Machine Vision

Introducing Robotics Vision System to a Manufacturing Robotics Course

Depth estimation from stereo image pairs

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

Creating a distortion characterisation dataset for visual band cameras using fiducial markers.

Transcription:

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen Abstract: An XBOX 360 Kinect is used to develop two applications to control the desktop cursor of a Windows computer. Application A uses the skeletal tracking feature to track body joints in both X and Y direction to control both cursor positioning and clicking. Application B uses the skeletal tracking feature to track body joints in X, Y, and Z directions to control both cursor positioning and clicking. In this paper, along with a detailed layout of the algorithm behind the Kinect Cursor Control application, the process which makes Kinect s skeletal tracking feature possible will also be discussed in detail. I. INTRODUCTION The Xbox 360 Kinect has been around since 2010 and after its successful control-less game play technology, people have been using the Kinect s technology for all kinds of applications. In this project, the XBOX 360 Kinect will be used to control the desktop cursor of a Windows computer. The idea came from the need to control a pc-connected tv screen without the need to purchase an expensive wireless mouse, assuming the person already has a Kinect. This project utilizes the Kinect s skeletal tracking as well as Microsoft s Visual Studio 2013 to create two simple application. These applications will allow us to control the cursor s position with our right hand and clicking features with our left, but first it is necessary to understand the features of the Kinect and the process behind skeletal tracking from the beginning step of creating a depth map to the final step of joint position proposal. II. MICROSOFT XBOX 360 KINECT Figure 2.1. Kinect Main Components Kinect combines a few detection mechanisms to build up an accurate and comprehensive amount of 3D data on what s going on inside a room. The color camera 1

captures RGB images along with per-pixel depth imaging information. The sensor also relies on the pair of depth sensors to measure the distance of objects in the room in three dimensions by emitting infrared structured light beams which project a specific grid pattern that is distorted based on a person s distance from the emitter. Kinect combines structured light with two classic computer vision techniques: Depth from focus, using a special astigmatic lens with different focal length in x- and y- directions (i.e. the stuff that is more blurry is further away), and Depth from stereo which will be discussed in detail later in this paper. These images are measured by an 11-bit 640x480 pixel monochrome CMOS sensor providing 2048 levels of grey which builds a map showing the distance from the sensor of every point in the image as seen in figure 2.2. Figure 2.2. Infrared structured light beams to gray level depth map The depth calculations are performed in the scene using a method called Stereo Triangulation. The depth measurement requires that corresponding points in one image need to be found in the second image, known as the correspondence problem, and is executed by means of a method called stereo matching. Once those corresponding points are found, we can find the disparity between the two images. Disparity measures the displacement of a point between the two images (i.e. the number of pixels between a point in the right image and the corresponding point in the left image). Figure 2.3. Stereo Matching - Correspondence Problem As the Kinect makes use of the simplest case constructing the device such that the image planes of the cameras are parallel to each other and to the baseline, the camera centers at same height, and the focal lengths the same, the epipolar lines fall along the horizontal scan 2

lines of the images, and the images are rectified (along the same parallel axis). Therefore, the process is simplified and the computation time reduced. Then, once we have the disparity, we then use triangulation to calculate the depth of that point in the scene. A disparity is computed for every pixel of the image with Stereo Matching, then Triangulation is used to compute the 3D position for every disparity. Figure 2.4. Depth from Stereo Images - Rectified Image Using Triangulation, the relationship between Disparity and Depth is determined as follows in figure 2.5: Figure 2.5. Disparity: Relationship to Depth Using this relationship, we can draw important conclusions such as disparity values are inversely proportional to the depth of a point Z. This is to say that far points have low disparity (i.e. the horizon has disparity of zero), and close points have a high disparity. The disparity is also proportional to the baseline b (i.e. the larger the baseline, the higher the disparity). For the Kinect, inferring body position is a two-stage process: compute a depth map using structured light, the infer body position using machine learning. Body Part Inference and Joint Proposals A key component of this work is the intermediate body part representation where several localized body part labels, that densely cover the body, are defined and color-coded as in figure 2.6. 3

Figure 2.6. Synthetic and Real Data. Pairs of depth image and ground truth body parts. Some of these parts are defined to directly localize particular skeletal joints of interest, while others fill the gaps or could be used in combination to predict other joints. This intermediate representation causes for the problem to then be changed to one that is solved by efficient classification algorithms. Simple depth comparison features, as seen in figure 2.7, are utilized to compute the following equation at a given pixel x: (1) where is the depth at pixel x in image, and parameters describe offsets and. Figure 2.7. Depth Image Features Figure 2.7. illustrates two features at different pixel locations x. Feature looks upwards. Equation 1 returns a large positive response for pixels x near the top of the body, but a value close to zero for pixels x lower down the body. Feature 2may instead find thin vertical structures such as the arm. 4

Figure 2.8. Randomized Decision Forest These features in combination in a decision forest make for a sufficient and accurate classification tool for all trained parts. A forest is an ensemble of T decision trees, each consisting of split and leaf nodes as depicted in figure 2.8. Each split node consists of a feature and a threshold. To classify pixel x in image I, one starts at the root and repeatedly evaluates Eq. 1, branching left or right according to the comparison to threshold. At the leaf node reached in tree t, a learned distribution over body part labels c is stored. The distributions are averaged together for all trees in the forest to give the final classification: (2) Body part recognition as described above infers per-pixel information. This information is then pooled across pixels to generate reliable proposals for the 3D skeletal joints. These proposals are the final output of the algorithm. As outlying pixels severely degrade the quality of global estimates (global 3D centers of probability mass for each part accumulated using the known calibrated depth), a local modefinding approach, based on mean shift with a weighted Gaussian kernel, is utilized. Figure 2.9 Mean shift descriptors Mean shift is used to find modes in this density efficiently. A final confidence estimate is given as a sum of the pixel weights reaching each mode. The detected modes lie on the surface of the body, and by means of the learning algorithm, a final joint position proposal is produced and skeletal tracking made possible. III. VISUAL STUDIOS 2013 Due to the Kinect s popularity as a stereo vision sensor, Microsoft s website provides an SDK for Visual Studios for the sole purpose for programming applications for the Kinect. This includes games as well as windows forms application. It is because of the programmability of Visual Studios along with the Kinect s SDK that makes the Kinect so robust in terms of 5

application development. For our application we are using the KinectSDK v.1.8 which can be found in Microsoft s Kinect for Windows page. This package is essentially the bridge between the features of the Kinect, which includes all the software and libraries necessary for the sensors to work, and C#, C++, or Visual Basics coding. By utilizing the KinectSDK package, we can develop our own application that directly interacts with the Kinect. IV. COORDINATES CONVERSION One of the things we need to consider while programming an application is the coordinate system of the Kinect. As shown in Figure 4.1, we see that it is in cartesian coordinates with the origin in the center of the image. In this case, the maximum X and Y values are -1 and 1; however, depending on the object s distance from the sensor, this value may change. For a typical application, one would be standing roughly 6 feet away from the sensor and -1 and 1 is a good approximation for the maximum values of the X and Y. Figure 4. Kinect s Coordinate System For certain applications such as our own, we will need to find the relationship between the coordinate system of the Kinect and the display resolution of the computer as shown in Figure 4.2. We also need to consider the fact that the origin of the display coordinates of the pc begins in the top left as opposed to the conventional bottom left of an X-Y coordinates. Figure 4.2 PC Display Coordinate System The equations needed to convert from the Kinect s coordinates to the PC coordinates are as follows: 6

where: Figure 4.3 Conversion Equations Xpc/ Ypc is the computer s maximum resolution. Xk/Yk is the Kinect s X/Y coordinates. Xr/Yr is Kinect s reference x/y frame. the minus sign in front of the yk converts the negative Y direction which compensates for the fact that the origin begins in the top left corner in the pc coordinate system. These two equations converts the Kinect s coordinates (input) to the pc display coordinates (output) which depends on the resolution of the screen. In addition we can limit the detection area of the Kinect to a localized window. Considering our application to control the cursor we can limit the user s movement to a small region as shown in Figure 4.4. This can alleviate the issues where users will have to move further away in order to move the mouse to the corners of the screen. Figure 4.4 Kinect s Reference Frame The red box represents the localized window which can be resized by simply changing the values of Xr/Yr above. Example Calculation: Suppose we have a point (0.3, -0.3) in the Kinect s coordinate system as shown in Figure 4.5. Where the reference window frame, Xr and Yr are 0.7 and 0.5 respectively. The maximum display resolution of the computer is 1366x768. Plugging the values into the equations, we get the converted points in pixels. 7

Figure 4.5 Kinect s Reference values 0.7, 0.5 Figure 4.6 Solving for X and Y Pixel coordinate Figure 4.7 Converted Pixel Coordinates Now that we have a way to convert between coordinate system, we can simply track the joints, convert their positions from Kinect coordinate system to pc coordinates, and set them as the new mouse coordinates. V. SKELETAL TRACKING TO CONTROL CLICKING The first application developed tracks four points of the body as shown in Figure 5.1: 8

Figure 5.1 Skeletal Tracking Joints The Right Wrist is tracked to control the movement or position of the cursor. The Left Wrist is tracked via its Y-coordinates to determine the type of clicking. The Center shoulder is tracked via its Y-coordinates and acts as a threshold for double clicking. The Left Hip is tracked via its Y-coordinates and acts as a threshold for single click and hold. Conditions for clicking: Left Wrist Y-Position is less than Center Shoulder will trigger double click. Left Wrist Y-Position is greater than Left Hip will trigger single click and hold. Left Wrist Y-Position is greater than Center Shoulder and less than Left Hip will do nothing. VI. Z-DIRECTION VELOCITY TO CONTROL DOUBLE CLICK The second application tracks only two points of the body: the Left Wrist and the Right Wrist. Similar to the first application, the Right Wrist controls the positioning of the cursor and the Left Wrist will control the clicking. In this case, we will track an additional axis, the Z- Direction of the Left Wrist. We will then measure the positive direction velocity of the Left Wrist and use this to determine the conditions to perform a double click. The velocity is measured by the change in position. By tracking the distance of the Left Wrist via the Z-direction and finding the difference between past and current positions, we can calculate the instantaneous velocity. For example, if the current distance from the Kinect to the tracked joint is 2.5 feet and the previous is 2.3 feet, the measured instantaneous velocity is -0.2 feet per sample. The negative sign signifies that the joint is moving away from the Kinect and the clicking will not trigger for negative values of velocity. We then had to determine a threshold to trigger the clicking event. After several trials from a slight movement to aggressive push, the velocity used is about 0.05 feet per sample, which is a simple push movement, not weak enough to detect a forward movement and not strong enough to injure someone. VII. CONCLUSION 9

In this paper, we have discussed the processes behind skeletal tracking a two-stage process: computing a depth map using structured light, then inferring the body position using machine learning (i.e. transform depth image to body part image, and body part image into a skeleton). We then utilized the Kinect s Skeletal tracking to create two versions of a cursor control application. The first version sets clicking thresholds on the Y position of the left wrist whereas the second version sets clicking thresholds via the left wrist s Z velocity. Both applications clicking mechanism works perfectly. Small problems were presented where the positioning of the cursor sometimes generate unwanted noise which results in the cursor teleporting in the screen. This may be the result of different aspect ratios from the Kinect s localized window frame and the display of the pc. Solutions to this may include adjusting the aspect ratio of the localized window to match those of the computer or to adjust the distance between the user and the Kinect. 10

APPENDIX A. CODE LISTING: Skeletal tracking //Check for Kinect Sensor //Check for Skeleton //as long as Skeleton is found run the following: bool leftclick; if (WristLeft < ShoulderCenter) if your wrist is higher than your shoulder double click! leftclick = true; Console.WriteLine("DOUBLE CLICK!"); for (int cnt = 0; cnt < 4; cnt++) This calls the clicking method four times, click, release, click, release NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); leftclick =!leftclick; Thread.Sleep(1000); a small delay is put here to prevent endless clicking else if (WristLeft > ShoulderCenter && WristLeft < HipLeft)if the hand is between the shoulder and hip, do nothing. leftclick = false; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); else the hand is under the left hip and will click and hold, and will hold until the hand is raised above the hip. Console.WriteLine("SINGLE CLICK AND HOLD!"); leftclick = true; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); APPENDIX B. CODE LISTING: Velocity Tracking //Check for Kinect Sensor //Check for Skeleton //as long as Skeleton is found run the following: bool leftclick; if ((Zvel - jwl.position.z) > 0.05) if the previous minus current position is greater than 0.05, click! leftclick = true; Console.Write("DOUBLE CLICK!"); for (int cnt = 0; cnt < 4; cnt++) This calls the clicking method four times, click, release, click, release NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); leftclick =!leftclick; Thread.Sleep(500); a small delay is put here to prevent endless clicking else leftclick = false; NativeMethods.SendMouseInput(cursorX, cursory, resolution.width, resolution.height, leftclick); Zvel = jwl.position.z; this sets the new previous Z position 11

REFERENCE [1] http://pages.cs.wisc.edu/~dyer/cs540/notes/17_kinect.pdf [2] http://blender.vsb.cz/download/kinect_in_car/kinect.pdf [3] http://users.dickinson.edu/~jmac/selected-talks/kinect.pdf [4] http://www.dmi.unict.it/~battiato/cvision1112/kinect.pdf [5] http://www.cs.ucf.edu/courses/cap6121/spr11/readings/rivera.pdf [6] http://campar.in.tum.de/twiki/pub/chair/teachingws10cv2/3d_cv2_ws_2010_rectification_disparity.pdf [7] http://www.ece.ucsb.edu/~manj/ece181bs04/l14(morestereo).pdf [8] http://www.jitrc.com/downloads/ppt_reports/kinect_ppt.pdf [9] http://courses.engr.illinois.edu/cs498dh/fa2011/lectures/lecture%2025%20-%20how%20the%20kinect%20 Works%20-%20CP%20Fall%202011.pdf [10] http://mrl.cs.vsv.cz [11] http://munin.uit.no/bitstream/handle/10037/4610/thesis.pdf?sequence=2 [12] http://www.ifixit.com/teardown/microsoft+kinect+teardown/4066 [13] http://www.wired.com/2010/11/tonights-release-xbox-kinect-how-does-it-work/all/ [14] http://www.engadget.com/2010/06/17/kinect-guide-a-preview-and-explanation-of-microsofts-new-full/ [15] http://en.wikipedia.org/wiki/kinect#kinect_for_xbox_360 12