Hand Tracking and Gesture Recognition for Human-Computer Interaction

Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 Hand Tacking and Gestue Recognition fo Human-Compute Inteaction Cistina Manesa, Javie Vaona, Ramon Mas and Fancisco J. Peales Unidad de Gáficos y Visión po Computado Depatamento de Matemáticas e Infomática Univesitat de les Illes Baleas Edificio Anselm Tumeda, Cta. Valldemossa km 7.5 07122 Palma de Malloca - España Received 4 Febuay 2005; accepted 18 May 2005 Abstact The poposed wok is pat of a poject that aims at the contol of a videogame based on hand gestue ecognition. This goal implies the estiction of eal-time esponse and the use of unconstained envionments. In this pape we pesent a new algoithm to tack and ecognise hand gestues fo inteacting with a videogame. This algoithm is based on thee main steps: hand segmentation, hand tacking and gestue ecognition fom hand featues. Fo the hand segmentation step we use the colou cue due to the chaacteistic colou values of human skin, its invaiant popeties and its computational simplicity. To pevent eos fom hand segmentation we add the hand tacking as a second step. Tacking is pefomed assuming a constant velocity model and using a pixel labeling appoach. Fom the tacking pocess we extact seveal hand featues that ae fed into a finite state classifie which identifies the hand configuation. The hand can be classified into one of the fou gestue classes o one of the fou diffeent movement diections. Finally, the system s pefomance is evaluated by showing the usability of the algoithm in a videogame envionment. Key Wods: Hand Tacking, Gestue Recognition, Human-Compute Inteaction, Peceptual Use Intefaces. 1 Intoduction Nowadays, the majoity of human-compute inteaction (HCI) is based on mechanical devices such as keyboads, mouses, joysticks o gamepads. In ecent yeas thee has been a gowing inteest in methods based on computational vision due to its ability to ecognise human gestues in a natual way [1]. These methods use the images acquied fom a camea o fom a steeo pai of cameas as input. The main goal of these algoithms is to measue the hand configuation at each time instant. To facilitate this pocess many gestue ecognition applications esot to the use of uniquely coloued gloves o makes on hands o finges [2]. In addition, using a contolled backgound makes it possible to locate the hand efficiently, even in eal-time [3]. These two conditions impose estictions on the use and on the inteface setup. We have specifically avoided solutions that equie coloued gloves o makes and a Coespondence to: <cistina.manesa@uib.es> Recommended fo acceptance by <Peales F., Dape B.> ELCVIA ISSN: 1577-5097 Published by Compute Vision Cente / Univesitat Autonoma de Bacelona, Bacelona, Spain

97 Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 contolled backgound because of the initial equiements of ou application. It must wok fo diffeent people, without any complement on them and also fo unpedictable backgounds. Ou application uses images fom a low-cost web camea placed in font of the wok aea, whee the ecognised gestues act as the input fo a compute 3D videogame. The playes, athe than pessing buttons, must use diffeent hand gestues that ou application should ecognise. This fact, inceases the complexity since the esponse time must be vey fast. Uses should not appeciate a significant delay between the instant they pefom a gestue o motion and the instant the compute esponds. Theefoe, the algoithm must povide eal-time pefomance fo a conventional pocesso. Most of the known hand tacking and ecognition algoithms do not meet this equiement and ae inappopiate fo visual inteface. Fo instance, paticle filteing-based algoithms can maintain multiple hypotheses at the same time to obustly tack the hands but they need high computational demands [4]. Recently, seveal contibutions fo educing the complexity of paticle filtes have been pesented, fo example, using a deteministic pocess to help the andom seach [5]. Also in [6], we can see a multi-scale colou featue fo epesenting hand shape and paticle filteing that combines shape and colou cues in a hieachical model. The system has been fully tested and seems obust and stable. To ou knowledge the system uns at about 10fames/second and does not conside seveal hand states. Howeve, these algoithms only wok in eal-time fo a educed size hand and in ou application, the hand fills most of the image. In [7], shape econstuction is quite pecise, a high DOF model is consideed, and in ode to avoid self-occlusions infaed othogonal cameas ae used. The authos popose to apply this technique using a colou skin segmentation algoithm. In this pape we popose a eal-time non-invasive hand tacking and gestue ecognition system. In the next sections we explain ou method which is divided in thee main steps. The fist step is hand segmentation, the image egion that contains the hand has to be located. In this pocess, the use of the shape cue is possible, but they vay geatly duing the natual hand motion[8]. Theefoe, we choose skin-colou as the hand featue. The skin-colou is a distinctive cue of hands and it is invaiant to scale and otation. The next step is to tack the position and oientation of the hand to pevent eos in the segmentation phase. We use a pixel-based tacking fo the tempoal update of the hand state. In the last step we use the estimated hand state to extact seveal hand featues to define a deteministic pocess of gestue ecognition. Finally, we pesent the system s pefomance evaluation esults that pove that ou method woks well in unconstained envionments and fo seveal uses. 2 Hand Segmentation Citeia The hand must be located in the image and segmented fom the backgound befoe ecognition. Colou is the selected cue because of its computational simplicity, its invaiant popeties egading to the hand shape configuations and due to the human skin-colou chaacteistic values. Also, the assumption that colou can be used as a cue to detect faces and hands has been poved useful in seveal publications [9,10]. Fo ou application, the hand segmentation has been caied out using a low computational cost method that pefoms well in eal time. The method is based on a pobabilistic model of the skin-colou pixels distibution. Then, it is necessay to model the skin-colou of the use s hand. The use places pat of his hand in a leaning squae as shown in Fig. 1. The pixels esticted in this aea will be used fo model leaning. Next, the selected pixels ae tansfomed fom the RGB-space to the HSL-space and the choma infomation is taken: hue and satuation. Figue 1: Application inteface and skin-colou leaning squae.

Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 98 We have encounteed two poblems in this step that have been solved in a pe-pocessing phase. The fist one is that human skin hue values ae vey nea to ed colou, that is, thei value is vey close to 2π adians, so it is difficult to lean the distibution due to the hue angula natue that can poduce samples on both limits. To solve this inconvenience the hue values ae otated π adians. The second poblem in using HSL-space appeas when the satuation values ae close to 0, because then the hue is unstable and can cause false detections. This can be avoided discading satuation values nea 0. Once the pe-pocessing phase has finished, the hue, H, and satuation, S, values fo each selected pixel ae used to infe the model, that is, x = ( x 1,..., x n ), whee n is the numbe of samples and a sample is x i = ( H i, Si ). A Gaussian model is chosen to epesent the skin-colou pobability density function. The values fo the paametes of the Gaussian model (mean, x, and covaiance matix, Σ ) ae computed fom the sample set using standad maximum likelihood methods [11]. Once they ae found, the pobability that a new pixel, x = (H, S), is skin can be calculated as 1 P( x ( 2 ) ( ) ) is skin) = 1 ( -1 x x Σ x x e T. (1) 2 ( 2π ) Σ Finally, we obtain the blob epesentation of the hand by applying a connected components algoithm to the pobability image, which goups pixels into the same blob. The system is obust to backgound changes and low light conditions. If the system gets lost, you can initialise it again by going to the hand stat state. Fig. 2 shows the blob contous found by the algoithm fo diffeent envionment conditions whee the system has been tested. Figue 2: Hand contous fo diffeent backgounds (1 st ow) and diffeent light conditions (2 nd ow). 3 Tacking Pocedue USB cameas ae known fo the low quality images they poduce. This fact can cause eos in the hand segmentation pocess. In ode to make the application obust to these segmentation eos we add a tacking algoithm. This algoithm ties to maintain and popagate the hand state ove time.

99 Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 We epesent the hand state in time t, s (t), by means of a vecto, s( t) = ( p( t), w( t), α( t)), whee p = p x, p ) is the hand position in the 2D image, the hand size is epesented by w = ( w, h), whee w is ( y the hand width and h is the hand height in pixels, and, finally, α is the hand s angle in the 2D image plane. Fist, fom the hand state in time t we built a hypothesis of the hand state, h = ( p( t + 1), w( t), α(t)), fo time t +1 applying a simple second-ode autoegessive pocess to the position component p( t + 1) p( t) = p( t) p( t 1). (2) Equation (2) expesses a dynamical model of constant velocity. Next, if we assume that at time t, M blobs have been detected, B = { b1, K, b j, K, b M }, whee each blob b j coesponds to a set of connected skin-colou pixels, the tacking pocess has to set the elation between the hand hypothesis, h, and the obsevations,, ove time. b j x = In ode to cope with this poblem, we define an appoximation to the distance fom the image pixel, ( x, y), to the hypothesis h. Fist, we nomalize the image pixel coodinates n = R t ( x p( +1) ), (3) whee R is a standad 2D otation matix about the oigin, α is the otation angle, and n = ( n x, n y ) ae the nomalized pixel coodinates. Then, we can find the cossing point, c = c x, c ), between the hand hypothesis ellipse and the nomalized image pixel as follows ( y c c x y = w cosϑ, (4) = h sinϑ whee ϑ is the angle between the nomalized image pixel and the hand hypothesis. Finally, the distance fom an image pixel to the hand hypothesis is d,. (5) ( x h) = n c This distance can be seen as the appoximation of the distance fom a point in the 2D space to a nomalized ellipse (nomalized means centeed in oigin and not otated). Fom the distance definition of (5) it tuns out that its value is equal o less than 0 if x is inside the hypothesis h, and geate than 0 if it is outside. Theefoe, consideing the hand hypothesis h and a point x belonging to a blob b, if the distance is equal o less than 0, we conclude that the blob b suppots the existence of the hypothesis h and it is selected to epesent the new hand state. This tacking pocess could also detect the pesence o the absence of the hand in the image [12].

Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 100 Figue 3: Gestue alphabet and valid gestue tansitions. 4 Gestue Recognition Ou gestue alphabet consists in fou hand gestues and fou hand diections in ode to fulfil the application s equiements. The hand gestues coespond to a fully opened hand (with sepaated finges), an opened hand with finges togethe, a fist and the last gestue appeas when the hand is not visible, in pat o completely, in the camea s field of view. These gestues ae defined as Stat, Move, Stop and the No-Hand gestue espectively. Also, when the use is in the Move gestue, he can cay out Left, Right, Font and Back movements. Fo the Left and Right movements, the use will otate his wist to the left o ight. Fo the Font and Back movements, the hand will get close to o futhe fom the camea. Finally, the valid hand gestue tansitions that the use can cay out ae defined in Fig. 3. The pocess of gestue ecognition stats when the use s hand is placed in font of the camea s field of view and the hand is in the Stat gestue, that is, the hand is fully opened with sepaated finges. In ode to avoid fast hand gestue changes that wee not intended, evey change should be kept fixed fo 5 fames, if not the hand gestue does not change fom the pevious ecognised gestue. To achieve this gestue ecognition, we use the hand state estimated in the tacking pocess, that is, s = ( p, w, α). This state can be viewed as an ellipse appoximation of the hand whee p = ( p x, p y ) is the ellipse cente and w = ( w, h) is the size of the ellipse in pixels. To facilitate the pocess we define the majo axis lenght as M and the mino axis lenght as m. In addition, we compute the hand s blob contou and its coesponding convex hull using standad compute vision techniques. Fom the hand's contou and the hand s convex hull we can calculate a sequence of contou points between two consecutive convex hull vetices. This sequence foms the so-called convexity defect (i.e., a finge concavity) and it is possible to compute the depth of the ith-convexity defect,. Fom these depths it is possible to compute the depth d i

101 Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 v m M p x, p y h u d α Figue 4: Extacted featues fo the hand gestue ecognition. In the ight image, u and and the end points of the ith-convexity defect, the depth, the convensity defect to the convex hull segment. d i i indicate the stat, is the distance fom the fathemost point of vi aveage, d, as a global hand featue, see (6), whee contou, see Fig. 4. n is the total numbe of convexity defects in the hand s d = 1 n d i i= 0.. n. (6) The fist step of the gestue ecognition pocess is to model the Stat gestue. The aveage of the depths of the convexity defects of an opened hand with sepaated finges is lage than in an open hand with no sepaated finges o in a fist. This featue is used fo diffeentiating the next hand gestue tansitions: fom Stop to Stat; fom Stat to Move; and fom No-Hand to Stat. Howeve, fist it is necessay to compute the Stat gestue featue, T. Once the use is coectly placed in the camea s field of view with the hand stat widely opened the skin-colou leaning pocess is initiated. The system also computes the Stat gestue featue fo the n fist fames, ( t) 1 n d t = 0.. n T stat =. (7) 2 Once the Stat gestue is identified, the most pobable valid gestue change is the Move gestue. Theefoe, if the cuent hand depth is less than T the system goes to the Move hand gestue. If the cuent hand gestue is Move the hand diections will be enabled: Font, Back, Left and Right. stat If the use does not want to move in any diection, he should set his hand in the Move state. The fist time that the Move gestue appeas, the system computes the Move gestue featue, T, that is an aveage of the appoximated aea of the hand fo n consecutive fames, ( ) T = 1 move n M t m t) t = 0.. n move (. (8) In ode to ecognize the Left and Right diections, the calculated angle of the fitted ellipse is used. To pevent non desied jitte effects in oientation, we intoduce a pedefined constant T. Then, if the angle jitte

Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 102 of the ellipse that cicumscibes the hand, α, satisfies α the ellipse that cicumscibes the hand, α, satisfies α > T jitte, Left oientation will be set. If the angle of < T jitte, Right oientation will be set. In ode to contol the Font and Back oientations and to etun to the Move gestue the hand must not be otated and the Move gestue featue is used to diffeentiate these movements. If T C < M m succeeds the hand oientation will be Font. The Back oientation will be achieved if move font back > m M C. The Stop gestue will be ecognised using the ellipse s axis. When the hand is in a fist, the fitted ellipse is almost like a cicle and m and M ae pactically the same, that is, when M m < C. C,C and C stop ae pedefined constants established duing the algoithm pefomance evaluation. Finally, the No- Hand state will appea when the system does not detect the hand, the size of the detected hand is not lage enough o when the hand is in the limits of the camea s field of view. The next possible hand state will be the Stat gestue and it will be detected using the tansition pocedue fom Stop to Stat explained ealie on. stop font back Some examples of gestue tansitions and the ecognised gestue esults can be seen in Fig. 5. These examples ae chosen to show the algoithm obustness fo diffeent lighting conditions, hand configuations and uses. We ealize that a coect leaning of the skin-colou is vey impotant. If not, some poblems with the detection and the gestue ecognition can be encounteed. One of the main poblems with the use of the application is the hand contol, maintaining the hand in the camea s field of view and without touching the limits of the captue aea. This poblem has been shown to disappea with use s taining. Figue 5: Gestue ecognition examples fo diffeent lighting conditions, uses and hand configuations.

103 Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 5 System's pefomance evaluation In this section we show the accuacy of ou hand tacking and gestue ecognition algoithm. The application has been implemented in Visual C++ using the OpenCV libaies [13]. The application has been tested on a Pentium IV unning at 1.8 GHz. The images have been captued using a Logitech Messenge WebCam with USB connection. The camea povides 320x240 images at a captue and pocessing ate of 30 fames pe second. Nº of tests 400 350 300 250 200 150 100 50 0 Gestue Recognition S M L R F B P N Hand gestues Total of gestues Coect Gestues S : START M: MOVE L : LEFT R: RIGHT F: FRONT B: BACK P: STOP N: NO HAND Figue 6: System's pefomance evaluation esults. Fo the pefomance evaluation of the hand tacking and gestue ecognition, the system has been tested on a set of 40 uses. Each use has pefomed a pedefined set of 40 gestues and theefoe we have 1600 gestues to evaluate the application esults. It is natual to think that the system s accuacy will be measued contolling the pefomance of the desied use movements fo managing the videogame. This sequence included all the application s possible states and tansitions. Figue 6 shows the pefomance evaluation esults. These esults ae epesented using a gaph with the application states, such as Stat o Move, as columns and the numbe of appeaances of the gestue as ows. The columns ae paied fo each gestue: the fist column is the numbe of tests of the gestue that has been coectly identified; the second column is the total numbe of times that the gestue has been caied out. As it can be seen in Fig. 6, the hand ecognition gestue woks well fo a 98% of the cases. 6 Conclusions In this pape we have pesented a eal-time algoithm to tack and ecognise hand gestues fo humancompute inteaction within the context of videogames. We have poposed an algoithm based on skin colou hand segmentation and tacking fo gestue ecognition fom extacted hand mophological featues. The system s pefomance evaluation esults have shown that the uses can substitute taditional inteaction metaphos with this low-cost inteface. The expeiments have confimed that continuous taining of the uses esults in highe skills and, thus, bette pefomances. Also the system has been tested in indoo laboatoy with changing backgound scenaio and low light conditions. In these cases the systems un well, with the logical exception of simila skin backgound situations o seveal hands intesecting in the same space and time. The system must be impoved to discad bad classifications situations due to the segmentation pocedue. But, in this case, the use can estat the system only going to the Stat hand state.

Manesa et al. / Electonic Lettes on Compute Vision and Image Analysis 5(3):96-104, 2005 104 Acknowledgements The pojects TIC2003-0931 and TIC2002-10743-E of MCYT Spanish Govenment and the Euopean Poject HUMODAN 2001-32202 fom UE V Pogam-IST have subsidized this wok. J.Vaona acknowledges the suppot of a Ramon y Cajal fellowship fom the spanish MEC. Refeences [1] V.I. Pavlovic, R. Shama, T.S Huang, Visual intepetation of hand gestues fo human-compute inteaction: a eview, IEEE Patten Analysis and Machine Intelligence, 19(7): 677 695, 1997. [2] R. Bowden, D. Windidge, T. Kadi, A. Zisseman, M. Bady, A Linguistic Featue Vecto fo the Visual Intepetation of Sign Language, in Tomas Pajdla, Jii Matas (Eds), Poc. Euopean Confeence on Compute Vision, ECCV04, v. 1: 391-401, LNCS3022, Spinge-Velag, 2004. [3] J. Segen, S. Kuma, Shadow gestues: 3D hand pose estimation using a single camea, Poc. of the Compute Vision and Patten Recognition Confeence, CVPR99, v. 1: 485, 1999. [4] M. Isad, A. Blake, ICONDENSATION: Unifying low-level and high-level tacking in a stochastic famewok, Poc. Euopean Confeence on Compute Vision, ECCV98, pp. 893-908, 1998. [5] C. Shan, Y. Wei, T. Tan, F.Ojadias, Real time hand tacking by combining paticle filteing and mean shift, Poc. Sixth IEEE Automatic Face and Gestue Recognition, FG04, pp: 229-674, 2004. [6] L. Betzne, I. Laptev, T. Lindebeg, Hand Gestue Recognition using Multi-Scale Colou Featues, Hieachical Models and Paticle filteing, Poc. Fifth IEEE Intenational Confeence on Automatic Face and Gestue Recognition, FRG02, 2002 IEEE. [7] K.Ogawaa, K. Hashimoto, J. Takamtsu, K. Ikeuchi, Gasp Recognition using a 3D Aticulated Model and Infaed Images, Institute of Industial Science,. Univ. of Tokyo, Tokyo, Japan. [8] T. Heap, D. Hogg, Womholes in shape space: tacking though discontinuous changes in shape, Poc. Sixth Intenational Confeence on Compute Vision, ICCV98, pp. 344-349, 1998. [9] G.R. Badski, Compute video face tacking fo use in a peceptual use inteface, Intel Technology Jounal, Q2'98, 1998. [10] D. Comaniciu, V. Ramesh, Robust detection and tacking of human faces with an active camea Poc. of the Thid IEEE Intenational Wokshop on Visual Suveillance, pp: 11-18, 2000. [11] C.M. Bishop, Neual Netwoks fo Patten Recognition. Claendon Pess, 1995. [12] J. Vaona, J.M. Buades, F.J. Peales, Hands and face tacking fo VR applications, Computes & Gaphics, 29(2):179-187, 2005. [13] G.R. Badski, V. Pisaevsky, Intel's Compute Vision Libay, Poc of IEEE Confeence on Compute Vision and Patten Recognition, CVPR00, v. 2: 796-797, 2000.