Detection of hand grasping an object from complex background based on machine learning co-occurrence of local image feature

Detecton of hand graspng an object from complex background based on machne learnng co-occurrence of local mage feature Shnya Moroka, Yasuhro Hramoto, Nobutaka Shmada, Tadash Matsuo, Yoshak Shra Rtsumekan Unversty, Shga, Japan Abstract Ths paper proposes a method of detectng a hand graspng an object from one mage based on machne learnng. The learnng process progresses as a boostrap scheme from handonly detecton to object-hand relatonshp modelng. In the frst step, Randomzed Trees that learns the postonal and vsual relatonshp between mage features of hand regon and wrst poston and that provdes the probablty dstrbuton of wrst postons s bult. Then the wrst poston s detected by votng the dstrbutons for multple features n newly captured mages. In the second step, the sutablty as hand s calculated for each feature by evaluatng the votng result. Based on that, the object features specfed by the graspng type are extracted and the postonal and vsual relatonshp between the object and hand are learned so that the object center can be predcted from hand shape. On the local coordnate constructed by the predcted object center and the wrst poston, t s learned where the object, hand regon and the background s located n the target graspng type by One class SVM. Fnally, the graspng hand, the grasped object and the background are respectvely extracted from the cluttered background. I. INTRODUCTION There are researches of detectng unknown objects from mage based on machne learnng lke [1]. In addton, another approach s proposed based on more mage cues lke human actons and the knd of objects. In order to recognze varous actons of the human hands, such as holdng or pckng up somethng, detectng the hands and estmatng ther postures from mages are to be solved. In some prevous researches the object category s recognzed by usng functonal relatonshp between an object and the hand acton [2][3]. They smultaneously evaluate the mage appearances of hand and object. However, when a hand s holdng somethng, t s dffcult to detect the hand and precsely classfy ts posture because some part of regons may be occluded by the grasped object. Ths paper proposes a method that detects a hand poston holdng an object based on local hand features n spte of the occluson. By learnng local features from the hand graspng varous tools wth varous way to hold them, the proposed method estmates the poston, scale, drecton of the hand and the regon of the grasped object. By some experments, we demonstrate that the proposed method can detect them n cluttered backgrounds. II. WRIST POSITION DETECTION BASED ON RANDOMIZED TREES LEARNING RELATIVE POSITIONS OF THE LOCAL IMAGE FEATURES A. Tranng of Randomzed Trees model provdng the probablty dstrbuton of wrst postons Randomzed Trees (RT)[4] s a mult-class classfer that accepts mult-dmensonal features and t provdes probablty dstrbutons over the class ndces. Here we construct a classfer that provdes the probablty dstrbuton of wrst postons when SURF[5] mage features are nput nto the classfer. It s necessary to estmate the dstrbuton even f the hand s located at the mage regon, n the scale and orentaton dfferent from those of the tranng mages. For the purpose, the tranng data of the wrst postons are once normalzed based on the poston, scale and orentaton of each SURF feature of the graspng hands seen n the tranng mages, and then the pars of the normalzed wrst postons and the features are nput nto the classfer for tranng. Snce the probablty dstrbuton predcted by RT s dscrete, however, the contnuous dstrbuton over the mage space cannot be drectly treat by RT. Therefore based on the normalzed coordnate, the normalzed mage space s dvded nto dscrete grd blocks each of that has a unque class label, the class label on whch the wrst poston les s determned, and then an RT that outputs the probablty dstrbuton of the wrst poston over the grds for the nput of 128 dmensonal SURF mage feature s constructed. A local coordnate space defned for each SURF vector s segmented nto some blocks by a grd (Fg.1-(c)). The number of blocks s fnte and they are ndexed. A label j s an ndex that means a block or background. The j-th block C j s a regon on a local coordnate space defned as [ ] u C j = u u v c (j) < S block/2, v v c (j) < S block/2}, (1) where (u c (j), v c (j)) t denotes the center of the C j and S block denotes the sze of a block. When a wrst poston s x wrst = (x wrst, y wrst ) t and a SURF vector v s detected on a hand, a label j of the SURF vector v s defned as a block ndex j such that T v (x wrst ) C j, where T v s defned the translaton nto the local coordnate

as T v (x, y) = 1 S v [ (x xv ) cos θ v + (y y v ) sn θ (x x v ) sn θ v + (y y v ) cos θ ], (2) and (x v, y v ) denotes the poston where the SURF v s detected, θ v and S v denote the drecton and scale of the SURF v, respectvely. A label of a SURF vector detected on background s defned as a specal ndex j back and t s dstngushed from block ndexes. To learn the relaton between a SURF vector v and ts label j, we collect many pars of a SURF vector and ts label from many teacher mages where the true wrst poston s known. Then, we tran RT wth the pars so that we can calculate a probablty dstrbuton P (j v). To fnd local maxma wthout explctly defnng local regons, we use mean-shft clusterng. Seeds for the clusterng are placed at regular ntervals n the mage space as shown n Fg.2(a), where a blue pont denotes a seed. a green pont denotes a trajectory, a red pont denotes a lmt of convergence and a green and blue crcle denotes the frst and second wrst canddate, respectvely. The mage space s clustered by lmt postons of convergence (Fg.2(b)). For each cluster, the poston wth the maxmum votes s taken as a canddate (Fg.2(c)). Fg. 1. Converson of wrst poston and postons of features to relatve postonal relatonshp class labels Fg. 2. Flow of wrst canddate detecton B. Wrst poston detecton based on votes from the probablty dstrbuton of wrst postons A wrst poston s estmated by votng on the mage space, based on the probablty dstrbuton P (j v) learned wth the RT. The votes functon V wrst (x, y) defned as V wrst (x, y) = V v (x, y), (3) where V v (x, y) = v for all SURF vectors P (j = j v) f C j s.t. T v(x, y) C j, 0 (otherwse), and P (j v) s a probablty dstrbuton calculated by the traned RT. The poston wth the maxmum votes, arg max V wrst (x, y), s consdered as a poston sutable for a wrst. However, the global maxmum may not be the true poston because the votes mean a local lkelhood and global comparson of them makes no sense. Therefore, we allow multple canddates of wrst postons that have locally maxmum votes. x,y (4) If a background regon s larger than a hand regon, the above votng s dsturbed by SURFs on the background. To overcome ths, we roughly dstngush a SURF seemngly orgnated from a hand and others by a SVM[6]. We perform the above votng algorthm wth SURFs except those apparently orgnated from non-hand regons. An example of fndng canddate s shown n Fg. 3. Frst, SURFs are extracted from an nput mage(fg. 3(a)). By usng the SVM, they are roughly classfed and SURFs apparently orgnated from non-hand regons are excluded (Fg. 3(b),(c)). For each of the rest SURF vectors v, as the above mentoned, the local coordnate s defned and the condtonal probablty dstrbuton P (j v) s predcted by the RT classfer (Fg. 3(d),(e)). By votng the probablty dstrbuton, canddates of wrst postons are determned (Fg. 3(f),(g)). The red rectangle n the mage Fg. 3(g) s the detected wrst poston, and t s shown that the wrst s successfully detected.

represents how much the detected SURF features contrbuted to the votng for that poston when the wrst poston was beng determned (.e. the probablty of votes) (Fg.4-(b)). The poston (x, y) wthn the mage s classfed usng h. The method of calculatng sutablty as hand h s to take the ndex of SURF features voted for vmax as, and the values voted for by feature n that place as Pwrst, and express ths as n equaton(5). vmax = Pwrst (5) Pwrst The hgher the value of s greater, feature has strongly contrbuted to determnng the wrst poston. In other words, the sutablty as hand for each feature s shown n equaton(6). h = Pwrst (6) K-means clusterng s used to segment the postons of features n the mage and the groups representng ther sutablty as hand x = (x, y, h) to determne whether they are hand class features or object class features. The h value of center xm1 and xm2 for each cluster s compared, wth the larger one beng judged to be the hand class Lhand and the other one to be the object class Lobj. m1 mlhand hm1 > hm2 m2 m Lobj (7) m1 m L obj otherwse m2 m Lhand Fg. 3. Flow of based on votes from the probablty dstrbuton of wrst postons III. E XTRACTION OF HAND AND OBJECT REGION BASED ON LEARNING OF CO - OCCURENCE OF HAND AND OBJECT FEATURES To fnd a wrst poston from features observed on a hand graspng an object, t s requred to classfy them nto features on a hand and those on an object. We ntroduce a One Class SVM[7] for the classfcaton. It s traned wth a object relatve poston represented by a wrst-object coordnate system. Frst, we estmate the center of gravty of an object by RT and derve a wrst-object coordnate system from the estmated center. Then, we dstngush between feature ponts on an object and those on a hand by the One Class SVM. A. Estmatng gravty center of a hand by coarse classfcaton A wrst poston can be found by the algorthm descrbed the secton 2. We can derve a wrst-object coordnate system by fndng a center of gravty of an object. Here, we are lookng at mages showng object graspng states, and we need to classfy these nto the mage features that belong to the hand and those that belong to the object. Thus, sutablty as hand h From ths, we can use the key pont x, and the Eucldean dstance d(x, xm ) of the center of gravty vector xm to determne the class n whch each key pont belongs. The hand class Lhand and object class Lobj are calculated from equaton(8). d(x, xmlhand ) < d(x, xmlobj ) Lhand (8) otherwse Lobj Ths wll generate the centers of gravty of the wrst poston and object class features as well as the wrst-object coordnate system (Fg.4-(c)). Fg.4 shows the hand regon based on the effectve scope of SURF features classfed nto the hand class (dark color ndcates greater sutablty as hand).(pnk rectangle : hand feature class, Blue rectangle : object feature class) B. Learnng of an RT to output the probablty dstrbuton of object postons Usng the process n the prevous secton, t s possble to segment the respectve areas for hand and object n the object graspng sample mage for tranng use. By usng the segmented mage graspng the object for tranng purposes, t s possble to newly tran t on the relatonshp between the wrst poston and the center poston of the object n these knds of claspng patterns. Thus, as n Secton 2.A, RT s used to teach the center of gravty of objects n the sample tranng mage of a graspng

Fg. 4. Flow of Classfcaton of hand and object features hand extracted n the prevous secton and the relatonshp of hand SURF features n the same mage and a system s bult to predct the object s locaton relatve to local features. After detectng the wrst poston, the probablty dstrbuton output by the devce for predctng the object s locaton s voted nto the mage space relatve to SURF features that have hgh sutablty as hand (h) and the object s center of gravty s detected. Fg.5 shows an example result of Detecton.(Red rectangle : wrst poston. Blue rectangle : center of gravty of objects) Fg. 5. Detecton of center of gravty of objects. C. Lernng of an One Class SVM to dentfy hand and object features based on the wrst-object coordnate system By obtanng the classfers n the prevous secton, t becomes possble, by nputtng the hand graspng state, to predct what poston wthn the mage the center of the object s lkely to appear. In ths way, as t s possble to set the detected wrst poston and wrst-object coordnate system estmated from the predcted central pont of the object as prevously explaned, t s possble to tran t from these standard coordnates where, relatvely, the clasped object area wll appear and make t possble to segment the object area and the background area. In the wrst-object coordnate system defned n Secton 3.A, n order to estmate the range of hand and object regons n the mage, both a hand feature One Class SVM and object feature One Class SVM are bult. One Class SVM s a classfer that learns usng only postve nstances to classfy tems nto two classes. For hand features, t uses the nput poston and lkeness to a hand (x, y, h) n the wrst-object coordnate system to learn classfers. The converson of mage coordnates x = (x.y.h) to wrst-object coordnates x s expressed n equaton(9). x = (x, y, h) (9) Specfcally, the classfer f hand (x ) of hand class feature becomes equaton (10). f hand (x ) > 0 x L hand (10) otherwse x / L hand For object features, because the sutablty as object class that s common to the appearance of all tems that are normally grasped cannot be evaluated, t learns classfers wth only the two-dmensonal poston (x, y) n the wrst-object coordnate system of the mage. The object class feature classfer f obj (y ), whch uses y = (x, y ) as the wrst-object coordnates converted from the mage coordnates, s as expressed n equaton(11). f cobj (y) > 0 y L cobj otherwse y / L cobj (11) x and y are calculated from the nput mage. As countless objects exst each tme there s the shape of a hand claspng an object, f hand (x ) s trusted more than f obj (y ). Thus, object class canddate recognton s expressed as n equaton(12). f cobj (y) > 0 f hand (x ) 0 y L cobj otherwse y / L cobj (12) The classfer functon f l (x y ) for hand class L hand, object canddate class L obj and background class L back from equaton(10,11,12), s expressed n equaton(13). L hand f hand (x ) > 0 f l (x, y ) = L cobj f cobj (y) > 0 f hand (x ) 0 (13) otherwse L back Fg.6 shows an example result of classfcaton nto hand, object, and background classes (features that are nether hand nor object) usng these two classfers. (Pnk rectangle : Hand feature, Blue rectangle : object feature, Green rectangle : background feature, Red rectangle : Wrst poston) IV. CONCLUSION We constructed the RT that output the probablty dstrbuton of the wrst postons wth the wrst havng undergone the tranng n ts postons as related to the SURF features of the hand class. Then we searched the wrst postons ths complex

Fg. 6. Result of feature class classfcaton background. We calculated the sutablty as hand of SURF features from the results thus detected on the wrst postons and classfed the features nto hand class and object class. From the results of our classfcaton, we constructed the RT that output the wrst-object poston probablty dstrbuton wth the wrst havng been traned n ts relatonshp wth the locaton of the center of gravty.then we prepared the wrstobject coordnate system. To enable us to nfer the hand and object area n the wrst-object coordnate system thus prepared, we constructed the One Class SVM of the hand features and the One Class SVM of the object features and proceeded to carry out the segmentaton of the hand, object, background area under complex background from the two classfers whch we had constructed. In the future, how varous objects change the grp of the hand wll be studed n order to extract objects based on that relatonshp. REFERENCES [1] S. Kawabata, S. Hura, and K.Sato, A Rapd Anomalous Regon Extracton Method by Iteratve Projecton onto Kernel Egen Space, The Insttute of Electroncs, Informaton and Communcaton Engneers, D,J91-D(11), pp. 2673-2683, November, 2008 [2] N. Kamakura, Shape of hand and Hand moton. Ishyaku Publshers, 1989. [3] H.Kasahara, J.Matsuzak, N.Shmada, H.Tanaka Object Recognton by Observng Graspng Scene from Image Sequence, Meetng on Image Recognton and Understandng 2008, IS2-5, pp.623-628,2008. [4] V. Lepett and P. Fua: Keypont recognton usng randomzed trees, Transactons on Pattern Analyss and Machne Intellggence, 28,9,pp.1465-1479 [5] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool: SURF:Sppeded Up Robust Features, Computer Vson and Image Understandng, Vol.110,No.3,pp346-359(2008) [6] V.Vapnk: The Nature of Statstcal Learnng Theory, Sprnger, (1995). [7] B.Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Wllamson: Estmatng the support of a hgh-dmensonal dstrbuton., Neural Computaton, 13(7):1443-1471,2001.