Multiple Frame Motion Inference Using Belief Propagation

Multple Frame Moton Inference Usng Belef Propagaton Jang Gao Janbo Sh The Robotcs Insttute Department of Computer and Informaton Scence Carnege Mellon Unversty Unversty of Pennsylvana Pttsburgh, PA 53 Phladelpha,PA 94 jgao@cs.cmu.edu jsh@cs.upenn.edu Abstract We present an algorthm for automatc nference of human upper body moton. A graph model s proposed for nferrng human moton, and moton nference s posed as a mappng problem between state nodes n the graph model and features n mage patches. Belef propagaton s utlzed for Bayesan nference n ths graph. A multple-frame nference model/algorthm s proposed to combne both structural and temporal constrants n human moton. We also present a method for capturng constrants of human body confguraton under dfferent vew angles. The algorthm s appled n a prototype system that can automatcally label upper body moton from vdeos, wthout manual ntalzaton of body parts.. Introducton Human moton detecton and trackng has many applcatons. For example, moton percepton n a human-machne nterface could enable people to communcate wth computers usng body language or gestures. Another applcaton s human actvty analyss, n whch human moton and gestures are detected and recognzed from survellance cameras. Many research actvtes have been drected toward trackng and recognton of human moton and gestures. In ths paper, we descrbe our approach for automatc nference of human upper body moton from moton energy mages and color features... Prevous works Whle many works have been done on human moton trackng (Bregler 998, Ju 996), most of the algorthms need manual ntalzaton of body parts for trackng. For algorthms wth self-ntalzaton ablt only some of them estmate detals of body parts. (a) (b) (c) Fgure. Moton energy mages of a gesture wth moton hstory accumulated. (a) Input frames. (b) Accumulatons of moton energy mages startng from the st frame. (c) Same as (b), wth moton energy pxels of the current frame marked. Generall automatc labelng of body parts s based on selected mage features and technques. Background subtracton s an effectve technque for human detecton and trackng (Hartaoglu 998, Felzenszwalb ), and are wdely used n vdeo survellance applcatons wth statc cameras. The Mult-vew approach (Gavrla 996) makes use of 3D nformaton and can resolve some of the ambgutes n complex stuatons, such as occlusons. The applcaton of background subtracton and multvew algorthms may not always be possble n some applcatons, such as nstant human actvty analyss n sngle camera vdeos. Body structure approach s proposed recently based on component models of human body (Ioffe, Felzenszwalb ). To label body parts, the algorthms search the space of possble human confguratons, and fnd the best match wth mage

observatons. In Mor (), shape context matchng s used to match contours of body parts. In ths paper, we propose a new framework that can nfer human upper body moton and label body parts wthout manual ntalzaton. We pose body parts labelng as a Bayesan nference problem n a Markov network (Jordan 998, Yedda ). Our motvaton and method are smlar wth Freeman (), though wth dfferent applcatons. Our model s proposed to capture constrants of human body confguraton under dfferent vew angles. A multple-frame Markov network model s further proposed for combnng both temporal and structural constrants n the Markov network. We use belef propagaton, whch performs spatal and temporal nference smultaneousl to nfer body moton n the Markov network. We are usng ths approach to desgn an ntellgent human machne nterface, where we can assume lmted vew angles, sngle person, and stll background... Moton energy mages Moton-energy mage (MEI) s a moton feature for representng movng regons (Bobck ). Let D( x, be a bnary mage sequence ndcatng regons of moton. D ( x, can be obtaned by mage dfferencng followed by a thresholdng. In ths paper we assume D ( x, = () represents the pxel at (x, y) n frame t s n moton, then the bnary moton-energy mage E τ ( x, s defned as τ E ( x, = D( x, t ). () τ U = τ s the temporal duraton for computng the MEI. Fg. shows an accumulaton of MEI of a gesturng actvty n several frames. The outlned pxels n Fg. (c) are the current slce of moton energy pxels. In ths paper, we use moton-energy mage as the moton feature to nfer 3D poston of body jonts. To do ths, we proposed a Markov network model to embed constrants of human body structure. We also propose a multple frame Markov network model to take advantage of temporal consstency n human moton. We dvde the task nto phases. Frst, nferrng D postons of body jonts n mages. Second, recover 3D postons of body jonts. In the followng, we wll frst present our nference model. The method for combnng moton and color features nto the nference models s gven n secton 4. The organzaton of ths paper s as follows. In secton, we descrbe a Markov network model for human upper body moton nference n a sngle frame. In secton 3, we propose a mult-frame model for Bayesan nference. Secton 4 descrbes usng moton and color features to constran the nference space, and gves expermental results. Fnall secton 5 concludes the paper.. Modelng sngle frame probablty Our goal s to nfer postons of body jonts based on moton energy mages. In Bayesan framework, gven mage features x, body confguraton h can be estmated as: h = argmax P( h x), (3) h P ( h x) = cp( x h) P( h). (4) Here body confguraton h conssts of postons of body jonts, denoted n ths paper as s, s, L s )... The Markov network model ( N We propose the Markov network model as shown n Fg. (a) to solve Eqs. (3) and (4). In ths model state nodes (the empty crcles) represent D postons of body jonts. In the Markov network model each state node s connected wth a measurement node (the flled-n crcles), as well as to ts neghbors. We denote a state at node as s, and observaton at the correspondng measurement node as x. As shown n Fg. (b), x corresponds to the body parts between jonts. We defne the mage patches (observatons) correspondng to wrst jont, elbow, and shoulder jont, as lower arm, upper arm, and shoulder grdle, respectvely. The mage patches are defned based on a cardboard person model (Fg. (c)). x x j x j- x - j j- - (a) (b) (c) (d) Fgure. (a)-(b) A Markov network model for upper body moton nference. The empty crcles are state nodes, and flled-n crcles are measurement nodes. (c) Cardboard person model for evdence computaton. (d) An upper body moton nference result.

In ths model, P ( s, s ) s the probablty that two j j body jont postons appear together. P ( x s ) s computed by countng the number of moton energy pxels (x, y) n each mage patches,.e., D ( x, = (Eq.()). The defnton s equvalent to say that the more moton energy pxels n the mage patches, the more lkely body jont postons defnng the body part are rght. Clearl P ( x s ) computed above does not drectly correspond to a probablty functon. We need to convert ths energy measurement to a probablstc measurement. Ths s done by a transform: P( x s ) = C /{ + exp( E)}, (5) where E the number of moton energy pxels n an mage patch, and C s a normalzaton coeffcent. The Markov network model essentally decomposes Eqs. (3)-(4) by: h = ( s, L, s ) arg max P( h x), (6) N = N,, s )) = N P( x s ) = P( x h = ( s L, (7) P ( s, s ) P( s, L, s ), (8) N s, s j j j = deg( s ) P ( s ) s where degree of s, deg( s ) s the number of nodes connectng wth s. Eqs. (6)-(8) are solved by nference n the Markov network model. In the followng, we frst propose a learnng algorthm to estmate parameters of the Markov network model, then descrbe an nference algorthm based on belef propagaton. h n 3D. The advantage of ths approach s that t smplfes the modelng process and avods recoverng 3D pose at the begnnng. The drawback s the modelng s vew-specfc. In our system, we tran a separate set of P ( s, s ) for each dfferent vew angles. j j We consder only 3 vew angles at ths stage, namel frontal, turnng to the left, and turnng to the rght. For our applcaton of human computer nterface, the 3 vew-angle assumpton s enough. We use a supervsed learnng method for estmatng P ( s, s ) between D postons of jonts and j. We j j unformly sample the D mage space, as shown n Fg. 3, and only estmate jont probabltes at the samplng postons (ntersectons of the grd n Fg. 3). All the other postons are ted to the nearest samplng coordnates. Before the learnng process, each samplng poston s, s ) are gven the same probablty: ( j P ( s, s ) =. /( N * N ), (9) j j s s where N s the number of samplng postons. Then s we run our body part labelng system through vdeo sequences. For each estmated par of body jont postons ( s, s ) that s not a vald human body j confguraton, we reduce ts probablty by P ( s, s ) = P ( s, s ) / T, () j j j j where T s a constant and T >. Then P ( s, s ) s j j renormalzed by: P ( s, s ) j j P ( s, s ) =. () j j P ( s, s ) s, s j j j.. Learnng the Markov network model The Markov network defned above decomposes the Bayesan nference problem Eqs. (3)-(4) nto local states and ther correspondng measurements or evdences. However, we need to estmate P ( s, s ) j j before we can do nference n ths network. P ( s, s ) j j represents a pror probablty for body jont postons, or confguraton constrants of human body. In ths paper, we assume the face poston can be estmated beforehand usng algorthms such as face detecton. The approxmate poston of shoulder grdle s then estmated based on poston of face and pose assumptons. Snce we are only nterested n human upper body confguraton, we need estmate the P ( s, s ) s between wrst, elbow, and shoulder jonts. j j We model jont probabltes of D projecton of body jonts, rather than drectly model the constrants Fgure 3. Learnng jont probabltes at samplng ponts of a D mage plane. 3. Modelng multple-frame probabltes In everyday lfe, people use temporal constrants of human moton trajectores to help trackng of body parts. Whle people may not be able to fnd body confguraton at an nstance wth enough confdence, they can defntely track human body parts after a long sequence of human performance. In ths secton, we extend the sngle frame Markov network model of secton nto a Markov network for multple frames of human moton.

3.. Temporal Constrants The temporal constrants are added to the Markov network model by defne the jont probablty of state nodes correspondng to the same body jont n consecutve frames, as shown n Fg. 4. Assumng s t t+ s a state of node n frame t, and s s a state of node n frame t+, ther jont probablty s defned as follows: t t+, t+ t t+ s s P t ( s, s ) = exp, () πσ σ whch s a Gaussan dstrbuton of D dstance between t t+ s and s. The covarance parameter σ s determned emprcally. The jont probablty () only mposes the smoothness of transton between t t+ s and s wthout any specfc model. Ths gves the system ablty to Fgure 4. Addng temporal constrants to the model. By connectng the state nodes, states of the same body jont n consecutve frames are gven a jont probablty. deal wth a wde range of human motons. Wth temporal constrants, the Bayesan nference algorthm s more robust, and can even recover from labelng errors n a sngle frame. 3.. Belef propagaton Belef propagaton (BP) s an teratve algorthm to nfer the hdden states (or solvng Eqs. (6)-(8)) n a Markov network based on message passng. A basc teraton s as follows: m ( s ) = α P ( s, s ) P ( x s ) m ( s ), (3) j j j j k x k N ( ) \ j b ( s ) = α P ( x s ) m ( s ), (4) k N ( ) where m s the message that node sends to node j, j and b s the belef,.e., margnal posteror probablt at node. b s obtaned by multplyng all ncomng messages to the node by the local evdence (lkelhood). α s a normalzaton constant. N ( ) \ j means all nodes neghborng node except j. All messages m j ( s ) are ntalzed to before the teratons begn. k Though belef propagaton algorthm s exact (.e., guaranteed convergence to the optmal soluton) only n networks wthout loops, recent study shows that t can also converge n many loopy networks. Our multframe Markov network contans loops. It s therefore nterestng to see f the BP algorthm can converge to optmal soluton n ths network. As descrbed n secton., we defned three dfferent vew angles (poses). We compute the belefs of possble body confguratons for each pose. Body pose and confguraton are determned smultaneously by selectng the one whch has the hghest belef gven the observatons. 3.3. 3D body confguraton recovery Recoverng 3D confguratons based on D projecton of body jont postons s based on the algorthm of Taylor (). Assume u, ) and ( v ( X, Y, Z ( u, v ) are projectons of the 3D ponts ) and ( X, Y, Z ) on the mage plane, under orthographc projecton we have u u ) = s( X ), (5) ( X ( v v ) = s( Y Y ). (6) and t can be derved that Z where s s a constant, and l s the length between ( X, Y, Z ) and ( X, Y, Z ). By assummg a reference depth Z, and usng the depth dfference computed n (7), we can recover 3D postons of all the body jonts. For detals, refer to Taylor (). Z = l (( u u ) + ( v v ) ) / s, (7) 4. Expermental results 4.. System archtecture We developed a prototype system that can automatcally detect and label human upper body moton n a natural envronment. The algorthm s shown n Fg. 5. Feature Extracton Compute probabltes n N neghborng frames Belef Prop. 3D Recovery Fgure 5. Upper body moton nference system.

The algorthm proceeds as follows: Frst, face detecton s conducted; color and moton features are extracted (Secton 4..), and canddate postons for each body jonts are detected based on the mage features (Secton 4..). Then, Bayesan nference s conducted n the mult-frame Markov network model. We use belef propagaton to fnd the best body confguraton. At ths stage, the estmated body confguraton s D postons of body jonts on mages. We then apply 3D recovery algorthm to recover 3D coordnates of body jonts (Secton 3.3). 4... Color and moton features. We use dfferent features n our system. For moton feature extracton, we apply frame dfference to obtan moton energy mages for subsequent processng. For color feature extracton, we apply face detecton algorthm frst, and buld a skn color model from the detected face regon. 4... Detecton of canddate states from features. Canddate states of body jonts are needed n belef propagaton algorthm. Theoretcall these state can be obtaned by unformly samplng the space of nterest, but the potental number of canddate states wll make the computatonal complexty extremely hgh. Here, we use a more practcal approach by frst fnd canddate postons of body jonts usng the extracted mage features. Ths approach mprove the speed by sacrfcng some accuracy. Canddate postons for hands and wrsts are detected based on the color model obtaned from face detecton. Some results are shown n Fg. 6(b). Dstracters n background comprse some of the canddate postons, but those are expected. For elbow and shoulder jonts, we use another strategy. We frst generate approxmate postons of shoulder jonts based on assumpton of human pose, then we use moton feature to generate canddate postons of elbows. Fg. 6(a) shows the method used n generatng the canddate elbow postons. We accumulate moton cues n rectangles wth wdth approxmate the wdth of upper arm, rotatng around assumed shoulder jonts. For each rectangle at a rotaton angle, we cluster moton cues wthn the rectangle, and fnd the major connected component of moton cues. The border of the connect component at far end from the shoulder jont s detected as a canddate elbow jont poston. After nference n the Markov network, we use moton feature to optmze postons of shoulders, based on estmated poston of elbows. Fg. 6(b) gves results of elbow and wrst jonts detecton. It s worth notng that the canddate poston detecton step dscussed n ths secton s used for speedng up the algorthm, and not requred by the proposed Markov network model. We can always samplng the space to get the canddate states. (a) (b) Fgure 6. (a) Detecton of elbow jont postons. (b) Detected canddate elbow (blue) and wrst/hand (red and pnk) jont postons overlad on frame mages. 4.. Results We tested our algorthm on captured vdeos wth people performng meanngful gestures. The vdeos are recorded wth 5 people, each 5 to mnutes. We also tested our system on cookng shows and some survellance vdeos. Belef propagaton s straghtforward to apply n the mult-frame Markov network. In our experments belef propagaton algorthm always converges n several teratons even though the Markov network contans loops. Fg.7 shows a moton nference result. The estmated D jont postons and recovered 3D confguraton are gven n Fg. 7(a) and Fg. 7(b). By ncorporatng temporal constrants, the mult-frame model avods many problems that would be detecton errors based on sngle frame algorthm. Fg.7(d) shows an example of detecton error based on a sngle frame. Fg.7(c) shows the convergence of belefs of state nodes after each BP teraton. In ths experment we use 4 state nodes for each frame, and a 7-frames wndow (total of 8 state nodes) to nfer body jont postons. We show the belefs for all canddate states n state nodes. In our experments, we found error n about % of the total frames of vdeos under test. Ths does not nclude the cases where the estmaton s roughly correct but naccurate. Errors occur mostly n occluson stuatons (Fg. 8) or more subtle stuatons, such as when two hands are too close together. Fg. 8 gves an example of nference error caused by occluson, partly due to the moton energy feature we used. The feature has lmted dscrmnatng ablty n occluson stuatons. We are now workng on addng more features to the system n order to deal wth some dffcult stuatons.

Ths paper s an attempt n ths drecton. We propose a Markov network model for nference of human upper body moton. We utlze belef propagaton algorthm for nference n ths Markov network. The mult-frame Bayesan nference algorthm usng BP gve promsng results. In the future, we wll mprove the algorthm n the followng aspects. Frst, we wll compare the results of usng detected canddate states and unformly sampled canddate states. Second, we wll utlze more features or better way to use these features, n order to deal wth some dffcult stuatons. Fnall fnd a better soluton to the vew-specfc problem. (a) -5-5 - References -5 - - - 5-5 - 5 3 - - 5 3 3 - - - 5-5 5 5 5-5 - 5-5 - - 5 3 3 - - 5 3 5 5 5 (b).5.35.3..5. Belefs Belefs.5.5...5.5 3 4 5 6 Iteratons 7 8 9 (c) 3 4 5 6 Iteratons 7 8 9 (d) Fgure 7. (a) Moton nference result. (b) Recovered 3D stck fgures. (c) Convergence of belefs of state nodes wth 9 and 88 canddate states, respectvely. Belefs n BP teratons are shown. (d) A sngle frame optmal estmaton whch was corrected by mult-frame constrants n (a). Fgure 8. An error caused by occluson. 5. Conclusons Human moton nference and body parts labelng s a dffcult problem. So far no exstng feature s confdental enough for nference. We beleve to solve the problem we have to take advantage of an effectve statstcal nference approach and a combnaton of dfferent features. [] A. Bobck and J. Davs, The recognton of human movement usng temporal templates. IEEE Trans. PAMI, vol. 3, no.4,. [] C. Bregler and J. Malk. Trackng people wth twsts and exponental maps. Proc. IEEE CVPR, pp. 8-5, 998. [3] P.F.Felzenszwalb. Object recognton wth pctoral structures. MIT AI Techncal Report -,. [4] W.T. Freeman, E.C. Pasztor, and O.T. Carmchael. Learnng Low-Level Vson, Internatonal Journal of Computer Vson, Vol 4, no., pp. 4-57, October. [5] D. Gavrla and L.S. Davs. 3-D model-based trackng of human n acton: a mult-vew approach. Proc. IEEE CVPR, pp. 73-8, 996. [6] I. Hartaoglu, D. Harwood, and L.S. Davs. W4: Who, when, where, what: a real tme system for detectng and trackng people. 3rd IEEE Int. Conf. Automatc Face and Gesture Recognton, Nara, Japan, 998. [7] S. Ioffe and D.A. Forsyth. Human trackng wth mxtures of trees. Proc. IEEE Int. Conf. on Computer Vson, pp. 69-695, Jul [8] M.I. Jordan ed. Learnng n Graphcal Models, Cambrdge: MIT Press, 998. [9] S.M. Ju, M.J. Black, and Y. Yacoob. Cardboard people: A parameterzed model of artculated moton. Int. Conf. On Automatc Face and Gesture Recognton, pp. 38-44, 996. [] G. Mor and J. Malk. Estmatng human body confguratons usng shape context matchng. Proc. ECCV, pp. 666-68,. [] C.J. Taylor. Reconstructon of artculated objects from pont correspondences n a sngle mage. Computer Vson and Image Understandng, vol.8, no.3, pp.349363,. [] J.S. Yedda, W.T. Freeman, and Y. Wess. Generalzed Belef Propagaton, Advances n Neural Informaton Processng Systems (NIPS), Vol 3, pp. 689-695,.