Estimating Human Body Pose from a Single Image via the Specialized Mappings Architecture

Boston Unversty OpenBU Computer Scence http://open.bu.edu CAS: Computer Scence: Techncal Reports 2000-06-10 Estmatng Human Body Pose from a Sngle Image va the Specalzed Mappngs Archtecture Rosales, Romer Boston Unversty Computer Scence Department Rosales, Romer; Sclaroff, Stan. "Estmatng Human Body Pose from a Sngle Image va the Specalzed Mappngs Archtecture", Techncal Report BUCS-2000-015, Computer Scence Department, Boston Unversty, June 10, 2000. [Avalable from: http://hdl.handle.net/2144/1809] https://hdl.handle.net/2144/1809 Boston Unversty

To appear nieee Worshop on Human Moton, Austn, T, 2000. Specalzed Mappngs and the Estmaton of Human Body Pose from a Sngle Image Rómer Rosales and Stan Sclaroff Boston Unversty, Computer Scence Department 111 Cummngton St., Boston, MA 02215 emal:frrosales,sclaroffg@bu.edu Abstract We present an approach for recoverng artculated body pose from sngle monocular mages usng the Specalzed Mappngs Archtecture (SMA), a non-lnear supervsed learnng archtecture. SMA s consst of several specalzed forward (nput to output space) mappng functons and a feedbac matchng functon, estmated automatcally from data. Each of these forward functons maps certan areas (possbly dsconnected) of the nput space onto the output space. A probablstc model for the archtecture s frst formalzed along wth a mechansm for learnng ts parameters. The learnng problem s approached usng a maxmum lelhood estmaton framewor; we present Expectaton Maxmzaton (EM) algorthms for several dfferent choces of the lelhood functon. The performance of the presented solutons under these dfferent lelhood functons s compared n the tas of estmatng human body posture from low level vsual features obtaned from a sngle mage, showng promsng results. 1 Introducton and Related Wor Estmatng artculated body pose from low-level vsual features s an mportant yet dffcult problem n computer vson and machne learnng. To date, there has been extensve research n the development of algorthms for human moton tracng [7, 21, 19, 4, 13, 9, 23, 17] and recognton [5], human pose estmaton from a sngle mage [1, 20], and machne learnng approaches [3, 12, 22, 20]. Beng able to nfer detaled body pose, would open the doors to the development of a great number of applcatons for human-computer nterfaces, vdeo codng, vsual survellance, human moton recognton, ergonomcs, and vdeo ndexng/retreval, etc. In ther everyday lfe, humans can easly estmate body part locaton and structure from relatvely low-resoluton mages of the projected 3D world (e.g.,watchng a vdeo). Unfortunately, ths problem s nherently dffcult for a computer. Fndng the mappng between low-level mage features and body confguratons s hghly complex and ambguous. The dffculty stems from the number of degrees of freedom n the human body, the complex underlyng probablty dstrbuton, ambgutes n the projecton of human moton onto the mage plane, self-occluson, nsuffcent temporal or spatal resoluton, etc. In ths paper we attac the problem of artculated body pose estmaton wthn the framewor of non-lnear supervsed learnng. In partcular, we use a novel machne (a) (a) Fgure 1. The data used for tranng s formed by 2D marer postons and ther correspondng mage vsual features. Here we show some frames from the same sequence vewed from two gven camera orentatons (a) 0 rads, 632 rads. Tranng s done samplng the set of all possble orentatons (here 32) from the same dstance and heght. learnng archtecture, the Specalzed Mappngs Archtecture (SMA). Ths SMA s fundamental components are a set of specalzed mappng functons, and a sngle feedbac matchng functon. All of these functons are estmated drectly from data, n our case: examples of body poses (output) and ther correspondng vsual features (nput). SMA s are related to machne learnng models [14, 11, 8, 20] that use the prncple of dvde-and-conquer to reduce the complexty of the learnng problem by splttng t nto several smpler ones. In our case, each of these hopefully smpler problems s attaced usng dfferent specalzed functons that act as the smpler problem solvers. In general these algorthms try to ft surfaces to the observed data by (1) splttng the nput space nto several regons, and (2) approxmatng smpler functons to ft the nput-output relatonshp nsde these regons. Sometmes these functons can be constants, and the regons may be recursvely subdvded creatng a herarchy of functons. Convergence has been reported to be generally faster than gradent-based neural networ optmzaton algorthms [14]. The dvde process may create a new problem: how to optmally partton the problem such that we obtan several sub-problems that can be solved usng the specfc solver capabltes (.e.,form of mappng functons). In ths sense we can consder [20] as a smplfcaton of our approach, where the splttng s done at once wthout consderng nether the power or characterstcs of the mappng functons nor nputoutput relatonshp n the tranng set. Ths gves rse to two

ndependent optmzaton problems n whch nput regons are formed and a mappng functon estmated for each regon, causng sub-optmalty. In ths paper we generalze these underlyng deas and present a probablstc nterpretaton along wth a estmaton framewor that smultaneously optmzes for both problems. Moreover, we provde a formal justfcaton of the seemngly ad-hoc method descrbed n [20]. In the wor of [8], hard splts of the data were used,.e.,the parameters n one regon only depend on the data fallng n that regon. In [14], some of the drawbacs of the hard-splt approach were ponted out (e.g.,ncrease n the varance of the estmator), and an archtecture that uses soft splts of the data, the Herarchcal Mxture of Experts, was descrbed. In ths archtecture, as n [11], at each level of the tree, a gatng networ s used to control the nfluence (weght) of the expert unts (mappng functons) to model the data. However, n [11] arbtrary subsets of the experts unts can be chosen. Unle these archtectures, n SMA s the mappng selecton s done usng a feedbac matchng process, currently n a wnner-tae-all fashon, but soft splttng s done durng tranng. Prevous learnng based approaches for estmatng human body pose nclude [12], where a statstcal approach was taen for reconstructng the three-dmensonal motons of a human fgure. It conssted of buldng a Gaussan probablty model for short human moton sequences. Ths method assumes that 2D tracng of jonts n the mage s gven. Unle ths method, we do not assume tracng can be performed (e.g.,we do not assume that a body model can be matched to mages from frame to frame). There are many nown dsadvantages and lmtatons n performng vsual tracng [20]: manual ntalzaton, poor long-term stablty, necessary teratve solutons durng reconstructon, hgh dependence of algorthms and characterstcs of the artculated model. In [3], the manfold of human body confguratons was modeled va a hdden Marov model and learned va entropy mnmzaton. In [22] dynamc programmng s used to calculate the best global labellng of the jont probablty densty functon of the poston and velocty of body features; t was assumed that t s possble to trac these features for pars of frames. Unle these prevous learnng based methods, our method does not attempt to model the dynamcal system; nstead, t reles only on nstantaneous confguratons. Even though ths gnores nformaton (.e.,moton components) that can be useful for constranng the reconstructon process, t provdes nvarance wth respect to speed (.e.,samplng dfferences) and drecton n whch motons are performed. Furthermore, fewer tranng sequences are needed n learnng a model. In our approach, a feedbac matchng step s used, whch transforms the reconstructed confguraton bac to the vsual cue space to choose among the set of reconstructon hypotheses. Fnally, no tracng s assumed. 2 Specalzed Mappngs and Learnng In ths paper, SMA s are descrbed to approach the problem of supervsed learnng. Defne the set of output-nput observatons pars Z f( )g, wth 2 and 2. Let us call the output and nput vectors the target and cue vectors and consder them as elements of < t and < c respectvely. Let us assume that there s a functonal relaton between cue and target vectors that we call? : < c!< t, such that? ( ), defne ths to be the forward mappng. The problem s to approxmate ths functon?. In theory thspproblem can be formulated by fndng n arg mn (( 1 ) ; ) where n s the cardnalty of or [2, 10, 15], and s an error functon. The problem of functon approxmaton from sparse data s nown to be ll-posed f no further constrants are added [2, 10] (e.g.,on the functonal form or archtecture of ). In ths paper, we attac nonlnear supervsed learnng problems usng an archtecture that generates a seres of m functons n whch each of these functons s specalzed to map only certan nputs, for example a regon of the nput space. However, the doman of can be more general than just a connected regon n the nput space. We propose to determne these regons and functons smultaneously. In contrast wth [14, 11] we do not have a mxture of expert functons weghted by gatng networs when generatng an output, n SMA s, an nput s only mapped by a gven functon. For ths, assume there s another functonal relaton such that ( ) (.e.,an nverse mappng), whch can be nown, or learned. Gven ths, SMA s nvolve a feedbac matchng process to choose among the seres of hypotheses gven by each specalzed functon. 2.1 Probablstc Model In order to gve a probablstc nterpretaton to the archtecture, let s defne some notaton frst. Let the tranng sets of output-nput observatons be f 1 2 ::: n g, and f 1 2 ::: n g respectvely. We wll use z ( ) to defne the gven output-nput tranng par, and Z fz 1 :::z n g as our observed tranng set. In general the vector z s defned to be composed of two parts, one denoted and another denoted assocated wth the output and nput space respectvely. Defne the unobserved random varables Y wth f1::ng. In our model these varables have doman the dscrete set C f1::mg of labels for the specalzed functons, and can be thought as the functon number used to map data pont, therefore m s the number of specalzed functons n the model. Our model uses parameters ( 1 2 ::: m ), where represents the parameters of the mappng functon. The vector ( 1 2 ::: m ), where represent P (y j), the pror probablty that mappng functon wth label wll be used to map an unnown pont. As an example, P (y jz ) represents the probablty that functon number y generated data pont number (gven our model parameters). Usng Bayes rule and assumng ndependence among observatons and an unform pror p() we have the jont

probablty of our archtecture: P (Z y )P (Zjy )P (yj) Y P (z jy )P (y j) (1) A ey queston n nstantatng the archtecture s: What s P (zjy )? (the probablty that pont z was generated usng the mappng functon y assumng a certan value for ts parameters). In ths paper we analyze three possble cases: 1. A Gaussan jont dstrbuton of nput-output vectors: P (zjy ) P ( jy ) N (( ) y y ) (2) 2. A Gaussan dstrbuton wth mean defned by the error ncurred n usng the possbly non-lnear functon y as a mappng functon, and a fxed, gven varance y. P (zjy ) N ( y ( ) y ) (3) 3. A comparson of dstance measures among all functons, t generates a competton among functons to represent the data ponts, for example: P (zjy ) Py e;( ; y( )) e;( ;y ( )) (4) where s a gven error norm, and j s the j-th mappng functon. Ths can be wrtten more generally as: P (zjy ) e; y(z ) Py e;y(z ) (5) 3 EM algorthms for Learnng the Parameters of the Model The probablstc parameter estmaton problem s approached under the Expectaton Maxmzaton (EM) algorthm framewor [6] usng the notaton followed by [16]. The E-step conssts of fndng P ~ (y) P (yjz ). It can be shown that ths reduces to: Y y P (z jy ) Y ~P (y) P 2C P (z jy ) ~P (t) (y ) (6) The M-step conssts of fndng (t) arg max E P ~ [log P (y Zj)]. In our case we can show that ths (t) s equvalent to: (t) arg max y2c ~P (t) (y )[log P (z jy )+log P (y j)]: (7) It s mportant to menton that ths s vald f P (z j) depends on y and not on y j, for any j 6. Note that for the dstrbutons dscussed above, ths s true. We present solutons for the cases descrbed above. Due to space constrants, only fnal equatons are shown. In case (1) we have: P (zjy ) N ( y y )N ( > ) y (8) In ths case, we can show that the SMA archtecture parameter learnng problem s neatly reduced to mxture of Gaussan estmaton, for whch t s straghtforward to estmate usng EM. Moreover, the ML estmate of the condtonal dstrbuton (the condtonal dstrbuton s of major mportance because our problem consst n esmatng from observng ) P ( j y ) s also Gaussan, gven by: P ( j y ) N ( + > ;1 (; ) ; > ;1 ) y (9) Therefore n case (1), each specalzed functon s just the mean of the condtonal dstrbuton (condtoned on the observaton and the functon ndex); ( ) + > ;1 ( ; ) (10) moreover we have an expresson for the confdence on ths estmate gven by the varance above. Thus, the set of functons are lnear n the nput vector. In case (2) we have: ~P (t) (y ) @ log P (y j) (11) ~P (t) (y ) [( @ ( )) > ;1 @ ( ; ( ))] (12) where E s the cost functon found n Eq. 7. Ths gves the followng update rule for (where Lagrange P multplers were used to ncorporate the constrant 1). 1 n P (y jz ) (13) The update of depends on the form of. Ths case s of partcular mportance n justfyng the approach presented n [20] from a probablstc perspectve. In [20] output data (from ) s clustered usng a mxture of Gaussans models, and then for each cluster a mult-layer perceptron s used to estmate the mappng from nput to output space. Let us consder the SMA obtaned by choosng to be a mult-layer perceptron neural networ. Frst note that the braceted term n Eq. 12 s equvalent to bacpropagaton (assumng I). Usng a wnner-tae-all varant to update the gradent found n Eq. 12, we have: [( @ ( )) > ;1 @ ( ; ( (14) ))] 2W wth W fj arg max j ~ P (t) (y j) g (.e.,use a hard assgnment of the data ponts to optmze each of the functons, accordng to the posteror probablty ~ P (t) ). Therefore we have that each of the specalzed functons s traned usng bacpropagaton wth a subset of the tranng sets (moreover these subsets are dsjont)

Note that the maxmzaton process that fnds the sets W can also be stated as arg max P (z jy j )P (y jj) (15) j The approach n [20] can then be explaned wthn the framewor of SMA s presented here by (1) performng the E-step (.e.,computng P ~ (t) (y )) once and therefore fxng ~P (t) (y ) throughout the whole optmzaton process, (2) usng a wnner-tae-all varant for the M-step. Fnally, (3) the choce of a Gaussan cost functon for clusterng (done n the E-step) s justfed by choosng P (z j) to be a Gaussan mxture, as suggested by Eq. 15. Let us call ths specal verson of case (2), case (2a). In case (3) we have: Tang dervatves n Eq. 7 wth respect to we obtan Eq. 13 as the update rule for. Tang dervatves n Eq. 7 wth respect to, we obtan: f @ (z ) [P (z jy ) ; P ~ (t) (y )] (16) Note that, n eepng the formulaton general, we have not defned the form of the specalzed functons n Eqs. 12 and 16. In both cases whether or not we can fnd a closed form soluton for the update of depends on the form of. For example f s a non-lnear functon, t s lely that we may have to use teratve optmzaton to fnd (t). In the case where yelds a quadratc form for then a closed form update exsts. Note also that the braceted term n Eq. 16 s the dfference between pror and posteror dstrbutons (whch gves an ntuton on what the goal of the process s), and only affects the mportance or weght of the contrbuton of each data pont. The results from case (3) wll be evaluated expermentally n further wor. 4 Feedbac Matchng When generatng an output ^y gven an nput x,we have a seres of output hypotheses ^Y obtaned usng ^y (x), wth 2C. Gven the set ^Y, we defne the most accurate hypothess to be that one that mnmzes a functon F ((^y j ) x Z), over j for example: arg mn((^y j ) ; x) > ;1 ((^y j) ; x) (17) j where s the covarance matrx of the elements n the set and s the assgned label. It s mportant to notce that the feedbac matchng could be used actvely durng learnng nstead of usng t only durng nference to choose among the set of hypotheses. The form of the cost functon could vary, here (n Eq. 17)we have assumed that the data from s Gaussan dstrbuted. Ths s explaned more thoroughly n [20]. 5 Experments Cases (1), (2) and (2a) of the descrbed SMA formulaton were tested. The expermental setup s the same as that used n [20]. We used a computer graphcs based feedbac functon [20], and Eq. 17 as feedbac matchng cost functon. The tranng data conssted of twelve sequences obtaned through 3D moton capture. As stated prevously, tranng data conssts of set of example nput-output pars, ( ). The output conssted of 11 2D marer postons (projected to the mage plane usng a perspectve model) but lnearly encoded by eght real values usng Prncpal Component Analyss (PCA). The nput conssted of seven real-valued Hu moments computed on synthetcally generated slhouettes of the artculated fgure. Input-output pars were generated usng computer graphcs by samplng the equator of the vew-sphere to render 32 vews [20]. We generated approxmately 60,000 data vectors for tranng (correspondng to 32 vews) and 9,984 for testng (also contanng samples, equally dstrbuted, from 32 vews). The only free parameters n ths test, related to the gven SMA s, were (a) the number of specalzed functons used: 15, 5, 15 for cases (1) (2) and (2a) respectvely and for case (2) and (2a) we chose to be mult-layer perceptrons wth 16 hdden neurons. Note that several model selecton approaches could be used nstead to choose the number of parameters of the archtecture (e.g.,mnmum Descrpton Length [18]). Fg. 2 shows the body pose estmates obtaned n several sngle mages comng from two dfferent sequences at specfc orentatons (due to space lmtatons case (2) s not ncluded, n ths case ts performance s comparable wth the rest). The agreement between pose estmates and groundtruth s easy to perceve for all sequences. Note that for self-occludng confguratons, pose estmaton s harder, but stll the estmate s close to ground-truth. No human nterventon nor pose ntalzaton s requred. Usng the tranng and testng data descrbed above, we measured the average marer error for both models (as the dstance between reconstructed and ground-truth projected marer poston). Wth respect to the heght of the body, the mean and varance marer error were: (1) 2.82% and 0.09%, (2) 2.73% and 0.02%, (2a) 2.34% and 0.04% respectvely. Note the number of parameters n each model: (1) 3600 (2) 1205 (2a) 3615. In case (1), the tranng was consderably faster because of the extra processng tme necessary n (2) and (2a) for tranng each neural networ once the clusterng or weghts per sample s decded. The smaller varance obtaned n case (2) (n general a desrable behavor) s probably due to the soft splts of the data used by the learnng algorthm. Inference requred approxmately the same computatonal tme per specalzed functon n each case. Fg. 3 shows the average marer error and varance per body orentaton. Note that n all cases the error s bgger for orentatons closer to 2 and 32 radans. Ths ntutvely agrees wth the noton that at those angles (sde-vew), there s less vsblty of the body parts. 5.1 Experments usng Real Vsual Cues For the next example, n Fg. 4 we test the system aganst real segmented vsual data, obtaned from observng a human subject. Reconstructon for several relatvely complex

0.045 0.04 s a very mportant step consderng that low-level vsual features are relatvely easly obtaned usng current vson technques. 0.035 References 0.03 0.025 0.02 0 5 10 15 20 25 30 35 Fgure 3. Mean marer error and varance for cases (1) (top-broen), (2) (mddle-contnuous) and (2a) (bottombroen) per vew angle, sampled every 232 radans. acton sequences s shown for both models. Note that even though the characterstcs of the segmented body dffer from the ones used for tranng, good performance s acheved. Most frames are vsually close to what can be thought as the rght pose reconstructon. Body orentaton s also correct. The followng varables are beleved to account the most n performance: 1.) lelhood dstrbuton choce 2.) enough data to account for observed confguratons 3.) number of approxmatng functons wth specalzed domans, 4.) dfferences n body characterstcs used for tranng/testng, and 5.) dscrmnatve power of the chosen mage features (Hu moments reduce the mage nterpretaton to a seven-dmensonal vector). 6 Concluson We have proposed the use of a non-lnear supervsed learnng framewor, Specalzed Mappngs Archtecture (SMA), for estmatng human body pose from sngle mages. A learnng algorthm was developed for ths archtecture usng the framewor of ML estmaton, latent varable models and Expectaton Maxmzaton. The mplemented algorthm for nference runs n lnear tme O(M ) wth respect to the number of specalzed functons M. The ncorporaton of the feedbac step actvely durng learnng s an mportant possblty provded by SMA s and currently beng consdered. Note that so far the feedbac matchng s used for nference only (for choosng among the set of hypotheses). Feedbac could also be used for determnng the dstrbuton or mportance of each tranng sample wth respect to each of the mappng functons. In experments, a SMA learns how to map low-level vsual features to a hgher level representaton le a set of jont postons of the body. Human pose reconstructon from a sngle mage s a partcularly dffcult problem because ths mappng s hghly ambguous and complex. We have obtaned excellent results even usng a very smple set of mage features, such as mage moments. Choosng the best subset of mage features from a gven set s by tself a complex problem, and a topc of on-gong research. Ths [1] C. Barron and I. Kaadars. Estmatng anthropometry and pose from a sngle mage. In CVPR, 2000. [2] M. Bertero, T. Poggo, and V. Torre. Ill-posed problems n early vson. Proc. of the IEEE, (76) 869-889, 1988. [3] M. Brand. Shadow puppetry. In ICCV, 1999. [4] C. Bregler. Tracng people wth twsts and exponental maps. In CVPR98, 1998. [5] J. Davs and A. F. Bobc. The representaton and recognton of human movement usng temporal templates. In CVPR, 1997. [6] A. Dempster, N. Lard, and D. Rubn. Maxmum lelhood estmaton from ncomplete data. Journal of the Royal Statstcal Socety (B), 39(1), 1977. [7] J. Deutscher, A. Blae, and I. Red. Artculated body moton capture by annealed partcle flterng. In CVPR, 2000. [8] J. Fredman. Multvatate adaptve regresson splnes. The Annals of Statstcs, 19,1-141, 1991. [9] D. Gavrla and L. Davs. Tracng of humans n acton: a 3-d model-based approac. In Proc. ARPA Image Understandng Worshop, Palm Sprngs, 1996. [10] F. Gros, M. Jones, and T. Poggo. Regularzaton theory and neural networ archtectures. Neural Computaton, (7) 219-269, 1995. [11] G. Hnton, B. Sallans, and Z. Ghahraman. A herarchcal communty of experts. Learnng n Graphcal Models, M. Jordan (edtor), 1998. [12] N. Howe, M. Leventon, and B. Freeman. Bayesan reconstructon of 3d human moton from sngle-camera vdeo. In NIPS, 1999. [13] L. D. I. Hartaouglu, D. Harwood. Ghost: A human body part labelng system usng slhouettes. In Intl. Conf. Pattern Recognton, 1998. [14] M. I. Jordan and R. A. Jacobs. Herarchcal mxtures of experts and the em algorthm. Neural Computaton, 6, 181-214, 1994. [15] N. Kolmogorov and S. Fomne. Elements of the Theory of Functons and Functonal Analyss. Dover, 1975. [16] R. Neal and G. Hnton. A vew of the em algorthm that justfes ncremental, sparse, and other varants. Learnng n Graphcal Models, M. Jordan (edtor), 1998. [17] J. M. Regh and T. Kanade. Model-based tracng of selfoccludng artculated objects. In ICCV, 1995. [18] J. Rssanen. Stochastc complexty and modelng. Annals of Statstcs, 14,1080-1100, 1986. [19] R. Rosales and S. Sclaroff. 3d trajectory recovery for tracng multple objects and trajectory guded recognton of actons. In CVPR, 1999. [20] R. Rosales and S. Sclaroff. Inferrng body pose wthout tracng body parts. In CVPR, 2000. [21] H. Sddenbladh, M. Blac, and D. Fleet. Stochastc tracng of 3d human fgures usng 2d mage moton. In ECCV, 2000. [22] Y. Song,. Feng, and P. Perona. Towards detecton of human moton. In CVPR, 2000. [23] C. Wren, A. Azarbayejan, T. Darrell, and A. Pentland. Pfnder: Real tme tracng of the human body. PAMI, 19(7):780-785, 1997.

(a) Fgure 2. Example reconstructon of several testng sequences, each set (4 rows each) conssts of nput mages, reconstructon usng case (1), reconstructon usng case (2a), and ground-truth, shown every 25th frame. Vew angles are 0 and 1232 radans respectvely for each set. Note that these sequences have challengng confguratons, body orentaton s also recovered correctly. (a) Fgure 4. Reconstructon obtaned from segmentng a human subject (every 30th frame). Two sequences are shown, each conssts of nput sequence, case (1) and case (2a) reconstructons