Viewpoint Invariant 3D Landmark Model Inference from Monocular 2D Images Using Higher-Order Priors

Size: px

Start display at page:

Download "Viewpoint Invariant 3D Landmark Model Inference from Monocular 2D Images Using Higher-Order Priors"

Winfred McDowell
6 years ago
Views:

1 Viewpoin Invarian 3D Landmark Model Inference from Monocular 2D Images Using Higher-Order Priors Chaohui Wang 1,2, Yun Zeng 3, Loic Simon 1, Ioannis Kakadiaris 4, Dimiris Samaras 3, Nikos Paragios 1,2 1 Cener for Visual Compuing, Ecole Cenrale Paris, Châenay-Malabry, France 2 Equipe GALEN, INRIA Saclay - Ile-de-France, Orsay, France 3 Deparmen of Compuer Science, Sony Brook Universiy, NY, USA 4 Compuaional Biomedicine Lab, Universiy of Houson, TX, USA Absrac In his paper, we propose a novel one-sho opimizaion approach o simulaneously deermine boh he opimal 3D landmark model and he corresponding 2D projecions wihou explici esimaion of he camera viewpoin, which is also able o deal wih misdeecions as well as parial occlusions. To his end, a 3D shape manifold is buil upon fourh-order ineracions of landmarks from a raining se where pose-invarian saisics are obained in his space. The 3D-2D consisency is also encoded in such highorder ineracions, which eliminae he necessiy of viewpoin esimaion. Furhermore, he modeling of visibiliy improves furher he performance of he mehod by handling missing correspondences and occlusions. The inference is addressed hrough a MAP formulaion which is naurally ransformed ino a higher-order MRF opimizaion problem and is solved using a dual-decomposiion-based mehod. Promising resuls on sandard face benchmarks demonsrae he poenial of our approach. 1. Inroducion 3D model inference from 2D images is one of he mos challenging problems in compuer vision. This is due o he fac ha boh camera esimaion and 3D model opimizaion have o be addressed wihin a single framework. In he mos general case, he camera parameers are unknown, he 3D model iself usually inheris high complexiy (high degrees of freedom even for non-ariculaed objecs), while a he same ime image feaures can be ambiguous, occluded and noisy. There are numerous applicaions involving he above scenario, such as raffic monioring wih 3D model based racking [18], hand racking [9], facial analysis [4] and medical imaging [17]. Such an inference process usually involves hree seps: he firs aims o deermine a compac represenaion of he 3D model, he second o associae such a represenaion wih he 2D image observaion, and he las o recover he opimal parameers of he model. Modeling variaions of he 3D model requires a saisical parameric represenaion of he objec of ineres. Such represenaions in mos cases are pose-varian, i.e., all raining examples are regisered o a same referenial frame where saisics are hen buil from he raining daa. One can cie acive shape [8] and acive appearance [7] models (ASMs and AAMs), which offer a good compromise beween compuaional complexiy and model expressiveness poenial. Oher represenaions adop more complex saisical models ha can vary from Mixures of Gaussians (MoG) o non-parameric densiy funcions [3]. In such a conex, one has o address he curse of dimensionaliy (he dimensionaliy of he manifold o be learned versus he number of raining samples). Once he represenaion has been buil, he nex seps consis of defining an image likelihood and combining i wih a 3D model prior owards opimal esimaion of he 3D model. Since he image likelihood is relaed o boh he 3D model configuraion and he camera parameers, he model esimaion is ofen achieved hrough an alernaing search or EM-syle approach [14]. Given an iniial 3D-2D correspondence map, he camera parameers are firs esimaed and hen used o define he fiing error beween he model and he image. This error is o be opimized by gradiendriven mehods and ieraive search processes so as o esimae boh he correspondences and he opimal model configuraion. Despie he promising performance of such a scheme, he fac ha explici esimaion of he camera viewpoin parameers is required in he process is a major drawback, since coordinae-descen approaches are prone o be rapped in local minima and provide no guaranee on he opimaliy of he esimaions. Graphical models have become a dominan approach in compuer vision and have been employed o address a num IEEE Inernaional Conference on Compuer Vision /11/$26.00 c 2011 IEEE 319

2 ber of vision problems (e.g., [5, 19, 20]), which is mainly due o heir srengh in erms of he qualiy of he opimum. The use of higher-order models has raised more and more aenion (e.g., [15, 12, 22]), along wih he developmen of efficien opimizaion mehods. Higher-order ineracions can naurally inroduce invariance o cerain class of ransformaions like ranslaion/roaion/scale, while a he same ime hey can deal wih vision asks of imporan complexiy. The objecive of our paper is o ake benefi of heir srengh and propose a unified formulaion o esimae 3D models from 2D images wihou alernaing search. The main conribuion of his paper is a probabilisic inference approach ha does no require explici viewpoin esimaion, while being able o joinly opimize he pose parameers and he corresponding landmarks selecion as well as explicily handling missing correspondences and occlusions via a visibiliy modeling. To his end, we formulae he problem as a maximum a poseriori (MAP) esimaion ask which involves 3D pose parameers, associaed 2D correspondences and visibiliy saes. We derive a poserior probabiliy as he produc of an image likelihood, a visibiliy prior, a 3D geomeric prior and a projecion consisency prior consraining he 2D and 3D configuraions. In order o circumven he need of viewpoin esimaion, we adop a high-order decomposiion of he 3D model ha enables o deermine he projecion error beween a given 3D configuraion and he corresponding 2D landmark posiions in a disribued manner. Furhermore, an explici visibiliy modeling is also inroduced o cope wih misdeecions and ouliers. The MAP inference is hen naurally ransformed ino a higher-order MRF opimizaion problem and all he laen variables are inferred using a one-sho opimizaion over a facor graph [3] hrough dual-decomposiion [2, 16, 20, 22]. The proposed formulaion has been validaed in he conex of 3D facial pose esimaion from 2D images. Promising resuls on sandard face benchmarks demonsrae he poenial of our mehod. The remainder of his paper is organized as follows. In Sec. 2, we presen he probabilisic formulaion for he join esimaion of he 3D pose, is visibiliy saes and he 2D correspondences. The individual likelihoods wih respec o geomery, 3D-o-2D consisency, visibiliy, and image suppor are presened in Sec. 3 while he corresponding higherorder graphical model is discussed in Sec. 4. Experimenal resuls compose Sec. 5, while discussion and fuure work conclude he paper in Sec Probabilisic 3D-2D Inference Framework We consider a poin-disribuion shape model composed of a se V of landmarks locaed on he surface of he 3D objec of ineres. Le laen variable X i = (X i,x (2) i ) denoe he 3D and 2D posiions of a landmark i (i V). More specifically, X i and X (2) i, 3-dimensional and 2- dimensional vecors respecively, denoe he 3D posiion of landmark i in he model space and he 2D posiion in he observed image I. Each variable X i akes a value x i from is possible configuraion se X i = X i X (2) i, where X i and X (2) i denoe he 3D and 2D posiion candidae ses, respecively. Due o he fac ha landmarks may be invisible, we also inroduce a visibiliy variable O i for landmark i [19]. O i = 1 when he landmark is visible in he 2D image space, and O i = 0 oherwise. Given he observed image I, he esimaion of he 3D-2D posiions and he visibiliy of he landmarks is formulaed as a maximizaion of he poserior probabiliy of (X,O) = ((X i ) i V,(O i ) i V ) over heir domains X = i V X i and O = {0,1} V : (x,o) op = arg max p(x, o I) (1) (x,o) X O The poserior probabiliy p(x,o I) is: p(x,o I) = p(x,o,i)/p(i) p(x,o,i) = p(i x,o) p(x,o) = p(i x (2),x,o) p(x (2) x,o) p(o x ) p(x ) = p(i x (2),o) p(x (2) x,o) p(o) p(x ) }{{}}{{}}{{}}{{} Image Likelihood Projecion Prior Visibiliy Prior 3D Model Prior where p(i x (2),o) encodes he likelihood of he observed image given he 2D posiion configuraions x (2) and he visibiliy saes o of he landmarks, p(x (2) x,o) encodes he projecion prior from he 3D configuraion x o he 2D configuraion of he landmarks, p(o) denoes he visibiliy prior on he landmarks, and p(x ) denoes he prior on he 3D configuraions of he landmarks. Noe ha his probabilisic formulaion can be direcly applied o he esimaion of 3D (or 2D) configuraion of he landmarks given 2D (or 3D) configuraion, simply by insaniaing he variables whose configuraions are known. 3. Definiions of he Probabiliy Terms In his secion, we define all he probabiliy erms which are involved in he poserior probabiliy p(x,o I) (Eq. 2) Image Likelihood The image likelihood p(i x (2),o) measures he occurrence probabiliy of he observed image I, given he 2D posiion configuraions x (2) and he visibiliy saes o of he landmarks. If we assume, wihou loss of generaliy, ha he landmarks are independen in erms of appearance, hen we can define p(i x (2),o) as follows: p(i x (2),o) i V (2) p(i x (2) i,o i ) 320

3 Regarding p(i x (2) i,o i ), here are wo cases: 1. When O i = 1, he landmark s posiion is informaive and p(i x (2) i,o i ) denoes he likelihood of he observed image given ha landmark i is locaed a posiion x (2) i, which can be defined using he oupu of a classifier such as Randomized Fores [6]. 2. When O i = 0, he landmark s posiion is no informaive and p(i x (2) i,o i ) denoes a uniform disribuion, hus we assume ha p(i x (2) i,o i ) = ˆp (consan) Projecion Prior The projecion prior p(x (2) x,o) measures he occurrence possibiliy of he 2D posiions x (2) of he landmarks when he 3D posiions x and he visibiliy saes o are given, which is modeled using Gibbs disribuion: p(x (2) x,o) exp{ f(x,o) } (4) T where T is emperaure, and he energy funcion f(x,o) encodes inconsisency beween he 3D and 2D configuraions of he landmarks aking he visibiliy saes ino accoun (he smaller f(x, o) is, he beer is he correspondence beween x and x (2) ). Wihou loss of generaliy, we use he weak-perspecive camera configuraion [1] o model he projecion from 3D poins o 2D poins 1. Le us firs consider a riple T = { V and = 3} of landmarks ha are all visible. Their 3D-2D posiions x deermine a mos wo projecion mappings P (s) x (s {1,2}) [1, 11] corresponding o wo reflecive symmeric camera configuraions. Then for any addiional visible poin i, we can measure he error e x (x i ) beween is 2D posiion x (2) i and he value obained by projecing is 3D posiion x i, i.e.: e x (x i ) = min P (s) x (x i ) x (2) i (5) s {1,2} where denoes he Euclidean norm, and beween he wo feasible projecions we consider he mos prominen one wih respec o he considered 2D configuraion [1]. On he conrary, if one or more of hese four poins are invisible, we se a consan energy Ê as he projecion error e x (x i ), which can be undersood as an upper bound of he average projecion error which is allowed beween four poins. Therefore, we define he error funcion e x,o (x i,o i ) by aking he visibiliy saes ino accoun as: { ex (x i ) if o j = 1, j {i} e x,o (x i,o i ) = w Ê oherwise (6) 1 In he proposed framework, he weak-perspecive camera model can be easily replaced by oher camera models such as he perspecive model. where w is a confidence weigh for he error measure obained under he mapping deermined by he posiions of he poins in clique, which will be presened laer in his secion. And hen, he 3D-2D consisency beween a quadruple c of landmarks consiss of he sum of he errors which are deermined by aking all possible combinaions of riples wihin he quadruple and evaluaing he projecion error on he remaining poin: e(x c,o c ) = c e x,o (x c\,o c\ ) (7) Finally, we define he energy funcion f(x,o) as he sum of e(x c,o c ) over all he quadruple, i.e.: f(x,o) = c C e(x c,o c ) (8) where C = {c c V and c = 4} denoes he se of all quadruples. Las, we should noe ha we can furher combine oher cues in his projecion prior, such as regional exure similariy. Robus Confidence Weigh Since he projecion marix esimaion is unsable when considering riples of 3D poins ha are nearly collinear [1], we inroduce a confidence weigh w o modulae he error conribuion of each riple of poins. For a riangle x consising of a riple of poins wih 3D posiions x, we define he non-collinear coefficien NC(x ) us- ) and is perimeer ing he square roo of is area Area( x Perim( x ) as follows: NC(x ) = Area 1 2 ( x ) Perim( x ) We can observe ha NC(x ) = 1 for an equilaeral riangle and NC(x ) = 0 when he hree poins are collinear. Then we learn he confidence weigh w by averaging he non-collinear coefficiens for each riple over he raining daa: w = 1 M M m=1 (9) NC(x,m ) (10) where M denoes he number of raining samples. Specificaion of he Projecion Error Regarding he compuaion of e x (x i ), we use he efficien mehod proposed in [1] o compue direcly he projecion of a 3D poin under he projecion deermined by a riple 321

4 are collinear so ha he final soluion of x canno be exacly collinear. By doing so, he erm e x (x i ) is well-defined for all he cases. For he sake of clariy, hereafer, we assume ha he definiion of e x (x i ) in Eq. 5 implicily includes he definiion in he degenerae case. of corresponding 3D-2D poins wihou calculaing he projecion mapping. We refer readers o [1] for more deails. Collinear riples of poins lead o degenerae configuraions from which we canno obain a soluion for he projecion mapping. In his case, he corresponding error erm e x (x i ) in Eq. 5 is no well-defined. To deal wih his, we consider wo differen scenarios: (i) When we have a prior knowledge ha he 3D posiions of a riple of poins have o be collinear, we simply ignore he corresponding error measure by defining e x (x i ) = 0 (his is consisen wih he confidence weigh defined in Eq. 10, i.e., w = 0 leads o zero conribuion o f(x)); (ii) Oherwise, we define e x (x i ) = + if x 3.3. Visibiliy Prior We inroduce he visibiliy variable O o achieve a more precise modeling of he 3D-2D esimaion, due o he fac ha a landmark can be invisible. The noion of invisibiliy encodes occlusions and self-occlusions in he 3D space, as well as misdeecion due o insufficien image suppor or classificaion failure. The inference process is performed by considering, for each landmark i, a number of 2D posiions which lead o he highes probabiliies p(i x (2) i ) owards composing he se of plausible soluions for X (2) i, expecing ha a leas one candidae is (or close o) he rue posiion. However, because of erroneous deecion or occlusions, i is possible ha all he candidaes are far from he ground ruh. In such a conex, we define he noion of visibiliy as wheher he rue 2D correspondence of he landmark is capured by he candidae se. More specifically, O i = 1 means ha a leas one candidae in X (2) i is close o he ground ruh, and O i = 0 sands for he opposie case. The prior probabiliy p(o) is defined as follows: p(o) = i V p(o i ) (11) where p(o i ) denoes he prior probabiliy of he visibiliy of each individual landmark i and is modeled as a Bernoulli disribuion Bern(o i µ i ) wih parameer µ i = Pr(O i = 1). In pracice, i is usually reasonable o assume he same parameer µ > 0.5 for all he landmarks [20] D Model Prior The raining daa are used o learn a 3D shape model. No assumpion on regisraion beween surfaces is being made. However, we assume ha correspondences have been deermined for he landmarks among he samples of he raining se. The key concern of shape modeling is how o capure he inheren variabiliy of he class of objecs from a reasonable small raining se using a compac represenaion ha can be easily adoped owards an efficien inference. We adop he pose-invarian prior in [22] which is based on he relaive lenghs of a riple of poins, and exend i in a more general formulaion where he cliques can be of any higher order. Such a prior does no require he esimaion of he global pose in he raining and esing sages and eliminaes he bias caused by such esimaions. Le us consider a clique 2 c (c V and c 3) of landmarks, we enumerae all he pairs P c = {(i,j) i,j c and i < j} of poins. Le d ij = x i x j denoe he Euclidean disance beween poins i and j ((i,j) P c ). We obain he relaive disance ˆd ij by normalizing he disance d ij over he sum of he disances beween he pairs of poins involved in clique c, i.e., ˆd ij = d ij / d ij (12) (i,j) P c Since for clique c, any relaive disance ˆd ij is a linear combinaion of he ohers (i.e., (i,j) P c ˆdij = 1), we sore all he relaive disances, excep one in a vecor ˆd c = ( ˆd ij ) (i,j) Pc, where P c conains he pairs ha are involved in he vecor ˆd c. Then, he saisics on ˆd c are learned from he raining daa. We can model is disribuion p c (ˆd c ) using sandard probabilisic models such as MoGs and Parzen-Windows. Finally, we define he prior probabiliy of he 3D configuraion as: p(x ) c C p c (ˆd c (x c )) (13) where C denoes he se of all cliques, and ˆd c (x 3 c) denoes he mapping from he 3D posiion x 3 c of he clique c o he relaive disance vecor ˆd c. 4. Higher-order MRF Formulaion The daa likelihood, he 3D-2D consisency, he visibiliy prior and he 3D shape model, presened in Sec. 3, can be naurally encoded wihin a higher-order MRF model where laen variables are o be inferred hrough an energy minimizaion. In his perspecive, he negaive logarihm of he poserior probabiliy (Eq. 2) is facorized ino he poenials of he MRF and consiues he MRF energy. To his end, we use a node o model a landmark i (i V) wih is laen 3D-2D posiion X i and is visibiliy O i. 2 As presened in Sec. 4, we use 4-order cliques (quadruples) in his work, i.e., c = 4. However, oher higher-order cliques c ( c 3) can also be used in his shape model. 322

Acually, we can use a single random variable 3 o encode X i and O i compacly by simply defining a special label occ wihin 2D posiion candidae se X (2) i such ha: { (x x i = i,x (2) i ) if O i = 1 (x

5 Acually, we can use a single random variable 3 o encode X i and O i compacly by simply defining a special label occ wihin 2D posiion candidae se X (2) i such ha: { (x x i = i,x (2) i ) if O i = 1 (x (14) i, occ) if O i = 0 This compac represenaion is valid because he 2D posiion x (2) i is meaningless when he landmark i is occluded (i.e., when O i = 0, he image likelihood p(i x (2),o) and he projecion prior p(x (2) x,o) are consan wih respec o x (2) i.). In order o facorize he poenial funcions, we use a fourh-order clique o model a quadruple c of landmarks. Due o he bijecive mappings beween nodes and landmarks and beween fourh-order cliques and quadruples, we reuse V and C o denoe he node se and he clique se which deermine he opology of he MRF. The 3D and 2D posiions of he landmarks are esimaed hrough he minimizaion of he MRF energy E(x): x op = arg min E(x) (15) x X Here, he energy of he MRF is defined as he negaive logarihm of he poserior probabiliy in Eq. 2 (up o an addiive consan) and can be facorized ino he following form: E(x) = i V U i (x i ) + c C H c (x c ) (16) where x c denoes he configuraion (x i ) i c of clique c. Singleon poenial U i (x i ) (i V) encodes he daa likelihood (see secion 3.1) and he visibiliy prior (see secion 3.3). Afer aking he negaive logarihm, we obain is definiion as follows: U i (x i ) = { log p(i x (2) i ) if x (2) i occ λ 1 if x (2) i = occ where λ 1 is a consan coefficien. (17) Higher-order clique poenial H c (x c ) (c C) is defined as follows: H c (x c ) = λ 2 H (1) c (x c ) + λ 3 H (2) c (x c ) (18) where λ 2 > 0 and λ 3 > 0 are wo balancing consans, H c (1) (x c ) encodes he 3D saisic geomery consrains implied by he shape prior on he 3D configuraion of he landmarks, and H c (2) (x c ) encodes he 3D-2D projecion prior: { H (1) c (x c ) = log p c (ˆd c (x c )) H (2) c (x c ) = e(x c,o c (x c )) (19) 3 In order o reduce he number of symbols used, we reuse X i o denoe his new random variable. Accordingly, we reuse x i, X i and he oher relaed noaions. (a) Figure 1. (a) The disribuion of landmarks; (b) The hisogram presening he disribuion of he number of missing 2D correspondences in he firs experimen. where o c (x c ) denoes he binary visibiliy values ha are recovered from x c using Eq. 14, and he definiions of e(x c,o c ) and p c (α(x c )) have been presened in Sec. 3.2 and 3.4, respecively. Dual-Decomposiion MRF Inference: Regarding he inference of he proposed higher-order MRF, we adop he dual-decomposiion opimizaion framework [2, 16], which is considered o be he sae-of-he-ar owards MAP-MRF inference [16, 20] in paricular when handling higher-order MRFs [15, 22]. Based on his framework, we decompose he original problem which is difficul o solve direcly ino a se of sub-problems which can be solved very efficienly. The soluions of he sub-problems are combined using projeced subgradien mehod [16, 20] o achieve he soluion of he original problem. Regarding he decomposiion, like [22], we decompose he original graph ino a se of facor rees which can be solved wihin polynomial ime using max-produc belief propagaion algorihm [3]. 5. Experimenal Resuls 5.1. Experimenal Seings The performance of he proposed mehod was evaluaed on he publicly-available facial expression daases BU- 3DFE [24] and BU-4DFE [23]. The former consiss of 3D range daa of 6 prooypical facial expressions of 100 differen subjecs (56 female and 44 male), and he laer is composed of 3D dynamic facial expressions of 101 differen subjecs (58 female and 43 male). The subjecs included in boh daases are of various ehnic/racial origins. The considered model consiss of 13 landmarks (eyes, nose, mouh and eyebrows as shown in Fig. 1(a)). In he inference sage, is 3D iniializaion was done by randomly picking one raining example. Regarding he 3D posiions of he landmarks, he search was guided by a coarse-o-fine scheme and sparse sampling sraegy in a similar way as [13]. Upon convergence of he algorihm, we performed Procruses Analysis [10] o obain he similariy ransform (b) 323

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 1 (a) (b) 2 3 (c) Figure 2. Resuls of he firs experimen. (a) and (b): 3D model esimaion resuls.

(c): Boxplos for he disribuions of dissimilariy measures for qualiaively evaluaing he 3D model esimaion. c.1: Resuls obained by he proposed mehod; c.

On each box, he cenral mark in red is he median, he edges of he box are he 25h and 75h perceniles, he whiskers exend o he mos exreme daa poins no considered ouliers, and ouliers are ploed

In erms of quaniaive evaluaion, a common goodness-of-fi crierion is he squared error sandardized by he scale of he objec.

6 (a) (b) 2 3 (c) Figure 2. Resuls of he firs experimen. (a) and (b): 3D model esimaion resuls. In each sub-figure, 3D face mesh is provided for measuring visually he error beween he resuling posiions (in red) of landmarks and he ground ruh (in blue). (c): Boxplos for he disribuions of dissimilariy measures for qualiaively evaluaing he 3D model esimaion. c.1: Resuls obained by he proposed mehod; c.2: Resuls obained by he version wihou visibiliy modeling; c.3: Iniializaion of he model. On each box, he cenral mark in red is he median, he edges of he box are he 25h and 75h perceniles, he whiskers exend o he mos exreme daa poins no considered ouliers, and ouliers are ploed individually. beween he esimaed 3D model and he ground ruh, hen ransformed he esimaed one ino he referenial frame of he ground ruh. In erms of quaniaive evaluaion, a common goodness-of-fi crierion is he squared error sandardized by he scale of he objec. Thus, Procruses disance [10] was used as he dissimilariy measure E d o evaluae our mehod quaniaively, which can be compued as follows: X 2 2 X x i C (20) x i x i / Ed = i V i V he landmarks (randomly sampled), he rue correspondence was removed and replaced wih a random posiion in he image plane, which produced beween 0 and 5 missing 2D correspondences for each es (see Fig. 1(b)). Figs. 2(a) and (b) presen 3D model esimaion resuls. Fig. 2(c).3 and Fig. 2(c).1 (i.e., he boxes 3 and 1 in Fig. 2(c)) depic he saisics of he dissimilariy measure E d (Eq. 20) for he iniializaion and he resuling 3D model obained by he proposed mehod, respecively. The qualiaive and quaniaive evaluaions demonsrae ha our mehod leads o well-esimaed 3D models even when correspondences are parially missing. Furhermore, in order o demonsrae he impac of he visibiliy modeling, we have also evaluaed an alernaive version (wihou visibiliy modeling) of he proposed mehod where he occ label was removed from he 2D candidae se of each node, and show he obained saisics of E d in Fig. 2(c).2. Based on he comparison of Fig. 2(c).1 and Fig. 2(c).2, we can conclude ha he visibiliy modeling indeed leads o significanly beer performance. Second, we employed he facial feaure poin deecor proposed in [21] o obain he 2D posiion candidaes for 101 samples of BU-4DFE, also one from each subjec. Such a deecor is based on Gabor feaures and boosing classifiers, and can well localize he considered landmarks from observed 2D images (Figs. 3(a)-(f)), hough errors may sill be presen in some ess. We also performed a leaveone-ou cross-validaion in his experimen. Figs. 3(a )-(f ) show six 3D model esimaion resuls of differen qualiies and Fig. 3(g) presens he saisics of E d for he proposed mehod and he version wihou visibiliy modeling. These resuls furher demonsrae he poenial of he proposed mehod o infer he 3D configuraion of he model from 2D observed images wih misdeecions/occlusion handling. Las bu no leas, we compared our mehod wih an al- where x i and x i denoe he resuling and ground ruh 3D posiions of landmark i, respecively, C = P 1 is he cener of he ground ruh model. The i V x i V d smaller E is, he closer he resuling model is o he ground ruh. In all he experimens, he concep of leave-one-ou cross-validaion was adoped owards he evaluaion of he mehod. In his conex, we do he validaion on a sample while using he remaining samples as raining daa, and such a validaion is done for all he samples conained in a daase using he same parameer seings. Regarding he 3D model prior (Eq. 13), we modeled he probabiliy disri buion pc (d c (xc )) beween a quadruple c of poins using a wo-componen Gaussian Mixure Qualiaive Resuls and Quaniaive Analysis Firs, we considered 100 samples of he neural expression from BU-3DFE, one from each subjec. The 2D landmark correspondence space was associaed wih 5 labels, four corresponding o he 2D posiion candidaes and he las o he occlusion label occ. On op of he ground ruh correspondence, noise was added o generae erroneous 2D candidaes as well. Furhermore, for 10% of 324

Figure 3. Resuls of he second experimen.

qualiaively evaluaing he 3D model esimaion. g.

2: Resuls obained by he version wihou visibiliy modeling; g.

ernaive mehod (ASM+RANSAC) wih a relaxed condiion where we assumed ha

For each es, we firs learned an ASM [8] from he raining daa.

on he iniializaion of he shape model and he given ground ruh 2D

Once he projecion funcion was esimaed, we searched for he bes shape

poins and heir 2D Furhermore, we evaluaed boh mehods using wo

hroughou he experimens, we also esed he mean-shape iniializaion where

examples o i and compued he mean shape as iniializaion.

BU-3DFE daase and he quaniaive evaluaion is shown in Fig. 4. Figs. 4.1 and 4.

iniializaions, which demonsraes robusness wih respec o he choice of

lower compared o ASM+RANSAC, which demonsraes ha our mehod performs

despie he imporan variabiliy of pose and facial geomery, has well

7 (a) (b) (c) (d) (e) (f) 2 3 (g) (a ) (b ) (c ) (d ) (e ) (f ) Figure 3. Resuls of he second experimen. (a)-(f): 2D landmark deecion resuls [21]; (a )-(f ): The corresponding 3D model esimaion resuls. (g): Boxplos for he disribuions of dissimilariy measures for qualiaively evaluaing he 3D model esimaion. g.1: Resuls obained by he proposed mehod; g.2: Resuls obained by he version wihou visibiliy modeling; g.3: Iniializaion of he model. ernaive mehod (ASM+RANSAC) wih a relaxed condiion where we assumed ha he ground ruh 2D correspondences were known. For each es, we firs learned an ASM [8] from he raining daa. Then, we used RANSAC [11] o esimae he camera projecion funcion based on he iniializaion of he shape model and he given ground ruh 2D correspondences. Once he projecion funcion was esimaed, we searched for he bes shape configuraion by minimizing he errors beween he projecions of he 3D poins and heir 2D correspondences. Furhermore, we evaluaed boh mehods using wo differen iniializaions: besides he random sample iniializaion used hroughou he experimens, we also esed he mean-shape iniializaion where we chose one example as he reference, regisered all he oher raining examples o i and compued he mean shape as iniializaion. We performed leave-one-ou crossvalidaion on all he 2500 samples of BU-3DFE daase and he quaniaive evaluaion is shown in Fig. 4. Figs. 4.1 and 4.4 show ha our mehod performed equally well wih he wo differen iniializaions, which demonsraes robusness wih respec o he choice of iniializaion. The evalua- ion of ASM+RANSAC is presened in Figs. 4.2 and 4.5. We observe from Fig. 4 ha he dissimilariy measure of our mehod is approximaely 3 o 5 imes lower compared o ASM+RANSAC, which demonsraes ha our mehod performs significanly beer han ASM+RANSAC and is highly robus wih respec o he iniializaion. In conclusion, he resuls of all he experimens demonsrae ha our mehod, despie he imporan variabiliy of pose and facial geomery, has well esimaed he 3D configuraion of he model even wih he exisence of misdeecions, and ouperforms significanly he alernaive mehods. 6. Conclusion In his paper, we have inroduced a novel approach for 3D landmark model inference from a monocular 2D view ha combines he esimaion of he 3D pose, he visibiliy saes and he 2D correspondences. The main innovaions of he mehod are he absence of camera parameers esimaion, he abiliy o model geomeric consisency hrough local priors, he explici modeling of visibiliy and 325

8 Figure 4. Comparison wih ASM+RANSAC in erms of dissimilariy measures. 1. Our mehod wih random-sample iniializaion; 2. ASM+RANSAC wih random-sample iniializaion; 3. The random-sample iniializaion; 4. Our mehod wih mean-shape iniializaion; 5. ASM+RANSAC wih mean-shape iniializaion; 6. The mean-shape iniializaion. he one-sho opimizaion o joinly infer all he variables. We have evaluaed our mehod on sandard facial daases wih promising resuls. Fuure work concerns firs he achievemen of a beer model decomposiion owards recovering he smalles subse of higher-order ineracions ha can express he 3D geomeric manifold, which could drasically decrease he compuaional complexiy of he mehod. The use of more advanced parameerizaions of he manifold which go beyond simple 3D landmark posiions (e.g., he enire surface hrough some kind of local inerpolaion) would open new applicaion domains of our mehod like body pose esimaion or medical image analysis where 2D parial acquisiion of 3D objecs is frequen. Las bu no leas, faser opimizaion algorihms of higher-order MRFs, including poenial implemenaions of exising opimizers on GPUs, could be beneficial o our approach boh in erms of he considered applicaion as well as in erms of modulariy wih respec o oher 3D pose esimaion problems from 2D images. Acknowledgmens This work was parially suppored by he European Research Council Saring Gran DIOCLES (ERC-STG ) and he MEDICEN Compeiive Cluser sereos+ gran. References [1] T. D. Aler. 3-D pose from 3 poins using weak-perspecive. IEEE TPAMI, 16(8): , [2] D. P. Bersekas. Nonlinear Programming. Ahena Scienific, [3] C. M. Bishop. Paern Recogniion and Machine Learning. Springer, [4] V. Blanz and T. Veer. A morphable model for he synhesis of 3D faces. In SIGGRAPH, [5] Y. Boykov and G. Funka-Lea. Graph cus and efficien N-D image segmenaion. IJCV, 70(2): , [6] L. Breiman. Random foress. Machine Learning, 45(1):5 32, [7] T. F. Cooes, G. J. Edwards, and C. J. Taylor. Acive appearance models. IEEE TPAMI, 23(6): , [8] T. F. Cooes, C. J. Taylor, D. H. Cooper, and J. Graham. Acive shape models - heir raining and applicaion. CVIU, 61(1):38 59, [9] M. de La Gorce, N. Paragios, and D. J. Flee. Model-based hand racking wih exure, shading and self-occlusions. In CVPR, [10] I. L. Dryden and K. V. Mardia. Saisical shape analysis. John Wiley & Sons Inc., [11] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fiing wih applicaions o image analysis and auomaed carography. Communicaions of he ACM, 24(6): , [12] B. Glocker, H. Heibel, N. Navab, P. Kohli, and C. Roher. TriangleFlow: Opical flow wih riangulaion-based higherorder likelihoods. In ECCV, [13] B. Glocker, N. Komodakis, G. Tzirias, N. Navab, and N. Paragios. Dense image regisraion hrough MRFs and efficien linear programming. Medical Image Analysis, 12(6): , [14] L. Gu and T. Kanade. 3D alignmen of face in a single image. In CVPR, [15] N. Komodakis and N. Paragios. Beyond pairwise energies: Efficien opimizaion for higher-order MRFs. In CVPR, [16] N. Komodakis, N. Paragios, and G. Tzirias. MRF opimizaion via dual decomposiion: Message-passing revisied. In ICCV, [17] R. Kurazume, K. Nakamura, T. Okada, Y. Sao, N. Sugano, T. Koyama, Y. Iwashia, and T. Hasegawa. 3D reconsrucion of a femoral shape using a parameric model and wo 2D fluoroscopic images. CVIU, 113(2): , [18] D. Roller, K. Daniilidis, and H.-H. Nagel. Model-based objec racking in monocular image sequences of road raffic scenes. IJCV, 10: , [19] E. B. Sudderh, M. I. Mandel, W. T. Freeman, and A. S. Willsky. Disribued occlusion reasoning for racking wih nonparameric belief propagaion. In NIPS, [20] L. Torresani, V. Kolmogorov, and C. Roher. Feaure correspondence via graph maching: Models and global opimizaion. In ECCV, [21] D. Vukadinovic and M. Panic. Fully auomaic facial feaure poin deecion using gabor feaure based boosed classifiers. In IEEE SMC, [22] C. Wang, O. Teboul, F. Michel, S. Essafi, and N. Paragios. 3D knowledge-based segmenaion using poseinvarian higher-order graphs. In MICCAI, [23] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A highresoluion 3D dynamic facial expression daabase. In In. Conf. on Auomaic Face and Gesure Recogniion, [24] L. Yin, X. Wei, Y. Sun, J. Wang, and M. Rosao. A 3D facial expression daabase for facial behavior research. In In. Conf. on Auomaic Face and Gesure Recogniion,

STEREO PLANE MATCHING TECHNIQUE

STEREO PLANE MATCHING TECHNIQUE Commission III KEY WORDS: Sereo Maching, Surface Modeling, Projecive Transformaion, Homography ABSTRACT: This paper presens a new ype of sereo maching algorihm called Sereo