Learning Active Appearance Models from Image Sequences

Size: px
Start display at page:

Download "Learning Active Appearance Models from Image Sequences"

Transcription

1 Learning Active Appearance Models from Image Sequences Jason Saragih 1 Roland Goecke 1,2 1 Research School of Information Sciences and Engineering, Australian National University 2 National ICT Australia, Canberra Research Laboratory Canberra, Australia jason.saragih@rsise.anu.edu.au, roland.goecke@nicta.com.au Abstract One of the major drawbacks of the Active Appearance Model (AAM) is that it requires a training set of pseudo-dense correspondences. Most methods for automatic correspondence finding involve a groupwise model building process which optimises over all images in the training sequence simultaneously. In this work, we pose the problem of correspondence finding as an adaptive template tracking process. We investigate the utility of this approach on an audio-visual (AV) speech database and show that it can give reasonable results. Keywords: AAM, automatic model building. 1 Introduction Active appearance models (AAM) are a powerful class of generative parametric models for non-rigid visual objects which couple a compact representation with an efficient alignment method. Since its advent by Edwards et al. in (Edwards, Taylor & Cootes 1998) and their preliminary extension (Cootes, Edwards, Taylor, Burkhardt & Neuman 1998), the method has found applications in many image modelling, alignment and tracking problems, for example (Lehn- Schiøler, Hansen & Larsen 2005) (Stegmann & Larsson 2003) (Mittrapiyanuruk, DeSouza & Kak 2005). The main drawback of AAM is that it requires pseudo-dense annotations for every training image to build its statistical models of shape and texture. Each of these images may require hundreds of corresponding points. Manual annotation for large databases, therefore, are tedious and error prone. The process is especially difficult for objects which exhibit only a small number of corner like features (i.e. the human face contains mostly edges). A process which automates the annotation process is, hence, highly desirable and may encourage a more widespread utilisation of the AAM. In this paper, we discuss the automatic annotations (finding physically corresponding points across images) of audio-visual (AV) speech databases which consist of sequences of talking heads. As a test case, we investigate its utility on the AVOZES (Goecke & Millar 2004) database. This scenario for automatic annotations is more constrained than the gen- National ICT Australia is funded by the Australian Government s Backing Australia s Ability initiative, in part through the Australian Research Council Copyright c 2006, Australian Computer Society, Inc. This paper appeared at HCSNet Workshop on the Use of Vision in HCI (VisHCI 2006), Canberra, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 56. R. Goecke, A. Robles-Kelly & T. Caelli, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included. eral problem as the changes in shape and texture between consecutive frames in a sequence is relatively small. Nonetheless, we show that this problem is still a challenging one, mainly due to the high dimensionality of the problem which makes it difficult to optimise and avoid spurious local minima. We approach the automatic annotation process through a tracking perspective, where the annotations in a reference image are propagated through the sequence by virtue of an adaptive template. We begin with an overview of related work in Section 2. The problem of image based correspondences is discussed in Section 3. An outline of our approach to the automatic annotations of image sequences is then presented in Section 4. In Section 5, we describe the results of applying this approach to the AVOZES database. Section 6 concludes with discussions of the results and future directions. 2 Related Work There has been significant research over the years to automatically find semi-dense correspondences across images of the same class for building AAMs. These methods can be broadly categorised into either feature or image based approaches. Feature based methods (Chui, Win, Schultz, Duncan & Rangarajan 2003) (Walker, Cootes, & Taylor 1999) (Hill & Taylor 1996) find correspondences between salient features in the image by examining the local structure of the features. The advantage of this method is that feature comparisons and calculations are relatively cheap. The downside however is twofold. Firstly, there may be insufficient salient features in the object to build a good appearance model. Secondly, as the feature comparisons generally consider only local image structure, the global image structure for which the AAM is then modelled is ignored, and hence, the model may be suboptimal. Image based methods (Cootes, Marsland, Twining, Smith & Taylor 2004) (Baker, Matthews & Schneider 2004) (Jebara 2003) usually find dense image correspondences by finding a nonlinear warping function which minimises some type of error measure between the intensities of the images. The main advantage of these methods is that the global structure of the image is taken into account, better mimicking the AAM for which the correspondences will be used later. The main drawback of this approach is that to accurately represent the shape variations of the visual object, the warping function will generally need to be parametrised using a large number of parameters (generally as set of landmark points). This results in a very large optimisation problem which is slow to optimise and prone to terminating in local minima.

2 3 Image Based Correspondence The heart of image based methods for correspondence consists of finding a warping function between a set of images such that every location in one image is warped to the same physically meaningful (corresponding) location in all other images. However, as there is no true sense of the physical correspondence of un-annotated images, the quality of a set of warping functions is usually quantified by some measure of model compactness built from the warped images. Examples of these measures include MDL (Cootes, Twining, Petrovic, Schestowitz & Taylor 2005), specificity/generalisation (Schestowitz, Twining, Petrovic, Cootes, Crum & Taylor 2006) and minimum volume PCA (Jebara 2003). Apart from the measure of quality there is a large amount of variation of image based correspondence methods at the implementation level. These variations include, but are not limited to, model and warp parametrisation, model fitting methods and the landmark selection process. In this section, we describe the choices we made on these factors for the experiments presented in Section 5. In most cases, we follow the convention of most AAM implementations. 3.1 Linear Appearance Models Active appearance models assume the visual phenomenon being modelled takes the form of a degenerate Gaussian distribution, where the shape and texture can be modelled by a compact set of linear modes of variation. The texture is generated as follows: m t t(x) = t(x) + q k t k (x), (1) k=1 where t(x) is the generated model texture at pixel location x, t(x) is the mean texture at that location, t k (x) is the k th mode of texture variation and q k is the magnitude of variation in that direction. Similarly, a novel instance of the model s shape can be generated using: m s s = s + p k s k, (2) k=1 where s = [x 1 ;... ; x n ] is the shape vector of concatenated landmark locations, s is the mean shape, s k is the k th mode of shape variation and p k is the magnitude of variation in that direction. These models are usually obtained by applying PCA to a set of annotated images, retaining only the m t and m s largest modes of variation in shape and texture respectively. The resulting model is a compact representation of a high dimensional visual object by a small set of parameters. Although these separate models of variation (called independent appearance models) have shown to adequately represent the variations exhibited by many visual objects, they fail to take into account the correlations between shape and texture. In some cases, where there is a strong correlation between shape and texture, failing to take these correlations into account may result in a model capable of generating unrealistic instances of the object class. Furthermore, the resulting model may not be as compact as it could be, if these correlations are considered in the model building process. An example of this is a person-specific AAM. In these cases, it is beneficial to perform a second level of PCA, this time on the concatenation of the shape and texture parameters: a = [ Ws p q ], (3) where W s is a weighting matrix which normalises the difference in units between shape and texture. A common choice for this matrix is an isotropic diagonal matrix representing the ratio between the total variations of shape and texture in the training set. By applying PCA to a set of these training vectors, a combined appearance model is obtained, for which novel instances can be generated as follows: m a a = c k a k, (4) i where a k is the k th mode of combined appearance variation and c k is the magnitude of variation in that direction. The combined appearance model can be used to generate novel instances of shape and texture directly as follows: where and s = s + Q s a (5) t = t + Q t a, (6) Q s = SWs 1 A s (7) Q t = TA t (8) [ ] As A = A (9) t S = [s 1,..., s ms ] T = [t 1,..., t mt ] A = [a 1,..., a ma ] are matrices of concatenated modes of variations of shape, texture and appearance, respectively. For visual objects exhibiting strong correlations between shape and texture, the resulting combined appearance model is usually more compact than the independent appearance model, exhibiting a smaller number of modes of variation. 3.2 Model Quality The quality of a model is usually quantified by some measure of compactness. In our work, we follow the method in (Jebara 2003) which estimates compactness of Gaussian distributed models through an approximation of the volume of the variations of the model. The approximation used here is the determinant of the model s covariance matrix, which is equivalent to the sum of the eigenvalues of the model: Q = m λ i (10) In the AAM, variations in pixel values in the image frame are generated from variations in both shape and texture, each of which is modelled by a Gaussian distribution. Therefore, a measure of compactness of an appearance model must take into account the compactness of both models which may disagree with each other. For example, for the same database, a model which exhibits a compact shape distribution may result in a non-compact texture as it needs to accommodate pixel intensity variations which are not accounted for by the shape. On the other hand, if the texture is evaluated in a reference frame (as opposed to the image frame as is done in an MDL formulation (Cootes et al. 2005)), the shape may be chosen i

3 such that the texture is compact at the cost of a noncompact shape distribution. In (Jebara 2003), only the texture compactness is used as a measure of quality, which may result in a non-compact shape distribution which in turn may result in a model which can generate implausible shapes. Although it is easy to have a single measure of model quality through a weighting of the compactness of shape and texture, this weighting is usually chosen heuristically based on the intuition of good results from manual analysis of example models. In this work, we investigate the trends of the shape and texture compactness measures for different settings of the training parameters. As a final note, in our implementation the sum in Equation (10) is performed over all PSfrag non-zero replacements eigenvalues of the system rather than only the most significant ones. This is because we want to measure the model quality by considering the total amount of variation in the training set. Since the total variation may differ depending on the implementation details, common methods used in PCA such as retaining only a certain percentage of the total variation may not give a discriminative measure as different amounts of variations may be discarded as noise. I 1 I 2 I 2 (W(x; p)) Figure 1: Piecewise-affine warping. Top row: pseudodense landmark triangulation. Bottom: I 2 warped onto I 1 using piecewise affine warp defined by triangulation. 3.3 Landmarks and the Warping Function The shape of an AAM is defined through a set of landmarks which in turn parametrise the warping function used to project the texture from the image to the reference frame Landmark Selection Scheme The choice of these landmarks is crucial to the compactness of the model. As a rule of thumb, for a given number of landmark points, the set which, under the warping function, accounts for the most amount of shape variation within the object class should be chosen. This way, the variation exhibited in the texture model accredited to shape variation is minimised. However, in the problem of automatic model building, parts of the object which exhibit the most variation in shape are not known a-priori. Therefore, a choice must be made regarding the contribution of each location in the image to the variation in texture due to unaccounted variations in shape. In general, locations with high texture contribute more to the variation in texture due to unaccounted shape variations than do flat regions. Therefore, we propose using a sequential selective process where landmarks are chosen iteratively based on their saliency, measured by the cornerness of that point in a reference image. This method was adopted in (Cootes et al. 2005), where it was demonstrated that using landmarks on strong edges, and ignoring flat regions, gave the best performance as it allowed more control over the boundary regions in the image. Our method differs however in the way the landmarks are chosen. In their approach, the landmarks are initialised on an equally spaced grid, then moved to the closest strong edge. In our work, we sequentially select the most salient pixel location, then zero-out a small region around that point in the saliency image. This process guarantees that the most salient locations are selected, but prevents trivial landmarks (i.e. those which are too close to represent adequate shape variations) from being selected. Apart from these salient landmarks, we also add a fixed number of border landmarks, equally spaced around the image border, such that the whole image is encoded into the texture model. As the domain of the texture of an AAM is usually defined within the convex hull of the reference shape only, adding these border landmarks allow the background to be incorporated into the model s texture which may allow a more accurate model building process as the boundary between the object and the background can give strong cues for model fitting Warping Functions The most common warping function used for AAMs is the piecewise affine warp. This type of warp utilises a triangulation of landmarks in the reference image, where pixels within the domain of each triangle are warped using an affine function. Although there are many other warping function which can be used, such as thin-plate splines or B-Splines, the piecewise affine warp is simple and efficient. Furthermore, it allows the inverse of the warp to be computed efficiently, which is beneficial in an image generation process where the texture in the reference frame is projected onto the image frame. Although the piecewise affine warp has the disadvantage that it is discontinuous at the boundaries of the triangles, we find that a sufficiently dense set of landmarks chosen according to the scheme described in Section usually results in a triangulation where the edges in the image correspond to edges of the triangles, minimising the effect of this discontinuity. An example of a pseudo-dense landmark selection with its triangulation and warping process is shown in Figure Alignment Regardless of the model building process used, automatic AAM construction generally involves a nonrigid registration to align the model to an image. The alignment process essentially finds the model parameters which best describe the image. This process usually involves minimising some measure of fitness between the model and the image which contains a data term and a smoothness term: C = C d + ηc s, (11) where C d is the data term, C s is the smoothness term and η is a regularisation parameter which trades off the contribution of the data and smoothness terms to the total cost.

4 3.4.1 The Data Term The data term is usually defined as a function of the difference between the model s texture and the image texture warped back to the reference frame: C d = x Ω ρ (E(x); σ) (12) E(x) = t(x; q) I(W(x; p)), (13) where Ω is the domain over which the model s texture is defined (i.e. the convex hull of the landmark points), t(x; q) is the model s texture, I(W(x; p)) is the image texture warped back to the reference frame, and ρ is some type of function over the residuals, parametrised by σ. A common function used in AAM alignment is the L2-norm (Baker & Matthews 2002), in which case, the data term takes the least squares form. However, in some cases it may be beneficial to use a robust error function to minimise the effects outliers in the data. This is particularly important in model building as regions which are not yet accounted for by the texture model may deteriorate the estimate of the shape in other parts of the image, leading to a non-compact model. For the experiments presented in Section 5, we use the Geman-McClure function: ρ(r; σ) = r2 σ 2 + r 2, (14) which has been used extensively for optical flow estimation (Black & Anandan 1993) (Blake, Isard & Reynard 1994). The choice of the scale parameter σ for robust error functions is always problematic as it depends on the distribution of the residuals. One approach is to use the assumption that the corresponding error functions model the underlying distribution of residuals, and find σ which best fits that distribution. However, this usually leads to a complex non-linear estimation process. Therefore, in our work, we assume a contaminated Gaussian distribution for the residuals. In this framework, the estimate of σ can be derived from the median value of the absolute residuals: σ = med (E(x)) (15) which has been claimed to have excellent resistance towards outliers, tolerating almost 50% of them (Sawhney & Ayer 1996) The Smoothness Term In automatic model building the landmarks should be allowed to move freely to minimise the data term. However, as the AAM s shape consists of a pseudodense set of landmarks, the dimensionality of the optimisation process is very large, which if not constrained is likely to get trapped in spurious local minima. These minima usually correspond to implausible shapes. As such, a smoothness term is required to encourage the model to deform smoothly. The form of the smoothness constraint is dependent on the visual object being modelled. The most common of which is to penalise the magnitude of the deformation of every landmark from a reference shape, as was adopted in (Baker et al. 2004). The problem with this approach is that it does not take into account the spatial relationship between the deformation of landmarks. In this work, we penalise only the difference between the deformation of landmarks, similar to the smoothness constraint in variational optical flow estimation (Brox, Bruhn, Papenberg & Weickert 2004). The differences are weighted by a smooth function of the landmark distances in a predefined shape: where C s = n k ij d(i, j) 2, (16) i,j ) exp ( ˆxi ˆxj 2 2σs k ij = 2 ) (17) n j ( exp ˆxi ˆxj 2 2σs 2 is a smoothing factor and d(i, j) = [W(x i ; p) ˆx i ] [W(x j ; p) ˆx j ] (18) is the difference between landmark displacements, with ˆx k = W(x k ; p 0 ) being the location of the k th landmark in the predefined shape, parametrised by p 0. In most works utilising a smoothness measure, the predefined shape is always set to the reference shape (i.e. p 0 = 0). The problem with this is that it assumes the deformations are isotropic for all landmarks. This type of smoothing does not fit the notion of a linear shape class which is modelled by a degenerate Gaussian. In contrast, we set the predefined shape as the initial shape in the alignment process. Smoothing the deformations in an isotropic manner starting from this shape better suits the form of the shape model as it does not over constrain the overall highly anisotropic shape deformations whilst still encouraging the landmarks to deform smoothly Optimisation To optimise the cost function in Equation (11) we adopt the Gauss-Newton method which is commonly used for image alignment. To allow the use of the robust error function in the Gauss-Newton optimisation procedure, the data term must be reformulated. Since it contains no squared term, the derivation of the parameter update requires a second order Taylor expansion, akin to the Newton algorithm. Therefore, following (Baker, Gross & Matthews 2003), we replace the data term in Equation (12) with: C d = x Ω ϱ ( E(x) 2 ; σ ) (19) and the reformulated robust error function: ϱ(r; σ) = r σ 2 + r (20) This requires only that the error function is symmetric, which is satisfied by the Geman McClure function. With this reformulation, the Gauss-Newton Hessian of the data term is given by: H d = x Ω ϱ (E(x) 2 )J d (x) T J d (x) (21) where ϱ (E(x) 2 ) is the derivative of the reformulated robust error function and [ ] W(x; p) t(x; q) J d (x) = I(W(x; p)), (22) p q is the Jacobian of the data term. It should be noted here that since we allow the landmark points to move freely, the warping function W is directly parametrised by the location of the landmarks (i.e.

5 p = [x 1 ;... ; x n ]). Therefore, the distance measure in Equation (18) is equivalent to: d(i, j) = (x i ˆx i ) (x j ˆx j ) (23) This is in contrast to the usual AAM formulation where the warp is parametrised by the magnitudes of the modes of shape variation. The Gauss-Newton Hessian of the smoothness term is given by: PSfrag replacements W(x i ; p) x i /2 2 W(x i ; p) H s = n [ k ij Jx (i, j) T J x (i, j) + J y (i, j) T J y (i, j) ] i,j (24) where the 2k th entry of the x smoothness term s Jacobian J x (i, j) is given by: 1 if k = i J x (i, j) 2k = 1 if k = j, (25) 0 otherwise and similarly for the (2k + 1) th entry of J y. For J x, entries at the (2k + 1) th locations are all zero, and similarly for the 2k th locations of J y. This simple form, which affords a fast calculation of the Hessian and gradient, is a result of optimising directly over the landmark locations. The parameter updates of the Gauss-Newton optimisation of Equation (11) then takes the following form: [ ] [ [ ]] 1 [ [ ]] p Hs 0 gs q = H d + η 0 0 g d + η 0 (26) where g d = x Ω ϱ (E(x) 2 )J d (x) T E(x) (27) g s = n [ k ij Jx (i, j) T d x (i, j) + J y (i, j) T d y (i, j) ] i,j (28) are the gradients of the data and smoothness term respectively. The optimisation process can usually be spedup by using the inverse compositional formulation (Matthews & Baker 2003). By reversing the roles of the model and the image in the data term, the gradients of the data term can be precomputed and hence a large proportion of computation needs to be done only once. The extensions of the inverse compositional image alignment (ICIA) algorithm to robust error norms was proposed in (Baker et al. 2003). With this formulation, the Hessian of the data term cannot be precomputed, despite the fixed gradients, as the derivative of the robust error terms cannot be precomputed. Although an efficient approximation has been derived by assuming spatial coherence of the outliers, this implementation is not particularly effective for automatic model building from databases as the images are generally occlusion free, with outliers stemming mainly from misalignment, image noise, changes in texture not yet accounted for by the texture model, and interlacing effects. The presence of the smoothness term means that the Hessian needs to be updated and inverted at every iteration which is the most costly part of the optimisation when there is a large number of landmark points. Furthermore, for the methods described in Section 4, the model is updated after every image, requiring the gradients to Figure 2: Initialising points in lower levels of the Gaussian pyramid. Top row: Warps at higher pyramid level. Bottom: Landmarks at current pyramid level. be recomputed. Due to these factors, we predict that using the ICIA formulation will not give dramatic improvements in efficiency and hence it was not incorporated into our implementation Gaussian Pyramid Despite the use of the smoothing term, the optimisation process may still converge to a local minimum due to the high dimensionality of the problem. This problem can be partially alleviated by optimising on a Gaussian-pyramid. There are issues however with regards to how the shape is parametrised between the levels of the pyramid. A pseudo-dense correspondence at the lowest level of the pyramid may result in an over parametrised model at the highest level of the pyramid, which results both in a slow alignment process as well as the higher likelihood of getting stuck in local minima. Instead, in this work we build a separate model for each level. Starting at the highest pyramid level, a set of landmarks is chosen as described in Section With this, an automatic model building process described in Section 4 is performed. Moving down the pyramid, a new set of landmarks is chosen from the reference image. The propagation of these landmarks to other images is illustrated in Figure 2. First the landmarks are downscaled to the previous pyramid level (bottom to top left in Figure 2). Then the landmarks are warped using the found correspondence for that level (top row), and finally up-scaled back to the current pyramid level (top to bottom right). With the smoothness term described in Section 3.4.2, the use of the Gaussian pyramid allows a stiff regularisation parameter η to be used as the movements of points at every level will be relatively small. This in turn allows the optimisation process to better avoid local minima. 4 Incremental Model Building Most approaches to automatic model building can be classed as groupwise, where a model is iteratively refined from an initial estimate by first fitting it to each image, followed by a reconstruction of a new model from the fitted images. One of the drawbacks of this approach is that it does not take into account the sequential nature of images in video. As such, its initial estimate of the model may be far from the optimum, which may cause the algorithm to converge slowly or get stuck in local minima. By assuming that the appearance of the visual objects varies slowly between consecutive frames in a

6 sequence, the model building process can be posed as a tracking problem. Although the complexity of the warping function is much higher than most tracking problems, which generally solve only for a similarity or affine transform, the same mechanisms apply. We start with an initial template, without loss of generality taken as the first image in the sequence, and propagate the landmark positions to the other images in the sequence through a consecutive alignment process. Unlike typical tracking problems however, due to the high dimensionality of the parameter space, the alignment process must generally utilise gradient based approaches as non-gradient methods such as a particle-filters will be too computationally expensive to evaluate. One of the main difficulties associated with template tracking is due to the changes in the object s texture throughout the sequence. Although this problem can be partially alleviated by using a robust error function, as the sequence progresses the object s texture may undergo significant changes such that treating them as outliers may lead to misalignment. One solution to this problem is to update the template using the texture from the previous frame. However, simply replacing the texture with the most recent image makes the algorithm prone to drifting. In this work, we investigate the utility of two adaptable template approaches for automatic model building from image sequences. 4.1 Method 1: Grounded Templates There are a number of approaches to the template update problem which reduce the drifting phenomenon, for example (Matthews, Ishikawa & Baker 2004) (Zhong, Jain & Dubuisson-Jolly 2000) (Loy, Goecke, Rougeaux & Zelinsky 2000). In this work we follow the approach of (Loy et al. 2000), where the new template is defined as a weighted combination of the initial template and the texture from the most recent image: T t (x) = αt 0 (x) + (1 α)t t 1 (x) (29) The parameter α (0... 1) is a grounding factor which reduces the drifting effect whislt allowing the template to adapt to the current object s texture. As the template is updated once before the alignment process in the next image, the optimisation process needs only be done over the landmark locations. Therefore, the Jacobian of the data term in Equation (22) for this method is simply: W(x; p) J d (x) = I(W(x; p)) p (30) and the Gauss-Newton update in Equation (26) is now given by: p = [H d + ηh s ] 1 [g d + ηg s ] (31) The output of the template matching algorithm is a set of corresponding annotations in every image in the sequence, from which an appearance model can be built in the usual manner. 4.2 Method 2: Incremental Texture Learning One of the weaknesses of the template update approach is that it takes into account only the initial and most recently encountered textures. As such it makes no use of the knowledge of the variations in texture which have been encountered earlier in the sequence. One possibility to incorporate this information is to perform an incremental model building process as the object is tracked throughout the sequence. For this algorithm we utilise incremental PCA (Li 2004) to update the model, rather than the template, after matching to every new image. Starting with the template of the first image, we match it to the next image using the approach described in Section 4.1. Some of the variations captured as outliers may in fact be intrinsic variations of the object rather than just image noise. The texture of the newly aligned image is then used as a new data instance for the linear model, for which incremental PCA is used to integrate it into the model. The amnesic factor (a weighting between the current model and the new data instance) n 1+n is set to, where n is the number of samples used to build the current model, so that every sample integrated into the model is given the same importance. See (Li 2004) for details. Once the model exhibits some linear modes of variation apart from the mean, matching to the next image should be done by simultaneously updating the landmark locations and the texture parameters q using the update equations described in Section This way, images which exhibit texture variations previously encountered in the sequence will be matched better than using a fixed template. Again, the data term is formulated using the robust error function to account for texture variations not yet encountered in the sequence. 5 Experiments 5.1 The AVOZES Database AVOZES (Goecke & Millar 2004), the Audio-Video Australian English Speech data corpus, is a database of 20 speakers uttering a variety of phrases which was designed for research on the statistical relationship of audio and video speech parameters with an audiovideo automatic speech recognition task in mind. Although sparse annotations for the vital mouth points, such as lip corners, are available, these points are chosen manually and represent only a heuristic intuition about their usefulness for automatic speech recognition. A more elaborate set of cues may be useful for audio-video speech recognition which may not be directly obvious. AAMs, which encode both a pseudo-dense set of landmark points as well as texture variations, provide a rich set of features to a speech recognition system which may allow better recognition rates to be achieved. An intensive study of the application of AAMs in this domain can be found in (Neti, Potaminos, Luettin, Matthews, Glotin & Vergyri 2001). In our experiments we used the continuous speech sequences for each of the speakers exclusively. The continuous speech part of AVOZES consists of three sequences, each with a different phrase. The length of all sequences range from 90 to 150 frames. As the video files in the database consist of a stereo pair, warped to half the height, we used only part of the sequence pertaining to the left camera, which we scaled to the true ratio. For each of the speakers, we performed both of the image based correspondence methods described in Section 4 on all three sequences together. Since there may be large differences between the start and end of different sequences of the different phrases, we find the image which is most similar in the later two sequences to an image in the first sequence. After tracking through the first sequence the model is tracked in the other sequence starting from the most similar image found previously, initialising the shape estimate to the corresponding image in the first sequence. The

7 Shape Model Shape Model Compactness Compactness f1 f3 f5 f7 f9 m1 m3 m5 m7 m9 0 f1 f3 f5 f7 f9 m1 m3 m5 m7 m9 Compactness 1e+08 9e+07 8e+07 7e+07 6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 0 f1 f3 f5 Texture Model f7 f9 m1 m3 Speakers m5 m7 m9 Compactness 1e+08 9e+07 8e+07 7e+07 6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 0 f1 f3 f5 Texture Model f7 f9 m1 m3 Speakers m5 m7 m9 Figure 3: Shape and texture model compactness for every speaker in AVOZES. The models were built from correspondences found using the grounded template method with three settings of the regularisation parameter η = {1, 10, 100}. Figure 4: Shape and texture model compactness for every speaker in AVOZES. The models were built from correspondences found using the incremental texture learning method with three settings of the regularisation parameter η = {1, 10, 100}. tracking process in these other sequences is performed forwards and backwards until the beginning and end of the sequences respectivley. From the resulting correspondences, the compactness of the shape and texture models are calculated as described in Section 3.2. The experiments were repeated for a number of settings of the smoothing parameter η. 5.2 Results In Figure 3 and 4, histograms of the shape and texture model compactness of each of the speakers in the AVOZES database built from correspondences obtained using the methods described in Section 4.1 are shown for three different settings of the regularisation parameter η. Comparing the two methods, the shape compactness differs little between them. The main difference lies in the texture compactness, where the incremental texture learning method generates models which are around twice as compact for most speakers compared to the grounded template method. As discussed in Section 4.2, this result is expected as the incremental texture learning retains memory of previously encountered texture variations. Also, as the alignment process may contain errors which may accumulate throughout the sequence, this approach is more constrained to valid texture instances rather than just the first and most recently encountered texture, which may be erroneous. Studying each method independently, as expected the compactness of the shape model improves as η is increased. Perhaps somewhat more surprisingly, the texture model s compactness is effected little by the different settings of the regularisation parameter. We attribute this to the fact that the texture model is evaluated in a reference frame. The effect of this is that for groups of landmarks which correspond to flat parts of the image, their movements contribute little to the change in the texture when projected onto the reference frame. As such, shapes with significantly different landmark locations in these flat regions may result in very similar texture. An example of this is shown in Figure 5. Landmarks in flat regions are more likely to be perturbed by image noise and hence, for the same texture compactness, the model with better shape compactness is generally a better model. From the correspondences in each image, found using the incremental texture learning method with η = 100, we built a combined appearance model (see Section 3.1) using every 10 th image in the sequences. The mean and first mode of variation on all speakers are shown in Figure 7 and 8. Although the correspondences appear to be of high quality in most speakers, observed through the crispness of the images, there are a few for which the tracking method seemed to have failed to obtain the correct correspondences across the sequences. In particular, the f7 and m5 speakers are particularly poor, where the first mode of variation seems to entail the presence or disappearance of visual artefacts. Referring to the texture compactness histograms in Figure 3 and 4 it can be seen that these two speakers exhibit the least compact model out of the database by a significant margin. It is clear that in these cases, the tracking process used to find the correspondences failed significantly in parts of the image, resulting in the texture model needing to account for variations due to misalignment rather than intrinsic texture variations of

8 Figure 6: Images from the f7 and m5 speakers which illustrate the large differences in scale affecting content in the images. 6 Conclusion Figure 5: Two shapes with significant landmark differences in flat regions exhibiting similar texture when projected to the reference frame. Top: Shape difference. Middle: Shapes of two images. Bottom: Texture projections onto the reference image. the speaker. On closer inspection, we found that these two speakers exhibited significant motion towards the camera during some parts of the sequences. Example images from these sequences are shown in Figure 6. As such, significant parts of the background are occluded when the speakers are close to the camera, but reappear when they are further from it. Because the background exhibits some strong texture and colour variations (see the white strip behind the speaker s heads), the disappearance/emergence of these areas perturb the alignment process significantly, despite using a robust error function. As models of the other speakers, which exhibit relatively small amounts of head movement, were able to be built compactly, we suspect that databases which exhibit a uniform background to not exhibit this problem. However, in cases where this is not practical, one solution would be to initialise the feature points within the face region exclusively, either using a manual crop in the first image or using some type of skin colour detector. It should be noted however, that the accuracy of the alignment around the peripheral of the face using this approach may be inferior to that which encodes background. As a final note, although the methods tested here have shown to give reasonably compact models when no significant visual artifacts disappear or emerge throughout the sequence, because the correspondences are obtained in a pairwise manner the model quality may be improved through a groupwise method. In fact, the methods discussed in this work can be used as a good initialisation for groupwise methods which will encourage faster convergence and help avoid local minima. In this work, we have investigated the utility of adaptive tracking methods for automatically building pseudo-dense correspondences across a sequence of a deformable object, with an AV database as a test case. We compared two methods, the grounded template and incremental texture learning method, measuring their performance through a shape and texture compactness measure as well as a qualitative analysis of the resulting linear models of variation. Through extensive experiments we have shown that this approach can be used to build highly compact models of a linearly deforming object which includes the background in the image. We also found that if the background exhibits significant texture, despite being static, movements of the object which causes these textured regions to be occluded or new textured regions to appear later in the sequence, significantly degrades the performance of this method. However, we suspect that this is a problem exhibited by most image based correspondence methods which utilise diffeomorphic warps and do not explicitly model the disappearance or emergence of visual artifacts. Future work on extending this method might involve investigations into efficiency gains of using the inverse compositional formulation, evaluating alignment error in the image rather than reference frame, and extensions to incremental shape model learning. Although the methods investigated in this paper and their possible extensions allow significant savings on human intervention, requiring only one manual markup per speaker, for large databases containing thousands of sequences this approach may be infeasible. The much more difficult problem of finding correspondences across sequences of different instances of the same object class (different speakers in AVOZES, for example) remains an open problem. References Baker, S., Gross, R. & Matthews, I. (2003), Lucas- Kanade 20 years on: A unifying framework: Part 3, Technical report, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. Baker, S. & Matthews, I. (2002), Lucas-kanade 20 years on: A unifying framework: Part 1, Technical Report CMU-RI-TR-02-16, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

9 Baker, S., Matthews, I. & Schneider, J. (2004), Automatic construction of active appearance models as an image coding problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10), Black, M. & Anandan, P. (1993), The robust estimation of multiple motions: Affine and piecewisesmooth flow fields, Technical report, Xerox PARC. Blake, A., Isard, M. & Reynard, D. (1994), Learning to track curves in motion, in IEEE Conference on Decision Theory and Control, pp Brox, T., Bruhn, A., Papenberg, N. & Weickert, J. (2004), High accuracy optical flow estimation based on theory of warping, in T. Pajdla & J. Matas, eds, 8th European Conference on Computer Vision, Vol. 4, Springer-Verlag, Parague, Czech Republic, pp Chui, H., Win, L., Schultz, R., Duncan, J. S. & Rangarajan, A. (2003), A unified non-rigid feature registration method for brain mapping, Medical Image Analysis 7(2), Cootes, T. F., Edwards, G., Taylor, C. J., Burkhardt, H. & Neuman, B. (1998), Active appearance models, in European Conference on Computer Vision, Vol. 2, pp Cootes, T. F., Marsland, S., Twining, C. J., Smith, K. & Taylor, C. J. (2004), Groupwise diffeomorphic non-rigid registration for automatic model building, in European Conference on Computer Vision, pp Cootes, T. F., Twining, C. J., Petrovic, V., Schestowitz, R. & Taylor, C. J. (2005), Groupwise construction of appearance models using piece-wise affine deformations, in British Machine Vision Conference, Vol. 2, pp Edwards, G., Taylor, C. J. & Cootes, T. F. (1998), Interpreting face images using active appearance models, in IEEE International Conference on Automatic Face and Gesture Recognition, pp Goecke, R. & Millar, J. B. (2004), The audio-video australian english speech data corpus avozes, in 8th International Conference on Spoken Language Processing INTERSPEECH IC- SLP, Vol. III, ISCA, Jeju, Korea, pp Hill, A. & Taylor, C. J. (1996), A method of non-rigid correspondence for automatic landmark identification, in 7th British Machine Vision Conference, Vol. 2, pp Jebara, T. (2003), Images as bags of pixels, in International Conference on Computer Vision, pp Lehn-Schiøler, T., Hansen, L. K. & Larsen, J. (2005), Mapping from speech to images using continuous state space models, in Lecture Notes in Computer Science, Vol. 3361, Springer, pp Li, Y. (2004), On incremental and robust subspace learning, Pattern Recognition 37(7), Loy, G., Goecke, R., Rougeaux, S. & Zelinsky, A. (2000), Stereo 3D lip tracking, in 6th International Conference on Control, Automation, Robotics and Vision, Singapore. Matthews, I. & Baker, S. (2003), Active appearance models revisited, Technical Report CMU- RI-TR-03-02, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. Matthews, I., Ishikawa, T. & Baker, S. (2004), The template update problem., IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6), Mittrapiyanuruk, P., DeSouza, G. N. & Kak, A. C. (2005), Accurate 3D tracking of rigid objects with occlusion using active appearance models, in IEEE Workshop on Motion and Video Computing, pp Neti, C., Potaminos, G., Luettin, J., Matthews, I., Glotin, H. & Vergyri, D. (2001), Largevocabulary audio-visual speech recognition: A summary of the john hopkins summer 2000 workshop, in Workshop on Multimedia Signal Processing (MMSP), Cannes. Sawhney, H. S. & Ayer, S. (1996), Compact representation of videos through dominant and multiple motion estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 18(8), Schestowitz, R. S., Twining, C. J., Petrovic, V. S., Cootes, T., Crum, B. & Taylor, C. J. (2006), Non-rigid registration assessment without ground truth, in Medical Image Understanding and Analysis, Vol. 2, pp Stegmann, M. B. & Larsson, H. B. (2003), Fast registration of cardiac perfusion MRI, in International Society of Magnetic Resonance In Medicine, Toronto, Canada, p Walker, K. N., Cootes, T. F., & Taylor, C. J. (1999), Automatically building appearance models from image sequences using salient features, in D. T.Pridmore, ed., British Machine Vision Conference, Vol. 2, pp Zhong, Y., Jain, A. K. & Dubuisson-Jolly, M. P. (2000), Object tracking using deformable templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(5),

10 3 λ µ +3 λ Figure 7: The first mode of variation of the female speakers in AVOZES, varied between ± three standard deviations. 3 λ µ +3 λ Figure 8: The first mode of variation of the male speakers in AVOZES, varied between ± three standard deviations.

Image Coding with Active Appearance Models

Image Coding with Active Appearance Models Image Coding with Active Appearance Models Simon Baker, Iain Matthews, and Jeff Schneider CMU-RI-TR-03-13 The Robotics Institute Carnegie Mellon University Abstract Image coding is the task of representing

More information

Occluded Facial Expression Tracking

Occluded Facial Expression Tracking Occluded Facial Expression Tracking Hugo Mercier 1, Julien Peyras 2, and Patrice Dalle 1 1 Institut de Recherche en Informatique de Toulouse 118, route de Narbonne, F-31062 Toulouse Cedex 9 2 Dipartimento

More information

Evaluating Error Functions for Robust Active Appearance Models

Evaluating Error Functions for Robust Active Appearance Models Evaluating Error Functions for Robust Active Appearance Models Barry-John Theobald School of Computing Sciences, University of East Anglia, Norwich, UK, NR4 7TJ bjt@cmp.uea.ac.uk Iain Matthews and Simon

More information

Automatic Construction of Active Appearance Models as an Image Coding Problem

Automatic Construction of Active Appearance Models as an Image Coding Problem Automatic Construction of Active Appearance Models as an Image Coding Problem Simon Baker, Iain Matthews, and Jeff Schneider The Robotics Institute Carnegie Mellon University Pittsburgh, PA 1213 Abstract

More information

The Template Update Problem

The Template Update Problem The Template Update Problem Iain Matthews, Takahiro Ishikawa, and Simon Baker The Robotics Institute Carnegie Mellon University Abstract Template tracking dates back to the 1981 Lucas-Kanade algorithm.

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

Motion Estimation. There are three main types (or applications) of motion estimation:

Motion Estimation. There are three main types (or applications) of motion estimation: Members: D91922016 朱威達 R93922010 林聖凱 R93922044 謝俊瑋 Motion Estimation There are three main types (or applications) of motion estimation: Parametric motion (image alignment) The main idea of parametric motion

More information

Visual Tracking (1) Pixel-intensity-based methods

Visual Tracking (1) Pixel-intensity-based methods Intelligent Control Systems Visual Tracking (1) Pixel-intensity-based methods Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

The Lucas & Kanade Algorithm

The Lucas & Kanade Algorithm The Lucas & Kanade Algorithm Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Registration, Registration, Registration. Linearizing Registration. Lucas & Kanade Algorithm. 3 Biggest

More information

An Adaptive Eigenshape Model

An Adaptive Eigenshape Model An Adaptive Eigenshape Model Adam Baumberg and David Hogg School of Computer Studies University of Leeds, Leeds LS2 9JT, U.K. amb@scs.leeds.ac.uk Abstract There has been a great deal of recent interest

More information

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects Intelligent Control Systems Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

arxiv: v1 [cs.cv] 2 May 2016

arxiv: v1 [cs.cv] 2 May 2016 16-811 Math Fundamentals for Robotics Comparison of Optimization Methods in Optical Flow Estimation Final Report, Fall 2015 arxiv:1605.00572v1 [cs.cv] 2 May 2016 Contents Noranart Vesdapunt Master of Computer

More information

Visual Tracking (1) Feature Point Tracking and Block Matching

Visual Tracking (1) Feature Point Tracking and Block Matching Intelligent Control Systems Visual Tracking (1) Feature Point Tracking and Block Matching Shingo Kagami Graduate School of Information Sciences, Tohoku University swk(at)ic.is.tohoku.ac.jp http://www.ic.is.tohoku.ac.jp/ja/swk/

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

NIH Public Access Author Manuscript Proc Int Conf Image Proc. Author manuscript; available in PMC 2013 May 03.

NIH Public Access Author Manuscript Proc Int Conf Image Proc. Author manuscript; available in PMC 2013 May 03. NIH Public Access Author Manuscript Published in final edited form as: Proc Int Conf Image Proc. 2008 ; : 241 244. doi:10.1109/icip.2008.4711736. TRACKING THROUGH CHANGES IN SCALE Shawn Lankton 1, James

More information

Kanade Lucas Tomasi Tracking (KLT tracker)

Kanade Lucas Tomasi Tracking (KLT tracker) Kanade Lucas Tomasi Tracking (KLT tracker) Tomáš Svoboda, svoboda@cmp.felk.cvut.cz Czech Technical University in Prague, Center for Machine Perception http://cmp.felk.cvut.cz Last update: November 26,

More information

A Nonlinear Discriminative Approach to AAM Fitting

A Nonlinear Discriminative Approach to AAM Fitting A Nonlinear Discriminative Approach to AAM Fitting Jason Saragih and Roland Goecke Research School of Information Sciences and Engineering, Australian National University Canberra, Australia jason.saragih@rsise.anu.edu.au,

More information

Fitting a Single Active Appearance Model Simultaneously to Multiple Images

Fitting a Single Active Appearance Model Simultaneously to Multiple Images Fitting a Single Active Appearance Model Simultaneously to Multiple Images Changbo Hu, Jing Xiao, Iain Matthews, Simon Baker, Jeff Cohn, and Takeo Kanade The Robotics Institute, Carnegie Mellon University

More information

On the Dimensionality of Deformable Face Models

On the Dimensionality of Deformable Face Models On the Dimensionality of Deformable Face Models CMU-RI-TR-06-12 Iain Matthews, Jing Xiao, and Simon Baker The Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Abstract

More information

Learning-based Neuroimage Registration

Learning-based Neuroimage Registration Learning-based Neuroimage Registration Leonid Teverovskiy and Yanxi Liu 1 October 2004 CMU-CALD-04-108, CMU-RI-TR-04-59 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract

More information

CS 565 Computer Vision. Nazar Khan PUCIT Lectures 15 and 16: Optic Flow

CS 565 Computer Vision. Nazar Khan PUCIT Lectures 15 and 16: Optic Flow CS 565 Computer Vision Nazar Khan PUCIT Lectures 15 and 16: Optic Flow Introduction Basic Problem given: image sequence f(x, y, z), where (x, y) specifies the location and z denotes time wanted: displacement

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 14 130307 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Stereo Dense Motion Estimation Translational

More information

Face Refinement through a Gradient Descent Alignment Approach

Face Refinement through a Gradient Descent Alignment Approach Face Refinement through a Gradient Descent Alignment Approach Simon Lucey, Iain Matthews Robotics Institute, Carnegie Mellon University Pittsburgh, PA 113, USA Email: slucey@ieee.org, iainm@cs.cmu.edu

More information

Visual Tracking. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania.

Visual Tracking. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania. Image Processing Laboratory Dipartimento di Matematica e Informatica Università degli studi di Catania 1 What is visual tracking? estimation of the target location over time 2 applications Six main areas:

More information

Nonrigid Surface Modelling. and Fast Recovery. Department of Computer Science and Engineering. Committee: Prof. Leo J. Jia and Prof. K. H.

Nonrigid Surface Modelling. and Fast Recovery. Department of Computer Science and Engineering. Committee: Prof. Leo J. Jia and Prof. K. H. Nonrigid Surface Modelling and Fast Recovery Zhu Jianke Supervisor: Prof. Michael R. Lyu Committee: Prof. Leo J. Jia and Prof. K. H. Wong Department of Computer Science and Engineering May 11, 2007 1 2

More information

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar. Matching Compare region of image to region of image. We talked about this for stereo. Important for motion. Epipolar constraint unknown. But motion small. Recognition Find object in image. Recognize object.

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

Chapter 9 Object Tracking an Overview

Chapter 9 Object Tracking an Overview Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging

More information

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore Particle Filtering CS6240 Multimedia Analysis Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore (CS6240) Particle Filtering 1 / 28 Introduction Introduction

More information

3D Active Appearance Model for Aligning Faces in 2D Images

3D Active Appearance Model for Aligning Faces in 2D Images 3D Active Appearance Model for Aligning Faces in 2D Images Chun-Wei Chen and Chieh-Chih Wang Abstract Perceiving human faces is one of the most important functions for human robot interaction. The active

More information

Notes 9: Optical Flow

Notes 9: Optical Flow Course 049064: Variational Methods in Image Processing Notes 9: Optical Flow Guy Gilboa 1 Basic Model 1.1 Background Optical flow is a fundamental problem in computer vision. The general goal is to find

More information

Computer Vision II Lecture 4

Computer Vision II Lecture 4 Computer Vision II Lecture 4 Color based Tracking 29.04.2014 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Single-Object Tracking Background modeling

More information

Textureless Layers CMU-RI-TR Qifa Ke, Simon Baker, and Takeo Kanade

Textureless Layers CMU-RI-TR Qifa Ke, Simon Baker, and Takeo Kanade Textureless Layers CMU-RI-TR-04-17 Qifa Ke, Simon Baker, and Takeo Kanade The Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 Abstract Layers are one of the most well

More information

OPPA European Social Fund Prague & EU: We invest in your future.

OPPA European Social Fund Prague & EU: We invest in your future. OPPA European Social Fund Prague & EU: We invest in your future. Patch tracking based on comparing its piels 1 Tomáš Svoboda, svoboda@cmp.felk.cvut.cz Czech Technical University in Prague, Center for Machine

More information

Feature Tracking and Optical Flow

Feature Tracking and Optical Flow Feature Tracking and Optical Flow Prof. D. Stricker Doz. G. Bleser Many slides adapted from James Hays, Derek Hoeim, Lana Lazebnik, Silvio Saverse, who 1 in turn adapted slides from Steve Seitz, Rick Szeliski,

More information

Active Appearance Models

Active Appearance Models Active Appearance Models Edwards, Taylor, and Cootes Presented by Bryan Russell Overview Overview of Appearance Models Combined Appearance Models Active Appearance Model Search Results Constrained Active

More information

A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space

A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space A Dynamic Human Model using Hybrid 2D-3D Representations in Hierarchical PCA Space Eng-Jon Ong and Shaogang Gong Department of Computer Science, Queen Mary and Westfield College, London E1 4NS, UK fongej

More information

Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization

Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization Journal of Pattern Recognition Research 7 (2012) 72-79 Received Oct 24, 2011. Revised Jan 16, 2012. Accepted Mar 2, 2012. Facial Feature Points Tracking Based on AAM with Optical Flow Constrained Initialization

More information

Learning Based Automatic Face Annotation for Arbitrary Poses and Expressions from Frontal Images Only

Learning Based Automatic Face Annotation for Arbitrary Poses and Expressions from Frontal Images Only Learning Based Automatic Face Annotation for Arbitrary Poses and Expressions from Frontal Images Only Akshay Asthana 1 Roland Goecke 1,3 Novi Quadrianto 1,4 Tom Gedeon 2 1 RSISE and 2 DCS, Australian National

More information

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE COMPUTER VISION 2017-2018 > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE OUTLINE Optical flow Lucas-Kanade Horn-Schunck Applications of optical flow Optical flow tracking Histograms of oriented flow Assignment

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

3D Model Acquisition by Tracking 2D Wireframes

3D Model Acquisition by Tracking 2D Wireframes 3D Model Acquisition by Tracking 2D Wireframes M. Brown, T. Drummond and R. Cipolla {96mab twd20 cipolla}@eng.cam.ac.uk Department of Engineering University of Cambridge Cambridge CB2 1PZ, UK Abstract

More information

Dense Active Appearance Models Using a Bounded Diameter Minimum Spanning Tree

Dense Active Appearance Models Using a Bounded Diameter Minimum Spanning Tree ANDERSON, STENGER, CIPOLLA: DENSE AAMS USING A BDMST Dense Active Appearance Models Using a Bounded Diameter Minimum Spanning Tree Robert Anderson Department of Engineering Cambridge University Cambridge,

More information

Lucas-Kanade 20 Years On: A Unifying Framework: Part 2

Lucas-Kanade 20 Years On: A Unifying Framework: Part 2 LucasKanade 2 Years On: A Unifying Framework: Part 2 Simon aker, Ralph Gross, Takahiro Ishikawa, and Iain Matthews CMURITR31 Abstract Since the LucasKanade algorithm was proposed in 1981, image alignment

More information

A Method of Automated Landmark Generation for Automated 3D PDM Construction

A Method of Automated Landmark Generation for Automated 3D PDM Construction A Method of Automated Landmark Generation for Automated 3D PDM Construction A. D. Brett and C. J. Taylor Department of Medical Biophysics University of Manchester Manchester M13 9PT, Uk adb@sv1.smb.man.ac.uk

More information

Visual Tracking. Antonino Furnari. Image Processing Lab Dipartimento di Matematica e Informatica Università degli Studi di Catania

Visual Tracking. Antonino Furnari. Image Processing Lab Dipartimento di Matematica e Informatica Università degli Studi di Catania Visual Tracking Antonino Furnari Image Processing Lab Dipartimento di Matematica e Informatica Università degli Studi di Catania furnari@dmi.unict.it 11 giugno 2015 What is visual tracking? estimation

More information

Feature Detection and Tracking with Constrained Local Models

Feature Detection and Tracking with Constrained Local Models Feature Detection and Tracking with Constrained Local Models David Cristinacce and Tim Cootes Dept. Imaging Science and Biomedical Engineering University of Manchester, Manchester, M3 9PT, U.K. david.cristinacce@manchester.ac.uk

More information

Tracking in image sequences

Tracking in image sequences CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY Tracking in image sequences Lecture notes for the course Computer Vision Methods Tomáš Svoboda svobodat@fel.cvut.cz March 23, 2011 Lecture notes

More information

2D vs. 3D Deformable Face Models: Representational Power, Construction, and Real-Time Fitting

2D vs. 3D Deformable Face Models: Representational Power, Construction, and Real-Time Fitting 2D vs. 3D Deformable Face Models: Representational Power, Construction, and Real-Time Fitting Iain Matthews, Jing Xiao, and Simon Baker The Robotics Institute, Carnegie Mellon University Epsom PAL, Epsom

More information

Learning Efficient Linear Predictors for Motion Estimation

Learning Efficient Linear Predictors for Motion Estimation Learning Efficient Linear Predictors for Motion Estimation Jiří Matas 1,2, Karel Zimmermann 1, Tomáš Svoboda 1, Adrian Hilton 2 1 : Center for Machine Perception 2 :Centre for Vision, Speech and Signal

More information

Face analysis : identity vs. expressions

Face analysis : identity vs. expressions Face analysis : identity vs. expressions Hugo Mercier 1,2 Patrice Dalle 1 1 IRIT - Université Paul Sabatier 118 Route de Narbonne, F-31062 Toulouse Cedex 9, France 2 Websourd 3, passage André Maurois -

More information

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract

More information

Ensemble registration: Combining groupwise registration and segmentation

Ensemble registration: Combining groupwise registration and segmentation PURWANI, COOTES, TWINING: ENSEMBLE REGISTRATION 1 Ensemble registration: Combining groupwise registration and segmentation Sri Purwani 1,2 sri.purwani@postgrad.manchester.ac.uk Tim Cootes 1 t.cootes@manchester.ac.uk

More information

Announcements. Computer Vision I. Motion Field Equation. Revisiting the small motion assumption. Visual Tracking. CSE252A Lecture 19.

Announcements. Computer Vision I. Motion Field Equation. Revisiting the small motion assumption. Visual Tracking. CSE252A Lecture 19. Visual Tracking CSE252A Lecture 19 Hw 4 assigned Announcements No class on Thursday 12/6 Extra class on Tuesday 12/4 at 6:30PM in WLH Room 2112 Motion Field Equation Measurements I x = I x, T: Components

More information

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam Presented by Based on work by, Gilad Lerman, and Arthur Szlam What is Tracking? Broad Definition Tracking, or Object tracking, is a general term for following some thing through multiple frames of a video

More information

TRACKING OF FACIAL FEATURE POINTS BY COMBINING SINGULAR TRACKING RESULTS WITH A 3D ACTIVE SHAPE MODEL

TRACKING OF FACIAL FEATURE POINTS BY COMBINING SINGULAR TRACKING RESULTS WITH A 3D ACTIVE SHAPE MODEL TRACKING OF FACIAL FEATURE POINTS BY COMBINING SINGULAR TRACKING RESULTS WITH A 3D ACTIVE SHAPE MODEL Moritz Kaiser, Dejan Arsić, Shamik Sural and Gerhard Rigoll Technische Universität München, Arcisstr.

More information

3D Morphable Model Parameter Estimation

3D Morphable Model Parameter Estimation 3D Morphable Model Parameter Estimation Nathan Faggian 1, Andrew P. Paplinski 1, and Jamie Sherrah 2 1 Monash University, Australia, Faculty of Information Technology, Clayton 2 Clarity Visual Intelligence,

More information

Local Image Registration: An Adaptive Filtering Framework

Local Image Registration: An Adaptive Filtering Framework Local Image Registration: An Adaptive Filtering Framework Gulcin Caner a,a.murattekalp a,b, Gaurav Sharma a and Wendi Heinzelman a a Electrical and Computer Engineering Dept.,University of Rochester, Rochester,

More information

AAM Based Facial Feature Tracking with Kinect

AAM Based Facial Feature Tracking with Kinect BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 3 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0046 AAM Based Facial Feature Tracking

More information

Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model

Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model Automated Pivot Location for the Cartesian-Polar Hybrid Point Distribution Model Tony Heap and David Hogg School of Computer Studies, University of Leeds, Leeds LS2 9JT, UK email: ajh@scs.leeds.ac.uk Abstract

More information

Chapter 3 Image Registration. Chapter 3 Image Registration

Chapter 3 Image Registration. Chapter 3 Image Registration Chapter 3 Image Registration Distributed Algorithms for Introduction (1) Definition: Image Registration Input: 2 images of the same scene but taken from different perspectives Goal: Identify transformation

More information

Using temporal seeding to constrain the disparity search range in stereo matching

Using temporal seeding to constrain the disparity search range in stereo matching Using temporal seeding to constrain the disparity search range in stereo matching Thulani Ndhlovu Mobile Intelligent Autonomous Systems CSIR South Africa Email: tndhlovu@csir.co.za Fred Nicolls Department

More information

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES Mehran Yazdi and André Zaccarin CVSL, Dept. of Electrical and Computer Engineering, Laval University Ste-Foy, Québec GK 7P4, Canada

More information

Face Tracking : An implementation of the Kanade-Lucas-Tomasi Tracking algorithm

Face Tracking : An implementation of the Kanade-Lucas-Tomasi Tracking algorithm Face Tracking : An implementation of the Kanade-Lucas-Tomasi Tracking algorithm Dirk W. Wagener, Ben Herbst Department of Applied Mathematics, University of Stellenbosch, Private Bag X1, Matieland 762,

More information

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation Michael J. Black and Allan D. Jepson Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto,

More information

Enforcing Non-Positive Weights for Stable Support Vector Tracking

Enforcing Non-Positive Weights for Stable Support Vector Tracking Enforcing Non-Positive Weights for Stable Support Vector Tracking Simon Lucey Robotics Institute, Carnegie Mellon University slucey@ieee.org Abstract In this paper we demonstrate that the support vector

More information

Feature-Driven Direct Non-Rigid Image Registration

Feature-Driven Direct Non-Rigid Image Registration Feature-Driven Direct Non-Rigid Image Registration V. Gay-Bellile 1,2 A. Bartoli 1 P. Sayd 2 1 LASMEA, CNRS / UBP, Clermont-Ferrand, France 2 LCEI, CEA LIST, Saclay, France Vincent.Gay-Bellile@univ-bpclermont.fr

More information

Outline 7/2/201011/6/

Outline 7/2/201011/6/ Outline Pattern recognition in computer vision Background on the development of SIFT SIFT algorithm and some of its variations Computational considerations (SURF) Potential improvement Summary 01 2 Pattern

More information

CS 4495 Computer Vision Motion and Optic Flow

CS 4495 Computer Vision Motion and Optic Flow CS 4495 Computer Vision Aaron Bobick School of Interactive Computing Administrivia PS4 is out, due Sunday Oct 27 th. All relevant lectures posted Details about Problem Set: You may *not* use built in Harris

More information

Multiple Model Estimation : The EM Algorithm & Applications

Multiple Model Estimation : The EM Algorithm & Applications Multiple Model Estimation : The EM Algorithm & Applications Princeton University COS 429 Lecture Dec. 4, 2008 Harpreet S. Sawhney hsawhney@sarnoff.com Plan IBR / Rendering applications of motion / pose

More information

Motion Estimation (II) Ce Liu Microsoft Research New England

Motion Estimation (II) Ce Liu Microsoft Research New England Motion Estimation (II) Ce Liu celiu@microsoft.com Microsoft Research New England Last time Motion perception Motion representation Parametric motion: Lucas-Kanade T I x du dv = I x I T x I y I x T I y

More information

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit

Augmented Reality VU. Computer Vision 3D Registration (2) Prof. Vincent Lepetit Augmented Reality VU Computer Vision 3D Registration (2) Prof. Vincent Lepetit Feature Point-Based 3D Tracking Feature Points for 3D Tracking Much less ambiguous than edges; Point-to-point reprojection

More information

3D Statistical Shape Model Building using Consistent Parameterization

3D Statistical Shape Model Building using Consistent Parameterization 3D Statistical Shape Model Building using Consistent Parameterization Matthias Kirschner, Stefan Wesarg Graphisch Interaktive Systeme, TU Darmstadt matthias.kirschner@gris.tu-darmstadt.de Abstract. We

More information

Active Wavelet Networks for Face Alignment

Active Wavelet Networks for Face Alignment Active Wavelet Networks for Face Alignment Changbo Hu, Rogerio Feris, Matthew Turk Dept. Computer Science, University of California, Santa Barbara {cbhu,rferis,mturk}@cs.ucsb.edu Abstract The active appearance

More information

16720 Computer Vision: Homework 3 Template Tracking and Layered Motion.

16720 Computer Vision: Homework 3 Template Tracking and Layered Motion. 16720 Computer Vision: Homework 3 Template Tracking and Layered Motion. Instructor: Martial Hebert TAs: Varun Ramakrishna and Tomas Simon Due Date: October 24 th, 2011. 1 Instructions You should submit

More information

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant

More information

82 REGISTRATION OF RETINOGRAPHIES

82 REGISTRATION OF RETINOGRAPHIES 82 REGISTRATION OF RETINOGRAPHIES 3.3 Our method Our method resembles the human approach to image matching in the sense that we also employ as guidelines features common to both images. It seems natural

More information

Image Segmentation and Registration

Image Segmentation and Registration Image Segmentation and Registration Dr. Christine Tanner (tanner@vision.ee.ethz.ch) Computer Vision Laboratory, ETH Zürich Dr. Verena Kaynig, Machine Learning Laboratory, ETH Zürich Outline Segmentation

More information

Segmentation-Based Motion with Occlusions Using Graph-Cut Optimization

Segmentation-Based Motion with Occlusions Using Graph-Cut Optimization Segmentation-Based Motion with Occlusions Using Graph-Cut Optimization Michael Bleyer, Christoph Rhemann, and Margrit Gelautz Institute for Software Technology and Interactive Systems Vienna University

More information

CS201: Computer Vision Introduction to Tracking

CS201: Computer Vision Introduction to Tracking CS201: Computer Vision Introduction to Tracking John Magee 18 November 2014 Slides courtesy of: Diane H. Theriault Question of the Day How can we represent and use motion in images? 1 What is Motion? Change

More information

Digital Volume Correlation for Materials Characterization

Digital Volume Correlation for Materials Characterization 19 th World Conference on Non-Destructive Testing 2016 Digital Volume Correlation for Materials Characterization Enrico QUINTANA, Phillip REU, Edward JIMENEZ, Kyle THOMPSON, Sharlotte KRAMER Sandia National

More information

Peripheral drift illusion

Peripheral drift illusion Peripheral drift illusion Does it work on other animals? Computer Vision Motion and Optical Flow Many slides adapted from J. Hays, S. Seitz, R. Szeliski, M. Pollefeys, K. Grauman and others Video A video

More information

Feature Tracking and Optical Flow

Feature Tracking and Optical Flow Feature Tracking and Optical Flow Prof. D. Stricker Doz. G. Bleser Many slides adapted from James Hays, Derek Hoeim, Lana Lazebnik, Silvio Saverse, who in turn adapted slides from Steve Seitz, Rick Szeliski,

More information

Multiple Model Estimation : The EM Algorithm & Applications

Multiple Model Estimation : The EM Algorithm & Applications Multiple Model Estimation : The EM Algorithm & Applications Princeton University COS 429 Lecture Nov. 13, 2007 Harpreet S. Sawhney hsawhney@sarnoff.com Recapitulation Problem of motion estimation Parametric

More information

Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition. Motion Tracking. CS4243 Motion Tracking 1

Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition. Motion Tracking. CS4243 Motion Tracking 1 Leow Wee Kheng CS4243 Computer Vision and Pattern Recognition Motion Tracking CS4243 Motion Tracking 1 Changes are everywhere! CS4243 Motion Tracking 2 Illumination change CS4243 Motion Tracking 3 Shape

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

CRF Based Point Cloud Segmentation Jonathan Nation

CRF Based Point Cloud Segmentation Jonathan Nation CRF Based Point Cloud Segmentation Jonathan Nation jsnation@stanford.edu 1. INTRODUCTION The goal of the project is to use the recently proposed fully connected conditional random field (CRF) model to

More information

Face View Synthesis Across Large Angles

Face View Synthesis Across Large Angles Face View Synthesis Across Large Angles Jiang Ni and Henry Schneiderman Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 1513, USA Abstract. Pose variations, especially large out-of-plane

More information

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Structured Light II Johannes Köhler Johannes.koehler@dfki.de Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov Introduction Previous lecture: Structured Light I Active Scanning Camera/emitter

More information

Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild

Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild Supplementary Material Estimating Correspondences of Deformable Objects In-the-wild Yuxiang Zhou Epameinondas Antonakos Joan Alabort-i-Medina Anastasios Roussos Stefanos Zafeiriou, Department of Computing,

More information

Image processing and features

Image processing and features Image processing and features Gabriele Bleser gabriele.bleser@dfki.de Thanks to Harald Wuest, Folker Wientapper and Marc Pollefeys Introduction Previous lectures: geometry Pose estimation Epipolar geometry

More information

CS231A Course Notes 4: Stereo Systems and Structure from Motion

CS231A Course Notes 4: Stereo Systems and Structure from Motion CS231A Course Notes 4: Stereo Systems and Structure from Motion Kenji Hata and Silvio Savarese 1 Introduction In the previous notes, we covered how adding additional viewpoints of a scene can greatly enhance

More information

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera Tomokazu Sato, Masayuki Kanbara and Naokazu Yokoya Graduate School of Information Science, Nara Institute

More information

Direct Estimation of Non-Rigid Registrations with Image-Based Self-Occlusion Reasoning

Direct Estimation of Non-Rigid Registrations with Image-Based Self-Occlusion Reasoning Direct Estimation of Non-Rigid Registrations with Image-Based Self-Occlusion Reasoning V. Gay-Bellile 1,2 A. Bartoli 1 P. Sayd 2 1 LASMEA, Clermont-Ferrand, France 2 LCEI, CEA LIST, Saclay, France Vincent.Gay-Bellile@univ-bpclermont.fr

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

Multi-View AAM Fitting and Camera Calibration

Multi-View AAM Fitting and Camera Calibration To appear in the IEEE International Conference on Computer Vision Multi-View AAM Fitting and Camera Calibration Seth Koterba, Simon Baker, Iain Matthews, Changbo Hu, Jing Xiao, Jeffrey Cohn, and Takeo

More information

Increasing the Density of Active Appearance Models

Increasing the Density of Active Appearance Models Increasing the Density of Active Appearance Models Krishnan Ramnath ObjectVideo, Inc. Simon Baker Microsoft Research Iain Matthews Weta Digital Ltd. Deva Ramanan UC Irvine Abstract Active Appearance Models

More information

Kanade Lucas Tomasi Tracking (KLT tracker)

Kanade Lucas Tomasi Tracking (KLT tracker) Kanade Lucas Tomasi Tracking (KLT tracker) Tomáš Svoboda, svoboda@cmp.felk.cvut.cz Czech Technical University in Prague, Center for Machine Perception http://cmp.felk.cvut.cz Last update: November 26,

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 FDH 204 Lecture 11 140311 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Motion Analysis Motivation Differential Motion Optical

More information

Optical Flow-Based Motion Estimation. Thanks to Steve Seitz, Simon Baker, Takeo Kanade, and anyone else who helped develop these slides.

Optical Flow-Based Motion Estimation. Thanks to Steve Seitz, Simon Baker, Takeo Kanade, and anyone else who helped develop these slides. Optical Flow-Based Motion Estimation Thanks to Steve Seitz, Simon Baker, Takeo Kanade, and anyone else who helped develop these slides. 1 Why estimate motion? We live in a 4-D world Wide applications Object

More information