Sequential Learning of Layered Models from Video

Size: px

Start display at page:

Download "Sequential Learning of Layered Models from Video"

Agnes Simon
6 years ago
Views:

1 Sequentia Learning of Layered Modes from Video Michais K. Titsias and Christopher K. I. Wiiams Schoo of Informatics, University of Edinburgh, Edinburgh EH1 2QL, UK Abstract. A popuar framework for the interpretation of image sequences is the ayers or sprite mode, see e.g. [1], [2]. Jojic and Frey [3] provide a generative probabiistic mode framework for this task, but their agorithm is sow as it needs to search over discretized transformations (e.g. transations, or affines) for each ayer simutaneousy. Exact computation with this mode scaes exponentiay with the number of objects, so Jojic and Frey used an approximate variationa agorithm to speed up inference. Wiiams and Titsias [4] proposed an aternative sequentia agorithm for the extraction of objects one at a time using a robust statistica method, thus avoiding the combinatoria exposion. In this chapter we eaborate on our sequentia agorithm in the foowing ways: Firsty, we describe a method to speed up the computation of the transformations based on approximate tracking of the mutipe objects in the scene. Secondy, for sequences where the motion of an object is arge so that different views (or aspects) of the object are visibe at different times in the sequence, we earn appearance modes of the different aspects. We demonstrate our method on four video sequences, incuding a sequence where we earn articuated parts of a human body. 1 Introduction A powerfu framework for modeing videosequences is the ayer-based approach which modes an image as a composite of 2D ayers, each one representing an object in terms of its appearance and region of support or mask, see e.g. Wang and Adeson [1] and Irani et a. [2]. A ayered representation expicity accounts for occusion between the objects, permits motion segmentation in a genera mutipe-frame setting rather than in pairs of frames, and provides appearance modes for the underying objects. These properties can aow ayered modes to be usefu for a number of different purposes such as video compression, video summarization, background substitution (e.g. apha matting appications), object recognition (e.g. earning an object recognition system from video cips without needing human annotation) and others. Jojic and Frey [3] provided a principed generative probabiistic framework for earning a ayered mode aowing transparency between the ayers. Wiiams and Titsias [4] deveoped a simiar mode where ayers stricty combine by occusion. Learning these modes using an exact EM agorithm faces the probem that as the number of objects increases, there is a combinatoria exposion of the number of configurations that need

2 586 Titsias, Wiiams to be considered. If there are L possibe objects, and there are J transformations that any one object can undergo, then we wi need to consider O(J L ) combinations to expain any image. Jojic and Frey [3] tacked this probem by using a variationa inference scheme searchingovera transformations simutaneousy, whie Wiiams and Titsias [4] deveoped a sequentia approach using robust statistics which searches over the transformations of one object at each time. Both these methods do not require a video sequence and can work on unordered sets of images. In this case, training can be very sow as it is necessary to search over a possibe transformations of at east a singe object on every image. However, for videosequences we coud consideraby speed up the training by first ocaizing each object based on a recursive processing of the consecutive frames. Recursive ocaization can approximate the underying sequence of transformations of an object in the frames and thus earning can be carried out with a very focused search over the neighbourhood of these transformations or without search at a when the approximation is accurate. We refer to the recursive ocaization procedure as object tracking. In this chapter we describe two deveopments of the above mode. Firsty, assuming video data and based on tracking we speed upthe method of Wiiams and Titsias [4]. First, the moving background is tracked and then its appearance is earned, whie moving foreground objects are found at ater stages. The tracking agorithm itsef recursivey updates an appearance mode of the tracked object and approximates the transformations by matching this mode to the frames through the sequence. Secondy, in order to account for variation in object appearance due to changes in the 3D pose of the object and sef occusion, we mode different visua aspects or views of each foreground object. This is achieved by introducing a set of mask and appearance pairs, each one associated with a different view of the object. To earn different viewpoint object modes we use approximate tracking, so that we first estimate the 2D or panar transformations of each foreground object in a frames and then given these transformations we stabiize the video and earn the viewpoint modes for that object using a mixture modeing approach. The structure of the remainder of the chapter is as foows: In section 2 we describe the ayered generative mode which assumes mutipe views for each foreground object. Section 3 describes earning the mode from a video sequence, whie section 4 discusses reated work. Section 5 gives experimenta resuts and we concude with a discussion in section 6. 2 Generative ayered mode For simpicity we wi present the generative mode assuming that there are two ayers, i.e. a foreground object and a static background. Later in this section we wi discuss the case of arbitrary number of foreground ayers and a moving background. Let b denote the appearance image of the background arranged as a vector. Assuming that the background is static, b wi have the same size as the data image size (athough note that for moving backgrounds, b wi need to be arger than the image size). Each entry b i stores the ith pixe vaue which can either be a grayscae intensity vaue or a coour vaue. In our impementation we aow cooured images where b i is a three-dimensiona vector in the RGB space. However, for notationa convenience beow we assume that b i is a scaar representing a grayscae vaue. In contrast to the background, the foreground object occupies some region of the image and thus to describe this ayer we need both an appearance f and mask π. The

3 Learning Layered Modes from Video 587 foreground is aowed to move so there is an underying transformation with index j that e.g. corresponds to transationa or affine motion and a corresponding transformation matrix so that T jf and T jπ is the transformed foreground and mask, respectivey. A pixe in an observed image is either foreground or background. This is expressed by a vector of binary atent variabes s, one for each pixe drawn from the distribution P(s j) = PY (T jπ) s i i (1 Tjπ)1 s i i, (1) i=1 where 1 denotes the vector of ones. Each variabe s i is drawn independenty so that for pixe i, if (T jπ) i 0, then the pixe wi be ascribed to the background with high probabiity, and if (T jπ) i 1, it wi be ascribed to the foreground with high probabiity. Note that s is the binary mask of the foreground object in an exampe image, whie π is the prior untransformed mask that captures roughy the shape of the object stored in f. P Seecting a transformation index j, using prior P j over J possibe vaues with J j=1 Pj = 1, and a binary mask s, an image x is drawn from the Gaussian p(x j,s) = PY N(x i; (T jf) i, σf) 2 s i N(x i; b i, σb) 2 1 s i, (2) i=1 where each pixe is drawn independenty from the above conditiona density. To express the ikeihood of an observed image p(x) we marginaise out the atent variabes, which are the transformation j and the binary mask s. Particuary, we first sum out s using (1) and (2) and obtain p(x j) = PY (T jπ) in(x i;(t jf) i, σf) 2 + (1 T jπ) in(x i; b i, σb). 2 (3) i=1 Using now the prior P j over the transformation j, the probabiity of an observed image x is p(x) = P J j=1 Pjp(x j). Given a set of images {x1,...,x N } we can adapt the parameters θ = {b,f, π, σ 2 f, σ 2 b } to maximize the og ikeihood using the EM agorithm. The above mode can be extended so as to have a moving background and L foreground objects [4]. For exampe, for two foreground ayers with parameters (f 1, π 1, σ 2 1) and (f 2, π 2, σ 2 2) and aso a moving background, the anaogue of equation (3) is PY p(x j 1, j 2, j b ) = (T j1 π 1) in(x i;(t j1 f 1) i, σ1) 2 + (1 T j1 π 1) i i=1 ˆ(Tj2 π 2) in(x i;(t j2 f 2) i, σ 2 2) + (1 T j2 π 2) in(x i;(t b b) i, σ 2 b), (4) where j 1, j 2 and j b denote the transformation of the first foreground object, the second foreground object and the background, respectivey. Furthermore, we can aow for an arbitrary occusion ordering between the foreground objects, so that it can vary in different images, by introducing an additiona hidden variabe that takes as vaues a L! possibe permutations of the foreground ayers.

4 588 Titsias, Wiiams 2.1 Incorporating mutipe viewpoints The ayered mode presented above assumes that the foreground object varies due to a 2D (or panar) transformation. However, in many videosequences this assumption wi not be true e.g. a foreground object can undergo 3D rotation so that at different times we may see different views (or aspects) of the object. For exampe, Figure 3 shows three frames of a sequence capturing a man waking; ceary the man s pose changes substantiay during time. Next we generaize the ayered mode so that the appearance of a foreground object can be chosen from a set of possibe appearances associated with different viewpoints. Assume again that there are two ayers: one static background and one moving foreground object. We introduce a discrete atent variabe v, that can obtain V possibe vaues indexed by integers from 1 to V. For each vaue v we introduce a separate pair of appearance f v and mask π v defined as in section 2. Each pair (f v, π v ) modes a particuar view of the object. To generate an image x we first seect a transformation j and a view v using prior probabiities P j and P v, respectivey. Then we seect a binary mask s from the distribution P(s j, v) = Q P i=1 (Tjπv ) s i i (1 Tjπv ) 1 s i i, and draw an image x from the Gausssian p(x j, v, s) = Q P i=1 N(xi; (Tjfv ) i, σf) 2 s i N(x i; b i, σb) 2 1 s i. Note the simiarity of the above expressions with equations (1) and (2). The ony difference is that now the appearance f and mask π are indexed by v to refect the fact that we have aso chosen a view for the foreground object. To express the probabiity distribution according to which an image is generated given the transformation, we sum out the binary mask and the view variabe and obtain p(x j) = VX P vp(x j,v), (5) v=1 where p(x j, v) is given as in (3) with f and π indexed by v. Notice how the equation (5) reates to equation (3). Ceary now p(x j) is a mixture mode of the type of mode given in (3) so that each mixture component is associated with a visua aspect. For exampe, if we choose to have a singe view the atter expression reduces to the former one. It is straightforward to extend the above mode to the case of L foreground ayers with varying viewpoints. In this case we need a separate view variabe v for each foreground object and a set of appearance and mask pairs: (f v, πv ), v = 1,..., V. For exampe, when we have two foreground objects and a moving background the conditiona p(x j 1, j 2, j b, v 1, v 2) is given exacty as in (4) by introducing suitabe indexes to the foreground appearances and masks that indicate the choices made for the viewpoint variabes. 3 Learning Given the set of possiby unordered images {x 1,...,x N } a principed way to earn the parameters θ = ({f v 1 1, πv 1 1, σ2 1,v 1 } V 1 v 1 =1,..., {fv L L, π v L, σ2 L,v L } V L vl =1,b, σ2 b) is by maximizing the og ikeihood L(θ) = P N n=1 og p(xn θ) using the EM agorithm. However, an exact EM agorithm is intractabe. If the foreground objects and the background can undergo J transformations and assuming V views for each foreground object, the time

5 Learning Layered Modes from Video 589 needed to carry out the E-step for a training image is O(J L+1 V L L!). Ceary, this computation is infeasibe as it grows exponentiay with L. A variationa EM agorithm can be used to simutaneousy infer the parameters, the transformations and the viewpoint variabes in a images in time inear with L. However such agorithm can face two probems: (i) it wi be very sow as the number of a transformations J can be very arge (e.g. for transations and rotations can be of order of hundred of thousands) and (ii) simutaneous search over a the unknown quantities can be prone to severe oca maxima, e.g. there is a cear danger of confusion the aspects of one object and the corresponding aspects of a different object. Our earning agorithm works for video data and proceeds in stages so that, roughy speaking, each stage deas with a differentset of unknownvariabes. Tabe 1 iustrates a the different steps of the agorithm. At Stage 1, we ignore the search needed to compute the occusion ordering of the foreground objects and we focus on approximating the transformations {j n 1, j n 2,..., j n L, j n b } N n=1 and inferring the viewpoint variabes {v n 1,..., v n L} N n=1 for a training images. In this stage aso we obtain good initia estimates for the parameters of a objects. At Stage 2 we compute the occusion orderings and we jointy refine a the parameters by optimizing the compete ikeihood of the mode where a the transformations, view variabes and occusion orderings have been fied in using their approximate vaues Stage 1 is the intensive part of the earning process and is divided in sub-stages. Particuary, the object parameters and their associated transformations are estimated in a greedy fashion so as to dea with one object at a time. Particuary, we first track the background in order to approximate the transformations (j 1 b,...,j N b ) and then given these transformations we earn the background appearance. Then for each foreground object sequentiay we track it and earn a of its different views. Stage 1: 1. Track the background to compute the transformations (jb,...,j 1 b N ). Then earn the background parameters (b, σb) Suppress a the pixes that have been cassified as part of the background in each training image, so that w1 n indicates the remaining non-background pixes in the image x n. 3. for = 1 to L (a) Using the w n vectors track the th object to compute the transformations (j 1,..., j N ). Then earn the parameters {f v, πv, σ2,v } V v =1. (b) If = L go to Stage 2. Otherwise, construct the vectors w+1 n from w n so that a the pixes cassified as part of the th object in a images are additionay suppressed. Stage 2: Using the inferred vaues of the parameters θ from Stage 1, the transformations, and the view variabes, compute the occusion ordering of the foreground ayers in each image. Then using these occusion orderings refine the parameters of the objects. Tabe 1. The steps of the earning agorithm. The next three sections expain in detai a the steps of the earning agorithm. Particuary, section 3.1 discusses earning the background (step 1 in Stage 1), section 3.2 describes earning the foreground objects (steps 2 and 3 in Stage 1) and section 3.3 discusses computation of the occusion ordering of the foreground objects

6 590 Titsias, Wiiams and refinement of the parameters (Stage 2). A preiminary version of this agorithm was presented in [5]. 3.1 Learning the background Assume that we have approximated the transformations {j 1 b,..., j N b } of the background in each frame of the video. We wi discuss shorty how to obtain such approximation using tracking. Using these transformations we wish to earn the background appearance. At this stage we consider images that contain a background and many foreground objects. However, we concentrate on earning ony the background. This goa can be achieved by introducing a ikeihood mode for the images that ony accounts for the background whie the presence of the foreground objects wi be expained by an outier process. For a background pixe, the foreground objects are interposed between the camera and the background, thus perturbing the pixe vaue. This can be modeed with a mixture distribution as p b (x i; b i) = α b N(x i; b i, σ 2 b) + (1 α b )U(x i), where α b is the fraction of times a background pixe is not occuded, and the robustifying component U(x i) is a uniform distribution common for a image pixes. When the background pixe is occuded it shoud be expained by the uniform component. Such robust modes have been used for image matching tasks by a number of authors, notaby Back and coeagues [6]. The background can be earned by maximizing the og ikeihood L b = P N n=1 og p(xn j n b ) where p(x j b ) = PY α b N(x i;(t jb b) i, σb) 2 + (1 α b )U(x i). (6) i=1 The maximization of the ikeihood over (b, σ 2 b) can be achieved by using the EM agorithm to dea with the pixe outier process. For exampe, the update equation of the background b is NX NX b [Tj T b n(rn (jb n ) x n )]./ [Tj T b nrn (jb n )], (7) n=1 where y z and y./z denote the eement-wise product and eement-wise division between two vectors y and z, respectivey. In (7) the vector r(j b ) stores the vaue n=1 r i(j b ) = α b N(x i;(t jb b) i, σ 2 b) α b N(x i;(t jb b) i, σ 2 b ) + (1 α b)u(x i) (8) for each image pixe i, which is the probabiity that the ith image pixe is part of the background (and not some outier due to occusion) given j b. The update for the background appearance b is very intuitive. For each image x n, the pixes which are ascribed to non-occuded background (i.e. ri n (jb n ) 1) are transformed by Tj T b n, which reverses the effect of the transformation by mapping the image x n into the arger and stabiized background image b so that x n is ocated within b in the position specified by jb n. Thus, the non-occuded pixes found in each training image are ocated propery into the big panorama image and averaged to produce b.

7 Learning Layered Modes from Video 591 Tracking the background. We now discuss how we can quicky approximate the transformations {jb,...,j 1 b N } using tracking. To introduce the idea of our tracking agorithm assume that we know the appearance of the background b as we as the transformation jb 1 that associates b with the first frame. Since motion between successive frames is expected to be reativey sma we can determine the transformation jb 2 for the second frame by searching over a sma discrete set of neighbouring transformations centered at jb 1 and inferring the most probabe one (i.e. the one giving the highest ikeihood given by equation (6), assuming a uniform prior). This procedure can be appied recursivey to determine the sequence of transformations in the entire video. However, the background b is not known in advance, but we can sti appy roughy the same tracking agorithm by suitaby initiaizing and updating the background b as we process the frames. More specificay, we initiaize b so that the centered part of it wi be the first frame x 1 in the sequence. The remaining vaues of b take zero vaues and are considered as yet not-initiaized which is indicated by a mask m of the same size as b that takes the vaue 1 for initiaized pixes and 0 otherwise. The transformation of the first frame jb 1 is the identity, which means that the first frame is untransformed. The transformation of the second frame and in genera any frame n + 1, n 1, is determined by evauating the posterior probabiity ( PP ) i=1 R(j b ) exp (Tj b mn ) i og p b (x n+1 i ;(T jb b n ) i) P P, (9) i=1 (Tj b mn ) i over the set of possibe j b vaues around the neighbourhood of jb n. The approximate transformation j n+1 b for the frame is chosen to be j n+1 b = jb, where jb maximizes the above posterior probabiity. Note that (9) is simiar to the ikeihood (6), with the ony difference being that pixes of the background that are not initiaized yet are removed from consideration and the score is normaized (by P P i=1 (Tj b mn ) i) so that the number of not-yet-initiaized pixes (which can vary with j b ) does not affect the tota score. Once we know j n+1 b, we use a the frames up to the frame x n+1 (i.e. {x 1,...,x n+1 }) to update b according to equation (7) where the vectors r t (j t b) with t = 1,..., n + 1 have been have been updated according to equation (8) for the od vaue b n of the background. The mask m is aso updated so that it aways indicates the pixes of b that are expored so far. The effect of these updates is that as we process the frames the background mode b is adjusted so that any occuding foreground object is burred out, reveaing the background behind. Having tracked the background, we can then earn its fu structure as described earier in this section. 3.2 Learning the foreground objects Imagine that the background b and its most probabe transformations in a training images have been approximated. What we wish to do next is to earn the foreground objects. We are going to earn the foreground objects one at each time. Particuary, we assume again that we have approximated the transformations {j 1,..., j N } of the th foreground object in a frames. This approximation can be obtained quicky using a tracking agorithm (see ater in this section), that is repeatedy appied to the video sequence and each time outputs the transformations associated with a different object. Learning of the th foreground object wi be based on a ikeihood mode for the images that ony accounts for that foreground object and the background, whie the presence of the other foreground objects is expained by an outier process. Particuary,

8 592 Titsias, Wiiams the other foreground objects can occude both the th foreground object and the background. Thus, we robustify the foreground and background pixe densities so that the Gaussians in equation (2) are repaced by p f (x i; f i) = α f N(x i; f i, σ 2 f) + (1 α f )U(x i) and p b (x i; b i) = α b N(x i; b i, σ 2 b) + (1 α b )U(x i) respectivey, where U(x i) is an uniform distribution in the range of a possibe pixe vaues and α f and α b express prior probabiities that a foreground (resp. background) pixe is not occuded. Any time a foreground or background pixe is occuded this can be expained by the uniform component U(x i). Based on this robustification, we can earn the parameters associated with a different aspects of the object by maximizing the og ikeihood L = NX og n=1 V X v =1 P v P Y i=1 (Tj n πv )ip f(x n i ;(T j n f v )i) + (1 T j n πv )ip b(x n i ;(T j n b b) i), (10) where p f (x n i ;(T j n f v )i) and p b(x n i ;(T j n b b) i) have been robustified as expained above. This maximization is carried out by EM where in the E-step the quantities, Q n (v ), r n (j ) and s n (j ) are computed as foows. Q n (v ) denotes the probabiity p(v x n, jb n, j n ) and is obtained by Q n P v p(x n jb n, j n, v ) (v ) = P V v =1 Pv p(xn jb n, jn, v ), (11) whie the vectors s n (j ) and r n i (j n 1 ) store the vaues and s n (T j n i (v ) = π v )ip f (x n i ;(T j n f v )i) (T j n π v )ip f (x n i ; (T j nfv )i) + (1 T j nπv )ip b(x n i ;(T (12) j b n b)i), r n i (v ) = α f N(x n i ;(T j n 1 f v )i, σ2 1) α f N(x n i ;(T j 1 nfv )i, σ2 1 ) + (1 α (13) f)u(x n i ), for each image pixe i. In the M-step we update the parameters {f v, πv, σ2,v } V For exampe the updates of π v and f v are f v n=1 π v n=1 v =1. NX NX Q n (v )Tj T n[sn (v )]./ Q n (v )[Tj T n 1], (14) NX NX Q n (v )Tj T n[sn (v ) r n (v ) x n ]./ Q n (v )Tj T n[sn (v ) r n (v )]. (15) The above updates are very intuitive. Consider, for exampe, the appearance f v. For pixes which are ascribed to the th foreground and are not occuded (i.e. (s n (v ) r n (v )) i 1), the vaues in x n are transformed by Tj T n which reverses the effect of the transformation. This aows the foreground pixes found in each training image to be mapped in a stabiized frame and then be averaged (weighted by the viewpoint posterior probabiities Q n (v )) to produce f v. n=1 n=1 Tracking the foreground objects. The appearance of each foreground object can vary significanty through the video due to arge pose changes. Thus, our agorithm shoud be abe to cope with such variation. Beow we describe a tracking agorithm that each time matches a mask π and appearance f to the current frame. Large viewpoint

9 Learning Layered Modes from Video 593 variation is handed by on-ine updating π and f so that each time they wi fit the shape and appearance of the object in the current frame. We first discuss how to track the first foreground object, so we assume that = 1. The pixes which are expained by the background in each image x n are fagged by the background responsibiities r n (jb n ) computed according to equation (8). Ceary, the mask r n (jb n ) = 1 r n (jb n ) roughy indicates a the pixes of frame x n that beong to the foreground objects. By focusing ony on these pixes, we wish to start tracking one of the foreground objects through the entire video sequence and ignore for the moment the rest foreground objects. Our agorithm tracks the first object by matching to the current frame and then updating in an on-ine fashion a mask π 1 and appearance f 1 of that object. The mask and the appearance are initiaized so that π1 1 = 0.5 r 1 (jb) 1 and f1 1 = x 1, where 0.5 denotes the vector with 0.5 vaues 1. Due to this initiaization we know that the first frame is untransformed, i.e. j1 1 is the identity transformation. To determine the transformation of the second frame and in genera the transformation j n+1 1, with n 1, of the frame x n+1 we evauate the posterior probabiity R(j 1) exp ( P X i=1 (w n+1 1 ) i og (T j1 π1 n ) i p f (x n+1 i ;(T j1 f n 1 ) i)+(1 T j1 π n 1 ) i(1 α b )U(x n+1 i ) ), (16) where p f (x n+1 i ;(T j n+1 1 f n 1 ) i) is robustified as expained earier, j 1 takes vaues around the neighbourhood of j1 n and w n+1 1 = r n+1 (j n+1 b ). R(j 1) measures the goodness of the match at those pixes of frame x n+1 which are not expained by the background. Note that as the objects wi, in genera, be of different sizes, the probabiity R(j 1) over the transformation variabe wi have greater mass on transformations reating to the argest object. The transformation j n+1 1 is set to be equa to j 1, where j 1 maximizes the above posterior probabiity. Once we determine j n+1 1 we update both the mask π 1 and appearance f 1. The mask is updated according to π n+1 1 = βπ n 1 + T T j n+1 1 [s n+1 ]./ β1 + T T j n+1 1 [1], (17) where β is a positive number. The vector s n+1 expresses a soft segmentation of the object in the frame x n+1 and is computed simiary to equation (12). The update (17) defines the new mask as a weighted average of the stabiized segmentation in the current frame (i.e. T T j n+1 [s n+1 (j n+1 1 )]) and the current vaue of the mask. β determines 1 the reative weight between these two terms. In a our experiments we have set β = 0.5. Simiary, the update for the foreground appearance f 1 is given by f n+1 1 = βf n 1 + T T j n+1 1 [s n+1 r n+1 x n+1 ]./ β1 + T T j n+1 1 [s n+1 r n+1 ]. (18) The vector r n+1 is defined simiary to equation (13). Again the above update is very intuitive. For pixes which are ascribed to the foreground (i.e. s n+1 r n+1 1), the vaues in x n+1 are transformed by T T j n+1 into the stabiized frame which aows the 1 1 The vaue of 0.5 is chosen to express our uncertainty about whether these pixes wi utimatey be in the foreground mask or not.

10 594 Titsias, Wiiams foreground pixes found in the current frame to be averaged with the od vaue f n in order to produce f n+1. Note that the updates given by (17) and (18) are on-ine versions of the respective batch updates of the EM agorithm for maximizing the og ikeihood (6) assuming that V 1 = 1. The above tracking agorithm is a modification of the method presented in [5] with the difference that the batch updates of (f 1, π 1) used there have been repaced by onine counterparts that aow tracking the object when the appearance can significanty change from frame to frame. Once the first object has been tracked we earn the different viewpoint modes for that object by maximizing (10). When these modes has been earned we can go through the images to find which pixes are expained by this object. Then we can remove these pixes from consideration by propery updating each w n vector which aows tracking a different object on the next stage. Particuary, we compute the vector ρ n 1 = P V 1 v 1 =1 Qn (v 1)(T j n 1 π v 1 1 ) rn (v1 n ). ρ n 1 wi give vaues cose to 1 ony for the nonoccuded object pixes of image x n, and these are the pixes that we wish to remove from consideration. We can now run the same tracking agorithm again by updating w+1 n ( 1) as by w+1 n = (1 ρ n ) w n which aows tracking a different object on the + 1th iteration. Note aso that the new mask π +1 is initiaized to 0.5 w+1 n whie the appearance f +1 is aways initiaized to the first frame x Specification of the occusion ordering and refinement of the object modes Once we run the greedy agorithm (Stage 1 in Tabe 1), we obtain an estimate of a mode parameters, an approximation of the object transformation in each training image as we as the probabiities Q n (v ) which express our posterior beief that image x n was generated by the view v of mode. Using now these quantities we wish to compute the occusion ordering of the foreground objects in each training image. This is necessary since even when the occusion ordering remains fixed across a video frames, the agorithm might not extract the objects in accordance with this ordering, i.e. discovering first the nearest object to the camera, then the second nearest object etc. The order the objects are found is determined by the tracking agorithm and typicay the argest objects that occupy more pixes than others are more ikey to be tracked first. A way to infer the occusion ordering of the foreground objects or ayers in an image x n is to consider a possibe permutations of these ayers and choose the permutation that gives the maximum ikeihood. The simpest case is to have two foreground objects with parameters {π v 1 1,fv 1 1, σ2 1,v 1 } V 1 v 1 =1 and {πv 2 2,fv 2 2, σ2 2,v 2 } V 2 v 2 =1, respectivey. From the posterior probabiities Q n (v 1) and Q n (v 2) corresponding to image x n we choose the most probabe views v1 n and v2 n. Conditioned on these estimated views as we as the transformations, the og ikeihood vaues of the two possibe orderings are L n k = PX i=1 n og (T j n k π vn k [(T j n π vn k )ip f k (x n i ;(T j n k f vn k ) ip f (x n i ;(T j n f vn k )i) + (1 T j n k πvn k k )i ) i) + (1 T j n π vn o ) ip b (x n i ;(T j n b b) i)], (19) where k = 1, = 2 or k = 2, = 1. The seected occusion ordering for the image x n is the one with the argest og ikeihood. When we have L foreground objects we work exacty anaogousy as above by expressing a L! permutations of the foreground ayers and seecting the one with the argest ikeihood.

11 Learning Layered Modes from Video 595 The above computation of the occusion ordering takes L! time and it can be used ony when we have few foreground objects. However, in most of the cases we can further speed up this computation and estimate the occusion ordering for arge number of objects. The idea is that an object usuay does not overap (either occudes or is occuded) with a the other L 1 objects, but ony with some of them. Thus, if for each object we identify the overapping objects, the compexity in the worse case wi be O(G!) where G is the argest number of objects that simutaneousy overap with each other. Detais of this agorithm together with iustrative exampes are given in section B.3 in [7]. Once the occusion ordering has been specified for each training image, we can maximize the compete og ikeihood for the mode described in section 2 (using the approximated transformations, viewpoints and the occusion orderings) and refine the appearances and masks of the objects. Note that for this maximization we need the EM agorithm in order to dea with the fact that each pixe foows a L + 1-component mixture distribution (for L = 2 see equation (4)). However, this EM runs quicky since a the transformations, viewpoints and occusion orderings have been fied in with the approximated vaues provided at previous stages of the earning process. 4 Reated work There is a huge iterature on motion anaysis and tracking in computer vision, and there is indeed much reevant prior work. Particuary, Wang and Adeson [1] estimate object motions in successive frames and track them through the sequence by computing optica fow vectors, fit affine motion modes to these vectors, and then custer the motion parameters into a number of objects using k-means. Darre and Pentand [8], and Sawhney and Ayer [9] used simiar approaches based on optica fow estimation between successive frames and appy the MDL principe for seecting the number of objects. Note that a major imitation of optica-fow based methods concerns regions of ow texture where fow information can be sparse, and when there is arge inter-frame motion. The method of Irani et a. [2] is much more reevant to ours. They do motion estimation using optica fow by matching the current frame against an accumuative appearance image of the tracked object. The appearance of a tracked object deveops though time, athough they do not take into account issues of occusion, so that if a tracked object becomes occuded for some frames, it may be ost. The work of Tao et a. [10] is aso reevant in that it deas with a background mode and object modes defined in terms of masks and appearances. However, note that in their work the mask is assumed to be of eiptica shape (parameterised as a Gaussian) rather than a genera mask. The mask and appearance modes are dynamicay updated. However, the initiaization of each mode is handed by a separate modue, and is not obtained automaticay. For the aeria surveiance exampe given in the paper initiaization of the objects can be obtained by simpe background subtraction, but that is not sufficient for the exampes we consider. Later work by Jepson et a. [11] uses a poybone mode for the mask instead of the Gaussian, but this sti has imited representationa capacity in comparison to our genera mask. Jepson et a. aso use more compex tracking methods which incude the birth and death of poybones in time, as we as tempora tracking proposas. The idea of focusing search when carrying our transformation-invariant custering has aso been used before, e.g. by Fitzgibbon and Zisserman [12] in their work on automatic cast isting of movies. However, in thatcase prior knowedge thatfaces were

12 596 Titsias, Wiiams being searched for meant that a face detector coud be run on the images to produce candidate ocations, whie this is not possibe in our case as we do not know what objects we are ooking for apriori. As we as methods based on masks and appearances, there are aso feature-based methods for tracking objects in image sequences, see e.g. [13], [14]. These attempt to track features through a sequence and custer these tracks using different motion modes. Aan et a. [15] describe a feature-based agorithm for computing the transformations of mutipe objects in a video by simutaneousy custering features of a frames into objects. The obtained transformations are then used to earn a ayered generative mode for the video. Currenty in our method we ignore the spatia continuity in the segmentation abes of the pixes. This might resut in noisy segmentations at some cases. A method for earning ayers that incorporates spatia continuity has been recenty considered by [14] and [16]. They use a ayered mode with a MRF prior for the pixe abes and make use of the graph cuts agorithm for efficient updating of the masks. Note that within our framework we coud aso incorporate a MRF for the pixe abes at the cost of increased computation. Finay, the mechanism for deaing with mutipe viewpoints using mixture modes has been considered before in [17]. However, in this work they consider one object present in the images against a cuttered background, and ony the appearance images of different poses of the object are earned (not masks). Aso they do not appy tracking andthey consider a goba search over transformations. Incontrast, our methodcan be appied to images with mutipe objects and earns the background as we as different poses for the foreground objects. An important aspect of our method is the use of tracking (appied priorto earning) whichstabiizes anobject andthenefficienty earns its views. 5 Experiments We consider four video sequences: the Frey-Jojic (FJ) sequence (Figure 1) avaiabe from arms-torso video sequence showing a moving human upper body (Figure 2), the man-waking sequence (Figure 3) and the Groundhog day van sequence 2 (Figure 4). We wi aso assume that the number of different views that we wish to earn for each foreground object is known. The FJ sequence consists of images (excuding the back border). This sequence can be we modeed by assuming a singe view for each of the foreground objects, thus we set V = 1 for both objects. During tracking we used a window of transations in units of one pixe during the tracking stage. The earning stage requires EM which converged in about 30 iterations. Figure 1a shows the evoution of the initia mask and appearance (t = 1) through frames 10 and 20 as we track the first object (Frey). Notice that as we process the frames the mask focuses on ony one of the two objects and the appearance remains sharp ony for this object. The rea running time of our MATLABimpementation for processing the whoe sequence was 3 minutes. The computer used for a the experiments reported here was a 3GHz Pentium. Figure 1b shows the resuts after Stage 1 of the agorithm is competed. Figure 1c shows the fina appearances of the foreground objects after the computation of the occusion ordering and the joint refinement of a the parameters. Comparing Figure 1b with Figure 1c, 2 We thank the Visua Geometry Group at Oxford for providing the Groundhog day van data.

13 Learning Layered Modes from Video 597 t = 1 t = 10 t = 20 (a) (b) (c) Fig. 1. Pane (a) shows the evoution of the mask π 1 (top row) and the appearance f 1 (bottom row) at times 1, 10 and 20 as we track the first object (Frey). Again notice how the mask becomes focused on one of the objects (Frey) and how the appearance remains cear and sharp ony for Frey. Pane (b) shows the mask and the eement-wise product of the mask and appearance mode (π f) earned for Frey (first coumn from the eft) and Jojic (second coumn) using the greedy agorithm (after Stage 1; see Tabe 1). Pane (c) dispays the corresponding masks and appearances of the objects after the refinement step.

Pane (b) dispays the masks and appearances of the parts of the

Particuary, the pots in the first coumn show the earn mask (top row)

information for the two arms. (b) Fig. 3.

14 598 Titsias, Wiiams (a) Fig. 2. Pane (a) shows three frames of the arms-torso video sequence. Pane (b) dispays the masks and appearances of the parts of the arms-torso video sequence. Particuary, the pots in the first coumn show the earn mask (top row) and the eement-wise product of the mask and appearance (bottom row) for the head/torso. Any pair of panes in the other two coumns provides the same information for the two arms. (b) Fig. 3. The panes in the first row show three frames of the man-waking sequence. The panes in the ast two rows show the eement-wise product of the mask (threshoded to 0.5) and appearance (showing against a grey background) for a six viewpoint modes.

15 Learning Layered Modes from Video 599 we can visuay inspect the improvement over the appearances of the objects, e.g. the ghosts in the masks of Figure 1b have disappeared in Figure 1c. When we carry out the refinement step, we aways initiaize the background and the foreground appearances and masks using the vaues provided by the greedy agorithm. The variances are reinitiaized to a arge vaue to hep escape from oca maxima. Note aso that for this maximization we maintain the robustification of the background and foreground pixe densities (α b and α f are set to 0.9) in order to dea with possibe variabiity of the objects, e.g. oca cothing deformation, or changes of the ighting conditions. The EM agorithm used for the above maximization converges in few iterations (e.g. ess than 20). This is because the objects appearances obtained from the greedy agorithm are aready cose to their fina vaues, and a the transformations of the objects have been fied in with the approximated vaues provided by the greedy agorithm. We demonstrate our method for earning parts of human body using the arms-torso sequence that consists of images. Three frames of this sequence are shown in Figure 2a. To earn the articuated parts we use transations and rotations so that the transformation matrix T j that appies to π and f impements an combination of transation and rotation. We impemented this using the MATLAB function tformarray.m and nearest neighbour interpoation. Note that we set the number of views V = 1 for a foreground objects. The tracking method searches over a window of transations and 15 rotations (at 2 o spacing) so that it searches over 1500 transformations in tota. Figures 2b shows the three parts discovered by the agorithm i.e. the head/torso and the two arms. Note that the ambiguity of the masks and appearances around the joints of the two arms with the torso which is due to the deformabiity of the cothing in these areas. The tota rea running time for earning this sequence was roughy one hour. Note that when we earn object parts we shoud aso earn a joint distribution over the parts; amethod for computingsuch a distribution is described in [7]. The man-waking sequence consists of coour images. Figure 3 dispays three frames of that sequence and aso the earned visua aspects (the eement-wise product of each appearance and mask pair). We assumed that the number of different views is six, i.e. V 1 = 6. When we appied the tracking agorithm we used a window of transations in units of one pixe. Processing the whoe video took 5 minutes. In our fourth experiment, we used frames of the Groundhog day van sequence. Figure 4a dispays three frames of that sequence. During tracking we assumed a window of transations in units of one pixe, pus 5 scaings for each of the two axes spaced at a 1% change in the image size, and 5 rotations at 2 o spacing. We assumed that the number of different views of the foreground object that we wish to earn is three, i.e. V 1 = 3. Figure 4b shows the earned prior mask (π v 1 1 ) and the eement-wise product of the appearance (f v 1 1 ) and the mask for each view. Figure 4c shows the background that is aso earned. Ceary, each appearance has modeed a different view of the van. However, the masks are a bit noisy; we beieve this coud be improved by using spatia continuity constraints. Processing the whoe video took 6 hours, where the most of the time was spent during tracking. 6 Discussion Above we have presented a genera framework for earning a ayered mode from a video sequence. The important feature of this method is tracking the background and the

16 600 Titsias, Wiiams (a) (b) Fig. 4. Pane (a) shows three frames of the van sequence. Pane (b) shows the pairs of mask and the eement-wise product of the mask and appearance (showing against a grey background) for a different viewpoints. Note that the eement-wise products are produced by making the masks binary (threshoded to 0.5). Pane (c) dispays the background. (c)

17 Learning Layered Modes from Video 601 foreground objects sequentiay so as to dea with one object and the associated transformations at each time. Additionay, we have combined this tracking method with aowing mutipe views for each foreground object so as to dea with arge viewpoint variation. These modes are earned using a mixture modeing approach. Tracking the object before knowing its fu structure aows for efficient earning of the object viewpoint modes. Some issues for the future are to automaticay identify how many views are needed to efficienty mode the appearance of each object, to determine the number of objects, and to dea with objects/parts that have interna variabiity. Another issue is to automaticay identify when a detected mode is a part or an independent object. This might be achieved by using a mutua information measure, since we expect parts of the same object to have significant statistica dependence. Acknowedgements This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Exceence, IST This pubication ony refects the authors views. References 1. J. Y. A. Wang and E. H. Adeson. Representing Moving Images with Layers. IEEE Transactions on Image Processing, 3(5): , M. Irani, B. Rousso, and S. Peeg. Computing Occuding and Transparent Motions. Internationa Journa of Computer Vision, 12(1):5 16, N. Jojic and B. J. Frey. Learning Fexibe Sprites in Video Layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society Press, Kauai, Hawaii. 4. C. K. I. Wiiams and M. K. Titsias. Greedy Learning of Mutipe Objects in Images using Robust Statistics and Factoria Learning. Neura Computation, 16(5): , M. K. Titsias and C. K. I. Wiiams. Fast unsupervised greedy earning of mutipe objects and parts from video. In Proc. Generative-Mode Based Vision Workshop, M.J. Back and A.D. Jepson. EigenTracking: Robust Matching and Tracking of ArticuatedObjects UsingaView-Based Representation. Proc. ECCV, pages , M. K. Titsias. Unsupervised Learning of Mutipe Objects in Images. PhD thesis, Schoo of Informatics, University of Edinburgh, T. Darre and A. P. Pentand. Cooperative Robust Estimation Using Layers of Support. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 17(5): , H. S. Sawhney and S Ayer. Compact Representations of Videos Through Dominant and Mutipe Motion Estimation. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 18(8): , H. Tao, H. S. Sawhney, and R. Kumar. Dynamic Layer Representation with Appications to Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages II: , 2000.

18 602 Titsias, Wiiams 11. A. D. Jepson, D. J. Feet, and M. J. Back. A Layered Motion Representation with Occusion and Compact Spatia Support. In Proceedings of the Seventh European Conference on Computer Vision, ECCV 2002, pages I Springer, LNCS A. Fitzgibbon and A. Zisserman. On Affine Invariant Custering and Automatic Cast Listing in Movies. In Proceedings of the Seventh European Conference on Computer Vision, ECCV 2002, pages III Springer, LNCS P. H. S. Torr. Geometric motion segmentation and mode seection. Phi. Trans. Roy. Soc. Lond. A, 356: , J. Wis, S. Agarwa, and S. Beongie. What Went Where. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2003, pages I:37 44, M. Aan, M. K. Titsias, and C. K. I. Wiiams. Fast Learning of Sprites using Invariant Features. In Proceedings of the British Machine Vision Conference 2005, pages M. P. Kumar, P. H. S. Torr, and A. Zisserman. Learning ayered pictoria structures from video. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, pages , B. J. Frey and N. Jojic. Transformation Invariant Custering Using the EM Agorithm. IEEE Trans Pattern Anaysis and Machine Inteigence, 25(1):1 17, 2003.

Sequential Learning of Layered Models from Video

Sequential Learning of Layered Models from Video Sequentia Learning of Layered Modes from Video Michais K. Titsias and Christopher K. I. Wiiams Schoo of Informatics, University of Edinburgh, Edinburgh EH1 2QL, UK M.Titsias@sms.ed.ac.uk, c.k.i.wiiams@ed.ac.uk