Recursive estimation of generative models of video

Size: px

Start display at page:

Download "Recursive estimation of generative models of video"

Victor Stevens
5 years ago
Views:

and learning proedure for unsupervised video lustering into senes.

1 Reursive estimation of generative models of video Nemanja Petrovi Google In. Sumit Basu Mirosoft Researh Aleksandar Ivanovi Universit of Illinois Urbana Thomas Huang Universit of Illinois Urbana Nebojsa Joji Mirosoft Researh Abstrat In this paper we present a generative model and learning proedure for unsupervised video lustering into senes. The work addresses two important problems: realisti modeling of the soures of variabilit in the video and fast transformation invariant frame lustering. We suggest a solution to the problem of omputationall intensive learning in this model b ombining the reursive model estimation, fast inferene, and on-line learning. Thus, we ahieve real time frame lustering performane. Novel aspets of this method inlude an algorithm for the lustering of Gaussian mixtures, and the fast omputation of the KL divergene between two mixtures of Gaussians. The effiien and the performane of lustering and KL approximation methods are demonstrated. We also present novel video browsing tool based on the visualization of the variables in the generative model. 1. Introdution The amount of video data available to an average onsumer has alread beome overwhelming. Still, there is a lak of effiient general-purpose tools for navigating this vast amount of information. We suggest that a suessful video browsing and summarization sstem has to aomplish two goals. First, it shall orretl model the soures of vast information ontent in the video. Seond, it shall provide the user with an intuitive and fast video navigation interfae that is ompatible, if not even jointl optimized with the analsis algorithm. As a solution of the first problem we propose the lustering of related but non-sequential frames into senes. Clustering is based on the generative model (Fig. 1) that builds on the model for translation invariant lustering [14]. Learning in the generative model with multiple disrete variables faes onsiderable omputational hallenges. We utilize the properties of video signal to develop provabl onvergent reursive lustering algorithm. To make the model more realisti, it assumes the Figure 1. Left: Sene generative model. Pair z is a Gaussian mixture. Observation x is obtained b saling b Z, rotating b R, transforming and ropping the latent image z b translation indexed b T. W is the fixed ropping window that models frame x as the small part of the video sene z. The effet of the omposition WT is illustrated as the shaded retangle that indiates the position of x in z. Right: Nine tpial frames from the video are initiall lustered into three lusters using onl translation invariant lustering. Three so obtained distributions are lustered into a single luster using translation, sale and rotation invariant distribution lustering (f. Se. 3). video frame to be a portion of the video sene, whih is reminisent of the panorami sene representations. As a solution of the seond problem we propose a video navigation tool based on the visualization of the variables in the generative model, whih ideall reflets all frames in the video. The navigation tool serves as a visuall meaningful index into the video. Video lustering and summarization is one of the most diffiult problems in the automati video understanding. It aims to produe short, et representative, snopsis of the video b extrating pertinent information or highlights that would enable the viewer to quikl grasp the general stor or navigate to the speifi segment. Two main approahes to video summarization inlude stati summarizations (inluding shots, mosais and storboards), and dnami summarizations (video skimming). Numerous shot and ke-frame detetion approahes are based on extrating and traking low level features over the time and deteting their abrupt

2 hanges. But, long lists of shots result in another information flood rather than abstration, while ke frames onl are not suffiient for a user to judge the relevane of the ontent. Dnami video summarization, often referred to as skimming, onsist of olleting representative or desirable sub-lips from the video. The navigation tool we introdue in this paper has the distintive feature that it inludes both stati and dnami summar of the video. Several shot-independent approahes for summarization have been proposed reentl, like the reursive ke-frame lustering [9, 10]. While promising, the lak robust similarit measure and with the number of lusters above the ertain levels, visuall different shots start to merge. There have been man interesting approahes to video browsing and summarization based on mosaiking. In this paper we introdue probabilisti mosai representation of video that inludes the invariane with respet to amera motion, hange in the sale and rotation. It assumes the video sene to be muh larger that the viewing field of the amera. We should emphasize that our goal is not the building of perfet mosais. Rather, it is the onstruting of robust similarit measure that ields to the high likelihood of the frame under the generative model. Similar mosaiking representations [4, 5, 6, 7] were used before, some of them based of generative models [8], but were overl onstrained with respet to the amera motions, requirements that target senes must be approximatel planar and should not ontain moving objets. For example, [5] studied mosais in the ontext of video browsing and a snopti view of a sene. Video summaries work [6] uses mosais for summarization, but ignores the foreground motion and relies on the bakground invariane for the sene lustering together with ad-ho sene similarit measure. Our work is similar to the mosaiking work [7] on mosaibased representations of video sequenes. There are a few important differenes. Our method, having video lustering and ompat video presentation as the ultimate goals, has a notion of variabilit of appearane (moving objets in the sene and blemishes in the bakground are allowed and treated as the noise), automatiall estimates the number of parameters (eg. number of different senes), explains the ause of variabilit of the data, and reognizes the sene that alread appeared. Also, other methods were used in the highl regimented ases (eg. aerial surveillane, sitoms ) where our is intended for the general lass of unonstrained home videos. Realisti graphial (generative) models ma sometimes fae serious omputational hallenges. Similarl, naive learning in this model is infeasible. Clustering of one hour of video does not allow visiting eah datum more than one. This onstraint suggests one pass over-lustering of the frames, followed b iterative luster grouping. Eah of this operations orrespond to the re-estimation of the parameters in the model. We derive the algorithm for reursive estimation of this model based on the EM algorithm, thus inheriting its onvergene and optimalit properties. Fast inferene methods and video navigation tool are the features of this work. 2. Model The video analsis algorithm we present is based on a generative model Figure 1 (left) that assumes the video senes are generated b a set of normalized senes that are subjeted to geometrial transforms and noise [3]. The appearane of the sene is modeled b a Gaussian appearane map. The probabilit densit of the vetor of pixel values z for the latent image orresponding to the luster is p(z ) =N (z; μ, Φ ), (1) where μ is the mean of the latent image z, and Φ is a diagonal ovariane matrix that speifies the variabilit of eah pixel in the latent image. The variabilit Φ is neessar to apture various auses not aptured b the variabilit in sene lass and transformation, eg. slight blemishes in appearane or hanges in lighting. We do not model the full ovariane matrix as there is never enough data to estimate it form the data. It is possible, however to use a subspae modeling tehnique to apture some orrelations in this matrix. The observable image is modeled as a ropped region of the latent sene image. Before ropping, the latent image undergoes a transformation, omposed of a zoom, rotation and translation. The motivation for this is that out of the global amera motion tpes, zoom and pan are the most frequent, while rotations are fairl rare. Leaving out more omplex motion, suh as the ones produed b perspetive effets, several dominant motion vetor fields, nonuniform motion, et., speeds up the algorithm (real time in our implementation), but makes the above defined variane maps a ruial part of the model, as the an apture the extra variabilit, although in a ruder manner. In addition, the nonuniform variane map has to apture some other auses of variabilit we left out, suh as small illumination hanges, variable ontrast, et. The probabilit densit of the observable vetor of pixel values x for the image orresponding to the zoom Z, translation T, rotation R, latent image z and fixed ropping transform W is p(x T, Z, R, z) =δ(x WTZRz) (2) where T, R and Z ome from a finite set of possible transformations. Similar affine generative model in onjuntion with Baesian inferene was proposed in [13]. We onsider onl a few different levels of zoom and rotation. The omputational burden of searhing over all integer translations

3 is relieved b the use of Fast Fourier Transform (FFT) for performing omputations in the Fourier domain (Se. 4). While we an view this model as the model where the omposition WTZR is treated as a novel transformation, it is an imperative to keep these transformations separate in order to derive an effiient inferene algorithm based on the FFTs, whih is several orders of magnitude faster than the algorithm based on testing all possible transformations jointl. The joint likelihood of a single video frame x and latent image z, given and T is p(x, z, T, Z, R) =δ(x WTZRz)N (z; μ, Φ ) (3) Note that the distribution over z an be integrated out in the losed form p(x,t,z,r)=n (x;wtzμ,wtzφ R Z T W ) (4) Under the assumption that eah frame is independentl generated in this fashion, the joint distribution over all variables is p({x, t, R t, Z t, T t } T t=1)= t p(x t t, R t, Z t, T t ) p( t )p(t t, Z t ) (5) The model is parameterized b sene means μ, pixel varianes stored on the diagonal of Φ and sene probabilities π = p( t = ), and as suh providing a summar of what is ommon in the video. The hidden variables t, R t, Z t, T t, desribe the main auses of variabilit in the video, and as suh var from frame to frame. The prior distribution over R, Z, T is assumed uniform. 3. Reursive model estimation It is possible to derive the EM algorithm in the losed form (Se. 4) for the proposed model. However, the number of senes (omponents in the mixture) is unknown. Also, the exhaustive omputation of the posterior probabilities over transformations and lasses is intratable. We use the variant of inremental EM algorithm [15, 16] to quikl luster the frames into the large number of lusters using, at this stage, translation and ropping-invariant model onl. We dnamiall update the number of lasses, b adding a new lass whenever the model annot explain the new data well. Given that a large number of frames x t have been lustered (summarized) in this manner using a mixture model p(x) = p(x )p() with C lusters (omponents), eah desribed b the prior p() =π, mean μ and a diagonal ovariane matrix Φ, we want to estimate another mixture model p 1 defined b a smaller number of lusters S with parameters π s, μ s, Φ s on the same data. We will formall derive the re-estimation algorithm using a Gaussian mixture model as an example, with the understanding that the same derivation is arried out for the more omplex models that inludes transformations. Assuming that p summarizes the data well, we an replae the real data {x i } with the similar ( virtual ) data { i },i=1...n randoml generated from the obtained mixture model, and estimate the parameters of the model p 1 using the virtual data { t }. When the number of virtual data (N) grows infinitel, the distribution onverges in probabilit to the original data distribution. We an fit the simpler distribution p 1 to { t } without atuall generating them, but rather b working onl with the expetations under the model p. We optimize the expetation i log p1 ( i ) for 1 of the likelihood of the generated data, N large N, where i is sampled from p() (in our example the mixture model with parameters {π, μ, Φ }). 1 N log p 1 ( i ) E[log p 1 ()]= N i=1 [ ] = p()logp 1 ()= p( )p() log p 1 () = p() p( )logp 1 () p() p( ) q (s)log p1 ( s)p 1 (s) = EF, (6) q s (s) where the inequalit follows b the same onvexit argument as in the ase of the standard free energ [15]. Suh reparametrized model p 1 an be reursivel reparametrized, giving the hierarh of models of the dereasing omplexit. B doing this we resort to the original data exatl one and avoid ostl re-proessing of hundreds of thousands of video frames. The new bound on the free energ EF would be tight if q (s) were exatl equal to the posterior, i.e., q (s) =p 1 (s ). However, we assume that the posterior is the same for all one the lass is hosen, and we emphasize this with the notation q (s). Under this assumption the bound further simplifies into EF = { [ ] p() q (s) p( )logp 1 ( s) + s } q (s)[log p 1 (s) log q (s)] s = p() [ q (s) 1 2 (μ s μ ) T Φ 1 s (μ s μ ) s 1 2 tr(φ 1 s Φ ) 1 ] 2 log 2πΦ s + p() s q (s)[log p 1 (s) log q (s)] (7)

4 Minimizing the free energ under the usual onstraints, e.g., s q (s) =1ields an iteration of an EM algorithm that reparameterizes the model, e.g., for the plain mixture model, q (s) p 1 (s)e 1 2 (μ s μ )T Φ 1 s (μ s μ ) 1 2 tr(φ 1 s Φ) 1 2 log 2πΦs (8) μ s = p()q (s)μ p()q (s) Φ s = p()q (s)[(μ s μ )(μ s μ ) T + Φ ] p()q (s) π s = p 1 (s) = p()q (s) p() = p()q (s) (9) Similar reparametrization algorithm was intuitivel proposed in [1] for data lustering in the presene of unertainties. The idea of reursive densit estimation is reminisent of [2]. The EM algorithm above will onverge to a loal maximum and the qualit of the results will depend on the validit of the assumption that the posterior q(s) is shared among all virtual data samples from the same lass. When model p aptures the original data with lots of narrow models p( ), and S C, the approximation is reasonable and redues the omputation b a fator of T/C in omparison with retraining diretl on the original data. The result of reursive model estimation is a hierarh of models whih an be elegantl presented to the user through an appropriate user interfae shown in the video submission. Figure 1 (right) illustrates reursive lustering of three distributions into a single hper-luster, using both translation and sale invariant lustering Computing the fast approximate KL divergene between two mixtures The optimization in Eq.(6) an be alternativel seen as the minimization of the KL divergene between distributions p and p 1. Thus, we an use the bound on the variational free energ for the re-estimation problem to obtain tight upper bound on the KL divergene between two mixture of Gaussians (MoGs) a problem not tratable in the losed form. Reentl, effiient and aurate omputation of the KL divergene between the mixtures has attrated a lot of attention [17, 18]. As the ground truth for the omputation of the KL divergene, we will use Monte Carlo simulation with large number of partiles as KL(f g) = f log f g 1 n n t=1 log f(x t) g(x t ) (10) While this method is asmptotiall exat, it is painfull slow. In [17] authors proposed a ouple of approximations on KL divergene based on ounting the influene onl of nearest omponents in two mixtures ( weak interations ). The demonstrated that their approximation is better than previous one published in [18]. The onlusion of their work is that KL divergene based on unsented transformation [19] (also known as the quadrati approximation ) gives exellent results, with the slight omputational overhead. This method is based on the approximate omputation of the expetation of some funtion h under d dimensional Gaussian f with the mean μ and ovariane matrix Σ as f(x)h(x)dx 1 2d 2d k=1 h(x k ) (11) where the set of 2d sigma points x k is defined as x k = μ +( dσ) k k =1,...,d x d+k = μ ( dσ) k k =1,...,d (12) We will use this method as the urrent art to ompare against the variational method. Given two mixtures p and p 1 the KL divergene an be separated into two terms KL(p, p 1 )=H(p) p()logp 1 () = (13) p()logp() p()logp 1 () We note that optimization we performed in Eq.(6) is the variational maximization of the lower bound of p()p1 (). B substituting the S C matrix q (readil omputed in Eq.(8)) into Eq.(7) the upper bound for p()p1 () follows. In the same manner, the lower bound on entrop H(p) of the Gaussian mixture p with parameters (π, μ, Φ ) an be approximated as {π ( 1 2 log det(2πφ )) + log(π )} <H(p) (14) The summation of the lower and upper bound of two terms in the KL divergene need not lead to the unambiguous onlusion on the nature of the approximation. Empiriall, we found that the entrop term negligibl ontributes to the KL divergene. 4. Inferene of lasses and transformations. Learning the senes in the model Inferene (posterior optimization). In this setion we will omit the rotation R b treating it as an identit transformation in order to keep the derivations simple. For our

5 model it is possible to derive exat EM algorithm that optimizes the free energ [15] of the form F = q( t, Z t, T t )log p(x t t, Z t, T t )π t q( t t, Z t, T t ) t,z t,t t (15) For given parameters, we an optimize the free energ with respet to the posterior q. We express the posterior as q( t, Z t )q(t t t, Z t ) and optimize F under the normalization onstraints t,z t q( t, Z t ) = 1 and T t q(t t t, Z t )=1, whih results in the same result as appling the Baes rule, q(t t t, Z t ) p(x t t, Z t, T t ),q( t, Z t ) p( t )e q(t Zt,Tt)logp(xt t,zt,tt) (16) Parameter optimization. Finding the derivatives of F with respet to the luster mean t = k we get T t=1 {T t, t} It an be shown that q({ t, T t })(WT t Z t ) (WT t Z t Φ k Z tt tw ) 1 (x t WT t Z t μ k )=0 (17) T W WTZΦ 1 Z T W x t = ZΦ 1 T W WTZΦ 1 Z diag(t X t ) Z T W WT = ZΦ 1 Z diag(t m) (18) where m diag(w W), is the binar mask that shows the position of observable image within latent image (upper left orner), and where X W x is frame x zero padded to the resolution of the latent image. Thus, assuming that the zoom has small effet on inverse varianes, i.e., (ZΦZ ) 1 (ZΦ 1 Z ) 1 we obtain simple update rules, e.g. for μ k T t=1 Z t q( t =k, Z t )Z 1 t T t q(t t t =k, Z t )(T X t ) T t=1 Z t q( t =k, Z t )Z 1 t T t q(t t t =k, Z t )(T m) (19) where Z 1 is the pseudoinverse of matrix Z, or the inverse zoom. In a similar fashion, we obtain the derivatives of F with respet to other two tpes of model parameters Φ k and π k, and derive the update equations. It ma seem at first that zero padding of the original frame to the size of the latent image onstitutes the unjustified manipulation of the data. But, taking into aount that zero is neutral element for the summation of the suffiient statistis in (19), it is atuall the mathematial onveniene to treat all variables as being of the same dimensionalit (resolution). The intuition 1 To avoid degenerate solutions, the likelihood is saled with the number of pixel inrease that the zoom auses. behind (19) is that the mean latent (panorami) image is the weighted sum of the properl shifted and saled frames, normalized with the ounter that keeps trak how man times eah pixel was visited. Speeding up inferene and parameter optimization using FFTs. Inferene and update equations (16) and (17) involve either testing all possible transformations or summations over all possible transformations T. If all possible integer shifts are onsidered (whih is desirable sine one an handle arbitrar interframe shifts), then these operations an be effiientl performed in the Fourier domain b identifing them as either onvolutions or orrelations. For example, (19) an be effiientl omputed using two dimensional FFT [11] as q(t, Z)(T X)= T IFFT2[onj(FFT2(q(T))) (FFT2(X))] (20) where denotes point wise multipliation, and onj denotes omplex onjugate. This is done for eah ombination of the lass and sale variables, and a similar onvolution of the transformation posterior is also applied to the mask m. Similarl, FFTs are used for inferene to ompute Mahalanobis distane in (4). This redues omputational omplexit of both inferene and parameter update from N 2 to NlogN (N number of pixels), allows us to analze video frames of higher resolution, and demonstrate the benefits of keeping translation variable T and separate from ropping W and zoom Z in the model. The omputation is still proportional to the number of lasses, as well as the number of zoom levels we searh and sum over in the E and M steps, but the number of these onfiguration is tpiall muh smaller than the number of possible shifts in the image. On-line learning. The bath EM learning suffers from two drawbaks: the need to preset the number of lasses C, and the need to iterate. The struture of realisti video allows development of more effiient algorithms. Frames in video tpiall ome in bursts of a single lass whih means that the algorithm does not need to test all lasses against all frames all the time. We use an on-line variant of the EM algorithm with the inremental estimation of suffiient statistis [15, 16]. The reestimation update equations (Eq. (9)) are reformulated in the same manner. In order to dnamiall learn the number of senes in the on-line EM algorithm, we introdue the threshold on the log-likelihood suh that whenever the log-likelihood falls under the threshold a new sene is introdued. The sensitivit due to the hoie of the threshold is overome b setting high threshold that guarantees the likelihood of the dataset remains high, but whih ma lead to over-lustering. The problem of merging large number of lusters still muh smaller than the number of frames is addressed in Setion

6 3. When the number of lusters is redued to the order of 100, we appl the full learning using the bath EM algorithm with number of lusters determined b the MDL riterion. Taking into aount that amera/objet shifts are b far the most ommon transformations in the video, another speed-up is to perform translation-onl invariant lustering in the first pass (b setting Z,R to identit matries). This approah redues most of the variabilit in the data with little omputational ost. The overall performane of our lustering is 35 frames per seond on 3GHz PC. 5. Experiments Computing the KL divergene between two mixtures. We tested the performane of omputing the upper bound on the KL divergene between two Gaussians in the setup similar to [17]. The mean of eah Gaussian in the five dimensional spae is randoml hosen aording to N (0, 1), whereas the ovariane matrix is sampled from Wishart distribution. In order to avoid almost singular ovariane matries that ma arise in the random sampling, ondition number is set to be at least 20. Covariane matrix is premultiplied with a small number ɛ that aounts for the width of the Gaussians. Higher values of ɛ orrespond to higher overlap between the blobs. We tested four methods: Monte Carlo simulation with 10,000 partiles (MC10000) assumed to be the golden standard; Monte Carlo simulation with 100 partiles (MC100); method based on unsented transform; and, our method (variational). We repeated eah simulation 100 times and averaged the results. We present the results in Table 1 and Fig. 2. The best results were obtained via the unsented approximation, followed b our method, and the MC100. The bottom row of Table 1 indiates the relative proessing time needed to ompute KL divergene for eah method. While approximate, our method is b far the fastest of the proposed methods. In Fig. 2 we illustrate the aura of the omputation of the KL divergene for different values of the parameter ɛ and the omputational omplexit of our method for different dimensionalit of the spae. As demonstrated, our method espeiall sales well in the the high dimensional spae. Video lustering and navigation. We tested our sstem on an 18 minutes long home video. In the first pass of on-line learning, the video is summarized in 290 lusters, man of them repeating. We reestimate this model until we end up with roughl three dozen of lasses. In all but the first pass we searh over all onfigurations of the zoom, rotation and lass variables. The omplexit of the learning drops as we go higher and higher in the hierarh (due to the smaller number of lusters), and so we do not need to be areful about the exat seletion of the sequene of thresholds or numbers of lasses - we simpl train a large number of models, as the user an hoose an one of them quikl at the browsing time. Figure 2. Computing KL divergene between two mixtures (12 and 8 omponents). Comparison of Monte-Carlo simulations, unsented transform-based method and the method in this work. Top: KL divergene is alulated between two MoGs in the 20-dimensional spae. Horizontal axis depits the regularization parameter epsilon. Our method, as antiipated, is upper bound to the true KL divergene. Middle: KL divergene as a funtion of the dimensionalit of mixtures. Regularization parameter is fixed to 0.4. Bottom: Computational time in seonds for our method and unsented-based method as a funtion of spae dimensionalit. Monte Carlo simulations are too slow to sale. Fig. 3 (top) illustrates the time line and sene partitioning using shot detetors and our method. On the left we hand-

7 ɛ unsented MC10000 variational MC Time Table 1. Values of the KL-divergene for four different methods and different values of regularization parameter ɛ. MC10000 is taken as the golden standard. Variational method is a reasonable approximation and it is b far the fastest. As expeted, for larger ɛ all methods make large errors. label ground truth (also indiating the repeating senes). In the middle we show shot detetion results using ommerial shot detetor largel on the olor histogram (from Mirosoft MovieMaker). On the right we show the lustering results using our method, and indiate the ases where our method over-segmented. The lustering fails in the ases when sene variabilit is not explained b the model. Some of the issues an be takled b using the higher number of sale levels in the model and inreasing the sene size with respet to the frame size. But, the real benefit of this approah is in the novel video navigation and browsing tool. Supplemental video material (Fig. 3 bottom and nemanja/video1.wmv) demonstrates the usefulness of this method. Cluster means are visualized as the thumbnail images that represent the index into the video. For eah pixel in eah frame in the video there is a mapping into the thumbnail image. The user browses the video b moving the mouse pointer over the ative panel. Instantl, frames within the luster that are loated in the proximit of the ursor are retrieved and marked in green on the time-line at the bottom of the interfae. The user an further double-lik at eah thumbnail images and it will deompose into the hild lusters that were merged together in the re-estimation proedure. The browsing ma then be ontinued seamlessl at the different level. Informall tested on 165 users, the sstem proved to be ver useful for the users to rapidl grasp the ontent of the video not seen before. 6. Conlusions In this work we presented a generative model for video that proved useful for unsupervised video lustering. We speifiall addressed the problem of intratabilit of naive learning for large sale problems b introduing the number of algorithmi tehniques for rapid learning and inferene. Our video analsis requires no hand-set parameters. The burden of seleting the optimal number of senes itself highl subjetive task is shifted to the user to hoose among the hierarh of the models. We believe that this model and the aompaning intuitive user-interfae will prove useful for quik and seamless video retrieval. Additionall, we will further explore the benefits of the proposed method for the rapid omputation of the KL divergene, espeiall in the high dimensional spae and for the massive data-sets. Aknowledgments This work was supported in part b Advaned Researh and Development Ativities (ARDA) under Contrat MDA C Referenes [1] T.F. Cootes, C.J.Talor. Statistial Models of Appearane for Computer Vision, Tehnial Report Universit of Manhester, [2] A. Gra, A. Moore. Rapid Evaluation of Multiple Densit Models, In Aritifiial Intelligene & Statistis, [3] D. Mumford. Pattern theor: a unifing perspetive, In Pereption as Baesian Inferene, Cambridge Universit Press, [4] R. Jones, D. DeMenthon, and D. Doermann, Building mosais from video using MPEG motion vetors In ACM Multimedia, [5] P. Anandan, M. Irani, M. Kumar, and J. Bergen. Video as an image data soure: Effiient representations and appliations, In Proeedings of IEEE ICIP, pp , [6] A. Aner, J. R. Kender. Video summaries through mosaibased shot and sene lustering, In ECCV, [7] M. Irani, P. Anandan, S. Hsu. Mosai based representations of video sequenes and their appliations, In ICCV, pages , June [8] M. Brown, D. Lowe, Reognising Panoramas, ICCV03. [9] M. Yeung, B. Yeo, B. Liu. Segmentation of video b lustering and graph analsis, In CVIU, 71:1, Jul [10] A. Girgensohn, J. Borezk. Time-onstrained keframe seletion tehnique, In Pro. IEEE Multimedia Computing and Sstems, [11] B.J. Fre and N. Joji. Fast, large-sale transformationinvariant lustering, In NIPS 14, Cambridge, MA: MIT Press, [12] C. M. Bishop. Variational learning in graphial models and neural networks In Proeedings 8th ICANN, [13] J. Winn, A. Blake. Generative Affine Loalisation and Traking In NIPS, 2003.

Figure 3. Top: Comparison of ground truth, shot detetion and our method. The time line with ground truth segmentation (left) illustrates positions and labels of 15 senes in the video.

8 Figure 3. Top: Comparison of ground truth, shot detetion and our method. The time line with ground truth segmentation (left) illustrates positions and labels of 15 senes in the video. Some of the senes are repeating. Shot detetion algorithm based on olor histogram (enter) is sensitive to sudden swipes and hanges in light. In the same time, sene hanges are undeteted if there are due to slow swipes of the amera. Our approah (right) orretl detets senes in most ases. We label the situations when it over-segments. Bottom: A snapshot of video navigation and browsing tool. See the demo at nemanja/video1.wmv [14] B.J. Fre, N. Joji. Transformation-invariant lustering using the EM algorithm. PAMI, 25(1), Jan [15] R. M. Neal, G. E. Hinton. A new view of the EM algorithm that justifies inremental, sparse and other variants, In Learning in Graphial Models, page Norwell MA: Kluwer Aademi Publishers, [16] S. J. Nowlan. Soft Competitive Adaptation: Neural Network Learning Algorithms based on Fitting Statistial Mixtures, Ph.D. thesis, Carnegie Mellon Universit, [17] J. Goldberger, S. Gordon, H. Greenspan. Effiient Image Similarit Measure based on Approximations of KL- Divergene Between Two Gaussian Mixtures, ICCV03. [18] N. Vasonelos. On the omplexit of probabilisti image retrieval, In ICCV, [19] S. Julier, J. K. Uhlmann. A general method for approximating non-linear transformations of probabilit distributions, Tehnial report, RRG, Universit of Oxford, 1996.

timestamp, if silhouette(x, y) 0 0 if silhouette(x, y) = 0, mhi(x, y) = and mhi(x, y) < timestamp - duration mhi(x, y), else

timestamp, if silhouette(x, y) 0 0 if silhouette(x, y) = 0, mhi(x, y) = and mhi(x, y) < timestamp - duration mhi(x, y), else 3rd International Conferene on Multimedia Tehnolog(ICMT 013) An Effiient Moving Target Traking Strateg Based on OpenCV and CAMShift Theor Dongu Li 1 Abstrat Image movement involved bakground movement and