Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video

Size: px

Start display at page:

Download "Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video"

Penelope Marshall
6 years ago
Views:

1 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, Clusterng Dynamc Textures wth the Herarchcal EM Algorthm for Modelng Vdeo Adeel Mumtaz, Emanuele Covello, Gert. R. G. Lanckret, Anton B. Chan Abstract The dynamc texture (DT) s a probablstc generatve model, defned over space and tme, that represents a vdeo as the output of a lnear dynamcal system (LDS). The DT model has been appled to a wde varety of computer vson problems, such as moton segmentaton, moton classfcaton, and vdeo regstraton. In ths paper, we derve a new algorthm for clusterng DT models that s based on the herarchcal EM algorthm. The proposed clusterng algorthm s capable of both clusterng DTs and learnng novel DT cluster centers that are representatve of the cluster members, n a manner that s consstent wth the underlyng generatve probablstc model of the DT. We also derve an effcent recursve algorthm for senstvty analyss of the dscrete-tme Kalman smoothng flter, whch s used as the bass for computng expectatons n the E-step of the HEM algorthm. Fnally, we demonstrate the effcacy of the clusterng algorthm on several applcatons n moton analyss, ncludng herarchcal moton clusterng, semantc moton annotaton, and learnng bag-of-systems codebooks for dynamc texture recognton. Index Terms Dynamc Textures, Expectaton Maxmzaton, Kalman Flter, Bag of Systems, Vdeo Annotaton, Senstvty Analyss. 1 INTRODUCTION Modelng moton as a spato-temporal texture has shown promse n a wde varety of computer vson problems, whch have otherwse proven challengng for tradtonal moton representatons, such as optcal flow 1, 2. In partcular, the dynamc texture model, proposed n 3, has demonstrated a surprsng ablty to abstract a wde varety of complex global patterns of moton and appearance nto a smple spatotemporal model. The dynamc texture (DT) s a probablstc generatve model, defned over space and tme, that represents a vdeo (.e., spato-temporal volume) as the output of a lnear dynamcal system (LDS). The model ncludes a hdden-state process, whch encodes the moton of the vdeo over tme, and an observaton varable that determnes the appearance of each vdeo frame, condtoned on the current hdden state. Both the hdden-state vector and the observaton vector are representatve of the entre mage, enablng a holstc characterzaton of the moton for the entre sequence. The DT model has been appled to a wde varety of computer vson problems, ncludng vdeo texture synthess 3, vdeo regstraton 4, 5, moton and vdeo texture segmentaton 6, 7, 8, 9, 10, human actvty recognton 11, human gat recognton 12, and moton classfcaton 13, 14, 15, 16, 17, 18, 19, 20. These successes llustrate both the modelng capabltes of the DT representaton, and the robustness of the underlyng probablstc framework. In ths paper, we address the problem of clusterng dynamc texture models,.e., clusterng lnear dynamcal systems. Gven a set of DTs (e.g., each learned from a small vdeo cube extracted from a large set of vdeos), the A. Mumtaz and A. B. Chan are wth the Department of Computer Scence, Cty Unversty of Hong Kong. E-mal: adeelmumtaz@gmal.com, abchan@ctyu.edu.hk. E. Covello and G. R. G. Lanckret are wth the Department of Electrcal and Computer Engneerng, Unversty of Calforna, San Dego. E-mal: emanuetre@gmal.com, gert@ece.ucsd.edu. goal s to group smlar DTs nto K clusters, whle also learnng a representatve DT center that can suffcently summarze each group. Ths s analogous to standard K- means clusterng, except that the dataponts are dynamc textures, nstead of real vectors. A robust DT clusterng algorthm has several potental applcatons n vdeo analyss, ncludng: 1) herarchcal clusterng of moton; 2) vdeo ndexng for fast vdeo retreval; 3) DT codebook generaton for the bag-of-systems moton representaton; 4) semantc vdeo annotaton va weakly-supervsed learnng. Fnally, DT clusterng can also serve as an effectve method for learnng DTs from a large dataset of vdeo va herarchcal estmaton. The parameters of the LDS le on a non-eucldean space (non-lnear manfold), and hence cannot be clustered drectly wth the K-means algorthm, whch operates on real vectors n Eucldean space. One soluton, proposed n 18, frst embeds the DTs nto a Eucldean space usng non-lnear dmensonalty reducton (NLDR), and then performs K-means on the lowdmensonal space to obtan the clusterng. Whle ths performs the task of groupng the DTs nto smlar clusters, 18 s not able to generate novel DTs as cluster centers. These lmtatons could be addressed by clusterng the DTs parameters drectly on the non-lnear manfold, e.g., usng ntrnsc mean-shft 21 or LLE 22. However, these methods requre analytc expressons for the log and exponental map on the manfold, whch are dffcult to compute for the DT parameters. An alternatve to clusterng wth respect to the manfold structure s to drectly cluster the probablty dstrbutons of the DTs. One method for clusterng probablty dstrbutons, n partcular, Gaussans, s the herarchcal expectatonmaxmzaton (HEM) algorthm for Gaussan mxture models (GMMs), frst proposed n 23. The HEM algorthm of 23 takes a Gaussan mxture model (GMM) wth K b mxture components and reduces t to another GMM wth K r components (K r < K b ), where each of the new Gaussan components represents a group of the orgnal Gaussans (.e.,

2 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, formng a cluster of Gaussans). HEM proceeds by generatng vrtual samples from each of the Gaussan components n the base GMM. Usng these vrtual samples, the reduced GMM s then estmated usng the standard EM algorthm. The key nsght of 23 s that, by applyng the law of large numbers, a sum over vrtual samples can be replaced by an expectaton over the base Gaussan components, yeldng a clusterng algorthm that depends only on the parameters of the base GMM. The components of the reduced GMM are the Gaussan cluster centers, whle the base components that contrbuted to these centers are the cluster members. In ths paper, we propose an HEM algorthm for clusterng dynamc textures through ther probablty dstrbutons 24. The resultng algorthm s capable of both clusterng DTs and learnng novel DT cluster centers that are representatve of the cluster members, n a manner that s consstent wth the underlyng generatve probablstc model of the DT. Besdes clusterng dynamc textures, the HEM algorthm can be used to effcently learn a DT mxture from large datasets of vdeo, usng a herarchcal estmaton procedure. In partcular, ntermedate DT mxtures are learned on small portons of the large dataset, and the fnal model s estmated by runnng HEM on the ntermedate models. Because HEM s based on maxmum-lkelhood prncples, t drves model estmaton towards smlar optmal parameter values as performng maxmum-lkelhood estmaton on the full dataset. we demonstrate the effcacy of the HEM clusterng algorthm for DTs on several computer vsons problems. Frst, we perform herarchcal clusterng of vdeo textures, showng that HEM groups perceptually smlar moton together. Second, we use HEM to learn DT mxture models for semantc moton annotaton, based on the supervsed mult-class labelng (SML) framework 25. DT annotaton models are learned effcently from weakly-labeled vdeos, by aggregatng over large amounts of data usng the HEM algorthm. Thrd, we generate codebooks wth novel DT codewords for the bagof-systems moton representaton, and demonstrate mproved performance on the task of dynamc texture recognton. The contrbutons of ths paper are three-fold. Frst, we propose and derve the HEM algorthm for clusterng dynamc textures (lnear dynamcal systems). Ths nvolves extendng the orgnal HEM algorthm 23 to handle mxture components wth hdden states (whch are dstnct from the hdden assgnments of the overall mxture). Second, we derve an effcent recursve algorthm for calculatng the E-step of ths HEM algorthm, whch makes a novel contrbuton to the subfeld of suboptmal flter analyss or senstvty analyss 26. In partcular, we derve expressons for the behavor (mean, covarance, and cross-covarance) of the Kalman smoothng flter when a msmatched source s appled. Thrd, we demonstrate the applcablty of our HEM algorthm on a wde varety of tasks, ncludng herarchcal DT clusterng, DTM densty estmaton from large amounts of data, and estmatng DT codebooks for BoS representatons. The remander of ths paper s organzed as follows. Secton 2 dscusses related work, and we revew dynamc texture models n Secton 3. In Secton 4, we derve the HEM algorthm for DT mxture models, and n Appendx A we derve an effcent algorthm for senstvty analyss of the Kalman smoothng flter. Fnally, Secton 5 concludes the paper by presentng three applcatons of HEM wth expermental evaluatons. 2 RELATED WORK 18 proposes to cluster DT models usng non-lnear dmensonalty reducton (NLDR). Frst, the DTs are embedded nto a Eucldean space usng multdmensonal scalng (MDS) and the Martn dstance functon. Next, the DTs are grouped together by applyng K-means clusterng on the low-dmensonal embedded ponts. Generatng representatve DTs correspondng to the K-means cluster centers s challengng, due to the pre-mage and out-of-sample lmtatons of kernelzed NLDR technques. 18 works around ths problem by selectng the DT whose low-dmensonal embeddng s closest to the lowdmensonal cluster center as the representatve DT for the cluster. The HEM algorthm for GMMs, proposed n 23, has been employed n 27 to buld GMM herarches for effcent mage ndexng, and n 25 to estmate GMMs from large mage datasets for semantc annotaton. In ths paper, we extend the HEM algorthm to dynamc texture mxtures (DTMs), where each mxture component s an LDS. In contrast to GMMs, the E-step nference of HEM for DTMs requres a substantal dervaton to obtan an effcent algorthm, due to the hdden state varables of the LDS. Other approaches to clusterng probablty dstrbutons have also been proposed n the lterature. 28 ntroduces a generc clusterng algorthm based on Bregman dvergences. Settng the Bregman dvergence to the dscrete KL dvergence yelds an algorthm for clusterng multnomals. When the Bregman dvergence s the sum of the Mahalonobs dstance and the Burg matrx dvergence, the result s a clusterng algorthm for multvarate Gaussans 29, whch uses the covarance and means of the base Gaussans. Smlarly, 30 mnmzes the weghted sum of the Kullback-Lebler (KL) dvergence between the cluster center and each probablty dstrbuton, yeldng an alternatng mnmzaton procedure dentcal to 29. Whle ths approach could also be appled to clusterng dynamc textures, t would requre calculatng prohbtvely large (f not nfnte) covarance matrces. Prevous works on senstvty analyss 31, 32 focus on the actual covarance matrx of the error,.e., the covarance of the error between the state estmate of the Kalman smoothng flter and the true state when a msmatched source LDS s appled. In contrast, the HEM E-step requres the expectaton, covarance, and cross-covarance of the smoothed state estmator under a dfferent LDS,.e., the actual expected behavor of the Kalman smoothng flter when a msmatched source s appled. Some of these quanttes are related to the actual error covarance matrx, and some are not. Hence, the results from 31, 32 cannot be drectly used to obtan our HEM E-step, or vce versa. Wth respect to our prevous work, the HEM algorthm for DTMs was orgnally proposed n 24. In contrast to 24, ths paper presents a more complete analyss of HEM-DTM and sgnfcantly more expermental results: 1) a complete dervaton of the HEM algorthm for DT mxtures; 2) a complete

3 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, dervaton of the senstvty analyss of the Kalman smoothng flter, used for the E-step n HEM-DTM; 3) a new experment on semantc moton annotaton usng a vdeo dataset of real scenes; 4) new experments on dynamc texture recognton wth the bag-of-systems representaton usng dfferent datasets, as well as comparsons wth other state-of-the-art methods. Fnally, we have also appled HEM-DTM to musc annotaton n 33, whch manly focuses on large-scale experments and nterpretng the parameters of the learned DT annotaton models. 3 DYNAMIC TEXTURE MODELS A dynamc texture 3 (DT) s a generatve model for both the appearance and the dynamcs of vdeo sequences. The model conssts of a random process contanng an observaton varable y t, whch encodes the appearance component (vectorzed vdeo frame at tme t), and a hdden state varable x t, whch encodes the dynamcs (evoluton of the vdeo over tme). The appearance component s drawn at each tme nstant, condtonally on the current hdden state. The state and observaton varables are related through the lnear dynamcal system (LDS) defned by x t = Ax t 1 + v t, (1) y t = Cx t + w t + ȳ, (2) where x t R n and y t R m are real vectors (typcally n m). The matrx A R n n s a state transton matrx, whch encodes the dynamcs or evoluton of the hdden state varable (.e., the moton of the vdeo), and the matrx C R m n s an observaton matrx, whch encodes the appearance component of the vdeo sequence. The vector ȳ R n s the mean of the dynamc texture (.e., the mean vdeo frame). v t s a drvng nose process, and s zero-mean Gaussan dstrbuted,.e., v t N (0, Q), where Q R n n s a covarance matrx. w t s the observaton nose and s also zero-mean Gaussan,.e., w t N (0, R), where R R m m s a covarance matrx (typcally, t s assumed the observaton nose s..d. between the pxels, and hence R = ri m s a scaled dentty matrx). Fnally, the ntal condton s specfed as x 1 N (µ, S), where µ R n s the mean of the ntal state, and S R n n s the covarance. The dynamc texture s specfed by the parameters Θ = {A, Q, C, R, µ, S, ȳ}. A number of methods are avalable to learn the parameters of the dynamc texture from a tranng vdeo sequence, ncludng maxmum-lkelhood (e.g., expectaton-maxmzaton 34), or a suboptmal, but computatonally effcent, greedy least-squares procedure 3. Whle a dynamc texture models a tme-seres as a sngle sample from a lnear dynamcal system, the dynamc texture mxture (DTM), proposed n 8, models multple tme-seres as samples from a set of K dynamc textures. The DTM model ntroduces an assgnment random varable z multnomal(π 1,, π K ), whch selects the parameters of one of the K dynamc texture components for generatng a vdeo observaton, resultng n system equatons { xt = A z x t 1 + v t (3) y t = C z x t + w t + ȳ z, Θ z where each mxture component s parameterzed by = {A z, Q z, C z, R z, µ z, S z, ȳ z }, and the DTM model s parameterzed by Θ = {π z, Θ z } K z=1. Gven a set of vdeo samples, the maxmum-lkelhood parameters of the DTM can be estmated wth recourse to the expectaton-maxmzaton (EM) algorthm 8. The EM algorthm for DTM alternates between estmatng frst and second-order statstcs of the hdden states, condtoned on each vdeo, wth the Kalman smoothng flter (E-step), and computng new parameters gven these statstcs (M-step). 4 THE HEM ALGORITHM FOR DYNAMIC TEXTURES The herarchcal expectaton-maxmzaton (HEM) algorthm was proposed n 23 to reduce a Gaussan mxture model (GMM) wth a large number of components nto a representatve GMM wth fewer components. In ths secton we derve the HEM algorthm when the mxture components are dynamc textures. 4.1 Formulaton Let Θ = {π, Θ } K =1 denote the base DT mxture model wth K components. The lkelhood of the observed random varable y 1:τ Θ s gven by p(y 1:τ Θ ) = K =1 π p(y 1:τ z =, Θ ), (4) where y 1:τ s the vdeo, τ s the vdeo length, and z multnomal(π 1, π ) s the hdden varable that K ndexes the mxture components. p(y 1:τ z =, Θ ) s the lkelhood of the vdeo y 1:τ under the th DT mxture component, and π s the pror weght for the th component. The goal s to fnd a reduced DT mxture model, Θ, whch represents (4) usng fewer mxture components. The lkelhood of the observed vdeo random varable y 1:τ Θ s p(y 1:τ Θ ) = K =1 π p(y 1:τ z =, Θ ), (5) where K s the number of DT components n the reduced model (K < K ), and z multnomal(π 1,, π ) s the hdden varable K for ndexng components n Θ. Note that we wll always use and to ndex the components of the base model Θ and the reduced model Θ, respectvely. We wll also use the short-hand Θ and Θ to denote the th component of Θ and the th component of Θ, respectvely. For example, we denote p(y 1:τ z =, Θ ) = p(y 1:τ Θ ). 4.2 Parameter estmaton To obtan the reduced model, HEM 23 consders a set of N vrtual samples drawn from the base model Θ, such that N = Nπ vdeo samples are drawn from the th component. The DT, however, has both observable Y and hdden state X varables (whch are dstnct from the hdden assgnments of the overall mxture). To adapt HEM to DT models wth hdden state varables, the most straghtforward approach s to draw vrtual samples from both X and Y accordng to ther ont dstrbuton. However, when computng the parameters of a new DT of the reduced model, there s no guarantee

4 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, that the vrtual hdden states from the base models lve n the same bass (equvalent DTs can be formed by scalng, rotatng, or permutng A, C, and X). Ths bass msmatch wll cause problems when estmatng parameters from the vrtual samples of the hdden states. The key nsght s that, n order to remove ambguty caused by multple equvalent hdden state representatons, we must only generate vrtual samples from the observable Y, whle treatng the hdden states X as addtonal mssng nformaton n HEM. We denote the set of N vrtual vdeo samples for the th component as Y = {y (,m) 1:τ } N m=1, where y(,m) 1:τ Θ s a sngle vdeo sample and τ s the length of the vrtual vdeo (a parameter we can choose). The entre set of N samples s denoted as Y = {Y } K =1. To obtan a consstent herarchcal clusterng, we also assume that all the samples n a set Y are eventually assgned to the same reduced component Θ, as n 23. The parameters of the reduced model can then be computed usng maxmum lkelhood estmaton wth the vrtual vdeo samples, where Θ = argmax log p(y Θ ), (6) Θ K log p(y Θ ) = log p(y Θ ) = log = log =1 K =1 K =1 K =1 K =1 π p(y z =, Θ ) π p(y, X Θ )dx (7) and X = {x (,m) 1:τ } N m=1 are the hdden-state varables correspondng to Y, and z s the hdden varable assgnng Y to a mxture component n Θ. (7) requres margnalzng over hdden states {X, Z}, and hence (6) can be solved usng the EM algorthm 35, whch s an teratve optmzaton method that alternates between estmatng the hdden varables wth the current parameters, and computng new parameters gven the estmated hdden varables (the complete data ), gven by E-Step: Q(Θ, ˆΘ ) = E X,Z Y, ˆΘlog p(x, Y, Z Θ ), M-Step: ˆΘ = argmax Θ Q(Θ, ˆΘ ), where ˆΘ s the current estmate of the parameters, p(x, Y, Z Θ ) s the complete-data lkelhood, and E X,Z Y, s the condtonal expectaton wth respect to the ˆΘ current model parameters. As s common wth the EM formulaton wth mxture models, we ntroduce a hdden assgnment varable z,, whch s an ndcator varable for when the vdeo sample set Y s assgned to the th component of Θ,.e., when z =. The complete-data log-lkelhood s then log p(x, Y, Z Θ ) = log = K =1 K =1 K =1 K =1 ( π z, log π ) p(y, X Θ z, ) + z, log p(y, X Θ ). (8) We next derve the Q functon, E-step, and M-step. 4.3 Q functon for HEM-DTM In the E-step, the Q functon s obtaned by takng the condtonal expectaton, wth respect to the hdden varables {X, Z}, of the complete-data lkelhood n (8) Q(Θ, ˆΘ ) = = = K =1 K =1 K K E X,Z Y, ˆΘ =1 =1 + z, log p(y, X Θ ) K =1 E Z Y, ˆΘz, log π z, log π + E Z Y, ˆΘz, E X Y, log p(y ˆΘ, X Θ ) K =1 ẑ, log π + ẑ, E X Y, ˆΘ where (9) follows from log p(y, X Θ ), E X,Z Y, ˆΘz, log p(y, X Θ ) = E Z Y, ˆΘE X Z,Y, ˆΘz, log p(y, X Θ ) = E Z Y, ˆΘz, E X Y,z,=1, ˆΘ log p(y, X Θ ) = ẑ, E X Y, log p(y ˆΘ, X Θ ), (9) (10) and ẑ, s the probablty that sample set Y s assgned to component n Θ, obtaned wth Bayes rule, ẑ, = E Z Y, ˆΘz, = p(z = Y, ˆΘ ) = π K =1 π p(y ˆΘ ) p(y (11) ˆΘ ). For the lkelhood of the vrtual samples, p(y ˆΘ ), we can obtan an approxmaton that only depends on the model parameters Θ that generated the samples, log p(y ˆΘ ) = N m=1 1 = N log p(y (,m) 1:τ ˆΘ ) N N m=1 N E y Θ log p(y (,m) 1:τ ˆΘ ) log p(y 1:τ ˆΘ ), (12)

5 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, where (12) follows from the law of large numbers 23 (as N ). Substtutng nto (11), we get the expresson for ẑ,, smlar to the one derved n 23, ẑ, = π exp K =1 π ( N E y Θ exp (N E y Θ For the last term n (10), we have E X Y, = N m=1 N E y Θ log p(y ˆΘ, X Θ ) E x (,m) 1:τ y (,m) 1:τ, E x y, ˆΘ log p(y 1:τ ˆΘ ) ) ). (13) log p(y 1:τ ˆΘ ) log p(y (,m) ˆΘ 1:τ, x (,m) 1:τ Θ log p(y 1:τ, x 1:τ Θ ) ), (14) where, agan, (14) follows from the law of large numbers. Hence, the Q functon s gven by Q(Θ, ˆΘ ) = K =1 + ẑ, N E y Θ K =1 E x y, ˆΘ ẑ, log π (15) log p(y 1:τ, x 1:τ Θ ). Note that the form of the Q functon n (15) s smlar to that of the EM algorthm for DTM 8. The frst dfference s the addtonal expectaton w.r.t. Θ. In HEM, each base DT Θ takes role of a data-pont n standard EM, where an addtonal expectaton w.r.t. Θ averages over the possble values of the data-pont, yeldng the double expectaton. The second dfference s the addtonal E y Θ E x y, ˆΘ weghtng of N on the second term, whch accounts for the pror probabltes of each base DT. Gven these two dfferences wth EM-DTM, the Q functon for HEM-DTM wll have the same form as that of EM 8, eqn. 16, but wth two modfcatons: 1) condtonal statstcs of the hdden state wll be computed usng a double expectaton, E y Θ E x y, ˆΘ ; 2) an addtonal weght N wll be appled when aggregatng these expectatons. Therefore, t can be shown that the HEM-DTM Q functon s Q(Θ ; ˆΘ ) = ˆN log π (16) 1 tr(r 1 (ˆΛ 2 ˆΓ C T C ˆΓT + C ˆΦ C T )) + tr(s 1 (ˆη ˆξ µ T µ ˆξT + ˆM µ µ T )) + tr(q 1 ( ˆϕ ˆΨ A T A ˆΨT + A ˆφ A T )) + ˆM (τ log R + (τ 1) log Q + log S ), where we defne the aggregate statstcs, ˆN = ẑ,, ˆΦ = ŵ, τ t=1 ˆM = ŵ,, ˆΨ = τ ŵ, t=2 ˆξ = ŵ, ˆx () 1, ˆϕ = ŵ, ˆη = () ŵ, ˆP ˆφ = ŵ, 1,1, τ t=1 û() ˆγ = ŵ, ˆβ = τ ŵ, t=1 ˆx() t, t, ˆΛ = ŵ, ˆΓ = ŵ, () ˆP t,t, () ˆP t,t 1, τ () t=2 ˆP t,t, τ () t=2 ˆP t 1,t 1, τ t=1 Û () t, τ t=1 Ŵ () t, wth ŵ, = ẑ, N = ẑ, π N. The ndvdual condtonal state expectatons are ˆx () t = E y Θ ˆP () t = E y Θ ˆP () t,t 1 = E y Θ E x y, ˆΘ E x y, ˆΘ E x y, ˆΘ x t, (17) x t x T t, (18) x t x T t 1, (19) Ŵ () t = E (y y Θ t ȳ )E x y, x ˆΘ t T, (20) Û () t = E y Θ (yt ȳ )(y t ȳ ) T, (21) ˆΘ û () t = E y Θ y t, (22) where s the current parameter estmate for the th component of the reduced model. Note that the expectatons of the hdden state, condtoned on each component Θ, are computed through a common DT model ˆΘ. Hence, the potental problem wth msmatches between the hdden-state bases of Θ s avoded. We next derve an effcent algorthm for computng the E-step expectatons. 4.4 E-step expectatons To smplfy notaton, we denote the parameters of a gven base mxture component Θ as Θ b = {A b, Q b, C b, R b, µ b, S b, ȳ b }, and lkewse for a reduced mxture component ˆΘ as Θ r = {A r, Q r, C r, R r, µ r, S r, ȳ r }. We denote the correspondng expectatons n (17-22) by droppng the and ndces, {ˆx t, ˆP t, ˆP t t 1, Ŵt, Ût, û t }. The nner expectatons n (17-20), E x y,θr, are related to the condtonal state estmator of the Kalman smoothng flter of Θ r, when gven an observaton y 1:τ 34, 26, x t τ = E x t y 1:τ,Θ r x t, Ṽ t τ = cov x t y 1:τ,Θ r (x t ), Ṽ t,t 1 τ = cov x t 1,t y 1:τ,Θ r (x t, x t 1 ), (23) where ã t s denotes the expectaton at tme t, condtoned on sequence y 1:s, w.r.t. Θ r. Rewrtng (17-20) n terms of the Kalman smoothng flter n (23), ˆx t = E y Θb x t τ, ˆP t = E y Θb Ṽ t τ + x t τ ( x t τ )T = ˆV t + ˆχ t + ˆx t (ˆx t ) T, ˆP t,t 1 = E y Θb Ṽ t,t 1 τ + x t τ ( x t 1 τ )T = ˆV t,t 1 + ˆχ t,t 1 + ˆx t (ˆx t 1 ) T, Ŵ t = E y Θb (y t ȳ r )( x t τ )T = ˆκ t + (û t ȳ r )(ˆx t ) T, where we defne the double expectatons, ˆV t = E y Θb Ṽ t τ, ˆVt,t 1 = E y Θb Ṽ t,t 1 τ, (24) ˆκ t = cov y Θb (y t, x t τ ), ˆχ t = cov y Θb ( x t τ ), (25) ˆχ t,t 1 = cov y Θb ( x t τ, x t 1 τ ).

6 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, Note that ˆx t s the output of the state estmator from a Kalman smoothng flter for Θ r when the observaton y s generated from a dfferent model Θ b. Ths s also known as suboptmal flter analyss or senstvty analyss 26, 36, where the goal s to analyze flter performance when an optmal flter, accordng to some source dstrbuton, s run on a dfferent source dstrbuton. Hence, the expectatons n (24) and (25) can be calculated by senstvty analyss of the Kalman smoothng flter for model Θ r and source Θ b. Ths procedure s summarzed here, wth the dervaton appearng n Appendx A. Frst, gven the Kalman flter for Θ b and Θ r, whch calculates the statstcs of the state x t gven the prevous observatons y 1:t 1, x t t 1 = E x t y 1:t 1,Θ b x t, Ṽ t t 1 = cov x t y 1:t 1,Θ b (x t ), x t t 1 = E x t y 1:t 1,Θ r x t, Ṽ t t 1 = cov x t y 1:t 1,Θ r (x t ), (26) senstvty analyss of the Kalman flter conssts of margnalzng over the dstrbuton of partal observatons, y 1:t 1 Θ b, and computng the mean and covarance, ˆx t = E Θb x t x t t 1 x t t 1 of the true state x t ˆx t t 1, ˆV t = cov Θb ( x t x t t 1 x t t 1 ), (27) and the state estmators ˆx t t 1 and. Second, usng these results for the Kalman flter, senstvty analyss of the Kalman smoothng flter conssts of margnalzng (23) over the dstrbuton of full observatons, y 1:τ Θ b, yeldng the expectatons n (25) and (24). The remanng two expectatons n (21-22) are calculated from the margnal statstcs of Θ b, Û t = E y Θb (yt ȳ r )(y t ȳ r ) T = cov y Θb (y t, y t ) + (û t ȳ r )(û t ȳ r ) T = C ˆV1,1 b t Cb T + R b + (û t ȳ r )(û t ȳ r ) T, (28) û t = E y Θb y t = C bˆx 1 t + ȳ b, (29) Fnally, for the soft assgnments ẑ,, the expected loglkelhood term, E y Θb log p(y Θ r ), s calculated effcently by expressng the observaton log-lkelhood of the DT n nnovaton form and margnalzng over y Θ b, resultng n (35-37). Ths s derved n Appendx A. Algorthm 1 summarzes the procedure for calculatng the E-step expectatons n (17-22). Frst, the Kalman flter and Kalman smoothng flter are run on Θ b and Θ r, usng Algorthm 2. Next, senstvty analyss s performed on the Kalman flter and Kalman smoothng flter va Algorthms 3 and 4, where Θ r s the model and Θ b s the source. Fnally, the expectatons and expected log-lkelhood are calculated accordng to (30-37). 4.5 M-step In the M-step of HEM for DTM, the paramters Θ are updated by maxmzng the Q functon. The form of the HEM Q functon n (16) s dentcal to that of EM for DTM 8. Hence, the equatons for updatng the parameters are dentcal Algorthm 1 Expectatons for HEM-DTM 1: Input: DT parameters Θ b and Θ r, length τ. 2: Run Kalman smoothng flter (Algorthm 2) on Θ b and Θ r to obtan {Ṽ t t 1, Ṽ t τ, Ṽ } and {Ṽ t,t 1 τ t t 1, Ṽ t τ, Ṽ t,t 1 τ }. 3: Run senstvty analyss on the Kalman flters, Θ b and Θ r, (Algorthm 3) to obtan {ˆx t, ˆV t}. 4: Run senstvty analyss for the Kalman smoothng flters, Θ b and Θ r, (Algorthm 4) to obtan {ˆx t, ˆχ t, ˆχ t,t 1, ˆκ t}. 5: Compute E-step expectatons, for t = {1,, τ}: û t = C bˆx 1 t + ȳ b, (30) Û t = C ˆV1,1 b t Cb T + R b + (û t ȳ r)(û t ȳ r) T, (31) ˆP t = Ṽ t τ + ˆχt + ˆxt(ˆxt)T, (32) ˆP t,t 1 = Ṽ t,t 1 τ + ˆχ t,t 1 + ˆx t(ˆx t 1 ) T, (33) Ŵ t = ˆκ t + (û t ȳ r)(ˆx t) T. (34) 6: Compute expected log-lkelhood l: ˆΣ t = C r Ṽ t t 1 CT r + Rr, 3,3 ˆΛt = ˆV t + ˆx 3 t (ˆx3 t )T, (35) ˆλ t = C ˆV2,3 b t + (C bˆx 1 t + ȳ b ȳ r)(ˆx 3 t )T, (36) l = τ t=1 1 tr ˆΣ 1 2 t (Ût ˆλ tcr T C r ˆλT t + C r ˆΛtC r T ) 1 2 log ˆΣt m 2 log(2π). 7: Output: {ˆx t, ˆP t, ˆP t,t 1, Ŵt, Ût, ût}, l. (37) to EM for DTM, although the aggregate expectatons are dfferent. Each DT component Θ s updated as C = ˆΓ ˆΦ 1, R = 1ˆN (ˆΛ C ˆΓ ), A = ˆΨ ˆφ 1, Q = 1ˆN ( ˆϕ A ˆΨ T ), µ = 1ˆN ˆξ, S = 1ˆN ˆη µ (µ )T, ˆN (38) π =, ȳ K = 1ˆN (ˆγ C ˆβ ). 5 APPLICATIONS AND EXPERIMENTS In ths secton, we dscuss several novel applcatons of HEM- DTM to vdeo and moton analyss, ncludng herarchcal moton clusterng, semantc moton annotaton, and DT codebook generaton for the bag-of-systems vdeo representaton, whch are llustrated n Fgure 1. These applcatons explot several desrable propertes of HEM to obtan promsng results. Frst, gven a set of nput DTs, HEM estmates a novel set of fewer DTs that represents the nput n a manner that s consstent wth the underlyng generatve probablstc models, by maxmzng the log-lkelhood of vrtual samples generated from the nput DTs. As a result, the clusters formed by HEM are also consstent wth the probablstc framework. Second, HEM can estmate models on large datasets, by breakng the learnng problem nto smaller peces. In partcular, ntermedate models are learned on small non-overlappng portons of a large dataset, and the fnal model s estmated by runnng HEM on the ntermedate models. Because HEM s based on maxmum-lkelhood prncples, t drves model estmaton towards smlar optmal parameter values as performng maxmum-lkelhood estmaton on the full dataset. However, the computer memory requrements are sgnfcantly less, snce we no longer have to store the entre dataset durng parameter estmaton. In addton, the ntermedate models are estmated ndependently of each other, so the task can be easly

HEM Algorthm BoS Codebook HEM Algorthm L2 Road tag model Candle tag model Fg.

In the remander of the secton, we present three applcatons of HEM-DTM to vdeo and moton analyss. 5.

We ntalze EM-DTM usng an teratve component splttng procedure suggested n 8, where EM s run repeatedly wth an ncreasng number of mxture components.

Next, we select the DT component, and duplcate t to form two components (ths s the splttng ), followed by slghtly perturbng the DT parameters.

We use a growng schedule of K = {1, 2, 4, 8, 16}, and perturb the observaton matrx C when creatng new DT components. We use a smlar procedure to ntalze the reduced DTM when runnng HEM- DTM.

7 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, (a) Road vdeos Candle vdeos (c) EM-DTM Algorthm HEM Algorthm 6 n t o a t m s e T D M m T Ḏ r th o M E lg A HEM Algorthm L1 HEM Algorthm HEM Algorthm BoS Codebook HEM Algorthm L2 Road tag model Candle tag model Fg. 1: Applcatons of the HEM-DTM algorthm: a) herarchcal clusterng of vdeo textures; b) learnng DT annotaton models; c) tranng a bag-of-systems (BoS) codebook. parallelzed. In the remander of the secton, we present three applcatons of HEM-DTM to vdeo and moton analyss. 5.1 Implementaton notes In the followng experments, the EM-DTM algorthm s frst used to learn vdeo-level DTMs from overlappng vectorzed vdeo patches (spato-temporal cubes) extracted from the vdeo. We ntalze EM-DTM usng an teratve component splttng procedure suggested n 8, where EM s run repeatedly wth an ncreasng number of mxture components. Specfcally, we start by estmatng a DTM wth K = 1 components by runnng EM-DTM to convergence. Next, we select the DT component, and duplcate t to form two components (ths s the splttng ), followed by slghtly perturbng the DT parameters. Ths new DTM wth K = 2 components serves as the ntalzaton for EM-DTM, whch s agan run untl convergence. The process s repeated untl the desred number of components s reached. We use a growng schedule of K = {1, 2, 4, 8, 16}, and perturb the observaton matrx C when creatng new DT components. We use a smlar procedure to ntalze the reduced DTM when runnng HEM- DTM. We set the vrtual sample parameters to τ = 20 and N = The state-space dmenson s set to n = 10. The lkelhood of a vdeo under a DT, p(y 1:τ Θ), s calculated effcently usng the nnovaton form of the lkelhood n (74). Fnally, we make a standard..d. assumpton on the observaton nose of the DT,.e., R = ri. In ths case, the nverson of large m m covarance matrces, e.g., n (37) and (48), s calculated effcently usng the matrx nverson lemma. 5.2 Herarchcal clusterng of vdeo textures We frst consder herarchcal moton clusterng of vdeo textures, by successvely clusterng DTs wth the HEM algorthm, as llustrated n Fgure 1a. Gven a set of K 1 vdeo textures, spato-temporal cubes are extracted from the vdeo and a DT s learned for each vdeo texture. Ths forms the frst level of the herarchy (the vdeo-level DT). The next level n the herarchy s formed by clusterng the DTs from the prevous level nto K 2 groups wth the HEM algorthm (K 2 < K 1 ). The DT cluster centers are selected as the representatve models at ths level, and the process s contnued wth each level n the herarchy learned from the precedng level. The result s a tree representaton of the vdeo dataset, wth smlar textures grouped together n the herarchy. Note that ths type of herarchy could not be bult n a straghtforward manner usng the EM algorthm on the orgnal spato-temporal cubes. Whle t s possble to learn several DTMs wth successvely smaller values of K, there s no guarantee that the resultng mxtures, or the cluster membershps of the vdeo patches, wll form a tree Expermental setup We llustrate herarchcal moton clusterng on the vdeo texture dataset from 8. Ths dataset s composed of 99 vdeo sequences, each contanng 2 dstnct vdeo textures (see Fgure 2 for examples). There are 12 texture classes n total, rangng from water (sea, rver, pond) to plants (grass and trees), to fre and steam. To obtan the frst level of the herarchy, we learn one DT for each texture n each vdeo (the locatons of the textures are known), and pool these DTs together to form a DTM wth K 1 = 198 components. Each DT s learned usng 6 on 100 spato-temporal cubes ( pxels) sampled from the texture segment. The second level of the herarchy s obtaned by runnng HEM on the level-1 DT mxture to reduce the number of components to K 2 = 12. Fnally, the a) b) rver-far, escalator grass, fre plant-a, rver-far Fg. 2: Vdeo texture examples: a) vdeo wth 2 textures; b) ground-truth labels. Fg. 3: Herarchcal clusterng of vdeo textures: The arrows and brackets show the cluster membershp from the precedng level (the groupngs between Levels 1 and 2 are omtted for clarty).

8 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, thrd and fourth levels are obtaned by runnng HEM on the prevous level for K 3 = 6 and K 4 = 3 clusters, respectvely Clusterng Results Fgure 3 shows the herarchcal clusterng that s obtaned wth HEM. The frst level contans the DTs that represent each texture segment n the database. Each vertcal bar represents one DT, where the color ndcates the ground-truth cluster label (texture name). In the second level, the 12 DT components are shown as vertcal bars, where the colors ndcate the proporton of the cluster membershp wth a partcular ground-truth cluster label. In most cases, each cluster corresponds to a sngle texture (e.g., grass, escalator, pond), whch llustrates that HEM s capable of clusterng DTs nto smlar motons. The Rand ndex for the level-2 clusterng usng HEM s (for comparson, clusterng hstograms-of-orented-optcal-flow usng K-means yelds a Rand ndex of 0.958). One error s seen n the HEM cluster wth both the rver and rver-far textures, whch s reasonable consderng that the rver-far texture contans both near and far perspectves of water. Movng up to the thrd level of the herarchy, HEM forms two large clusters contanng the plant textures (plant-, plant-a, grass) and water textures (rver-far, rver, sea-far). Fnally, n the fourth level, the vdeo textures are grouped together accordng to broad categores: plants (grass, plant-a, plant-), water (pond, rverfar, rver, sea-far), and rsng textures (fre, ellyfsh, and steam). These results llustrate that HEM for DT s capable of extractng meanngful clusters n a herarchcal manner. 5.3 Semantc vdeo texture annotaton In ths secton, we formulate the annotaton of vdeo sequences as a supervsed mult-class labelng (SML) problem 25 usng DTM models Vdeo Annotaton Framework A vdeo sequence s frst decomposed nto spato-temporal cubes as Y = {y () 1:τ }N =1 where each y() 1:τ s a vectorzed vdeo patch of length τ. The number of vdeo cubes N depends on the sze and length of the vdeo. Semantc content of the vdeo can be represented wth a vocabulary V = {w 1,, w V } of unque tags (e.g., trees, rver, and drected moton), wth sze V. Each vdeo s represented wth an annotaton vector of the form c = {c 1,..., c V }, where a partcular entry c k > 0 f there s some assocaton of the vdeo wth the kth tag n the vocabulary. Each tag w k s modeled as a probablty dstrbuton over the vdeo cubes,.e., p(y () 1:τ w k), whch n our case wll be a DTM model. The annotaton task s then to fnd a subset W = {w 1,..., w A } V of A tags, that best descrbes a novel vdeo Y. Gven the novel vdeo, the most relevant tags are those wth hghest posteror probablty accordng to Bayes rule, p(w k Y) = p(y w k)p(w k ) p(y), (39) where p(w k ) s the kth tag pror, p(y) s the pror for the vdeo and p(y w k ) = N =1 p(y() 1:τ w k). The vdeo can then be represented as a semantc multnomal p = p(w 1 Y),..., p(w V Y). The top A tags accordng to the semantc multnomal p are then selected as the annotatons of the vdeo. To promote annotaton usng a dverse set of tags, we also assume a unform pror, p(w k ) = 1/ V Learnng tag models wth HEM For each tag w k, the tag dstrbuton p(y () 1:τ w k) s modeled wth a DTM model, whch s estmated from the set of tranng vdeos assocated wth the partcular tag. One approach to estmaton s to extract all the vdeo fragments from the relevant tranng vdeos for the tag, and then run the EM algorthm 8 drectly on ths data to learn the tag-level DTM. Ths approach, however, requres storng many vdeo fragments n memory (RAM) for runnng the EM algorthm. For even modest-szed databases, the memory requrements can exceed the RAM capacty of most computers. To allow effcent tranng n computaton tme and memory requrements, the learnng procedure s splt nto two steps. Frst, a vdeo-level DTM model s learned for each vdeo n the tranng set usng the standard EM algorthm 8. Next, a tag-level model s formed by poolng together all the vdeo level DTMs assocated wth a tag, to form a large mxture (.e., each DT component n a relevant vdeo-level DTM becomes a component n the large mxture). However, a drawback of ths model aggregaton approach s that the number of DTs n the DTM tag model grows lnearly wth the sze of the tranng data, makng nference computatonally neffcent when usng large tranng sets. To allevate ths problem, the DTM tag models formed by model aggregaton are reduced to a representatve DTM wth fewer components by usng the HEM algorthm. The HEM algorthm clusters together smlar DTs n the vdeo-level DTMs, thus summarzng the common nformaton n vdeos assocated wth a partcular tag. The new DTM tag model allows for more effcent nference, due to fewer mxture components, whle mantanng a relable representaton of the tag-level model. The process for learnng a tag-level DTM model from vdeo level DTMs s llustrated n Fgure 1b Expermental setup For the annotaton experment we use the DynTex dataset 37, whch conssts of over 650 vdeos, mostly n everyday sea(16) feld(9) tree(40) escalator(6) stream(26) bolng(7) shower(3) rver(19) flag(17) candle(8) plant(27) sky(3) moble(5) road(4) basn(20) fountan(60) waterfall(19) pond(7) foam(6) source(11) wndmll(6) net(4) aquarum(4) anemone(19) ran(4) tolet(4) laundry(6) server(3) wavng(78) dmoton(94) turbulent(95) oscllatng(95) dmotons(38) random(11) ntrnsc(15) Fg. 4: Lst of tags wth example thumbnals and vdeo count for the DynText dataset. Structural tags are n bold.

9 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, TABLE 1: Annotaton Results for dfferent methods on the DynTex dataset. Average Average Average Tags wth Precson Recall F-Measure Recall > 0 DTM-HEM GMM-HEM-DCT GMM-HEM-OPF surroundngs. Ground truth annotaton nformaton s present for 385 sequences (called the golden set ), based on a detaled analyss of the physcal processes underlyng the dynamc textures. We select the 35 most frequent tags n DynTex for annotaton comprsng of 337 sequences. The tags are also grouped nto two categores: 1) process tags, whch descrbe the physcal texture process (e.g., sea, feld, and tree), and are manly based on the appearance; 2) structural tags, whch descrbe only the moton characterstcs (e.g., turbulent and oscllatng), and are largely ndependent of appearance. Note that vdeos wth a partcular structural tag can have a wde range of appearances, snce the tag only apples to underlyng moton. Each vdeo has an average of 2.34 tags 1. Fgure 4 shows an example of each tag alongsde the number of sequences n the dataset. Each vdeo s truncated to 50 frames, converted to grayscale and downsampled 3 tmes usng bcubc nterpolaton, resultng n a sze of Overlappng spatotemporal cubes of sze (step: ) are extracted from the vdeos. We only consder patches wth sgnfcant moton, by gnorng a patch f any pxel has varance < 5 n tme. Vdeo-level DTMs are learned wth K = 16 components to capture enough of the temporal dversty present n each sequence, whle tag-level DTMs use K = 8 components. Annotaton performance s measured followng the procedure descrbed n 25. Annotaton accuracy s reported by computng precson, recall and F-score for each tag, and then averagng over all tags. Per-tag precson s the probablty that the model correctly uses the tag when annotatng a vdeo. Per-tag recall s the probablty that the model annotates a vdeo that should have been annotated wth the tag. Precson, recall and F-score measure for a tag w are defned as: P = W C W A, R = W C W H, F = 2((P ) 1 + (R) 1 ) 1, (40) where W H s the number of sequences that have tag w n the ground truth, W A s the number of tmes the annotaton system uses W when automatcally taggng a vdeo, and W C s the number of tmes w s correctly used. In case a tag s never selected for annotaton, the correspondng precson (that otherwse would be undefned) s set to the tag pror from the tranng set, whch equals the performance of a random classfer. To nvestgate the advantage of the DTM s temporal representaton, we compare the annotaton performance of HEM-DTM to the herarchcally-traned Gaussan mxture models usng DCT features 25 (GMM-HEM-DCT) and usng optcal flow features 9 (GMM-HEM-OPF). The dataset s splt nto 50% tranng and 50% test sets, wth each vdeo 1. Detals of the data set and more results can be found at: Precson (a) Recall Average F Measure DTM HEM GMM HEM DCT GMM HEM OPF tag level Fg. 5: (a) Average precson/recall plot; F-measure plot, showng all tag-levels, for dfferent methods on DynTex data set. appearng exactly once n ether set. Results are averaged over 5 trals, usng dfferent random tranng/test sets Annotaton Results Table 1 shows the average precson, recall, and F-measure for annotaton wth A = 3 tags, whle Fgure 5 shows these values for all 35 tag levels. Vdeo annotaton usng DTMs outperforms usng DCT and optcal flow features, wth an F-score of versus and Overall, ths suggests that the DTM can better capture both the appearance and dynamcs of the vdeo texture processes. process tags structural tags TABLE 2: Per Tag performance on DynTex data set. Precson Recall F-Measure DTM DCT OPF DTM DCT OPF DTM DCT OPF anemone aquarum basn bolng candle escalator feld flag foam fountan laundry moble net plant pond ran rver road sea server shower sky source stream tolet tree waterfall wndmll dmoton dmotons nternsc oscllatng random turbulent wavng Process Structural

10 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, fc c610 (truth)bolng,turbulent (DTM)bolng,turbulent,laundry,flag,wavng (DCT)bolng,candle,nternsc,turbulent,flag (OPF)candle,laundry,turbulent,nternsc,flag 644b a320 (truth)road,dmotons (DTM)road,dmotons,foam,waterfall,wavng (DCT)road,wavng,dmotons,turbulent,dmoton (OPF)road,waterfall,turbulent,net,dmoton (truth)fountan,foam,turbulent,dmotons (DTM)foam,stream,dmotons,fountan,turbulent (DCT)waterfall,foam,stream,oscllatng,dmoton (truth)plant,oscllatng (DTM)plant,oscllatng,tree,anemone,fountan (DCT)foam,anemone,oscllatng,plant,basn (OPF)feld,plant,tolet,anemone,oscllatng (truth)flag,wavng (DTM)flag,wavng,laundry,turbulent,bolng (DCT)flag,wavng,wndmll,laundry,candle (OPF)flag,wavng,laundry,wndmll,turbulent (truth)plant,oscllatng (DTM)plant,oscllatng,tree,aquarum,stream (DCT)waterfall,stream,oscllatng,foam,plant (OPF)tolet,plant,feld,oscllatng,anemone (OPF)tolet,plant,feld,oscllatng,anemone Fg. 6: Annotaton examples from the DynTex database, showng ground truth, DTM, GMM-DCT, and GMM-OPF annotatons. Automatc annotatons that match the ground-truth annotatons are n bold. Table 2 presents the annotaton performance for the ndvdual tags, as well as averages over the process and structural categores. For the process category, DTM outperforms DCT on average F-score (0.393 versus 0.290), although the performance on ndvdual tags s mxed. In some cases, appearance (va DCT features) s suffcent to dentfy the relevant texture (e.g.. net). For the structural category, DTM also outperforms DCT wth an average F-score of versus 0.302, whle also domnatng DCT on all but one ndvdual structural tags. In these cases, appearance features cannot suffcently model the structural tags, snce these tags contan sgnfcant varaton n appearance. On the other hand, DTM s able to learn the common moton characterstcs, n spte of the varaton n appearance. Fnally, Fgure 6 presents some example annotatons for dfferent vdeos usng the top-5 tags. To gve a sense of the computatonal cost of these annotaton experments, the average runtme usng a standard PC (3.16 Ghz, C++, OpenCV) was 3.3 mnutes to learn a vdeo-level DTM, 2.4 mnutes to learn a tag model from vdeo-level DTMs, and 2.3 mnutes to annotate a sngle vdeo Effect of varous tranng parameters We further nvestgated the effect of varyng the number of states, number of components, and tranng set sze. Fgures 7(a) and 7 show the F-score when varyng the number of Average F Measure Average F Measure (a) K =4 K =8 K = tag level (c) n=5 n=7 n= tag level Average F Measure Average F Measure K =2 K =4 K = tag level (d) 25% 50% 75% 100% tag level Fg. 7: Effect on annotaton performance when varyng the number of: (a) base components; tag-level components; (c) states; (d) tranng vdeos. vdeo-level components and tag-level components. In general, ncreasng the number of components at the vdeo- and tag-level mproves performance, snce the DTM can better capture the varatons n underlyng dynamcs of the vdeo sequence. Fgure 7(c) shows the annotaton performance whle varyng the dmenson of the state space n. Increasng n tends to mprove performance. Fnally, Fgure 7(d) presents the average F-score whle changng the sze of the tranng set, by selectng a subset of the tranng set. The performance mproves consstently wth the ncrease n number of vdeos. 5.4 HEM-traned bag-of-systems codebook for dynamc texture recognton The bag-of-systems (BoS) representaton 18 s a descrptor of moton n a vdeo, where dynamc texture (DT) codewords represent the typcal moton patterns n spato-temporal patches extracted from the vdeo. The BoS representaton of vdeos s analogous to the bag-of-vsual-words representaton of mages, where mages are represented by countng the occurrences of vsual codewords n the mage. Specfcally, n the BoS framework the codebook s formed by generatve tme-seres models nstead of words, each of them compactly characterzng typcal textures and dynamcs patterns of pxels n a spatotemporal patch. Hence, each vdeo s represented by a BoS hstogram wth respect to the codebook, by assgnng ndvdual spato-temporal patches to the most lkely codeword, and then countng the frequency wth whch each codeword s selected. To learn the DT codebook, 18, 38 frst estmate ndvdual DTs, learned from spato-temporal cubes extracted at spato-temporal nterest ponts 18,or from non-overlappng samples from the vdeo 38. Codewords are then generated by clusterng the ndvdual DTs usng a combnaton of non-lnear dmensonalty reducton (NLDR) and K-means clusterng. Due to the pre-mage problem of kernelzed NLDR, ths clusterng method s not capable of producng novel DT codewords, as dscussed n Secton 2. In ths secton, we use the HEM-DTM algorthm to generate novel DT codewords for the bag-of-systems representaton, thus mprovng the robustness of the moton descrptor. We valdate the HEM-traned BoS codebook on the task of dynamc texture recognton, whle comparng wth exstng state-of-the-art methods 14, 39, 40, 18, Learnng a BoS Codebook wth HEM-DTM The procedure for learnng a BoS codebook s llustrated n Fgure 1c. Frst, for each vdeo n the tranng corpus, a dense samplng of spato-temporal cubes s extracted, and a DTM

number of mxture components s reduced usng the HEM-DTM algorthm.

Note that ths method of codebook generaton s able to explot all the

operators as n 18, or non-overlappng samples as n 38.

codebook model, as dscussed n the prevous sectons.

countng the number of occurrences of each codeword n the vdeo, where

Next, a BoS hstogram (weght vector w) s formed usng the standard term

representatons, TF: w k = N k N, TFIDF: w k = N k N ( log ) V V k, (41)

tmes codeword k appears n vdeo, N = k N k s the total number of

number of tranng vdeos n whch codeword k occurs. 5.4.

recognton use DT models 13, 14, 20, 18 or aggregatons of local

13, 14 represent each vdeo as a DT model, and then leverage nearest

approprate dstance bg-leaves blossom-tree1-c1 blossom-tree2-c1

escalator1-c escalator2-c escalator3-c1 escalator3-c2 flag-close flame

shower-medum shower-strong small-leaves smoke square-sheet steam1-c1

updown-tde water-grass Fg. 8: Examples from DynTex35.

., Martn dstance 13 or Kullback-Lebler dvergence 14.

vdeo,.e., the partcular vewpont of each texture.

vewpont-ndependent approaches: 20 proposes dstances between DTs based

11 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, s learned wth the EM algorthm 8. Next, these DTMs are pooled together to form one large DTM, and the number of mxture components s reduced usng the HEM-DTM algorthm. Fnally, the novel DT cluster centers are selected as the BoS codewords. Note that ths method of codebook generaton s able to explot all the tranng data, as opposed to only a subset selected va nterest-pont operators as n 18, or non-overlappng samples as n 38. Ths s made possble through the effcent herarchcal learnng of the codebook model, as dscussed n the prevous sectons. Gven the BoS codebook, the BoS representaton of a vdeo s formed by frst countng the number of occurrences of each codeword n the vdeo, where each spato-temporal cube s assgned to the codeword wth largest lkelhood. Next, a BoS hstogram (weght vector w) s formed usng the standard term frequency (TF) or term frequency nverse document frequency (TFIDF) representatons, TF: w k = N k N, TFIDF: w k = N k N ( log ) V V k, (41) where w k s the kth codeword entry for the th vdeo, N k s the number of tmes codeword k appears n vdeo, N = k N k s the total number of codewords for vdeo, V s the total number of tranng vdeos, and V k s the number of tranng vdeos n whch codeword k occurs Related Work and Datasets Current approaches to dynamc texture recognton use DT models 13, 14, 20, 18 or aggregatons of local descrptors 39, , 14 represent each vdeo as a DT model, and then leverage nearest neghbors or support vector machne (SVM) classfers, by adoptng an approprate dstance bg-leaves blossom-tree1-c1 blossom-tree2-c1 bolng-water2-c bolng-water2-c curly-har danube danube-close danube-far escalator1-c escalator2-c escalator3-c1 escalator3-c2 flag-close flame lft-downward naked-tree rdeau-aune see-waves shower-drops1 shower-low shower-medum shower-strong small-leaves smoke square-sheet steam1-c1 steam1-c2 steam1-c1 straw straw-far stream-wtr1-c stream-wtr2-c updown-tde water-grass Fg. 8: Examples from DynTex35. functons between dynamc textures, e.g.., Martn dstance 13 or Kullback-Lebler dvergence 14. The resultng classfers are largely dependent on the appearance of the vdeo,.e., the partcular vewpont of each texture. Subsequent methods address ths ssue by proposng translaton-nvarant or vewpont-ndependent approaches: 20 proposes dstances between DTs based only on the spectrum or cepstrum of the hdden-state process x t, whle gnorng the appearance component of the model; 18 proposes a bag-of-systems representaton for vdeos, formed by assgnng spato-temporal patches, whch are selected by nterest-pont operators, to DT codewords. The patch-based framework of BoS s less senstve to changes n vewpont than the approaches based on holstc appearance 13, 14. In contrast to usng DT models, 39, 40 aggregate local descrptors to form a vdeo descrptor. 39 uses dstrbutons of local space-tme orented structures, whle 40 concatenates local bnary pattern (LBP) hstograms extracted from three orthogonal planes n space-tme (XY, XT, YT). Whle these two descrptors are less senstve to vewpont, they both gnore the underlyng long-term moton dynamcs of the texture process. The datasets used by the above papers are ether based on the UCLA 13 or DynTex 37 vdeo textures datasets, wth modfcatons n order to test vewpont-nvarance: UCLA50: the orgnal UCLA dataset 13 conssts of 50 classes, wth 4 vdeos per class. The orgnal vdeos are grayscale wth a frame sze of , 14 crop the vdeos to a representatve vdeo patch so that the texture s from the same vewpont. In our BoS experments, we use the orgnal uncropped versons. UCLA39: 20 consders 39 classes from UCLA, whch do not volate the assumpton of spatal statonarty. Each vdeo s cropped nto a left subvdeo and a rght subvdeo (both 48 48), where one sde s used for tranng and the other for testng. Ths classfcaton task s sgnfcantly more challengng than UCLA50, snce the appearances of the tranng vdeos are qute dfferent than those of the test vdeos. UCLA9: 18 groups related classes n UCLA nto 9 super-classes, where each super-class contans dfferent vewponts of the same texture process. Experments are conducted on subsets of these 9 super-classes: water vs fountan (UCLA9wf), fountan vs waterfall (UCLA9wff), 4 classes (UCLA9c4), and 8 classes (UCLA9c8). The orgnal uncropped vdeos are used. UCLA7: 39 also groups smlar classes from UCLA nto 7-super classes, usng the uncropped vdeo. DynTex35: 40 uses the old DynTex dataset 2, consstng of 35 sequences. Each sequence s decomposed nto 10 subvdeos, by splttng spatally and temporally, resultng n 35 classes, wth 10 vdeos per class. Example frames from each class n DynTex35 are presented n Fgure 8. In ths paper, we valdate our proposed HEM-traned BoS on each of these datasets, followng the protocols establshed by ther respectve papers and comparng to ther publshed results. 2. Ths s an old verson of the DynTex dataset used n the prevous secton.

12 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, TABLE 3: Dstances and kernels used for classfcaton. square-root dstance (SR) d s(w 1, w 2 ) = arccos( k w1k w 2k ) χ 2 dstance (CS) d χ 2(w 1, w 2 ) = 1 w 1k w 2k 2 k w 1k +w 2k χ 2 kernel (CSK) K(w 1, w 2 ) = 1 (w 1k w 2k ) 2 k 1 2 (w 1k+w 2k ) Exponentated χ 2 kernel(ecs) K(w 1, w 2 ) = exp( γd χ 2(w 1, w 2 )) Bhattacharyya kernel (BCK) K(w 1, w 2 ) = k w1k w 2k Expermental setup For the UCLA-based datasets, overlappng spato-temporal cubes wth sze (step: ) pxels are extracted densely from the grayscale vdeo. For the DynTex35 dataset, the vdeos are converted to grayscale, and overlappng spato-temporal cubes wth sze (step: ) pxels are extracted. We gnore patches wthout sgnfcant moton by only selectng patches wth overall pxel varance > 1. For all datasets, we learn vdeo-level DTMs wth K = 4 components. The BoS codebook s then learned by runnng HEM wth K = 64 on the mxture formed from all the tranng vdeo DTMs. For UCLA9, we also consder a codebook sze of K = 8 n order to obtan a far comparson wth 18. Each vdeo s then represented by ts TF and TFIDF vectors usng the BoS codebook. We manly follow the protocol of 18 to tran dynamc texture classfers, usng the varous dstances and kernel functons lsted n Table 3 and the BoS representaton. Frst, we use k-nearest neghbor classfers usng χ 2 and square-root dstances, denoted as CS1 and SR1 for k = 1 and CS3 and SR3 for k = 3. Second, we consder support vector machnes (SVM) usng kernel functons related to the CS and SR dstances, such as the χ 2 kernel (CSK), exponentated χ 2 kernel (ECS), and Bhattacharyya kernel (BCK). SVM tranng and testng s performed wth lbsvm 41, wth kernel and SVM parameters selected usng 10-fold cross-valdaton on the tranng set. Fnally, a generatve classfcaton approach, namely a nave Bayes (NB) classfer, was also tested, as n 18. All classfcaton results are averaged over a number of trals usng dfferent tranng and test sets, dependng on the protocol of the dataset Classfcaton results Table 4 presents the vdeo classfcaton results for the varous classfers usng the HEM-BoS codebook and ether TF or TFIDF representatons, and exstng state-of-the-art reference methods for each dataset. Reference results (Ref) are those provded n the respectve papers. The row labeled Best refers to the best accuracy among the varous classfers usng the HEM-BoS codebook. Frst, lookng at K = 64 codewords, the best classfer usng the HEM-BoS codebook consstently outperforms the reference methods of 14, 39, 40, 18, 20. To dentfy the best-performng (.e., most consstent) classfer, we rank all the HEM-BoS classfers on each ndvdual dataset, and then calculate the average rankng over the 5 datasets. The best rankng classfer s the 1-NN classfer usng the square-root dstance (TF-SR1). TF-SR1 s also consstently more accurate than the reference methods. These results demonstrate the effcacy of the HEM-BoS codebook for representng of a wde range of dynamc textures, whle mantanng vewpont and translaton nvarance. Among the datasets, accuracy on UCLA39 s the most mproved, from 20% 20 or 42.3% 39 to 56.4% for HEM- BoS. In contrast to 20, whch s based solely on moton dynamcs, and 39, whch models local appearance and nstantaneous moton, the BoS representaton s able to leverage both the local appearance (for translaton nvarance) and moton dynamcs of the vdeo to mprove the overall accuracy. Next, we compare the two methods of learnng a BoS codebook, the HEM algorthm and NLDR/clusterng 18, usng K = 8 as n 18. On both the 4- and 8-class UCLA9 datasets, the accuracy usng HEM-BoS mproves sgnfcantly over NLDR-BoS, from 89% to 97.92% on the 4-class problem, and from 80% 18 or 84% 38 to 92.83% on the 8-class problem 3. The mprovement n performance s due to both the 3. Usng the same patch szes as n 38, we get smlar performances on the 8-class problem: 92.83% accuracy for patches, and 90.10% for TABLE 4: BoS Classfcaton Results. Average Rank s calculated from the ndvdual ranks on each dataset for K = 64 (shown n parenthess). A/B refers to method A as reported n B. K=8 K=64 Method UCLA9wf UCLA9wff UCLA9c4 UCLA9c8 UCLA9c8 UCLA7 UCLA39 UCLA50 DynTex35 Average Rank CS (9.0) (8.5) (12.0) (10.0) (3.0) 8.5 N CS (13.0) (13.0) (13.0) (15.0) (12.0) 13.2 N SR (5.0) (6.0) (8.5) (2.0) (2.0) 4.7 T SR (14.0) (11.0) (8.5) (12.0) (9.0) 10.9 F S CSK (4.0) (6.0) (8.5) (7.0) (4.0) 5.9 V INT (10.0) (3.0) (2.0) (1.0) (10.0) 5.2 M BCK (1.5) (1.5) (8.5) (5.5) (14.0) 6.2 CS (7.0) (8.5) (14.0) (11.0) (5.0) 9.1 T N CS (12.0) (14.0) (15.0) (14.0) (11.0) 13.2 F N SR (6.0) (11.0) (4.0) (4.0) (1.0) 5.2 I SR (15.0) (11.0) (3.0) (13.0) (6.0) 9.6 D S CSK (1.5) (6.0) (6.0) (8.5) (8.0) 6.0 F V INT (11.0) (4.0) (1.0) (3.0) (13.0) 6.4 M BCK (3.0) (1.5) (5.0) (8.5) (15.0) 6.6 NB (8.0) (15.0) (11.0) (5.5) (7.0) 9.3 BEST TFSR1 Ref / / / /18, ,15 14/ /14, 81 39

d dd d dd d dd d ee d dd d dd d dd d dd d dd d dd d dd d dd d ed d de d de d dd d dd d dd d dd d dd d dd d de d dd d dd d dd d dd d dd d dd d dd d dd d ed d dd d dd d dd d dd d dd d dd d dd d dd d dd

d dd d dd d dd d dd d ed d dd d dd d dd d dd d dd d dd d dd d de d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd Fg.

Fgure 9 shows the confuson matrx for UCLA9c8, usng the HEM-BoS and the NLDR-BoS from 18, respectvely. HEM-BoS removes the msclassfcatons of water to fre, and fountan to waterfall.

5 hours to learn a codebook from vdeo-level DTMs for UCLA39, and 20 seconds to calculate the BoS representaton for a sngle vdeo.

In general, ncreasng the number of codewords also ncreases the classfer accuracy, wth accuracy saturatng for UCLA50 and UCLA7.

A codebook sze of K = 64 represents a good tradeoff between speed and performance for BoS classfcaton on these datasets.

consstent wth the underlyng probablstc models.

13 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, & & & ^ ^ t t d ed d dd d dd d dd d dd d dd d dd d dd d dd d ed d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d ee d dd d dd d dd d dd d dd d dd d dd d dd d ed d de d de d dd d dd d dd d dd d dd d dd d de d dd d dd d dd d dd d dd d dd d dd d dd d ed d dd d dd d dd d dd d dd d dd d dd d dd d dd & & & ^ ^ t t d dd d dd d dd d dd d dd d dd d dd d dd d dd d ee d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d de d dd d dd d dd d dd d dd d dd d dd d dd d dd d ed d dd d dd d dd d dd d dd d dd d dd d de d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd d dd Fg. 9: Confuson matrx on UCLA9c8 usng: (a) HEM-traned BoS BoS from 18. generaton of novel DT codewords, and the ablty to learn these codewords effcently from more data,.e., from a dense samplng of spato-temporal cubes, rather than those selected by nterest pont operators. Fgure 9 shows the confuson matrx for UCLA9c8, usng the HEM-BoS and the NLDR-BoS from 18, respectvely. HEM-BoS removes the msclassfcatons of water to fre, and fountan to waterfall. Agan, ths llustrates the robustness of the BoS learned wth HEM. Fgure 10 shows several examples of test vdeos wth the generated class labels. The average runtme was 1.5 hours to learn a codebook from vdeo-level DTMs for UCLA39, and 20 seconds to calculate the BoS representaton for a sngle vdeo. Fnally, we nvestgate the effect of ncreasng the codebook sze for the BoS representaton. Fgure 11 plots the accuracy on UCLA{7, 39, 50} and DynTex35, versus a codebook sze of K = {8, 16, 32, 64}. In general, ncreasng the number of codewords also ncreases the classfer accuracy, wth accuracy saturatng for UCLA50 and UCLA7. Also, ncreasng the codebook sze ncreases the computatonal cost of proectng to the codebook. A codebook sze of K = 64 represents a good tradeoff between speed and performance for BoS classfcaton on these datasets. 6 CONCLUSIONS In ths paper, we have derved a herarchcal EM algorthm that both clusters DTs and learns novel DT cluster centers that are representatve of the cluster members, n a manner that s consstent wth the underlyng probablstc models. The clusterng s acheved by generatng vrtual samples from the nput DTs, and maxmzng the log-lkelhood of these vrtual samples wth respect to the DT cluster centers. Usng the law-of-large-numbers, the sum over vrtual samples can be replaced by an expectaton over the nput DTs, resultng {bolng,bolng} {fre,fre} {flowers,flowers} {fountan,fountan} {sea,sea} {smoke,sea} {water,water} {waterfall,waterfall} Fg. 10: Classfcaton examples from UCLA9c8 {ground truth, classfer predcton}. classfcaton rate (a) ucla ucla39 ucla50 ucla codebook sze classfcaton rate DynTex DynTex codebook sze Fg. 11: Effect of ncreasng the codebook sze on BoS classfcaton usng dfferent data sets. n a clusterng algorthm that depends only on the nput DT model parameters. For the E-step nference of HEM, we also derve a novel effcent algorthm for senstvty analyss of the Kalman smoothng flter. Besdes clusterng, the HEM algorthm for DTs can also be used for herarchcal model estmaton from large datasets, where DTMs are frst learned on subsets of the data (e.g., ndvdual vdeos), and the resultng DTMs are then aggregated usng the HEM algorthm. Ths formulaton provdes a sgnfcant ncrease n computatonal and memory effcency, n comparson to runnng standard EM on the full dataset. We apply the HEM algorthm to a varety of moton analyss problems. Frst, we apply HEM to herarchcally cluster vdeo textures, and demonstrate that the algorthm produces consstent clusters based on vdeo moton. Second, we use HEM to estmate moton annotaton models usng the SML framework, where each annotaton model s a DTM learned wth weakly-labeled tranng data. Thrd, we use HEM to learn BoS codebooks and demonstrate state-of-the-art results n dynamc texture recognton. Future work wll be drected at extendng HEM to general graphcal models, allowng a wde varety of generatve models to be clustered or used as codewords n a bag-of-x representaton. Fnally, n ths work we have not addressed the model selecton problem,.e., selectng the number of reduced mxture components. Snce HEM s based on maxmum lkelhood prncples, t s possble to apply standard statstcal model selecton approaches, such as Akake nformaton crteron (AIC) and Bayesan nformaton crteron (BIC) 42. Alternatvely, nspred by Bayesan non-parametrc statstcs, the HEM formulaton could be extended to nclude a Drchlet process pror 43, wth the number of components adaptng to the data.

14 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, ACKNOWLEDGMENTS The authors thank R. Péter for the DynTex dataset, and G. Doretto for the UCLA dataset. AM and ABC were supported by the Research Grants Councl of the Hong Kong Specal Admnstratve Regon, Chna CtyU EC and GRGL acknowledge support from Qualcomm, Inc., Yahoo! Inc., the Hellman Fellowshp Program, the Alfred P. Sloan Foundaton, NSF Grants CCF and IIS , and the UCSD FWGrd Proect, NSF Research Infrastructure Grant Number EIA ABC, EC and GRGL also receved support from a Google Research Award. REFERENCES 1 B. Horn and B. Schunk, Determnng optcal flow, Artfcal Intellgence, vol. 17, pp , B. Lucas and T. Kanade, An teratve mage regstraton technque wth an applcaton to stereo vson, n Proc. DARPA Image Understandng Workshop, 1981, pp G. Doretto, A. Chuso, Y. N. Wu, and S. Soatto, Dynamc textures, Intl. J. Computer Vson, vol. 51, no. 2, pp , A. W. Ftzgbbon, Stochastc rgdty: mage regstraton for nowhere-statc scenes, n ICCV, vol. 1, 2001, pp A. Ravchandran and R. Vdal, Dynamc texture regstraton, IEEE Transactons on Pattern Analyss and Machne Intellgence, G. Doretto, D. Cremers, P. Favaro, and S. Soatto, Dynamc texture segmentaton, n ICCV, vol. 2, 2003, pp A. Ghoreysh and R. Vdal, Segmentng dynamc textures wth Isng descrptors, ARX models and level sets, n Dynamcal Vson Workshop n the European Conf. on Computer Vson, A. B. Chan and N. Vasconcelos, Modelng, clusterng, and segmentng vdeo wth mxtures of dynamc textures, IEEE TPAMI, vol. 30, no. 5, pp , May R. Vdal and A. Ravchandran, Optcal flow estmaton & segmentaton of multple movng dynamc textures, n IEEE Conf. Computer Vson and Pattern Recognton, vol. 2, 2005, pp A. B. Chan and N. Vasconcelos, Layered dynamc textures, IEEE Trans. on Pattern Analyss and Machne Intellgence: Specal Issue on Probablstc Graphcal Models n Computer Vson, vol. 31, no. 10, pp , October R. Chaudry, A. Ravchandran, G. Hager, and R. Vdal, Hstograms of orented optcal flow and Bnet-Cauchy kernels on nonlnear dynamcal systems for the recognton of human actons, n CVPR, A. Bssacco, A. Chuso, and S. Soatto, Classfcaton and recognton of dynamcal models: The role of phase, ndependent components, kernels and optmal transport. IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, pp , P. Sasan, G. Doretto, Y. Wu, and S. Soatto, Dynamc texture recognton, n CVPR, vol. 2, 2001, pp A. B. Chan and N. Vasconcelos, Probablstc kernels for the classfcaton of auto-regressve vsual processes, n CVPR, vol. 1, 2005, pp S. V. N. Vshwanathan, A. J. Smola, and R. Vdal, Bnetcauchy kernels on dynamcal systems and ts applcaton to the analyss of dynamc scenes, Intl. J. Computer Vson, vol. 73, no. 1, pp , R. Vdal and P. Favaro, Dynamcboost: Boostng tme seres generated by dynamcal systems, n IEEE Intl. Conf. on Computer Vson, A. B. Chan and N. Vasconcelos, Classfyng vdeo wth kernel dynamc textures, n IEEE Conf. Computer Vson and Pattern Recognton, A. Ravchandran, R. Chaudhry, and R. Vdal, Vew-nvarant dynamc texture recognton usng a bag of dynamcal systems, n CVPR, B. Ghanem and N. Ahua, Phase based modellng of dynamc textures, n IEEE Intl. Conf. on Computer Vson, F. Woolfe and A. Ftzgbbon, Shft-nvarant dynamc texture recognton, n ECCV, H. Cetngul and R. Vdal, Intrnsc mean shft for clusterng on Stefel and Grassmann manfolds, n CVPR, A. Goh and R. Vdal, Clusterng and dmensonalty reducton on Remannan manfolds, n CVPR, N. Vasconcelos and A. Lppman, Learnng mxture herarches, n Neural Informaton Processng Systems, A. B. Chan, E. Covello, and G. Lanckret, Clusterng dynamc textures wth the herarchcal EM algorthm, n Intl. Conference on Computer Vson and Pattern Recognton, G. Carnero, A. B. Chan, P. J. Moreno, and N. Vasconcelos, Supervsed learnng of semantc classes for mage annotaton and retreval, IEEE TPAMI, vol. 29, no. 3, pp , March A. Gelb, Appled Optmal Estmaton. MIT Press, N. Vasconcelos, Image ndexng wth mxture herarches, n IEEE Conf. Computer Vson and Pattern Recognton, A. Baneree, S. Merugu, I. Dhllon, and J. Ghosh, Clusterng wth bregman dvergences, Journal of Machne Learnng Research (JMLR), vol. 6, pp , J. V. Davs and I. Dhllon, Dfferental entropc clusterng of multvarate gaussans, n Adv. n Neural Inf. Proc. Sys. (NIPS, J. Goldberger and S. Rowes, Herarchcal clusterng of a mxture model, n In NIPS. MIT Press, 2005, pp R. E. Grffn and A. P. Sage, Senstvty analyss of dscrete flterng and smoothng algorthms, AIAA Journal, vol. 7, pp , Oct J. Wall, A. Wllsky, and N. Sandell, On the fxed-nterval smoothng problem. Stochastcs., vol. 5, pp. 1 41, E. Covello, A. Chan, and G. Lanckret, Tme seres models for semantc musc annotaton, Audo, Speech, and Language Processng, IEEE Transactons on, vol. 19, no. 5, pp , uly R. H. Shumway and D. S. Stoffer, An approach to tme seres smoothng and forecastng usng the EM algorthm, Journal of Tme Seres Analyss, vol. 3, no. 4, pp , A. P. Dempster, N. M. Lard, and D. B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm, Journal of the Royal Statstcal Socety B, vol. 39, pp. 1 38, S. M. Kay, Fundamentals of Statstcal Sgnal Processng: Estmaton Theory. Prentce-Hall, R. Péter, S. Fazekas, and M. J. Huskes, DynTex: A comprehensve database of dynamc textures, Pattern Recognton Letters, vol. 31, no. 12, pp , Onlne. Avalable: 38 A. Ravchandran, R. Chaudhry, and R. Vdal, Categorzng dynamc textures usng a bag of dynamcal systems, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. PP, no. 99, p. 1, K. G. Derpans and R. P. Wldes, Dynamc texture recognton based on dstrbutons of spacetme orented structure, n CVPR, G. Zhao and M. Petkanen, Dynamc texture recognton usng local bnary patterns wth an applcaton to facal expressons, IEEE Transactons on Pattern Analyss and Machne Intellgence, C. Chang and C. Ln, Lbsvm: a lbrary for support vector machnes, ACM TIST, T. Haste, R. Tbshran, and J. Fredman, The elements of statstcal learnng: data mnng, nference and predcton, 2nd ed. Sprnger, Onlne. Avalable: tbs/elemstatlearn/ 43 D. M. Ble and M. I. Jordan, Varatonal nference for drchlet process mxtures, Bayesan Analyss, vol. 1, pp , 2005.

IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, 2013 Adeel Mumtaz receved the B.S. degree n computer scence from Pakstan Insttute of Engneerng and Appled Scences and the M.S. degree n computer system engneerng from Ghulam Ishaq Khan Insttute of Engneerng Scences and Technology, Pakstan, n 2004 and 2006, respectvely.

Hs research nterests nclude Computer Vson, Machne Learnng and Pattern recognton.

2008, respectvely. He s currently pursung the Ph.D.

Covello receved the Premo Guglelmo Marcon Junor 2009 award, from the Guglelmo Marcon Foundaton (Italy), and won the 2010 Yahoo! Key Scentfc Challenge Program, sponsored by Yahoo!

15 IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, TO APPEAR, 2013 Adeel Mumtaz receved the B.S. degree n computer scence from Pakstan Insttute of Engneerng and Appled Scences and the M.S. degree n computer system engneerng from Ghulam Ishaq Khan Insttute of Engneerng Scences and Technology, Pakstan, n 2004 and 2006, respectvely. He s currently workng toward the PhD degree n Computer Scence at the Cty Unversty of Hong Kong. He s currently wth the Vdeo, Image, and Sound Analyss Laboratory, Department of Computer Scence, CtyU. Hs research nterests nclude Computer Vson, Machne Learnng and Pattern recognton. Emanuele Covello receved the Laurea Trennale degree n nformaton engneerng and the Laurea Specalstca degree n telecommuncaton engneerng from the Unversta degl Stud d Padova, Padova, Italy, n 2006 and 2008, respectvely. He s currently pursung the Ph.D. degree n the Department of Electrcal and Computer Engneerng, Unversty of Calforna at San Dego (UCSD), La Jolla, where he has oned the Computer Audton Laboratory. Mr. Covello receved the Premo Guglelmo Marcon Junor 2009 award, from the Guglelmo Marcon Foundaton (Italy), and won the 2010 Yahoo! Key Scentfc Challenge Program, sponsored by Yahoo!. Hs man nterest s machne learnng appled to content based nformaton retreval and multmeda data modelng, and automatc nformaton extracton from the Internet. Gert Lanckret receved the M.S. degree n electrcal engneerng from the Katholeke Unverstet Leuven, Leuven, Belgum, n 2000 and the M.S. and Ph.D. degrees n electrcal engneerng and computer scence from the Unversty of Calforna, Berkeley, n 2001 and 2005, respectvely. In 2005, he oned the Department of Electrcal and Computer Engneerng, Unversty of Calforna, San Dego, where he heads the Computer Audton Laboratory. Hs research focuses on the nterplay of convex optmzaton, machne learnng, and sgnal processng, wth applcatons n computer audton and musc nformaton retreval. Prof. Lanckret was awarded the SIAM Optmzaton Prze n 2008 and s the recpent of a Hellman Fellowshp, an IBM Faculty Award, an NSF CAREER Award and an Alfred P. Sloan Foundaton Research Fellowshp. In 2011, MIT Technology Revew named hm one of the 35 top young technology nnovators n the world (TR35). Anton B. Chan receved the B.S. and M.Eng. degrees n electrcal engneerng from Cornell Unversty, Ithaca, NY, n 2000 and 2001, respectvely, and the Ph.D. degree n electrcal and computer engneerng from the Unversty of Calforna, San Dego (UCSD), San Dego, n From 2001 to 2003, he was a Vstng Scentst wth the Vson and Image Analyss Laboratory, Cornell Unversty, Ithaca, NY, and n 2009, he was a Postdoctoral Researcher wth the Statstcal Vsual Computng Laboratory, UCSD. In 2009, he oned the Department of Computer Scence, Cty Unversty of Hong Kong, Kowloon, Hong Kong, as an Assstant Professor. Hs research nterests nclude computer vson, machne learnng, pattern recognton, and musc analyss. Dr. Chan was the recpent of an NSF IGERT Fellowshp from 2006 to 2008, and an Early Career Award n 2012 from the Research Grants Councl of the Hong Kong SAR, Chna. 15

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust