Manifold Regularized Slow Feature Analysis for Dynamic Texture Recognition

Size: px

Start display at page:

Download "Manifold Regularized Slow Feature Analysis for Dynamic Texture Recognition"

Alfred Quentin Griffin
6 years ago
Views:

1 1 Manfold Regularzed Slow Feature Analyss for Dynamc Texture Recognton Je Mao, Xangmn Xu, Member, IEEE, Xaofen Xng, and Dacheng Tao, Fellow, IEEE arxv: v1 [cs.cv] 9 Jun 2017 Abstract Dynamc textures exst n varous forms, e.g., fre, smoke, and traffc jams, but recognzng dynamc texture s challengng due to the complex temporal varatons. In ths paper, we present a novel approach stemmed from slow feature analyss (SFA) for dynamc texture recognton. SFA extracts slowly varyng features from fast varyng sgnals. Fortunately, SFA s capable to leach nvarant representatons from dynamc textures. However, complex temporal varatons requre hghlevel semantc representatons to fully acheve temporal slowness, and thus t s mpractcal to learn a hgh-level representaton from dynamc textures drectly by SFA. In order to learn a robust low-level feature to resolve the complexty of dynamc textures, we propose manfold regularzed SFA (MR-SFA) by explorng the neghbor relatonshp of the ntal state of each temporal transton and retanng the localty of ther varatons. Therefore, the learned features are not only slowly varyng, but also partly predctable. MR-SFA for dynamc texture recognton s proposed n the followng steps: 1) learnng feature extracton functons as convoluton flters by MR-SFA, 2) extractng local features by convoluton and poolng, and 3) employng Fsher vectors to form a vdeo-level representaton for classfcaton. Expermental results on dynamc texture and dynamc scene recognton datasets valdate the effectveness of the proposed approach. Index Terms Dynamc texture recognton, slow feature analyss, temporal varaton, manfold regularzaton. I. INTRODUCTION Dynamc texture s an extenson of texture nto the temporal doman. Dynamc textures exst n the real world n varous forms, e.g., fre, smoke, water, human crowds, and traffc jams. Dynamc texture recognton can be used for many applcatons, e.g., fre detecton, traffc montorng, scene recognton, facal expresson recognton and age estmaton. Statc cues are not suffcent for dynamc texture recognton. Dynamc texture s a complex temporal process that takes place n the pxel doman. Non-rgd deformatons n dynamc textures make the applcaton of tradtonal computer vson approaches very challengng. For example, optcal flow requres moton smoothness, and a hstogram of gradents requres clear edges and boundares. Nether of these condtons can be fulflled by dynamc textures. Although much effort has been made, dynamc texture recognton remans a challengng problem. A lnear dynamcal Xangmn Xu s the correspondng author. J. Mao, X. Xu and X. Xng are wth School of Electronc and Informaton Engneerng, South Chna Unversty of Technology, Wushan RD., Tanhe Dstrct, Guangzhou, P.R.Chna. E-mal: (maow1988@qq.com; xmxu@scut.edu.cn; xfxng@scut.edu.cn;). D. Tao s wth the Centre for Quantum Computaton & Intellgent Systems and the Faculty of Engneerng and Informaton Technology, Unversty of Technology, Sydney, 81 Broadway Street, Ultmo, NSW 2007, Australa. E- mal: dacheng.tao@uts.edu.au systems (LDS) approach attempts to model dynamc textures by a statstcal generatve model [1]. However, LDS s senstve to vewponts, scale, rotaton, and other factors. Some carefully desgned hand-crafted features (e.g., local bnary patterns [2]) descrbe dynamc textures by capturng the appearances and temporal varatons. They tend to suffer from complex temporal varatons, for example, non-rgd deformatons and spatal-temporal translatons. In contrast to these approaches, we attempt to resolve the temporal complexty of dynamc textures. Once the temporal complexty s untangled, dynamc textures can be represented well. Intutvely, the complexty of dynamc textures requres temporally nvarant representatons. Inspred by the temporal slowness prncple, slow feature analyss (SFA) extracts slowly varyng features from fast varyng sgnals [3]. For example, pxels n a vdeo of dynamc texture vary quckly over the short term, but the hgh-level semantc nformaton of the vdeo vares slowly over the long term. Fortunately, SFA s capable to leach nvarant representatons from dynamc textures. However, the complex temporal varatons that exst n dynamc textures requre hgh-level semantc representatons, whch cannot be obtaned drectly by SFA. Kernel methods [4] and non-lnear expansons [5] were employed to reduce the gap between hgh-dmensonal fast varyng nputs and slowly varyng hgh-level semantc representatons. However, they are stll not suffcent to extract a robust representaton for dynamc texture recognton. To address temporal complexty n dynamc texture recognton, we learn slowly varyng features for local vdeo volumes, and then, we obtan vdeo-level representatons by bag-ofwords models. In ths way, local vdeo volumes are well represented by learned features, and the vdeo-level representaton s nvarant to translatons, vewponts, scales, and other aspects. We further mprove the standard SFA by explorng the manfold regularzaton [6] to ensure that the learned features are not only slowly varyng but also partly predctable. Specfcally, we construct a neghbor relatonshp of all temporal transtons by ther ntal states, and then constran the localty of ther varatons n the learned feature space. Consequently, each temporal varaton can be partly predcted by ts ntal state, and the temporal complexty n the dynamc textures can be resolved better. The evaluaton on dynamc texture and scene recognton datasets shows that compettve results can be acheved compared wth state-of-the-art approaches. The remander of ths paper s organzed as follows. Secton II dscusses related studes. Sectons III and IV detal the proposed approach. The expermental results are presented n Secton V, and the conclusons are drawn n Secton VI.

2 2 II. RELATED WORK Ths secton dscusses related work on dynamc texture recognton, and brefly revews slow feature analyss and ts mprovements. A. Dynamc Texture Recognton A lnear dynamcal systems (LDS) approach for dynamc texture recognton was proposed assumng that dynamc textures are statonary stochastc processes [1]. LDS s a statstcal generatve model. It can be further used for dynamc texture synthess [7]. The recognton s performed by comparng the parameters of LDS. Some kernel methods and dstance learnng approaches were then proposed to mprove the comparson [8], [9]; however, ther results are stll lmted by LDS-based features, whch cannot handle dfferent vewponts, scales, or other aspects. A bag-of-words model based on LDS features was proposed to mprove conventonal LDS-based approaches [10]. Then, the bag-of-system-trees was further proposed for better effcency [11]. Extreme learnng machne (ELM) was appled to construct the codebook of LDS features whle preservng the spatal and temporal characterstcs of dynamc textures [12]. A herarchcal expectaton maxmzaton algorthm was proposed to cluster dynamc textures usng LDS features [13]. The mxture of LDS was also exploted for modelng, clusterng and segmentng dynamc textures [14]. Although LDS s reasonable and ntutve, t tends to suffer from complex temporal varatons n the sequental process. Local features have been successfully appled to dynamc texture recognton. Local bnary patterns on three orthogonal planes (LBP-TOP) were proposed for dynamc texture and facal expresson recognton [2]. Instead of processng the entre vdeo, ths approach extracts features from three orthogonal planes n the vdeo cube. LBP-TOP has been generalzed to the tensor orthogonal LBP for mcro-expresson recognton [15]. Smlar to LBP-TOP, the method of multscale bnarzed statstcal mage features on three orthogonal planes (MBSIF-TOP) was proposed usng bnarzed responses of flters learned by applyng ndependent component analyss on each plane [16]. By capturng the drecton of natural flows, a spatotemporal drectonal number transtonal graph (DNG) was proposed usng spatal structures and motons of each local regon [17]. Although these approaches work well, they neglect a large amount of spatal-temporal nformaton. Some approaches have been proposed to fully utlze the spatal-temporal nformaton. The spato-temporal fractal analyss (DFS) was proposed usng both volumetrc and multslce dynamc fractal spectrum components [18]. Space-tme orentaton dstrbutons generated by 3D Gaussan dervatve flters were used for dynamc texture recognton [19], [20], and they have been successfully extended to bag-of-words models for dynamc scene recognton [21]. Although both space and tme were consdered, the performance of these approaches are affected by the complexty of spatal-temporal varatons. Recently, a hgh-order hdden Markov model was employed to model dynamc textures [22]. A dynamc shape and appearance model was proposed by learnng a statstcal model of the varablty drectly by a Gauss-Markov model [23]. A moton estmaton approach based on locally and globally varyng models was proposed to estmate optcal flows n dynamc texture vdeos [24]. Besdes the pxel doman, a wavelet doman mult-fractal analyss for dynamc texture recognton was proposed, and good results can be acheved by smply usng frame averages [25]. Hgh-level features have also been exploted for dynamc texture recognton. Deep learnng has been successfully appled to general object recognton and detecton. It has also been appled to dynamc texture recognton. A 3D convolutonal neural network (CNN) was traned from a very large number of vdeos [26]. Ths 3D CNN has been used as general vdeo feature extractor, and acheved a good result on dynamc scene recognton. Many approaches use a pre-traned CNN as a hgh-level feature extractor [27] [29]. These approaches outperform most exstng dynamc texture recognton approaches. Besdes the CNN, a complex network was proposed to extract features from dynamc textures drectly [30]. A deep belef network was used to extract features from conventonal features [31]. In contrast to all of the above-mentoned approaches that are based on deeply learned networks, MR-SFA extracts features wthout usng deep networks. B. Slow Feature Analyss Slow feature analyss (SFA) was proposed as an unsupervsed learnng approach [3]. Inspred by the temporal slowness prncple, SFA extracts slowly varyng features from fast varyng sgnals. It has been proven that the propertes of feature extracton functons learned by SFA are smlar to complex cells n the prmary vsual cortex (V1) of the bran [32]. SFA has been successfully appled to applcatons such as human acton recognton [5], [33], dynamc scene recognton [34], and blnd source separaton [35], [36]. It s mpractcal to apply SFA to an entre vdeo, whch s extremely hgh dmensonal. A possble soluton s to extract local features from a small receptve feld and then, use them for subsequent processng. Zhang and Tao [5] employed SFA and nonlnear expanson to learn slow features of local cubes and use ther accumulaton as the vdeo representaton. A dscrmnatve SFA was also proposed n ther work to further mprove the recognton result. However, ths approach cannot generalze well to complex vdeos due to ts dependency on smple and clear foregrounds. Some mprovements have been proposed to handle complex vdeos [33], [37]. Inspred by deep learnng, a herarchcal approach based on SFA was proposed [37]. Ths approach effectvely extends the receptve feld by a two-layer SFA feature extracton framework, and models vdeos by bag-of-words models. Afterward, SFA was generalzed to temporal varance analyss to utlze both slow and fast features [33]. Although fast varyng moton features outperform slowly varyng appearance features, temporal varance analyss reles on stablzed local volumes that are tracked by optcal flows. In contrast to vdeos of human acton, non-rgd deformatons n dynamc textures are more challengng. It s dffcult to extract a robust slowly varyng feature for dynamc textures drectly by SFA. To accomplsh ths goal, Therault et al. [34] employed SFA as a postprocessng of Gabor features for dynamc scene recognton.

3 3 Although sgnfcant mprovements can be acheved compared wth conventonal Gabor features, the result s far from good compared wth other approaches. Many other mprovements n SFA have also been proposed. A regularzed sparse kernel SFA was proposed to generate feature spaces for lnear algorthms [4]. A changng detecton algorthm based on an onlne kernel SFA was proposed for vdeo segmentaton and trackng [38]. Although kernel methods can handle nonlnear data, they wll ntroduce more noses and computatonal complextes than lnear approaches. Mnh and Wskott [35] proposed a multvarate SFA for blnd source separaton. A probablstc SFA was proposed for facal behavor analyss [39]. Slow feature dscrmnant analyss (SFDA) was proposed as a supervsed learnng approach by maxmzng the nter-class temporal varance and mnmzng the ntra-class temporal varance smultaneously [40]. These approaches cannot be appled to dynamc texture recognton drectly. III. MANIFOLD REGULARIZED SLOW FEATURE ANALYSIS Ths secton descrbes mathematcal detals about the proposed manfold regularzed SFA (MR-SFA). Matrces, vectors and scalars are denoted by uppercase letters, boldface lowercase letters and regular lowercase letters respectvely (e.g. matrx X, vector x and scalar x). All of the vectors n the paper are column vectors. The matrx and vector transpose s denoted by the superscrpt T. For example, X T s the transpose of X. A. Slow Feature Analyss Frst, we gve a bref ntroducton on slow feature analyss (SFA) [3]. SFA s an unsupervsed learnng approach that extracts slowly varyng features from fast varyng sgnals. Here, we consder only one temporal sequence for smplcty. We denote a temporal sequence as X = [x 1,, x t ] R p t, where x s the state at tme. Wthout loss of generalty, we assume that the nput sequence {x } s centered,.e., we have t =1 x = 0. SFA learns a new representaton Y = [y 1,, y t ] R q t whch globally mnmzes the overall temporal varaton of X. Defnng the temporal varaton at tme as ẏ = y y +1, the objectve functon of SFA can be formulated as arg mn ẏ 2 2 s.t. Y Y T = I, (1) Y where I s an dentty matrx. The constrant Y Y T = I guarantees a nontrval soluton. Consderng the lnear case that Y s obtaned by an affne functon Y = U T X, where U R p q, we have ẏ 2 2 = tr(ẏ Ẏ T ) = tr(u T (ẊẊT )U), (2) where tr( ) s the matrx trace operator, Ẏ = [ẏ 1,, ẏ t 1 ] R q (t 1) and Ẋ = [ẋ 1,, ẋ t 1 ] R p (t 1). For smplcty, we further assume that the nput sequence {x } s whtened. In partcular, we have XX T = I. (3) Therefore, the constrant Y Y T = I can be smplfed as Y Y T = U T (XX T )U = U T U = I. (4) Lastly, the objectve functon of SFA can be reformulated as arg mn U tr(u T (ẊẊT )U) s.t. U T U = I, (5) and the soluton U can be obtaned by solvng the egendecomposton problem (ẊẊT )U = ΛU, (6) where Λ s a dagonal matrx of egenvalues. B. Manfold Regularzed Slow Feature Analyss Standard SFA smply mnmzes the overall temporal varaton. Non-rgd deformatons n dynamc textures result n complex and nosy temporal transtons. Although features learned by standard SFA are slowly varyng, they contan a large amount of nose. To mprove standard SFA, we explore the manfold regularzaton [6] that s based on a smple ntuton: temporal features should not only be slowly varyng, but also be predctable. More specfcally, each state transton n a temporal sequence conssts of three elements,.e., the ntal state, the temporal varaton, and the fnal state. In partcular, the fnal state can be determned by the ntal state and the temporal transton. It has been proven that a dynamc texture can be regarded as a statonary process, and descrbed by lnear dynamcal systems (LDS) [1]. Although we cannot model temporal transtons accurately by SFA, we can utlze successve temporal states, whch can be relable and predctable f we learn them properly. Ideally, n the long term, f two transtons have smlar ntal states, then they should have smlar varatons. In other word, each temporal varaton should be partly predcted by ts ntal state. To accomplsh ths goal, we construct a neghbor relatonshp of all temporal transtons by ther ntal states, and constran the localty of ther varatons n the learned feature space. A conceptual llustraton of the proposed MR-SFA s shown n Fg. 1. Notably, t s essental to construct the neghbor relatonshp by states, and constran the varatons. For each temporal transton, smlar states always result n smlar varatons. However, smlar varatons mght be caused by totally dfferent transtons. Therefore, although the constrant s mposed on the varatons, the smlarty of the temporal transtons should be determned by the ntal states. Defnng each temporal transton as a tuple, MR-SFA can be concluded by two aspects: mnmzng ntra-varatons of each tuple and preservng the localty of smlar tuples. Therefore, the ntal objectve functon of MR-SFA s formulated as arg mn U s.t. U T U = I, ẏ λ S j ẏ ẏ j 2 2 j (7)

4 4 State Global Slowness Varaton SFA Temporal Sequences Neghbor MR-SFA Transton Predctable Slowness Fg. 1. A conceptual llustraton of the proposed MR-SFA. SFA s llustrated for a comparson. where S s the smlarty matrx, and λ s a hyper-parameter to balance the weght between the temporal slowness and the regularzaton. The frst part of ths objectve functon s dentcal to SFA, and the second part s the manfold regularzaton that retans the localty of the temporal transtons. The smlarty matrx S s determned by the ntal states of each temporal transton. Specfcally, f x s among the k-nearest neghbors of x j, or x j s among the k-nearest neghbors of x, we set S j = exp( x x j 2 ), (8) r and S j = 0 otherwse. Here, r s a hyper-parameter that regulates the weght of the neghborng connectons. The objectve functon ncurs a heavy penalty f the temporal varaton of smlar transtons are mapped far apart. In ths way, the localty of the temporal varatons n smlar transtons s preserved, and the varatons can be partly predcated by ther current states. Followng some smple dervatons, we then have ẏ λ S j ẏ ẏ j 2 2 = tr(ẏ Ẏ T ) + λtr(ẏ (D S)Ẏ T ) = tr(ẏ (I + λ(d S))Ẏ T ) = tr(ẏ LẎ T ), j where D s a dagonal matrx wth entres D = j S j, and (9) L = I + λ(d S). (10) Thus far, the ntal objectve functon can be reformulated as arg mn U tr(ẏ LẎ T ) s.t. U T U = I. (11) The matrx D provdes a measurement of the mportance of each tuple. If a tuple has more neghbor tuples, then t mght be more predctable. Therefore, we add an addtonal constrant Ẏ T DẎ = I as the weght of each tuple. The new object functon s formulated as arg mn U tr(ẏ LẎ T ) s.t. Ẏ T DẎ = I and U T U = I. (12) Notably, dfferent from the constrant Y T Y = I used n standard SFA, the new constrant Ẏ T DẎ = I cannot guarantee that the learned new representaton has an dentty covarance matrx. To elmnate the constrant Ẏ T DẎ = I, the objectve functon can be further reformulated as arg mn U tr(ẏ LẎ T ) tr(ẏ DẎ T ) Consderng that Y = U T X, we have arg mn U tr(u T (ẊLẊT )U) tr(u T (ẊDẊT )U) s.t. U T U = I. (13) s.t. U T U = I. (14) Last, the soluton U can be obtaned by the soluton of the generalzed egenvalue problem (ẊLẊT )U = Λ(ẊDẊT )U, (15) where Λ s a dagonal matrx of egenvalues. In practce, the frst soluton or few solutons that correspond to egenvalues that are close to zero mght be caused by nose. These nosy solutons should be abandoned, and the remanng solutons can be used for subsequent processng. Although Sprekeler [41] showed that SFA s related to Laplacan egenmaps [42] for encodng the localty of the neghborng samples, the proposed MR-SFA focuses on varatons n temporal transtons. The largest advantage of temporal slowness s to utlze the natural temporal relatonshp, whch s stronger than the relatonshp constructed by k-nearest neghbors n the orgnal space. Successve states n a sequence mght be very dfferent due to the complex temporal varaton. MR-SFA resolves these varatons despte the localty n the orgnal space. Moreover, the aforementoned algorthm uses only one temporal sequence for learnng. It can be smply extended to more sequences by evaluatng all possble temporal transtons as tuples. Overall, MR-SFA can be summarzed as n Algorthm 1. Algorthm 1 MR-SFA Input: A temporal sequence X = [x 1,, x t ] R p t, where x s a column vector that ndcates the state of sequence at tme. Here {x } s assumed to be centered and whtened. Output: A slowly varyng and partly predcable representaton Y = U T X R q t, and the projecton matrx U R p q. 1: Construct the smlarty matrx S by the k-nearest neghbor of {x }. 2: D j S j, and L I + λ(d S). 3: A ẊLẊT 4: B ẊDẊT 5: Solve the generalzed egen-decomposton problem AU = ΛBU to obtan the soluton U and Y.

5 5 vdeo (rushng rver) convoluton & poolng A poolng Fsher vector random cubes MR-SFA dervaton V poolng Fsher vector SVM convoluton flters feature maps local features vdeo features Fg. 2. An llustraton of the proposed dynamc texture recognton framework. IV. MR-SFA FOR DYNAMIC TEXTURE RECOGNITION Ths secton presents the proposed feature extracton process for dynamc texture recognton. We learn feature extracton functons from randomly extracted small vdeo cubes, and we use them as convoluton flters to generate feature maps. Then, spatal and temporal poolng are employed to extract local features from feature maps. Last, all of the extracted features are encoded by Fsher vectors to obtan a vdeo-level representaton for classfcaton. An llustraton of the proposed framework s shown n Fg. 2. A. Learnng Convoluton Flters Generally speakng, larger receptve felds contan more hgh-level nformaton. However, t s mpractcal to obtan a hgh-level semantc representaton smply by a lnear projecton. We choose to learn feature extracton functons for small receptvely felds (e.g. a local volume of spatal-temporal sze ). Two pre-processng procedures are requred for applyng MR-SFA. In practce, thousands of small vdeo sequences are used for learnng. We denote the sze of each sequence as h s w s l s, where h s, w s, and l s are the heght, wdth and length respectvely. Frst, frame-based sequences are reformatted nto cube-based sequences to obtan longterm stable temporal transtons. Ths procedure s smlar to the reformattng that s proposed n [5]. Specfcally, we reformat each sequence nto a new sequence that conssts of elemental cubes of sze h s w s d s, where d s s the length of each elemental cube. The number of elemental cubes n each reformatted sequence s l n = l s d s + 1. In addton to achevng long-term temporal slowness, the reformaton procedure enables relable temporal predcton. Secondly, prncpal component analyss (PCA) and whtenng are employed to reduce the dmenson of all elemental cubes from h s w s d s to m. After applyng PCA whtenng, the sze of each sequence s m l n. Last, MR-SFA s appled to learn feature extracton functons from these sequences. Combnng PCA and MR-SFA, features can be extracted drectly from raw vdeos. Specfcally, we denote projecton matrces of MR-SFA and PCA whtenng as U and P, and the mean of all tranng samples as b. The feature extracton functon s formulated as where g(x) = U T P T (x b) = W T x + b, (16) W = (U T P T ) T = P U (17) represents the weghts (.e. convoluton flters), and b = W T b (18) represents the bases. Therefore, the feature extracton can be performed smply by applyng ths lner functon. Each column of W = [w 1,, w q ] s a 3D convoluton flter of sze h s w s d s. All of the slces n a learned flter are smlar to one another. Therefore, we use slm flters nstead of fulllength 3D flters [33]. Specfcally, the frst frame of each flter s used to replace the orgnal full-length 3D flters. In ths way, the sze of each flter s reduced from h s w s d s to h s w s. We denote these slm flters as Ŵ = [ŵ 1,, ŵ q ]. The convoluton can be performed more effcently wth these 2D flters. A vsualzaton of learned convoluton flters {ŵ } s shown n Fg. 3. As shown n the fgure, flters learned by standard SFA are nosy due to the complex temporal varatons n dynamc textures. Flters learned by MR-SFA are more relable compared wth flters learned by standard SFA. B. Feature Maps Feature maps are obtaned by convoluton and poolng. We denote each vdeo as [I 1,, I n ], where I s the -th frame, and n s the total number of frames. The j-th convoluton output map of frame I s obtaned by M (j) = ŵ j I + b j, (19)

6 A (1) A (2) A (3) A (4) (a) Convoluton flters learned by MR-SFA. V (1) V (2) V (3) V (4) (b) Convoluton flters learned by standard SFA Fg. 3. The vsualzaton of 16 slm convoluton flters.

We further perform a spatal poolng to reduce the spatal sze of M. The new output s denoted as ˆM (j) = h(g(m (j) )), (20) where h( ) and g( ) are the spatal poolng and actvaton functon respectvely.

Both max-poolng and average-poolng can be used as the spatal poolng operaton h( ). In our work, we use a non-overlapped max poolng of sze 2 2 or 4 4.

Specfcally, the j-th appearance feature map of a frame I s obtaned by A (j) (j) = ˆM, (21) where the operator s a element-wse absolute value functon.

6 6 A (1) A (2) A (3) A (4) (a) Convoluton flters learned by MR-SFA. V (1) V (2) V (3) V (4) (b) Convoluton flters learned by standard SFA Fg. 3. The vsualzaton of 16 slm convoluton flters. Here, 16 flters are shown from top-left to bottom-rght. where the operator ndcates the convoluton operaton, and ŵ j and b j are the j-th convoluton flter and bas respectvely. We further perform a spatal poolng to reduce the spatal sze of M. The new output s denoted as ˆM (j) = h(g(m (j) )), (20) where h( ) and g( ) are the spatal poolng and actvaton functon respectvely. By default, we smply use an absolute value functon as the actvaton functon,.e. g(x) = x. The choosng of the actvaton functon g( ) wll sgnfcantly affect the recognton results. Both max-poolng and average-poolng can be used as the spatal poolng operaton h( ). In our work, we use a non-overlapped max poolng of sze 2 2 or 4 4. Two types of feature maps are obtaned from ˆM for the subsequent feature extracton. Specfcally, the j-th appearance feature map of a frame I s obtaned by A (j) (j) = ˆM, (21) where the operator s a element-wse absolute value functon. Appearance feature maps {A } 1 n can keep tracks of appearance nformaton, whch s mportant for dynamc texture recognton. For standard SFA, features that were extracted from appearance feature maps are used for the fnal representaton. They represent near-statc nformaton, whch s appearance nformaton, and they are nvarant to spataltemporal varatons. Besdes slowly varyng features, we also propose varaton feature maps for the varaton tself. The j-th varaton feature maps of a frame I are obtaned by V (j) = (j) ˆM (j) ˆM +1. (22) Varaton feature maps {V } 1 n 1 carry well dstrbuted temporal varaton nformaton on dynamc textures. Features extracted from varaton feature maps sgnfcantly mprove the representaton of dynamc textures. An llustraton of some of the extracted feature maps s shown n Fg 4. C. Local Features After the feature maps are obtaned, a spatal-temporal poolng s performed to accomplsh local feature extracton. Appearance and varaton features are extracted ndependently due to the dfferences between the appearance and varaton feature maps. Therefore, each set of convoluton flters result Fg. 4. The vsualzaton of feature maps generated by the frst four convoluton flters. Here, the orgnal frame of these feature maps s the nput vdeo frame (rushng rver), whch s shown n Fg. 2. Feature Maps A 9 (j) A 1 (j) Fg. 5. An llustraton of spatal-temporal poolng n a sequence of feature maps. n two sets of local features n our approach. Consderng a dynamc texture as a mult-channel vdeo cube (each channel s a feature map), each local feature s extracted by poolng a local volume of sze h p w p l p, where h p, w p, and l p are the heght, wdth and length of the local volume, respectvely. The spatal and temporal strde of the local feature extracton s denoted as s s and s t. An llustraton of the poolng procedure n a sequence of feature maps s shown n Fg. 5. In our work, average poolng s used, t mantans more valuable nformaton compared wth max poolng. Frst, spatal poolng s performed. For each slce of sze h p w p n a local volume, poolng s performed n four equally dvded sub-regons (.e., top-left, top-rght, bottomleft, bottom-rght) of sze hp 2 wp 2. Consderng that we have eght feature maps that were generated by eght convoluton flters, each slce n the local volume can be descrbed by a 4 8-dmensonal feature. Furthermore, normalzaton s appled to each slce by dvdng ther L2-norms, to obtan better generalzaton. Second, temporal poolng s performed on an entre local volume. Slces n a local volume are equally dvded nto three parts of length lp 3, and then, they are pooled together. These three parts are concatenated as the fnal representaton of the local volume. Thus far, each local volume s descrbed by a local feature of sze Σ Σ Σ

7 7 D. Vdeo Representaton A large number of local features s extracted from each vdeo. Each feature s a descrptor for a local volume. A bagof-words model s employed to encode all of the local features nto a hgh-level representaton. Here we use the Fsher vector for the feature encodng procedure [43]. The Fsher vector encodes low-level features by ther frst- and second-order statstcs. Due to the orthogonal natural of the learned flters, we dvde learned flters nto smaller sets, and each set conssts of eght flters. Followng the group of flters, the feature maps are also dvded nto dfferent groups. The feature extracton s performed ndependently n each group. In ths way, the complexty of each feature set s further reduced. The features can be well descrbed by a small Gaussan mxture model (GMM) for Fsher vectors. Usng more flters n a set wll result n the under-fttng of GMM, whle usng fewer flters n a set wll result n a bad feature descrpton. At ths pont, the dmensonalty of each set of local features s = 96. A PCA whtenng s performed to reduce ts dmensonalty to 48 for encodng. Moreover, we also ntroduce a mult-scale feature extracton. Specfcally, local features are extracted from dfferent spatal scales, and then, they are encoded together by Fsher vectors. In practce, two or four extra spatal scales are suffcent. After applyng Fsher vectors, a power normalzaton and an L2-normalzaton are appled to each set of encoded features. Then, all of them are concatenated as the fnal representaton of each dynamc texture, and an extra L2-normalzaton s further appled. Last, a one-aganst-all lnear support vector machne (SVM) s employed for classfcaton. V. EXPERIMENTS The proposed approach was evaluated on three datasets, whch range from dynamc texture recognton to dynamc scene recognton. Some frames extracted from the used datasets are shown n Fg. 6. The DynTex dataset [44] s a dynamc texture dataset that conssts of more than 650 hgh-qualty vdeos, e.g., sea grass, trees, smoke, escalator and traffc. We use the downsampled verson, whch all vdeos are reszed to the resoluton of The standard classfcaton benchmark dvdes DynTex nto 3 subsets. The Alpha dataset conssts of 3 classes wth 60 vdeos, the Beta dataset conssts of 10 classes wth 162 vdeos, and the Gamma dataset conssts of 10 classes wth 275 vdeos. We use vdeos from the Alpha dataset for the cross-valdaton, and we report the results on the Beta dataset and Gamma dataset. In partcular, we follow the standard evaluaton protocol and report the mean accuracy of the leaveone-vdeo-out (LOO) cross-valdaton [44]. Moreover, we also follow a recently proposed alternatve evaluaton protocol [45]. Specfcally, fve vdeos from each category are used for tranng and the remander are used for testng. We report the mean accuracy on 20 random splts. The YUPENN dynamc scene dataset [19] s ntroduced to emphasze scene specfc temporal nformaton nstead of camera-nduced aspects. All of the vdeos n the dataset are captured by a statonary camera. The dataset conssts of 14 dynamc scene categores, and each category contans 30 color vdeos. Vdeos n the dataset contan sgnfcant dfferences n resoluton, frame rate, scale, llumnaton and vewpont. We follow the leave-one-vdeo-out cross-valdaton protocol and report the average accuracy as the fnal result [19]. The DynTex++ dataset [9] conssts of 36 types of dynamc textures. Each type of dynamc texture contans 100 gray vdeos of sze Followng the standard evaluaton protocol, we tran on half the samples of each category and test on the remanng samples, and we report the average accuracy of 20 random splts as the fnal result [9]. A. Implementaton Detals In our experments, all of the parameters were set accordng to th followng descrptons, unless stated otherwse: We randomly extracted 100,000 small vdeo cubes from 100 vdeos to learn convoluton flters by MR-SFA. The sze h s w s l s of these cube was These frame-based sequences were further reformatted nto cube-based sequences. The length of each elemental cube was d s = 6, and the number l n of elemental cubes n each reformatted sequence was 10. The dmensonalty of each elemental cube was reduced to m = 64 by PCA whtenng, and then, MR-SFA was performed to obtan convoluton flters. The number k of nearest states was set to 5, the hyper-parameter r was set to m 2 = 32, and the weght λ of manfold regularzaton was set to λ = 0.1. Twenty four convoluton flters of sze 7 7 were learned, and then, they were equally dvded nto three groups. Three sets of varaton features and three sets of appearance features were obtaned for the fnal representaton. For each set of features, we traned a GMM wth 16 clusters for Fsher vectors from 16,000 randomly sampled local features. All of the vdeos were converted to gray vdeos, and they were truncated to 256 frames. The convoluton was performed densely by a strde of one. For the DynTex dataset and YU- PENN dataset, a non-overlapped max poolng of sze 4 4 was performed after the convoluton. The volume sze h p w p l p of each local feature was The spatal strde s s was 1, and the temporal strde s t was 3. Fve spatal scales were used for feature extracton; they were [1, , 0.5, , 0.25]. The sze of the vdeos n the DynTex++ dataset was much smaller (.e ), thus we performed a non-overlapped max poolng of sze 2 2 after the convoluton. The volume sze h p w p l p of each local feature was set to The spatal strde s s and the temporal strde s t was set to 1 and 3, respectvely. Fve spatal scales were used, they were [2, , 1, , 0.5]. All of the experments were mplemented by Matlab 2014a on a Lnux system, and they were conducted on a server that had two Intel Xeon E V1 CPUs and 128G RAM. B. Parameter Evaluaton To evaluate the parameters used n our experments effcently, we constructed a subset based on the DynTex++ dataset. We randomly chose ten vdeos from each category for the subset. There are 360 vdeos from 36 categores n total. Three vdeos n each category were used for tranng, and the

respectvely. Mean Accuracy 0.66 0.65 0.64 0.63 0.62 0.61 SFA λ=0 λ=0.01 λ=0.03 λ=0.1 λ=0.3 λ=1 λ=3 λ=10 λ=30 λ=100 λ=300 λ=1000 Fg.

The mean accuraces obtaned by dfferent values are marked as crcles and connected by a polylne.

The average accuracy on 30 random splts was reported as the fnal result.

Frst, we analyzed the nfluence of the parameter λ, whch s the weght of the manfold regularzaton.

The evaluaton of each value was repeated 20 tmes to obtan a credble result. As shown n Fg.

Usng a large λ wll corrupt the temporal slowness of the extracted features, and thus, they performed poorly.

Besdes MR-SFA, the results obtaned by SFA are also reported as the baselne.

MR-SFA sgnfcantly outperforms standard SFA.

Based on ths evaluaton, we use λ = 0.1 for all subsequent experments.

We tested dfferent GMM cluster numbers that ranged from 1 to 256.

Fg. 8. The Evaluaton of dfferent GMM clusters for Fsher vectors. Each value was evaluated 20 tmes.

TABLE I THE RECOGNITION ACCURACY (%) OBTAINED BY DIFFERENT ACTIVATION FUNCTIONS AND POOLING METHODS ON THE DYNTEX++ SUBSET.

The best result was obtaned usng 16 GMM clusters.

8 8 Fg. 6. Representatve frames of datasets used n our experments; the rows from top to bottom are Dyntex, YUPENN and Dyntex++ respectvely. Mean Accuracy SFA λ=0 λ=0.01 λ=0.03 λ=0.1 λ=0.3 λ=1 λ=3 λ=10 λ=30 λ=100 λ=300 λ=1000 Fg. 7. The evaluaton of the the parameter λ. Each value of λ was evaluated 20 tmes. The mean accuraces obtaned by dfferent values are marked as crcles and connected by a polylne. In addton to MR-SFA, the results of standard SFA are also reported as a baselne. remander were used for testng. The average accuracy on 30 random splts was reported as the fnal result. To further speed up the evaluaton, the strde of the convoluton was ncreased to two. Frst, we analyzed the nfluence of the parameter λ, whch s the weght of the manfold regularzaton. In addton to λ = 0, values that ranged from 0.01 to 1000 were evaluated. The evaluaton of each value was repeated 20 tmes to obtan a credble result. As shown n Fg. 7, good results can be acheved by λ 0.3, and the best result was acheved by λ = 0.1. Usng a large λ wll corrupt the temporal slowness of the extracted features, and thus, they performed poorly. Notably, usng λ = 0 also acheved a compettve result due to the addtonal manfold constrant Ẏ T DẎ = I. Besdes MR-SFA, the results obtaned by SFA are also reported as the baselne. In partcular, MR-SFA were replaced wth SFA, and all of the other parameters were smlar. MR-SFA sgnfcantly outperforms standard SFA. The mprovement comes from two aspects: the regularzaton for partal predcton, and the weght constrant of the tuples. Based on ths evaluaton, we use λ = 0.1 for all subsequent experments. Second, we evaluated the number of GMM clusters that were used for the vdeo representaton. We tested dfferent GMM cluster numbers that ranged from 1 to 256. Smlar to the prevous evaluaton, the evaluaton of each number was Mean Accuracy Fg. 8. The Evaluaton of dfferent GMM clusters for Fsher vectors. Each value was evaluated 20 tmes. The mean accuraces that were obtaned by dfferent values are marked as crcles and connected by a polylne. TABLE I THE RECOGNITION ACCURACY (%) OBTAINED BY DIFFERENT ACTIVATION FUNCTIONS AND POOLING METHODS ON THE DYNTEX++ SUBSET. Lnear ReLu Abs Square Max Avg repeated 20 tmes, and the results are shown n Fg. 8. The best result was obtaned usng 16 GMM clusters. The numbers of GMM clusters that are n the range from eght to 64 are compettve compared wth the best. Notably, results obtaned by only one GMM cluster outperform results obtaned by 256 GMM clusters. The proposed dynamc texture recognton reles on a small amount of features. Usng a large number of GMM clusters results n overfttng. We also evaluated dfferent combnatons of actvaton functons g( ) and poolng methods h( ). Each combnaton was evaluated 20 tmes, and the mean results are shown n Table I. Among all of the combnatons, the max poolng and the absolute value functon acheved the best performance. The max poolng outperforms the average poolng. The absolute value functon takes advantage of both the postve and negatve responses. Thus t performs better than the lnear functon and the rectfed lnear unt (ReLu) [46] n our experments. The square functon performed poorly compared wth the absolute value functon. Although t works n a smlar way compared wth the absolute value functon, t corrupts the lnearty of

9 9 TABLE II THE RECOGNITION ACCURACY (%) OBTAINED BY DIFFERENT SETS OF MR-SFA FEATURES. DynTex LOO DynTex Alternatve Beta Gamma Beta Gamma YUPENN DynTex++ AF AF AF AF1+AF AF1+AF2+AF3 (AF) VF VF VF VF1+VF VF1+VF2+VF3 (VF) AF1+VF AF2+VF AF3+VF AF1+VF1+AF2+VF AF+VF the orgnal responses. C. Feature Evaluaton We further conducted experments on dfferent datasets to analyze each set of features. Here, each evaluaton s repeated 3 tmes, and the average result s reported. In contrast to the experments conducted on the subsets of the DynTex++ dataset, the experments here attempted to acheve the best result. Therefore, the varance of the obtaned results s small, and t can be gnored. There were 24 convoluton flters that were separated nto three sets, and sx sets of features were generated from them. Three sets of varaton features (VF) obtaned from varaton feature maps {V } are denoted as VF1, VF2 and VF3, and three sets of appearance features (AF) obtaned from appearance feature maps {A } are denoted as AF1, AF2 and AF3. As shown n Table II, the combnaton of all of the features (AF+VF) performed best, and each sngle feature performed poorly compared wth the best result. Notably, although the frst set of flters s the best soluton of MR-SFA compared wth the others, sometmes they performed worse compared wth the other flters. Ths phenomenon mght be caused by the nose that exsts n the learned features. As descrbed n the prevous secton, the frst one or few solutons of MR-SFA should be abandoned due to the nose, and we abandoned only the frst soluton n all experments. Usng more sets of flters s helpful. However, a combnaton of dfferent types of features s more effectve. In our experments, usng more than three sets of flters barely mproved the accuracy. Notably, the best recognton accuracy can be acheved by only usng features that were obtaned usng 16 flters (AF1+VF1+AF2+VF2) on the DynTex Beta dataset and DynTex Gamma dataset. Both the appearance and moton nformaton are essental to dynamc texture recognton. It s dffcult to tell whch contrbutes more. VF outperforms AF on both the DynTex dataset and YUPENN dataset. All of these datasets have a relatvely large resoluton and complex background, whch makes VF more robust than AF. However, AF outperforms TABLE III THE RECOGNITION ACCURACY (%) OBTAINED ON THE DYNTEX DATASET COMPARED WITH STATE-OF-THE-ART APPROACHES, USING THE LEAVE-ONE-VIDEO-OUT PROTOCOL. Methods Beta Gamma DFS [18] MBSIF-TOP [16] ELM [12] ST-TCoF [27] SFA MR-SFA TABLE IV THE RECOGNITION ACCURACY (%) OBTAINED ON THE DYNTEX DATASET COMPARED WITH STATE-OF-THE-ART APPROACHES, USING FIVE VIDEOS IN EACH CATEGORY FOR TRAINING. Methods Beta Gamma DFS [18] OTF [47] LBP-TOP [25] OTD [45] SFA MR-SFA VF on the DynTex++ dataset. Ths result mght be caused by the smplcty of the vdeos n ths dataset. D. Comparson wth State-of-the-Art Approaches In ths subsecton, we compare the proposed approach wth state-of-the-art approaches. We also report results that were obtaned by SFA for the comparson. The comparson on the DynTex dataset s shown n Table III and IV. MR-SFA outperforms all of the exstng approaches on the DynTex dataset. MBSIF-TOP can be regarded as an mprovement over LBP-TOP; t performs well on the DynTex dataset. A sgnfcant mprovement can be acheved based on LDS features usng ELM. The spatal-temporal transferred convolutonal neural network feature (ST-TCoF) was proposed usng a pre-traned convolutonal neural network [27]. Wth the pror knowledge of more than a mllon mages, ST-TCoF

10 10 TABLE V THE RECOGNITION ACCURACY (%) OBTAINED ON THE YUPENN DATASET COMPARED WITH STATE-OF-THE-ART APPROACHES. Methods Accuracy Gabor+SFA [34] 85.0 BoSE [21] 96.2 AlexNet [26], [48] 96.7 C3D [26] 98.1 SA-CNN [28] 98.3 ST-TCoF [27] 99.1 SFA 97.4 MR-SFA 97.9 TABLE VI THE RECOGNITION ACCURACY (%) ON THE DYNTEX++ DATASET COMPARED WITH STATE-OF-THE-ART APPROACHES. Methods Accuracy LBP-TOP [16] 89.5 DFS [18] 91.7 DNG [17] 93.8 OTD [45] 94.7 Ch-Square LBP-TOP [49] 97.0 MBSIF-TOP [16] 97.2 SFA 97.0 MR-SFA 97.7 outperforms most of the exstng features. The results obtaned by MR-SFA are slghtly better than the results obtaned by ST- TCoF. The orented template features (OTF) employ SIFT-lke feature descrptors and a powerful global statstcal descrptor for texture descrpton [47]. The orthogonal tensor dctonary (OTD) employs tensor-based sparse codng as a dynamc texture descrptor [45]. MR-SFA outperforms all of these approaches on both evaluaton protocols. The comparson on the YUPENN dataset s shown n Table V. MR-SFA outperforms most of the exstng approaches on the YUPENN dataset. An approach called bags of spacetme energes (BoSE) was proposed for dynamc scene recognton [21]. Ths approach uses orented 3D Gaussan thrddervatve flters for feature extracton. The result obtaned by the AlexNet s also reported as a baselne for all of the CNNbased approaches [26], [48]. The convoluton 3D (C3D) [26], the statstcal aggregaton convolutonal neural network (SA- CNN) [28], and the ST-TCoF are CNN-based approaches that nvolve pre-tranng on enormous amounts of data. Compared wth these CNN-based approaches, the results obtaned by MR-SFA are stll compettve. The comparson on the DynTex++ dataset s shown n Table VI. Vdeos of the DynTex++ dataset have less backgrounds compared wth the other datasets. Therefore, LBP-TOP and ts mprovements show sgnfcant advantages on the DynTex++ dataset. Smlar to LBP-TOP, DNG extracts features from nne dfferent planes n the vdeo cube. The ch-squared LBP-TOP was proposed usng a ch-squared transformaton to better ft the Gaussan dstrbuton [49]. MR-SFA outperforms all of the state-of-the-art approaches on the DynTex++ dataset. Overall, both SFA and MR-SFA can acheve compettve results. More specfcally, state-of-the-art results on the Dyn- Tex dataset and the DynTex++ dataset can be acheved by TABLE VII THE FEATURE EXTRACTION SPEED (FRAME PER SECOND) EVALUATED ON A SINGLE CPU CORE. DynTex++ DynTex YUEPNN 8 flters flters flters MR-SFA. MR-SFA can obtan sgnfcant mprovements on both DynTex dataset and DynTex++ dataset compared wth standard SFA. The mprovements arse from the proposed manfold regularzaton and the varaton features. Because the YUPENN dataset contans fewer complex temporal transtons, mprovements on the YUPENN dataset are relatvely small. Compared wth LDS features, the features that were extracted by MR-SFA are well dstrbuted. They can be easly modeled by a small number of GMM clusters for the vdeo representaton. In contrast, the parameters of LDS are hghly nonlnear. They cannot be compared drectly wth respect to classfcaton, nor are they well modeled by the conventonal bag-of-words models to obtan better representatons. MBSIF- TOP performs best among all of the approaches that extract features from orthogonal planes. MR-SFA outperforms MBSIF-TOP due to learned slowly varyng features and bagof-words models. In partcular, the temporal complexty s well resolved by learned slowly varyng features, and the proposed manfold regularzaton further mproves the robustness of the learned features. CNN-based approaches (.e., C3D, SA-CNN and ST-TCoF) perform well among all of the dynamc texture approaches. Especally, pre-traned CNN features contan large amounts of hgh-level semantc nformaton, and thus, they perform best on the YUPENN dataset. Compare wth CNNbased approaches, MR-SFA uses only a sngle convolutonal layer, and fewer convoluton flters. MR-SFA outperforms CNN-based approaches on the DynTex dataset. Moreover, MR-SFA can be appled to the DynTex++ dataset, whch conssts of gray vdeos that have a small resoluton and fewer semantc objects. In ths stuaton, CNN-based approaches cannot be appled drectly, but MR-SFA s stll effcent and effectve. E. Computatonal Effcency In ths subsecton, we analyze the effcency of the proposed approach. In our mplementaton, the convoluton was mplemented by matrx multplcatons, and the poolng was mplemented by ntegral mages. Therefore, the proposed dense feature extracton can be performed effcently. We report the average feature extracton speed on each dataset n Table VII. The evaluaton was conducted on a sngle CPU core runnng at 2.4GHz. As shown n the table, usng more convoluton flters lnearly ncreases the computatonal complexty. Due to the low resoluton of the vdeos, the feature extracton on the DynTex++ dataset s effcent compared wth others. Most of the computatonal tme of the feature extracton s spent on convoluton and poolng. In practce, the speed can be smply mproved by usng more CPU cores, or usng GPUs for

11 11 acceleraton. In our mplementaton, we smply employ data parallelsm to speed up the feature extracton process. VI. CONCLUSION We have proposed a novel approach for dynamc texture recognton. Specfcally, we learn feature extracton functons by MR-SFA, and employ convoluton and poolng for local feature extracton. Then dynamc textures are represented usng bag-of-words models. To the best of our knowledge, ths study s the frst research that ntroduces SFA to dynamc texture recognton. The proposed MR-SFA further mproves standard SFA by explorng the manfold regularzaton. In partcular, we construct the neghbor relatonshp of the ntal states of each temporal transton, and retan the localty of ther varatons n the temporal transton. In ths way, the varaton n each temporal transton can be partly predcted by ts ntal state. Ths approach ensures that learned features can be robust to complex and nosy temporal transtons. Overall, the proposed MR-SFA benefts from followng three aspects. Frst, learned local features are not only slowly varyng but also partly predctable, and thus, the temporal complexty of the dynamc textures can be better resolved. Second, local features are densely extracted by convoluton and poolng, whch further mproves the robustness of extracted local features. Last, the bag-of-words model approach ensures that the fnal representaton can be nvarant to varous spataltemporal translatons, vewponts, scales, and other aspects. Expermental results show that compettve results can be acheved by the proposed approach. State-of-the-art results can be acheved on the DynTex and DynTex++ dataset. REFERENCES [1] G. Doretto, A. Chuso, Y. N. Wu, and S. Soatto, Dynamc textures, Internatonal Journal of Computer Vson, vol. 51, no. 2, pp , [2] G. Zhao and M. Petkanen, Dynamc texture recognton usng local bnary patterns wth an applcaton to facal expressons, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 29, no. 6, pp , [3] L. Wskott and T. J. Sejnowsk, Slow feature analyss: Unsupervsed learnng of nvarances, Neural Computaton, vol. 14, pp , [4] W. Böhmer, S. Grünewälder, H. Ncksch, and K. Obermayer, Regularzed sparse kernel slow feature analyss, n Machne Learnng and Knowledge Dscovery n Databases. Sprnger, 2011, pp [5] Z. Zhang and D. Tao, Slow feature analyss for human acton recognton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 34, pp , [6] M. Belkn, P. Nyog, and V. Sndhwan, Manfold regularzaton: A geometrc framework for learnng from labeled and unlabeled examples. Journal of Machne Learnng Research, vol. 7, no. 3, pp , [7] C.-C. Hsu, L.-W. Kang, and C.-W. Ln, Temporally coherent superresoluton of textured vdeo va dynamc texture synthess, IEEE Transactons on Image Processng, vol. 24, no. 3, pp , March [8] A. B. Chan and N. Vasconcelos, Probablstc kernels for the classfcaton of auto-regressve vsual processes, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 2005, pp [9] B. Ghanem and N. Ahuja, Maxmum margn dstance learnng for dynamc texture recognton, n European Conference on Computer Vson (ECCV). Sprnger, 2010, pp [10] A. Ravchandran, R. Chaudhry, and R. Vdal, Categorzng dynamc textures usng a bag of dynamcal systems, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 35, no. 2, pp , [11] A. Mumtaz, E. Covello, G. R. Lanckret, and A. B. Chan, A scalable and accurate descrptor for dynamc textures usng bag of system trees, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 37, no. 4, pp , [12] L. Wang, H. Lu, and F. Sun, Dynamc texture vdeo classfcaton usng extreme learnng machne, Neurocomputng, vol. 174, pp , [13] M. Adeel, C. Emanuele, L. Gert R G, and C. Anton B, Clusterng dynamc textures wth the herarchcal em algorthm for modelng vdeo, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 35, no. 7, pp , [14] A. B. Chan and N. Vasconcelos, Modelng, clusterng, and segmentng vdeo wth mxtures of dynamc textures, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 30, no. 5, pp , [15] S.-J. Wang, W.-J. Yan, X. L, G. Zhao, C.-G. Zhou, X. Fu, M. Yang, and J. Tao, Mcro-expresson recognton usng color spaces, IEEE Transactons on Image Processng, vol. 24, no. 12, pp , Dec [16] S. R. Arashloo and J. Kttler, Dynamc texture recognton usng multscale bnarzed statstcal mage features, IEEE Transactons on Multmeda, vol. 16, pp , [17] A. Ramrez Rvera and O. Chae, Spatotemporal drectonal number transtonal graph for dynamc texture recognton, IEEE Transactons on Pattern Analyss and Machne Intellgence, no. 1, pp. 1 1, [18] Y. Xu, Y. Quan, Z. Zhang, H. Lng, and H. J, Classfyng dynamc textures va spatotemporal fractal analyss, Pattern Recognton, vol. 48, no. 10, pp , [19] K. G. Derpans, Dynamc scene understandng: The role of orentaton features n space and tme n scene classfcaton, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 2012, pp [20] K. G. Derpans and R. P. Wldes, Spacetme texture representaton and recognton based on a spatotemporal orentaton analyss, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 34, no. 6, pp , [21] C. Fechtenhofer, A. Pnz, and R. P. Wldes, Bags of spacetme energes for dynamc scene recognton, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR). IEEE, 2014, pp [22] Y. Qao and L. Weng, Hdden markov model based dynamc texture classfcaton, IEEE Sgnal Processng Letters, vol. 22, no. 4, pp , [23] G. Doretto and S. Soatto, Dynamc shape and appearance models, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 28, no. 12, pp , [24] H. Sakano, Moton estmaton for dynamc texture vdeos based on locally and globally varyng models, IEEE Transactons on Image Processng, vol. 24, no. 11, pp , Nov [25] H. J, X. Yang, H. Lng, and Y. Xu, Wavelet doman multfractal analyss for statc and dynamc texture classfcaton, IEEE Transactons on Image Processng, vol. 22, no. 1, pp , [26] D. Tran, L. Bourdev, R. Fergus, L. Torresan, and M. Palur, Learnng spatotemporal features wth 3D convolutonal networks, n IEEE Internatonal Conference on Computer Vson (ICCV), 2015, pp [27] X. Q, C.-G. L, G. Zhao, X. Hong, and M. Petkänen, Dynamc texture and scene classfcaton by transferrng deep mage features, arxv preprnt arxv: , [28] A. Gangopadhyay, S. M. Trpath, I. Jndal, and S. Raman, SA-CNN: Dynamc scene classfcaton usng convolutonal neural networks, arxv preprnt arxv: , [29] M. Harand, M. Salzmann, and M. Baktashmotlagh, Beyond Gauss: Image-set matchng on the Remannan manfold of pdfs, arxv preprnt arxv: , [30] W. N. Gonalves, B. B. Machado, and O. M. Bruno, A complex network approach for dynamc texture recognton, Neurocomputng, vol. 153, pp , [31] Y. Wang and S. Hu, Explotng hgh level feature for dynamc textures recognton, Neurocomputng, vol. 154, pp , [32] P. Berkes and L. Wskott, Slow feature analyss yelds a rch repertore of complex cell propertes, Journal of Vson, vol. 5, no. 6, p. 9, [33] J. Mao, X. Xu, S. Qu, C. Qng, and D. Tao, Temporal varance analyss for acton recognton, IEEE Transactons on Image Processng, vol. 24, no. 12, pp , Dec [34] C. Therault, N. Thome, and M. Cord, Dynamc scene classfcaton: Learnng moton descrptors wth slow features analyss, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), June 2013, pp

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components