4648 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017

Size: px

Start display at page:

Download "4648 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017"

Bertha Webb
6 years ago
Views:

1 4648 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 Acton Recognton Usng 3D Hstograms of Texture and A Mult-Class Boostng Classfer Baochang Zhang, Yun Yang, Chen Chen, Lnln Yang, Jungong Han, and Lng Shao, Senor Member, IEEE Abstract Human acton recognton s an mportant yet challengng task. Ths paper presents a low-cost descrptor called 3D hstograms of texture (3DHoTs) to extract dscrmnant features from a sequence of depth maps. 3DHoTs are derved from projectng depth frames onto three orthogonal Cartesan planes,.e., the frontal, sde, and top planes, and thus compactly characterze the salent nformaton of a specfc acton, on whch texture features are calculated to represent the acton. Besdes ths fast feature descrptor, a new mult-class boostng classfer (MBC) s also proposed to effcently explot dfferent knds of features n a unfed framework for acton classfcaton. Compared wth the exstng boostng frameworks, we add a new mult-class constrant nto the objectve functon, whch helps to mantan a better margn dstrbuton by maxmzng the mean of margn, whereas stll mnmzng the varance of margn. Experments on the MSRActon3D, MSRGesture3D, MSRActvty3D, and UTD-MHAD data sets demonstrate that the proposed system combnng 3DHoTs and MBC s superor to the state of the art. Index Terms Acton recognton, mult-class classfcaton, boostng classfer, depth mage, texture feature. I. INTRODUCTION HUMAN acton recognton has been an actve research topc n computer vson n the past 15 years. It can facltate a varety of applcatons, rangng from human computer nteracton [1] [3], moton sensng based gamng, ntellgent survellance to asssted lvng [4]. Early research manly focuses on dentfyng human actons from vdeo sequences captured by RGB vdeo cameras. In [5], bnary motonenergy mages (MEI) and moton-hstory mages (MHI) are used to represent where moton has occurred and characterze Manuscrpt receved December 19, 2016; revsed Aprl 24, 2017; accepted June 4, Date of publcaton June 21, 2017; date of current verson July 18, Ths work was supported n part by the Natural Scence Foundaton of Chna under Contract and Contract The work of B. Zhang was supported n part by the Bejng Muncpal Scence and Technology Commsson under Grant Z and n part by the Open Projects Program of Natonal Laboratory of Pattern Recognton. The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Prof. Wes Ln. (B. Zhang and Y. Yang contrbuted equally to ths work.) (Correspondng author: Jungong Han.) B. Zhang, Y. Yang, and L. Yang are wth Behang Unversty, Bejng , Chna (e-mal: bczhang@buaa.edu.cn; yanglnln@buaa.edu.cn). Y. Yang s wth the Computer Vson Laboratory, Noah s Ark Laboratory, Huawe Technologes, Bejng , Chna (e-mal: yangyun18@huawe.com). C. Chen s wth the Center for Research n Computer Vson, Unversty of Central Florda, Orlando, FL, USA (e-mal: chenchen870713@gmal.com). J. Han s wth the School of Computng and Communcatons, Lancaster Unversty, Lancaster LA1 4YW, U.K. (e-mal: jungonghan77@gmal.com). L. Shao s wth the School of Computng Scences, Unversty of East Angla, Norwch NR4 7TJ, U.K. (e-mal: lng.shao@eee.org). Color versons of one or more of the fgures n ths paper are avalable onlne at Dgtal Object Identfer /TIP human actons. In [6], a low computatonal-cost volumetrc acton representaton from dfferent vew angles s utlzed to obtan hgh recognton rates. In [7], the noton of spatal nterest ponts s extended to the spato-temporal doman based on the dea of the Harrs nterest pont operator. The results show ts robustness to occluson and nose. In [8], a moton descrptor bult upon the spato-temporal optcal flow measurement s ntroduced to deal wth low resoluton mages. Despte the great progress n the past decades, recognzng actons n the real world envronment s stll problematc. Wth the development of RGB-D cameras, especally Mcrosoft Knect, more recent research works focus on acton recognton usng depth mages [9], [10] due to the fact that depth nformaton s much more robust to changes n lghtng condtons, compared wth the conventonal RGB data. In [11], a bag of 3D ponts correspondng to the nodes n an acton graph s generated to recognze human actons from depth sequences. Alternatvely, an actonlet ensemble model s proposed n [12] and the developed local occupancy patterns are shown to be mmune to nose and nvarant to translatonal and temporal msalgnments. In [13], Hstograms of Orented Gradents (HOG) computed from Depth Moton Maps (DMMs) are generated, capturng body shape and moton nformaton from depth mages. In [14], Chen et al. combne Local Bnary Pattern (LBP) and the Extreme Learnng Machne (ELM), achevng the best performance on ther own datasets. In summary, although depth based methods have been popular, they cannot perform relably n practcal applcatons where large ntra-class varatons, e.g., the acton-speed dfference, exst. Such a drawback s manly caused by two algorthm desgnng faults. Frst, the vsual features fed nto the classfer are unable to obtan dfferent knds of dscrmnatng nformaton, the dversty of whch s requred n buldng a robust classfer. Second, few works take the theoretcal bounds nto account when combnng dfferent learnng models for classfcaton. We perceve that most exstng works emprcally stack up dfferent learnng models wthout any theoretcal gudance, even though the results are acceptable n some stuatons. To mprove the robustness of the system, especally for practcal applcaton usage, we propose a feature descrptor, namely 3D Hstograms of Texture (3DHoTs), whch s able to extract dscrmnatve features from depth mages. More specfcally, 3DHoT s an extenson of our prevous DMM-LBP descrptor n the sense that the complete local bnary pattern (CLBP) proposed n [15] for texture classfcaton s employed to capture more texture features, thereby enhancng the feature representaton capacty. Ths new feature IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

2 ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4649 s able to descrbe the moton nformaton from varous perspectves such as sgn, magntude and local dfference based on the global center. Besdes, we also mprove the classfcaton by combnng the extreme learnng machne (ELM) and a new mult-class boostng classfer (MBC). Ths paper s an extenson of [60] n the sense that we provde the theoretcal dervaton of our objectve whch ams to mnmze the varance of margn samples followng the Gaussan Mxture Model (GMM) dstrbuton. From the theoretcal perspectve, our classfcaton technque s an ensemble of base classfers on dfferent types of features, makng t possble to tackle extremely challengng acton recognton tasks. In summary, our work dffers from the exstng work n two aspects. 1. The prmary contrbuton les n a mult-class boostng classfer, whch enables to explot dfferent knds of features n a unfed framework. Compared to the exstng boostng frameworks, we add a new mult-class constrant nto the objectve functon, whch helps to mantan a better margn dstrbuton by maxmzng the mean margn whle controllng the margn varance even f the margn samples follow a complcated dstrbuton,.e., GMM. 2. We enhance our prevous DMM-LBP descrptor [9] by usng a more advanced texture extracton model CLBP [15]. Ths new 3DHoTs feature combnng DMM and CLBP encodes moton nformaton across depth frames and local texture varaton smultaneously. Usng ths representaton can mprove the performance of depth-based acton recognton, especally for realstc applcatons. The rest of the paper s organzed as follows. Secton II brefly revews related work on depth feature representatons. Secton III descrbes the detals of 3DHoT features. Secton IV ntroduces the mult-class boostng method as well as ts theoretcal dscussons. Expermental results are gven n Secton V. Some concludng remarks are drawn n Secton VI. II. RELATED WORK Recently, depth based acton recognton methods have ganed much attenton due to ther robustness to changes n lghtng condtons [16]. Researchers have made great efforts to obtan a dstnctve acton recognton system based on depth or skeleton models. Ths secton presents a revew on related work wth focuses on feature representatons for depth maps and classfer fuson, whch are n lne wth our two contrbutons. A. Feature Representaton for Acton Recognton Two commonly used vsual features for acton recognton are handcrafted feature and learned feature. The former captures certan moton, shape or texture attrbutes of the acton usng statstcal approaches whle the latter automatcally obtans ntrnsc representatons from a large volume of tranng samples n a data-drven manner [17]. Skeleton jonts from depth mages are typcal handcrafted features for use n acton recognton, because they provde a more ntutve way to perceve human actons. In [18], robust features based on the probablty dstrbuton of skeleton data were extracted and followed by a multvarate statstcal method for encodng the relatonshp between the extracted features. In [19], Ofl et al. proposed a Sequence of Most Informatve Jonts (SMIJ) based on the measurements, such as the mean and varance of jont angles and the maxmum angular velocty of body jonts. A descrptor named Hstogram of Orented Dsplacements (HOD) was ntroduced n [20], where each dsplacement n the trajectory voted wth ts length n a hstogram of orentaton angles. In [21], a HMM-based methodology for acton recognton was developed usng star skeleton as a representatve descrptor of human postures. Here, a star-lke fve-dmensonal vector based on the skeleton features was employed to represent local human body extremes, such as head and four lmbs. In [22], Luo et al. utlzed the parwse relatve postons between jonts as the vsual features and adopted a dctonary learnng algorthm to realze the quantzaton of such features. Both the group sparsty and geometry constrants are ncorporated n order to mprove the dscrmnatve power of the learned dctonary. Ths approach has acheved the best results on two benchmark datasets, thereby representng the current stateof-the-art. Despte the fact that skeleton-based human acton recognton has acheved surprsng performance, large storage requrement and hgh dmensonalty of the feature descrptor make t mpractcal, f not mpossble, to be deployed n real scenaros, where low-cost and fast algorthm s demanded. Alternatvely, another stream of research tred to capture moton, shape and texture handcrafted features drectly from the depth maps. In [23], Fanello et al. extracted two types of features from each mage, namely Global Hstograms of Orented Gradents (GHOGs) and 3D Hstograms of Flow. The former was desgned to model the shape of the slhouette whle the latter was to descrbe the moton nformaton. These features were then fed nto a sparse codng stage, leadng to a compact and stable representaton of the mage content. In [24], Tran and Nguyen ntroduced an acton recognton method wth the ad of depth moton maps and a gradent kernel descrptor whch was then evaluated usng dfferent confguratons of machne learnng technques such as Support Vector Machne (SVM) and kernel based Extreme Learnng Machne (KELM) on each projecton vew of the moton map. In [25], Zhang et al. proposed an effectve descrptor, called Hstogram of 3D Facets (H3DF), to explctly encode the 3D shape and structures of varous depth mages by codng and poolng 3D Facets from depth mages. In [66], the kernel technque s used to mprove the performance for processng nonlnear quaternon sgnals; n addton, both RGB nformaton and depth nformaton are deployed to mprove representaton ablty. Dfferent from the above methods that rely on handcraft features, deep models learn the feature representaton from raw depth data and approprately generate the hgh level semantc representaton. In our prevous work [26], Wang et al. proposed a new deep learnng framework, whch only requred small-scale CNNs but acheved hgher performance wth less computatonal costs. In [27], DMM-Pyramd archtecture that can partally keep the temporal ordnal nformaton was proposed to preprocess the depth sequences. In ther system, Yang et al. advocated the use of the convoluton operaton

3 4650 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 to extract spatal and temporal features from raw vdeo data automatcally and extended DMM to DMM-Pyramd. Subsequently, the raw depth sequences can be accepted by both 2D and 3D convolutonal networks. From the extensve work on depth map based acton recognton, we have observed that depth maps actually contan rch dscrmnatng texture nformaton. However, most methods do not take t nto account when generatng ther feature representatons. B. Classfer Fuson In a practcal acton recognton system, the classfer plays an mportant role n determnng the performance of the system, thereby ganng much attenton. Most exstng systems just adapted the sngle classfer, such as SVM [28], ELM [29] and HMM [21], nto the acton recognton feld, and are suffcently accurate when recognzng smple actons lke sttng, walkng and runnng. However, for more complcated human actons, such as hammerng a nal, exstng works have proved that combnng multple classfers especally weak classfers usually mproves the recognton rate. Apparently, how to combne basc classfers becomes crucal. In [9], Chen et al. employed three types of vsual features, each beng fed nto a KELM classfer. At the decson level, a soft decson fuson scheme, namely logarthmc opnon pool (LOGP) rule, merged the probablty outputs and assgned the fnal class label. Instead of usng specfc fuson rules, most algorthms adopted the boostng schemes, whch teratvely wegh dfferent sngle classfers by manpulatng the tranng dataset, and on top of t, selectvely combne them dependng on the weght of each classfer. For example, a boosted exemplar learnng (BEL) approach [30] was proposed to recognze varous actons, where several exemplar-based classfers were learned va multple nstance learnng, gven a certan number of class-specfc canddate exemplars. Afterwards, they appled AdaBoost to ntegrate the further selecton of representatve exemplars and acton modelng. Recently, consderable research has been devoted to multclass boostng classfcaton as t s able to facltate a broad range of applcatons ncludng acton recognton [31] [33]. Flowng [32], [39] and many other publcatons, we generally dvde the exstng works nto two categores dependng on how they solved the M-ary (M>2) problems. In the frst category, the proposed approaches decompose the desred mult-class problem nto a collecton of multple ndependent bnary classfcaton problems, bascally treatng an M class problem as an estmaton of a two-class classfer on the tranng set M tmes. Representatves nclude ECOC [31], AdaBoost.MH [34], bnary GentleBoost algorthm [35], and AdaBoost.M2 [36]. In general, ths type of mult-class boostng methods can be easly mplemented based on the conventonal bnary AdaBoost, however, the system performance s not satsfactory due to the fact that bnary boostng scores do not represent true class probabltes. Addtonally, such a two-step scheme nevtably creates resource problems by ncreasng the tranng tme and memory consumpton, especally when dealng wth a large number of classes. To overcome ths drawback, the second approach drectly boosts an M-ary classfer va optmzng a mult-class exponental loss functon. One of the frst attempts was the AdaBoost.M1 algorthm [36]. Smlar to the bnary AdaBoost method, ths algorthm allowed for any weak classfer that has an error rate of less than 0.5. In [38], a new varaton of the AdaBoost.M1 algorthm, named ConfAdaBoost.M1, was presented, whch used the nformaton about how confdent the weak learners are to predct the class of the nstances. Many researches boosted M-ary classfer by redefnng the objectve functons. For example, n [37] Zou et al. extended the bnary Fsher-consstency result to mult-class classfcaton problems, where the smooth convex Fsher-consstent loss functon s mnmzed by employng gradent decent. Alternatvely, Shen et al. [32] presented an extenson of the bnary totally-correctve boostng framework to the mult-class case by generalzng the concept of separaton hyperplane and margn derved from the famous SVM classfcaton. Moreover, the class label representaton problem s dscussed n [33], whch exploted dfferent vector encodngs for representng class labels and classfer responses to model the uncertanty caused by each weak-learner. From the perspectve of margn theory as shown n [39], researchers defned a proper margn loss functon for M-ary classfcaton and dentfed an optmal codebook. And they further derved two boostng algorthms for the mnmzaton of the classfcaton rsk. In [40], Shen et al. assumed a Gaussan dstrbuton of margn and obtaned a new objectve, whch s one of the most well-known theoretcal results n the feld. To sum up, most of exstng works, especally the multclass ones focused on solvng weak classfer selecton and the mbalance problem by ntroducng more robust loss functons. From the margn theory perspectve [40], they are only able to maxmze the hard-margn or the mnmum margn when the data follows a smple dstrbuton (Gaussan). Accordng to the theoretcal evdences n [40], a good boostng framework should am for maxmzng the average margn. Such problems were addressed n other learnng methods, e.g., SVM, by employng the soft-margns, whch actually nspred our work. Unlke [40] and other exstng works [31], [32], and [39], we assume a more reasonable multple Gaussan dstrbuton of margn. When dealng wth a multple-class (one versus all) problem, evdently t s hard to assume that the margn follows a sngle Gaussan. Based on our GMM assumpton, we desgn an objectve functon, ntendng to mnmze the varance of margn samples that follow the GMM dstrbuton. III. 3-D HISTOGRAMS OF TEXTURE On a depth mage, the pxel values ndcate the dstances between the surface of an object and a depth camera locaton, therefore provdng 3D structure nformaton of a scene. Commonly, researchers utlze the 3D nformaton n the orgnal 3D space, but we project each depth frame of a depth sequence onto three orthogonal Cartesan planes so as to make use of both the 3D structure and shape nformaton [13]. Bascally, our 3DHoTs feature extracton and descrpton conssts of

ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4651 Fg. 2. Sgn and magntude components extracted from a sample block.

4 ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4651 Fg. 2. Sgn and magntude components extracted from a sample block. (a) 3 3 sample block; (b) the local dfferences; (c) the sgn component of block; and (d) the magntude component of block. Fg. 1. Salent Informaton (SI) maps. From the left to the rght: front (f) vew, sde (s) vew and top (t) vew. two steps: salent nformaton map generaton and CLBP based feature descrpton, each beng elaborated below. A. Salent Informaton (SI) Map Generaton The dea of SI s derved from DMM [13], whch s generated by stackng moton energy of depth maps projected onto three orthogonal Cartesan planes. After obtanng each projected map, ts moton energy s computed by thresholdng the dfference between consecutve maps. The bnary map of moton energy provdes a strong clue of the acton category beng performed and ndcates moton regons or where movement happens n each temporal nterval. More specfcally, each 3D depth frame generates three 2D projected maps algnng wth front (f), sde (s), and top (t) vews,.e.,p f, p s and p t, respectvely. The summaton of the absolute dfferences of consecutve projected maps can be used to mply the moton wthn a regon. The larger the summaton value, the more lkely the moton frequently occurs n that regon. Consderng both the dscrmnablty and robustness of feature descrptors, authors used the L1-norm of the absolute dfference between two projected maps to defne salent nformaton (SI) n [14]. On the one hand, the summaton of L1-norm s nvarant to the length of a depth sequence. That s to say, we wll be less nfluenced by msmatched speeds of performng the same acton by dfferent people. On the other hand, L1-norm contans more salent nformaton than other norms (.e., L2) and t s fast to compute. Consequently, the SI maps of a depth sequence are computed as: B v SI = p+v p, (1) =1 where denotes f, s or t. The parameter v stands for the frame nterval, represents the frame ndex, and B s the total number of frames n a depth sequence. An example of the SI maps of a depth acton sequence s shown n Fg. 1. In the case that the sum operaton n Eq. (1) s only used gven a threshold satsfed, t s smlar to the dea of [13]. Instead of selectng frames as n orgnal DMM [13], however, n [60], the authors proposed that all frames should be deployed to calculate moton nformaton. As shown n Eq. (2), the SI map for v = 1 contans more salent nformaton than that of v = 2: 2( p 2 p 1 + N 2 =2 p 2 p N 2 =1 p +1 p + p N p N 1 ) N 2 =2 p +1 p + p N p N 1 p +2 p. (2) The scale n the above expresson affects lttle on the local pattern hstogram. The result s evdent, consderng the fact that: p +2 p +1 + p +1 p p +2 p. (3) Instead of accumulatng bnary maps result from comparng wth the threshold, SI obtans more detaled feature than orgnal DMM does, based on whch we further ntroduce a powerful texture descrptor nspred by CLBP [15] method. B. CLBP Based Descrptor Our CLBP based descrptors represent SI maps from three aspects, whch are: 1) Sgn based descrptor for Salent Informaton: Gven a center pxel t c n the SI mage, ts neghborng pxels are equally scattered on a crcle wth radus r (r > 0). If the coordnates of t c are (0, 0) and m neghbors {t } m 1 =0 are consdered, the coordnates of t are( r sn (2π/m), r cos (2π/m)). The sgn descrptor s computed by thresholdng the neghbors {t } m 1 =0 wth the center pxel t c to generate an m-bt bnary number, so that t can be formulated as: m 1 m 1 Sgn m,r (t c ) = s(t t c )2 = s(d )2, (4) =0 where d = (t t c ).s(d ) = 1fd 0ands(d ) = 0f d < 0. After obtanng the sgn based encodng for pxels n an SI mage, a block-wse statstc hstogram named HoT_S s computed over an mage or a regon to represent the texture nformaton. 2) Magntude based descrptor for Salent Informaton: The magntude s complementary to sgn nformaton n the sense that the dfference d can be reconstructed based on them. Fg. 2 shows an example of the sgn and magntude components extracted from a sample block. The local dfferences are decomposed nto two complementary components: the sgns =0

5 4652 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 Fg. 3. Ppelne of 3DHoTs feature extracton. and magntudes (absolute values of d,.e. d ). Note that 0 s coded as -1 n the encodng process (see Fg. 2 (c)). The magntude operator s defned as follows: m 1 Magntude m,r = ϕ ( d, c)2, =0 { 1, σ c ϕ (σ, c) = 0, σ < c, (5) where c s a threshold settng to the mean value of d on the whole mage. A block-wse statstc hstogram named HoT_Magntude (HoT_M) s subsequently computed over an mage or a regon. 3) Center based descrptor for Salent Informaton: The center part of each block whch encodes the values of the center pxels also provdes dscrmnant nformaton. It s denoted as: Center m,r = ϕ (t c, c 1 ), (6) where ϕ s defned n Eq. (5) and the threshold c 1 s set as the average gray level of the whole mage. Subsequently, we obtan the hstograms of center based texture feature (HoT_C) over a SI mage or a regon. To summarze, n our feature extracton method, each depth frame from a depth sequence are frst projected onto three orthogonal Cartesan planes to form three projected maps. Under each projecton plane, the absolute dfferences between the consecutve projected maps are accumulated over an entre sequence to generate a correspondng SI mage. Then each SI mage s dvded nto overlapped blocks. Each component of the texture descrptors s appled to the blocks and the resulted local hstograms of all blocks are concatenated to form a sngle feature vector. Therefore, each SI mage creates three hstogram feature vectors denoted by HoT_ S, HoT_ M and HoT_ C, respectvely. Snce there are three SI mages correspondng to three projecton vews (.e., front, sde and top vews), three feature vectors are generated as fnal feature vectors as follows. The feature extracton procedure s llustrated n Fg. 3. 3DHoT_S =[HoT f _S, HoT s _S, HoT t _S] 3DHoT_M =[HoT f _M, HoT s _M, HoT t _M] 3DHoT_C =[HoT f _C, HoT s _C, HoT t _C] IV. DECISION-LEVEL CLASSIFIER FUSION BASED ON MULTI-CLASS BOOSTING SCHEME As can be seen, we use mult-vew features n order to capture the dversty of the depth mage. Normally, the dssmlarty among features from dfferent vews s large. To solve ths mult-vew data classfcaton problem, the majorty of the research n ths feld advocates the use of the boostng method. The basc dea of a boostng method s to optmally ncorporate multple weak classfers nto a sngle strong classfer. Here, one vew of features can be fed nto one weak classfer. As an outstandng boostng representatve, AdaBoost [40] ncrementally bulds an ensemble by tranng each new model nstance to emphasze the tranng nstances that are msclassfed prevously. In ths paper, we concentrate on ths framework, based on whch we ntroduce a new mult-class boostng method. Supposed we have n weak/base classfers and h (x) denotes the th base classfer, a boostng algorthm actually seeks for a convex lnear combnaton: n F(α, x) = α h (x), (7) =1 where α s a weght coeffcent correspondng to the th weak classfer. Apparently, AdaBoost method can be decomposed nto two modules: base classfer constructon and classfer weght calculate, gven tranng samples. A. Base Classfer: Extreme Learnng Machne In prncple, the base classfers n AdaBoost can be any exstng classfers performng better than random guessng. But the better a base classfer s, the greater the overall decson system performs. Therefore, we use the ELM method [29] n our work, whch s an effcent learnng algorthm for snglehdden-layer feed-forward neural networks (SLFNs). More specfcally, let y =[y 1,...,y k,...,y C ] T R C be the class to whch a sample belongs, where y k {1, 1}(1 k C) and C s the number of classes. Gven N tranng samples {x, y } =1 N,wherex R M and y R C, a sngle hdden layer neural network havng L hdden nodes can be expressed as L β j h(w j x + e j ) = y, = 1,...,N, (8) j=1 where h( ) s a nonlnear actvaton functon (e.g., Sgmod functon), β j R C denotes the weght vector connectng the j th hdden node to the output nodes, w j R M denotes the weght vector connectng the j th hdden node to the nput nodes, and e j s the bas of the j th hdden node. The above N equatons can be wrtten compactly as: Hβ = Y, (9) where β = [β1 T ;...; βt L ] RL C, Y = [y1 T ;...; yt N ] R N C, and H s the hdden layer output matrx. A leastsquares soluton ˆβ of (8) s found to be ˆβ = H Y, (10)

6 ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4653 where H s the Moore-Penrose generalzed nverse of matrx H. The output functon of the ELM classfer s f L (x ) = h(x )β = h(x )H T ( I ρ + HHT ) 1 Y, (11) where 1/ρ s a regularzaton term and ρ s set to be The label of a test sample s assgned to the ndex of the output nodes wth the largest value. In our experments, we use a kernel-based ELM (KELM) wth a radal bass functon (RBF) kernel (the parameter gamma n RBF s set to be 10.5). B. Mult-Class Boostng Classfer Havng specfed the base classfer, the next step s to ntroduce our new mult-class boostng classfer. Our nvestgaton s carred out from the perspectve of margn sample dstrbuton, n contrast to the tradtonal methods that focus on solvng the weak classfer selecton and the mbalance problem. One of the obvous advantages les n the allevaton of the overfttng problem through weghng the samples. As another ntuton, nspred by [40], we nvestgate AdaBoost based on a more reasonable hypothess on the margn dstrbuton and obtan a new theoretcal result. Followng Eq. (7), AdaBoost s equvalent to mnmzng the exponental loss functon [42]: N mn exp( y F(α, x )), s.t.α 0. (12) α =1 The logarthmc functon log ( ) s a strctly monotoncally ncreasng functon and t s easy to calculate the mnmum value of a non-exponental functon. Therefore, after a logarthmc processng, AdaBoost equals to solve [42]: mn α N log( exp( y F(α, x ))), s.t.α 0, α 1 = δ. (13) =1 The constrant α 1 = δ avods enlargng the soluton α by an arbtrary large factor to make the cost functon approach zero n the case of separable tranng data. In [43], Crammer and Snger propose to construct multclass predctors wth a pecewse lnear bound. Consderng the smplcty and the effcency of a lnear functon, we use the followng rule for ths C-class classfcaton, C arg max{θ T, j x}, (14) j=1 where θ j s a vector. And then we heurstcally propose the followng lnear objectve functon: max(θ T, j x θ T,m x), (15) j where m = j. Next, we ncorporate ths lnear objectve and a multple-class constrant nto a smple form of AdaBoost descrbed n Eq. (13). Eventually, a mult-class boostng method to calculate the weght vector separately for each class can be acheved through mnmzng the followng objectve: mn(log ( ω exp( y F(θ j, x ))) j + 1 N j (θ T,m x j θ T, j x j ) + λ θ j 1 ) (16) The effect of λ on the system performance s nvestgated n the expermental results part. x j denotes the th sample n the j th class wth N j samples. We make use of the nteror pont method to solve our objectve. Here, we further dscuss the theoretcal advantage behnd the new objectve functon. The margn theory used n SVM s the state-of-the-art learnng prncple. The so-called dual form of AdaBoost s another sgnfcant work related to the margn theory. The latter one s qute close to our work, whch s brefly ntroduced wth the focus on explanng ther dfference. In [40], authors assume a Gaussan dstrbuton of margn, and based on t, they theoretcally explan the state-of-the-art margn method (AdaBoost). However, for a multple-class (one versus all) problem, t s hard, f not mpossble, to assume that the margn follows a sngle Gaussan. Instead, we presume that the margn follows the multple Gaussan models. It s beleved that assumng multple Gaussan dstrbuton models n a more complcated stuaton lke our problem here s sensble, as a sngle Gaussan model s wdely accepted n the theoretcal analyss for a smple stuaton. After settlng the data dstrbuton, the next queston becomes whether our objectve functon maxmzes the mean of margn and at the same tme mnmzes the varance of margn that follows Gaussan mxture models. It was stated n [40] that the success of a boostng algorthm can be understood n terms of mantanng a better margn dstrbuton by maxmzng margns and meanwhle controllng the margn varance. In order words, t can be a sort of crteron to measure the proposed boostng algorthm. In our case, provng t s not easy, snce we have assumed that samples from dfferent classes mght follow GMM but not a sngle Gaussan. As another motvaton n [40], the boostng method can be used to solve varous complex problems, but few researchers explan t from a theoretcal aspect. We present a theorem to answer the queston mentoned above. Based on Lemmas 1 and 2 n Appendx, we obtan new theoretcal results for our boostng methods, and sgnfcantly extend the orgnal one n [36]. Here we descrbe our algorthm as follows: V. EXPERIMENTAL RESULTS Our proposed system s mplemented n MATLAB on an Intel 5 Quadcore 3.2 GHz desktop computer wth 8GB of RAM. Separate algorthmc parts correspondng to our contrbutons as well as the entre acton recognton system are evaluated and compared wth state-of-the-art algorthms based on four publc datasets ncludng MSRActon3D [44], MSRGesture3D [44], MSRActvty3D [44] and UTD-MHAD [45]. Moreover, we conduct the experments to nvestgate the effects of a few mportant parameters. For all the experments, we fx m = 4andr = 1 based on our emprcal studes n [10], [14], and the regon sze s set to 4 2 wth 15 hstogram bns when extractng 3DHoTs. A. Datasets The MSRActon3D dataset [44] s a popular depth dataset for acton recognton, contanng 20 actons performed

7 4654 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 Algorthm 1 We Solve Our Objectve Based on the MATLAB Toolbox. Our Method Utlzes the Informaton Derved From Depth Moton Maps and Texture Operators and Improves the Performance of the KELM Base Classfers. Fg. 4. An example of basketball-shoot acton from UTD-MHAD dataset. The frst row shows the color mages, the second row shows the depth mages. TABLE I RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON MSRACTION3D DATASET TABLE II RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON MSRGESTURE3D DATASET by 10 subjects. Each subject performs one acton 2 or 3 tmes when facng the depth camera. The resoluton of each depth mage s It s a challengng dataset due to the smlarty of actons and large speed varatons n actons. The MSRGesture3D dataset [44] s a benchmark dataset for depth-based hand gesture recognton, consstng of 12 gestures defned by Amercan Sgn Language (ASL). Each acton s performed 2 or 3 tmes by each subject, thereby resultng n 333 depth sequences. The MSRActvty3D dataset [44] contans 16 daly actvtes acqured by a Mcrosoft Knect devce. In ths dataset, there are 10 subjects, each beng asked to perform the same acton twce n standng poston and sttng poston, respectvely. There are n total 320 samples wth both depth maps and RGB sequences. The UTD-MHAD dataset [45] employed four temporally synchronzed data modaltes for data acquston. It provdes RGB vdeos, depth vdeos, skeleton postons, and nertal sgnals (captured by a Knect camera and a wearable nertal sensor) of a comprehensve set of 27 human actons. Some example frames of the datasets are shown n Fg. 4. B. Contrbuton Verfcaton We have clamed two contrbutons n Secton I, whch are a new mult-class boostng classfer and an mproved feature descrptor. Here, we desgn an experment to verfy these two contrbutons smultaneously on the MSRActon3D dataset. More specfcally, we have combned two dfferent feature descrptors and four dfferent classfer fuson methods for the acton recognton. Feature descrptors nclude our 3DHoTs descrptor and the conventonal DMM+LBP descrptor [9] whle the four classfer fuson methods nvolve AdaBoost.M2 [36], LOGP [9], MCBoost [39] and our MBC. The dea s to feed two features nto four classfers respectvely, and afterwards, the average recognton accuracy of each combnaton s calculated accordngly. Table I shows the acheved results, for whch we adopted the orgnal settngs suggested n [9]. If we look at each column vertcally, we can fnd the accuracy comparsons when fxng the classfer but varyng feature descrptors. As can be seen, our 3DHoTs feature s consstently better than the DMM+LBP feature over four classfers, ndcatng that applyng the CLBP descrptor on DMM maps ndeed helps to represent the acton. On the contrary, f we look at each row horzontally, we can fnd the results acheved by dfferent classfers when the nput feature s constant. It s clear that our MBC classfer performs better than the other three, regardless of the nput features. Compared wth AdaBoost.M2 [36], MBC acheves a much better performance due to the fact that our framework focuses on the margn samples that can be more robust when the sze of the sample set s not large, whch s the case n ths applcaton. As s shown n Table II and Table III, our 3DHoTs feature outperforms DMM+LBP feature over four classfers, whch ndcates that the CLBP descrptor on DMM maps make a contrbuton to recognzng dfferent actons. Furthermore, n each row respectvely, t s demonstrated that our MBC classfer acheves comparable results wth other classfer combnaton methods.

8 ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4655 TABLE III RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON UTD-MHAD DATASET TABLE IV THREE SUBSETS OF ACTIONS USED FOR MSRACTION3D DATASET In comparson wth Adaboost.M2 and MCBoost, our MBC method performs better n both MSRGesture3D dataset and UTD-MHAD dataset. In fact, multclass boostng method cannot be drectly used n our problems. We addressed the ssue by combnng heterogeneous classfcaton models, whch s not a custom classfcaton task. To compare wth multclass boostng methods, n a dfferent way, we substtuted our objectve functon wth the loss functon they defned for M-array classfcaton. C. System Verfcaton 1) Results on the MSRActon3D Dataset: Smlar to other publcatons, we establsh two dfferent expermental settngs to evaluate our method. Settng 1: The expermental settng reported n [11] s adopted. Specfcally, the actons are dvded nto three subsets as lsted n Table IV. For each subset, three dfferent tests are carred out. In the frst test, 1/3 of the samples are used for tranng and the rest for testng; n the second test, 2/3 of the samples are used for tranng and the rest for testng; n the cross-subject test, one half of the subjects (1, 3, 5, 7, 9) are used for tranng and the rest for testng. Settng 2: The expermental setup suggested by [46] s used. A total of 20 actons are employed and one half of the subjects (1, 3, 5, 7, 9) are used for tranng and the remanng subjects are used for testng. To facltate a far comparson, we set the same parameters of DMMs and blocks as noted n [9]. As llustrated n Table V, the results clearly valdate the effectveness of MBC. Inthetest one, our method acheves 100% recognton accuracy n AS3, and also comparable results n AS1 and AS2. In the second experment, our method gets 100% recognton accuracy on all three subsets. In the cross-subject test, the MBC method agan gets the hghest average recognton accuracy, n ths very challengng settng wth large nter-class varatons of dfferent tranng and testng subjects. The comparson results of settng 2 are llustrated n Table VI, showng that our approach performs the best n terms of the recognton accuracy. More specfcally, the ensemble MBC classfer sgnfcantly mproves the performance of sngle 3DHoT feature,.e., 3DHoT_S, at least 3.3%. Compared to the state-of-the-art algorthm (DMM-LBP-DF) that s also based on the decson-level fuson scheme, we are 2% hgher n terms of the accuracy rate. Wth respect to the feature extracton, we compare ours wth most of exstng descrptors,.e., DMM [9], Cubod [47], and our method consstently shows ts advantages n the database. In terms of classfer, MBC acheves a much better performance than SVM [13], [48] and ELM [9]. Note that all compared results are cted from reference papers. 2) Results on the MSRGesture3D Dataset: Table VII shows the recognton results of our method as well as comparatve methods on the MSRGesture3D dataset. As shown n ths Table, the proposed method acheves a much better performance than DMM-HOG wth an ncreased rate of 5.5%. The accuracy of the decson level fuson approach (DMM-LBP-DF) s smlar to ours, and both methods outperform the others. It should be noted that the AdaBoost.M2 [36] s not sutable for a small set of tranng samples, whch are not used for the comparson n ths experment. 3) Results on the UTD-MHAD Dataset: In the conducted experments, we only utlze the depth data. Subsequently, the data from the subject numbers 1, 3, 5, 7 are used for tranng, and the data for the subject numbers 2, 4, 6, 8 are used for testng. Note that we slghtly change the parameter m to6for 3DHoTs feature extracton due to the better performance on ths dataset. We have compared our method wth the exstng feature extracton methods [45] used for depth mages and nertal sensors. It s remarkable that MBC obtans a much better performance than the combnaton of Knect and Inertal as shown n Table VIII. Compared to the state-of-the-art DMM-HOG result, we obtan 2.9% hgher recognton accuracy. The results clearly demonstrate the superor performance of our method. Compared to the tradtonal mult-class AdaBoost, we agan acheve a much better performance, whch further valdates the effectveness of MBC. 4) Results on the MSRActvty3D Dataset: To further test the effectveness of the proposed method, we consder a more complcated stuaton,.e., human actvty recognton. We conduct an experment on the MSRActvty3D dataset, whch s more challengng due to the large ntra-class varatons occurrng n the dataset. Experments performed on ths dataset s based on a cross-subject test by followng the same settng n [12], where 5 subjects are used for tranng, and the remanng 5 subjects are used for testng. The AdaBoost.M2 [36] s not used on ths dataset, because the data set s not bg enough to well tran an ensemble classfer lke t. Seen from the results reported n Table IX, our algorthm outperforms all the pror arts ncludng several recent ones except for [22]. It reveals that our MBC framework ndeed works well even f feedng two dfferent types of features.

9 4656 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 TABLE V COMPARISON OF RECOGNITIONACCURACIES (%) OF OUR METHOD AND EXISTINGMETHODS ON MSRACTION3DDATASET USING SETTING 1 TABLE VI RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRACTION3D DATASET TABLE IX RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRACTIVITY3D DATASET TABLE VII RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRGESTURE3D DATASET TABLE VIII RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON UTD-MHAD DATASET The major reason that our performance s worse than that of [22] les n the fact that we are manly based on the depth features extracted from the raw depth sgnal but the work n [22] employs more sophstcated skeleton-based features, whch can better nterpret the human actons when a challengng dataset s gven. Though we have ntegrated the skeleton nformaton here n order to verfy whether our mult-class boostng framework can handle two dfferent types of features, our skeleton features encodng only the jont poston dfferences are very smple, n contrast to [22] that uses group sparsty and geometry constraned dctonary learnng to further enhance the skeleton feature representaton. Accordng to ther results, the classfcaton performance benefts from generalzng vector quantzaton (e.g., Bag-of-Words representaton) to sparse codng [22]. It s beleved that our performance can be mproved further f we could combne the sophstcated skeleton features. 5) Comparson Wth Deep Learnng Based Methods: The baselne methods mentoned above deploy the tradtonal handcrafted features. Dfferently, the deep learnng models learn the feature representaton from raw data and generate the hgh level semantc representaton [26], [27] whch represent the latest development n acton recognton. Here, we compare our method wth two deep models, n whch one s SMF-BDL [26] and the other one s a DMM-Pyramd approach based on both tradtonal 2D CNN and 3D CNN for acton recognton. Smlar tombc, the decson-level fuson method s used to combne dfferent deep CNN models. To valdate the proposed 3DHoT-MBC method, we conduct the same experment as those of the two methods. Note that the comparatve results are all reported on ther reference papers. The results n Table X and Table XI show that 3DHoT-MBC s even superor to the two deep learnng methods D. Comparson Wth Other Boostng Methods In ths secton, we create a large-scale acton database by combnng two acton databases, MSR Acton3D and UTD-MHAD, nto a sngle one. We then compare performances of dfferent boostng algorthms for two knds of

10 ZHANG et al.: ACTION RECOGNITION USING 3DHOTS AND A MULTI-CLASS BOOSTING CLASSIFIER 4657 TABLE X RECOGNITION ACCURACIES (%) OF OUR METHOD AND DEEP LEARNING METHODS ON MSRACTION3DDATASET USING SETTING 1 TABLE XI RECOGNITION ACCURACIES (%) OF OUR METHOD AND DEEP LEARNING METHODS ON MSRACTION3DDATASET USING SETTING 2 AND MSRGESTURE3D DATASET Fg. 5. KELM performance w.r.t. parameter ρ on the MSRActon3D dataset. TABLE XII RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON ACTION -MHAD DATASET TABLE XIII RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON DHA DATASET Fg. 6. System performance w.r.t. parameter λ on two datasets features,.e., DMM+LBP and 3DHoTs. The new combned Acton-MHAD dataset has 38 dstnct acton categores (the same actons n both datasets are combned nto one acton) whch consst of 1418 depth sequences. In experments, odd subject numbers such as 1, 3, 5, 7 are used for tranng and the remanng subjects are used for testng. The expermental results, as shown n Table XII, demonstrate that our MBC s superor to other boostng methods. We also verfy our algorthm on the DHA dataset [61]. DHA contans 23 acton categores where the frst 10 categores follow the same defntons n the Wezmann acton dataset [65] and the 11th to 16th actons are extended categores. The 17th to 23rd are the categores of selected sport actons. Each of the 23 actons was performed by 21 dfferent ndvduals (12 males and 9 females), resultng n 483 acton samples. Table XIII shows the recognton results of our method aganst exstng algorthms on the DHA dataset. Agan, our method acheves the best recognton performance. E. Effects of Parameters Lke other acton recognton systems, our system also needs to tune a few parameters n both the 3DHoTs feature extracton stage and the MBC classfcaton stage so as to obtan the best performance. Regardng feature extracton, the selectons of m and r s crtcal, whch determne the regon sze on DMM and also the number of the neghborng ponts nvolved n the descrptor. In our prevous papers [9], [14], we accomplshed an emprcal study for these two parameters, whch revealed m = 4andr= 1 can obtan good results on most of the datasets. Wth respect to our classfcaton algorthm, there are two parts nvolvng KELM base classfer and the MBC fuson algorthm. For the KELM, there s a regularzaton term ρ that s used to solve ll-posed problem. In Fg. 5, we plot the recognton accuracy changes of our method (tranng data cross valdaton) f we vary ths parameter on the MSRActon3D dataset. Seen from the curve, t s very obvous that we could set ths parameter to 1000 because the recognton rate reaches a peak pont when adoptng that value. For the MBC, regularzaton coeffcent λ s the only parameter requred to be predefned. Here, we nvestgate how the algorthm wll behave when varyng λ. To do so, we change the value of λ and plot the correspondng recognton rates on two datasets, whch are llustrated n Fg. 6. As shown

11 4658 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 10, OCTOBER 2017 on ths fgure, the MBC recognton accuracy s oscllatng when λ s varyng between 0 and 50. When λ exceeds 50, MBC results ncrease gradually and fnally level off untl λ reaches 100. We fnd more or less the same behavor on two dfferent datasets, whch makes the selecton of ths parameter feasble. In fact, the regularzaton term reflects our selected model complexty. When we set a small λ, we actually set a loose constrant of model complexty, whch wll easly lead to overfttng. On the other hand, a large λ ensures that we obtan a smple model. So, we set λ = 100 consderng a tradeoff between algorthm performance and effcency. Fnally, the executon tme of our system s calculated, ntendng to reveal the feasblty of our system for a real-tme applcaton. To ths end, we have set up a smulaton platform usng MATLAB on an Intel 5 Quadcore 3.2 GHz desktop computer wth 8GB of RAM. It can be seen that the proposed method s able to process over 120 frames per second. VI. CONCLUSION In ths paper, we have proposed an effectve feature descrptor and a novel decson-level fuson method for acton recognton. Ths feature, called 3DHoTs, combnes depth maps and texture descrpton for an effectve acton representaton of a depth vdeo sequence. At the decson-level fuson, we have added the nequalty constrants derved from a mult-class Support Vector Machne to modfy the general AdaBoost optmzaton functon, where Kernel-based extreme learnng machne (KELM) classfers serve as the base classfers. The expermental results on four standard datasets demonstrate the superorty of our method. A future work s to extend ths mult-class boostng framework to other relevant applcatons, such as object recognton [67] and mage retreval. APPENDIX Lemma 1: The GMM wth 2 components s represented by f (z,μ 1,σ 1,μ 2,σ 2 ) as: f (z,μ 1,σ 1,μ 2,σ 2 ) = ω 1 G 1 (z,μ 1,σ 1 ) + ω 2 G 2 (z,μ 2,σ 2 ), and we have: (z,μ 1,σ 1,μ 2,σ 2 ) f σ 2 (z, 0,σ 1, 0,σ 2 ) + ε, where ω 1,ω 2 are the mxture proportons, μ 1,μ 2 and σ 1,σ 2 are respectvely the mean and varance of the Gaussan components, and ε s a constant. f σ 2 represents the varance of f (), wth 0 μ 1,μ 2 1, 0 ε 1. Proof: Based on the defnton of varance, we obtan: = z 2 (ω 1 G 1 +ω 2 G 2 )dz z(ω 1 G 1 +ω 2 G 2 )dz = ω 1 z 2 G 1 dz ω 1 zg 1 dz +ω 2 z 2 G 2 dz ω 2 zg 2 dz +ω 1 zg 1 dz ω1 2 zg 1 dz 2 + ω 2 zg 2 dz ω2 2 zg 2 dz) 2ω 1 ω 2 u 1 u 2 As σ 2 1 = z 2 G 1 dz σ2 2 = z 2 G 2 dz zg 1 dz 2 zg 2 dz), we obtan: = z 2 (ω 1 G 1 +ω 2 G 2 )dz z(ω 1 G 1 +ω 2 G 2 )dz = ω 1 σ1 2 + ω 2σ2 2 + ω 1ω 2 μ ω 1ω 2 μ 2 2 2ω 1ω 2 μ 1 μ 2 As ω 1 + ω 2 =1, we have: ω 1 ω 2 1/4, and thus, = ω 1 σ1 2 + ω 2σ2 2 + ω 1ω 2 (μ 1 μ 2 ω 1 σ1 2 and +ω 2 σ (μ 1 μ 2 f σ 2 (z, 0,σ 1, 0,σ 2 ) = ω 1 σ ω 2σ 2 2 As we constran 0 μ 1,μ 2 1, and have: 0 (μ 1 μ 2 1and 1 4 (μ 1 μ Thus, we obtan: (z, u 1,σ 1, u 2,σ 2 ) f σ 2 (z, 0,σ 1, 0,σ 2 ) + ε where ε s smaller than 0.25 n the case of 0 μ 1,μ 2 1. Lemma 2: For GMM wth M components, we have: f σ 2 (z,μ 1,σ 1,μ 2,σ 2,...) f σ 2 (z, 0,σ 1, 0,σ 2,...) +ε, 0 μ 1,μ 2 1, 0 ε 1, when M 4. Proof: We proven ths Lemma from two dfferent cases, when M s an even or odd number. When M s an even number, based on Lemma 1, we have: = z 2 (ω 1 G 1 + ω 2 G 2,...,+ω M G M )dz z(ω 1 G 1 + ω 2 G 2,...,+ω M G M )dz ω 1 σ 2 1 +ω 2 σ2 2,...,+ω M σm (μ2 1 + μ μ2 M 1 + μ2 M ) As 0 μ 1, =,...,M, wehave: ω 1 σ1 2 + ω 2σ2 2,...,+ω M σm 2 + M 4. We further prove Lemma 2 when M s an odd number, and have: = z 2 (ω 1 G 1 + ω 2 G 2,...,+ω M G M )dz z(ω 1 G 1 + ω 2 G 2,...,+ω M G M )dz ω 1 σ 2 1 +ω 2 σ 2 2,...,+ω M σ 2 M (μ2 1 + μ μ2 M 1 + μ2 M ) and we obtan: ω 1 σ ω 2σ 2 2,...,+ω M σ 2 M + M 4.

Discriminative Dictionary Learning with Pairwise Constraints

Discriminative Dictionary Learning with Pairwise Constraints Dscrmnatve Dctonary Learnng wth Parwse Constrants Humn Guo Zhuoln Jang LARRY S. DAVIS UNIVERSITY OF MARYLAND Nov. 6 th, Outlne Introducton/motvaton Dctonary Learnng Dscrmnatve Dctonary Learnng wth Parwse