Multi-View Surveillance Video Summarization via Joint Embedding and Sparse Optimization

Size: px

Start display at page:

Download "Multi-View Surveillance Video Summarization via Joint Embedding and Sparse Optimization"

Jean Poole
6 years ago
Views:

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 1 Mult-Vew Survellance Vdeo Summarzaton va Jont Embeddng and Sparse Optmzaton Rameswar Panda and Amt K.

and nter-vew correlatons n summarzng mult-vew vdeos n a camera network.

The frst s to capture the multvew correlatons va an embeddng, whch helps n extractng a dverse set of representatves.

We propose to ontly optmze both of the obectves, such that embeddng can not only characterze the correlatons, but also ndcate the requrements of sparse representatve selecton.

Rgorous experments on several mult-vew datasets demonstrate that our approach clearly outperforms the state-of-the-art methods.

The volume of data collected by such network of vson sensors deployed n many settngs rangng from securty needs to envronmental montorng clearly meets the requrements of bg data [24], [55].

As a result, vdeo summarzaton, that automatcally extract a bref yet nformatve summary of these vdeos has attracted ntense attenton n the recent years.

sequence or a vdeo skm [53], [11], [70], [9], [27], [26], [29].

Mult-vew vdeo summarzaton refers to the problem of summarzaton that seeks to take a set of nput vdeos captured from dfferent cameras focusng on roughly the same felds-of-vew (fov) from dfferent

In ths paper, gven a set of vdeos and ts shots, we focus on developng an unsupervsed approach for selectng a N Rameswar Panda and Amt K.

An llustraton of a mult-vew camera network where sx cameras C 1, C 2,..., C 6 are observng an area (black rectangle) from dfferent vewponts.

subset of shots that consttute the mult-vew summary.

1 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 1 Mult-Vew Survellance Vdeo Summarzaton va Jont Embeddng and Sparse Optmzaton Rameswar Panda and Amt K. Roy-Chowdhury, Senor Member, IEEE Abstract Most tradtonal vdeo summarzaton methods are desgned to generate effectve summares for sngle-vew vdeos, and thus they cannot fully explot the complcated ntra and nter-vew correlatons n summarzng mult-vew vdeos n a camera network. In ths paper, wth the am of summarzng mult-vew vdeos, we ntroduce a novel unsupervsed framework va ont embeddng and sparse representatve selecton. The obectve functon s two-fold. The frst s to capture the multvew correlatons va an embeddng, whch helps n extractng a dverse set of representatves. The second s to use a ` 2,1 -norm to model the sparsty whle selectng representatve shots for the summary. We propose to ontly optmze both of the obectves, such that embeddng can not only characterze the correlatons, but also ndcate the requrements of sparse representatve selecton. We present an effcent alternatng algorthm based on half-quadratc mnmzaton to solve the proposed non-smooth and non-convex obectve wth convergence analyss. Rgorous experments on several mult-vew datasets demonstrate that our approach clearly outperforms the state-of-the-art methods. Index Terms Vdeo summarzaton; Camera Network; Sparse optmzaton; Mult-vew vdeo. I. I NTRODUCTION ETWORK of survellance cameras are everywhere nowadays. The volume of data collected by such network of vson sensors deployed n many settngs rangng from securty needs to envronmental montorng clearly meets the requrements of bg data [24], [55]. The dffcultes n analyzng and processng such bg vdeo data s apparent whenever there s an ncdent that requres foragng through vast vdeo archves to dentfy events of nterest. As a result, vdeo summarzaton, that automatcally extract a bref yet nformatve summary of these vdeos has attracted ntense attenton n the recent years. Although vdeo summarzaton has been extensvely studed durng the past few years, many prevous methods manly focused on developng a varety of ways to summarze snglevew vdeos n form of a key-frame sequence or a vdeo skm [53], [11], [70], [9], [27], [26], [29]. However, another mportant problem and rarely addressed n ths context s to fnd an nformatve summary from mult-vew vdeos [14], [33], [43], [28], [44]. Mult-vew vdeo summarzaton refers to the problem of summarzaton that seeks to take a set of nput vdeos captured from dfferent cameras focusng on roughly the same felds-of-vew (fov) from dfferent vewponts and produce a vdeo synopss or key-frame sequence that presents the most mportant portons of the nputs wthn a short duraton (See Fg. 1). In ths paper, gven a set of vdeos and ts shots, we focus on developng an unsupervsed approach for selectng a N Rameswar Panda and Amt K. Roy-Chowdhury are wth the Department of Electrcal and Computer Engneerng, Unversty of Calforna, Rversde, CA , USA. E-mals: (rpand002@ucr.edu, amtrc@ece.ucr.edu) C3 C4 C2 C5 C1 C6 Fg. 1. An llustraton of a mult-vew camera network where sx cameras C 1, C 2,..., C 6 are observng an area (black rectangle) from dfferent vewponts. Snce the vews are roughly overlappng, nformaton correlatons across multple vews along wth correlatons n each vew should be taken nto account for generatng a concse mult-vew summary. subset of shots that consttute the mult-vew summary. Such a summary can be very benefcal n many survellance systems equpped n offces, banks, factores, and crossroads of ctes, for obtanng sgnfcant nformaton n short tme. Mult-vew vdeo summarzaton s dfferent from snglevdeo summarzaton n two mportant ways. Frst, although the amount of mult-vew data s mmensely challengng, there s a certan structure underlyng t. Specfcally, there s large amount of correlatons n the data due to the locatons and felds of vew of the cameras. So, content correlatons as well as dscrepances among dfferent vdeos need to be properly modeled for obtanng an nformatve summary. Second, these vdeos are captured wth dfferent vew angles, and depth of felds, for the same scenery, resultng n a number of unalgned vdeos. Hence, dfference n llumnaton, pose, vew angle and synchronzaton ssues pose a great challenge n summarzng these vdeos. So, methods that attempt to extract summary from sngle-vew vdeos usually do not produce an optmal set of representatves whle summarzng mult-vew vdeos. To address the challenges encountered n a camera network, we propose a novel mult-vew vdeo summarzaton method, whch has the followng advantages. Frst, to better characterze the mult-vew structure, we proect the data ponts nto a latent embeddng whch s able to preserve both ntra and nter-vew correlatons wthout assumng any pror correspondences/algnment between the mult-vew vdeos. Our underlyng dea hnges upon the basc concept of subspace learnng [6], [41], whch typcally ams to obtan a latent subspace shared by multple vews by assumng that these vews are generated from ths subspace. Second, we propose a sparse representatve selecton method over the learned embeddng to summarze the mult-vew

2 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 2 vdeos. Specfcally, we formulate the task of fndng summares as a sparse codng problem where the dctonary s constraned to have a fxed bass (dctonary to be the matrx of same data ponts) and the nonzero rows of sparse coeffcent matrx represent the mult-vew summares. Fnally, to better leverage the mult-vew embeddng and selecton mechansm, we learn the embeddng and optmal representatves ontly. Specfcally, nstead of smply usng the embeddng to characterze mult-vew correlatons and then selecton method, we propose to adaptvely change the embeddng wth respect to the representatve selecton mechansm and unfy these two obectves n formng a ont optmzaton problem. Wth ont embeddng and sparse representatve selecton, our fnal obectve functon s both non-smooth and non-convex. We present an effcent optmzaton algorthm based on half-quadratc functon theory to solve the fnal obectve functon. II. RELATED WORK There s a rch body of lterature n multmeda and computer vson on summarzng vdeos n form of a key frame sequence or a vdeo skm (see [40], [62] for revews). Sngle-vew Vdeo Summarzaton. Much progress has been made n developng a varety of ways to summarze a snglevew vdeo n an unsupervsed manner or developng supervsed algorthms. Varous strateges have been studed, ncludng clusterng [1], [9], [18], [57], attenton modelng [37], salency based lnear regresson model [30], super frame segmentaton [20], kernel temporal segmentaton [53], crowd-sourcng [26], energy mnmzaton [54], [13], storylne graphs [27], submodular maxmzaton [19], determnantal pont process [17], [69], archetypal analyss [60], long shortterm memory [68] and maxmal bclque fndng [7]. Recently, there has been a growng nterest n usng sparse codng (SC) to solve the problem of vdeo summarzaton [11], [70], [8], [39], [10], [38] snce the sparsty and reconstructon error term naturally fts nto the problem of summarzaton. In contrast to these pror works that can only summarze a sngle vdeo, we develop a novel mult-vew summarzaton method that ontly summarzes a set of vdeos to fnd a sngle summary for descrbng the collecton altogether. Mult-vew Vdeo Summarzaton. Generatng a summary from mult-vew vdeos s a more challengng problem due to the nevtable thematc dversty and content overlaps wthn mult-vew vdeos than a sngle vdeo. To address the challenges encountered n mult-vew settngs, there have been some specfcally desgned approaches that use random walk over spato-temporal graphs [14] and rough sets [33] to summarze mult-vew vdeos. A recent work n [28] uses bpartte matchng constraned optmum path forest clusterng to solve the problem of mult-vew vdeo summarzaton. An onlne method can also be found n [43]. However, ths method reles on nter-camera frame correspondence, whch can be a very dffcult problem n uncontrolled settngs. The work n [31] and [32] also addresses a smlar problem of summarzaton n non-overlappng camera networks. Learnng from multple nformaton sources such as vdeo tags [64], topc-related web vdeos [47], [48] and non-vsual data [71], [65] s also a recent trend n multple web vdeo summarzaton. Ths paper has sgnfcant dfferences wth our prevous work n [45]. Frst, n [45], we proposed a statc mult-vew summarzaton method that extracts a set of key frames to present most mportant portons of the nput vdeos n form of story-boards. Whle key frames are a helpful way of ndexng vdeos, they are lmted n that all moton nformaton s lost. Ths lmts ther use n many survellance applcatons where vdeo skmmng.e., replacng all the vdeos by a shorter complaton of ts fragments/shots, seems better suted for obtanng sgnfcant nformaton n short tme. In ths work, we focus on dynamc shot-based vdeo summarzaton, whch not only reduces computatonal cost but also provdes more flexble way of representng vdeos by consderng temporal aspects of actvtes typcally shown n vdeos. Towards ths, we propose a vdeo representaton scheme based on spatotemporal C3D features whch have recently shown promsng results n several vdeo recognton tasks [61], [47]. Second, n [45], we adopt a two step approach.e., both embeddng and representatve selecton are performed ndependently whle summarzng mult-vew vdeos. By contrast, n ths work, we ontly optmze both of the obectves, such that the embeddng can not only characterzes the mult-vew structural correlatons, but also ndcates the requrements of sparse representatve selecton. Experments show that ont optmzaton ndeed mproves the summarzaton performance by generatng more nformatve mult-vew summares. Thrd, we conduct rgorous experments on three addtonal multvew datasets ncludng one large scale dataset captured wth 19 survellance cameras n an ndoor settng [43]. We also perform a subectve user study to valdate the effectveness of our approach n generatng hgh qualty summares for a more effcent and engagng vewng experence. New expermentaton wth spato-temporal vdeo representaton, and the ont optmzaton well demonstrate the performance mprovements n the current framework for summarzng mult-vew vdeos. III. PROPOSED METHODOLOGY In ths secton, we start by gvng man notatons and defnton of the mult-vew summarzaton problem and then present our detaled approach to summarze mult-vew vdeos. Notaton. We use uppercase letters to denote matrces and lowercase letters to denote vectors. For matrx A, ts -th row and -th column are denoted by a and a respectvely. A F s Frobenus norm of A and tr(a) denote the trace of A. The l p -norm of the vector a R n s defned as a p = ( n =1 a p ) 1/p and l 0 -norm s defned as a 0 = n=1 a 0. The l 2,1 -norm can be generalzed to l r, p -norm whch s defned as A r, p = ( n =1 a r p ) 1/p. The operator dag(.) puts a vector on the man dagonal of a matrx. Mult-Vew Vdeo Summarzaton. Gven a set of vdeos captured wth consderable overlappng felds-of-vew across multple cameras, the goal of mult-vew vdeo summarzaton s to compactly depct the nput vdeos, dstllng ts most nformatve events nto a short watchable synopss. Specfcally, t s composed of several vdeo shots that represent most mportant portons of the nput vdeo collecton wthn a short duraton. Our approach can be roughly descrbed as the set of three man tasks, namely () vdeo representaton, () ont

3 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 3 embeddng and representatve selecton, and () summary generaton. In partcular, our approach works as follows. Frst, we segment each vdeo nto multple non-unform shots usng an exstng temporal segmentaton algorthm and represent each shot by a feature vector usng a mean poolng scheme over the extracted C3D features (Secton III-A). Then, we develop a novel scheme for ont embeddng and representatve selecton by explotng the mult-vew correlatons wthout assumng any pror correspondence between the vdeos (Sectons III-B, III-C, III-D). Specfcally, we formulate the task of fndng summares as an l 2,1 sparse optmzaton where the nonzero rows of sparse coeffcent matrx represent the relatve mportance of the correspondng shots. Fnally, the approach outputs a vdeo summary composed of the shots wth the hghest mportance score (Secton V). A. Vdeo Representaton Vdeo representaton s a crucal step n summarzaton for mantanng vsual coherence, whch n turn affects the overall qualty of a summary. It bascally conssts of two man steps, namely, () temporal segmentaton, and, () feature representaton. We descrbe these steps n the followng sectons. Temporal Segmentaton. Our approach starts wth segmentng vdeos usng an exstng algorthm [7]. We segment each vdeo nto multple shots by measurng the amount of changes between two consecutve frames n the RGB and HSV color spaces. A shot boundary s determned at a certan frame when the porton of total change s greater than 75% [7]. We added an addtonal constrant to the algorthm to ensure that the number of frames wthn each shot les n the range of [32,96]. The segmented shots serve as the basc unts for feature extracton and subsequent processng to extract a summary. Feature Representaton. Recent advancement n deep feature learnng has revealed that features extracted from upper or ntermedate layers of a CNN are generc features that have good transfer learnng capabltes across dfferent domans [58], [25]. An advantage of usng deep learnng features s that there exst accurate, large-scale datasets such as Imagenet [56] and Sports-1M [25] from whch they can be extracted. For vdeos, C3D features [61] have recently shown better performance compared to the features extracted usng each frame separately [61], [67]. We therefore extract C3D features, by takng sets of 16 nput frames, applyng 3D convolutonal flters, and extractng the responses at layer FC6 as suggested n [61]. Ths s followed by a temporal mean poolng scheme to mantan the local orderng structure wthn a shot. Then the poolng result serves as the fnal feature vector of a shot (4096 dmensonal) to be used n the sparse optmzaton. We wll dscuss the performance benefts of employng C3D features n our experments. B. Mult-vew Vdeo Embeddng Consder a set of K dfferent vdeos captured from dfferent cameras, where X (k) = {x (k) R D, = 1,, N k }, k = 1,, K. Each x represents the feature descrptor of a vdeo shot n D-dmensonal feature space. We represent each shot by extractng the shot-level C3D features as descrbed above. As the vdeos are captured non-synchronously, the number of shots n each vdeo mght be dfferent and hence there s no optmal one-to-one correspondence that can be assumed. We use N k to denote the number of shots n k-th vdeo and N to denote the total number of shots n all vdeos. Gven the mult-vew vdeos, our goal s to fnd an embeddng for all the shots nto a ont latent space whle satsfyng some constrants. Specfcally, we are seekng a set of embedded coordnates Y (k) 1,, K, where, d (<< D) s the dmensonalty of the embeddng space, wth the followng two constrants: (1) Intravew correlatons. The content correlatons between shots of a vdeo should be preserved n the embeddng space. (2) Intervew correlatons. The shots from dfferent vdeos wth hgh feature smlarty should be close to each other n the resultng embeddng space as long as they do not volate the ntra-vew correlatons present n an ndvdual vew. Modelng Mult-vew Correlatons. To acheve an embeddng that preserves the above two constrants, we need to consder feature smlartes between two shots n an ndvdual vdeo as well as across two dfferent vdeos. Inspred by the recent success of sparse representaton coeffcent based methods to compute data smlartes n subspace clusterng [12], we adopt such coeffcents n modelng mult-vew correlatons. Our proposed approach has two nce propertes: (1) the smlartes computed va sparse coeffcents are robust aganst nose and outlers snce the value not only depends on the two shots, but also depends on other shots that belong to the same subspace, and (2) t smultaneously carres out the adacency constructon and smlarty calculaton wthn one step unlke kernel based methods that usually handle these tasks ndependently wth optmal choce of several parameters. Intra-vew Smlartes. Intra-vew smlarty should reflect spatal arrangement of feature descrptors n each vew. Based on the self-expressveness property [12] of an ndvdual vew, each shot can be sparsely represented by a small subset of shots that are hghly correlated n the dataset. Mathematcally, for k-th vew, t can be represented as x (k) = {y (k) R d, = 1,, N k }, k = = X (k) c (k), c (k) = 0, (1) where c (k) = [c (k) 1, c(k),..., c(k) 2 N k ] T, and the constrant c (k) = 0 elmnates the trval soluton of representng a shot wth tself. The coeffcent vector c (k) should have nonzero entres for a few shots that are correlated and zeros for the rest. However, n (1), the representaton of x n the dctonary X s not unque n general. Snce we are nterested n effcently fndng a nontrval sparse representaton of x, we consder the tghtest convex relaxaton of the l 0 norm,.e., mn c (k) 1 s.t. x (k) = X (k) c (k), c (k) = 0, (2) It can be rewrtten n matrx form for all shots n a vew as mn C (k) 1 s.t. X (k) = X (k) C (k), dag(c (k) ) = 0, (3) where C (k) = [c (k) 1, c(k),..., c(k) 2 N k ] s the sparse coeffcent matrx whose -th column corresponds to the sparse representaton of the shot x (k). The coeffcent matrx obtaned from the above l 1 sparse optmzaton essentally characterzes the shot correlatons and thus t s natural to utlze as ntra-vew smlartes. Ths provdes an mmedate choce of the ntravew smlarty matrx as C (k) ntr a = C(k) T where -th row of

4 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 4 matrx C (k) ntr a represents the smlartes between the -th shot to all other shots n the vew. Inter-vew Smlartes. Snce all cameras are focusng on roughly the same fovs from dfferent vewponts, all vews have apparently a sngle underlyng structure. Followng ths assumpton n a mult-vew settng, we fnd the correlated shots across two vews on solvng a smlar l 1 sparse optmzaton lke n ntra-vew smlartes. Specfcally, we calculate the parwse smlarty between m-th and n-th vew by solvng the followng optmzaton problem: mn C (m,n) 1 s.t. X (m) = X (n) C (m,n), (4) where C (m,n) R N n N m s the sparse coeffcent matrx whose -th column corresponds to the sparse representaton of the shot x (m) usng the dctonary X. Ideally, after solvng the proposed optmzaton problem n (4), we obtan a sparse representaton for a shot n m-th vew whose nonzero elements correspond to shots from n-th vew that belong to the same subspace. Fnally, the nter-vew smlarty matrx between m-th and n-th vew can be represented as C (m,n) nter = C(m,n) T where -th row of matrx C (m,n) nter represent smlartes between -th shot of m-th vew and all other shots n the n-th vew. Obectve Functon. The am of embeddng s to correctly match the proxmty score between two shots x and x to the score between correspondng embedded ponts y and y respectvely. Motvated by ths observaton, we reach the followng obectve functon on the embedded ponts Y. J (Y (1),..., Y (K ) ) = J ntra (Y (k) ) + J nter (Y (m), Y (n) ) = k, m, n m n, k y (k) y (m) m, n m n y (k) 2 C (k) ntr a (, )+ y (n) 2 C (m,n) nter (, ) (5) where k, m and n = 1,, K. J ntra (Y (k) ) s the cost of preservng local correlatons wthn X (k) and J nter (Y (m), Y (n) ) s the cost of preservng correlatons between X (m) and X (n). The frst term says that f two shots (x (k), x (k) ) of a vew are smlar, whch happens when C (k) ntr a (, ) s larger, ther locatons n the embedded space, y (k) and y (k) should be close to each other. Smlarly, the second term tres to preserve the nter-vew correlatons by brngng embedded ponts y (m) and y (n) close to each other f the parwse proxmty score C (m,n) nter (, ) s hgh. Problem (5) can be rewrtten usng one smlarty matrx defned over the whole set of vdeo shots as J (Y ) = m,n, y (m) where the total smlarty matrx s defned as y (m) 2 C (m,n) total (, ) (6) C (m,n) total (, ) = C (k) ntr a (, ) f m = n = k C (m,n) nter (, ) otherwse (7) Ths constructon defnes a N N smlarty matrx where the dagonal blocks represent the ntra-vew smlartes and off-dagonal blocks represent nter-vew smlartes. Note that an nterestng fact about our total smlarty matrx constructon n (7) s that snce each l 1 optmzaton s solved ndvdually, a fast parallel computng strategy can be easly adopted for effcency. However, the matrx n (7) s not symmetrc snce n l 1 optmzaton (2, 4), a shot x can be represented as a lnear combnaton of some shots ncludng x, but x may not be present n the sparse representaton of x. But, deally, a smlarty matrx should be symmetrc n whch shots belongng to the same subspace should be connected to each other. Hence, we reformulate (6) wth a symmetrc smlarty matrx W = C total + Ctotal T as F (Y ) = m,n, y (m) y (m) 2 W (m,n) (, ) Wth the above formulaton, we make sure that two shots x and x get connected to each other ether x and x s n the sparse representaton of the other. Furthermore, we normalze W as w w / w to make sure the weghts n the smlarty matrx are of same scale. Gven ths constructon, problem (8) reduces to the Laplacan embeddng [2] of shots defned by the smlarty matrx W. So, the optmzaton problem can be wrtten as Y = argmn tr ( Y LY T ) (9) Y,YY T =I where L s the graph Laplacan matrx of W and I s an dentty matrx. Mnmzng (9) s a generalzed egenvector problem and the optmal soluton can be obtaned by the bottom d nonzero egenvectors. Note that our approach s agnostc to the choce of embeddng algorthms. Our method s based on graph Laplacan because t s one of the state-ofthe-art methods n characterzng the manfold structure and performs satsfactorly well n several vson and multmeda applcatons [15], [36], [42]. C. Sparse Representatve Selecton Once the embeddng s obtaned, our next goal s to fnd an optmal subset of all the embedded shots, such that each shot can be descrbed as weghted lnear combnaton of a few of the shots from the subset. The subset s then referred as the nformatve summary of the mult-vew vdeos. In partcular, we are tryng to represent the mult-vew vdeos by selectng only a few representatve shots. Therefore, our natural goal s to establsh a shot level sparsty whch can be nduced by performng l 1 regularzaton on rows of the sparse coeffcent matrx [8], [11]. By ntroducng row sparsty regularzer, the summarzaton problem can now be succnctly formulated as (8) mn Z R N N Z 2,1 s.t. Y = Y Z (10) where Z 2,1 N =1 z 2 s the row sparsty regularzer.e., sum of l 2 norms of the rows of Z. The self-expressveness constrant (Y = Y Z) n summarzaton s logcal as the representatves for summary should come from the orgnal frame set. Usng Lagrange multplers, (10) can be wrtten as mn Z Y Y Z 2 F + λ Z 2,1 (11) where λ s a regularzaton parameter that balances the weght of the two terms. Once (11) s solved, the representatve shots are selected as the ponts whose correspondng z 2 0. Remark 1. Notce that both sparse optmzatons n (3) and (10) look smlar; however, the nature of sparse regularzer n

5 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 5 both formulatons are completely dfferent. In (3), the obectve of l 1 regularzer s to nduce element wse sparsty n a column whereas n (10), the obectve of l 2,1 regularzer s to nduce row level sparsty n a matrx. Remark 2. Gven non-unform length of shots, (11) can be modfed to a weghted l 2,1 -norm based obectve to consder length of vdeo shots whle selectng representatves as mn Y Y Z Z 2 F + λ QZ 2,1 (12) where Q = [dag(q)] and q R N represent the temporal length of each vdeo shot. It s easy to see that problem (12) favors selecton of shorter vdeo shots by assgnng a lower score va Q. In other words, problem (12) tres to mnmze the number of shots by consderng the temporal length of vdeo shots, such that the overall obectve turns to mnmzng the length of the fnal summary. D. Jont Embeddng and Sparse Representatve Selecton We now dscuss our proposed method to ontly optmze the mult-vew vdeo embeddng and sparse representaton to select a dverse set of representatve shots. Specfcally, the performance of sparse representatve selecton s largely determned by the effectveness of graph Laplacan n embeddng learnng. Hence, t s a natural choce to adaptvely change the graph Laplacan wth respect to the followng sparse representatve selecton, such that the embeddng can not only characterzes the manfold structure, but also ndcates the requrements of sparse representatve selecton. By combnng the obectve functons (9) and (11), the ont obectve functon becomes: mn tr(y LY T ) + α ( Y Y Z Y, Z,YY T F 2 + λ Z 2,1 (13) =I where α > 0 s a trade-off parameter between the two obectves. The frst term of the cost functon proects the nput data nto a latent embeddng by capturng the meanngful structure of data, whereas the second term helps n selectng a robust set of representatves by mnmzng the reconstructon error and the sparsty. Note that the proposed method s also computatonally effcent as the sparse representatve selecton s done n the low-dmensonal space by dscardng the rrelevant part of a data pont represented by a hgh-dmensonal feature, whch can deral the representatve selecton process. IV. OPTIMIZATION The optmzaton problem n (13) s non-smooth and nonconvex. Solvng t s thus more dffcult due to the non-smooth l 2,1 norm and the addtonal embeddng varable Y. Halfquadratc optmzaton technques [21], [22] have shown to be effectve n solvng these sparse optmzatons n several vson and multmeda applcatons [63], [66], [50], [35]. Motvated by such methods, we devse an teratve algorthm to effcently solve (13) by mnmzng ts augmented functon alternatvely 1. Specfcally, f we defne φ(x) = x 2 + ɛ wth ɛ beng a constant, we can transform Z 2,1 to n =1 z ɛ, accordng to the analyss of l 2,1 -norm n [21], [35]. Wth 1 We solve all the sparse optmzaton problems usng Half-quadratc optmzaton technques [21], [22]. Due to space lmtaton, we only present the optmzaton procedure to solve (13). However, the same procedure can be easly extended to solve other sparse optmzatons (3, 4). ) ths transformaton, we can optmze (13) effcently n an alternatve way as follows. Accordng to the half-quadratc theory [21], [22], [16], the augmented cost-functon of (13) can be wrtten as mn tr(y LY T ) + α ( Y Y Z Y, Z,YY T F 2 + λtr(zt PZ) ) (14) =I where P R N N s a dagonal matrx, and the correspondng -th element s defned as 1 P, = 2 z ɛ (15) where ɛ s a smoothng term, whch s usually set to be a small constant value. Wth ths transformaton, note that the problem (14) s convex separately wth respect to Y, Z, and P. Hence, we can solve (14) alternatvely wth the followng three steps wth respect to Z, Y, and P, respectvely. (1) Solvng for Z: For a gven P and Y, solve the followng obectve to estmate Z: mn Z α( tr((y Y Z)(Y Y Z) T ) + λtr(z T PZ) ) (16) By settng dervatve of (16) wth respect to Z to zero, the optmal soluton can be computed by solvng the followng lnear system. (Y T Y + λp)z = Y T Y (17) (2) Solvng for Y: For a gven P, and Z, solve the followng obectve to estmate Y: mn tr(y LY T ) + αtr((y Y Z)(Y Y Z) T ) Y,YY T =I = mn tr(y (L + α(i 2Z + Z Z T ))Y T (18) ) Y,YY T =I Eq. 18 can be solved by egen-decomposton of the matrx (L + α(i 2Z + Z Z T )). We pck up the egenvectors correspondng to the d smallest egenvalues. (3) Solvng for P: When Z s fxed, we can update P by employng the formulaton n Eq. 15 drectly. We contnue to alternately solve for Z, Y, and P untl a maxmum number of teratons s reached or a predefned threshold s reached. Snce the alternatng mnmzaton can stuck n a local mnmum, t s mportant to have a sensble ntalzaton. We ntalze Y by solvng (9) usng an Egen decomposton and P by an dentty matrx. Experments show that the alternatng mnmzaton converges fast by usng ths knd of ntalzaton. In practce, we montor the convergence wthn less than 25 teratons. Therefore, the proposed method can be appled to large scale problems n practce. V. SUMMARY GENERATION Above, we descrbed how we compute the optmal sparse coeffcent matrx Z by ontly optmzng the mult-vew embeddng learnng and sparse representatve selecton. We follow the followng rules to extract a mult-vew summary: () We frst generate a weght curve usng l 2 norms of the rows n Z snce t provdes nformaton about the relatve mportance of the representatves for descrbng the whole vdeos. More specfcally, a vdeo shot wth hgher mportance takes part n the reconstructon of many other vdeo shots, hence ts correspondng row n Z has many nonzero elements

6 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 6 TABLE I DATASET STATISTICS Datasets # Vews Total Duratons (Mns.) Settngs Camera Type Offce 4 46:19 Indoor Fxed Campus 4 56:43 Outdoor Non-fxed Lobby 3 24:42 Indoor Fxed Road 3 22:46 Outdoor Non-fxed Badmnton 3 15:07 Indoor Fxed BL-7F :10 Indoor Fxed wth large values. On the other hand, a shot wth lower mportance takes part n reconstructon of fewer shots n the whole vdeos, hence, ts correspondng row n Z has a few nonzero elements wth smaller values. Thus, we can generate a weght curve, where the weght measures the confdence of the vdeo shot to be ncluded n the fnal vdeo summary. () We detect local maxma from the weght curve, then extract an optmal summary of specfed length from the local maxmums constraned by the weght value and full sequence coverage assumpton. Note that the shots wth low or zero weghts cannot be nserted nto fnal vdeo summary. Furthermore, the wegh curve n our framework allows users to choose dfferent number of shots n summary wthout ncurrng addtonal computatonal cost. In contrast, many other mult-vew vdeo summarzaton methods need to preset the number of vdeo shots that should be ncluded n the fnal summary and any change wll result n a re-calculaton. Therefore, the proposed approach s scalable n generatng summares of dfferent lengths and hence provdes more flexblty for practcal applcatons. More detals on the summary length and scalablty are ncluded n experments. VI. EXPERIMENTS In ths secton, we present varous experments and comparsons to valdate the effectveness and effcency of our proposed algorthm n summarzng mult-vew vdeos. A. Datasets. We conduct rgorous experments usng 6 mult-vew datasets wth 36 vdeos n total, whch are from [14], [43] (See Tab. I). The datasets are captured n both ndoor and outdoor envronments wth overall 360 degree coverage of the scene, makng t more dffcult to be summarzed. All these datasets are standard n mult-vew vdeo summarzaton and have been used by the pror works [14], [28], [33]. It s mportant to note that experments n our pror work [45] was lmted to only 3 datasets, whereas n the current work, we conduct experments on 6 datasets ncludng BL-7F whch s one of the largest publcly avalable dataset for mult-vew vdeo summarzaton. B. Performance Measures. To provde an obectve comparson, we compare all the approaches usng three quanttatve measures, ncludng Precson, Recall and F-measure ( 2 Precson Recall Precson+Recall ) [14], [28]. For all these metrcs, the hgher value ndcates better summarzaton qualty. We set the same summary length as n [14] to generate our summares and employ the ground truth of mportant events reported n [14] to compute the performance measures. More specfcally, the ground truth annotatons contan a lst of events wth correspondng start and end frame for each dataset. We took an event as correctly detected f our framework produces a vdeo shot between the start and end of the event. We follow the pror works [14], [28], [43] and consder an event to be redundant f we detect the event smultaneously from more than one camera. Such an evaluaton settng gves a far comparson wth the prevous state-of-the-art methods [14], [33], [28], [44], [45]. C. Expermental Settngs. We mantan the followng conventons durng all our experments. () All our experments are based on unoptmzed MATLAB codes on a desktop PC wth Intel(R) core(tm) processor wth 16 GB of DDR3 memory. We used a NVIDIA Tesla K40 GPUs to extract the C3D features. () Each feature descrptor s L 2 -nomnalzed. () Determnng the ntrnsc dmensonalty of the embeddng s an open problem n the feld of manfold learnng. One common way s to determne t by grd search. We determne t as n most tradtonal approaches, such as [3]. (v) The sparsty regularzaton parameter λ s computed as λ 0 /ρ and λ 0 s analytcally computed from the embedded ponts [11], (v) We emprcally set α to 0.05 and kept fxed for all results. D. Comparson wth State-of-the-art Mult-vew Methods. Goal. Ths experment ams at evaluatng our approach compared to the state-of-the-art mult-vew summarzaton methods presented n the lterature. Compared Methods. We contrast our approach wth several state-of-the-art methods whch are specfcally desgned for mult-vew vdeo summarzaton as follows. RandomWalk [14]. The method frst create a spatotemporal shot graph and then use random walk as a clusterng algorthm over the graph to extract mult-vew summares. RoughSets [33]. The method frst adopt a SVM classfer as the key frame abstracton process and then apples rough set to remove smlar frames. BpartteOPF [28]. Ths method frst uses a bpartte graph matchng to model the nter-vew correlatons and then apples optmum path forest clusterng on the refned adacency matrx to generate mult-vew summary. GMM [43]. An onlne Gaussan mxture model clusterng s frst appled on each vew ndependently and then a dstrbuted vew selecton algorthm s adopted to remove the content redundancy n the nter-vew stage. Implementaton Detals. To report exstng methods results, we use pror publshed numbers when possble. In partcular, for the mult-vew summarzaton methods (RandomWalk, BpartteOPF and GMM), we report the avalable results from the correspondng papers and mplement RoughSets ourselves usng the same vdeo representaton as the proposed one and tune ther parameters to have the best performance. Results. Table II shows the results on three mult-vew datasets, namely Offce, Campus and Lobby datasets. We have the followng key observatons from Table II: () Our approach produces summares wth same precson as RandomWalk and BpartteOPF for both Offce and Lobby datasets. However, the mprovement n recall value ndcates the ablty of our method n keepng more mportant nformaton n the summary compared to both of the approaches. As an llustraton, n Offce dataset, the event of lookng for a thck book by a member whle present n the cubcle s absent n the summary

7 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 7 TABLE II PERFORMANCE COMPARISON WITH SEVERAL BASELINES INCLUDING BOTH SINGLE AND MULTI-VIEW METHODS APPLIED ON THE THREE MULTI-VIEW DATASETS. P: PRECISION IN PERCENTAGE, R: RECALL IN PERCENTAGE AND F: F-MEASURE. OURS PERFORM THE BEST. Offce Campus Lobby Methods P R F P R F P R F Reference Attenton-Concate TMM2005 [37] Sparse-Concate TMM2012 [8] Concate-Attenton TMM2005 [37] Concate-Sparse TMM2012 [8] Graph TCSVT2006 [51] RandomWalk TMM2010 [14] RoughSets ICIP2011 [33] BpartteOPF TMM2015 [28] Ours Proposed TABLE III PERFORMANCE COMPARISON WITH GMM BASELINE ON BL-7F DATASET Methods Precson(%) Recall(%) F-measure(%) Reference GMM JSTSP2015 [43] Ours Proposed Fg. 2. Sequence of events detected related to actvtes of a member (A 0 ) nsde the Offce dataset. Top row: Summary produced by method [14], and Bottom row: Summary produced by our approach. Sequence of events detected n top row: 1st: A 0 enters the room, 2nd: A 0 sts n cubcle 1, 3rd: A 0 leaves the room. Sequence of events detected n bottom row: 1st: A 0 enters the room, 2nd: A 0 sts n cubcle 1, 3rd: A 0 s lookng for a thck book to read (as per the ground truth n [14]), and 4th: A 0 leaves the room. The event of lookng for a thck book to read (as per the ground truth n [14]) s mssng n the summary produced by method [14] where as t s correctly detected by our approach (3rd frame: bottom row). Ths ndcates our method captures vdeo semantcs n a more nformatve way compared to [14]. produced by RandomWalk whereas t s correctly detected by our proposed method. Fg. 2 n ths connecton explans the whole sequence of events detected usng our approach as compared to RandomWalk. () For all methods, ncludng Ours, performance on Campus dataset s not that good as compared to the other datasets. Ths s obvous snce the Campus dataset contans many trval events as t was captured n an outdoor envronment, thus makng the summarzaton more dffcult. Nevertheless, for ths challengng dataset, F- measure of our approach s about 6% better than that of the recent BpartteOPF. () Table II also reveals that for all three datasets, recall s generally low compared to precson because users usually prefer to select more extensve summares n ground truth, whch can be verfed from the ground truth events from [14]. As a result, number of events n ground truth ncreases rrespectve of ther nformaton content. (v) Overall, on the three datasets, our approach outperforms all compared methods n terms of F-measure. Ths corroborates the fact that the proposed approach produces nformatve mult-vew summares n contrast to the state-ofthe-art methods (See Fg. 3 for an llustratve example). Table III shows results of our method on a larger and more complex BL-7F dataset captured wth 19 survellance cameras n the 7th floor of the BarryLam Buldng n Natonal Tawan Unversty [43]. From Table III, t s clearly evdent that our approach sgnfcantly outperforms the recent method GMM n generatng more nformatve mult-vew summares. The F-measure of our method s about 11% better than that of GMM [43]. Ths ndcates that the proposed method s very effectve and can be appled to large scale problems n practce. We follow the evaluaton strategy of [43] and compute the performance measures n the unt of frames nstead of events as n Table II to make a far comparson wth the GMM baselne. E. Comparson wth Sngle-vew Methods. Goal. The obectve of ths experment s to compare our method wth some sngle-vew summarzaton approaches to show ther performance on mult-vew vdeos. Specfcally, the purpose of comparng wth sngle-vew summarzaton methods s to show that technques that attempt to fnd summary from sngle-vew vdeos usually do not produce an optmal set of representatves whle summarzng multple vdeos. Compared Methods. We compare our approach wth several baselne methods (Attenton-Concate [37], Sparse-Concate [8], Concate-Attenton [37], Concate-Sparse [8], Graph [51]) that use snglevdeo summarzaton approach over mult-vew datasets to generate summary. Note that n the frst two baselnes (Attenton-Concate, Sparse-Concate), a snglevdeo summarzaton approach s frst appled to each vew and then resultng summares are combned to form a sngle summary, whereas the other three baselnes (Concate-Attenton, Concate-Sparse, Graph) concatenate all the vews nto a sngle vdeo and then apply a sngle-vdeo approach to summarze mult-vew vdeos. Both Sparse-Concate and Concate-Sparse baselnes use (11) to summarze mult-vew vdeos wth out any embeddng. The purpose of comparng wth these two baselne methods s to explctly show the advantage of our proposed mult-vew embeddng n generatng nformatve and dverse summares whle summarzng mult-vew survellance vdeos. Implementaton Detals. We mplement Sparse-Concate and Concate-Sparse ourselves wth the same temporal segmentaton and C3D feature representaton as the proposed one whereas for rest of the sngle-vew summarzaton methods, we report the avalable results from the publshed papers [14], [28]. Results. We have the followng key fndngs from Table II and Fg. 4: () The proposed method sgnfcantly outper-

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 8 E1 E2 E3 E4 E6 E7 E8 E9 E10 E11 V1 V2 V1 V2 V1 V2 V4 V3 V3 V1 Fg. 3. Summarzed events for the Offce dataset.

Numbers above the frame (E1,, E26) represent the event number whereas the numbers below (V1,, V4) ndcate the vew from whch the event s detected.

Top row: summary produced by Sparse-Concate [8], Mddle row: summary produced by Concate-Sparse [8], and Bottom row: summary produced by our approach.

(bottom row) produces meanngful representatves by explotng the content correlatons va an embeddng. Redundant events are marked wth same color borders.

sngle vdeo. Best vewed n color. forms all the compared sngle-vew summarzaton methods by a sgnfcant margn on all three datasets.

the mult-vew concepts. () It s clearly evdent from the Fg.

8 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 8 E1 E2 E3 E4 E6 E7 E8 E9 E10 E11 V1 V2 V1 V2 V1 V2 V4 V3 V3 V1 Fg. 3. Summarzed events for the Offce dataset. Each event s represented by a key frame and s assocated wth two numbers, one above and below of the key frame. Numbers above the frame (E1,, E26) represent the event number whereas the numbers below (V1,, V4) ndcate the vew from whch the event s detected. Lmted to the space, we only present 10 events arranged n temporal order, as per the ground truth n [14]. Vew 1 Vew 2 Vew 3 Fg. 4. Some summarzed events for the Lobby dataset. Top row: summary produced by Sparse-Concate [8], Mddle row: summary produced by Concate-Sparse [8], and Bottom row: summary produced by our approach. It s clearly evdent from both top and mddle rows that both of the snglevew baselnes produce a lot of redundant events as per the ground truth [14] whle summarzng mult-vew vdeos, however, our approach (bottom row) produces meanngful representatves by explotng the content correlatons va an embeddng. Redundant events are marked wth same color borders. Note that both Sparse-Concate and Concate-Sparse summarze multple vdeos wthout any embeddng by ether applyng sparse representatve selecton to each vdeo separately or concatenatng all the vdeos nto a sngle vdeo. Best vewed n color. forms all the compared sngle-vew summarzaton methods by a sgnfcant margn on all three datasets. We observe that drectly applyng these methods to summarze multple vdeos produces a lot of redundant shots whch devates from the fact that the optmal summary should be dverse and nformatve n descrbng the mult-vew concepts. () It s clearly evdent from the Fg. 4 that both of the sparse representatve selecton based sngle-vew summarzaton methods (Sparse-Concate and Concate-Sparse) produce a lot of redundances (smultaneous presence of most of the events) whle summarzng vdeos on Lobby dataset. Ths s expected snce both of the approaches fal to explot the complcated nter-vew content correlatons present n mult-vew vdeos. () By usng our mult-vew vdeo summarzaton method, such redundancy s largely reduced n contrast. Some events are recorded by the most nformatve summarzed shots, whle the most mportant events are reserved n our summares. The proposed approach generates hghly nformatve and dverse summary n most cases, due to ts ablty to ontly model mult-vew correlatons and sparse representatve selecton. F. Scalablty n Generatng Summares. Scalablty n generatng summares of dfferent length has shown to be effectve whle summarzng sngle vdeos [23], [46]. However, most of the pror mult-vew summarzaton methods requre the number of shots to be specfed before generatng summares whch s hghly undesrable n practcal applcatons. Concretely speakng, the algorthm need to be rerun for each change n the number of representatve shots that the user want to see n the summary. By contrast, our approach provdes scalablty n generatng summares of dfferent length based on user constrants wthout any further analyss of the nput vdeos (analyze once, generate many). Ths s due to the fact that non-zero rows of the sparse coeffcent matrx Z can generate a ranked lst of representatves whch can be subsequently used to provde a scalable representaton n generatng summares of desred length wthout ncurrng any addtonal cost. Such a scalablty property makes our approach more sutable n provdng human-machne nterface where the summary length s changed as per the user request. Fg. 5 shows the generated summares of length 3, 5 and 7 most mportant shots (as determned by the weght curve descrbed n Sec. V) for Offce dataset. G. Performance Analyss wth Shot-level C3D Features. We nvestgate the mportance and relablty of the proposed vdeo representaton based on C3D features by comparng wth 2D shot-level deep features, and found that the later produces nferor results, wth a F-measure of 84.01% averaged over three datasets (Offce, Campus and Lobby) compared to 86.55% by the C3D features. We utlze Pycaffe wth the VGG net pretraned model [59] to extract a 4096-dm feature vector of a frame and then use temporal mean poolng to compute a sngle shot-level feature vector, smlar to C3D features descrbed n Sec. III-A. The spato-temporal C3D features perform best, as they explot the temporal aspects of actvtes typcally shown n vdeos. H. Performance Analyss wth Vdeo Segmentaton. We examned the performance of our approach by replacng the temporal segmentaton algorthm [7] by a nave approach that unformly dvdes vdeo nto several segments of equal length. We use unform segments wth a length of 2 seconds and kept other components fxed whle generatng summares.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 9 TABLE IV F-MEASURE COMPARISON WITH [45] Methods Offce Campus Lobby Reference [45] 84.48 75.42 88.26 ICPR2016 [45] Ours 89.36 77.78 92.

9 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 9 TABLE IV F-MEASURE COMPARISON WITH [45] Methods Offce Campus Lobby Reference [45] ICPR2016 [45] Ours Proposed (a) (b) (c) Fg. 5. The fgure shows an llustratve example of scalablty n generatng summares of dfferent length based on the user constrants for the Offce dataset. Each shot s represented by a key frame and are arranged accordng to the l 2 norms of correspondng non-zero rows of the sparse coeffcent matrx. (a): Summary for user length request of 3, (b): Summary for user length request of 5 and (c): Summary for user length request of 7. By usng the vdeo segmentaton algorthm of [7], the proposed approach acheves a F-measure of 86.55% averaged over three datasets (Offce, Campus and Lobby). On the other hand, wth the use of unform length segments, our approach obtans a mean F-measure 85.43%. Ths shows that our approach s relatvely robust wth the change n segmentaton algorthm. Note that our proposed sparse optmzaton s hghly flexble to ncorporate more sophstcated temporal segmentaton algorthms, e.g., [52] n generatng vdeo summares we expect such advanced and complex vdeo segmentaton algorthms wll only beneft our proposed approach. I. Performance Comparson wth [45]. We now compare the proposed approach wth [45] to explctly verfy the effectveness of vdeo representaton and ont optmzaton for summarzng mult-vew vdeos. Table IV shows the comparson wth [45] on Offce, Campus and Lobby datasets. Followng are the analyss of the results: () The proposed framework consstently outperforms [45] on all three datasets by a margn of about 5% n terms of F-measure (maxmum mprovement of 8% n terms of precson for the offce dataset). () We mprove around 3% n terms of F-measure for the more challengng Campus dataset whch demonstrates that the current framework s more effectve n summarzng vdeos wth outdoor scenes. () We beleve the best performance n the proposed framework can be attrbuted to two factors workng n concert: (a) more flexble and powerful vdeo representaton va C3D features, and (b) ont embeddng learnng and sparse representatve selecton. Moreover, to better understand the contrbuton of ont optmzaton, we analyzed the performance of the proposed approach wth shotlevel C3D features and a 2 step process smlar to [45], and found that the mean F-measure on three datasets (Offce, Campus and Lobby) decreases from 86.55% to 83.85%. We beleve ths s because adaptvely changng the graph Laplacan wth respect to the sparse representatve selecton helps n TABLE V USER STUDY MEAN EXPERT RATINGS ON A SCALE OF 1 TO 10. OUR APPROACH SIGNIFICANTLY OUTPERFORMS OTHER AUTOMATIC METHODS. Methods Offce Campus Lobby Road Badmnton RandomWalk BpartteOPF Ours better explotng the mult-vew correlatons and also ndcates the requrement of optmal representatve shots to be ncluded n the summary. It also mportant to note that the approach n [45] s lmted to key frame extracton only and hence may not be sutable for many survellance applcatons where vdeo skms wth moton nformaton seems better suted for obtanng sgnfcant nformaton n short tme. J. User Study. Wth 5 study experts, we performed human evaluaton of the generated summares to verfy the results obtaned from the automatc obectve evaluaton wth F-measure. Our obectve s to understand how an user perceve the qualty of the summares accordng to the vsual pleasantness and nformaton content of the system generated summary. Each study expert watched the vdeos at 3x speed and were then shown 3 sets of summares constructed usng dfferent methods: RandoWalk, BpartteOPF and Ours for 5 datasets (Offce, Campus, Lobby, Road and Badmnton). Study experts were asked to rate the overall qualty of each summary by assgnng a ratng from 1 to 10, where 1 corresponded to The generated summary s not at all nformatve and 10 corresponded to The summary very well descrbes all the nformaton present n the orgnal vdeos and also vsually pleasant to watch. The summares were shown n random order wthout revealng the dentty of each method and the audo track was not ncluded to ensure that the subects chose the ratng based solely on vsual stmul. The results are summarzed n Table V. Smlar to the obectve evaluaton, our approach sgnfcantly outperforms both of the methods (RandomWalk, BpartteOPF). Ths agan corroborates the fact that the proposed framework generates a more nformatve and dverse mult-vew summary as compared to the state-of-the-art methods. Furthermore, we note that the relatve rank of the dfferent algorthms s largely preserved n the subectve user study as compared to the obectve evaluaton n Table II. K. Dscussons. Abnormal Event Detecton. Abnormal event detecton and survellance vdeo summarzaton are two closely related problem n computer vson and multmeda. In a survellance settng, where an abnormal event took place, the proposed approach can select shots to represent the abnormal event n the fnal summary. Ths s due to the fact that our approach selects representatve shots from the mult-vew vdeos such that set of vdeos should be reconstructed wth hgh accuracy usng the extracted summary. Specfcally, the proposed approach n (13) favors selectng a set of shots as representatves for constructng the summary whch can reconstruct all the events

10 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 10 n the nput wth low reconstructon error. Consder a smple example for an llustraton. Let us assume a survellance settng equpped n a place wth only pedestran traffc. People are walkng as usual and suddenly, a car s speedng. In order to reconstruct the part where the car s speedng, our method wll choose a few shots from ths porton; otherwse the reconstructon error wll be hgh. Mult-Vew Event Capture. In general, the purpose of overlappng-feld of vew s to facltate users to check obects/events from dfferent angles. For an event captured wth multple cameras havng a large dfference n vew angles, the proposed method often selects more than one shot to represent the event n the summary. Ths s due to the fact that our approach selects representatve shots from the mult-vew vdeos such that the whole nput can be reconstructed wth low error. In our experments, we have observed a smlar stuaton whle summarzng vdeos on Campus dataset. The summary produced by our approach contans three shots captured wth cameras 1, 3, and 4 n an outdoor envronment whch essentally represent the same event (E23 n the ground truth [14]). However, note that although ncludng shots representng same event from more than one camera n the summary may help an user to check events from dfferent angles, t ncreases the summary length whch often devates from the fact that length of the summary should be as small as possble. Thus, the obectve of our current work s on generatng an optmal summary that balances the two man mportant crtera of a good summary,.e., maxmzng the nformaton content va representatveness and mnmzng the length va sparsty. Jont Vdeo Segmentaton and Summarzaton. Note that the proposed approach uses temporal vdeo segmentaton as a preprocessng step and then use the shot-level features to extract summares. Our approach can be modfed n two ways to optmze the temporal segmentaton for the task of vdeo summarzaton. Frst, nvolvng a human n our current approach for gvng feedbacks, smlar to the concept of relatve attrbutes n vsual recognton [49] can help us n adaptvely changng the shot boundares for generatng better qualty summares. Second, learnng a dynamc agent usng Markov decson process (MDP) for movng the shot boundares (forward or backward wth temporal ncrements) based on the performance of our proposed summarzaton algorthm s also a possblty n ths regard [4]. Developng an effcent framework for ont segmentaton and summarzaton s an nterestng practcal problem we leave ths as future work, wth no exstng work, to the best of our knowledge. VII. CONCLUSIONS AND FUTURE WORKS In ths paper, we addressed the problem of summarzng mult-vew vdeos va ont embeddng learnng and l 2,1 sparse optmzaton. The embeddng helps s capturng content correlatons n mult-vew datasets wthout assumng any pror correspondence between the ndvdual vdeos. On the other hand, the sparse representatve selecton helps n generatng multvew summares as per user length request wthout requrng addtonal computatonal cost. Performance comparsons on sx standard mult-vew datasets show marked mprovement over some mono-vew summarzaton approaches as well as state-of-the-art mult-vew summarzaton methods. Movng forward, we would lke to mprove our method by explctly ncorporatng vdeo semantcs that may requre more complex model wth addtonal technques such as attenton modelng [37] or semantc feature analyss based on user preferences [34]. It s also mportant to note that unlke sngle-vew vdeos, understandng semantcs n a mult-camera envronment s a challengng problem and hence, one may also requre data assocaton strateges, e.g, [5] to properly explot mult-vew vdeo semantcs for the task of vdeo summarzaton. Moreover n future, we would lke to consder the fact that more than one camera vew may be necessary to fully represent an event (e.g., due to occluson) n a mult-vew settng and hence, t may be necessary to nclude multple smlar shots representng same events from more than one camera for generatng a good qualty vdeo summary. ACKNOWLEDGMENT Ths work was partally supported by NSF grant We gratefully acknowledge the support of NVIDIA wth the donaton of the Tesla K40 GPU used for ths research. REFERENCES [1] J. Almeda, N. J. Lete, and R. da S. Torres. VISON: VIdeo Summarzaton for ONlne applcatons. PRL, [2] M. Belkn and P. Nyog. Laplacan egenmaps and spectral technques for embeddng and clusterng. In NIPS, [3] D. Ca, X. He, and J. Han. Spectral regresson for effcent regularzed subspace learnng. In ICCV, [4] J. C. Cacedo and S. Lazebnk. Actve obect localzaton wth deep renforcement learnng. In ICCV, [5] A. Chakraborty, A. Das, and A. K. Roy-Chowdhury. Network consstent data assocaton. TPAMI, [6] C. Chennubhotla and A. Jepson. Sparse codng n practce. In Proc. of the Second Int. Workshop on Statstcal and Computatonal Theores of Vson, [7] W.-S. Chu, Y. Song, and A. James. Vdeo co-summarzaton: Vdeo summarzaton by vsual co-occurrence. In CVPR, [8] Y. Cong, J. Yuan, and J. Luo. Towards Scalable Summarzaton of Consumer Vdeos Va Sparse Dctonary Selecton. TMM, [9] S. E. F. de Avla, A. P. B. Lopes, A. da Luz Jr., and A. de A. AraÃžo. VSUMM: A mechansm desgned to produce statc vdeo summares and a novel evaluaton method. PRL, [10] F. Dornaka and I. K. Aldne. Decremental sparse modelng representatve selecton for prototype selecton. PR, [11] E. Elhamfar, G. Sapro, and R. Vdal. See all by lookng at a few: Sparse modelng for fndng representatve obects. In CVPR, [12] E. Elhamfar and R. Vdal. Sparse subspace clusterng: Algorthm, theory, and applcatons. TPAMI, [13] S. Feng, Z. Le, and S. L. Onlne content-aware vdeo condensaton. In CVPR, [14] Y. Fu, Y. Guo, Y. Zhu, F. Lu, C. Song, and Z.-H. Zhou. Mult Vew Vdeo Summmarzaton. TMM, [15] S. Gao, I. W.-H. Tsang, L.-T. Cha, and P. Zhao. Local features are not lonely laplacan sparse codng for mage classfcaton. In CVPR, [16] D. Geman and G. Reynolds. Constraned restoraton and the recovery of dscontnutes. TPAMI, [17] B. Gong, W. Chao, K. Grauman, and F. Sha. Dverse sequental subset selecton for supervsed vdeo summarzaton. In NIPS, [18] G. Guan, Z. Wang, S. Me, M. Ott, M. He, and D. D. Feng. A Top-Down Approach for Vdeo Summarzaton. TOMCCAP, [19] M. Gygl, H. Grabner, and L. V. Gool. Vdeo summarzaton by learnng submodular mxtures of obectves. In CVPR, [20] M. Gygl, H. Grabner, H. Remenschneder, and L. V. Gool. Creatng summares from user vdeos. In ECCV, [21] R. He, T. Tan, L. Wang, and W.-S. Zheng. l21 regularzed correntropy for robust feature selecton. In CVPR, [22] R. He, W.-S. Zheng, T. Tan, and Z. Sun. Half-quadratc-based teratve mnmzaton for robust sparse representaton. TPAMI, [23] L. Herranz and J. M. Martnez. A framework for scalable summarzaton of vdeo. TCSVT, [24] T. Huang. Survellance vdeo: the bggest bg data. Computng Now, 2014.

[27] G. Km, L. Sgal, and E. P. Xng. Jont summarzaton of large-scale collectons of web mages and vdeos for storylne reconstructon. In CVPR, 2014. [28] S. Kuanar, K. Ranga, and A. Chowdhury.

11 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. XX, DECEMBER 20XX 11 [25] A. Karpathy, G. Toderc, S. Shetty, T. Leung, R. Sukthankar, and L. FeFe. Large-scale vdeo classfcaton wth convolutonal neural networks. In CVPR, [26] A. Khosla, R. Hamd, C. J. Ln, and N. Sundaresan. Large-scale vdeo summarzaton usng web-mage prors. In CVPR, [27] G. Km, L. Sgal, and E. P. Xng. Jont summarzaton of large-scale collectons of web mages and vdeos for storylne reconstructon. In CVPR, [28] S. Kuanar, K. Ranga, and A. Chowdhury. Mult-vew vdeo summarzaton usng bpartte matchng constraned optmum-path forest clusterng. TMM, [29] S. K. Kuanar, R. Panda, and A. Chowdhury. Vdeo Key frame Extracton through Dynamc Delaunay Clusterng wth a Structural Constrant. JVCIR, [30] Y. Lee, J. Ghosh, and K. Grauman. Dscoverng mportant people and obects for egocentrc vdeo summarzaton. In CVPR, [31] C. D. Leo and B. S. Manunath. Multcamera vdeo summarzaton from optmal reconstructon. In ACCV Workshop, [32] C. D. Leo and B. S. Manunath. Multcamera Vdeo Summarzaton and Anomaly Detecton from Actvty Motfs. TOSN, [33] P. L, Y. Guo, and H. Sun. Mult key-frame abstracton from vdeos. In ICIP, [34] W.-N. Le and K.-C. Hsu. Vdeo summarzaton based on semantc feature analyss and user preference. In SUTC, [35] C. Lu, J. Tang, M. Ln, L. Ln, S. Yan, and Z. Ln. Correntropy nduced l2 graph for robust subspace clusterng. In ICCV, [36] X. Lu, Y. Yuan, and P. Yan. Image super-resoluton va double sparsty regularzed manfold learnng. TCSVT, [37] Y. F. Ma, X. S. Hua, and H. J. Zhang. A Generc Framework of User Attenton Model and Its Applcaton n Vdeo Summarzaton. TMM, [38] S. Me, G. Guan, Z. Wang, S. Wan, M. He, and D. D. Feng. Vdeo summarzaton va mnmum sparse reconstructon. PR, [39] J. Meng, H. Wang, J. Yuan, and Y.-P. Tan. From keyframes to key obects: Vdeo summarzaton by representatve obect proposal selecton. In CVPR, [40] A. G. Money and H. Agus. Vdeo summarsaton: A conceptual framework and survey of the state of the art. JVCIR, [41] H. Nguyen, V. Patel, N. Nasrabad, and R. Chellappa. Sparse embeddng: A framework for sparsty promotng dmensonalty reducton. ECCV, [42] F. Ne, H. Wang, H. Huang, and C. Dng. Unsupervsed and semsupervsed learnng va ` 1 -norm graph. In ICCV, [43] S.-H. Ou, C.-H. Lee, V. Somayazulu, Y.-K. Chen, and S.-Y. Chen. On-Lne Mult-Vew Vdeo Summarzaton for Wreless Vdeo Sensor Network. JSTSP, [44] R. Panda, A. Das, and A. K. Roy-Chowdhury. Embedded sparse codng for summarzng mult-vew vdeos. In ICIP, [45] R. Panda, A. Das, and A. K. Roy-Chowdhury. Vdeo summarzaton n a mult-vew camera network. In ICPR, [46] R. Panda, S. K. Kuanar, and A. S. Chowdhury. Scalable vdeo summarzaton usng skeleton graph and random walk. In ICPR, [47] R. Panda and A. K. Roy-Chowdhury. Collaboratve summarzaton of topc-related vdeos. In CVPR, [48] R. Panda and A. K. Roy-Chowdhury. Sparse modelng for topc-orented vdeo summarzaton. In ICASSP, [49] D. Parkh and K. Grauman. Relatve attrbutes. In ICCV, [50] Y. Peng and B.-L. Lu. Robust structured sparse representaton va halfquadratc optmzaton for face recognton. MTA, [51] Y. Peng and C.-W. Ngo. Clp-based smlarty measure for querydependent clp retreval and vdeo summarzaton. TCSVT, [52] Y. Poleg, C. Arora, and S. Peleg. Temporal segmentaton of egocentrc vdeos. In CVPR, [53] D. Potapov, M. Douze, Z. Harchaou, and C. Schmd. Category-specfc vdeo summarzaton. In ECCV, [54] Y. Prtch, A. Rav-Acha, A. Gutman, and S. Peleg. Webcam synopss: Peekng around the world. In ICCV, [55] A. K. Roy-Chowdhury and B. Song. Camera networks: The acquston and analyss of vdeos over wde areas. Synthess Lectures on Computer Vson, [56] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsten, et al. Imagenet large scale vsual recognton challenge. IJCV, [57] M. M. Salehn and M. Paul. Fuson of Foreground Obect, Spatal and Frequency Doman Moton Informaton for Vdeo Summarzaton. Sprnger Internatonal Publshng, [58] K. Smonyan and A. Zsserman. Two-stream convolutonal networks for acton recognton n vdeos. In NIPS, [59] K. Smonyan and A. Zsserman. Very deep convolutonal networks for large-scale mage recognton. arxv preprnt arxv: , [60] Y. Song, J. Vallmtana, A. Stent, and A. James. Tvsum: Summarzng web vdeos usng ttles. In CVPR, [61] D. Tran, L. D. Bourdev, R. Fergus, L. Torresan, and M. Palur. C3d: generc features for vdeo analyss. CoRR, abs/ , 2:7, [62] B. Truong and S. Venkatesh. Vdeo abstracton: A systematc revew and classfcaton. TOMCCAP, [63] K. Wang, R. He, W. Wang, L. Wang, and T. Tan. Learnng coupled feature spaces for cross-modal matchng. In ICCV, [64] M. Wang, R. Hong, G. L, Z.-J. Zha, S. Yan, and T.-S. Chua. Event drven web vdeo summarzaton by tag localzaton and key-shot dentfcaton. TMM, [65] M. Wang, G. L, Z. Lu, Y. Gao, and T.-S. Chua. When amazon meets google: Product vsualzaton by explorng multple web sources. TOIT, [66] Y. Wang, C. Pan, S. Xang, and F. Zhu. Robust hyperspectral unmxng wth correntropy-based metrc. TIP, [67] L. Yao, A. Torab, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courvlle. Descrbng vdeos by explotng temporal structure. In ICCV, [68] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Vdeo summarzaton wth long short-term memory. In ECCV. [69] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Summary transfer: Exemplar-based subset selecton for vdeo summarzato. CVPR, [70] B. Zhao and E. P. Xng. Quas real-tme summarzaton for consumer vdeos. In CVPR, [71] X. Zhu, C. C. Loy, and S. Gong. Learnng from multple sources for vdeo summarsaton. IJCV, Rameswar Panda receved hs Bachelors and Masters degree n Electroncs and Telecommuncaton engneerng from Bu Patanak Unversty of Technology, Inda and Jadavpur Unversty, Inda n 2011 and 2013 respectvely. He s currently pursung the Ph.D. degree n the department of Electrcal and Computer Engneerng at Unversty of Calforna, Rversde. Hs man research nterests nclude computer vson, machne learnng, vdeo summarzaton, person re-dentfcaton and vdeo survellance. Amt K. Roy-Chowdhury receved the Bachelors degree n Electrcal Engneerng from Jadavpur Unversty, Calcutta, Inda, the Masters degree n systems scence and automaton from the Indan Insttute of Scence, Bangalore, Inda, and the Ph.D. degree n Electrcal Engneerng from the Unversty of Maryland, College Park. He s a Professor of Electrcal Engneerng at U. of Calforna, Rversde. Hs research nterests nclude mage processng and analyss, computer vson, and vdeo communcatons and statstcal methods for sgnal analyss. Hs current research proects nclude ntellgent camera networks, wde-area scene analyss, moton analyss n vdeo, actvty recognton and search, vdeobased bometrcs (face and gat), bologcal vdeo analyss, and dstrbuted vdeo compresson. He s coauthor of The Acquston and Analyss of Vdeos over Wde Areas He s the edtor of the book Dstrbuted Vdeo Sensor Networks. He has been on the organzng and program commttees of multple conferences and serves on the edtoral boards of a number of ournal. He s a Fellow of IAPR.

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features