Unsupervised Content Discovery in Composite Audio Rui Cai Department of Computer Science and Technology, Tsinghua Univ. Beijing, , China

Size: px

Start display at page:

Download "Unsupervised Content Discovery in Composite Audio Rui Cai Department of Computer Science and Technology, Tsinghua Univ. Beijing, , China"

Barrie Dean
6 years ago
Views:

1 Unsupervsed Content Dscovery n Composte Audo Ru Ca Department of Computer Scence and Technology, Tsnghua Unv. Bejng, , Chna caru01@mals.tsnghua.edu.cn Le Lu Mcrosoft Research Asa No. 49 Zhchun Road Bejng, , Chna llu@mcrosoft.com Alan Hanjalc Department of Medamatcs Delft Unversty of Technology 68 CD Delft, The Netherlands A.Hanjalc@ew.tudelft.nl ABSTRACT Automatcally extractng semantc content from audo streams can be helpful n many multmeda applcatons. Motvated by the known lmtatons of tradtonal supervsed approaches to content extracton, whch are hard to generalze and requre sutable tranng data, we propose n ths paper an unsupervsed approach to dscover and categorze semantc content n a composte audo stream. In our approach, we frst employ spectral clusterng to dscover natural semantc sound clusters n the analyzed data stream (e.g. speech, musc, nose, applause, speech mxed wth musc, etc.). These clusters are referred to as audo elements. Based on the obtaned set of audo elements, the key audo elements, whch are most promnent n characterzng the content of nput audo data, are selected and used to detect potental boundares of semantc audo segments denoted as audtory scenes. Fnally, the audtory scenes are categorzed n terms of the audo elements appearng theren. Categorzaton s nferred from the relatons between audo elements and audtory scenes by usng the nformaton-theoretc co-clusterng scheme. Evaluatons of the proposed approach performed on 4 hours of dverse audo data ndcate that promsng results can be acheved, both regardng audo element dscovery and audtory scene categorzaton. Categores and Subject Descrptors H.5.5 [Informaton Interfaces and Presentaton]: Sound and Musc Computng sgnal analyss, synthess and processng; Systems; H.3.1 [Informaton Storage and Retreval]: Content Analyss and Indexng Indexng methods; I.5.3 [Pattern Recognton]: Clusterng Algorthms; Smlarty measures. General Terms Algorthms, Desgn, Expermentaton, Management, Theory. Keywords Content-based audo analyss, unsupervsed approach, key audo element, audtory scene, spectral clusterng, nformaton-theoretc co-clusterng. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. MM 05, November 6 1, 005, Sngapore. Copyrght 005 ACM /05/0011 $ INTRODUCTION An automaton of semantc content extracton from dgtal audo streams can be benefcal to many multmeda applcatons, such as context-aware computng [10][1] and vdeo content parsng, ncludng hghlght extracton [][4][5] and vdeo abstracton and summarzaton [1][16]. To detect and categorze semantc content n audo sgnals, consderable research effort has been nvested n developng the theores and methods for content-based audo analyss to brdge the "semantc gap" separatng the lowlevel audo features and the hgh-level audo content semantcs. A typcal approach to content-based audo analyss can be represented by the general flowchart shown n Fg. 1 [14]. There, the nput audo stream s frst segmented nto dfferent audo elements such as speech, musc, varous audo effects and any combnaton of these. Then, the key audo elements are selected, beng the audo elements that are most characterstc for the semantcs of the analyzed audo data stream [5][5]. In the next step, the audtory scenes [3], whch are the temporal segments wth coherent semantc content, are detected and classfed based on the (key) audo elements they contan. For example, n [][5], the elements such as applause, cheer, ball-ht, and whstlng, are used to detect the hghlghts n sports vdeos; and n flm ndexng [5][6][17], humor and volence scenes are categorzed by detectng the key audo elements lke laughter, gun-shot, and exploson. Low-level Feature Audo Classfcaton Key Audo Element Spottng Audtory Scene Segmentaton Semantc Categorzaton Semantc Descrpton Fg. 1. A unfed approach to content-based audo analyss. Prevous attempts of realzng the scheme n Fg. 1, ether as a whole or n parts, usually adopted supervsed data analyss and classfcaton methods. For nstance, hdden Markov models (HMMs) [][6] and support vector machnes (SVMs) [5] are often used to model and dentfy audo elements n audo sgnals. As for audtory scene categorzaton, heurstc rules such as "f double whstlng, then Foul or Offsde" are wdely employed to nfer the events n soccer games [5], whle n [6] and [17] Gaussan mxture models (GMMs) and SVMs are used to statstcally learn the relatonshps between the key audo elements and the hgher-level semantcs of audtory scenes. Although the supervsed approaches have proved to be effectve n many applcatons, they show some crtcal lmtatons. Frst, the effectveness of the supervsed approaches reles heavly on the qualty of the tranng data. If the tranng data s nsuffcent or badly dstrbuted, the system performance drops sgnfcantly. Second, n most real-lfe applcatons, t's dffcult to lst all audo Ths work was performed at Mcrosoft Research Asa.

2 elements and semantc categores that are possble to be found n data. For example, n the applcatons lke pervasve computng [10] and survellance [], both the audo elements and the semantc scenes are unknown n advance. Thus t s mpossble to collect tranng data and learn proper statstcal models n these cases. In vew of the descrbed dsadvantages of the supervsed methods, a number of recent works ntroduced unsupervsed approaches nto multmeda content analyss. For example, an approach based on tme seres clusterng s presented n [] to dscover "unusual" events n audo streams. In [10], unsupervsed analyss of personal audo archve s performed to create an "automatc dary". For the purpose of vdeo summarzaton and abstracton, unsupervsed approaches have also shown promsng results. For nstance, affectve vdeo content characterzaton and hghlghts extracton can be performed usng the theory and methods proposed n [1]. Also n many other exstng approaches (e.g. [16][19][4]), the technques lke clusterng and groupng are utlzed for semantc analyss, nstead of supervsed classfcaton and dentfcaton. However, these exstng methods are not meant to provde generc content analyss solutons, as they are ether desgned for specfc applcatons [1][16][19][4], or only address some solated parts of the scheme n Fg. 1 [10][]. ( Ι ) Context-based Scalng Factors Importance Measures Key Audo Elements ( ΙΙ ) BIC-based Estmaton of Cluster Number Audo Streams Feature Extracton Iteratve Spectral Clusterng Scheme Audo Elements Audtory Scene Detecton Informaton-Theoretc Coclusterng based Audtory Scene Categorzaton ( a) (b) Audtory Scene Groups Documents / Web Pages Word Parsng Words Index Terms Selecton Keywords Document Categorzaton Documents wth Smlar Topcs Fg.. (a) The flowchart of the proposed approach to unsupervsed audo content analyss, whch conssts of two major parts: (I) audo element dscovery and key element spottng; and (II) audtory scene categorzaton. (b) A comparable process of the topc-based document categorzaton. Workng towards a more generc and robust realzaton of the system n Fg.1, we propose n ths paper a novel unsupervsed approach to audo content analyss, whch s capable of dealng wth arbtrary composte audo data streams. The detaled flowchart of the proposed approach s gven n Fg. (a). It conssts of two major steps: I) audo elements dscovery and key audo element spottng, and II) audtory scenes categorzaton. Both steps are unsupervsed and doman- and applcaton-ndependent. It also facltates audo content dscovery n dfferent semantc levels, such as md-level audo elements and hgh-level audtory scenes. Our proposed approach can also be seen as an analogy to the topc-based text document categorzaton [1], as shown n Fg. (b). Here, audo elements are smlar to words, whle key audo elements correspond to keywords. In the proposed scheme, the nput s an arbtrary composte audo stream. After feature extracton, an teratve spectral clusterng method s proposed to decompose the audo stream nto audo elements. Spectral clusterng [18] has proved to be successful n many complcated clusterng problems, and s also very sutable n our case as well. To mprove the clusterng performance n vew of the nhomogeneous dstrbuton denstes of varous sounds n the feature space, we adjust the standard spectral clusterng scheme [18] by usng the context-dependent scalng factors. Usng ths clusterng method, the segments wth smlar low-level features n the audo stream are grouped nto natural semantc clusters that we adopt as audo elements. Then, a number of mportance measures are defned and employed to flter the obtaned set of audo elements and select the key audo elements. In the audtory scene categorzaton step, the potental audtory scenes are frst detected by nvestgatng the co-occurrences among varous key audo elements n the nput audo stream. Then, these audtory scenes are grouped nto semantc categores by usng the nformaton-theoretc co-clusterng algorthm [7], whch explots the relatonshps among varous audo elements and audtory scenes. Moreover, we propose a strategy based on the Bayesan Informaton Crteron (BIC) for selectng the optmal cluster numbers for co-clusterng. The rest of ths paper s organzed as follows. Secton presents the algorthms for audo element detecton, ncludng feature extracton, audo stream decomposton, and key audo elements selecton. In Secton 3, the procedure for audtory scene detecton and categorzaton s descrbed. Experments and dscussons can be found n Secton 4, and Secton 5 concludes the paper.. AUDIO ELEMENT DISCOVERY.1 Feature Extracton We frst dvde the audo data stream nto frames of 5ms wth 50% overlap. Then, we compute a number of audo features to characterze each audo frame. Many audo features have been proposed n prevous works on content-based audo analyss [][6][15], and have been proved to be effectve n characterzng varous audo elements. Inspred by these works, we extract both the temporal and spectral features for each audo frame. The set of temporal features conssts of shorttme energy (STE) and zero-crossng rate (ZCR), whle the spectral features nclude sub-band energy ratos (BER), brghtness, bandwdth, and 8-order Mel-frequency cepstral coeffcents (MFCCs). Moreover, to provde a more complete descrpton of audo elements and to be able to dscern a greater dversty of audo elements, two new spectral features proposed n our pervous works [3][5], ncludng the Sub-band Spectral Flux and the Harmoncty Promnence, are also extracted for each audo frame. In our experments, the spectral doman s equally dvded nto 8 sub-bands n Mel-scale and then the sub-band features are extracted. All the above features are collected nto a 9-dmensonal feature vector per audo frame. In order to reduce the computatonal complexty of the proposed approach, we choose to group audo frames nto longer temporal audo segments of the length t, and to use these longer segments as the bass for the subsequent audo processng steps. For ths purpose, a sldng wndow of t seconds wth t seconds overlap s used to segment the frame sequence. In order to balance the detecton resoluton and the computatonal complexty, we choose t as 1.0 second and t as 0.5 seconds. At each wndow poston, the

Audo elements to be found n complex composte audo streams, such as sound tracks of moves, usually have complcated and rregular dstrbutons n the feature space.

3 mean and standard devaton of the frame-based features are computed and used to represent the correspondng audo segment.. Audo Stream Decomposton The decomposton of audo streams s carred out by groupng audo segments nto the clusters correspondng to audo elements. Audo elements to be found n complex composte audo streams, such as sound tracks of moves, usually have complcated and rregular dstrbutons n the feature space. However, tradtonal clusterng algorthms such as K-means are based on the assumpton that the cluster dstrbutons n the feature space are Gaussans [9], and such assumpton s usually not satsfed n complex cases. As a promsng alternatve, spectral clusterng [18] recently emerged and showed ts effectveness n a varety of complex applcatons, such as mage segmentaton [6][7] and the multmeda sgnal clusterng [10][19][]. We therefore choose to employ spectral clusterng to decompose audo streams nto audo elements. To further mprove the robustness of the clusterng process, we adopt the self-tunng strategy [7] to set contextbased scalng factors for dfferent data denstes, and buld an teratve scheme to perform a herarchcal clusterng of nput data...1 Spectral Clusterng Algorthm For a gven audo stream, the set U = {u 1,, u n } of feature vectors u s obtaned through the feature extracton descrbed n Secton.1. Each element u of the set U represents the feature vector of one audo segment. After specfyng the search range [k mn, k max ] for the most lkely number of audo elements exstng n the stream, the standard spectral clusterng algorthm s carred out, whch conssts of the followng steps [18]: Algorthm I: Spectral_Clusterng (U, k mn, k max ) 1. Form an affnty matrx A defned by A j = ex-d(u, u j ) /σ ) f j, and A = 0. Here, d(u, u j ) = u -u j s the Eucldean dstance between the feature vectors u and u j, and σ s the scalng factor. The selecton of σ wll be dscussed later n ths secton.. Defne D to be a dagonal matrx whose (, ) element s the sum of A's th row, and construct the normalzed affnty matrx L = D -1/ AD -1/. 3. Suppose (x 1,, x kmax+1) are the k max +1 largest egenvectors of L, and (λ 1,, λ kmax+1) are the correspondng egenvalues. The optmal cluster number k s estmated based on the egengaps between adjacent egenvalues [18], as: k = arg max [ k, ](1 1 / ) mn k λ max + λ (1) Then, form the matrx X = [x 1 x x k ] R n k by stackng the frst k egenvectors n columns. 4. Form the matrx Y by renormalzng each of X's rows to have unt length, that s: = /( ) 1/ j X j j X j Y () 5. Treat each row of Y as a pont n R k, cluster them nto k clusters va the cosne-dstance based K-means. The ntal centers n the K-means are selected to be as orthogonal to each other as possble [6]. 6. Assgn the orgnal data pont u to cluster c j f and only f the row of the matrx Y s assgned to c j. The clusterng s followed by smoothng of possble dscontnutes between audo segments assgned to the same cluster. For example, f the consecutve audo segments are assgned to clusters A and B as "A-A-B-A-A", they wll be smoothed to "A-A-A- A-A". Each obtaned seres of audo segments that belong to the same cluster s consdered as one nstance (occurrence) of the correspondng audo element n the nput audo data stream... Context-based Scalng Factors In the spectral clusterng algorthm, the scalng factor σ affects how rapdly the smlarty measure A j decreases when the Eucldean dstance d(u, u j ) ncreases. In ths way, t actually controls the value of A j at whch two audo segments are consdered smlar. In the standard spectral clusterng algorthm, σ s set unformly for all data ponts (for example, the average Eucldean dstance n the data), based on the assumpton that each cluster n the nput data has a smlar dstrbuton densty n the feature space. However, such assumpton s usually not satsfed n composte audo data, whch often contan clusters wth dfferent cluster denstes. Fg. 3 (a) llustrates an example affnty matrx of a 30-second audo stream composed of musc (0-10s), musc wth dense applause (10-0s), and speech (0-30s), wth a unform scalng factor. From Fg. 3 (a), t s notced that the densty of speech s sparser than those of other elements, and musc and musc wth dense applause are close to each other and hard to be dstngushed. Thus the standard spectral clusterng can not properly estmate the number of clusters based on the egenvalues and egen-gaps shown at the bottom of Fg. 3 (a) egenvalue egen-gap 1.0 egenvalue egen-gap (a) (b) Fg. 3. The smlarty matrces, wth top 10 egenvalues and the egen-gaps, of a 30-second audo stream, whch conssts of musc (0-10s), musc wth dense applause (10-0s), and speech (0-30s): (a) usng a unform scalng factor, (b) usng the contextbased scalng factors. To obtan a more relable smlarty measure and mprove the clusterng robustness, the self-tunng strategy [7] s employed to select context-based scalng factors n our approach. That s, for each data pont u, the scalng factor s set based on ts context data densty, as: σ d( u, u j ) / nb (3) = j u j close( u ) where close(u ) denotes the set contanng n b nearest neghbors of u, and n b s expermentally set to 5 n our approach. Then the affnty matrx s re-defned as: A j = ex d( u, u ) /(σ σ )) (4) Fg. 3 (b) shows the correspondng affnty matrx computed wth the context-based scalng factors. It s notced that the three blocks on the dagonal are more dstnct than those n Fg. 3 (a). In Fg. 3 (b), speech segment shows more concentrated n the smlarty matrx, whle musc and musc wth dense applause are j j

4 more apart. Accordng to equaton (1), t s also noted the promnent egen-gap between the 3 rd and 4 th egenvalues can approprately predct the correct number of clusters...3 An Iteratve Clusterng Scheme In order to prevent audo segments of dfferent types to be merged nto the same cluster, we also propose an teratve clusterng scheme to verfy whether a cluster can be dvded any further. That s, at each teraton, each cluster obtaned from prevous teraton s further clustered usng the spectral clusterng scheme. A cluster s consdered to be nseparable f the spectral clusterng returns only one cluster. Although some clusters may be nseparable n the smlarty matrx of the prevous teraton (a global scale), they may become separable n the new smlarty matrx durng the next teraton (a local scale). The teratve scheme s descrbed by the followng pseudo code: Iteratve_Clusterng(U, k mn, k max ) { [k, {c 1,, c k }] = Spectral_Clusteng(U, k mn, k max ); f (k = = 1) return; for (j = 1; j k; j++) Iteratve_Clusterng(c j, 1, k max ); }.3 Key Audo Element Spottng After we dscovered the audo elements n the nput audo stream, we wsh to spot those audo elements that are most characterstc for the semantc content conveyed by the stream. For example, whle an award ceremony typcally contans the elements such as speech, musc, applause and ther dfferent combnatons, the audo elements of applause can be consdered as a good ndcator of the hghlghts of the ceremony. To spot the key audo elements, we draw an analogy to keyword extracton n text document analyss, that s, some smlar crtera are proposed to compute the "mportance" of each audo element. As a frst mportance ndcator, we consder the occurrence frequency of an audo element, whch s a drect analogy to the term frequency n text [1]. However, the dfference to text analyss s that keywords n text usually have hgher occurrence frequency than other words, whle key audo elements may not. Ths can be drawn from the followng analyss. For example, the major part of the sound track of a typcal acton move segment conssts of "usual" audo elements, such as speech, musc, speech mxed wth musc and some "standard" background nose (car-engne, openng/closng a door, etc.), whle the remanng smaller part ncludes audo elements that are typcal for acton, lke gun-shots or explosons. As the usual audo elements can be found n any other (e.g. romantc) move segment as well, t s clear that only ths small set of temporal segments contanng specfc audo elements s the most mportant to characterze the content of a partcular move segment. We apply the smlar reasonng as above to extend our "mportance" measure by other relevant ndcators. The total duratons and the average lengths of each occurrence of an audo element are, typcally, very dfferent for varous sounds n a stream. Background sounds are usually majortes n streams whle key audo elements are mnortes. For nstance, n a stuaton comedy, both the total duraton and the average length of the speech are consderably longer than that of the laughter. Further, dfferent sounds usually have dfferent varatons of ther occurrence lengths. Key elements usually have relatvely consstent length n each occurrence, as opposed to the strongly varyng lengths of the segments of background sounds. For example, the sound of applause n a tenns game usually has smlar length n each of ts occurrences. The same holds for the large majorty of gunshots and explosons n acton moves. Based on the above observatons, we propose four heurstc mportance ndcators for spottng key audo elements. One of these ndcators s the occurrence frequency related, and the other three are desgned to capture the observatons made regardng element duraton. For a gven audo element c n the stream S, these ndcators are defned as follows: Element Frequency s used to take nto account the occurrence frequency of c n S: efrq( = ex ( n α n ) /(n )) (5.1) Here, n s the occurrence number of the audo element c, and n avg and n std are the correspondng mean and standard devaton of the occurrence numbers of all the audo elements. The factor α adjusts the expectaton of how often the key elements can lkely occur. By ths ndcator, the audo elements that appear far more or far less frequently than the expectaton α n avg are punshed. Element Duraton takes nto account the total duraton of c n the stream: edur( = ex ( d β d ) /(d )) (5.) Here, d s the total duraton of c, and d avg and d std are the correspondng mean and standard devaton. The factor β adjusts the expectaton of key audo element duraton, and has a smlar effect as α. Average Element Length takes nto account the average segment length of c over all ts occurrences, as: elen( = ex ( l γ l ) /(l )) (5.3) Here, l s the average segment length of c, and l avg and l std are the correspondng mean and standard devaton of all the elements. The factor γ s smlar to α and β and adjusts the expectaton of the average segment length of key audo elements. Element Length Varaton evaluates the constancy of the segment lengths of the audo element c n S: evar( c, S) = ex v /( δ l )) (5.4) Here, v s the standard devaton of the segment lengths of c, and δ adjusts the tolerance of v related to l. The heurstc mportance ndcators defned above can be tuned adaptvely for dfferent applcatons, based on the avalable doman knowledge. For example, to detect unusual sounds n survellance vdeos, the factor α, β, and γ could be set relatvely small, snce such sounds are not expected to occur frequently and are of a relatvely short duraton. On the other hand, to detect the more repettve key sounds lke laughter n stuaton comedes, we may decrease δ correspondngly, as the segments of such sounds typcally have a more-or-less constant length. Assumng the above four ndcators are ndependent wth each others, n our system the followng "mportance" score s proposed to measure the mportance of each audo element: score( = efrq( edur( elen( evar( (6) The key elements are fnally selected as the frst K audo elements wth the hghest mportance scores, whle the remanng audo avg avg avg std std std

5 elements are further referred to as background elements. The number of key audo elements, K, s chosen as: k K = arg max k { } = d 1 η LS (7) where d' denotes the duraton of the th audo element on the lst of audo elements ranked n the descendng order based on the score (6), L S s the total duraton of S, and η s a tunng parameter that can be set dependng on the target applcatons. For nstance, η can be set to a relatvely small value when detectng unusual sounds n a survellance vdeo, as the total duraton of unusual sounds compared to L S s expected to be small. In our experments, η s set to 0.5, as we assume that the key audo elements wll not cover more than 5% of the whole nput audo sgnal. 3. AUDITORY SCENE CATEGORIZATION Havng localzed the audo segments contanng the key- or background audo elements, we now use ths nformaton to detect and classfy hgher-level audtory scenes based on the type of audo elements they contan. For both of these tasks, we explot the cooccurrence phenomena among audo elements, and n partcular, of the key audo elements. In general, some (key) audo elements wll rarely occur together n the same semantc context. Ths s partcularly useful n detectng possble brakes (boundares) n the semantc content coherence between consecutve audtory scenes. On the other hand, the audtory scenes wth smlar semantcs usually contan smlar sets of typcal key audo elements. For example, many acton scenes wll contan gunshots and explosons, whle a scene n a stuaton comedy s typcally characterzed by a combnaton of applause, laughter, speech and lght musc. In ths sense, the relaton between the co-occurrence of audo elements and the semantc smlarty of audtory scenes can be exploted for scene categorzaton. 3.1 Key Element-based Scene Detecton Fg. 4 llustrates an example audo element sequence, where the shaded segments are the key element segments and the whte segments are the background segments. To locate the boundares between audtory scenes n such a contnuous audo stream, an ntutve dea s to measure the semantc affnty between two adjacent key-element segments n the audo stream. If ths affnty s low, then there s a potental audtory scene boundary between these segments. However, t s usually dffcult to measure the semantc affntes among key elements from the low-level features. In our approach, the affnty measure s based on the followng assumptons: ) there s a hgh affnty between two segments f the correspondng key audo elements usually occur together, and ) the larger the tme nterval between two adjacent key element segments, the lower ther affnty. Based on the above assumptons, we defne the affnty between the l th and (l+1) th key element segments as: a l, l+ 1 = ex Ek( l) k ( l+ 1) / µ E ) ex tl, l+ 1 / Tm ) (8) where k(l) s the label of the l th key element segment n the stream. E j s the average occurrence nterval between the th key element and the j th key element, µ E s the mean of all the E j, t l,l+1 s the tme nterval between the two key element segments, and T m s a scalng factor, whch s set to 16 seconds n our experments, followng the dscussons on human memory lmt [3]. In equaton (8), we use exponental functon to smulate the affnty dstrbuton; and ts two parts actually reflect the two assumptons we made above. ' Usng (8), an affnty curve can be obtaned, as llustrated n Fg. 4. Thresholdng the obtaned curve wll result n coarse audtory scenes, ndcated by the ntervals S 1, S and S 3 n Fg. 4. We set ths threshold expermentally as µ a +σ a, where µ a and σ a are the mean and standard devaton of the affnty curve, respectvely. Affnty threshold Key Element Segments a + 1, l + a l l, l +1 S 1 a l +, l + 3 Background Segments a l + 4, l + 5 a l + 3, l L l l+ 1 l+ l+ 3 l+ 4 l+ 5 S 1 S S 3 Fg. 4. An llustraton of the key audo element-based audtory scene detecton, where the shaded and whte segments represent the key- and background audo elements, respectvely. The scenes are frst located by measurng the affntes between adjacent key element segments, as shown by the ntervals S 1, S and S 3. Then, the scene boundares are further revsed by mergng the surroundng background segments, as shown by the ntervals S 1, S, and S 3. The boundares of the coarsely located audtory scenes are defned by the key elements. To determne these boundares more precsely, we need to fnd out whether the surroundng segments of background elements could be merged nto the correspondng audtory scene, so we can adjust the boundares accordngly. In our approach, a background segment s merged nto an audtory scene f ts affnty wth the key element on the scene boundary s large enough (larger than a threshold). Moreover, f a background segment has a suffcent affnty to both surroundng scenes, t s merged wth the scene whose key element on the correspondng scene boundary has a larger affnty value wth the background element. In the mplementaton of the above mechansm, the affnty between the key element and the background element s based on the crteron smlar to equaton (8) that evaluates the average occurrence nterval between them. Agan, a pre-defned threshold, set expermentally, s used to evaluate the affnty, n the same way as the threshold n Fg. 4. After the boundary refnement, the coarse audtory scenes are updated, as the ntervals S 1, S, and S 3 show n Fg. 4. As stated above, we base the detecton of audtory scene boundares on the comparson of two subsequent key audo elements, whch may be a lttle strct. A more ntutve approach would be to allow more flexblty n the orderng of key audo elements, as long as ther mutual dstance remans acceptable. In ths way we would come close to the classcal vdeo scene segmentaton approaches, such as those based on fast-forward lnkng [11] or content recall [13]. In our approach, we choose to stck to the strct crteron (8) n order to prevent that two semantcally dfferent audtory scenes are seen as one. Clearly, the proposed method wll most lkely result n an over-segmentaton of the nput audo stream. Ths, however, s not a problem as the semantcally smlar scenes wll be grouped together n the followng step. It s also noted that, wth ths scheme, some background elements may not belong to any audtory scene, or, n other words, each of S S 3 t

6 these background elements could also be consdered as an ndvdual audtory scene. For these scenes, the categorzaton can be smply based on (or the same as) ther element label. Therefore, n the followng scene categorzaton and evaluatons, we don't consder the audtory scenes whch contan only a background element, and just consder those audtory scenes whch contan both key- and background audo elements. 3. Co-clusterng based Scene Categorzaton In the proposed approach, audo elements are used as md-level representatons of the audtory scene content. Thus, for the audtory scene categorzaton, the semantc smlarty between audtory scenes can be measured based on the audo elements they contan. Although prevous works usually use key audo elements to nfer the semantcs of audtory scenes [6][5], n our approach, all the audo elements are used n the audtory scene categorzaton, snce those background elements can be consdered as the context of the key elements, and thus can also provde extra useful nformaton n the semantc groupng. In scene categorzaton, t s useful to consder the "groupng" tendency among audo elements [4]: some audo elements usually occur together n the scenes of smlar semantcs, such as the cooccurrence of gun-shots and explosons n war scenes, and of cheer and laughter n humor scenes. Clearly, n a relable smlarty measure of audtory scenes, the "dstance" between two audo elements n a same "group" should be smaller than those among dfferent element "groups" [4]. Therefore, to obtan relable results of audtory scenes categorzaton, audo elements also need to be grouped accordng to ther co-occurrences n varous audtory scenes. Essentally, the processes of clusterng audtory scenes and revealng lkely co-occurrences of audo elements can be consdered dependent on each other [4]. That s, the nformaton on semantc scene clusters can help reveal the audo elements cooccurrence and vce versa. An analogy to the above can agan be found n the doman of text document analyss, where a soluton n the form of a co-clusterng algorthm was recently proposed for unsupervsed topc-based document clusterng, whch explots the relaton between document clusters and the co-occurrences of keywords. Unlke tradtonal one-way clusterng such as k-means, co-clusterng s a twoway clusterng algorthm. Wth ths algorthm, the documents and keywords are clustered smultaneously based on the concept of mutual nformaton. In our approach, the nformaton-theoretc co-clusterng [7] s adopted to co-cluster the audtory scenes and audo elements. Moreover, we also extend the algorthm wth the Bayesan Informaton Crteron (BIC) to automatcally select the cluster numbers for both audtory scenes and audo elements Informaton-Theoretc Co-clusterng Algorthm The nformaton-theoretc co-clusterng can effectvely explot the relatonshps among varous audo elements and audtory scenes, based on the mutual nformaton theory [7]. Followng the work ntroduced n [7], we suppose there are m audtory scenes and n audo elements, and all the audtory scenes could be consdered as beng generated by a dscrete random varable S, whose value s s taken from the set {s 1,, s m }; and smlarly, all the audo elements are generated by another dscrete random varable E, whose value e s taken n the set {e 1,, e n }. Let S, E) be a matrx, wth each element s, e) representng the co-occurrence probablty of an audo element e and the audtory scene s. Then, the mutual nformaton I(S; E) s calculated as: I s e ( S; E) = s, e)log ( s, e) / s) e)) (9) It was proved n [7] that an optmal co-clusterng should mnmze the loss of mutual nformaton after the clusterng,.e. the optmal clusters should mnmze the dfference: I ( S; E) I( S ; E ) = KL( S, E), q( S, E)) (10) where S ={s 1,, s k} and E ={e 1,, e l} are k and l dsjont clusters formed from the elements of S and E. KL( ) s the Kullback-Lebler (K-L) dvergence, and q(s,e) s also a dstrbuton n the form of an m n matrx, gven as: q( s, e) = s, e ) s s ) e e ), where s s, e e (11) After some transformatons, equaton (10) can be further expressed n a symmetrcal manner [7]: s s s KL ( p, q) = s) KL( E s), q( E s )) (1.1) KL ( p, q) = e) KL( S e), q( S e )) (1.) e e e Expressons (1.1) and (1.) show that the loss of mutual nformaton can be mnmzed by mnmzng the K-L dvergence between E s) and q(e s ), or between S e) and q(s e ). Ths leads to the followng teratve four-step co-clusterng algorthm, whch was proved to monotoncally reduce the loss of mutual nformaton and converge to a local mnmum [7]: Algorthm II: Co_Clusterng (p, k, l) 1) Intalzaton: Group all audtory scenes nto k clusters, and the audo elements nto l clusters. These ntal clusters are formed such that ther centrods are "maxmally" far apart from each other [8]. Then calculate the ntal value of the q matrx. ) Updatng row clusters: Frst, for each row s, fnd ts new cluster ndex as: k s k = arg mn KL( E s), q( E )) (13) Thus the K-L dvergence of E s) and q(e s ) decreases n ths step. Wth the new cluster ndces of rows, update the q matrx accordng to equaton (11). 3) Updatng column clusters: Based on the updated q matrx n step, fnd a new cluster ndex j for each column e as: l e l j = arg mn KL( S e), q( S )) (14) Thus the K-L dvergence of S e) and q(s e ) decreases n ths step. Wth the new cluster ndces of columns, update the q matrx agan. 4) Re-calculate the loss of mutual nformaton usng equaton (10). If the change n the loss of mutual nformaton s smaller than a pre-defned threshold, stop the teraton process and return the clusterng results; otherwse go to step to start a new teraton. To ncrease the qualty of the local mnmum, the local search strategy s appled [8]. Snce what we know about the nput audo stream s the presence and duraton of dscovered audo elements per each detected audo scene, the occurrence probablty of the audo element e j and the audtory scene s s approxmated n our mplementaton smply by the duraton percentage occr j of e j n s. If an audo element doesn't occur n the scene, ts duraton percentage s set to zero. Fnally, to satsfy the requrement that the ntegral (sum) of the co-occurrence dstrbuton s equal to one, the co-occurrence matrx S, E) s normalzed across the whole matrx, that s:

7 m n p s, e j ) = occrj / = 1 j = 1 ( occr (15) 3.. BIC-based Cluster Number Estmaton In the above co-clusterng algorthm, the row cluster number k and the column cluster number l are assumed to be known. However, n an unsupervsed approach, t's not possble to specfy the cluster numbers beforehand. In our proposed approach, the Bayesan Informaton Crteron (BIC) s utlzed to select the optmal cluster numbers for coclusterng. In general, the BIC searches for a tradeoff between the data lkelhood and the model complexty, and has been successfully employed to select the optmal cluster number for K-means clusterng [0]. In our co-clusterng scheme, assumng that the model preservng more mutual nformaton would better ft the data, the data lkelhood s descrbed by the logarthm of the rato between the mutual nformaton after clusterng I(S ; E ) and the orgnal mutual nformaton I(S; E). As co-clusterng s a two-way clusterng, the model complexty here should consst of two parts: the sze of the row clusters (n k: k cluster centers of dmensonalty n) and the sze of the column clusters (m l: l cluster centers of dmensonalty m). Accordng to the defnton of BIC [4], these two parts are further modulated by the logarthm of the numbers n row and column,.e. logm and logn, respectvely. Thus, the BIC n our algorthm can be formulated as: BIC ( k, l) = λ log( I ( S ; E ) / I( S; E)) ( nk logm + ml log n) / (16) In our mplementaton, λ s set expermentally as m n, whch s the sze of the co-occurrence matrx. The algorthm searches over all the (k, l) pars n a pre-defned range, and the model wth the hghest BIC score s chosen as the optmal set of cluster numbers. 4. EVALUATION AND DISCUSSION In ths secton, we present the evaluaton results obtaned for the proposed approach on the bass of several composte audo streams, and addressng both the audo element dscovery and audtory scene categorzaton. 4.1 Database Informaton The proposed framework was evaluated on sound tracks extracted from varous types of vdeo, ncludng sports, stuaton comedy, award ceremony, and moves, and n the total length of about 4 hours. These sound tracks contan an abundance of dfferent audo elements, and are of dfferent complexty, n order to provde a more relable base for evaluatng the proposed approach under dfferent condtons. For example, n the test dataset, the sound track of the tenns game s relatvely smple, as compared to a far more complex sound track from the war move "Band of Brothers - Carentan". Table 1. Informaton of the expermental audo data No. Vdeo category duraton A 1 Tenns Game sports 0:59:41 A Frends stuaton comedy 0:5:08 A 3 59 th Annual Golden Globe Awards award ceremony 1:39:47 A 4 Band of Brothers - Carentan war move 1:05:19 j Detaled nformaton on the sound tracks we used s lsted n Table 1. All the audo streams are n 16 KHz, 16-bt and mono channel format, and are dvded nto frames of 5ms wth 50% overlap for feature extracton. To balance the detecton resoluton and the computatonal complexty, the length of the sldng wndow ntroduced n Secton.1 s chosen as one second, wth 0.5 seconds overlap. 4. Audo Element Dscoverng and Key Element Spottng In ths secton, the performance of the proposed unsupervsed approach to audo element dscovery and key element spottng s evaluated. Due to the page lmtaton, we do not lst all the detaled performance fgures for all test audo streams. Instead, we frst provde an exhaustve performance presentaton on the example of one test audo stream, and then gve a summary of performance fgures obtaned on all other audo streams. In each evaluaton step, a dfferent test stream s chosen as an example Audo Element Dscovery In the spectral clusterng for audo element dscovery, the boundares of the search ranges for selectng the numbers of clusters are set expermentally as k mn = and k max =0 for all the sound tracks. Moreover, to llustrate the effectveness of the proposed spectral clusterng scheme wth context-based scalng factors, we compare ths scheme wth the standard spectral clusterng. Table. Comparson of the results of the standard spectral clusterng and the spectral clusterng wth context-based scalng factors on the sound track of "Frends" (A ) (unt: second) Spectral clusterng wth context-based scalng factors Standard spectral clusterng No. N S A L L&M M precson recall recall Abbr. nose (N), speech (S), applause (A), laughter (L), and musc (M) Table shows the comparson results of the two spectral clusterng algorthms on the sound track of "Frends". In ths experment, we obtaned 7 audo elements usng the spectral clusterng wth context-based scalng factors, and only 5 audo elements usng the standard spectral clusterng. To enable a quanttatve evaluaton of the clusterng performances, we establshed the ground truth by combnng the results obtaned by three unbased persons who analyzed the content of the sound track and the obtaned audo elements. Ths process resulted n 6 sound classes that we labeled as nose (N), speech (S), applause (A), laughter (L), musc (M), and laughter wth musc (L&M). In Table, each row represents one dscovered audo element and contans the duratons (n seconds) of ts occurrences n vew of the ground truth sound classes. We manually grouped those audo element occurrences assocated to the same ground truth class (ndcated by shaded felds n Table ), and then calculated the precson, recall and accuracy (the duraton percentage of the correctly assgned audo segments n the stream) based on the groupng results. As shown n Table, the accuraces of the two algorthms are n average 97.8% and 90.1%, respectvely, for the sound track of "Frends" (A ).

8 Table shows that each class n the ground truth can be covered by the audo elements dscovered wth the spectral clusterng usng context-based scalng factors. In the standard spectral clusterng, the sounds of applause (A), musc (M) and laughter wth musc (L&M) were mssed and falsely ncluded nto other clusters, whle speech (S) s dvded over three dscovered audo elements. As demonstrated n Secton.., ths phenomenon may be caused by the unharmonous dstrbutons of varous sound classes n the feature space. For nstance, the feature dstrbuton of speech (S) s relatvely sparse and s wth large dvergence, whle those of musc (M) and laughter wth musc (L&M) are more "tght". The nfluence of unharmonous sound dstrbutons can be reduced by settng dfferent scalng factors for dfferent data denstes, as done n our approach. Table 3. Performance comparson between the spectral clusterng wth and wthout context-based scalng factors on all the sound tracks Spectral clusterng wth Standard spectral clusterng No. #gc context-based scalng factors #nc / #mss accuracy #nc / #mss accuracy A / / A 6 5 / / A / / A / / The performance of the audo element dscovery on all test sound tracks s summarzed n Table 3, whch lsts the number of ground truth sounds (#gc), the number of dscovered audo elements (#nc), the number of mssed ground truth audo elements (#mss), and the overall accuracy. Table 3 shows that by usng the standard spectral clusterng algorthm, around 44% of sound classes n the ground truth are not properly dscovered, and the average accuracy s only around 77.1%. The table also shows that the spectral clusterng wth context-based scalng factors performs better on all the test sound tracks, and acheves an average accuracy of around 94.7%. In partcular, no sound classes n the ground truth are mssed n the obtaned set of audo elements. Hence, the use of context-based scalng factors n the spectral clusterng of complex audo streams can notably mprove the clusterng performance. 4.. Key Audo Element Spottng Based on the dscovered audo elements, the heurstc rules proposed n Secton.4 are employed to automatcally spot the key audo elements. In the experments, the parameters α, β, γ, and δ n equaton (5.1)-(5.4) are smply set as 1 wthout hard tunng, assumng that the key element n the streams has medum occurrence frequency and duraton, such as the applause n the tenns game (A 1 ) and the laughter n the stuaton comedy (A ). Table 4 contans the results of the key elements spottng n the sound track of "Tenns" (A 1 ). For 7 dscovered audo elements n the stream, the table lsts ther total duraton (dur), the number of segments ncluded n the correspondng clusters (nseg), and the mean and standard devaton of the segment length (avgl and stdl). Based on these propertes, the mportance score of each audo element s computed and an "educated guess" s made for the most lkely number of key elements usng equaton (7). In the tenns soundtrack, we fnally obtan two key elements, applause wth speech and pure applause, as ndcated by the shaded felds n Table 4. It s noted that, n Table 4, the cluster 7 whch contans ball-ht sound was not labeled as a key audo element, although one could expect that ball-ht s also a key element n the case of a tenns sequence. Actually, the obtaned mportance score for ths cluster also shows that t s an excellent canddate for beng the key audo element. However, due to the fnte resoluton of the stream decomposton, defned by the sldng-wndow overlap of 0.5s, n our approach, the segments of ball-ht usually constran long-tme slence or nose. Therefore, n audo element clusterng, the segments of ball-ht are mxed wth an amount of slence or nose segments n the obtaned cluster 7. Thus, cluster 7 n fact descrbes background sounds n tenns match and s not taken as a ground truth key element n ths paper. It s also not selected as a key element n the experments, snce the overall duraton of ths cluster when added to the overall duratons of the frst two key audo elements s too large for the threshold set n equaton (7). Table 4. Key element spottng on the track of "Tenns" (A 1 ) No. Descrpton dur nseg avgl stdl score 1 clean speech applause wth speech pure musc pure applause slence nose slence wth ball-ht Table 5. Dscovered key elements n all the sound tracks No. k 0/dur 0 k/dur rec. / prec. Key elements (n descendng order) A applause wth speech, 11:16 11: pure applause A laughter, nose, 04:9 06: laughter wth musc, applause A 3 applause, applause wth lght-musc1, applause wth lght-musc, applause wth 7:01 4: dense-musc, applause wth speech A gun-shot mxed wth exploson, 14:38 11: pure gun-shot Note: ( ) Falsely spotted key elements; ( ) Mssed key elements The spotted key audo elements from all test sound tracks are shown n Table 5. For each sound track, the number of ground truth key elements (k 0 ) and the total duraton of correspondng audo segments (dur 0 ) are lsted, as well as the spotted key element number (k) and ther total duraton (dur). The ground truth s establshed here agan by combnng the results obtaned by three unbased persons who analyzed the content of the test sound tracks n the search for the most characterstc sounds and sound combnatons. Based on these values, the recall and precson are computed. The table shows that the performance on relatvely smple audo streams (A 1 and A 3 ) s satsfyng. All the key elements n the ground truth are well spotted and no false alarms are ntroduced. The average recall and precson are above 90%. On the other hand, for complex audo streams such as the stuaton comedy TV (A ) and the war move (A 4 ), some false alarms are ntroduced and some key elements are mssed. For example, the nose n A s falsely detected as key element, snce t has smlar occurrence frequency and duraton as the expected key elements. Also n A 4, some real key elements such as the pure gunshot are mssed, snce the characterstcs of key elements n complex audo streams vary too large and are nconsstent. These problems ndcate that the heurstc rules proposed n ths paper can not yet gve a complete descrpton to all the characterstcs of key elements n complex audo streams, and need to be mproved n the future works. However, the overall performance of key element spottng usng the proposed rules on our testng audo streams s stll ac-

9 ceptable, and more than 90% (10 out of 11) of the key elements n the ground truth can be properly spotted. 4.3 Audtory Scene Categorzaton To better evaluate the performance of the audtory scene categorzaton wth co-clusterng algorthm, we frst do some mnor correctons n the key element spottng based on the avalable ground truth nformaton. That s, we delete the nose from the key element lst of A, and add the pure gunshot to the lst of A 4. Then, the audtory scenes are automatcally located wth the strategy proposed n Secton 3.1. Fnally, we agan employed three persons to manually group these scenes nto a number of semantc categores. Based on ths manual groupng, we establshed the ground truth for further evaluaton. To llustrate the effectveness of the proposed co-clusterng scheme for scene categorzaton, we compare t wth a tradtonal one-way clusterng algorthm. Here, the X-means algorthm [0], n whch BIC s also used to estmate the number of clusters, s adopted for the comparson. We search for the proper cluster number K n the range of 1 K 10 n the X-means clusterng; whle n the co-clusterng, we search for the optmal number k of audtory scene categores and the optmal number l of audo element groups n the ranges of 1 k 10 and 1 l n (n s the number of audo elements n the correspondng sound track), respectvely. Table 6. Detaled results comparson between the X-means and the Co-clusterng for audtory scene categorzaton on the sound track of the "59 th Annual Golden Globe Awards" (A 3 ) No. S 1 S S 3 prec. No. S 1 S S 3 prec recall recall Note: (S 1) scenes of hosts or wnners comng to or leavng the stage; (S ) scenes of audence applaudng the wnners; (S 3) scenes of hosts ntroducng the wnner canddates. Co-clusterng X-means Table 6 shows the detaled comparson results of the two clusterng algorthms on the example sound track of the "59 th Annual Golden Globe Awards" (A 3 ). In ths stream, there are totally 115 obtaned scenes, whch are manually classfed nto 3 semantc categores: 1) the scenes of hosts or wnners comng to or leavng the stage (S 1 ), whch are manly composed of applause and musc; ) the scenes of audence applaudng the wnners (S ), and 3) the scenes of hosts ntroducng the wnner canddates (S 3 ), whch are manly composed of applause and speech. In the experments, we obtaned 5 audtory scene categores usng the nformaton-theoretc co-clusterng and 9 scene categores by the X-means. In Table 6, each row represents one obtaned cluster and the dstrbuton of the audtory scenes contaned theren across the ground truth categores. Smlar to Table, we also manually group those clusters assocated to the same ground truth category (as ndcated by the shaded felds n Table 6), and then calculate the correspondng precson and recall per grouped cluster. The results reported n Table 6 show that the co-clusterng algorthm can acheve better performance n audtory scene categorzaton. Frst, the number of audtory categores obtaned by co-clusterng s closer to the number of ground truth categores, than that acheved wth the X-means. In other words, the coclusterng can provde a more exact approxmaton of the actual semantc content classes exstng n audo streams. Second, coclusterng performs better than the X-means clusterng, both n terms of precson and recall. In average, around 97.4% of the scenes are correctly clustered wth the co-clusterng algorthm, whle the accuracy of the X-means s 87.8%. The performance comparson between the X-means and the coclusterng on all the sound tracks s summarzed n Table 7. Smlar to the categorzaton results on A 3, co-clusterng acheves hgher accuraces on all test sound tracks, and also has a closer approxmaton of the ground truth categores n all cases. Table 7. Performance comparson between the X-means and the Co-clusterng on all the sound tracks Labeled semantc group num. Group num. Accuracy Group num. Accuracy X-means Co-clusterng No. A A A A Furthermore, wth the co-clusterng algorthm we also obtan several audo element groups for each test audo stream, as shown n Table 8. These clusterng results realstcally reveal the groupng (co-occurrence) tendency among the audo elements, as explaned n Secton 3.. For example, n the "59 th Annual Golden Globe Awards" ceremony (A 3 ), we observed that the sounds of applause wth lght-musc and applause wth dense-musc usually occurs together n the scenes of "the hosts or wnners comng to or leavng the stage", and they are correctly grouped together wth the co-clusterng algorthm. Table 8. The audo element groups obtaned usng the coclusterng algorthm on each sound track No. #G audo element groups {clear speech}; {applause wth speech, slence}; {pure musc}; A 1 4 {pure applause, slence, slence wth ball-ht} {nose}; {laughter, laughter wth musc}; A 3 {musc1, musc, speech, applause} {speech1, speech, speech3, applause wth speech}; {applause}; {musc wth speech, musc}; {nose} A 3 5 {applause wth dense-musc, applause wth lght-musc1, applause wth lght-musc}; {speech wth sparse gun-shot, gun-shot mxed wth exploson, pure gun-shot }; {heavy nose, nose, speech wth nose}; A 4 4 {speech1,speech, speech3, speech4, slence1, slence, slence3, applause, musc wth speech}; {musc} 5. CONCLUSIONS In ths paper, an unsupervsed approach s proposed to dscover semantc audtory content n composte audo streams. In ths approach, a spectral clusterng-based scheme s presented to segment and cluster the nput stream nto audo elements. By sortng the obtaned elements accordng to ther mportance scores, key audo elements are dscovered and then used to locate the potental audtory scenes n audo streams. Fnally, an nformatontheoretc co-clusterng based categorzaton approach s utlzed to group the audtory scenes wth smlar semantcs, by explotng

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features