Formalisation of MPEG-1 compressed domain audio features

Size: px

Start display at page:

Download "Formalisation of MPEG-1 compressed domain audio features"

Bernice Boone
5 years ago
Views:

1 Techncal Report CSIRO Mathematcal and Informaton Scences Formalsaton of MPEG-1 compressed doman audo features By Slva Pfeffer and Thomas Vncent 18 th December 1 Report Number: 1/196 CSIRO Mathematcal and Informaton Scences Locked Bag 17, North Ryde NSW 167 Australa Contact: Slva.Pfeffer@csro.au

3 TABLE OF CONTENTS 1 Introducton... 4 MPEG-1 compressed data MPEG-1 audo encodng Feld nformaton Header-type nformaton Subband values Low-level audo features Pre-processed subband nformaton Cepstral features Energy features Slence statstcs Spectral energy statstcs Bandwdth features Ptch features Segmentaton General segmentaton Scene change Classfcaton Slence determnaton Musc/speech determnaton Sound effects Nose Speaker Recognton and Identfcaton Gender Speech recognton Beat recognton Conclusons REFERENCES MPEG-1 audo features S. Pfeffer, T. Vncent

4 1 Introducton In ths paper we gve an overvew of exstng audo content analyss approaches n the compressed doman and ncorporate them nto a coherent formal structure. We frst examne the knds of nformaton accessble n an MPEG-1 compressed audo stream and descrbe a coherent approach to determne features from these. These features are generc enough to be further processed wth standard audo content analyss approaches. We report on a number of applcatons that have been presented makng use of the compressed doman features. Most of them am at creatng an ndex to the audo stream by segmentng the stream nto temporally coherent regons, whch are often classfed nto a pre-specfed set of classes. We also dscuss recognton and dentfcaton applcatons. MPEG-1 compressed data To understand the knd of features that can be extracted from an MPEG-1 compressed audo stream, we have to understand the meanng of the encoded felds (see [HAS97, ISA93, NUL97, PAN95]). To that end, we frst brefly explan the encodng steps and the resultng feld structures and then explan whch felds contan useful nformaton for content analyss..1 MPEG-1 audo encodng MPEG-1 audo encodng comes n three dfferent flavours called Layers. They ncrease n complexty from Layer 1 to 3 yet all follow the same processng steps: 1. The sampled sound data s broken up nto analyss wndows and transformed nto the frequency doman. A polyphase flterbank calculates 3 frequency band magntudes (called subband values) for each of the three Layers, whch s further refned to 576 subbands for Layer 3 only.. The resultng subband values are manpulated accordng to psychoacoustc models and the desred btrate. The am s to flter out sounds that are masked by other sounds and to arrve at a perceptually lossless compresson. The extent of compresson acheved by ths psychoacoustc flterng s encoder-specfc and not standardsed. 3. The remanng subband magntudes are lnearsed nto a btstream accordng to the btstream format standardsed for the respectve Layer. In ths last step, further compresson can be done (such as Huffman encodng) whch s lossless and explots redundances of the data contaned wthn several successve analyss wndows. The data s then encoded n a so-called audo frames. The resultng fles therefore consst of a sequence of (audo) frames contanng Layerspecfc felds. Fgure 1 dsplays the transformaton that a sequence of 115 PCM samples go through durng encodng. Layers 1 and stop after the frst transform (a polyphase flterbank), whle Layer 3 goes through an addtonal Modfed Dscrete Cosne Transform (MDCT) step. Access of MPEG-1 audo features S. Pfeffer, T. Vncent

5 transformaton coeffcents n Layer 3 can therefore be ether at the flterbank or the MDCT level. I t 115 PCM samples Polyphase Flterbank f 3 subbands (36 ponts each) MDCT MDCT MDCT MDCT MDCT MDCT modfed DCT (36 / 3 x 1 samples) I f Fgure 1: Frequency transformatons of Layer 3 Fgure dsplays the frame formats of all three Layers. The 3 subband values are encoded n groups of 1 (Layer 1 & ) or 18 (Layer 3) subband samples. We call these groups granules. There s only one such granule n a Layer 1 frame, whereas a Layer frame contans three granules and a Layer 3 frame two granules to explot further redundances. A granule n Layer 3 can be vewed as ether consstng of 18 values n each of 3 subbands or of one value n each of 576 subbands dependng on whether one accesses the flterbank coeffcents or the MDCT coeffcents. Layer I & Layer II Frame header Bt allocaton Scalefactor selecton Quantzed values Encoded data Bt allocaton Scalefactor selecton Quantzed values Channel 1 Channel Layer III (MP3) Frame header Sde nfo Encoded data Scalefactors Scalefactors Scalefactors data factors data factors data factors Huffman Scale- Huffman Scale- Huffman Scale- Huffman data Channel 1 Channel Channel 1 Channel Granule 1 Granule Man data Fgure : Frame formats for all three Layers Except for the number of granules, Layer 1 and encodngs are the same. Ther subband values are encoded n the quantsed values feld after havng beng normalsed wth scalefactors. The number of bts requred for the quantsed values s encoded n the bt allocaton feld. Addtonally, the scalefactor selecton feld used by Layer stores how many of the three scalefactors of each granule were dfferent and thus had to be encoded. MPEG-1 audo features S. Pfeffer, T. Vncent

6 Layer 3 s dfferent. As mentoned above, t results n 576 subband values. These are further compressed manly by use of a Huffman compresson scheme after careful groupng of subbands (see Fgure 3) f Pars of bg values,.e. value < 8191 Quadruples of small values,.e. value {, -1, 1} Pars of zero values,.e. value = Regon Regon 1 Regon Huffman encoded Not encoded Nose allocaton & scalefactor calculaton Huffman encoded Fgure 3: Layer 3 encodng of frequency bands. Feld nformaton Wthout gong back to decodng an MPEG audo fle to PCM samples, there are two types of nformaton that can be used as features on whch to base audo content analyss approaches: the nformaton encoded n the header-lke felds (header, bt allocaton, scalefactor selecton, scalefactors, sde nformaton) and the encoded subband values...1 Header-type nformaton Wang and Vlermo [WAN1] have used the wndow type nformaton of Layer 3 to detect beats. Layer 3 uses four dfferent knds of analyss wndows: long, long-to-short, short, and shortto-long. The short wndows are used for short but ntensve sounds for whch the long wndow would ntroduce too much pre-echo. They found that the wndow-swtchng pattern of pop-musc beats for ther specfc encoder at btrates of kbps gves (long, long-to-short, short, short, short-to-long, long) wndow sequences n 99% of the beats. Our own research has also examned the header-type felds [BAR97]. We have used the followng felds of Layers 1 & wth the gven feature nterpretaton: Bt allocaton: stores the dynamc range of a sequence of subband values. Scalefactors: stores nformaton on the maxmum loudness of a sequence of subband values. Scalefactor selecton nformaton (Layer only): stores how the loudness changes on three subsequent groups; a value of ndcates no change, so the loudness s stable, a value of MPEG-1 audo features S. Pfeffer, T. Vncent

7 ndcates a transent change, and the values 1 and 3 ndcate an unstable change... Subband values Bascally all other publshed research on compressed-doman audo analyss has used the subband values as a startng pont for feature calculaton. Thus t s requred to decode the MPEG audo stream enough to access the subband values. For all three Layers, the subband values are not avalable drectly n the lnearsed fle/stream but have to be reconstructed from the encoded felds mplyng some processng cost. The most tme consumng step for decodng an MPEG audo stream s however the resynthess of PCM samples and ths s avoded as the subband values are stll n the compressed doman. In Layers 1 and the subband values may be approxmated by drectly usng the quantsed values n an encoded frame (whch can only be extracted from the fle wth help of the btallocaton nformaton). Ths however gnores the fact that the values are normalsed by the scalefactors n each of the 3 subbands. So, to arrve at the subband values encoded n the fle, one has to use the quantsed values and denormalse them. In Layer 3 there are 576 subband values. To extract the frequency band magntudes from the fle, t s necessary to decode the quantsed samples wth a Huffman decoder. Then, scalefactors have to be readjusted, whch served to colour the quantsaton nose, and quantsaton has to be reversed. After ths, one reaches the alas reduced MDCT coeffcents, whch we call the 576 subband values. To acheve features whch are comparable ndependent of the encoded Layer, t s possble to further decode the 576 MDCT coeffcents to the orgnal 3 Polyphase Flterbank coeffcents, but there s a processng cost assocated and a loss of frequency resoluton. One however gans on temporal resoluton, whch mght be more approprate n certan applcaton areas. Subband values are n the followng denoted by s (n), beng the subband number, I-1, (I=3 n Layers 1 and, and possbly 576 n Layer 3) and n the tme ndex. In the followng, all ndex values wll start wth. The tme ndex n s vewed from a whole fle perspectve. As an example, f we take a Layer encoded audo fle, ts 5 th frame, ts nd granule, and the 5 th subband value (out of the 1) of the 1 th subband, we access the value s 9 (16) (see Fgure 4). Frame Frame 1 Frame Frame 3 Granule Granule1 Granule Frame 4 Granule Wndow t S 9 (16) n t M-1 Fgure 4: Explanaton of subband numberng scheme. MPEG-1 audo features S. Pfeffer, T. Vncent

8 Dependng on the requred resoluton of an analyss, one may choose to calculate features on the subband value resoluton level, on the granule resoluton level, or on the frame resoluton level. A statstcal analyss on a larger tme wndow may thus be calculated on a multplcty of any of these resoluton levels. In our own research we have chosen the subband value or granule resoluton, whle many others prefer the frame resoluton. Most analyss algorthms work on a wndow of samples. Independent of the choce of resoluton, we denote the wndow sze by M and the tme poston wthn a wndow by m, m M-1. t wll denote the wndow number whle gong over a fle, whch s closely related to the tme poston wthn a fle. A subband value at wndow poston m s accessed dependng on the choce of resoluton for analyss. For example, f we select a granule as wndow sze, have consecutve non-overlappng wndows only, and work on a subband value resoluton level, the above subband value s 9 (16) wll be n wndow number t=13 at poston m=4 (M=1 s the mpled wndow sze for the Layer granule here) (see Fgure 4). Synopss of used terms: n Tme ndex n N-1 Subband ndex I-1 s (n) Subband value at tme ndex n for subband I m Tme poston wthn wndow m M-1 M Analyss wndow sze t Analyss wndow number 3 Low-level audo features Gong from the subband values to hgh-level analyss such as segmentaton, classfcaton, recognton and dentfcaton, requres frstly the calculaton of low-level audo features on whch the further analyss wll be based. Most often nformaton on the subband energes s used as a startng pont for the analyss of the features. 3.1 Pre-processed subband nformaton Instead of usng the subband values themselves to access subband energes, some researchers preprocess them to acheve dfferent goals. Tzanetaks et al. [TZA] use the root mean squared (RMS) subband vector on a frame resoluton because ths s a better measure of sgnal energy than the subband values themselves. A generalsed formula for ther approach, ndependent of granule, frame or any larger wndow, s gven by: s ( t) = 1 M M 1 m= s ( Mt + m). Ths enables one to arrve at 3 or 576 subband values by averagng the M subband values n the wndow. MPEG-1 audo features S. Pfeffer, T. Vncent

9 Boccgnone et al. [BOC99] use the subband mean value n a wndow of sze M as feature as a way to go to a frame resoluton. A generalsed formula for ther approach, ndependent of granule, frame or any larger wndow, s gven by: M 1 1 µ ( t) = s ( Mt + m). M m= Nakajma et al. [NAK99] calculate the normalsed subband energy from the subband samples of a frame to absorb sound level dependency on audo source. The followng formula normalses a sngle subband value on the maxmum of all subband values at the same tme ndex n: s ( n) ς ( n) = 1log1 ( ). max( s ( n) : j I 1) j Gong to a lower resoluton for a wndow of sze M, we have also used the subband mean as a bass for calculaton of a normalsed subband energy: µ ( t) ς ( t) = 1log1 ( ). max( µ ( t) : j I 1) j In our own work we have made use of the scalefactors n Layer 1 and for a granule resoluton subband energy measure [BAR97]. The scalefactors are the maxmum value of the sequence of subband values wthn a granule: scf ( t) = max( s ( Mt + m) : m M 1), wth Ι 1. The two values s (n) and ς (n) work at the subband value resoluton, whle s (t), µ I (t),ς (t) and scf (t) work on a granule, frame or any larger wndow resoluton. In the followng subsectons, any of these values may be used n the formulas nterchangeably. The choce depends on the goals of the analyss. As a placeholder we wll use s (t). 3. Cepstral features For speech and speaker recognton approaches t s standard to use a representaton of the audo sgnal n the frequency doman as feature. Cepstral coeffcents have proven to be partcularly successful. They are calculated by performng another frequency transform on the logarthm of spectral coeffcents. Both the 3 subband values of Layer 1 and, and the 576 subband values of Layer 3 are lnearly spaced spectral coeffcents and serve well as a bass for calculaton of cepstral coeffcents. Venugopal et al. [VEN99] calculate lnear frequency cepstral coeffcents from the subband values and use them for speaker dentfcaton. Yapp et al. [YAP97] use the quantsed values of the frst nne subbands for ther speech recognton system. To reach a hgher frequency resoluton they apply the Fast Fourer MPEG-1 audo features S. Pfeffer, T. Vncent

10 Transform (FFT) on subband wndows of sze 5 ms usng a Hammng wndow and paddng the number of samples to a power of. They take the magntude of the FFT values and assemble them over the subbands nto one sngle frequency vector. Ths feature vector s now mel-warped and used to calculate cepstral, delta cepstral, and acceleraton coeffcents as features. 3.3 Energy features Sgnal energy features are closely related to the human loudness percepton. When calculatng energy features n the compressed doman rather than from uncompressed PCM samples, the results are closer approxmatons of perceptual loudness because the subband values have been fltered by the psychoacoustc model and thus the nfluence of non-hearable frequences s reduced. The dsadvantage however s that sgnal energy s dstrbuted over the frequency bands and thus has to be added up requrng hgher computatonal complexty than on uncompressed sgnals. Sgnal energy measures are often used for segmentaton of an audo stream. A sgnal's start and end tmes are then usually determned by thresholdng. Patel et al. [PAT96] calculate sgnal energy for a wndow of sze M from the subband values. Nakajma et al. [NAK99] restrct ther loudness measurement to the energy n the lowest subband as ths s the one where most energy s concentrated and ths restrcton provdes a consderable effcency ncrease. A generalsed formula for sgnal energy s gven by: I 1 M 1 1 E ( t) = s ( Mt + m) h( M 1 m). I M = m= The wndow functon h(m) may be e.g. a Rectangular, Hammng, Hannng, Welch or Bartlett wndow dependng on the requred narrowness or peakness of spectral leakage. Tzanetaks et al. [TZA] prefer to use the RMS of the sgnal energy for loudness approxmaton whch acheves a better separaton for low level values. Another loudness approxmaton s the sgnal magntude, whch s less senstve to nose than sgnal energy. It can be calculated analogously to sgnal energy [PAT96] va: I 1 M 1 1 M ( t) = s ( Mt + m) h( M 1 m). I M = m= In our own work we have used the sum of scalefactors on a granule resoluton for a fast approxmaton of the sgnal magntude on Layer 1 & frames [BAR97, PFE99]. Wang and Vlermo [WAN1] calculate the band energy of several subbands and use a threshold on them to determne a confdence for a pop-musc beat n the granule: E ( t) = J M 1 ( J1, J ) = J1 m= s ( Mt + m) h( M 1 m). MPEG-1 audo features S. Pfeffer, T. Vncent

11 3.4 Slence statstcs Slence statstcs are often used as ndcators for classfcaton of audo segments nto dfferent sgnal classes. Speech segments for example generally contan a lot more slence than musc segments. Therefore, Patel et al. [PAT96] propose pause rate as an ndcator to separate speech from non-speech sgnals: M 1 1 P ( t) = ( E( Mt + m) > Ts ) ( E( Mt + m + 1) > T s ) M m= wth T s as the slence energy threshold. P counts the number of slent segments on a tme nterval of sze M. Smlarly, Tzanetaks et al. [TZA] use a low energy feature to separate speech from musc. On a wndow of about 1 sec (M=4 frames) they calculate the percentage of frames that have less than the average energy E (t) : 1 L ( t) = M M 1 m= ( E( Mt + m) < E( t)). Nakajma et al. [NAK99] call ther slence statstc energy densty. It s also defned on a wndow of 1 sec and bascally calculated as the log value of the varance of L(t). 3.5 Spectral energy statstcs Spectral energy statstcs capture subband energy dstrbuton features, whch are ndcatve for specfc types of sounds. The spectral centrod s the balancng pont of the subband energy dstrbuton [BAR97, TZA]. It s thus calculated as the frst moment of the subband energy dstrbuton: I 1 ( + 1) s ( t) = C ( t) =. I 1 s ( t) = It determnes the frequency area around whch most of the sgnal energy concentrates and s thus closely related to the tme-doman zero crossng rate (ZCR) feature often used n speech recognton systems to determne exact start- and endponts of talkspurts. It s also frequently used as an approxmaton for a perceptual brghtness measure [BAR97]. Nakajma et al. [NAK99] use the squared subband samples n ther spectral centrod calculaton to better spread out the centrod values. The spectral rolloff pont R s determned where 85% of the wndow's energy s acheved: R = s ( t) =.85 I 1 = s ( t). MPEG-1 audo features S. Pfeffer, T. Vncent

12 It s used to dstngush voced speech from unvoced speech and musc [TZA], whch have a hgher rolloff pont because ther power s better dstrbuted over the subband range. Patel et al. [PAT96] propose a feature called band energy rato, whch sets the energy of the low frequences (subbands to J-1) n relaton to the hgh frequences (subbands J to I-1): J 1 M 1 s = m= B ( t) = I M 1 s = J m= ( Mt + m) h( M 1 m). ( Mt + m) h( M 1 m) Ths s ndcatve of the vocedness of a sound. They clam that J= s a good choce because voced sgnal energy concentrates below 1.5kHz whle unvoced sgnal energy s dstrbuted over all subbands. The spectral flux of two successve wndows t and t+1 s calculated as the -norm of the dfference between normalzed subband value vectors at t and t+1 [TZA]: ( t, t + 1) = 1 s ( t) s ( t + 1). = max( s ( t) : j I 1) max( s ( t + 1) : j I 1) I j Whle the frst three statstcs calculated spectral energy dstrbuton features on one wndow, the spectral flux determnes changes of spectral energy dstrbuton of two successve wndows. The subband central moments calculated by Boccgnone et al. [BOC99] on the contrary calculate statstcs wthn subbands over several frames. They capture how much a subband's energy s dspersed from ts mean: M 1 k 1 k D ( t) = ( s ( Mt + m) µ ( t)) wth k=,...,5. M m= j 3.6 Bandwdth features The bandwdth covered by a wndow s calculated from all subbands wth suffcent energy: BW t) = max( : ( I 1) ( s ( t) > T )) mn( : ( I 1) ( s ( t) > T )). ( s s It s stpulated that the bandwdth of speech s usually narrower than that of musc [NAK99, VEN99]. Nakajma et al. [NAK99] propose to determne bandwdth nformaton by countng the number of subbands wth sgnfcant level: SB ( t) = I 1 = ( s ( t) > T s ). If a wndow s content covers a lot of subbands such as musc, SB(t) becomes large. In addton to measurng the subband range that a wndow contans, ths also takes nto account how strong the subbands n between are represented. MPEG-1 audo features S. Pfeffer, T. Vncent

13 3.7 Ptch features Ptch s ndcatve of a speaker and thus an mportant property of a sound. Patel et al. [PAT96] calculate the ptch of a sound sgnal n the compressed doman by usng the autocorrelaton functon of the values of the frst subband on 3% overlappng wndows. Wth an overlap of o samples, the related generc formula can be gven va: M 1 1 A ( t, k) = ( s (( M o) t + m) s (( M o) t + m + k)). M m= They calculate the ptch only on wndows of suffcent energy to reduce processng tme on slences. In addton they perform nonlnear clppng of small subband values to avod confuson of the frst and second formants. They choose the largest autocorrelaton peak value as the ptch f t contans more than 3% of the wndow's energy. Venugopal et al. [VEN99] use the analyss-by-synthess approach of the Multband Exctaton Vocoder for ptch estmaton. In t, speech s synthessed and an unbased error measure s calculated by comparson to the orgnal speech. The ptch perod s the perod used when the error s mnmum. 4 Segmentaton When talkng about segmentaton of an audo stream, temporal segmentaton s usually the subject. The dentfcaton of the sound components that belong to one specfc sound event could be regarded as a spato-temporal segmentaton. Ths s a hard task and beng researched n the feld of computatonal audtory scene analyss (CASA). We are not aware of any approaches toward CASA n the MPEG-1 compressed doman. So, here we concentrate on temporal segmentaton n whch specfc temporal fragments of an audo stream are dentfed for ther homogenous content accordng to some crtera. Exstng segmentaton approaches determne fragment boundares based on e.g. strong changes of a specfc feature or relatve pauses. For such segmentaton approaches, the presented features are often not used drectly nstead ther mean and varances are calculated on larger wndows of about 1-4 sec. Addtonally, logtransforms of the results can be used to reduce the dynamc range and make the clusters n classfcaton more compact [TZA]. 4.1 General segmentaton Tzanetaks et al. [TZA] perform generc audo segmentaton at texture change nstants. They use the features low energy, and mean and varances of spectral centrod, spectral rolloff pont, spectral flux, and RMS energy n a feature vector. They calculate the Mahalanobs dstance between successve feature vectors, and dfferentate t. Peaks are then pcked as segment boundares va an adaptve thresholdng algorthm, whch ncludes a MPEG-1 audo features S. Pfeffer, T. Vncent

14 mnmum duraton condton to avod small regons. They acheve up to 75% consstency wth human segmentaton results. Our own approach to segmentaton uses the sgnal magntude as feature [PFE1] to calculate relatve pauses. The segmentaton algorthm also follows an adaptve thresholdng approach on sec ntervals. Wndows are determned as slence f ther sgnal magntude stays under the threshold. Segmentaton uses a mnmum duraton and a maxmum tolerated nterrupton parameter. A sequence of slence wndows gets clustered nto a pause segment f t covers at least the mnmum duraton and s not nterrupted by non-slence wndows longer than the tolerated nterrupton. We acheve ht rates between 46% and 97% when comparng to human segmentaton results dependng upon the SNR of the materal. 4. Scene change Barrass [BAR97] calculates a runnng average of the spectral centrod called brghtness hstory. Sudden changes n brghtness are used for scene change detecton. Boccgnone et al. [BOC99] calculate vdeo scene changes based on audo and vdeo breaks. Vdeo analyss provdes shot boundares, whch are scene change canddates. Audo analyss valdates the canddates. They calculate the subband mean energy and four subband central moments on the frst 8 subbands and accumulate these nto one feature vector. Then, an Artfcal Neural Network s traned to partton the feature space nto slence, speech, musc, and nose resultng n transton ponts between dfferent sound classes. These are used to valdate the shot boundares. They acheve a ht rate between 6% and 93% for audo breaks n comparson to human results. 5 Classfcaton Although temporal segmentaton s an mportant frst step n determnng the structure of an audo stream, automatc determnaton of more nformaton on the actual content of the fragments s of hgher value. Thus the next step s to classfy the fragment content nto a gven set of sound classes. Generc classes are slence, musc, speech, and nose. Accordng to the requrements of the applcaton, more specfc classfcatons may be requred, such as the determnaton of the type of sound effect or the separaton of speakers. 5.1 Slence determnaton Patel et al. [PAT96] use a standard threshold approach on the average sgnal energy of a vdeo clp to determne whether the clp contans slence. Nakajma et al. [NAK99] use the varance of subband energy to dstngush between slence and non-slence: MPEG-1 audo features S. Pfeffer, T. Vncent

15 M 1 1 σ ( n ) = ( s ( m) µ ( m)). M m= They choose M = 1 sec and use only one subband sample per frame. Ths sub-samplng ncreases calculaton speed enormously. A slent segment s declared where σ (n)<t s wth T s as the slence energy threshold. They acheve a ht rate of 91% wth 13% false hts. The false hts are manly attrbuted to mxed sgnal 1 sec wndows. 5. Musc/speech determnaton Patel et al. [PAT96] classfy audo segments determned by vdeo shot boundary detecton. If the band energy rato les above.8 or the pause rate s below. or there s no ptch found, the segment s classfed as a muscal clp, else as speech clp. Nakajma et al. [NAK99] use the energy densty and the average number of subbands wth sgnfcant level on 1 sec wndows to dstngush between musc and speech. Musc has a hgher energy densty than speech. Musc also usually has sgnfcant subbands up to subband number, whereas speech rarely goes beyond subband 1. They use a multvarate Gaussan dstrbuton to model the classes and acheve a ht rate of 93% for musc (wth 4% false hts) and of 88% for speech (wth 16% false hts). The false hts for speech stem from ntermttent sounds such as drum solo. Tzanetaks et al. [TZA] use the features low energy, and the mean and varances of the spectral centrod, spectral rolloff pont, and spectral flux n a feature vector. They compare a multvarate Gaussan dstrbuton classfer to a K-Nearest Neghbour classfer to dstngush speech from musc. They evaluate them on about hours of audo data and compare results on a frame bass achevng about 8% accuracy for the Gaussan dstrbuton and about 85% accuracy for the K-NN. In comparson to the classfcaton of PCM data, the results only degrade by about %. Barrass [BAR97] determnes musc on Layer fles as a sgnal that exhbts long-term wdeband stablty. Ths stablty s calculated from the scalefactor selecton nformaton. A sgnal s determned as long-term stable f more than 6% of the non-zero subbands of a frame have a repeated scalefactor. Coverage of the subbands must be at least 4 out of 3 subbands. In contrast, speech s determned as a sgnal wth low- or md-range brghtness and stablty. The brghtness s calculated va the spectral centrod of the scalefactors. Low-range brghtness ndcates a male voce, md-range brghtness a female one and hgh-range brghtness agan sgnfes musc. 5.3 Sound effects Nakajma et al. [NAK99] use the average and varance of the spectral centrod on 1 sec wndows to determne applause on ther TV program sound tracks. Applause has a contnuous selfsmlarty and stable centre frequency. They acheve a ht rate of 74% wth 15% false hts, whch MPEG-1 audo features S. Pfeffer, T. Vncent

16 occur manly on mxed sgnal 1 sec wndows. 5.4 Nose Barrass [BAR97] determnes nose on Layer fles as a sgnal that exhbts long-term wde-band transence. Ths transence s calculated from the scalefactor selecton nformaton. A sgnal s determned as long-term transent f more than 3% of the non-zero subbands of a frame have non-repeated scalefactors. Coverage of the subbands must be at least 1 out of 4 subbands. In addton, short, loud and brght sgnals are determned as a clang and also classfed as a nose. 5.5 Speaker To dstngush between sx dfferent speakers, Venugopal et al. [VEN99] use normalsed lnear frequency cepstral coeffcents and estmate Gaussan Mxture Model parameters usng the Expectaton Maxmsaton algorthm. 6 Recognton and Identfcaton On speech segments, recognton and dentfcaton of more specfc sound content s possble such as the gender of a speaker segment, the speaker tself, and the content of hs speech. On musc segments, recognton of beats and dentfcaton of rhythms can be performed. We report on one beat recognton approach based on MPEG-1 compressed doman features. 6.1 Gender Venugopal et al. [VEN99] use the ptch estmaton of the Multband Exctaton Vocoder and declare the speaker as male f the ptch s between 6 and 1 Hz and female between 1 and Hz. They acheve a ht rate of about 8%. As mentoned above, Barrass [BAR97] determnes the gender of a speech frame va the brghtness of the sgnal, whch s calculated from the spectral centrod of the scalefactors n Layer. If the spectral centrod les below the second subband, t s determned as male, and between the thrd and sxth subband as female. 6. Speech recognton Yapp and Zck [YAP97] mplemented a speaker-dependent, small vocabulary speech recognton system that uses compressed doman features. They calculate cepstral, delta cepstral, and acceleraton coeffcents as descrbed n Secton..1. Then they tran Hdden Markov Models MPEG-1 audo features S. Pfeffer, T. Vncent

17 (HMMs) on 17 words. Tranng and recognton were both performed wth contnuously spoken sentences. A word-level accuracy of 99% was obtaned on Layer encoded data at 3 kbt/s. Ther system works on Layer 1 and Layer and nterlayer tranng and recognton s possble. 6.3 Beat recognton Wang and Vlermo [WAN1] presented a compressed-doman beat detector for pop-musc wth the am of replacng beats that were lost durng an Internet transmsson of a pop-song wth prevously stored beat samples of that song. They use the wndow type nformaton of Layer 3 fles and the band energy of four frequency ranges for beat detecton. The four frequency ranges are: the full-band energy and the frequency ntervals -459, , and Hz. The mddle frequency ranges usually gve poor beat nformaton because other nstruments and sngng are more domnant n these ranges. When restrctng the search for beats to the most probable tmes after nter beat ntervals (IBIs), they detect most beats. 7 Conclusons In ths paper we have gven an overvew of the knd of features that have been extracted n the MPEG-1 audo compressed doman. Consderng the amount of MPEG-1 Layer 3 (MP3) fles avalable nowadays, audo analyss on compressed fles s bound to be n great demand soon. Research n ths feld s stll n ts nfancy and there are stll many opportuntes to pursue for fundamental research. Audo analyss results can be used more powerfully when used n conjuncton wth vdeo analyss results to acheve automatc extracton of more abstract concepts. On ts own t can be used for sound-based audo search engnes no more based on textual queres and flenames but on the audo content. ACKNOWLEDGEMENTS We thank Conrad Parker and Stephen Barrass for ther proof-readng and feedback to ths paper. MPEG-1 audo features S. Pfeffer, T. Vncent

18 REFERENCES [BAR97] [BOC99] Stephen Barrass Blby A tool for foragng n MPEG audo, Techncal Report No. 1/98, CSIRO Mathematcal and Informaton Scences, Nov (publshed June 1). G. Boccgnone, M. De Santo, and G. Percannella, "Jont Audo-Vdeo Processng of MPEG Encoded Sequences", Proc. IEEE Intl. Conf. on Multmeda Computng and Systems (ICMCS), Vol., 1999, pp [HAS97] Haskell, Pur, and Netraval, "Dgtal Vdeo: An Introducton to MPEG-, Chapter 4, 1997, Chapman & Hall, New York. [ISA93] ISO, Internatonal Organzaton for Standardzaton, "Internatonal Standard , Informaton Technology - Codng of movng pctures and assocated audo for dgtal storage meda at up to about 1,5 MBt/s - Part 3: Audo", [NAK99] Y. Nakajma, Y. Lu, M. Sugano, A. Yoneyama, H. Yanaghara, and A. Kurematsu, "A Fast Audo Classfcaton From MPEG", Proc. IEEE Intl. Conf. on Acoustcs, Speech, and Sgnal Processng (ICASSP), Vol. IV, 1999, Phoenx, Arzona, USA, pp [NOL97] P. Noll, "MPEG Dgtal Audo Codng", IEEE Sgnal Processng Magazne, Sept. 1997, pp [PAN95] D. Pan, "A Tutoral on MPEG/Audo Compresson", IEEE Multmeda, Vol. No., [PAT96] [PFE1] [PFE99] [TZA] [VEN99] summer 1995, pp N.V. Patel, and I.K. Seth, "Audo characterzaton for vdeo ndexng", Proc. SPIE, Storage and Retreval for Stll Image and Vdeo Databases IV, Vol. 67, 1996, San Jose, CA, USA, pp Slva Pfeffer Pause concepts for audo segmentaton at dfferent semantc levels, Proc. ACM Multmeda 1, Sept 3 Oct 5, Ottawa, Ontaro, Canada, pp , 1. S. Pfeffer, J. Robert-Rbes, and D. Km, "Audo Content Extracton from MPEGencoded sequences", Proc. Ffth Jont Conference on Informaton Scences, pp , 1999, Vol. II, Atlantc Cty, New Jersey. G. Tzanetaks, and P. Cook, "Sound analyss usng MPEG compressed audo", Proc. IEEE Intl. Conf. on Acoustcs, Speech, and Sgnal Processng ICASSP, pp , Vol., Istanbul, Turkey. S. Venugopal, K.R. Ramakrshnan, S.H. Srnvas, and N. Balakrshnan, "Audo scene analyss and scene change detecton n the MPEG compressed doman", IEEE Thrd Workshop on Multmeda Sgnal Processng, MMSP 19999, pp [WAN1] Ye Wang, Mkka Vlermo A compressed doman beat detector usng MP3 audo btstreams, Proc. ACM Multmeda 1, Sept 3 Oct 5, Ottawa, Ontaro, Canada, pp. 194-, 1. [YAP97] L. Yapp, and G. Zck, "Speech recognton on MPEG/audo encoded fles", Proc. IEEE Intl. Conf. on Multmeda Computng and Systems (ICMCS), pp , 1997, Ottawa, Canada. MPEG-1 audo features S. Pfeffer, T. Vncent

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,