Structural Analysis of Musical Signals for Indexing and Thumbnailing

Structural Analyss of Muscal Sgnals for Indexng and Thumbnalng We Cha Barry Vercoe MIT Meda Laboratory {chawe, bv}@meda.mt.edu Abstract A muscal pece typcally has a repettve structure. Analyss of ths structure wll be useful for musc segmentaton, ndexng and thumbnalng. Ths paper presents an algorthm that can automatcally analyze the repettve structure of muscal sgnals. Frst, the algorthm detects the repettons of each segment of fxed length n a pece usng dynamc programmng. Second, the algorthm summarzes ths repetton nformaton and nfers the structure based on heurstc rules. The performance of the approach s demonstrated vsually usng fgures for qualtatve evaluaton, and by two structural smlarty measures for quanttatve evaluaton. Based on the structural analyss result, ths paper also proposes a method for musc thumbnalng. The prelmnary results obtaned usng a corpus of Beatles songs show that automatc structural analyss and thumbnalng of musc are possble. 1. Introducton A muscal pece typcally has a repettve structure. For example, a song may have a structure of ABA, ndcatng a three-part compostonal form n whch the second secton contrasts wth the frst secton, and the thrd secton s a restatement of the frst. Methods for automatcally detectng the repettve structure of a muscal pece from acoustcal sgnals s valuable for nformaton retreval systems and dgtal lbrares; for example, the result can be used for ndexng the muscal content or for musc thumbnalng. There has been some recent research on ths topc. Dannenberg and Hu presented a method to automatcally detect the repettve structure of muscal sgnals [6]. The process conssts of searchng for smlar segments n a muscal pece, formng clusters of smlar segments, and explanng the muscal structure n terms of these clusters. Three representatons were nvestgated: monophonc ptch estmaton, chroma representaton, and polyphonc transcrpton followed by harmonc analyss. Although the promse of ths method was demonstrated n several examples, there was no quanttatve evaluaton of the method n ther paper. Two topcs closely related to structural analyss of musc have also been nvestgated. One s musc thumbnalng (or musc summarzaton), whch ams at fndng the most representatve part (normally assumed to be the most repeated secton) of a song. Some research on musc thumbnalng deals wth symbolc muscal data (e.g., MIDI fles and scores) [10]. There have also been studes on thumbnalng of muscal sgnals. Logan and Chu attempted to use a clusterng technque or Hdden Markov Models to fnd key phrases of songs. Mel Cepstral features were used to characterze each song [11]. The other related topc s musc segmentaton. Most prevous research n ths area attempted to segment muscal peces by detectng the locatons where a sgnfcant change of statstcal propertes occurs [1]. Ths method s more approprate for segmentng dfferent local events rather than segmentng the semantc components of the global structure. Addtonally, Foote proposed a representaton called a smlarty matrx for vsualzng and analyzng the structure of audo, ncludng symbolc musc, acoustc muscal sgnals or more general audo [7][8]. One attempt usng ths representaton was to locate ponts of sgnfcant change n musc (e.g., score analyss) or audo (e.g., speech/musc segmentaton) [8]. Bartsch and Wakefeld used the smlarty matrx and chroma-based features for musc thumbnalng [4]. A varaton of the smlarty matrx was also proposed for musc thumbnalng [12]. Ths paper descrbes research nto automatc dentfcaton of the repettve structure of muscal peces from acoustc sgnals. Specfcally, an algorthm s presented that wll output structural nformaton, ncludng both the form (e.g., AABABA) and the boundares ndcatng the begnnng and the end of each secton. It s assumed that no pror knowledge about muscal forms or the length of each secton s provded, and the restatement of a secton may have varatons. Ths assumpton requres both robustness and effcency of the algorthm. Two novel structural smlarty measures are also proposed n ths paper to quanttatvely evaluate the performance of the algorthm, n addton to the qualtatve evaluaton presented by fgures.

The remander of ths paper s organzed as follows. Secton 2 llustrates the structural analyss approach. Secton 3 presents the expermental results. Secton 4 explans how the structural analyss result can be used for musc thumbnalng. Secton 5 gves conclusons and proposes future work. 2. Approach Ths secton llustrates the structural analyss method, whch follows fve steps and s also llustrated n Fgure 1: 1) Segment the sgnal nto frames and compute the feature of each frame; 2) Segment the feature sequence nto overlapped segments of fxed length and compute the repetton property of each segment usng dynamc programmng; 3) Detect the repettons of each segment by fndng the local mnma n the dynamc programmng result; 4) Merge consecutve segments that have the same repettve property nto sectons and generate pars of smlar sectons; 5) Segment and label the repettve structure. Segment the muscal sgnal nto overlaped frames Frame 1 Frame 2 Frame 3 Frame n Compute the feature vector of each frame Feature vector 1 Feature vector 2 Feature vector 3 Feature vector n Segment the feature vector sequence nto overlapped segments Segment 1 Segment 2 Segment 3 Segment m Match each segment aganst the feature vector sequence usng dynamc programmng Detect the repettons of each segment Merge consecutve segments that have the same repettve property nto sectons Label the repettve structure Fgure 1. Overvew of the approach. The followng fve sectons explan each step n detal. All the parameter confguratons are tuned based on the expermental corpus, whch s descrbed n Secton 3. 2.1. Feature extracton The algorthm frst segments the sgnal nto overlapped frames (e.g., 1024-sample wndow length wth 512- sample overlap) and computes the feature of each frame. Two representatons are nvestgated n ths paper. One s the ptch representaton, whch uses autocorrelaton [13] to estmate the man frequency component of each frame. Although all the test data n the experment are polyphonc, t turns out that, for muscal sgnals wth a leadng vocal, ths feature can stll capture much nformaton. The other representaton explored s the spectral representaton,.e., FFT magntude coeffcents. The dstance between two ptch features v 1 and v2 s defned as v1 v2 d p ( v1, v2 ) = (1) normalzaton factor The dstance between two spectral features v 1 and v 2 s defned as v v v v 1 2 d f ( 1, 2 ) = 0.5 0.5 (2) v v 1 2 In both cases, a dstance value ranges between 0 and 1. 2.1. Pattern matchng After computng the feature vector v (onedmensonal vector for the ptch representaton and N- j dmensonal vector for the spectral representaton) for each frame, the algorthm segments the feature vector sequence V[ 1, n] = { v j j = 1,, n} (n s the number of frames) nto overlapped segments of fxed length l (e.g., 200 consecutve vectors wth 150 vectors overlap). Snce prevous research has shown that dynamc programmng s effectve for musc pattern matchng [9][14], here dynamc programmng s used to match each segment (.e., s = V [ j, j + l 1] ) wth the feature vector sequence startng from ths segment (.e., V [ j, n] ). The dynamc programmng algorthm wll fll n a matrx M (.e., the dynamc programmng matrx of the th segment) as shown n Fgure 2 based on Equaton 3. M[ p 1, q] + e M[ p, q 1] + e M[ p, q] = mn M[ p 1, q 1] + c 0 ( p 1) ( q 1) ( p, q 1) o. w. (3)

where e s the nserton or deleton cost, c s the dstance between the two correspondng feature vectors, whch has been defned n Secton 2.1. The last row of matrx M s defned as functon d [r] (shown as the shaded area n Fgure 2). In addton, the trace-back step of dynamc programmng determnes the actual algnments (.e., the locatons n V [ j, n] matchng the begnnng of s ) that result n d [r]. The trace-back result s denoted as t [r]. (3) No two repettons are closer than d, the algorthm adds the mnmum nto the detected repetton set. In our experment, we set w=400, d=5, and p=0.1. Fgure 3 shows the repetton detecton result of one segment n the song Yesterday. 80 70 repetton detecton: Yesterday 60 vj V(j+1) V(j+2) vn d [k] 50 40 vj V(j+1) V(j+l-1) 0 0 0 0 0 e 2e le Fgure 2. Dynamc programmng matrx M. Ths step s the most tme consumng one n the structural analyss algorthm; ts tme complexty s O ( n 2 ). 2.3. Repetton detecton Ths step of the algorthm detects the repetton of each segment. To acheve ths, the algorthm detects the local mnma n the functon d [r] for each, because normally a repetton of segment wll correspond to a local mnmum n ths functon. There are four predefned parameters n the algorthm of detectng the local mnma: the wdth parameter w, the dstance parameter d, the heght parameter h, and the shape parameter p. To detect local mnma of d [r], the algorthm sldes the wndow of wdth w over d [r]. Assume the ndex of the mnmum wthn the wndow s r 0 wth value d [ r 0 ], the ndex of the maxmum wthn the wndow but left to r 0 s r 1 (.e., r 1 <r 0 ) wth value d [ r 1 ], and the ndex of the maxmum wthn the wndow but rght to r 0 s r 2 (.e., r 2 >r 0 ) wth value d [ r 2 ]. If (1) d [ r1 ] d [ r0 ] > h and d [ r2 ] d [ r0 ] > h (.e., the local mnmum s deep enough); and (2) d [ r1 ] d [ r0 ] d [ r2 ] d [ r0 ] > p or > p (.e., the r1 r0 r2 r0 local mnmum s sharp enough); and 30 20 10 0 0 500 1000 1500 2000 2500 3000 k Fgure 3. One-segment repetton detecton result of Yesterday. The local mnma ndcated by crcles correspond to detected repettons of the segment. The repettons detected may have add or drop errors, meanng a repetton s falsely detected or mssed. For example, n Fgure 3, the frst, the second, the fourth and the ffth detected local mnma correspond to the four restatements of the same melodc segment n the song ( here to stay, over me, hde away, hde away ). However, there s an add error occurrng at the thrd detected local mnmum. The number of add errors and that of the drop errors are balanced by the predefned parameter h; whenever the local mnmum s deeper than heght h, the algorthm reports a detecton of repetton. Thus, when h ncreases, there are more drop errors but fewer add errors, and vse versa. For balancng between these two knds of errors, the algorthm searches wthn a range for the best value of h (e.g., decreasng from 10 to 5 wth step -1 for the ptch representaton, and decreasng from 12 to 8 wth step -1 for the spectral representaton), so that the number of detected repettons of the whole song s reasonable (e.g., # detected repetton s / n 2 ). For each detected mnmum d [ r * ] for s = V [ j, j + l 1], let k * = t [ r * ] ; thus, t s detected that the segment startng at v s repeated at j v. Please * j + k note that by the nature of dynamc programmng, the matchng part may not be of length l due to the varatons n the repetton.

2.4. Segment mergng The algorthm merges consecutve segments that have the same repettve property nto sectons and generates pars of smlar sectons n ths step. k 2000 1500 1000 500 repetton detecton (whole song): Yesterday 0 0 200 400 600 800 1000 1200 1400 1600 1800 j Fgure 4. Whole-song repetton detecton result of Yesterday. A crcle or a square at locaton (j, k) ndcates that the segment startng at v j s detected to repeat at v j+k. Fgure 4 shows the repetton detecton result of the song Yesterday after ths step. In ths fgure, a crcle or a square at (j, k) corresponds to a repetton detected n the last step (.e., the segment startng at v j s repeated at v j + ). Snce typcally one muscal phrase conssts of k multple segments, based on the confguratons n prevous steps, f one segment n a phrase s repeated by a shft of k, all the segments n ths phrase are repeated by shfts roughly equal to k. Ths phenomenon can be seen from Fgure 4, where the squares have the horzontal patterns ndcatng consecutve segments have roughly the same shfts. By detectng these horzontal patterns (denoted by squares n Fgure 4) and dscardng other detected repettons (denoted by crcles n Fgure 4) obtaned from the thrd step, the effects of add/drop errors are further reduced. The output of ths step s a set of merged sectons n terms of tuples < j 1, j2, shft >, ndcatng that the segment startng at v and endng at j 1 v repeats roughly from j 2 v j + to 1 shft v j +. Each tuple corresponds to one 2 shft horzontal pattern n the whole-song repetton detecton result. For example, the tuple correspondng to the leftbottom horzontal pattern n Fgure 4 s <100, 450, 370>. Snce the shfts of repettons may not be exactly the same for segments n the merged secton, the shft of the whole secton s the average value. 2.5. Structure labelng Based on the tuples obtaned from the fourth step, the last step of the algorthm segments the whole pece nto sectons and labels each secton accordng to the repettve relaton (.e., gves each secton a symbol such as A, B, etc.). Thus, ths step wll output the structural nformaton, ncludng both the form (e.g., AABABA) and the boundares ndcatng the begnnng and the end of each secton. To solve conflcts that mght occur, the rule for labelng s always labelng the most frequently repeated secton frst. Specfcally, the algorthm fnds the most frequently repeated secton based on the frst two columns n the tuples, and labels t and ts shfted versons as secton A. Then the algorthm deletes the tuples already labeled, repeats the same procedure for the remanng tuples, and labels sectons produced n each step as B, C, D and so on. If conflcts occur (e.g., a later labeled secton has overlap wth the prevous labeled sectons), the prevous labeled sectons wll always reman ntact and the current secton wll be truncated. 3. Experment and evaluaton Ths secton presents the expermental results and evaluatons of the structural analyss approach. 3.1. Data set The expermental corpus conssts of the 26 Beatles songs n the two CDs The Beatles (1962-1966). All these songs have clear repettve structures and leadng vocal. The data were mxed to 8-bt mono and down-sampled to 11kHz. 3.2. Measures of structural smlarty Fgure 5. Comparson of the computed structure (above) usng the ptch representaton and the deal structure (below) of Yesterday. Sectons n the same color ndcate restatements of the secton. Sectons n the lghtest grey correspond to the sectons wth no repetton. To qualtatvely evaluate the results, fgures as shown n Fgure 5 are used to compare the structure obtaned from the algorthm wth the deal structure obtaned by manually labelng the repetton. Ths paper also proposes two measures of structural smlarty to quanttatvely

evaluate the result. Both of the measures need to be as small as possble, deally equal to zero. Measure 1 (structural measure) s defned as the edt dstance between the strngs representng dfferent forms. For the example n Fgure 5, the dstance between the deal structure AABABA and the computed structure AABBABA s 1, ndcatng one nserton. Here how the algorthm labels each secton s not mportant as long as the repettve relaton s the same; thus, ths deal structure s deemed as equvalent (0-dstance) to structure BBABAB, or structure AACACA. Measure 2 (boundary measure) s manly used to evaluate how accurate the boundares of each secton are. It s defned as BM = ( 1 r) / s (4) where r s the rato of the length of parts where both structures have the same labelng to the whole length, and s s the number of the repettve sectons n the deal structure. BM 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 5 10 15 20 25 30 song d 0.18 0.16 0.14 0.12 Boundary Measures of 26 Beatles' Songs (ptch representaton) Boundary Measures of 26 Beatles' Songs (FFT representaton) 3.3. Results BM 0.1 0.08 Fgure 6 and Fgure 7 show the structural and boundary measures of the expermental results usng both the ptch representaton and the spectral representaton. In Fgure 7, the baselne results correspondng to labelng the whole song as a sngle secton are also plotted for a comparson. SM 10 9 8 7 6 5 4 3 2 Structural Measures of 26 Beatles' songs 0.06 0.04 0.02 0 5 10 15 20 25 30 song d Fgure 7. Boundary measures of the 26 Beatles songs. The sold lnes wth crcle markers correspond to the computed results (above: ptch representaton; below: spectral representaton). The dotted lnes wth x markers correspond to the baselne. It s easly seen from the above fgures that the performance of the thrd, the eghth and the nnth song usng the ptch representaton are the best (the structural measures are 0 and the boundary measures are low). For example, the result of the thrd song From me to you usng the ptch representaton s shown n Fgure 8. 1 0 5 10 15 20 25 song d Fgure 6. Structural measures of the 26 Beatles songs. The sold lne wth crcle markers corresponds to the ptch representaton results. The dashed lne wth square markers corresponds to the spectral representaton results. Fgure 8. Comparson of the computed structure (above) usng the ptch representaton and the deal structure (below) of From me to you. The one of the worst performance s the seventeenth song Day trpper usng the ptch representaton, whose result s shown n Fgure 9.

Fgure 9. Comparson of the computed structure (above) usng the ptch representaton and the deal structure (below) of Day trpper. Some nterestng results also occur. For example, for the twelfth song Tcket to rde, although the computed structure usng the spectral representaton s dfferent from the deal structure as shown n Fgure 10, t also looks reasonable by seeng secton A n the computed structure as the combnaton of secton A and secton B n the deal structure. Fgure 10. Comparson of the computed structure (above) usng the spectral representaton and the deal structure (below) of Tcket to rde. 3.4. Dscussons The expermental result shows that, by ether the ptch representaton or the spectral representaton, the performance of 15 out of 26 songs have structural measures less than or equal to 2 (Fgure 6) and the results of all the songs have boundary measures better than the baselne (Fgure 11). Ths demonstrates the promse of the method. BM 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 Boundary Measures of 26 Beatles' Songs (mn) 0.02 0 5 10 15 20 25 30 song d Fgure 11. Boundary measures of the 26 Beatles songs. The sold lne wth crcle markers corresponds to the best computed result for each song usng ether the ptch representaton or the spectral representaton. The dotted lne wth x markers corresponds to the baselne. The result does not show one representaton s sgnfcantly superor to the other. However, for each song, the result of one representaton s often better than the result of the other. Ths ndcates that the representaton does play an mportant role n performance and nvestgatng other feature representatons mght help mprove the accuracy of the algorthm. One can notce that, even for the song wth the best performance, the computed boundares of each secton were slghtly shfted from the deal boundares. Ths was manly caused by the naccuracy of the approxmate pattern matchng. To tackle ths problem, other muscal features (e.g., chord progressons, change n dynamcs, etc.) can be used to detect local events so as to locate the boundares accurately. In fact, ths problem suggests that computng only the repettve relaton mght not be suffcent for fndng the semantc structure. The poston of phrase boundares n tonal melodes relates to a number of nteractng muscal factors. The most obvous determnants of muscal phrases are the standard chord progressons known as cadence. Other factors nclude surface features such as relatvely large nterval leaps, change n dynamcs, and mcropauses ( groupng preference rules ), and repeated muscal patterns n terms of harmony, rhythm and melodc contour. [3] 4. Musc thumbnalng va structural analyss The problem of musc thumbnalng ams at fndng the most representatve part of a song, whch can be used for musc browsng and searchng. It would be helpful f the song has been segmented nto semantcally meanngful sectons before summarzaton, because, although what makes a part of a song most memorable s not clear, ths part often appears at partcular locatons wthn the structure, e.g., the begnnng or the end part of each secton. For example, among the 26 Beatles songs, 6 songs have the song ttles n the frst sentence of a secton; 9 songs have them n the last sentence of a secton; and 10 songs have them n both the frst and the last sentences of a secton. For many pop/rock songs, ttles are contaned n the hook sentences. Ths nformaton s very useful for musc thumbnalng: once we have the structure of a song, a straghtforward strategy s to choose the begnnng or the end part of the most repeated secton as the summary of the musc. For example, f ten-second summares are wanted, Table 1 shows the performance of ths strategy based on the crtera proposed by Logan and Chu [11]. The four columns n Table 1 ndcate the percentage of summares that contan a vocal porton, contan the song s ttle, are the begnnng of a secton, and are the begnnng of a phrase. The algorthm frst fnd the most repeated sectons, take the frst secton among these and truncate the begnnng ten seconds of t as the summary of the

song. The thumbnalng result usng ths method hghly depends on the accuracy of the structural analyss result. For example, the summary of the song From me to you usng the ptch representaton s If there's anythng that you want; If there's anythng I can do; Just call ; the summary of the song Yesterday usng the ptch representaton s Yesterday, all my troubles seemed so far away; Now t looks as though. Both of the summares start rght at the begnnng of the most repettve secton. However, the summary of song Day trpper usng the ptch representaton does not contan any vocal porton due to the poor structural analyss result of ths song. Table 1: Thumbnalng results of the 26 Beatles songs Vocal Ttle Begnnng of a secton Begnnng of a phrase Ptch 85% 23% 50% 58% FFT 88% 35% 46% 58% 5. Conclusons and future work Ths paper presents an algorthm for automatcally analyzng the repettve structure of musc from acoustc sgnals. Prelmnary results were evaluated both qualtatvely and quanttatvely, whch demonstrate the promse of the proposed method. To mprove the accuracy, more representatons need to be nvestgated. The possblty of generalzng ths method to other musc genres should also be explored. Addtonally, nferrng the herarchcal repettve structures of musc and dentfyng the functonalty of each secton wthn the structure would be a more complcated yet nterestng topc. Musc segmentaton, thumbnalng and structural analyss are three coupled problems. Dscovery of effectve methods for solvng any one of the three problems wll beneft the other two. Furthermore, the soluton to any of them depends on the study of humans percepton of musc, for example, what makes part of musc sounds lke a complete phrase and what makes t memorable or dstngushable. Human experments are always necessary for explorng such questons. Fnally, whle most prevous research on musc dgtal lbrares was based on symbolc representatons of musc (e.g., MIDI, scores), ths paper attempts to address the structural analyss problem of acoustc muscal data. We beleve that current automatc musc transcrpton technques are stll far from robust and effcent, and thus analyzng acoustc muscal data drectly wthout transcrpton s of great applcaton value for ndexng the dgtal musc repostory, segmentng musc at transtons, and summarzng the thumbnals of musc, all of whch wll beneft the users browsng and searchng n a musc dgtal lbrary. 6. References [1] J.J. Aucouturer and M. Sandler. Segmentaton of Muscal Sgnals usng Hdden Markov Models, In Proc. AES 110 th Conventon, May 2001. [2] J.J. Aucouturer and M. Sandler. Usng Long-Term Structure to Retreve Musc: Representaton and Matchng, In Proc. Internatonal Symposum on Musc Informaton Retreval, Bloomngton, IN, 2001. [3] M. Balaban, K. Ebcoglu, and O. Laske, Understandng Musc wth AI: Perspectves on Musc Cognton, Cambrdge: MIT Press; Menlo Park: AAAI Press, 1992. [4] M.A. Bartsch and G.H. Wakefeld, To Catch a Chorus: Usng Chroma-based Representatons for Audo Thumbnalng, In Proc. Workshop on Applcatons of Sgnal Processng to Audo and Acoustcs, 2001. [5] W.P. Brmngham, R.B. Dannenberg, G.H. Wakefeld, M. Bartsch, D. Bykowsk, D. Mazzon, C. Meek, M. Mellody, and W. Rand, MUSART: Musc Retreval va Aural Queres, In Proc. Internatonal Symposum on Musc Informaton Retreval, Bloomngton, IN, 2001. [6] R.B. Dannenberg and N. Hu, Pattern Dscovery Technques for Musc Audo, In Proc. Internatonal Conference on Musc Informaton Retreval, October 2002. [7] J. Foote, "Vsualzng Musc and Audo usng Self- Smlarty," In Proc. ACM Multmeda Conference, Orlando, FL, pp. 77-80, 1999. [8] J. Foote, "Automatc Audo Segmentaton usng a Measure of Audo Novelty." In Proc. of IEEE Internatonal Conference on Multmeda and Expo, vol. I, pp. 452-455, 2000. [9] J. Foote, ARTHUR: Retrevng Orchestral Musc by Long-Term Structure, In Proc. Internatonal Symposum on Musc Informaton Retreval, October 2000. [10] J.L. Hsu, C.C. Lu, and L.P. Chen, Dscoverng Nontrval Repeatng Patterns n Musc Data, IEEE Transactons on Multmeda, Vol. 3, No. 3, pp. 311-325, September 2001. [11] B. Logan and S. Chu, Musc Summarzaton usng Key Phrases, In Proc. Internatonal Conference on Acoustcs, Speech and Sgnal Processng, 2000. [12] G. Peeters, A. L. Burthe and X. Rodet, Toward Automatc Musc Audo Summary Generaton from Sgnal Analyss,

In Proc. Internatonal Conference on Musc Informaton Retreval, October 2002. [13] C. Roads, The Computer Musc Tutoral, MIT Press, 1996. [14] C. Yang, Musc Database Retreval Based on Spectral Smlarty, In Proc. Internatonal Symposum on Musc Informaton Retreval, 2001.