Time delay estimation of reverberant meeting speech: on the use of multichannel linear prediction

University of Wollongong Researh Online Faulty of Informatis - apers (Arhive) Faulty of Engineering and Information Sienes 7 Time delay estimation of reverberant meeting speeh: on the use of multihannel linear predition Eva Cheng University of Wollongong, e4@uow.edu.au I. Burnett Faulty of Informatis, University of Wollongong, ianb@uow.edu.au Christian Ritz University of Wollongong, ritz@uow.edu.au ubliation Details E. Cheng, I. S. Burnett & C. H. Ritz, "Time delay estimation of reverberant meeting speeh: on the use of multihannel linear predition", in International Conferene on Signal Image Tehnology & Internet Based Systems (SITIS '7), 7, pp. 494-5. Researh Online is the open aess institutional repository for the University of Wollongong. For further information ontat the UOW Library: researh-pubs@uow.edu.au

Time delay estimation of reverberant meeting speeh: on the use of multihannel linear predition Abstrat Effetive and effiient aess to multiparty meeting reordings requires tehniques for meeting analysis and indexing. Sine meeting partiipants are generally stationary, speaker loation information may be used to identify meeting events e.g., detet speaker hanges. Time-delay estimation (TDE) utilizing ross-orrelation of multihannel speeh reordings is a ommon approah for deriving speeh soure loation information. Researh improved TDE by alulating TDE from linear predition (L) residual signals obtained from L analysis on eah individual speeh hannel. This paper investigates the use of L residuals for speeh TDE, where the residuals are obtained from jointly modeling the multiple speeh hannels. Experiments onduted with a simulated reverberant room and real room reordings show that jointly modeled L better predits the L oeffiients, ompared to L applied to individual hannels. Both the individually and jointly modeled L exhibit similar TDE performane, and outperform TDE on the speeh alone, espeially with the real reordings. Disiplines hysial Sienes and Mathematis ubliation Details E. Cheng, I. S. Burnett & C. H. Ritz, "Time delay estimation of reverberant meeting speeh: on the use of multihannel linear predition", in International Conferene on Signal Image Tehnology & Internet Based Systems (SITIS '7), 7, pp. 494-5. This onferene paper is available at Researh Online: http://ro.uow.edu.au/infopapers/38

Third International IEEE Conferene on Signal-Image Tehnologies tehnologies and Internet-Based System Time Delay Estimation of Reverberant Meeting Speeh: On the Use of Multihannel Linear redition E. Cheng, I. S. Burnett, C. Ritz Whisper Laboratories Shool of Eletrial, Computer and Teleommuniations Engineering University of Wollongong, Wollongong NSW Australia 5 [e4, ianb, ritz]@uow.edu.au Abstrat Effetive and effiient aess to multiparty meeting reordings requires tehniques for meeting analysis and indexing. Sine meeting partiipants are generally stationary, speaker loation information may be used to identify meeting events e.g., detet speaker hanges. Time-delay estimation (TDE) utilizing rossorrelation of multihannel speeh reordings is a ommon approah for deriving speeh soure loation information. Reent researh improved TDE by alulating TDE from linear predition (L) residual signals obtained from L analysis on eah individual speeh hannel. This paper investigates the use of L residuals for speeh TDE, where the residuals are obtained from jointly modeling the multiple speeh hannels. Experiments onduted with a simulated reverberant room and real room reordings show that jointly modeled L better predits the L oeffiients, ompared to L applied to individual hannels. Both the individually and jointly modeled L exhibit similar TDE performane, and outperform TDE on the speeh alone, espeially with the real reordings.. Introdution Multiparty meetings our in many government, business, researh, and eduational environments. Reent researh has foused on tehniques for effiient and effetive aess to offline meeting reordings []. Analysis of meeting events is fundamental to offline aess of the reordings, and Lathoud et al. proposed the use of speaker loation information for meeting speeh segmentation []. Meeting partiipants generally remain stationary and thus speaker loation information an be used to analyze the meeting events for subsequent indexing and segmentation. Speeh, the dominant audio soure in a meeting, may be loalized using a number of tehniques. Time- Delay Estimation (TDE) is a popular tehnique for deriving speeh soure loation information: robustness to room aousti effets ommon to meeting environments, suh as reverberation and bakground noise, may be mitigated through frequeny-domain weighting [3]. The appliation of weighted TDE defines the Generalized Cross Correlation (GCC) [3]. One partiular form of weighting, GCC with hase Transform (GCC-HAT), has been shown to reliably derive TDE from reverberant speeh. Reent researh has ahieved more aurate TDE through applying GCC to the speeh linear predition (L) residual, ompared to GCC- HAT on the original multihannel speeh [4]. These approahes, however, do not jointly model the L between hannels, as reently used for multihannel dereverberation of speeh [5]. This paper proposes to ombine these two areas of researh to investigate the use of joint L models for TDE. The proposed approah is ompared to individually optimized (on a per-hannel basis) L and using the multihannel speeh alone for TDE. In the remainder of this paper, Setion outlines the proposed system of using a multihannel L model front-end to GCC-based TDE. Setion 3 desribes the simulated and real meeting reordings used in experiments. The results are presented and analyzed in Setion 4, with Setion 5 onluding this paper.. roposed System Fig. illustrates the proposed paradigm of using multi-hannel linear predition (L) analysis on meeting speeh (reorded with a mirophone array) as a front-end to time-delay estimation utilizing GCC tehniques. In the proposed system, the meeting 978--7695-3-9/8 $5. 8 IEEE DOI.9/SITIS.7.96 537 53

Fig.. roposed approah onsists of five partiipants equally spaed in a irle of 3m in diameter. The meeting speeh is then reorded by four mirophones plaed in the entre of the irle, as illustrated in Fig.... Single Channel Linear redition Sine speeh is the dominant audio soure in multiparty meetings, Linear redition (L) is employed for speeh analysis in the proposed system. In L, samples in the speeh signal are predited as a weighted sum of the past samples, where is the preditor order. The error (or residual) signal for eah hannel (e [), is defined as the differene between the original (s [) and predited (ŝ [) speeh signal. The L analysis proedure is mathematially represented as: sˆ [ ak, s[ n ; e[ s[ sˆ [ ; () k The summing weights, a k,, known as linear predition oeffiients, are alulated to minimize the error signal, e [, energy, E [: E [ a s [ n () e [ s[ n k, n k Eq. () is minimized by setting E / a k, for k,,,,, whih redues to the linear equation set: n k s [ n i] s [ a, s [ n i] s [ n (3) k n for i,,,,. Using the autoorrelation funtion R [i] of s [, Eq. (3) an be redued to (where N is length of the analysis window): R[ i] k N a, R [ i, where k R[ i] s[ s[ n i] for i,,,. (4) n i.. Multihannel Linear redition To extend the onepts of single hannel L to multiple speeh hannels, Gaubith et al. proposed the use of an averaged (aross hannels) autoorrelation matrix, R avg, instead of R in Eq. (4) [6]: R avg where Fig.. Meeting room setup [ i] k a, R [ i k avg C avg R avg R for i,,,. (5) This paper adopts the approah in [6] to implement multihannel L for the purposes of TDE. The Levinson-Durbin reursion algorithm is used to find the solutions of Eqs. (4) and (5) to find a k, and a k for the individual and joint L models, respetively. Eah of the multihannel speeh signals is then filtered with the (individual or joint) L model to obtain the L residual signal, following Eq. (). An alternative tehnique to jointly model L aross multiple hannels is to average the Line Spetral Frequenies (LSFs), where LSFs are an alternative representation of a k,. Eq. () an be expressed in the z- domain as: k [ + Q[ z]] A + ak, z (6) k where [z] and Q [z] are the sum and differene equations: ( + ) A + z A [ z ] (7) ( + ) Q[ z] A z A [ z ] The LSFs are defined as the polynomial roots of [z] and Q [z] in Eq. (7). The LSF representation of the L oeffiients is widely used in speeh oding e.g., for interpolating the L oeffiients, due to the robustness of the LSFs to quantization noise, where other representations of the L oeffiients an result in filter instability. It is for these reasons that this paper proposes averaging the LSFs obtained from eah hannel as an alternative method to form the jointly modeled L 538 53

oeffiients. The roots of [z] and Q [z] are found by Chebyshev polynomials methods [7], and averaged aross the hannels to form the averaged LSFs. The averaged LSFs are then onverted bak to a k, using Eq. (6), for subsequent filtering to obtain the L residuals using Eq. (). Finally, the omputational omplexities of averaging autoorrelation matries and LSFs are omparable, to enable fair omparisons between the two methods..3. Time-Delay Estimation Generalized Cross Correlation (GCC) is a tehnique ommonly applied to deriving TDE from two mirophone hannels [3]. Mathematially, GCC is given by: * ˆ X [ X [ G X [ ] X k (8) W[ where the Disrete Fourier Transforms (DFT) of multihannel signals x[ are denoted by X[, and the frequeny-domain weighting funtion, W[, is hosen depending on the signal and noise harateristis. Using the Inverse Disrete Fourier Transform (IDFT), the phase orrelation funtion is given by: R ( ˆ ˆ [ τ ] IDFT G X ) X (9) The TDE, ˆ τ, is alulated as the maximum of : τ ˆ arg max R [ τ ] () ˆ τ To minimize erroneous TDE values, the searh range of delays is onstrained to an interval D ˆ τ D, where D is generally determined by the physial arrangement of the mirophones. The frequeny-domain weighting funtion, W[, shown to be most robust to reverberant speeh with low levels of noise is the HAse Transform (HAT), whih leads to the GCC-HAT tehnique [3]: * W[ X [ X [ () In this paper, GCC-HAT is applied to the reverberant speeh, while simple ross-orrelation (CC), or GCC with W[ for all frequenies, is applied to the L residual to extrat the TDE. The GCC-HAT does not offer an advantage to L residual signals sine the HAT weighting flattens the ross-spetrum, and the spetrum of L residual signals is relatively flat by nature. To apply TDE to multiple hannels, GCC is alulated for eah hannel pair. In this paper, four mirophones are deployed (see Setion 3), whih defines six possible mirophone pairs and thus six TDE alulations. 3. Meeting Reordings Five loudspeakers equally spaed in a irle of 3m in diameter simulated ative meeting partiipants. Illustrated in Fig., the reording setup was modeled using Allen and Berkeley s image method [7], with reverberation times (T6) from anehoi (T6 ) to T6 seond. Most offie spaes generally exhibit a reverberation time of 3 ms. To evaluate the proposed system with ideal (voied) speeh soure signals for L analysis, the five English vowels ( a, e, i, o, u ) of approx. ms in duration were synthesized using the rosynth sofware, whih employs a hierarhial phonologial struture for speeh synthesis [6]. Vowels were sampled at 6kHz, and stored at 6 bits/sample. To simulate a meeting using the image method room model, the vowels were played from the five soure loations and reorded with the four omnidiretional mirophones, as defined in the room model of Fig.. Reordings were then made in a real reverberant aousti environment of approx. 3ms reverberation time with bakground noise. The syntheti vowels were played in turn from the five loudspeakers (Genele 9A) and reorded by omnidiretional mirophones (RØDE NTA) arranged to math the room model and Fig.. 4. Results To ensure real-time updates to the TDE are viable with the system proposed in this paper, 3ms Hamming windowed analysis frames are employed with 5% overlap between adjaent frames. As detailed in Setion 3, the reorded speeh is sampled at 6kHz, whih leads to an L order of for Eq. (). To evaluate the proposed system, a number of performane metris are used. All graphs presented in this setion exhibit 95% onfidene intervals over the speified mean of the following performane metris: Itakura distane shows the deviation between L autoorrelation oeffiients under test, â k, and the lean speeh oeffiients a k (obtained from the anehoi speeh in this paper): aˆ k Raˆ k d I log () ak Rak where R is the autoorrelation matrix. Thus, the smaller the Itakura distane, the loser the estimated L autoorrelation oeffiients are to the ideal ase. 539 533

redition gain is the ratio of the anehoi signal energy to the L residual energy. Thus, the larger the predition gain, the more aurately the L models the voal trat, sine the residual energy is low. TDE Root Mean Squared Error (RMSE) indiates the mean square error of the TDE under study from the ground truth time delay, whih is known from the mirophone and speaker onfiguration (see Fig. ): TDE RMSE M N M X ( ˆ[ x, m] τ[ x, m] ) m x τ (3) where M is the number of possible mirophone pairs, ˆ τ [ x, m] is the TDE, and τ [ x, m] is the groundtruth time-delay. Sine it is known that the syntheti vowels are voied and minimally time-varying, the TDE RMSE metri is averaged aross time, where X in Eq. (3) is the number of frames in the signal. Thus, the lower the RMSE, the more aurate and reliable the time delay estimation. For the room modeling results below (Setions 4. and 4.), the results are averaged aross the five syntheti vowels to evaluate the TDE performane aross inreasing reverberation time and also to evaluate the system performane with different voied signals. In the following setions, the Itakura distane and predition gain performane metris are utilized to ompare the performanes of TDE alulated from individually and jointly modeled L residuals. 4.. Autoorrelation Matrix Averaging Fig. 3a shows the Itakura distane for the individually modeled mirophone hannels (solid lines), and for the joint L model (dashed lines). It is lear that the jointly modeled L model onsistently outperforms the individual models with a lower Itakura distane aross all reverberation times. These results onfirm the statistial analyses and simulations of [6]: for a syntheti vowel signal, the joint L model derives L autoorrelation oeffiients, a k, that better math the ideal set of oeffiients. In ontrast, Fig. 3b illustrates the predition gain for the four mirophone hannels, individually (solid line) and jointly (dotted line) modeled. Although the jointly modeled L oeffiients better math the ideal set of oeffiients (see Fig. 3a), when filtered with eah hannel of the reverberant speeh to obtain the L residual, there is little differene shown by either L model in the predition gain. Fig. 3 illustrates the TDE performane from the reverberant speeh GCC-HAT and the individually modeled L residual GCC. It an be learly seen that Itakura Distane (db) redition Gain (db) TDE RMSE (samples).7.6.5.4.3.. Itakura Distane: Individual (blak solid) vs. Joint L (blue dashed) Mi Mi Mi3 Mi4 -....3.4.5.6.7.8.9 Reverberation Time (T6) in se 6 55 5 45 4 35 (a) Itakura distane: individual vs. joint L redition Gain: Individual (blak solid) vs. Joint L (blue dashed) Mi Mi Mi3 Mi4 3...3.4.5.6.7.8.9 Reverberation Time (T6) in se 7 6 5 4 3 (b) redition gain: individual (solid line) vs. joint (dotted line) L residual TDE: Speeh GCC-HAT vs. L Residual GCC (Individual and Joint) Speeh L Residual (Ind) L Residual (Joint)...3.4.5.6.7.8.9 Reverberation Time (T6) in se () TDE RMSE: individual vs. joint L residual Fig. 3. Syntheti vowel simulation results 54 534

Itakura Distane: AR (blak solid) vs. LSF (blue dashed) Averaging 7 TDE: Speeh GCC-HAT vs. L Residual GCC (Individual, AR Joint, LSF Joint).5.4 Mi Mi Mi 3 Mi 4 6 5 Itakura Distane (db).3. TDE RMSE (samples) 4 3. Speeh L Residual (AR Ind) L Residual (AR Joint) L Residual (LSF Joint)...3.4.5.6.7.8.9 Reverberation Time (T6) in se for reverberation times less than 6ms, the L residual provides a more reliable TDE vetor (aross the six hannel pairs, averaged in Fig. 3) with a onsistently lower TDE RMSE. As reverberation inreases, however, the speeh GCC-HAT TDE exhibits slightly lower RMSE over the L residual GCC (both individually and jointly modeled). At higher reverberation times, although the jointly modeled L oeffiients are extrated aurately ompared to the individually modeled hannels (see Fig. 3a), upon filtering the L oeffiients with eah reverberant hannel the residual an ontain signifiant amounts of reverberation [6]. As is the ase with speeh, reverberation an introdue erroneous peaks into the GCC funtion whih in turn lead to erroneous TDE. Fig. 3 also ompares the TDE RMSE from the individually (dashed line) and jointly (dotted line) modeled L residuals. It an be seen that the jointly modeled L inreasingly improves the TDE reliability over the individually modeled hannels as reverberation time inreases past 4ms. However, the improvement is less then one sample in resolution. The results in Fig. 3 suggest that in a simulated reverberant environment, while more aurately modeling the speeh L oeffiients, the inreased omputational omplexity for the jointly modeled L model does not lead to a signifiant improvement in the TDE auray. 4.. Line Spetral Frequenies Averaging Fig. 4a shows the omparison between the Itakura distanes of the joint L models obtained by averaging...3.4.5.6.7.8.9 Reverberation Time (T6) in se (a) Itakura distane: AR vs. LSF (b) TDE RMSE: Speeh vs. Joint L Residual (AR and LSF) Fig. 4. Joint L modeling: AR vs. LSF averaging autoorrelation matries (solid line) and averaging the LSFs (dotted line). Aross the simulated reverberation times, it an be seen that the LSF averaging performane is omparable to that of autoorrelation matrix averaging. Fig. 4b depits the TDE RMSE for the speeh GCC-HAT, and GCC of the L residual obtained by both jointly modeled tehniques. The omparable performanes of the two averaging tehniques shown in Fig. 4a are reiproated with TDE reliability. The TDE performane of the two jointly modeled L tehniques is omparable to, or better than individually modeled L and speeh GCC-HAT for reverberation times less than or greater then 4ms, respetively. Similar to the results in Fig. 3, speeh GCC-HAT performs best at reverberation times greater then 7ms. With similar TDE results exhibited by both the autoorrelation and LSF averaging, the results in Fig. 4 suggest that joint modeling, for both the tested methods, only result in more reliable TDE over TDE from individually modeled L speeh residuals at higher reverberation times. 4.3. Real Reverberant Reordings Fig. 5 shows the results from reording the e syntheti vowel averaged over the five speaker positions, plotted aross time. Similarly, Fig. 6 shows the results from reording the o syntheti vowel. Although only the results from these two of the five syntheti vowels and two of the four mirophones are presented here for brevity, the other three vowels and mirophones exhibited similar trends. Both Figs. 5 and 54 535

.4.35.3 Itakura Distane: AR (blak solid) vs. LSF (blue dashed) Averaging Mi AR Mi AR Mi LSF Mi LSF 7 6 5 TDE: Speeh GCC-HAT vs. L Residual GCC (Individual, AR Joint, LSF Joint) Itakura Distane (db).5..5 TDE RMSE (samples) 4 3..5 Speeh L Residual (AR Ind) L Residual (AR Joint) L Residual (LSF Joint) 5 5 5 Time (ms) 5 5 5 Time (ms) (a) Itakura distane: AR vs. LSF (b) TDE RMSE: speeh vs. joint L residual (AR and LSF) Fig. 5. Joint L modeling for real reording of e : AR vs. LSF averaging.4. Itakura Distane: AR (blak solid) vs. LSF (blue dashed) Averaging Mi AR Mi AR Mi LSF Mi LSF 7 6 5 TDE: Speeh GCC-HAT vs. L Residual GCC (Individual, AR Joint, LSF Joint) Speeh L Residual (AR Ind) L Residual (AR Joint) L Residual (LSF Joint) Itakura Distane (db).8.6.4 TDE RMSE (samples) 4 3. 5 5 5 Time (ms) 6 show that the performanes of the autoorrelation and LSF averaging tehniques are almost idential. Figs. 5b and 6b, however, show a marked performane improvement for TDE auray from the L residual (individually or jointly modeled), ompared to GCC-HAT on the speeh alone. These results with real reordings onfirm the findings of [4]. The jointly modeled L residual (either AR or LSF averaged) does not signifiantly outperform the individually modeled L residual, although a slight performane improvement an be seen with the o vowel in Fig. 6b. The improved performane of the L residual TDE (individually and jointly modeled) ompared to speeh GCC-HAT is muh more signifiant in a real aousti environment ompared to the theoretial simulations: this an be seen by 5 5 5 Time (ms) (a) Itakura distane: AR vs. LSF (b) TDE RMSE: speeh vs. joint L residual (AR and LSF) Fig. 6. Joint L modeling for real reording of o : AR vs. LSF averaging omparing the results of Fig. 4b to those in Figs. 5b and 6b. The results in Figs. 5b and 6b learly show that the L residual TDE is more robust to a real reverberant aousti environment with bakground noise, than the speeh GCC-HAT. 5. Conlusion This paper studied the use of multihannel linear predition for time-delay estimation (TDE) of reverberant speeh. Two tehniques for multihannel linear predition were implemented: averaging the autoorrelation matries, and line spetral frequenies (LSFs) aross the speeh hannels. 54 536

The simulations in this paper were onduted on syntheti vowels in a modeled room and real reordings in a reverberant room with bakground noise. Results showed that jointly modeled L oeffiients better math the ideal set of L oeffiients ompared to individually modeling the multiple speeh hannels alone. However, there is little performane gain between TDE from individually or jointly modeled L residuals; the reasons for this are urrently being investigated with both simulated and real reverberant environments. Furthermore, the two joint L modeling tehniques studied in this paper, namely, the averaged autoorrelation matries and LSFs, perform omparably in both the simulated and real reverberant room. Nonetheless, TDE alulated from the L residual from either tehnique signifiantly outperform the speeh TDE in the real reordings. This suggests that extrating TDE from the L residual (either individually or jointly modeled) is the most robust tehnique for TDE in real reverberant environments. 6. Referenes [] S. Tuker and S. Whittaker, Aessing Multimodal Meeting Data: Systems, roblems and ossibilities, in LNCS, vol. 336, pp. -, Springer-Verlag, Berlin, 5. [] G. Lathoud, I. MCowan, Loation Based Speaker Segmentation, in pro. ICASS, Hong Kong, pp. 76-79, April 3. [3] C. Knapp, G. Carter, The Generalized Correlation Method for Estimation of Time Delay, IEEE Trans. Aoust., Speeh, and Signal ro., Vol. ASS-4, No. 4, pp. 3-37, Aug. 976. [4] V. C. Raykar, et al., Speaker Loalization Using Exitation Soure Information in Speeh, IEEE Trans. Speeh and Audio ro., vol. 3, no. 5, pp. 75-76, Sept. 5. [5] M. Delroix, T. Hikihi, M. Miyoshi, reise Dereverberation using Multihannel Linear redition, IEEE Trans. Audio, Speeh, and Language ro., Vol. 5, No., pp. 43-44, Feb. 7. [6] N. Gaubith, D. B. Ward,. A. Naylor, Statistial Analysis of the Autoregressive Modeling of Reverberant Speeh, JASA, Vol., No. 6, pp. 43-439, De. 6. [7]. Kabal, R.. Ramahandran, The Computation of Line Spetral Frequenies using Chebyshev olynomials, IEEE Trans. on Aoustis, Speeh and Signal roessing, Vol. 34, No. 6, pp. 49-46, De. 986. [8] J. A. Allen, D. A. Berkeley, Image Method for Effiiently Simulating Small-Room Aoustis, JASA, vol. 65, no. 4, pp. 943-95, April 979. [9] rosynth: All rosodi Speeh Synthesis [Online]. Available: http://www-users.york.a.uk/~lang9/ 543 537