Conference IAFPA 2007 Plymouth, UK

Size: px

Start display at page:

Download "Conference IAFPA 2007 Plymouth, UK"

Vernon Hutchinson
5 years ago
Views:

Jan Rademacher, Jörg Bitzer, Jörg Houpert University of Oldenburg, Medical Physics, Oldenburg,

1 Conference IAFPA 2007 Plymouth, UK Increasing Speech Intelligibility by DeNoising: What can be achieved? Jan Rademacher, Jörg Bitzer, Jörg Houpert University of Oldenburg, Medical Physics, Oldenburg, Germany University of Applied Science Oldenburg / Ostfriesland / Wilhelmshaven, Institute for Hearing Technology and Audiology, Oldenburg Germany Cube-Tec International, Bremen, Germany Slide No / 3

History of Cube-Tec International 1990 Founding of the engineering offices Houpert Digital Audio (HDA) in the Technology Center University Park Bremen, Germany.

2 History of Cube-Tec International 1990 Founding of the engineering offices Houpert Digital Audio (HDA) in the Technology Center University Park Bremen, Germany Founding of Spectral Design (together with Steinberg Media Technologies) 2001 Founding of our international Distribution Company, in South Germany Cube Technologies GmbH Founding of the Cube-Tec Development GmbH 2005 March - Founding of the new Cube-Tec International GmbH Merger of all four companies, and transfer of assets to Cube-Tec International headquarters in Bremen. Cube-Tec Office at the Technology Park Bremen, Germany Slide No / 4

3 few words on Cube-Tec International Since 1990 we are providing solutions based on digital audio signal processing and psychoacoustics. Current developments are focused on technologies for analyzing and processing huge amounts of digital audio & video files. We are using computer rendering farms for finding relevant cues, matching algorithms, looking for similarities in media files Some of the topics mass-processing of sound and video files automatic metadata extraction metadata driven processing of media files workflow automation Slide No / 5

4 Slide No / 6 Cube-Workflow

5 Slide No / 7 What is DOBBIN?

6 FOENICS Slide No / 8

7 FOENICS real time restoration tool set for Speech Enhancement Forensic Enhancement of Intelligibility by Computer System Slide No / 9

8 back to the joint research project and the question Increasing Speech Intelligibility by DeNoising: What can be achieved? Slide No / 10

9 Motivation / Introduction It is known that state-of-the-art denoising cannot increase speech intelligibility, if the noise source is a speech-like babble noise (Bitzer et al. 2005). The Questions for the following basic research are? Is this statement also true for noises with other spectral shapes? Is it possible to predict the enhancement, if the spectral shape of the disturbance is known. Slide No / 11

10 A short review on the basics of denoising algorithms A distorted speech signal x(t) is usually modeled as summation of a speech signal s(t) and a distortion n(t) x ( t) = s( t) + n( t) This holds also for the power spectral density (PSD) X(m,l) X ( m, l) = S( m, l) + N( m, l) denoising as spectral substraction S( m, l) = X ( m, l) N( m, l) Be aware: X,S,N are power spectral densities not spectra and in that model the noise has to be uncorrelated with the speech signal Slide No / 12

11 Principle of denoising Estimate the speech signal by multipling the distored signal with a filter function X ( m, l) H ( m, l) = Sˆ( m, l) Optimal approach in the mean square sense : Wiener Filter [Bol79] Problem: successful estimation of H(m,l) depends on: Signal to Noise Ratio (SNR) Spectral shape of noise and speech Question: Is it possible to assess the achievable intelligibility enhancement without performing the denoising and listening tests? Slide No / 13

12 Principle of denoising find a weighting factor H(m,l) for each frequency m at each time instance l fulfilling X ( m, l) H ( m, l) = Sˆ( m, l) Optimal approach in the mean square sense : Wiener Filter [Bol79] Problem: successful estimation of H(m,l) depends on: Signal to Noise Ratio (SNR) Spectral shape of noise and speech Question: Is it possible to assess the achievable intelligibility enhancement without performing the denoising and listening tests? Slide No / 14

13 Prediction of enhancement Proposed solution: A new measure describing the intelligibility enhancement to be expected after denoising by a single value called 'denoising success'. Range: no enhancement: 0 total enhancement: 1 This measure will take into account the: signal to noise ratio (SNR) spectral shape of the speech signal spectral shape of the distortion psychoacoustical properties of the human hearing masking threshold of tonal and non-tonal components [Zwi82, ISO93] band importance function [ANS97] Slide No / 15

14 Calculation of achievable intelligibility enhancement - STEP-1 N(m,l) S(m,l) Estimate HWiener (m,l) N(m,l) HWiener (m,l) Ndenoise(m,l) S(m,l) N(m,l) Input to the model is: N(m,l) - the power spectral density of the noise (Noise PSD) S(m,l) - the power spectral density of the speech (Speech PSD) the long term average spectrum of a single speaker standard LTAS (long term average spectrum) for speech Slide No / 16

15 Calculation of achievable intelligibility enhancement - STEP-1 N(m,l) S(m,l) Estimate HWiener (m,l) HWiener (m,l) N(m,l) Ndenoise(m,l) S(m,l) N(m,l) The noise spectra and the speech spectra together can form the Wiener filter, that is just the power spectra of the speech - divided by the power spectral of the speech and noise. Apply this Wiener filter to the noise signal only denoised version of the original noise (denoised Noise PSD), the original noise (Noise PSD) and the original speech signal (Speech PSD) Slide No / 17

16 Calculation of achievable intelligibility enhancement - STEP-2 Masking thresholds using MPEG1 Signal model [Zwi82, ISO93] MT noise (m,l) for the distortion N(m,l) Ndenoise(m,l) Masking Threshold MTdenoise(m,l) S(m,l) N(m,l) Masking Threshold S(m,l) MTnoise(m,l) Slide No / 18

17 Calculation of achievable intelligibility enhancement - STEP-2 MT noise (m,l) for the distortion N(m,l) MT denoise (m,l) for the remaining distortion N denoise (m,l) after denoising Ndenoise(m,l) Masking Threshold MTdenoise(m,l) S(m,l) N(m,l) Masking Threshold S(m,l) MTnoise(m,l) Slide No / 19

18 Calculation of achievable intelligibility enhancement - STEP-3 MTdenoise (m,l) S(m,l) not masked? BIF S(m,l) MTnoise (m,l) S(m,l) masked? & denoising success Find all frequency indices m M, where a masked speech amplitude S(m,l) is unmasked after denoising. Slide No / 20

19 Calculation of achievable intelligibility enhancement - STEP-3 MTdenoise (m,l) S(m,l) not masked? BIF S(m,l) & denoising success MTnoise (m,l) S(m,l) masked? Find all frequency indices m M, where a masked speech amplitude S(m,l) is unmasked after denoising. The relevance of each frequency band m M with respect to intelligibility is incorporated by the band importance function (BIF). Slide No / 21

20 Calculation of achievable intelligibility enhancement - STEP Band Importance Factor (ANSI) BIF denoising success 0 200Hz 1kHz 10kHz Freq in Hz Find all frequency indices m M, where a masked speech amplitude S(m,l) is unmasked after denoising. The relevance of each frequency band m M with respect to intelligibility is incorporated by the band importance function (BIF). Slide No / 22

21 Calculation of achievable intelligibility enhancement - STEP-3 MTdenoise (m,l) S(m,l) not masked? BIF S(m,l) & denoising success MTnoise (m,l) S(m,l) masked? Find all frequency indices m M, where a masked speech amplitude S(m,l) is unmasked after denoising. The relevance of each frequency band m M with respect to intelligibility is incorporated by the band importance function (BIF). The denoising success is the sum of all BIF values belonging to M Slide No / 23

22 Complete calculation scheme N( m,l) S( m,l) estimate H m,l Wiener( ) HWiener( m,l) Ma sking N( m,l) Ndenoise( m,l) Thre shold MT m,l denoise( ) S( m,l) S( m,l) not masked? & BIF denoising success N( m,l) Ma sking Thre shold MTnoise( m,l) S( m,l) masked? How does this calculation scheme perform? Slide No / 24

23 Four different stationary noise sources Test / Noise Description OLSA speech like noise (perfect masker) White white Gaussian noise Peak peak noise with flat spectrum between 400Hz and 2400Hz Notch white noise with noise gap between 400Hz and 2400Hz 0 OLSA 0 White 0 Peak 0 Notch Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Slide No / 25

24 Four different stationary noise sources Test / Noise Description OLSA speech like noise (perfect masker) White white Gaussian noise Peak peak noise with flat spectrum between 400Hz and 2400Hz Notch white noise with noise gap between 400Hz and 2400Hz 0 OLSA 0 White 0 Peak 0 Notch Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Slide No / 26

25 Four different stationary noise sources Test / Noise Description OLSA speech like noise (perfect masker) White white Gaussian noise Peak peak noise with flat spectrum between 400Hz and 2400Hz Notch white noise with noise gap between 400Hz and 2400Hz 0 OLSA 0 White 0 Peak 0 Notch Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Slide No / 27

26 Four different stationary noise sources Test / Noise Description OLSA speech like noise (perfect masker) White white Gaussian noise Peak peak noise with flat spectrum between 400Hz and 2400Hz Notch white noise with noise gap between 400Hz and 2400Hz 0 OLSA 0 White 0 Peak 0 Notch Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Hz 1kHz 10kHz Freq in Hz Slide No / 28

27 Test Results for OLSA & White Noise Masker denoising success [01] olsa noise SNR [db] denoising success [01] white noise SNR [db] OLSA Noise Masker: Predicted enhancement of speech intelligibility is zero for all SNRs White Noise Masker: Predicted enhancement of speech intelligibility is zero for all SNRs Slide No / 29

28 Test Results for Peak & Notch Masker denoising success [01] peak noise SNR [db] denoising success [01] notch noise SNR [db] Peak Noise Masker: Intelligibility will be improved significantly at moderate SNRs Notch Noise Masker: Small improvement at low SNRs Slide No / 30

Listening Tests / Test-Design OLSA Test Method Oldenburger sentence test, developed by Wagener and Kollmeier at University of Oldenburg (OLSA) is a further development of the Hagerman sentence test

29 Listening Tests / Test-Design OLSA Test Method Oldenburger sentence test, developed by Wagener and Kollmeier at University of Oldenburg (OLSA) is a further development of the Hagerman sentence test 30 sentences, the level of the noise masker is fixed, the level for each sentence is adapted until 50% word scoring (in the mean) is adjusted Find 50% correct words in a nonsense sentence (name, verb, number, adjective, object) (Ulrich gewann fünf rote Messer = Ulrich won five red knives) The mean of the SNR of the last 20 sentences form the Speech Reception Threshold (SRT) Slide No / 31

30 Oldenburger sentence test - Example page name, verb, number, adjective, object Slide No / 32

31 Test-Design II Each test person listen to the sentences mixed with one of the four noise sources. SRT for each mix with the four noise sources Each test person listen to denoised version the sentences mixed with one of the four noise sources. SRT for each of the four denoised versions The denoiser is a standard Real-World-denoiser based on published technologies: WOLA based, FFTLen = 2048, Noise Estimator is based on (Cohen 2002), Gain-Rule is Wiener with a-priori SNR (decision directed estimation based on Ephraim and Malah 1984) Slide No / 33

32 Results of the Listening Test -10 NoAlgo SRT in db Olsa White Peak Notch The listening test was performed by five trained expert listener. See the Speech Reception Threshold for the four noise masker, with the mean values and the standard deviation. First the results for the unprocessed sound files. Slide No / 34

33 Results of the Listening Test -10 NoAlgo SRT in db Olsa White Peak Notch SRT reduces significantly, if the noise has a different color than the speech signal and is therefore, an incomplete masker to the signal speech is very quiet compared to the noise and can be close to the technical limits (bit depth). Listening condition are very stressful because of the high gain to amplify the speech signal above the hearing threshold. Slide No / 35

34 Results of the Listening Test - Denoised -10 NoAlgo DeNoised SRT in db Olsa White Peak Notch No intelligibility enhancement possible for a perfect masking noise Denoising is not useful, as long as you can't use additional information for example spatial cues by multi-microphone processing. Slide No / 36

35 Results of the Listening Test - White Noise Masker -10 NoAlgo DeNoised SRT in db Olsa White Peak Notch Small enhancement for white noise (but maybe not statistically significant) DeNoising will give a very small benefit, if at all. Slide No / 37

36 Results of the Listening Test - Peak Noise Masker -10 NoAlgo DeNoised SRT in db Olsa White Peak Notch A significant enhancement for colored peak noise DeNoising will improve speech intelligibility, to the cost of a slightly changed and maybe disturbing background noise. Speech source will sound differently. Slide No / 38

37 Results of the Listening Test - Notch Noise Masker -10 NoAlgo DeNoised SRT in db Olsa White Peak Notch Enhancement for notch noise DeNoising may improve speech intelligibility More important: the annoying noise is reduced and therefore you can amplify the signal to a higher level, which will increase SRT. (But in this experiment level change was not permitted ) Slide No / 39

38 Comparison of objective measure and listening Test The new measure 'denoising success' can coarsly predict: The lowering of the Speech Reception Threshold (Peak and Notch vs. OLSA noise) The benefit in speech intellegibility for colored noise It is only a coarse predictor (no enhancement for white noise predicted). Slide No / 40

39 Conclusions Denoising is useful if: The disturbance is colored compared to the long-term spectrum of the desired speech signal. The disturbance is so annoying that an automatic reduction is necessary in order to be able to increase the level of the speech above hearing threshold. Automatic prediction of an increase of the speech intelligibility with a denoising process is possible, but so far, only in a limited range. More work to do! Slide No / 41

40 Slide No / 42 Literature [ANS97] American National Standard, Methods for calculation of the speech intelligibility index, ANSI S [Bit05] Bitzer, Jörg and Simmer, Uwe and Holube, Inga and Schaer, Timm, "Some Experiments on Short-Time Spectral Attenuation (STSA) Algorithms and Speech Intelligibility", International Workshop on Acoustic Echo and Noise Control (IWAENC) Sept 2005, p [Bol79] Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Process., ASSP-27, pages , 1979 [Coh02] Cohen,Berdugo: "Noise estimation by minima controlled recursive averaging for robust speech enhancement", IEEE Signal Processing Letters, vol. 9, pp. 12, Jan [EM84] Ephraim, Malah, Speech Enhancement Using a Minimum Mean- Square Error Short-Time Spectral Amplitude Estimator, IEEE Transaction on Acoustics, Speech, and Signal Processing, Dec 1984 (32), No.6, p [Hag82] Hagerman, Sentences for Testing Speech Intelligibility in Noise, Scandinavian Audiology, No. 11, 1982, p [ISO93] ISO/IEC 11172, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s - Part 3: audio, BSI, London, First edition, October Implementation of ISO/IEC :1993. ISBN [Wag03] Wagener, Factors Influencing Sentence Intelligibility in Noise, PhD Thesis, Oldenburg, 2003 [Wie49] Wiener, Extrapolation, interpolation and smoothing of stationary time series: With engineering applications, In Principles of Electrical Engineering Series, Cambridge, Massachusetts and New York, [Zwi82] Zwicker, Psychoakustik, Springer Verlag, Berlin, 1982

41 Abstract -1- of -2- Increasing Speech Intelligibility by DeNoising: What can be achieved? Jan Rademacher, University of Oldenburg, Medical Physics, Oldenburg, Germany Prof. Dr. Jörg Bitzer, University of Applied Science Oldenburg / Ostfriesland / Wilhelmshaven, Institute for Hearing Technology and Audiology, Oldenburg Germany Jörg Houpert, Cube-Tec International, Bremen, Germany Abstract: A lot of forensic material is of poor recording quality. A variety of disturbances is added to the desired speech signal. A typical way to increase the quality of the signal is to apply algorithms that can be called DeNoisers, which filter out the background noises. However, increasing the quality does not imply that the speech intelligibility is increased, which is the essential goal of the process. In (Bitzer et al. 2005) the authors show that no significant enhancement is possible if the background noise is a perfect masker for the desired speech signal. In their study the noise signal was computed by a mixture of 500 sentences randomly mixed by the same speaker. Therefore, the long-term noise spectrum is exactly the same as the spectrum of the desired speech signal. Finally, the speech reception threshold is computed, which is the Signal-to- Noise-Ratio (SNR) at which 50% of all words can be understood. The methodology is described in (Kollmeier and Wesselkamp 1997 and Wagener 2003). The test is called Oldenburger sentence test and is based on (Hagerman 1982). Slide No / 43

42 Abstract -2- of -2- In real world recordings, however, the masker has a different spectrum. Prediction of enhancement The differences of the long term spectrum of the desired signal and the disturbance can be used as a measure of how well a denoiser could work for this combination of signals. Of course this is a coarse prediction since stationarity has to be assumed. Our new measure is based on the so called optimal- or Wiener- filter (Boll 1979, Wiener 1949) which is well-known for denoising tasks. Based on this filter, we suggest to calculate the masking thresholds (MT) (Zwicker 1982) in a way closely related to the MPEG standard (ISO/IEC ) both for the original noise spectrum and for the noise spectrum denoised by the Wiener-filter. These MTs are then used to find the spectral components of the speech signal which are audible before and after the denoising process. Taking into account the band importance function (American National Standard 1997) reflecting the importance of each frequency band for intelligibility, the increase of audible frequency bands after denoising will be a good approximation of the forthcoming improvement of the intelligibility achieved by denoising. In order to show this behaviour subjective listening tests will be used for comparison. Conclusions: Increasing speech intelligibility is possible, if the disturbing signal is not masking all important frequencies of the desired signal. Since denoising algorithms are able to filter signal-bands individually a significant improvement in quality and intelligibility can be achieved. Furthermore, we introduced a new measure that can be used as a coarse predictor of the possible improvement. Slide No / 44

Fundamentals of Perceptual Audio Encoding. Craig Lewiston HST.723 Lab II 3/23/06

Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06 Goals of Lab Introduction to fundamental principles of digital audio & perceptual audio encoding Learn the basics of psychoacoustic