SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Size: px

Start display at page:

Download "SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION"

Stewart McDonald
5 years ago
Views:

1 Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages Published Online: September 14, 2009 This paper is available online at Pushpa Publishing House SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION YASUO ARIKI, TETSUYA TAKIGUCHI, TAKASHI MUROI and RYOICHI TAKASHIMA Department of Computer and System Engineering Kobe University, Kobe, Japan Abstract In this paper, we propose a new feature extraction method based on higher-order local auto-correlation (HLAC) and Fisher weight maps (FWM). Widely used MFCC features lack temporal dynamics. To solve this problem, 35 types of local auto-correlation features are computed within two-dimensional local regions. These local features are accumulated over more global regions by placing high scores on the discriminative areas, where the typical features among all phonemes are well expressed. This score map is called a Fisher weight map. We verified the effectiveness of the HLAC and FWM through speaker-independent phoneme recognition. The proposed method showed an 82.1% recognition rate, 8.9 points higher than the result by MFCC. Furthermore, by combining the proposed feature with MFCC, ΔMFCC and ΔΔMFCC, the recognition rate improved to 86.6%. Experimental results also demonstrate the robustness of the proposed method in noisy environments in comparison with MFCC-based features. Keywords and phrases: phoneme recognition, feature extraction, Fisher weight map, local auto-correlation. Received June 16, 2009

2 126 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA I. Introduction In speech recognition, MFCC (Mel-Frequency Cepstrum Coefficient) is widely used, which is a cepstrum conversion of a sub-band mel-frequency spectrum within a short time. Due to the characteristic of short time spectrum, MFCC lacks temporal dynamic features and degrades the recognition rate. To overcome this shortcoming, the regression coefficients of MFCC (delta, delta-delta MFCC) are usually utilized, but they are an indirect expression of temporal frequency changes such as formant transition or plosives. A more direct expression of the temporal frequency changes would be a geometrical feature in a two-dimensional local area [4, 7, 8, 9, 11]; for example, within a three-frame by three-frequency band area, on the temporal frequency domain. Figure 1 shows the time wave and spectrogram of the word democrats. On the lower frequency band, several formant transitions are observed, and on a high-frequency band, the plosive is observed. In order to locate such two-dimensional geometrical features, auto-correlation within a local area is effective because it can enhance the geometrical features. Originally, this type of feature extraction was proposed in the field of facial emotion recognition [10]. They computed 35 types of local auto-correlation features within a two-dimensional local area at each pixel on an image, and accumulated them within some discriminative areas where the typical features among all emotions were well expressed. The map showing the discriminative areas is called a Fisher weight map. Some other data-driven techniques have been also described (e.g., [2], [6]). We propose, in this paper, a method to find the geometrical discriminative features and discriminative areas of phonemes in the temporal-frequency domain of speech signals by using Fisher weight maps [1]. In [1], a simple experiment of speaker-dependent vowel recognition showed that the formant features were proved to be a discriminative feature by investigating the resultant Fisher weight maps. In this paper, the effectiveness of the proposed discriminative feature is verified through speaker independent phoneme-recognition experiments. To evaluate the noise robustness of the proposed feature extraction method, white Gaussian noise and real noises were added to the clean speech data at SNRs of 20, 10, and 0 db, because the higher-order local auto-correlation used in the proposed method is thought to be robust for noisy environments. In Sections II and III, auto-correlation coefficients based on the local features

SPEECH FEATURE EXTRACTION... 127 and the Fisher weight maps are described. Section IV describes an extraction flow of the geometrical discriminative features for phoneme recognition.

3 SPEECH FEATURE EXTRACTION and the Fisher weight maps are described. Section IV describes an extraction flow of the geometrical discriminative features for phoneme recognition. In Section V, phoneme recognition experiments are shown. Figure 1. Example of a spectrogram of speech signal. II. Local Features and Weighted Higher-order Local Auto-correlations A. Local features Two-dimensional geometrical and local features are observed on the timefrequency matrix shown on the left in Figure 2. On the right-hand side, 3 3 local patterns are shown to capture the local features. The upper pattern is for continuation in a time direction, the middle for continuation in a frequency direction and the lower for transition. The flag 1 indicates the multiplication of the spectrum on the position. A local feature within the k-th local pattern at a position r is defined as follows: ( k ) ( ) ( ( k ) ) ( ( k h = I r + I r + a + + I r + a ) ), (1) r 1 where I ( r) is the log-power spectrum at position r on a time-frequency matrix composed of time t and frequency f. k-th local pattern. If I ( r) is the linear-power spectrum, multiplication, rather than addition, in Eq. (1). N ( k ) r + a i indicates the other position within the ( k h ) will be calculated by

128 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA We can calculate local features using any local-pattern sizes and any order sizes ( N ) in the time-frequency domain.

4 128 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA We can calculate local features using any local-pattern sizes and any order sizes ( N ) in the time-frequency domain. But a large size for local patterns will result in a very high computational complexity. Therefore, a 3 3 local pattern is often used [10]. By limiting local patterns within a three-frame by three-band area at reference position r, setting the order N to be 2, and omitting the equivalence of translation, the number of displacement sets ( a 1,..., a N ) is set to 35. Namely, 35 types of local patterns are obtained at each position r on the time-frequency matrix as shown in Figure 3, according to [10]. For example, in the case of the top pattern in Figure 2 and pattern No. 16 in Figure 3 (the continuation in a time direction), the local feature is calculated by ( ) ( ) ( ) ( ), 16 h ( t ) S t, f S t 1, f S t 1, f, f = (2) where S ( t, f ) is the log-power spectrum at time t and frequency f. Figure 2. Example of local features. B. Weighted higher-order local auto-correlations The higher-order local auto-correlation x k for the k-th local pattern is obtained by summing the local features shown in Eq. (1) on the time-frequency matrix. It is formalized as follows: ( k x = ) k h r. (3) r

5 SPEECH FEATURE EXTRACTION Figure types of local patterns. In order to express the higher-order local auto-correlation in the matrix form, all the local features shown in Eq. (1) for the k-th local pattern are collected on the timefrequency matrix and presented as the following vector: ( k ) [ ( k ) ( k ) ( k ) t h = h h h ]. (4) 2, 2 2, T 1 F 1, T 1 Here the dimension of the vector is M = ( T 2)[ time] ( F 2)[ frequency]. The higher-order local auto-correlation x k for the k-th local pattern is expressed as follows using the M-dimensional vector ( k h ) : t ( k x h ) k = 1. (5) Here, 1 is the M-dimensional vector, in which each element is set to 1, and t represents the transpose. A local feature matrix is obtained as follows by placing the M-dimensional vectors ( k h ) in the horizontal direction one by one for all of the 35 local patterns, [ ( 1 ) ( K H = h h ) ]. (6)

130 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA The higher-order local auto-correlation vector x is obtained by packing the x k and is expressed as follows: t t x = [ x ] = H.

6 130 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA The higher-order local auto-correlation vector x is obtained by packing the x k and is expressed as follows: t t x = [ x ] = H. (7) 1 x K 1 Figure 4 shows an example computing the local feature matrix H. Here, moving 35 local patterns on the windowed time-frequency matrix, the local features are computed. These local features are packed into the local feature matrix H. The higher-order local auto-correlation vector x presents the existence of the local patterns all over the time-frequency matrix. Therefore, it is not a discriminative vector. In order to make the higher-order local auto-correlation vector x have discriminative ability, local features of the same local pattern are summed over the windowed time-frequency matrix by putting a high weight on the local features where class difference appears clearly. This is done by replacing vector 1, consisting of M 1 s, by the weighting vector w. Then the weighted higher-order local autocorrelation vector x is obtained as follows: t x = H w. (8) Here w is called a Fisher weight map because it is computed based on linear discriminant analysis. Figure 4. Local feature matrix.

7 SPEECH FEATURE EXTRACTION III. Fisher Weight Map In order to find the Fisher weight map, Fisher s discriminative criterion is utilized [10]. Let P be the number of training data. Then the local feature matrix for M K the training data is denoted as { H }. The corresponding weighted P j R j = 1 higher-order local auto-correlation vectors, the within-class covariance matrix and the between-class covariance matrix are denoted as { }, Σ ~ and Σ ~, P x j j= 1 W respectively. In this paper, each phoneme class is used for training the Fisher weight map. Then the Fisher discriminative criterion J ( w) is expressed as follows using those denotations: ~ t trσ w w ( w) B Σ J = B ~ =, (9) trσ t W w ΣW w where Σ W and Σ B is the within-class covariance matrix and the between-class matrix of the local feature matrices (training data). The Fisher weight map is obtained as eigenvector w based on the following generalized eigenvalue decomposition derived by maximizing the Fisher t discriminative criterion under the constraint such that w Σ w = 1: W Σ w = λσw w. (10) B Since the Fisher weight map is composed of several eigenvectors, the number of eigenvectors is optimized in the phoneme recognition process. However, if the number of eigenvectors is set to 20, then the weighted higherorder local auto-correlation vector x shown in Eq. (8) equals 700 ( 35 20) dimensional vector. It is a large dimension, so the HMM used in the phoneme recognition cannot be estimated accurately and stably. To solve this problem, PCA (Principal Component Analysis) is used to reduce the dimension effectively in this paper. IV. Extraction Flow of Geometrical Discriminative Features Figure 5 shows an extraction flow of geometrical discriminative features and phoneme recognition. At first, speech waveforms are converted into the timefrequency domain by taking short-time Fourier transformation. At this point, a time sequence of short-time spectra (frames) is obtained. Then a moving window with B

132 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA several consecutive frames is put on the time sequence of short-time log spectra, forming a windowed time-frequency matrix.

8 132 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA several consecutive frames is put on the time sequence of short-time log spectra, forming a windowed time-frequency matrix. Local features of 35 types are computed at each position (time, frequency) within this window, forming a local feature matrix H with the number of positions the 35 types of local features. Finally, a Fisher weight map w is produced by applying linear discriminant analysis (LDA) to the local feature matrix H. Geometrical discriminative features are obtained as higher-order local auto-correlation by summing up the local features weighted by the Fisher weight map for each type of local feature, forming a 35-dimensional vector x for a window. By moving this window, the sequence of a 35-dimensional vector is obtained as a geometrical discriminative feature. In phoneme recognition, phoneme HMMs are trained first. Then, the test speech data is converted into a 35-dimensional vector sequence of geometrical discriminative features, and phoneme likelihood is computed using the trained phoneme HMMs. V. Phoneme Recognition Experiments The new feature extraction technique was evaluated on speaker-independent phoneme recognition tasks in both clean and noisy environments. Figure 5. New feature extraction flowchart.

9 SPEECH FEATURE EXTRACTION Figure 6. Phoneme recognition as a function of the number of eigenvectors (white noise, an SNR of 10 db). A. Experimental conditions We carried out speaker-independent Japanese 25-phoneme recognition under clean and noisy environments. Speech material was continuous speech data spoken by six male speakers and four female speakers and was manually segmented into phoneme sections. When carrying out speaker-independent phoneme recognition, 2,578 data (about 100 data for each phoneme segmented by hand for all phonemes) were collected together and used for phoneme training (Fisher weight map and phoneme HMMs). The Training data were also used to calculate the eigenvectors of PCA. Another 2,578 phoneme data from an individual speaker were tested. The phoneme recognition rate was computed by averaging the results from ten speakers. To evaluate the noise robustness of the proposed feature extraction method, white Gaussian noise was added to the clean speech data at SNRs of 20, 10, and 0 db, respectively, and real noise data were also used from the CENSREC-1-C database [5]. The speech waveform was transformed into a time frequency matrix using short-time Fourier transformation with a 25 ms frame width and 10 ms frame shift. Then, the frequency was converted into mel-scale using a mel-filter bank (64 dimensions: F in Eq. (4)). A window with a 5-frame width (T in Eq. (4)) and 1-frame shift was moved on the time-frequency matrix and the windowed time-frequency matrix was generated. The number of dimensions D of the weighted higher-order local auto-correlation vector x reduced by PCA was also experimentally optimized. B. Optimal eigenvector number Figure 6 shows the variation of system performance as a function of the number

10 134 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA of eigenvectors in Eq. (10), where the blue line represents the recognition rate of the clean-speech test set and the red line represents the recognition rate of the noisy speech test set (white noise, an SNR of 10 db). The experiment results show that no improvement results with a larger number of eigenvectors (more than 20), where the recognition rates of clean speech and noisy speech (white noise) are 75.2% and 64.5%, respectively. So, in the following experiments, the number of eigenvectors in the Fisher weight map is set to 20. Figure 7. Dimension reduction using PCA. The total number of dimensions for the proposed speech feature (with the number of eigenvectors W = ( 20) comes to ( local patterns) = 700, and PCA (Principal Component Analysis) can be effectively used for the reduction of the total number of the feature dimensions. Figure 7 shows the phoneme recognition results as a function of the number of feature dimensions using PCA. The experiment results show that the recognition rate is improved to 83.7% for clean speech and 72.2% for noisy speech (white noise, an SNR of 10 db), where the total number of feature dimensions is reduced to 50 using PCA. C. Comparison of each single feature in clean and noisy (white noise) environments Figure 8 shows the results of speaker-independent phoneme recognition using the proposed feature FWM (Fisher Weight Map), compared with the recognition results using MFCC. The recognition rate using the proposed feature FWM with the number of eigenvectors W = 20 ( = 700 dimensions) in the Fisher weight map is 75.19%, 70.82%, 62.33%, and 42.83%, at SNRs of. 20 db, 10 db, and 0 db, respectively, better than with MFCC.

11 SPEECH FEATURE EXTRACTION Figure 8. Results of speaker-independent phoneme recognition using each single feature. The highest phoneme recognition rate is obtained by using the weighted higherorder local auto-correlation vector reduced with PCA (FWM PCA (50 dim)). Compared with MFCC and ΔMFCC, the recognition rate is improved with the proposed method for all kinds of noise levels due to accumulation of the direct expression of temporal features of 10 persons. The recognition rate for clean speech is improved to 8.9 and 7.0 points compared with MFCC and ΔMFCC. As the noise level increases, the difference of the recognition rate between FWM and MFCC is the highest 20.5 points at an SNR of 10 db. When PCA was not applied, since the dimension is so high (700), the recognition rate was almost the same as that of MFCC.

12 136 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA Figure 9. Results of speaker-independent phoneme recognition using feature integration. D. Recognition results with feature integration in clean and noisy (white noise) environments Since FWM showed the highest phoneme recognition rate using the single feature for each kind of white Gaussian noise level, it was combined with MFCC (with power), ΔMFCC (with Δ power) and ΔΔMFCC (with ΔΔ power) in the phoneme recognition. The result is shown in Figure 9. FWM improved the recognition rate by 3.0 points and 4.0 points after combined with MFCC and ΔMFCC, respectively, compared with the original speaker-independent FWM (82.1% in Figure 8). When four features FWM, MFCC, ΔMFCC and ΔΔMFCC were combined together, the

SPEECH FEATURE EXTRACTION... 137 recognition rate showed the highest score 86.6% that was 0.5 points higher than the result of MFCC + ΔMFCC + ΔΔMFCC.

13 SPEECH FEATURE EXTRACTION recognition rate showed the highest score 86.6% that was 0.5 points higher than the result of MFCC + ΔMFCC + ΔΔMFCC. This indicates that the FWM has information to improve the recognition rate obtained by MFCC, ΔMFCC and ΔΔMFCC combination. Furthermore, in the noisy speech at an SNR of 0 db, the recognition rate of FWM + MFCC + ΔMFCC + ΔΔMFCC was 6.4 points higher than that of MFCC + ΔMFCC + ΔΔMFCC. This shows robustness of FWM in noisy environments in comparison with MFCC. Figure 10. Recognition rates in real noisy environments using each single feature (%). E. Recognition results in real noisy environments To evaluate our method with real noise data, two kinds of noise data (restaurant and street) were used from the CENSREC-1-C database [5], which were recorded in low-snr ( 5 SNR 10 db) and high-snr ( 10dB SNR) environments. Figure 10 shows the recognition rates for each single feature. The proposed feature (FWM), which the number of dimensions is reduced to 50 with PCA, improves the recognition rate to 65.4% and 39.2% from 46.5% and 27.1% at the high SNR and the low SNR in the restaurant environment, respectively, compared with the recognition results using MFCC (with power). Also, the proposed feature (FWM) improves the recognition rate to 60.1% and 48.9% from 39.9% and 31.3% at the high SNR and the low SNR in the street environment, respectively, compared with the recognition result using MFCC (with power).

138 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA Figure 11 shows the recognition rates for feature integration. The integration feature of FWM + MFCC improves the recognition rate by 2.7% and 3.

14 138 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA Figure 11 shows the recognition rates for feature integration. The integration feature of FWM + MFCC improves the recognition rate by 2.7% and 3.1% at the high and low SNR, respectively, in the restaurant environment, compared with MFCC + ΔMFCC. A similar result is obtained for the street environment, as shown in Figure 11. So, the experiment results show that the FWM performs better than the ΔMFCC as dynamic information in feature integration. Also, the integration of FWM + MFCC + ΔMFCC + ΔΔMFCC could improve the recognition rate by 3.8 points and 2.4 points at the high SNR and low SNR in the restaurant environment, respectively, and by 5.8 points and 5.5 points at the high SNR and low SNR in the street environment, respectively. Figure 11. Recognition rates in real noisy environments using feature integration (%). VI. Conclusion We described a new feature extraction method based on higher-order local autocorrelation and Fisher weight maps (FWM), where the new method enhances geometrical features in a two-dimensional local area. For speaker-independent phoneme recognition, the proposed FWM showed 82.1% recognition rate, by 8.9 and 7.0 points higher than the result obtained using MFCC and ΔMFCC for clean speech. Furthermore, by combining FWM with MFCC, ΔMFCC and ΔΔMFCC, the recognition improved to 86.6%. For noisy speech in a restaurant and street

15 SPEECH FEATURE EXTRACTION environment, the proposed FWM showed 65.4% and 60.1% recognition rates, or 18.9 and 20.2 points higher than the results obtained using MFCC, respectively. By combining MFCC, ΔMFCC, and ΔΔMFCC, the recognition rates improved to 70.7% and 64.0%, respectively. Experimental comparisons of the proposed feature extraction method FWM and MFCC in noisy environments have demonstrated that the FWM is more noise robust and offers better encoding of temporal dynamics than MFCC. In future research, we will extend the method into HMM expression, and to apply it to continuous phoneme recognition. Some problems with the method will be the lack of normalization like CMN and composition of HMM with noise components. We will investigate these problems theoretically as studied in [3]. References [1] Y. Ariki, S. Kato and T. Takiguchi, Phoneme recognition based on Fisher weight map to higher-order local auto-correlation, Proceedings of the 9th International Conference on Spoken Language Processing (Interspeech ICSLP), September 2006, pp [2] I. M.-Chagnolleau, G. Durou and F. Bimbot, Application of time-frequency principal component analysis to text-independent speaker identification, IEEE Trans. on Speech and Audio Processing 10(6) (2002), [3] M. P. Cooke, P. D. Green, L. B. Josifovski and A. Vizinho, Robust automatic speech recognition with missing and uncertain acoustic data, Speech Commun. 24 (2001), [4] M. Heckmann, X. Domont, F. Joublin and C. Goerick, A closer look on hierarchical spectro-temporal features (HIST), Proceedings of the Interspeech 2008, September 2008, pp [5] N. Kitaoka, K. Yamamoto, T. Kusamizu, S. Nakagawa, T. Yamada, S. Tsuge, C. Miyajima, T. Nishiura, M. Nakayama, Y. Denda, M. Fujimoto, T. Takiguchi, S. Tamura, S. Kuroiwa, K. Takeda and S. Nakamura, Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance, Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, December 2007, pp [6] N. Malayath, H. Hermansky, S. Kajarekar and B. Yegnanarayana, Data-driven temporal filters and alternatives to GMM in speaker verification, Digital Signal Process. 10 (2000), [7] B. T. Meyer and B. Kollmeier, Optimization and evaluation of Gabor feature sets for ASR, Proceedings if the Interspeech 2008, September 2008, pp

16 140 Y. ARIKI, T. TAKIGUCHI, T. MUROI and R. TAKASHIMA [8] T. Nitta, Feature extraction for speech recognition based on orthogonal acoustic feature planes and LDA, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1999, pp [9] K. Schutte and J. Glass, Speech recognition with localized time-frequency pattern detectors, Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, December 2007, pp [10] Y. Shinohara and N. Otsu, Facial expression recognition using Fisher weight maps, Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition, May 2004, pp [11] S. Y. Zhao and N. Morgan, Multi-stream spectro-temporal features for robust speech recognition, Proceedings of the Interspeech 2008, September 2008, pp

Reverberant Speech Recognition Based on Denoising Autoencoder

INTERSPEECH 2013 Reverberant Speech Recognition Based on Denoising Autoencoder Takaaki Ishii 1, Hiroki Komiyama 1, Takahiro Shinozaki 2, Yasuo Horiuchi 1, Shingo Kuroiwa 1 1 Division of Information Sciences,