Analyzing Mel Frequency Cepstral Coefficient for Recognition of Isolated English Word using DTW Matching

Size: px

Start display at page:

Download "Analyzing Mel Frequency Cepstral Coefficient for Recognition of Isolated English Word using DTW Matching"

Claud Stokes
6 years ago
Views:

1 Abstract- Analyzing Mel Frequency Cepstral Coefficient for Recognition of Isolated English Word using DTW Matching Mr. Nitin Goyal, Dr. R.K.Purwar PG student, USICT NewDelhi, Associate Professor, USICT NewDelhi In this paper we proposed Mel-frequency cepstrum coefficients feature extraction and Dynamic Time Warping matching algorithm for speech recognition. Feature vector (Mel-frequency cepstrum coefficients) obtained from speech frame by using Fast Fourier Transform and Discrete Cosine Transform. DTW matching algorithm was applied on feature vector thus obtained by varying number of MFCC coefficients. Clustered database was prepared for template matching. The effectiveness of varying vector size over matching was considered. appropriate matching system command is invoked. This paper focuses on an effective method for recognition of English words and generating particular system command to open particular application. The rest of the paper is organized as follows: the Methodology is discussed in section 2, which is followed by feature extraction and feature matching. result and conclusion has been explained subsequently II. METHODOLOGY Keywords Feature Extraction, Mel Frequency Cepstral Coefficients(MFCC),Dynamic Time Warping(DTW), Discrete Cosine Transform(DCT) I. Introduction Speech is the fundamental mode of communication for human beings. Computer s usage has become inevitable in modern era. Exchange of information between human and computer became a natural complication thus speech recognition system comes into light. The recognition system converts words spoken by humans into a form in which a computer can understand and can respond accordingly [1]. There are mainly two phase of a speech recognition system. One is training phase and another testing phase [2]. It is impossible to recognize all the English words by a system. For methods like DTW, HMM training is necessary [1][2]. The speech recognition system starts with converting human voice (continuous signal) into digital signal [3]. Then feature is extracted from digitalize speech using MFCC technique[4][5]. Further voice feature coefficients are compared with template patterns in the database using Dynamic Time Warping (DTW) in order to find the exact spoken word. After In order to perform the isolated English word recognition two algorithms are utilized in this paper. Feature extraction is done by using MFCC [6]. Feature matching concept can be implemented by using DTW. Feature extraction stage is the most important among all stages because it is responsible for extracting relevant information from the speech frames as feature vectors [7][8]. For matching stage, DTW algorithm is used which is based on Dynamic Programming [9][10]. This algorithm is used for measuring similarity between two non linear time series III. FEATURE EXTRACTION The first stage in the speech recognition process is feature extraction. MFCCs are said to be the coefficients that together represent the short-term power spectrum of the sound which is based on a linear cosine transform of a log power spectrum on a nonlinear Mel Scale of frequency [6]. MFCC is used to extract feature vector from the sound wave. MFCC algorithm is based on human hearing perceptions and having Mel Scale based filter. The process of feature extraction is explained in the given block diagram Page 436

2 Acquisition setup Discrete Fourier Transform Fig.1. Speech signal s front end analysis Fig.2. Speech signal s feature extraction MFCC algorithm can be completed in the following steps [5]: A. Acquisition setup To achieve a audio file, the recording is done by microphone in the normal soundproof room. The sampling frequency for all recordings was 8000 Hz at the normal room temperature and normal humidity. The speakers were sitting in front of microphone at a distance of cm. In this way we achieved digitalization of Continuous Speech Signal. B. Pre-emphasis In the spectrum of speech sound we will notice less energy in the highest frequencies with strictly decreasing slope. Objective of the pre-emphasis filter is to counter-balance and flatten the spectrum. x_pre-emphasis = filter([1,-.095],1,x(n)) (1) The speech signal x(n) is sent to a high-pass filter: y(n) = x(n) a * x(n-1) (2) where y(n) is the output signal and ranges between 0.95 to 0.97 C. Frame Blocking Preemphasis Mel Filter Bank Frame Blocking and Windowing Discrete Cosine Transform Speech signal is segmented into frames of 20~30 msec. Suppose we have taken frame duration 20 msec and sampling frequency 16 khz. Then the number of sample in a frame is 320. Now speech signal in short time interval is available for processing.each frame is supposed to have stationary behavior and smooth transition from frame to frame is the goal of frame blocking. Here sampled speech signal is blocked into frames of K samples where the adjacent frames is separated by P (P<K) [7]. D. Windowing In order to reduce the discontinuities of the speech signal at the edges of each frame, a tapered window is applied to each one. The most common used window is Hamming window [9]. Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points in the frame (i.e. to reduce Gibbs effect). Signal in a frame is denoted by x(n), n = 0, N-1, then the signal after Hamming windowing is x(n)*w(n), where w(n) is the Hamming window defined by: w(n) = (1 a) a cos, 0 n M 1 (3) Different values of a corresponds to different curves for the Hamming E. Discrete Fourier Transform (DFT) The time domain sample is converted into frequency domain by using FFT. FFT is efficient algorithm of DFT. We usually perform FFT to obtain the magnitude frequency response for each frame Y(w)=FFT [h(t)*x(t)] = H(w) * X(w) (4) In the above equation X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively. It converts each frame of K samples from the time domain into the frequency domain F. Mel Filter Bank Mel filter bank consists of triangular filters. Here filters are equally spaced along the Mel frequency, which is related to the common linear frequency f by the following equation: F (Mel) = [ 2595 * log10 [1 + f /700] (5) Mel-frequency is proportional to the logarithm of the linear frequency, reflecting similar effects in the human's subjective aural perception [5]. G. Discrete Cosine Transform In this last stage the Mel frequency Cepstral Coefficients are obtained.we apply DCT on the energy Ek obtained from the triangular bandpass Page 437

3 filters to have L mel-scale cepstral coefficients. The formula for DCT is given by C m = cos [ (. ) ]*E k m=1,2,..., L (6) where N is the number of triangular bandpass filters, L is the number of mel-scale cepstral coefficients. We have performed FFT, DCT transforms the frequency domain into a time-like domain called quefrency domain. The obtained features are referred to as the mel-scale cepstral coefficients or MFCC. IV. FEATURE MATCHING Dynamic Time Warping (DTW) is used to perform the feature matching technique. Dynamic Programming (DP) is guaranteed to find the lowest path distance through the matrix, while minimizing the amount of computation. DTW is a speech recognition technique based on template matching and non-parametric method. DTW algorithm has been widely used in speech recognition of a particular person. The DTW algorithm was first introduced to recognize spoken words in 1978 by Sakoe and Chiba. It compares test word with reference word template. The basic idea of which is in the training phase, the feature vector sequence of speech corresponding to each word in the vocabulary table was extracted as the template, and then was stored to the characteristics template library [9]. Then, in the recognition phase, to compare the feature vector sequences of the speech to be recognized with each template of template library by dynamic time warping algorithm [10]. V. RESULT AND CONCLUSION We have taken two sets of data. Dataset 1 is ONE through TEN digits in English. The other is the vowels of English Language a to u. Each word and vowel were uttered 6 times by a speaker and recorded with sampling frequency 8khz. Out of 6 utterance of each word, 5 is used for clustering and 1 is used for testing. One representative comes out of 5 after clustering kept in database. This technique is economical in terms of reducing processing time by reducing training sample 5 to 1. Test sample of two datasets are taken as input and features are extracted as MFCC coefficient.these coefficient is represented as a feature vector. These feature vectors are matched with clustered reference template that has already stored. This matching is DTW based where minimum distance between two time series (may be non linear in time) measure is chosen as matched one. Number of MFCC coefficients consideration might effects matching. By default 12 coefficient were taken into consideration. We had taken 6 to 24 coefficients in interval of 3. TABLE 1-5 shows dataset 1 matching and TABLE 6-15 shows dataset 2 matching Page 438

4 Page 439

5 Page 440

6 Page 441

The tables` result conclude that dataset 1 has 60% matching without varying number of MFCC coefficients because at default number (i.e. 12) a and I do not match.

Dataset2 has 90% matching ( nine does not match).

As the matching succeeds, appropriate system commands can be invoked by some predicate mean.

7 The tables` result conclude that dataset 1 has 60% matching without varying number of MFCC coefficients because at default number (i.e. 12) a and I do not match. But by varying number of MFCC coefficient to 24 we achieved 100% matching. In turns number of MFCC coefficients increase processing overhead. Dataset2 has 90% matching ( nine does not match). The table s result concludes that there is no effect of varying number of MFCC coefficients which is only processing overhead. As the matching succeeds, appropriate system commands can be invoked by some predicate mean. Hence this paper presented isolated English word recognition for small application like robot for handicapped person, small embedded systems etc. REFERENCES [1] Lawrence Rabiner,Biing-Hwang Juang and B.Yegnanarayana. Fundamentals of Speech Recognition, Dorling Kindersley (India) Pvt. Ltd Pp [2] Santosh K. Gaikward, Bharti W. Gawali and Pravin Yannawar rs. A Review on Speech Recognition Technique, International Journal of Computer Page 442

8 Applications ( ) Volume 10-No. 3 November 2010, pp 16-24; [3] Cormen et al. Introduction to Algorithms, Edition 3, 31 Jul, 2009[7] Shanthi Therese S.,Chelpa Lingam, Review of Feature Extraction Techniques in Automatic Speech Recognition, International Journal of Scientific Engineering and Technology (ISSN : )Volume No.2, Issue No.6, pp : [4] Chadawan Ittichaichareon, Siwat Suksri and Thaweesak Yingthawornsuk, Speech recognition using MFCC, International Conference on Computer Graphics, Simulation and Modeling (ICGSM'2012) July 28-29, 2012 [5] Md. Afzal; Sheeraz Memon; Gregory, Mark; A novel approach for MFCC feature extraction, 4th international conference on signal processing and communication systems, pages. 1-5, Gold Coast Australia, [6] J. Chen, K. K. Paliwal, M. Mizumachi and S. Nakamura, "Robust mfccs derived from differentiated power spectrum " Eurospeech 2001, Scandinavia, [7] IEEE International Multitopic Conference, INMIC 2007, 2007 pp. Wang Chen, Miao Zhenjiang and Meng Xiao, "Comparison of different implementations of mfcc," J. Computer Science & Technology, 2001, pp. 16(16): [8] Han Chunguang, Li Hua, Approach to Improve Robust Performance of Mel frequency Coefficient Cepstral, Computer Engineering and Design, [9] Sakoe H. and Chiba, H.S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process, 1978, 26,(1), pp [10] Gregory N, Stainhaour and George Carayannis, New Parallel Impementations for DTW Algorithms, IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 38, No. 4, 1990, pp [11] Dalmiya C.P, Dharun V.S and Rajesh K,P, An Efficient Method for Tamil Speech, Proceeding of IEEE conference on ICT,2013 [12] Thomas H. Cormen, Charkes E. Leiserson and Ronald L. Rivest, Approximation Algorthm. Introduction to Algorithms Prentic Hall of India Private Limited 2001, ISBN pp [13] Waibel Alexander, Krishanan N and Reddy Dabbala Rajagopal, Minimizing computational cost for dynamic programming algorithms(1981), Carnegie Mellon University, Computer Science Department, Paper [14] Anany Levitin, Introduction to The Design & Analysis of Algorithm, Villanova University. Pearson Education (Singapur)Pvt. Ltd Pp Page 443

Voice Command Based Computer Application Control Using MFCC

Voice Command Based Computer Application Control Using MFCC Abinayaa B., Arun D., Darshini B., Nataraj C Department of Embedded Systems Technologies, Sri Ramakrishna College of Engineering, Coimbatore,