IJETST- Vol. 03 Issue 05 Pages May ISSN

International Journal of Emerging Trends in Science and Technology Implementation of MFCC Extraction Architecture and DTW Technique in Speech Recognition System R.M.Sneha 1, K.L.Hemalatha 2 1 PG Student, Dept of ECE, Easwari Engineering College, Chennai, rmsneha10@gmail.com. 2 Assistant Professor, Dept of ECE, Easwari Engineering College, Chennai, hemalathasabs@gmail.com ABSTRACT: Processing of speech signal is very important for fast and accurate automatic speech recognition technology. The voice is a signal of infinite information. The speech signal is analyzed to extract the information contained in the signal. The two process in Speech recognition are Feature Extraction and Feature Matching. Mel Frequency Cepstral Coefficients (MFCCs) are used for extracting the features in a speech signal. MFCC is used for its less complexity and less power consumption in implementing the feature extraction. Dynamic Time Warping (DTW) is used for feature matching where the voice signal of the speaker is compared with the pre stored voice. DTW is an algorithm, which is used for measuring similarity between two sequences, which may vary in time or speed. Several methods such as Liner Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) etc are some of the methods used for feature extraction and feature matching. The combination of MFCC and DTW algorithm achives high recognition accuracy. Keywords - Feature Extraction, Feature Matching, Mel Frequency Cepstral Coefficient (MFCC), dynamic Time warping. 1.INTRODUCTION Speaker recognition is a process that enables machines to understand and interpret the human speech by making use of certain algorithms and verifies the authenticity of a speaker with the help of a database. First, the human speech is converted to machine readable format after which the machine processes the data. The data processing deals with feature extraction and feature matching. Then, based on the processed data, suitable action is taken by the machine. The action taken depends on the application. Every speaker is identified with the help of unique numerical values of certain signal parameters called template or code book pertaining to the speech produced by his or her vocal tract. Normally the speech parameters of a vocal tract that are considered for analysis are (i) formant frequencies, (ii) pitch, and (iii) loudness.firstly, human voice is converted into digital signal form to produce digital data representing each level of signal at every discrete time step. The digitized speech samples are then processed using MFCC to produce voice features. After that, the coefficient of voice features can go trough DTW to select the pattern that matches the database and input frame in order to minimize the resulting error between them.the popularly used cepstrum based methods to compare the pattern to find their similarity are the MFCC and DTW. The MFCC and DTW features techniques can be implemented using MATLAB. 2. PRINCIPLE OF VOICE RECOGNITION 2.1 Feature Extraction (MFCC) The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition performance. The efficiency of this phase is important for the next phase since it affects its behavior. MFCC is based on human hearing perceptions which cannot perceive frequencies over 1Khz. In other words, in MFCC is based on known variation of the human ear s critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel Frequency Scale to capture R.M.Sneha 1 www.ijetst.in Page 753

important characteristic of phonetic in speech. The overall process of the MFCC is shown Y(n)=X(n) * W(n) Step 4: Fast Fourier Transform To convert each frame of N samples from time domain into frequency domain. The Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal tract impulse response H[n] in the time domain. This statement supports the equation below: Y(w) = FFT [h(t) * X(t)] = H(w) * X (w) If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively. Fig. 1. MFCC Block Diagram MFCC consists of seven computational steps. Each step has its function and mathematical approaches as discussed briefly in the following: Step 1: Pre emphasis This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency. Y[n]=X[n] -0.95 X[n-1] Step 2: Framing The process of segmenting the speech samples obtained from analog to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into framesof N samples. Adjacent frames are being separated by M (M<N). Step 3: Hamming windowing Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. The Hamming window equation is given as: If the window is defined as W (n), 0 n N-1 where N = number of samples in each frame Y[n] = Output signal X (n) = input signal W (n) = Hamming window, then the result of windowing signal is shown below: Step 5: Mel Filter Bank Processing The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. A set of triangular filters that are used to compute a weighted sum of filter spectral components so that the output of process approximates to a Mel scale. Each filter s magnitude frequency response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. After that the following equation is used to compute the Mel for given frequency f in HZ: F(Mel) = [2595 * log 10 [ 1+f ] 700 ] Step 6: Discrete Cosine Transform This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector. 2.2 Feature Matching (DTW) Dynamic Time Warping (DTW) algorithm finds the edit distance, which is the minimum number of editing operations required to convert R.M.Sneha 1 www.ijetst.in Page 754

one sequence into another, between two sequences by analysing three operations: substitution, insertion, and deletion [1]. Besides, it is a search algorithm which finds an optimal warping path dynamically for the minimal cost alignment of input on template. Measuring the edit distance is best reflected as a 2-dimensional grid with the two sequences on the top and left side respectively. The central idea is to formulate optimal path to any intermediate point X in the grid in terms of optimal paths of all its immediate antecedents. Let γ(n,m) = Minimum path cost from origin to any point (n,m) in the grid. γ(n,m)=d(n,m)+min{γ(n-1,m),γ(n,m-1),γ(n-1,m- 1)} where d(n,m) is the square of the Euclidean distance between the two elements of the target sequences at X. Fig. 2. A Warping between two time series In the above Figure, each vertical line connects a point in one time series to its correspondingly similar point in the other time series. The lines have similar values on the y- axis,but have been separated so the vertical lines between them can be viewed more easily. If both of the time series in figure 4 were identical, all of the lines would be straight vertical lines because no warping would be necessary to line up the two time series. The warp path distance is a measure of the difference between the two time series after they have been warped together, which is measured by the sum of the distances between each pair of points connected by the vertical lines in the above figure.thus, two time series that are identical except for localized stretching of the time axis will have DTW distances of zero. The principle of DTW is to compare two dynamic patterns and measure its similarity by calculating a minimum distance between them. The classic DTW is computed as below: Suppose we have two time series Q and C, of length n and m respectively, where: Q = q1, q2,, qi,,qn C = c1, c2,, cj,,cm To align two sequences using DTW, an n- by-m matrix where the (ith, jth) element of the matrix contains the distance d (qi, cj) between the two points qi and cj is constructed. Then, the absolute distance between the values of two sequences is calculated using the Euclidean distance computation: d (qi,cj) = (qi - cj)2 Each matrix element (i, j) corresponds to the alignment between the points qi and cj. Then, accumulated distance is measured by: D(i, j) = min[d(i-1, j-1),d(i-1, j),d(i, j-1)+d(i, j) This is shown where the horizontal axis represents the time of test input signal, and the vertical axis represents the time sequence of the reference template. The path shown results in the minimum distance between the input and template signal. The shaded in area represents the search space for the input time to template time mapping function. Any monotonically non decreasing path within the space is an alternative to be considered. Using dynamic programming techniques, the search for the minimum distance path can be done in polynomial time P(t), using equation below: P(t) = O(N 2 V) where, N is the length of the sequence, and V is the number of templates to be considered. Theoretically, the major optimizations to the DTW algorithm arise from observations on the nature of good paths through the grid. Monotonic condition: the path will not turn back on itself, both I and j indexes either stay the same or increase, they never decrease. Continuity condition: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path. Boundary condition: the path starts at the bottom left and ends at the top right. Adjustment window condition: a good path is unlikely to wander very far from the diagonal. The distance that the path is allowed to wander is the window length r. R.M.Sneha 1 www.ijetst.in Page 755

Slope constraint condition: The path should not be too steep or too shallow. This prevents very short sequences matching very long ones. The condition is expressed as a ratio n/m where m is the number of steps in the x direction and m is the number in the y direction. After m steps in x you must make a step in y and vice versa. 3. METHODOLOGY There are two phases, one is the training phase and the other is the testing phase. During the training phase the speech is trained to the system to build a reference model for the speaker. In the training phase the features of the speech is extracted using the MFCC Algorithm and then stored in the database for future use. During the testing phase the input speech is matched with the stored reference template model. The input speech and the reference template model are matched using dynamic time warping algorithm. Using the MFCC algorithm and DTW algorithm a speech to text application is created and implemented Fig.3. Speech Algorithm flow Chart Voice recognition works based on the premise that a person voice exhibits characteristics are unique to different speaker. The signal during training and testing session can be greatly different due to many factors such as people voice change with time, health condition (e.g. the speaker has a cold), speaking rate and also acoustical noise and variation recording environment via microphone. 4. RESULT AND DISCUSSION Thetwo speech recognition algorithm are simulated as well as implemented in a hardware. The below given results are simulated using MATLAB and xilinx. 5. CONCLUSION This paper has discussed two speech recognition algorithms which are important in improving the speech recognition performance. The results show that these techniques could be R.M.Sneha 1 www.ijetst.in Page 756

used effectively for speech recognition purposes. Several other techniques such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), and Artificial Neural Network (ANN) are currently being investigated. REFERENCES [1] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, Internet ofthings (IoT): A vision, architectural elements, and future directions, Future Generat. Comput. Syst., vol. 29, no. 7, pp. 1645 1660,Sep. 2013. speaker verification systemimplemented on reconfigurable hardware, J. Signal Process. Syst.,vol. 71, no. 2, pp. 89 103, May 2013. [10] D. G. Childers, D. P. Skinner, and R. C. Kemerait, The cepstrum:a guide to processing, Proc. IEEE, vol. 65, no. 10, pp. 1428 1443,Oct. 1977. [2] H.-W. Hon, A survey of hardware architectures designed for speechrecognition, Dept. Comput. Sci., Carnegie Mellon Univ., Pittsburgh,PA, USA, Tech. Rep. CMU-CS-91-169, Aug. 1991. [3] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.Englewood Cliffs, NJ, USA: Prentice-Hall, 1993, pp. 1 9. [4] D. R. Reddy, Speech recognition by machine: A review, Proc. IEEE,vol. 64, no. 4, pp. 501 531, Apr. 1976. [5] S. Davis and P. Mermelstein, Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357 366, Aug. 1980. [6] S. Nedevschi, R. K. Patra, and E. A. Brewer, Hardware speechrecognition for user interfaces in low cost, low power devices, in Proc.42nd DAC, Jun. 2005, pp. 684 689. [7] N.-V. Vu, J. Whittington, H. Ye, and J. Devlin, Implementation ofthe MFCC front-end for low-cost speech recognition systems, in Proc.ISCAS, May/Jun. 2010, pp. 2334 2337. [8] P. EhKan, T. Allen, and S. F. Quigley, FPGA implementation forgmm-based speaker identification, Int. J. Reconfig. Comput., vol. 2011,no. 3, pp. 1 8, Jan. 2011, Art. ID 420369. [9] R. Ramos-Lara, M. López-García, E. Cantó- Navarro, andl. Puente-Rodriguez, Real-time R.M.Sneha 1 www.ijetst.in Page 757