IJETST- Vol. 03 Issue 05 Pages May ISSN

Similar documents
Authentication of Fingerprint Recognition Using Natural Language Processing

Voice Command Based Computer Application Control Using MFCC

Analyzing Mel Frequency Cepstral Coefficient for Recognition of Isolated English Word using DTW Matching

Text-Independent Speaker Identification

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS

Voice & Speech Based Security System Using MATLAB

Implementation of Speech Based Stress Level Monitoring System

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

STUDY OF SPEAKER RECOGNITION SYSTEMS

: A MATLAB TOOL FOR SPEECH PROCESSING, ANALYSIS AND RECOGNITION: SAR-LAB

2014, IJARCSSE All Rights Reserved Page 461

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Chapter 3. Speech segmentation. 3.1 Preprocessing

Device Activation based on Voice Recognition using Mel Frequency Cepstral Coefficients (MFCC s) Algorithm

Fault Tolerant Parallel Filters Based on ECC Codes

Intelligent Hands Free Speech based SMS System on Android

Real Time Speaker Recognition System using MFCC and Vector Quantization Technique

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System

Design of Feature Extraction Circuit for Speech Recognition Applications

Processing and Recognition of Voice

Speech User Interface for Information Retrieval

Speech Based Voice Recognition System for Natural Language Processing

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

Efficient Speech Recognition System for Isolated Digits

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

Environment Independent Speech Recognition System using MFCC (Mel-frequency cepstral coefficient)

A text-independent speaker verification model: A comparative analysis

ON THE PERFORMANCE OF SEGMENT AVERAGING OF DISCRETE COSINE TRANSFORM COEFFICIENTS ON MUSICAL INSTRUMENTS TONE RECOGNITION

Speaker Verification with Adaptive Spectral Subband Centroids

Emotion recognition using Speech Signal: A Review

Neetha Das Prof. Andy Khong

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

MATLAB Apps for Teaching Digital Speech Processing

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

SPEAKER RECOGNITION. 1. Speech Signal

Speech Recognition on DSP: Algorithm Optimization and Performance Analysis

Dynamic Time Warping

Input speech signal. Selected /Rejected. Pre-processing Feature extraction Matching algorithm. Database. Figure 1: Process flow in ASR

A NEURAL NETWORK APPLICATION FOR A COMPUTER ACCESS SECURITY SYSTEM: KEYSTROKE DYNAMICS VERSUS VOICE PATTERNS

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

Speaker Recognition for Mobile User Authentication

Principles of Audio Coding

Biometric Security System Using Palm print

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

EE482: Digital Signal Processing Applications

A Quantitative Approach for Textural Image Segmentation with Median Filter

Footprint Recognition using Modified Sequential Haar Energy Transform (MSHET)

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Secure E- Commerce Transaction using Noisy Password with Voiceprint and OTP

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Multi-Modal Human Verification Using Face and Speech

IMAGE FUSION PARAMETER ESTIMATION AND COMPARISON BETWEEN SVD AND DWT TECHNIQUE

Audio-coding standards

Comparative Study between DCT and Wavelet Transform Based Image Compression Algorithm

High Speed Pipelined Architecture for Adaptive Median Filter

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Make Garfield (6 axis robot arm) Smart through the design and implementation of Voice Recognition and Control

Design of 2-D DWT VLSI Architecture for Image Processing

Speech Recognizing Robotic Arm for Writing Process

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability

FIR Filter Architecture for Fixed and Reconfigurable Applications

MODIFIED IMDCT-DECODER BASED MP3 MULTICHANNEL AUDIO DECODING SYSTEM Shanmuga Raju.S 1, Karthik.R 2, Sai Pradeep.K.P 3, Varadharajan.

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Sparse Component Analysis (SCA) in Random-valued and Salt and Pepper Noise Removal

Operators to calculate the derivative of digital signals

Least Squares Signal Declipping for Robust Speech Recognition

MATLAB Toolbox for Audiovisual Speech Processing

Adaptive Filtering using Steepest Descent and LMS Algorithm

Ergodic Hidden Markov Models for Workload Characterization Problems

An Introduction to Pattern Recognition

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

Institutionen för systemteknik Department of Electrical Engineering

Mahdi Amiri. February Sharif University of Technology

Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform

Speech Recognition: Increasing Efficiency of Support Vector Machines

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

Hardware Implementation of DCT Based Image Compression on Hexagonal Sampled Grid Using ARM

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

SPEECH is the natural and easiest way to communicate

WHO WANTS TO BE A MILLIONAIRE?

Enhanced Hexagon with Early Termination Algorithm for Motion estimation

Proximity and Data Pre-processing

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

1 Introduction. 2 Speech Compression

Simulation of rotation and scaling algorithm for numerically modelled structures

THE genetic algorithm (GA) is a powerful random search

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

MP3 Speech and Speaker Recognition with Nearest Neighbor. ECE417 Multimedia Signal Processing Fall 2017

A Modified SVD-DCT Method for Enhancement of Low Contrast Satellite Images

Lecture 2.2 Cubic Splines

EMBEDDING WATERMARK IN VIDEO RECORDS

ISAR IMAGING OF MULTIPLE TARGETS BASED ON PARTICLE SWARM OPTIMIZATION AND HOUGH TRANSFORM

ADVANCED IMAGE PROCESSING METHODS FOR ULTRASONIC NDE RESEARCH C. H. Chen, University of Massachusetts Dartmouth, N.

Transcription:

International Journal of Emerging Trends in Science and Technology Implementation of MFCC Extraction Architecture and DTW Technique in Speech Recognition System R.M.Sneha 1, K.L.Hemalatha 2 1 PG Student, Dept of ECE, Easwari Engineering College, Chennai, rmsneha10@gmail.com. 2 Assistant Professor, Dept of ECE, Easwari Engineering College, Chennai, hemalathasabs@gmail.com ABSTRACT: Processing of speech signal is very important for fast and accurate automatic speech recognition technology. The voice is a signal of infinite information. The speech signal is analyzed to extract the information contained in the signal. The two process in Speech recognition are Feature Extraction and Feature Matching. Mel Frequency Cepstral Coefficients (MFCCs) are used for extracting the features in a speech signal. MFCC is used for its less complexity and less power consumption in implementing the feature extraction. Dynamic Time Warping (DTW) is used for feature matching where the voice signal of the speaker is compared with the pre stored voice. DTW is an algorithm, which is used for measuring similarity between two sequences, which may vary in time or speed. Several methods such as Liner Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) etc are some of the methods used for feature extraction and feature matching. The combination of MFCC and DTW algorithm achives high recognition accuracy. Keywords - Feature Extraction, Feature Matching, Mel Frequency Cepstral Coefficient (MFCC), dynamic Time warping. 1.INTRODUCTION Speaker recognition is a process that enables machines to understand and interpret the human speech by making use of certain algorithms and verifies the authenticity of a speaker with the help of a database. First, the human speech is converted to machine readable format after which the machine processes the data. The data processing deals with feature extraction and feature matching. Then, based on the processed data, suitable action is taken by the machine. The action taken depends on the application. Every speaker is identified with the help of unique numerical values of certain signal parameters called template or code book pertaining to the speech produced by his or her vocal tract. Normally the speech parameters of a vocal tract that are considered for analysis are (i) formant frequencies, (ii) pitch, and (iii) loudness.firstly, human voice is converted into digital signal form to produce digital data representing each level of signal at every discrete time step. The digitized speech samples are then processed using MFCC to produce voice features. After that, the coefficient of voice features can go trough DTW to select the pattern that matches the database and input frame in order to minimize the resulting error between them.the popularly used cepstrum based methods to compare the pattern to find their similarity are the MFCC and DTW. The MFCC and DTW features techniques can be implemented using MATLAB. 2. PRINCIPLE OF VOICE RECOGNITION 2.1 Feature Extraction (MFCC) The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition performance. The efficiency of this phase is important for the next phase since it affects its behavior. MFCC is based on human hearing perceptions which cannot perceive frequencies over 1Khz. In other words, in MFCC is based on known variation of the human ear s critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel Frequency Scale to capture R.M.Sneha 1 www.ijetst.in Page 753

important characteristic of phonetic in speech. The overall process of the MFCC is shown Y(n)=X(n) * W(n) Step 4: Fast Fourier Transform To convert each frame of N samples from time domain into frequency domain. The Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal tract impulse response H[n] in the time domain. This statement supports the equation below: Y(w) = FFT [h(t) * X(t)] = H(w) * X (w) If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t) respectively. Fig. 1. MFCC Block Diagram MFCC consists of seven computational steps. Each step has its function and mathematical approaches as discussed briefly in the following: Step 1: Pre emphasis This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency. Y[n]=X[n] -0.95 X[n-1] Step 2: Framing The process of segmenting the speech samples obtained from analog to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into framesof N samples. Adjacent frames are being separated by M (M<N). Step 3: Hamming windowing Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. The Hamming window equation is given as: If the window is defined as W (n), 0 n N-1 where N = number of samples in each frame Y[n] = Output signal X (n) = input signal W (n) = Hamming window, then the result of windowing signal is shown below: Step 5: Mel Filter Bank Processing The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. A set of triangular filters that are used to compute a weighted sum of filter spectral components so that the output of process approximates to a Mel scale. Each filter s magnitude frequency response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. After that the following equation is used to compute the Mel for given frequency f in HZ: F(Mel) = [2595 * log 10 [ 1+f ] 700 ] Step 6: Discrete Cosine Transform This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector. 2.2 Feature Matching (DTW) Dynamic Time Warping (DTW) algorithm finds the edit distance, which is the minimum number of editing operations required to convert R.M.Sneha 1 www.ijetst.in Page 754

one sequence into another, between two sequences by analysing three operations: substitution, insertion, and deletion [1]. Besides, it is a search algorithm which finds an optimal warping path dynamically for the minimal cost alignment of input on template. Measuring the edit distance is best reflected as a 2-dimensional grid with the two sequences on the top and left side respectively. The central idea is to formulate optimal path to any intermediate point X in the grid in terms of optimal paths of all its immediate antecedents. Let γ(n,m) = Minimum path cost from origin to any point (n,m) in the grid. γ(n,m)=d(n,m)+min{γ(n-1,m),γ(n,m-1),γ(n-1,m- 1)} where d(n,m) is the square of the Euclidean distance between the two elements of the target sequences at X. Fig. 2. A Warping between two time series In the above Figure, each vertical line connects a point in one time series to its correspondingly similar point in the other time series. The lines have similar values on the y- axis,but have been separated so the vertical lines between them can be viewed more easily. If both of the time series in figure 4 were identical, all of the lines would be straight vertical lines because no warping would be necessary to line up the two time series. The warp path distance is a measure of the difference between the two time series after they have been warped together, which is measured by the sum of the distances between each pair of points connected by the vertical lines in the above figure.thus, two time series that are identical except for localized stretching of the time axis will have DTW distances of zero. The principle of DTW is to compare two dynamic patterns and measure its similarity by calculating a minimum distance between them. The classic DTW is computed as below: Suppose we have two time series Q and C, of length n and m respectively, where: Q = q1, q2,, qi,,qn C = c1, c2,, cj,,cm To align two sequences using DTW, an n- by-m matrix where the (ith, jth) element of the matrix contains the distance d (qi, cj) between the two points qi and cj is constructed. Then, the absolute distance between the values of two sequences is calculated using the Euclidean distance computation: d (qi,cj) = (qi - cj)2 Each matrix element (i, j) corresponds to the alignment between the points qi and cj. Then, accumulated distance is measured by: D(i, j) = min[d(i-1, j-1),d(i-1, j),d(i, j-1)+d(i, j) This is shown where the horizontal axis represents the time of test input signal, and the vertical axis represents the time sequence of the reference template. The path shown results in the minimum distance between the input and template signal. The shaded in area represents the search space for the input time to template time mapping function. Any monotonically non decreasing path within the space is an alternative to be considered. Using dynamic programming techniques, the search for the minimum distance path can be done in polynomial time P(t), using equation below: P(t) = O(N 2 V) where, N is the length of the sequence, and V is the number of templates to be considered. Theoretically, the major optimizations to the DTW algorithm arise from observations on the nature of good paths through the grid. Monotonic condition: the path will not turn back on itself, both I and j indexes either stay the same or increase, they never decrease. Continuity condition: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path. Boundary condition: the path starts at the bottom left and ends at the top right. Adjustment window condition: a good path is unlikely to wander very far from the diagonal. The distance that the path is allowed to wander is the window length r. R.M.Sneha 1 www.ijetst.in Page 755

Slope constraint condition: The path should not be too steep or too shallow. This prevents very short sequences matching very long ones. The condition is expressed as a ratio n/m where m is the number of steps in the x direction and m is the number in the y direction. After m steps in x you must make a step in y and vice versa. 3. METHODOLOGY There are two phases, one is the training phase and the other is the testing phase. During the training phase the speech is trained to the system to build a reference model for the speaker. In the training phase the features of the speech is extracted using the MFCC Algorithm and then stored in the database for future use. During the testing phase the input speech is matched with the stored reference template model. The input speech and the reference template model are matched using dynamic time warping algorithm. Using the MFCC algorithm and DTW algorithm a speech to text application is created and implemented Fig.3. Speech Algorithm flow Chart Voice recognition works based on the premise that a person voice exhibits characteristics are unique to different speaker. The signal during training and testing session can be greatly different due to many factors such as people voice change with time, health condition (e.g. the speaker has a cold), speaking rate and also acoustical noise and variation recording environment via microphone. 4. RESULT AND DISCUSSION Thetwo speech recognition algorithm are simulated as well as implemented in a hardware. The below given results are simulated using MATLAB and xilinx. 5. CONCLUSION This paper has discussed two speech recognition algorithms which are important in improving the speech recognition performance. The results show that these techniques could be R.M.Sneha 1 www.ijetst.in Page 756

used effectively for speech recognition purposes. Several other techniques such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), and Artificial Neural Network (ANN) are currently being investigated. REFERENCES [1] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, Internet ofthings (IoT): A vision, architectural elements, and future directions, Future Generat. Comput. Syst., vol. 29, no. 7, pp. 1645 1660,Sep. 2013. speaker verification systemimplemented on reconfigurable hardware, J. Signal Process. Syst.,vol. 71, no. 2, pp. 89 103, May 2013. [10] D. G. Childers, D. P. Skinner, and R. C. Kemerait, The cepstrum:a guide to processing, Proc. IEEE, vol. 65, no. 10, pp. 1428 1443,Oct. 1977. [2] H.-W. Hon, A survey of hardware architectures designed for speechrecognition, Dept. Comput. Sci., Carnegie Mellon Univ., Pittsburgh,PA, USA, Tech. Rep. CMU-CS-91-169, Aug. 1991. [3] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.Englewood Cliffs, NJ, USA: Prentice-Hall, 1993, pp. 1 9. [4] D. R. Reddy, Speech recognition by machine: A review, Proc. IEEE,vol. 64, no. 4, pp. 501 531, Apr. 1976. [5] S. Davis and P. Mermelstein, Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357 366, Aug. 1980. [6] S. Nedevschi, R. K. Patra, and E. A. Brewer, Hardware speechrecognition for user interfaces in low cost, low power devices, in Proc.42nd DAC, Jun. 2005, pp. 684 689. [7] N.-V. Vu, J. Whittington, H. Ye, and J. Devlin, Implementation ofthe MFCC front-end for low-cost speech recognition systems, in Proc.ISCAS, May/Jun. 2010, pp. 2334 2337. [8] P. EhKan, T. Allen, and S. F. Quigley, FPGA implementation forgmm-based speaker identification, Int. J. Reconfig. Comput., vol. 2011,no. 3, pp. 1 8, Jan. 2011, Art. ID 420369. [9] R. Ramos-Lara, M. López-García, E. Cantó- Navarro, andl. Puente-Rodriguez, Real-time R.M.Sneha 1 www.ijetst.in Page 757