4 th International Conference on Computing, Communication and Sensor Network, CCSN2015 Implementation of Speech Based Stress Level Monitoring System V.Naveen Kumar 1,Dr.Y.Padma sai 2, K.Sonali Swaroop 3 Department Of Electronics And Communication Engineering 1,2,3 VNRVJIET Hyderabad,India naveenkumar_v@vnrvjiet.in 1 ABSTRACT A human voice has many features of which pathological voice is one of the features. A pathological voice is the presence of abnormality in the speech. The variations in these pathological voices differentiate hypertension and hypotension. The speech sample with higher blood pressure has higher variation in the frequency of the speech and viceversa. The hypertension (high blood pressure) is a state where the heart pumps the blood at a higher rate through the arteries which can increase the accumulation of fatty plaques and increase the risk of heart attack. Hypotension (low blood pressure) on the other side can drive a person to unconscious state. The paper proposes a model for non-contact method of stress level indication of a person. The feature extraction includes calculation of MFCCs. The kmeans algorithm has been used for clustering the data set and for codebook generation. An overall accuracy of 83% is achieved through the model developed by us. Keywords: Hypertension, Hypotension, Mel-frequencyCepstrum coefficients, kmeans algorithm, pathological voices. I.INTRODUCTION V OICE production is the result of interactions of body parts that resonate. When a person breathes in, the airs through the vocal tract enters the vocal cords and vibrate them at different frequencies during expiration [8]. All parts of the body play specific roles in voice production and maybe responsible for the variations of the speech. The human voice can be affected by various parameters. The pressure exerted by the blood on the walls of blood vessels will also affect the voice. The larynx consists of muscles that are surrounded by the blood vessels. The amount of pressure exerted on the walls of the blood vessels will have a finite effect on the muscles. These are known as micro muscle tremors (MMT), which in turn produce some sort of variations in the speech [4]. The main idea behind this paper is to distinguish the pathological voices from the normal voices based on the Mel frequency Cepstrum coefficient feature vectors. Further the paper focuses on how to detect stress level by classifying the pathological voice. With the increase in the demand for non-contact based monitoring tools our research helps in designing a simple tool which can be used to detect the stress level by recording the speech from the patient using a microphone and may be economical. Hence this paper tries to classify these levels and indicate the state of a person. Speech analysis offers unique advantages which include user convenience that makes it an efficient technology to develop. Human voice has many features like jitter, shimmer, noise to harmonic ratio (N/H), autocorrelation (A/C), Mel frequency Cepstrum coefficients (MFCC) etc. The speech for each stress level differs in many parameters such as frequency, MFCCs and this variation can be clearly differentiated. For instance considering the sample with the high stress level has higher variation in the frequency compared to the normal speech level. The difference is shown in figure 1. Fig 1 (a) After the features are extracted they are classified using different algorithms available such as dynamic time wrapping (DTW), hidden markov model (HMM), artificial neural network (ANN) and kmeans classification [1]. Hence our paper uses Mel
frequency cepstrum coefficients for feature extraction and kmeans algorithm for classification. The paper is organized in the following format. The previous research work is described in section II, voice database of healthy and pathological voices that are used for classification is explained in section III.Section IV includes the methodology of feature extraction techniques and Kmeans algorithm and section V describes the results and followed by conclusion in section VI. which can be used for monitoring the minute variations in the speech samples and our study indicates that MFCC does help in distinguishing the stress level. III.DATA COLLECTION The voice recordings were collected from people of different age groups varying from 30-65 years suffering from high blood pressure and also low blood pressure. The subjects were free to choose the language. Each subject spoke for duration of 20 seconds. The recordings were made using a mobile phone application easy voice recorder available with Google play store. The samples were recorded at a frequency of 16 khz. The voice recordings from the normal people were collected and analyzed. IV.METHODOLOGY The complete process of classification system is depicted in Figure 2. The entire process is divided into training and testing phase. Fig 1 (b) Fig 2: Block diagram of the system In the training phase, the input sample is read into the system buffer and the features extracted are Mel frequency Cepstrum coefficients (MFCCs). Fig 1 (c) Fig 1: The variations in the frequency of speech sample having a) low stress b) high stress and c) normal speech II.RELATED WORK The previous research work successfully demonstrated the usefulness of some basic features in classifying normal and high blood pressure patients. The features that were used are harmonic to noise ratio, mean pitch, shimmer and jitter. Kmeans classifier gives efficiency of 79% in classifying the two categories. MFCC features for classification is one such feature A. Mel frequency cepstrum coefficients Mel Frequency Cepstral Coefficients (MFCCs) are a feature widely used in speech processing applications. The MFCCs are proved more efficient. The calculation of the MFCC includes the following steps. 1. Windowing A speech signal changes continuously, so to simplify things an assumption is made that on short time scales the audio signal does not change much. The input signal is thus divided into frames of 20-40 ms duration [3]. If the frame is much shorter, enough samples will not be available to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.
The speech sample is multiplied with the hamming window represented by the equation (1). W(n) = 0.54 0.46 cos(2π(n N)) (1) Windowing FFT Mel-scale frequency wrapping Where n represents magnitude of the sample at that instant and ranges from 0 to N N represents width of the frame in terms of the samples 2. Discrete Fourier transforms (DFT) The speech signal is analysed in frequency domain more accurately and therefore the signal is converted into frequency domain using cosine transforms. Considering the mathematical computations involved in DFT a similar approach know as Fast Fourier transforms are used which is given by the equation (2). Equation (3) is called as twiddle factor. 3. Mel scale wrapping N 1 X(k) = x(n)w N nk n=0 W N = e j2π N MFCCs are generally a set of coefficient which represents a particular cosine transformation of the real logarithmic frequency onto Mel-frequency scale. The approximate formula to compute the Mel s for a given frequency f in Hz is given by equation (4) mel(f) = 2595 log10(1 + {f 700}) (4) Where f is the frequency to be approximated to the Mel scale (2) (3) MFCC feature vector DCT Fig 4: MFCCs feature extraction flow graph After the feature vectors are extracted for input speech sample the mfcc feature vectors are subjected to kmeans clustering and codebook generation using vector quantization. B. Classification Using Kmeans Algorithm Log K-means is one of the simplest unsupervised learning algorithms [8]. The procedure follows a simple and easy way to classify a given data set through a certain number of predefined clusters k. The first step is to define k centroids The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early clustering is done. At this point we need to re-calculate k new centroids of the clusters resulting from the previous step. After these k new centroids, a new binding has to be established between the same data set points and the nearest new centroid. This is continued until no more changes are done in the allocation of the data points i.e., the centroids do not change. In this experimentation, the number of centroids used are K=3. The euclidean distance is calculated using the formula given in equation (5). Fig 3: Mel filter bank 4. Discrete cosine transform(dct) The tranformed signal has to be reconstructed into the original speech sample. This reconstruction is possible using DCT. The MFCC process is decribed in the figure 4. d = x y = x i y i 2 n i=1 (5)
Start In training phase, the MFCCs feature vectors were extracted with 39 parameters which include 13 MFC coefficients, 13 delta coefficients and 13 delta-delta coefficients. Initialize the number of clusters Randomly select initial centers as the centroids of the K clusters Generate a new partition by assigning each data point to the closest cluster Calculate centroids for the new group Fig 7: Cepstrum represntation of the input speech samples Yes New group? End No The features thus extracted are shown in figure 7. The subplot 1 shows speech waveform.the subplot 2 shows the cepstrum representation of the speech sample that indicates normal stress level.the subplot 3,4 represents the cepstrum of the speech samples with low and high stress levels respectively. The variation in these cepstrums can be seen clearly.the feature vectors thus obtained were clustered using kmeans algorithm. V.SIMULATION RESULTS Fig 5: kmeans algorithm A monitoring system which uses speech as input for indicating stress level of the person has been developed. The system uses MATLAB environment for analysing the speech samples. Figure 6 shows the simulation results for stress level monitoring system. In testing phase, the test sample is applied to the system. The feature vector is calculated for the input test sample. The minimum distance between the codebook generated in the training phase and in the testing phase will result in the stress level of the input speech sample. Figure 9 shows the simulation results of the speech sample which indicates the high stress level. Fig 6: MATLAB simulation of the stress level classification. Fig 8: shows the kmeans clustering of the training data and the test sample.
[8] Saloni, R. K. Sharma And Anil K. Gupta, 2014,Classification of High Blood Pressure Persons Vs Normal Blood Pressure Persons Using Voice Analysis, IIJSP, 1, 47-52 Fig 9: Simulation results showing the test sample identified as high stress VI.CONCLUSION The paper brings out an approach to identify the stress level of a person using speech. The MFCC feature vectors are calculated and the vectors are clustered using kmeans algorithm. An accuracy of 83% was obtained. Furthermore different algorithms can be used to increase the accuracy of the system. REFERENCES [1] A. A. Khulage, Prof. B. V. Pathak, Analysis of speech under stress using linear techniques and non- linear techniques for emotion recognition system. [2] Abdelwadood Mesleh1, Dmitriy Skopin1, Sergey Baglikov2, And Anas Quteishat, nov. 2012 Heart Rate Extraction from Vowel Speech Signal, Journal of computer Science and technology 27(6): 1243{1251. [3] Bageshree V. Sathe-Pathak, Ashish R. Panat, July 2012 Extraction of Pitch and Formants and its Analysis to Identify 3 different emotional states of a person, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 1. [4] Clifford S. Hopkins,Roy J. Ratley,Daniel S. Benincasa,John J. Grieco, 2005 Evaluation of Voice Stress Analysis Technology, 38th Hawaii International Conference on System Sciences. [5] Li Dong, august 2011 Time series analysis of jitter in sustained vowels, ICPHS XVII regular session hong Kong, 17-21 [6] Mireia Farrús, Javier Hernando, Pascual Ejarque,Jitter and Shimmer Measurements for Speaker Recognition. [7] Guojun Zhou,John H. L. Hansen, James F. Kaiser, march 2001,Nonlinear Feature Based Classification of Speech Under Stress ieee transactions on speech and audio processing, vol. 9,no. 3.