Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Size: px

Start display at page:

Download "Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery"

Chrystal Short
5 years ago
Views:

1 Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India SPIRE LAB, IISc, Bangalore 1

2 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 2

3 Introduction Section 1 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 3

4 Introduction Motivation The Automatic speech recognition (ASR) systems are very common in Mobile devices. SPIRE LAB, IISc, Bangalore 4

5 Introduction Motivation The Automatic speech recognition (ASR) systems are very common in Mobile devices. Implementing ASR applications in mobile devices using these models could be challenging due to its computational and memory constraints. SPIRE LAB, IISc, Bangalore 4

6 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 1. 1 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, 2016 SPIRE LAB, IISc, Bangalore 5

7 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 2. Such systems replace low bit-rate speech codecs with feature vectors (such as MFCCs). 2 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 6

8 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 2. Such systems replace low bit-rate speech codecs with feature vectors (such as MFCCs). The removal of the speech codec gives increased recognition accuracy, particular in the presence of acoustic noise or channel errors 3. 2 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 6

9 Introduction Motivation HMM based recognizer was using directly features to do ASR 4. 4 Gales, Maximum likelihood linear transformations for HMM-based speech recognition, 1998 SPIRE LAB, IISc, Bangalore 7

10 Introduction Motivation Recently in many practical scenarios, the accuracy of the speech recognition is closer to the human level using the end to end deep architectures Xiong et al., The Microsoft 2016 Conversational Speech Recognition System, Zweig et al., Advances in All-Neural Speech Recognition, Chan et al., Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 2016 SPIRE LAB, IISc, Bangalore 8

11 Introduction Motivation One way to use is reconstructing the speech from features. SPIRE LAB, IISc, Bangalore 9

12 Introduction Motivation In most cases the feature used are Mel-frequency Cepstral Coefficients (MFCC) in case of HMM based ASR. So we need way to reconstruct the speech just using MFCC. So we propose to predict the pitch from MFCC as first step in speech reconstruction. SPIRE LAB, IISc, Bangalore 10

13 Introduction How Mel-frequency Cepstral Coefficients (MFCC) encodes the pitch information? SPIRE LAB, IISc, Bangalore 11

14 Introduction Source Filter model of speech 8 8 Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, 1971 SPIRE LAB, IISc, Bangalore 12

15 Introduction Source Filter model of speech 8 Note the sparse nature of the speech spectrum..! 8 Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, 1971 SPIRE LAB, IISc, Bangalore 12

16 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

17 Introduction MFCC computation 9 w(n) is the window signal. 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

18 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

19 Introduction MFCC computation 9 H m [k], 0 k N 1 is frequency response of m th filter. 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

20 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

21 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

22 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

23 Proposed approach Section 2 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 14

24 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks to be inverted? SPIRE LAB, IISc, Bangalore 15

25 Proposed approach Proposed approach: Pitch prediction from MFCC Speech magnitude spectrum is enough to predict the pitch..! SPIRE LAB, IISc, Bangalore 15

26 Proposed approach Proposed approach: Pitch prediction from MFCC Which blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

27 Proposed approach Proposed approach: Pitch prediction from MFCC Which blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

28 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

29 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

30 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? We propose a three-step method to estimate the pitch from MFCC. 1 Estimate the MFBE from the MFCC. 2 Recover the spectrum from the estimated MFBEs. 3 Estimate pitch from spectrum. SPIRE LAB, IISc, Bangalore 16

31 Proposed approach Proposed approach-estimation of the spectrum from the MFBEs SPIRE LAB, IISc, Bangalore 17

32 Proposed approach (1) Estimate the MFBE from the MFCC SPIRE LAB, IISc, Bangalore 18

33 Proposed approach (1) Estimate the MFBE from the MFCC The DCT operation is invertible only if the number of MFBEs(M) and MFCCs(K) are the same. If K<M. we use two methods to recover the MFBEs. 1 Z DCT : Zero padding to MFCC 2 DNN DCT : DNN based estimation. SPIRE LAB, IISc, Bangalore 19

34 Proposed approach (2) Recover the spectrum from the estimated MFBEs SPIRE LAB, IISc, Bangalore 20

Proposed approach [2a] Recover the spectrum from the estimated MFBEs. The voiced spectrum is sparse and the pitch can be determined from the voice spectrum.

35 Proposed approach [2a] Recover the spectrum from the estimated MFBEs. The voiced spectrum is sparse and the pitch can be determined from the voice spectrum. The values around the harmonics is determined by the spectrum of the window. We model the voiced speech spectrum as ( L ) ( L ) Y [k] W [k] x l δ(k N 0 l) = x l W (k N 0 l) l=1 l=1 This can be compactly written as Y W x. where x is a sparse vector SPIRE LAB, IISc, Bangalore 21

36 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. SPIRE LAB, IISc, Bangalore 22

37 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. The estimated MFBE can be written as ˆf = HŴ x + γ where γ is sum of model and estimation noise. SPIRE LAB, IISc, Bangalore 22

38 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. The estimated MFBE can be written as ˆf = HŴ x + γ where γ is sum of model and estimation noise. We propose two methods to recover the spectrum from MFBEs 1 Direct estimation of spectrum under the noise model given above. 2 Estimation of spectrum with sparsity constraint on the spectrum. SPIRE LAB, IISc, Bangalore 22

39 Proposed approach (2c) Recover the spectrum from the estimated MFBEs. Given that the ˆf = HŴ x + γ, The maximum likelihood estimation of spectrum is given by x PINV = arg min ˆf HŴ x 2 2 (1) x SPIRE LAB, IISc, Bangalore 23

40 Proposed approach (2c) Recover the spectrum from the estimated MFBEs. Given that the ˆf = HŴ x + γ, The maximum likelihood estimation of spectrum is given by x PINV = arg min ˆf HŴ x 2 2 (1) x The solution turns out to be a closed form expression and can be written using the pseudo-inverse (PINV) of HŴ as follows: x PINV = ((HŴ )T HŴ ) 1 (HŴ )Tˆf (2) SPIRE LAB, IISc, Bangalore 23

41 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 10 Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

42 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 Since there is non negativity constraint on x, the l 1 norm of x can be written as sum of its elements. The equivalent optimization problem becomes: x S = arg min ˆf HŴ x λ1 T x (3) x 0 The following optimization is posed as a quadratic programing problem Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

43 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 Since there is non negativity constraint on x, the l 1 norm of x can be written as sum of its elements. The equivalent optimization problem becomes: x S = arg min ˆf HŴ x λ1 T x (3) x 0 The following optimization is posed as a quadratic programing problem 10. Note that the λ is hyper-parameter and controls the sparsity. 10 Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

44 Proposed approach (3) Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 25

45 Proposed approach (3) Estimation of the pitch from the estimated spectrum We use Subharmonic to Harmonic Ratio (SHR) 11 to estimate the pitch from the spectrum. Given the magnitude spectrum X (f ), the pitch range S and the number of harmonics (Q), the pitch value (p ) is obtained following an optimization given below: p = arg max f S Q k=1 0 ( ) log X (f ) δ(f kf) δ(f (k 1/2)f) df (4) 11 Sun, Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio, 2002 SPIRE LAB, IISc, Bangalore 26

46 Proposed approach [3]Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 27

47 Previous work and baseline Section 3 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 28

48 Previous work and baseline Previous work and baseline There are several works in the literature where the pitch is predicted from the MFCC using a statistical model such as Gaussian mixture model (GMM) and hidden Markov models Here we use Deep neural network (DNN) based method to predict pitch from MFCC. Which showed lot of success in many fields. We refer this DNN by DNN b. 12 Milner and Shao, Prediction of fundamental frequency and voicing from Mel-frequency cepstral coefficients for unconstrained speech reconstruction, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 29

49 Previous work and baseline [3]Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 30

50 Experiments and results Section 4 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 31

51 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

52 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

53 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. KEELE database: one male and one female. 4min each. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

54 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. KEELE database: one male and one female. 4min each. We use randomly choose 80% of the CMUARCTIC data from each speaker as training set and rest as test set. 100% KEELE is used as test set to evaluate the generalization of the algorithms. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

55 Experiments and results Database Database The histogram pitch distribution for different train and test set is shown below SPIRE LAB, IISc, Bangalore 33

56 Experiments and results Database Database The histogram pitch distribution for different train and test set is shown below Note the histogram mismatch in MALE is more. SPIRE LAB, IISc, Bangalore 33

57 Experiments and results Experimental setup Experimental setup: MFCC and Pitch computation MFCC computation: 1 hamming window of 40ms and shift of 10ms. 2 The DFT of 2048 point is computed. 3 The MFBEs are computed by placing the M=26 filter banks uniformly on the Melscale from Hz The DCT with K = 26, 21, 16, 13 is computed to investigate estimation error due to different amount of truncation in DCT coefficients. Pitch: We use auto-correlation method from Praat 17 on the EGG signal available with the database to determine ground truth of the fundamental frequency and voicing. The un-voiced frames are removed from the data for the experiments. 16 The velocity and the acceleration coefficients of MFCC are not used. 17 Boersma and Weenink, Praat: doing phonetics by computer, 2010 SPIRE LAB, IISc, Bangalore 34

58 Experiments and results Experimental setup Experimental setup: Hyper parameter selection proposed method: The sparse spectrum estimation has hyper parameter λ is experimentally found using the 10% of randomly selected training data to minimize the pitch error for each training set. SPIRE LAB, IISc, Bangalore 35

59 Experiments and results Experimental setup Experimental setup: Hyper parameter selection proposed method: The sparse spectrum estimation has hyper parameter λ is experimentally found using the 10% of randomly selected training data to minimize the pitch error for each training set. Pitch estimation: We use 4 harmonics to compute the pitch score using SHR. The pitch search range(s) for CM, CF and CM+CF are chosen to be , , respectively. SPIRE LAB, IISc, Bangalore 35

60 Experiments and results Evaluation Evaluation We use root mean squared error 18 as metrics to measure the pitch estimation performance. This is computed using the estimated pitch (p i ) and original pitch (p i) at the i-th frame for the entire test set with N tot voiced frames given by RMSE = 1 N tot N tot (p i p i )2 i=1 18 Tabrikian, Dubnov, and Dickalov, Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model, 2004 SPIRE LAB, IISc, Bangalore 36

61 Experiments and results Evaluation Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 37

62 Experiments and results Evaluation Sample recovered spectrum SPIRE LAB, IISc, Bangalore 38

63 Experiments and results Evaluation Sample recovered spectrum the sparsity constraint helps in recovering the lower pitch spectrum with higher accuracy SPIRE LAB, IISc, Bangalore 38

64 Experiments and results Evaluation Set of models trained subscript z indicates the DCT reconstruction using zero padding and subscript D indicates the DCT reconstruction using DNN. model CM CF CM+CF DNN P INV z P S z P INV D P S D DNN CM b P S CM z P INV CM D P S CM D DNN CF b P INV z P S CF z P INV CF D P S CF D DNN CM+CF b P S CM+CF z P INV CM+CF D P S CM+CF D SPIRE LAB, IISc, Bangalore 39

65 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D DNN out performs all the methods. SPIRE LAB, IISc, Bangalore 40

66 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The pitch RMSE increases as the truncation increases. SPIRE LAB, IISc, Bangalore 40

67 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The DNN gender mismatched model performing poorly. SPIRE LAB, IISc, Bangalore 40

68 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The PS and PINV methods are not affected much by the gender mismatch. SPIRE LAB, IISc, Bangalore 40

69 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The DNN based DCT estimation is helping in all cases. SPIRE LAB, IISc, Bangalore 40

70 Experiments and results Evaluation RMSE error with matched test data FEMALE Test CF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D Same observations applicable. The RMSE is lower comapre to the MALE case. SPIRE LAB, IISc, Bangalore 41

71 Experiments and results Evaluation RMSE error with matched test data MALE+FEMALE Test CM+CF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM+CF D Same observations applicable. The RMSE is increased compare to gender dependent cases. SPIRE LAB, IISc, Bangalore 42

72 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D DNN is performing poorly because of the histogram mismatch. SPIRE LAB, IISc, Bangalore 43

73 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D PS method is out performing all the methods. SPIRE LAB, IISc, Bangalore 43

74 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The RMSE of P INV D and P S D is higher than the P INV z and P S z because the DNN-DCT prediction is also poor. SPIRE LAB, IISc, Bangalore 43

75 Experiments and results Evaluation RMSE error with mismatched FEMALE test data KF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D P INV z method is out performing all the methods at lower value of truncation. SPIRE LAB, IISc, Bangalore 44

76 Experiments and results Evaluation RMSE error with mismatched FEMALE test data KF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D DNN method is better at higher truncation and this because of the histogram mismatch is less in case of FEMALE data. SPIRE LAB, IISc, Bangalore 44

77 Experiments and results Evaluation RMSE error with mismatched MALE+FEMALE test data To evaluate the performance of the algorithm in a general unseen scenario, we evaluate the gender independent models in each method (DNN, PINV and PS). The average RMSE on unseen KEELE database is shown below AVG DNN b CM+CF P INV z PS CF +CM z SPIRE LAB, IISc, Bangalore 45

78 Experiments and results Evaluation RMSE error with mismatched MALE+FEMALE test data To evaluate the performance of the algorithm in a general unseen scenario, we evaluate the gender independent models in each method (DNN, PINV and PS). The average RMSE on unseen KEELE database is shown below AVG DNN b CM+CF P INV z PS CF +CM z The pitch prediction with sparsity constraint will out perform other 2 methods in general unseen data and unknown gender case. SPIRE LAB, IISc, Bangalore 45

79 Conclusion and future work Section 5 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 46

80 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. SPIRE LAB, IISc, Bangalore 47

81 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. SPIRE LAB, IISc, Bangalore 47

82 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. It might be possible to train a DNN with many speakers to get a better model that generalizes well on unseen test cases. However, obtaining data from many speakers with EGG could be challenging. SPIRE LAB, IISc, Bangalore 47

83 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. It might be possible to train a DNN with many speakers to get a better model that generalizes well on unseen test cases. However, obtaining data from many speakers with EGG could be challenging. Future works may include imposing the periodicity constraint along with the sparsity constraint on the spectrum. Reconstruction of speech using the estimated pitch and evaluation of ASR performance and the naturalness of synthesized speech. SPIRE LAB, IISc, Bangalore 47

84 Conclusion and future work THANK YOU SPIRE LAB, IISc, Bangalore 48

85 Conclusion and future work SPIRE LAB, IISc, Bangalore 49

86 Conclusion and future work Experimental setup: Deep neural network 1 The structure of DNNs is defined recursively on the layer index l. The input vector, z l R d 1, is mapped to the representation vector z l+1 R d 2 through an activation function f l as follows: z l+1 = f l (W l z l + b l ), 0 l L 1 (5) where f l (x) = { tanh(x), 0 l L 2 x, l = L 1. d 1,d 2 are the input and output dimensions of the l th layer. The W l and b l are the parameters of the network. These parameters are estimated by back propagation and stochastic gradient decent. SPIRE LAB, IISc, Bangalore 50

87 Conclusion and future work Experimental setup: Deep neural network 1 DNNs for both DNN DCT and DNN b have the same architecture and training procedure, except for the number of hidden units in each layer. We use 4-layer network with 256 units in each layer for DNN b and 512 units for DNN DCT. 2 The input data is normalized to zero mean and unit variance. 3 The network is initialized using glorot initialization. 4 The training is performed using stochastic gradient descent with a batch size of 256 and a momentum of 0.9. The 20% of training data is used to monitor the validation loss at each epoch and the weight update is stopped when there is no improvement in the validation loss. SPIRE LAB, IISc, Bangalore 51

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION S. V. Bharath Kumar Imaging Technologies Lab General Electric - Global Research JFWTC, Bangalore - 560086, INDIA bharath.sv@geind.ge.com