Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Size: px
Start display at page:

Download "Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery"

Transcription

1 Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India SPIRE LAB, IISc, Bangalore 1

2 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 2

3 Introduction Section 1 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 3

4 Introduction Motivation The Automatic speech recognition (ASR) systems are very common in Mobile devices. SPIRE LAB, IISc, Bangalore 4

5 Introduction Motivation The Automatic speech recognition (ASR) systems are very common in Mobile devices. Implementing ASR applications in mobile devices using these models could be challenging due to its computational and memory constraints. SPIRE LAB, IISc, Bangalore 4

6 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 1. 1 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, 2016 SPIRE LAB, IISc, Bangalore 5

7 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 2. Such systems replace low bit-rate speech codecs with feature vectors (such as MFCCs). 2 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 6

8 Introduction Motivation Distributed speech recognition (DSR) allows ASR applications to be used in mobile devices 2. Such systems replace low bit-rate speech codecs with feature vectors (such as MFCCs). The removal of the speech codec gives increased recognition accuracy, particular in the presence of acoustic noise or channel errors 3. 2 Choi, 14-2: Invited Paper: Enabling Technologies for Wearable Smart Headsets, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 6

9 Introduction Motivation HMM based recognizer was using directly features to do ASR 4. 4 Gales, Maximum likelihood linear transformations for HMM-based speech recognition, 1998 SPIRE LAB, IISc, Bangalore 7

10 Introduction Motivation Recently in many practical scenarios, the accuracy of the speech recognition is closer to the human level using the end to end deep architectures Xiong et al., The Microsoft 2016 Conversational Speech Recognition System, Zweig et al., Advances in All-Neural Speech Recognition, Chan et al., Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 2016 SPIRE LAB, IISc, Bangalore 8

11 Introduction Motivation One way to use is reconstructing the speech from features. SPIRE LAB, IISc, Bangalore 9

12 Introduction Motivation In most cases the feature used are Mel-frequency Cepstral Coefficients (MFCC) in case of HMM based ASR. So we need way to reconstruct the speech just using MFCC. So we propose to predict the pitch from MFCC as first step in speech reconstruction. SPIRE LAB, IISc, Bangalore 10

13 Introduction How Mel-frequency Cepstral Coefficients (MFCC) encodes the pitch information? SPIRE LAB, IISc, Bangalore 11

14 Introduction Source Filter model of speech 8 8 Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, 1971 SPIRE LAB, IISc, Bangalore 12

15 Introduction Source Filter model of speech 8 Note the sparse nature of the speech spectrum..! 8 Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, 1971 SPIRE LAB, IISc, Bangalore 12

16 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

17 Introduction MFCC computation 9 w(n) is the window signal. 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

18 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

19 Introduction MFCC computation 9 H m [k], 0 k N 1 is frequency response of m th filter. 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

20 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

21 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

22 Introduction MFCC computation 9 9 Huang et al., Spoken language processing: A guide to theory, algorithm, and system development, 2001 SPIRE LAB, IISc, Bangalore 13

23 Proposed approach Section 2 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 14

24 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks to be inverted? SPIRE LAB, IISc, Bangalore 15

25 Proposed approach Proposed approach: Pitch prediction from MFCC Speech magnitude spectrum is enough to predict the pitch..! SPIRE LAB, IISc, Bangalore 15

26 Proposed approach Proposed approach: Pitch prediction from MFCC Which blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

27 Proposed approach Proposed approach: Pitch prediction from MFCC Which blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

28 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

29 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? SPIRE LAB, IISc, Bangalore 16

30 Proposed approach Proposed approach: Pitch prediction from MFCC What are the blocks are non-invertible? We propose a three-step method to estimate the pitch from MFCC. 1 Estimate the MFBE from the MFCC. 2 Recover the spectrum from the estimated MFBEs. 3 Estimate pitch from spectrum. SPIRE LAB, IISc, Bangalore 16

31 Proposed approach Proposed approach-estimation of the spectrum from the MFBEs SPIRE LAB, IISc, Bangalore 17

32 Proposed approach (1) Estimate the MFBE from the MFCC SPIRE LAB, IISc, Bangalore 18

33 Proposed approach (1) Estimate the MFBE from the MFCC The DCT operation is invertible only if the number of MFBEs(M) and MFCCs(K) are the same. If K<M. we use two methods to recover the MFBEs. 1 Z DCT : Zero padding to MFCC 2 DNN DCT : DNN based estimation. SPIRE LAB, IISc, Bangalore 19

34 Proposed approach (2) Recover the spectrum from the estimated MFBEs SPIRE LAB, IISc, Bangalore 20

35 Proposed approach [2a] Recover the spectrum from the estimated MFBEs. The voiced spectrum is sparse and the pitch can be determined from the voice spectrum. The values around the harmonics is determined by the spectrum of the window. We model the voiced speech spectrum as ( L ) ( L ) Y [k] W [k] x l δ(k N 0 l) = x l W (k N 0 l) l=1 l=1 This can be compactly written as Y W x. where x is a sparse vector SPIRE LAB, IISc, Bangalore 21

36 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. SPIRE LAB, IISc, Bangalore 22

37 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. The estimated MFBE can be written as ˆf = HŴ x + γ where γ is sum of model and estimation noise. SPIRE LAB, IISc, Bangalore 22

38 Proposed approach (2b) Recover the spectrum from the estimated MFBEs. Error in the modeling because of non-inveribility. The estimated MFBE can be written as ˆf = HŴ x + γ where γ is sum of model and estimation noise. We propose two methods to recover the spectrum from MFBEs 1 Direct estimation of spectrum under the noise model given above. 2 Estimation of spectrum with sparsity constraint on the spectrum. SPIRE LAB, IISc, Bangalore 22

39 Proposed approach (2c) Recover the spectrum from the estimated MFBEs. Given that the ˆf = HŴ x + γ, The maximum likelihood estimation of spectrum is given by x PINV = arg min ˆf HŴ x 2 2 (1) x SPIRE LAB, IISc, Bangalore 23

40 Proposed approach (2c) Recover the spectrum from the estimated MFBEs. Given that the ˆf = HŴ x + γ, The maximum likelihood estimation of spectrum is given by x PINV = arg min ˆf HŴ x 2 2 (1) x The solution turns out to be a closed form expression and can be written using the pseudo-inverse (PINV) of HŴ as follows: x PINV = ((HŴ )T HŴ ) 1 (HŴ )Tˆf (2) SPIRE LAB, IISc, Bangalore 23

41 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 10 Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

42 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 Since there is non negativity constraint on x, the l 1 norm of x can be written as sum of its elements. The equivalent optimization problem becomes: x S = arg min ˆf HŴ x λ1 T x (3) x 0 The following optimization is posed as a quadratic programing problem Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

43 Proposed approach (2d) Recover the spectrum from the estimated MFBEs. We impose non-negativity and sparsity constraint on x. This results in the following optimization problem: x S = arg min ˆf HŴ x λ x 1 x 0 Since there is non negativity constraint on x, the l 1 norm of x can be written as sum of its elements. The equivalent optimization problem becomes: x S = arg min ˆf HŴ x λ1 T x (3) x 0 The following optimization is posed as a quadratic programing problem 10. Note that the λ is hyper-parameter and controls the sparsity. 10 Koh, Kim, and Boyd, An interior-point method for large-scale l1-regularized logistic regression, 2007 SPIRE LAB, IISc, Bangalore 24

44 Proposed approach (3) Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 25

45 Proposed approach (3) Estimation of the pitch from the estimated spectrum We use Subharmonic to Harmonic Ratio (SHR) 11 to estimate the pitch from the spectrum. Given the magnitude spectrum X (f ), the pitch range S and the number of harmonics (Q), the pitch value (p ) is obtained following an optimization given below: p = arg max f S Q k=1 0 ( ) log X (f ) δ(f kf) δ(f (k 1/2)f) df (4) 11 Sun, Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio, 2002 SPIRE LAB, IISc, Bangalore 26

46 Proposed approach [3]Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 27

47 Previous work and baseline Section 3 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 28

48 Previous work and baseline Previous work and baseline There are several works in the literature where the pitch is predicted from the MFCC using a statistical model such as Gaussian mixture model (GMM) and hidden Markov models Here we use Deep neural network (DNN) based method to predict pitch from MFCC. Which showed lot of success in many fields. We refer this DNN by DNN b. 12 Milner and Shao, Prediction of fundamental frequency and voicing from Mel-frequency cepstral coefficients for unconstrained speech reconstruction, Shao and Milner, Pitch prediction from MFCC vectors for speech reconstruction, 2004 SPIRE LAB, IISc, Bangalore 29

49 Previous work and baseline [3]Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 30

50 Experiments and results Section 4 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 31

51 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

52 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

53 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. KEELE database: one male and one female. 4min each. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

54 Experiments and results Database Database We use two databases: CMUARCTIC 14 and KEELE 15. CMU-ARCTIC database: one male(jmk) and one female(slt). 48min each. KEELE database: one male and one female. 4min each. We use randomly choose 80% of the CMUARCTIC data from each speaker as training set and rest as test set. 100% KEELE is used as test set to evaluate the generalization of the algorithms. 14 Kominek and Black, The CMU ARCTIC speech databases, Plante, Meyer, and Ainsworth, A pitch extraction reference database, 1995 SPIRE LAB, IISc, Bangalore 32

55 Experiments and results Database Database The histogram pitch distribution for different train and test set is shown below SPIRE LAB, IISc, Bangalore 33

56 Experiments and results Database Database The histogram pitch distribution for different train and test set is shown below Note the histogram mismatch in MALE is more. SPIRE LAB, IISc, Bangalore 33

57 Experiments and results Experimental setup Experimental setup: MFCC and Pitch computation MFCC computation: 1 hamming window of 40ms and shift of 10ms. 2 The DFT of 2048 point is computed. 3 The MFBEs are computed by placing the M=26 filter banks uniformly on the Melscale from Hz The DCT with K = 26, 21, 16, 13 is computed to investigate estimation error due to different amount of truncation in DCT coefficients. Pitch: We use auto-correlation method from Praat 17 on the EGG signal available with the database to determine ground truth of the fundamental frequency and voicing. The un-voiced frames are removed from the data for the experiments. 16 The velocity and the acceleration coefficients of MFCC are not used. 17 Boersma and Weenink, Praat: doing phonetics by computer, 2010 SPIRE LAB, IISc, Bangalore 34

58 Experiments and results Experimental setup Experimental setup: Hyper parameter selection proposed method: The sparse spectrum estimation has hyper parameter λ is experimentally found using the 10% of randomly selected training data to minimize the pitch error for each training set. SPIRE LAB, IISc, Bangalore 35

59 Experiments and results Experimental setup Experimental setup: Hyper parameter selection proposed method: The sparse spectrum estimation has hyper parameter λ is experimentally found using the 10% of randomly selected training data to minimize the pitch error for each training set. Pitch estimation: We use 4 harmonics to compute the pitch score using SHR. The pitch search range(s) for CM, CF and CM+CF are chosen to be , , respectively. SPIRE LAB, IISc, Bangalore 35

60 Experiments and results Evaluation Evaluation We use root mean squared error 18 as metrics to measure the pitch estimation performance. This is computed using the estimated pitch (p i ) and original pitch (p i) at the i-th frame for the entire test set with N tot voiced frames given by RMSE = 1 N tot N tot (p i p i )2 i=1 18 Tabrikian, Dubnov, and Dickalov, Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model, 2004 SPIRE LAB, IISc, Bangalore 36

61 Experiments and results Evaluation Estimation of the pitch from the estimated spectrum SPIRE LAB, IISc, Bangalore 37

62 Experiments and results Evaluation Sample recovered spectrum SPIRE LAB, IISc, Bangalore 38

63 Experiments and results Evaluation Sample recovered spectrum the sparsity constraint helps in recovering the lower pitch spectrum with higher accuracy SPIRE LAB, IISc, Bangalore 38

64 Experiments and results Evaluation Set of models trained subscript z indicates the DCT reconstruction using zero padding and subscript D indicates the DCT reconstruction using DNN. model CM CF CM+CF DNN P INV z P S z P INV D P S D DNN CM b P S CM z P INV CM D P S CM D DNN CF b P INV z P S CF z P INV CF D P S CF D DNN CM+CF b P S CM+CF z P INV CM+CF D P S CM+CF D SPIRE LAB, IISc, Bangalore 39

65 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D DNN out performs all the methods. SPIRE LAB, IISc, Bangalore 40

66 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The pitch RMSE increases as the truncation increases. SPIRE LAB, IISc, Bangalore 40

67 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The DNN gender mismatched model performing poorly. SPIRE LAB, IISc, Bangalore 40

68 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The PS and PINV methods are not affected much by the gender mismatch. SPIRE LAB, IISc, Bangalore 40

69 Experiments and results Evaluation RMSE error with matched test data MALE Test CM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The DNN based DCT estimation is helping in all cases. SPIRE LAB, IISc, Bangalore 40

70 Experiments and results Evaluation RMSE error with matched test data FEMALE Test CF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D Same observations applicable. The RMSE is lower comapre to the MALE case. SPIRE LAB, IISc, Bangalore 41

71 Experiments and results Evaluation RMSE error with matched test data MALE+FEMALE Test CM+CF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM+CF D Same observations applicable. The RMSE is increased compare to gender dependent cases. SPIRE LAB, IISc, Bangalore 42

72 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D DNN is performing poorly because of the histogram mismatch. SPIRE LAB, IISc, Bangalore 43

73 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D PS method is out performing all the methods. SPIRE LAB, IISc, Bangalore 43

74 Experiments and results Evaluation RMSE error with mismatched MALE test data KM DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CM D The RMSE of P INV D and P S D is higher than the P INV z and P S z because the DNN-DCT prediction is also poor. SPIRE LAB, IISc, Bangalore 43

75 Experiments and results Evaluation RMSE error with mismatched FEMALE test data KF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D P INV z method is out performing all the methods at lower value of truncation. SPIRE LAB, IISc, Bangalore 44

76 Experiments and results Evaluation RMSE error with mismatched FEMALE test data KF DNN CM b DNN CF b DNN CM+CF b P INV z PS CM z PS CF z PSCF +CM z PINV D PS CF D DNN method is better at higher truncation and this because of the histogram mismatch is less in case of FEMALE data. SPIRE LAB, IISc, Bangalore 44

77 Experiments and results Evaluation RMSE error with mismatched MALE+FEMALE test data To evaluate the performance of the algorithm in a general unseen scenario, we evaluate the gender independent models in each method (DNN, PINV and PS). The average RMSE on unseen KEELE database is shown below AVG DNN b CM+CF P INV z PS CF +CM z SPIRE LAB, IISc, Bangalore 45

78 Experiments and results Evaluation RMSE error with mismatched MALE+FEMALE test data To evaluate the performance of the algorithm in a general unseen scenario, we evaluate the gender independent models in each method (DNN, PINV and PS). The average RMSE on unseen KEELE database is shown below AVG DNN b CM+CF P INV z PS CF +CM z The pitch prediction with sparsity constraint will out perform other 2 methods in general unseen data and unknown gender case. SPIRE LAB, IISc, Bangalore 45

79 Conclusion and future work Section 5 1 Introduction 2 Proposed approach 3 Previous work and baseline 4 Experiments and results Database Experimental setup Evaluation 5 Conclusion and future work SPIRE LAB, IISc, Bangalore 46

80 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. SPIRE LAB, IISc, Bangalore 47

81 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. SPIRE LAB, IISc, Bangalore 47

82 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. It might be possible to train a DNN with many speakers to get a better model that generalizes well on unseen test cases. However, obtaining data from many speakers with EGG could be challenging. SPIRE LAB, IISc, Bangalore 47

83 Conclusion and future work Conclusion and future work Proposed a there-step method to estimate pitch from MFCC vectors. We showed that the sparsity constraint help in recovering the pitch value more accurately in MALE subjects and generalize well across database. It might be possible to train a DNN with many speakers to get a better model that generalizes well on unseen test cases. However, obtaining data from many speakers with EGG could be challenging. Future works may include imposing the periodicity constraint along with the sparsity constraint on the spectrum. Reconstruction of speech using the estimated pitch and evaluation of ASR performance and the naturalness of synthesized speech. SPIRE LAB, IISc, Bangalore 47

84 Conclusion and future work THANK YOU SPIRE LAB, IISc, Bangalore 48

85 Conclusion and future work SPIRE LAB, IISc, Bangalore 49

86 Conclusion and future work Experimental setup: Deep neural network 1 The structure of DNNs is defined recursively on the layer index l. The input vector, z l R d 1, is mapped to the representation vector z l+1 R d 2 through an activation function f l as follows: z l+1 = f l (W l z l + b l ), 0 l L 1 (5) where f l (x) = { tanh(x), 0 l L 2 x, l = L 1. d 1,d 2 are the input and output dimensions of the l th layer. The W l and b l are the parameters of the network. These parameters are estimated by back propagation and stochastic gradient decent. SPIRE LAB, IISc, Bangalore 50

87 Conclusion and future work Experimental setup: Deep neural network 1 DNNs for both DNN DCT and DNN b have the same architecture and training procedure, except for the number of hidden units in each layer. We use 4-layer network with 256 units in each layer for DNN b and 512 units for DNN DCT. 2 The input data is normalized to zero mean and unit variance. 3 The network is initialized using glorot initialization. 4 The training is performed using stochastic gradient descent with a batch size of 256 and a momentum of 0.9. The 20% of training data is used to monitor the validation loss at each epoch and the weight update is stopped when there is no improvement in the validation loss. SPIRE LAB, IISc, Bangalore 51

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION S. V. Bharath Kumar Imaging Technologies Lab General Electric - Global Research JFWTC, Bangalore - 560086, INDIA bharath.sv@geind.ge.com

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search Wonkyum Lee Jungsuk Kim Ian Lane Electrical and Computer Engineering Carnegie Mellon University March 26, 2014 @GTC2014

More information

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Voice Command Based Computer Application Control Using MFCC

Voice Command Based Computer Application Control Using MFCC Voice Command Based Computer Application Control Using MFCC Abinayaa B., Arun D., Darshini B., Nataraj C Department of Embedded Systems Technologies, Sri Ramakrishna College of Engineering, Coimbatore,

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies

More information

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department

More information

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department

More information

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization Presented by: Karen Lucknavalai and Alexandr Kuznetsov Example Style Content Result Motivation Transforming content of an image

More information

Intelligent Hands Free Speech based SMS System on Android

Intelligent Hands Free Speech based SMS System on Android Intelligent Hands Free Speech based SMS System on Android Gulbakshee Dharmale 1, Dr. Vilas Thakare 3, Dr. Dipti D. Patil 2 1,3 Computer Science Dept., SGB Amravati University, Amravati, INDIA. 2 Computer

More information

Speech User Interface for Information Retrieval

Speech User Interface for Information Retrieval Speech User Interface for Information Retrieval Urmila Shrawankar Dept. of Information Technology Govt. Polytechnic Institute, Nagpur Sadar, Nagpur 440001 (INDIA) urmilas@rediffmail.com Cell : +919422803996

More information

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

A new approach for supervised power disaggregation by using a deep recurrent LSTM network A new approach for supervised power disaggregation by using a deep recurrent LSTM network GlobalSIP 2015, 14th Dec. Lukas Mauch and Bin Yang Institute of Signal Processing and System Theory University

More information

2014, IJARCSSE All Rights Reserved Page 461

2014, IJARCSSE All Rights Reserved Page 461 Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real Time Speech

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 1 Course Overview This course is about performing inference in complex

More information

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011 DOI: 01.IJEPE.02.02.69 ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011 Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Interaction Krishna Kumar

More information

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139,

More information

SVD-based Universal DNN Modeling for Multiple Scenarios

SVD-based Universal DNN Modeling for Multiple Scenarios SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,

More information

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Wenzhun Huang 1, a and Xinxin Xie 1, b 1 School of Information Engineering, Xijing University, Xi an

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing

More information

Authentication of Fingerprint Recognition Using Natural Language Processing

Authentication of Fingerprint Recognition Using Natural Language Processing Authentication of Fingerprint Recognition Using Natural Language Shrikala B. Digavadekar 1, Prof. Ravindra T. Patil 2 1 Tatyasaheb Kore Institute of Engineering & Technology, Warananagar, India 2 Tatyasaheb

More information

Speaker Verification with Adaptive Spectral Subband Centroids

Speaker Verification with Adaptive Spectral Subband Centroids Speaker Verification with Adaptive Spectral Subband Centroids Tomi Kinnunen 1, Bingjun Zhang 2, Jia Zhu 2, and Ye Wang 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I 2 R) 21

More information

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech INTERSPEECH 16 September 8 12, 16, San Francisco, USA Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech Meet H. Soni, Hemant A. Patil Dhirubhai Ambani Institute

More information

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2 Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2 1 Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 33720, Tampere, Finland toni.heittola@tut.fi,

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India Analysis of Different Classifier Using Feature Extraction in Speaker Identification and Verification under Adverse Acoustic Condition for Different Scenario Shrikant Upadhyay Assistant Professor, Department

More information

DEEP LEARNING IN PYTHON. The need for optimization

DEEP LEARNING IN PYTHON. The need for optimization DEEP LEARNING IN PYTHON The need for optimization A baseline neural network Input 2 Hidden Layer 5 2 Output - 9-3 Actual Value of Target: 3 Error: Actual - Predicted = 4 A baseline neural network Input

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

Geometric Reconstruction Dense reconstruction of scene geometry

Geometric Reconstruction Dense reconstruction of scene geometry Lecture 5. Dense Reconstruction and Tracking with Real-Time Applications Part 2: Geometric Reconstruction Dr Richard Newcombe and Dr Steven Lovegrove Slide content developed from: [Newcombe, Dense Visual

More information

The Automatic Musicologist

The Automatic Musicologist The Automatic Musicologist Douglas Turnbull Department of Computer Science and Engineering University of California, San Diego UCSD AI Seminar April 12, 2004 Based on the paper: Fast Recognition of Musical

More information

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods Tracking Algorithms CSED441:Introduction to Computer Vision (2017F) Lecture16: Visual Tracking I Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Deterministic methods Given input video and current state,

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing

More information

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, JUNE 2017 1 Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition Fei Tao, Student Member, IEEE, and

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION Please contact the conference organizers at dcasechallenge@gmail.com if you require an accessible file, as the files provided by ConfTool Pro to reviewers are filtered to remove author information, and

More information

Reverberant Speech Recognition Based on Denoising Autoencoder

Reverberant Speech Recognition Based on Denoising Autoencoder INTERSPEECH 2013 Reverberant Speech Recognition Based on Denoising Autoencoder Takaaki Ishii 1, Hiroki Komiyama 1, Takahiro Shinozaki 2, Yasuo Horiuchi 1, Shingo Kuroiwa 1 1 Division of Information Sciences,

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

Real Time Speaker Recognition System using MFCC and Vector Quantization Technique

Real Time Speaker Recognition System using MFCC and Vector Quantization Technique Real Time Speaker Recognition System using MFCC and Vector Quantization Technique Roma Bharti Mtech, Manav rachna international university Faridabad ABSTRACT This paper represents a very strong mathematical

More information

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Multinomial Regression and the Softmax Activation Function. Gary Cottrell! Multinomial Regression and the Softmax Activation Function Gary Cottrell Notation reminder We have N data points, or patterns, in the training set, with the pattern number as a superscript: {(x 1,t 1 ),

More information

Modeling Phonetic Context with Non-random Forests for Speech Recognition

Modeling Phonetic Context with Non-random Forests for Speech Recognition Modeling Phonetic Context with Non-random Forests for Speech Recognition Hainan Xu Center for Language and Speech Processing, Johns Hopkins University September 4, 2015 Hainan Xu September 4, 2015 1 /

More information

Automatic Speech Recognition on Mobile Devices and over Communication Networks

Automatic Speech Recognition on Mobile Devices and over Communication Networks Zheng-Hua Tan and Berge Lindberg Automatic Speech Recognition on Mobile Devices and over Communication Networks ^Spri inger g< Contents Preface Contributors v xix 1. Network, Distributed and Embedded Speech

More information

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014 MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION Steve Tjoa kiemyang@gmail.com June 25, 2014 Review from Day 2 Supervised vs. Unsupervised Unsupervised - clustering Supervised binary classifiers (2 classes)

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS Dipti D. Joshi, M.B. Zalte (EXTC Department, K.J. Somaiya College of Engineering, University of Mumbai, India) Diptijoshi3@gmail.com

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Manifold Constrained Deep Neural Networks for ASR

Manifold Constrained Deep Neural Networks for ASR 1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as

More information

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data

Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Using Gradient Descent Optimization for Acoustics Training from Heterogeneous Data Martin Karafiát Λ, Igor Szöke, and Jan Černocký Brno University of Technology, Faculty of Information Technology Department

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Special Session: Intelligent Knowledge Management A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Jiping Sun 1, Jeremy Sun 1, Kacem Abida 2, and Fakhri Karray

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Machine Learning Feature Creation and Selection

Machine Learning Feature Creation and Selection Machine Learning Feature Creation and Selection Jeff Howbert Introduction to Machine Learning Winter 2012 1 Feature creation Well-conceived new features can sometimes capture the important information

More information

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam,

More information

MoonRiver: Deep Neural Network in C++

MoonRiver: Deep Neural Network in C++ MoonRiver: Deep Neural Network in C++ Chung-Yi Weng Computer Science & Engineering University of Washington chungyi@cs.washington.edu Abstract Artificial intelligence resurges with its dramatic improvement

More information

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 December 2012

More information

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong Using Capsule Networks for Image and Speech Recognition Problems by Yan Xiong A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved November 2018 by the

More information

Najiya P Fathima, C. V. Vipin Kishnan; International Journal of Advance Research, Ideas and Innovations in Technology

Najiya P Fathima, C. V. Vipin Kishnan; International Journal of Advance Research, Ideas and Innovations in Technology ISSN: 2454-32X Impact factor: 4.295 (Volume 4, Issue 2) Available online at: www.ijariit.com Analysis of Different Classifier for the Detection of Double Compressed AMR Audio Fathima Najiya P najinasi2@gmail.com

More information

Sparse Solutions to Linear Inverse Problems. Yuzhe Jin

Sparse Solutions to Linear Inverse Problems. Yuzhe Jin Sparse Solutions to Linear Inverse Problems Yuzhe Jin Outline Intro/Background Two types of algorithms Forward Sequential Selection Methods Diversity Minimization Methods Experimental results Potential

More information

Least Squares Signal Declipping for Robust Speech Recognition

Least Squares Signal Declipping for Robust Speech Recognition Least Squares Signal Declipping for Robust Speech Recognition Mark J. Harvilla and Richard M. Stern Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213 USA

More information

Robust speech recognition using features based on zero crossings with peak amplitudes

Robust speech recognition using features based on zero crossings with peak amplitudes Robust speech recognition using features based on zero crossings with peak amplitudes Author Gajic, Bojana, Paliwal, Kuldip Published 200 Conference Title Proceedings of the 200 IEEE International Conference

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Neural Networks Based Time-Delay Estimation using DCT Coefficients

Neural Networks Based Time-Delay Estimation using DCT Coefficients American Journal of Applied Sciences 6 (4): 73-78, 9 ISSN 1546-939 9 Science Publications Neural Networks Based Time-Delay Estimation using DCT Coefficients Samir J. Shaltaf and Ahmad A. Mohammad Department

More information

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression Voice Conversion Using Dynamic Kernel 1 Partial Least Squares Regression Elina Helander, Hanna Silén, Tuomas Virtanen, Member, IEEE, and Moncef Gabbouj, Fellow, IEEE Abstract A drawback of many voice conversion

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

Neetha Das Prof. Andy Khong

Neetha Das Prof. Andy Khong Neetha Das Prof. Andy Khong Contents Introduction and aim Current system at IMI Proposed new classification model Support Vector Machines Initial audio data collection and processing Features and their

More information

On Pre-Image Iterations for Speech Enhancement

On Pre-Image Iterations for Speech Enhancement Leitner and Pernkopf RESEARCH On Pre-Image Iterations for Speech Enhancement Christina Leitner 1* and Franz Pernkopf 2 * Correspondence: christina.leitner@joanneum.at 1 JOANNEUM RESEARCH Forschungsgesellschaft

More information

Confidence Measures: how much we can trust our speech recognizers

Confidence Measures: how much we can trust our speech recognizers Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition

More information

Multifactor Fusion for Audio-Visual Speaker Recognition

Multifactor Fusion for Audio-Visual Speaker Recognition Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY

More information

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification

More information

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification 2 1 Xugang Lu 1, Peng Shen 1, Yu Tsao 2, Hisashi

More information

Device Activation based on Voice Recognition using Mel Frequency Cepstral Coefficients (MFCC s) Algorithm

Device Activation based on Voice Recognition using Mel Frequency Cepstral Coefficients (MFCC s) Algorithm Device Activation based on Voice Recognition using Mel Frequency Cepstral Coefficients (MFCC s) Algorithm Hassan Mohammed Obaid Al Marzuqi 1, Shaik Mazhar Hussain 2, Dr Anilloy Frank 3 1,2,3Middle East

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method

Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method Accelerating the Hessian-free Gauss-Newton Full-waveform Inversion via Preconditioned Conjugate Gradient Method Wenyong Pan 1, Kris Innanen 1 and Wenyuan Liao 2 1. CREWES Project, Department of Geoscience,

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool. Tina Memo No. 2014-004 Internal Report Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool. P.D.Tar. Last updated 07 / 06 / 2014 ISBE, Medical School, University of Manchester, Stopford

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Implementation of Speech Based Stress Level Monitoring System

Implementation of Speech Based Stress Level Monitoring System 4 th International Conference on Computing, Communication and Sensor Network, CCSN2015 Implementation of Speech Based Stress Level Monitoring System V.Naveen Kumar 1,Dr.Y.Padma sai 2, K.Sonali Swaroop

More information

Multiple-View Object Recognition in Band-Limited Distributed Camera Networks

Multiple-View Object Recognition in Band-Limited Distributed Camera Networks in Band-Limited Distributed Camera Networks Allen Y. Yang, Subhransu Maji, Mario Christoudas, Kirak Hong, Posu Yan Trevor Darrell, Jitendra Malik, and Shankar Sastry Fusion, 2009 Classical Object Recognition

More information

M. Sc. (Artificial Intelligence and Machine Learning)

M. Sc. (Artificial Intelligence and Machine Learning) Course Name: Advanced Python Course Code: MSCAI 122 This course will introduce students to advanced python implementations and the latest Machine Learning and Deep learning libraries, Scikit-Learn and

More information

Probabilistic Robotics

Probabilistic Robotics Probabilistic Robotics Sebastian Thrun Wolfram Burgard Dieter Fox The MIT Press Cambridge, Massachusetts London, England Preface xvii Acknowledgments xix I Basics 1 1 Introduction 3 1.1 Uncertainty in

More information

l1 ls: A Matlab Solver for Large-Scale l 1 -Regularized Least Squares Problems

l1 ls: A Matlab Solver for Large-Scale l 1 -Regularized Least Squares Problems l ls: A Matlab Solver for Large-Scale l -Regularized Least Squares Problems Kwangmoo Koh deneb@stanford.edu Seungjean Kim sjkim@stanford.edu May 5, 2008 Stephen Boyd boyd@stanford.edu l ls solves l -regularized

More information

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

A long, deep and wide artificial neural net for robust speech recognition in unknown noise A long, deep and wide artificial neural net for robust speech recognition in unknown noise Feipeng Li, Phani S. Nidadavolu, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University,

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Deep Learning. Volker Tresp Summer 2015

Deep Learning. Volker Tresp Summer 2015 Deep Learning Volker Tresp Summer 2015 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Johnson Hsieh (johnsonhsieh@gmail.com), Alexander Chia (alexchia@stanford.edu) Abstract -- Object occlusion presents a major

More information