An Optimization of Deep Neural Networks in ASR using Singular Value Decomposition

Size: px
Start display at page:

Download "An Optimization of Deep Neural Networks in ASR using Singular Value Decomposition"

Transcription

1 An Optimization of Deep Neural Networks in ASR using Singular Value Decomposition Bachelor Thesis of Igor Tseyzer At the Department of Informatics Institute for Anthropomatics (IFA) Reviewer: Second reviewer: Advisor: Prof. Dr. Alexander Waibel Dr. Sebastian Stüker Dipl.-Inform. Kevin Kilgour Duration: January 26, 2014 May 26, 2014

2

3 I declare that I have developed and written the enclosed thesis completely by myself, and have not used sources or means without declaration in the text. Karlsruhe, May 26, 2014 Igor Tseyzer

4

5 Contents 1 Introduction 1 2 Background and Related Work Basics of Automatic Speech Recognition ASR as Pattern Recognition Process Extraction of Speech Features Acoustic Models Language Models Artificial Neural Networks in ASR Related Work Deep Belief Networks Stacked Autoencoders Deep Stacking Network Rectified Linear Units and Dropout Context-Dependent Deep-Neural-Network Deep Convolutional Neural Networks Deep Recurrent Neural Networks Restructuring of DNN with Singular Value Decompositions Bottle-neck Features Analysis of Existing Methods in the Optimization of DNN Optimizations in Acoustic Input Optimizations in Output Layer Prevention of Overfitting Pre-training and Weights Initialization Optimization of Size and Topology of DNN Types of ANN Restructuring of DNN Summary Model Design and Implementation Baseline System Baseline System Deep Neural Network Topology and Training Singular Value Decomposition Applying SVD on Weight Matrix of DNN Non-rank Decomposition of Hidden Layers Low-rank Decomposition of Hidden Layers Low-rank Decomposition of Output Layer

6 vi Contents 4.7 Insertion of Bottle-neck Layers to DNN with SVD Extracting Bottle-neck Features with SVD Step-by-Step Fine-tuning of Low-rank Decomposed DNN Summary Results and Evaluation Topology and Performance Analysis Analysis of Singular Values in DNN Non-rank Decomposition of Hidden Layers Low-rank Decomposition of Hidden Layers Small 4 Layer DNN Small 5 Layer DNN Large 4 Layer DNN Validation of Results Summary Low-rank Decomposition of Output layer Insertion of Bottle-neck Layer to DNN with SVD Bottle-neck Layers with 50%-rank Decomposition Bottle-neck Layers with 10%-rank Decomposition Validation of Results Extracting Bottle-neck Features with SVD Step-by-step Fine-tuning of Low-rank Decomposed DNN Summary Conclusion and Outlook 45 Bibliography 47

7 1. Introduction Automatic Speech Recognition (ASR) is a powerful tool which opens a new level of man-machine interaction. ASR can be applied in various areas of everyday life, such as dictation, hands-free control of devices and machine translation. In recent decades the accuracy and performance of ASR systems has significantly improved. One of approaches which insured this improvement is Artificial Neural Networks (ANN) and particularly Deep Neural Networks (DNN). The DNNs allow extraction of significant information from acoustic input and in this way increase the accuracy of recognition. Until recent, training a DNN was not possible. A significant improvement in the field of deep learning was achieved thanks to the development of computing power and the introduction of deep learning methods. Therefore the recent decade has seen significant advance in the research of DNNs for ASR. Recent studies in this field have shown a big improvement in speech recognition accuracy. An example of using a DNNs in practice is a lecture translation systems. These systems make it possible for students to listen to lectures in a language which is not their native one. Additionally this allows having a recording and transcription of lectures in both languages; in language of the lecturer and the student s native language. These kind of systems must fulfill strict requirements, such as quick performance and high accuracy in recognition. In order to make an ASR system fit these strict requirements, the question of optimizing the DNN is uppermost. The main problem is finding a balance between two properties of DNNs, which are present on two different poles: performance and accuracy. Improving one of them, in general, makes the other worse. To solve this problem various optimization techniques can be used. Today s studies are directed towards optimizing the topology, weights initialization and performance of DNNs. Solving these problems and as a result optimizing the DNN would allow creation of better ASR systems. This can help to make communication between people in different counties easier and break the language barrier in education, tourism and business areas. This work aims first of all to find the optimal topology for DNN and optimize this topology by using singular value decomposition. The major attempts will be

8 2 1. Introduction researched and evaluated. The main goal of this work is to develop a simple and effective method of optimizing DNN by using singular value decomposition. In chapter 2, the background and related work is presented. Chapter 3 analyzes the modern approaches in optimizing DNN. Design of proposed models and their implementation is described in chapter 4. Finally, chapter 5 evaluates the proposed models.

9 2. Background and Related Work In this chapter the background of ASR and related work in the field of optimizing DNN is presented. This would make the process of speech recognition understandable and describe recent achievement in optimizing of DNN. 2.1 Basics of Automatic Speech Recognition In essence speech recognition can be seen as a transformation of human speech into a form which can be understand by machines. From a mathematical point of view the ASR process is represented as a classification task. Most of the systems employ probabilistic classifiers to find the most likely word sequence to a given acoustic signal. In order to perform this classification, two probabilities are estimated: probabilities of possible word sequences and probability distributions of the acoustic signals for each word sequence. Both probabilities are represented as two parametric models ASR as Pattern Recognition Process Mathematically, parts of the process of speech recognition can be presented as Pattern, Classes and Classification Output [VSR12]. Where Pattern is an audio recording X, Classes are word sequences W from space W of all word sequences, and Classification Output is a word sequence W for recording X. For the given speech recording X, the sequence of word Ŵ = ŵ 1, ŵ 2,..., ŵ n can be estimated as follows [ST95]: Ŵ = argmax P (W X). (2.1) W The equation 2.1 can be re-factored using Bayes rule as follows: The equation 2.2 can be also represented as: Ŵ = argmax P (X W ) P (W ) (2.2) W P (W X) = P (X W ) P (W ) P (X) (2.3)

10 4 2. Background and Related Work This equation is also known as fundamental equation of speech recognition. The parts of this equation are presented as acoustic model (P (X W )) and language model (P (W )) [ST95]. Figure 2.1 presents the basic scheme of statistical ASR system. Figure 2.1: Scheme of statistical ASR system. Based on [ST95] Extraction of Speech Features The first step in the recognition process is the transformation of an acoustic input into a sequence of feature vectors. This transformation allows the representation of speech waveform with a relatively small number of dimensions. This process of transformation is called speech feature extraction. The speech processing is divided into two major methods, non-parametric methods based on periodograms, and parametric methods, which use a small number of parameters from data [VSR12]. An overview of the methods is available in table 2.1. Properties Spectrum Method Frequency Frequency Pitch axis resolution sensitivity PS Nonparametric Linear Static Very high Warped PS Nonparametric Nonlinear Static Very high Mel-filter bank Nonparametric Nonlinear Static High LP Parametric Linear Static Medium Perceptual LP Parametric Nonlinear Static Medium Warped LP Parametric Nonlinear Static Medium Warped-twice LP Parametric Nonlinear Adaptive Medium MVDR Parametric Linear Static Low Warped MVDR Parametric Nonlinear Static Low Warped-twice MVDR Parametric Nonlinear Adaptive Low PS = power spectrum, LP = linear prediction; MVDR = minimum variance distortionless response. Table 2.1: Overview of spectral estimation methods [VSR12]. The process of extracting features starts with taking a frame at the start of the acoustic input, applying a windowing function and shifting the frame further over the acoustic input. The size of the frame should be chosen sensibly, so it would be short enough to provide the required time resolution and long enough to ensure

11 2.1. Basics of Automatic Speech Recognition 5 adequate frequency resolution. Another important parameter is the frame shift over the acoustic input. Together frame size and frame shift can have a big influence on speech recognition accuracy and their choice depends on the characteristics of the speech. In general, a sensible choice of frame size and shift is ms for the frame size and 5-15 ms for the frame shift [VSR12]. The next step is transforming of the chosen frame with Short-Time Fourier Transform (STFT): X(τ, ω) = x(t)w(t τ)e jωt dt, (2.4) where w(t τ) is window function, x(t) is acoustic input at time t, τ is time index and ω is frequency. Various functions could be used as a window function, for example, Rectangular, Parzen, Hanning, Blackman, Kaiser, and Hamming windows. In speech recognition the Hamming window is commonly used. The Hamming window is defined as: { cos( 2πn N w(n) = w ), 0 n N w, (2.5) 0, otherwise, Further processing includes calculating Cepstral Coefficients. The cepstrum is a result of the Fourier transform of the (warped) logarithmic spectrum. The real cepstrum uses a logarithmic function and is defined as follows [VSR12]: c x (k) = 1 2π π where X(e jω ) is a spectral estimate, for example mel scale. π log X(e jω ) e jωk dω, (2.6) The speech features could be generated as 0, k < 0, ˆx min (k) = c x (0), k = 0, 2c x (k) k > 0 (2.7) Another way to generate features is Type 2 Discrete Cosine Transform (DCT) [VSR12]. ˆx min (k) = M 1 m=0 where log X(e jωm ) is a log-power density vector and T (2) k,m Type 2 DCT. log X(e jωm ) T (2) k,m, (2.8) are components of the The general process of generating cepstral coefficients is represented in figure Acoustic Models As mentioned above, the Acoustic Model (AM) can be represented as part P (X W ) in equation 2.3. After successful extraction of speech features, a classification method is required to classify extracted features to word sequences. A commonly used method in recent time is Hidden Markov Model (HMM).

12 6 2. Background and Related Work Figure 2.2: Block diagrams illustrating all the processing steps required for the calculation of the compared mel-frequency cepstral coefficient and warped MVDR cepstral coefficients front-ends [VSR12] The HMM can be described as probabilistic function of the Markov Chain [ST95]. The HMM is divided into two levels, a finite set Q = {s 1,..., s N } of states and output alphabet K = {v 1,..., v K }. The first level is a Markov chain, with specified initial state. The visible states of process are represented as a sequence of random variable q = q 1,..., q T, which take value from Q. The probability of moving from one state to another is described as follows [ST95].: P (q i q 1... q t 1 ) = P (q i q t 1 ), (2.9) and can be represented as N N transition matrix A: A = [a ij ] N N with a ij = P (q t = s j q t 1 = s i ). (2.10) The probability of process finding in certain state is described as probability π and transition matrix A: N π i = P (q 1 = s i ), π = 1 (2.11) In general the Markov Chain represents the manner in which the process moves through states. The second level is a set of state output probability distributions for each state. When the process comes to state i at time t, an observation symbols O = O 1... O T are generated based on state output distribution [ST95]: i=1 P (O t O 1... O t 1, q i... q t ) = P (O t q t ), (2.12) The output symbols probability distribution for state j on the second level can be found using this equation and represented as matrix B: b jk = P (O t = v k q t = s j ) and B = [b jk ] N N (2.13)

13 2.1. Basics of Automatic Speech Recognition 7 The Hidden Markov Model from this point of view is described as: λ = π, A, B (2.14) After describing the HMM, the best parameter λ should be found, which would represent the set of given symbols O = O 1... O T. This training procedure can be done using maximum-likelihood estimation (MLE), Baum Welch algorithm and Expectation maximization algorithm (EM) [ST95]. The output of HMM is modeled as a probability function. In speech recognition, a state output distribution P (O t q t ) is usually modeled as Gaussian Mixture densities [VSR12]: K P (O t q t ) = ω i,k N (x; µ i,k, Θ i,k ), (2.15) k=1 where N (x; µ, Θ) is multivariate Gaussian density. The ω i,k, µ i,k, Θ i,k are mixture weight, mean vector, and covariance matrix of the kth Gaussian in the mixture Gaussian state output distribution for state i. K is the number of Gaussians in the mixture. As can be seen, the AM allows the representation of extracted speech features as output symbols. This output should be formed as a word sequence using a Language Model, which is described in the next section Language Models The next step in the speech recognition process is to build hypotheses of word sequences according to given output symbols probability distribution from AM. Generating of the hypotheses can be done using statistical language models. This modeling process is represented as part P (W ) in equation 2.3 [ST95]. The first step in building a statistical language model is to define equivalence classes for words Q = {s 1,..., s N }. This allows the representation of a sequence of words w 1, w 2,..., w m as a state of deterministic finite state machine [ST95]: q 0 = s[ ] Q q 1 = s[w 1 ] Q q 2 = s[w 1 w 2 ] Q. q m = s[w 1...w m ] Q By using this model, the probability of a sentence can be represented as follows: P (w) = m ( N ) P (w i q i 1 = s n ) P (s[w 1... w i 1 = s n ]) i=1 n=1 (2.16) As can clearly be seen in the equation above, the probability of one particular word depends on the sequence of words which appeared before it. The models which can compute this dependency are called n-gramm [ST95]. The most useful for speech

14 8 2. Background and Related Work recognition are unigram, bigram and trigram. The computation of probability with unigram, bigram and trigram is described in following equations: P (w) = m P (w i ), (2.17) i=1 P (w) = P (w 1 ) m P (w i w i 1 ), (2.18) i=2 P (w) = P (w 1 ) P (w 2 w 1 ) m P (w i w i 2 w i 1 ). (2.19) However, the amount of parameters in the n-gram could be very big. For example, the trigram model with 1000 words contains about 10 9 parameters [ST95]. In order to avoid this, the words are classified to disjunct categories: i=3 C = {C 1,..., C N } with N k=1 C k = W (2.20) A word can be ordered to a particular category C(w) C and w C(w). So for example, the probability for bigram can be represented as follows[st95]: P (w) = P (w 1 ) m P (w i C(w i )) P (C(w i ) C(w i 1 )) (2.21) Other models apart from n-gramms can be used, for example: i=2 multilevel n-gram language models morpheme-based language models context-free grammar language models Using language models together with acoustic models and a dictionary enables the production of the most likely hypotheses for a given acoustic input. The existing methods of building LM and AM are continuously improved and extended, which will ensure the further improvement of speech recognition accuracy and performance. 2.2 Artificial Neural Networks in ASR Artificial Neural Networks (ANNs) started to be commonly used in ASR at the end of the 1980s. The main idea was to replace Gaussian Mixture Model (GMM) with ANN. The ANNs are very good at the task of classification, so they could be successfully applied to transform acoustic input into the states of HMM. There are various approaches that use hybrid ANN/HMM in the task of ASR. They are based on ANN s architectural and algorithmic solutions. According to Trentin et al. [TG01], ANN/HMM hybrid systems can be divided into five major categories: Early attempts Early approaches (between the late 1980s and the beginning of the 1990s) relied on ANN architectures that attempted to emulate HMMs.

15 2.3. Related Work 9 ANNs to estimate the HMM state-posterior probabilities Some ANN/HMM hybrids assume that the output of an ANN is sent to an HMM. Global optimization Introduction of a training scheme aimed at the optimization of a global criterion function, defined at the whole-system (i.e., ANN and HMM simultaneously) level. Networks as vector quantizers for discrete HMM Unsupervised ANNs are used to perform a quantization in the acoustic feature space for discrete HMMs. Other approaches Hybrid systems based on particular combination techniques between ANNs and HMMs, not belonging to any of the previous categories, and often focused on specific tasks. According to [HDY + 12], the main reason for using ANN was the statistical inefficiency of GMM for modeling a non-linear manifold in a data space, and the ability of ANN to learn model of data in non-linear manifold. However in the early stages of ANN, learning algorithms and hardware were not able to provide efficiency as good as GMM. Since then, development of hardware and new deep learning algorithms have made it possible to train big ANNs with many hidden layers and big output layers to estimate the emission probabilities for HMM [HDY + 12], which was not possible before. Hinton presented the first method of deep learning that makes it possible in 2007 [Hin07]. The main idea of his work is to pre-train each layer of multi-layer neural network as an unsupervised Restricted Boltzmann Machine (RBM) and to fine-tune using supervised backpropagation. The ability of deep learning provided a further development of architectures and training methods of DNN. Since 2007 DNN has played an increasingly important role in ASR. 2.3 Related Work Optimizing the DNN is one of a key task in ASR. As shown by Dahl et al.[dyda12], using DNN instead of GMMs significantly improves the performance of speech recognition systems. But the DNNs still have problems; various pre-training, training and DNN typologies have been researched in order to solve them. This section presents recent achievements in this field Deep Belief Networks One of the ways to optimize DNNs is the sensible initializing of weights before training. In 2012 Mohamed et al. presented their work Acoustic Modelling Using Deep Belief Networks [MDH12]. According to their work training a multilayer network could be represented from two points of view; directed and undirected. In the undirected view, the model contains separate layers which are not connected to

16 10 2. Background and Related Work each other, but the activity of one layer can be given as input to the next layers. A simple example of the undirected view is the Restricted Boltzman Machine (RBM). RBM contains visible and hidden units. Visible units are connected to the input, and hidden units help visible units to represent learned features. Restricted means that there is no connection between visible-visible or hidden-hidden units. In the directed point of view, hidden layers represent binary features and the visible layer represents a data vector. The Deep Belief Network (DBN) has top-down connections, so the probability of generating date on visible units is bigger than on others. The difference between RBM and DBN can be seen on figure 2.3. Figure 2.3: A difference between RBM and DBN Training the DBN proceeds in two stages. First comes layer-wise unsupervised pre-training, which contains the step of learning with tied weights. Second comes learning the different weights in each layer, and then the final layer with softmax activation function is added to the top of the network. The pre-training is followed by discriminative fine-tuning using backpropagation to maximize the log probability of the correct HMM state. The experiments of Mohamed et al. were performed on the TIMIT corpus [MDH12]. Their experiments showed that the Phone Error Rate (PER) of the best DBN is 20.7%, which is 24.17% less than the PER of Context-Dependent HMM system (27.3%). Mohamed et al. concluded the following about factors which influence DNN performance [MDH12]: 1. Better recognition performance can be achieved by using 40 filter-bank coefficients. 2. Adding hidden layers improves performance, if the number of layers is less than 4. Further addition does not give much of a bigger improvement. 3. More small layers are better than one big layer. 4. Pre-training is necessary when the network has more than 1 hidden layers. 5. Pre-training prevents overfitting of network. 6. The best number of frames in the input window is 11, 17, or 27 frames.

17 2.3. Related Work Adding a last bottle-neck hidden layer helps to prevent overfitting of the network by reducing the number of not pre-trained weights Stacked Autoencoders Initializing of the weights can be done in various ways. As described above, Hinton et al. [HOT06] achieved great success in deep learning with DBN. At the same time, Bengio et al. introduced another approach [BLPL06]. They used the same layerby-layer initialization using auto-encoders. In 2008 Vincent et al. introduced their work in which they tried to extract and compose Robust Features with Denoising Autoencoders [VLBM08]. Their research was based on the hypothesis that partially destroyed input should have almost the same representation as not destroyed input. The basic auto-encoder can be described as follows [VLBM08]. First the autoencoder receives input vector x [0, 1] d and, using the function y = f θ (x) = s(w x + b), (2.22) maps it to hidden representation y [0, 1] d. The function f θ is parametrised by θ = {W, b}, where W is weight matrix d d and b is a bias vector. The hidden representation y is then mapped back to vector z [0, 1] d with function z = g θ (y) = s(w y + b ), (2.23) where θ = {W, b }. The matrix W is constrained by W = W T. In this case the autoencoder has tied weights. The parameters are optimized to minimize average reconstruction error θ, θ = argmin 1 n n L(x (i), z (i) ), (2.24) i=1 where L is a loss function. The reconstruction cross-entropy can be presented as follows: L H (x, z) = H(B x B z ) (2.25) Second, to construct the denoising autoencoder, a modification of the basic encoder is needed. It can be represented as the application of stochistic process q D ( x x) to input vector x. It means that the encoder should be partly destroyed. In order to do this, the fixed number of components vd are chosen and their value set to 0. In this way the initial input x is mapped to partly destroyed version x. Then the partly destroyed input x is used to a receive hidden representation: y = f θ ( x) = s(w x + b) (2.26) Then follows the reconstruction: z = g θ (y) = s(w y + b ). (2.27) Now z is a deterministic function of x, which represents a stochastic mapping of x. This procedure can be seen on figure 2.4.

18 12 2. Background and Related Work Figure 2.4: An example x is corrupted to x. The autoencoder then maps it to y and attempts to reconstruct x[vlbm08]. As shown by [BLPL06], greedy layer-wise training can be applied to initialize weights with the autoencoder, where the k layer is used as input for k + 1 layer. The same procedure can be used for denoising encoder [VLBM08]. The denoising autoencoder is trained by standard backpropogation to minimize loss function L = (x, z). Vincent et al. concluded initialization of layers using denoising encoders helps to capture interesting structures in the input distribution [VLBM08]. A practical application of stacked denosing encoders was done by Gehring et al. [GMMW13]. In their work they used stacked autoencoders to extract bottleneck features for ASR tasks. They used the method proposed by Vincent et al. [VLBM08]. The experiments were performed on IARPA Babel Program Cantonese language collection babel101-v0.1d, Cantonese language collection babel101-v0.4c with 80 hours of actual speech and the Tagalog language collection [GMMW13]. One of the experiments, the evaluation on babel101-v0.4c, showed significant improvement of the autoencoder based network compared to the network that used MFCC. The results can be seen in table 2.2 System Model DBNF (by AE Layers) CER (%) Table 2.2: Character error rates of the Cantonese babel c system with MFCC and DBNF input features[gmmw13] The denoising autoencoders can be used for the initialization of DNN and show better results than MFCC based systems Deep Stacking Network Another approach in initializing weights is the Deep Stacking Network (DSN), which was presented by Deng et al. in 2012 [DYP12]. The main goal of this network is to provide a method for building deep classification architectures which can also be parallellized. The research was based on the idea of composing simple functions into a chain and then stacking them together. All functions estimate the same target. This method helps to avoid overfitting due to learning complex functions. The structure of DSNs can be seen in figure 2.5. The first layer is a linear layer. It corresponds to raw input data. Second is the non-linear sets of sigmoidal hidden

19 2.3. Related Work 13 Figure 2.5: Illustration of the basic architecture of DSN [DYP12] units. Third, the output layer is the set of C-linear output units. The experiments were done on MNIST image classification and on TIMIT corpus [DYP12]. The results of speech classification experiments show that the DSN achieved frame-level error rate of 43.86%, which is 1.18% better than a DNN tested on the same dataset (45.04%) Rectified Linear Units and Dropout An important part of optimizing a DNN is the activation function of neurons. Choosing a sensible activation function for a DNN can not only improve performance, but also prevent quick system overfitting by deep learning methods. In 2013 Dahl et al. presented an improvement of DNN that used Rectified Linear Units (ReLU) and Dropout [DSH13]. Rectified Linear Units use the simple activation function max(0, x), meaning they can easily be used in DNN or RBM. The problem of ReLU is that they can more easily overfit, than the sigmoid activation function. In order to prevent this problem, Dahl et al. [DSH13] combined ReLU with Dropout, which avoid overfitting by adding noise zeros or dropout to hidden units. The use of dropout can be presented as follows: 1 y t = f( 1 r y t 1 mw + b), (2.28) where f is the activation function of t 1 layer, W and b are weights and bias for current layer, is element-wise multiplication. The input from the previous layer 1 would be computed with factor, where r is the dropout probability for units in 1 r the previous layer. The ability to avoid overfitting allows training of big DNN, but using dropout slows learning time by a factor of about two. Dahl et al. performed their experiments on 50 hours of English Broadcast News. The DARPA EARS rt03 set was used for development/validation and final testing was performed on the DARPA EARS dev04f evaluation set [DSH13]. The comparison of results with full sequence training can be seen in table 2.3 They show that combined ReLU and dropout work well. The dropout can also be used in combination with other activation functions and might improve their performance.

20 14 2. Background and Related Work Model rt03 dev04f GMM baseline ReLUs, 3k, dropout, smbr Sigmoids, 2k, smbr Table 2.3: Results with full sequence training [DSH13] Context-Dependent Deep-Neural-Network The second way to optimize a DNN is to use a larger number of output units. The first ANN used in ASR mostly used context-independent (CI) HMM. Their results were comparable or better than GMM. The logical development of DNN was to apply context-dependent (CD) HMM as output for DNN. In 2011 Seide et al. proposed Context-Dependent Deep Neural Network HMMs, or CD-DNN-HMM [SLY11]. The experiments were performed on the task of speech-to-text transcription using the 309-hour Switchboard-I training set. A GMM-HMM system with 40 Gaussian mixtures trained with maximum likelihood (ML) was used as a baseline. The training of the DNN was performed in two steps. First, the DNN was pre-trained with DBN; second, it was fine-tuned with back-propagation. The experiments showed significant improvement in word-error rate (WER) in about 33% of CD-DNN-HMM systems (18.5%) compared to a GMM 40-mix, BMMI system (27.4%) tested on an RT03S task [SLY11]. Extracting Bottle-Neck Features (BNF) can be also done using DNN. In 2013 Gehring et al. [GMMW13] performed an extraction of deep BNF using stacked autoencoders. The description, results and evaluation of the experiments performed by Gehring et al. can be found in section Deep Convolutional Neural Networks As described above, DNN achieved great success in the task of ASR. Another type of ANN can be used apart from DNN: Convolutional Neural Networks (CNNs). Unlike DNN, whose units are fully connected, CNN computes only a small local input for each unit. In 2013 Sainath et al. introduced an application of deep CNN in the ASR [SMKR13]. They tried to find an optimal architecture for CNN. Figure 2.6 shows a typical CNN architecture. The computation of hidden activation proceeds in three steps. First, the weights W are multiplied with local input v1, v2, v3. Second, the weights W are shared across the entire input space, and third, max-pooling is used to remove variability in the hidden units. Sainath et al. first trained the CNN on a 50-hour English Broadcast News task and tested it on EARS dev04f set [SMKR13]. 40 dimensional log mel-filterbank coefficients were used as acoustic input. The performance of the CNN was compared with a fully connected DNN with a hidden layer size of 1024 units and output layer of 512 units. Table 2.4 shows the results of comparing DNN with CNN. Other test results showed that the best number of hidden units for CNN is 128 for the first convolutional layer and 256 for the second layer. Table 2.5 shows results for the architecture used. The CNN hybrid is 3-5% better relative to the DNN hybrid.

21 2.3. Related Work 15 Figure 2.6: Diagram showing a typical convolutional network architecture consisting of a convolutional and max-pooling layer. In this diagram, weights with the same line style are shared across all convolutional layer bands [SMKR13] Number of Convolutional vs. Fully Connected Layers WER No conv, 6 full (DNN) conv, 5 full conv, 4 full conv, 3 full 22.4 Table 2.4: WER as a Function of number of Convolutional Layers [SMKR13] Model dev04f rt04 Baseline GMM/HMM Hybrid DNN DNN-based Features Hybrid CNN CNN-based Features Table 2.5: WER for NN Hybrid and Feature-Based Systems [SMKR13] The further experiments were performed on a large task: on 400 hours of English Broadcast News and 300 hours of conversational American English telephony data from the Switchboard corpus. DARPA EARS dev04f set was used for development and DARPA EARS rt04 for evaluation. In the next experiment Hub5 00 set was used for development and rt03 set for evaluation, with separate testing of Switchboard (SWB) and Fisher (FSH) portions of the set [SMKR13]. The results can be seen in tables 2.6 and 2.7. Model dev04f rt04 Baseline GMM/HMM Hybrid DNN DNN-based Features CNN-based Features Table 2.6: WER on Broadcast News, 400 hrs [SMKR13]

22 16 2. Background and Related Work Model dev04f rt03 SWB FSH SWB Baseline GMM/HMM Hybrid DNN CNN-based Features Table 2.7: WER on Switchboard, 300 hrs [SMKR13] As can be seen from all experiments, both CNN and DNN perform better than GMM/HMM systems. For the Broadcast News test, the CNN showed a 10-12% relative improvement over the DNN-based features. For conversational American English they are between 4-7% better than the hybrid DNN Deep Recurrent Neural Networks Recurrent neural networks (RNN) are a powerful tool due to their ability to remember and produce sequences of actions from one input. RNNs achieved great success in handwriting recognition. Graves et al. introduced the Long Short-Term Memory (LSTM) RNNs in ASR [GMH13]. They also developed an end-to-end method which included a joint train of two RNNs, for the acoustic and the language model. The LSTM RNN is based on purpose-built memory cells which include an input gate, a forget gate, an output gate and cell activation vector (see figure 2.7). The main power of RNN is its ability to use previous context from a network, but RNN can in addition use future context. Such an RNN is called a Bidirectional RNN (BRNN) (see figure 2.8). The BRNN computes the forward hidden sequence h, the backward hidden sequence h and the output sequence y. The combination of BRNN and LSTM gives Bidirectional LSTM; the deep RNN can then be created by stacking RNN layers one on the top of the other. This Bidirectional LSTM architecture was applied to the TIMIT corpus. Figure 2.7: Long Short-term Memory cell [GMH13] The Bidirectional LSTM was trained on a 40-coefficient filter-bank with three training method: Connectionist Temporal Classification (CTC), Transducer and pretrained Transducer [GMH13]. The best system showed a Phone Error Rate (PER) of 17.7%.

23 2.3. Related Work 17 Figure 2.8: Bidirectional RNN [GMH13] Restructuring of DNN with Singular Value Decompositions Training a DNN is a process that requires high computation costs. As shown above, DNN perform better if their size is increased. In 2013 Xue et al. introduced the method of compressing a DNN without losing performance [XLG13]. Their work is based on applying singular value decomposition (SVD) to decompose weight matrices of a DNN and then restructuring the model in order to keep the similarity to the original model. To maintain the performance of the restructured model, fine-tuning is applied. Figure 2.9 shows a typical DNN in ASR with fully connected layers. In order to perform decomposition, the SVD can be applied to weight matrices between layers. Figure 2.9: DNN used in ASR systems [XLG13] Applying SVD to the weight matrix A with size m n gives the following result: A m n = U m n n n V T n n (2.29)

24 18 2. Background and Related Work According to Xue et al. [XLG13], about 15% of singular values contribute 50% of total values and about 40% - 80% of total size. It means that the number of elements in matrix A can be reduced without losing important information. After leaving only k-biggest singular values on A, the final matrix would look as follows: A m n = U m k Vk n T (2.30) The A m n can also be presented as where W k n = k k V T k n. k k A m n = U m k W k n, (2.31) After applying SVD, two smaller matrices U and W can be applied back to the original model to replace A. The single layer in the original model would be replaced by two layers. The number of parameters is changed from m n to (m + n) k. The replacement of the single layer can be seen in figures 2.10 and Figure 2.10: One layer in original DNN model [XLG13] Figure 2.11: Two corresponding layers in new DNN model [XLG13] Xue et al. [XLG13] applied this procedure to a Microsoft internal task (task1), which consisted of 750 hours of audio. As input, the CD-DNN-HMM system with 13-dimension mean-normalized MFCC feature was used. The system layout is as follows: input layer with 572 units, 4 hidden layers with 2048 units each and output layer with 5976 units. The number of parameters for system: ( ) M. Table 2.8 shows the results of applying SVD on whole system. Xue et al. were able achieve reduction of the DNN model on task 1 by about 80%. By keeping singular values proportional to the total amount of values a reduction of 73% could be reached, with less than 1% accuracy loss.

25 2.3. Related Work 19 Acoustic model WER Number of parameters All hidden layers (512) Before fine-tune 26.0% 21M After fine-tune 25.6% All hidden layers (256) Before fine-tune 27.0% 17M After fine-tune 25.8% All hidden layers and Before fine-tune 29.7% 7M output layers (256) After fine-tune 25.4% All hidden layers and Before fine-tune 36.7% 5.6M output layers (192) After fine-tune 25.5% Table 2.8: Results of SVD restructuring on the whole model on task 1 [XLG13] Bottle-neck Features The DNN can also be used to extract speech features. As described earlier, extracting probabilistic features is a very important part of ASR, and one where various approaches can be applied. One of the most efficient is bottle-neck features (BNF). The use of BNF was first proposed by Grézl et al. [GKKC07]. They experimented with 5 layer multi-layer perception (MLP) with bottle-neck layers in the middle. After training, the output of bottle-neck layers was used as features for a GMM-HMM recognition system. The system was trained using NIST RT 05 meeting data. The results of the evaluation compared with probabilistic features of the same size are presented in table 2.9. BNF show a improvement of system accuracy compared to probabilistic features. feature size (NN output) (25) (30) (35) (40) (45) probabilistic bottle-neck Table 2.9: WER for probabilistic and bottle-neck features in full feature extraction framework [GKKC07]

26 20 2. Background and Related Work

27 3. Analysis of Existing Methods in the Optimization of DNN This chapter presents methods of optimizing DNN from various points of views. It compares and analyzes the recent technologies and achievements in using DNN in ASR. Optimizing a DNN requires finding balance between ASR s system accuracy and performance. For example, big neural networks give better results, but they require a longer time to be trained. Training time and training accuracy depends on acoustic input, network size and topology, activation methods, pre-training and weights initialization. The main goal of this work is to develop simple and effective methods to optimize DNN. This goal could be achieved by finding optimal topology for DNN by applying decomposing algorithms to DNN, in order to compress the model or improve its performance. 3.1 Optimizations in Acoustic Input The choice of acoustic input for training DNN can have a big influence on performance. Early attempts of DNN training used the Mel Frequency Cepstral Coefficients (MFCC). Recent studies have showed that DNN perform better on simple spectral features [DLH + 13], [MDH12], [GMMW13]. In their works, the best performance was shown on 40 log filter-banks. The advantage of spectral features over MFCC can be explained by the fact that spectral features have more information (including possibly redundant or irrelevant information) [DLH + 13] which might help DNN in classification task. 3.2 Optimizations in Output Layer The performance of ANN can be also improved by using context dependent HMM as the output layer for DNN [DYDA12]. This allows coverage of a large number of HMM states. This, however, increases network size and negatively influences its performance. Modern computing technologies such as the Graphics Processing Unit (GPU) allow much faster training of the DNN, but cannot help it to avoid performance problems in the speech recognition process later, if the CPU is used.

28 22 3. Analysis of Existing Methods in the Optimization of DNN These problems can be solved by restructuring the DNN. The main idea is to reduce the size of the DNN, without affecting its performance. See chapter for one of the possible methods, Restructuring of DNN with Singular Value Decomposition by Xue et al. [XLG13]. 3.3 Prevention of Overfitting The most commonly used activation functions for hidden layers are linear, hyperbolic tangent and sigmoid. The main remaining problem is overfitting. Usually the problem of overfitting is solved by using cross-validation during training. This stops a network from overfitting and works perfectly for ANN. However cross-validation is not able to stop units from being overfitted during deep training. That is why new methods for DNN should be found and applied. One of the way to solve this problem is dropout. The idea is to add particular noise (zero values or dropout ) to the activation function during training. Hinton et al. first proposed dropout in 2012 [HSK + 12]. In 2013 Dahl et al. applied dropout for ASR task in combination with rectified linear units [DSH13]. 3.4 Pre-training and Weights Initialization Deep learning is divided into two important parts: pre-training and fine-tuning. The main task of pre-training is to find sensible weights for training, because a neural network with directly connected layers is difficult to train and it takes a long time. To avoid this problem Hinton proposed using pre-training based on Restricted Boltzmann Machine (RBM) [Hin07]. Training a RBM has an advantage compared to training directly connected network. First, hidden units help visible units to find the right distributions. Second, a poor approximation to the data distribution is generated. Third, directly connected neural network tries to learn hidden causes that are marginally independent. The RBM can be replaced by autoencoders with a single hidden layer [Hin07]. This further development showed that denoising autoencoders perform well [VLBM08] in speech recognition tasks [GMMW13]. 3.5 Optimization of Size and Topology of DNN The introduction of deep learning algorithm allowed the training of very deep networks with a big number of units in layers. However, unlimited increases of the depth and size of networks do not bring the same improvement. According to Mohamed et al. [MDH12] and Graves et al. [GMH13], increasing the network s depth stops bringing significant improvement after fifth layer. Another important property of DNN is that many small-sized layers are better than one big one. In general, the best topology should be found according to acoustic input, the activation function used and other parameters. Choosing the size and topology of DNN is a trade-off between the network s number of parameters and its accuracy.

29 3.6. Types of ANN Types of ANN The three major types of ANN used in ASR are usually deep full-connected neural networks (DNN), Deep Convolutional Networks (CNN) and Deep Recurrent Networks (RNN). Both CNN and RNN showed they could be applied ASR and achieve better accuracy than DNN [SMKR13] and [GMH13]. However, DNN with sensible weight initialization and activation function show comparable accuracy with the benefits of simple implementation, training and optimization. The sensible choice for ASR task could be a DNN using denoising auto-encoders. 3.7 Restructuring of DNN One of the recent approaches in optimizing DNN is restructuring. As Xue et al. proposed [XLG13], restructuring can decrease the number of parameters in DNN, which improves its performance. The potential application of restructuring could not only leads to performance improvement, but also the network s recognition accuracy. This approach could be achieved by optimizing the size and topology of DNN, which cannot be done by pre-training and fine-tuning. 3.8 Summary The sensible way to implement and optimize the DNN is to use full-connected neural networks together with various optimization methods. Autoencoders provide the possibility of sensible initial weight initialization as a DBN, but also control overfitting of units as dropout. Additionally, recent methods of restructuring DNN can be easily applied to full-connected DNN.

30 24 3. Analysis of Existing Methods in the Optimization of DNN

31 4. Model Design and Implementation In this chapter a design of used models and their implementation are presented. The model design shows in which way a DNN can be optimized, how this models can be implemented and a possible advantage of using this models. 4.1 Baseline System The models would be applied on a baseline system, which can be used with a set of DNN with various topologies and a size of hidden layers. The description of baseline system and DNN can be found in the following sections Baseline System The system used in this experiment was trained and decoded using the Janus Recognition Toolkit (JRTk) developed at Karlsruhe Institute of Technology and Carnegie Mellon University [SMFW01]. The neural networks were trained on 237 hours of Quaero training data of German language from 2010 to For the audio sampling was done at 16Hz with 32ms window size. The feature extraction was done on 40 log mel scale filterbank coefficients (lmel) with Pitch and FFV, which results a feature vector with 54 elements. The acoustic models of all systems are context-dependent quinphones with three states per phoneme, using a left-to-right HMM topology without skip states. The GMM models were trained with incremental splitting of Gaussians training (MAS), followed by optimal feature space training (OFS) and refined by one iteration of Viterbi training. A vocal tract length normalization (VTLN) was not used. The training of neural networks was performed using Theano library [BBB + 10] Deep Neural Network Topology and Training In this work different typologies with various sizes and numbers of hidden layers will be tested. The input layer contains 702 units and has a hyperbolic tangent (tanh)

32 26 4. Model Design and Implementation activation function. The size of the hidden layers varies from 400 units to 2000 units; the number of hidden layers varies from 1 to 10. The visible activation function of units in hidden layers is sigmoid. The output layer has 6016 units, based on 6016 context dependent phone states. The activation function of output layer is softmax. During pre-training the weights of hidden layers are initialized using stacked denoising autoencoders, proposed by [VLBM08]. The denoising autoencoder works the same as the autoencoder proposed by Bengio et al. [BLPL06], but with the modification, that input is partly corrupted. This corruption is to avoid overfitting and allows features to be extracted from large hidden layers. The input corruption can be represented as a stochistic process q D ( x x) to input vector x. To destroy the input a fixed number of components vd are chosen and their value set to 0. So initial the input x is mapped to the corrupted version x. The hidden representation now looks as follows: y = f θ ( x) = s(w x + b) (4.1) Then the reconstruction z can be computed as: z = g θ (y) = s(w y + b ) (4.2) Now z is a deterministic function of x, which represents a stochastic mapping of x. The cross-entropy error is used to compare input with output in order to find the necessary adjustment for neural network weights: L H (x, z) = i x i logz i + (1 x i )logz i (4.3) The hidden layers of neural networks were pre-trained with the following parameters: Parameter First hidden layer Following hidden layers corruption loss function mean squared error cross entropy batch-size minibatches 2M 1M learning-rate Table 4.1: Parameters for hidden layer pre-training After pre-training the output layer is added to neural network and then fine-tuned using back-propagation. The results with various sizes and numbers of hidden layers are available in chapter 6 Results and Evaluation. 4.2 Singular Value Decomposition Singular Value Decomposition (SVD) is a factorization of a matrix. The decomposition of m n matrix A is seen as: A m n = U m n Σ m n V T n n, (4.4)

33 4.3. Applying SVD on Weight Matrix of DNN 27 where U is unitary matrix, V T is conjugate transpose of the unitary matrix V, and Σ is diagonal matrix with non-negative real numbers on the diagonal. The matrix A can be decomposed by using approximation with the matrix lower rank A k. This can be done by replacing the lowest singular values in matrix Σ with zeros. The result of the approximated matrix is A k = U m k Σ k k V T k n. (4.5) The equation below shows a preciser presentation of approximation problem. a a 1n A =..... a m1... a mn = σ u u 1n v v 1n σ kk u m1... u mn v n1... v nn σ nn u u 1n σ v v 1n (4.6) u m1... u mk σ m1... σ kk v k1... v kn After approximation the matrix A can be replaced with two smaller matrices, U and W, where W k n = Σ k k V T k n. The final equation for matrix decomposition is A m n = U m k W k n. (4.7) The resulting equation is able to decompose the weights matrix in the DNN, in order to compress the original DNN [XLG13]. 4.3 Applying SVD on Weight Matrix of DNN The connections between layers in a DNN can be represented as weight matrices. The weight matrix A m n between layers i and j has the dimension m n, where m is the number of units in layer i and n is the number of units in layer j. After applying SVD to matrix A m n, we have two matrices of a smaller size U m k and W k n. Then the matrix A is replaced with matrices U and W in that order: first matrix U and then W. This process is illustrated in figure 4.1. The new layer between layers i and j has k units. For new layer the sigmoidal activation function is chosen. In order to insure better fine-tuning for new weigths, the values of bias in each of three layers after SVD are set to Non-rank Decomposition of Hidden Layers This model applied SVD to weight matrices between the first and the last hidden layers. Weight matrices between input layer and first hidden layer, as well as between last hidden layer and output layer remain untouched.

34 28 4. Model Design and Implementation Figure 4.1: An example of SVD applied to a weight matrix in a DNN. a: matrix before SVD. b: matrix after SVD SVD applied on a weight matrix, produces two matrices U and W, as described in chapter 4.2. If a rank of matrices is not reduced after applying SVD on square weight matrix of hidden layer A m m, the resulting matrices would also be square U m m and W m m. So basically the hidden layer is doubled. This procedure may allow an increasing of the depth of a network and achieve better accuracy, because the weights of the DNN would be set to sensible values. 4.5 Low-rank Decomposition of Hidden Layers Decomposing a DNN can bring the advantage of removing units in hidden layers that have not been trained during a fine-tuning, from the network. The main goal of this attempt is to increase the network s performance, but to keep the total amount of units constant. Theoretically, this would give the performance of much bigger network while still having the advantages of a small network. The sensible way of performing this task is to find the topology, then improve by adding new hidden layers. As shown by Mohamed et al. [MDH12], adding more than four hidden layers to DNN does not show any significant improvement. This improvement is too small compared to the size of the DNN and possible performance problems. In order to perform such a decomposition, the parameter k is set to the half number of singular values on the diagonal in matrix Σ. If matrix A m m is square, the k is set as half of m. For example, the matrix with an input layer 702 units, four hidden layers at size 1200 units and an output layer at size of 6016 units is taken (figure 4.2). The SVD could applied to weight matrices between layers 1 and 2, 2 and 3, and 3 and 4. The matrices between the input layer 0 and 1, the last hidden layer 4 and the output layer 5 remain the same. After applying SVD, the resulting neural network has more layers (figure 4.3), but the number of units remains the same (table 4.2).

35 4.6. Low-rank Decomposition of Output Layer 29 Figure 4.2: Neural network before applying SVD Figure 4.3: Neural network after applying SVD Network Input layer Hidden layers Output layer Number of hidden layers Before SVD 842.4K 3 1.2K 1.2K = 4320K K 4 After SVD 842.4K 6 1.2K 0.6K = 4320K K 7 Table 4.2: Number of parameters and hidden layers after applying SVD to a network with 4 hidden layers 4.6 Low-rank Decomposition of Output Layer Due to using Context-Dependent HMM as output for DNN, the output layer has a relatively big size. The next attempt is thus to find an influence of applying SVD to the output layer (figure 4.4). The For example, in the system used here, the output layer has 6016 units, so the weight matrix, which connects it with previous layer has = parameters. Applying SVD to the output layer could significantly decrease this number. The result of applying SVD to the output layer s weight matrix with a size of 1200 by 6016 units can be seen in table 4.3: leaving only 50% of singular values allows the

36 30 4. Model Design and Implementation Figure 4.4: Low-rank Decomposition of Output layer number of parameters to decrease by 40%. Theoretically, this approach could not only decrease number of parameters, but also improve DNN accuracy by deleting not-trained units. Output layer Number of parameters % of original Original % After SVD with k = % After SVD with k = % After SVD with k = % Table 4.3: Number of parameters in output layer after applying SVD 4.7 Insertion of Bottle-neck Layers to DNN with SVD This attempt explores the influence of bottle-neck layers in DNN. The bottle-neck is a layer in a neural network, whose size is significantly smaller than of other layers. The bottle-neck is usually placed before the last layer in the DNN. This can help to prevent overfitting by reducing the number of not pre-trained weights [MDH12]. By using SVD the bottle-neck can be inserted after training the DNN. The idea is to leave only significant singular values, so theoretically this might help to improve DNN performance. In order to achieve this goal, various numbers of bottle-neck layers would be added to the DNN. The process of inserting bottle-neck to a DNN with 4 hidden layers (figure 4.2) is illustrated by figure Extracting Bottle-neck Features with SVD As proposed by Grézl et al. [GKKC07], a bottle-neck layer could be used to extract features. SVD allows to creation of such a bottle-neck layer, leaving only significant information. The extraction of BNF could be done in two ways: with fine-tuning after SVD and without. After creating the DNN, the acoustic model and the resulting system would be trained.

37 4.9. Step-by-Step Fine-tuning of Low-rank Decomposed DNN 31 Figure 4.5: An insertion of bottle-neck layer to DNN with 4 hidden layers 4.9 Step-by-Step Fine-tuning of Low-rank Decomposed DNN After implementing all of approaches detailed above, the next sensible idea is to combine the best of them in one DNN. Various approaches are going to be applied and fine-tuned one followed by the next. Theoretically, this could give a better improvement than each approach by itself Summary Decomposing a DNN without increasing the network s size, and inserting of bottleneck layers have a good theoretical basis to be able to improve the performance of a DNN. In the next chapter these approaches will be evaluated.

38 32 4. Model Design and Implementation

39 5. Results and Evaluation 5.1 Topology and Performance Analysis The topology analysis started from the layers pre-training. The pre-trained setup consisted of five sets with various hidden layer with sizes of 400, 800, 1200, 1600 and 2000 units. Each set contained 11 pre-trained layers. Table 5.1 presents the time taken to pre-train 11 layers for each set. The dependence between size and pre-training time has a linear character (see figure 5.1). Hidden layer size Pre-training time 13:30 27:12 39:47 61:29 102:12 Table 5.1: Pre-training time (h:m) for 11 layers After pre-training, the DNNs with various parameters were fine-tuned and evaluated on a Quaero evaluation task. Figure 5.2 presents the dependence between the number of layers and the time spent for fine-tuning. As it can be seen, as the number of units in hidden layers increases, the time for fine-tuning increases dramatically and reaches almost 168 hours for the biggest network. Figure 5.3 presents the dependence between the number of layers and the validation error for networks with different hidden layer sizes. The validation error stops decreasing significantly after the network has more then 5 layers and after 7th or 8th layer starts growing again. The dependence between the number of hidden layers and the Word Error Rate (WER) is noticeable (see figure 5.4). Networks with hidden layers of sizes from 1200 to 2000 units, showed a similar level of WER. So it can be assumed that increasing the size of hidden layers to more than 1200 units is not sensible. In order to prove this, the corrected sample standard deviation s of WER would is analyzed: s = 1 N (x i x) N 1 2 (5.1) i=1

40 34 5. Results and Evaluation Pre-train time Time (hours) Hidden layer size Figure 5.1: Dependence between size and pre-training time Time fine-tuning (hours) Number of layers Figure 5.2: Dependence between the number of layers and fine-tuning time

41 5.1. Topology and Performance Analysis Validation error Number of layers Figure 5.3: Dependence between the number of layers and validation error WER(%) Number of layers Figure 5.4: Dependence between the number of layers and WER(%)

42 36 5. Results and Evaluation The sample standard deviation s of all networks is presented in table 5.2. According to this table, the deviation gets smaller when the number of hidden layers is increased. It means, that the difference between WER of networks with different size of hidden layers get smaller with increasing the number of layers. Number of x s 2 s hidden layers Table 5.2: Corrected sample standard deviation s for all size hidden layers As mentioned above, networks with hidden layers of sizes from 1200 to 2000 units showed a similar level of WER. In order to prove that, the networks with hidden layers size of 1200, 1600 and 2000 would be only taken. As it can be seen from table 5.3, the corrected sample standard deviation becomes much smaller. The corrected sample standard deviation of networks with 7 hidden layers is 0.1, which is about 7 times smaller than for all networks. Further important information is that average WER (x) stops decreasing at 8 layers and starts growing again. Number of x s 2 s hidden layers Table 5.3: Corrected sample standard deviation s for network size hidden layers equal to 1200, 1600 and 2000 The results of the analysis are: The addition of further hidden layers does not bring much improvement for networks with more than 5 hidden layers.

43 5.2. Analysis of Singular Values in DNN 37 Increasing the size of hidden layers does not bring much improvement for networks with a size of more than 1200 units. For networks with hidden layers between 1200 and 2000 units, the WER stops decreasing at layer 8. The sensible choice based on these results, is a network with a number of hidden layers between 4 and 8 and with hidden layers with a size of about 1200 units. 5.2 Analysis of Singular Values in DNN To perform decomposition on a DNN, singular values of weight matrices of the DNN should be analyzed. After singular value decomposition, singular values are represented as numbers on a diagonal in matrix Σ. This numbers are sorted and divided into 5 groups. The results can be seen on figure 5.5. Figure 5.5: Analysis of singular values of DNN with 5 hidden layers Figure 5.6: Analysis of singular values of DNN with 8 hidden layers The values are not equally distributed over the weight matrices. Indirectly, this could indicate the amount of information stored in every weight matrix. For example, the values in matrices A 12, A 23 and A 56 are higher than in others. This

Lecture 7: Neural network acoustic models in speech recognition

Lecture 7: Neural network acoustic models in speech recognition CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition Outline Hybrid acoustic modeling overview Basic

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, Bhuvana Ramabhadran IBM T. J. Watson

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition

Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition Tara N. Sainath 1, Brian Kingsbury 1, Bhuvana Ramabhadran 1, Petr Fousek 2, Petr Novak 2, Abdel-rahman Mohamed 3

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen Lecture 13 Deep Belief Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 12 December 2012

More information

SVD-based Universal DNN Modeling for Multiple Scenarios

SVD-based Universal DNN Modeling for Multiple Scenarios SVD-based Universal DNN Modeling for Multiple Scenarios Changliang Liu 1, Jinyu Li 2, Yifan Gong 2 1 Microsoft Search echnology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft Way, Redmond,

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Unsupervised learning Daniel Hennes 29.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Supervised learning Regression (linear

More information

A Fast Learning Algorithm for Deep Belief Nets

A Fast Learning Algorithm for Deep Belief Nets A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National

More information

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search Wonkyum Lee Jungsuk Kim Ian Lane Electrical and Computer Engineering Carnegie Mellon University March 26, 2014 @GTC2014

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing feature 3 PC 3 Beate Sick Many slides are taken form Hinton s great lecture on NN: https://www.coursera.org/course/neuralnets

More information

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders Neural Networks for Machine Learning Lecture 15a From Principal Components Analysis to Autoencoders Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Principal Components

More information

Introduction to Deep Learning

Introduction to Deep Learning ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)

More information

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version Shallow vs. deep networks Restricted Boltzmann Machines Shallow: one hidden layer Features can be learned more-or-less independently Arbitrary function approximator (with enough hidden units) Deep: two

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Extracting and Composing Robust Features with Denoising Autoencoders

Extracting and Composing Robust Features with Denoising Autoencoders Presenter: Alexander Truong March 16, 2017 Extracting and Composing Robust Features with Denoising Autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol 1 Outline Introduction

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of

More information

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao GMM-FREE DNN TRAINING Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao Google Inc., New York {andrewsenior,heigold,michiel,hankliao}@google.com ABSTRACT While deep neural networks (DNNs) have

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015 Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using

More information

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines Abdullah M. Rashwan, Mohamed S. Kamel, and Fakhri Karray University of Waterloo Abstract. Most of the state-of-the-art

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

Tutorial Deep Learning : Unsupervised Feature Learning

Tutorial Deep Learning : Unsupervised Feature Learning Tutorial Deep Learning : Unsupervised Feature Learning Joana Frontera-Pons 4th September 2017 - Workshop Dictionary Learning on Manifolds OUTLINE Introduction Representation Learning TensorFlow Examples

More information

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification 2 1 Xugang Lu 1, Peng Shen 1, Yu Tsao 2, Hisashi

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon Andrew Maas Collaborators: Awni Hannun, Peng Qi, Chris Lengerich, Ziang Xie, and Anshul Samar Advisors: Andrew Ng and Dan Jurafsky Outline

More information

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 07 Cambridge University Engineering Department

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Manifold Constrained Deep Neural Networks for ASR

Manifold Constrained Deep Neural Networks for ASR 1 Manifold Constrained Deep Neural Networks for ASR Department of Electrical and Computer Engineering, McGill University Richard Rose and Vikrant Tomar Motivation Speech features can be characterized as

More information

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri 1 CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING Alexander Wankhammer Peter Sciri introduction./the idea > overview What is musical structure?

More information

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina Residual Networks And Attention Models cs273b Recitation 11/11/2016 Anna Shcherbina Introduction to ResNets Introduced in 2015 by Microsoft Research Deep Residual Learning for Image Recognition (He, Zhang,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Feature Engineering in Context-Dependent Deep Neural Networks for Conversational Speech Transcription

Feature Engineering in Context-Dependent Deep Neural Networks for Conversational Speech Transcription Feature Engineering in Context-Dependent Deep Neural Networks for Conversational Speech Transcription Frank Seide 1,GangLi 1, Xie Chen 1,2, and Dong Yu 3 1 Microsoft Research Asia, 5 Danling Street, Haidian

More information

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION Xue Feng,

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Kernels vs. DNNs for Speech Recognition

Kernels vs. DNNs for Speech Recognition Kernels vs. DNNs for Speech Recognition Joint work with: Columbia: Linxi (Jim) Fan, Michael Collins (my advisor) USC: Zhiyun Lu, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aurélien Bellet, Fei Sha IBM:

More information

Handwritten Text Recognition

Handwritten Text Recognition Handwritten Text Recognition M.J. Castro-Bleda, S. España-Boquera, F. Zamora-Martínez Universidad Politécnica de Valencia Spain Avignon, 9 December 2010 Text recognition () Avignon Avignon, 9 December

More information

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN. Stacked Denoising Sparse Variational (Adapted from Paul Quint and Ian Goodfellow) Stacked Denoising Sparse Variational Autoencoding is training a network to replicate its input to its output Applications:

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Reverberant Speech Recognition Based on Denoising Autoencoder

Reverberant Speech Recognition Based on Denoising Autoencoder INTERSPEECH 2013 Reverberant Speech Recognition Based on Denoising Autoencoder Takaaki Ishii 1, Hiroki Komiyama 1, Takahiro Shinozaki 2, Yasuo Horiuchi 1, Shingo Kuroiwa 1 1 Division of Information Sciences,

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

A Deep Learning primer

A Deep Learning primer A Deep Learning primer Riccardo Zanella r.zanella@cineca.it SuperComputing Applications and Innovation Department 1/21 Table of Contents Deep Learning: a review Representation Learning methods DL Applications

More information

Autoencoders, denoising autoencoders, and learning deep networks

Autoencoders, denoising autoencoders, and learning deep networks 4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,

More information

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong Using Capsule Networks for Image and Speech Recognition Problems by Yan Xiong A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved November 2018 by the

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Deep Belief Networks for phone recognition

Deep Belief Networks for phone recognition Deep Belief Networks for phone recognition Abdel-rahman Mohamed, George Dahl, and Geoffrey Hinton Department of Computer Science University of Toronto {asamir,gdahl,hinton}@cs.toronto.edu Abstract Hidden

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION LOW-RANK PLUS DIAGONAL ADAPTATION FOR DEEP NEURAL NETWORKS Yong Zhao, Jinyu Li, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA {yonzhao; jinyli; ygong}@microsoft.com ABSTRACT

More information

arxiv: v5 [cs.lg] 21 Feb 2014

arxiv: v5 [cs.lg] 21 Feb 2014 Do Deep Nets Really Need to be Deep? arxiv:1312.6184v5 [cs.lg] 21 Feb 2014 Lei Jimmy Ba University of Toronto jimmy@psi.utoronto.ca Abstract Rich Caruana Microsoft Research rcaruana@microsoft.com Currently,

More information

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine 2014 22nd International Conference on Pattern Recognition To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine Takayoshi Yamashita, Masayuki Tanaka, Eiji Yoshida, Yuji Yamauchi and Hironobu

More information

Deep Learning. Volker Tresp Summer 2015

Deep Learning. Volker Tresp Summer 2015 Deep Learning Volker Tresp Summer 2015 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Non-Profiled Deep Learning-Based Side-Channel Attacks

Non-Profiled Deep Learning-Based Side-Channel Attacks Non-Profiled Deep Learning-Based Side-Channel Attacks Benjamin Timon UL Transaction Security, Singapore benjamin.timon@ul.com Abstract. Deep Learning has recently been introduced as a new alternative to

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Advanced Introduction to Machine Learning, CMU-10715

Advanced Introduction to Machine Learning, CMU-10715 Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio

More information

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization

Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Improving Bottleneck Features for Automatic Speech Recognition using Gammatone-based Cochleagram and Sparsity Regularization Chao Ma 1,2,3, Jun Qi 4, Dongmei Li 1,2,3, Runsheng Liu 1,2,3 1. Department

More information

arxiv: v2 [cs.lg] 6 Jun 2015

arxiv: v2 [cs.lg] 6 Jun 2015 HOPE (Zhang and Jiang) 1 Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Probe and Learn Neural Networks Shiliang Zhang and Hui Jiang arxiv:1502.00702v2 [cs.lg 6 Jun 2015 National

More information

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt Energy Based Models, Restricted Boltzmann Machines and Deep Networks Jesse Eickholt ???? Who s heard of Energy Based Models (EBMs) Restricted Boltzmann Machines (RBMs) Deep Belief Networks Auto-encoders

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Backpropagation + Deep Learning

Backpropagation + Deep Learning 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation + Deep Learning Matt Gormley Lecture 13 Mar 1, 2018 1 Reminders

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

Alex Waibel

Alex Waibel Alex Waibel 815.11.2011 1 16.11.2011 Organisation Literatur: Introduction to The Theory of Neural Computation Hertz, Krogh, Palmer, Santa Fe Institute Neural Network Architectures An Introduction, Judith

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow Ehsan Variani, Tom Bagby, Erik McDermott, Michiel Bacchiani Google Inc, Mountain View, CA, USA

More information

Handwritten Hindi Numerals Recognition System

Handwritten Hindi Numerals Recognition System CS365 Project Report Handwritten Hindi Numerals Recognition System Submitted by: Akarshan Sarkar Kritika Singh Project Mentor: Prof. Amitabha Mukerjee 1 Abstract In this project, we consider the problem

More information

Hidden Markov Model for Sequential Data

Hidden Markov Model for Sequential Data Hidden Markov Model for Sequential Data Dr.-Ing. Michelle Karg mekarg@uwaterloo.ca Electrical and Computer Engineering Cheriton School of Computer Science Sequential Data Measurement of time series: Example:

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016 Winter 2016 Lecture 2, Januar 26, 2016 f2? cathedral high-rise f1 A common computer vision pipeline before 2012 1. 2. 3. 4. Find interest points. Crop patches around them. Represent each patch with a sparse

More information

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models Wenzhun Huang 1, a and Xinxin Xie 1, b 1 School of Information Engineering, Xijing University, Xi an

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Vulnerability of machine learning models to adversarial examples

Vulnerability of machine learning models to adversarial examples Vulnerability of machine learning models to adversarial examples Petra Vidnerová Institute of Computer Science The Czech Academy of Sciences Hora Informaticae 1 Outline Introduction Works on adversarial

More information