Automatic Speech Recognition using Dynamic Bayesian Networks

Size: px

Start display at page:

Download "Automatic Speech Recognition using Dynamic Bayesian Networks"

Stanley Caldwell
5 years ago
Views:

1 Automatic Speech Recognition using Dynamic Bayesian Networks Rob van de Lisdonk Faculty Electrical Engineering, Mathematics and Computer Science Delft University of Technology June 2009

2 Graduation Committee: Prof. drs. dr. L.J.M. Rothkrantz Dr. ir. P. Wiggers Dr. C. Botha 2

3 Abstract New ideas to improve automatic speech recognition have been proposed that make use of context user information such as gender, age and dialect. To incorporate this information into a speech recognition system a new framework is being developed at the mmi department of the ewi faculty at the Delft University of Technology. This toolkit is called Gaia and makes use of Dynamic Bayesian networks. In this thesis a basic speech recognition system was built using Gaia to test if speech recognition is possible using Gaia and dbns. dbn models were designed for the acoustic model, language model and training part of the speech recognizer. Experiments using a small data set proved that speech recognition is possible using Gaia. Other results showed that training using Gaia is not working yet. This issue needs to be addressed in the future and also the speed of the toolkit. 3

4 4

5 Contents 1 Introduction Research Goal Contents of the report Standard Speech Recognizer Techniques Introduction Hidden Markov Model Acoustic Preprocessing Acoustic Model Language Model Recognition Training More Techniques DBN based automatic speech recognition Bayesian Networks Exact Inference Approximate Inference Learning Dynamic Bayesian Networks Exact Inference Approximate Inference Learning From HMM to DBN Tools Gaia Toolkit ObservationFile PTable

6 4.1.3 XMLFile DBN HTK SRILM copy sets and test sets Models Acoustic Model Context Files Language Model Word Uni-gram Word Tri-gram Interpolated Word Tri-gram Phone Recognizer Implementation PreProcessor FileManager IOConverter Trainer Recognizer GaiaXMLGeneration lexicontool createlmtext Experiments and Results Initial Experiments Smaller Data Set Recognition Experiments Pruning Experiments Training Experiments Conclusion and Recommendations 79 6

7 Chapter 1 Introduction Computers are an intricate part of our life for most of us. Almost everyone in developed countries has a computer available for personal use. The way we operate or communicate with computers is with a keyboard and mouse. This is not a very natural way of communicating for humans and we need to learn to use those control mechanisms. Furthermore, heavy use of keyboard and mouse have led to people suffering from Repetitive Strain Injury (rsi) complaints which is another signal that it is not a very optimal way of communicating. Speech on the other hand is a very natural way of communicating for humans and we humans are very proficient with it. If computers could understand what we are saying, would that not be an ideal way of operating computers? This is one of the reasons why there is research done in Automatic Speech Recognition (asr). Humans and computers however, are two different things and asr is complicated to do correct. asr research has been around for some time. The earliest research started around 1936 [17], but at that time the main problem was the lack of computer power. Computers became more powerful and in the 1980 s systems were developed that could recognize single words. In the 1990 s systems were developed that could recognize continuous speech with a vocabulary of a few thousand words and today we have systems that do continuous speech recognition for 64k words with a recognition rate of about 95% on read speech. These results however, are obtained in a controlled environment where the system is adapted to the speaker s voice. At that recognition rate the system is therefore very restricted. The goal of asr research is to eventually create a system that can perfectly recognize natural, fluent and spontaneous speech of anybody in non-laboratory environments in real-time. The current asr systems make use of Hidden Markov Models (hmm) 7

8 which are explained in the next chapter. Although these systems can have good performance as stated above, there are new ideas for improving speech recognition for which hmms do not have sufficient modeling power. Wiggers in [19] proposes to use context information in speech recognition to increase the recognition rates. One of the ideas is to use user knowledge, for example gender, age, dialect, and switch to specific speech models once these variables are estimated. This seems a good approach because specific systems work much better than general systems. A hmm however, cannot incorporate multiple models and when confronted with different speakers hmm based systems use techniques like speaker adaptation or speaker normalization to either adjust the model parameters to the speaker or vice versa. Another context idea is using topics and switch to a corresponding language model where certain words are more likely to occur than others. To test these context ideas a dbn toolkit is needed but because current toolkits that work with dbns cannot process the large amount of states and data that are used in speech recognition a new toolkit is needed. This new software toolkit is being developed at the mmi department of the ewi faculty at the Delft University of Technology. The outlines of this toolkit called Gaia are discussed in [19] and in this report some details are described. 1.1 Research Goal One of the goals for which the Gaia toolkit is developed is to create a dbn based context dependent speech recognizer that should be able to compete and hopefully outperform a modern hmm based speech recognizer. Because such a hmm system will have had years of developing and will have many techniques implemented that improve performance, building such a system with the Gaia toolkit is no small task. My literature study [18] was therefore chosen to be a research on techniques that current hmm based speech recognizers use to improve performance, such that I got an idea what techniques may be useful for the dbn system and, if it was possible within the project time, implement a technique. The first step toward the context dependent recognizer is creating the first working basic speech recognizer with the Gaia toolkit and thus to prove that the toolkit is capable of doing speech recognition. Because the Gaia toolkit is not fully developed this also means that a lot of testing with the Gaia toolkit will be done such that bugs are found and missing functionality will be added. During the project there was also the idea to participate with the dbn speech recognizer in the N-Best competition. This competition 8

9 was held among several universities and companies in the Netherlands and Belgium to see what the current state of affairs is on Dutch speech recognizers [14]. However, the dbn system was not finished in time to participate. To sum up; the goals of this thesis are: Research what makes hmms successful as a asr model. Design dbn models for the acoustic model, language model and training part of a speech recognizer. Design, implement and test a basic speech recognizer using the Gaia toolkit. Further develop and test the Gaia toolkit. 1.2 Contents of the report In this thesis I will describe how the dbn speech recognizer is created and how it performed. The basic speech recognition theory and techniques are covered in the second chapter. The hmm model used in such a system is also discussed there. This is useful to gain some understanding of an asr system and to compare it to the dbn model described in the next chapter. In chapter four the external tools that were used are discussed, which is mainly the Gaia toolkit. The dbn models that I designed for the speech recognizer are explained in chapter five. The next chapter discusses the c++ tools I created to make the speech recognizer. The experiments that were done with the speech recognizer are described in chapter seven. The results from those experiments are displayed in the chapter thereafter which is followed by the conclusion and a note to future work. 9

10 10

11 Chapter 2 Standard Speech Recognizer Techniques In this chapter I will give a short introduction to the theory of speech recognition and describe the main parts of a speech recognizer and the standard techniques that are often used. I will also describe the hmm model here because most current asr systems use this model. Furthermore, in the next chapter the dbn theory will be discussed and because it is not that different from hmm theory this chapter will be useful in understanding how the dbn speech recognizer works. In a literature study I researched what advanced techniques hmm based systems use to improve performance and a summary of that report is given here in the last section. 2.1 Introduction Speech recognition can be summarized in one formula: Ŵ = argmax P (W O) (2.1) W L Here the Ŵ stands for the recognized word (or sentence), the W for a word from the language L and the O is the speech signal or the observation. Thus the recognized word is the word from our language that has the highest probability given the observation. The distribution in this form is hard to quantify because the random variables involved may have infinitely many values, so with the help of Bayes rule we transform it into this formula: Ŵ = argmax W L P (O W )P (W ) P (O) 11 (2.2)

12 Figure 2.1: Simple first order Markov Model that models weather prediction(rain or Sun) Because P (O) is the same for all W we can simplify the equation: Ŵ = argmax W L observation likelihood {}}{ P (O W ) language model {}}{ P (W ) (2.3) Here we have two new terms that can be better quantified. The probability that the observation is an instantiation of the word can be calculated by the acoustic model (or observation likelihood ). This acoustic model is covered in section 2.6 about recognition. The prior probability that the word occurs (for example after another word in a sentence) can be calculated with the language model and this is covered in section 2.5 about the language model. 2.2 Hidden Markov Model In our language we can generate an enormous amount of possible word sequences but we use only a small part that is correct according to our grammar. A speech recognizer uses this fact by limiting the possible word sequences it can recognize according to some (usually simple) grammar. It models this grammar using a Hidden Markov Model such that an algorithm can calculate the best possible sequence. It is derived from a Markov Chain which is a stochastic process with the Markov property. A Markov property of order n means that the present state is dependent on a finite number n of past states and independent on all other states. A simple example of a first order Markov Model is shown in figure 2.1. It models a simplified weather prediction. The weather can be rainy (R) or sunny (S) and the prediction of today is only dependent on the weather yesterday. The transition probabilities (or a ij ) for this model are shown in table 2.1. If it rained yesterday then it will be sunny today with a probability of 0.4. The difference with a Hidden Markov Model is that a hmm adds hidden variables to this model. In the previous example of the weather prediction 12

13 Table 2.1: Probability table belonging to Figure 2.2 T oday Rain Sunny Y esterday Rain Sunny Figure 2.2: Simple first order Hidden Markov Model that models weather prediction by an observable newspaper (Wet or Dry) the only variable was the weather which is observable. In a hmm the state of hidden variables can only be determined by looking at observable variables that are being influenced by the hidden variables. Suppose for example that you are not able to leave your house and are therefore not able to observe the weather. The only clue you have about the weather is the newspaper you find on your doormat every day which is wet or dry. The weather prediction model now becomes a hmm and is shown in figure 2.2. The R and S variable have become hidden and the new observable variables are W p (wet paper) and D p (dry paper) and are shown in grey to differentiate that they are observable. They do not influence the weather system but are influenced by it, so the connections are shown in grey to differentiate them. To predict what the weather will be we can only look at the newspaper. To complete the prediction model we need an additional probability matrix that specifies the relation between the weather and the state of the newspaper and is specified in table 2.2. These probabilities are called the observation probabilities or b i (o t ). In hmm speech recognition we would like to know the word that is spoken but to the computer this is a hidden variable. It can only observe the acoustic preprocessed sound but by using this information from the speech signal it can determine which word is spoken using the methods described below. 13

14 Table 2.2: Probability table belonging to Figure 2.2 Newspaper Wet Dry W eather Rain Sunny Acoustic Preprocessing Before we can use a speech signal for recognition we first need to extract certain information from the speech signal and put it in variables to work with. Which information should be extracted from the signal is a decision that affects the performance. The performance of the system is bounded by the amount of relevant information extracted from the speech signal. There are however methods derived from human hearing characteristics that are proved to have good performance. A well known method mfcc, is discussed below. Another method that is often used is perceptual linear prediction (plp), which also uses knowledge about human hearing. For more information on plp see [6]. The most valuable information of the speech signal for speech recognition is the way the spectral shape changes in time. To capture this information the signal is divided in small intervals, e.g., every 10 msec. This is done with a window function, a common one is the Hamming-Window described in [13]. By multiplying this function with the speech signal we get a short speech segment that are often chosen to be longer than the interval, for example 25 msec. Doing this for the whole signal with different time indices gives us time segments that overlap. This overlapping proves useful for representing the global shape transitions in time. From these samples we compute the discrete power spectrum. This is done by first using a Discrete Fourier Transform to compute the complex spectrum of each window sample and then taking the square root of each sample. Because human hearing does not have a linear frequency resolution, a transformation on the power spectrum is used to better suit human abilities. Humans have a greater resolution in lower frequencies of the spectrum and a lower resolution in the higher frequencies. The Mel scale captures this non linearity and can be used to transform the power spectrum into Mel scale coefficients, see figure 2.3. To transform these coefficients back into the frequency domain an inverse discrete fourier transform can be used, but this is usually replaced by a more efficient cosine transform which does the same. The coefficients are 14

cepstral comes from cepstrum which is the inverse of a spectrum).

15 Figure 2.3: A Log Power Spectrum (left) and the same spectrum Mel-Scaled (right) now called Mel Frequency Cepstral Coefficients (mfcc) and are used directly as input to the speech recognizer (the word cepstral comes from cepstrum which is the inverse of a spectrum). Often only the first 12 coefficients are used and to capture the dynamics of the speech input also the first and second derivatives are computed. These derivatives are computed as the difference between two coefficients lying a time index t in the past and future, where the second derivative is computed as the difference between the first derivatives. Finally the signal energy is computed as the sum of the speech samples in the time window and it is also computed for the derivatives. This brings the total coefficients for the 10 msec speech input to 39 ((12 + 1) 3) and these are used as a feature vector for the speech recognizer. 2.4 Acoustic Model From the preprocessing phase we obtained real-valued feature vectors for every time-slice of the speech signal. One recognition method is to compare each vector to a database of feature vectors and choose the one which suits best. A standard feature vector from the database would represent part of a word or a phone (explained below). Comparing the feature vectors will then result in the best fitting word or phone from the language. But because there is a lot of variation in the pronunciation of words there need to be a large number of vectors to represent that for each word. Furthermore, because often large lexicons are used the number of vectors that need to be stored 15

16 is too large and another method is preferred. This method is statistical and instead of standard feature vectors a (multivariate) probability density function (pdf) is stored. We can then compute how likely it is that the current feature vector from the observation comes from the pdf. In earlier models a pdf was created for each phone. A phone is a sound where words are build from, for example the word parsley consists of the phones /p/ /aa/ /r/ /s/ /l/ /iy/. An example of a phone set for Dutch speech can be seen in table 6.2. To capture more detail of the speech, often a sub-phone model is used. Splitting up a phone sound in the beginning (on-glide), middle (pure) and end state (off-glide) enables that. For certain phones like the plosive /p/ in put this proves useful because the release and stop of this phone are different. The number of pdf models increases with this model by a factor of three. Another way of improving the phone model comes from the idea that phones influence each other and that for example the phone /r/ is different in the word translate, where it is between a /t/ and an /@/, and in the word parsley where it is between an /aa/ and /s/. This is called the tri-phone model because for each phone there is a different representation for all the possible neighbor combinations it has. The word translate consists of the tri-phones: /t+r/ /t-r+@/ /r-@+n/ /@-n+s/ /n-s+l/ /s-l+e/ /l-e+t/ /e-t/. The number of pdf models required to represent all the different tri-phones is the number of tri-phones to the power of three. However, many of those tri-phone combinations will never occur or are very rare (for example /tt+t/) so often systems are created with a much lower number of tri-phones by clustering similar sounding tri-phones together. In this project the sub-phone model was used because the tri-phone model requires too much computational power. Because a pdf is used to recognize only one sub-phone and we want to recognize whole words we need a mechanism to tie the separate sub-phone recognitions together to create phone representations. This is done using Hidden Markov Models (hmm). A hmm is created for each phone, made out of it s sub-phones. Such a hmm has state probabilities (the pdf s) for each sub-phone and also transition probabilities for moving from one sub-phone to the next, or to itself, see figure 2.4. The small nodes 1 and 5 in the figure are just there to make tying this hmm to other hmms easier. For each word a hmm can be created by gluing the hmm of it s phones together. Such a hmm could look like figure

17 Figure 2.4: An example three-state HMM Figure 2.5: An example HMM for the word he 2.5 Language Model The language model is used to capture properties of the language spoken and to predict the next word or utterance. It assigns probabilities to sequences of words by means of a probability distribution. In formula 2.3 it is represented by P (W ). A commonly used language model is the N-gram. A N-gram is a model that gives probabilities to sequences of words. It assumes that the probability of a word W in a sentence is only dependent on it s n predecessors. For example the bi-gram model (n = 1): P (w 1 w 2 w 3... w n ) = P (w 1 )P (w 2 w 1 )... P (w n w 1 w 2... w n 1 ) (2.4) P (w 1 )P (w 2 w 1 )... P (w n w n 1 ) (2.5) When sequences of one word are used it is called an uni-gram (independent of previous words), more words give a bi-gram, tri-gram and four-gram. The probabilities are simply calculated by counting the occurrence of the sequence in a large corpus. One problem with this method is that if a sequence is not present in the corpus it gets a zero probability even though the sentence could occur. A solution to this problem is called smoothing, 17

18 Figure 2.6: Figure 2.4 shown as a Hierarchical HMM where a small probability is given to each sequence that is not in the corpus [4]. To keep the total probability equal to 1, probabilities of sequences that do occur are lowered. How these probabilities are calculated differs per smoothing method. 2.6 Recognition In section 5.1 the acoustic model is discussed which gives the hmm models that represent phones. In section 2.5 the language model is discussed that models the sequence of words. To get a full asr system the models need to be combined. A hmm is created for each word which is done by gluing the hmms of its phones together as shown in figure 2.5. Instead of representing this as a regular hmm we can also put this model in a Hierarchical Hidden Markov Model (hhmm), see figure 2.6. The extra nodes 1 and 5 to help tie the hmms together are omitted. The hhmm fits with the idea that we already use a hierarchy in the system (words are made out of phones which are made out of sub-phones) and will prove to be a nice link to the next chapter. The hhmm is traversed depth-first and when a hmm is finished it will jump back to the point where it got initiated; for example when the last sub-phone of /h/ has finished the model tracks back to that phone and continues to /i/. To calculate the probability that a given speech signal is a sequence of words represented by a path through the hmms an algorithm is used. 18

19 for every time-slice t of the observation o do for every state s in the word under consideration do for every transition s specified in the HMM of the word do forward[s, t + 1] forward[s, t] a[s, s ] b[s, o t ] end for end for end for sum all probabilities at time-slice t Figure 2.7: The Forward Algorithm in pseudo-code If done for all possible sequences then the most likely sequence is the one with the highest probability. An algorithm that is used is a dynamic programming algorithm called the Forward algorithm, see figure 2.7. In this algorithm forward[s, t] stands for the previous path probability, a[s, s ] for the transition probability derived from the language model and b[s, o t ] for the observation likelihood derived from the pdf in that state. It calculates for all possible paths through the model a probability and sums all those at the end to give a probability for the whole word sequence. This approach however, uses many unnecessary calculations for speech recognition because only one path through the hmm will match the speech signal. Furthermore the forward algorithm has to run for each sequence hmm separately. A small variation on the forward algorithm is the Viterbi algorithm, which replaces the sum of all previous paths by the maximum of those paths. It calculates only the probability of the best path and can be run simultaneously on all word sequences in parallel. The algorithm can be visualized as finding the best path through a matrix where the vertical dimension represents the states of the hmm and the horizontal dimension represents the frames of speech (i.e. time), see figure 2.8. Even with the Viterbi algorithm however, calculating the word probabilities simultaneously for all words can take long because all possible paths are still calculated. Many of those paths will have low probabilities. Pruning can be used to discard these low probability paths and keep the search space smaller without much loss in the quality of the solution. During the search for the best word or sentence many calculations are made, usually with small numbers and a lot of multiplications. This often leads to numerical underflow because computers can only represent to a certain precision. A simple solution for this is to work with logarithmic calculations which replaces a multiplication with a sum. This helps to rep- 19

20 Figure 2.8: Viterbi search representation resent much smaller probabilities because summation (or subtraction in this case since the logarithm of small numbers is negative) of small numbers lead to small results much more slowly than multiplication. Thus instead of a b we use log(a) + log(b). 2.7 Training Before we can actually recognize anything we first need to train the recognizer on data that is representative for speech data it will encounter. In this training the parameters of the hmms that are created beforehand are estimated. The algorithm that is often used is the Forward-Backward algorithm (also known as Baum-Welch). It trains the transition probabilities a ij and the observation probabilities b i (o t ) iteratively by starting with an estimation and using this estimation in the algorithm to calculate a better estimation. Estimating a ij is done as the expected number of transitions from state i to state j normalized by the expected number of transitions from state i. b i (o t ) is estimated by the expected number of times in state i at time t normalized by the expected number of times in state i. These estimations are calculated with the help of a forward probability and a backward probability (hence the name). A complete description including formulas can be found in [4]. 2.8 More Techniques In a literature study [18] I researched a number of asr systems to find out which other techniques, besides the standard techniques discussed above, 20

21 they use to improve performance. A number of techniques are common among those systems; feature extraction is usually done using mfcc features. Some form of Speaker Adaptation is often implemented to give the system more robustness for different speakers. Examples are Speaker Adaptation where the model parameters are adjusted to the observation or Speaker Normalization where the observation is first normalized before being used by the model.the Viterbi algorithm is the most common decoding algorithm and is often combined with Beam pruning or Histogram pruning. Beam pruning uses the best probable path and prunes all other paths whose probabilities are not within a certain percentage of the best path. Histogram pruning sets a threshold to the maximum number of paths in the search space. It orders similar paths into bins and prunes the lowest probable paths from all bins such that the total number of paths is below the threshold. The decoding is often done using a Two-Pass approach; the first decoding pass is not accurate but fast, the second pass can be more accurate because the search space has been reduced in the first pass. For a language model a N-gram such as the Context Dependent Tri-phone model is used most often. This model is discussed in section 2.4 as the tri-phone model. Depending on the type of asr system you are building, some techniques are more useful than others. If you are building a simple asr system that only recognizes a few simple commands the vocabulary size will not be very big. That means the language model does not have to be big and that enough training data can be found more easily. A State Clustering method, which groups pdf s of similar states together to reduce the model size, is not necessary in that case. When a large vocabulary is implemented for a system that has to recognize continuous speech this is probably very useful because of the lack of training data for all possible acoustics. If spontaneous speech has to be recognized it might be useful to consider the recognition of filler words (like uh ) and to exclude them from the sentence. Otherwise the grammar represented by the language model will not work properly when such a word is encountered. When time is an issue in training, instead of the accurate Forward-Backward algorithm a Viterbi approximation of that algorithm can be used that estimates a ij as the most probable path instead of counting all possible paths and normalizing by them. If real-time decoding is not an issue, language models can be used that are slower but more precise, for example Tri-grams instead of Bi-, or Uni-grams. Pruning can be discarded or wider beams can be used and if the Token Passing model [16] is used, which is essentially a different formulation of the Viterbi 21

22 decoding algorithm using tokens, a larger number of tokens can be allowed. All these requirements are also related to the hardware. If faster hardware is used, more computational power can be used in the same time-span and a more computational expensive technique can be used. 22

23 Chapter 3 DBN based automatic speech recognition Research is being done on how speech recognition rates can be improved. [19] proposes to use different sorts of context information in addition to the speech signal but because hmms are not suitable to incorporate this information he searched for a different model. [11] and [21] both proposed the use of dbns because the expressing power of hmms is limited. Because a hmm is a special case of dbn there is no loss of expressing power when changing to these models, it actually increases. Furthermore, because dbns are used in more research disciplines there are a lot of good algorithms already available. A last advantage of using dbns is when you would want to use a multi-modal system that uses for example speech (audio) and lip reading (video) inputs. This can be combined in a dbn model fairly easy because it can handle different time scales for input more easy than a hmm. 3.1 Bayesian Networks To describe what a dbn is and what techniques are available to work with them I start by describing Bayesian Networks because they are a more general class of models. An introduction to dbns and their inference techniques can be found in [10] and [19]. A Bayesian Network (bn) is a graphical model that represents the relations between a set of random variables, it represents a joint probability distribution. It consists of a directed, acyclic graph which shows the (in)dependencies between the variables and a set of probability distributions that quantify those dependencies. The advantage of having a set of 23

24 Figure 3.1: Example Bayesian Network that models a system that predicts whether it is cloudy or not given the state of the grass. It consists of four binary variables; C or whether it is cloudy or not, S which represent the sprinkler on or off, R which is whether it rains and W which represents whether the grass is wet or not. probability distributions instead of one full joint probability distribution is that it is often smaller to represent the set of distributions. The number of probabilities in a distribution is exponential in relation to the number of variables, for n binary variables there are 2 n probabilities. Because there are often independence relations between the variables the individual variable distributions can be made smaller. This can be demonstrated using an example bn from [10] which is shown in figure 3.1. This bn models a system that predicts whether it is cloudy or not given the state of the grass and consists of four binary variables; C or whether it is cloudy or not, S which represent the sprinkler on or off, R which is whether it rains and W which represents whether the grass is wet or not. The joint probability distribution of this system would, according to the chain rule of probability, be: P (C, S, R, W ) = P (C)P (S C)P (R C, S)P (W C, S, R) (3.1) But because of the conditional independence relations in the model this can be simplified to: P (C, S, R, W ) = P (C)P (S C)P (R C)P (W S, R) (3.2) where the set of separate distributions is smaller to represent than the total joint probability distribution. 24

25 Calculating the probability of one or more variables in a bn given some evidence is called inference. Just as the Viterbi or the Forward algorithm is used for hmms there are algorithms for bn s. A few of those methods are explained here briefly Exact Inference The most simple and straight forward inference method is summing out irrelevant variables from the joint probability distribution. This is a basic technique from probability theory called marginalisation; P (X Q X E ) = X H P (X H ), P (X Q ), P (X E ) (3.3) where X Q is the set of query variables, X E is the set of evidence variables and X H is the set of variables that are neither in the query set, nor in the evidence set. Referring to the example from figure 3.1; W is the evidence variable, C is the query variable and S and R are neither. This straightforward marginalisation however, is for most networks computationally very hard to do directly. The techniques discussed further make use of clever ideas to make marginalisation possible on larger networks. Variable Elimination Variable Elimination is a technique that makes marginalisation more efficient by pushing sums as far as possible in the calculation when summing out irrelevant variables. This is illustrated using the example from figure 3.1. We obtain the joint probability distribution: P (W = w) = P (C = c, S = s, R = r, W = w) (3.4) c s r which can be rewritten as: = P (C = c)p (S = s C = c)p (R = r C = c)p (W = w S = s, R = r) c s r (3.5) = P (C = c) P (S = s C = c) P (R = r C = c)p (W = w S = s, R = r) c s r (3.6) The innermost sum is evaluated and a new term is created which needs to be summed over again. P (W = w) = c P (C = c) s P (S = s C = c)t 1 (c, w, s) (3.7) 25

26 T 1 (c, w, s) = r P (R = r C = c)p (W = w S = s, R = r) (3.8) Continuing this way gives: P (W = w) = c P (C = c)t 2 (c, w) (3.9) T 2 (c, w) = s P (S = s C = c)t 1 (c, w, s) (3.10) Message Passing Instead of doing one variable marginalisation at a time, a technique called message passing calculates the posterior distributions of all variables in the network given some evidence simultaneously. It is a generalization of the forward-backward algorithm for hmms described briefly in section 2.7. The algorithm works only for tree shaped graphs because a cycle would lead to evidence being counted double. It uses the fact that a variable in the model is independent from the rest of the model given its Markov blanket. A Markov blanket consists of the variable s parents, its children and the parents of its children. Variables receive new information from their neighbours in their Markov blanket, update their beliefs and propagate it back. When done for all variables this process will reach an equilibrium after a number of cycles and result in updated probability distributions for the entire model. Details on this algorithm can be found in [12]. Junction Tree Because the message passing algorithm only works for tree shaped graphs another algorithm has been developed that works for models that include cycles. The Junction Tree algorithm [7] creates a new tree-shaped graph that defines the same joint probability distribution as the original but this new graph enables the use of a message passing algorithm. This new graph is obtained by change of variables and the new variables in the graph are cliques of the original variables. The new graph is called a junction tree and if the connections between the variables are directed the message passing algorithm can be used. Usually however, it is easier to obtain undirected connections and use an adjusted message passing algorithm. MAP Because marginalisation is very hard to do for large networks the Maximum A Posteriori (map) technique uses the max operator to reduce the computational requirements. Using the example again, compared to the joint pdf from equation 3.4, map calculates P (W = w) as: P (W = w) = max P (C = c, S = s, R = r, W = w) (3.11) c s r 26

27 The Viterbi algorithm is a special case of map where all summations are replaced by max operators: P (W = w) = max c max s Approximate Inference max P (C = c, S = s, R = r, W = w) (3.12) r The problem with exact inference is that for many models this is computationally intractable. Therefore methods have been created that approximate the correct inference results but are much faster. Here I will describe some methods briefly. Loopy Belief Propagation A straight forward idea is to use the message passing algorithm on graphs even though they have cycles. This is called loopy belief propagation. Because evidence will be counted double this method may not converge to a result or may converge to a wrong result. In practice however, this method gives good results because in some cases all evidence will be counted double such that the effect is canceled out [11]. Cutset Conditioning Another method called cutset conditioning [15] is to instantiate variables in the graph that break up cycles. New graphs are created for each value of this instantiated variable and the message passing algorithm is run on each of them after which marginalisation is used to combine the results. The downside for this method is that the number of networks grows exponentially with the number of cycles in the network and with the possible number of values for the variables that are instantiated. Sampling Methods A number of methods exist that do stochastic sampling on the model. They generate a large number of configurations from the probability distribution and then the frequency of the relevant configurations is computed. This enables the estimation of the values of the variables in the graph. Logic sampling is a simple approach which starts at the root nodes and their prior probabilities and then follows the arcs of the graph to generate values according to the conditional probabilities to get a configuration. This method is not very good because there will be many configurations generated that do not match the evidence and therefore the estimation will take long. Importance sampling improves this method by weighting the values generated by the conditional probabilities according to the evidence. 27

28 Table 3.1: Four cases of learning for Bayesian Networks Structure Observability Method Known Full Maximum Likelihood Estimation (ML) Known Partial Expectation Maximization Algorithm (EM) Unknown Full Search through model space Unknown Partial EM + Search through model space Learning For Bayesian Networks both the structure and the parameters for the probability distributions can be learned, although the learning of structure is much more difficult than the learning of parameters. Furthermore the graph can be completely observable or it can contain hidden nodes which makes the learning more difficult. These possibilities lead to four cases and four possible learning methods for each case are given in table 3.1. Because in this project the structure of the model is known and it contains hidden variables I will discuss the Maximum Likelihood Estimation as introduction and the Expectation Maximisation algorithm. Maximum Likelihood Estimation When the structure of the model is known and the variables are all observable, learning comes down to finding the parameters of each conditional probability distribution that maximizes the likelihood of the training data. If the training set D consists of N independent items the normalized log-likelihood is: L = 1 m s log P (X i P (X i ), D l ) (3.13) N i=1 l=1 Assuming that the parameters of the variables are independent of each other, the contribution to the log likelihood of each variable can be maximized independently. For the training of the W variable from the example of figure 3.1 we just need to count the number of training events where the grass is wet and divide them by all samples: where P (W = w S = s, R = r) N(W = w, S = s, R = r) N(S = s, R = r) (3.14) N(S = s, R = r) = N(W = 0, S = s, R = r) + N(W = 1, S = s, R = r) (3.15) 28

29 For multinomial variables, like in this example, learning is counting occurrences. For Gaussian variables the sample mean and variance needs to be computed and then linear regression is used to estimate the Gaussian mixtures. Expectation Maximization When the structure of the model is known but it contains variables that are not observable the Expectation Maximization algorithm is used. The idea of this algorithm is that if we somehow know the values of the hidden variables, the learning would be easy like in the ml algorithm. Therefore expected values for these variables are computed and treated as if they are observed. For the example equation 3.14 becomes: P (W = w S = s, R = r) = E[N(W = w, S = s, R = r)] E[N(S = s, R = r)] (3.16) E[N(x)] is the expected number of times that event x occurs in the training data, given the current estimated parameters. It can be computed like: E[N(x)] = E k I(x D(k)) = k P (x D(k)) (3.17) where I(x D(k)) is an indicator function that has value 1 if the event occurs in training sample k and is 0 otherwise. With the expected counts the parameters are maximized and new expected counts are computed. This iteration leads to a local maximum of the likelihood. 3.2 Dynamic Bayesian Networks Dynamic Bayesian Networks are an extension to bn s which can represent stochastic processes over time. The term dynamic from dbn is a bit misleading because usually dbns are not assumed to change their structure, although there are cases where this is possible. Because the dbn evolves over time it is represented by two models; the prior and the transition model. A simple dbn extension of figure 3.1 is shown in figure 3.2. Only the C variable is connected in time which represents the fact that whether it is cloudy on time t depends on whether it was cloudy at time t 1. The prior probabilities are shown in table 3.2 and the transitional probabilities are shown in table

30 Figure 3.2: Example Dynamic Bayesian Network extended from Figure 3.1. Only the C variable is connected in time which represents the fact that whether it is cloudy on time t depends on whether it was cloudy at time t 1. Table 3.2: Prior probability tables belonging to Figure 3.2 Cloudy Yes No Rain Yes No Cloudy Yes No 0 1 Sprinkler Yes No Cloudy Yes No Grass Wet Dry Rain Yes 1 0 No Sprinkler Yes 1 0 No

31 Table 3.3: Transitional probability table belonging to Figure 3.2 Cloudy today Yes No Cloudy yesterday Yes No Exact Inference In theory all inference methods discussed in the bn section can also work for dbn but then the entire network needs to be unrolled for all time-slices. Even if that size is known beforehand, it will often not fit into the computers memory and thus online inference methods were developed that process the network slice by slice. Frontier algorithm The Frontier algorithm uses a Markov blanket like the message passing algorithm where all the hidden variables d-separate the past from the future. When variables are d-separated in a Bayesian network they are independent. The Markov blanket moves through the network in time, first forward and then backward and is called the frontier. During its movement variables are added and removed from it, resulting in the following operations. When moving forward a variable can be added to the frontier when all its parents are in the frontier and this is done by multiplying its conditional probability distribution onto the frontier. A variable can be removed from the frontier when all its children are in the frontier and this is done by marginalizing it out. When the variable is observed the marginalisation is skipped because its value is known. When moving backwards a variable is added to the frontier when all its children are in the frontier and this is done by expanding the domain of the frontier and duplicating the entries in it, once for each value of the variable. A variable is removed from the frontier when all its parents are in the frontier and this is done by multiplying its conditional probability distribution onto the frontier and marginalizing it out, again if the variable is observed this marginalisation can be skipped. Interface algorithm The Frontier algorithm uses all the hidden variables in a slice to d-separate the past from the future. [11] shows that the set of variables that have outgoing arcs to the next time-slice already d-separates the past from the future and that the Frontier algorithm is thus sub-optimal. 31

32 The Interface algorithm uses this set, ensures that it is a clique (where each variable is connected to all other variables) by adding arcs and calls it the interface. The algorithm creates junction trees for each time-slice including the interface variables from the preceding time-slice. The junction trees can be processed separately and messages are sent via the interface variables. Islands algorithm Even though online inference methods store less information than offline methods it is sometimes still too much to fit in the memory because all the forward messages need to be saved. Instead of saving these messages it is also possible to calculate them at each time-slice. This saves space but increases the computational load enormously. The Islands algorithm [20] chooses a point between these extremes by storing the forward messages at a number of points. That results in a number of subproblems that are solved recursively Approximate Inference Boyen-Koller The Boyen-Koller algorithm [1] approximates the joint probability distribution of the interface in the Interface algorithm by representing it as a set of smaller clusters (marginals) of variables. The requirement that all variables in the interface need to be in a clique is dropped. How accurate the algorithm is depends on the number of clusters used to represent the interface. Using one cluster is equal to exact inference, using more lowers the accuracy but speeds up the algorithm. Viterbi Like the Forward algorithm for hmms the inference algorithms can also be used with a Viterbi approximation. When marginalizing the sum operators are replaced by a max operator. The idea behind it is that the most likely path will contain most probability mass. This is also called Most Probable Explanation or mpe Learning Learning in a dbn can be done using the em algorithm from the bn section. Instead of running it on the entire network it is done for each time-slice separately using the Frontier or Interface algorithm. It uses a forward pass where it stores intermediate results and uses those during the backward pass. 32

33 Figure 3.3: How a hmm relates to a dbn. A hmm is shown on the left and is unfolded in time. When this is folded back horizontally you obtain the dbn where each q state contains all three states of the hmm. 3.3 From HMM to DBN Because a dbn is generalization of a hmm we can convert a hmm to a dbn. This can be seen in figure 3.3. A hmm is shown on the vertical axis and it is unfolded to the right where its states and the possible state transitions are shown. If you fold the states back horizontally the dbn model is obtained. The three states are now enclosed in one variable q which is shown unrolled for three time indices. The q variable can be in any of the three states, although at time index 1 it starts in state 1 which means that the first time index it can be in state 3 is at time index 3. The o variable represents the observations that are the input or evidence to this model. In the previous chapter I showed a hhmm of the acoustic model used in speech recognition in figure 2.6. The idea from that model is that speech recognition can be seen as hierarchical where words consist of phones and phones consist of sub-phones. This model can also be converted to a dbn which will look like figure 3.4. It consists of the S variable which represents the sub-phones or states of the phone, the P variable which represents the phones and the W variable which represents the words. The F s, F p and F w variables are switches that fire when a corresponding variable has reached its end. This model will be discussed in more detail in chapter five and then 33

34 Figure 3.4: The hhmm from figure 2.6 converted to a dbn. It consists of the S variable which represents the sub-phones or states of the phone, the P variable which represents the phones and the W variable which represents the words. The F s, F p and F w variables are switches that fire when a corresponding variable has reached its end. it will also show that it can be simplified. 34

35 Chapter 4 Tools This chapter will describe the tools which were used but not created in this project. It will mainly cover the new Gaia toolkit but also cover two well known tools briefly. 4.1 Gaia Toolkit In this section I will describe the Gaia toolkit from a user perspective because that is the way I used it. It will be a short description because the entire toolkit is too complex to describe in just one chapter. First a short introduction and global description will be given to give an idea why and how the toolkit is built. After the overview some specific parts of Gaia that I used are discussed in more detail. Those parts are also interesting for readers who want to use Gaia to build a speech recognizer. It has to be noted however that the Gaia toolkit is still evolving so this information may become incorrect. I wrote in the introduction chapter that [19] proposes to use context information to improve speech recognition. He also states in that report that there is no current toolkit that can easily accommodate this. The Gaia toolkit is therefore created as a framework for probabilistic temporal reasoning over models with large state-spaces. It uses Dynamic Bayesian Networks and can, for example, be used for language modeling and speech recognition. It is written in c++ and consists of 8 libraries as shown in figure 4.1. The Base library contains among other basic classes the ObservationFile class and its iterator class which is discussed below. The Text library has classes that help with textual processing which can be used in creating lan- 35

36 Figure 4.1: The different libraries from the Gaia toolkit and their relations guage models. Utilities is a library that contains classes that create xml files, handle the file system, do logging and do xml parsing. The Math library contains the mathematical building blocks which are used in the dbn. Real, Domain, Probability and RandomVariable are some of the classes that can be found in this library. Classes that implement different types of distributions like Gaussian and Multinomial can be found in the Distributions library. The JTree library contains classes that implement the JunctionTree algorithm, which is used for marginalisation in the dbn. The probability tables used in the dbn are created with the classes from the PTable library. It contains different sorts of tables including; SparsePTable, LazyPTable, DensePTable and DeterministicPTable. The most interesting library for a user is DBN. It contains the classes to create dbn objects and perform training and inference. The dbn models that can be created with the Gaia toolkit can consists of multiple parts. Figure 3.3 shows two sets of variables which are labeled. Each set belongs to a so called slice in Gaia and can be repeated for a finite or infinite number of times. In the figure the first slice will be repeated once and the second slice an infinite number of times (or as long as the number of observations). The toolkit also allows to group the slices into chapters, which 36

37 can also be repeated a finite or infinite number of times. This construction allows for elaborate models such that, for example, different inputs with different time scales can be observed ObservationFile ObservationFile is a class that generates the Gaia observation format. It is created with an InputSpecification which states in this project that the observations consists of 39 real values and one nominal value. Once it is created it can be filled with ObservationVectors that should contain 39 Boost Variant objects and one value that is either zero or one. (Boost is a freely available collection of peer reviewed C++ libraries [9]). This variant construct makes it possible to use different data types in one observation. This should be useful when context information is used in the speech recognition system such that you can, for example, extend the 39 real values with a boolean that determines the gender of the speaker. The observation file is stored in a binary format for fast loading PTable The PTable class is used to create the multinomial tables for nodes of the model. First a Domain for the table has to be created. The domain specifies which random variables are contained in the probability distribution that is represented by the table. This domain then has to be filled with RandomVariables which have two indices to position them in the model and a cardinality (number of states the random variable can be in). When used in a dbn the first index is a relative position to indicate where the variables are situated in the dbn model compared to each other. The highest index means the variable is a child node and is not a parent to any of the variables from the domain. The lowest index means that the variable is a parent node and is not a child to any of the variables from the dbn. The indices in between order the variables in this child - parent tree. The second index means the time-slice the variable comes from, 0 is the current time-slice, -1 is one time-slice in the past. To clarify this, consider the example model in figure 4.2. This is part of a model I used which will be explained later on. Suppose we would like to create a table for the variable Nw. This variable has four parents; W in the current time-slice and F w, Nw and F s all one time-slice in the past. These parents are created as RandomVariables with the following indices: W with indices (0,0), F w with (0,-1), Nw with (1,-1) and F s with (2,-1). F s is the last in the tree because the Nw is indirectly 37

38 Figure 4.2: Example model (via P ) in between F w and F s. Once the domain has been filled the table can be constructed with the domain. After that the table can be filled with ValueVectors which should be filled with a probability and values ordered according to the domain specification. Furthermore, these vectors should be added to the table in such a way that the values are added from low to high. When this is done correctly the vectors can be put into the ptable with the push_back() function. Otherwise the slower Add() function can be used but with large tables this takes considerably longer because the vectors need to be sorted. To clarify this consider the example given above from the model in figure 4.2. The table created would have entries is this order: [Nw,W,F w,nw,f s]. To fill the table using the fast push_back() function all variables have to be filled from low values to high. This means that the ValueVector [1,0,0,0,0] would follow [0,99,10,34,0] which would follow [0,99,10,33,0]. This can be done using nested for loops in a program XMLFile Models and distribution tables are stored in xml format in the Gaia toolkit. When the PTable is filled it can be written to a XMLFile object with a single streaming operator to create an xml file which can be included in the total dbn model. The XMLFile class is designed as an aid in creating xml files and it supports the writing of tags, attributes and data with a streaming operator. Writing end tags does not need an argument, it closes 38

39 the corresponding (last) start tag automatically DBN The DBN class is used to create dbn objects. It is constructed from a template with a specific engine. The choices are a BoyenKoller of a Frontier engine which use those algorithms for inferencing. It provides functions to load a dbn object from a file in xml format. Such a file will usually contain multinomial tables generated with the PTable class but the total model structure has to be created by hand. If a dbn model gets updated by training the updated model can be written to a file in xml format. The DBN class has a SetIslands function which gives some control of the memory usage of the Gaia toolkit during inference. The two arguments specify in how many parts the current multinomial table has to be split and after how many milliseconds of observation duration. The idea is that when a multinomial table becomes too large to fit in memory during operation is it better to split it up into pieces that fit into memory than to do paging between the memory and the hard drive. Another function that can speed up execution is SetPruningBeam() which sets the parameter, in percentage, for the pruning beam. Beam pruning uses a percentage beam around the best path to decide which paths to prune. To prepare the Gaia toolkit for training the StartLearning() and afterward the EndLearning() function need to be called. The first function is called with a phase as argument and thus these methods specify which phase of the model is to be learned. In the dbn model, variables can be set to update only in a specific phase or phases. This can be used to train specific variables only or keep a variable static during a training session. When training the model there is for each observation some context information needed that specifies what phones are spoken. This information needs to be loaded and the function StartSequence() does that. It has to be called separately for each observation and with argument the current.seq file. It will find all other files in the same directory with same name (but different extensions) as context information and loads those xml tables. The function EndSequence() has to be called when the learning of the observation is done to unload the context files. The function Learn() will do the actual training from an observation. It will do a expectation maximisation algorithm. As arguments it takes two ObservationFileIterator objects, one for the begin position and one for the end. The iterators are created with the observation file as argument 39

40 and optionally a offset. observation to train on. This way it is possible to use only a part of an Because training is a computationally heavy operation the Gaia toolkit is able to store the training results of a small dataset and combine the results later on. This way it is possible to split a training task up in smaller parts and run them simultaneously on different processors. The WriteAccumulators() and ReadAccumulators() functions do this writing and reading of partial training results. Both should be called within the StartLearning() and EndLearning() calls. The write version takes as arguments two indices and the directory where to store the accumulation files. The first index should be the same for all partial results that need to accumulate, the second index should be unique for every partial result. The read version has similar arguments. To do inference on the dbn model the MAP() function can be used if the dbn object has the Frontier engine. If the BoyenKoller engine is used a MPE() function is available but at this point this does not work yet. The arguments of the MAP() function are two ObservationFileIterators as in the Learn() function, a result construct and a domain. The domain contains the random variable which need to be observed (for example the word variable) and the result construct consists of a list of observations (one for every time-slice) and the probability for this path. Similar as in the case of learning, StartMAP() and EndMAP() need to be called before and after the call to MAP(). 4.2 HTK The Hidden Markov Model ToolKit [5] is a toolkit for building and manipulating hmms and is used primarily for speech recognition although not exclusively. It is developed at the Cambridge University Engineering Department. htk consists of a set of library modules written in c. The tools provide sophisticated facilities for speech analysis, hmm training, testing and results analysis. I used a tool called HCopy to do the acoustical analysis of the speech signals. It creates mfccs from the speech signal which I used to create the Gaia observation files. The reason for not doing the acoustical analysis myself is that it would be far too much work and it is beyond the scope of this thesis. 40

41 4.3 SRILM The sri Language Modeling Toolkit [8] is a toolkit for building and applying statistical language models, primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. It has been under development in the sri Speech Technology and Research Laboratory since The toolkit consists of a set of c++ libraries, a set of executable programs which perform standard tasks and some scripts that perform minor tasks. I used the ngram and ngram-count programs to create the lm and two scripts to get specific information out of the lm. Some language models that I created with srilm were an interpolated Kneser-Ney discounted tri-gram model on the word level (which was eventually not used because of the large size) and a interpolated 5-gram on the phone level. These lm files were then used to create distributions for the Gaia model. 4.4 copy sets and test sets These are two small programs I used that help splitting up large data sets. When combined in a script it can divide a data set in multiple parts where each part is a integer percentage of the total. Once that is done for one specific file type it can search for other corresponding files with a different extension and put those together. 41

42 42

43 Chapter 5 Models In this chapter I will describe the models developed in this project. Those models can be separated in a language model and an acoustic model which are described separate below. The complete basic model looks like figure 5.1. The acoustic model part is shown in black, the language model part is shown in grey. In all of the models I created, the acoustic model part is the same and only the language models differ. Figure 5.1 shows the model for two Gaia slices to indicate the relations between the variables in time. The numbers next to the variable names indicate to which Gaia slice the variable belongs. The model starts in the first slice and it moves to the second slice for the second time-slice of the observation. The second slice is repeated for every time-slice of the observation after that. The reason that the first slice of the model is different is because there is no history at the beginning which is used by variables in the second slice of the model. The meaning of the variables is explained in next sections. 5.1 Acoustic Model For the acoustic model we represent each time-slice of speech by an observation variable O. As the processed audio data contains 39 dimensional feature vectors, this O variable has for each of its states a corresponding 39 dimensional Gaussian probability distribution. With these features as input we can calculate with statistics on these pdf s how likely the time-slice observation corresponds with each of these states of the O variable. Once this is known each of these likely states corresponding to a small time-slice should be fit into a larger model on the phone level and/or the word level. How the observation variable is linked to phone and word variables is described 43

44 Figure 5.1: The basic dbn model with Acoustic model part indicated in black and the Language model part in grey 44

45 Figure 5.2: The dbn Acoustic model used for training the S, F s, M and O variables below. The acoustic model as used in the training of the speech recognizer is shown in figure 5.2. This is not exactly the same as the core acoustic part shown in figure 5.1 because we need the extra variables to specify which sounds are processed during training. Because we are only interested in training the acoustics we pretend that each training sample consists of one word so that the dbn model can remain simple. The Nw variable represents the position in the sentence. As each phone occupies a position the first phone is on position 0, the second phone on 1 etc. The P variable represents the possible phones, each of its possible states corresponds to one of the possible phones I used for the annotated train data. The value of P depends on the Nw variable only because it is known from the annotation data. The table of P consists of a phone for each position of the sentence. This works the same in training and recognition though in training the word(s) are known and the probability table for P is 45

46 thus very small. In recognition the words are not known and the probability table will consist of the positions of phones in every word in the lexicon. Because a sub-phone model is used to better represent speech the S variable is introduced in the model. The sub-phone model states that a phone is made up of three sub-phones (on-glide, pure, off-glide). These sub-phones are represented by the S variable which has 3 states per possible phone. Its probability table consists of the transitional probabilities between these sub-phone states. The S variable therefore depends on its previous value, on P and on the variable F s. A phone is finished if its last subphone has finished. When that happens the F s binary variable is triggered which signals to S, Nw and EOI. Either the next position in the sentence is considered, or if EOI is also triggered, the end of the input is reached. The EOI variable is also a binary variable that observes a special flag in the observation file that signals if the current observation has finished. By using this variable it is ensured that the best matching path through the model always reaches the end of the model and does not, for example, stay in one state for all time-slices. The O variable is linked to the P, S and M node. For each combination of states from these variables the O variable has a state which consist of a 39 dimensional probability density function. In the project there were 144 (3 * 48 * 1) (S *P * M) combinations because the M variable was fixed to one state. This M variable was introduced to incorporate tied mixture modeling in the system. A tied mixture system uses a single set of Gaussian pdf s shared by all observation states (in this project those 144 combinations). Each observation state then uses a set of weights (mixtures) to create its own specific pdf from this set. The M variable would contain these weights Context Files During training it is known which word(s) and thus phones are uttered in the observation. To learn those phones in a hmm model the separate phone hmms are just pasted together like in figure 2.5 and the learning algorithm is run on the resulting hmm. In the dbn model however, all possible phones and states are represented by the P and S variable respectively in a single model. We thus need to specify in the model which phones are being uttered in the observation. The solution for this in the Gaia toolkit is the use of context files. For each observation file a set of context files is also loaded which contain the probability distributions for the following variables; N w, P, and EOI. 46

47 5.2 Language Model A language model in speech recognition is used to better predict the sentences that are uttered. It specifies in what order it is likely that the words in a sentence appear by assigning probabilities to each word ordering. Less likely word orderings can then be pruned to reduce the search space and computing effort. At the end of the project it became clear that a large vocabulary combined with a complex language model was computationally too heavy for Gaia. I therefore used different language models to test the speech recognizer with and these are described here. Although I started the project with a complex language model that became gradually more simple, I describe them here in reverse order because it makes the complex models more easier to understand. The data for all the language models came from the cgn data. I used the transcriptions of the entire cgn data to create a large text on which srilm computed the language model statistics I needed to fill the Gaia probability tables with Word Uni-gram The uni-gram model on the word level uses no information from the past to calculate probabilities for the word variable. The probabilities are calculated by just counting the word occurrences in a large corpus and depending on how often a word occurs it will get a corresponding probability, high for a frequent word, low for a word that does not occur often. The model is shown in figure 5.3. It looks the same as figure 5.1 except the two grey nodes. The W variable has all the words from the lexicon as its possible states. The Nw variable is the same as discussed in the acoustic model section but its probability table here contains all the positions of the entire lexicon because this model is used in recognition. For every state of the W variable (for every word) it has a list of all possible positions inside that word and to which position it should change given the W and F w variable. The P variable depends on both W and Nw and is explained in the previous section. The reason that it depends on W is that the entire lexicon annotation is contained in the states of P and it thus needs a value of W (word) and a value from Nw (position) to return the correct phone. The binary variable F w is triggered when the last phone of a word has finished, that is when F s triggers when Nw is in the last position of the word. The EOI variable is only dependent on F w which means that the end of input can only be reached after one or more whole words have finished. 47

48 Figure 5.3: Word Uni-gram in the total model. 48

49 The variable Nw in the first slice just starts at zero for every state of W. In the second slice of the model is depends also on W and on the previous values of F w, Nw and F s. If F s is triggered then Nw goes to the next position, otherwise it will stay the same value as the previous Nw. If F w is also triggered Nw is reset to zero. I used the same probability table for the W variable in the first and second slice of the model. Because the W variable depends on the previous W and F w it needs those two grey dummy variables from figure in the first slice for compatibility. When F w triggers the uni-gram probability table is used in the calculations, otherwise the W variable will copy its previous value. In the first slice of the model the dummy F w is triggered and the dummy W holds no real information because it is not used. In the experiments I did, the set contained only single word utterances. Therefore I also created a model which had no F w variable because it was not needed and this would make the model a little less complex. With the F w variable removed the system will not consider utterances of multiple words and thus only output single words. The model looks exactly like figure 5.3 with the F w variables replaced by the EOI variables and the dummy F w variable removed Word Tri-gram A commonly used language model in speech recognition is the tri-gram on a word level. The probability of a word depends on the previous two words. This model is proved by experiments to hold decent information to capture grammatical sentences. I created a tri-gram model with the srilm toolkit and smoothed it with modified Kneser-Ney. Because this model is quite complex to construct in Gaia it is discussed here in two stages for clarity. The first stage as constructed in Gaia is shown in figure 5.4. The tri-gram model in figure 5.4 works like the uni-gram model but because the W variable now depends on its previous values some extra variables are needed do the calculations right. Because I use a tri-gram model, W is dependent on its values two timeslices in the past. Therefore three slices of the model are shown of which the first two are used once and the last one is repeated indefinitely. Furthermore, because I used the same W probability tables for the W variable in all timeslices the W variable in the first two time-slices needs previous values that do not exist. Therefore two dummy variables are added in grey which hold no real information. In order to not use this wrong information the N and E variables are available. The N variable simply counts how many indices 49

50 Figure 5.4: Part of word Tri-gram model we can look back in time (0, 1 or 2) by updating its value only if F w is triggered to a maximum of 2. The E variable signals the end of the sentence and resets N to 0 if that happens. It is different from the end of input variable EOI because the input can contain multiple sentences. Depending on the value of N, the W variable uses an uni-, bi- or tri-gram model and uses 0, 1 or 2 previous W variables. The W dummy variables are thus never used but are necessary in the xml model for consistency. The third dummy variable is the F w variable which signals the W variable in the first slice that a new word begins. If the F w variable is 0 the W variable will stay in the same state as the previous time-slice. The distribution of the E variable comes also from the language model created by srilm. For all uni-, bi- and tri-grams I filtered out the ones which ended in the end of sentence symbol < /s >. I thus used the bi-grams with the < /s > symbol to create uni-grams and tri-grams with the < /s > symbol to create bi-gram probabilities for the E variable. The problem with the model thus far is that the W variable is updated in the same time scale as the observations. For every 10 msec there will be a new chapter of the model where the W variable will usually have the same value as the previous chapter. The problem occurs when F w triggers and the tri-gram (or bi-gram) probabilities need to be considered in the calculations. The values of W which are used according to his model are the W values of the previous two 10ms time-slices (which are usually all the same) instead of the actual previous two words. We thus need a way to store the actual words and this is done using the model in figure

51 Figure 5.5: Word Tri-gram language model The difference with figure 5.4 is the addition of the 1W and 2W variables and the corresponding dependencies which are shown in grey for more clarity. These variables store the actual previous words by copying the value from W to 1W and from 1W to 2W when F w triggers. The 1W variable and 2W start out in dummy states but because the N variable makes sure that only after two F w triggers the tri-gram probabilities are considered, these variables will also have correct values by then. When F w is not triggered they copy their value to the next time-slice. The tri-gram probabilities can now be used correctly because W depends no longer on its previous values but on the 1W and 2W variables when F w is triggered. When F w is not triggered W still depends on its previous value because it copies that value to the next time-slice. This model should now work if for all possible words combinations there are bi- and tri-gram probabilities available. This is because once two words have been recognized only tri-gram probabilities will be used in the calculations. This is also true for the E variable because this will also apply to the end of sentence uni- and bi-gram tables created. Because they are obtained from bi- and tri-grams ending with the < /s > word a sentence can only end with a bi-gram that ends in < /s >. It is therefore important that all possible combinations are available. This can be achieved using smoothing. 51

52 Smoothing Usually not all possible tri- or bi-grams occur in training data. Smoothing is used to give these unseen events (word orderings) a small probability instead of zero probability. Because the corpus on which the language model is trained does not contain all possible word sequences, and some of those may occur during speech, we need to reserve some probability for those events by lowering the probabilities of seen events. Once we have probabilities for all word sequences we can fill in the entire probability tables for the W variable. The smoothing algorithm used in the language model is Modified Kneser- Ney smoothing which seems the current best smoothing algorithm available according to [2]. It is implemented by srilm and the precise algorithm can be found in [2]. It is an extension to absolute discounting which is a smoothing technique that subtracts a small constant amount of probability (D) from each non zero event. It then distributes that total probability mass evenly over unseen events. The amount D can be calculated by function 5.1 where N 1 stand for the total number of uni-grams observed and N 2 thus for the total number of bi-grams. D = N 1 /N N 2 (5.1) Kneser-Ney builds on the idea that the influence of lower-order distributions (like uni-grams) are only important if the higher-order distributions have only a few counts in the train data. Kneser-Ney smoothing therefore looks at the number of contexts a word appears in, instead of the number of times a word appears. [2] motivates this by the example of the bi-gram San Francisco. Most smoothing methods will assign a too high probability to bi-grams that end in Francisco because Francisco has a high uni-gram count. However, when Francisco appears it is almost always after San. To assign those other bi-grams with Francisco lower probabilities the uni-gram Francisco should receive a lower probability. The bi-gram San Francisco will not be affected much because it has a high bi-gram probability. There are two other techniques that also solve the problem of unseen events in the training data; backoff and interpolation. Backoff uses a lower order n-gram probability when there is no higher order available. If a trigram probability cannot be found for the current word in combination with the previous two words, a bi-gram probability is used. If that also cannot be found the uni-gram probability is used. This approach is however not possible in the Gaia toolkit due to the way Gaia is constructed. Interpolation is discussed in the next section. 52

53 5.2.3 Interpolated Word Tri-gram Interpolation is a technique that is used to better estimate the N-gram probabilities of unseen N-grams. The idea follows from the next example. If the bi-grams who are and who art both do not occur in the training data they can be given an equal amount of probability by smoothing. The bi-gram who are however, is much more likely to appear due to the word are being more common than art. Interpolation uses this information by averaging over all available N-grams. The tri-gram P (W i W i 1 W i 2 ) becomes: P (W i W i 1 W i 2 ) = λ 1 P (W i W i 1 W i 2 ) + λ 2 P (W i W i 1 ) + λ 3 P (W i ) (5.2) For the bi-gram example given above this would make the total probability of who are larger because the uni-gram probability of are is larger than art (the bi-gram probability is the same for both). Interpolation can be used in conjunction with smoothing or without. For the example this means that the bi-gram probability is either a small smoothed amount or zero. An interpolation structure can be created in the dbn model like figure 5.6. The interpolation structure is shown in the model from figure 5.4 instead of the model from figure 5.5 for clarity. The structure is very easy, it consists of two λ variables, one which is connected to the W variable and one which is connected to the E variable. The λ variable for the W variable will have three values which are the weights for the uni-, bi- and tri-gram probabilities as in equation 5.2. The λ variable for the E variable works the same but has only two values because the E variable works with bi- and uni-grams only. The values for the λ weights can be obtained by training this model on data while keeping all other variables static. This can be done easily with the Gaia toolkit Phone Recognizer Instead of recognizing words I also created a model that recognizes phones. The advantage of this model is that there is no lexicon needed which leads to a smaller model and that it can recognize every possible word; it is not bounded by the lexicon. The disadvantage is that this also means that the model can recognize non existing words and that the information obtained from the lexicon (which phones are likely to follow each other) is not available. This information can however be obtained by using a language model on the phone variable. Because there are only around 40 phones used there 53

54 Figure 5.6: Language model with Interpolation construction is enough data in the cgn corpus to train a n-gram without smoothing. The small phone set also enables the use of larger n-grams because even a trigram would give only 40 3 = possible combinations. Figure 5.7 shows two slices of the model which is enough to explain how the model works without the picture becoming too cluttered. The P, F s, S, M and O variables are present like in the previous models. The language model is built around the P variable like the W variable from the tri-gram word model. The 1P and 2P variables store the actual previous values of P in the tri-gram model and the N variable keeps track of how many previous P values we can look back in time, up to a maximum of two in this case. These variables are now dependent on the F s variable because this signals when phone has finished. When F s does not signal P, N and all the P variables copy their values to the next time-slice. If F s signals the N variable increases its value (if possible), the 1P variable copies the value from the P variable and the 2P variable copies the value from the 1P variable. The EOI variable is also connected to the F s variable such that at the end of the input only a number of whole phones can be recognized. In the next time-slice the P variable can according to the value of N use the correct previous values of P in the calculations of the inference algorithm. Interpolation is done using the same λ construction as described in the previous section but on the phone level. 54

55 Figure 5.7: Phone recognizer with Tri-gram 55

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall