MEMMs (Log-Linear Tagging Models)

Size: px
Start display at page:

Download "MEMMs (Log-Linear Tagging Models)"

Transcription

1 Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter describes a powerful alternative to HMMs, log-linear tagging models, which build directly on ideas from log-linear models. A key advantage of log-linear tagging models is that they allow highly flexible representations, allowing features to be easily integrated in the model. Log-linear tagging models are sometimes referred to as maximum entropy Markov models (MEMMs). 1 We will use the terms MEMM and log-linear tagging model interchangeably in this chapter. The name MEMM was first introduced by McCallum et al. (2000). Log-linear tagging models are conditional tagging models. Recall that a generative tagging model defines a joint distribution p(x 1... x n, y 1... y n ) over sentences x 1... x n paired with tag sequences y 1... y n. In contrast, a conditional tagging model defines a conditional distribution p(y 1... y n x 1... x n ) corresponding to the probability of the tag sequence y 1... y n conditioned on the input sentence x 1... x n. We give the following definition: 1 This name is used because 1) log-linear models are also referred to as maximum entropy models, as it can be shown in the unregularized case that the maximum likelihood estimates maximize an entropic measure subject to certain linear constraints; 2) as we will see shortly, MEMMs make a Markov assumption that is closely related to the Markov assumption used in HMMs. 1

2 2 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) Definition 1 (Conditional Tagging Models) A conditional tagging model consists of: A set of words V (this set may be finite, countably infinite, or even uncountably infinite). A finite set of tags K. A function p(y 1... y n x 1... x n ) such that: 1. For any x 1... x n, y 1... y n S, p(y 1... y n x 1... x n ) 0 where S is the set of all sequence/tag-sequence pairs x 1... x n, y 1... y n such that n 1, x i V for i = 1... n, and y i K for i = 1... n. 2. For any x 1... x n such that n 1 and x i V for i = 1... n, p(y 1... y n x 1... x n ) = 1 where Y(n) is the set of all tag sequences y 1... y n such that y i K for i = 1... n. Given a conditional tagging model, the function from sentences x 1... x n to tag sequences y 1... y n is defined as f(x 1... x n ) = arg max p(y 1... y n x 1... x n ) Thus for any input x 1... x n, we take the highest probability tag sequence as the output from the model. We are left with the following three questions: How we define a conditional tagging model p(y 1... y n x 1... x n )? How do we estimate the parameters of the model from training examples? How do we efficiently find for any input x 1... x n? arg max p(y 1... y n x 1... x n )

3 8.2. TRIGRAM MEMMS 3 The remainder of this chapter describes how MEMMs give a solution to these questions. In brief, a log-linear model is used to define a conditional tagging model. The parameters of the model can be estimated using standard methods for parameter estimation in log-linear models. MEMMs make a Markov independence assumption that is closely related to the Markov independence assumption in HMMs, and that allows the arg max above to be computed efficiently using dynamic programming. 8.2 Trigram MEMMs This section describes the model form for MEMMs. We focus on trigram MEMMs, which make a second order Markov assumption, where each tag depends on the previous two tags. The generalization to other Markov orders, for example firstorder (bigram) or third-order (four-gram) MEMMs, is straightforward. Our task is to model the conditional distribution P (Y 1 = y 1... Y n = y n X 1 = x 1... X n = x n ) for any input sequence x 1... x n paired with a tag sequence y 1... y n. We first use the following decomposition: = = P (Y 1 = y 1... Y n = y n X 1 = x 1... X n = x n ) P (Y i = y i X 1 = x 1... X n = x n, Y 1 = y 1... Y i 1 = y i 1 ) P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) The first equality is exact, by the chain rule of probabilities. The second equality makes use of a trigram independence assumption, namely that for any i 1... n}, P (Y i = y i X 1 = x 1... X n = x n, Y 1 = y 1... Y i 1 = y i 1 ) = P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) Here we assume that y 1 = y 0 = *, where * is a special symbol in the model denoting the start of a sequence. Thus we assume that the random variable Y i is independent of the values for Y 1... Y i 3, once we condition on the entire input sequence X 1... X n, and the previous two tags Y i 2 and Y i 1. This is a trigram independence assumption, where each tag depends only on the previous two tags. We will see that this independence

4 4 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) assumption allows us to use dynamic programming to efficiently find the highest probability tag sequence for any input sentence x 1... x n. Note that there is some resemblance to the independence assumption made in trigram HMMs, namely that P (Y i = y i Y 1 = y 1... Y i 1 = y i 1 ) = P (Y i = y i Y i 2 = y i 2, Y i 1 = y i 1 ) The key difference is that we now condition on the entire input sequence x 1... x n, in addition to the previous two tags y i 2 and y i 1. The final step is to use a log-linear model to estimate the probability P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) For any pair of sequences x 1... x n and y 1... y n, we define the i th history h i to be the four-tuple h i = y i 2, y i 1, x 1... x n, i Thus h i captures the conditioning information for tag y i in the sequence, in addition to the position i in the sequence. We assume that we have a feature-vector representation f(h i, y) R d for any history h i paired with any tag y K. The feature vector could potentially take into account any information in the history h i and the tag y. As one example, we might have features f 1 (h i, y) = 1 if xi = the and y = DT f 2 (h i, y) = 1 if yi 1 = V and y = DT Section 8.3 describes a much more complete set of example features. Finally, we assume a parameter vector θ R d, and that = P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) exp (θ f(h i, y i )) y K exp (θ f(h i, y)) Putting this all together gives the following: Definition 2 (Trigram MEMMs) A trigram MEMM consists of:

5 8.3. FEATURES IN TRIGRAM MEMMS 5 A set of words V (this set may be finite, countably infinite, or even uncountably infinite). A finite set of tags K. Given V and K, define H to be the set of all possible histories. The set H contains all four-tuples of the form y 2, y 1, x 1... x n, i, where y 2 K *}, y 1 K *}, n 1, x i V for i = 1... n, i 1... n}. Here * is a special start symbol. An integer d specifying the number of features in the model. A function f : H K R d specifying the features in the model. A parameter vector θ R d. Given these components we define the conditional tagging model p(y 1... y n x 1... x n ) = p(y i h i ; θ) where h i = y i 2, y i 1, x 1... x n, i, and p(y i h i ; θ) = exp (θ f(h i, y i )) y K exp (θ f(h i, y)) At this point there are a number of questions. How do we define the feature vectors f(h i, y)? How do we learn the parameters θ from training data? How do we find the highest probability tag sequence arg max p(y 1... y n x 1... x n ) for an input sequence x 1... x n? The following sections answer these questions. 8.3 Features in Trigram MEMMs Recall that the feature vector definition in a trigram MEMM is a function f(h, y) R d where h = y 2, y 1, x 1... x n, i is a history, y K is a tag, and d is an integer specifying the number of features in the model. Each feature f j (h, y) for j 1... d} can potentially be sensitive to any information in the history h in

6 6 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) conjunction with the tag y. This will lead to a great deal of flexibility in the model. This is the primary advantage of trigram MEMMs over trigram HMMs for tagging: a much richer set of features can be employed for the tagging task. In this section we give an example of how features can be defined for the part-of-speech (POS) tagging problem. The features we describe are taken from Ratnaparkhi (1996); Ratnaparkhi s experiments show that they give competitive performance on the POS tagging problem for English. Throughout this section we assume that the history h is a four-tuple y 2, y 1, x 1... x n, i. The features are as follows: Word/tag features One example word/tag feature is the following: f 100 (h, y) = 1 if xi is base and y = VB This feature is sensitive to the word being tagged, x i, and the proposed tag for that word, y. In practice we would introduce features of this form for a very large set of word/tag pairs, in addition to the pair base/vb. For example, we could introduce features of this form for all word/tag pairs seen in training data. This class of feature allows the model to capture the tendency for particular words to take particular parts of speech. In this sense it plays an analogous role to the emission parameters e(x y) in a trigram HMM. For example, given the definition of f 100 given above, a large positive value for θ 100 will indicate that base is very likely to be tagged as a VB; conversely a highly negative value will indicate that this particular word/tag pairing is unlikely. Prefix and Suffix features f 101 (h, y) = An example of a suffix feature is as follows: 1 if xi ends in ing and y = VBG This feature is sensitive to the suffix of the word being tagged, x i, and the proposed tag y. In practice we would introduce a large number of features of this form. For example, in Ratnaparkhi s POS tagger, all suffixes seen in training data up to four letters in length were introduced as features (in combination with all possible tags). An example of a prefix feature is as follows: f 102 (h, y) = 1 if xi starts with pre and y = NN

7 8.3. FEATURES IN TRIGRAM MEMMS 7 Again, a large number of prefix features would be used. Ratnaparkhi s POS tagger employs features for all prefixes up to length four seen in training data. Prefix and suffix features are very useful for POS tagging and other tasks in English and many other languages. For example, the suffix ing is frequently seen with the tag VBG in the Penn treebank, which is the tag used for gerunds; there are many other examples. Crucially, it is very straightforward to introduce prefix and suffix features to the model. This is in contrast with trigram HMMs for tagging, where we used the idea of mapping low-frequency words to pseudo-words capturing spelling features. The integration of spelling features for example prefix and suffix features in log-linear models is much less ad-hoc than the method we described for HMMs. Spelling features are in practice very useful when tagging words in test data that are infrequent or not seen at all in training data. An example of a trigram tag fea- Trigram, Bigram and Unigram Tag features ture is as follows: f 103 (h, y) = 1 if y 2, y 1, y = DT, JJ, VB This feature, in combination with the associated parameter θ 103, plays an analogous role to the q(vb DT, JJ) parameter in a trigram HMM, allowing the model to learn whether the tag trigram DT, JJ, VB is likely or unlikely. Features of this form are introduced for a large number of tag trigrams, for example all tag trigrams seen in training data. The following two features are examples of bigram and unigram tag features: f 104 (h, y) = f 105 (h, y) = 1 if y 1, y = JJ, VB 1 if y = VB The first feature allows the model to learn whether the tag bigram JJ VB is likely or unlikely; the second feature allows the model to learn whether the tag VB is likely or unlikely. Again, a large number of bigram and unigram tag features are typically introduced in the model. Bigram and unigram features may at first glance seem redundant, given the obvious overlap with trigram features. For example, it might seem that features f 104 and f 105 are subsumed by feature f 103. Specifically, given parameters θ 103, θ 104 and θ 105 we can redefine θ 103 to be equal to θ 103 +θ 104 +θ 105, and θ 104 = θ 105 = 0,

8 8 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) giving exactly the same distribution p(y x; θ) for the two parameter settings. 2 Thus features f 104 and f 105 can apparently be eliminated. However, when used in conjunction with regularized approaches to parameter estimation, the bigram and unigram features play an important role that is analogous to the use of backed-off estimates q ML (VB JJ) and q ML (VB) in the smoothed estimation techniques seen previously in the class. Roughly speaking, with regularization, if the trigram in feature f 103 is infrequent, then the value for θ 103 will not grow too large in magnitude, and the parameter values θ 104 and θ 105, which are estimated based on more examples, will play a more important role. Other Contextual Features Ratnaparkhi also used features which consider the word before or after x i, in conjunction with the proposed tag. Example features are as follows: f 106 (h, y) = f 107 (h, y) = 1 if previous word xi 1 = the and y = VB 1 if next word xi+1 = the and y = VB Again, many such features would be introduced. These features add additional context to the model, introducing dependencies between x i 1 or x i+1 and the proposed tag. Note again that it is not at all obvious how to introduce contextual features of this form to trigram HMMs. Other Features We have described the main features in Ratnaparkhi s model. Additional features that he includes are: 1) spelling features which consider whether the word x i being tagged contains a number, contains a hyphen, or contains an upper-case letter; 2) contextual features that consider the word at x i 2 and x i+2 in conjunction with the current tag y. 8.4 Parameter Estimation in Trigram MEMMs Parameter estimation in trigram MEMMs can be performed using the parameter estimation methods for log-linear models described in the previous chapter. The training data is a set of m examples (x (k), y (k) ) for k = 1... m, where each x (k) is a sequence of words x (k) 1... x (k) n k, and each y (k) is a sequence of tags y (k) 1... y n (k) k. 2 To be precise, this argument is correct in the case where for every unigram and bigram feature there is at least one trigram feature that subsumes it. The reassignment of parameter values would be applied to all trigrams.

9 8.5. DECODING WITH MEMMS: ANOTHER APPLICATION OF THE VITERBI ALGORITHM9 Here n k is the length of the k th sentence or tag sequence. For any k 1... m}, i 1... n k }, define the history h (k) i to be equal to y (k) i 2, y(k) i 1, x(k) 1... x (k) n k, i. We assume that we have a feature vector definition f(h, y) R d for some integer d, and hence that ( p(y (k) i h (k) exp θ f(h (k) ) i, y i ) i ; θ) = y K (θ exp f(h (k) ) i, y) The regularized log-likelihood function is then L(θ) = m n k k=1 log p(y (k) i h (k) i ; θ) λ 2 d θj 2 j=1 The first term is the log-likelihood of the data under parameters θ. The second term is a regularization term, which penalizes large parameter values. The positive parameter λ dictates the relative weight of the two terms. An optimization method is used to find the parameters θ that maximize this function: θ = arg max θ R d L(θ) In summary, the estimation method is a direct application of the method described in the previous chapter, for parameter estimation in log-linear models. 8.5 Decoding with MEMMs: Another Application of the Viterbi Algorithm We now turn to the problem of finding the most likely tag sequence for an input sequence x 1... x n under a trigram MEMM; that is, the problem of finding where arg max p(y 1... y n x 1... x n ) p(y 1... y n x 1... x n ) = p(y i h i ; θ) and h i = y i 2, y i 1, x 1... x n, i. First, note the similarity to the decoding problem for a trigram HMM tagger, which is to find arg max q(stop y n 1, y n ) q(y i y i 2, y i 1 )e(x i y i )

10 10 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) Putting aside the q(stop y n 1, y n ) term, we have essentially replaced by q(y i y i 2, y i 1 ) e(x i y i ) p(y i h i ; θ) The similarity of the two decoding problems leads to the decoding algorithm for trigram MEMMs being very similar to the decoding algorithm for trigram HMMs. We again use dynamic programming. The algorithm is shown in figure 8.1. The base case of the dynamic program is identical to the base case for trigram HMMs, namely π(*, *, 0) = 1 The recursive case is π(k, u, v) = max (π(k 1, w, u) p(v h; θ)) w K k 2 where h = w, u, x 1... x n, k. (We again define K 1 = K 0 = *, and K k = K for k = 1... n.) Recall that the recursive case for a trigram HMM tagger is π(k, u, v) = max (π(k 1, w, u) q(v w, u) e(x k v)) w K k 2 Hence we have simply replaced q(v w, u) e(x k v) by p(v h; θ). The justification for the algorithm is very similar to the justification of the Viterbi algorithm for trigram HMMs. In particular, it can be shown that for all k 0... n}, u K k 1, v K k, k π(k, u, v) = max p(y i h i ; θ) y 1...y k S(k,u,v) where S(k, u, v) is the set of all sequences y 1 y 0... y k such that y i i = 1... k, and y k 1 = u, y k = v. K i for 8.6 Summary To summarize, the main ideas behind trigram MEMMs are the following:

11 8.6. SUMMARY 11 Input: A sentence x 1... x n. A set of possible tags K. A model (for example a log-linear model) that defines a probability p(y h; θ) for any h, y pair where h is a history of the form y 2, y 1, x 1... x n, i, and y K. Definitions: Define K 1 = K 0 = *}, and K k = K for k = 1... n. Initialization: Set π(0, *, *) = 1. Algorithm: For k = 1... n, For u K k 1, v K k, π(k, u, v) = max w K k 2 (π(k 1, w, u) p(v h; θ)) bp(k, u, v) = arg max w K k 2 (π(k 1, w, u) p(v h; θ)) where h = w, u, x 1... x n, k. Set (y n 1, y n ) = arg max u Kn 1,v K n π(n, u, v) For k = (n 2)... 1, y k = bp(k + 2, y k+1, y k+2 ) Return the tag sequence y 1... y n Figure 8.1: The Viterbi Algorithm with backpointers.

12 12 CHAPTER 8. MEMMS (LOG-LINEAR TAGGING MODELS) 1. We derive the model by first making the independence assumption P (Y i = y i X 1 = x 1... X n = x n, Y 1 = y 1... Y n = y n ) = P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) and then assuming that P (Y i = y i X 1 = x 1... X n = x n, Y i 2 = y i 2, Y i 1 = y i 1 ) = p(y i h i ; θ) expθ f(h i, y i )} = y expθ f(h i, y)} where h i = y i 2, y i 1, x 1... x n, i, f(h, y) R d is a feature vector, and θ R d is a parameter vector. 2. The parameters θ can be estimated using standard methods for parameter estimation in log-linear models, for example by optimizing a regularized log-likelihood function. 3. The decoding problem is to find arg max p(y i h i ; θ) This problem can be solved by dynamic programming, using a variant of the Viterbi algorithm. The algorithm is closely related to the Viterbi algorithm for trigram HMMs. 4. The feature vector f(h, y) can be sensitive to a wide range of information in the history h in conjunction with the tag y. This is the primary advantage of MEMMs over HMMs for tagging: it is much more straightforward and direct to introduce features into the model.

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney Hidden Markov Models Natural Language Processing: Jordan Boyd-Graber University of Colorado Boulder LECTURE 20 Adapted from material by Ray Mooney Natural Language Processing: Jordan Boyd-Graber Boulder

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017) Discussion School of Computing and Information Systems The University of Melbourne COMP9004 WEB SEARCH AND TEXT ANALYSIS (Semester, 07). What is a POS tag? Sample solutions for discussion exercises: Week

More information

Feature Extraction and Loss training using CRFs: A Project Report

Feature Extraction and Loss training using CRFs: A Project Report Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國 Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional

More information

CS 6784 Paper Presentation

CS 6784 Paper Presentation Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La erty, Andrew McCallum, Fernando C. N. Pereira February 20, 2014 Main Contributions Main Contribution Summary

More information

Sequence Labeling: The Problem

Sequence Labeling: The Problem Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: DT NN VBD IN DT NN. The cat sat on the mat. 36 part-of-speech tags used

More information

Detection and Extraction of Events from s

Detection and Extraction of Events from  s Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

I Know Your Name: Named Entity Recognition and Structural Parsing

I Know Your Name: Named Entity Recognition and Structural Parsing I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Log- linear models. Natural Language Processing: Lecture Kairit Sirts Log- linear models Natural Language Processing: Lecture 3 21.09.2017 Kairit Sirts The goal of today s lecture Introduce the log- linear/maximum entropy model Explain the model components: features, parameters,

More information

Introduction to Hidden Markov models

Introduction to Hidden Markov models 1/38 Introduction to Hidden Markov models Mark Johnson Macquarie University September 17, 2014 2/38 Outline Sequence labelling Hidden Markov Models Finding the most probable label sequence Higher-order

More information

Modeling Sequence Data

Modeling Sequence Data Modeling Sequence Data CS4780/5780 Machine Learning Fall 2011 Thorsten Joachims Cornell University Reading: Manning/Schuetze, Sections 9.1-9.3 (except 9.3.1) Leeds Online HMM Tutorial (except Forward and

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Conditional Random Fields. Mike Brodie CS 778

Conditional Random Fields. Mike Brodie CS 778 Conditional Random Fields Mike Brodie CS 778 Motivation Part-Of-Speech Tagger 2 Motivation object 3 Motivation I object! 4 Motivation object Do you see that object? 5 Motivation Part-Of-Speech Tagger -

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining CS155 Machine Learning and Data Mining February 27, 2018 Motivation Much of machine learning is heavily dependent on computational power Many libraries exist that aim to reduce computational time TensorFlow

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey * Most of the slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition Lecture 4:

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

More information

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging 0-708: Probabilisti Graphial Models 0-708, Spring 204 Disrete sequential models and CRFs Leturer: Eri P. Xing Sribes: Pankesh Bamotra, Xuanhong Li Case Study: Supervised Part-of-Speeh Tagging The supervised

More information

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES Saturday 10 th December 2016 09:30 to 11:30 INSTRUCTIONS

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Exact Inference: Elimination and Sum Product (and hidden Markov models) Exact Inference: Elimination and Sum Product (and hidden Markov models) David M. Blei Columbia University October 13, 2015 The first sections of these lecture notes follow the ideas in Chapters 3 and 4

More information

Part-of-Speech Tagging

Part-of-Speech Tagging Part-of-Speech Tagging A Canonical Finite-State Task 600.465 - Intro to NLP - J. Eisner 1 The Tagging Task Input: the lead paint is unsafe Output: the/ lead/n paint/n is/v unsafe/ Uses: text-to-speech

More information

Assignment 4 CSE 517: Natural Language Processing

Assignment 4 CSE 517: Natural Language Processing Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

Phil 320 Chapter 7: Recursive sets and relations Note: 0. Introduction Significance of and main objectives for chapter 7:

Phil 320 Chapter 7: Recursive sets and relations Note: 0. Introduction Significance of and main objectives for chapter 7: Phil 320 Chapter 7: Recursive sets and relations (Note: We cover only section 7.1.) 0. Introduction Significance of and main objectives for chapter 7: 1. Chapter 7 generalizes the notion of recursive (or

More information

Approximate Large Margin Methods for Structured Prediction

Approximate Large Margin Methods for Structured Prediction : Approximate Large Margin Methods for Structured Prediction Hal Daumé III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu Slide 1 Structured Prediction

More information

Handwritten Word Recognition using Conditional Random Fields

Handwritten Word Recognition using Conditional Random Fields Handwritten Word Recognition using Conditional Random Fields Shravya Shetty Harish Srinivasan Sargur Srihari Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science

More information

Verifying a Border Array in Linear Time

Verifying a Border Array in Linear Time Verifying a Border Array in Linear Time František Franěk Weilin Lu P. J. Ryan W. F. Smyth Yu Sun Lu Yang Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario

More information

CRF Feature Induction

CRF Feature Induction CRF Feature Induction Andrew McCallum Efficiently Inducing Features of Conditional Random Fields Kuzman Ganchev 1 Introduction Basic Idea Aside: Transformation Based Learning Notation/CRF Review 2 Arbitrary

More information

Cardinality of Sets. Washington University Math Circle 10/30/2016

Cardinality of Sets. Washington University Math Circle 10/30/2016 Cardinality of Sets Washington University Math Circle 0/0/06 The cardinality of a finite set A is just the number of elements of A, denoted by A. For example, A = {a, b, c, d}, B = {n Z : n } = {,,, 0,,,

More information

Hidden Markov Support Vector Machines

Hidden Markov Support Vector Machines Hidden Markov Support Vector Machines Yasemin Altun Ioannis Tsochantaridis Thomas Hofmann Department of Computer Science, Brown University, Providence, RI 02912 USA altun@cs.brown.edu it@cs.brown.edu th@cs.brown.edu

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

Detection of Man-made Structures in Natural Images

Detection of Man-made Structures in Natural Images Detection of Man-made Structures in Natural Images Tim Rees December 17, 2004 Abstract Object detection in images is a very active research topic in many disciplines. Probabilistic methods have been applied

More information

Automated Extraction of Event Details from Text Snippets

Automated Extraction of Event Details from Text Snippets Automated Extraction of Event Details from Text Snippets Kavi Goel, Pei-Chin Wang December 16, 2005 1 Introduction We receive emails about events all the time. A message will typically include the title

More information

Bayes Net Learning. EECS 474 Fall 2016

Bayes Net Learning. EECS 474 Fall 2016 Bayes Net Learning EECS 474 Fall 2016 Homework Remaining Homework #3 assigned Homework #4 will be about semi-supervised learning and expectation-maximization Homeworks #3-#4: the how of Graphical Models

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Parsing with Dynamic Programming

Parsing with Dynamic Programming CS11-747 Neural Networks for NLP Parsing with Dynamic Programming Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Two Types of Linguistic Structure Dependency: focus on relations between words

More information

Lecture 3: Conditional Independence - Undirected

Lecture 3: Conditional Independence - Undirected CS598: Graphical Models, Fall 2016 Lecture 3: Conditional Independence - Undirected Lecturer: Sanmi Koyejo Scribe: Nate Bowman and Erin Carrier, Aug. 30, 2016 1 Review for the Bayes-Ball Algorithm Recall

More information

CSEP 517 Natural Language Processing Autumn 2013

CSEP 517 Natural Language Processing Autumn 2013 CSEP 517 Natural Language Processing Autumn 2013 Unsupervised and Semi-supervised Learning Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Unsupervised

More information

Voting between Multiple Data Representations for Text Chunking

Voting between Multiple Data Representations for Text Chunking Voting between Multiple Data Representations for Text Chunking Hong Shen and Anoop Sarkar School of Computing Science Simon Fraser University Burnaby, BC V5A 1S6, Canada {hshen,anoop}@cs.sfu.ca Abstract.

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

A Hidden Markov Model for Alphabet Soup Word Recognition

A Hidden Markov Model for Alphabet Soup Word Recognition A Hidden Markov Model for Alphabet Soup Word Recognition Shaolei Feng 1 Nicholas R. Howe 2 R. Manmatha 1 1 University of Massachusetts, Amherst 2 Smith College Motivation: Inaccessible Treasures Historical

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

Stephen Scott.

Stephen Scott. 1 / 33 sscott@cse.unl.edu 2 / 33 Start with a set of sequences In each column, residues are homolgous Residues occupy similar positions in 3D structure Residues diverge from a common ancestral residue

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3) CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Language Model" Unigram language

More information

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Informa(on Extrac(on and Named En(ty Recogni(on. Introducing the tasks: Ge3ng simple structured informa8on out of text

Informa(on Extrac(on and Named En(ty Recogni(on. Introducing the tasks: Ge3ng simple structured informa8on out of text Informa(on Extrac(on and Named En(ty Recogni(on Introducing the tasks: Ge3ng simple structured informa8on out of text Informa(on Extrac(on Informa8on extrac8on (IE) systems Find and understand limited

More information

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(Xt = wt X1 = w1,...,xt-1 = wt-1)

More information

CHAPTER 4. COMPUTABILITY AND DECIDABILITY

CHAPTER 4. COMPUTABILITY AND DECIDABILITY CHAPTER 4. COMPUTABILITY AND DECIDABILITY 1. Introduction By definition, an n-ary function F on N assigns to every n-tuple k 1,...,k n of elements of N a unique l N which is the value of F at k 1,...,k

More information

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012 A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012 TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of

More information

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy. Math 340 Fall 2014, Victor Matveev Binary system, round-off errors, loss of significance, and double precision accuracy. 1. Bits and the binary number system A bit is one digit in a binary representation

More information

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1 Finite-State and the Noisy Channel 600.465 - Intro to NLP - J. Eisner 1 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing

More information

Clinical Name Entity Recognition using Conditional Random Field with Augmented Features

Clinical Name Entity Recognition using Conditional Random Field with Augmented Features Clinical Name Entity Recognition using Conditional Random Field with Augmented Features Dawei Geng (Intern at Philips Research China, Shanghai) Abstract. In this paper, We presents a Chinese medical term

More information

a- As a special case, if there is only one symbol, no bits are required to specify it.

a- As a special case, if there is only one symbol, no bits are required to specify it. Codes A single bit is useful if exactly two answers to a question are possible. Examples include the result of a coin toss (heads or tails), Most situations in life are more complicated. This chapter concerns

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

Posterior Regularization for Structured Latent Varaible Models

Posterior Regularization for Structured Latent Varaible Models University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 7-200 Posterior Regularization for Structured Latent Varaible Models Kuzman Ganchev University

More information

ENCODING STRUCTURED OUTPUT VALUES. Edward Loper. Computer and Information Science

ENCODING STRUCTURED OUTPUT VALUES. Edward Loper. Computer and Information Science ENCODING STRUCTURED OUTPUT VALUES Edward Loper A DISSERTATION PROPOSAL in Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements

More information

Fundamentals of the J Programming Language

Fundamentals of the J Programming Language 2 Fundamentals of the J Programming Language In this chapter, we present the basic concepts of J. We introduce some of J s built-in functions and show how they can be applied to data objects. The pricinpals

More information

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017 Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 The Story so Far: Generative Models 1 The definition of translation probability follows a mathematical derivation argmax e p(e f) = argmax

More information

Semi-Supervised Learning of Named Entity Substructure

Semi-Supervised Learning of Named Entity Substructure Semi-Supervised Learning of Named Entity Substructure Alden Timme aotimme@stanford.edu CS229 Final Project Advisor: Richard Socher richard@socher.org Abstract The goal of this project was two-fold: (1)

More information

Unit 1 Algebraic Functions and Graphs

Unit 1 Algebraic Functions and Graphs Algebra 2 Unit 1 Algebraic Functions and Graphs Name: Unit 1 Day 1: Function Notation Today we are: Using Function Notation We are successful when: We can Use function notation to evaluate a function This

More information

On Structured Perceptron with Inexact Search, NAACL 2012

On Structured Perceptron with Inexact Search, NAACL 2012 On Structured Perceptron with Inexact Search, NAACL 2012 John Hewitt CIS 700-006 : Structured Prediction for NLP 2017-09-23 All graphs from Huang, Fayong, and Guo (2012) unless otherwise specified. All

More information

Conditional Random Fields for Word Hyphenation

Conditional Random Fields for Word Hyphenation Conditional Random Fields for Word Hyphenation Tsung-Yi Lin and Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl260}@ucsd.edu February 12,

More information

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Multi-Class Classification Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation

More information

Using Hidden Markov Models to analyse time series data

Using Hidden Markov Models to analyse time series data Using Hidden Markov Models to analyse time series data September 9, 2011 Background Want to analyse time series data coming from accelerometer measurements. 19 different datasets corresponding to different

More information

A Neuro Probabilistic Language Model Bengio et. al. 2003

A Neuro Probabilistic Language Model Bengio et. al. 2003 A Neuro Probabilistic Language Model Bengio et. al. 2003 Class Discussion Notes Scribe: Olivia Winn February 1, 2016 Opening thoughts (or why this paper is interesting): Word embeddings currently have

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Move-to-front algorithm

Move-to-front algorithm Up to now, we have looked at codes for a set of symbols in an alphabet. We have also looked at the specific case that the alphabet is a set of integers. We will now study a few compression techniques in

More information

Dynamic Feature Selection for Dependency Parsing

Dynamic Feature Selection for Dependency Parsing Dynamic Feature Selection for Dependency Parsing He He, Hal Daumé III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N Fruit flies like a banana

More information

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 6-28-2001 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling

More information

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing Jun Suzuki, Hideki Isozaki NTT CS Lab., NTT Corp. Kyoto, 619-0237, Japan jun@cslab.kecl.ntt.co.jp isozaki@cslab.kecl.ntt.co.jp

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Representing Information. Bit Juggling. - Representing information using bits - Number representations. - Some other bits - Chapters 1 and 2.3,2.

Representing Information. Bit Juggling. - Representing information using bits - Number representations. - Some other bits - Chapters 1 and 2.3,2. Representing Information 0 1 0 Bit Juggling 1 1 - Representing information using bits - Number representations 1 - Some other bits 0 0 - Chapters 1 and 2.3,2.4 Motivations Computers Process Information

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

Segment-based Hidden Markov Models for Information Extraction

Segment-based Hidden Markov Models for Information Extraction Segment-based Hidden Markov Models for Information Extraction Zhenmei Gu David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2l 3G1 z2gu@uwaterloo.ca Nick Cercone

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Lexicographic Semirings for Exact Automata Encoding of Sequence Models Lexicographic Semirings for Exact Automata Encoding of Sequence Models Brian Roark, Richard Sproat, and Izhak Shafran {roark,rws,zak}@cslu.ogi.edu Abstract In this paper we introduce a novel use of the

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Sets MAT231. Fall Transition to Higher Mathematics. MAT231 (Transition to Higher Math) Sets Fall / 31

Sets MAT231. Fall Transition to Higher Mathematics. MAT231 (Transition to Higher Math) Sets Fall / 31 Sets MAT231 Transition to Higher Mathematics Fall 2014 MAT231 (Transition to Higher Math) Sets Fall 2014 1 / 31 Outline 1 Sets Introduction Cartesian Products Subsets Power Sets Union, Intersection, Difference

More information

Segmentation: Clustering, Graph Cut and EM

Segmentation: Clustering, Graph Cut and EM Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu

More information

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING Joshua Goodman Speech Technology Group Microsoft Research Redmond, Washington 98052, USA joshuago@microsoft.com http://research.microsoft.com/~joshuago

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Comparisons of Sequence Labeling Algorithms and Extensions

Comparisons of Sequence Labeling Algorithms and Extensions Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models

More information