Mixture-of-Parents Maximum Entropy Markov Models

Size: px
Start display at page:

Download "Mixture-of-Parents Maximum Entropy Markov Models"

Transcription

1 Mixture-of-Parents Maximum Entropy Markov Models David S. Rosenberg Department of Statistics University of California, Berkeley Dan Klein Computer Science Division University of California, Berkeley Ben Taskar Computer and Information Science University of Pennsylvania Abstract We present the mixture-of-parents maximum entropy Markov model (MoP-MEMM), a class of directed graphical models extending MEMMs. The MoP-MEMM allows tractable incorporation of long-range dependencies between nodes by restricting the conditional distribution of each node to be a mixture of distributions given the parents. We show how to efficiently compute the exact marginal posterior node distributions, regardless of the range of the dependencies. This enables us to model non-sequential correlations present within text documents, as well as between interconnected documents, such as hyperlinked web pages. We apply the MoP-MEMM to a named entity recognition task and a web page classification task. In each, our model shows significant improvement over the basic MEMM, and is competitive with other longrange sequence models that use approximate inference. 1 Introduction Two very popular and effective techniques for sequence labeling tasks, such as part-of-speech tagging, are maximum entropy Markov models (MEMMs), introduced by McCallum et al. [2000], and linear-chain conditional random fields (CRFs), introduced by Lafferty et al. [2001]. Neither of these models directly model relationships between nonadjacent labels. Higher order Markov models relax this local conditional independence assumption, but the complexity of inference grows exponentially with the increasing range of direct dependencies. In many situations, models could benefit by allowing information to pass directly between two labels that are far apart. For example, in named entity recognition (NER) tasks, a typical goal is to identify groups of consecutive words as being one of the following entity types: location, person, company, and other. It often happens that the type of an entity is clear in one context, but difficult to determine in another context. In a Markov model of fixed order, there is no direct way to share information between the two occurrences of the same entity. However, with long-distance interactions, we can enforce or encourage repeated words and word groups to receive the same entity labels. Long-range dependencies arise not only within contiguous text, but also between interconnected documents. Consider the task of giving a topic label to each document in a collection, where the documents have a natural connectivity structure. For example, in a collection of scientific articles, it s natural to consider there to be a connection between two articles if one article cites the other. Similarly, for a collection of web pages, a hyperlink from one web page to another is a natural indicator of connection. Since documents often connect to other documents about similar topics, it s potentially helpful to use this connectivity information in making topic label predictions. Indeed, this structure has been used to aid classification in several non-probabilistic, procedural systems [Neville and Jensen, 2000, Slattery and Mitchell, 2000], as well as in probabilistic models [Getoor et al., 2001, Taskar et al., 2002, Bunescu and Mooney, 2004]. Although a strong case can be made for the benefits of long-range models, performing inference (i.e. carrying out the labeling procedure) is intractable in most graphical models with long-range interactions. One general approach to this challenge is to replace exact inference with approximate inference algorithms. Two previous approaches to using long-distance dependencies in linguistic tasks are loopy belief propagation [Taskar et al., 2002, Sutton and McCallum, 2004, Bunescu and Mooney, 2004] and Gibbs sampling [Finkel et al., 2005], each a form of approximate inference.

2 In this paper, we present the mixture-of-parents MEMM, a graphical model incorporating long-range interactions, for which we can efficiently compute marginal node posteriors without approximation or sampling. As a graphical model, the mixture-ofparents MEMM is a MEMM with additional skip edges that connect nonadjacent nodes. The skip edges are directed edges pointing from earlier labels to later labels. At this level, the model is similar to the skip-chain CRF of Sutton and McCallum [2004], which they describe as essentially a linear-chain CRF with additional long-distance edges. However, while the skip-chain CRF precludes exact inference, we make additional model assumptions to keep exact inference tractable. In both mixture-of-parents MEMMs and skip-chain CRFs, the features on skip edges can be based on both the label and the input environment of each node in the edge. These long-distance features allow a highly informative environment at one node to influence the label chosen for the other node. In the NER task, for example, one might connect all pairs of identical words by edges. This would allow the context-sharing effect described above. In linked-document data, a number of interesting models are possible. One simple model is to have a long-distance feature connecting each document to all the other documents that it cites [Chakrabarti et al., 1998, Taskar et al., 2001]. These links allow the model to account for the strong topic correlation along bibliographic links. The rest of the paper is organized as follows. We begin in Section 2 by reviewing maximum entropy Markov models. Then we introduce the mixture-of-parents extension, and show how to perform efficient inference in the model. In Section 3, we describe the estimation procedure. In Sections 4 and 5, we provide experimental validation on two tasks, demonstrating significant improvements in accuracy. We conclude with a discussion of our results in Section 6. 2 Models We begin by describing maximum entropy Markov models (MEMMs), introduced by McCallum et al. [2000]. A MEMM represents the conditional distribution of a chain of labels given the input sequence. 2.1 MEMMs Let us denote the input sequence as x = (x 1,x 2,...,x n ) and the label sequence as y = (y 1,y 2,...,y n ), where each label y i takes on values in some discrete set Y. A first-order MEMM assumes p(y k y 1,...,y k 1,x) = p(y k y k 1,x). Inference in these models, that is, computing the posterior marginals p(y 1 x),...,p(y n x), or the posterior mode arg max y 1,...,y n p(y 1,...,y n x), requires O(n Y 2 ) time. An m th -order model assumes p(y k y 1,...,y k 1,x) = p(y k y k m,...,y k 1,x), and requires O(n Y m+1 ) time for inference. For simplicity, we focus on first-order models. In a MEMM, the conditional distributions p(y k y k 1,x) are taken to be log-linear, or maximum entropy in form: p λ,µ (y k y k 1,x) = ( 1 exp λ s f s (y k 1,y k,x) + Z yk 1,x s t µ t g t (y k,x) where Z yi 1,x is a normalization function ensuring that y i Y p λ,µ(y i y i 1,x) = 1. In models for named entity recognition, the features f s and g t track the attributes of the local context around the label, such as the current word, previous words, capitalization, punctuation, etc. 2.2 Skip-chain models In the MEMM, when we condition on the input sequence x, the label variables y 1,...,y n form a Markov chain. The conditional independence structure of this model is given by a directed graph, with edges connecting adjacent labels. We now consider a much more general model, in which we allow additional skip edges to connect nonadjacent labels. We call this model a directed skip-chain model. The graphical structure for this model is shown in Figure 1(a), with the long-range skip edges shown dashed. With respect to the graph structure, the parents of a label y k comprise the label y k 1 immediately preceding it, as well as any earlier labels connected to y k via a skip edge. For each label y k, we denote the indices of the parents of y k by π k {1,...,k 1}, and we denote the set of parent labels by y πk = {y j : j π k }. We define the conditional distribution of y k as follows: p(y k y 1,...,y k 1,x) = p(y k y πk,x). (1) Since we are conditioning on the input x, the graphical structure itself is allowed to depend on the input. This allows us, for instance, to introduce skip edges connecting the labels of identical words. An undirected version of this model, called the skip-chain conditional random field, has been presented in [Sutton and Mc- Callum, 2004]. Figure 1(b) shows the graphical structure of the skip-chain CRF. )

3 y y x x (a) Directed skip-chain model (b) Skip-chain CRF Figure 1: Long-range dependencies shown as dashed. Without additional restrictions on the number or placement of the skip edges, exact inference in these models is intractable. For the directed skip-chain model, the tree-width is one more than the maximum number of skip edges passing over a node in the chain. In Sutton and McCallum [2004], loopy belief propagation is used for approximate inference in the skip-chain CRF. In contrast, we introduce an assumption about the structure of the conditional distributions that enables the efficient calculation of posterior marginals. 2.3 Mixture-of-parents models We say that a directed skip-chain model is a mixtureof-parents model if the expression in Equation (1) above can be written in the following special form: p(y k y πk,x) = α kj p(y k y j,x), (2) where α kj 0 for each kj, and α kj = 1 are the mixing weights. We now show that for skip-chain mixture-of-parents models, we can compute the marginal posteriors p(y 1 x),...,p(y n x) in an efficient way. In the equations below, all probabilities are conditional on x, so we suppress the x in our calculations to reduce clutter: p(y k ) = p(y k y 1,...,y k 1 ) p(y 1,...,y k 1 ) y 1,...,y k 1 = α kj p(y k y j ) p(y 1,...,y k 1 ) y 1,...,y k 1 = α kj p(y k y j ) p(y 1,...,y k 1 ) y 1,...,y k 1 = α kj p(y k y j ) p(y j ). y j This calculation shows that if the single-parent conditional probabilities p(y k y j,x) are easy to compute, then we can also easily compute the single-node posterior distributions 1. We can also write the marginal 1 Note that the task of finding the posterior mode does posteriors as p(y k ) = α kj p j (y k ), where p j (y k ) = y j p(y k y j ) p(y j ) is the predictive distribution for y k, given the marginal distribution of parent y j. So for a skip-chain mixture-of-parents model, the posterior distribution of a node y k is a convex combination of the predictive distributions given by each parent separately. We call this model a skip-chain mixture-of-parents model because Equation (2) defines a probabilistic mixture model. The generative interpretation of such a model is that to generate y k given the parents y πk, we first randomly choose one of the parent labels according to the multinomial probability distribution with parameters α kj,j π k. Then, according to the mixture model, only this parent label is relevant to the determination of the label y k. For example, if the randomly chosen parent node is y j, then y k is drawn according to the conditional probability distribution p(y k y j,x). Conditional distributions with this mixture-of-parents form were also considered in [Pfeffer, 2001], where they were called separable distributions. In that work, it is shown that a conditional distribution p(y k y πk ) has the mixture-of-parents form iff we can write the marginal distribution p(y k ) in terms of the marginal distributions of the parents of y k. In general, we would need to know the full joint distribution of the parents to determine the marginal distribution of y k. 2.4 Single-parent conditionals We now complete our description of the mixture-ofparents MEMM by giving the specific form for the individual parent conditional distributions. We use the same maximum entropy model found in the standard not allow the same trick.

4 MEMM: p λ,µ (y k y j,x) ( = 1 exp λ s f s (y j,y k,x) + Z yj,x s t. µ t g t (y k,x) Although in theory we could use a different parameter vector (λ,µ) for each edge, in practice we only use a few distinct transition models so that we can pool the data in the parameter estimation phase. 3 Learning We focus on learning the parameters λ and µ of the local transition models p λ,µ (y k y j,x), and we assume the mixing weights α are given. In our experiments, we used a uniform mixing distribution. The standard method for training MEMMs is to maximize the conditional log-likelihood of the data, L C (x,y) = log p(y x) = k log p λ,µ (y k y πk,x), under some regularization of the parameters λ and µ. In our experiments, we used ridge regularization, which penalizes the sum of squares of all the weights equally. The L C (x,y) objective function favors parameter values that assign high probability to the training data (i.e. the correctly labeled sequences) as a whole. This objective function is quite natural when the final prediction is the posterior mode of the sequence distribution, namely arg max y L C (x,y). However, in the MoP-MEMM, we take each label prediction to be arg max yk log p(y k x). An objective function that s better suited to this form of label prediction was suggested in [Kakade et al., 2002], where the objective is to maximize the sum of posterior marginal log-likelihoods of the training labels: L M (x,y) = log p(y k x) k = log α kj p(y k y j,x) p(y j x) k y j = log α kj p j (y k x), k where p j (y k x) = y j p(y k y j,x) p(y j x) is the distribution over y k induced by selecting y j as its sole parent. The first objective function, L C, is concave in the parameters λ, µ, and therefore easy to optimize. The second objective function, L M, although not concave, is relatively well-behaved, as noted in Kakade et al. ). [2002]. In our experiments, we use the L-BFGS [Nocedal and Wright, 1999] method for optimization, which requires us to compute gradients of the objective function. 3.1 Gradients The gradient of the first objective with respect to the parameters λ,µ is given by: L C (x,y) = α kj p(y k y j,x). (3) p(y k y πk,x) k For the second objective the gradient is: L M (x,y) = k p(y k x) p(y k x). (4) Expanding the gradient of each posterior marginal, we have: p(y k x) = α kj p j (y k x). (5) Expanding further, we have: p j (y k x) = (6) [ p(yk y j,x) p(y j x) + p(y j x) p(y k y j,x) ]. y j Finally, the derivative of the conditional distribution with respect to λ s is λs p(y k y j,x) = (7) p(y k y j,x) f s (x,y j,y k ) y k p(y k y j,x)f s (x,y j,y k). and the derivate with respect to µ t is obvious analogue, with f s replaced by g t. Note that to calculate the gradient of L M, we need to compute the marginals p(y k x), while no inference is required to compute the gradient of L C. This highlights the difference between the two objectives, as L M incorporates the uncertainty in the prediction of previous labels during learning, while L C simply uses the correct labels of previous positions. Although it may be advantageous to account for uncertainty in earlier predictions, this makes the gradient calculation at position k, as in Equation 5, much more difficult, since we need to incorporate gradients from previous positions. In many natural language tasks, the set of local features that are active (non-zero) at a position is usually small (tens to hundreds). This sparsity of p(y k y j,x) allows efficient learning for MEMMs,

5 even with millions of features, since the contributions of each position to the gradient affect only a small number of features. This property no longer holds for the gradient of the L M objective, since the gradient contribution of each position k will contain the union of the features active at all of its ancestors positions. For the L C objective, we will have the union of just the parents, not all the ancestors. 3.2 Speeding up gradient calculations The key to efficient gradient computation that exploits sparsity is to reorder the calculations so that only sparse vectors need to be manipulated. If we recursively expand Equations 4, 5, and 6, and regroup the terms, the total gradient can be written as a linear combination of local gradient vectors: L M (x,y) = w kj (y k,y j) p(y k y j,x), k y k y j where the weights w kj (y k,y j ) depend only on the marginals p(y k x) and the mixing weights α kj. See Appendix A for the derivation of the weights. Thus once we know the weights, the gradient is just the weighted sum of the sparse derivative vectors p(y k y j,x). To calculate the weights, we sweep from left to right, computing the appropriate w kj (y k,y j ) recursively. A second left-to-right sweep just adds the sparse gradients with the computed weights. 4 The Tasks We apply the mixture-of-parents MEMM to the CoNLL 2003 English named entity recognition (NER) dataset 2, and the WebKB dataset [Craven et al., 1998]. 4.1 The CoNLL NER task This NER data set was developed for the shared task of the 2003 Conference on Computational Natural Language Learning (CoNLL). It was one of two NER data sets developed for the task. We used the English language dataset, which comprises Reuters newswire articles annotated with four entity types: location (LOC), person (PER), organization (ORG), and miscellaneous (MISC). The competition scored entity taggers based on their precision and recall at the entity level. Each tagger was ranked based on its overall F1 score, which is the harmonic mean of precision and recall across all entity types. We report this F1 score in our own experiments. We use the standard split of this data into a training set comprising 945 documents and a test set comprising 216 documents. 2 Available at WebKB The WebKB dataset contains webpages from four different Computer Science departments: Cornell, Texas, Washington and Wisconsin. Each page is categorized into one of the following five webpage types: course, faculty, student, project, and other. The data set is problematic in that the category other is a diverse mix of many different types of pages. We used the subset of the dataset from Taskar et al. [2002], with the following category distribution: course (237), faculty (148), other (332), research-project (82) and student (542). The number of pages for each school are: Cornell (280), Texas (291), Washington (315) and Wisconsin (454). The number of links for each school are: Cornell (574), Texas (574), Washington (728) and Wisconsin (1614). For each page, we have access to the entire html source, as well as the links to other pages. Our goal is to collectively classify webpages into one of these five categories. In all of our experiments, we learn a model from three schools and test the performance of the learned model on the remaining school. 5 Methods and Results There are several things one must consider when applying a mixture-of-parents MEMM to data. First, although a MEMM may theoretically have a different parameter vector (λ,µ) for each edge, in practice this gives too many parameters to estimate from the data. The typical approach is to use the same parameter vectors on multiple edges. In terms of the model description above, we limit the number of distinct maximum-entropy conditional probability distributions p λ,µ (y k y πk,x) that we must learn. In the NER task, for example, we restrict to two conditional probability models, one that models the transition probability between adjacent words, denoted p λ,µ (y k y k 1,x), and another that models the transition probability between nonadjacent words, denoted p λ,µ (y k y v,x), for v k 2. Next, one must decide on the edge structure of the graph. That is, for each node in our model, we must have a rule for finding its parent nodes. For sequential data, such as the NER dataset, one obvious parent for a node is the node immediately preceding it. In their skip-chain conditional random field, Sutton and Mc- Callum [2004] put an edge between all pairs of identical capitalized words, in addition to the edges connecting adjacent words. Once we ve specified the parents for every node in the model, we must devise a way to set the mixing weight α kj for the jth parent of the kth node, for every valid pair (j,k). While one can certainly try to learn a para-

6 metric model for the mixing weights, our preliminary results in this direction were not promising. Thus we chose to use uniform mixing weights. That is, we took α kj = 1/ π k, where π k denotes the number of parents of node k. Finally, we complete our specification of the maximum-entropy conditional probability distributions by specifying the feature functions f s and g t. It seems reasonable that the types of features that would be best-suited for a long-distance transition model would be different from the best features for a local model. For example, one reasonable feature for a skipedge (y j,y k ) would be whether or not the preceding words x j 1 and x k 1 are the same. In particular, if the preceding words are equal, this would would make it more likely that the labels agree: y j = y k. However, this reasoning doesn t apply for local edges: this feature would only be active if the same word occurred twice in a row in the sentence. Our first objective was to compare mixture-of-parent MEMMs as directly as possible with regular MEMMs. To this end, we took the features on each skip-edge (y j,y k ) to be the union of the features on the local edges (y j 1,y j ) and (y k 1,y k ). In this first stage, we avoided innovation in the design of skip-edge features since, after all, one could just as well improve plain MEMMs by using better features. However, perhaps we were overly cautious. Although it seems plausible that one could get better results by crafting features particularly well-suited to skip-edges, our initial attempts in this direction did not show any significant improvement. Thus we only report our results using the baseline skip-edge features described above. 5.1 The NER Task For the NER task we used the same feature set used in Sutton and McCallum [2005]. In deciding which skip-edges to include, we first eliminated from consideration all words occurring in more than 100 of the documents. We did this to decrease training time, and because common words are typically easy to label. After eliminating the most common words from consideration, we followed Sutton and McCallum [2005] and connected the remaining identical capitalized word pairs within each document. However, to keep from having an excessively large training set, if a word occurred more than r times within a single document, we only connected it to the r most recent occurrences. The performance of the model seemed relatively insensitive to the value of r between 3 and 10, so we kept it at 5 throughout the experiments. In Figure 2, we tabulate the performance results of several sequence models on the NER test set. The Model F1% FP%/FN% %Improvement MEMM Viterbi / /-4.6 Posterior /10.9 0/0 Mixture-of-Parents MEMM Separate / /5.5 Joint / /10.1 CRF-Based Models Linear-Chain / /5.5 Figure 2: Comparison of several models on the NER task. All models used the feature set described in Sutton and McCallum [2004], which was also the source for the skip-chain CRF result. The F1 column gives the overall entity-wise F1 score, and the FP and FN columns give the entity-wise false positive and false negative rates. The %Improvement column gives the percent reduction in false positive and false negative rates compared to the posterior-decoded MEMM model. Viterbi-decoded MEMM is the typical MEMM model. When the mixture-of-parents MEMM has no skipedges, we get the posterior-decoded MEMM model. These two models perform approximately the same. If we use the MoP-MEMM model, but train each transition model separately using standard MEMM training, then we get about 0.6% improvement in F1 over the basic MEMM. If we train the models jointly, using the sum of marginal log-likelihoods objective function L M, then we get an additional 0.4% gain. The standard CRF model gets an F1 of 90.6%, which even beats the more basic MoP-MEMM model. In principal, one advantage the (undirected) CRF-based models have over the (directed) MEMM-based models is that in the CRF, the label of a given node can be directly influenced by nodes both before and after it. Finally, the skip-chain CRF gets the top performance, beating the jointly trained MoP-MEMM by 0.3% Analysis It s informative to look in more detail at a particular document in the NER test set. Consider the 15th document, which is an article about a tennis tournament. The tennis player MaliVai Washington is first mentioned, with his full name, in the 17th position of the document. His last name, Washington, shows up 6 more times. The posterior-decoded MEMM correctly labels the first occurrence of Washington as a person with probability 0.99, but 5 of the next 6 occurrences are labeled as locations, though not by a large margin. The Mixture-of-Parents MEMM gets all but one

7 of these 7 occcurrences correct. The very high confidence in the label of the first occurrence of Washington propagates via the skip-edges to later occurrences, tipping things in the right direction. The improvement in F1 of the MoP-MEMM over the MEMM tells us that the propagation of information via skip edges helps more often than not. Nevertheless, sometime skip-edges lead the model astray. For example, the 22nd document of the test set is about an event in the soccer world. The acronym UEFA, which stands for the Union of European Football Associations, occurs twice in the document. The first time it is correctly identified as an organization by both models, with probability The second occurrence of the acronym is in the phrase the UEFA Cup, which should be a MISC entity type. The local edge model indeed predicts that UEFA is a MISC with probability However, with uniform mixing, the correct local model in the MoP-MEMM is slightly overpowered by the highly incorrect non-local model prediction, which gives probability 0.99 to the second occurence of UEFA being an ORG. 5.2 The WebKB Task In the WebKB task, the obvious graph structure given by the hyperlinks cannot be taken immediately as our edge model the problem is that the hyperlink graph may have directed cycles. Our approach to this problem was to first select a random ordering of the nodes. If the i th node and the j th node in our random ordering are connected by an edge (in either direction), with i < j, then in our model we put a directed edge from the i th node to the j th node. We use two different conditional probability models for the nodes on these edges. If the original hyperlink is pointing from i to j, then we use the incoming edge model, and otherwise we use the outgoing edge model. In this way, we get a DAG structure with two distinct conditional probability models. Since this is now a MoP-MEMM, we can label the nodes using our standard method. It s clear that some orderings will give rise to better models than others. To reduce this variability, we find the node marginals resulting from each of 50 random node permutations. We then predict using the average of the 50 marginals. The performance of this average predictor was typically close to the performance of the best of the 50 individual predictors. We again tried two different approaches to training the conditional probability models. Note that any hyperlink can end up associated with either an incoming or outgoing edge model, depending on the randomly chosen ordering of its nodes. Thus for the separate training mode, each hyperlink was added to the train- Model %Error %Improvement Node Model Link Model MoP-MEMM (Separate) MoP-MEMM (Joint) Figure 3: Comparison of the percent error rates of several models on the WebKB webpage classification task. All models use the feature set described in Taskar et al. [2002], which was also the source for the Link Model result. The %Improvement column gives the percent decrease in the error rate, compared to the node model. ing sets of both the incoming and outgoing edge models, and we trained each model using standard MEMM training. For joint training, we have to fix a random ordering to get a MoP-MEMM model. Again, to reduce variability, we trained on 10 different orderings of the nodes. For this dataset, joint training performed marginally worse than separate training. However, both models essentially matched the performance of the Link model of Taskar et al. [2002]. The Link model has a similar graphical structure to our model, but without the mixture-of-parents simplification. Thus to find the labels, they use loopy belief propagation, an approximate inference technique. For comparison, we also trained a maximum-entropy node classifier that ignored the hyperlink information. The performance of this Node model, as well as the other models we ve discussed, are shown in Figure 3. 6 Discussion In addition to the works described above (Sutton and McCallum [2004], Finkel et al. [2005]), our work is similar to that of Malouf [2002] and Curran and Clark [2003]. In the latter works, the label of a word is conditioned on the most recent previous instance of that same word in an earlier sentence. To perform inference, they labeled each sentence sequentially, and allowed labels in future sentences to be conditioned on the labels chosen for earlier sentences. The Mixtureof-Parents MEMM seeems to be a conceptual improvement over these methods, since we use the soft labeling (i.e. the posterior distribution) of the preceding label, rather than the predicted label, which doesn t account for the confidence of the labeling. The skip-chain MEMM has many compelling attributes. First, it is a non-markovian sequence model for which we can efficiently compute the exact marginal node posteriors. Second, it gives results that exceed ordinary MEMMs, sometimes by a significant margin, without any additional feature engi-

8 neering. Finally, the model is very modular: We can train several different local and skip-edge models, and interchange them to see which combinations give the best performance. Once a good separately trained MoP-MEMM is found, one can use this as a starting point in training the weights of a jointly trained MoP-MEMM. There are also some drawbacks of the skip-chain MEMM, especially in comparison to other non- Markovian models, such as the approaches of Sutton and McCallum [2004] and Finkel et al. [2005]. The most important one is that in skip-chain MEMMs, information only flows from early labels to later labels. Although this may not be a serious problem in the newswire corpora we consider, since earlier mentions of an entity are typically less ambiguous than later mentions, it is certainly less than desirable in general. One way to address this is by sampling orderings of the nodes and averaging results of inference on several orderings, as we have done for the WebKB hypertext data, showing that a simple mixture of tractable models can capture dependencies in the data as well as an intractable model, which relies on heuristic approximate inference. Acknowledgements We would like to thank the anonymous reviewers and Percy Liang for helpful comments. References R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In Proc. Association for Computational Linguistics, S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD, M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the world wide web. In Proc AAAI98, pages , J. R. Curran and S. Clark. Language independent NER using a maximum entropy tagger. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-03), pages , Edmonton, Canada, J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 05), pages , Ann Arbor, Michigan, June Association for Computational Linguistics. L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for hypertext classification. In Proc. IJCAI01 Workshop on Text Learning: Beyond Supervision, Seattle, Wash., S. Kakade, Y. W. Teh, and S. T. Roweis. An alternate objective function for markovian fields. In Claude Sammut and Achim G. Hoffmann, editors, ICML, pages Morgan Kaufmann, ISBN J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML-2001, pages , R. Malouf. Markov models for language-independent named entity recognition. In Proceedings of CoNLL- 2002, pages , A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML-2000, pages , J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data, pages 13 20, J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, A. Pfeffer. Sufficiency, separability and temporal probabilistic models. In J. S. Breese and D. Koller, editors, UAI, pages Morgan Kaufmann, ISBN S. Slattery and T. Mitchell. Discovering test set regularities in relational domains. In Proc. ICML00, pages , C. Sutton and A. McCallum. Collective segmentation and labeling of distant entities in information extraction. Technical Report TR # 04-49, University of Massachusetts, July Presented at ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields. C. Sutton and A. McCallum. Piecewise training of undirected models. In 21st Conference on Uncertainty in Artificial Intelligence, B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In UAI, B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. In Proc. IJCAI01, pages , Seattle, Wash., A Sparse gradient computations It is easier to work in matrix notation for these derivations. Consider the gradients of the marginals and conditionals with respect to a particular parameter θ (either λ or µ), and for convenience define the following variables: v k (y k) = θ p(y k x), k; y k Y; (8) u kj (y j,y k) = θ p(y k y j,x), k;j π k ; y j,y k Y. We stack the vectors v 1,...,v n into a single vector v of length Y n, where n is number of variables y k (e.g. the length of the sequence). Similarly, we stack the elements of the u kj matrices into a single vector u of

9 length Y 2 n k π k. By combining Equations (5-6), we can write v = Av + Bu, for appropriately defined matrices A and B. Solving for v, we have v = (I A) 1 Bu. If v is laid out in blocks corresponding to position k, then A is upper triangular with 0 s on the diagonal, and thus I A is easily invertible. The total gradient L M (x,y) = k v k(y k ) = γ v for an appropriately defined γ. Hence, L M (x,y) = γ (I A) 1 Bu = w u, where w = γ (I A) 1 B gives the weights w kj (y j,y k ) of local sparse gradients in Section 3.2.

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001 Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Semi-Supervised Learning of Named Entity Substructure

Semi-Supervised Learning of Named Entity Substructure Semi-Supervised Learning of Named Entity Substructure Alden Timme aotimme@stanford.edu CS229 Final Project Advisor: Richard Socher richard@socher.org Abstract The goal of this project was two-fold: (1)

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Link Prediction in Relational Data

Link Prediction in Relational Data Link Prediction in Relational Data Ben Taskar Ming-Fai Wong Pieter Abbeel Daphne Koller btaskar, mingfai.wong, abbeel, koller @cs.stanford.edu Stanford University Abstract Many real-world domains are relational

More information

Detection and Extraction of Events from s

Detection and Extraction of Events from  s Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to

More information

Conditional Random Fields for Word Hyphenation

Conditional Random Fields for Word Hyphenation Conditional Random Fields for Word Hyphenation Tsung-Yi Lin and Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl260}@ucsd.edu February 12,

More information

MEMMs (Log-Linear Tagging Models)

MEMMs (Log-Linear Tagging Models) Chapter 8 MEMMs (Log-Linear Tagging Models) 8.1 Introduction In this chapter we return to the problem of tagging. We previously described hidden Markov models (HMMs) for tagging problems. This chapter

More information

Conditional Random Field for tracking user behavior based on his eye s movements 1

Conditional Random Field for tracking user behavior based on his eye s movements 1 Conditional Random Field for tracing user behavior based on his eye s movements 1 Trinh Minh Tri Do Thierry Artières LIP6, Université Paris 6 LIP6, Université Paris 6 8 rue du capitaine Scott 8 rue du

More information

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation Pelin Angin Purdue University Department of Computer Science pangin@cs.purdue.edu Jennifer Neville Purdue University Departments

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Combining Gradient Boosting Machines with Collective Inference to Predict Continuous Values

Combining Gradient Boosting Machines with Collective Inference to Predict Continuous Values Combining Gradient Boosting Machines with Collective Inference to Predict Continuous Values Iman Alodah Computer Science Department Purdue University West Lafayette, Indiana 47906 Email: ialodah@purdue.edu

More information

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

More information

Learning to extract information from large domain-specific websites using sequential models

Learning to extract information from large domain-specific websites using sequential models Learning to extract information from large domain-specific websites using sequential models Sunita Sarawagi sunita@iitb.ac.in V.G.Vinod Vydiswaran vgvinodv@iitb.ac.in ABSTRACT In this article we describe

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

CS 6784 Paper Presentation

CS 6784 Paper Presentation Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La erty, Andrew McCallum, Fernando C. N. Pereira February 20, 2014 Main Contributions Main Contribution Summary

More information

CRF Feature Induction

CRF Feature Induction CRF Feature Induction Andrew McCallum Efficiently Inducing Features of Conditional Random Fields Kuzman Ganchev 1 Introduction Basic Idea Aside: Transformation Based Learning Notation/CRF Review 2 Arbitrary

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

Conditional Random Fields : Theory and Application

Conditional Random Fields : Theory and Application Conditional Random Fields : Theory and Application Matt Seigel (mss46@cam.ac.uk) 3 June 2010 Cambridge University Engineering Department Outline The Sequence Classification Problem Linear Chain CRFs CRF

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

I Know Your Name: Named Entity Recognition and Structural Parsing

I Know Your Name: Named Entity Recognition and Structural Parsing I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Loopy Belief Propagation

Loopy Belief Propagation Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure

More information

Summary: A Tutorial on Learning With Bayesian Networks

Summary: A Tutorial on Learning With Bayesian Networks Summary: A Tutorial on Learning With Bayesian Networks Markus Kalisch May 5, 2006 We primarily summarize [4]. When we think that it is appropriate, we comment on additional facts and more recent developments.

More information

Web clustering based on the information of sibling pages

Web clustering based on the information of sibling pages Web clustering based on the information of sibling pages Caimei Lu Xiaodan Zhang Jung-ran Park Xiaohua Hu College of Information Science and Technology, Drexel University 3141 Chestnut Street Philadelphia,

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 1, 2019 Today: Inference in graphical models Learning graphical models Readings: Bishop chapter 8 Bayesian

More information

A Note on Semi-Supervised Learning using Markov Random Fields

A Note on Semi-Supervised Learning using Markov Random Fields A Note on Semi-Supervised Learning using Markov Random Fields Wei Li and Andrew McCallum {weili, mccallum}@cs.umass.edu Computer Science Department University of Massachusetts Amherst February 3, 2004

More information

Bayesian Classification Using Probabilistic Graphical Models

Bayesian Classification Using Probabilistic Graphical Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Bayesian Classification Using Probabilistic Graphical Models Mehal Patel San Jose State University

More information

Segment-based Hidden Markov Models for Information Extraction

Segment-based Hidden Markov Models for Information Extraction Segment-based Hidden Markov Models for Information Extraction Zhenmei Gu David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2l 3G1 z2gu@uwaterloo.ca Nick Cercone

More information

Link Prediction in Relational Data

Link Prediction in Relational Data Link Prediction in Relational Data Alexandra Chouldechova STATS 319, March 1, 2011 Motivation for Relational Models Quantifying trends in social interaction Improving document classification Inferring

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 25, 2015 Today: Graphical models Bayes Nets: Inference Learning EM Readings: Bishop chapter 8 Mitchell

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

CSEP 517 Natural Language Processing Autumn 2013

CSEP 517 Natural Language Processing Autumn 2013 CSEP 517 Natural Language Processing Autumn 2013 Unsupervised and Semi-supervised Learning Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Unsupervised

More information

Feature Extraction and Loss training using CRFs: A Project Report

Feature Extraction and Loss training using CRFs: A Project Report Feature Extraction and Loss training using CRFs: A Project Report Ankan Saha Department of computer Science University of Chicago March 11, 2008 Abstract POS tagging has been a very important problem in

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference

More information

Flexible Text Segmentation with Structured Multilabel Classification

Flexible Text Segmentation with Structured Multilabel Classification Flexible Text Segmentation with Structured Multilabel Classification Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia,

More information

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Fast, Piecewise Training for Discriminative Finite-state and Parsing Models Charles Sutton and Andrew McCallum Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA {casutton,mccallum}@cs.umass.edu

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Detection of Man-made Structures in Natural Images

Detection of Man-made Structures in Natural Images Detection of Man-made Structures in Natural Images Tim Rees December 17, 2004 Abstract Object detection in images is a very active research topic in many disciplines. Probabilistic methods have been applied

More information

Dependency detection with Bayesian Networks

Dependency detection with Bayesian Networks Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov

More information

Discriminative Probabilistic Models for Relational Data

Discriminative Probabilistic Models for Relational Data UAI2002 T ASKAR ET AL. 485 Discriminative Probabilistic Models for Relational Data Ben Taskar Computer Science Dept. Stanford University Stanford, CA 94305 btaskar@cs.stanford.edu Pieter Abbeel Computer

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國 Conditional Random Fields - A probabilistic graphical model Yen-Chin Lee 指導老師 : 鮑興國 Outline Labeling sequence data problem Introduction conditional random field (CRF) Different views on building a conditional

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 Hidden Markov Models Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017 1 Outline 1. 2. 3. 4. Brief review of HMMs Hidden Markov Support Vector Machines Large Margin Hidden Markov Models

More information

Efficiently Inducing Features of Conditional Random Fields

Efficiently Inducing Features of Conditional Random Fields Efficiently Inducing Features of Conditional Random Fields Andrew McCallum Computer Science Department University of Massachusetts Amherst Amherst, MA 01003 mccallum@cs.umass.edu Abstract Conditional Random

More information

Gradient of the lower bound

Gradient of the lower bound Weakly Supervised with Latent PhD advisor: Dr. Ambedkar Dukkipati Department of Computer Science and Automation gaurav.pandey@csa.iisc.ernet.in Objective Given a training set that comprises image and image-level

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Learning Better Data Representation using Inference-Driven Metric Learning

Learning Better Data Representation using Inference-Driven Metric Learning Learning Better Data Representation using Inference-Driven Metric Learning Paramveer S. Dhillon CIS Deptt., Univ. of Penn. Philadelphia, PA, U.S.A dhillon@cis.upenn.edu Partha Pratim Talukdar Search Labs,

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 1 Course Overview This course is about performing inference in complex

More information

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation Supplementary Material: The Emergence of Organizing Structure in Conceptual Representation Brenden M. Lake, 1,2 Neil D. Lawrence, 3 Joshua B. Tenenbaum, 4,5 1 Center for Data Science, New York University

More information

DBpedia Spotlight at the MSM2013 Challenge

DBpedia Spotlight at the MSM2013 Challenge DBpedia Spotlight at the MSM2013 Challenge Pablo N. Mendes 1, Dirk Weissenborn 2, and Chris Hokamp 3 1 Kno.e.sis Center, CSE Dept., Wright State University 2 Dept. of Comp. Sci., Dresden Univ. of Tech.

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Conditional Random Fields. Mike Brodie CS 778

Conditional Random Fields. Mike Brodie CS 778 Conditional Random Fields Mike Brodie CS 778 Motivation Part-Of-Speech Tagger 2 Motivation object 3 Motivation I object! 4 Motivation object Do you see that object? 5 Motivation Part-Of-Speech Tagger -

More information

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between

More information

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

Introduction to Graphical Models

Introduction to Graphical Models Robert Collins CSE586 Introduction to Graphical Models Readings in Prince textbook: Chapters 10 and 11 but mainly only on directed graphs at this time Credits: Several slides are from: Review: Probability

More information

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen Structured Perceptron Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen 1 Outline 1. 2. 3. 4. Brief review of perceptron Structured Perceptron Discriminative Training Methods for Hidden Markov Models: Theory and

More information

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models Gleidson Pegoretti da Silva, Masaki Nakagawa Department of Computer and Information Sciences Tokyo University

More information

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Tracking Trends: Incorporating Term Volume into Temporal Topic Models Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison* Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA,

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 20. PGM Representation Next Lectures Representation of joint distributions Conditional/marginal independence * Directed vs

More information

Log-linear models and conditional random fields

Log-linear models and conditional random fields Log-linear models and conditional random fields Charles Elkan elkan@cs.ucsd.edu February 23, 2010 The general log-linear model is a far-reaching extension of logistic regression. Conditional random fields

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Domain Adaptation of Information Extraction Models

Domain Adaptation of Information Extraction Models Domain Adaptation of Information Extraction Models Rahul Gupta IIT Bombay grahul@cse.iitb.ac.in Sunita Sarawagi IIT Bombay sunita@iitb.ac.in ABSTRACT Domain adaptation refers to the process of adapting

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Dynamic Bayesian network (DBN)

Dynamic Bayesian network (DBN) Readings: K&F: 18.1, 18.2, 18.3, 18.4 ynamic Bayesian Networks Beyond 10708 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University ecember 1 st, 2006 1 ynamic Bayesian network (BN) HMM defined

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Collective classification in network data

Collective classification in network data 1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods

More information

Hidden Markov Models in the context of genetic analysis

Hidden Markov Models in the context of genetic analysis Hidden Markov Models in the context of genetic analysis Vincent Plagnol UCL Genetics Institute November 22, 2012 Outline 1 Introduction 2 Two basic problems Forward/backward Baum-Welch algorithm Viterbi

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines

More information

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Discriminative Training with Perceptron Algorithm for POS Tagging Task Discriminative Training with Perceptron Algorithm for POS Tagging Task Mahsa Yarmohammadi Center for Spoken Language Understanding Oregon Health & Science University Portland, Oregon yarmoham@ohsu.edu

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 17, 2011 Today: Graphical models Learning from fully labeled data Learning from partly observed data

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Conditional Models of Identity Uncertainty with Application to Noun Coreference

Conditional Models of Identity Uncertainty with Application to Noun Coreference Conditional Models of Identity Uncertainty with Application to Noun Coreference Andrew McCallum Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA mccallum@cs.umass.edu

More information

CRFs for Image Classification

CRFs for Image Classification CRFs for Image Classification Devi Parikh and Dhruv Batra Carnegie Mellon University Pittsburgh, PA 15213 {dparikh,dbatra}@ece.cmu.edu Abstract We use Conditional Random Fields (CRFs) to classify regions

More information

Transductive Phoneme Classification Using Local Scaling And Confidence

Transductive Phoneme Classification Using Local Scaling And Confidence 202 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Transductive Phoneme Classification Using Local Scaling And Confidence Matan Orbach Dept. of Electrical Engineering Technion

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Expectation Propagation

Expectation Propagation Expectation Propagation Erik Sudderth 6.975 Week 11 Presentation November 20, 2002 Introduction Goal: Efficiently approximate intractable distributions Features of Expectation Propagation (EP): Deterministic,

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer An Introduction to Conditional Random Fields Citation for published version: Sutton, C & McCallum, A 2012, 'An Introduction to Conditional Random Fields' Foundations and Trends

More information

Tree-structured approximations by expectation propagation

Tree-structured approximations by expectation propagation Tree-structured approximations by expectation propagation Thomas Minka Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213 USA minka@stat.cmu.edu Yuan Qi Media Laboratory Massachusetts

More information

Practical Markov Logic Containing First-Order Quantifiers with Application to Identity Uncertainty

Practical Markov Logic Containing First-Order Quantifiers with Application to Identity Uncertainty Practical Markov Logic Containing First-Order Quantifiers with Application to Identity Uncertainty Aron Culotta and Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA

More information