ALTW 2005 Conditional Random Fields

Similar documents
Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

COMP90051 Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Conditional Random Fields : Theory and Application

Scaling Conditional Random Fields for Natural Language Processing

COMP90051 Statistical Machine Learning

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

CS 6784 Paper Presentation

Conditional Random Fields for Object Recognition

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

STA 4273H: Statistical Machine Learning

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Feature Extraction and Loss training using CRFs: A Project Report

Loopy Belief Propagation

Introduction to Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Introduction to CRFs. Isabelle Tellier

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Structured Learning. Jun Zhu

Probabilistic Graphical Models

Building Classifiers using Bayesian Networks

ECE521 W17 Tutorial 10

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sequence Labeling: The Problem

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

3 : Representation of Undirected GMs

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Introduction to Hidden Markov models

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

CS 343: Artificial Intelligence

Complex Prediction Problems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Graphical Models. David M. Blei Columbia University. September 17, 2014

Machine Learning

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Machine Learning

Machine Learning

Bayesian Classification Using Probabilistic Graphical Models

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

The Basics of Graphical Models

Probabilistic Graphical Models

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Conditional Random Field for tracking user behavior based on his eye s movements 1

6 : Factor Graphs, Message Passing and Junction Trees

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Bayes Net Learning. EECS 474 Fall 2016

CS 188: Artificial Intelligence

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Tekniker för storskalig parsning: Dependensparsning 2

CRF Feature Induction

Machine Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Information Processing Letters

K-Means and Gaussian Mixture Models

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Conditional Random Fields. Mike Brodie CS 778

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Day 3 Lecture 1. Unsupervised Learning

Probabilistic Graphical Models

Search Engines. Information Retrieval in Practice

Semi-Supervised Learning of Named Entity Substructure

Lecture 11: Clustering Introduction and Projects Machine Learning

Detection and Extraction of Events from s

Machine Learning

A Note on Semi-Supervised Learning using Markov Random Fields

Using Maximum Entropy for Automatic Image Annotation

Probabilistic Graphical Models

Introduction to Machine Learning CMU-10701

Machine Learning. Supervised Learning. Manfred Huber

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Dynamic Bayesian network (DBN)

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence

Joint Entity Resolution

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

Learning Tractable Probabilistic Models Pedro Domingos

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Statistical and Learning Techniques in Computer Vision Lecture 1: Markov Random Fields Jens Rittscher and Chuck Stewart

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

CSEP 517 Natural Language Processing Autumn 2013

Semi-Markov Conditional Random Fields for Information Extraction

Easy-First POS Tagging and Dependency Parsing with Beam Search

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Exam Marco Kuhlmann. This exam consists of three parts:

CS545 Project: Conditional Random Fields on an ecommerce Website

Algorithms for Markov Random Fields in Computer Vision

Exponentiated Gradient Algorithms for Large-margin Structured Classification

A New Approach to Early Sketch Processing

A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks

Transcription:

ALTW 2005 Conditional Random Fields Trevor Cohn tacohn@csse.unimelb.edu.au 1

Outline Motivation for graphical models in Natural Language Processing Graphical models mathematical preliminaries directed models: Belief Networks undirected models: Markov (& Conditional) Random Fields Inference: training and decoding for tree structured graphs for loopy graphs Regularisation and useful approximations 2

Motivation Graphical models define probability distributions over complex domains. Typically, these distributions are too complex to directly estimate or work with. Thus, we factorise the distribution - i.e. divide it into manageable parts. these models allows us to estimate the probability of various events and to find the events which maximise that probability These are particularly useful in NLP, yielding state-of-the-art results for many (most) tasks. Commonly used graphical models include: naive Bayes for document classification or topic detection n-grams for language modelling hidden Markov models (HMMs) for sequencing tasks (chunking, POS tagging, name entity recognition) probabilistic context free grammars (P-CFGs) for syntax parsing 3

Topic Detection: Naive Bayes Topic detection in a document: task is to identify the most salient topic in a given document naive Bayes is a commonly used approach. This models the creation of a document as the following: 1. select a topic, t, from the set of possible topics, T 2. repeat n times: 1. select a word, w, from the vocabulary, W each select step involves randomly sampling from a distribution modelled as: p(w,t)=p(t) N i=1 p(w i t) training: estimate p(t) and p(w t) from labelled data (or unlabelled or both) inference: we can then find the best topic: t = argmax t p(t w)=argmax t p(w,t) 4

Generative vs. Discriminative Models The previous examples were all generative models these models describe a process where the observed data (eg. words) are generated by some hidden process (eg. a document topic). Assumptions are made about what s going on in the hidden process (eg. structure, label sets). we can use these models to predict (or maximise) the probability of hidden configurations, when given some observed data. However, we can also directly model the conditional distribution (this is a discriminative model). Eg. the probability over topics, given the document. this assumes we have labelled training data (generative models are more flexible - they can also be trained in an unsupervised or semi-supervised manner) We ll look at two related modelling frameworks: directed and undirected graphical models. Both can each be used to model generative and conditional distributions; however, typically conditional distributions are modelled with undirected models. 5

Graphical Models Both directed and undirected graphical models share many common notions both describe conditional independence relations between random variables both use similar inference algorithms to predict variable assignment probabilities to find the maximum likelihood variable assignment both are commonly trained to optimise the likelihood of the training data Thus, we ll start with the fundamentals of directed graphical models (Belief Networks) before proceeding 6

Belief Networks Belief networks, a.k.a. Bayesian nets, model independence relationships between groups of random variables. present a graphical depiction of these relationships raining grass wet sprinkler... but before we proceed, let s focus on some Maths preliminaries 7

Preliminaries: Independence Let X and Y be two (sets of) random variables. X is independent of Y iff P(X Y )=P(X) symmetrical: if X is independent of Y, then Y is independent of X Intuitively: knowing the value of Y doesn t change the probabilities of X taking on particular values Equivalently, if X and Y are independent P(X,Y )=P(X)P(Y ) Let Z be another (set of) random variable(s). X is conditionally independent of Y given Z iff P(X Y, Z)=P(X Z) 8

Preliminaries: Bayes and Chain rules Bayes rule The chain rule P(A B)= P(A,B) P(B) this is just one possible order of expansion {1, 2,..., k} no approximations have been made = P(A)P(B A) P(B) P(X 1,X 2,,X k )=P(X 1 )P(X 2 X 1 ) P(X k X 1,X 2,,X k 1 ) nor have any assumptions (or world-knowledge) been used From now on I ll just use the notation p(a, b) to mean P(A=a, B=b), where capitals denote random variables, and lower case denotes values 9

Chain Rule: Example Imagine now that we do have some knowledge about the relationships between the random variables eg. six random variables X 1,,X 6 by the chain rule, the joint probability is given as: p(x 1,x 2,x 3,x 4,x 5,x 6 )=p(x 1 )p(x 2 x 1 )p(x 3 x 1,x 2 )p(x 4 x 1,x 2,x 3 ) p(x 5 x 1,x 2,x 3,x 4 )p(x 6 x 1,x 2,x 3,x 4,x 5 )... but we know that x3 is independent of x2 when given x1 i.e., p(x 3 x 1,x 2 )=p(x 3 x 1 ) Similarly, if we know other conditional independences, we can further simplify the joint probability: p(x 1,x 2,x 3,x 4,x 5,x 6 )=p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 10

Graphical Notation We can represent this structure of conditional independences in a directed acyclical graph (DAG) X 4 X 2 X 1 X 6 X 3 X 5 the edges show the conditioning variables in the expansions p(x 1,x 2,x 3,x 4,x 5,x 6 )=p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 11

Joint Decomposition The graph informs us how we can decompose the joint probability where p(x 1,,x n )= N i=1 p(x i x pi ) p i are the parents of i in the graph (source of incoming edges) X 4 p(x 4 x 2 ) p(x 2 x 1 ) X 2 p(x 1 ) X 1 X 6 p(x 6 x 2,x 5 ) p(x 3 x 1 ) X 3 X 5 p(x 5 x 3 ) 12

Why bother? Why decompose the joint probability such a manner? each distribution p(x i x pi ) is a multi-dimensional table for every combination of values of the parent variable values, we need to record a distribution over the values of the variable i. when there are 2 conditioning variables (parents), we record 2 3 values (assuming all variables are binary valued) to represent the full joint (6 variables), we would require 2 6 values Reasoning with the model also becomes more expensive, as the tables get larger. It also becomes harder to learn the parameters in the tables from data. thus, ideally use sparsely connected graphs 13

Graphical Models for Document Classification Naive Bayes: all tokens in the document are assumed to be independently and identically distributed each token conditionally independent of all other tokens, given the class can think of this in a generative sense: the document was created by first choosing a class, and then generating the document, word at a time, from a distribution specific to the class. C W 1 W 2 W 3 W 4... 14

Smarter Topic Detection: LDA We can do better: let s assume that every document is generated from a number of topics. Furthermore, let s assume that this is a weighted set. Generative process: 1. select a distribution over the set of possible classes, M 2. repeat n times: 1. select a class, z, from M 2. select a word, w, from p(w z) M Z 1 Z 2 Z 3 Z 4... W 1 W 2 W 3 W 4... This model is called latent Dirichlet allocation (LDA) 15

Graphical Models for Language Modelling Each token in a sentence is a random variable, Wi. These random variables each range over the words in the vocabulary. for a particular sentence, these variables are each assigned a value eg., Wi = the eg., a second order (trigram) model W 1 W 2 W 3 W 4... 16

Graphical Models for Sequence Tagging Each observation in a sequence is assigned a random variable Oi. A parallel chain of states are assigned the random variables Si. eg., for part-of-speech tagging the states correspond to POS tags, and the observations correspond to tokens eg., a second order hidden Markov model (HMM) S 1 S 2 S 3 S 4... O 1 O 2 O 3 O 4... 17

Undirected Models Another class of graphical models is in common circulation: the undirected graphical model (aka Markov Random Field). X Y Z This model is parameterised by a set of potential functions, one for each maximal clique in the graph (largest sets of completely interconnected nodes). We re free to define these functions as we like, so long as they re non-negative. For the above graph, we define the joint probability density as: where Z is the normalising constant p(x,y,z)= 1 Z y XY(x,y)y YZ (y,z) Z D =  x  y Ây XY (x,y)y YZ (y,z) z 18

Undirected Model Parameterisation We can incorporate the non-negativity constraints, by requiring the potentials to have an exponential form here, f is an unconstrained function y XY (x,y)=exp f XY (x,y) This results in a probability distribution of the form: p(x 1,x 2,,x n )= 1 Z exp  c2c where C are the maximal cliques in the graph, and Z is defined as Z =  x 1  c2c Âexp  x 2 x n f c (x c ) f c (x c ) Undirected models are also referred to as log-linear models 19

Undirected Example Our previous example as an undirected graph: y X2,X 4 (x 2,x 4 )=p(x 4 x 2 ) X 4 y X1,X 2 (x 1,x 2 )=p(x 1 )p(x 2 x 1 ) X 2 y X2,X 5,X 6 (x 2,x 5,x 6 )=p(x 6 x 2,x 5 ) X 1 X 6 y X1,X 3 (x 1,x 3 )=p(x 3 x 1 ) X 3 X 5 y X3,X 5 (x 3,x 5 )=p(x 5 x 3 ) the product of all potential functions yields the earlier expansion, with Z=1 20

Aside: Factor Graphs An alternative representation: show maximal cliques as factors (boxes), which are connected to each of the nodes in the clique. reduces any graph to a pair-wise MRF F 24 X 4 potentials applied at factors F 12 X 2 X 1 F 256 X 6 F 13 X 3 X 5 factors are labelled with the combined labels of F 35 their incident nodes 21

Conditional Random Fields This is a conditional undirected model, used for sequence tagging. It uses a similar structure to a HMM, however, it is conditioned on the observations (tokens). S 1 S 2 S 3... O 1 O 2 O 3... conditioning removes the observations from consideration, leaving a chain. S 1 S 2 S 3... instead the observations are incorporated into the clique potentials, and thus the normalisation term, Z. 22

Conditional Random Fields Probabilistic formulation (after expanding the clique potentials into feature functions, h) p(s o)= 1 Z(o) exp  c2c Âl j h j (s c,o,c) j typically features are binary {0, 1} and supplied by user (not learnt); some examples: h 5 (s c,o,{i, j})= 1, if sc = {DT,NN}^o j = dog 0, otherwise h 82 (s c,o,{i, j})= 1, if sc = {any,vbg}^o j ends with ing 0, otherwise Nb. normalisation term Z is a now function of the observations: Z(o) D =  s 1  s 2 Âexp  s T c2c Âl j h j (s c,o,c) j 23

Aside: The Maxent Classifier Simplest undirected model X 1 X 2 X 3 X 4... The maximal cliques are the singleton nodes themselves each random variable is independent of all others as such, the partition function Z can be decomposed (and is therefore very simple to calculate!) 24

Aside: The MEMM The Maximum Entropy Markov Model (MEMM) was a precursor to the CRF locally normalised over each transition (instead of globally, over a sequence) S 1 S 2 S 3 O 1 O 2 O 3 with probability function: with the transition distribution: and local partition function: p(s t s t Z(s t p(s o)= t 1,o)= Z(s t 1,o) D =  s t p(s t s t 1 1,o) 1,o) exp  j expâl j f j (s t j l j f j (s t 1,s t,o,t) 1,s t,o,t) 25

Inference The process of reasoning under the model eg. what is the marginal probability of x2, p(x2); or what is the marginal probability of both x1 and x2, p(x1, x2)? eg. if we observe X4 = x4, what is the probability of x1, p(x1 x4)? eg. what combinations of x1,..., x6 yield the maximum probability? Let s work through an example X 3 X 2 X 1 X 4 X 6 X 5 26

Inference Example Calculate: p(x 1 x 4 )= p(x 1,x 4 ) Â x1 p(x 1,x 4 ) condition on X4 (blue) X 2 X 3 marginalise (sum) out X2,3,5,6 (green) X 1 X 4 X 6 Formally: p(x 1,x 4 )=ÂÂÂÂ p(x 1,x 2,x 3,x 4,x 5,x 6 ) x 2 x 3 x 5 x 6 1 =ÂÂÂÂ x 2 x 3 x 5 x 6 Z y 12(x 1,x 2 )y 23 (x 2,x 3 )y 24 (x 2,x 4 )y 15 (x 1,x 5 )y 46 (x 4,x 6 ) = 1 Z Â x 2 y 12 (x 1,x 2 )y 24 (x 2,x 4 )Â x 3 y 23 (x 2,x 3 )Â x 5 y 15 (x 1,x 5 )Â x 6 y 46 (x 4,x 6 ) X 5 27

we introduce m terms to successively eliminate variables Inference Example (cont.) p(x 1,x 4 )= 1 Z  x 2 y 12 (x 1,x 2 )y 24 (x 2,x 4 ) y 23 (x 2,x 3 )Ây 15 (x 1,x 5 )Ây 46 (x 4,x 6 ) x 3 x 5 x 6 µâ x 2 y 12 (x 1,x 2 )y 24 (x 2,x 4 ) x 3 y 23 (x 2,x 3 ) x 5 y 15 (x 1,x 5 ) let m 5 (x 1 ) D =  x 5 y 15 (x 1,x 5 ) = y 12 (x 1,x 2 )y 24 (x 2,x 4 )Ây 23 (x 2,x 3 )m 5 (x 1 ) x 2 x 3 let m 3 (x 2 ) = D Ây 23 (x 2,x 3 ) x 3 =Ây 12 (x 1,x 2 )y 24 (x 2,x 4 )m 3 (x 2 )m 5 (x 1 ) x 2 let m 2 (x 1 ) D =  x 2 y 12 (x 1,x 2 )y 24 (x 2,x 4 )m 3 (x 2 ) =m 2 (x 1 )m 5 (x 1 ) we can omit this last term, as it only varies with x4, which is fixed 28

Inference Example (cont.) Finally p(x 1 x 4 )= m 2(x 1 )m 5 (x 1 ) Â x1 m 2 (x 1 )m 5 (x 1 ) the m2, m3, m5 functions store the partial sums, which can each be computed easily. Each time we introduce an m function, we are eliminating variables from the equation. aside: the term involving x6 that was omitted from the calculation demonstrated how the distribution represents conditional independence: X1 is conditionally independent of X6 when given X4. This corresponds to graph separation - node 4 separates nodes 1 and 6. This process is called the Elimination Algorithm, and allows calculation of probabilities after observing the values of some variables and marginalising out (eliminating) others. See Pearl, 88 for a general description of the algorithm 29

Elimination order Each elimination step removed sub-graph from consideration summarised the effect of removing the sub-graph on the state of the neighbouring node 2: m 3 X many alternative elimination orderings 3 3: m 2 X 2 X 1 1: m 5 X 5 Conditioned and unreachable nodes not considered explicitly (only in potentials) 30

Belief Propagation The elimination algorithm needs to be run many times to calculate the marginals for all nodes in the graph. We can do better. Belief propagation allows use to consider all possible m() functions, reusing these functions rather than recalculating them numerous times. terminology: mi(xj,xk) is called the message from i, parametised by xj and xk belief propagation (BP) also called message passing, or sum-product, or max-product In fact, the forward-backward and Viterbi algorithms for HMMs are both instances of BP This forms the core of the Junction Tree algorithm 31

BP Example Take the graphical model, describing the occurrence of an Burglary, the house Alarm sounding, and the unreliable testimonies of neighbours Tim and Steven who have promised to call whenever the alarm sounds S B A T p(b,a,s,t) µy B (b)y AB (a,b)y AS (a,s)y AT (a,t) first, we must set (or somehow find) the potential functions y we will assume these have been supplied 32

BP Example: Sum-Product Belief propagation algorithm S B A first we select a node as the root (say, S) T pass 1: gather messages in from the leaves towards the root {B, T}, {A}, {S} pass 2: distribute messages from the root to the leaves {S}, {A}, {B, T} each message over edge (source, target) sums out the source node variable from the product of the edge weight and all incoming messages to the source. Eg. the message from A to S is: m A (s)=ây AS (a,s)m T (a)m B (a) a 33

BP Example: Sum-Product (cont.) First pass (gather) in red, second pass (distribute) in blue 3: m s (a) S B 4: m a (b) 1: m b (a) A 1: m t (a) 2: m a (s) 4: m a (t) T After receiving all incoming messages, a node knows its marginal probability exactly. At this point it can communicate to a neighbouring node what it believes about the neighbour s state. 34

Sum-Product Finally, the marginal distribution for a node can be computed by taking the product of incoming messages (and normalising): p(x) µ m Y (x) Y 2N (X) the marginal distribution over edges (X, Y) has a similar form p(x,y) µy XY (x,y) Z2N (X)/Y where the messages are defined as: m Y (x) m W (y) W2N (Y )/X m X (y)=ây XY (x,y) m Z (x) x Z2N (X)/Y The full joint probability can now be recovered from the node and edge marginals: p(x j,x k ) p(x 1,,x n )= p(x i ) i ( j,k)2c p(x j )p(x k ) 35

Max-Product The max-product algorithm is another variant of BP where instead of summing the effect of neighbouring sub-graphs, we are finding the maximising configurations use same gather, distribute message passing schedule as earlier messages are instead defined as m X (y)=maxy XY (x,y) m Z (x) x Z2N (X)/Y best configuration found by locally maximising the distribution at each node x = argmax m Y (x) Y 2N (X) the Viterbi algorithm is an instance of max-product BP, applied to directed chains 36

Loopy BP While the motivation and theory behind BP is based on trees, it can be applied to loopy graphs in two ways: collapse nodes together to form the junction tree (exact, and often expensive) continue to pass messages about the graph until the messages cease to change (convergence) called loopy BP, and is only approximate, with no convergence guarantee messages can pass around a loop indefinitely empirically often quite accurate and reliable, and much more efficient than exact inference over the junction tree particularly sensitive to the message passing schedule (order of message passing through graph) 37

Estimating the Model Parameters Until now, we ve assumed the model was given: i.e. the structure of the graph and the potential tables. How do we learn these from data? If the data is fully observed (i.e. all random variables are given values - as is the case for many NLP applications), we can find the parameters which maximise the probability of the data. this is the maximum likelihood estimator (MLE) If the data is only partially observed (eg. machine translation, where we want word alignments, but only given aligned sentences), we must resort to other methods. discriminative models (including CRFs) aren t very good in this situation; generative models (eg., HMMs) are more appropriate 38

l Training the Model Recall the CRF probability density function: p(s o)= 1 Z(o) exp  Âl j h j (s c,o,c) c2c j are the parameters of the model (values learnt in training) and h are feature functions The MLE estimate (of the parameters) is used to maximise the (log) likelihood of the training data, D = ns (i),o (i)o N L=log = N i=1 N    i=1 c2c (i) j i=1 p(s (i) o (i) )= N  log p(s (i) o (i) ) i=1 l j h j (s (i),o (i),c) logz(o (i) ) 39

MLE Training This log-likelihood cannot be analytically solved, instead is optimised by numerical methods the log-likelihood is convex - i.e. there are no local optima, but only a single global optimum. originally, methods such as IIS and GIS were used. Conjugate gradient and L-BFGS, are more in vogue, being much faster and effective. These perform gradient ascent until the global optimum is found. These methods require the derivative of the log-likelihood with respect to each parameter 40

Gradient given by: L l k = MLE Log-likelihood Gradient N  i=1 N   c2c (i) h k (s (i),o (i),c)  N  i=1 N  i=1 = h k (s (i),o (i),c) i=1 c2c (i) =E p(s,o) [h k ] E p(s o) p(o) [h k ] this is a standard maxent observed feature count - expected feature count finding the expected feature count requires belief propagation (sum-product) to recover the marginal distributions over each maximal clique recall that for a chain, there is a maximal clique for each adjacent pair of nodes  s p(s o (i) )  c2c (i) h k (s c,o (i),c)   c2c (i) s c p(s c o (i) )h k (s c,o (i),c) 41

MAP Training: Using a Prior As we often include thousands or millions of features, making the model fit each exactly is counter-productive. Use of a prior limits the modelling power, and thus the ability to over-fit the training data. p(l D)= p(d L)p(L) p(d) typically use a Gaussian (normal) distribution, this embodies an assumption that the weights should tend towards their mean (usually zero - i.e. ignore the feature by default), and each feature should be penalised for straying from its mean. The new objective is now (after excluding constant terms): O = L 1 2 Â j l j µ p(d L)p(L) Many other prior distributions used with log-linear models. eg. Laplacian, Hyperbolic. s j p(l) µ exp µ 2 j 1 2 Â j l j s j µ 2 j 42

Perceptron training Other training methods repeatedly process the training data, until convergence: decoding each instance, and whenever an error is made (i.e. predicted labelling and gold labelling differ), update the parameters Pseudolikelihood optimise log-likelihood over smaller sub-graphs, where remaining portion of graph is observed with the gold standard labellings Piecewise p PL (s o) t2t p(s t s T /t,o) optimise log-likelihood smaller subgraphs, where rest of graph is completely ignored 43

Dynamic Conditional Random Fields Why stop at one layer? Can simultaneously model multiple layers of annotation. this avoids the premature removal of ambiguity in the typical cascade. Instead the different layers of annotation can interact, revising decisions made in other layers until the best joint labelling is found. eg., chunk tagging and part-of-speech tagging chunk tags C 1 C 2 C 3 C 4... part-of-speech tags P 1 P 2 P 3 P 4... tokens T 1 T 2 T 3 T 4... caveat: forced to use junction tree (intractable) or loopy BP (inexact) 44

Skip-chain CRFs Used for named-entity recognition where we wish to find all person, location, organisation, etc references in a text chain CRFs are quite effective for this task. However, they often tag many instances of one word with different entity labels. In a given document, repeated instances of a word will tend to have the same label. sentence 1... adds extra skip edges between consecutive mentions in a document sentence 2... thus evidence in one chain influences the other chains sentence 14... 45

Tree CRFs Used for semantic role labelling: given a parse tree, decide which constituents fill semantic roles for a given verb. Roles include agent, patient, theme, etc. annotate the parse structure with role information agent temporal adjunct The luxury auto maker last year sold verb patient locative adjunct 1,214 cars in the US 46

Tractability Issues Training a CRF is expensive; MLE/MAP training requires 100s of iterations, each involving calculation of the log-likelihood and its derivative for a chain each iteration costs O(L 2 TF) where L is the number of labels, T is the total length of training sequences, and F is the average number of active features Decoding also expensive O(L 2 TF) but only requires one iteration perceptron training, which repeatedly decodes the training instances, can reach a good solution quickly Approximations can speed up both training and decoding pseudo-likelihood training, piecewise training, beam search, error-correcting output codes, feature selection Memory usage also a concern: typically parallelise implementation, and run on cluster computers 47

References Graphical models and belief propagation Judea Pearl, Probabilistic reasoning in intelligent systems: Networks of plausible inference, Morgan Kaufmann, 1988. Michael Jordan, Graphical Models, Statistical Science 19, pages 140-155, 2004. Jonathan Yedidia, William Freeman and Yair Weiss, Understanding belief propagation and its generalisation, IJCAI 2001. Maximum entropy models Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, A maximum entropy approach to natural language processing, Computational Linguistics, 1996 Tutorial: http://www.cs.cmu.edu/~aberger/maxent.html 48

Maximum Entropy Markov Models References Andrew McCallum, Dayne Freitag and Fernando Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation, ICML-2000. Adwait Ratnaparkhi, A maximum entropy part-of-speech tagger, EMNLP 1996. Conditional Random Fields John Lafferty, Andrew McCallum, and Fernando Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001. Andrew McCallum, Dynamic conditional random fields: Factorised probabilistic models for labelling and segmenting sequence data, ICML 2004. Charles Sutton and Andrew McCallum, Collective Segmentation and Labeling of Distant Entities in Information Extraction, ICML workshop on Statistical Relational Learning, 2004. 49

References Some applications of CRFs Fei Sha and Fernando Pereira, Shallow parsing with conditional random fields, HLT- NAACL 2003 David Pinto, Andrew McCallum, Xing Wei and Bruce Croft, Table extraction using conditional random fields, SIGIR 2003 Andrew McCallum and Wei Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, CoNLL 2003 Trevor Cohn and Philip Blunsom, Semantic Role Labelling with Tree Conditional Random Fields, CoNLL 2005 50

References Alternative training methods for CRFs Andrew McCallum, Efficiently inducing features of Conditional Random Fields, UAI 2003. Brian Roark, Murat Saraclar, Michael Collins and Mark Johnson, Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm, ACL 2004. Trevor Cohn and Andrew Smith and Miles Osborne, Scaling conditional random fields using error-correcting codes, ACL 2005. Charles Sutton and Andrew McCallum, Piecewise Training for Undirected Models, UAI, 2005. Andrew Smith and Miles Osborne, Regularisation Techniques for Conditional Random Fields: Parameterised versus Parameter-free, IJCNLP, 2005. 51

Software JavaBayes - clean and simple graphical presentation of Bayesian networks http://www-2.cs.cmu.edu/~javabayes Graphical models toolkit (GMTK) - closed source efficient Bayesian network package http://ssli.ee.washington.edu/~bilmes/gmtk/ Tags n trigrams (TnT) - fast second order hidden Markov Model http://www.coli.uni-saarland.de/~thorsten/tnt/ Zhang Le s maximum entropy classifier http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html MALLET - Java implementations of many classifiers (including maximum entropy) as well as a CRF http://mallet.cs.umass.edu 52