Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Similar documents
8 : Learning Fully Observed Undirected Graphical Models

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

Structured Learning. Jun Zhu

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields : Theory and Application

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Semi-Supervised Affinity Propagation with Instance-Level Constraints

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

CS 6784 Paper Presentation

Naïve Bayes Slides are adapted from Sebastian Thrun (Udacity ), Ke Chen Jonathan Huang and H. Witten s and E. Frank s Data Mining and Jeremy Wyatt,

Video Data and Sonar Data: Real World Data Fusion Example

Context-Aware Activity Modeling using Hierarchical Conditional Random Fields

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

A Novel Validity Index for Determination of the Optimal Number of Clusters

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Relevance for Computer Vision

Shape Outlier Detection Using Pose Preserving Dynamic Shape Models

27: Hybrid Graphical Models and Neural Networks

KERNEL SPARSE REPRESENTATION WITH LOCAL PATTERNS FOR FACE RECOGNITION

MEMMs (Log-Linear Tagging Models)

Conditional Random Fields. Mike Brodie CS 778

Cluster-Based Cumulative Ensembles

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

Dynamic Programming Bipartite Belief Propagation For Hyper Graph Matching

SEGMENTATION OF IMAGERY USING NETWORK SNAKES

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Improved Vehicle Classification in Long Traffic Video by Cooperating Tracker and Classifier Modules

Chromaticity-matched Superimposition of Foreground Objects in Different Environments

Graph-Based vs Depth-Based Data Representation for Multiview Images

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

FOREGROUND OBJECT EXTRACTION USING FUZZY C MEANS WITH BIT-PLANE SLICING AND OPTICAL FLOW

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

COMP90051 Statistical Machine Learning

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Self-Adaptive Parent to Mean-Centric Recombination for Real-Parameter Optimization

Superpixel Tracking. School of Information and Communication Engineering, Dalian University of Technology, China 2

Boosted Random Forest

Algorithms for Markov Random Fields in Computer Vision

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks

Extracting Partition Statistics from Semistructured Data

An Alternative Approach to the Fuzzifier in Fuzzy Clustering to Obtain Better Clustering Results

Collective classification in network data

Simulation of Crystallographic Texture and Anisotropie of Polycrystals during Metal Forming with Respect to Scaling Aspects

3 : Representation of Undirected GMs

A scheme for racquet sports video analysis with the combination of audio-visual information

Gray Codes for Reflectable Languages

STA 4273H: Statistical Machine Learning

Pipelined Multipliers for Reconfigurable Hardware

Weak Dependence on Initialization in Mixture of Linear Regressions

ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

New Fuzzy Object Segmentation Algorithm for Video Sequences *

Multi-modal Clustering for Multimedia Collections

Sequence Labeling: The Problem

Segmentation. Bottom up Segmentation Semantic Segmentation

Drawing lines. Naïve line drawing algorithm. drawpixel(x, round(y)); double dy = y1 - y0; double dx = x1 - x0; double m = dy / dx; double y = y0;

Complex Prediction Problems

TUMOR DETECTION IN MRI BRAIN IMAGE SEGMENTATION USING PHASE CONGRUENCY MODIFIED FUZZY C MEAN ALGORITHM

the data. Structured Principal Component Analysis (SPCA)

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Conditional Random Fields for Object Recognition

Adobe Certified Associate

NOISE REMOVAL FOR OBJECT TRACKING BASED ON HSV COLOR SPACE PARAMETER USING CAMSHIFT

Introduction to Seismology Spring 2008

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

Dynamic Programming. Lecture #8 of Algorithms, Data structures and Complexity. Joost-Pieter Katoen Formal Methods and Tools Group

Exploiting Enriched Contextual Information for Mobile App Classification

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Detection and Recognition of Non-Occluded Objects using Signature Map

Time series, HMMs, Kalman Filters

Gradient based progressive probabilistic Hough transform

arxiv: v1 [cs.db] 13 Sep 2017

The Mathematics of Simple Ultrasonic 2-Dimensional Sensing

Feature Extraction and Loss training using CRFs: A Project Report

Using Augmented Measurements to Improve the Convergence of ICP

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen

A radiometric analysis of projected sinusoidal illumination for opaque surfaces

Spatial-Aware Collaborative Representation for Hyperspectral Remote Sensing Image Classification

One Against One or One Against All : Which One is Better for Handwriting Recognition with SVMs?

OBSERVING and reasoning about object motion patterns

1 Training/Validation/Testing

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions

Graphical Models & HMMs

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

timestamp, if silhouette(x, y) 0 0 if silhouette(x, y) = 0, mhi(x, y) = and mhi(x, y) < timestamp - duration mhi(x, y), else

Segmentation of brain MR image using fuzzy local Gaussian mixture model with bias field correction

Outline: Software Design

Scene Grammars, Factor Graphs, and Belief Propagation

Defect Detection and Classification in Ceramic Plates Using Machine Vision and Naïve Bayes Classifier for Computer Aided Manufacturing

HEXA: Compact Data Structures for Faster Packet Processing

Lecture 3: Conditional Independence - Undirected

Webpage Understanding: an Integrated Approach

Scene Grammars, Factor Graphs, and Belief Propagation

Rotation Invariant Spherical Harmonic Representation of 3D Shape Descriptors

Transcription:

0-708: Probabilisti Graphial Models 0-708, Spring 204 Disrete sequential models and CRFs Leturer: Eri P. Xing Sribes: Pankesh Bamotra, Xuanhong Li Case Study: Supervised Part-of-Speeh Tagging The supervised part-of-speeh tagging is a supervised task that tags eah part of speeh sequene with predefined labels, suh as noun, verb, and so on. As shown in Figure, given the data D = {x (n), y (n) }, x represent the speeh word and y represents the tag. Figure : Supervised Part-of-Speeh Tagging This problem an be approahed with many different methods. Here we disuss three of them: Markov random field, Bayes network, and onditional random field. Markov Random Field (Figure 2): it models the joint distribution over the tags Y i and words X i. The individual fators are not probabilities. Thus a normalization Z is needed. Bayes Network (Figure 3): it also models the joint distribution over the tags Y i and words X i. Note that here the individual fators are probabilities. So Z =. Conditional Random Field (Figure 4): it models onditional distribution over tags Y i given words X i. The fators and Z are speifi to sentene X.

2 Disrete sequential models and CRFs Figure 2: Supervised Part-of-Speeh Tagging with Markov Random Field Figure 3: Supervised Part-of-Speeh Tagging with Bayes Network Figure 4: Supervised Part-of-Speeh Tagging with Conditional Random Field

Disrete sequential models and CRFs 3 2 Review of Inferene Algorithm The forward-bakward algorithm (marginal inferene) and viterbi algorithm (MAP inferene) are two major inferene algorithms we have seen so far. It turns out they are all belief propagation algorithms. Figure 5: Belief Propagation in Forward-bakward Algorithm For example, in the forward-bakward algorithm, as shown in Figure 5, α is the belief from forward pass, the β is the belief from the bakward pass, and the ψ is the belief from the x i. Then the belief of of Y 2 = n is the produt of the belief from the three diretions: α(n)β(n)ψ(n). 3 Hidden Markov Model (HMM) and Maximal Entropy Markov Model (MEMM) 3. HMM Figure 6: Hidden Markov Model

4 Disrete sequential models and CRFs HMM (Figure 6) is a graphial model with latent variable Y and observed variable X. It is a simple model for sequential data suh as language, speeh, and so fore. But, there are two issues with HMM. Loality of feature: HMM models only apture dependenies between eah state and its orresponding observation. But in real world problem like NLP, eah segmental state may depend not just on a single word (and the adjaent segmental stages), but also on the (non-loal) features of the whole line suh as line length, indentation, amount of white spae, et. Mismath of objetive funtion: HMM learns a joint distribution of states and observations P (Y, X), but in a predition task, we need the onditional probability P (Y X) 3.2 MEMM Figure 7: Maximal Entropy Markov Model MEMM (Figure 7) is for solving the problems of HMM. It gives full observation of sequene to every state, whih make the model more experssive than HMM. It is also a disriminative model, sine it ompletely ignores modeling P (X) and the learning objetive funtion onsistent with preditive funtion: P (Y X). But MEMM has the label bias problem, whih means a preferene for states with lower number of transitions over others. To avoid this problem, one solution is not normalizing probabilities loally. This improvement gives us the onditional random field (CRF). 4 Comparison between Markov Random Field and Conditional Random Field 4. Data CRF: D = {X (n), Y (n) } N n= MRF: D = {Y (n) } N n= 4.2 Model CRF: P (Y X, θ) = Z(X,θ) MRF: P (Y θ) = Z(θ) 4.3 Log likelihood CRF: l(θ, D) = N N n= log(yn x (n), θ) MRF: l(θ, D) = N N n= log(yn θ) C ψ (Y, X), where ψ (Y, X) = exp(θf (Y, X)) C ψ (Y ), where ψ (Y ) = exp(θf (Y ))

Disrete sequential models and CRFs 5 4.4 Derivatives l CRF: θ k = N N n= f,k(y (n), x (n) ) N N n= MRF: l θ k = N N n= f,k(y (n), x (n) ) N N n= y p(y (n) x (n) )f,k (y (n), x (n) ) y p(y (n) )f,k (y (n) ) 5 Generative Vs. Disriminative Models - Reap In simple terms, the differene between generative and disriminative models is the generative models are based on joint distribution p(y, x), while disriminative models are based on the onditional distribution p(y x). Typial example of generative-disriminative pair [] is the Naive Bayes lassifier and Logisti regression. The priniple advantage of using disriminative models is that they an inorporate rih features whih an have long range dependenies. For example, in POS tagging we an inorporate features like apitalization of the word, syntati properties of the neighbour words, and others like loation of the word in the sentene. Having suh features in generative models generally leads to poor performane of the models. Figure 8: Relationship between some generative-disriminative pairs 6 CRF formulation Conditional random fields are formulated as below: - P (y :n x :n ) = Z(x :n ) n φ(y i, y i, x :n ) = Z(x :n, w) i= n exp(w T f(y i, y i, x :n )) An important differene to note here is that CRF formulation looks more or less like the MEMM formulation. However, here the partition funtion is global and lies outside of the exponential produt term. i= 7 Properties of CRFs CRFs are partially direted models.

6 Disrete sequential models and CRFs CRFs are disriminative models like MEMMs. CRF formulation has global normalizer Z that helps in overoming the label bias problem of the MEMMs. Being a disriminative model, CRFs an model rih set of features over the entire observation sequene. 8 Linear Chain CRFs Figure 9: A Linear hain CRF In a linear hain onditional random field, we model the potentials between adjaent nodes as a Markov Random Field onditioned on input x. Thus, we an formulate linear hain CRF as: - P (y x) = n Z(x, λ, µ) exp( where Z(x, λ, µ) = y exp( i=( k n i=( k λ k f k (y i, y i, x) + l λ k f k (y i, y i, x) + l µ l g l (y i, x))) µ l g l (y i, x))) f k and g l in the above equations are referred to as feature funtions that an enode rih set of features over the entire observation sequene. 9 CRFs Vs. MRFs 9. Model CRF : P (y x, θ) = Z(x, θ) MRF : P (y x, θ) = Z(θ) exp(θ, f(y, x)) exp(θ, f(y )) 9.2 Average Likelihood CRF : l(θ; D) = N N log p(y (n) x (n), θ) n= MRF : l(θ; D) = N N log p(y (n) θ) n=

Disrete sequential models and CRFs 7 9.3 Derivatives CRF : MRF : d l(θ; D) dθ k = N d l(θ; D) dθ k N n= = N f,k (y (n) n=, x (n) ) N N p(y x (n) )f,k (y, x (n) ) n= N f,k (y (n) ) N y N n= y p(y )f,k (y ) 0 Parameter estimation As we saw in the previous setions CRFs have parameters λ k and µ k whih we estimate from the training data D = (x (n), y (n ) N i= having empirial distribution p(x, y). We an use iterative saling algorithms and gradient desent to maximizes the log-likelihood funtion. Performane Figure 0: CRFs vs. other sequential models for POS tagging on Penn treebank 2 Minimum Bayes Risk Deoding Deoding in CRFs refers to hoosing a partiular y from p(y x) suh that some form of loss funtion is minimized. A minimum Bayes risk deoder returns an assignment that minimizes expeted loss under model distribution. This an be represented as: - h θ (x) = argmin y p θ (y x)l(ŷ, y) Here L(ŷ, y) represents a loss funtion like 0- loss or Hamming loss. y 3 Appliations - Computer Vision A few appliations of CRFs in omputer vision are: - Image segmentation

8 Disrete sequential models and CRFs Handwriting reognition Pose estimation Objet reognition 3. Image segmentation Image segmentation an be modelled by onditional random fields. We exploit the fat that foreground and bakground portions of an image have pixel-wise and loal harateristis that an be inorporated as CRF parameters. Thus, image segmentation an be formulated as: - Y = argmax y (0,) n V i (y i, X) + V i,j (y i, y j ) i S i S j N i Here, Y refers to image label as foreground or bakground, Xs are data features, S are pixels, and N i refers to the neighbours of pixel i.