Conditional Random Fields. Mike Brodie CS 778

Similar documents
Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Structured Learning. Jun Zhu

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

CS 6784 Paper Presentation

Modeling Sequence Data

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Sequence Labeling: The Problem

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Machine Learning. Chao Lan

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

Conditional Random Fields : Theory and Application

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

MEMMs (Log-Linear Tagging Models)

Introduction to Hidden Markov models

COMP90051 Statistical Machine Learning

A Systematic Overview of Data Mining Algorithms

CSC 578 Neural Networks and Deep Learning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Structured Prediction Basics

A Brief Look at Optimization

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Statistical Methods for NLP

27: Hybrid Graphical Models and Neural Networks

Machine Learning Classifiers and Boosting

Computationally Efficient M-Estimation of Log-Linear Structure Models

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

CRF Feature Induction

Detection and Extraction of Events from s

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Conditional Random Fields for Word Hyphenation

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Network Traffic Measurements and Analysis

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Machine Learning. Decision Trees. Manfred Huber

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

I Know Your Name: Named Entity Recognition and Structural Parsing

CS229 Final Project: Predicting Expected Response Times

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

What is machine learning?

CS545 Project: Conditional Random Fields on an ecommerce Website

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Simple Model Selection Cross Validation Regularization Neural Networks

Feature Extraction and Loss training using CRFs: A Project Report

Probabilistic Graphical Models

Machine Learning. Sourangshu Bhattacharya

Bayes Net Learning. EECS 474 Fall 2016

ADVANCED CLASSIFICATION TECHNIQUES

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

(Multinomial) Logistic Regression + Feature Engineering

CSEP 573: Artificial Intelligence

Classification: Linear Discriminant Functions

Edinburgh Research Explorer

EECS 496 Statistical Language Models. Winter 2018

Semi-Supervised Learning of Named Entity Substructure

Bayesian model ensembling using meta-trained recurrent neural networks

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Generative and discriminative classification techniques

Cambridge Interview Technical Talk

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Semi-supervised Learning

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Discriminative Training for Phrase-Based Machine Translation

Application of Support Vector Machine Algorithm in Spam Filtering

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Semi-Markov Models for Named Entity Recognition

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Homework 2: HMM, Viterbi, CRF/Perceptron

CS6220: DATA MINING TECHNIQUES

Machine Learning 13. week

INTRODUCTION TO DEEP LEARNING

Neural Network Neurons

Lecture 20: Neural Networks for NLP. Zubin Pahuja

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Time series, HMMs, Kalman Filters

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

Semi-Markov Conditional Random Fields for Information Extraction

Notes on Multilayer, Feedforward Neural Networks

Conditional Random Fields for Object Recognition

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

CS5670: Computer Vision

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

CAP 6412 Advanced Computer Vision

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Transcription:

Conditional Random Fields Mike Brodie CS 778

Motivation Part-Of-Speech Tagger 2

Motivation object 3

Motivation I object! 4

Motivation object Do you see that object? 5

Motivation Part-Of-Speech Tagger - Java CRFTagger INPUT FILE - Logan.txt Logan woke up this morning and ate a big bowl of Fruit Loops. On his way to school, a small dog chased after him. Fortunately, Logan's leg had healed and he outran the dog. 6

Motivation Part-Of-Speech Tagger - Java CRFTagger OUTPUT FILE - Logan.txt.pos Logan/NNP woke/vbd up/rp this/dt morning/nn and/cc ate/vb a/dt big/jj bowl/nn of/in Fruit/ NNP Loops/NNP./. On/IN his/prp$ way/nn to/ TO school/vb,/, a/dt small/jj dog/nn chased/ VBN after/in him/prp./. Fortunately/RB,/, Logan/ NNP 's/pos leg/nn had/vbd healed/vbn and/ CC he/prp outran/vb the/dt dog/nn./. 7

Motivation Stanford Named Entity Recognition Recognizes names, places, organizations http://nlp.stanford.edu:8080/ner/process 8

Motivation Image Applications Segmentation Multi-Label Classification Labels: Hat, Shirt Cup, Stanley 9

Motivation Bioinformatics Applications RNA Structural Alignment Protein Structure Prediction 10

Motivation Cupcakes with Rainbow Filling 11

Background Isn't it about time 12

Background Isn't it about time for Graphical Models? 13 Represent distribution as a product of local functions that depend on a smaller subset of variables

Background Generative vs Discriminative Generative models describe how a label vector y can probabilistically generate a feature vector x. Discriminative models describe how to take a feature vector x and assign it a label y. Either type of model can be converted to the other type using Bayes rule P (x y)p (y) =P (x, y) =P (y x)p (x) 14

Background Joint Distributions P (x, y) Problems:? 15

Background Joint Distributions P (x, y) Problems: - Modeling inputs => intractable models - Ignoring inputs => poor performance 16

Background Joint Distributions P (x, y) Naive Bayes Generative model Strong assumptions to simplify computation 17

Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional 18

Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional If trained to maximize conditional likelihood 19

Background Generative vs Discriminative Naive Bayes Logistic Regression Conditional If trained to maximize joint likelihood 20

Background Sequence Labeling Hidden Markov Model (HMM) 21

Background Sequence Labeling Hidden Markov Model (HMM) ASSUMPTIONS: y t? y 1,...,y t 2 y t 1 x t? Y \{y t },X \{x t } y t

Background Sequence Labeling Hidden Markov Model (HMM) PROBLEMS: - Later labels cannot influence previous labels - Cannot represent overlapping features 23

Background Improvements to HMMs Maximum Entropy Markov Model 24

Background Improvements to HMMs Maximum Entropy Markov Model St-1 St MEMM Ot P (S 1,...,S n O 1,...,O n )= 25 nx t=1 P (S t S t 1,O t )

Background Improvements to HMMs Conditional Markov Model PROS: More flexible form of context dependence Independence assumptions among labels, not observations CONS: Only models log-linear distribution over component transition distributions

Background Improvements to HMMs Label Bias Problem 27

Solution Conditional Random Fields - Drop Local Normalization Constant - Normalize GLOBALLY Over Full Graph Avoids Label Bias Problem 28

Solution Conditional Random Fields Calculate path probability by normalizing the path score by the sum of the scores of all possible paths Poorly matching path -> low score (probability) Well-matching path -> high score 29

Solution Conditional Random Fields Similar to CMM - Discriminative - Model Conditional Distribution p(y x) - Allow arbitrary, overlapping features 30

Solution Conditional Random Fields Similar to CMM - Discriminative - Model Conditional Distribution p(y x) - Allow arbitrary, overlapping features TAKEAWAY: Retains advantages of CMM over HMM Overcomes label bias problem of CMM and MEMM 31

Linear-Chain CRFs Conditional Random Fields Remember the HMM 32

Linear-Chain CRFs Conditional Random Fields Remember the HMM

Linear-Chain CRFs Conditional Random Fields By expanding Z, we have the conditional distribution p(y x) 34

Feature Functions students.cs.byu.edu/~mbrodie/778 35

36

Linear-Chain CRFs Conditional Random Fields 37

Feature Engineering Larger feature set benefits: Greater prediction accuracy More flexible decision boundary However, may lead to overfitting 38

Feature Engineering Label-Observation Features Where is an observation function rather than a specific word value For example: word word x t x t is capitalized ends in ing 39

Feature Engineering Unsupported Features For example: word x t is with and label is NOUN Greatly increases the number of parameters. y t 40

Feature Engineering Unsupported Features For example: word x t is with and label is NOUN Greatly increases the number of parameters. However, usually gives a slight boost in accuracy y t Can be useful -> Assign negative weights to prevent spurious labels from receiving high probability 41

Feature Engineering Boundary Labels Boundary labels (start/end of sentence, edge of image) can have different characteristics from other labels. For example: Capitalization in the middle of the sentence generally indicates a proper noun (but not necessarily at the beginning of a sentence). 42

Feature Engineering Other Methods Feature Induction Begin with a number of base features. Gradually add conjunctions to those features Normalize Features Convert categorical to binary features Features from Different Time Steps 43

Feature Engineering Other Methods Features as Model Combination Input Dependent Structure e.g. Skip-Chain CRFs Encourage identical words in a sentence to have the same label Do this by adding an edge in the graph 44

Linear-Chain CRFs Conditional Random Fields HMM-like CRF Previous observation affects transition 45

Feature Effects Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences Features Accuracy # Feature Functions yt-1 and yt 0.0449 296 xt and yt 0.6282 486 xt and yt AND xt and yt-1and yt 0.6603 1176 xt and yt AND yt-1 and yt 0.0929 782 46

Linear-Chain CRFs Parameter Estimation We want to find parameters for feature functions: 47

Linear-Chain CRFs Parameter Estimation We want to find parameters for feature functions: Maximize the conditional log likelihood: 48

Linear-Chain CRFs Parameter Estimation 49

Linear-Chain CRFs Parameter Estimation Which becomes 50

Linear-Chain CRFs Parameter Estimation Which becomes Penalize large weight vectors

Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 52

Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 53

Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 54

Linear-Chain CRFs Parameter Estimation Cannot maximize `( ) in closed form Instead use numerical optimization => Find derivatives w.r.t. k 55

Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 56

Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 57

Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 58

Linear-Chain CRFs Parameter Estimation How to optimize `( )? Steepest ascent along Gradient Newton s Method (L-) BFGS 59

Feature Effects Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences Features Accuracy # Feature Functions Trained Accuracy 10 Epochs yt-1 and yt 0.0449 296 0.1185 xt and yt 0.6282 486 0.6538 xt and yt AND xt and yt-1and yt 0.6603 1176 0.6731 xt and yt AND yt-1 and yt 0.0929 782 0.2724

Linear-Chain CRFs Inference During Training Marginal Distributions for each edge During Testing Viterbi Algorithm to compute the most likely labeling Compute Z(x) 61

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: 62

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Factor Definition: 63

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Factor Definition: Rewrite Using Distributive Law:

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: This leads to a set of forward variables 65

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: This leads to a set of forward variables Compute t by recursion

Linear-Chain CRFs Inference Use Forward-Backward Approach in HMM: Similarly, we computer a set of backward variables Compute t by recursion 67

Linear-Chain CRFs Inference Compute Marginals Needed for Gradient Using Forward and Backward Results 68

Linear-Chain CRFs Inference Compute Marginals Needed for Gradient Using Forward and Backward Results = 69

Linear-Chain CRFs Inference How does this compare to CRF training/inference? 70

Linear-Chain CRFs Inference How does this compare to CRF training/inference? 71

Linear-Chain CRFs Inference How does this compare to CRF training/inference? 72

Breakthrough 73

My Implementation 3 Different Implementations 1. Tensorflow 74

Tensorflow Tips Do NOT start with CRFs

Tensorflow Tips Be prepared for awkward, sometimes unintuitive formats 76

Tensorflow Gaaaaah!! len(x) => X.get_shape().dims[0].value Start out with Theano More widely used More intuitive 77

My Implementation 3 Different Implementations 1. Tensorflow 2. Scipy Optimization (fmin_l_bfgs_b) 78

My Implementation 3 Different Implementations 1. Tensorflow 2. Scipy Optimization (fmin_l_bfgs_b) 3. Gradient Descent A. Trainable Version B. Simplified, Web-based Version 79

Experiments Data Sets Brown Corpus News Articles Penn Treebank Corpus WSJ 5,000 POS Sentences 27,596 Tags 75/25 Train-Test Split 3,914 sentence SAMPLE $1,700 (or free BYU version) 75/25 Train-Test Split 80

Experiments Comparison Models Python Natural Language Toolkit (NLTK) Hidden Markov Model Conditional Random Field Averaged-Multilayer Perceptron Average the weight updates -> prevent radical changes due to different training batches 81

Experiments Brown Corpus Results 100 82.5 Accuracy 65 47.5 30 HMM Avg-MLP NLTK-CRF Brodie-CRF 82

Experiments Penn Corpus Results 100 93.75 Accuracy 87.5 81.25? 75 HMM Avg-MLP NLTK-CRF Brodie-CRF 83

Pros/Cons of Implementation Pros Can use forward/ backward and Viterbi Algorithm Learn weights for features Cons Slow Limited to gradient descent training 84

Extensions General CRFs Skip-chain CRFs Hidden Node CRFs 85

CRF-RNN + CNN Problem: Traditional CNNs produce coarse outputs for pixel-level labels. Solution: Apply CRF as a post-processing refinement

BI-LSTM-CRF

Future Directions New model combinations of CRFs Better training approximations (i.e. besides MCMC sampling methods) Additional hidden architectures 88

Bibliography 89

Bibliography 90

Bibliography Other Resources Tom Mitchell s Machine Learning. Online draft version of Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. http://www.cs.cmu.edu/~tom/mlbook/ NBayesLogReg.pdf. Accessed January 25, 2016. Edwin Chen Blog. Introduction to Conditional Random Fields. http://blog.echen.me/2012/01/03/introduction-toconditional-random-fields/. Accessed February 2, 2016. LingPipe Tutorial: http://alias-i.com/lingpipe/demos/ tutorial/crf/read-me.html. Accesed February 2, 2016. 91

Thank You Questions? 92