Semi-Markov Conditional Random Fields for Information Extraction

Size: px

Start display at page:

Download "Semi-Markov Conditional Random Fields for Information Extraction"

Brittney Miles
6 years ago
Views:

1 Semi-Markov Conditional Random Fields for Information Extraction S U N I T A S A R A W A G I A N D W I L L I A M C O H E N N I P S P R E S E N T E D B Y : D I N E S H K H A N D E L W A L S L I D E S A R E A D O P T E D F R O M D A N I E L K H A S H A B I

2 Beyond Classification Learning Standard classification problem assumes individual cases are disconnected and independent (i.i.d.: independently and identically distributed). Many NLP problems do not satisfy this assumption and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent. More sophisticated learning and inference techniques are needed to handle such situations in general. 2

3 Sequence Labeling Problem Many NLP problems can viewed as sequence labeling. Each token in a sequence is assigned a label. Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d). 3

4 Named Entity Recognition My review of Fermat s last theorem by S. Singh t x y My review of Fermat s last theorem by S. Singh Other Other Other Title Title Title other Author Author y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9

5 Problem Description The relational connection occurs in many applications, NLP, Computer Vision, Signal Processing,. Traditionally in graphical models, p( xy, ) p( y x) p( x) Modeling the joint distribution can lead to difficulties rich local features occur in relational data, p( x) features may have complex dependencies, constructing probability distribution over them is difficult Solution: directly model the conditional, is sufficient for classification! CRF is simply a conditional distribution associated graphical structure p( y x) p( y x) with an

6 Log linear representation of CRFs Pr y x, W = 1 Z(x) ewt F(x,y) x F x, y = i=1 f(i, x, y) f = f 1,, f K f K (i, x, y) R Vector of local feature functions Parameters to be estimated, W

7 Linear Chain CRF =unobservable =observable f K i, x, y = f K i, x, y i, y i 1

8 Features The kind of features used in NLP-oriented machine learning systems typically involve Binary values: Think of a feature as being on or off rather than as a feature with a value Values that are relative to an object/class pair rather than being a function of the object alone. Typically have lots and lots of features (100,000s of features isn t unusual.)

9 Features f 1 (i, x, y)= 1 y i = DT and y i 1 = V 0, otherwise f 2 (i, x, y)= 1 x i = the and y i = DT 0, otherwise f 3 (i, x, y)= 1 suffix x i = "ing" and y i = V 0,,otherwise

10 Segmentation models (Semi-CRFs) i x y I went skiing with Fernando Pereira in British Columbia O O O O I I O I I f K i, x, y i, y i 1 Features describe the single word t,u x y t 1 =u 1 =1 t 2 =u 2 =2 t 3 =u 3 =3 t 4 = u 4 =4 t 5 =5, u 5 =6 t 6 = u 6 =7 t 7 =8, u 7 =9 I went skiing with Fernando Pereira in British Columbia O O O O I O I g K y j, y j 1, x, t j, u j Features describe the segment from t j to u j

11 Semi-CRF S 1 S p x t1 x u1 x tp x up s = s 1,, s p denote a segmentation of x Segment s j = t j, u j, y j consists of a start position t j, an end position u j, and a label y j 1 t j u j s t j+1 = u j + 1 and

12 Semi-CRF =unobservable =observable g K j, x, s = g K y j, y j 1, x, t j, u j Pr s x, W = 1 Z(x) ewt G(x,s) x G x, s = i=1 g(i, x, s) Z(x)= s e WT G(x,s) g is a vector of segment level feature functions.

13 MAP Inference Semi-CRF S = argmax s S = argmax P(s x, W) s S = argmax W T G(x, s) W T s j s g y j, y j 1, x, t j, u j g is a vector of segment level feature functions.

14 Viterbi algorithm for Semi-CRF max s W T s j=1 g y j, y j 1, x, t j, u j L be an upper bound on segment length s i:y denote set of all partial segmentation starting from 1 to i, such that the last segment has the label y and ending position i. s W T g y j, y j 1, x, t j, u j + V(i, y) = max y,d max s s i d:y j=1 W T g y, y, x, i d, i

15 Viterbi algorithm for Semi-CRF V(i, y) = max y,d max s s i d:y V i d, y = max s s i d:y W T s g y j, y j 1, x, t j, u j + j=1 max y,d WT g y, y, x, i d, i W T s j=1 g y j, y j 1, x, t j, u j V(i, y) = max y,d V i d, y + W T g y, y, x, i d, i

16 Viterbi algorithm for Semi-CRF V(i, y) = max y,d=1,..l V i d, y + W T g y, y, x, i d, i 0 If i >0 If i = 0 If i<0 The optimal label sequence corresponds to path traced by max y V x, y.

17 Semi-Markov CRFs vs conventional CRFs Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-crfs is more expensive. However additional cost is only linear in L. Semi-CRFs are more expressive power. A major advantage of semi-crfs is that they allow features which measure properties of segments, rather than individual elements.

18 Semi-Markov CRFs vs Higher order CRFs Semi-CRFs are no more expressive than order-l CRFs. For order-l CRFs, however the additional computational cost is exponential in L. Semi-CRFs only consider sequences in which the same label is assigned to all L positions, rather than all Y L length-l sequences. This is a useful restriction, as it leads to faster inference.

19 Parameter Learning: Semi-CRF N Given the training data, {(x l, s l )} l=1 we wish to learn parameters of the model. We express the log-likelihood over the training sequences as L W = l log P(s l x l, W) = l (W T G(x l, s l ) log Z W (x l )) L W is concave, and can thus be maximized by gradient ascent, or one of many related methods. (Paper uses a limited-memory quasi- Newton method) L W = l (G x l, s l E Pr s x, W G(x l, s )) Observed feature count Expected feature count

20 Parameter Learning: Semi-CRF L W = L W = l l (G x l, s l E Pr s x, W G(x l, s )) G x l, s l s G x l, s e WT G(x l,s ) Z W (x l ) Markov property of G and a dynamic programming helps in fast computation of the expected value of the features under the current weight vector E Pr s x, W G(x l, s ) α(i, y) = s s i:y e WTG(xl,s ) Where s i:y denotes all segmentations from 1 to i ending at i and labeled y. Z W (x)= y α( x, y)

21 Parameter Learning: Semi-CRF α(i, y) = L d=1 y Y α i d, y e WT g y,y,x,i d,i 1 0 if i > 0 if i =0 if i < 0 A similar approach can be used to compute the expectation s G x l, s e WT G(x l,s ) η k i, y = s s i:y G k x l, s e WT G(x l,s ), restricted to the part of the segmentation ending at position i. η k L i, y = d=1 y Y (ηk i d, y + α i d, y g K y, y, x, i d, i )e WT g y,y,x,i d,i

22 Parameter Learning: Semi-CRF E Pr s x, W G x, s = 1 Z W (x) y η k ( x, y)

23 Extentions Barun,Gagan, Dhruvin,Yashoteja: This idea of reasoning over segments can be extended in the task of image segmentation. Nupur: Introducing constraints in the model to have something similar to CCM as in case of CRF. Happy: Apart from the similarity measures they have used, there is a very good similarity measure called Gower distance, which is primarily used for nonnumerical data. I think, we can also use that here. Prachi: Compare SOTA deep learning models and semi-crfs to building insights on what one can capture and other can't. This may enable us to improve architectures of both the models. Yashoteja: Start with L=1, and quickly filter out the regions of the sequence that we are confident to not contain any named entities. Now we can use L=2 and resegment only those regions where entities might lie. We can then proceed with L=3, etc. Intuition is similar to those in Apriori algorithm.

24 Experiments with NER data Baseline algorithms: CRF/1, labels words inside and outside entities with I and O, respectively. CRF/4, replaces the I tag with four tags B, E, C, and U, which depend on where the word appears in an entity. Datasets: The Address corpus contains 4,226 words, and consists of 395 home addresses of students. Paper considered extraction of city names and state names from this corpus. The Jobs corpus contains 73,330 words, and consists of 300 computer related job postings. Paper considered extraction of company names and job titles. The 18,121-word corpus contains 216 messages taken from the CSPACE corpus, which is mail associated with a 14-week, 277-person management game. Paper considered extraction of person names.

25 Features CRF Features Indicators for specific words at location i, or locations within three words of i. Indicators for capitalization/letter patterns Semi-CRF Features Indicators for the phrase inside a segment and the capitalization pattern inside a segment. Indicators for words and capitalization patterns in 3-word windows before and after the segment. Indicators for each segment length (d = 1,...,L), and combined all word-level features with indicators for the beginning and end of a segment. Dictionary based features: External dictionary of strings g sim,d (j, x, s)=max u D sim(x sj, u) Internal segment dictionary

26 Results

27 Results

28 Results Dhruvin, Prachi, Gagan - Precision/ Recall values not reported. Anshul- Why order-l CRFs perform much worse than semi-crfs? Nupur, Haroun- Comparison with only CRF?

Semi-Markov Models for Named Entity Recognition

Semi-Markov Models for Named Entity Recognition Sunita Sarawagi Indian Institute of Technology Bombay, India sunita@iitb.ac.in William W. Cohen Center for Automated Learning & Discovery Carnegie Mellon