Part-of-Speech Tagging

Size: px

Start display at page:

Download "Part-of-Speech Tagging"

Cameron Sullivan
6 years ago
Views:

1 Part-of-Speech Tagging A Canonical Finite-State Task Intro to NLP - J. Eisner 1

2 The Tagging Task Input: the lead paint is unsafe Output: the/ lead/n paint/n is/v unsafe/ Uses: text-to-speech (how do we pronounce lead?) can write regexps like () * N+ over the output preprocessing to speed up parser (but a little dangerous) if you know the tag, you can back off to it in other tasks Intro to NLP - J. Eisner 2

3 Why Do We Care? Input: the lead paint is unsafe Output: the/ lead/n paint/n is/v unsafe/ The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task Can be done well with methods that look at local context Though should really do it by parsing! Intro to NLP - J. Eisner 3

4 Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn t tagged Partly supervised: Training corpus isn t tagged, but you have a dictionary giving possible tags for each word We ll start with the supervised case and move to decreasing levels of supervision Intro to NLP - J. Eisner 4

5 Current Performance Input: the lead paint is unsafe Output: the/ lead/n paint/n is/v unsafe/ How many tags are correct? About 97% currently (for English) But baseline is already 90% Baseline is performance of stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns Intro to NLP - J. Eisner 5

6 What Should We Look At? correct tags PN Verb Prep Prep Bill directed a cortege of autos through the dunes PN Prep Prep Verb Verb Verb some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too Intro to NLP - J. Eisner 6

7 What Should We Look At? correct tags PN Verb Prep Prep Bill directed a cortege of autos through the dunes PN Prep Prep Verb Verb Verb some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too Intro to NLP - J. Eisner 7

8 What Should We Look At? correct tags PN Verb Prep Prep Bill directed a cortege of autos through the dunes PN Prep Prep Verb Verb Verb some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too Intro to NLP - J. Eisner 8

9 Three Finite-State Approaches Noisy Channel Model (statistical) real language Y noisy channel Y X observed string X part-of-speech tags (n-gram model) replace tags with words text want to recover Y from X Intro to NLP - J. Eisner 9

10 Three Finite-State Approaches 1. Noisy Channel Model (statistical) 2. erministic baseline tagger composed with a cascade of fixup transducers 3. Nondeterministic tagger composed with a cascade of finite-state automata that act as filters Intro to NLP - J. Eisner 10

11 Review: Noisy Channel real language Y noisy channel Y X obseved string X p(y) * p(x Y) = p(x,y) want to recover y Y from x X choose y that maximizes p(y x) or equivalently p(x,y) Intro to NLP - J. Eisner 11

12 Review: Noisy Channel.o. = p(y) * p(x Y) = p(x,y) Note p(x,y) sums to 1. Suppose x= C ; what is best y? Intro to NLP - J. Eisner 12

13 Review: Noisy Channel.o. = p(y) * p(x Y) = p(x,y) Suppose x= C ; what is best y? Intro to NLP - J. Eisner 13

14 Review: Noisy Channel.o. p(y) * p(x Y) restrict just to paths compatible with output C.o. * (X = x)? = = p(x, Y) Intro to NLP - J. Eisner 14

15 Noisy Channel for Tagging acceptor: p(tag sequence).o. Markov Model transducer: tags words Unigram Replacement p(y) p(x Y).o. * acceptor: the observed words (X = x)? straight line = = p(x, Y) transducer: scores candidate tag seqs on their joint probability with obs words; pick best path Intro to NLP - J. Eisner 15 *

16 Markov Model (bigrams) Verb Start Prep Stop Intro to NLP - J. Eisner 16

17 Markov Model Verb Start Prep Stop Intro to NLP - J. Eisner 17

18 Markov Model 0.8 Verb Start Prep Stop Intro to NLP - J. Eisner 18

19 Markov Model p(tag seq) 0.8 Verb Start Prep Stop 0.1 Start Stop = 0.8 * 0.3 * 0.4 * 0.5 * Intro to NLP - J. Eisner 19

20 Markov Model as an FSA p(tag seq) 0.8 Verb Start Prep Stop 0.1 Start Stop = 0.8 * 0.3 * 0.4 * 0.5 * Intro to NLP - J. Eisner 20

21 Markov Model as an FSA p(tag seq) 0.8 Verb Start Prep Stop 0.1 Start Stop = 0.8 * 0.3 * 0.4 * 0.5 * Intro to NLP - J. Eisner 21

22 Markov Model (tag bigrams) p(tag seq) 0.8 Start Stop Start Stop = 0.8 * 0.3 * 0.4 * 0.5 * Intro to NLP - J. Eisner 22

23 Noisy Channel for Tagging automaton: p(tag sequence).o. Markov Model transducer: tags words Unigram Replacement p(y) p(x Y).o. * automaton: the observed words p(x X) straight line = = p(x, Y) transducer: scores candidate tag seqs on their joint probability with obs words; pick best path Intro to NLP - J. Eisner 23 *

24 Start 0.4 Noisy Channel for Tagging Verb 0.2 Prep Stop.o. :cortege/ :autos/0.001 :Bill/0.002 :a/0.6 :the/0.4 = :cool/0.003 :directed/ p(y) p(x Y) :cortege/ o. * p(x X) the cool directed autos transducer: scores candidate tag seqs on their joint probability with obs words; we should pick best path = p(x, Y) Intro to NLP - J. Eisner 24 *

25 Unigram Replacement Model p(word seq tag seq) :cortege/ :autos/0.001 :Bill/0.002 sums to 1 :a/0.6 :the/0.4 :cool/0.003 :directed/ sums to 1 :cortege/ Intro to NLP - J. Eisner 25

26 Compose Start Verb Prep Stop 0.2 :cortege/ :autos/0.001 :Bill/0.002 :a/0.6 :the/0.4 :cool/0.003 :directed/ :cortege/ p(tag seq) 0.8 Verb 0.3 Start Prep Stop Intro to NLP - J. Eisner 26

27 Compose Start Verb Prep Stop 0.2 :cortege/ :autos/0.001 :Bill/0.002 :a/0.6 :the/0.4 :cool/0.003 :directed/ :cortege/ p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) :a 0.48 :the 0.32 Start :cool :directed :cortege :cool :directed :cortege N:cortege N:autos Verb Prep Stop Intro to NLP - J. Eisner 27

28 Observed Words as Straight-Line FSA word seq the cool directed autos Intro to NLP - J. Eisner 28

29 Compose with the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) :a 0.48 :the 0.32 Start :cool :directed :cortege :cool :directed :cortege N:cortege N:autos Verb Prep Stop Intro to NLP - J. Eisner 29

30 Compose with the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) :the 0.32 Start why did this loop go away? :directed :cool N:autos Verb Prep Stop Intro to NLP - J. Eisner 30

31 The best path: Start Stop = 0.32 * the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) :the 0.32 Start :directed :cool N:autos Verb Prep Stop Intro to NLP - J. Eisner 31

32 In Fact, Paths Form a Trellis p(word seq, tag seq) Start :directed Stop The best path: Start Stop = 0.32 * the cool directed autos Intro to NLP - J. Eisner 32

33 The Trellis Shape Emerges from the Cross-Product Construction for Finite-State Composition 0 1 0,0 1,1 2,1 3, ,2 2,2 3,2.o = All paths here are 4 words 1,3 2,3 3,3 1,4 2,4 3,4 4,4 So all paths here must have 4 words on output side Intro to NLP - J. Eisner 33 4

34 Actually, Trellis Isn t Complete p(word seq, tag seq) Trellis has no or Stop arcs; why? Start :directed Stop The best path: Start Stop = 0.32 * the cool directed autos Intro to NLP - J. Eisner 34

35 Actually, Trellis Isn t Complete p(word seq, tag seq) Lattice is missing some other arcs; why? Start :directed Stop The best path: Start Stop = 0.32 * the cool directed autos Intro to NLP - J. Eisner 35

36 Actually, Trellis Isn t Complete p(word seq, tag seq) Lattice is missing some states; why? Start :directed Stop The best path: Start Stop = 0.32 * the cool directed autos Intro to NLP - J. Eisner 36

37 Find best path from Start to Stop Start :directed Stop Use dynamic programming like prob. parsing: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as probability plus one backpointer) Special acyclic case of Dijkstra s shortest-path alg. Faster if some arcs/states are absent Intro to NLP - J. Eisner 37

38 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : probs from tag bigram model probs from unigram replacement Start PN Verb Prep Pre Bill directed a cortege of autos thro Find X that maximizes probability product Intro to NLP - J. Eisner 38

39 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! p( ) Start PN Verb Bill directed a = p(start) * p(pn Start) * p(verb Start PN) * p( Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb ) * p(a Bill directed, Start PN Verb ) * Intro to NLP - J. Eisner 39

40 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! p( ) Start PN Verb Bill directed a = p(start) * p(pn Start) * p(verb Start PN) * p( Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb ) * p(a Bill directed, Start PN Verb ) * Start PN Verb Prep Prep Stop Bill directed a cortege of autos through the dunes Intro to NLP - J. Eisner 40

41 Three Finite-State Approaches 1. Noisy Channel Model (statistical) 2. erministic baseline tagger composed with a cascade of fixup transducers 3. Nondeterministic tagger composed with a cascade of finite-state automata that act as filters Intro to NLP - J. Eisner 41

42 Another FST Paradigm: Successive Fixups Like successive markups but alter Morphology Phonology Part-of-speech tagging input output Intro to NLP - J. Eisner 42

43 figure from Brill s thesis Transformation-Based Tagging (Brill 1995) Intro to NLP - J. Eisner 43

44 Transformations Learned figure from Brill s thesis BaselineTag* VB // TO _ VB //... _ etc. Compose this cascade of FSTs. Gets a big FST that does the initial tagging and the sequence of fixups all at once Intro to NLP - J. Eisner 44

45 figure from Brill s thesis Initial Tagging of OOV Words Intro to NLP - J. Eisner 45

46 Three Finite-State Approaches 1. Noisy Channel Model (statistical) 2. erministic baseline tagger composed with a cascade of fixup transducers 3. Nondeterministic tagger composed with a cascade of finite-state automata that act as filters Intro to NLP - J. Eisner 46

47 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning Supervised: You have a tagged training corpus Unsupervised: You have an untagged training corpus Here: You have an untagged training corpus and a dictionary giving possible tags for each word Intro to NLP - J. Eisner 47

Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

Finite-State and the Noisy Channel 600.465 - Intro to NLP - J. Eisner 1 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing