Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

Size: px

Start display at page:

Download "Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1"

Claire Black
5 years ago
Views:

1 Finite-State and the Noisy Channel Intro to NLP - J. Eisner 1

2 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a lexicon FSA that matches all English words. What do these regexps do? L* L* & x [ L ]* [ L :0 ]*.o. x Intro to NLP - J. Eisner (what does this FSA look like? how many paths?) [ L ]*.o. [ [ :0] \ ]*.o. x (equivalent)

3 Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a lexicon FSA that matches all English words. Let s try to recover the spaces with a single regexp [ [ L :0 ]*.o. x ].u word sequence transduced to spaceless version restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner 3

4 Word Segmentation x = theprophetsaidtothecity Given L = a lexicon FSA that matches all English words. Point of the lecture: This solution generalizes! [ [ L :0 ]*.o. x ].u word sequence transduced to spaceless version What if Lexicon is weighted? From unigrams to bigrams? Smooth L to include unseen words? restrict to paths that produce observed string x language of upper strings of those paths Make upper strings be over alphabet of words, not letters? Intro to NLP - J. Eisner 4

5 Spelling correction Spelling correction also needs a lexicon L But there is distortion beyond deleting spaces Let T be a transducer that models common typos and other spelling errors ance ence (deliverance,...) e ε (deliverance,...) ε e // Cons _ Cons (athlete,...) rr r (embarrasş occurrence, ) ge dge (privilege, ) etc. What does L.o. T represent? How s that useful? Should L and T have probabilities? We ll want T to include all possible errors Intro to NLP - J. Eisner 5

6 Noisy Channel Model real language Y noisy channel Y X observed string X want to recover Y from X Intro to NLP - J. Eisner 6

7 Noisy Channel Model real language Y correct spelling noisy channel Y X typos observed string X misspelling want to recover Y from X Intro to NLP - J. Eisner 7

8 Noisy Channel Model real language Y (lexicon space)* noisy channel Y X delete spaces observed string X text w/o spaces want to recover Y from X Intro to NLP - J. Eisner 8

9 Noisy Channel Model real language Y language model noisy channel Y X acoustic model observed string X (lexicon space)* pronunciation speech want to recover Y from X Intro to NLP - J. Eisner 9

10 Noisy Channel Model real language Y probabilistic CFG noisy channel Y X observed string X tree delete everything but terminals text want to recover Y from X Intro to NLP - J. Eisner 10

11 Noisy Channel Model real language Y noisy channel Y X observed string X p(y) * p(x Y) = p(x,y) want to recover y Y from x X choose y that maximizes p(y x) or equivalently p(x,y) Intro to NLP - J. Eisner 11

12 Remember: Noisy Channel and Bayes Theorem noisy channel y p(y=y) mess up y into x p(x=x Y=y) x decoder most likely reconstruction of y maximize p(y=y X=x) = p(y=y) p(x=x Y=y) / (X=x) = p(y=y) p(x=x Y=y) / y p(y=y ) p(x=x Y=y ) Intro to NLP - J. Eisner 12

13 Noisy Channel Model.o. = p(y) * p(x Y) = p(x,y) Note p(x,y) sums to 1. Suppose x= C ; what is best y? Intro to NLP - J. Eisner 13

14 Noisy Channel Model.o. = p(y) * p(x Y) = p(x,y) Suppose x= C ; what is best y? Intro to NLP - J. Eisner 14

15 Noisy Channel Model.o. p(y) * p(x Y) restrict just to paths compatible with output C.o. * (X==x)? 0 or 1 = = p(x,y) Intro to NLP - J. Eisner 15

16 Noisy Channel Model (i.e., paths transducing y x have total prob p(x y)) noisy channel FST: accepts pair (x,y) with weight p(x y) [ Y.o. T.o. x ].u source model FSA: accepts string y with weight p(y) (i.e., paths accepting y have total prob p(y)) restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner 16

17 Just Bayes Theorem Again restrict just to paths compatible with output C.o. p(y) = prior on Y * p(x Y) = likelihood of Y, for each X.o. * evidence X=x = = p(x,y): divide by constant p(x) to get posterior p(y x) Intro to NLP - J. Eisner 17

18 Morpheme Segmentation Let Lexicon be a machine that matches all Turkish words Same problem as word segmentation Just at a lower level: morpheme segmentation Turkish word: uygarlaştıramadıklarımızdanmışsınızcasına = uygar+laş+tır+ma+dık+ları+mız+dan+mış+sınız+ca+sı+na (behaving) as if you are among those whom we could not cause to become civilized Some constraints on morpheme sequence: bigram probs Generative model concatenate then fix up joints stop + -ing = stopping, fly + -s = flies, vowel harmony Use a cascade of transducers to handle all the fixups But this is just morphology! Can use probabilities here too (but people often don t) Intro to NLP - J. Eisner 18

19 An FST: Baby Think & Baby Talk :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: /.1 observe talk.2 m m.1 FST mapping think to talk ( = just babbling, no thought) recover think, by composition :m/.05 :m/.4 Mama:m :m/.4 IWant: /.1.1 Mama/.05 Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/ Intro to NLP - J. Eisner 19.2 /.032

20 Joint Prob. by Double Composition think :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: / talk m m compose :m/.05 :m/.4 Mama:m :m/ IWant: /.1 p( * : mm) = = sum of all paths Intro to NLP - J. Eisner 20

21 Cute method for summing all paths think :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: / talk m m compose :m/.05 :m/.4 Mama:m :m/ IWant: /.1 p( * : mm) = = sum of all paths Intro to NLP - J. Eisner 21

22 Cute method for summing all paths think : :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: / talk m: m: compose : /.4 : /.05 : : / : /.1 p( * : mm) = = sum of all paths Intro to NLP - J. Eisner 22

23 Cute method for summing all paths think : :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: / talk m: m: compose + minimize p( * : mm) = = sum of all paths Intro to NLP - J. Eisner 23

24 p(mama mm) think :Mama : :b/.2 :m/.4 :m/.05 IWant:u/.8 Mama:m IWant: / talk m: m: compose + minimize p(mama * mm) = / = p(mama * : mm) = = sum of Mama paths p( * : mm) = = sum of all paths Intro to NLP - J. Eisner 24

25 Edit Distance Older baby said caca Was probably thinking clara (?) Do these match up well? How well? clara caca clar a c a ca clara c aca clara caca 3 substitutions + 1 deletion = total cost 4 2 deletions + 1 insertion = total cost 3 1 deletion + 1 substitution = total cost 2 5 deletions + 4 insertions = total cost 9 minimum edit distance (best alignment) Intro to NLP - J. Eisner 25

26 Edit distance as min-cost path 0 c 1 l 2 a 3 r 4 a 5? 0 c 1 a 2 c 3 a4 Minimum-cost path shows the best alignment, and its cost is the edit distance position in lower string position in upper string Intro to NLP - J. Eisner 26

27 Edit distance as min-cost path position in upper string 0 c 1 l 2 a 3 r 4 a c 1 a 2 c 3 a 4 Minimum-cost path shows the best alignment, and its cost is the edit distance position in lower string Intro to NLP - J. Eisner 27

28 Edit distance as min-cost path A deletion edge has cost 1 It advances in the upper string only, so it s horizontal It pairs the next letter of the upper string with (empty) in the lower string position in lower string position in upper string Intro to NLP - J. Eisner 28

29 Edit distance as min-cost path An insertion edge has cost 1 It advances in the lower string only, so it s vertical It pairs (empty) in the upper string with the next letter of the lower string position in lower string position in upper string Intro to NLP - J. Eisner 29

30 Edit distance as min-cost path A substitution edge has cost 0 or 1 It advances in the upper and lower strings simultaneously, so it s diagonal It pairs the next letter of the upper string with the next letter of the lower string Cost is 0 or 1 depending on whether those letters are identical! position in lower string position in upper string Intro to NLP - J. Eisner 30

31 Edit distance as min-cost path We re looking for a path from upper left to lower right 0 position in upper string (so as to get through both strings) Solid edges have cost 0, dashed edges have cost 1 So we want the path with the fewest dashed edges position in lower string Intro to NLP - J. Eisner 31

32 Edit distance as min-cost path position in upper string clara caca 3 substitutions + 1 deletion = total cost 4 position in lower string Intro to NLP - J. Eisner 32

33 Edit distance as min-cost path position in upper string clar a c a ca 2 deletions + 1 insertion = total cost 3 position in lower string Intro to NLP - J. Eisner 33

34 Edit distance as min-cost path 0 position in upper string clara c aca 1 deletion + 1 substitution = total cost 2 position in lower string Intro to NLP - J. Eisner 34

35 Edit distance as min-cost path 0 position in upper string clara caca 5 deletions + 4 insertions = total cost 9 position in lower string Intro to NLP - J. Eisner 35

36 Edit Distance Transducer O(k) deletion arcs O(k 2 ) substitution arcs O(k) insertion arcs O(k) no-change arcs Intro to NLP - J. Eisner 36

37 Stochastic Edit Distance Transducer O(k) deletion arcs O(k 2 ) substitution arcs O(k) insertion arcs O(k) identity arcs Likely edits = high-probability arcs Defines p(x y), e.g., p(caca clara) Intro to NLP - J. Eisner 37

38 Most likely path that maps clara to caca? clara.o. Best path (by Dijkstra s algorithm) =.o. caca Intro to NLP - J. Eisner 38

39 What string produced caca? (posterior distribution) Lexicon*.o..o. caca = more complicated FST, with cycles. Each path is a possible input string aligned to caca. A path s weight is proportional to the posterior probability of that path as an explanation for caca. Could find best path Intro to NLP - J. Eisner 39

40 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model p(word seq).o. pronunciation model p(phone seq word seq).o. acoustic model.o. observed acoustics p(acoustics phone seq) Intro to NLP - J. Eisner 40

o..o. phone context p(word seq) p(phone seq word seq)

41 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model.o. phone context CAT:k æ t æ :.o..o. phone context p(word seq) p(phone seq word seq) p(acoustics phone seq) Intro to NLP - J. Eisner 41

42 HMM Tagging Can think of this as a finite-state noisy channel, too! See slides in the tagging lecture Intro to NLP - J. Eisner 42

Part-of-Speech Tagging

Part-of-Speech Tagging A Canonical Finite-State Task 600.465 - Intro to NLP - J. Eisner 1 The Tagging Task Input: the lead paint is unsafe Output: the/ lead/n paint/n is/v unsafe/ Uses: text-to-speech