Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Ann Phelps
5 years ago
Views:

1 Natural Language Processing N-grams and minimal edit distance Pieter Wellens These slides are based on the course materials from the ANLP course given at the School of Informatics, Edinburgh and the online coursera Stanford NLP course by Jurafski and Manning.

2 Last week

3 Last week Language modeling with N-gram models Introduction to N-gram models Estimating N-gram probabilities Evaluation and Perplexity Unseen N-grams and smoothing Interpolation and scaling

4 Today Language modeling with N-gram models Unseen N-grams and smoothing Interpolation and scaling Minimum edit distance Introduction Computation

5 Today Language modeling with N-gram models Unseen N-grams and smoothing Interpolation and scaling Minimum edit distance Introduction Computation

6 Unseen N-grams (generalization) We have seen i like to in our corpus We have never seen i like to smooth in our corpus P(smooth i like to) = 0 Any sentence that includes i like to smooth will be assigned probability 0

7 Add-one Smoothing Also called Laplace smoothing Pretend we saw each word one more time than we did by adding one to all the counts. MLE estimate: Add-one estimate: P MLE (w i w i 1 ) = c(w i 1, w i ) c(w i 1 ) P Add 1 (w i w i 1 ) = c(w i 1, w i )+1 c(w i 1 )+V

8 Add-one Smoothing Add-one smoothing is a blunt tool isn t appropriate for n-grams it is used in other NLP tools where the amount of zeros isn t so huge.

9 Advanced smoothing algorithms Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Use the count of things we ve seen once to help estimate the count of things we ve never seen.

10 Notation: Nc = Frequency of frequency c Nc= the count of things we ve seen c times

11 Notation: Nc = Frequency of frequency c Nc= the count of things we ve seen c times Sam I am, I am Sam, I do not eat I 3 Sam 2 am 2 do 1 not 1 eat 1 N1 =, N2 =, N3 =

12 Notation: Nc = Frequency of frequency c Nc= the count of things we ve seen c times Sam I am, I am Sam, I do not eat I 3 Sam 2 am 2 do 1 not 1 eat 1 N1 = 3, N2 = 2, N3 = 1 N0 = number of tokens (or n-grams) = 10

13 Good-Turing smoothing intuition You are fishing and caught: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that next species is trout?

14 Good-Turing smoothing intuition You are fishing (a scenario from Josh Goodman), and caught: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that next species is trout? 1/18 How likely is it that next species is new (i.e. catfish or bass) Let s use our estimate of things-we-saw-once to estimate the new things.

15 Good-Turing smoothing intuition You are fishing (a scenario from Josh Goodman), and caught: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that next species is trout? 1/18 How likely is it that next species is new (i.e. catfish or bass) Let s use our estimate of things-we-saw-once to estimate the new things. 3/18 (because N1=3) Assuming so, how likely is it that next species is trout? Must be less than 1/18 but how to estimate?

16 Good Turing formula c* = (c +1)N c+1 N c Calculate it for cases seen once (.e.g. trout): MLE: 1/18 c*(trout) = (1+1) * N2/N1 = 2 * 1/3 = 2/3 P*GT (trout) = 2/3 / 18 = 1/27

17 Good-Turing numbers Numbers from Church and Gale (1991) 22 million words of AP Newswire c* = (c +1)N c+1 N c Count&c& Good&Turing&c*& 0& & 1& 0.446& 2& 1.26& 3& 2.24& 4& 3.24& 5& 4.22& 6& 5.19& 7& 6.21& 8& 7.24& 9& 8.25&

18 Today Language modeling with N-gram models Unseen N-grams and smoothing Interpolation and scaling Minimum edit distance Introduction Computation

19 Backoff and interpolation Sometimes it helps to use less context For example because you have only seen the large context a few times (not reliable)

20 Backoff and interpolation Sometimes it helps to use less context For example because you have only seen the large context a few times (not reliable) Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram

21 Backoff and interpolation Sometimes it helps to use less context For example because you have only seen the large context a few times (not reliable) Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram Interpolation mix unigram, bigram, trigram Interpolation works better than backoff

22 Linear Interpolation Simple interpolation

23 Linear Interpolation Simple interpolation Lambdas conditional on context

24 How to find out good lambdas? Use a held-out corpus Training)Data) Held%Out) Data) Test)) Data) Choose lambdas to maximize the probability of held-out data

25 Unknown words: Open versus closed vocabulary tasks

26 Unknown words: Open versus closed vocabulary tasks Often we don t encounter all words in the training set Out Of Vocabulary words = OOV words Open vocabulary task

27 Unknown words: Open versus closed vocabulary tasks Often we don t encounter all words in the training set Out Of Vocabulary words = OOV words Open vocabulary task Solution: create an unknown word token <UNK> At normalization phase change some rare or unimportant words by <UNK> Train on this data-set At testing time use these <UNK> probabilities for real unseen words

28 Large-scale (web) data

29 Large-scale (web) data For example: Google N-gram corpus Pruning Only store N-grams with count > threshold (google > 40, for unigrams) Entropy-based pruning

30 Smoothing for Web-scale N-grams Stupid Backoff (Brants et al. 2007) No discounting, just use relative frequencies i 1 S(w i w i k+1 ) = " $ # $ % $ i count(w i k+1 i 1 count(w i k+1 ) ) i 1 0.4S(w i w i k+2 i if count(w i k+1 ) > 0 ) otherwise S(w i ) = count(w i) N

31 Today Language modeling with N-gram models Unseen N-grams and smoothing Interpolation and scaling Minimum edit distance Introduction Computation

32 Exercise of last week Write a function that takes as input two strings and returns their similarity (or distance), a real number [0,1].

33 Exercise of last week Write a function that takes as input two strings and returns their similarity (or distance), a real number [0,1].! //similarity metric based on the hamming distance! //then, the 2 strings have to be of same length! public static double similarity(string s1, String s2) {!! double distance = Double.MAX_VALUE;!! if(s1.length()==s2.length()) {!!! int length = s1.length();!!! distance = 0;!!! //we add 1 to distance when 2 characters are different!!! for(int i = 0; i<length; ++i) {!!!! if(s1.charat(i)!=s2.charat(i)) {!!!!! ++distance;!!!! }!!! }!!! //the final measure has to be between 0 and 1!!! distance = distance/length;!!! //now we have, 0<=distance<=1!!! distance = 1-distance;!! }!! else {!!! System.err.println("The 2 strings must be of equal length...");!! }!! return distance;! }

34 Exercise of last week! static int comparetwo(string a, String b){!!!!!! System.out.println(a + " <> " + b );!!! a = a.tolowercase();!!! b = b.tolowercase();!!! a = fixsounds(a);!!! b = fixsounds(b);!!! int as = a.length();!!! int bs = b.length();!!! int min = ( as <= bs )? as : bs;!!! int res = 0;!!! for(int i = 0; i < a.length(); ++i)!!! {!!!! int pos = b.indexof(a.charat( i) );!!!! int dif = (pos!= -1 )? Math.abs( i - pos ) : min;!!!! res += dif;!!! }!!! System.out.println(a + " <> " + b + " -> " + (double)res/min);!!! return res;! }

35 Exercise of last week def similar(firststr,secondstr): if len(firststr)!= len(secondstr): return 0 # completely different #get number of different characters# else: count=getnoofdifferentcharacters(firststr,secondstr) if count == len(firststr): return 0 # totally different if count == 0: # completely similar wether there are capital and small letters or they are same combination return 1 return getprobability(count,len(firststr)) def getnoofdifferentcharacters(firststr,secondstr): count=0 #to cont how many characters are different from one string to another index=0 firststr=sorted(firststr.lower()) secondstr=sorted(secondstr.lower()) for ch in firststr: #comparing each character in first string to corresponding one in second string if ch!= secondstr[index]: count=count+1 index=index+1 return count

36 Exercise of last week def similarity(str1, str2): len1 = len(str1) len2 = len(str2) if len1 == 0 or len2 == 0: return 0.0 ssum = 0.0 if len1 < len2: str1 += "_" * (len2 - len1) len1 = len2 elif len2 < len1: str2 += "_" * (len1 - len2) len2 = len1 for i in range(len1): if str1[i] == str2[i] and str1[i]!= "_": ssum += 1.0 return ssum / len1

37 Exercise of last week def similarity_matching(str1, str2): len1 = len(str1) len2 = len(str2) if len1 == 0 or len2 == 0: return 0.0 best = 0.0 for i1 in range(len2): nstr1 = "_" * i1 + str1 simil = similarity(nstr1, str2) if best < simil: best = simil for i2 in range(len1): nstr2 = "_" * i2 + str2 simil = similarity(str1, nstr2) if best < simil: best = simil return best

38 Why string similarity Spell correction graffe which is closest? graf graft grail giraffe

39 Why string similarity Spell correction Computational bioogy graffe which is closest? graf graft Alignment of nucleotides AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC Resulting alignment grail -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC giraffe

40 Minimum edit distance The minimum number of editing operations needed to transform one into the other.

41 Minimum edit distance The minimum number of editing operations needed to transform one into the other. Insertion Deletion Substitution

42 Minimum edit distance The minimum number of editing operations needed to transform one into the other. Insertion Deletion Substitution

43 Minimum edit distance Two strings and their alignment

44 Minimum edit distance Two strings and their alignment

45 Minimum edit distance Two strings and their alignment If each operation has cost 1 then the distance is 5

46 Minimum edit distance Two strings and their alignment If each operation has cost 1 then the distance is 5 If substitutions cost 2 (Levenshtein) then the distance is 8

47 Other uses of edit distance in NLP Evaluating Machine Translation and speech recognition R Spokesman confirms senior government adviser was shot H Spokesman said the senior adviser was shot dead S I D I

48 How to find minimum edit distance Searching for a part (sequence of edits) from the start string to the final string.

49 How to find minimum edit distance Searching for a part (sequence of edits) from the start string to the final string. initial state: the word we re transforming goal state: the word we re trying to get to

50 How to find minimum edit distance Searching for a part (sequence of edits) from the start string to the final string. initial state: the word we re transforming goal state: the word we re trying to get to operators: insert, delete, substitute path cost: the number of edits

51 How to find minimum edit distance Searching for a part (sequence of edits) from the start string to the final string. initial state: the word we re transforming goal state: the word we re trying to get to operators: insert, delete, substitute, do nothing path cost: the number of edits

52 Problem: search space is huge! We can t afford to navigate naïvely Lots of distinct paths wind up at the same state As soon as we hit a duplicate state we can break of that branch

53 Today Language modeling with N-gram models Unseen N-grams and smoothing Interpolation and scaling Minimum edit distance Introduction Computation

54 Recursive bottom-up computation notation D(An,Bm), the edit distance between string A of length n and string B of length m.

55 Recursive bottom-up computation notation D(An,Bm), the edit distance between string A of length n and string B of length m. # signifies the empty string special case: D(An, #) = n and D(#,Bm) = m

56 Recursive bottom-up computation notation D(Xn,Ym), the edit distance between string X of length n and string Y of length m. # signifies the empty string special case: D(Xn, #) = n and D(#,Ym) = m Solving problems by combining solutions to subproblems For each i = 1 N!! For each j = 1 M! D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) Y(j)! 0; if X(i) = Y(j)!

57 Recursive bottom-up computation Build a matrix with the initial string on the column and the goal string on the row. N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # # E X E C U T I O N

58 Recursive bottom-up computation Build a matrix with the initial string on the column and the goal string on the row. N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # # E X E C U T I O N

59 Recursive bottom-up computation Build a matrix with the initial string on the column and the goal string on the row. N O I T N E T N I # # E X E C U T I O N Edit Distance table

60 Edit distance and alignment Filled out matrix only gives us the edit distance and not the alignment

61 Edit distance and alignment Filled out matrix only gives us the edit distance and not the alignment To reconstruct the alignment we keep a backtrace which simply records the the cell we came from.

62 Edit distance and alignment Filled out matrix only gives us the edit distance and not the alignment To reconstruct the alignment we keep a backtrace which simply records the the cell we came from.

63 Weighted minimum edit distance Why would we add weights to the computation Spell correction: some letters are more likely to be mistyped than others Biology: certain kinds of delections or insertions are more likely than others

64 Confusion matrix for spelling

65 Weighted minimum edit distance Initialization: D(i,#) = D(i-1,#) + del(x[i]) D(#,j) = D(#,j-1) + ins(y[j])

66 Weighted minimum edit distance Initialization: D(i,#) = D(i-1,#) + del(x[i]) D(#,j) = D(#,j-1) + ins(y[j]) D(i-1,j) + del[x(i)]! D(i,j)= min D(i,j-1) + ins[y(j)]! D(i-1,j-1) + sub[x(i),y(j)]!!

67 Announcement and assignment Next week (March 8) there will be no class

68 Announcement and assignment Next week (March 8) there will be no class Write a program that can generate n-gram frequency lists from a tokenized corpus (use your own tokenizer first). A program that takes a (tokenized) corpus and an integer n and writes to file(s) all n-grams of size n (and optionally smaller).

69 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following:

70 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following: Naive count and divide (chain rule but NO Markov)

71 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following: Naive count and divide (chain rule but NO Markov) Max likelihood estimator using n-grams (from part one of the assignment)

72 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following: Naive count and divide (chain rule but NO Markov) Max likelihood estimator using n-grams (from part one of the assignment) Additionally add support for: add-one and good-turing smoothing, Out of Vocabulary words

73 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following: Naive count and divide (chain rule but NO Markov) Max likelihood estimator using n-grams (from part one of the assignment) Additionally add support for: add-one and good-turing smoothing, Out of Vocabulary words Optionally add support for: backoff and/or interpolation

74 assignment (continued) Write a probability function P(word_sequence) that takes as input a sequence of words and returns (using your n-gram data) the probability for that sequence of words. Support the following: Naive count and divide (chain rule but NO Markov) Max likelihood estimator using n-grams (from part one of the assignment) Additionally add support for: add-one and good-turing smoothing, Out of Vocabulary words Optionally add support for: backoff and/or interpolation Deadline code: Friday 15 March

75 assignment (continued) Write a short paper (+- 4 pages) in which you explain n-gram language modeling and look at the impact of the different parameters for the probability function or at the impact of different values for n. It is possible you wish to write some extra code for this in order to evaluate the model. Adhere to all scientific standards of writing. Provide an abstract, an introduction, your methodology, the results and a conclusion. Refer to literature when appropriate. Deadline paper: Thursday 21 March

76 About deadlines Each student has 5 credit/joker days for the whole semester. These extra days can be used if realize you cannot make a deadline. You have to inform me before the deadline itself about taking up x days.

Minimum Edit Distance. Definition of Minimum Edit Distance

Minimum Edit Distance. Definition of Minimum Edit Distance Minimum Edit Distance Definition of Minimum Edit Distance How similar are two strings? Spell correction The user typed graffe Which is closest? graf graft grail giraffe Computational Biology Align two