Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Size: px

Start display at page:

Download "Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley"

Vernon Pearson
5 years ago
Views:

1 Algorithms for NLP Language Modeling II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

2 Announcements Should be able to really start project after today s lecture Get familiar with bit-twiddling in Java (e.g. &,, <<, >>) No external libraries / code We will go over KN again in recitation edge cases Tentative office hours: Me: Maria: Hieu: Akshay:

3 Language Models Language models are distributions over sentences N-gram models are built from local conditional probabilities The methods we ve seen are backed by corpus n-gram counts ˆP (w i w i 1,w i 2 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 )

4 Kneser-Ney Smoothing Kneser-Ney smoothing combines two ideas Discount and reallocate like absolute discounting In the backoff model, word probabilities are proportional to context fertility, not frequency P (w) / {w 0 : c(w 0,w) > 0} Theory and practice Practice: KN smoothing has been repeatedly proven both effective and efficient Theory: KN smoothing as approximate inference in a hierarchical Pitman-Yor process [Teh, 2006]

5 Kneser-Ney Edge Cases All orders recursively discount and back-off: P k (w prev k 1 )= max(c0 (prev k 1,w) d, 0) P v c0 (prev k 1,v) + (prev k 1)P k 1 (w prev k 2 ) The unigram base case does not need to discount (though it can) Alpha is computed to make the probability normalize (but if context count is zero, then fully back-off) For the highest order, c is the token count of the n-gram. For all others it is the context fertility of the n-gram (see Chen and Goodman p. 18): c 0 (x) = {u : c(u, x) > 0}

6 Idea 4: Big Data There s no data like more data.

7 Data >> Method? Having more data is better Entropy n-gram order 100,000 Katz 100,000 KN 1,000,000 Katz 1,000,000 KN 10,000,000 Katz 10,000,000 KN all Katz all KN but so is using a better estimator Another issue: N > 3 has huge costs in speech recognizers

8 Tons of Data? [Brants et al, 2007]

9 What about

10 Unknown Words? What about totally unseen words? Most LM applications are closed vocabulary ASR systems will only propose words that are in their pronunciation dictionary MT systems will only propose words that are in their phrase tables (modulo special models for numbers, etc) In principle, one can build open vocabulary LMs E.g. models over character sequences rather than word sequences Back-off needs to go down into a generate new word model Typically if you need this, a high-order character model will do

11 What s in an N-Gram? Just about every local correlation! Word class restrictions: will have been Morphology: she, they Semantic class restrictions: danced the Idioms: add insult to World knowledge: ice caps have Pop culture: the empire strikes But not the long-distance ones The computer which I had just put into the machine room on the fifth floor.

12 What Actually Works? Trigrams and beyond: Unigrams, bigrams generally useless Trigrams much better 4-, 5-grams and more are really useful in MT, but gains are more limited for speech Discounting Absolute discounting, Good- Turing, held-out estimation, Witten-Bell, etc Context counting Kneser-Ney construction of lower-order models See [Chen+Goodman] reading for tons of graphs [Graph from Joshua Goodman]

13 What s in an N-Gram? Just about every local correlation! Word class restrictions: will have been Morphology: she, they Semantic class restrictions: danced the Idioms: add insult to World knowledge: ice caps have Pop culture: the empire strikes But not the long-distance ones The computer which I had just put into the machine room on the fifth floor.

14 Linguistic Pain? The N-Gram assumption hurts one s inner linguist! Many linguistic arguments that language isn t regular Long-distance dependencies Recursive structure Answers N-grams only model local correlations, but they get them all As N increases, they catch even more correlations N-gram models scale much more easily than structured LMs Not convinced? Can build LMs out of our grammar models (later in the course) Take any generative model with words at the bottom and marginalize out the other variables

15 What Gets Captured? Bigram model: [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen] [outside, new, car, parking, lot, of, the, agreement, reached] [this, would, be, a, record, november] PCFG model: [This, quarter, s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders, and, transportation, prices,.] [It, could, be, announced, sometime,.] [Mr., Toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more, than, 12, stocks,.]

16 Other Techniques? Lots of other techniques Maximum entropy LMs (soon) Neural network LMs (soon) Syntactic / grammar-structured LMs (much later)

17 How to Build an LM

18 Tons of Data Good LMs need lots of n-grams! [Brants et al, 2007]

19 Storing Counts Key function: map from n-grams to counts searching for the best searching for the right searching for the cheapest searching for the perfect searching for the truth searching for the searching for the most searching for the latest searching for the next searching for the lowest searching for the name 8402 searching for the finest 8171

20 Example: Google N-Grams

21 Efficient Storage

22 Naïve Approach 0 c(cat) = 12 c(the) = 87 hash(cat) = 2 hash(the) = key value cat 12 the 87 c(and) = 76 c(dog) = 11 hash(and) = 5 hash(dog) = and 76 7 dog 11 c(have) =? hash(have) = 2

23 A Simple Java Hashmap? Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing

24 Open Address Hashing c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = key value

25 Open Address Hashing key value c(cat) = 12 hash(cat) = 2 0 c(the) = 87 c(and) = 76 hash(the) = 2 hash(and) = cat the c(dog) = 11 hash(dog) = and 5 6 c(have) =? hash(have) = 2 7 dog 7

26 Open Address Hashing c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = key value 14 15

27 Efficient Hashing Closed address hashing Resolve collisions with chains Easier to understand but bigger Open address hashing Resolve collisions with probe sequences Smaller but easy to mess up Direct-address hashing No collision resolution Just eject previous entries Not suitable for core LM storage

28 A Simple Java Hashmap? Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing

29 Integer Encodings word ids the cat laughed 233 n-gram count

30 Bit Packing Got 3 numbers under 2 20 to store? bits 20 bits 20 bits Fits in a primitive 64-bit long

31 Integer Encodings = n-gram encoding the cat laughed 233 n-gram count

32 Rank Values c(the) = < bits to represent integers between 0 and bits 35 bits n-gram encoding count

33 Rank Values # unique counts = < bits to represent ranks of all counts 60 bits 20 bits n-gram encoding rank rank freq

34 So Far Word indexer N-gram encoding scheme unigram: f(id) = id bigram: f(id 1, id 2 ) =? trigram: f(id 1, id 2, id 3 ) =? unigram Count DB bigram trigram Rank lookup

35 Hashing vs Sorting

36 Context Tries

37 Tries

38 Context Encodings [Many details from Pauls and Klein, 2011]

39 Context Encodings

40 N-Gram Lookup

41 Compression

42 Idea: Differential Compression

Variable Length Encodings Encoding 9 000 1001

43 Variable Length Encodings Encoding Length in Unary Number in Binary [Elias, 75]

44 Speed-Ups

45 Rolling Queries

46 Idea: Fast Caching LM can be more than 10x faster w/ directaddress caching

Approximate LMs Simplest option: hash-and-hope Array of size K ~ N (optional) store hash of keys Store values in direct-address Collisions: store the max

47 Approximate LMs Simplest option: hash-and-hope Array of size K ~ N (optional) store hash of keys Store values in direct-address Collisions: store the max What kind of errors can there be? More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc

48 Maximum Entropy Models

49 Improving on N-Grams? N-grams don t combine multiple sources of evidence well P(construction After the demolition was completed, the) Here: the gives syntactic constraint demolition gives semantic constraint Unlikely the interaction between these two has been densely observed We d like a model that can be more statistically efficient

50 Maximum Entropy LMs Want a model over completions y given a context x: P y x = P( close the door close the ) Want to characterize the important aspects of y = (v,x) using a feature function f F might include Indicator of v (unigram) Indicator of v, previous word (bigram) Indicator whether v occurs in x (cache) Indicator of v and each non-adjacent previous word

51 Some Definitions INPUTS CANDIDATE SET CANDIDATES close the {close the door, close the table, } close the table TRUE OUTPUTS close the door FEATURE VECTORS close in x Ù v= door v -1 = the Ù v= door door in x and v

52 Linear Models: Maximum Entropy Maximum entropy (logistic regression) Use the scores as probabilities: Make positive Normalize Maximize the (log) conditional likelihood of training data

53 Maximum Entropy II Motivation for maximum entropy: Connection to maximum entropy principle (sort of) Might want to do a good job of being uncertain on noisy cases in practice, though, posteriors are pretty peaked Regularization (smoothing)

54 Derivative for Maximum Entropy Big weights are bad Expected feature vector over possible candidates Total count of feature n in correct candidates

55 Convexity The maxent objective is nicely behaved: Differentiable (so many ways to optimize) Convex (so no local optima*) Convex Non-Convex Convexity guarantees a single, global maximum value because any higher points are greedily reachable

56 Unconstrained Optimization Once we have a function f, we can find a local optimum by iteratively following the gradient For convex functions, a local optimum will be global Basic gradient ascent isn t very efficient, but there are simple enhancements which take into account previous gradients: conjugate gradient, L-BFGs Online methods (e.g. AdaGrad) now very popular

57 Implicit Representation

Natural Language Processing

Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The