N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

Size: px

Start display at page:

Download "N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh"

Emory Anderson
5 years ago
Views:

1 N-Gram Language Modelling including Feed-Forward NNs Kenneth Heafield University of Edinburgh

2 History of Language Model History Kenneth Heafield University of Edinburgh

3 3

4 p(type Predictive) > p(tyler Predictive) 4

5 p(lose Win or) p(loose Win or) [Church et al, 2007] 5

6 Heated indoor swimming pool 6

7 Chambre Bedroom 7

8 présidente de la Chambre des représentants chairwoman of the Bedroom of Representatives 8

9 présidente de la Chambre des représentants chairwoman of the House of Representatives 9

10 présidente de la Chambre des représentants chairwoman of the House of Representatives p(chairwoman of the House of Representatives) > p(chairwoman of the Bedroom of Representatives) 10

11 p(another one bites the dust.) > p(another one rides the bus.) 11

12 Prediction Spelling Translation Speech Essential Component: Language Model p(moses Compiles) =? 12

13 Sequence Models Chain Rule p(moses compiles)=p(moses)p(compiles Moses) 13

14 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )=

15 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )= Explicit begin and end of sentence. 15

16 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )= Where do these probabilities come from? 16

17 Probabilities from Text Model p(raw in the) 17

18 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the)

19 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the) 1 4 p(ugrasena in the) 0 19

20 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the) 1 6 p(ugrasena in the)

21 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = count(in the) = 0? 21

22 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = = 0? count(in the) count(the Ugrasena) = = count(the) 22

23 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = = 0? count(in the) count(the Ugrasena) = = count(the) Stupid Backoff: Drop context until count is non-zero [Brants et al, 2007] Can we be less stupid? 23

24 Smoothing in the Ugrasena was not seen, but could happen. 1 Backoff: maybe the Ugrasena was seen? 2 Neural Networks: classifier predicts next word 24

25 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) p(ugrasena the) 25

26 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) p(ugrasena the) the Ugrasena was not seen try Ugrasena p(ugrasena the) p(ugrasena) 26

27 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) = p(ugrasena the)b(in the) the Ugrasena was not seen try Ugrasena p(ugrasena the) = p(ugrasena)b(the) Backoff b is a penalty for not matching context. 27

28 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

29 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Query log p(is <s> iran) = 1.1 Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

30 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Query : p(of iran is) log p(of) 2.5 log b(is) 1.4 log b(iran is) log p(of iran is) = 4.3 Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

31 Close words matter more. Though long-distance matters: Grammatical structure Topical coherence Words tend to repeat Cross-sentence dependencies 31

32 Where do p and b come from? Text! Kneser-Ney Witten-Bell Good-Turing 32

33 Kneser-Ney Common high-quality smoothing 1 Adjust 2 Normalize 3 Interpolate 33

34 Adjusted counts are: Trigrams Count in the text. Others Number of unique words to the left. Lower orders are used when a trigram did not match. How freely does the text associate with new words? 34

35 Adjusted counts are: Trigrams Count in the text. Others Number of unique words to the left. Lower orders are used when a trigram did not match. How freely does the text associate with new words? Input Trigam Count are one of 1 is one of 5 are two of 3 Output 1-gram Adjusted of 2 Output 2-gram Adjusted one of 2 two of 1 35

36 Discounting and Normalization Save mass for unseen events pseudo(w n w n 1 1 ) = adjusted(w 1 n ) discount n (adjusted(w1 n )) n 1 adjusted(w1 x) x Normalize 36

37 Discounting and Normalization Save mass for unseen events pseudo(w n w n 1 1 ) = adjusted(w 1 n ) discount n (adjusted(w1 n )) n 1 adjusted(w1 x) x Normalize Input 3-gram Adjusted are one of 1 are one that 2 is one of 5 Output 3-gram Pseudo are one of 0.26 are one that 0.47 is one of

38 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution. 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary 38

39 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution, 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary Interpolate bigrams with unigrams, etc. p(of one) = pseudo(of one) + backoff(one)p(of) 39

40 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution, 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary Interpolate bigrams with unigrams, etc. p(of one) = pseudo(of one) + backoff(one)p(of) Input n-gram pseudo interpolation weight of 0.1 backoff( ɛ) = 0.1 one of 0.2 backoff( one) = 0.3 are one of 0.4 backoff(are one) = 0.2 Output n-gram p of one of are one of

41 Kneser-Ney Intuition Adjust Measure association with new words. Normalize Leave space for unseen events. Interpolate Handle sparsity. How do we implement it? 41

42 LM queries often account for more than 50% of the CPU [Green et al, WMT 2014] 500 billion entries in my largest model Need speed and memory efficiency 42

43 Counting n-grams <s> Australia is one of the few 5-gram Count <s> Australia is one of 1 Australia is one of the 1 is one of the few 1 Hash table? 43

44 Counting n-grams <s> Australia is one of the few 5-gram Count <s> Australia is one of 1 Australia is one of the 1 is one of the few 1 Hash table? Runs out of RAM. 44

45 Spill to Disk When RAM Runs Out Text Hash Table File 45

46 Split Data Text Hash Table Hash Table File File 46

47 Split and Merge Text Hash Table Hash Table Sort Sort File File Merge Sort 47

48 Training Problem: Batch process large number of records. Solution: Split/merge Stupid backoff in one pass Kneser-Ney in three passes 48

49 Training Problem: Batch process large number of records. Solution: Split/merge Stupid backoff in one pass Kneser-Ney in three passes Training is designed for mutable batch access. What about queries? 49

50 Query stupid(w n w n 1 1 ) = count(w1 n ) count(w n 1 if count(w1 n ) > 0 1 ) 0.4stupid(w n w n 1 2 ) if count(w1 n ) = 0 stupid(few is one of the) count(is one of the few) = 5 count(is one of the) = 12 50

51 Query stupid(w n w n 1 1 ) = count(w1 n ) count(w n 1 if count(w1 n ) > 0 1 ) 0.4stupid(w n w n 1 2 ) if count(w1 n ) = 0 stupid(periwinkle is one of the) count(is one of the periwinkle) = 0 count(one of the periwinkle) = 0 count(of the periwinkle) = 0 count(the periwinkle) = 3 count(the) =

52 Save Memory: Forget Keys Giant hash table with n-grams as keys and counts as values. Replace the n-grams with 64-bit hashes: Store hash(is one of) instead of is one of. Ignore collisions. 52

53 Save Memory: Forget Keys Giant hash table with n-grams as keys and counts as values. Replace the n-grams with 64-bit hashes: Store hash(is one of) instead of is one of. Ignore collisions. Birthday attack: 2 64 = = Low chance of collision until 4 billion entries. 53

54 Default Hash Table boost::unordered_map and gnu_cxx::hash_map n-grams Bucket array 54

55 Default Hash Table boost::unordered_map and gnu_cxx::hash_map n-grams Bucket array Lookup requires two random memory accesses. 55

56 Linear Probing Hash Table 1.5 buckets/entry (so buckets = 6). Ideal bucket = hash mod buckets. Resolve bucket collisions using the next free bucket. Bigrams Words Ideal Hash Count iran is 0 0x959e48455f4a2e90 3 0x0 0 is one 2 0x186a7caef34acf16 5 one of 2 0xac db8dac 2 <s> iran 4 0xf0ae9c2442c6920e 1 0x0 0 56

57 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. 57

58 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. Entries not in the model get assigned offsets Store a fingerprint of each n-gram 58

59 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. Low memory, but potential for false positives 59

60 Less Memory: Sorted Array Look up zebra in a dictionary. Binary search Open in the middle. O(log n) time. Interpolation search Open near the end. O(log log n) time. 60

61 100 Lookups/µs 10 1 probing hash_set unordered_set interpolation binary_search set Entries 61

62 Trie Reverse n-grams, arrange in a trie. is of one are <s> is Australia <s> is Australia one are are Australia <s> 62

63 Saving More RAM Quantization: store approximate values Collapse probability and backoff 63

64 Implementation Summary Implementation involves sparse mapping Hash table Probing hash table Minimal perfect hash table Sorted array with binary or interpolation search 64

65 Problem with Backoff: in the Ugrasena kingdom Ugrasena is unseen use unigram to predict kingdom p(kingdom in the Ugrasena) = p(kingdom) One rare word of context breaks conditioning. 65

66 More Conditioning Word class/cluster model Skip-gram: skip one or more context words Neural networks 66

67 More Conditioning Word class/cluster model Skip-gram: skip one or more context words Neural networks 67

68 Neural N-grams (aka Feed-Forward) Language modeling as classification: Input in the Ugrasena Output kingdom Predict (classify) the next word given preceding words. 68

69 Turning Words into Vectors in kingdom Ugrasena the Assign each word a unique row 69

70 Query: Concatenate Vectors in the Ugrasena Neural Network Prediction 70

71 Issue: Large Vocabulary The input has vocab history dimensions. = The input matrix has vocab history hidden parameters. Say vocab 1 billion. About 1 trillion parameters. 71

72 Dealing with Vocabulary Size Word vectors Map unpopular words to unknown or part of speech Don t use words: characters or short snippets 72

73 Turning Words into Vectors in kingdom Ugrasena the Vectors from the input matrix... or your favorite ACL paper. 73

74 Use Word Vectors Steal word representation from another task (language modeling, IR, SVD). Reduce from vocab one-hot to 100 dimensional dense input. Pro Other task might have more data Con Does not jointly learn word representation 74

75 Output vocabulary Output is still vocab -way classification. = Large output matrix, lots to normalize over. = Slow evaluation, overfitting. 75

76 Dealing with Vocabulary Size Word vectors Map unpopular words to unknown or part of speech Don t use words: characters or short snippets 76

77 Translation Modeling p(bedroom of the American, de la Chambre des représentants des États-Unis) [Devin et al 2014] 77

78 Conclusion Language models measure fluency. Neural networks and backoff are the dominant formalisms.... But you should probably use recurrent networks. 78

KenLM: Faster and Smaller Language Model Queries

KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm What KenLM Does Answer language model queries using less time and memory.