Efficient Data Structures for Massive N-Gram Datasets

Size: px

Start display at page:

Download "Efficient Data Structures for Massive N-Gram Datasets"

Clara Flynn
5 years ago
Views:

Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.

1 Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy The 40-th CM SIGIR Conference on Research and evelopment in Information Retrieval Tokyo, Japan 10/08/2017 1

2 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2

3 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2

4 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published 2

5 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 More than 11 billion grams. 2

6 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. 3

7 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values 3

8 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) 3

9 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. 3

10 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map 3

11 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash + time - space 3

12 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time 3

13 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time ctive field of research Many software libraries KenLM [Heafield, WMT 2011] erkeleylm [Pauls and Klein, CL 2011] ExpGram [Watanabe at el., IJCNLP 2009] IRSTLM [Federico et al., CL 2008] RandLM [Talbot and Osborne, CL 2007] SRILM [Stolcke, INTERSPEECH 2002] 3

14 Trie Indexing 4

15 4 C C C C Trie Indexing

16 4 C C C C Trie Indexing

17 4 C C C C Trie Indexing C C C C

18 Trie Indexing C C C hash vocabulary 0 1 C 2 3 C C C C C 4

19 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

20 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

21 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

22 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

23 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 C 0 1 C C2 0 C C

24 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C2 0 C C

25 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C C C

26 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C Elias-Fano Tries One NextGEQ per level Constant-time random ccess 0 1 C C C C

27 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5

28 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5

29 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

30 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

31 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

32 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

33 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

34 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution. 5

35 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution

36 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution

37 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution

38 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

39 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

40 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

41 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

42 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6

43 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6

Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E5-2630 v3, 2.4

44 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average brings approximately 30% more time you will notice this! will you notice this? 6

45 Experimental nalysis - Overall comparison 7

46 Experimental nalysis - Overall comparison 7

47 Experimental nalysis - Overall comparison 2.3X 2.5X 7

48 Experimental nalysis - Overall comparison 2.3X 2.5X 7

49 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7

50 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7

51 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but more than twice smaller. 7

52 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

53 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

54 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

55 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

56 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function 2X 4X 2X 2.7X 3X 3.3X 2.5X 2X 4X +25% 3.5X +40% 15X +80% 36X +60% +65% +30% 25X +20% 16X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

57 9

58 9

59 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility 9

60 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility Future work Optimal I-assignment for Elias-Fano? (NP-hard problem) Make queries (especially perplexity) even faster 9

61 Proudly supported by a Student Travel Grant Thanks for your attention, time, patience! ny questions? 10

Efficient Data Structures for Massive N -Gram Datasets

Efficient Data Structures for Massive N -Gram Datasets Giulio Ermanno Pibiri Dept. of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it ABSTRACT The efficient indexing of large and