Efficient Data Structures for Massive N-Gram Datasets

Size: px
Start display at page:

Download "Efficient Data Structures for Massive N-Gram Datasets"

Transcription

1 Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy The 40-th CM SIGIR Conference on Research and evelopment in Information Retrieval Tokyo, Japan 10/08/2017 1

2 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2

3 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2

4 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published 2

5 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 More than 11 billion grams. 2

6 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. 3

7 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values 3

8 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) 3

9 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. 3

10 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map 3

11 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash + time - space 3

12 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time 3

13 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time ctive field of research Many software libraries KenLM [Heafield, WMT 2011] erkeleylm [Pauls and Klein, CL 2011] ExpGram [Watanabe at el., IJCNLP 2009] IRSTLM [Federico et al., CL 2008] RandLM [Talbot and Osborne, CL 2007] SRILM [Stolcke, INTERSPEECH 2002] 3

14 Trie Indexing 4

15 4 C C C C Trie Indexing

16 4 C C C C Trie Indexing

17 4 C C C C Trie Indexing C C C C

18 Trie Indexing C C C hash vocabulary 0 1 C 2 3 C C C C C 4

19 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

20 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

21 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

22 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C

23 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 C 0 1 C C2 0 C C

24 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C2 0 C C

25 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C C C

26 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C Elias-Fano Tries One NextGEQ per level Constant-time random ccess 0 1 C C C C

27 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5

28 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5

29 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

30 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

31 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

32 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

33 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5

34 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution. 5

35 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution

36 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution

37 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution

38 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

39 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

40 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

41 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6

42 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6

43 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6

44 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average brings approximately 30% more time you will notice this! will you notice this? 6

45 Experimental nalysis - Overall comparison 7

46 Experimental nalysis - Overall comparison 7

47 Experimental nalysis - Overall comparison 2.3X 2.5X 7

48 Experimental nalysis - Overall comparison 2.3X 2.5X 7

49 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7

50 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7

51 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but more than twice smaller. 7

52 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

53 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

54 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

55 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

56 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function 2X 4X 2X 2.7X 3X 3.3X 2.5X 2X 4X +25% 3.5X +40% 15X +80% 36X +60% +65% +30% 25X +20% 16X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8

57 9

58 9

59 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility 9

60 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility Future work Optimal I-assignment for Elias-Fano? (NP-hard problem) Make queries (especially perplexity) even faster 9

61 Proudly supported by a Student Travel Grant Thanks for your attention, time, patience! ny questions? 10

Efficient Data Structures for Massive N -Gram Datasets

Efficient Data Structures for Massive N -Gram Datasets Efficient Data Structures for Massive N -Gram Datasets Giulio Ermanno Pibiri Dept. of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it ABSTRACT The efficient indexing of large and

More information

arxiv: v1 [cs.ir] 25 Jun 2018

arxiv: v1 [cs.ir] 25 Jun 2018 Handling Massive N -Gram Datasets Efficiently ariv:806.09447v [cs.ir] 25 Jun 208 GIULIO ERMNNO PIBIRI and ROSSNO VENTURINI, University of Pisa and ISTI-NR, Italy bstract. This paper deals with the two

More information

KenLM: Faster and Smaller Language Model Queries

KenLM: Faster and Smaller Language Model Queries KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm What KenLM Does Answer language model queries using less time and memory.

More information

Faster and Smaller N-Gram Language Models

Faster and Smaller N-Gram Language Models Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California, Berkeley {adpauls,klein}@csberkeleyedu Abstract N-gram language models are a major resource

More information

N-gram language models for massively parallel devices

N-gram language models for massively parallel devices N-gram language models for massively parallel devices Nikolay Bogoychev and Adam Lopez University of Edinburgh Edinburgh, United Kingdom Abstract For many applications, the query speed of N-gram language

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

CS 288: Statistical NLP Assignment 1: Language Modeling

CS 288: Statistical NLP Assignment 1: Language Modeling CS 288: Statistical NLP Assignment 1: Language Modeling Due September 12, 2014 Collaboration Policy You are allowed to discuss the assignment with other students and collaborate on developing algorithms

More information

N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh N-Gram Language Modelling including Feed-Forward NNs Kenneth Heafield University of Edinburgh History of Language Model History Kenneth Heafield University of Edinburgh 3 p(type Predictive) > p(tyler Predictive)

More information

arxiv: v1 [cs.ir] 29 Apr 2018

arxiv: v1 [cs.ir] 29 Apr 2018 Variable-Byte Encoding Is Now Space-Efficient Too Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.it Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy

More information

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval David Guthrie Computer Science Department University of Sheffield D.Guthrie@dcs.shef.ac.uk Mark Hepple Computer Science

More information

n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 TSUBAKI 2)

n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 TSUBAKI 2) n-gram 1, 2 1 1 1 n-gram n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 Recently, a search system has been proposed

More information

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(Xt = wt X1 = w1,...,xt-1 = wt-1)

More information

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project after today s lecture Get familiar with bit-twiddling

More information

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder GIZA++ and Moses Outline GIZA++ Steps Output files Moses Training pipeline Decoder Demo GIZA++ A statistical machine translation toolkit used to train IBM Models 1-5 (moses only uses output of IBM Model-1)

More information

Inverted Index Compression

Inverted Index Compression Inverted Index Compression Giulio Ermanno Pibiri and Rossano Venturini Department of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it and rossano.venturini@unipi.it Abstract The data

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Improved Address-Calculation Coding of Integer Arrays

Improved Address-Calculation Coding of Integer Arrays Improved Address-Calculation Coding of Integer Arrays Jyrki Katajainen 1,2 Amr Elmasry 3, Jukka Teuhola 4 1 University of Copenhagen 2 Jyrki Katajainen and Company 3 Alexandria University 4 University

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview

More information

CS159 - Assignment 2b

CS159 - Assignment 2b CS159 - Assignment 2b Due: Tuesday, Sept. 23 at 2:45pm For the main part of this assignment we will be constructing a number of smoothed versions of a bigram language model and we will be evaluating its

More information

Hustle Documentation. Release 0.1. Tim Spurway

Hustle Documentation. Release 0.1. Tim Spurway Hustle Documentation Release 0.1 Tim Spurway February 26, 2014 Contents 1 Features 3 2 Getting started 5 2.1 Installing Hustle............................................. 5 2.2 Hustle Tutorial..............................................

More information

Optimal Space-time Tradeoffs for Inverted Indexes

Optimal Space-time Tradeoffs for Inverted Indexes Optimal Space-time Tradeoffs for Inverted Indexes Giuseppe Ottaviano 1, Nicola Tonellotto 1, Rossano Venturini 1,2 1 National Research Council of Italy, Pisa, Italy 2 Department of Computer Science, University

More information

Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process

Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process Changyou Chen,, Lan Du,, Wray Buntine, ANU College of Engineering and Computer Science The Australian National University National

More information

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

Lign/CSE 256, Programming Assignment 1: Language Models

Lign/CSE 256, Programming Assignment 1: Language Models Lign/CSE 256, Programming Assignment 1: Language Models 16 January 2008 due 1 Feb 2008 1 Preliminaries First, make sure you can access the course materials. 1 The components are: ˆ code1.zip: the Java

More information

Efficient Diversification of Web Search Results

Efficient Diversification of Web Search Results Efficient Diversification of Web Search Results G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri ISTI-CNR, Pisa, Italy Laboratory Web Search Results Diversification Query: Vinci, what is the user

More information

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data Sorelle A. Friedler Google Joint work with David M. Mount Motivation Kinetic data: data generated by moving objects Sensors collect data

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Patrick Nguyen, Jianfeng Gao, and Milind Mahajan November Technical Report MSR-TR

Patrick Nguyen, Jianfeng Gao, and Milind Mahajan November Technical Report MSR-TR MSRLM: a scalable language modeling toolkit Patrick Nguyen, Jianfeng Gao, and Milind Mahajan {panguyen,jfgao,milindm}@microsoft.com November 2007 Technical Report MSR-TR-2007-144 Microsoft Research Microsoft

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Compressed Indexes for String Searching in Labeled Graphs

Compressed Indexes for String Searching in Labeled Graphs Compressed Indexes for String Searching in Labeled Graphs Paolo Ferragina Francesco Piccinno Rossano Venturini Dipartimento di Informatica, University of Pisa {ferragina, piccinno, rossano}@di.unipi.it

More information

Bloom Filter and Lossy Dictionary Based Language Models

Bloom Filter and Lossy Dictionary Based Language Models Bloom Filter and Lossy Dictionary Based Language Models Abby D. Levenberg E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Cognitive Science and Natural Language Processing School of Informatics

More information

Fast and compact Hamming distance index

Fast and compact Hamming distance index Fast and compact Hamming distance index Simon Gog Institute of Theoretical Informatics, Karlsruhe Institute of Technology gog@kit.edu Rossano Venturini University of Pisa and Istella Srl rossano.venturini@unipi.it

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:

More information

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation

More information

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)

B. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi) A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 01/01/11 through 03/31/11 the NCSA Innovative Systems Lab team investigated

More information

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic An Oracle White Paper September 2011 Oracle Utilities Meter Data Management 2.0.1 Demonstrates Extreme Performance on Oracle Exadata/Exalogic Introduction New utilities technologies are bringing with them

More information

Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures

Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures William A. Pearlman Center for Image Processing Research Rensselaer Polytechnic Institute pearlw@ecse.rpi.edu

More information

Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition

Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition Karlsruhe Institute of Technology Department of Informatics Institute for Anthropomatics Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition Christian Mandery

More information

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava

Hardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava Hardware Acceleration in Computer Networks Outline Motivation for hardware acceleration Longest prefix matching using FPGA Hardware acceleration of time critical operations Framework and applications Contracted

More information

CSI33 Data Structures

CSI33 Data Structures Outline Department of Mathematics and Computer Science Bronx Community College September 6, 2017 Outline Outline 1 Chapter 2: Data Abstraction Outline Chapter 2: Data Abstraction 1 Chapter 2: Data Abstraction

More information

Enhancing Bitmap Indices

Enhancing Bitmap Indices Enhancing Bitmap Indices Guadalupe Canahuate The Ohio State University Advisor: Hakan Ferhatosmanoglu Introduction n Bitmap indices: Widely used in data warehouses and read-only domains Implemented in

More information

Efficient Representation of Local Geometry for Large Scale Object Retrieval

Efficient Representation of Local Geometry for Large Scale Object Retrieval Efficient Representation of Local Geometry for Large Scale Object Retrieval Michal Perďoch Ondřej Chum and Jiří Matas Center for Machine Perception Czech Technical University in Prague IEEE Computer Society

More information

CS 206 Introduction to Computer Science II

CS 206 Introduction to Computer Science II CS 206 Introduction to Computer Science II 04 / 25 / 2018 Instructor: Michael Eckmann Today s Topics Questions? Comments? Balanced Binary Search trees AVL trees / Compression Uses binary trees Balanced

More information

A Retrieval Method for Double Array Structures by Using Byte N-Gram

A Retrieval Method for Double Array Structures by Using Byte N-Gram A Retrieval Method for Double Array Structures by Using Byte N-Gram Masao Fuketa, Kazuhiro Morita, and Jun-Ichi Aoe Abstract Retrieving keywords requires speed and compactness. A trie is one of the data

More information

Open-Source Speech Recognition for Hand-held and Embedded Devices

Open-Source Speech Recognition for Hand-held and Embedded Devices PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black

More information

Adding Integers. Unit 1 Lesson 6

Adding Integers. Unit 1 Lesson 6 Unit 1 Lesson 6 Students will be able to: Add integers using rules and number line Key Vocabulary: An integer Number line Rules for Adding Integers There are two rules that you must follow when adding

More information

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Takuya Akiba U Tokyo Takanori Hayashi U Tokyo Nozomi Nori Kyoto

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

gcc o driver std=c99 -Wall driver.c bigmesa.c

gcc o driver std=c99 -Wall driver.c bigmesa.c C Programming Simple Array Processing This assignment consists of two parts. The first part focuses on array read accesses and computational logic. The second part focuses on array read/write access and

More information

R16 SET - 1 '' ''' '' ''' Code No: R

R16 SET - 1 '' ''' '' ''' Code No: R 1. a) Define Latency time and Transmission time? (2M) b) Define Hash table and Hash function? (2M) c) Explain the Binary Heap Structure Property? (3M) d) List the properties of Red-Black trees? (3M) e)

More information

Indexing Variable Length Substrings for Exact and Approximate Matching

Indexing Variable Length Substrings for Exact and Approximate Matching Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of

More information

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 26 Source Coding (Part 1) Hello everyone, we will start a new module today

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms

A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Mobile Visual Search with Word-HOG Descriptors

Mobile Visual Search with Word-HOG Descriptors Mobile Visual Search with Word-HOG Descriptors Sam S. Tsai, Huizhong Chen, David M. Chen, and Bernd Girod Department of Electrical Engineering, Stanford University, Stanford, CA, 9435 sstsai@alumni.stanford.edu,

More information

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 Dan Jurafsky Tuesday, January 23, 2018 1 Part 1: Group Exercise We are interested in building

More information

Recommender Systems New Approaches with Netflix Dataset

Recommender Systems New Approaches with Netflix Dataset Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based

More information

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk Linked Structures Songs, Games, Movies Part IV Fall 23 Carola Wenk Storing Text We ve been focusing on numbers. What about text? Animal, Bird, Cat, Car, Chase, Camp, Canal We can compare the lexicographic

More information

Pre-Computable Multi-Layer Neural Network Language Models

Pre-Computable Multi-Layer Neural Network Language Models Pre-Computable Multi-Layer Neural Network Language Models Jacob Devlin jdevlin@microsoft.com Chris Quirk chrisq@microsoft.com Arul Menezes arulm@microsoft.com Abstract In the last several years, neural

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Entity and Knowledge Base-oriented Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061

More information

Schema Extraction for Tabular Data on the Web

Schema Extraction for Tabular Data on the Web Schema Extraction for abular ata on the Web Marco delfio anan Samet epartment of Computer Science Center for utomation Research Institute for dvanced Computer Studies University of Maryland College Park,

More information

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009 CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP Jimmy Lin and Nitin Madnani University of Maryland Wednesday, November 18, 2009 Three Pillars of Statistical NLP Algorithms

More information

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016 Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions Marcin Junczys-Dowmunt Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614

More information

Algorithms for NLP. Language Modeling II. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling II. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling II Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project ager today s lecture Get familiar with bit- twiddling

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

CS294-1 Assignment 2 Report

CS294-1 Assignment 2 Report CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

15 July, Huffman Trees. Heaps

15 July, Huffman Trees. Heaps 1 Huffman Trees The Huffman Code: Huffman algorithm uses a binary tree to compress data. It is called the Huffman code, after David Huffman who discovered d it in 1952. Data compression is important in

More information

Automatic Semantic Platform- dependent Redesign. Giulio Mori & Fabio Paternò. HIIS Laboratory ISTI-C.N.R.

Automatic Semantic Platform- dependent Redesign. Giulio Mori & Fabio Paternò.  HIIS Laboratory ISTI-C.N.R. Automatic Semantic Platform- dependent Redesign Giulio Mori & Fabio Paternò http://giove.isti.cnr.it/ HIIS Laboratory ISTI-C.N.R. Pisa, Italy Multi-Device Interactive Services: Current Practice Manual

More information

The Adaptive Radix Tree

The Adaptive Radix Tree Department of Informatics, University of Zürich MSc Basismodul The Adaptive Radix Tree Rafael Kallis Matrikelnummer: -708-887 Email: rk@rafaelkallis.com September 8, 08 supervised by Prof. Dr. Michael

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Presented by: Dardan Xhymshiti Spring 2016

Presented by: Dardan Xhymshiti Spring 2016 Presented by: Dardan Xhymshiti Spring 2016 Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications

More information

Programming and Data Structure Laboratory (CS13002)

Programming and Data Structure Laboratory (CS13002) Programming and Data Structure Laboratory (CS13002) Dr. Sudeshna Sarkar Dr. Indranil Sengupta Dept. of Computer Science & Engg., IIT Kharagpur 1 Some Rules to be Followed Attendance is mandatory. Regular

More information

An empirical study of smoothing techniques for language modeling

An empirical study of smoothing techniques for language modeling Computer Speech and Language (1999) 13, 359 394 Article No. csla.1999.128 Available online at http://www.idealibrary.com on An empirical study of smoothing techniques for language modeling Stanley F. Chen

More information

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis

More information

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 Name Section Directions: Answer each question neatly in the space provided. Please read each question carefully. You have 50 minutes for this exam. No electronic

More information

A Compressed Cache-Oblivious String B-tree

A Compressed Cache-Oblivious String B-tree A Compressed Cache-Oblivious String B-tree PAOLO FERRAGINA, Department of Computer Science, University of Pisa ROSSANO VENTURINI, Department of Computer Science, University of Pisa In this paper we study

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

arxiv: v1 [cs.cl] 16 Aug 2016

arxiv: v1 [cs.cl] 16 Aug 2016 Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees arxiv:1608.04465v1 [cs.cl] 16 Aug 2016 Ehsan Shareghi, Matthias Petri, Gholamreza Haffari and Trevor Cohn Faculty of

More information

WebSAIL Wikifier at ERD 2014

WebSAIL Wikifier at ERD 2014 WebSAIL Wikifier at ERD 2014 Thanapon Noraset, Chandra Sekhar Bhagavatula, Doug Downey Department of Electrical Engineering & Computer Science, Northwestern University {nor.thanapon, csbhagav}@u.northwestern.edu,ddowney@eecs.northwestern.edu

More information

So you haven t upgraded to MapInfo 64-bit yet?

So you haven t upgraded to MapInfo 64-bit yet? MapInfo v16 So you haven t upgraded to MapInfo 64-bit yet? This document provides a quick overview of the important features and improvements of the current 64-bit release for those customers who have

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

Sparse Non-negative Matrix Language Modeling

Sparse Non-negative Matrix Language Modeling Sparse Non-negative Matrix Language Modeling Joris Pelemans Noam Shazeer Ciprian Chelba joris@pelemans.be noam@google.com ciprianchelba@google.com 1 Outline Motivation Sparse Non-negative Matrix Language

More information

Linux Kernel Hacking Free Course

Linux Kernel Hacking Free Course Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What

More information

Applications of Succinct Dynamic Compact Tries to Some String Problems

Applications of Succinct Dynamic Compact Tries to Some String Problems Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available:

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available: Trees in java.util A set is an object that stores unique elements In Java, two implementations are available: The class HashSet implements the set with a hash table and a hash function The class TreeSet,

More information

Computing n-gram Statistics in MapReduce

Computing n-gram Statistics in MapReduce Computing n-gram Statistics in MapReduce Klaus Berberich (kberberi@mpi-inf.mpg.de) Srikanta Bedathur (bedathur@iiitd.ac.in) n-gram Statistics Statistics about variable-length word sequences (e.g., lord

More information

COS 226 Algorithms and Data Structures Fall Final Solutions. 10. Remark: this is essentially the same question from the midterm.

COS 226 Algorithms and Data Structures Fall Final Solutions. 10. Remark: this is essentially the same question from the midterm. COS 226 Algorithms and Data Structures Fall 2011 Final Solutions 1 Analysis of algorithms (a) T (N) = 1 10 N 5/3 When N increases by a factor of 8, the memory usage increases by a factor of 32 Thus, T

More information

Programming Studio #1 ECE 190

Programming Studio #1 ECE 190 Programming Studio #1 ECE 190 Programming Studio #1 Announcements In Studio Assignment Introduction to Linux Command-Line Operations Recitation Floating Point Representation Binary & Hexadecimal 2 s Complement

More information

SIGIR 2016 Richard Zanibbi, Kenny Davila, Andrew Kane, Frank W. Tompa July 18, 2016

SIGIR 2016 Richard Zanibbi, Kenny Davila, Andrew Kane, Frank W. Tompa July 18, 2016 SIGIR 2016 Richard Zanibbi, Kenny Davila, Andrew Kane, Frank W. Tompa July 18, 2016 Mathematical Information Retrieval (MIR) Many mathematical resources are available online, such as: Online databases

More information